Description
Implement kernel-level compute-communication overlapping for context parallelism (CP) linear attention kernels using nvshmem.
Context
For distributed training with context parallelism, overlapping computation with inter-GPU communication at the kernel level can reduce latency. Using nvshmem enables fine-grained, GPU-initiated communication directly from within CUDA kernels, avoiding CPU-side synchronization overhead.
Tasks
Description
Implement kernel-level compute-communication overlapping for context parallelism (CP) linear attention kernels using nvshmem.
Context
For distributed training with context parallelism, overlapping computation with inter-GPU communication at the kernel level can reduce latency. Using nvshmem enables fine-grained, GPU-initiated communication directly from within CUDA kernels, avoiding CPU-side synchronization overhead.
Tasks