Skip to content

Kernel-level compute-communication overlapping for CP linear attention kernels via nvshmem #15

@icavan

Description

@icavan

Description

Implement kernel-level compute-communication overlapping for context parallelism (CP) linear attention kernels using nvshmem.

Context

For distributed training with context parallelism, overlapping computation with inter-GPU communication at the kernel level can reduce latency. Using nvshmem enables fine-grained, GPU-initiated communication directly from within CUDA kernels, avoiding CPU-side synchronization overhead.

Tasks

  • Design the compute-communication overlap strategy for CP linear attention
  • Integrate nvshmem for inter-GPU communication within kernels
  • Implement overlapping for KDA and/or GDN CP kernels
  • Benchmark communication overhead and end-to-end training throughput
  • Test across multi-GPU configurations

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions