Kernel-level compute-communication overlapping for CP linear attention kernels via nvshmem

### Description

Implement kernel-level compute-communication overlapping for context parallelism (CP) linear attention kernels using **nvshmem**.

### Context

For distributed training with context parallelism, overlapping computation with inter-GPU communication at the kernel level can reduce latency. Using nvshmem enables fine-grained, GPU-initiated communication directly from within CUDA kernels, avoiding CPU-side synchronization overhead.

### Tasks

- [ ] Design the compute-communication overlap strategy for CP linear attention
- [ ] Integrate nvshmem for inter-GPU communication within kernels
- [ ] Implement overlapping for KDA and/or GDN CP kernels
- [ ] Benchmark communication overhead and end-to-end training throughput
- [ ] Test across multi-GPU configurations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel-level compute-communication overlapping for CP linear attention kernels via nvshmem #15

Description

Context

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Kernel-level compute-communication overlapping for CP linear attention kernels via nvshmem #15

Description

Description

Context

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions