[Multidevice] Add TMA bulk copy kernel and P2P transport option#6012
[Multidevice] Add TMA bulk copy kernel and P2P transport option#6012samnordmann merged 18 commits intomainfrom
Conversation
Description
|
| Relevant files | |||||
|---|---|---|---|---|---|
| Enhancement |
| ||||
| Configuration changes |
| ||||
| Tests |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Thread Safety
|
Greptile SummaryThis PR adds a Hopper TMA ( Key changes:
Issues found:
Confidence Score: 4/5Safe to merge for transfers ≤ 2 GB; the One confirmed P1 logic defect:
Important Files Changed
Sequence DiagramsequenceDiagram
participant Host
participant GPU_Sender as GPU (Sender)
participant GPU_Receiver as GPU (Receiver)
Note over Host: getP2pTransport() == Tma?
alt P2pProtocol::Put + TMA transport
Host->>GPU_Sender: sendPost → launchTmaCopy(peer.ptr, local.ptr, count)
Note over GPU_Sender: Block 0…N each handle one chunk
GPU_Sender->>GPU_Sender: cp.async.bulk GMEM(local)→SMEM
GPU_Sender->>GPU_Receiver: cp.async.bulk SMEM→GMEM(peer.ptr)
Host->>GPU_Receiver: recvWait (cuStreamWaitValue32 kIdle)
else P2pProtocol::Get + TMA transport
Host->>GPU_Receiver: recvPost → launchTmaCopy(local.ptr, peer.ptr, count)
Note over GPU_Receiver: Block 0…N each handle one chunk
GPU_Receiver->>GPU_Sender: cp.async.bulk GMEM(peer.ptr)→SMEM
GPU_Receiver->>GPU_Receiver: cp.async.bulk SMEM→GMEM(local.ptr)
Host->>GPU_Sender: sendWait (cuStreamWaitValue32 kIdle)
else CopyEngine transport (unchanged)
Host->>GPU_Sender: cudaMemcpyAsync(peer.ptr, local.ptr)
end
|
| static CUmodule module = nullptr; | ||
| static CUfunction kernel = nullptr; |
There was a problem hiding this comment.
Static initialization lacks thread-safety protection. Multiple threads calling launchTmaCopy concurrently could race on the module == nullptr check (line 365), causing duplicate compilations or accessing partially-initialized state.
Other kernels in this file (launchAlltoallvKernel, launchMulticastKernel) have the same pattern. Consider adding mutex protection or using std::call_once for thread-safe lazy initialization if concurrent calls are possible.
|
!test |
|
!test |
f791ac2 to
6ffe84c
Compare
| src_bytes += chunk; | ||
| remaining -= chunk; | ||
| } | ||
| } |
There was a problem hiding this comment.
Would it be better if it launched one kernel that managed these chunks versus multiple? I'm not sure if there's a clear performance gain by launching 1 vs N kernels.
There was a problem hiding this comment.
Good point! You are right that having a loop inside the kernel is a much better solution. What is even better is to launch a single kernel with many blocks, so that SMs and TCs can work in parallel. I'm implementing the latter solution.
A follow-up optimization would be to use SW pipelining and double-buffering to overlap the Tma loads and stores
|
!test |
wujingyue
left a comment
There was a problem hiding this comment.
Some nits before I review the kernel
|
!test |
|
!test |
|
!test |
2633996 to
bcfe2bd
Compare
|
!test |
cp.async.bulk) copy kernel (csrc/multidevice/tma_copy.cu) compiled at runtime via NVRTC, and wire it as an alternative P2P data transport alongside the existing copy-engine (cudaMemcpyAsync) path.P2pTransportoption (NVFUSER_ENABLE=p2p_transport(tma)) that switchessendPost/recvPostincuda_p2p.cppbetween copy-engine (default) and TMA.