SM memcpy performance concerns on GH200

I’m using `nvbandwidth` to understand the memcpy performance across NVLINK C2C on a GH200 (900GB/s bi-directional)

Command example: `./nvbandwidth --testcase 0 1 2 3 16 17 32 --testSamples 1 --useMean  -b 20480`

I do not understand the following performance characteristics. What could be the reasons for this?

1. As I understand from https://github.com/NVIDIA/nvbandwidth/issues/23, the SM version should outperform the performance of the Copy Engines due to the available parallel threads. But in this case I am seeing a drastic drop in performance for buffer sizes 20GB and above. Is there copy kernel tuning necessary?
2. [Prior work](https://www.cs.unc.edu/~jbakita/rtas24.pdf) suggests the CE engines should operate in parallel, however for the bi-directional case I see that this is not true and the performance in half of the peak
3. The DtoH and HtoD asymmetry is larger than previously reported estimates.

<img width="861" alt="Image" src="https://github.com/user-attachments/assets/e6620628-0f96-4dae-abdf-c686a222bcce" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SM memcpy performance concerns on GH200 #41

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SM memcpy performance concerns on GH200 #41

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions