You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I do not understand the following performance characteristics. What could be the reasons for this?
As I understand from [GH200] Unexpected Low Host-to-Device Bandwidth #23, the SM version should outperform the performance of the Copy Engines due to the available parallel threads. But in this case I am seeing a drastic drop in performance for buffer sizes 20GB and above. Is there copy kernel tuning necessary?
Prior work suggests the CE engines should operate in parallel, however for the bi-directional case I see that this is not true and the performance in half of the peak
The DtoH and HtoD asymmetry is larger than previously reported estimates.
I’m using
nvbandwidthto understand the memcpy performance across NVLINK C2C on a GH200 (900GB/s bi-directional)Command example:
./nvbandwidth --testcase 0 1 2 3 16 17 32 --testSamples 1 --useMean -b 20480I do not understand the following performance characteristics. What could be the reasons for this?