Which component has the problem?
CuTe DSL
Bug Report
Describe the bug
The following sample is using a spin-loop to acquire system-scope writes, but is using relaxed instead of acquire:
|
res = cute.arch.load(flag.llvm_ptr, cutlass.Int32, sem="relaxed", scope="gpu") |
It then uses a warp barrier, but that warp barrier only orders memory operations issued by the same warp, not other warps or other thread blocks, like the example is relying on.
Expected behavior
I expected the spin-loop to use acquire memory ordering.
Additional context
This is a customer request (ping me internally for more details if needed).
Which component has the problem?
CuTe DSL
Bug Report
Describe the bug
The following sample is using a spin-loop to acquire system-scope writes, but is using relaxed instead of acquire:
cutlass/examples/python/CuTeDSL/distributed/distributed_gemm_reduce_scatter_blackwell.py
Line 1386 in 982748a
It then uses a warp barrier, but that warp barrier only orders memory operations issued by the same warp, not other warps or other thread blocks, like the example is relying on.
Expected behavior
I expected the spin-loop to use acquire memory ordering.
Additional context
This is a customer request (ping me internally for more details if needed).