[BUG] incorrect spin-loop synchronization

### Which component has the problem?

CuTe DSL

### Bug Report

**Describe the bug**
The following sample is using a spin-loop to acquire system-scope writes, but is using relaxed instead of acquire: https://github.com/NVIDIA/cutlass/blob/982748aa7356fa838c2ea4994ddcb0b2a4b4cefa/examples/python/CuTeDSL/distributed/distributed_gemm_reduce_scatter_blackwell.py#L1386

It then uses a warp barrier, but that warp barrier only orders memory operations issued by the same warp, not other warps or other thread blocks, like the example is relying on.

**Expected behavior**
I expected the spin-loop to use acquire memory ordering.

**Additional context**
This is a customer request (ping me internally for more details if needed).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] incorrect spin-loop synchronization #3117

Which component has the problem?

Bug Report

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] incorrect spin-loop synchronization #3117

Description

Which component has the problem?

Bug Report

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions