Current kernels are designed for a single-GPU execution. Let's scale them to multi-GPU systems. Ideally, using TMA and cooperative groups.
Current kernels are designed for a single-GPU execution. Let's scale them to multi-GPU systems. Ideally, using TMA and cooperative groups.