Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds support for scaled FP8 MFMA instructions (32x32x64 and 16x16x128) on the gfx950 architecture. These instructions use OCP FP8 types (f8E4M3FN, f8E5M2) with implicit scale factors, sharing the same hardware instructions as FP4 scaled MFMAs but with different configuration parameters (cbsz=0, blgp=0). The PR depends on #2242 which relaxes isCoherentWithK validation to allow kpack < k_base for double-buffer pipelines, enabling configurations like kpack=4 with k_base=32.
Changes:
- Added 4 new MfmaTypeId enum values for scaled FP8 type combinations (Fp8Fp8ScaledTyId, Fp8Bf8ScaledTyId, Bf8Fp8ScaledTyId, Bf8Bf8ScaledTyId)
- Implemented scaled FP8 MFMA selection logic in MfmaInsnGroup that tries scaled FP8 MFMAs first when kPerBlock is large enough
- Updated AccelEmitter to generate scaled MFMA operations with neutral scale values for FP8 types without explicit scale buffers
- Added 9 comprehensive tests covering all combinations of MFMA sizes (16x16x128, 32x32x64), schedule versions (1, 2, 3, 4), FP8 type combinations, and kpack values (1, 4, 8, 32)
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| mlir/test/Dialect/Rock/lowering_xdlops_gemm.mlir | Added 9 tests for scaled FP8 MFMA operations covering single/double-buffer pipelines and various kpack configurations |
| mlir/include/mlir/Dialect/Rock/IR/MfmaInsnGroup.h | Added 4 new enum values for scaled FP8 type IDs and isScaledFp8() method declaration |
| mlir/lib/Dialect/Rock/IR/MfmaInsnGroup.cpp | Implemented scaled FP8 MFMA instruction mapping, selection logic in selectForGfx950(), and isScaledFp8() method |
| mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp | Added logic to emit scaled MFMA operations with neutral scale values for FP8 types without explicit scale buffers |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f93c548 to
ff1d1c9
Compare
eecc935 to
4f657da
Compare
e7cfbe1 to
8eb0352
Compare
You're correct that the hardware (ISA) supports both scaled and non-scaled instructions for these FP8/FP6/FP4 matrix sizes. However, at the MLIR level, only the scaled MFMA intrinsics are exposed in the ROCDL dialect. Looking at the ROCDLOps.td, only these are defined: def ROCDL_mfma_scale_f32_16x16x128_f8f6f4 : ROCDL_Mfma_Scale_IntrOp<...>; The non-scaled versions (mfma_f32_16x16x128_f8f6f4, mfma_f32_32x32x64_f8f6f4) are not exposed in the ROCDL dialect. The LLVM backend has an optimization pattern (UnscaledMFMAOptimizationPat in SIInstrInfo.td) that automatically converts v_mfma_scale* instructions to v_mfma* instructions when the scale values are constant zero. Since we use Float8E8M0FNU with value 0.0 (which represents exponent = 0, meaning scale factor = 2^0 = 1), this optimization kicks in and the generated ISA will use the non-scaled v_mfma* instruction. So the flow is:
This approach gives us the best of both worlds - we work within the constraints of what ROCDL exposes, while still getting optimal ISA output. |
8eb0352 to
c9d1edb
Compare
can we add those into ROCDL then ? You can do a PR upstream, it should be simple |
You're right, adding the non-scaled MFMA intrinsics to upstream ROCDL would be the cleaner long-term solution. |
I looked into this more deeply and found that adding the unscaled MFMA intrinsics to ROCDL is more complex than initially expected. Current state in LLVM:
What would be required: To add unscaled MFMA intrinsics to ROCDL, we would need to modify multiple LLVM files:
This is a significant upstream change that goes beyond the scope. |
Can you add a test so that it checks whether backend is generating non-scaled MFMA or not ? you can use
Can we file a ticket for the backend compiler to expose that then ? |
c9d1edb to
ecbaf58
Compare
I'll add a lit test to verify the backend generates the non-scaled v_mfma_f32_16x16x128_f8f6f4 / v_mfma_f32_32x32x64_f8f6f4 instructions when using scaled MFMA with neutral scales. I'll file a ticket for the compiler team to expose the unscaled MFMA intrinsics. |
185a1c7 to
1587b86
Compare
1587b86 to
1b082b3
Compare
4edf0ee to
196438c
Compare
…for scaled MFMAs on gfx950
- Remove duplicate entries in getMfmaInsnInfoMap - Clarify neutral scale creation comment in AccelEmitter.cpp - Rename zeroAttr to neutralScaleAttr for clarity
…l scales - Rename isScaledFp8 -> selectedScaledMFMA - Add neutral scale checks in lit tests - Add ISA test for unscaled MFMA backend optimization - Add E2E tests for fp8/bf8 combinations on gfx950
196438c to
0aad9ec
Compare
Resolves: https://amd-hub.atlassian.net/browse/AIROCMLIR-477
Motivation
This PR adds support for scaled FP8 MFMA instructions (32x32x64 and 16x16x128) on gfx950 architecture. The scaled FP8 MFMAs use OCP FP8 types (f8E4M3FN, f8E5M2) with implicit scale factors and provide improved performance for 8-bit floating-point matrix operations.
PR #2242 relaxes the isCoherentWithK validation to allow kpack < k_base for double-buffer pipelines (scheduleVersion 2 or 4), which is required for some of the test configurations in this PR to work correctly.
Technical Details
Scaled FP8 MFMA Instructions on gfx950
The gfx950 architecture introduces scaled MFMA instructions for OCP FP8 types (f8E4M3FN, f8E5M2):
These instructions differ from native FP8 MFMAs (32x32x16 with k_base=8) by using implicit scale factors. The compiler generates amdgpu.scaled_mfma operations with constant scale values
MFMA Selection Logic
The MfmaInsnGroup::select function in MfmaInsnGroup.cpp selects scaled FP8 MFMAs when:
Test Plan
Added 9 tests to mlir/test/Dialect/Rock/lowering_xdlops_gemm.mlir covering all combinations of MFMA sizes, scheduleVersion values, FP8 type combinations, and kpack values.
All tests pass
Test Result
Submission Checklist