Skip to content

Scaled fp8 mfma gfx950#2246

Merged
stefankoncarevic merged 7 commits intodevelopfrom
scaled-fp8-mfma-gfx950
Mar 27, 2026
Merged

Scaled fp8 mfma gfx950#2246
stefankoncarevic merged 7 commits intodevelopfrom
scaled-fp8-mfma-gfx950

Conversation

@stefankoncarevic
Copy link
Copy Markdown
Contributor

⚠️ This PR depends on #2242 and should not be merged before that.
Resolves: https://amd-hub.atlassian.net/browse/AIROCMLIR-477

Motivation

This PR adds support for scaled FP8 MFMA instructions (32x32x64 and 16x16x128) on gfx950 architecture. The scaled FP8 MFMAs use OCP FP8 types (f8E4M3FN, f8E5M2) with implicit scale factors and provide improved performance for 8-bit floating-point matrix operations.
PR #2242 relaxes the isCoherentWithK validation to allow kpack < k_base for double-buffer pipelines (scheduleVersion 2 or 4), which is required for some of the test configurations in this PR to work correctly.

Technical Details

Scaled FP8 MFMA Instructions on gfx950
The gfx950 architecture introduces scaled MFMA instructions for OCP FP8 types (f8E4M3FN, f8E5M2):

  • 32x32x64 MFMA: M=32, N=32, K=64, k_base=32, output vector<16xf32>
  • 16x16x128 MFMA: M=16, N=16, K=128, k_base=32, output vector<4xf32>

These instructions differ from native FP8 MFMAs (32x32x16 with k_base=8) by using implicit scale factors. The compiler generates amdgpu.scaled_mfma operations with constant scale values

MFMA Selection Logic
The MfmaInsnGroup::select function in MfmaInsnGroup.cpp selects scaled FP8 MFMAs when:

  1. Architecture is gfx950
  2. Input types are OCP FP8 (f8E4M3FN or f8E5M2)
  3. isCoherentWithK validation passes for the given kpack, kpackPerBlock, and scheduleVersion

Test Plan

Added 9 tests to mlir/test/Dialect/Rock/lowering_xdlops_gemm.mlir covering all combinations of MFMA sizes, scheduleVersion values, FP8 type combinations, and kpack values.
All tests pass

Test Result

Submission Checklist

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for scaled FP8 MFMA instructions (32x32x64 and 16x16x128) on the gfx950 architecture. These instructions use OCP FP8 types (f8E4M3FN, f8E5M2) with implicit scale factors, sharing the same hardware instructions as FP4 scaled MFMAs but with different configuration parameters (cbsz=0, blgp=0). The PR depends on #2242 which relaxes isCoherentWithK validation to allow kpack < k_base for double-buffer pipelines, enabling configurations like kpack=4 with k_base=32.

Changes:

  • Added 4 new MfmaTypeId enum values for scaled FP8 type combinations (Fp8Fp8ScaledTyId, Fp8Bf8ScaledTyId, Bf8Fp8ScaledTyId, Bf8Bf8ScaledTyId)
  • Implemented scaled FP8 MFMA selection logic in MfmaInsnGroup that tries scaled FP8 MFMAs first when kPerBlock is large enough
  • Updated AccelEmitter to generate scaled MFMA operations with neutral scale values for FP8 types without explicit scale buffers
  • Added 9 comprehensive tests covering all combinations of MFMA sizes (16x16x128, 32x32x64), schedule versions (1, 2, 3, 4), FP8 type combinations, and kpack values (1, 4, 8, 32)

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
mlir/test/Dialect/Rock/lowering_xdlops_gemm.mlir Added 9 tests for scaled FP8 MFMA operations covering single/double-buffer pipelines and various kpack configurations
mlir/include/mlir/Dialect/Rock/IR/MfmaInsnGroup.h Added 4 new enum values for scaled FP8 type IDs and isScaledFp8() method declaration
mlir/lib/Dialect/Rock/IR/MfmaInsnGroup.cpp Implemented scaled FP8 MFMA instruction mapping, selection logic in selectForGfx950(), and isScaledFp8() method
mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp Added logic to emit scaled MFMA operations with neutral scale values for FP8 types without explicit scale buffers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mlir/lib/Dialect/Rock/IR/MfmaInsnGroup.cpp Outdated
Comment thread mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp Outdated
@stefankoncarevic stefankoncarevic force-pushed the scaled-fp8-mfma-gfx950 branch 3 times, most recently from f93c548 to ff1d1c9 Compare February 24, 2026 11:56
@stefankoncarevic stefankoncarevic force-pushed the scaled-fp8-mfma-gfx950 branch 3 times, most recently from eecc935 to 4f657da Compare March 3, 2026 08:54
@stefankoncarevic stefankoncarevic force-pushed the scaled-fp8-mfma-gfx950 branch 2 times, most recently from e7cfbe1 to 8eb0352 Compare March 24, 2026 08:42
Comment thread mlir/lib/Dialect/Rock/IR/MfmaInsnGroup.cpp Outdated
Comment thread mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp
Copy link
Copy Markdown
Member

@umangyadav umangyadav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is v_mfma_f32_16x16x128_f4f6f8 and v_mfma_f32_32x32x64_f4f6f8 instructions that can be used for double rate fp8 non-scaled versions.

@stefankoncarevic
Copy link
Copy Markdown
Contributor Author

There is v_mfma_f32_16x16x128_f4f6f8 and v_mfma_f32_32x32x64_f4f6f8 instructions that can be used for double rate fp8 non-scaled versions.

You're correct that the hardware (ISA) supports both scaled and non-scaled instructions for these FP8/FP6/FP4 matrix sizes.

However, at the MLIR level, only the scaled MFMA intrinsics are exposed in the ROCDL dialect. Looking at the ROCDLOps.td, only these are defined:

def ROCDL_mfma_scale_f32_16x16x128_f8f6f4 : ROCDL_Mfma_Scale_IntrOp<...>;
def ROCDL_mfma_scale_f32_32x32x64_f8f6f4 : ROCDL_Mfma_Scale_IntrOp<...>;

The non-scaled versions (mfma_f32_16x16x128_f8f6f4, mfma_f32_32x32x64_f8f6f4) are not exposed in the ROCDL dialect.

The LLVM backend has an optimization pattern (UnscaledMFMAOptimizationPat in SIInstrInfo.td) that automatically converts v_mfma_scale* instructions to v_mfma* instructions when the scale values are constant zero. Since we use Float8E8M0FNU with value 0.0 (which represents exponent = 0, meaning scale factor = 2^0 = 1), this optimization kicks in and the generated ISA will use the non-scaled v_mfma* instruction.

So the flow is:

  • MLIR level: Use amdgpu.scaled_mfma with neutral scale

  • LLVM IR level: Lowers to llvm.amdgcn.mfma.scale.* intrinsic

  • ISA level: LLVM backend optimization converts it to non-scaled v_mfma* instruction

This approach gives us the best of both worlds - we work within the constraints of what ROCDL exposes, while still getting optimal ISA output.

@umangyadav
Copy link
Copy Markdown
Member

def ROCDL_mfma_scale_f32_16x16x128_f8f6f4 : ROCDL_Mfma_Scale_IntrOp<...>;
def ROCDL_mfma_scale_f32_32x32x64_f8f6f4 : ROCDL_Mfma_Scale_IntrOp<...>;

can we add those into ROCDL then ? You can do a PR upstream, it should be simple

@stefankoncarevic
Copy link
Copy Markdown
Contributor Author

def ROCDL_mfma_scale_f32_16x16x128_f8f6f4 : ROCDL_Mfma_Scale_IntrOp<...>;
def ROCDL_mfma_scale_f32_32x32x64_f8f6f4 : ROCDL_Mfma_Scale_IntrOp<...>;

can we add those into ROCDL then ? You can do a PR upstream, it should be simple

You're right, adding the non-scaled MFMA intrinsics to upstream ROCDL would be the cleaner long-term solution.
I can create a upstream PR to add the non-scaled intrinsics (mfma.f32.16x16x128.f8f6f4, mfma.f32.32x32x64.f8f6f4) to upstream LLVM as a separate task first.

@stefankoncarevic
Copy link
Copy Markdown
Contributor Author

def ROCDL_mfma_scale_f32_16x16x128_f8f6f4 : ROCDL_Mfma_Scale_IntrOp<...>;
def ROCDL_mfma_scale_f32_32x32x64_f8f6f4 : ROCDL_Mfma_Scale_IntrOp<...>;

can we add those into ROCDL then ? You can do a PR upstream, it should be simple

You're right, adding the non-scaled MFMA intrinsics to upstream ROCDL would be the cleaner long-term solution. I can create a upstream PR to add the non-scaled intrinsics (mfma.f32.16x16x128.f8f6f4, mfma.f32.32x32x64.f8f6f4) to upstream LLVM as a separate task first.

I looked into this more deeply and found that adding the unscaled MFMA intrinsics to ROCDL is more complex than initially expected.

Current state in LLVM:

  • Machine instructions exist - V_MFMA_F32_16X16X128_F8F6F4 and V_MFMA_F32_32X32X64_F8F6F4 are defined in VOP3PInstructions.td

  • No direct LLVM intrinsic - There is no int_amdgcn_mfma_f32_16x16x128_f8f6f4 or int_amdgcn_mfma_f32_32x32x64_f8f6f4 in IntrinsicsAMDGPU.td. Only the scaled versions exist.

  • Optimization pattern - The backend uses UnscaledMFMAOptimizationPat in SIInstrInfo.td to convert scaled intrinsics with neutral scales (scale=0) to unscaled machine instructions.

What would be required:

To add unscaled MFMA intrinsics to ROCDL, we would need to modify multiple LLVM files:

  • IntrinsicsAMDGPU.td - Add new intrinsics

  • SIInstrInfo.td or VOP3PInstructions.td - Add instruction selection patterns

  • ROCDLOps.td - Add ROCDL wrappers

This is a significant upstream change that goes beyond the scope.
Would it be acceptable to merge this PR?
If the compiler team adds the unscaled intrinsics upstream in the future, we can adapt our implementation accordingly.

@umangyadav
Copy link
Copy Markdown
Member

ISA level: LLVM backend optimization converts it to non-scaled v_mfma* instruction

Can you add a test so that it checks whether backend is generating non-scaled MFMA or not ? you can use --serialize-to-isa and then do checks on that. it should check if it is using double rate MFMA or not

No direct LLVM intrinsic - There is no int_amdgcn_mfma_f32_16x16x128_f8f6f4 or int_amdgcn_mfma_f32_32x32x64_f8f6f4 in IntrinsicsAMDGPU.td. Only the scaled versions exist.

Can we file a ticket for the backend compiler to expose that then ?

Comment thread mlir/test/Dialect/Rock/lowering_xdlops_gemm.mlir
Base automatically changed from mfma-enable-kpack-values-gfx950 to develop March 25, 2026 20:02
Comment thread mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp Outdated
Comment thread mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp Outdated
Comment thread mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp Outdated
@stefankoncarevic
Copy link
Copy Markdown
Contributor Author

ISA level: LLVM backend optimization converts it to non-scaled v_mfma* instruction

Can you add a test so that it checks whether backend is generating non-scaled MFMA or not ? you can use --serialize-to-isa and then do checks on that. it should check if it is using double rate MFMA or not

No direct LLVM intrinsic - There is no int_amdgcn_mfma_f32_16x16x128_f8f6f4 or int_amdgcn_mfma_f32_32x32x64_f8f6f4 in IntrinsicsAMDGPU.td. Only the scaled versions exist.

Can we file a ticket for the backend compiler to expose that then ?

I'll add a lit test to verify the backend generates the non-scaled v_mfma_f32_16x16x128_f8f6f4 / v_mfma_f32_32x32x64_f8f6f4 instructions when using scaled MFMA with neutral scales.

I'll file a ticket for the compiler team to expose the unscaled MFMA intrinsics.

Comment thread mlir/lib/Dialect/Rock/IR/MfmaInsnGroup.cpp Outdated
Comment thread mlir/test/e2e/GemmFp8NeutralScalesMixed.toml Outdated
Comment thread mlir/test/e2e/GemmFp8NeutralScalesSame.toml Outdated
- Remove duplicate entries in getMfmaInsnInfoMap
- Clarify neutral scale creation comment in AccelEmitter.cpp
- Rename zeroAttr to neutralScaleAttr for clarity
…l scales

- Rename isScaledFp8 -> selectedScaledMFMA
- Add neutral scale checks in lit tests
- Add ISA test for unscaled MFMA backend optimization
- Add E2E tests for fp8/bf8 combinations on gfx950
@stefankoncarevic stefankoncarevic merged commit d2d62f3 into develop Mar 27, 2026
8 of 15 checks passed
@stefankoncarevic stefankoncarevic deleted the scaled-fp8-mfma-gfx950 branch March 27, 2026 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants