Scaled fp8 mfma gfx950 by stefankoncarevic · Pull Request #2246 · ROCm/rocMLIR

stefankoncarevic · 2026-02-20T14:02:17Z

⚠️ This PR depends on #2242 and should not be merged before that.
Resolves: https://amd-hub.atlassian.net/browse/AIROCMLIR-477

Motivation

This PR adds support for scaled FP8 MFMA instructions (32x32x64 and 16x16x128) on gfx950 architecture. The scaled FP8 MFMAs use OCP FP8 types (f8E4M3FN, f8E5M2) with implicit scale factors and provide improved performance for 8-bit floating-point matrix operations.
PR #2242 relaxes the isCoherentWithK validation to allow kpack < k_base for double-buffer pipelines (scheduleVersion 2 or 4), which is required for some of the test configurations in this PR to work correctly.

Technical Details

Scaled FP8 MFMA Instructions on gfx950
The gfx950 architecture introduces scaled MFMA instructions for OCP FP8 types (f8E4M3FN, f8E5M2):

32x32x64 MFMA: M=32, N=32, K=64, k_base=32, output vector<16xf32>
16x16x128 MFMA: M=16, N=16, K=128, k_base=32, output vector<4xf32>

These instructions differ from native FP8 MFMAs (32x32x16 with k_base=8) by using implicit scale factors. The compiler generates amdgpu.scaled_mfma operations with constant scale values

MFMA Selection Logic
The MfmaInsnGroup::select function in MfmaInsnGroup.cpp selects scaled FP8 MFMAs when:

Architecture is gfx950
Input types are OCP FP8 (f8E4M3FN or f8E5M2)
isCoherentWithK validation passes for the given kpack, kpackPerBlock, and scheduleVersion

Test Plan

Added 9 tests to mlir/test/Dialect/Rock/lowering_xdlops_gemm.mlir covering all combinations of MFMA sizes, scheduleVersion values, FP8 type combinations, and kpack values.
All tests pass

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR adds support for scaled FP8 MFMA instructions (32x32x64 and 16x16x128) on the gfx950 architecture. These instructions use OCP FP8 types (f8E4M3FN, f8E5M2) with implicit scale factors, sharing the same hardware instructions as FP4 scaled MFMAs but with different configuration parameters (cbsz=0, blgp=0). The PR depends on #2242 which relaxes isCoherentWithK validation to allow kpack < k_base for double-buffer pipelines, enabling configurations like kpack=4 with k_base=32.

Changes:

Added 4 new MfmaTypeId enum values for scaled FP8 type combinations (Fp8Fp8ScaledTyId, Fp8Bf8ScaledTyId, Bf8Fp8ScaledTyId, Bf8Bf8ScaledTyId)
Implemented scaled FP8 MFMA selection logic in MfmaInsnGroup that tries scaled FP8 MFMAs first when kPerBlock is large enough
Updated AccelEmitter to generate scaled MFMA operations with neutral scale values for FP8 types without explicit scale buffers
Added 9 comprehensive tests covering all combinations of MFMA sizes (16x16x128, 32x32x64), schedule versions (1, 2, 3, 4), FP8 type combinations, and kpack values (1, 4, 8, 32)

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
mlir/test/Dialect/Rock/lowering_xdlops_gemm.mlir	Added 9 tests for scaled FP8 MFMA operations covering single/double-buffer pipelines and various kpack configurations
mlir/include/mlir/Dialect/Rock/IR/MfmaInsnGroup.h	Added 4 new enum values for scaled FP8 type IDs and isScaledFp8() method declaration
mlir/lib/Dialect/Rock/IR/MfmaInsnGroup.cpp	Implemented scaled FP8 MFMA instruction mapping, selection logic in selectForGfx950(), and isScaledFp8() method
mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp	Added logic to emit scaled MFMA operations with neutral scale values for FP8 types without explicit scale buffers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

umangyadav

There is v_mfma_f32_16x16x128_f4f6f8 and v_mfma_f32_32x32x64_f4f6f8 instructions that can be used for double rate fp8 non-scaled versions.

stefankoncarevic · 2026-03-25T15:20:10Z

There is v_mfma_f32_16x16x128_f4f6f8 and v_mfma_f32_32x32x64_f4f6f8 instructions that can be used for double rate fp8 non-scaled versions.

You're correct that the hardware (ISA) supports both scaled and non-scaled instructions for these FP8/FP6/FP4 matrix sizes.

However, at the MLIR level, only the scaled MFMA intrinsics are exposed in the ROCDL dialect. Looking at the ROCDLOps.td, only these are defined:

def ROCDL_mfma_scale_f32_16x16x128_f8f6f4 : ROCDL_Mfma_Scale_IntrOp<...>;
def ROCDL_mfma_scale_f32_32x32x64_f8f6f4 : ROCDL_Mfma_Scale_IntrOp<...>;

The non-scaled versions (mfma_f32_16x16x128_f8f6f4, mfma_f32_32x32x64_f8f6f4) are not exposed in the ROCDL dialect.

The LLVM backend has an optimization pattern (UnscaledMFMAOptimizationPat in SIInstrInfo.td) that automatically converts v_mfma_scale* instructions to v_mfma* instructions when the scale values are constant zero. Since we use Float8E8M0FNU with value 0.0 (which represents exponent = 0, meaning scale factor = 2^0 = 1), this optimization kicks in and the generated ISA will use the non-scaled v_mfma* instruction.

So the flow is:

MLIR level: Use amdgpu.scaled_mfma with neutral scale
LLVM IR level: Lowers to llvm.amdgcn.mfma.scale.* intrinsic
ISA level: LLVM backend optimization converts it to non-scaled v_mfma* instruction

This approach gives us the best of both worlds - we work within the constraints of what ROCDL exposes, while still getting optimal ISA output.

umangyadav · 2026-03-25T15:41:05Z

def ROCDL_mfma_scale_f32_16x16x128_f8f6f4 : ROCDL_Mfma_Scale_IntrOp<...>;
def ROCDL_mfma_scale_f32_32x32x64_f8f6f4 : ROCDL_Mfma_Scale_IntrOp<...>;

can we add those into ROCDL then ? You can do a PR upstream, it should be simple

stefankoncarevic · 2026-03-25T16:02:36Z

def ROCDL_mfma_scale_f32_16x16x128_f8f6f4 : ROCDL_Mfma_Scale_IntrOp<...>;
def ROCDL_mfma_scale_f32_32x32x64_f8f6f4 : ROCDL_Mfma_Scale_IntrOp<...>;

can we add those into ROCDL then ? You can do a PR upstream, it should be simple

You're right, adding the non-scaled MFMA intrinsics to upstream ROCDL would be the cleaner long-term solution.
I can create a upstream PR to add the non-scaled intrinsics (mfma.f32.16x16x128.f8f6f4, mfma.f32.32x32x64.f8f6f4) to upstream LLVM as a separate task first.

stefankoncarevic · 2026-03-25T17:34:35Z

def ROCDL_mfma_scale_f32_16x16x128_f8f6f4 : ROCDL_Mfma_Scale_IntrOp<...>;
def ROCDL_mfma_scale_f32_32x32x64_f8f6f4 : ROCDL_Mfma_Scale_IntrOp<...>;

can we add those into ROCDL then ? You can do a PR upstream, it should be simple

You're right, adding the non-scaled MFMA intrinsics to upstream ROCDL would be the cleaner long-term solution. I can create a upstream PR to add the non-scaled intrinsics (mfma.f32.16x16x128.f8f6f4, mfma.f32.32x32x64.f8f6f4) to upstream LLVM as a separate task first.

I looked into this more deeply and found that adding the unscaled MFMA intrinsics to ROCDL is more complex than initially expected.

Current state in LLVM:

Machine instructions exist - V_MFMA_F32_16X16X128_F8F6F4 and V_MFMA_F32_32X32X64_F8F6F4 are defined in VOP3PInstructions.td
No direct LLVM intrinsic - There is no int_amdgcn_mfma_f32_16x16x128_f8f6f4 or int_amdgcn_mfma_f32_32x32x64_f8f6f4 in IntrinsicsAMDGPU.td. Only the scaled versions exist.
Optimization pattern - The backend uses UnscaledMFMAOptimizationPat in SIInstrInfo.td to convert scaled intrinsics with neutral scales (scale=0) to unscaled machine instructions.

What would be required:

To add unscaled MFMA intrinsics to ROCDL, we would need to modify multiple LLVM files:

IntrinsicsAMDGPU.td - Add new intrinsics
SIInstrInfo.td or VOP3PInstructions.td - Add instruction selection patterns
ROCDLOps.td - Add ROCDL wrappers

This is a significant upstream change that goes beyond the scope.
Would it be acceptable to merge this PR?
If the compiler team adds the unscaled intrinsics upstream in the future, we can adapt our implementation accordingly.

umangyadav · 2026-03-25T18:48:59Z

ISA level: LLVM backend optimization converts it to non-scaled v_mfma* instruction

Can you add a test so that it checks whether backend is generating non-scaled MFMA or not ? you can use --serialize-to-isa and then do checks on that. it should check if it is using double rate MFMA or not

No direct LLVM intrinsic - There is no int_amdgcn_mfma_f32_16x16x128_f8f6f4 or int_amdgcn_mfma_f32_32x32x64_f8f6f4 in IntrinsicsAMDGPU.td. Only the scaled versions exist.

Can we file a ticket for the backend compiler to expose that then ?

stefankoncarevic · 2026-03-26T11:10:40Z

ISA level: LLVM backend optimization converts it to non-scaled v_mfma* instruction

Can you add a test so that it checks whether backend is generating non-scaled MFMA or not ? you can use --serialize-to-isa and then do checks on that. it should check if it is using double rate MFMA or not

No direct LLVM intrinsic - There is no int_amdgcn_mfma_f32_16x16x128_f8f6f4 or int_amdgcn_mfma_f32_32x32x64_f8f6f4 in IntrinsicsAMDGPU.td. Only the scaled versions exist.

Can we file a ticket for the backend compiler to expose that then ?

I'll add a lit test to verify the backend generates the non-scaled v_mfma_f32_16x16x128_f8f6f4 / v_mfma_f32_32x32x64_f8f6f4 instructions when using scaled MFMA with neutral scales.

I'll file a ticket for the compiler team to expose the unscaled MFMA intrinsics.

…for scaled MFMAs on gfx950

- Remove duplicate entries in getMfmaInsnInfoMap - Clarify neutral scale creation comment in AccelEmitter.cpp - Rename zeroAttr to neutralScaleAttr for clarity

…l scales - Rename isScaledFp8 -> selectedScaledMFMA - Add neutral scale checks in lit tests - Add ISA test for unscaled MFMA backend optimization - Add E2E tests for fp8/bf8 combinations on gfx950

stefankoncarevic requested a review from causten as a code owner February 20, 2026 14:02

stefankoncarevic requested review from dhernandez0, djramic, justinrosner, pabloantoniom and umangyadav February 20, 2026 14:03

umangyadav requested a review from Copilot February 20, 2026 18:30

Copilot started reviewing on behalf of umangyadav February 20, 2026 18:32 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

Comment thread mlir/lib/Dialect/Rock/IR/MfmaInsnGroup.cpp Outdated

Comment thread mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp Outdated

stefankoncarevic force-pushed the scaled-fp8-mfma-gfx950 branch 3 times, most recently from f93c548 to ff1d1c9 Compare February 24, 2026 11:56

stefankoncarevic mentioned this pull request Feb 25, 2026

Add FP8/BF8 support for LDS transpose load #2210

Open

2 tasks

stefankoncarevic force-pushed the scaled-fp8-mfma-gfx950 branch 3 times, most recently from eecc935 to 4f657da Compare March 3, 2026 08:54

stefankoncarevic force-pushed the scaled-fp8-mfma-gfx950 branch 2 times, most recently from e7cfbe1 to 8eb0352 Compare March 24, 2026 08:42

umangyadav reviewed Mar 24, 2026

View reviewed changes

Comment thread mlir/lib/Dialect/Rock/IR/MfmaInsnGroup.cpp Outdated

Comment thread mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp

umangyadav requested changes Mar 24, 2026

View reviewed changes

stefankoncarevic force-pushed the scaled-fp8-mfma-gfx950 branch from 8eb0352 to c9d1edb Compare March 25, 2026 15:27

umangyadav reviewed Mar 25, 2026

View reviewed changes

Comment thread mlir/test/Dialect/Rock/lowering_xdlops_gemm.mlir

Base automatically changed from mfma-enable-kpack-values-gfx950 to develop March 25, 2026 20:02

umangyadav reviewed Mar 25, 2026

View reviewed changes

Comment thread mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp Outdated

umangyadav reviewed Mar 25, 2026

View reviewed changes

Comment thread mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp Outdated

umangyadav reviewed Mar 25, 2026

View reviewed changes

Comment thread mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp Outdated

stefankoncarevic force-pushed the scaled-fp8-mfma-gfx950 branch from c9d1edb to ecbaf58 Compare March 26, 2026 10:50

stefankoncarevic force-pushed the scaled-fp8-mfma-gfx950 branch from 185a1c7 to 1587b86 Compare March 26, 2026 12:47

stefankoncarevic requested a review from umangyadav March 26, 2026 14:34

stefankoncarevic force-pushed the scaled-fp8-mfma-gfx950 branch from 1587b86 to 1b082b3 Compare March 26, 2026 14:35

umangyadav reviewed Mar 26, 2026

View reviewed changes

Comment thread mlir/lib/Dialect/Rock/IR/MfmaInsnGroup.cpp Outdated

umangyadav reviewed Mar 26, 2026

View reviewed changes

Comment thread mlir/test/e2e/GemmFp8NeutralScalesMixed.toml Outdated

umangyadav reviewed Mar 26, 2026

View reviewed changes

Comment thread mlir/test/e2e/GemmFp8NeutralScalesSame.toml Outdated

umangyadav approved these changes Mar 26, 2026

View reviewed changes

stefankoncarevic force-pushed the scaled-fp8-mfma-gfx950 branch from 4edf0ee to 196438c Compare March 26, 2026 18:38

stefankoncarevic added 7 commits March 26, 2026 17:12

WIP: Scaled FP8 MFMA support

2988a35

Remove FNUZ FP8 types from isScaledFp8() - only OCP FP8 is supported …

e067372

…for scaled MFMAs on gfx950

Clean up scaled FP8 MFMA code based on review feedback

0de52cf

- Remove duplicate entries in getMfmaInsnInfoMap - Clarify neutral scale creation comment in AccelEmitter.cpp - Rename zeroAttr to neutralScaleAttr for clarity

Clang format

f7b02af

Fix clang format

dda1c11

Address scaled FP8 MFMA review: rename variable, add tests for neutra…

eab7ed9

…l scales - Rename isScaledFp8 -> selectedScaledMFMA - Add neutral scale checks in lit tests - Add ISA test for unscaled MFMA backend optimization - Add E2E tests for fp8/bf8 combinations on gfx950

Address review comments for scaled FP8 MFMA support

0aad9ec

stefankoncarevic force-pushed the scaled-fp8-mfma-gfx950 branch from 196438c to 0aad9ec Compare March 26, 2026 22:13

stefankoncarevic merged commit d2d62f3 into develop Mar 27, 2026
8 of 15 checks passed

stefankoncarevic deleted the scaled-fp8-mfma-gfx950 branch March 27, 2026 01:45

Conversation

stefankoncarevic commented Feb 20, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

umangyadav left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stefankoncarevic commented Mar 25, 2026

Uh oh!

umangyadav commented Mar 25, 2026

Uh oh!

stefankoncarevic commented Mar 25, 2026

Uh oh!

stefankoncarevic commented Mar 25, 2026

Uh oh!

umangyadav commented Mar 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stefankoncarevic commented Mar 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

umangyadav left a comment •

edited

Loading