[rocRoller] AIROCROLL-1547 Simplify pretiling#5688
Conversation
Generated Documentation |
There was a problem hiding this comment.
Pull request overview
This PR simplifies rocRoller’s pre-tiling setup by shifting more of the pre-tiled layout handling into the KernelGraph lowering path, removing several places that manually constructed special tensor strides/descriptors for pre-tiled inputs and scales.
Changes:
- Rework pre-tiled coordinate transform lowering (
addLoadMacroTileCT*) to optionally rewrite outer SubDimension sizes/strides for pre-tiled paths, and update SwizzleScale to avoid double-updating. - Simplify command/tensor setup for pre-tiling and scale pre-tiling across tests, client, and hipblaslt integration by relying on
SubTileTranspose+ lowering rather than custom TensorDescriptors/strides. - Adjust a GEMM pre-tile unit test to use much larger problem sizes.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| shared/rocroller/test/unit/GEMMTestBase.hpp | Removes explicit pretile stride/descriptor handling; relies on SubTileTranspose + lowering for pretiled A/B and scale tensors. |
| shared/rocroller/test/unit/GEMMPretileTest.cpp | Changes pre-tiling test M/N/K to much larger values. |
| shared/rocroller/lib/source/KernelGraph/Transformations/SwizzleScale.cpp | Updates call to addLoadMacroTileCT to disable pre-tiled SubDimension rewriting when duplicating an existing graph. |
| shared/rocroller/lib/source/KernelGraph/Transformations/LowerTile.cpp | Adds updatePreTiledSubDimStrides behavior and adjusts pre-tiled CT logic/order. |
| shared/rocroller/lib/source/KernelGraph/LowerFromCommand.cpp | Simplifies SubTileTranspose lowering to add inner tile SubDimensions; leaves an unused variable. |
| shared/rocroller/lib/include/rocRoller/KernelGraph/Utils.hpp | Extends addLoadMacroTileCT API with updatePreTiledSubDimStrides (default true) and documents it. |
| shared/rocroller/client/src/gemm.cpp | Removes scale pre-tile TensorDescriptor special-casing; always binds a standard 2D descriptor. |
| shared/rocroller/client/include/client/DataParallelGEMMSolution.hpp | Removes pre-tile TensorDescriptor special-casing; uses SubTileTranspose but currently lacks some pretile validation. |
| projects/hipblaslt/library/src/amd_detail/rocblaslt/src/rocroller/gemm.cpp | Switches scale pre-tiling from 4D Tensor descriptors to SubTileTranspose + standard 2D binding. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Code Coverage Report for gfx942Summary
This PR adds/edits 73 newly uncovered lines. Artifacts
Commit Hashes |
Performance Report for gfx942ResultsDetailsComparison Summary
@@ Significant (p-val <0.05) Performance Diffs @@
====================================================================================================
+ 0.12% | p=2.5347e-02
| 3. FloatsGEMM(M: 256, N: 256, K: 16384, alpha: 2, beta: 0.5, types: {'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 2, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=256, N=256, K=16384, alpha=2, beta=0.5, types={'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=2, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.05% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.78% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTile, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTile', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.82% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: Standard, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='Standard', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 1.35% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTile, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTile', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.72% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: Standard, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='Standard', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.84% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTile, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTile', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.02% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTileDPFirst, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTileDPFirst', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.33% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: Standard, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='Standard', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.94% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTileDPFirst, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTileDPFirst', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 1.16% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTile, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTile', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.26% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTile, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTile', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.54% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 256, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTile, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=256, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTile', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.54% | p=5.6994e-05
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTile, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTile', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.89% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: Standard, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='Standard', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.30% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: Standard, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='Standard', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.75% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTile, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTile', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.83% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTileDPFirst, numWGs: 304, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTileDPFirst', numWGs=304, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 1.50% | p=5.6994e-05
| CodeGen(instCount: 40000, instructions: comments)| CodeGen() | CodeGen(instCount: 40000, instructions: comments)
- 0.54% | p=5.6994e-05
| CodeGen(instCount: 40000, instructions: simple_mi)| CodeGen() | CodeGen(instCount: 40000, instructions: simple_mi)
|
Resource Report for gfx942ResultsDetails✔️ No Resource Usage Changes ✔️ |
perfci run on commit 9970352 |
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project status has failed because the head coverage (77.21%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #5688 +/- ##
===========================================
- Coverage 67.41% 66.80% -0.60%
===========================================
Files 1868 1849 -19
Lines 289571 284936 -4635
Branches 40837 40020 -817
===========================================
- Hits 195198 190351 -4847
- Misses 77658 78191 +533
+ Partials 16715 16394 -321
*This pull request uses carry forward flags. Click here to find out more. 🚀 New features to boost your workflow:
|
Performance Report for gfx950ResultsDetailsComparison Summary
@@ Significant (p-val <0.05) Performance Diffs @@
====================================================================================================
+ 0.55% | p=2.5347e-02
| 3. FloatsGEMM(M: 3072, N: 4096, K: 4096, alpha: 2, beta: 0, types: {'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 2, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=3072, N=4096, K=4096, alpha=2, beta=0, types={'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=2, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.27% | p=9.8009e-15
| 3. FloatsGEMM(M: 4096, N: 4096, K: 256, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScaleGFX950', 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4], 'scalePreTileA': [32, 8], 'scalePreTileB': [8, 32], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToLDS, loadScale_B: BufferToLDS, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=4096, N=4096, K=256, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScaleGFX950', 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4], 'scalePreTileA': [32, 8], 'scalePreTileB': [8, 32], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToLDS', loadScale_B='BufferToLDS', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.09% | p=3.5561e-02
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.17% | p=2.7325e-03
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.18% | p=2.5347e-02
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 2], 'scaleShuffleTileB': [64, 4, 2], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 2], 'scaleShuffleTileB': [64, 4, 2], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.16% | p=8.3258e-03
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.08% | p=1.7587e-05
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.06% | p=5.7413e-04
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 1, workgroupMappingValue: 2)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=1, workgroupMappingValue=2)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.17% | p=8.0987e-14
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.16% | p=6.2574e-08
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 1, workgroupMappingValue: 2)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=1, workgroupMappingValue=2)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.24% | p=2.9184e-04
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 1, workgroupMappingValue: 1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=1, workgroupMappingValue=1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.90% | p=8.8969e-11
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0.5, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: False, loadScale_A: BufferToLDSViaVGPR, loadScale_B: BufferToLDSViaVGPR, swizzleScale: False, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0.5, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=False, loadScale_A='BufferToLDSViaVGPR', loadScale_B='BufferToLDSViaVGPR', swizzleScale=False, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.42% | p=6.5285e-153
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 128, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToLDSViaVGPR, loadScale_B: BufferToLDSViaVGPR, swizzleScale: False, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=128, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToLDSViaVGPR', loadScale_B='BufferToLDSViaVGPR', swizzleScale=False, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.09% | p=1.3118e-07
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: False, loadScale_A: BufferToLDSViaVGPR, loadScale_B: BufferToLDSViaVGPR, swizzleScale: False, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=False, loadScale_A='BufferToLDSViaVGPR', loadScale_B='BufferToLDSViaVGPR', swizzleScale=False, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.12% | p=1.8915e-16
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 4)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: False, loadScale_A: BufferToLDSViaVGPR, loadScale_B: BufferToLDSViaVGPR, swizzleScale: False, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=4)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=False, loadScale_A='BufferToLDSViaVGPR', loadScale_B='BufferToLDSViaVGPR', swizzleScale=False, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.88% | p=1.7451e-03
| 3. FloatsGEMM(M: 4096, N: 4096, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 256, workgroup_size_y: 1, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=4096, N=4096, K=8192, alpha=2, beta=0.5, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=64, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=256, workgroup_size_y=1, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 2.95% | p=1.7451e-03
| 3. FloatsGEMM(M: 4096, N: 4096, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'fp6', 'type_B': 'fp6', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 256, workgroup_size_y: 1, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=4096, N=4096, K=8192, alpha=2, beta=0.5, types={'type_A': 'fp6', 'type_B': 'fp6', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=256, workgroup_size_y=1, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 1.93% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 4, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=64, workgroup_size_y=4, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.33% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: Standard, numWGs: 256, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='Standard', numWGs=256, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
+ 0.14% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: TwoTile, numWGs: 256, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='TwoTile', numWGs=256, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- 0.12% | p=2.5347e-02 Results truncated, see full report in workspace |
Resource Report for gfx950ResultsDetails@@ Resource Usage Changes @@
====================================================================================================
- SGPR: 88 -> 90 (+2)
| 3. FloatsGEMM(M: 256, N: 256, K: 256, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScaleGFX950', 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4], 'scalePreTileA': [32, 8], 'scalePreTileB': [8, 32], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToLDS, loadScale_B: BufferToLDS, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=256, N=256, K=256, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScaleGFX950', 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4], 'scalePreTileA': [32, 8], 'scalePreTileB': [8, 32], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToLDS', loadScale_B='BufferToLDS', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- SGPR: 88 -> 90 (+2)
| 3. FloatsGEMM(M: 4096, N: 4096, K: 256, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScaleGFX950', 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4], 'scalePreTileA': [32, 8], 'scalePreTileB': [8, 32], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToLDS, loadScale_B: BufferToLDS, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: True, version: 509b104ce66) | GEMMProblem(M=4096, N=4096, K=256, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScaleGFX950', 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4], 'scalePreTileA': [32, 8], 'scalePreTileB': [8, 32], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToLDS', loadScale_B='BufferToLDS', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=True, version='')
- SGPR: 86 -> 88 (+2)
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4], 'scalePreTileA': [32, 8], 'scalePreTileB': [8, 32], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 256, mac_n: 256, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToLDS, loadScale_B: BufferToLDS, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: False, version: 509b104ce66) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScale', 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4], 'scalePreTileA': [32, 8], 'scalePreTileB': [8, 32], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=256, mac_n=256, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToLDS', loadScale_B='BufferToLDS', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=False, version='')
- SGPR: 86 -> 88 (+2)
| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScaleGFX950', 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4], 'scalePreTileA': [32, 8], 'scalePreTileB': [8, 32], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: 2)| GEMM(mac_m: 256, mac_n: 256, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDS, load_B: BufferToLDS, store: VGPRToGlobalMemoryWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToLDS, loadScale_B: BufferToLDS, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops: False, version: 509b104ce66) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': 'PreSwizzleScaleGFX950', 'scaleShuffleTileA': [32, 8, 4], 'scaleShuffleTileB': [32, 8, 4], 'scalePreTileA': [32, 8], 'scalePreTileB': [8, 32], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=2)GEMMSolution(mac_m=256, mac_n=256, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDS', load_B='BufferToLDS', store='VGPRToGlobalMemoryWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToLDS', loadScale_B='BufferToLDS', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, tailLoops=False, version='')
|
Performance Report for gfx12ResultsDetailsComparison Summary
@@ Significant (p-val <0.05) Performance Diffs @@
====================================================================================================
+ 0.23% | p=1.7451e-03
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'bf8', 'type_B': 'bf8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Sequential, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops: True, version: 6a7e4a7ef07) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'bf8', 'type_B': 'bf8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Sequential', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops=True, version='')
+ 0.28% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'bf8', 'type_B': 'fp8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Cooperative, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops: True, version: 6a7e4a7ef07) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'bf8', 'type_B': 'fp8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Cooperative', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops=True, version='')
+ 0.18% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Sequential, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops: True, version: 6a7e4a7ef07) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Sequential', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops=True, version='')
+ 0.26% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Sequential, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops: True, version: 6a7e4a7ef07) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Sequential', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops=True, version='')
+ 0.49% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops: True, version: 6a7e4a7ef07) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops=True, version='')
+ 0.44% | p=2.5347e-02
| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'half', 'trans_A': 'T', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, store: VGPRToGlobalMemoryViaLDSWithBuffer, betaInFma: True, padLDS_A: [0, 0], padLDS_B: [0, 0], scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: None, numWGs: 0, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops: True, version: 6a7e4a7ef07) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'half', 'trans_A': 'T', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': 'None', 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': [], 'pretileA': [], 'pretileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', store='VGPRToGlobalMemoryViaLDSWithBuffer', betaInFma=True, padLDS_A=[0, 0], padLDS_B=[0, 0], scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK='None', numWGs=0, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, tailLoops=True, version='')
+ 1.44% | p=1.7451e-03
| CodeGen(instCount: 40000, instructions: comments)| CodeGen() | CodeGen(instCount: 40000, instructions: comments)
+ 1.84% | p=5.6994e-05
| CodeGen(instCount: 40000, instructions: complex_mi_with_coop)| CodeGen() | CodeGen(instCount: 40000, instructions: complex_mi_with_coop)
+ 1.72% | p=5.6994e-05
| CodeGen(instCount: 40000, instructions: simple_mi)| CodeGen() | CodeGen(instCount: 40000, instructions: simple_mi)
|
Resource Report for gfx12ResultsDetails✔️ No Resource Usage Changes ✔️ |
## Motivation For AIROCROLL-1547. This PR simplifies rocRoller’s pre-tiling setup by shifting more of the pre-tiled layout handling into the KernelGraph lowering path, removing several places that manually constructed special tensor strides/descriptors for pre-tiled inputs and scales. ## Technical Details - Rework pre-tiled coordinate transform lowering (addLoadMacroTileCT*) to optionally rewrite outer SubDimension sizes/strides for pre-tiled paths, and update SwizzleScale to avoid double-updating. - Simplify command/tensor setup for pre-tiling and scale pre-tiling across tests, client, and hipblaslt integration by relying on SubTileTranspose + lowering rather than custom TensorDescriptors/strides. - Adjust a GEMM pre-tile unit test to use much larger problem sizes. --------- Co-authored-by: Caio S. Rohwedder <Caio.SalvadorRohwedder@amd.com>
## Motivation For AIROCROLL-1547. This PR simplifies rocRoller’s pre-tiling setup by shifting more of the pre-tiled layout handling into the KernelGraph lowering path, removing several places that manually constructed special tensor strides/descriptors for pre-tiled inputs and scales. ## Technical Details - Rework pre-tiled coordinate transform lowering (addLoadMacroTileCT*) to optionally rewrite outer SubDimension sizes/strides for pre-tiled paths, and update SwizzleScale to avoid double-updating. - Simplify command/tensor setup for pre-tiling and scale pre-tiling across tests, client, and hipblaslt integration by relying on SubTileTranspose + lowering rather than custom TensorDescriptors/strides. - Adjust a GEMM pre-tile unit test to use much larger problem sizes. --------- Co-authored-by: Caio S. Rohwedder <Caio.SalvadorRohwedder@amd.com>
Motivation
For AIROCROLL-1547. This PR simplifies rocRoller’s pre-tiling setup by shifting more of the pre-tiled layout handling into the KernelGraph lowering path, removing several places that manually constructed special tensor strides/descriptors for pre-tiled inputs and scales.
Technical Details