feat: add INT8/INT4 quantization support for 2-stage ASM MoE kernels by amd-zfyu · Pull Request #2340 · ROCm/aiter

amd-zfyu · 2026-03-19T02:33:48Z

Summary

Add INT8 per-token and INT4 (LQQ) quantization support for the 2-stage ASM MoE pipeline.

Changes

aiter/fused_moe_bf16_asm.py: Add asm_moe_stage2() wrapper, 2-stage ASM MoE pipeline with INT8/INT4 support, CSV-based kernel config lookup via pandas, refactored _run_asm_moe_a16() helper
csrc/py_itfs_cu/asm_moe_2stage.cu: Add Kernel2Args struct for stage2 kernels, INT8/INT4 kernel launch paths with splitk support, new fields (total_tgs, ps_deno, ptr_Qscl, ptr_Qzero, eLQQs)
csrc/include/moe_op.h: Add moe_stage2_g1u1 declaration
csrc/include/rocm_ops.hpp: Add INT8/INT4 MoE bindings
aiter/ops/moe_op.py: Register moe_stage2_g1u1 op
Pre-compiled .co kernels: Stage1 and stage2 binaries for INT8 per-token and INT4 LQQ quantization (gfx942, various tile sizes: 32x128 to 80x128)
op_tests/test_moe_ep.py: Add INT8/FP8 smoothquant EP test cases
Smoothquant fix: Use smooth_per_token_scaled_quant for both INT8 and FP8 smoothquant paths in EP mode
Backward compat: Fix legacy .co kernel loading in asm_moe_2stage

Test plan

CI: gfx942/gfx950 op tests
test_moe_ep.py INT8 smoothquant tests
test_moe_ep.py FP8 smoothquant tests

- Add moe_stage2_g1u1 C API for stage2 ASM kernel launch - Extend moe_stage1_g1u1 with LQQ scale/zero, fc2_smooth_scale, multix support - Add Kernel2Args struct for stage2 kernel arguments - Add get_cfg_stage2() and is_MultiX logic in get_cfg() - Enhanced heuristic kernel selection with buffer kernel tie-breaking - Add lqq_1x64 to QuantType enum and pybind11 bindings - Fix codegen.py: collect union of all CSV columns across groups (prevents smf/pf columns from being dropped due to glob ordering) - Add 26 new .co kernel binaries and 4 CSV configs for gfx942 - Add Python wrappers: asm_moe_stage2, AsmInt8Config, 2-stage pipeline - Fix opus.hpp compiler compatibility for LDS address space casts - Update test_moe_ep.py with shared_expert parameterization Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…zation Replace smooth_per_token_scaled_quant with moe_smooth_per_token_scaled_quant which accepts sorted_token_ids, sorted_expert_ids, num_valid_ids, and block_m for better kernel dispatch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

amd-zfyu requested a review from a team March 19, 2026 02:33

amd-zfyu force-pushed the asm_moe_2stages_int8_v2 branch 2 times, most recently from 6b1f76e to c662413 Compare March 27, 2026 08:58

amd-zfyu force-pushed the asm_moe_2stages_int8_v2 branch from c662413 to 1b5820c Compare March 27, 2026 09:07

zufayu requested a review from junhaha666 March 27, 2026 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add INT8/INT4 quantization support for 2-stage ASM MoE kernels#2340

feat: add INT8/INT4 quantization support for 2-stage ASM MoE kernels#2340
amd-zfyu wants to merge 2 commits intoROCm:mainfrom
amd-zfyu:asm_moe_2stages_int8_v2

amd-zfyu commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

amd-zfyu commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

amd-zfyu commented Mar 19, 2026 •

edited

Loading