Skip to content

feat: add INT8/INT4 quantization support for 2-stage ASM MoE kernels#2340

Open
amd-zfyu wants to merge 2 commits intoROCm:mainfrom
amd-zfyu:asm_moe_2stages_int8_v2
Open

feat: add INT8/INT4 quantization support for 2-stage ASM MoE kernels#2340
amd-zfyu wants to merge 2 commits intoROCm:mainfrom
amd-zfyu:asm_moe_2stages_int8_v2

Conversation

@amd-zfyu
Copy link
Copy Markdown
Contributor

@amd-zfyu amd-zfyu commented Mar 19, 2026

Summary

Add INT8 per-token and INT4 (LQQ) quantization support for the 2-stage ASM MoE pipeline.

Changes

  • aiter/fused_moe_bf16_asm.py: Add asm_moe_stage2() wrapper, 2-stage ASM MoE pipeline with INT8/INT4 support, CSV-based kernel config lookup via pandas, refactored _run_asm_moe_a16() helper
  • csrc/py_itfs_cu/asm_moe_2stage.cu: Add Kernel2Args struct for stage2 kernels, INT8/INT4 kernel launch paths with splitk support, new fields (total_tgs, ps_deno, ptr_Qscl, ptr_Qzero, eLQQs)
  • csrc/include/moe_op.h: Add moe_stage2_g1u1 declaration
  • csrc/include/rocm_ops.hpp: Add INT8/INT4 MoE bindings
  • aiter/ops/moe_op.py: Register moe_stage2_g1u1 op
  • Pre-compiled .co kernels: Stage1 and stage2 binaries for INT8 per-token and INT4 LQQ quantization (gfx942, various tile sizes: 32x128 to 80x128)
  • op_tests/test_moe_ep.py: Add INT8/FP8 smoothquant EP test cases
  • Smoothquant fix: Use smooth_per_token_scaled_quant for both INT8 and FP8 smoothquant paths in EP mode
  • Backward compat: Fix legacy .co kernel loading in asm_moe_2stage

Test plan

  • CI: gfx942/gfx950 op tests
  • test_moe_ep.py INT8 smoothquant tests
  • test_moe_ep.py FP8 smoothquant tests

@amd-zfyu amd-zfyu requested a review from a team March 19, 2026 02:33
@amd-zfyu amd-zfyu force-pushed the asm_moe_2stages_int8_v2 branch 2 times, most recently from 6b1f76e to c662413 Compare March 27, 2026 08:58
- Add moe_stage2_g1u1 C API for stage2 ASM kernel launch
- Extend moe_stage1_g1u1 with LQQ scale/zero, fc2_smooth_scale, multix support
- Add Kernel2Args struct for stage2 kernel arguments
- Add get_cfg_stage2() and is_MultiX logic in get_cfg()
- Enhanced heuristic kernel selection with buffer kernel tie-breaking
- Add lqq_1x64 to QuantType enum and pybind11 bindings
- Fix codegen.py: collect union of all CSV columns across groups
  (prevents smf/pf columns from being dropped due to glob ordering)
- Add 26 new .co kernel binaries and 4 CSV configs for gfx942
- Add Python wrappers: asm_moe_stage2, AsmInt8Config, 2-stage pipeline
- Fix opus.hpp compiler compatibility for LDS address space casts
- Update test_moe_ep.py with shared_expert parameterization

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@amd-zfyu amd-zfyu force-pushed the asm_moe_2stages_int8_v2 branch from c662413 to 1b5820c Compare March 27, 2026 09:07
@zufayu zufayu requested a review from junhaha666 March 27, 2026 14:45
…zation

Replace smooth_per_token_scaled_quant with moe_smooth_per_token_scaled_quant
which accepts sorted_token_ids, sorted_expert_ids, num_valid_ids, and block_m
for better kernel dispatch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant