feat: add INT8/INT4 quantization support for 2-stage ASM MoE kernels#2340
Open
feat: add INT8/INT4 quantization support for 2-stage ASM MoE kernels#2340
Conversation
6b1f76e to
c662413
Compare
- Add moe_stage2_g1u1 C API for stage2 ASM kernel launch - Extend moe_stage1_g1u1 with LQQ scale/zero, fc2_smooth_scale, multix support - Add Kernel2Args struct for stage2 kernel arguments - Add get_cfg_stage2() and is_MultiX logic in get_cfg() - Enhanced heuristic kernel selection with buffer kernel tie-breaking - Add lqq_1x64 to QuantType enum and pybind11 bindings - Fix codegen.py: collect union of all CSV columns across groups (prevents smf/pf columns from being dropped due to glob ordering) - Add 26 new .co kernel binaries and 4 CSV configs for gfx942 - Add Python wrappers: asm_moe_stage2, AsmInt8Config, 2-stage pipeline - Fix opus.hpp compiler compatibility for LDS address space casts - Update test_moe_ep.py with shared_expert parameterization Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
c662413 to
1b5820c
Compare
…zation Replace smooth_per_token_scaled_quant with moe_smooth_per_token_scaled_quant which accepts sorted_token_ids, sorted_expert_ids, num_valid_ids, and block_m for better kernel dispatch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add INT8 per-token and INT4 (LQQ) quantization support for the 2-stage ASM MoE pipeline.
Changes
aiter/fused_moe_bf16_asm.py: Addasm_moe_stage2()wrapper, 2-stage ASM MoE pipeline with INT8/INT4 support, CSV-based kernel config lookup via pandas, refactored_run_asm_moe_a16()helpercsrc/py_itfs_cu/asm_moe_2stage.cu: AddKernel2Argsstruct for stage2 kernels, INT8/INT4 kernel launch paths with splitk support, new fields (total_tgs,ps_deno,ptr_Qscl,ptr_Qzero,eLQQs)csrc/include/moe_op.h: Addmoe_stage2_g1u1declarationcsrc/include/rocm_ops.hpp: Add INT8/INT4 MoE bindingsaiter/ops/moe_op.py: Registermoe_stage2_g1u1op.cokernels: Stage1 and stage2 binaries for INT8 per-token and INT4 LQQ quantization (gfx942, various tile sizes: 32x128 to 80x128)op_tests/test_moe_ep.py: Add INT8/FP8 smoothquant EP test casessmooth_per_token_scaled_quantfor both INT8 and FP8 smoothquant paths in EP mode.cokernel loading inasm_moe_2stageTest plan
test_moe_ep.pyINT8 smoothquant teststest_moe_ep.pyFP8 smoothquant tests