Refactor topk softmax asm bind by yzhou103 · Pull Request #2327 · ROCm/aiter

yzhou103 · 2026-03-18T08:50:38Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

github-actions · 2026-03-18T08:51:04Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2327 --add-label <label>

Copilot

Pull request overview

This PR refactors several ASM-backed ops to use a torch-free C ABI (extern "C") plus a Python ctypes call path (via a new AiterTensor struct), reducing reliance on PyTorch C++/pybind for these kernels.

Changes:

Introduces AiterTensor/AiterDtype plumbing and a compile_ops(..., ffi_type="ctypes") dispatch path that loads and calls .so symbols via ctypes.
Refactors ASM implementations (topk-softmax, layernorm, GEMM a16w16) to accept AiterTensor* and an explicit hipStream_t.
Adjusts build config and Python wrappers to use the new ctypes path; adds contiguity gating for the ASM topk-softmax callsites/tests.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
op_tests/test_moeTopkSoftmax.py	Makes ASM path contiguous; minor cleanup in allclose helper
csrc/py_itfs_cu/asm_topksoftmax.cu	Switches to C ABI + `AiterTensor*` args and explicit stream/device handling
csrc/py_itfs_cu/asm_layernorm.cu	Switches to C ABI + `AiterTensor*` args and explicit stream/device handling
csrc/py_itfs_cu/asm_gemm_a16w16.cu	Switches to C ABI + `AiterTensor*` args; removes torch types; explicit device handling
csrc/include/rocm_ops.hpp	Removes some pybind bindings; updates macros (notably quant bindings)
csrc/include/norm.h	Formatting + removes ASM layernorm declarations from this torch header
csrc/include/moe_op.h	Removes torch-signature declaration for `topk_softmax_asm`
csrc/include/asm_gemm_a16w16.h	Deletes old torch-signature header for GEMM ASM
csrc/include/aiter_tensor.h	Adds new C struct used for ctypes FFI tensor metadata
csrc/include/aiter_hip_common.h	Adds `AITER_CHECK` and includes for new enum/tensor headers
csrc/include/aiter_enum.h	Adds `AiterDtype` enum and helpers
aiter/utility/aiter_types.py	Adds ctypes definitions + header parsing for dtype IDs
aiter/utility/dtypes.py	Adds `torch_to_aiter()` conversion and dtype maps for ctypes path
aiter/ops/norm.py	Points ASM layernorm wrappers to a new module + ctypes FFI
aiter/ops/moe_op.py	Switches `topk_softmax_asm` wrapper to ctypes FFI
aiter/ops/gemm_op_a16w16.py	Switches GEMM ASM wrapper to ctypes FFI and returns `out` explicitly
aiter/jit/optCompilerConfig.json	Updates build modules (drops some pybind sources; adds `module_asm_layernorm`)
aiter/jit/core.py	Adds `_ctypes_call()` and `ffi_type` option to `compile_ops`
aiter/fused_moe.py	Adds `gating_output.is_contiguous()` requirement for ASM topk-softmax fast path

Comments suppressed due to low confidence (2)

aiter/ops/moe_op.py:32

topk_softmax_asm is switched to ffi_type="ctypes", but it still targets module_moe_asm, whose build config includes torch/pybind sources (e.g., pybind/moe_op_pybind.cu). _ctypes_call() forces torch_exclude=True when it needs to build the module, so a first-time call to topk_softmax_asm (without the pybind module already built) is likely to fail to link/compile. Suggested fix: create a dedicated torch-free module (e.g., module_asm_topksoftmax) that only compiles py_itfs_cu/asm_topksoftmax.cu (+ any needed deps) and point this decorator at it; alternatively, avoid setting torch_exclude=True for modules that still depend on torch/pybind.

@compile_ops("module_moe_asm", fc_name="topk_softmax_asm", ffi_type="ctypes")
def topk_softmax_asm(
    topk_weights: Tensor,
    topk_indices: Tensor,
    token_expert_indices: Tensor,
    gating_output: Tensor,
    need_renorm: bool,
) -> None: ...

csrc/include/rocm_ops.hpp:1459

QUANT_PYBIND no longer binds moe_smooth_per_token_scaled_quant_v1/v2, but Python still calls these via @compile_ops("module_quant") (see aiter/ops/quant.py). This will cause runtime AttributeError when the compiled module_quant is imported and getattr(module, ...) is attempted. Either restore these m.def(...) bindings in QUANT_PYBIND or update the Python API to stop referencing these functions.

#define QUANT_PYBIND                                                     \
    m.def("static_per_tensor_quant", &aiter::static_per_tensor_quant);   \
    m.def("dynamic_per_tensor_quant", &aiter::dynamic_per_tensor_quant); \
    m.def("dynamic_per_token_scaled_quant",                              \
          &aiter::dynamic_per_token_scaled_quant,                        \
          py::arg("out"),                                                \
          py::arg("input"),                                              \
          py::arg("scales"),                                             \
          py::arg("scale_ub")        = std::nullopt,                     \
          py::arg("shuffle_scale")   = false,                            \
          py::arg("num_rows")        = std::nullopt,                     \
          py::arg("num_rows_factor") = 1);                               \
    m.def("dynamic_per_group_scaled_quant_fp4",                          \
          &aiter::dynamic_per_group_scaled_quant_fp4,                    \
          py::arg("out"),                                                \
          py::arg("input"),                                              \
          py::arg("scales"),                                             \
          py::arg("group_size")      = 32,                               \
          py::arg("shuffle_scale")   = true,                             \
          py::arg("num_rows")        = std::nullopt,                     \
          py::arg("num_rows_factor") = 1);                               \
    m.def("smooth_per_token_scaled_quant",                               \
          &aiter::smooth_per_token_scaled_quant,                         \
          py::arg("out"),                                                \
          py::arg("input"),                                              \
          py::arg("scales"),                                             \
          py::arg("smooth_scale"),                                       \
          py::arg("smooth_scale_map")      = std::nullopt,               \
          py::arg("shuffle_scale")         = false,                      \
          py::arg("num_rows")              = std::nullopt,               \
          py::arg("num_rows_factor")       = 1,                          \
          py::arg("smooth_scale_map_hash") = std::nullopt,               \
          py::arg("enable_ps")             = true);                                  \
    m.def("partial_transpose",                                           \
          &aiter::partial_transpose,                                     \
          py::arg("out"),                                                \
          py::arg("input"),                                              \
          py::arg("num_rows"));

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

This declaration was erroneously re-added during rebase conflict resolution. PR #2327 already removed it when migrating to ctypes.

* Migrate MoE ASM kernels from pybind to C ABI + ctypes Convert fmoe, fmoe_int8_g1u0, fmoe_g1u1, fmoe_g1u1_tkw1, fmoe_int8_g1u0_a16, fmoe_g1u1_a16, fmoe_fp8_blockscale_g1u1, and moe_stage1_g1u1 from torch::Tensor& (pybind11) to AiterTensor* + hipStream_t (C ABI called via ctypes). - asm_fmoe.cu: Remove torch/ATen includes, use AiterTensor*, AITER_DTYPE_*, AITER_CHECK; template <int I_elemSize, int O_elemSize> - asm_moe_2stage.cu: Same conversion for moe_stage1_g1u1 - moe_op.h: Remove fmoe pybind declarations (now extern "C") - rocm_ops.hpp: Remove fmoe entries from MOE_OP_PYBIND macro - moe_op.py: Use ffi_type="ctypes" with new module_moe_fmoe_asm - optCompilerConfig.json: Split ctypes sources into module_moe_fmoe_asm Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add missing #include <memory> for std::unique_ptr/make_unique After pip install -e . refreshed aiter_meta from csrc/, the indirect include chain changed and <memory> was no longer transitively included. * fix: address Copilot review comments - kernelName: str -> Optional[str] for correct ctypes c_char_p conversion - Remove extra activation param from fmoe_int8_g1u0_a16 (C ABI has no such param) - Fix typo "supput" -> "support" in asm_fmoe.cu * fix: add HipDeviceGuard to all C ABI MoE kernel functions Address review comment from amd-ruitang3: the pybind->ctypes migration removed device_guard. Now that PR #2377 has merged, use the new HipDeviceGuard in all 8 extern "C" fmoe/moe_stage functions. * fix: restore activation parameter in fmoe_int8_g1u0_a16 C ABI The activation parameter was dropped during pybind-to-ctypes migration. The original implementation uses it to select between silu/gelu config maps. Restore it in C ABI signature, Python ctypes declaration, and call site. * fix: remove stale topk_softmax_asm pybind declaration from moe_op.h This declaration was erroneously re-added during rebase conflict resolution. PR #2327 already removed it when migrating to ctypes. --------- Co-authored-by: root <root@hjbog-srdc-39.amd.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

amd-ruitang3 and others added 4 commits March 18, 2026 03:08

refactor asm kl bind

ef00064

update

818082e

refactor_topk_softmax_asm_bind

74a8f7c

fix lint

725d48f

yzhou103 requested review from a team and Copilot March 18, 2026 08:50

Copilot started reviewing on behalf of yzhou103 March 18, 2026 08:51 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

Comment thread op_tests/test_moeTopkSoftmax.py

Comment thread csrc/include/aiter_tensor.h

Comment thread csrc/include/aiter_enum.h Outdated

Comment thread aiter/utility/dtypes.py

yzhou103 added 2 commits March 20, 2026 11:47

Merge branch 'main' into refactor_topk_softmax_asm_bind

032cc5e

Update rocm_ops.hpp

69f90a0

amd-ruitang3 added the ci:atom label Mar 20, 2026

valarLip approved these changes Mar 23, 2026

View reviewed changes

amd-ruitang3 merged commit 378efe2 into main Mar 23, 2026
32 checks passed

amd-ruitang3 deleted the refactor_topk_softmax_asm_bind branch March 23, 2026 02:52

zufayu pushed a commit that referenced this pull request Mar 23, 2026

fix: remove stale topk_softmax_asm pybind declaration from moe_op.h

ab8b18c

This declaration was erroneously re-added during rebase conflict resolution. PR #2327 already removed it when migrating to ctypes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor topk softmax asm bind#2327

Refactor topk softmax asm bind#2327
amd-ruitang3 merged 6 commits intomainfrom
refactor_topk_softmax_asm_bind

yzhou103 commented Mar 18, 2026

Uh oh!

github-actions Bot commented Mar 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yzhou103 commented Mar 18, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions Bot commented Mar 18, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants