Skip to content

Refactor topk softmax asm bind#2327

Merged
amd-ruitang3 merged 6 commits intomainfrom
refactor_topk_softmax_asm_bind
Mar 23, 2026
Merged

Refactor topk softmax asm bind#2327
amd-ruitang3 merged 6 commits intomainfrom
refactor_topk_softmax_asm_bind

Conversation

@yzhou103
Copy link
Copy Markdown
Contributor

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

@yzhou103 yzhou103 requested review from a team and Copilot March 18, 2026 08:50
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2327 --add-label <label>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors several ASM-backed ops to use a torch-free C ABI (extern "C") plus a Python ctypes call path (via a new AiterTensor struct), reducing reliance on PyTorch C++/pybind for these kernels.

Changes:

  • Introduces AiterTensor/AiterDtype plumbing and a compile_ops(..., ffi_type="ctypes") dispatch path that loads and calls .so symbols via ctypes.
  • Refactors ASM implementations (topk-softmax, layernorm, GEMM a16w16) to accept AiterTensor* and an explicit hipStream_t.
  • Adjusts build config and Python wrappers to use the new ctypes path; adds contiguity gating for the ASM topk-softmax callsites/tests.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
op_tests/test_moeTopkSoftmax.py Makes ASM path contiguous; minor cleanup in allclose helper
csrc/py_itfs_cu/asm_topksoftmax.cu Switches to C ABI + AiterTensor* args and explicit stream/device handling
csrc/py_itfs_cu/asm_layernorm.cu Switches to C ABI + AiterTensor* args and explicit stream/device handling
csrc/py_itfs_cu/asm_gemm_a16w16.cu Switches to C ABI + AiterTensor* args; removes torch types; explicit device handling
csrc/include/rocm_ops.hpp Removes some pybind bindings; updates macros (notably quant bindings)
csrc/include/norm.h Formatting + removes ASM layernorm declarations from this torch header
csrc/include/moe_op.h Removes torch-signature declaration for topk_softmax_asm
csrc/include/asm_gemm_a16w16.h Deletes old torch-signature header for GEMM ASM
csrc/include/aiter_tensor.h Adds new C struct used for ctypes FFI tensor metadata
csrc/include/aiter_hip_common.h Adds AITER_CHECK and includes for new enum/tensor headers
csrc/include/aiter_enum.h Adds AiterDtype enum and helpers
aiter/utility/aiter_types.py Adds ctypes definitions + header parsing for dtype IDs
aiter/utility/dtypes.py Adds torch_to_aiter() conversion and dtype maps for ctypes path
aiter/ops/norm.py Points ASM layernorm wrappers to a new module + ctypes FFI
aiter/ops/moe_op.py Switches topk_softmax_asm wrapper to ctypes FFI
aiter/ops/gemm_op_a16w16.py Switches GEMM ASM wrapper to ctypes FFI and returns out explicitly
aiter/jit/optCompilerConfig.json Updates build modules (drops some pybind sources; adds module_asm_layernorm)
aiter/jit/core.py Adds _ctypes_call() and ffi_type option to compile_ops
aiter/fused_moe.py Adds gating_output.is_contiguous() requirement for ASM topk-softmax fast path
Comments suppressed due to low confidence (2)

aiter/ops/moe_op.py:32

  • topk_softmax_asm is switched to ffi_type="ctypes", but it still targets module_moe_asm, whose build config includes torch/pybind sources (e.g., pybind/moe_op_pybind.cu). _ctypes_call() forces torch_exclude=True when it needs to build the module, so a first-time call to topk_softmax_asm (without the pybind module already built) is likely to fail to link/compile. Suggested fix: create a dedicated torch-free module (e.g., module_asm_topksoftmax) that only compiles py_itfs_cu/asm_topksoftmax.cu (+ any needed deps) and point this decorator at it; alternatively, avoid setting torch_exclude=True for modules that still depend on torch/pybind.
@compile_ops("module_moe_asm", fc_name="topk_softmax_asm", ffi_type="ctypes")
def topk_softmax_asm(
    topk_weights: Tensor,
    topk_indices: Tensor,
    token_expert_indices: Tensor,
    gating_output: Tensor,
    need_renorm: bool,
) -> None: ...

csrc/include/rocm_ops.hpp:1459

  • QUANT_PYBIND no longer binds moe_smooth_per_token_scaled_quant_v1/v2, but Python still calls these via @compile_ops("module_quant") (see aiter/ops/quant.py). This will cause runtime AttributeError when the compiled module_quant is imported and getattr(module, ...) is attempted. Either restore these m.def(...) bindings in QUANT_PYBIND or update the Python API to stop referencing these functions.
#define QUANT_PYBIND                                                     \
    m.def("static_per_tensor_quant", &aiter::static_per_tensor_quant);   \
    m.def("dynamic_per_tensor_quant", &aiter::dynamic_per_tensor_quant); \
    m.def("dynamic_per_token_scaled_quant",                              \
          &aiter::dynamic_per_token_scaled_quant,                        \
          py::arg("out"),                                                \
          py::arg("input"),                                              \
          py::arg("scales"),                                             \
          py::arg("scale_ub")        = std::nullopt,                     \
          py::arg("shuffle_scale")   = false,                            \
          py::arg("num_rows")        = std::nullopt,                     \
          py::arg("num_rows_factor") = 1);                               \
    m.def("dynamic_per_group_scaled_quant_fp4",                          \
          &aiter::dynamic_per_group_scaled_quant_fp4,                    \
          py::arg("out"),                                                \
          py::arg("input"),                                              \
          py::arg("scales"),                                             \
          py::arg("group_size")      = 32,                               \
          py::arg("shuffle_scale")   = true,                             \
          py::arg("num_rows")        = std::nullopt,                     \
          py::arg("num_rows_factor") = 1);                               \
    m.def("smooth_per_token_scaled_quant",                               \
          &aiter::smooth_per_token_scaled_quant,                         \
          py::arg("out"),                                                \
          py::arg("input"),                                              \
          py::arg("scales"),                                             \
          py::arg("smooth_scale"),                                       \
          py::arg("smooth_scale_map")      = std::nullopt,               \
          py::arg("shuffle_scale")         = false,                      \
          py::arg("num_rows")              = std::nullopt,               \
          py::arg("num_rows_factor")       = 1,                          \
          py::arg("smooth_scale_map_hash") = std::nullopt,               \
          py::arg("enable_ps")             = true);                                  \
    m.def("partial_transpose",                                           \
          &aiter::partial_transpose,                                     \
          py::arg("out"),                                                \
          py::arg("input"),                                              \
          py::arg("num_rows"));

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread op_tests/test_moeTopkSoftmax.py
Comment thread csrc/include/aiter_tensor.h
Comment thread csrc/include/aiter_enum.h Outdated
Comment thread aiter/utility/dtypes.py
@amd-ruitang3 amd-ruitang3 merged commit 378efe2 into main Mar 23, 2026
32 checks passed
@amd-ruitang3 amd-ruitang3 deleted the refactor_topk_softmax_asm_bind branch March 23, 2026 02:52
zufayu pushed a commit that referenced this pull request Mar 23, 2026
This declaration was erroneously re-added during rebase conflict
resolution. PR #2327 already removed it when migrating to ctypes.
zufayu added a commit that referenced this pull request Mar 23, 2026
* Migrate MoE ASM kernels from pybind to C ABI + ctypes

Convert fmoe, fmoe_int8_g1u0, fmoe_g1u1, fmoe_g1u1_tkw1,
fmoe_int8_g1u0_a16, fmoe_g1u1_a16, fmoe_fp8_blockscale_g1u1,
and moe_stage1_g1u1 from torch::Tensor& (pybind11) to
AiterTensor* + hipStream_t (C ABI called via ctypes).

- asm_fmoe.cu: Remove torch/ATen includes, use AiterTensor*,
  AITER_DTYPE_*, AITER_CHECK; template <int I_elemSize, int O_elemSize>
- asm_moe_2stage.cu: Same conversion for moe_stage1_g1u1
- moe_op.h: Remove fmoe pybind declarations (now extern "C")
- rocm_ops.hpp: Remove fmoe entries from MOE_OP_PYBIND macro
- moe_op.py: Use ffi_type="ctypes" with new module_moe_fmoe_asm
- optCompilerConfig.json: Split ctypes sources into module_moe_fmoe_asm

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add missing #include <memory> for std::unique_ptr/make_unique

After pip install -e . refreshed aiter_meta from csrc/, the indirect
include chain changed and <memory> was no longer transitively included.

* fix: address Copilot review comments

- kernelName: str -> Optional[str] for correct ctypes c_char_p conversion
- Remove extra activation param from fmoe_int8_g1u0_a16 (C ABI has no such param)
- Fix typo "supput" -> "support" in asm_fmoe.cu

* fix: add HipDeviceGuard to all C ABI MoE kernel functions

Address review comment from amd-ruitang3: the pybind->ctypes migration
removed device_guard. Now that PR #2377 has merged, use the new
HipDeviceGuard in all 8 extern "C" fmoe/moe_stage functions.

* fix: restore activation parameter in fmoe_int8_g1u0_a16 C ABI

The activation parameter was dropped during pybind-to-ctypes migration.
The original implementation uses it to select between silu/gelu config
maps. Restore it in C ABI signature, Python ctypes declaration, and
call site.

* fix: remove stale topk_softmax_asm pybind declaration from moe_op.h

This declaration was erroneously re-added during rebase conflict
resolution. PR #2327 already removed it when migrating to ctypes.

---------

Co-authored-by: root <root@hjbog-srdc-39.amd.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants