Merged
Conversation
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refactors several ASM-backed ops to use a torch-free C ABI (extern "C") plus a Python ctypes call path (via a new AiterTensor struct), reducing reliance on PyTorch C++/pybind for these kernels.
Changes:
- Introduces
AiterTensor/AiterDtypeplumbing and acompile_ops(..., ffi_type="ctypes")dispatch path that loads and calls.sosymbols viactypes. - Refactors ASM implementations (topk-softmax, layernorm, GEMM a16w16) to accept
AiterTensor*and an explicithipStream_t. - Adjusts build config and Python wrappers to use the new ctypes path; adds contiguity gating for the ASM topk-softmax callsites/tests.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| op_tests/test_moeTopkSoftmax.py | Makes ASM path contiguous; minor cleanup in allclose helper |
| csrc/py_itfs_cu/asm_topksoftmax.cu | Switches to C ABI + AiterTensor* args and explicit stream/device handling |
| csrc/py_itfs_cu/asm_layernorm.cu | Switches to C ABI + AiterTensor* args and explicit stream/device handling |
| csrc/py_itfs_cu/asm_gemm_a16w16.cu | Switches to C ABI + AiterTensor* args; removes torch types; explicit device handling |
| csrc/include/rocm_ops.hpp | Removes some pybind bindings; updates macros (notably quant bindings) |
| csrc/include/norm.h | Formatting + removes ASM layernorm declarations from this torch header |
| csrc/include/moe_op.h | Removes torch-signature declaration for topk_softmax_asm |
| csrc/include/asm_gemm_a16w16.h | Deletes old torch-signature header for GEMM ASM |
| csrc/include/aiter_tensor.h | Adds new C struct used for ctypes FFI tensor metadata |
| csrc/include/aiter_hip_common.h | Adds AITER_CHECK and includes for new enum/tensor headers |
| csrc/include/aiter_enum.h | Adds AiterDtype enum and helpers |
| aiter/utility/aiter_types.py | Adds ctypes definitions + header parsing for dtype IDs |
| aiter/utility/dtypes.py | Adds torch_to_aiter() conversion and dtype maps for ctypes path |
| aiter/ops/norm.py | Points ASM layernorm wrappers to a new module + ctypes FFI |
| aiter/ops/moe_op.py | Switches topk_softmax_asm wrapper to ctypes FFI |
| aiter/ops/gemm_op_a16w16.py | Switches GEMM ASM wrapper to ctypes FFI and returns out explicitly |
| aiter/jit/optCompilerConfig.json | Updates build modules (drops some pybind sources; adds module_asm_layernorm) |
| aiter/jit/core.py | Adds _ctypes_call() and ffi_type option to compile_ops |
| aiter/fused_moe.py | Adds gating_output.is_contiguous() requirement for ASM topk-softmax fast path |
Comments suppressed due to low confidence (2)
aiter/ops/moe_op.py:32
topk_softmax_asmis switched toffi_type="ctypes", but it still targetsmodule_moe_asm, whose build config includes torch/pybind sources (e.g.,pybind/moe_op_pybind.cu)._ctypes_call()forcestorch_exclude=Truewhen it needs to build the module, so a first-time call totopk_softmax_asm(without the pybind module already built) is likely to fail to link/compile. Suggested fix: create a dedicated torch-free module (e.g.,module_asm_topksoftmax) that only compilespy_itfs_cu/asm_topksoftmax.cu(+ any needed deps) and point this decorator at it; alternatively, avoid settingtorch_exclude=Truefor modules that still depend on torch/pybind.
@compile_ops("module_moe_asm", fc_name="topk_softmax_asm", ffi_type="ctypes")
def topk_softmax_asm(
topk_weights: Tensor,
topk_indices: Tensor,
token_expert_indices: Tensor,
gating_output: Tensor,
need_renorm: bool,
) -> None: ...
csrc/include/rocm_ops.hpp:1459
QUANT_PYBINDno longer bindsmoe_smooth_per_token_scaled_quant_v1/v2, but Python still calls these via@compile_ops("module_quant")(seeaiter/ops/quant.py). This will cause runtimeAttributeErrorwhen the compiledmodule_quantis imported andgetattr(module, ...)is attempted. Either restore thesem.def(...)bindings inQUANT_PYBINDor update the Python API to stop referencing these functions.
#define QUANT_PYBIND \
m.def("static_per_tensor_quant", &aiter::static_per_tensor_quant); \
m.def("dynamic_per_tensor_quant", &aiter::dynamic_per_tensor_quant); \
m.def("dynamic_per_token_scaled_quant", \
&aiter::dynamic_per_token_scaled_quant, \
py::arg("out"), \
py::arg("input"), \
py::arg("scales"), \
py::arg("scale_ub") = std::nullopt, \
py::arg("shuffle_scale") = false, \
py::arg("num_rows") = std::nullopt, \
py::arg("num_rows_factor") = 1); \
m.def("dynamic_per_group_scaled_quant_fp4", \
&aiter::dynamic_per_group_scaled_quant_fp4, \
py::arg("out"), \
py::arg("input"), \
py::arg("scales"), \
py::arg("group_size") = 32, \
py::arg("shuffle_scale") = true, \
py::arg("num_rows") = std::nullopt, \
py::arg("num_rows_factor") = 1); \
m.def("smooth_per_token_scaled_quant", \
&aiter::smooth_per_token_scaled_quant, \
py::arg("out"), \
py::arg("input"), \
py::arg("scales"), \
py::arg("smooth_scale"), \
py::arg("smooth_scale_map") = std::nullopt, \
py::arg("shuffle_scale") = false, \
py::arg("num_rows") = std::nullopt, \
py::arg("num_rows_factor") = 1, \
py::arg("smooth_scale_map_hash") = std::nullopt, \
py::arg("enable_ps") = true); \
m.def("partial_transpose", \
&aiter::partial_transpose, \
py::arg("out"), \
py::arg("input"), \
py::arg("num_rows"));
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
valarLip
approved these changes
Mar 23, 2026
zufayu
pushed a commit
that referenced
this pull request
Mar 23, 2026
This declaration was erroneously re-added during rebase conflict resolution. PR #2327 already removed it when migrating to ctypes.
zufayu
added a commit
that referenced
this pull request
Mar 23, 2026
* Migrate MoE ASM kernels from pybind to C ABI + ctypes Convert fmoe, fmoe_int8_g1u0, fmoe_g1u1, fmoe_g1u1_tkw1, fmoe_int8_g1u0_a16, fmoe_g1u1_a16, fmoe_fp8_blockscale_g1u1, and moe_stage1_g1u1 from torch::Tensor& (pybind11) to AiterTensor* + hipStream_t (C ABI called via ctypes). - asm_fmoe.cu: Remove torch/ATen includes, use AiterTensor*, AITER_DTYPE_*, AITER_CHECK; template <int I_elemSize, int O_elemSize> - asm_moe_2stage.cu: Same conversion for moe_stage1_g1u1 - moe_op.h: Remove fmoe pybind declarations (now extern "C") - rocm_ops.hpp: Remove fmoe entries from MOE_OP_PYBIND macro - moe_op.py: Use ffi_type="ctypes" with new module_moe_fmoe_asm - optCompilerConfig.json: Split ctypes sources into module_moe_fmoe_asm Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add missing #include <memory> for std::unique_ptr/make_unique After pip install -e . refreshed aiter_meta from csrc/, the indirect include chain changed and <memory> was no longer transitively included. * fix: address Copilot review comments - kernelName: str -> Optional[str] for correct ctypes c_char_p conversion - Remove extra activation param from fmoe_int8_g1u0_a16 (C ABI has no such param) - Fix typo "supput" -> "support" in asm_fmoe.cu * fix: add HipDeviceGuard to all C ABI MoE kernel functions Address review comment from amd-ruitang3: the pybind->ctypes migration removed device_guard. Now that PR #2377 has merged, use the new HipDeviceGuard in all 8 extern "C" fmoe/moe_stage functions. * fix: restore activation parameter in fmoe_int8_g1u0_a16 C ABI The activation parameter was dropped during pybind-to-ctypes migration. The original implementation uses it to select between silu/gelu config maps. Restore it in C ABI signature, Python ctypes declaration, and call site. * fix: remove stale topk_softmax_asm pybind declaration from moe_op.h This declaration was erroneously re-added during rebase conflict resolution. PR #2327 already removed it when migrating to ctypes. --------- Co-authored-by: root <root@hjbog-srdc-39.amd.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist