[feat] Add support for Qwen3.5 and Qwen3-Next to ATOM-plugined SGLang#532
[feat] Add support for Qwen3.5 and Qwen3-Next to ATOM-plugined SGLang#532wanzhenchn wants to merge 8 commits intoROCm:mainfrom
Conversation
c856071 to
a8e9882
Compare
a8e9882 to
9295de9
Compare
|
I turn it to draft and will let it go through CI after we finishing the code review. @ganyi1996ppo @Yuechguo please help review this PR. |
99ece56 to
66194d5
Compare
…ight shape check failed
…ATOMAttnBackendForSgl
87e8531 to
9ae4b3d
Compare
9ae4b3d to
a4b019b
Compare
| fwd_ctx: ForwardContext = get_forward_context() | ||
| gdn_metadata: GDNAttentionMetadata = fwd_ctx.attn_metadata.gdn_metadata | ||
|
|
||
| gdn_metadata, conv_state, ssm_state = self._resolve_runtime_state(fwd_ctx) |
There was a problem hiding this comment.
Why abstract this part? former one looks more straight forward. and I don't see any reuse of it.
There was a problem hiding this comment.
- SGLang exposes the mamba/GDN conv cache as a row-major tensor shaped
[slot, D, W]. - ATOM's
causal_conv1d_*kernels consume the same logical shape, but they
require the feature dimensionDto be contiguous in memory
(stride(-2) == 1).
So we should have tensor layout transformation for conv state.
It has been called atom/plugin/sglang/attention_backend/attention_gdn.py
| class GatedDeltaNet(AtomGatedDeltaNet): | ||
| """SGLang adapter over the shared ATOM GDN implementation.""" | ||
|
|
||
| def _resolve_runtime_state( |
There was a problem hiding this comment.
Redundant defination seems?
There was a problem hiding this comment.
No, it is used to achieve the conv state tensor layout transformation after the GatedDeltaNet is initialized.
| from atom.utils.decorators import TorchCompileWrapperWithCustomDispatcher | ||
|
|
||
|
|
||
| class Qwen3NextSglangAttention(_CoreQwen3NextAttention): |
There was a problem hiding this comment.
I think you can trust the modeling already have in atom for the execution part. if there anything varying from want already have, you can just add a if branch, or abstract into a function.
There was a problem hiding this comment.
we shouldn't add this sglang/models/ folder, you can patch the qwen3_5.py or qwen3_next.py as you need
Motivation
Background: ROCm/ATOM#355 and ROCm/ATOM#359.
PR #355 integrated ATOM with upstream SGLang through the SGLANG_EXTERNAL_MODEL_PACKAGE out-of-tree mechanism, replacing a fork-based workflow and establishing atom.plugin.sglang.models as the external entry package for ATOM-backed architectures.
Building on that foundation, this PR extends the SGLang plugin path so that two major ATOM model families—Qwen3-next (Qwen3NextForCausalLM) and Qwen3.5 (Qwen3_5ForConditionalGeneration/Qwen3_5MoeForConditionalGeneration)—can run as first-class external models inside SGLang. The goal is parity with prior ATOM-in-SGLang accuracy while improving end-to-end inference performance on the supported paths (e.g. ATOM’s fused kernels, quantization, and MLA / MoE handling tuned for ROCm), without requiring a patched SGLang tree—users continue to point
SGLANG_EXTERNAL_MODEL_PACKAGEatatom.plugin.sglang.modelsand launch with standard upstreamsglang.launch_server.Technical Details
Qwen3-next
Qwen3NextForCausalLMis registered underatom.plugin.sglang.modelsand subclasses_AtomCausalLMBaseForSglang, reusing the same SGLang-facing contract as other OOT entry points: the wrapper calls prepare_model(..., engine="sglang") to build the ATOM weight stack, runs the language model forward with pipeline-parallel state mapped from pp_proxy_tensors, applies LogitsProcessor on the last PP rank, and loads weights via load_model_in_plugin_mode.Qwen3NextSglangModelplus sglang_gdn_bridge so GDN layers see the SGLang forward_batch context they expect. At prepare time, apply_qwen3_next_sglang_model_patch swapsatom.models.qwen3_next.Qwen3NextModelto that bridged implementation; the shared prepare hook defaultsATOM_SGLANG_USE_NATIVE_AITER_ATTN_BACKENDfor Qwen3NextForCausalLM before register_ops_to_sglang.Qwen3.5
How to Run
The following models have been supported:
Accuracy
Qwen3.5-397B-A17B-FP8
Qwen3-Next-80B-A3B-Instruct
Inference Perf.