Skip to content

[Feat][Plugin] Enable spec decoding for vLLM Plugin#557

Draft
whx-sjtu wants to merge 7 commits intomainfrom
whx-sjtu/atom-support-vllm-glm5-mtp
Draft

[Feat][Plugin] Enable spec decoding for vLLM Plugin#557
whx-sjtu wants to merge 7 commits intomainfrom
whx-sjtu/atom-support-vllm-glm5-mtp

Conversation

@whx-sjtu
Copy link
Copy Markdown
Contributor

@whx-sjtu whx-sjtu commented Apr 14, 2026

Motivation

This PR enables spec decode feature for running GLM5 with vLLM+atom.

Technical Details

  1. atom_config related bugfix.
  2. Fix wrong full_cls_name of different MLA sparse attention backends.
  3. Register model architecture and model class for GLM5 MTP
  4. Add index_buffer for DeepseekMTP.
  5. Adapt full graph of main model with mtp enabled.

Test Plan

Comming soon.

Test Result

  1. zai-org/GLM-5.1-FP8

Accuracy test commands:

lm_eval --model local-completions \
        --model_args model=/home/models/GLM-5.1-FP8,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3 \
        --tasks gsm8k \
        --num_fewshot 20

Accuracy test result with mtp=3:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|↑  |0.9454|±  |0.0063|
|     |       |strict-match    |    20|exact_match|↑  |0.9462|±  |0.0062|
  1. deepseek-ai/DeepSeek-R1-0528

Accuracy test commands:

lm_eval --model local-completions \
        --model_args model=/home/models/DeepSeek-R1-0528,base_url=http://localhost:8000/v1/completions,num_concurrent=16,max_retries=3,tokenized_requests=False \
        --tasks gsm8k \
        --num_fewshot 3

Accuracy test result with mtp=3:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  |0.9492|±  |0.0060|
|     |       |strict-match    |     3|exact_match|↑  |0.9469|±  |0.0062|

Submission Checklist

@whx-sjtu whx-sjtu marked this pull request as ready for review April 14, 2026 14:52
@whx-sjtu whx-sjtu changed the title [Feat][Plugin] Enable spec decoding for GLM5 in atom (vLLM Plugin) [Feat][Plugin] Enable spec decoding for GLM5 (vLLM Plugin) Apr 14, 2026
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
@whx-sjtu whx-sjtu force-pushed the whx-sjtu/atom-support-vllm-glm5-mtp branch from 17446a6 to 9015568 Compare April 15, 2026 03:37
@wuhuikx
Copy link
Copy Markdown
Collaborator

wuhuikx commented Apr 15, 2026

Could you please help attach the accuracy test results on gms8k? Do we support MTP=1 or MTP=1/2/3? How about the acceptance ratio?

@wuhuikx wuhuikx marked this pull request as draft April 15, 2026 09:16
@wuhuikx
Copy link
Copy Markdown
Collaborator

wuhuikx commented Apr 15, 2026

I will turn this PR to draft and go through CI after the code review is done.

@whx-sjtu
Copy link
Copy Markdown
Contributor Author

Could you please help attach the accuracy test results on gms8k? Do we support MTP=1 or MTP=1/2/3? How about the acceptance ratio?

Sure I will attach the acc results later. Now we support MTP=1/2/3, but the acceptance rate is low (about 20% for first draft token and 0 for other tokens) and I'm working on it.

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
@whx-sjtu whx-sjtu changed the title [Feat][Plugin] Enable spec decoding for GLM5 (vLLM Plugin) [Feat][Plugin] Enable spec decoding for vLLM Plugin Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants