update qwen quant #880

zhewenl · 2026-02-11T03:16:14Z

Qwen3-Omni Thinking models have a separate thinker_max_new_tokens parameter (default value =1024) that is independent of max_new_tokens.
During calibration, setting max_new_tokens=1 only limits the talker — the thinker still generates up to 1024 tokens per sample, causing a ~500x slowdown that makes calibration extremely slow

Signed-off-by: Zhewen Li <zhewenli@inferact.ai>

copy-pr-bot · 2026-02-11T03:16:17Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-11T03:16:24Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

zhewenl · 2026-02-11T03:25:59Z

examples/llm_ptq/hf_ptq.py

+        # Note: thinker_max_new_tokens controls the thinker's generation limit (default 1024),
+        # which is separate from max_new_tokens. Cap it to avoid long waits.
+        result = full_model.generate(
+            **calib_batch, max_new_tokens=100, thinker_max_new_tokens=100


updated here

zhewenl · 2026-02-11T03:26:11Z

examples/llm_ptq/hf_ptq.py

-        result = full_model.generate(**calib_batch, max_new_tokens=100)
+        print("[DEBUG] pre_quantize: starting qwen3omni preview generation (max_new_tokens=100)...", flush=True)
+        result = full_model.generate(
+            **calib_batch, max_new_tokens=100, thinker_max_new_tokens=100


updated here

zhewenl · 2026-02-11T03:26:41Z

modelopt/torch/utils/dataset_utils.py

+                # For Qwen3-Omni Thinking models, the thinker's token limit is controlled by
+                # a separate `thinker_max_new_tokens` param (default 1024), not `max_new_tokens`.
+                # Cap it to avoid unbounded chain-of-thought generation during calibration.
+                if "qwen3omni" in model.__class__.__name__.lower():


updated here

update

2727a61

Signed-off-by: Zhewen Li <zhewenli@inferact.ai>

zhewenl changed the title ~~update~~ update qwen quant Feb 11, 2026

zhewenl commented Feb 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update qwen quant #880

update qwen quant #880

zhewenl commented Feb 11, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 11, 2026

Uh oh!

coderabbitai bot commented Feb 11, 2026

Review skipped

Uh oh!

zhewenl Feb 11, 2026

Uh oh!

zhewenl Feb 11, 2026

Uh oh!

zhewenl Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

update qwen quant #880

Are you sure you want to change the base?

update qwen quant #880

Conversation

zhewenl commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Feb 11, 2026

Uh oh!

coderabbitai bot commented Feb 11, 2026

Review skipped

Uh oh!

zhewenl Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

zhewenl Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

zhewenl Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhewenl commented Feb 11, 2026 •

edited

Loading