Skip to content

Conversation

@zhewenl
Copy link

@zhewenl zhewenl commented Feb 11, 2026

Qwen3-Omni Thinking models have a separate thinker_max_new_tokens parameter (default value =1024) that is independent of max_new_tokens.
During calibration, setting max_new_tokens=1 only limits the talker — the thinker still generates up to 1024 tokens per sample, causing a ~500x slowdown that makes calibration extremely slow

Signed-off-by: Zhewen Li <zhewenli@inferact.ai>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 11, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 11, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@zhewenl zhewenl changed the title update update qwen quant Feb 11, 2026
# Note: thinker_max_new_tokens controls the thinker's generation limit (default 1024),
# which is separate from max_new_tokens. Cap it to avoid long waits.
result = full_model.generate(
**calib_batch, max_new_tokens=100, thinker_max_new_tokens=100
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated here

result = full_model.generate(**calib_batch, max_new_tokens=100)
print("[DEBUG] pre_quantize: starting qwen3omni preview generation (max_new_tokens=100)...", flush=True)
result = full_model.generate(
**calib_batch, max_new_tokens=100, thinker_max_new_tokens=100
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated here

# For Qwen3-Omni Thinking models, the thinker's token limit is controlled by
# a separate `thinker_max_new_tokens` param (default 1024), not `max_new_tokens`.
# Cap it to avoid unbounded chain-of-thought generation during calibration.
if "qwen3omni" in model.__class__.__name__.lower():
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant