Add support for Qwen3Omni30B thinking model #856

ajrasane · 2026-02-05T01:01:08Z

What does this PR do?

Type of change:
Model support

Overview:

Support quantization of Qwen3-Omni-30B-A3B-Thinking

Usage

python hf_ptq.py \
    --pyt_ckpt_path Qwen/Qwen3-Omni-30B-A3B-Thinking \
    --qformat nvfp4 \
    --kv_cache_qformat none \
    --export_path ./qwen3_omni_30b_nvfp4 \
    --trust_remote_code \
    --attn_implementation flash_attention_2 \
    --dataset cnn_dailymail \
    --verbose \
    --device "cuda:0" \
    --quant_summary_path ./quant_summary.txt

Testing

Able to quantize model, export hf_checkpoint and run inference with vLLM:

Baseline:
|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7891|±  |0.0032|
| - humanities     |      2|none  |      |acc   |↑  |0.6978|±  |0.0063|
| - other          |      2|none  |      |acc   |↑  |0.8265|±  |0.0065|
| - social sciences|      2|none  |      |acc   |↑  |0.8794|±  |0.0058|
| - stem           |      2|none  |      |acc   |↑  |0.8002|±  |0.0069|

nvfp4:
|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.7789|±  |0.0033|
| - humanities     |      2|none  |      |acc   |↑  |0.6895|±  |0.0064|
| - other          |      2|none  |      |acc   |↑  |0.8146|±  |0.0066|
| - social sciences|      2|none  |      |acc   |↑  |0.8723|±  |0.0059|
| - stem           |      2|none  |      |acc   |↑  |0.7859|±  |0.0070|

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes
Did you write any new necessary tests?: No
Did you add or update any necessary documentation?: No
Did you update Changelog?: No

Summary by CodeRabbit

New Features
- Added support for Qwen3Omni and Mllama multimodal models in quantization workflows with specialized image processing
- Added --quant_summary_path CLI option to save quantization summaries to file
- New dataset utilities and processors for Qwen3Omni text, image, and video processing
Bug Fixes
- Fixed model configuration conflicts during HuggingFace export
Chores
- Expanded model type recognition mappings for additional model variants

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

coderabbitai · 2026-02-05T01:01:25Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

The pull request adds comprehensive support for Qwen3Omni multimodal models across the PTQ quantization pipeline, introduces specialized image and text processors for handling multimodal inputs, extends model name-to-type mapping, and adds new dataset utilities and CLI options for enhanced quantization workflow control.

Changes

Cohort / File(s)	Summary
Qwen3Omni Processor Support `examples/llm_ptq/example_utils.py`, `modelopt/torch/utils/image_processor.py`	Added specialized image and text processors (Qwen3OmniImageProcessor, Qwen3OmniTextProcessor) for multimodal input handling and extended get_processor to route these model types to dedicated processor implementations.
Qwen3Omni Quantization Workflow `examples/llm_ptq/hf_ptq.py`	Integrated Qwen3Omni support across calibration and quantization pipeline with dataset variant selection (video, image, text), modified pre_quantize/post_quantize signatures to propagate calib_batch, and added --quant_summary_path CLI option for flexible output handling.
Dataset and Dataloader Utilities `modelopt/torch/utils/dataset_utils.py`, `modelopt/torch/utils/__init__.py`	Added get_qwen3omni_text_dataloader for text-only Qwen3Omni inference, introduced _should_use_generate helper to route model execution through generate() or forward() based on model type, and re-exported video_dataset_utils module.
Model Type Recognition `modelopt/torch/export/model_utils.py`	Expanded MODEL_NAME_TO_TYPE mapping with comprehensive model name entries (Arctic, Baichuan, Deepseek, Qwen3Omni variants, etc.) and extended get_language_model_from_vl to check for thinker attribute on models.
MoE and Export Infrastructure `modelopt/torch/quantization/plugins/huggingface.py`, `modelopt/torch/export/unified_export_hf.py`	Registered Qwen3OmniMoe sparse text blocks with quantization registry and added pre-save generation_config sanitization to prevent sampling attribute conflicts during checkpoint export.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as User/CLI
    participant Main as hf_ptq.py<br/>(quantize_main)
    participant Processor as Processor<br/>Selection
    participant DataLoader as DataLoader<br/>Factory
    participant PreQuant as pre_quantize
    participant Quant as quantize_model
    participant PostQuant as post_quantize
    participant Export as export_quantized

    CLI->>Main: Run with --model_name=qwen3omni
    Main->>Processor: get_processor(model_type="qwen3omni")
    Processor-->>Main: Qwen3OmniImageProcessor
    
    Main->>DataLoader: Select dataset<br/>(video/VL/text)
    alt Video Dataset
        DataLoader->>DataLoader: Load video samples
        DataLoader-->>Main: video_dataloader
    else VL Dataset
        DataLoader->>DataLoader: Load image+text samples
        DataLoader-->>Main: vl_dataloader
    else Text-only Dataset
        DataLoader->>DataLoader: Load text samples
        DataLoader-->>Main: get_qwen3omni_text_dataloader()
    end
    
    Main->>PreQuant: pre_quantize(calib_dataloader, processor)
    PreQuant->>PreQuant: Process full calib_batch<br/>(not just preview)
    PreQuant-->>Main: preview_input_ids,<br/>generated_ids_before_ptq,<br/>calib_batch
    
    Main->>Quant: quantize_model(calib_batch, processor)
    Quant->>Quant: Use _should_use_generate<br/>to route inference
    Quant-->>Main: quantized_model
    
    Main->>PostQuant: post_quantize(..., calib_batch)
    PostQuant->>PostQuant: Post-PTQ generation<br/>with calib_batch
    PostQuant-->>Main: results
    
    Main->>Export: export_quantized(calib_batch)
    Export->>Export: Sanitize generation_config
    Export-->>Main: exported_checkpoint

    alt --quant_summary_path provided
        Main->>Main: Save summary to file
    else No path
        Main->>Main: Print summary to stdout
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 65.52% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'Add support for Qwen3Omni30B thinking model' accurately and specifically describes the main objective of the changeset—adding support for a new Qwen3Omni model variant across quantization, export, and utility modules.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ajrasane/qwen3omni_final

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@examples/llm_ptq/hf_ptq.py`:
- Around line 219-221: The current code mutates processor.dtype (processor.dtype
= language_model.dtype) which can have unintended side effects; instead avoid
changing the existing processor and either pass dtype explicitly into
get_vlm_dataset_dataloader/collate function or construct a new processor
instance with the desired dtype (e.g., create a new Qwen3OmniImageProcessor
using processor.tokenizer and language_model.dtype) and pass that to
get_vlm_dataset_dataloader so the original processor remains unchanged.

In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1001-1009: The code currently mutates model.generation_config
in-place (via gen_config and clearing temperature/top_p/top_k) which persists
after export; instead, create a non-mutating copy of model.generation_config (or
record the original values for "temperature","top_p","top_k"), apply the
sampling fixes to that copy (or to the saved serialization data) and use the
copy for writing the checkpoint (e.g., before calling save_pretrained), then
restore the original model.generation_config (or simply never mutate it) so the
caller's model is unchanged; reference symbols: model.generation_config,
gen_config, and the sampling attrs ["temperature","top_p","top_k"] when locating
where to implement the copy/restore behavior.

In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 470-472: The code computes batch_size using tensor_data but
iterates keys from batch_data (batch_size =
tensor_data[next(iter(batch_data.keys()))].shape[0]), which can pick a
non-tensor key (e.g., max_new_tokens) and cause .shape access to fail; change
the key source to tensor_data so you select a tensor key (e.g., batch_size =
tensor_data[next(iter(tensor_data.keys()))].shape[0]) or explicitly find the
first value in tensor_data that is a tensor and use its shape[0]; update the
logic around the batch_size variable and references to tensor_data/batch_data to
ensure batch_size is derived from actual tensors (functions/variables to check:
tensor_data, batch_data, batch_size).

In `@modelopt/torch/utils/image_processor.py`:
- Around line 264-277: The code uses torch.tensor(...) for "pixel_values" and
"audio_features" which can change/infer dtypes inconsistently; change those to
preserve original dtype by using
torch.as_tensor(first["pixel_values"]).to(self.device) and
torch.as_tensor(first["audio_features"]).to(self.device) (keeping the existing
.to(self.device) pattern), so the dtype of the incoming data is preserved;
update the handling in image_processor.py where result["pixel_values"] and
result["audio_features"] are set (using the local variables first and result) to
use torch.as_tensor instead of torch.tensor.

🧹 Nitpick comments (5)

modelopt/torch/export/model_utils.py (1)
145-148: Minor: Update comment for consistency.

The comment "Pattern 3: No language_model found" at line 148 is now misleading since a new pattern was inserted above it. Consider renumbering or removing pattern numbering.
💡 Suggested fix
     if hasattr(model, "thinker"):
         return [model, model.thinker]

-    # Pattern 3: No language_model found
+    # No language_model found
     return None
examples/llm_ptq/example_utils.py (1)
284-286: Update docstring to reflect new return types.

The docstring states it returns a MllamaImageProcessor object, but now it can also return Qwen3OmniImageProcessor for the new model type.
📝 Suggested fix
 def get_processor(
     ckpt_path,
     model_type,
     device: torch.device = "auto",
     trust_remote_code=False,
     attn_implementation=None,
 ) -> BaseImageProcessor | ProcessorMixin | None:
     """
-    Returns a :class:`modelopt.torch.utils.image_processor.MllamaImageProcessor` object.
+    Returns an image processor for multimodal models (MllamaImageProcessor, Qwen3OmniImageProcessor)
+    or a ProcessorMixin for models like Whisper.
     """
modelopt/torch/utils/image_processor.py (1)
115-172: Consider interface consistency for preprocess_function.

Qwen3OmniTextProcessor.preprocess_function(text: str) has a different signature than the base class and sibling classes, which use preprocess_function(examples: dict). This could cause issues if callers expect a uniform interface.

Additionally, the dtype attribute is stored but never used in the processor methods.
💡 Suggested approach

Consider either:

Aligning the signature with the base class by wrapping text in a dict:
def preprocess_function(self, examples):
    text = examples.get("text", examples) if isinstance(examples, dict) else examples
    # ... rest of implementation
Or documenting the intentional deviation clearly in the class docstring.

For the unused dtype, either utilize it in tensor conversion or remove if not needed.
modelopt/torch/utils/dataset_utils.py (1)
136-146: Replace assertions with proper exceptions for public API.

Using assert for input validation in a public API function (get_qwen3omni_text_dataloader) is discouraged because assertions can be disabled with -O flag. Use ValueError or TypeError instead.
💡 Suggested fix
-    assert processor is not None, "Please provide a Qwen3OmniTextProcessor."
+    if processor is None:
+        raise ValueError("Please provide a Qwen3OmniTextProcessor.")
 
     if isinstance(num_samples, int):
         num_samples = [num_samples]
 
     if isinstance(dataset_name, str):
         dataset_name = [dataset_name]
 
-    assert len(dataset_name) == len(num_samples), (
-        "dataset_name and num_samples must be the same length"
-    )
+    if len(dataset_name) != len(num_samples):
+        raise ValueError("dataset_name and num_samples must be the same length")
examples/llm_ptq/hf_ptq.py (1)
756-766: Code duplication in Qwen3Omni generation handling.

The logic for handling Qwen3Omni's tuple return from generate() is duplicated between pre_quantize (lines 756-766) and post_quantize (lines 817-827). Consider extracting to a helper function.
♻️ Suggested refactor
def _extract_qwen3omni_generated_ids(result):
    """Extract generated IDs from Qwen3Omni generate() output."""
    if isinstance(result, tuple):
        text_ids, _ = result
        return text_ids.sequences if hasattr(text_ids, "sequences") else text_ids
    return result
Then use in both places:
# In pre_quantize and post_quantize:
result = full_model.generate(**calib_batch, max_new_tokens=100)
generated_ids = _extract_qwen3omni_generated_ids(result)
Also applies to: 817-827

coderabbitai · 2026-02-05T01:05:31Z

examples/llm_ptq/hf_ptq.py

+            # Set the dtype for proper tensor conversion in collate_function
+            processor.dtype = language_model.dtype
+            calib_dataloader = get_vlm_dataset_dataloader(


⚠️ Potential issue | 🟡 Minor

Side effect: Mutating processor.dtype.

Setting processor.dtype = language_model.dtype mutates the processor object, which could affect other callers if the processor is reused elsewhere. Consider passing dtype as a parameter instead of mutating.

💡 Suggested approach

Rather than mutating the processor, consider passing dtype explicitly to the collate function or creating a new processor instance with the dtype set:

# Option 1: Create new processor with dtype processor = Qwen3OmniImageProcessor( processor.tokenizer, device=device, dtype=language_model.dtype ) calib_dataloader = get_vlm_dataset_dataloader(...)

🤖 Prompt for AI Agents

In `@examples/llm_ptq/hf_ptq.py` around lines 219 - 221, The current code mutates processor.dtype (processor.dtype = language_model.dtype) which can have unintended side effects; instead avoid changing the existing processor and either pass dtype explicitly into get_vlm_dataset_dataloader/collate function or construct a new processor instance with the desired dtype (e.g., create a new Qwen3OmniImageProcessor using processor.tokenizer and language_model.dtype) and pass that to get_vlm_dataset_dataloader so the original processor remains unchanged.

coderabbitai · 2026-02-05T01:05:31Z

modelopt/torch/export/unified_export_hf.py

+        # Fix generation_config conflicts before saving
+        # Some models have temperature/top_p/top_k set but do_sample=False which causes validation errors
+        if hasattr(model, "generation_config") and model.generation_config is not None:
+            gen_config = model.generation_config
+            if not getattr(gen_config, "do_sample", True):
+                # Remove sampling-related params when do_sample is False
+                for attr in ["temperature", "top_p", "top_k"]:
+                    if hasattr(gen_config, attr):
+                        setattr(gen_config, attr, None)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, examine the context around lines 1001-1009 to understand the function flow sed -n '995,1020p' modelopt/torch/export/unified_export_hf.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1243

🏁 Script executed:

# Check what happens after the save_pretrained call sed -n '1001,1050p' modelopt/torch/export/unified_export_hf.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1434

🏁 Script executed:

# Look for the function definition to understand the full scope grep -n "def.*export" modelopt/torch/export/unified_export_hf.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 208

🌐 Web query:

Hugging Face Transformers GenerationConfig save_pretrained mutation best practice

💡 Result:

Best practice (avoid “mutation surprises”)

1) For one-off generation changes, don’t mutate model.generation_config

Prefer passing parameters directly to generate() (or pass a separate GenerationConfig). Transformers explicitly supports this: generation_config is the base, and any matching **kwargs override it for that call; if you don’t pass one, Transformers loads defaults from (1) generation_config.json if present, else (2) the model config. [1]

out = model.generate(**inputs, max_new_tokens=128, do_sample=True, top_p=0.9)

2) For persistent defaults you want to version/control, create & save a dedicated GenerationConfig

Use GenerationConfig.from_pretrained(...) (loads generation_config.json) or GenerationConfig.from_model_config(model.config) (helpful for legacy configs), then edit and save_pretrained(). Saving writes generation_config.json by default and can be reloaded with from_pretrained(). [1]

from transformers import GenerationConfig gen_cfg = GenerationConfig.from_pretrained("openai-community/gpt2") # or your model dir/id gen_cfg.max_new_tokens = 256 gen_cfg.do_sample = True gen_cfg.top_p = 0.9 gen_cfg.save_pretrained("./my_model_dir") # saves ./my_model_dir/generation_config.json # later: gen_cfg2 = GenerationConfig.from_pretrained("./my_model_dir")

3) If you do want to mutate defaults, do it deliberately (and preferably on a copy)

Because GenerationConfig is a normal mutable object, in-place edits affect subsequent calls. A common pattern is: copy → edit → pass as generation_config to generate(), leaving model.generation_config unchanged (supported by the generation_config= argument). [1]

Source: Hugging Face Transformers “Generation / GenerationConfig” API docs (covers generate(generation_config=..., **kwargs), loading priority, and GenerationConfig.save_pretrained() / from_pretrained() usage). [1]

References:
[1] https://huggingface.co/docs/transformers/v4.53.2/en/main_classes/text_generation

Avoid persisting generation_config mutation beyond export.

This block mutates model.generation_config in-place; after export returns, callers retain the modified model with temperature/top_p/top_k cleared. Hugging Face best practices recommend avoiding direct mutation of generation_config—instead, preserve the original state or apply fixes only to the saved checkpoint. Consider saving the original values and restoring them after save_pretrained().

🔧 Suggested change (restore after save)

- if hasattr(model, "generation_config") and model.generation_config is not None: - gen_config = model.generation_config - if not getattr(gen_config, "do_sample", True): - # Remove sampling-related params when do_sample is False - for attr in ["temperature", "top_p", "top_k"]: - if hasattr(gen_config, attr): - setattr(gen_config, attr, None) + gen_config = getattr(model, "generation_config", None) + _gen_restore = None + if gen_config is not None and not getattr(gen_config, "do_sample", True): + _gen_restore = {a: getattr(gen_config, a, None) for a in ["temperature", "top_p", "top_k"]} + for attr in _gen_restore: + if hasattr(gen_config, attr): + setattr(gen_config, attr, None) # Save model - model.save_pretrained( - export_dir, state_dict=post_state_dict, save_modelopt_state=save_modelopt_state - ) + try: + model.save_pretrained( + export_dir, state_dict=post_state_dict, save_modelopt_state=save_modelopt_state + ) + finally: + if _gen_restore is not None: + for attr, value in _gen_restore.items(): + if hasattr(gen_config, attr): + setattr(gen_config, attr, value)

🤖 Prompt for AI Agents

In `@modelopt/torch/export/unified_export_hf.py` around lines 1001 - 1009, The code currently mutates model.generation_config in-place (via gen_config and clearing temperature/top_p/top_k) which persists after export; instead, create a non-mutating copy of model.generation_config (or record the original values for "temperature","top_p","top_k"), apply the sampling fixes to that copy (or to the saved serialization data) and use the copy for writing the checkpoint (e.g., before calling save_pretrained), then restore the original model.generation_config (or simply never mutate it) so the caller's model is unchanged; reference symbols: model.generation_config, gen_config, and the sampling attrs ["temperature","top_p","top_k"] when locating where to implement the copy/restore behavior.

coderabbitai · 2026-02-05T01:05:31Z

modelopt/torch/utils/dataset_utils.py

    )
    # Get the batch size of current data
-    batch_size = batch_data[next(iter(batch_data.keys()))].shape[0]
+    batch_size = tensor_data[next(iter(batch_data.keys()))].shape[0]


⚠️ Potential issue | 🟠 Major

Potential bug: batch_data used instead of tensor_data.

Line 472 retrieves batch_size from tensor_data but uses batch_data.keys(). If the first key in batch_data is a scalar parameter (like max_new_tokens), accessing .shape[0] will fail.

🐛 Proposed fix

assert all(torch.is_tensor(data) or data is None for data in tensor_data.values()), ( "tensor_data values must be tensors" ) # Get the batch size of current data - batch_size = tensor_data[next(iter(batch_data.keys()))].shape[0] + batch_size = tensor_data[next(iter(tensor_data.keys()))].shape[0]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

)

# Get the batch size of current data

batch_size = batch_data[next(iter(batch_data.keys()))].shape[0]

batch_size = tensor_data[next(iter(batch_data.keys()))].shape[0]

)

# Get the batch size of current data

batch_size = tensor_data[next(iter(tensor_data.keys()))].shape[0]

🤖 Prompt for AI Agents

In `@modelopt/torch/utils/dataset_utils.py` around lines 470 - 472, The code computes batch_size using tensor_data but iterates keys from batch_data (batch_size = tensor_data[next(iter(batch_data.keys()))].shape[0]), which can pick a non-tensor key (e.g., max_new_tokens) and cause .shape access to fail; change the key source to tensor_data so you select a tensor key (e.g., batch_size = tensor_data[next(iter(tensor_data.keys()))].shape[0]) or explicitly find the first value in tensor_data that is a tensor and use its shape[0]; update the logic around the batch_size variable and references to tensor_data/batch_data to ensure batch_size is derived from actual tensors (functions/variables to check: tensor_data, batch_data, batch_size).

coderabbitai · 2026-02-05T01:05:31Z

modelopt/torch/utils/image_processor.py

+        if first.get("pixel_values") is not None:
+            result["pixel_values"] = torch.tensor(first["pixel_values"]).to(self.device)
+
+        # Handle image grid thw (tile height width info)
+        if first.get("image_grid_thw") is not None:
+            result["image_grid_thw"] = torch.LongTensor(first["image_grid_thw"]).to(self.device)
+
+        # Handle audio features if present
+        if first.get("audio_feature_lens") is not None:
+            result["audio_feature_lens"] = torch.LongTensor(first["audio_feature_lens"]).to(
+                self.device
+            )
+        if first.get("audio_features") is not None:
+            result["audio_features"] = torch.tensor(first["audio_features"]).to(self.device)


⚠️ Potential issue | 🟡 Minor

Inconsistent tensor creation may cause dtype issues.

torch.tensor() (lines 265, 277) infers dtype from input data, while torch.LongTensor() explicitly creates int64 tensors. For pixel_values and audio_features, the original dtype (likely float32/bfloat16) should be preserved.

🔧 Suggested fix to preserve dtype

# Handle pixel values for images if first.get("pixel_values") is not None: - result["pixel_values"] = torch.tensor(first["pixel_values"]).to(self.device) + result["pixel_values"] = torch.tensor(first["pixel_values"], dtype=torch.float32).to(self.device) # Handle audio features if present if first.get("audio_features") is not None: - result["audio_features"] = torch.tensor(first["audio_features"]).to(self.device) + result["audio_features"] = torch.tensor(first["audio_features"], dtype=torch.float32).to(self.device)

Alternatively, consider using the dtype attribute if it's meant to control these tensor types.

🤖 Prompt for AI Agents

In `@modelopt/torch/utils/image_processor.py` around lines 264 - 277, The code uses torch.tensor(...) for "pixel_values" and "audio_features" which can change/infer dtypes inconsistently; change those to preserve original dtype by using torch.as_tensor(first["pixel_values"]).to(self.device) and torch.as_tensor(first["audio_features"]).to(self.device) (keeping the existing .to(self.device) pattern), so the dtype of the incoming data is preserved; update the handling in image_processor.py where result["pixel_values"] and result["audio_features"] are set (using the local variables first and result) to use torch.as_tensor instead of torch.tensor.

cjluo-nv

@Edwardf0t1 could you review it as well?

examples/llm_ptq/hf_ptq.py

cjluo-nv · 2026-02-05T16:30:45Z

examples/llm_ptq/hf_ptq.py

            batch_size=args.batch_size,
            num_samples=args.calib_size[0],
        )
+    elif model_type == "qwen3omni":


can we move this to example_utils to keep hf_ptq short?

examples/llm_ptq/hf_ptq.py

cjluo-nv · 2026-02-05T16:32:46Z

examples/llm_ptq/hf_ptq.py

+            old_stdout = sys.stdout
+            sys.stdout = buffer = io.StringIO()
+            try:
+                mtq.print_quant_summary(full_model)


this is a bit hacky. How about just modify print_quant_summary to take a file path?

modelopt/torch/export/model_utils.py

modelopt/torch/export/unified_export_hf.py

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Edwardf0t1

@ajrasane I'm wondering if the exported checkpoint can be deployed on vLLM/SGLang/TRT-LLM and how's the accuracy?

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

ajrasane · 2026-02-06T19:16:57Z

@Edwardf0t1 ,I was able to deploy think checkpoint on vLLM and get good outputs. I have also generated accuracy results with MMLU. PTAL.

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

cjluo-nv · 2026-02-09T23:21:17Z

examples/llm_ptq/example_utils.py

+    return None
+
+
+def ensure_tokenizer_files(model_path: str, source_model_id: str) -> None:


why do we need this?

cjluo-nv · 2026-02-09T23:22:50Z

examples/llm_ptq/example_utils.py

        print("No custom model files found to copy")
+
+
+def patch_config_for_unified_export(model_type: str, export_path: str) -> None:


@Edwardf0t1 could you review this part?

cjluo-nv · 2026-02-09T23:25:10Z

examples/llm_ptq/hf_ptq.py

            else:
-                calibrate_loop = create_forward_loop(dataloader=calib_dataloader)
+                calibrate_loop = create_forward_loop(
+                    dataloader=calib_dataloader, generation_kwargs=generation_kwargs


what will happen if generation_kwargs is empty?

cjluo-nv · 2026-02-09T23:26:11Z

examples/llm_ptq/hf_ptq.py

    generated_ids_before_ptq,
    is_nemotron_vl_model,
    first_text_speech_dataset,
+    calib_batch: dict | None = None,


please document why we need to add calib_batch

cjluo-nv · 2026-02-09T23:27:11Z

examples/llm_ptq/hf_ptq.py

-            return processor.tokenizer.batch_decode(input_ids)
+        # BaseImageProcessor covers MllamaImageProcessor and Qwen3OmniImageProcessor
+        if processor is not None and isinstance(processor, BaseImageProcessor):
+            return processor.tokenizer.batch_decode(input_ids, skip_special_tokens=True)


is 800 and 804 expected?

cjluo-nv · 2026-02-09T23:27:53Z

examples/llm_ptq/run_vllm.py

@@ -0,0 +1,136 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


is this qwen3 omni specific?

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Add support for Qwen3Omni30B thinking model

f0519b1

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

ajrasane requested review from a team as code owners February 5, 2026 01:01

ajrasane requested review from Edwardf0t1 and cjluo-nv February 5, 2026 01:01

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

cjluo-nv reviewed Feb 5, 2026

View reviewed changes

ajrasane added 2 commits February 6, 2026 05:16

Optimize calibration for text data

dfedafa

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Refactor model specific code to example_utils

e287c0b

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Edwardf0t1 reviewed Feb 6, 2026

View reviewed changes

Update hf configs for vLLM deployment

1221e4f

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

ajrasane added 2 commits February 6, 2026 20:14

Create a script to run vllm inference

1b4440b

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Add an option to supply host as an argument

2a0ff6f

cjluo-nv reviewed Feb 9, 2026

View reviewed changes

Add video dataset utils

8740db3

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

		return None


		def ensure_tokenizer_files(model_path: str, source_model_id: str) -> None:

		print("No custom model files found to copy")


		def patch_config_for_unified_export(model_type: str, export_path: str) -> None:

		@@ -0,0 +1,136 @@
		# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Add support for Qwen3Omni30B thinking model #856

Are you sure you want to change the base?

Add support for Qwen3Omni30B thinking model #856

Uh oh!

Conversation

ajrasane commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Best practice (avoid “mutation surprises”)

1) For one-off generation changes, don’t mutate model.generation_config

2) For persistent defaults you want to version/control, create & save a dedicated GenerationConfig

3) If you do want to mutate defaults, do it deliberately (and preferably on a copy)

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

ajrasane commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

ajrasane commented Feb 5, 2026 •

edited

Loading

coderabbitai bot commented Feb 5, 2026 •

edited

Loading

1) For one-off generation changes, don’t mutate `model.generation_config`

2) For persistent defaults you want to version/control, create & save a dedicated `GenerationConfig`

ajrasane commented Feb 6, 2026 •

edited

Loading