Add Nemotron parse PTQ support #786

Edwardf0t1 · 2026-01-15T02:17:58Z

What does this PR do?

Type of change: New model support

Overview: Add PTQ support for https://huggingface.co/nvidia/NVIDIA-Nemotron-Parse-v1.1

Usage

python3 hf_ptq.py --pyt_ckpt_path /home/omniml_data_3/models/NVIDIA-Nemotron-Parse-v1.1 --qformat fp8 --export_path /home/omniml_data_3/zhiyuc/checkpoints/NVIDIA-Nemotron-Parse-v1.1-FP8 --trust_remote_code --kv_cache_qformat none --attn_implementation eager

By default, image-text data will be used in calibration for VLMs.

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Not yet

Additional Information

Summary by CodeRabbit

New Features
- Added support for Nemotron-Parse multimodal models, including proper device mapping, processor loading, and generation handling.
Improvements
- Enhanced quantization robustness with safer handling of quantization attributes and fallback logic.
- Improved model loading with better device placement and encoder buffer management for vision-language models.

codecov · 2026-01-15T02:29:19Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.44%. Comparing base (5e43b2a) to head (bf5d52a).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #786   +/-   ##
=======================================
  Coverage   73.44%   73.44%           
=======================================
  Files         197      197           
  Lines       20657    20657           
=======================================
  Hits        15172    15172           
  Misses       5485     5485

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jingyu-ml

LGMT but seems like you need to rebase to the main where I did some refactoring to the unifed ckpt export.

coderabbitai · 2026-02-10T06:02:11Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

The PR adds Nemotron-Parse support across quantization and export utilities. Changes include Nemotron-Parse detection via config architectures, processor loading, specialized dataset and calibration handling, generation config management, and FP8 quantization path improvements for multimodal models.

Changes

Cohort / File(s)	Summary
Quantization Utilities with Nemotron-Parse Support `examples/llm_ptq/example_utils.py`, `examples/llm_ptq/hf_ptq.py`	Adds Nemotron-Parse detection, processor loading with fallback handling, device mapping logic, tokenizer suppression, and conditional dataset/calibration selection based on model type. Extends make_calib_dataloader signature with full_model parameter for model-aware defaults.
VLM Generation Handling `examples/llm_ptq/vlm_utils.py`	Introduces GenerationConfig import and Nemotron-Parse detection via config.architectures. Adds NP-specific prompt handling, input processing, generation invocation with config parameters, and decoder selection (tokenizer vs processor).
Export Pipeline Enhancements `modelopt/torch/export/model_utils.py`, `modelopt/torch/export/unified_export_hf.py`	Adds Nemotron-Parse multimodal detection and encoder-decoder language-model extraction. Expands dummy forward generation with VL model optimization and hardened FP8 quantization path with safe amax handling, fallback logic, and early skip conditions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and accurately summarizes the main purpose of the PR: adding PTQ (post-training quantization) support for the Nemotron-Parse model, which is reflected across all modified files.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch zhiyu/nemotron-parse-support

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

examples/llm_ptq/vlm_utils.py (1)
357-382: ⚠️ Potential issue | 🟡 Minor

Logging level restored outside the finally block — may not restore on exception.

At line 376, logging.getLogger("transformers").setLevel(transformers_log_level) is called after the with block but not in a finally clause. If AutoProcessor.from_pretrained (line 370) raises, the except at line 380 catches it, but the logging level is never restored because line 376 is skipped.
Proposed fix
         try:
             import contextlib
             import io
             import logging

             # Suppress verbose output from processor/tokenizer loading
             transformers_log_level = logging.getLogger("transformers").level
             logging.getLogger("transformers").setLevel(logging.ERROR)

-            with contextlib.redirect_stdout(io.StringIO()):
-                processor = AutoProcessor.from_pretrained(
-                    ckpt_path,
-                    **model_kwargs,
-                )
-
-            # Restore logging
-            logging.getLogger("transformers").setLevel(transformers_log_level)
+            try:
+                with contextlib.redirect_stdout(io.StringIO()):
+                    processor = AutoProcessor.from_pretrained(
+                        ckpt_path,
+                        **model_kwargs,
+                    )
+            finally:
+                logging.getLogger("transformers").setLevel(transformers_log_level)

             print(f"Loaded AutoProcessor for model type: {model_type}")
             return processor
modelopt/torch/export/unified_export_hf.py (1)
448-471: ⚠️ Potential issue | 🔴 Critical

Critical: Indentation error — _amax conversion executes unconditionally, ignoring the hasattr guard.

Lines 455 and 471 are at the same indentation as their respective if guards (lines 454, 470), so the _amax.to(torch.float32) calls run even when _amax doesn't exist or is None, causing AttributeError.
Proposed fix
-            if hasattr(input_quantizer, "_amax") and input_quantizer._amax is not None:
-            input_quantizer._amax = input_quantizer._amax.to(torch.float32)
+            if hasattr(input_quantizer, "_amax") and input_quantizer._amax is not None:
+                input_quantizer._amax = input_quantizer._amax.to(torch.float32)
 ...
-            if hasattr(output_quantizer, "_amax") and output_quantizer._amax is not None:
-            output_quantizer._amax = output_quantizer._amax.to(torch.float32)
+            if hasattr(output_quantizer, "_amax") and output_quantizer._amax is not None:
+                output_quantizer._amax = output_quantizer._amax.to(torch.float32)
examples/llm_ptq/hf_ptq.py (1)

578-584: ⚠️ Potential issue | 🟠 Major

Fix incorrect comment and attribute linkage for encoder-decoder VL models.

The comment at line 578-579 incorrectly states "For Nemotron-Parse, language_model IS full_model," but the actual behavior is: when Nemotron-Parse (encoder-decoder model) goes through the else branch, extract_and_prepare_language_model_from_vl() extracts model.decoder as language_model, making language_model is not full_model evaluate to True.

Line 584 then executes language_model_lineage[-2].language_model = language_model, which sets full_model.language_model = decoder. For encoder-decoder architectures where the language component is naturally stored at .decoder (not .language_model), this creates an artificial attribute that doesn't align with the model's structure. Since mtq.quantize() operates in-place, the decoder is already quantized; this assignment becomes redundant and conceptually incorrect.

Update the comment to accurately reflect the behavior for encoder-decoder models, and consider whether this attribute assignment is necessary or should be conditional on the model architecture.

🤖 Fix all issues with AI agents

In `@examples/llm_ptq/example_utils.py`:
- Around line 641-654: The broad hasattr(model, "encoder") check should be
narrowed to only target Nemotron-Parse models so we don't accidentally move
encoders for unrelated encoder-decoder models; instead of simply checking
hasattr(model, "encoder"), first inspect model.config (or
model.config.architectures/model.config.model_type) or model.__class__.__name__
for the Nemotron-Parse identifier and only then run the named_buffers scan and
model.encoder.to(device) move; update the conditional around model.encoder,
named_buffers, and model.encoder.to(device) to gate on that Nemotron-specific
check (e.g., a helper like is_nemotron_parse(model) that checks
model.config.architectures or model.config.model_type) before performing the CPU
buffer detection and move.
- Around line 357-376: The transformer logging level restore is skipped if
AutoProcessor.from_pretrained raises; update the try block around
AutoProcessor.from_pretrained in example_utils.py so you capture the original
transformers logger level, set it to ERROR before loading, and always restore it
in a finally clause (i.e., move the call to
logging.getLogger("transformers").setLevel(transformers_log_level) into
finally). Reference the AutoProcessor.from_pretrained call and the transformers
logger calls to locate where to add the finally block and ensure the log level
is reset regardless of success or exception.

In `@examples/llm_ptq/hf_ptq.py`:
- Around line 416-429: The current default for Nemotron-Parse sets args.dataset
= ["nemotron_vlm_v2"] but never enables image calibration, causing
make_calib_dataloader -> get_dataset_dataloader to raise NotImplementedError;
fix by ensuring the code either uses a text dataset or enables the image
dataloader: for a minimal change, when is_nemotron_parse is true and
args.dataset is None, set args.calib_with_images = True (and keep args.dataset =
["nemotron_vlm_v2"]), or alternatively change the default to a supported text
dataset like ["cnn_dailymail"]; if you prefer image routing, update
make_calib_dataloader to call get_nemotron_vlm_dataset_dataloader (the imported
symbol) when seeing "nemotron_vlm_v2" so the image-text loader is used instead
of get_dataset_dataloader.

In `@examples/llm_ptq/vlm_utils.py`:
- Around line 76-79: The function redundantly recomputes is_nemotron_parse from
model.config/architectures twice; remove the second computation (the repeated
any("nemotronparse" in arch.lower() for arch in architectures) block) and reuse
the existing is_nemotron_parse variable wherever it’s needed later in the
function, ensuring the original declaration (using model.config and
architectures) remains in scope for those subsequent uses.

In `@modelopt/torch/export/model_utils.py`:
- Around line 150-155: The unconditional hasattr(model, "decoder") check should
be narrowed to only match Nemotron-Parse VL models: change the condition to gate
on the same Nemotron-Parse detection used in is_multimodal_model (e.g., call
is_multimodal_model(model) and/or compare the model's architecture/config field
against the Nemotron-Parse identifier used there) before returning [model,
model.decoder]; reference the symbols model, decoder, and is_multimodal_model
and only return the decoder for models that pass the Nemotron-Parse-specific
check.

In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 487-488: The deletion of weight_quantizer._scale is currently
executed unconditionally due to wrong indentation; update the block around
weight_quantizer to ensure the del statement only runs when the attribute exists
(e.g., indent the del under the if hasattr(weight_quantizer, "_scale") and
weight_quantizer._scale is not None check or replace with a safe
conditional/delete pattern), so only delete _scale when present—look for the
weight_quantizer variable usage in unified_export_hf.py and fix the if + del
pairing.
- Around line 148-157: The body lines in _input_hook and _output_hook are
mis-indented so their if-guards are empty; move the operations into the
corresponding if blocks: in _input_hook ensure the line
input_to_linear[input[0]].append(module) is indented under if len(input) > 0 and
isinstance(input[0], torch.Tensor) (so you avoid IndexError/KeyError when input
is empty or not a Tensor), and in _output_hook ensure
output_to_layernorm[output] = module is indented under if output_to_layernorm is
not None and isinstance(output, torch.Tensor) so the mapping only happens when
the guard passes; keep the existing TODO comment intact.
- Around line 413-446: The code currently computes weight_scaling_factor only
when amax_tensor is not None (using weight_quantizer and amax_tensor) but then
unconditionally calls sub_module.register_buffer(quantizer_attrs.weight_scale,
weight_scaling_factor), which can raise a NameError and duplicates the buffer
registration; remove the unconditional register_buffer call at the end and keep
only the registration inside the if amax_tensor is not None branch so
weight_scaling_factor is only referenced when defined and the buffer is
registered exactly once (refer to weight_quantizer, amax_tensor,
weight_scaling_factor, sub_module.register_buffer, and
quantizer_attrs.weight_scale).
- Around line 319-348: The llm_dummy_forward block has incorrect indentation and
broken if/elif/else flow; fix it by aligning the Nemotron language-model
extraction (language_model_lineage = get_language_model_from_vl(model)) directly
under the surrounding if is_vl_model... block and its subsequent branch (use
language_model_lineage[-1], print, and calls to language_model(...)) at the same
indentation level, and then restore the encoder-decoder branching so it reads:
elif getattr(model.config, "is_encoder_decoder", False): call model(fake_input,
decoder_input_ids=decoder_fake_input) else: call model(fake_input); ensure the
is_nemotron_parse special-case (language_model(..., use_cache=False)) remains
inside the Nemotron branch and that there are no orphaned else clauses.

🧹 Nitpick comments (3)

examples/llm_ptq/vlm_utils.py (1)

84-103: Generation config: dict fallback loses NP-specific settings from the pretrained config.

When GenerationConfig.from_pretrained fails for Nemotron-Parse (lines 91–97), the fallback dict hardcodes max_new_tokens=50. This silently drops model-specific settings (e.g., top_k, repetition_penalty) that the pretrained config would have provided. Consider at minimum logging which settings are being lost, or constructing a GenerationConfig object from the defaults so both paths use a consistent type.

examples/llm_ptq/example_utils.py (1)

279-305: Inline imports of os, logging, etc. shadow top-level imports.

os is already imported at line 20 and logging doesn't need re-importing. The import contextlib, io can be at the top of the file. This is minor, but importing inside a function for modules that are already available at module scope adds unnecessary noise.

examples/llm_ptq/hf_ptq.py (1)

528-576: Dead code: original_forward is initialized to None and never reassigned.

original_forward is set to None at line 532 and never modified before the check at line 573. The entire block at lines 572–576 (restoring original forward) is unreachable. If wrapping was intended but not yet implemented, this should be a TODO or removed to avoid confusion.

coderabbitai · 2026-02-10T06:10:25Z

examples/llm_ptq/example_utils.py

+    else:
+        # Try to load AutoProcessor for other VL models (e.g., Nemotron-Parse)
+        # This will only work if the model has a processor config
+        try:
+            import contextlib
+            import io
+            import logging
+
+            # Suppress verbose output from processor/tokenizer loading
+            transformers_log_level = logging.getLogger("transformers").level
+            logging.getLogger("transformers").setLevel(logging.ERROR)
+
+            with contextlib.redirect_stdout(io.StringIO()):
+                processor = AutoProcessor.from_pretrained(
+                    ckpt_path,
+                    **model_kwargs,
+                )
+
+            # Restore logging
+            logging.getLogger("transformers").setLevel(transformers_log_level)


⚠️ Potential issue | 🟡 Minor

Logging level not restored on exception — same pattern as in vlm_utils.py.

If AutoProcessor.from_pretrained raises, the except at line 380 catches it, but logging.getLogger("transformers").setLevel(transformers_log_level) at line 376 is skipped because it's between the with block and the except. Wrap the restore in a finally clause.

Proposed fix

transformers_log_level = logging.getLogger("transformers").level logging.getLogger("transformers").setLevel(logging.ERROR) - with contextlib.redirect_stdout(io.StringIO()): - processor = AutoProcessor.from_pretrained( - ckpt_path, - **model_kwargs, - ) - - # Restore logging - logging.getLogger("transformers").setLevel(transformers_log_level) + try: + with contextlib.redirect_stdout(io.StringIO()): + processor = AutoProcessor.from_pretrained( + ckpt_path, + **model_kwargs, + ) + finally: + logging.getLogger("transformers").setLevel(transformers_log_level)

🤖 Prompt for AI Agents

In `@examples/llm_ptq/example_utils.py` around lines 357 - 376, The transformer logging level restore is skipped if AutoProcessor.from_pretrained raises; update the try block around AutoProcessor.from_pretrained in example_utils.py so you capture the original transformers logger level, set it to ERROR before loading, and always restore it in a finally clause (i.e., move the call to logging.getLogger("transformers").setLevel(transformers_log_level) into finally). Reference the AutoProcessor.from_pretrained call and the transformers logger calls to locate where to add the finally block and ensure the log level is reset regardless of success or exception.

coderabbitai · 2026-02-10T06:10:25Z

examples/llm_ptq/example_utils.py

+    # For Nemotron-Parse, ensure the encoder (including RADIO) is fully on device
+    # The RADIO encoder has buffers that might not be properly moved even with device_map="auto"
+    # This is because custom RADIO modules might not fully support accelerate's device_map
+    if device != "cpu" and hasattr(model, "encoder"):
+        # Check if encoder has any buffers on CPU
+        cpu_buffers = []
+        for name, buffer in model.encoder.named_buffers():
+            if buffer.device.type == "cpu":
+                cpu_buffers.append(name)
+
+        if cpu_buffers:
+            print(f"Found {len(cpu_buffers)} encoder buffers on CPU. Moving encoder to {device}...")
+            model.encoder = model.encoder.to(device)
+            print(f"Encoder moved to {device}")


⚠️ Potential issue | 🟠 Major

Broad hasattr(model, "encoder") check may unintentionally move encoder buffers for non-NP models.

Any encoder-decoder model (T5, Bart, Whisper) that runs through this code path will have its encoder buffers forcibly moved. While the actual move only happens if buffers are on CPU (which may be benign), the intent is NP-specific. Consider gating this on the Nemotron-Parse architecture check.

Proposed fix

- # For Nemotron-Parse, ensure the encoder (including RADIO) is fully on device - # The RADIO encoder has buffers that might not be properly moved even with device_map="auto" - # This is because custom RADIO modules might not fully support accelerate's device_map - if device != "cpu" and hasattr(model, "encoder"): + # For Nemotron-Parse, ensure the encoder (including RADIO) is fully on device + architectures = getattr(model.config, "architectures", []) if hasattr(model, "config") else [] + is_nemotron_parse = any("nemotronparse" in arch.lower() for arch in architectures) + if device != "cpu" and is_nemotron_parse and hasattr(model, "encoder"):

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# For Nemotron-Parse, ensure the encoder (including RADIO) is fully on device

# The RADIO encoder has buffers that might not be properly moved even with device_map="auto"

# This is because custom RADIO modules might not fully support accelerate's device_map

if device != "cpu" and hasattr(model, "encoder"):

# Check if encoder has any buffers on CPU

cpu_buffers = []

for name, buffer in model.encoder.named_buffers():

if buffer.device.type == "cpu":

cpu_buffers.append(name)

if cpu_buffers:

print(f"Found {len(cpu_buffers)} encoder buffers on CPU. Moving encoder to {device}...")

model.encoder = model.encoder.to(device)

print(f"Encoder moved to {device}")

# For Nemotron-Parse, ensure the encoder (including RADIO) is fully on device

architectures = getattr(model.config, "architectures", []) if hasattr(model, "config") else []

is_nemotron_parse = any("nemotronparse" in arch.lower() for arch in architectures)

if device != "cpu" and is_nemotron_parse and hasattr(model, "encoder"):

# Check if encoder has any buffers on CPU

cpu_buffers = []

for name, buffer in model.encoder.named_buffers():

if buffer.device.type == "cpu":

cpu_buffers.append(name)

if cpu_buffers:

print(f"Found {len(cpu_buffers)} encoder buffers on CPU. Moving encoder to {device}...")

model.encoder = model.encoder.to(device)

print(f"Encoder moved to {device}")

🤖 Prompt for AI Agents

In `@examples/llm_ptq/example_utils.py` around lines 641 - 654, The broad hasattr(model, "encoder") check should be narrowed to only target Nemotron-Parse models so we don't accidentally move encoders for unrelated encoder-decoder models; instead of simply checking hasattr(model, "encoder"), first inspect model.config (or model.config.architectures/model.config.model_type) or model.__class__.__name__ for the Nemotron-Parse identifier and only then run the named_buffers scan and model.encoder.to(device) move; update the conditional around model.encoder, named_buffers, and model.encoder.to(device) to gate on that Nemotron-specific check (e.g., a helper like is_nemotron_parse(model) that checks model.config.architectures or model.config.model_type) before performing the CPU buffer detection and move.

coderabbitai · 2026-02-10T06:10:25Z

examples/llm_ptq/hf_ptq.py

        if args.dataset is None:
-            args.dataset = ["cnn_dailymail", "nemotron-post-training-dataset-v2"]
-            warnings.warn(
-                "No dataset specified. Defaulting to cnn_dailymail and nemotron-post-training-dataset-v2."
-            )
+            if is_nemotron_parse:
+                # For Nemotron-Parse, default to Nemotron VLM Dataset v2
+                args.dataset = ["nemotron_vlm_v2"]
+                print(
+                    "No dataset specified. Defaulting to 'nemotron_vlm_v2' for Nemotron-Parse "
+                    "(NVIDIA's image-text dataset for better calibration)."
+                )
+            else:
+                # For other models, use text-only datasets
+                args.dataset = ["cnn_dailymail", "nemotron-post-training-dataset-v2"]
+                warnings.warn(
+                    "No dataset specified. Defaulting to cnn_dailymail and nemotron-post-training-dataset-v2."
+                )


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Find all references to nemotron_vlm_v2 and related functions rg -n "nemotron_vlm_v2|get_nemotron_vlm_dataset_dataloader|get_dataset_dataloader" --type=py -C3

Repository: NVIDIA/Model-Optimizer

Length of output: 12809

🏁 Script executed:

# Examine make_calib_dataloader function to understand the routing logic sed -n '145,220p' examples/llm_ptq/hf_ptq.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 3752

🏁 Script executed:

# Find where model_type is determined and what it is for Nemotron-Parse rg -n "model_type\s*=" examples/llm_ptq/hf_ptq.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 1038

🏁 Script executed:

# Check what extract_and_prepare_language_model_from_vl returns for Nemotron-Parse # and trace the model_type assignment around lines 400-460 sed -n '400,460p' examples/llm_ptq/hf_ptq.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 3334

🏁 Script executed:

# Check what model_type is returned for Nemotron-Parse by extract_and_prepare_language_model_from_vl rg -n "def extract_and_prepare_language_model_from_vl|nemotron_parse|nemotron.*model_type" --type=py -A5 -B2 | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 3920

🏁 Script executed:

# Find the extract_and_prepare_language_model_from_vl function and see what it returns rg -n "def extract_and_prepare_language_model_from_vl" --type=py -A30

Repository: NVIDIA/Model-Optimizer

Length of output: 2212

🏁 Script executed:

# Find get_language_model_from_vl function to see what it returns for Nemotron-Parse rg -n "def get_language_model_from_vl" --type=py -A40

Repository: NVIDIA/Model-Optimizer

Length of output: 2608

🏁 Script executed:

# Check what model_type get_model_type returns for Nemotron-Parse decoder # Also check if nemotron_vlm_v2 is a valid dataset for get_dataset_dataloader rg -n "nemotron_vlm_v2|supported.*dataset" --type=py -B2 -A2 | head -60

Repository: NVIDIA/Model-Optimizer

Length of output: 4066

🏁 Script executed:

# Find the SUPPORTED_DATASET_CONFIG dictionary to see what datasets are supported rg -n "SUPPORTED_DATASET_CONFIG\s*=" --type=py -A50 | head -80

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Search for SUPPORTED_DATASET_CONFIG definition rg -n "SUPPORTED_DATASET_CONFIG" --type=py -B2 -A20 | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 6390

🏁 Script executed:

# Get the complete list of supported datasets sed -n '32,150p' modelopt/torch/utils/dataset_utils.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 5345

🏁 Script executed:

# Check get_nemotron_vlm_dataset_dataloader and how it's supposed to be used rg -n "def get_nemotron_vlm_dataset_dataloader|get_nemotron_vlm_dataset_dataloader" --type=py -B2 -A10

Repository: NVIDIA/Model-Optimizer

Length of output: 1089

🏁 Script executed:

# Check if get_nemotron_vlm_dataset_dataloader is actually used in hf_ptq.py rg -n "get_nemotron_vlm_dataset_dataloader" examples/llm_ptq/hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 164

🏁 Script executed:

# Verify the actual supported datasets by checking what get_supported_datasets returns python3 << 'EOF' import sys sys.path.insert(0, '/repo') from modelopt.torch.utils.dataset_utils import get_supported_datasets print("Supported text datasets:", get_supported_datasets()) print("\nDoes it include 'nemotron_vlm_v2'?", 'nemotron_vlm_v2' in get_supported_datasets()) EOF

Repository: NVIDIA/Model-Optimizer

Length of output: 1168

🏁 Script executed:

# Extract supported datasets directly from the source code python3 << 'EOF' import re with open('modelopt/torch/utils/dataset_utils.py', 'r') as f: content = f.read() # Find SUPPORTED_DATASET_CONFIG dictionary match = re.search(r'SUPPORTED_DATASET_CONFIG:\s*dict\[str,\s*Any\]\s*=\s*\{(.*?)\n\}', content, re.DOTALL) if match: config_content = match.group(1) # Extract all dataset names (keys before the colon) datasets = re.findall(r'"(\w+(?:-\w+)*)"\s*:', config_content) print("Supported text datasets:") for ds in datasets: print(f" - {ds}") print(f"\nTotal: {len(datasets)} datasets") print(f"\nDoes it include 'nemotron_vlm_v2'? {'nemotron_vlm_v2' in datasets}") EOF

Repository: NVIDIA/Model-Optimizer

Length of output: 1049

🏁 Script executed:

# Trace the actual execution flow for Nemotron-Parse # Check how the decoder model_type is determined rg -n "def get_model_type" --type=py -A15 | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 2834

Bug confirmed: nemotron_vlm_v2 dataset default will fail at runtime without --calib_with_images.

When Nemotron-Parse is detected without --calib_with_images, the code defaults to ["nemotron_vlm_v2"] (line 419). However, make_calib_dataloader routes non-image paths through get_dataset_dataloader (line 208), which only supports: open_code_reasoning, open_math_reasoning, llama-nemotron-post-training-dataset, nemotron-post-training-dataset-v2, nemotron-post-training-dataset-v1, magpie, cnn_dailymail, pile, pg19, wikipedia, c4, and wikitext.

Since nemotron_vlm_v2 is not in this list, the code will raise NotImplementedError at runtime. The comment suggests the intent is to use image-text calibration, but the routing logic doesn't support this configuration without the --calib_with_images flag.

Either:

Change the default dataset to a supported text dataset (e.g., cnn_dailymail), or

Force --calib_with_images=true for Nemotron-Parse models, or

Implement the nemotron_vlm_v2 dataset in the text dataset loader (unlikely to be correct since it's described as image-text data)

Note: get_nemotron_vlm_dataset_dataloader is imported at line 69 but never used.

🤖 Prompt for AI Agents

In `@examples/llm_ptq/hf_ptq.py` around lines 416 - 429, The current default for Nemotron-Parse sets args.dataset = ["nemotron_vlm_v2"] but never enables image calibration, causing make_calib_dataloader -> get_dataset_dataloader to raise NotImplementedError; fix by ensuring the code either uses a text dataset or enables the image dataloader: for a minimal change, when is_nemotron_parse is true and args.dataset is None, set args.calib_with_images = True (and keep args.dataset = ["nemotron_vlm_v2"]), or alternatively change the default to a supported text dataset like ["cnn_dailymail"]; if you prefer image routing, update make_calib_dataloader to call get_nemotron_vlm_dataset_dataloader (the imported symbol) when seeing "nemotron_vlm_v2" so the image-text loader is used instead of get_dataset_dataloader.

coderabbitai · 2026-02-10T06:10:25Z

examples/llm_ptq/vlm_utils.py

+        # Check if this is Nemotron-Parse early to set up proper generation config
+        config = model.config
+        architectures = getattr(config, "architectures", [])
+        is_nemotron_parse = any("nemotronparse" in arch.lower() for arch in architectures)


🛠️ Refactor suggestion | 🟠 Major

Duplicate Nemotron-Parse detection within the same function.

is_nemotron_parse is computed at line 79 and then recomputed identically at lines 130–132. Remove the second computation and reuse the existing variable.

Proposed fix

- # Check if this is Nemotron-Parse (uses task prompts instead of chat templates) - config = model.config - architectures = getattr(config, "architectures", []) - is_nemotron_parse = any("nemotronparse" in arch.lower() for arch in architectures) + # is_nemotron_parse already computed above (line 79)

Also applies to: 129-132

🤖 Prompt for AI Agents

In `@examples/llm_ptq/vlm_utils.py` around lines 76 - 79, The function redundantly recomputes is_nemotron_parse from model.config/architectures twice; remove the second computation (the repeated any("nemotronparse" in arch.lower() for arch in architectures) block) and reuse the existing is_nemotron_parse variable wherever it’s needed later in the function, ensuring the original declaration (using model.config and architectures) remains in scope for those subsequent uses.

coderabbitai · 2026-02-10T06:10:25Z

modelopt/torch/export/model_utils.py

+    # Pattern 3: For encoder-decoder VL models (e.g., Nemotron-Parse), the decoder is the language model
+    if hasattr(model, "decoder"):
+        return [model, model.decoder]
+
+    # Pattern 4: No language_model found
    return None


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, let's look at the full function context cd modelopt/torch/export && wc -l model_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 84

🏁 Script executed:

#!/bin/bash # Read the function containing the reviewed code cat -n modelopt/torch/export/model_utils.py | head -200 | tail -100

Repository: NVIDIA/Model-Optimizer

Length of output: 4348

🏁 Script executed:

#!/bin/bash # Search for all calls to get_language_model_from_vl rg "get_language_model_from_vl" --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 936

🏁 Script executed:

#!/bin/bash # Look for test files related to this module find . -name "*test*.py" -type f | xargs rg -l "get_language_model_from_vl" 2>/dev/null

Repository: NVIDIA/Model-Optimizer

Length of output: 50

🏁 Script executed:

#!/bin/bash # Check for references to Nemotron-Parse or similar model architectures rg -i "nemotron" --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 22220

🏁 Script executed:

#!/bin/bash # Check the call site in unified_export_hf.py with context rg "get_language_model_from_vl" -B 5 -A 5 modelopt/torch/export/unified_export_hf.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1160

🏁 Script executed:

#!/bin/bash # Check the call site in hf_ptq.py with context rg "get_language_model_from_vl" -B 5 -A 5 examples/llm_ptq/hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1650

🏁 Script executed:

#!/bin/bash # Check if there are any non-VLM uses or if this function is always used after is_multimodal_model check rg "language_model_lineage\s*=" --type py -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 2047

🏁 Script executed:

#!/bin/bash # Let me check the broader context of extract_language_model_from_vl function definition and its callers rg "def extract_language_model_from_vl" -A 15 examples/llm_ptq/hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

#!/bin/bash # Also check where extract_language_model_from_vl is called from rg "extract_language_model_from_vl\(" --type py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

#!/bin/bash # Check the docstring and function context around the hf_ptq.py call sed -n '1500,1650p' examples/llm_ptq/hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

#!/bin/bash # Better approach: search for the function that contains the language_model_lineage call rg "def " examples/llm_ptq/hf_ptq.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 602

🏁 Script executed:

#!/bin/bash # Get the function name context around the call rg "language_model_lineage = get_language_model_from_vl" -B 20 examples/llm_ptq/hf_ptq.py | grep "def " | tail -1

Repository: NVIDIA/Model-Optimizer

Length of output: 125

🏁 Script executed:

#!/bin/bash # Get the full function definition rg "def extract_and_prepare_language_model_from_vl" -A 30 examples/llm_ptq/hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1251

🏁 Script executed:

#!/bin/bash # Check where extract_and_prepare_language_model_from_vl is called rg "extract_and_prepare_language_model_from_vl\(" examples/llm_ptq/hf_ptq.py -B 5 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 1069

🏁 Script executed:

#!/bin/bash # Check the context in unified_export_hf.py around the is_vl_model check sed -n '1,50p' modelopt/torch/export/unified_export_hf.py | tail -30

Repository: NVIDIA/Model-Optimizer

Length of output: 749

🏁 Script executed:

#!/bin/bash # Get wider context around is_vl_model usage in unified_export_hf.py rg "is_vl_model" modelopt/torch/export/unified_export_hf.py -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 872

🏁 Script executed:

#!/bin/bash # Check what architectures are in model.config for common encoder-decoder models rg "architectures" modelopt/torch/export/model_utils.py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 762

🏁 Script executed:

#!/bin/bash # Verify that the is_multimodal_model check is sufficient protection # Look for any direct calls to get_language_model_from_vl without is_multimodal_model guard rg "get_language_model_from_vl" --type py -B 10 | grep -E "(get_language_model_from_vl|is_multimodal_model)" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 936

🏁 Script executed:

#!/bin/bash # Check if the function is documented as needing to be called only for VL models sed -n '124,142p' modelopt/torch/export/model_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 783

Gate the decoder check on Nemotron-Parse architecture to prevent unintended matches with other encoder-decoder models.

The hasattr(model, "decoder") check will match any model with a decoder attribute (T5, Bart, Whisper, etc.), not just Nemotron-Parse encoder-decoder VL models. While current call sites are protected by is_multimodal_model checks, the function lacks defensive validation. Since is_multimodal_model already contains the Nemotron-Parse architecture detection logic (lines 107–109), apply the same pattern here for consistency and robustness:

Proposed fix

# Pattern 3: For encoder-decoder VL models (e.g., Nemotron-Parse), the decoder is the language model - if hasattr(model, "decoder"): + architectures = getattr(model.config, "architectures", []) if hasattr(model, "config") else [] + is_nemotron_parse = any("nemotronparse" in arch.lower() for arch in architectures) + if is_nemotron_parse and hasattr(model, "decoder"): return [model, model.decoder]

🤖 Prompt for AI Agents

In `@modelopt/torch/export/model_utils.py` around lines 150 - 155, The unconditional hasattr(model, "decoder") check should be narrowed to only match Nemotron-Parse VL models: change the condition to gate on the same Nemotron-Parse detection used in is_multimodal_model (e.g., call is_multimodal_model(model) and/or compare the model's architecture/config field against the Nemotron-Parse identifier used there) before returning [model, model.decoder]; reference the symbols model, decoder, and is_multimodal_model and only return the decoder for models that pass the Nemotron-Parse-specific check.

coderabbitai · 2026-02-10T06:10:25Z

modelopt/torch/export/unified_export_hf.py

    def _input_hook(module, input, output):
        """Update dictionary with list of all modules that share the same input."""
        if len(input) > 0 and isinstance(input[0], torch.Tensor):
-            # TODO: Handle DBRX MoE case
-            input_to_linear[input[0]].append(module)
+        # TODO: Handle DBRX MoE case
+        input_to_linear[input[0]].append(module)

    def _output_hook(module, input, output):
        """Update dictionary with mapping of layernorms and their outputs."""
        if output_to_layernorm is not None and isinstance(output, torch.Tensor):
-            output_to_layernorm[output] = module
+        output_to_layernorm[output] = module


⚠️ Potential issue | 🔴 Critical

Critical: Indentation error — guard conditions in _input_hook and _output_hook are bypassed.

Lines 151–152 and 157 are at the same indentation level as their respective if statements, so the if guards have empty bodies and the operations execute unconditionally. When input is empty or input[0] is not a Tensor, this will crash with an IndexError or KeyError.

Proposed fix

def _input_hook(module, input, output): """Update dictionary with list of all modules that share the same input.""" if len(input) > 0 and isinstance(input[0], torch.Tensor): - # TODO: Handle DBRX MoE case - input_to_linear[input[0]].append(module) + # TODO: Handle DBRX MoE case + input_to_linear[input[0]].append(module) def _output_hook(module, input, output): """Update dictionary with mapping of layernorms and their outputs.""" if output_to_layernorm is not None and isinstance(output, torch.Tensor): - output_to_layernorm[output] = module + output_to_layernorm[output] = module

🤖 Prompt for AI Agents

In `@modelopt/torch/export/unified_export_hf.py` around lines 148 - 157, The body lines in _input_hook and _output_hook are mis-indented so their if-guards are empty; move the operations into the corresponding if blocks: in _input_hook ensure the line input_to_linear[input[0]].append(module) is indented under if len(input) > 0 and isinstance(input[0], torch.Tensor) (so you avoid IndexError/KeyError when input is empty or not a Tensor), and in _output_hook ensure output_to_layernorm[output] = module is indented under if output_to_layernorm is not None and isinstance(output, torch.Tensor) so the mapping only happens when the guard passes; keep the existing TODO comment intact.

coderabbitai · 2026-02-10T06:10:25Z

modelopt/torch/export/unified_export_hf.py

+        # Check if this is Nemotron-Parse (encoder-decoder VL model)
+        architectures = getattr(model.config, "architectures", [])
+        is_nemotron_parse = any("nemotronparse" in arch.lower() for arch in architectures)
+
+        if is_vl_model and ("nemotron" in model_type or is_nemotron_parse):
+            # For Nemotron VL models (including Nemotron-Parse), run optimization on just the
+            # language model/decoder. This avoids needing pixel_values for the vision encoder.
+                language_model_lineage = get_language_model_from_vl(model)
+
+                if language_model_lineage is not None:
+                    language_model = language_model_lineage[-1]
+                    print(
+                        f"Running optimization on language model with fake_input shape: {fake_input.shape}"
+                    )
+                    # For Nemotron-Parse decoder, force use_cache=False to avoid tuple index errors
+                    if is_nemotron_parse:
+                        language_model(fake_input, use_cache=False)
+                    else:
+                        language_model(fake_input)
+                else:
+                    raise ValueError(
+                        f"Cannot extract language_model from Nemotron VL model (type: {model_type}). "
+                        "This is required for requantization/resmoothing optimization. "
+                        "Please ensure the model architecture is supported or file an issue."
+                    )
+        elif getattr(model.config, "is_encoder_decoder", False):
+            # For other encoder-decoder models (non-VL), pass both encoder and decoder input ids
            model(fake_input, decoder_input_ids=decoder_fake_input)
-        elif is_vl_model and "nemotron" in model_type:
-            # For Nemotron VL models, try to run optimization on just the language model part
-            language_model_lineage = get_language_model_from_vl(model)
-
-            if language_model_lineage is not None:
-                # Run optimization on just the language model with the same input format as regular LLMs
-                # Use the same fake_input tensor that regular LLMs use
-                language_model = language_model_lineage[-1]
-                print(
-                    f"Running optimization on language model with fake_input shape: {fake_input.shape}"
-                )
-                language_model(fake_input)
            else:
-                raise ValueError(
-                    f"Cannot extract language_model from Nemotron VL model (type: {model_type}). "
-                    "This is required for requantization/resmoothing optimization. "
-                    "Please ensure the model architecture is supported or file an issue."
-                )
-        else:
-            model(fake_input)
+                model(fake_input)


⚠️ Potential issue | 🔴 Critical

Critical: Broken indentation and control flow in llm_dummy_forward — will cause SyntaxError or incorrect branching.

Multiple issues in this block:

Line 326 (language_model_lineage = ...) is over-indented relative to its enclosing if at line 323.

Lines 344–348: the elif / else structure is broken — else: at line 347 has no matching if at that indentation level and model(fake_input) at line 348 appears to be inside an orphaned else.

This code would fail at parse time or produce incorrect branching. The intended structure should be:

Proposed fix (reconstructed intent)

if is_vl_model and ("nemotron" in model_type or is_nemotron_parse): - # For Nemotron VL models (including Nemotron-Parse), run optimization on just the - # language model/decoder. This avoids needing pixel_values for the vision encoder. - language_model_lineage = get_language_model_from_vl(model) - - if language_model_lineage is not None: - language_model = language_model_lineage[-1] - print( - f"Running optimization on language model with fake_input shape: {fake_input.shape}" - ) - # For Nemotron-Parse decoder, force use_cache=False to avoid tuple index errors - if is_nemotron_parse: - language_model(fake_input, use_cache=False) - else: - language_model(fake_input) - else: - raise ValueError( - f"Cannot extract language_model from Nemotron VL model (type: {model_type}). " - "This is required for requantization/resmoothing optimization. " - "Please ensure the model architecture is supported or file an issue." - ) - elif getattr(model.config, "is_encoder_decoder", False): - # For other encoder-decoder models (non-VL), pass both encoder and decoder input ids - model(fake_input, decoder_input_ids=decoder_fake_input) - else: - model(fake_input) + # For Nemotron VL models (including Nemotron-Parse), run optimization on just the + # language model/decoder. This avoids needing pixel_values for the vision encoder. + language_model_lineage = get_language_model_from_vl(model) + + if language_model_lineage is not None: + language_model = language_model_lineage[-1] + print( + f"Running optimization on language model with fake_input shape: {fake_input.shape}" + ) + # For Nemotron-Parse decoder, force use_cache=False to avoid tuple index errors + if is_nemotron_parse: + language_model(fake_input, use_cache=False) + else: + language_model(fake_input) + else: + raise ValueError( + f"Cannot extract language_model from Nemotron VL model (type: {model_type}). " + "This is required for requantization/resmoothing optimization. " + "Please ensure the model architecture is supported or file an issue." + ) + elif getattr(model.config, "is_encoder_decoder", False): + # For other encoder-decoder models (non-VL), pass both encoder and decoder input ids + model(fake_input, decoder_input_ids=decoder_fake_input) + else: + model(fake_input)

🤖 Prompt for AI Agents

In `@modelopt/torch/export/unified_export_hf.py` around lines 319 - 348, The llm_dummy_forward block has incorrect indentation and broken if/elif/else flow; fix it by aligning the Nemotron language-model extraction (language_model_lineage = get_language_model_from_vl(model)) directly under the surrounding if is_vl_model... block and its subsequent branch (use language_model_lineage[-1], print, and calls to language_model(...)) at the same indentation level, and then restore the encoder-decoder branching so it reads: elif getattr(model.config, "is_encoder_decoder", False): call model(fake_input, decoder_input_ids=decoder_fake_input) else: call model(fake_input); ensure the is_nemotron_parse special-case (language_model(..., use_cache=False)) remains inside the Nemotron branch and that there are no orphaned else clauses.

coderabbitai · 2026-02-10T06:10:26Z

modelopt/torch/export/unified_export_hf.py

    if quantization_format == QUANTIZATION_FP8:
        # Convert amax to float32
-        weight_quantizer._amax = weight_quantizer._amax.to(torch.float32)
+        # Note: Use the public 'amax' property, not the private '_amax' attribute
+        if hasattr(weight_quantizer, "_amax") and weight_quantizer._amax is not None:
+            weight_quantizer._amax = weight_quantizer._amax.to(torch.float32)
+            amax_tensor = weight_quantizer._amax
+        else:
+            # Fallback to public amax property
+            amax_tensor = weight_quantizer.amax
+            if amax_tensor is not None and hasattr(amax_tensor, "to"):
+                amax_tensor = amax_tensor.to(torch.float32)
+
+        # Only compute scaling factor if amax_tensor is valid
+        if amax_tensor is not None and hasattr(amax_tensor, "dim"):
+            if amax_tensor.dim() == 1:
+                # Per-tensor amax
+                weight_scaling_factor = torch.tensor(
+                    weight_quantizer.amax.item() / weight_quantizer.maxbound
+                )
+            else:
+                # Per-channel amax
+                weight_scaling_factor = torch.tensor(
+                    weight_quantizer.amax / weight_quantizer.maxbound
+                )

-        if weight_quantizer._amax.dim() == 1:
-            # Per-tensor amax
-            weight_scaling_factor = torch.tensor(
-                weight_quantizer.amax.item() / weight_quantizer.maxbound
+            sub_module.register_buffer(
+                quantizer_attrs.weight_scale,
+                weight_scaling_factor,
            )
-        else:
-            # Per-channel amax
-            weight_scaling_factor = torch.tensor(weight_quantizer.amax / weight_quantizer.maxbound)

        sub_module.register_buffer(
            quantizer_attrs.weight_scale,
            weight_scaling_factor,
        )


⚠️ Potential issue | 🔴 Critical

Critical: Duplicate register_buffer and potential NameError when amax_tensor is None.

weight_scaling_factor is computed and registered conditionally inside the if amax_tensor is not None block (lines 426–441), but then lines 443–446 unconditionally register the same buffer again. When amax_tensor is None, weight_scaling_factor is never assigned, causing a NameError at line 445.

The unconditional register_buffer at lines 443–446 appears to be leftover code from before the conditional guard was added. Remove it.

Proposed fix

sub_module.register_buffer( quantizer_attrs.weight_scale, weight_scaling_factor, ) - sub_module.register_buffer( - quantizer_attrs.weight_scale, - weight_scaling_factor, - )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if quantization_format == QUANTIZATION_FP8:

# Convert amax to float32

weight_quantizer._amax = weight_quantizer._amax.to(torch.float32)

# Note: Use the public 'amax' property, not the private '_amax' attribute

if hasattr(weight_quantizer, "_amax") and weight_quantizer._amax is not None:

weight_quantizer._amax = weight_quantizer._amax.to(torch.float32)

amax_tensor = weight_quantizer._amax

else:

# Fallback to public amax property

amax_tensor = weight_quantizer.amax

if amax_tensor is not None and hasattr(amax_tensor, "to"):

amax_tensor = amax_tensor.to(torch.float32)

# Only compute scaling factor if amax_tensor is valid

if amax_tensor is not None and hasattr(amax_tensor, "dim"):

if amax_tensor.dim() == 1:

# Per-tensor amax

weight_scaling_factor = torch.tensor(

weight_quantizer.amax.item() / weight_quantizer.maxbound

)

else:

# Per-channel amax

weight_scaling_factor = torch.tensor(

weight_quantizer.amax / weight_quantizer.maxbound

)

if weight_quantizer._amax.dim() == 1:

# Per-tensor amax

weight_scaling_factor = torch.tensor(

weight_quantizer.amax.item() / weight_quantizer.maxbound

sub_module.register_buffer(

quantizer_attrs.weight_scale,

weight_scaling_factor,

)

else:

# Per-channel amax

weight_scaling_factor = torch.tensor(weight_quantizer.amax / weight_quantizer.maxbound)

sub_module.register_buffer(

quantizer_attrs.weight_scale,

weight_scaling_factor,

)

if quantization_format == QUANTIZATION_FP8:

# Convert amax to float32

# Note: Use the public 'amax' property, not the private '_amax' attribute

if hasattr(weight_quantizer, "_amax") and weight_quantizer._amax is not None:

weight_quantizer._amax = weight_quantizer._amax.to(torch.float32)

amax_tensor = weight_quantizer._amax

else:

# Fallback to public amax property

amax_tensor = weight_quantizer.amax

if amax_tensor is not None and hasattr(amax_tensor, "to"):

amax_tensor = amax_tensor.to(torch.float32)

# Only compute scaling factor if amax_tensor is valid

if amax_tensor is not None and hasattr(amax_tensor, "dim"):

if amax_tensor.dim() == 1:

# Per-tensor amax

weight_scaling_factor = torch.tensor(

weight_quantizer.amax.item() / weight_quantizer.maxbound

)

else:

# Per-channel amax

weight_scaling_factor = torch.tensor(

weight_quantizer.amax / weight_quantizer.maxbound

)

sub_module.register_buffer(

quantizer_attrs.weight_scale,

weight_scaling_factor,

)

🤖 Prompt for AI Agents

In `@modelopt/torch/export/unified_export_hf.py` around lines 413 - 446, The code currently computes weight_scaling_factor only when amax_tensor is not None (using weight_quantizer and amax_tensor) but then unconditionally calls sub_module.register_buffer(quantizer_attrs.weight_scale, weight_scaling_factor), which can raise a NameError and duplicates the buffer registration; remove the unconditional register_buffer call at the end and keep only the registration inside the if amax_tensor is not None branch so weight_scaling_factor is only referenced when defined and the buffer is registered exactly once (refer to weight_quantizer, amax_tensor, weight_scaling_factor, sub_module.register_buffer, and quantizer_attrs.weight_scale).

coderabbitai · 2026-02-10T06:10:26Z

modelopt/torch/export/unified_export_hf.py

            if hasattr(weight_quantizer, "_scale") and weight_quantizer._scale is not None:
-                del weight_quantizer._scale
+            del weight_quantizer._scale


⚠️ Potential issue | 🔴 Critical

Critical: del weight_quantizer._scale executes unconditionally despite the hasattr guard.

Same indentation bug pattern — line 488 is not inside the if block on line 487, so _scale deletion runs even when the attribute doesn't exist.

Proposed fix

if hasattr(weight_quantizer, "_scale") and weight_quantizer._scale is not None: - del weight_quantizer._scale + del weight_quantizer._scale

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if hasattr(weight_quantizer, "_scale") and weight_quantizer._scale is not None:

del weight_quantizer._scale

del weight_quantizer._scale

if hasattr(weight_quantizer, "_scale") and weight_quantizer._scale is not None:

del weight_quantizer._scale

🤖 Prompt for AI Agents

In `@modelopt/torch/export/unified_export_hf.py` around lines 487 - 488, The deletion of weight_quantizer._scale is currently executed unconditionally due to wrong indentation; update the block around weight_quantizer to ensure the del statement only runs when the attribute exists (e.g., indent the del under the if hasattr(weight_quantizer, "_scale") and weight_quantizer._scale is not None check or replace with a safe conditional/delete pattern), so only delete _scale when present—look for the weight_quantizer variable usage in unified_export_hf.py and fix the if + del pairing.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

cjluo-nv · 2026-02-10T20:45:39Z

examples/llm_ptq/hf_ptq.py

+        # Check if this is a vision-language model
+        # For VL models, skip automatic batch size detection and use a conservative default
+        # since proper multimodal input preparation is complex
+        if is_multimodal_model(full_model) or is_nemotron_vl(full_model):


does this apply only when you do calib_with_images?

cjluo-nv · 2026-02-10T20:47:57Z

examples/llm_ptq/hf_ptq.py

@@ -800,36 +809,46 @@ def quantize_main(
    device: torch.device,
 ):
    if args.batch_size == 0:


these logics are getting long. Could you help refactor it into a helper function?

Edwardf0t1 requested review from a team as code owners January 15, 2026 02:17

Edwardf0t1 requested review from jingyu-ml and sugunav14 January 15, 2026 02:18

jingyu-ml approved these changes Jan 17, 2026

View reviewed changes

Edwardf0t1 requested a review from chadvoegele February 9, 2026 23:18

Edwardf0t1 force-pushed the zhiyu/nemotron-parse-support branch from 64fa1fa to b670815 Compare February 10, 2026 06:01

coderabbitai bot reviewed Feb 10, 2026

View reviewed changes

Edwardf0t1 added 6 commits February 9, 2026 22:48

add support for nemotron parse fp8

01b09dd

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

add support for nemotron parse fp8

2924e30

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

add support for nemotron parse fp8

fdea5bf

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

add image-text data calibration support

769c04d

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

add image-text data calibration support

7bce92e

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

fix issues caused by rebase and simplify

10f1140

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Edwardf0t1 force-pushed the zhiyu/nemotron-parse-support branch from 48b7a46 to 10f1140 Compare February 10, 2026 06:48

Edwardf0t1 added 2 commits February 10, 2026 00:02

clean up

d57031a

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

make image-text calib default for VLMs, further simplify

bf5d52a

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

cjluo-nv reviewed Feb 10, 2026

View reviewed changes

Add Nemotron parse PTQ support #786

Are you sure you want to change the base?

Add Nemotron parse PTQ support #786

Conversation

Edwardf0t1 commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

codecov bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jingyu-ml left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Edwardf0t1 commented Jan 15, 2026 •

edited

Loading

codecov bot commented Jan 15, 2026 •

edited

Loading

coderabbitai bot commented Feb 10, 2026 •

edited

Loading