-
Notifications
You must be signed in to change notification settings - Fork 268
Add Nemotron parse PTQ support #786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #786 +/- ##
=======================================
Coverage 73.44% 73.44%
=======================================
Files 197 197
Lines 20657 20657
=======================================
Hits 15172 15172
Misses 5485 5485 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
jingyu-ml
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGMT but seems like you need to rebase to the main where I did some refactoring to the unifed ckpt export.
64fa1fa to
b670815
Compare
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe PR adds Nemotron-Parse support across quantization and export utilities. Changes include Nemotron-Parse detection via config architectures, processor loading, specialized dataset and calibration handling, generation config management, and FP8 quantization path improvements for multimodal models. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
examples/llm_ptq/vlm_utils.py (1)
357-382:⚠️ Potential issue | 🟡 MinorLogging level restored outside the
finallyblock — may not restore on exception.At line 376,
logging.getLogger("transformers").setLevel(transformers_log_level)is called after thewithblock but not in afinallyclause. IfAutoProcessor.from_pretrained(line 370) raises, theexceptat line 380 catches it, but the logging level is never restored because line 376 is skipped.Proposed fix
try: import contextlib import io import logging # Suppress verbose output from processor/tokenizer loading transformers_log_level = logging.getLogger("transformers").level logging.getLogger("transformers").setLevel(logging.ERROR) - with contextlib.redirect_stdout(io.StringIO()): - processor = AutoProcessor.from_pretrained( - ckpt_path, - **model_kwargs, - ) - - # Restore logging - logging.getLogger("transformers").setLevel(transformers_log_level) + try: + with contextlib.redirect_stdout(io.StringIO()): + processor = AutoProcessor.from_pretrained( + ckpt_path, + **model_kwargs, + ) + finally: + logging.getLogger("transformers").setLevel(transformers_log_level) print(f"Loaded AutoProcessor for model type: {model_type}") return processormodelopt/torch/export/unified_export_hf.py (1)
448-471:⚠️ Potential issue | 🔴 CriticalCritical: Indentation error —
_amaxconversion executes unconditionally, ignoring thehasattrguard.Lines 455 and 471 are at the same indentation as their respective
ifguards (lines 454, 470), so the_amax.to(torch.float32)calls run even when_amaxdoesn't exist or isNone, causingAttributeError.Proposed fix
- if hasattr(input_quantizer, "_amax") and input_quantizer._amax is not None: - input_quantizer._amax = input_quantizer._amax.to(torch.float32) + if hasattr(input_quantizer, "_amax") and input_quantizer._amax is not None: + input_quantizer._amax = input_quantizer._amax.to(torch.float32) ... - if hasattr(output_quantizer, "_amax") and output_quantizer._amax is not None: - output_quantizer._amax = output_quantizer._amax.to(torch.float32) + if hasattr(output_quantizer, "_amax") and output_quantizer._amax is not None: + output_quantizer._amax = output_quantizer._amax.to(torch.float32)examples/llm_ptq/hf_ptq.py (1)
578-584:⚠️ Potential issue | 🟠 MajorFix incorrect comment and attribute linkage for encoder-decoder VL models.
The comment at line 578-579 incorrectly states "For Nemotron-Parse,
language_modelISfull_model," but the actual behavior is: when Nemotron-Parse (encoder-decoder model) goes through theelsebranch,extract_and_prepare_language_model_from_vl()extractsmodel.decoderaslanguage_model, makinglanguage_model is not full_modelevaluate to True.Line 584 then executes
language_model_lineage[-2].language_model = language_model, which setsfull_model.language_model = decoder. For encoder-decoder architectures where the language component is naturally stored at.decoder(not.language_model), this creates an artificial attribute that doesn't align with the model's structure. Sincemtq.quantize()operates in-place, the decoder is already quantized; this assignment becomes redundant and conceptually incorrect.Update the comment to accurately reflect the behavior for encoder-decoder models, and consider whether this attribute assignment is necessary or should be conditional on the model architecture.
🤖 Fix all issues with AI agents
In `@examples/llm_ptq/example_utils.py`:
- Around line 641-654: The broad hasattr(model, "encoder") check should be
narrowed to only target Nemotron-Parse models so we don't accidentally move
encoders for unrelated encoder-decoder models; instead of simply checking
hasattr(model, "encoder"), first inspect model.config (or
model.config.architectures/model.config.model_type) or model.__class__.__name__
for the Nemotron-Parse identifier and only then run the named_buffers scan and
model.encoder.to(device) move; update the conditional around model.encoder,
named_buffers, and model.encoder.to(device) to gate on that Nemotron-specific
check (e.g., a helper like is_nemotron_parse(model) that checks
model.config.architectures or model.config.model_type) before performing the CPU
buffer detection and move.
- Around line 357-376: The transformer logging level restore is skipped if
AutoProcessor.from_pretrained raises; update the try block around
AutoProcessor.from_pretrained in example_utils.py so you capture the original
transformers logger level, set it to ERROR before loading, and always restore it
in a finally clause (i.e., move the call to
logging.getLogger("transformers").setLevel(transformers_log_level) into
finally). Reference the AutoProcessor.from_pretrained call and the transformers
logger calls to locate where to add the finally block and ensure the log level
is reset regardless of success or exception.
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 416-429: The current default for Nemotron-Parse sets args.dataset
= ["nemotron_vlm_v2"] but never enables image calibration, causing
make_calib_dataloader -> get_dataset_dataloader to raise NotImplementedError;
fix by ensuring the code either uses a text dataset or enables the image
dataloader: for a minimal change, when is_nemotron_parse is true and
args.dataset is None, set args.calib_with_images = True (and keep args.dataset =
["nemotron_vlm_v2"]), or alternatively change the default to a supported text
dataset like ["cnn_dailymail"]; if you prefer image routing, update
make_calib_dataloader to call get_nemotron_vlm_dataset_dataloader (the imported
symbol) when seeing "nemotron_vlm_v2" so the image-text loader is used instead
of get_dataset_dataloader.
In `@examples/llm_ptq/vlm_utils.py`:
- Around line 76-79: The function redundantly recomputes is_nemotron_parse from
model.config/architectures twice; remove the second computation (the repeated
any("nemotronparse" in arch.lower() for arch in architectures) block) and reuse
the existing is_nemotron_parse variable wherever it’s needed later in the
function, ensuring the original declaration (using model.config and
architectures) remains in scope for those subsequent uses.
In `@modelopt/torch/export/model_utils.py`:
- Around line 150-155: The unconditional hasattr(model, "decoder") check should
be narrowed to only match Nemotron-Parse VL models: change the condition to gate
on the same Nemotron-Parse detection used in is_multimodal_model (e.g., call
is_multimodal_model(model) and/or compare the model's architecture/config field
against the Nemotron-Parse identifier used there) before returning [model,
model.decoder]; reference the symbols model, decoder, and is_multimodal_model
and only return the decoder for models that pass the Nemotron-Parse-specific
check.
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 487-488: The deletion of weight_quantizer._scale is currently
executed unconditionally due to wrong indentation; update the block around
weight_quantizer to ensure the del statement only runs when the attribute exists
(e.g., indent the del under the if hasattr(weight_quantizer, "_scale") and
weight_quantizer._scale is not None check or replace with a safe
conditional/delete pattern), so only delete _scale when present—look for the
weight_quantizer variable usage in unified_export_hf.py and fix the if + del
pairing.
- Around line 148-157: The body lines in _input_hook and _output_hook are
mis-indented so their if-guards are empty; move the operations into the
corresponding if blocks: in _input_hook ensure the line
input_to_linear[input[0]].append(module) is indented under if len(input) > 0 and
isinstance(input[0], torch.Tensor) (so you avoid IndexError/KeyError when input
is empty or not a Tensor), and in _output_hook ensure
output_to_layernorm[output] = module is indented under if output_to_layernorm is
not None and isinstance(output, torch.Tensor) so the mapping only happens when
the guard passes; keep the existing TODO comment intact.
- Around line 413-446: The code currently computes weight_scaling_factor only
when amax_tensor is not None (using weight_quantizer and amax_tensor) but then
unconditionally calls sub_module.register_buffer(quantizer_attrs.weight_scale,
weight_scaling_factor), which can raise a NameError and duplicates the buffer
registration; remove the unconditional register_buffer call at the end and keep
only the registration inside the if amax_tensor is not None branch so
weight_scaling_factor is only referenced when defined and the buffer is
registered exactly once (refer to weight_quantizer, amax_tensor,
weight_scaling_factor, sub_module.register_buffer, and
quantizer_attrs.weight_scale).
- Around line 319-348: The llm_dummy_forward block has incorrect indentation and
broken if/elif/else flow; fix it by aligning the Nemotron language-model
extraction (language_model_lineage = get_language_model_from_vl(model)) directly
under the surrounding if is_vl_model... block and its subsequent branch (use
language_model_lineage[-1], print, and calls to language_model(...)) at the same
indentation level, and then restore the encoder-decoder branching so it reads:
elif getattr(model.config, "is_encoder_decoder", False): call model(fake_input,
decoder_input_ids=decoder_fake_input) else: call model(fake_input); ensure the
is_nemotron_parse special-case (language_model(..., use_cache=False)) remains
inside the Nemotron branch and that there are no orphaned else clauses.
🧹 Nitpick comments (3)
examples/llm_ptq/vlm_utils.py (1)
84-103: Generation config: dict fallback loses NP-specific settings from the pretrained config.When
GenerationConfig.from_pretrainedfails for Nemotron-Parse (lines 91–97), the fallback dict hardcodesmax_new_tokens=50. This silently drops model-specific settings (e.g.,top_k,repetition_penalty) that the pretrained config would have provided. Consider at minimum logging which settings are being lost, or constructing aGenerationConfigobject from the defaults so both paths use a consistent type.examples/llm_ptq/example_utils.py (1)
279-305: Inline imports ofos,logging, etc. shadow top-level imports.
osis already imported at line 20 andloggingdoesn't need re-importing. Theimport contextlib, iocan be at the top of the file. This is minor, but importing inside a function for modules that are already available at module scope adds unnecessary noise.examples/llm_ptq/hf_ptq.py (1)
528-576: Dead code:original_forwardis initialized toNoneand never reassigned.
original_forwardis set toNoneat line 532 and never modified before the check at line 573. The entire block at lines 572–576 (restoring original forward) is unreachable. If wrapping was intended but not yet implemented, this should be a TODO or removed to avoid confusion.
examples/llm_ptq/example_utils.py
Outdated
| else: | ||
| # Try to load AutoProcessor for other VL models (e.g., Nemotron-Parse) | ||
| # This will only work if the model has a processor config | ||
| try: | ||
| import contextlib | ||
| import io | ||
| import logging | ||
|
|
||
| # Suppress verbose output from processor/tokenizer loading | ||
| transformers_log_level = logging.getLogger("transformers").level | ||
| logging.getLogger("transformers").setLevel(logging.ERROR) | ||
|
|
||
| with contextlib.redirect_stdout(io.StringIO()): | ||
| processor = AutoProcessor.from_pretrained( | ||
| ckpt_path, | ||
| **model_kwargs, | ||
| ) | ||
|
|
||
| # Restore logging | ||
| logging.getLogger("transformers").setLevel(transformers_log_level) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging level not restored on exception — same pattern as in vlm_utils.py.
If AutoProcessor.from_pretrained raises, the except at line 380 catches it, but logging.getLogger("transformers").setLevel(transformers_log_level) at line 376 is skipped because it's between the with block and the except. Wrap the restore in a finally clause.
Proposed fix
transformers_log_level = logging.getLogger("transformers").level
logging.getLogger("transformers").setLevel(logging.ERROR)
- with contextlib.redirect_stdout(io.StringIO()):
- processor = AutoProcessor.from_pretrained(
- ckpt_path,
- **model_kwargs,
- )
-
- # Restore logging
- logging.getLogger("transformers").setLevel(transformers_log_level)
+ try:
+ with contextlib.redirect_stdout(io.StringIO()):
+ processor = AutoProcessor.from_pretrained(
+ ckpt_path,
+ **model_kwargs,
+ )
+ finally:
+ logging.getLogger("transformers").setLevel(transformers_log_level)🤖 Prompt for AI Agents
In `@examples/llm_ptq/example_utils.py` around lines 357 - 376, The transformer
logging level restore is skipped if AutoProcessor.from_pretrained raises; update
the try block around AutoProcessor.from_pretrained in example_utils.py so you
capture the original transformers logger level, set it to ERROR before loading,
and always restore it in a finally clause (i.e., move the call to
logging.getLogger("transformers").setLevel(transformers_log_level) into
finally). Reference the AutoProcessor.from_pretrained call and the transformers
logger calls to locate where to add the finally block and ensure the log level
is reset regardless of success or exception.
examples/llm_ptq/example_utils.py
Outdated
| # For Nemotron-Parse, ensure the encoder (including RADIO) is fully on device | ||
| # The RADIO encoder has buffers that might not be properly moved even with device_map="auto" | ||
| # This is because custom RADIO modules might not fully support accelerate's device_map | ||
| if device != "cpu" and hasattr(model, "encoder"): | ||
| # Check if encoder has any buffers on CPU | ||
| cpu_buffers = [] | ||
| for name, buffer in model.encoder.named_buffers(): | ||
| if buffer.device.type == "cpu": | ||
| cpu_buffers.append(name) | ||
|
|
||
| if cpu_buffers: | ||
| print(f"Found {len(cpu_buffers)} encoder buffers on CPU. Moving encoder to {device}...") | ||
| model.encoder = model.encoder.to(device) | ||
| print(f"Encoder moved to {device}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broad hasattr(model, "encoder") check may unintentionally move encoder buffers for non-NP models.
Any encoder-decoder model (T5, Bart, Whisper) that runs through this code path will have its encoder buffers forcibly moved. While the actual move only happens if buffers are on CPU (which may be benign), the intent is NP-specific. Consider gating this on the Nemotron-Parse architecture check.
Proposed fix
- # For Nemotron-Parse, ensure the encoder (including RADIO) is fully on device
- # The RADIO encoder has buffers that might not be properly moved even with device_map="auto"
- # This is because custom RADIO modules might not fully support accelerate's device_map
- if device != "cpu" and hasattr(model, "encoder"):
+ # For Nemotron-Parse, ensure the encoder (including RADIO) is fully on device
+ architectures = getattr(model.config, "architectures", []) if hasattr(model, "config") else []
+ is_nemotron_parse = any("nemotronparse" in arch.lower() for arch in architectures)
+ if device != "cpu" and is_nemotron_parse and hasattr(model, "encoder"):📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # For Nemotron-Parse, ensure the encoder (including RADIO) is fully on device | |
| # The RADIO encoder has buffers that might not be properly moved even with device_map="auto" | |
| # This is because custom RADIO modules might not fully support accelerate's device_map | |
| if device != "cpu" and hasattr(model, "encoder"): | |
| # Check if encoder has any buffers on CPU | |
| cpu_buffers = [] | |
| for name, buffer in model.encoder.named_buffers(): | |
| if buffer.device.type == "cpu": | |
| cpu_buffers.append(name) | |
| if cpu_buffers: | |
| print(f"Found {len(cpu_buffers)} encoder buffers on CPU. Moving encoder to {device}...") | |
| model.encoder = model.encoder.to(device) | |
| print(f"Encoder moved to {device}") | |
| # For Nemotron-Parse, ensure the encoder (including RADIO) is fully on device | |
| architectures = getattr(model.config, "architectures", []) if hasattr(model, "config") else [] | |
| is_nemotron_parse = any("nemotronparse" in arch.lower() for arch in architectures) | |
| if device != "cpu" and is_nemotron_parse and hasattr(model, "encoder"): | |
| # Check if encoder has any buffers on CPU | |
| cpu_buffers = [] | |
| for name, buffer in model.encoder.named_buffers(): | |
| if buffer.device.type == "cpu": | |
| cpu_buffers.append(name) | |
| if cpu_buffers: | |
| print(f"Found {len(cpu_buffers)} encoder buffers on CPU. Moving encoder to {device}...") | |
| model.encoder = model.encoder.to(device) | |
| print(f"Encoder moved to {device}") |
🤖 Prompt for AI Agents
In `@examples/llm_ptq/example_utils.py` around lines 641 - 654, The broad
hasattr(model, "encoder") check should be narrowed to only target Nemotron-Parse
models so we don't accidentally move encoders for unrelated encoder-decoder
models; instead of simply checking hasattr(model, "encoder"), first inspect
model.config (or model.config.architectures/model.config.model_type) or
model.__class__.__name__ for the Nemotron-Parse identifier and only then run the
named_buffers scan and model.encoder.to(device) move; update the conditional
around model.encoder, named_buffers, and model.encoder.to(device) to gate on
that Nemotron-specific check (e.g., a helper like is_nemotron_parse(model) that
checks model.config.architectures or model.config.model_type) before performing
the CPU buffer detection and move.
examples/llm_ptq/hf_ptq.py
Outdated
| if args.dataset is None: | ||
| args.dataset = ["cnn_dailymail", "nemotron-post-training-dataset-v2"] | ||
| warnings.warn( | ||
| "No dataset specified. Defaulting to cnn_dailymail and nemotron-post-training-dataset-v2." | ||
| ) | ||
| if is_nemotron_parse: | ||
| # For Nemotron-Parse, default to Nemotron VLM Dataset v2 | ||
| args.dataset = ["nemotron_vlm_v2"] | ||
| print( | ||
| "No dataset specified. Defaulting to 'nemotron_vlm_v2' for Nemotron-Parse " | ||
| "(NVIDIA's image-text dataset for better calibration)." | ||
| ) | ||
| else: | ||
| # For other models, use text-only datasets | ||
| args.dataset = ["cnn_dailymail", "nemotron-post-training-dataset-v2"] | ||
| warnings.warn( | ||
| "No dataset specified. Defaulting to cnn_dailymail and nemotron-post-training-dataset-v2." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Find all references to nemotron_vlm_v2 and related functions
rg -n "nemotron_vlm_v2|get_nemotron_vlm_dataset_dataloader|get_dataset_dataloader" --type=py -C3Repository: NVIDIA/Model-Optimizer
Length of output: 12809
🏁 Script executed:
# Examine make_calib_dataloader function to understand the routing logic
sed -n '145,220p' examples/llm_ptq/hf_ptq.py | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 3752
🏁 Script executed:
# Find where model_type is determined and what it is for Nemotron-Parse
rg -n "model_type\s*=" examples/llm_ptq/hf_ptq.py | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 1038
🏁 Script executed:
# Check what extract_and_prepare_language_model_from_vl returns for Nemotron-Parse
# and trace the model_type assignment around lines 400-460
sed -n '400,460p' examples/llm_ptq/hf_ptq.py | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 3334
🏁 Script executed:
# Check what model_type is returned for Nemotron-Parse by extract_and_prepare_language_model_from_vl
rg -n "def extract_and_prepare_language_model_from_vl|nemotron_parse|nemotron.*model_type" --type=py -A5 -B2 | head -50Repository: NVIDIA/Model-Optimizer
Length of output: 3920
🏁 Script executed:
# Find the extract_and_prepare_language_model_from_vl function and see what it returns
rg -n "def extract_and_prepare_language_model_from_vl" --type=py -A30Repository: NVIDIA/Model-Optimizer
Length of output: 2212
🏁 Script executed:
# Find get_language_model_from_vl function to see what it returns for Nemotron-Parse
rg -n "def get_language_model_from_vl" --type=py -A40Repository: NVIDIA/Model-Optimizer
Length of output: 2608
🏁 Script executed:
# Check what model_type get_model_type returns for Nemotron-Parse decoder
# Also check if nemotron_vlm_v2 is a valid dataset for get_dataset_dataloader
rg -n "nemotron_vlm_v2|supported.*dataset" --type=py -B2 -A2 | head -60Repository: NVIDIA/Model-Optimizer
Length of output: 4066
🏁 Script executed:
# Find the SUPPORTED_DATASET_CONFIG dictionary to see what datasets are supported
rg -n "SUPPORTED_DATASET_CONFIG\s*=" --type=py -A50 | head -80Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Search for SUPPORTED_DATASET_CONFIG definition
rg -n "SUPPORTED_DATASET_CONFIG" --type=py -B2 -A20 | head -100Repository: NVIDIA/Model-Optimizer
Length of output: 6390
🏁 Script executed:
# Get the complete list of supported datasets
sed -n '32,150p' modelopt/torch/utils/dataset_utils.py | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 5345
🏁 Script executed:
# Check get_nemotron_vlm_dataset_dataloader and how it's supposed to be used
rg -n "def get_nemotron_vlm_dataset_dataloader|get_nemotron_vlm_dataset_dataloader" --type=py -B2 -A10Repository: NVIDIA/Model-Optimizer
Length of output: 1089
🏁 Script executed:
# Check if get_nemotron_vlm_dataset_dataloader is actually used in hf_ptq.py
rg -n "get_nemotron_vlm_dataset_dataloader" examples/llm_ptq/hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 164
🏁 Script executed:
# Verify the actual supported datasets by checking what get_supported_datasets returns
python3 << 'EOF'
import sys
sys.path.insert(0, '/repo')
from modelopt.torch.utils.dataset_utils import get_supported_datasets
print("Supported text datasets:", get_supported_datasets())
print("\nDoes it include 'nemotron_vlm_v2'?", 'nemotron_vlm_v2' in get_supported_datasets())
EOFRepository: NVIDIA/Model-Optimizer
Length of output: 1168
🏁 Script executed:
# Extract supported datasets directly from the source code
python3 << 'EOF'
import re
with open('modelopt/torch/utils/dataset_utils.py', 'r') as f:
content = f.read()
# Find SUPPORTED_DATASET_CONFIG dictionary
match = re.search(r'SUPPORTED_DATASET_CONFIG:\s*dict\[str,\s*Any\]\s*=\s*\{(.*?)\n\}', content, re.DOTALL)
if match:
config_content = match.group(1)
# Extract all dataset names (keys before the colon)
datasets = re.findall(r'"(\w+(?:-\w+)*)"\s*:', config_content)
print("Supported text datasets:")
for ds in datasets:
print(f" - {ds}")
print(f"\nTotal: {len(datasets)} datasets")
print(f"\nDoes it include 'nemotron_vlm_v2'? {'nemotron_vlm_v2' in datasets}")
EOFRepository: NVIDIA/Model-Optimizer
Length of output: 1049
🏁 Script executed:
# Trace the actual execution flow for Nemotron-Parse
# Check how the decoder model_type is determined
rg -n "def get_model_type" --type=py -A15 | head -40Repository: NVIDIA/Model-Optimizer
Length of output: 2834
Bug confirmed: nemotron_vlm_v2 dataset default will fail at runtime without --calib_with_images.
When Nemotron-Parse is detected without --calib_with_images, the code defaults to ["nemotron_vlm_v2"] (line 419). However, make_calib_dataloader routes non-image paths through get_dataset_dataloader (line 208), which only supports: open_code_reasoning, open_math_reasoning, llama-nemotron-post-training-dataset, nemotron-post-training-dataset-v2, nemotron-post-training-dataset-v1, magpie, cnn_dailymail, pile, pg19, wikipedia, c4, and wikitext.
Since nemotron_vlm_v2 is not in this list, the code will raise NotImplementedError at runtime. The comment suggests the intent is to use image-text calibration, but the routing logic doesn't support this configuration without the --calib_with_images flag.
Either:
- Change the default dataset to a supported text dataset (e.g.,
cnn_dailymail), or - Force
--calib_with_images=truefor Nemotron-Parse models, or - Implement the
nemotron_vlm_v2dataset in the text dataset loader (unlikely to be correct since it's described as image-text data)
Note: get_nemotron_vlm_dataset_dataloader is imported at line 69 but never used.
🤖 Prompt for AI Agents
In `@examples/llm_ptq/hf_ptq.py` around lines 416 - 429, The current default for
Nemotron-Parse sets args.dataset = ["nemotron_vlm_v2"] but never enables image
calibration, causing make_calib_dataloader -> get_dataset_dataloader to raise
NotImplementedError; fix by ensuring the code either uses a text dataset or
enables the image dataloader: for a minimal change, when is_nemotron_parse is
true and args.dataset is None, set args.calib_with_images = True (and keep
args.dataset = ["nemotron_vlm_v2"]), or alternatively change the default to a
supported text dataset like ["cnn_dailymail"]; if you prefer image routing,
update make_calib_dataloader to call get_nemotron_vlm_dataset_dataloader (the
imported symbol) when seeing "nemotron_vlm_v2" so the image-text loader is used
instead of get_dataset_dataloader.
examples/llm_ptq/vlm_utils.py
Outdated
| # Check if this is Nemotron-Parse early to set up proper generation config | ||
| config = model.config | ||
| architectures = getattr(config, "architectures", []) | ||
| is_nemotron_parse = any("nemotronparse" in arch.lower() for arch in architectures) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Duplicate Nemotron-Parse detection within the same function.
is_nemotron_parse is computed at line 79 and then recomputed identically at lines 130–132. Remove the second computation and reuse the existing variable.
Proposed fix
- # Check if this is Nemotron-Parse (uses task prompts instead of chat templates)
- config = model.config
- architectures = getattr(config, "architectures", [])
- is_nemotron_parse = any("nemotronparse" in arch.lower() for arch in architectures)
+ # is_nemotron_parse already computed above (line 79)Also applies to: 129-132
🤖 Prompt for AI Agents
In `@examples/llm_ptq/vlm_utils.py` around lines 76 - 79, The function redundantly
recomputes is_nemotron_parse from model.config/architectures twice; remove the
second computation (the repeated any("nemotronparse" in arch.lower() for arch in
architectures) block) and reuse the existing is_nemotron_parse variable wherever
it’s needed later in the function, ensuring the original declaration (using
model.config and architectures) remains in scope for those subsequent uses.
modelopt/torch/export/model_utils.py
Outdated
| # Pattern 3: For encoder-decoder VL models (e.g., Nemotron-Parse), the decoder is the language model | ||
| if hasattr(model, "decoder"): | ||
| return [model, model.decoder] | ||
|
|
||
| # Pattern 4: No language_model found | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's look at the full function context
cd modelopt/torch/export && wc -l model_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 84
🏁 Script executed:
#!/bin/bash
# Read the function containing the reviewed code
cat -n modelopt/torch/export/model_utils.py | head -200 | tail -100Repository: NVIDIA/Model-Optimizer
Length of output: 4348
🏁 Script executed:
#!/bin/bash
# Search for all calls to get_language_model_from_vl
rg "get_language_model_from_vl" --type pyRepository: NVIDIA/Model-Optimizer
Length of output: 936
🏁 Script executed:
#!/bin/bash
# Look for test files related to this module
find . -name "*test*.py" -type f | xargs rg -l "get_language_model_from_vl" 2>/dev/nullRepository: NVIDIA/Model-Optimizer
Length of output: 50
🏁 Script executed:
#!/bin/bash
# Check for references to Nemotron-Parse or similar model architectures
rg -i "nemotron" --type pyRepository: NVIDIA/Model-Optimizer
Length of output: 22220
🏁 Script executed:
#!/bin/bash
# Check the call site in unified_export_hf.py with context
rg "get_language_model_from_vl" -B 5 -A 5 modelopt/torch/export/unified_export_hf.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1160
🏁 Script executed:
#!/bin/bash
# Check the call site in hf_ptq.py with context
rg "get_language_model_from_vl" -B 5 -A 5 examples/llm_ptq/hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1650
🏁 Script executed:
#!/bin/bash
# Check if there are any non-VLM uses or if this function is always used after is_multimodal_model check
rg "language_model_lineage\s*=" --type py -B 3 -A 3Repository: NVIDIA/Model-Optimizer
Length of output: 2047
🏁 Script executed:
#!/bin/bash
# Let me check the broader context of extract_language_model_from_vl function definition and its callers
rg "def extract_language_model_from_vl" -A 15 examples/llm_ptq/hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
#!/bin/bash
# Also check where extract_language_model_from_vl is called from
rg "extract_language_model_from_vl\(" --type py -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
#!/bin/bash
# Check the docstring and function context around the hf_ptq.py call
sed -n '1500,1650p' examples/llm_ptq/hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
#!/bin/bash
# Better approach: search for the function that contains the language_model_lineage call
rg "def " examples/llm_ptq/hf_ptq.py | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 602
🏁 Script executed:
#!/bin/bash
# Get the function name context around the call
rg "language_model_lineage = get_language_model_from_vl" -B 20 examples/llm_ptq/hf_ptq.py | grep "def " | tail -1Repository: NVIDIA/Model-Optimizer
Length of output: 125
🏁 Script executed:
#!/bin/bash
# Get the full function definition
rg "def extract_and_prepare_language_model_from_vl" -A 30 examples/llm_ptq/hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1251
🏁 Script executed:
#!/bin/bash
# Check where extract_and_prepare_language_model_from_vl is called
rg "extract_and_prepare_language_model_from_vl\(" examples/llm_ptq/hf_ptq.py -B 5 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 1069
🏁 Script executed:
#!/bin/bash
# Check the context in unified_export_hf.py around the is_vl_model check
sed -n '1,50p' modelopt/torch/export/unified_export_hf.py | tail -30Repository: NVIDIA/Model-Optimizer
Length of output: 749
🏁 Script executed:
#!/bin/bash
# Get wider context around is_vl_model usage in unified_export_hf.py
rg "is_vl_model" modelopt/torch/export/unified_export_hf.py -B 3 -A 3Repository: NVIDIA/Model-Optimizer
Length of output: 872
🏁 Script executed:
#!/bin/bash
# Check what architectures are in model.config for common encoder-decoder models
rg "architectures" modelopt/torch/export/model_utils.py -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 762
🏁 Script executed:
#!/bin/bash
# Verify that the is_multimodal_model check is sufficient protection
# Look for any direct calls to get_language_model_from_vl without is_multimodal_model guard
rg "get_language_model_from_vl" --type py -B 10 | grep -E "(get_language_model_from_vl|is_multimodal_model)" | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 936
🏁 Script executed:
#!/bin/bash
# Check if the function is documented as needing to be called only for VL models
sed -n '124,142p' modelopt/torch/export/model_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 783
Gate the decoder check on Nemotron-Parse architecture to prevent unintended matches with other encoder-decoder models.
The hasattr(model, "decoder") check will match any model with a decoder attribute (T5, Bart, Whisper, etc.), not just Nemotron-Parse encoder-decoder VL models. While current call sites are protected by is_multimodal_model checks, the function lacks defensive validation. Since is_multimodal_model already contains the Nemotron-Parse architecture detection logic (lines 107–109), apply the same pattern here for consistency and robustness:
Proposed fix
# Pattern 3: For encoder-decoder VL models (e.g., Nemotron-Parse), the decoder is the language model
- if hasattr(model, "decoder"):
+ architectures = getattr(model.config, "architectures", []) if hasattr(model, "config") else []
+ is_nemotron_parse = any("nemotronparse" in arch.lower() for arch in architectures)
+ if is_nemotron_parse and hasattr(model, "decoder"):
return [model, model.decoder]🤖 Prompt for AI Agents
In `@modelopt/torch/export/model_utils.py` around lines 150 - 155, The
unconditional hasattr(model, "decoder") check should be narrowed to only match
Nemotron-Parse VL models: change the condition to gate on the same
Nemotron-Parse detection used in is_multimodal_model (e.g., call
is_multimodal_model(model) and/or compare the model's architecture/config field
against the Nemotron-Parse identifier used there) before returning [model,
model.decoder]; reference the symbols model, decoder, and is_multimodal_model
and only return the decoder for models that pass the Nemotron-Parse-specific
check.
| def _input_hook(module, input, output): | ||
| """Update dictionary with list of all modules that share the same input.""" | ||
| if len(input) > 0 and isinstance(input[0], torch.Tensor): | ||
| # TODO: Handle DBRX MoE case | ||
| input_to_linear[input[0]].append(module) | ||
| # TODO: Handle DBRX MoE case | ||
| input_to_linear[input[0]].append(module) | ||
|
|
||
| def _output_hook(module, input, output): | ||
| """Update dictionary with mapping of layernorms and their outputs.""" | ||
| if output_to_layernorm is not None and isinstance(output, torch.Tensor): | ||
| output_to_layernorm[output] = module | ||
| output_to_layernorm[output] = module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Indentation error — guard conditions in _input_hook and _output_hook are bypassed.
Lines 151–152 and 157 are at the same indentation level as their respective if statements, so the if guards have empty bodies and the operations execute unconditionally. When input is empty or input[0] is not a Tensor, this will crash with an IndexError or KeyError.
Proposed fix
def _input_hook(module, input, output):
"""Update dictionary with list of all modules that share the same input."""
if len(input) > 0 and isinstance(input[0], torch.Tensor):
- # TODO: Handle DBRX MoE case
- input_to_linear[input[0]].append(module)
+ # TODO: Handle DBRX MoE case
+ input_to_linear[input[0]].append(module)
def _output_hook(module, input, output):
"""Update dictionary with mapping of layernorms and their outputs."""
if output_to_layernorm is not None and isinstance(output, torch.Tensor):
- output_to_layernorm[output] = module
+ output_to_layernorm[output] = module🤖 Prompt for AI Agents
In `@modelopt/torch/export/unified_export_hf.py` around lines 148 - 157, The body
lines in _input_hook and _output_hook are mis-indented so their if-guards are
empty; move the operations into the corresponding if blocks: in _input_hook
ensure the line input_to_linear[input[0]].append(module) is indented under if
len(input) > 0 and isinstance(input[0], torch.Tensor) (so you avoid
IndexError/KeyError when input is empty or not a Tensor), and in _output_hook
ensure output_to_layernorm[output] = module is indented under if
output_to_layernorm is not None and isinstance(output, torch.Tensor) so the
mapping only happens when the guard passes; keep the existing TODO comment
intact.
| # Check if this is Nemotron-Parse (encoder-decoder VL model) | ||
| architectures = getattr(model.config, "architectures", []) | ||
| is_nemotron_parse = any("nemotronparse" in arch.lower() for arch in architectures) | ||
|
|
||
| if is_vl_model and ("nemotron" in model_type or is_nemotron_parse): | ||
| # For Nemotron VL models (including Nemotron-Parse), run optimization on just the | ||
| # language model/decoder. This avoids needing pixel_values for the vision encoder. | ||
| language_model_lineage = get_language_model_from_vl(model) | ||
|
|
||
| if language_model_lineage is not None: | ||
| language_model = language_model_lineage[-1] | ||
| print( | ||
| f"Running optimization on language model with fake_input shape: {fake_input.shape}" | ||
| ) | ||
| # For Nemotron-Parse decoder, force use_cache=False to avoid tuple index errors | ||
| if is_nemotron_parse: | ||
| language_model(fake_input, use_cache=False) | ||
| else: | ||
| language_model(fake_input) | ||
| else: | ||
| raise ValueError( | ||
| f"Cannot extract language_model from Nemotron VL model (type: {model_type}). " | ||
| "This is required for requantization/resmoothing optimization. " | ||
| "Please ensure the model architecture is supported or file an issue." | ||
| ) | ||
| elif getattr(model.config, "is_encoder_decoder", False): | ||
| # For other encoder-decoder models (non-VL), pass both encoder and decoder input ids | ||
| model(fake_input, decoder_input_ids=decoder_fake_input) | ||
| elif is_vl_model and "nemotron" in model_type: | ||
| # For Nemotron VL models, try to run optimization on just the language model part | ||
| language_model_lineage = get_language_model_from_vl(model) | ||
|
|
||
| if language_model_lineage is not None: | ||
| # Run optimization on just the language model with the same input format as regular LLMs | ||
| # Use the same fake_input tensor that regular LLMs use | ||
| language_model = language_model_lineage[-1] | ||
| print( | ||
| f"Running optimization on language model with fake_input shape: {fake_input.shape}" | ||
| ) | ||
| language_model(fake_input) | ||
| else: | ||
| raise ValueError( | ||
| f"Cannot extract language_model from Nemotron VL model (type: {model_type}). " | ||
| "This is required for requantization/resmoothing optimization. " | ||
| "Please ensure the model architecture is supported or file an issue." | ||
| ) | ||
| else: | ||
| model(fake_input) | ||
| model(fake_input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Broken indentation and control flow in llm_dummy_forward — will cause SyntaxError or incorrect branching.
Multiple issues in this block:
- Line 326 (
language_model_lineage = ...) is over-indented relative to its enclosingifat line 323. - Lines 344–348: the
elif/elsestructure is broken —else:at line 347 has no matchingifat that indentation level andmodel(fake_input)at line 348 appears to be inside an orphanedelse.
This code would fail at parse time or produce incorrect branching. The intended structure should be:
Proposed fix (reconstructed intent)
if is_vl_model and ("nemotron" in model_type or is_nemotron_parse):
- # For Nemotron VL models (including Nemotron-Parse), run optimization on just the
- # language model/decoder. This avoids needing pixel_values for the vision encoder.
- language_model_lineage = get_language_model_from_vl(model)
-
- if language_model_lineage is not None:
- language_model = language_model_lineage[-1]
- print(
- f"Running optimization on language model with fake_input shape: {fake_input.shape}"
- )
- # For Nemotron-Parse decoder, force use_cache=False to avoid tuple index errors
- if is_nemotron_parse:
- language_model(fake_input, use_cache=False)
- else:
- language_model(fake_input)
- else:
- raise ValueError(
- f"Cannot extract language_model from Nemotron VL model (type: {model_type}). "
- "This is required for requantization/resmoothing optimization. "
- "Please ensure the model architecture is supported or file an issue."
- )
- elif getattr(model.config, "is_encoder_decoder", False):
- # For other encoder-decoder models (non-VL), pass both encoder and decoder input ids
- model(fake_input, decoder_input_ids=decoder_fake_input)
- else:
- model(fake_input)
+ # For Nemotron VL models (including Nemotron-Parse), run optimization on just the
+ # language model/decoder. This avoids needing pixel_values for the vision encoder.
+ language_model_lineage = get_language_model_from_vl(model)
+
+ if language_model_lineage is not None:
+ language_model = language_model_lineage[-1]
+ print(
+ f"Running optimization on language model with fake_input shape: {fake_input.shape}"
+ )
+ # For Nemotron-Parse decoder, force use_cache=False to avoid tuple index errors
+ if is_nemotron_parse:
+ language_model(fake_input, use_cache=False)
+ else:
+ language_model(fake_input)
+ else:
+ raise ValueError(
+ f"Cannot extract language_model from Nemotron VL model (type: {model_type}). "
+ "This is required for requantization/resmoothing optimization. "
+ "Please ensure the model architecture is supported or file an issue."
+ )
+ elif getattr(model.config, "is_encoder_decoder", False):
+ # For other encoder-decoder models (non-VL), pass both encoder and decoder input ids
+ model(fake_input, decoder_input_ids=decoder_fake_input)
+ else:
+ model(fake_input)🤖 Prompt for AI Agents
In `@modelopt/torch/export/unified_export_hf.py` around lines 319 - 348, The
llm_dummy_forward block has incorrect indentation and broken if/elif/else flow;
fix it by aligning the Nemotron language-model extraction
(language_model_lineage = get_language_model_from_vl(model)) directly under the
surrounding if is_vl_model... block and its subsequent branch (use
language_model_lineage[-1], print, and calls to language_model(...)) at the same
indentation level, and then restore the encoder-decoder branching so it reads:
elif getattr(model.config, "is_encoder_decoder", False): call model(fake_input,
decoder_input_ids=decoder_fake_input) else: call model(fake_input); ensure the
is_nemotron_parse special-case (language_model(..., use_cache=False)) remains
inside the Nemotron branch and that there are no orphaned else clauses.
| if quantization_format == QUANTIZATION_FP8: | ||
| # Convert amax to float32 | ||
| weight_quantizer._amax = weight_quantizer._amax.to(torch.float32) | ||
| # Note: Use the public 'amax' property, not the private '_amax' attribute | ||
| if hasattr(weight_quantizer, "_amax") and weight_quantizer._amax is not None: | ||
| weight_quantizer._amax = weight_quantizer._amax.to(torch.float32) | ||
| amax_tensor = weight_quantizer._amax | ||
| else: | ||
| # Fallback to public amax property | ||
| amax_tensor = weight_quantizer.amax | ||
| if amax_tensor is not None and hasattr(amax_tensor, "to"): | ||
| amax_tensor = amax_tensor.to(torch.float32) | ||
|
|
||
| # Only compute scaling factor if amax_tensor is valid | ||
| if amax_tensor is not None and hasattr(amax_tensor, "dim"): | ||
| if amax_tensor.dim() == 1: | ||
| # Per-tensor amax | ||
| weight_scaling_factor = torch.tensor( | ||
| weight_quantizer.amax.item() / weight_quantizer.maxbound | ||
| ) | ||
| else: | ||
| # Per-channel amax | ||
| weight_scaling_factor = torch.tensor( | ||
| weight_quantizer.amax / weight_quantizer.maxbound | ||
| ) | ||
|
|
||
| if weight_quantizer._amax.dim() == 1: | ||
| # Per-tensor amax | ||
| weight_scaling_factor = torch.tensor( | ||
| weight_quantizer.amax.item() / weight_quantizer.maxbound | ||
| sub_module.register_buffer( | ||
| quantizer_attrs.weight_scale, | ||
| weight_scaling_factor, | ||
| ) | ||
| else: | ||
| # Per-channel amax | ||
| weight_scaling_factor = torch.tensor(weight_quantizer.amax / weight_quantizer.maxbound) | ||
|
|
||
| sub_module.register_buffer( | ||
| quantizer_attrs.weight_scale, | ||
| weight_scaling_factor, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Duplicate register_buffer and potential NameError when amax_tensor is None.
weight_scaling_factor is computed and registered conditionally inside the if amax_tensor is not None block (lines 426–441), but then lines 443–446 unconditionally register the same buffer again. When amax_tensor is None, weight_scaling_factor is never assigned, causing a NameError at line 445.
The unconditional register_buffer at lines 443–446 appears to be leftover code from before the conditional guard was added. Remove it.
Proposed fix
sub_module.register_buffer(
quantizer_attrs.weight_scale,
weight_scaling_factor,
)
- sub_module.register_buffer(
- quantizer_attrs.weight_scale,
- weight_scaling_factor,
- )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if quantization_format == QUANTIZATION_FP8: | |
| # Convert amax to float32 | |
| weight_quantizer._amax = weight_quantizer._amax.to(torch.float32) | |
| # Note: Use the public 'amax' property, not the private '_amax' attribute | |
| if hasattr(weight_quantizer, "_amax") and weight_quantizer._amax is not None: | |
| weight_quantizer._amax = weight_quantizer._amax.to(torch.float32) | |
| amax_tensor = weight_quantizer._amax | |
| else: | |
| # Fallback to public amax property | |
| amax_tensor = weight_quantizer.amax | |
| if amax_tensor is not None and hasattr(amax_tensor, "to"): | |
| amax_tensor = amax_tensor.to(torch.float32) | |
| # Only compute scaling factor if amax_tensor is valid | |
| if amax_tensor is not None and hasattr(amax_tensor, "dim"): | |
| if amax_tensor.dim() == 1: | |
| # Per-tensor amax | |
| weight_scaling_factor = torch.tensor( | |
| weight_quantizer.amax.item() / weight_quantizer.maxbound | |
| ) | |
| else: | |
| # Per-channel amax | |
| weight_scaling_factor = torch.tensor( | |
| weight_quantizer.amax / weight_quantizer.maxbound | |
| ) | |
| if weight_quantizer._amax.dim() == 1: | |
| # Per-tensor amax | |
| weight_scaling_factor = torch.tensor( | |
| weight_quantizer.amax.item() / weight_quantizer.maxbound | |
| sub_module.register_buffer( | |
| quantizer_attrs.weight_scale, | |
| weight_scaling_factor, | |
| ) | |
| else: | |
| # Per-channel amax | |
| weight_scaling_factor = torch.tensor(weight_quantizer.amax / weight_quantizer.maxbound) | |
| sub_module.register_buffer( | |
| quantizer_attrs.weight_scale, | |
| weight_scaling_factor, | |
| ) | |
| if quantization_format == QUANTIZATION_FP8: | |
| # Convert amax to float32 | |
| # Note: Use the public 'amax' property, not the private '_amax' attribute | |
| if hasattr(weight_quantizer, "_amax") and weight_quantizer._amax is not None: | |
| weight_quantizer._amax = weight_quantizer._amax.to(torch.float32) | |
| amax_tensor = weight_quantizer._amax | |
| else: | |
| # Fallback to public amax property | |
| amax_tensor = weight_quantizer.amax | |
| if amax_tensor is not None and hasattr(amax_tensor, "to"): | |
| amax_tensor = amax_tensor.to(torch.float32) | |
| # Only compute scaling factor if amax_tensor is valid | |
| if amax_tensor is not None and hasattr(amax_tensor, "dim"): | |
| if amax_tensor.dim() == 1: | |
| # Per-tensor amax | |
| weight_scaling_factor = torch.tensor( | |
| weight_quantizer.amax.item() / weight_quantizer.maxbound | |
| ) | |
| else: | |
| # Per-channel amax | |
| weight_scaling_factor = torch.tensor( | |
| weight_quantizer.amax / weight_quantizer.maxbound | |
| ) | |
| sub_module.register_buffer( | |
| quantizer_attrs.weight_scale, | |
| weight_scaling_factor, | |
| ) |
🤖 Prompt for AI Agents
In `@modelopt/torch/export/unified_export_hf.py` around lines 413 - 446, The code
currently computes weight_scaling_factor only when amax_tensor is not None
(using weight_quantizer and amax_tensor) but then unconditionally calls
sub_module.register_buffer(quantizer_attrs.weight_scale, weight_scaling_factor),
which can raise a NameError and duplicates the buffer registration; remove the
unconditional register_buffer call at the end and keep only the registration
inside the if amax_tensor is not None branch so weight_scaling_factor is only
referenced when defined and the buffer is registered exactly once (refer to
weight_quantizer, amax_tensor, weight_scaling_factor,
sub_module.register_buffer, and quantizer_attrs.weight_scale).
| if hasattr(weight_quantizer, "_scale") and weight_quantizer._scale is not None: | ||
| del weight_quantizer._scale | ||
| del weight_quantizer._scale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: del weight_quantizer._scale executes unconditionally despite the hasattr guard.
Same indentation bug pattern — line 488 is not inside the if block on line 487, so _scale deletion runs even when the attribute doesn't exist.
Proposed fix
if hasattr(weight_quantizer, "_scale") and weight_quantizer._scale is not None:
- del weight_quantizer._scale
+ del weight_quantizer._scale📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if hasattr(weight_quantizer, "_scale") and weight_quantizer._scale is not None: | |
| del weight_quantizer._scale | |
| del weight_quantizer._scale | |
| if hasattr(weight_quantizer, "_scale") and weight_quantizer._scale is not None: | |
| del weight_quantizer._scale |
🤖 Prompt for AI Agents
In `@modelopt/torch/export/unified_export_hf.py` around lines 487 - 488, The
deletion of weight_quantizer._scale is currently executed unconditionally due to
wrong indentation; update the block around weight_quantizer to ensure the del
statement only runs when the attribute exists (e.g., indent the del under the if
hasattr(weight_quantizer, "_scale") and weight_quantizer._scale is not None
check or replace with a safe conditional/delete pattern), so only delete _scale
when present—look for the weight_quantizer variable usage in
unified_export_hf.py and fix the if + del pairing.
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
48b7a46 to
10f1140
Compare
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
| # Check if this is a vision-language model | ||
| # For VL models, skip automatic batch size detection and use a conservative default | ||
| # since proper multimodal input preparation is complex | ||
| if is_multimodal_model(full_model) or is_nemotron_vl(full_model): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this apply only when you do calib_with_images?
| @@ -800,36 +809,46 @@ def quantize_main( | |||
| device: torch.device, | |||
| ): | |||
| if args.batch_size == 0: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these logics are getting long. Could you help refactor it into a helper function?
What does this PR do?
Type of change: New model support
Overview: Add PTQ support for https://huggingface.co/nvidia/NVIDIA-Nemotron-Parse-v1.1
Usage
By default, image-text data will be used in calibration for VLMs.
Testing
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
New Features
Improvements