-
Notifications
You must be signed in to change notification settings - Fork 273
Add support for Qwen3Omni30B thinking model #856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe pull request adds comprehensive support for Qwen3Omni multimodal models across the PTQ quantization pipeline, introduces specialized image and text processors for handling multimodal inputs, extends model name-to-type mapping, and adds new dataset utilities and CLI options for enhanced quantization workflow control. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as User/CLI
participant Main as hf_ptq.py<br/>(quantize_main)
participant Processor as Processor<br/>Selection
participant DataLoader as DataLoader<br/>Factory
participant PreQuant as pre_quantize
participant Quant as quantize_model
participant PostQuant as post_quantize
participant Export as export_quantized
CLI->>Main: Run with --model_name=qwen3omni
Main->>Processor: get_processor(model_type="qwen3omni")
Processor-->>Main: Qwen3OmniImageProcessor
Main->>DataLoader: Select dataset<br/>(video/VL/text)
alt Video Dataset
DataLoader->>DataLoader: Load video samples
DataLoader-->>Main: video_dataloader
else VL Dataset
DataLoader->>DataLoader: Load image+text samples
DataLoader-->>Main: vl_dataloader
else Text-only Dataset
DataLoader->>DataLoader: Load text samples
DataLoader-->>Main: get_qwen3omni_text_dataloader()
end
Main->>PreQuant: pre_quantize(calib_dataloader, processor)
PreQuant->>PreQuant: Process full calib_batch<br/>(not just preview)
PreQuant-->>Main: preview_input_ids,<br/>generated_ids_before_ptq,<br/>calib_batch
Main->>Quant: quantize_model(calib_batch, processor)
Quant->>Quant: Use _should_use_generate<br/>to route inference
Quant-->>Main: quantized_model
Main->>PostQuant: post_quantize(..., calib_batch)
PostQuant->>PostQuant: Post-PTQ generation<br/>with calib_batch
PostQuant-->>Main: results
Main->>Export: export_quantized(calib_batch)
Export->>Export: Sanitize generation_config
Export-->>Main: exported_checkpoint
alt --quant_summary_path provided
Main->>Main: Save summary to file
else No path
Main->>Main: Print summary to stdout
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 219-221: The current code mutates processor.dtype (processor.dtype
= language_model.dtype) which can have unintended side effects; instead avoid
changing the existing processor and either pass dtype explicitly into
get_vlm_dataset_dataloader/collate function or construct a new processor
instance with the desired dtype (e.g., create a new Qwen3OmniImageProcessor
using processor.tokenizer and language_model.dtype) and pass that to
get_vlm_dataset_dataloader so the original processor remains unchanged.
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1001-1009: The code currently mutates model.generation_config
in-place (via gen_config and clearing temperature/top_p/top_k) which persists
after export; instead, create a non-mutating copy of model.generation_config (or
record the original values for "temperature","top_p","top_k"), apply the
sampling fixes to that copy (or to the saved serialization data) and use the
copy for writing the checkpoint (e.g., before calling save_pretrained), then
restore the original model.generation_config (or simply never mutate it) so the
caller's model is unchanged; reference symbols: model.generation_config,
gen_config, and the sampling attrs ["temperature","top_p","top_k"] when locating
where to implement the copy/restore behavior.
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 470-472: The code computes batch_size using tensor_data but
iterates keys from batch_data (batch_size =
tensor_data[next(iter(batch_data.keys()))].shape[0]), which can pick a
non-tensor key (e.g., max_new_tokens) and cause .shape access to fail; change
the key source to tensor_data so you select a tensor key (e.g., batch_size =
tensor_data[next(iter(tensor_data.keys()))].shape[0]) or explicitly find the
first value in tensor_data that is a tensor and use its shape[0]; update the
logic around the batch_size variable and references to tensor_data/batch_data to
ensure batch_size is derived from actual tensors (functions/variables to check:
tensor_data, batch_data, batch_size).
In `@modelopt/torch/utils/image_processor.py`:
- Around line 264-277: The code uses torch.tensor(...) for "pixel_values" and
"audio_features" which can change/infer dtypes inconsistently; change those to
preserve original dtype by using
torch.as_tensor(first["pixel_values"]).to(self.device) and
torch.as_tensor(first["audio_features"]).to(self.device) (keeping the existing
.to(self.device) pattern), so the dtype of the incoming data is preserved;
update the handling in image_processor.py where result["pixel_values"] and
result["audio_features"] are set (using the local variables first and result) to
use torch.as_tensor instead of torch.tensor.
🧹 Nitpick comments (5)
modelopt/torch/export/model_utils.py (1)
145-148: Minor: Update comment for consistency.The comment "Pattern 3: No language_model found" at line 148 is now misleading since a new pattern was inserted above it. Consider renumbering or removing pattern numbering.
💡 Suggested fix
if hasattr(model, "thinker"): return [model, model.thinker] - # Pattern 3: No language_model found + # No language_model found return Noneexamples/llm_ptq/example_utils.py (1)
284-286: Update docstring to reflect new return types.The docstring states it returns a
MllamaImageProcessorobject, but now it can also returnQwen3OmniImageProcessorfor the new model type.📝 Suggested fix
def get_processor( ckpt_path, model_type, device: torch.device = "auto", trust_remote_code=False, attn_implementation=None, ) -> BaseImageProcessor | ProcessorMixin | None: """ - Returns a :class:`modelopt.torch.utils.image_processor.MllamaImageProcessor` object. + Returns an image processor for multimodal models (MllamaImageProcessor, Qwen3OmniImageProcessor) + or a ProcessorMixin for models like Whisper. """modelopt/torch/utils/image_processor.py (1)
115-172: Consider interface consistency forpreprocess_function.
Qwen3OmniTextProcessor.preprocess_function(text: str)has a different signature than the base class and sibling classes, which usepreprocess_function(examples: dict). This could cause issues if callers expect a uniform interface.Additionally, the
dtypeattribute is stored but never used in the processor methods.💡 Suggested approach
Consider either:
- Aligning the signature with the base class by wrapping text in a dict:
def preprocess_function(self, examples): text = examples.get("text", examples) if isinstance(examples, dict) else examples # ... rest of implementation
- Or documenting the intentional deviation clearly in the class docstring.
For the unused
dtype, either utilize it in tensor conversion or remove if not needed.modelopt/torch/utils/dataset_utils.py (1)
136-146: Replace assertions with proper exceptions for public API.Using
assertfor input validation in a public API function (get_qwen3omni_text_dataloader) is discouraged because assertions can be disabled with-Oflag. UseValueErrororTypeErrorinstead.💡 Suggested fix
- assert processor is not None, "Please provide a Qwen3OmniTextProcessor." + if processor is None: + raise ValueError("Please provide a Qwen3OmniTextProcessor.") if isinstance(num_samples, int): num_samples = [num_samples] if isinstance(dataset_name, str): dataset_name = [dataset_name] - assert len(dataset_name) == len(num_samples), ( - "dataset_name and num_samples must be the same length" - ) + if len(dataset_name) != len(num_samples): + raise ValueError("dataset_name and num_samples must be the same length")examples/llm_ptq/hf_ptq.py (1)
756-766: Code duplication in Qwen3Omni generation handling.The logic for handling Qwen3Omni's tuple return from
generate()is duplicated betweenpre_quantize(lines 756-766) andpost_quantize(lines 817-827). Consider extracting to a helper function.♻️ Suggested refactor
def _extract_qwen3omni_generated_ids(result): """Extract generated IDs from Qwen3Omni generate() output.""" if isinstance(result, tuple): text_ids, _ = result return text_ids.sequences if hasattr(text_ids, "sequences") else text_ids return resultThen use in both places:
# In pre_quantize and post_quantize: result = full_model.generate(**calib_batch, max_new_tokens=100) generated_ids = _extract_qwen3omni_generated_ids(result)Also applies to: 817-827
examples/llm_ptq/hf_ptq.py
Outdated
| # Set the dtype for proper tensor conversion in collate_function | ||
| processor.dtype = language_model.dtype | ||
| calib_dataloader = get_vlm_dataset_dataloader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side effect: Mutating processor.dtype.
Setting processor.dtype = language_model.dtype mutates the processor object, which could affect other callers if the processor is reused elsewhere. Consider passing dtype as a parameter instead of mutating.
💡 Suggested approach
Rather than mutating the processor, consider passing dtype explicitly to the collate function or creating a new processor instance with the dtype set:
# Option 1: Create new processor with dtype
processor = Qwen3OmniImageProcessor(
processor.tokenizer,
device=device,
dtype=language_model.dtype
)
calib_dataloader = get_vlm_dataset_dataloader(...)🤖 Prompt for AI Agents
In `@examples/llm_ptq/hf_ptq.py` around lines 219 - 221, The current code mutates
processor.dtype (processor.dtype = language_model.dtype) which can have
unintended side effects; instead avoid changing the existing processor and
either pass dtype explicitly into get_vlm_dataset_dataloader/collate function or
construct a new processor instance with the desired dtype (e.g., create a new
Qwen3OmniImageProcessor using processor.tokenizer and language_model.dtype) and
pass that to get_vlm_dataset_dataloader so the original processor remains
unchanged.
| # Fix generation_config conflicts before saving | ||
| # Some models have temperature/top_p/top_k set but do_sample=False which causes validation errors | ||
| if hasattr(model, "generation_config") and model.generation_config is not None: | ||
| gen_config = model.generation_config | ||
| if not getattr(gen_config, "do_sample", True): | ||
| # Remove sampling-related params when do_sample is False | ||
| for attr in ["temperature", "top_p", "top_k"]: | ||
| if hasattr(gen_config, attr): | ||
| setattr(gen_config, attr, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, examine the context around lines 1001-1009 to understand the function flow
sed -n '995,1020p' modelopt/torch/export/unified_export_hf.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1243
🏁 Script executed:
# Check what happens after the save_pretrained call
sed -n '1001,1050p' modelopt/torch/export/unified_export_hf.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1434
🏁 Script executed:
# Look for the function definition to understand the full scope
grep -n "def.*export" modelopt/torch/export/unified_export_hf.py | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 208
🌐 Web query:
Hugging Face Transformers GenerationConfig save_pretrained mutation best practice
💡 Result:
Best practice (avoid “mutation surprises”)
1) For one-off generation changes, don’t mutate model.generation_config
Prefer passing parameters directly to generate() (or pass a separate GenerationConfig). Transformers explicitly supports this: generation_config is the base, and any matching **kwargs override it for that call; if you don’t pass one, Transformers loads defaults from (1) generation_config.json if present, else (2) the model config. [1]
out = model.generate(**inputs, max_new_tokens=128, do_sample=True, top_p=0.9)2) For persistent defaults you want to version/control, create & save a dedicated GenerationConfig
Use GenerationConfig.from_pretrained(...) (loads generation_config.json) or GenerationConfig.from_model_config(model.config) (helpful for legacy configs), then edit and save_pretrained(). Saving writes generation_config.json by default and can be reloaded with from_pretrained(). [1]
from transformers import GenerationConfig
gen_cfg = GenerationConfig.from_pretrained("openai-community/gpt2") # or your model dir/id
gen_cfg.max_new_tokens = 256
gen_cfg.do_sample = True
gen_cfg.top_p = 0.9
gen_cfg.save_pretrained("./my_model_dir") # saves ./my_model_dir/generation_config.json
# later:
gen_cfg2 = GenerationConfig.from_pretrained("./my_model_dir")3) If you do want to mutate defaults, do it deliberately (and preferably on a copy)
Because GenerationConfig is a normal mutable object, in-place edits affect subsequent calls. A common pattern is: copy → edit → pass as generation_config to generate(), leaving model.generation_config unchanged (supported by the generation_config= argument). [1]
Source: Hugging Face Transformers “Generation / GenerationConfig” API docs (covers generate(generation_config=..., **kwargs), loading priority, and GenerationConfig.save_pretrained() / from_pretrained() usage). [1]
References:
[1] https://huggingface.co/docs/transformers/v4.53.2/en/main_classes/text_generation
Avoid persisting generation_config mutation beyond export.
This block mutates model.generation_config in-place; after export returns, callers retain the modified model with temperature/top_p/top_k cleared. Hugging Face best practices recommend avoiding direct mutation of generation_config—instead, preserve the original state or apply fixes only to the saved checkpoint. Consider saving the original values and restoring them after save_pretrained().
🔧 Suggested change (restore after save)
- if hasattr(model, "generation_config") and model.generation_config is not None:
- gen_config = model.generation_config
- if not getattr(gen_config, "do_sample", True):
- # Remove sampling-related params when do_sample is False
- for attr in ["temperature", "top_p", "top_k"]:
- if hasattr(gen_config, attr):
- setattr(gen_config, attr, None)
+ gen_config = getattr(model, "generation_config", None)
+ _gen_restore = None
+ if gen_config is not None and not getattr(gen_config, "do_sample", True):
+ _gen_restore = {a: getattr(gen_config, a, None) for a in ["temperature", "top_p", "top_k"]}
+ for attr in _gen_restore:
+ if hasattr(gen_config, attr):
+ setattr(gen_config, attr, None)
# Save model
- model.save_pretrained(
- export_dir, state_dict=post_state_dict, save_modelopt_state=save_modelopt_state
- )
+ try:
+ model.save_pretrained(
+ export_dir, state_dict=post_state_dict, save_modelopt_state=save_modelopt_state
+ )
+ finally:
+ if _gen_restore is not None:
+ for attr, value in _gen_restore.items():
+ if hasattr(gen_config, attr):
+ setattr(gen_config, attr, value)🤖 Prompt for AI Agents
In `@modelopt/torch/export/unified_export_hf.py` around lines 1001 - 1009, The
code currently mutates model.generation_config in-place (via gen_config and
clearing temperature/top_p/top_k) which persists after export; instead, create a
non-mutating copy of model.generation_config (or record the original values for
"temperature","top_p","top_k"), apply the sampling fixes to that copy (or to the
saved serialization data) and use the copy for writing the checkpoint (e.g.,
before calling save_pretrained), then restore the original
model.generation_config (or simply never mutate it) so the caller's model is
unchanged; reference symbols: model.generation_config, gen_config, and the
sampling attrs ["temperature","top_p","top_k"] when locating where to implement
the copy/restore behavior.
| ) | ||
| # Get the batch size of current data | ||
| batch_size = batch_data[next(iter(batch_data.keys()))].shape[0] | ||
| batch_size = tensor_data[next(iter(batch_data.keys()))].shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential bug: batch_data used instead of tensor_data.
Line 472 retrieves batch_size from tensor_data but uses batch_data.keys(). If the first key in batch_data is a scalar parameter (like max_new_tokens), accessing .shape[0] will fail.
🐛 Proposed fix
assert all(torch.is_tensor(data) or data is None for data in tensor_data.values()), (
"tensor_data values must be tensors"
)
# Get the batch size of current data
- batch_size = tensor_data[next(iter(batch_data.keys()))].shape[0]
+ batch_size = tensor_data[next(iter(tensor_data.keys()))].shape[0]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ) | |
| # Get the batch size of current data | |
| batch_size = batch_data[next(iter(batch_data.keys()))].shape[0] | |
| batch_size = tensor_data[next(iter(batch_data.keys()))].shape[0] | |
| ) | |
| # Get the batch size of current data | |
| batch_size = tensor_data[next(iter(tensor_data.keys()))].shape[0] |
🤖 Prompt for AI Agents
In `@modelopt/torch/utils/dataset_utils.py` around lines 470 - 472, The code
computes batch_size using tensor_data but iterates keys from batch_data
(batch_size = tensor_data[next(iter(batch_data.keys()))].shape[0]), which can
pick a non-tensor key (e.g., max_new_tokens) and cause .shape access to fail;
change the key source to tensor_data so you select a tensor key (e.g.,
batch_size = tensor_data[next(iter(tensor_data.keys()))].shape[0]) or explicitly
find the first value in tensor_data that is a tensor and use its shape[0];
update the logic around the batch_size variable and references to
tensor_data/batch_data to ensure batch_size is derived from actual tensors
(functions/variables to check: tensor_data, batch_data, batch_size).
| if first.get("pixel_values") is not None: | ||
| result["pixel_values"] = torch.tensor(first["pixel_values"]).to(self.device) | ||
|
|
||
| # Handle image grid thw (tile height width info) | ||
| if first.get("image_grid_thw") is not None: | ||
| result["image_grid_thw"] = torch.LongTensor(first["image_grid_thw"]).to(self.device) | ||
|
|
||
| # Handle audio features if present | ||
| if first.get("audio_feature_lens") is not None: | ||
| result["audio_feature_lens"] = torch.LongTensor(first["audio_feature_lens"]).to( | ||
| self.device | ||
| ) | ||
| if first.get("audio_features") is not None: | ||
| result["audio_features"] = torch.tensor(first["audio_features"]).to(self.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent tensor creation may cause dtype issues.
torch.tensor() (lines 265, 277) infers dtype from input data, while torch.LongTensor() explicitly creates int64 tensors. For pixel_values and audio_features, the original dtype (likely float32/bfloat16) should be preserved.
🔧 Suggested fix to preserve dtype
# Handle pixel values for images
if first.get("pixel_values") is not None:
- result["pixel_values"] = torch.tensor(first["pixel_values"]).to(self.device)
+ result["pixel_values"] = torch.tensor(first["pixel_values"], dtype=torch.float32).to(self.device)
# Handle audio features if present
if first.get("audio_features") is not None:
- result["audio_features"] = torch.tensor(first["audio_features"]).to(self.device)
+ result["audio_features"] = torch.tensor(first["audio_features"], dtype=torch.float32).to(self.device)Alternatively, consider using the dtype attribute if it's meant to control these tensor types.
🤖 Prompt for AI Agents
In `@modelopt/torch/utils/image_processor.py` around lines 264 - 277, The code
uses torch.tensor(...) for "pixel_values" and "audio_features" which can
change/infer dtypes inconsistently; change those to preserve original dtype by
using torch.as_tensor(first["pixel_values"]).to(self.device) and
torch.as_tensor(first["audio_features"]).to(self.device) (keeping the existing
.to(self.device) pattern), so the dtype of the incoming data is preserved;
update the handling in image_processor.py where result["pixel_values"] and
result["audio_features"] are set (using the local variables first and result) to
use torch.as_tensor instead of torch.tensor.
cjluo-nv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Edwardf0t1 could you review it as well?
| batch_size=args.batch_size, | ||
| num_samples=args.calib_size[0], | ||
| ) | ||
| elif model_type == "qwen3omni": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this to example_utils to keep hf_ptq short?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
examples/llm_ptq/hf_ptq.py
Outdated
| old_stdout = sys.stdout | ||
| sys.stdout = buffer = io.StringIO() | ||
| try: | ||
| mtq.print_quant_summary(full_model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit hacky. How about just modify print_quant_summary to take a file path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Edwardf0t1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajrasane I'm wondering if the exported checkpoint can be deployed on vLLM/SGLang/TRT-LLM and how's the accuracy?
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
|
@Edwardf0t1 ,I was able to deploy think checkpoint on vLLM and get good outputs. I have also generated accuracy results with MMLU. PTAL. |
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
| return None | ||
|
|
||
|
|
||
| def ensure_tokenizer_files(model_path: str, source_model_id: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this?
| print("No custom model files found to copy") | ||
|
|
||
|
|
||
| def patch_config_for_unified_export(model_type: str, export_path: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Edwardf0t1 could you review this part?
| else: | ||
| calibrate_loop = create_forward_loop(dataloader=calib_dataloader) | ||
| calibrate_loop = create_forward_loop( | ||
| dataloader=calib_dataloader, generation_kwargs=generation_kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what will happen if generation_kwargs is empty?
| generated_ids_before_ptq, | ||
| is_nemotron_vl_model, | ||
| first_text_speech_dataset, | ||
| calib_batch: dict | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please document why we need to add calib_batch
| return processor.tokenizer.batch_decode(input_ids) | ||
| # BaseImageProcessor covers MllamaImageProcessor and Qwen3OmniImageProcessor | ||
| if processor is not None and isinstance(processor, BaseImageProcessor): | ||
| return processor.tokenizer.batch_decode(input_ids, skip_special_tokens=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 800 and 804 expected?
| @@ -0,0 +1,136 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this qwen3 omni specific?
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
What does this PR do?
Type of change:
Model support
Overview:
Usage
Testing
Able to quantize model, export hf_checkpoint and run inference with vLLM:
Before your PR is "Ready for review"
Summary by CodeRabbit
New Features
Bug Fixes
Chores