Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds FP8 quantization support for MiniMax M2.1 model. Introduces Changes
Sequence Diagram(s)sequenceDiagram
participant Model as MiniMax M2.1 Model
participant HFPlugin as HuggingFace Plugin
participant Registry as QuantModuleRegistry
participant ExportUtil as Export Utilities
participant Kernel as FP8 Kernel
Model->>HFPlugin: Load model (MiniMaxM2ForCausalLM)
HFPlugin->>HFPlugin: register_minimax_m2_moe_on_the_fly()
HFPlugin->>Registry: Register _QuantSparseMoe
HFPlugin->>Registry: Register _QuantFP8Linear
ExportUtil->>Model: Scan modules
ExportUtil->>ExportUtil: is_quantlinear() checks for QuantFP8Linear
alt QuantFP8Linear detected
ExportUtil->>ExportUtil: Check element_size <= 1
ExportUtil->>Kernel: Call weight_dequant with dtype
Kernel->>Kernel: Dequantize weights using scaling factors
Kernel-->>ExportUtil: Return dequantized tensor
end
ExportUtil-->>Model: Export with unpacked FP8 weights
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@CHANGELOG.rst`:
- Around line 13-15: The changelog entry duplicates the word "support" on the
MiniMax line; edit the sentence that currently reads "Add support for MiniMax
M2.1 model quantization support for the original FP8 checkpoint." (the MiniMax
M2.1 line) to remove the extra "support", e.g. "Add MiniMax M2.1 model
quantization for the original FP8 checkpoint."
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 746-754: unpack_weight currently calls weight_dequant without
specifying dtype, which can mismatch forward (which uses dtype=input.dtype); fix
by storing the target dtype during _setup (e.g., save self._orig_dtype or
self.target_dtype) or by passing an explicit dtype parameter into unpack_weight,
then call weight_dequant(weight, scale_inv, self.block_size,
dtype=self._orig_dtype) so unpacked weights match forward; update any callers
and remove weight_scale_inv as before. Reference: unpack_weight, forward,
weight_dequant, and _setup.
- Around line 454-473: In the forward override where TRANSFORMERS_VERSION_GE_5_0
is checked, remove the stray unconditional assignment "self.top_k =
original_top_k" that appears after the version-gated if/else; that assignment is
incorrect for the TRANSFORMERS_VERSION_GE_5_0 branch (original_top_k references
self.gate.topk) and the two branches already restore their respective
attributes, so deleting this line in the forward method (the block manipulating
self.gate.topk and self.top_k) fixes the bug without changing branch-specific
restoration logic.
CHANGELOG.rst
Outdated
| - Add standalone type inference option (``--use_standalone_type_inference``) in ONNX AutoCast as an alternative to ONNX's ``infer_shapes``. This experimental feature performs type-only inference without shape inference, useful as a workaround when shape inference fails or to avoid unnecessary shape inference overhead. | ||
| - Add support for Kimi K2 Thinking model quantization from the original int4 checkpoint. | ||
| - Add support for MiniMax M2.1 model quantization support for the original FP8 checkpoint. |
There was a problem hiding this comment.
Fix duplicated wording in the changelog entry.
Line 15 repeats “support”. Consider the tweak below for clarity.
📝 Proposed fix
-- Add support for MiniMax M2.1 model quantization support for the original FP8 checkpoint.
+- Add support for MiniMax M2.1 model quantization for the original FP8 checkpoint.🤖 Prompt for AI Agents
In `@CHANGELOG.rst` around lines 13 - 15, The changelog entry duplicates the word
"support" on the MiniMax line; edit the sentence that currently reads "Add
support for MiniMax M2.1 model quantization support for the original FP8
checkpoint." (the MiniMax M2.1 line) to remove the extra "support", e.g. "Add
MiniMax M2.1 model quantization for the original FP8 checkpoint."
| def unpack_weight(self): | ||
| with torch.cuda.device(self.weight.device): | ||
| weight, scale_inv = self._get_weight_and_scale_inv() | ||
| self.weight = nn.Parameter( | ||
| weight_dequant(weight, scale_inv, self.block_size), | ||
| requires_grad=False, | ||
| ) | ||
| if hasattr(self, "weight_scale_inv"): | ||
| del self.weight_scale_inv |
There was a problem hiding this comment.
Consider specifying dtype in unpack_weight for consistency.
In forward(), weight_dequant is called with dtype=input.dtype, preserving the input's precision. However, unpack_weight() omits dtype, defaulting to torch.get_default_dtype() (typically float32).
If this is intentional for export, a comment would clarify the design. Otherwise, consider accepting/storing a target dtype to ensure consistency.
💡 Suggested improvement
def unpack_weight(self):
with torch.cuda.device(self.weight.device):
weight, scale_inv = self._get_weight_and_scale_inv()
self.weight = nn.Parameter(
- weight_dequant(weight, scale_inv, self.block_size),
+ weight_dequant(weight, scale_inv, self.block_size, dtype=torch.bfloat16),
requires_grad=False,
)Or store the original dtype during _setup and use it here.
🤖 Prompt for AI Agents
In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 746 - 754,
unpack_weight currently calls weight_dequant without specifying dtype, which can
mismatch forward (which uses dtype=input.dtype); fix by storing the target dtype
during _setup (e.g., save self._orig_dtype or self.target_dtype) or by passing
an explicit dtype parameter into unpack_weight, then call weight_dequant(weight,
scale_inv, self.block_size, dtype=self._orig_dtype) so unpacked weights match
forward; update any callers and remove weight_scale_inv as before. Reference:
unpack_weight, forward, weight_dequant, and _setup.
779621e to
c7628f3
Compare
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
c7628f3 to
0bb9291
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #817 +/- ##
==========================================
- Coverage 73.73% 73.68% -0.05%
==========================================
Files 199 200 +1
Lines 21165 21187 +22
==========================================
+ Hits 15606 15612 +6
- Misses 5559 5575 +16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| def register_minimax_m2_moe_on_the_fly(model): | ||
| """Register MiniMax M2 MoE modules as a QUANT_MODULE. | ||
|
|
||
| MiniMax M2 MoE modules are defined in the model card, so we need to register them on the fly. |
There was a problem hiding this comment.
Does latest HF transformers not support MiniMax M2 MoE ?
Code Review SummaryThanks for adding MiniMax M2.1 support! This is a solid PR that extends the FP8 quantization capabilities. Here are my findings: 🚨 Critical IssueBug in As CodeRabbit already identified, there is a stray
📝 Additional Suggestions
✅ Positive Notes
Please address the critical bug before merging. Let me know if you need any clarification! |
Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>
What does this PR do?
Type of change: ? new feature
Overview: ?
Support loading the MiniMax M2.1 (FP8) checkpoint for PTQ.
Usage
scripts/huggingface_example.sh --model --quant nvfp4 --trust_remote_code
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
New Features
Improvements
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.