Support MiniMax M2.1 (FP8 checkpoint) by cjluo-nv · Pull Request #817 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-01-25T21:16:50Z

What does this PR do?

Type of change: ? new feature

Overview: ?

Support loading the MiniMax M2.1 (FP8) checkpoint for PTQ.

Usage

scripts/huggingface_example.sh --model --quant nvfp4 --trust_remote_code

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

New Features
- Added MiniMax M2.1 model quantization support with nvfp4 format.
- Extended FP8 quantization capabilities with configurable dtype parameter for enhanced precision control.
Improvements
- Enhanced detection of quantized linear module variants.
- Improved weight unpacking for FP8-based linear modules.
Documentation
- Updated supported models table to include MiniMax M2.1.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

copy-pr-bot · 2026-01-25T21:16:53Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-01-25T21:16:59Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds FP8 quantization support for MiniMax M2.1 model. Introduces _QuantFP8Linear class in HuggingFace plugin, registers MoE modules for MiniMax M2 architecture, adds transformer version gating, updates export utilities to handle FP8 modules, and modifies the weight dequantization kernel signature with a dtype parameter.

Changes

Cohort / File(s)	Summary
Documentation & Examples `CHANGELOG.rst`, `examples/llm_ptq/README.md`	Added changelog entry and model support table entry for MiniMax M2.1 quantization with nvfp4 support.
Import Path Updates `examples/deepseek/ptq.py`	Updated `weight_dequant` import from `ds_kernel` to `modelopt.torch.quantization.triton.fp8_kernel`.
Export Utilities `modelopt/torch/export/layer_utils.py`, `modelopt/torch/export/unified_export_hf.py`	Enhanced `is_quantlinear` to detect `QuantFP8Linear` and exclude lora/ds_kernel modules; extended weight unpacking condition for FP8 modules with element size ≤ 1 byte.
FP8 Kernel Implementation `modelopt/torch/quantization/triton/fp8_kernel.py`	Simplified license header, updated documentation, added `dtype` parameter to `weight_dequant` function with default value.
HuggingFace Plugin Infrastructure `modelopt/torch/quantization/plugins/huggingface.py`	Introduced `_QuantFP8Linear` class for FP8 weight quantization; added `register_minimax_m2_moe_on_the_fly()` function; implemented transformer version gating (`TRANSFORMERS_VERSION_GE_5_0`) for KV attention paths; registered FP8Linear in `QuantModuleRegistry`.

Sequence Diagram(s)

sequenceDiagram
    participant Model as MiniMax M2.1 Model
    participant HFPlugin as HuggingFace Plugin
    participant Registry as QuantModuleRegistry
    participant ExportUtil as Export Utilities
    participant Kernel as FP8 Kernel

    Model->>HFPlugin: Load model (MiniMaxM2ForCausalLM)
    HFPlugin->>HFPlugin: register_minimax_m2_moe_on_the_fly()
    HFPlugin->>Registry: Register _QuantSparseMoe
    HFPlugin->>Registry: Register _QuantFP8Linear
    
    ExportUtil->>Model: Scan modules
    ExportUtil->>ExportUtil: is_quantlinear() checks for QuantFP8Linear
    alt QuantFP8Linear detected
        ExportUtil->>ExportUtil: Check element_size <= 1
        ExportUtil->>Kernel: Call weight_dequant with dtype
        Kernel->>Kernel: Dequantize weights using scaling factors
        Kernel-->>ExportUtil: Return dequantized tensor
    end
    
    ExportUtil-->>Model: Export with unpacked FP8 weights

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Support MiniMax M2.1 (FP8 checkpoint)' directly and clearly summarizes the main objective of the pull request: adding support for the MiniMax M2.1 model with FP8 checkpoint quantization.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chenjiel/support_minimax

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@CHANGELOG.rst`:
- Around line 13-15: The changelog entry duplicates the word "support" on the
MiniMax line; edit the sentence that currently reads "Add support for MiniMax
M2.1 model quantization support for the original FP8 checkpoint." (the MiniMax
M2.1 line) to remove the extra "support", e.g. "Add MiniMax M2.1 model
quantization for the original FP8 checkpoint."

In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 746-754: unpack_weight currently calls weight_dequant without
specifying dtype, which can mismatch forward (which uses dtype=input.dtype); fix
by storing the target dtype during _setup (e.g., save self._orig_dtype or
self.target_dtype) or by passing an explicit dtype parameter into unpack_weight,
then call weight_dequant(weight, scale_inv, self.block_size,
dtype=self._orig_dtype) so unpacked weights match forward; update any callers
and remove weight_scale_inv as before. Reference: unpack_weight, forward,
weight_dequant, and _setup.
- Around line 454-473: In the forward override where TRANSFORMERS_VERSION_GE_5_0
is checked, remove the stray unconditional assignment "self.top_k =
original_top_k" that appears after the version-gated if/else; that assignment is
incorrect for the TRANSFORMERS_VERSION_GE_5_0 branch (original_top_k references
self.gate.topk) and the two branches already restore their respective
attributes, so deleting this line in the forward method (the block manipulating
self.gate.topk and self.top_k) fixes the bug without changing branch-specific
restoration logic.

coderabbitai · 2026-01-28T18:13:41Z

CHANGELOG.rst

 - Add standalone type inference option (``--use_standalone_type_inference``) in ONNX AutoCast as an alternative to ONNX's ``infer_shapes``. This experimental feature performs type-only inference without shape inference, useful as a workaround when shape inference fails or to avoid unnecessary shape inference overhead.
 - Add support for Kimi K2 Thinking model quantization from the original int4 checkpoint.
+- Add support for MiniMax M2.1 model quantization support for the original FP8 checkpoint.


⚠️ Potential issue | 🟡 Minor

Fix duplicated wording in the changelog entry.

Line 15 repeats “support”. Consider the tweak below for clarity.

📝 Proposed fix

-- Add support for MiniMax M2.1 model quantization support for the original FP8 checkpoint. +- Add support for MiniMax M2.1 model quantization for the original FP8 checkpoint.

🤖 Prompt for AI Agents

In `@CHANGELOG.rst` around lines 13 - 15, The changelog entry duplicates the word "support" on the MiniMax line; edit the sentence that currently reads "Add support for MiniMax M2.1 model quantization support for the original FP8 checkpoint." (the MiniMax M2.1 line) to remove the extra "support", e.g. "Add MiniMax M2.1 model quantization for the original FP8 checkpoint."

modelopt/torch/quantization/plugins/huggingface.py

coderabbitai · 2026-01-28T18:13:42Z

modelopt/torch/quantization/plugins/huggingface.py

+    def unpack_weight(self):
+        with torch.cuda.device(self.weight.device):
+            weight, scale_inv = self._get_weight_and_scale_inv()
+            self.weight = nn.Parameter(
+                weight_dequant(weight, scale_inv, self.block_size),
+                requires_grad=False,
+            )
+        if hasattr(self, "weight_scale_inv"):
+            del self.weight_scale_inv


⚠️ Potential issue | 🟡 Minor

Consider specifying dtype in unpack_weight for consistency.

In forward(), weight_dequant is called with dtype=input.dtype, preserving the input's precision. However, unpack_weight() omits dtype, defaulting to torch.get_default_dtype() (typically float32).

If this is intentional for export, a comment would clarify the design. Otherwise, consider accepting/storing a target dtype to ensure consistency.

💡 Suggested improvement

def unpack_weight(self): with torch.cuda.device(self.weight.device): weight, scale_inv = self._get_weight_and_scale_inv() self.weight = nn.Parameter( - weight_dequant(weight, scale_inv, self.block_size), + weight_dequant(weight, scale_inv, self.block_size, dtype=torch.bfloat16), requires_grad=False, )

Or store the original dtype during _setup and use it here.

🤖 Prompt for AI Agents

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 746 - 754, unpack_weight currently calls weight_dequant without specifying dtype, which can mismatch forward (which uses dtype=input.dtype); fix by storing the target dtype during _setup (e.g., save self._orig_dtype or self.target_dtype) or by passing an explicit dtype parameter into unpack_weight, then call weight_dequant(weight, scale_inv, self.block_size, dtype=self._orig_dtype) so unpacked weights match forward; update any callers and remove weight_scale_inv as before. Reference: unpack_weight, forward, weight_dequant, and _setup.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

codecov · 2026-02-12T20:10:56Z

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 73.68%. Comparing base (3e95d9f) to head (3b95a19).
⚠️ Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/quantization/triton/fp8_kernel.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #817      +/-   ##
==========================================
- Coverage   73.73%   73.68%   -0.05%     
==========================================
  Files         199      200       +1     
  Lines       21165    21187      +22     
==========================================
+ Hits        15606    15612       +6     
- Misses       5559     5575      +16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

meenchen

LGTM

Edwardf0t1 · 2026-02-13T09:15:33Z

modelopt/torch/quantization/plugins/huggingface.py

+def register_minimax_m2_moe_on_the_fly(model):
+    """Register MiniMax M2 MoE modules as a QUANT_MODULE.
+
+    MiniMax M2 MoE modules are defined in the model card, so we need to register them on the fly.


Does latest HF transformers not support MiniMax M2 MoE ?

it requires 5.0x

modelopt/torch/quantization/triton/fp8_kernel.py

modelopt-bot · 2026-02-14T05:58:10Z

Code Review Summary

Thanks for adding MiniMax M2.1 support! This is a solid PR that extends the FP8 quantization capabilities. Here are my findings:

🚨 Critical Issue

Bug in modelopt/torch/quantization/plugins/huggingface.py (lines 454-473)

As CodeRabbit already identified, there is a stray self.top_k = original_top_k assignment on line 472 that executes unconditionally after the version-gated block. This is problematic because:

In the TRANSFORMERS_VERSION_GE_5_0 branch, original_top_k holds self.gate.topk, not self.top_k
This line should be removed since each branch already handles its own restoration
Please remove line 472: self.top_k = original_top_k

📝 Additional Suggestions

Import organization in huggingface.py: The import transformers and from packaging import version imports were moved up, which is good, but consider keeping standard library imports (packaging) separate from third-party imports (transformers) per PEP 8.
Docstring completeness: The _QuantFP8Linear class could benefit from a class-level docstring explaining its purpose and usage.
Error handling: In _QuantFP8Linear._setup(), consider adding type hints and validation for the block_size assertion to provide clearer error messages.
FP8 kernel dtype handling: The new dtype parameter in weight_dequant() is a good addition for flexibility. 👍

✅ Positive Notes

Good cleanup of the duplicate license header in fp8_kernel.py
Proper handling of DTensor for distributed training scenarios
Clean refactoring moving weight_dequant to a reusable location
Adding MiniMax to the model type detection and documentation

Please address the critical bug before merging. Let me know if you need any clarification!

Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>

cjluo-nv marked this pull request as ready for review January 28, 2026 18:09

cjluo-nv requested review from a team as code owners January 28, 2026 18:09

cjluo-nv requested review from Edwardf0t1, meenchen, realAsma and sugunav14 January 28, 2026 18:09

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

cjluo-nv force-pushed the chenjiel/support_minimax branch from 779621e to c7628f3 Compare February 12, 2026 19:57

Support DS and Minimax style FP8 base model quantization

0bb9291

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv force-pushed the chenjiel/support_minimax branch from c7628f3 to 0bb9291 Compare February 12, 2026 19:59

meenchen approved these changes Feb 13, 2026

View reviewed changes

Edwardf0t1 reviewed Feb 13, 2026

View reviewed changes

Update copyright year in fp8_kernel.py

3b95a19

Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>

cjluo-nv enabled auto-merge (squash) February 14, 2026 08:11

Conversation

cjluo-nv commented Jan 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Jan 25, 2026

Uh oh!

coderabbitai bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

modelopt-bot commented Feb 14, 2026

Code Review Summary

🚨 Critical Issue

📝 Additional Suggestions

✅ Positive Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cjluo-nv commented Jan 25, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 25, 2026 •

edited

Loading

codecov bot commented Feb 12, 2026 •

edited

Loading