Skip to content

Support MiniMax M2.1 (FP8 checkpoint)#817

Open
cjluo-nv wants to merge 2 commits intomainfrom
chenjiel/support_minimax
Open

Support MiniMax M2.1 (FP8 checkpoint)#817
cjluo-nv wants to merge 2 commits intomainfrom
chenjiel/support_minimax

Conversation

@cjluo-nv
Copy link
Collaborator

@cjluo-nv cjluo-nv commented Jan 25, 2026

What does this PR do?

Type of change: ? new feature

Overview: ?

Support loading the MiniMax M2.1 (FP8) checkpoint for PTQ.

Usage

scripts/huggingface_example.sh --model --quant nvfp4 --trust_remote_code

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

  • New Features

    • Added MiniMax M2.1 model quantization support with nvfp4 format.
    • Extended FP8 quantization capabilities with configurable dtype parameter for enhanced precision control.
  • Improvements

    • Enhanced detection of quantized linear module variants.
    • Improved weight unpacking for FP8-based linear modules.
  • Documentation

    • Updated supported models table to include MiniMax M2.1.

✏️ Tip: You can customize this high-level summary in your review settings.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 25, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 25, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds FP8 quantization support for MiniMax M2.1 model. Introduces _QuantFP8Linear class in HuggingFace plugin, registers MoE modules for MiniMax M2 architecture, adds transformer version gating, updates export utilities to handle FP8 modules, and modifies the weight dequantization kernel signature with a dtype parameter.

Changes

Cohort / File(s) Summary
Documentation & Examples
CHANGELOG.rst, examples/llm_ptq/README.md
Added changelog entry and model support table entry for MiniMax M2.1 quantization with nvfp4 support.
Import Path Updates
examples/deepseek/ptq.py
Updated weight_dequant import from ds_kernel to modelopt.torch.quantization.triton.fp8_kernel.
Export Utilities
modelopt/torch/export/layer_utils.py, modelopt/torch/export/unified_export_hf.py
Enhanced is_quantlinear to detect QuantFP8Linear and exclude lora/ds_kernel modules; extended weight unpacking condition for FP8 modules with element size ≤ 1 byte.
FP8 Kernel Implementation
modelopt/torch/quantization/triton/fp8_kernel.py
Simplified license header, updated documentation, added dtype parameter to weight_dequant function with default value.
HuggingFace Plugin Infrastructure
modelopt/torch/quantization/plugins/huggingface.py
Introduced _QuantFP8Linear class for FP8 weight quantization; added register_minimax_m2_moe_on_the_fly() function; implemented transformer version gating (TRANSFORMERS_VERSION_GE_5_0) for KV attention paths; registered FP8Linear in QuantModuleRegistry.

Sequence Diagram(s)

sequenceDiagram
    participant Model as MiniMax M2.1 Model
    participant HFPlugin as HuggingFace Plugin
    participant Registry as QuantModuleRegistry
    participant ExportUtil as Export Utilities
    participant Kernel as FP8 Kernel

    Model->>HFPlugin: Load model (MiniMaxM2ForCausalLM)
    HFPlugin->>HFPlugin: register_minimax_m2_moe_on_the_fly()
    HFPlugin->>Registry: Register _QuantSparseMoe
    HFPlugin->>Registry: Register _QuantFP8Linear
    
    ExportUtil->>Model: Scan modules
    ExportUtil->>ExportUtil: is_quantlinear() checks for QuantFP8Linear
    alt QuantFP8Linear detected
        ExportUtil->>ExportUtil: Check element_size <= 1
        ExportUtil->>Kernel: Call weight_dequant with dtype
        Kernel->>Kernel: Dequantize weights using scaling factors
        Kernel-->>ExportUtil: Return dequantized tensor
    end
    
    ExportUtil-->>Model: Export with unpacked FP8 weights
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Support MiniMax M2.1 (FP8 checkpoint)' directly and clearly summarizes the main objective of the pull request: adding support for the MiniMax M2.1 model with FP8 checkpoint quantization.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chenjiel/support_minimax

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@cjluo-nv cjluo-nv marked this pull request as ready for review January 28, 2026 18:09
@cjluo-nv cjluo-nv requested review from a team as code owners January 28, 2026 18:09
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@CHANGELOG.rst`:
- Around line 13-15: The changelog entry duplicates the word "support" on the
MiniMax line; edit the sentence that currently reads "Add support for MiniMax
M2.1 model quantization support for the original FP8 checkpoint." (the MiniMax
M2.1 line) to remove the extra "support", e.g. "Add MiniMax M2.1 model
quantization for the original FP8 checkpoint."

In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 746-754: unpack_weight currently calls weight_dequant without
specifying dtype, which can mismatch forward (which uses dtype=input.dtype); fix
by storing the target dtype during _setup (e.g., save self._orig_dtype or
self.target_dtype) or by passing an explicit dtype parameter into unpack_weight,
then call weight_dequant(weight, scale_inv, self.block_size,
dtype=self._orig_dtype) so unpacked weights match forward; update any callers
and remove weight_scale_inv as before. Reference: unpack_weight, forward,
weight_dequant, and _setup.
- Around line 454-473: In the forward override where TRANSFORMERS_VERSION_GE_5_0
is checked, remove the stray unconditional assignment "self.top_k =
original_top_k" that appears after the version-gated if/else; that assignment is
incorrect for the TRANSFORMERS_VERSION_GE_5_0 branch (original_top_k references
self.gate.topk) and the two branches already restore their respective
attributes, so deleting this line in the forward method (the block manipulating
self.gate.topk and self.top_k) fixes the bug without changing branch-specific
restoration logic.

CHANGELOG.rst Outdated
Comment on lines 13 to 15
- Add standalone type inference option (``--use_standalone_type_inference``) in ONNX AutoCast as an alternative to ONNX's ``infer_shapes``. This experimental feature performs type-only inference without shape inference, useful as a workaround when shape inference fails or to avoid unnecessary shape inference overhead.
- Add support for Kimi K2 Thinking model quantization from the original int4 checkpoint.
- Add support for MiniMax M2.1 model quantization support for the original FP8 checkpoint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix duplicated wording in the changelog entry.

Line 15 repeats “support”. Consider the tweak below for clarity.

📝 Proposed fix
-- Add support for MiniMax M2.1 model quantization support for the original FP8 checkpoint.
+- Add support for MiniMax M2.1 model quantization for the original FP8 checkpoint.
🤖 Prompt for AI Agents
In `@CHANGELOG.rst` around lines 13 - 15, The changelog entry duplicates the word
"support" on the MiniMax line; edit the sentence that currently reads "Add
support for MiniMax M2.1 model quantization support for the original FP8
checkpoint." (the MiniMax M2.1 line) to remove the extra "support", e.g. "Add
MiniMax M2.1 model quantization for the original FP8 checkpoint."

Comment on lines 746 to 754
def unpack_weight(self):
with torch.cuda.device(self.weight.device):
weight, scale_inv = self._get_weight_and_scale_inv()
self.weight = nn.Parameter(
weight_dequant(weight, scale_inv, self.block_size),
requires_grad=False,
)
if hasattr(self, "weight_scale_inv"):
del self.weight_scale_inv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Consider specifying dtype in unpack_weight for consistency.

In forward(), weight_dequant is called with dtype=input.dtype, preserving the input's precision. However, unpack_weight() omits dtype, defaulting to torch.get_default_dtype() (typically float32).

If this is intentional for export, a comment would clarify the design. Otherwise, consider accepting/storing a target dtype to ensure consistency.

💡 Suggested improvement
     def unpack_weight(self):
         with torch.cuda.device(self.weight.device):
             weight, scale_inv = self._get_weight_and_scale_inv()
             self.weight = nn.Parameter(
-                weight_dequant(weight, scale_inv, self.block_size),
+                weight_dequant(weight, scale_inv, self.block_size, dtype=torch.bfloat16),
                 requires_grad=False,
             )

Or store the original dtype during _setup and use it here.

🤖 Prompt for AI Agents
In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 746 - 754,
unpack_weight currently calls weight_dequant without specifying dtype, which can
mismatch forward (which uses dtype=input.dtype); fix by storing the target dtype
during _setup (e.g., save self._orig_dtype or self.target_dtype) or by passing
an explicit dtype parameter into unpack_weight, then call weight_dequant(weight,
scale_inv, self.block_size, dtype=self._orig_dtype) so unpacked weights match
forward; update any callers and remove weight_scale_inv as before. Reference:
unpack_weight, forward, weight_dequant, and _setup.

@cjluo-nv cjluo-nv force-pushed the chenjiel/support_minimax branch from 779621e to c7628f3 Compare February 12, 2026 19:57
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv force-pushed the chenjiel/support_minimax branch from c7628f3 to 0bb9291 Compare February 12, 2026 19:59
@codecov
Copy link

codecov bot commented Feb 12, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 73.68%. Comparing base (3e95d9f) to head (3b95a19).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/quantization/triton/fp8_kernel.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #817      +/-   ##
==========================================
- Coverage   73.73%   73.68%   -0.05%     
==========================================
  Files         199      200       +1     
  Lines       21165    21187      +22     
==========================================
+ Hits        15606    15612       +6     
- Misses       5559     5575      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@meenchen meenchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

def register_minimax_m2_moe_on_the_fly(model):
"""Register MiniMax M2 MoE modules as a QUANT_MODULE.

MiniMax M2 MoE modules are defined in the model card, so we need to register them on the fly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does latest HF transformers not support MiniMax M2 MoE ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it requires 5.0x

@modelopt-bot
Copy link

Code Review Summary

Thanks for adding MiniMax M2.1 support! This is a solid PR that extends the FP8 quantization capabilities. Here are my findings:

🚨 Critical Issue

Bug in modelopt/torch/quantization/plugins/huggingface.py (lines 454-473)

As CodeRabbit already identified, there is a stray self.top_k = original_top_k assignment on line 472 that executes unconditionally after the version-gated block. This is problematic because:

  • In the TRANSFORMERS_VERSION_GE_5_0 branch, original_top_k holds self.gate.topk, not self.top_k
  • This line should be removed since each branch already handles its own restoration
  • Please remove line 472: self.top_k = original_top_k

📝 Additional Suggestions

  1. Import organization in huggingface.py: The import transformers and from packaging import version imports were moved up, which is good, but consider keeping standard library imports (packaging) separate from third-party imports (transformers) per PEP 8.

  2. Docstring completeness: The _QuantFP8Linear class could benefit from a class-level docstring explaining its purpose and usage.

  3. Error handling: In _QuantFP8Linear._setup(), consider adding type hints and validation for the block_size assertion to provide clearer error messages.

  4. FP8 kernel dtype handling: The new dtype parameter in weight_dequant() is a good addition for flexibility. 👍

✅ Positive Notes

  • Good cleanup of the duplicate license header in fp8_kernel.py
  • Proper handling of DTensor for distributed training scenarios
  • Clean refactoring moving weight_dequant to a reusable location
  • Adding MiniMax to the model type detection and documentation

Please address the critical bug before merging. Let me know if you need any clarification!

Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>
@cjluo-nv cjluo-nv enabled auto-merge (squash) February 14, 2026 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants