Skip to content

Conversation

@yueshen2016
Copy link
Contributor

@yueshen2016 yueshen2016 commented Feb 10, 2026

What does this PR do?

Type of change: Bug fix

Overview: When using TE spec (Transformer Engine layer specification), the input_layernorm and pre_mlp_layernorm are fused into TELayerNormColumnParallelLinear, making the standalone layernorm modules IdentityOp. The HF export (export_mcore_gpt_to_hf) skips IdentityOp modules, which silently dropped these layernorm weights from the exported HF checkpoint — producing an incomplete checkpoint missing input_layernorm.weight and post_attention_layernorm.weight for every layer.

This PR adds _FusedLayerNormProxy, a lightweight nn.Module that extracts the fused layer_norm_weight (and optional layer_norm_bias) from the TE linear module and exposes it as a standard .weight attribute. This allows the existing export rules to produce the correct HF key names without any changes to the rule mappings. The fix is applied to both transformer layer and Mamba layer export paths.

Usage

No API changes. The fix is transparent — export_mcore_gpt_to_hf now correctly exports layernorm weights regardless of whether the model uses TE spec or local spec.

# Same usage as before — no changes needed
import modelopt.torch.export as mtex
mtex.export_mcore_gpt_to_hf(model, hf_model_id, export_dir="./hf_export")

Testing

  • Quantized Llama-3.2-1B with TE spec via Megatron-Bridge quantize.py
  • Exported to HF format via export.py
  • Verified the exported safetensors contain input_layernorm.weight and post_attention_layernorm.weight for all layers
  • Ran end-to-end inference with vLLM (vllm/vllm-openai:latest, quantization="modelopt") — model loaded successfully and produced coherent text

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes — local spec models are unaffected (the elif branches only activate when layernorm is IdentityOp and fused weights exist)
  • Did you write any new necessary tests?: No
  • Did you add or update any necessary documentation?: No
  • Did you update Changelog?: No

Additional Information

The root cause is that TE spec fuses layernorm into the subsequent linear layer (TELayerNormColumnParallelLinear), storing the norm weight as .layer_norm_weight on the linear module instead of as .weight on a standalone norm module. The export walker checks isinstance(layer.input_layernorm, IdentityOp) and skips it, but never looks for the fused weight on the linear module. This fix bridges that gap with a zero-overhead proxy.

Summary by CodeRabbit

  • Bug Fixes
    • Improved export support for transformer and MTP models with fused layer normalization components, ensuring proper tensor mapping and export compatibility.

@yueshen2016 yueshen2016 requested a review from a team as a code owner February 10, 2026 19:16
@yueshen2016 yueshen2016 requested review from kevalmorabia97 and meenchen and removed request for meenchen February 10, 2026 19:16
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 10, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This change adds an internal proxy class to support exporting fused layernorm weights in Tensor Engine (TE)-based transformer models. The proxy exposes fused layernorm weights correctly during state-dict export by wrapping TE modules, handling fusion across input, pre-MLP, and Mamba mixer layers without modifying public APIs.

Changes

Cohort / File(s) Summary
Fused Layernorm Export Support
modelopt/torch/export/unified_export_megatron.py
Added _FusedLayerNormProxy internal class to expose fused layernorm weights from TE modules. Extended _get_transformer_layer_state_dict and _get_mamba_layer_state_dict functions to detect and wrap fused layernorms (input, pre-MLP, and mixer norms) with the proxy for proper export mapping.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: fixing export of fused layernorm weights for the Transformer Engine (TE) specification, which is the core issue addressed in this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch yueshen/export-fused-layernorm-te-spec

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Feb 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.44%. Comparing base (5e43b2a) to head (c828300).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #876   +/-   ##
=======================================
  Coverage   73.44%   73.44%           
=======================================
  Files         197      197           
  Lines       20657    20657           
=======================================
  Hits        15172    15172           
  Misses       5485     5485           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: James Shen <yueshen@nvidia.com>
@yueshen2016 yueshen2016 force-pushed the yueshen/export-fused-layernorm-te-spec branch from ec257b8 to ff4332f Compare February 10, 2026 19:45
Signed-off-by: James Shen <yueshen@nvidia.com>
has_mcore = True

Qwen3VLModel = None
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does Qwen3VLModel exist in mcore? the other imports were from mcore. just wondering if this will cause a circular dependency since MBridge also depends on modelopt

)
else:
# For llama 4, in hf unified checkpoint, all local experts share one scale
self.rules["local_experts.linear_fc1"](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does non-grouped MLP export still work even if you remove these lines?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants