[OMNIML-3232] Support full TE spec for NemotronH HF-to-Megatron import#884
[OMNIML-3232] Support full TE spec for NemotronH HF-to-Megatron import#884yueshen2016 merged 1 commit intomainfrom
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe changes update model import/export handling to support fused layer normalization in transformer-based models. Mapping configurations skip direct layer_norm_weight loading, while new logic routes these weights through dedicated fused_norm rules. Expert layer paths are reworked from MTP to backbone-based paths. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #884 +/- ##
=======================================
Coverage 73.74% 73.74%
=======================================
Files 199 199
Lines 21163 21163
=======================================
Hits 15606 15606
Misses 5557 5557 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Can we make sure previous modelopt spec continues to work? Pruning doesnt support full TE spec yet |
|
No concerns from pruning-support POV where we still need to use older non-full-TE modelopt spec |
Yes, previous one is still working. |
Signed-off-by: James Shen <yueshen@nvidia.com>
45399c9 to
5ddcbe7
Compare
What does this PR do?
Type of change: new feature
Overview: Enable full TE spec support for NemotronH (Mamba hybrid) models during HF-to-Megatron weight import via
import_mcore_gpt_from_hf.Previously, importing HF weights into a Megatron model built with the full TE spec (
TELayerNormColumnParallelLinear,TEGroupedMLP, etc.) failed for NemotronH models due to two issues:Grouped expert prefix bug: The
experts.linear_fc1/fc2import rules had a hard-codedmtp.layers.{}prefix, which was only correct for MTP layers. When regular decoder MoE layers useTEGroupedMLP(via the full TE spec), the importer generated incorrect HF keys (e.g.,mtp.layers.27.mixer.experts.0.up_proj.weightinstead ofbackbone.layers.27.mixer.experts.0.up_proj.weight).Fused layer norm loading: In the full TE spec, layer norms are fused into
TELayerNormColumnParallelLinearmodules aslayer_norm_weight. The importer's_name_remappingwould crash trying to loadlayer_norm_weightfrom a non-existent HF path (e.g.,backbone.layers.X.mixer.in_proj.layer_norm_weight), when the actual HF norm weight lives atbackbone.layers.X.norm.weight.Changes
mcore_nemotron.py:mtp.layers.{}tobackbone.layers.{}. The_grouped_mlp_mergingfunction already handlesbackbone→mtpreplacement whenis_mtp=True, so both decoder and MTP layers work correctly.mapping={"layer_norm_weight": None}toin_projandlinear_fc1rules to skiplayer_norm_weightduring_name_remapping(loaded separately viafused_norm).fused_normrule (NameRemapping("backbone.layers.{}.norm.weight")) to load HF norm weights into fused TE modules.megatron_importer.py:source_key is Nonecheck in_name_remappingto skip keys mapped toNonein the mapping dict (keeps existing value instead of crashing on missing HF key)._import_mamba_layer: after loadingin_proj, loadslayer_norm_weightfrom HF viafused_normrule whenlayer.normisIdentityOp._import_transformer_layer: loadslayer_norm_weightintolinear_qkv(wheninput_layernormisIdentityOp) and intolinear_fc1(whenpre_mlp_layernormisIdentityOp).Usage
The full TE spec is enabled via the
--full-te-specflag on the Megatron-LM side (separate PR). On the ModelOpt side, no user-facing changes are needed -- the import rules automatically handle both local spec and full TE spec models.Testing
IdentityOpchecks,hasattrchecks, and"fused_norm" in self.ruleschecks).Before your PR is "Ready for review"
Additional Information
Companion megatron-lm changes (separate PR):
megatron/core/post_training/modelopt/mamba/model_specs.py: Addeduse_full_te_specparameter to return canonicalmamba_stack_specfrommamba_layer_specs.py.megatron/post_training/model_builder.py: Passesuse_full_te_spec=args.full_te_spectoget_mamba_stack_modelopt_spec.megatron/post_training/arguments.py: Added--full-te-specCLI flag.examples/post_training/modelopt/convert_model.py: Skipmoe_grouped_gemm=Falseoverride when--full-te-specis set.Summary by CodeRabbit
New Features
Bug Fixes
Refactor