Skip to content

Upgrade Dev containers for CICD to latest#891

Open
kevalmorabia97 wants to merge 2 commits intomainfrom
kmorabia/upgrade-dev-setup
Open

Upgrade Dev containers for CICD to latest#891
kevalmorabia97 wants to merge 2 commits intomainfrom
kmorabia/upgrade-dev-setup

Conversation

@kevalmorabia97
Copy link
Collaborator

@kevalmorabia97 kevalmorabia97 commented Feb 13, 2026

What does this PR do?

  • Upgrade CICD test containers to latest
  • Enable torch 2.10 testing in CICD

Testing

CI/CD in this PR should pass

Summary by CodeRabbit

  • New Features

    • Added support for mixed-precision gradient handling with FSDP2.
  • Documentation

    • Updated Linux installation guide with CUDA 13.x support and cupy dependency guidance.
  • Chores

    • Updated CI/CD workflows and test infrastructure to support PyTorch 2.10 and CUDA 13.
    • Updated container image versions and test environment configurations.
    • Updated TensorRT-LLM version requirements.

@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner February 13, 2026 23:27
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 13, 2026

📝 Walkthrough

Walkthrough

CI workflows, test matrices, and installation docs updated for newer PyTorch/CUDA/tooling (container tags moved to 26.01, CUDA 13 variants added, Torch 2.10 added). Tests guarded by Megatron/TE checks were mostly removed or relaxed, and a new FSDP2 post-backward patch was added to the quantization trainer.

Changes

Cohort / File(s) Summary
GitHub Actions workflows
​.github/workflows/example_tests.yml, ​.github/workflows/gpu_tests.yml, ​.github/workflows/unit_tests.yml
Centralized Docker image selection via matrix (default 26.01), bumped container tags to 26.01, switched GPU job matrix variants from CUDA 12 → CUDA 13, and updated TensorRT/TensorRT-LLM tags. Unit test matrices renamed to use torch210 variants and expanded torch versions.
CI config (tox)
tox.ini
Aligned tox envs to Torch 2.10 / CUDA 13: added torch210 CPU envs, replaced cuda12-gpu with cuda13-gpu variants, updated torchvision mappings and GPU presteps (cupy-cuda13x, fast-hadamard-transform install).
Docs
docs/source/getting_started/_installation_for_Linux.rst
Updated Linux installation requirements to include CUDA 12.x and 13.x, relaxed TensorRT-LLM constraint, and added CUDA-specific instructions for cupy-cuda12x vs cupy-cuda13x installation flows.
Quantization trainer patch
modelopt/torch/quantization/plugins/transformers_trainer.py
Added _patch_fsdp2_post_backward() and invoked it during accelerate/FSDP2 preparation to patch FSDPParamGroup.post_backward, normalizing grad dtype handling across mixed-precision scenarios (idempotent wrapper + casting logic).
Test utilities
tests/_test_utils/import_helper.py
Added ctypes/os checks to verify TensorRT provider native library load and updated skip_if_no_megatron signature to keyword-only (*, te_required: bool = True, mamba_required: bool = False) with TE-specific skip logic.
Megatron test guards updated/removed
tests/gpu_megatron/... (multiple files)
Many test files removed or relaxed skip_if_no_megatron guards: some tests now run unconditionally; others drop TE/APEX requirements while preserving mamba checks. Review needed for environments lacking Megatron/TE. (See modified files under tests/gpu_megatron/....)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (19 files):

⚔️ .github/workflows/example_tests.yml (content)
⚔️ .github/workflows/gpu_tests.yml (content)
⚔️ .github/workflows/unit_tests.yml (content)
⚔️ docs/source/getting_started/_installation_for_Linux.rst (content)
⚔️ modelopt/torch/export/plugins/mcore_nemotron.py (content)
⚔️ modelopt/torch/export/plugins/megatron_importer.py (content)
⚔️ modelopt/torch/quantization/plugins/transformers_trainer.py (content)
⚔️ tests/_test_utils/import_helper.py (content)
⚔️ tests/gpu_megatron/torch/distill/plugins/test_distill_megatron.py (content)
⚔️ tests/gpu_megatron/torch/export/test_unified_export_megatron.py (content)
⚔️ tests/gpu_megatron/torch/export/test_vllm_fakequant_megatron_export.py (content)
⚔️ tests/gpu_megatron/torch/nas/plugins/test_megatron_gpt_dynamic_modules.py (content)
⚔️ tests/gpu_megatron/torch/nas/plugins/test_megatron_mamba_dynamic_modules.py (content)
⚔️ tests/gpu_megatron/torch/peft/plugins/test_megatron_peft.py (content)
⚔️ tests/gpu_megatron/torch/prune/plugins/test_mcore_gpt_minitron_pruning.py (content)
⚔️ tests/gpu_megatron/torch/prune/plugins/test_mcore_mamba_minitron_pruning.py (content)
⚔️ tests/gpu_megatron/torch/quantization/plugins/test_megatron.py (content)
⚔️ tests/gpu_megatron/torch/speculative/plugins/test_speculative_megatron_modules.py (content)
⚔️ tox.ini (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main objective of the PR, which is upgrading CICD test containers to latest versions (PyTorch, CUDA, TensorRT, etc.).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch kmorabia/upgrade-dev-setup
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch kmorabia/upgrade-dev-setup
  • Create stacked PR with resolved conflicts
  • Post resolved changes as copyable diffs in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
.github/workflows/gpu_tests.yml (1)

86-99: ⚠️ Potential issue | 🔴 Critical

gpu-tests-non-pr still references cuda12 tox environments that were removed from tox.ini.

The PR job matrix (lines 66-68) was updated to cuda13, but the non-PR job matrix (lines 92-94) still uses py312-cuda12-gpu and py312-cuda12-gpu-megatron. Since tox.ini no longer defines cuda12 environments (replaced by cuda13), nightly and on-demand runs will fail with an unknown tox environment error.

🐛 Proposed fix
       matrix:
         include:
-          - example: py312-cuda12-gpu
+          - example: py312-cuda13-gpu
             timeout: 90
-          - example: py312-cuda12-gpu-megatron
+          - example: py312-cuda13-gpu-megatron
             timeout: 120
.github/workflows/unit_tests.yml (1)

79-86: ⚠️ Potential issue | 🔴 Critical

multi-torch matrix includes torch26 which no longer exists in tox.ini.

As noted in the tox.ini review, torch26 was dropped from the tox env pattern. This matrix entry will produce py312-torch26-tf_latest-unit which tox won't recognize. Either add torch26 back to the tox env pattern or remove 26 from this matrix (and consider adding 210 if full coverage of the new version is desired).

🐛 Suggested fix (if torch26 is intentionally dropped)
       matrix:
-        torch: [26, 27, 28, 29]
+        torch: [27, 28, 29, 210]
🤖 Fix all issues with AI agents
In `@tox.ini`:
- Around line 14-21: The tox env pattern
[testenv:{py310,py311,py312}-torch{27,28,29,210}-tf_{min,latest}-unit] no longer
includes torch26 but there is still a deps mapping "torch26:
torchvision~=0.21.0" and the CI matrix (multi-torch) still lists 26; fix by
either (A) restoring 26 into the env factor (change torch{27,28,29,210} to
torch{26,27,28,29,210}) so tox, deps, and the workflow matrix align, or (B)
remove the dead deps line "torch26: torchvision~=0.21.0" and remove the "26"
entry from the multi-torch matrix in unit_tests.yml so the workflow no longer
references an unsupported tox env. Ensure both the tox env pattern and the
workflow matrix + deps block are consistent.
🧹 Nitpick comments (1)
docs/source/getting_started/_installation_for_Linux.rst (1)

129-131: Consider splitting the long instruction for readability.

The single bullet on line 131 packs multiple commands and conditional logic into one line, which can be hard to scan in rendered docs.

📝 Suggested restructure
-**CUDA specific dependencies**
-
-* By default, ``cupy-cuda12x`` is installed for INT4 ONNX quantization. If you have CUDA 13, you need to run ``pip uninstall -y cupy-cuda12x`` and ``pip install cupy-cuda13x`` after installing ``nvidia-modelopt[onnx]``.
+**CUDA specific dependencies**
+
+* By default, ``cupy-cuda12x`` is installed for INT4 ONNX quantization.
+* If you have CUDA 13, replace it after installing ``nvidia-modelopt[onnx]``:
+
+  .. code-block:: shell
+
+      pip uninstall -y cupy-cuda12x
+      pip install cupy-cuda13x

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/upgrade-dev-setup branch from b7e3883 to f0bd99c Compare February 13, 2026 23:31
@codecov
Copy link

codecov bot commented Feb 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.74%. Comparing base (ae69d5d) to head (dea2eeb).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #891   +/-   ##
=======================================
  Coverage   73.74%   73.74%           
=======================================
  Files         199      199           
  Lines       21163    21163           
=======================================
  Hits        15606    15606           
  Misses       5557     5557           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@realAsma please help review this cursor-generate fix for passing llm_qat test on PyTorch 2.10 (pytorch:26.01-py3 container)

Failing test without this fix: https://github.com/NVIDIA/Model-Optimizer/actions/runs/22007571666/job/63595325923?pr=891

[rank0]: RuntimeError: attempting to assign a gradient with dtype 'c10::BFloat16' to a tensor with grad_dtype 'Float'. The gradient must match the tensor's grad_dtype (defaults to the tensor's dtype). You can set the tensor's grad_dtype attribute with a specific dtype, or None to allow any dtype. Set grad_dtype with caution. Diverging the dtypes of a tensor and its gradient may break downstream systems that assume they match.

from packaging import version


def skip_if_no_tensorrt():
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gcunhase @ajrasane please help review this cursor-generated check for skipping the tests/gpu/onnx/test_plugin.py on Cuda 13 container with our current ORT-GPU intallation.

Failure without this skip check: https://github.com/NVIDIA/Model-Optimizer/actions/runs/22007571673/job/63595332370?pr=891#step:7:364

*************** EP Error ***************
EP Error /onnxruntime_src/onnxruntime/python/onnxruntime_pybind_state.cc:505 void onnxruntime::python::RegisterTensorRTPluginsAsCustomOps(PySessionOptions&, const onnxruntime::ProviderOptions&) Please install TensorRT libraries as mentioned in the GPU requirements page, make sure they're in the PATH or LD_LIBRARY_PATH, and that your GPU is supported.
 when using [('TensorrtExecutionProvider', {'trt_max_workspace_size': 85899345920}), ('CUDAExecutionProvider', {'device_id': 0, 'arena_extend_strategy': 'kSameAsRequested'}), ('CPUExecutionProvider', {'arena_extend_strategy': 'kSameAsRequested'})]
Falling back to ['CPUExecutionProvider'] and retrying.
****************************************
----------------------------- Captured stderr call -----------------------------
2026-02-14 01:01:06.659874993 [E:onnxruntime:Default, provider_bridge_ort.cc:2167 TryGetProviderInfo_TensorRT] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1778 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_tensorrt.so with error: libcublas.so.12: cannot open shared object file: No such file or directory

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/upgrade-dev-setup branch from 86e9821 to dea2eeb Compare February 14, 2026 01:31
@kevalmorabia97
Copy link
Collaborator Author

@coderabbitai full review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 14, 2026

✅ Actions performed

Full review triggered.

@NVIDIA NVIDIA deleted a comment from coderabbitai bot Feb 14, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/_test_utils/import_helper.py (1)

100-102: ⚠️ Potential issue | 🟡 Minor

Pre-existing bug: skip message is inverted.

The condition on line 100 skips when the installed version is greater than 1.18, but the message says "is less than required". This is a pre-existing issue unrelated to this PR.

Proposed fix
     if version.parse(installed_version) > version.parse(required_version):
         pytest.skip(
-            f"{package_name} version {installed_version} is less than required {required_version}"
+            f"{package_name} version {installed_version} is greater than {required_version}"
         )
🧹 Nitpick comments (1)
modelopt/torch/quantization/plugins/transformers_trainer.py (1)

387-387: Placement note: _patch_fsdp2_post_backward() is called unconditionally.

This patches the class-level method regardless of whether FSDP2 is actually in use. The ImportError guard and double-patch sentinel make this safe, but calling it only when self.accelerator.is_fsdp2 would be more precise. Not a blocker since the guards handle it correctly.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/_test_utils/import_helper.py (1)

100-102: ⚠️ Potential issue | 🟡 Minor

Pre-existing bug: skip message says "less than" but condition checks "greater than".

Line 100 skips when installed_version > required_version, but the message on line 102 says the version "is less than required". This is a pre-existing issue unrelated to this PR but worth noting.

Proposed fix
     if version.parse(installed_version) > version.parse(required_version):
         pytest.skip(
-            f"{package_name} version {installed_version} is less than required {required_version}"
+            f"{package_name} version {installed_version} is above the maximum {required_version}"
         )
🧹 Nitpick comments (5)
docs/source/getting_started/_installation_for_Linux.rst (1)

129-131: Consider mentioning [all] in addition to [onnx] for the cupy swap instruction.

The cupy swap applies regardless of whether the user installed nvidia-modelopt[onnx] or nvidia-modelopt[all]. Mentioning only [onnx] may cause users who installed [all] to miss this step.

.github/workflows/example_tests.yml (1)

68-73: The explicit docker_image: "26.01" for speculative_decoding is currently redundant with the default.

Since the fallback || '26.01' already covers it, the explicit value is unnecessary. However, keeping it is fine if the intent is to make it easy to override per-example in the future. The comment on line 59 hints at this being intentional.

tox.ini (2)

62-73: Consider pinning the fast-hadamard-transform dependency to a specific commit or tag.

git+https://github.com/Dao-AILab/fast-hadamard-transform.git installs from the default branch head, meaning CI can break at any time due to upstream changes. Pinning to a known-good commit (e.g., ...transform.git@<commit-sha>) would make GPU test runs reproducible.

The cupy-cuda12xcupy-cuda13x swap on lines 69–70 is consistent with the CUDA 13 migration.


75-83: Same unpinned-git concern applies to mamba; otherwise LGTM.

Line 79 installs mamba from the default branch of state-spaces/mamba without a commit pin — same reproducibility note as fast-hadamard-transform above. The rest of the Megatron GPU environment looks correct.

modelopt/torch/quantization/plugins/transformers_trainer.py (1)

104-147: Well-documented and correct implementation — optional refinements available.

The idempotency guard, import-safety, and clear documentation of this as a temporary workaround are sound.

Two optional observations:

  1. grad_dtype is cleared but never restored. After _modelopt_original_post_backward() runs, grad_dtype remains None. While it's re-set to None on every subsequent post_backward call, the codebase elsewhere uses save-restore patterns for similar monkey-patching (see utils.py lines 563–575). If future FSDP2 code paths inspect grad_dtype between backward passes, this could cause unexpected behavior. Consider saving and restoring it:
♻️ Optional: restore grad_dtype after the original call
     `@torch.no_grad`()
     def _patched_post_backward(self):
         # Allow bf16 gradients to be assigned to fp32 parameters
+        saved_grad_dtypes = {}
         for fsdp_param in self.fsdp_params:
             with contextlib.suppress(AttributeError):
+                saved_grad_dtypes[fsdp_param] = fsdp_param.sharded_param.grad_dtype
                 fsdp_param.sharded_param.grad_dtype = None

         self._modelopt_original_post_backward()

         # Cast gradients to parameter dtype so the optimizer sees matching dtypes
         for fsdp_param in self.fsdp_params:
             sp = fsdp_param.sharded_param
             if sp.grad is not None and sp.grad.dtype != sp.dtype:
                 sp.grad = sp.grad.to(sp.dtype)
+            if fsdp_param in saved_grad_dtypes:
+                sp.grad_dtype = saved_grad_dtypes[fsdp_param]
  1. Exception safety. If _modelopt_original_post_backward() raises, the gradient casting loop (lines 142–145) is skipped. A try/finally would ensure cleanup, though the risk is low for a temporary workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant