Upgrade Dev containers for CICD to latest#891
Conversation
📝 WalkthroughWalkthroughCI workflows, test matrices, and installation docs updated for newer PyTorch/CUDA/tooling (container tags moved to 26.01, CUDA 13 variants added, Torch 2.10 added). Tests guarded by Megatron/TE checks were mostly removed or relaxed, and a new FSDP2 post-backward patch was added to the quantization trainer. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
.github/workflows/gpu_tests.yml (1)
86-99:⚠️ Potential issue | 🔴 Critical
gpu-tests-non-prstill referencescuda12tox environments that were removed fromtox.ini.The PR job matrix (lines 66-68) was updated to
cuda13, but the non-PR job matrix (lines 92-94) still usespy312-cuda12-gpuandpy312-cuda12-gpu-megatron. Sincetox.inino longer definescuda12environments (replaced bycuda13), nightly and on-demand runs will fail with an unknown tox environment error.🐛 Proposed fix
matrix: include: - - example: py312-cuda12-gpu + - example: py312-cuda13-gpu timeout: 90 - - example: py312-cuda12-gpu-megatron + - example: py312-cuda13-gpu-megatron timeout: 120.github/workflows/unit_tests.yml (1)
79-86:⚠️ Potential issue | 🔴 Critical
multi-torchmatrix includestorch26which no longer exists intox.ini.As noted in the
tox.inireview,torch26was dropped from the tox env pattern. This matrix entry will producepy312-torch26-tf_latest-unitwhich tox won't recognize. Either addtorch26back to the tox env pattern or remove26from this matrix (and consider adding210if full coverage of the new version is desired).🐛 Suggested fix (if torch26 is intentionally dropped)
matrix: - torch: [26, 27, 28, 29] + torch: [27, 28, 29, 210]
🤖 Fix all issues with AI agents
In `@tox.ini`:
- Around line 14-21: The tox env pattern
[testenv:{py310,py311,py312}-torch{27,28,29,210}-tf_{min,latest}-unit] no longer
includes torch26 but there is still a deps mapping "torch26:
torchvision~=0.21.0" and the CI matrix (multi-torch) still lists 26; fix by
either (A) restoring 26 into the env factor (change torch{27,28,29,210} to
torch{26,27,28,29,210}) so tox, deps, and the workflow matrix align, or (B)
remove the dead deps line "torch26: torchvision~=0.21.0" and remove the "26"
entry from the multi-torch matrix in unit_tests.yml so the workflow no longer
references an unsupported tox env. Ensure both the tox env pattern and the
workflow matrix + deps block are consistent.
🧹 Nitpick comments (1)
docs/source/getting_started/_installation_for_Linux.rst (1)
129-131: Consider splitting the long instruction for readability.The single bullet on line 131 packs multiple commands and conditional logic into one line, which can be hard to scan in rendered docs.
📝 Suggested restructure
-**CUDA specific dependencies** - -* By default, ``cupy-cuda12x`` is installed for INT4 ONNX quantization. If you have CUDA 13, you need to run ``pip uninstall -y cupy-cuda12x`` and ``pip install cupy-cuda13x`` after installing ``nvidia-modelopt[onnx]``. +**CUDA specific dependencies** + +* By default, ``cupy-cuda12x`` is installed for INT4 ONNX quantization. +* If you have CUDA 13, replace it after installing ``nvidia-modelopt[onnx]``: + + .. code-block:: shell + + pip uninstall -y cupy-cuda12x + pip install cupy-cuda13x
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
b7e3883 to
f0bd99c
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #891 +/- ##
=======================================
Coverage 73.74% 73.74%
=======================================
Files 199 199
Lines 21163 21163
=======================================
Hits 15606 15606
Misses 5557 5557 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
59eb8ff to
86e9821
Compare
There was a problem hiding this comment.
@realAsma please help review this cursor-generate fix for passing llm_qat test on PyTorch 2.10 (pytorch:26.01-py3 container)
Failing test without this fix: https://github.com/NVIDIA/Model-Optimizer/actions/runs/22007571666/job/63595325923?pr=891
[rank0]: RuntimeError: attempting to assign a gradient with dtype 'c10::BFloat16' to a tensor with grad_dtype 'Float'. The gradient must match the tensor's grad_dtype (defaults to the tensor's dtype). You can set the tensor's grad_dtype attribute with a specific dtype, or None to allow any dtype. Set grad_dtype with caution. Diverging the dtypes of a tensor and its gradient may break downstream systems that assume they match.
| from packaging import version | ||
|
|
||
|
|
||
| def skip_if_no_tensorrt(): |
There was a problem hiding this comment.
@gcunhase @ajrasane please help review this cursor-generated check for skipping the tests/gpu/onnx/test_plugin.py on Cuda 13 container with our current ORT-GPU intallation.
Failure without this skip check: https://github.com/NVIDIA/Model-Optimizer/actions/runs/22007571673/job/63595332370?pr=891#step:7:364
*************** EP Error ***************
EP Error /onnxruntime_src/onnxruntime/python/onnxruntime_pybind_state.cc:505 void onnxruntime::python::RegisterTensorRTPluginsAsCustomOps(PySessionOptions&, const onnxruntime::ProviderOptions&) Please install TensorRT libraries as mentioned in the GPU requirements page, make sure they're in the PATH or LD_LIBRARY_PATH, and that your GPU is supported.
when using [('TensorrtExecutionProvider', {'trt_max_workspace_size': 85899345920}), ('CUDAExecutionProvider', {'device_id': 0, 'arena_extend_strategy': 'kSameAsRequested'}), ('CPUExecutionProvider', {'arena_extend_strategy': 'kSameAsRequested'})]
Falling back to ['CPUExecutionProvider'] and retrying.
****************************************
----------------------------- Captured stderr call -----------------------------
2026-02-14 01:01:06.659874993 [E:onnxruntime:Default, provider_bridge_ort.cc:2167 TryGetProviderInfo_TensorRT] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1778 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_tensorrt.so with error: libcublas.so.12: cannot open shared object file: No such file or directorySigned-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
86e9821 to
dea2eeb
Compare
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/_test_utils/import_helper.py (1)
100-102:⚠️ Potential issue | 🟡 MinorPre-existing bug: skip message is inverted.
The condition on line 100 skips when the installed version is greater than 1.18, but the message says "is less than required". This is a pre-existing issue unrelated to this PR.
Proposed fix
if version.parse(installed_version) > version.parse(required_version): pytest.skip( - f"{package_name} version {installed_version} is less than required {required_version}" + f"{package_name} version {installed_version} is greater than {required_version}" )
🧹 Nitpick comments (1)
modelopt/torch/quantization/plugins/transformers_trainer.py (1)
387-387: Placement note:_patch_fsdp2_post_backward()is called unconditionally.This patches the class-level method regardless of whether FSDP2 is actually in use. The
ImportErrorguard and double-patch sentinel make this safe, but calling it only whenself.accelerator.is_fsdp2would be more precise. Not a blocker since the guards handle it correctly.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/_test_utils/import_helper.py (1)
100-102:⚠️ Potential issue | 🟡 MinorPre-existing bug: skip message says "less than" but condition checks "greater than".
Line 100 skips when
installed_version > required_version, but the message on line 102 says the version "is less than required". This is a pre-existing issue unrelated to this PR but worth noting.Proposed fix
if version.parse(installed_version) > version.parse(required_version): pytest.skip( - f"{package_name} version {installed_version} is less than required {required_version}" + f"{package_name} version {installed_version} is above the maximum {required_version}" )
🧹 Nitpick comments (5)
docs/source/getting_started/_installation_for_Linux.rst (1)
129-131: Consider mentioning[all]in addition to[onnx]for the cupy swap instruction.The cupy swap applies regardless of whether the user installed
nvidia-modelopt[onnx]ornvidia-modelopt[all]. Mentioning only[onnx]may cause users who installed[all]to miss this step..github/workflows/example_tests.yml (1)
68-73: The explicitdocker_image: "26.01"forspeculative_decodingis currently redundant with the default.Since the fallback
|| '26.01'already covers it, the explicit value is unnecessary. However, keeping it is fine if the intent is to make it easy to override per-example in the future. The comment on line 59 hints at this being intentional.tox.ini (2)
62-73: Consider pinning thefast-hadamard-transformdependency to a specific commit or tag.
git+https://github.com/Dao-AILab/fast-hadamard-transform.gitinstalls from the default branch head, meaning CI can break at any time due to upstream changes. Pinning to a known-good commit (e.g.,...transform.git@<commit-sha>) would make GPU test runs reproducible.The
cupy-cuda12x→cupy-cuda13xswap on lines 69–70 is consistent with the CUDA 13 migration.
75-83: Same unpinned-git concern applies tomamba; otherwise LGTM.Line 79 installs
mambafrom the default branch ofstate-spaces/mambawithout a commit pin — same reproducibility note asfast-hadamard-transformabove. The rest of the Megatron GPU environment looks correct.modelopt/torch/quantization/plugins/transformers_trainer.py (1)
104-147: Well-documented and correct implementation — optional refinements available.The idempotency guard, import-safety, and clear documentation of this as a temporary workaround are sound.
Two optional observations:
grad_dtypeis cleared but never restored. After_modelopt_original_post_backward()runs,grad_dtyperemainsNone. While it's re-set toNoneon every subsequentpost_backwardcall, the codebase elsewhere uses save-restore patterns for similar monkey-patching (seeutils.pylines 563–575). If future FSDP2 code paths inspectgrad_dtypebetween backward passes, this could cause unexpected behavior. Consider saving and restoring it:♻️ Optional: restore grad_dtype after the original call
`@torch.no_grad`() def _patched_post_backward(self): # Allow bf16 gradients to be assigned to fp32 parameters + saved_grad_dtypes = {} for fsdp_param in self.fsdp_params: with contextlib.suppress(AttributeError): + saved_grad_dtypes[fsdp_param] = fsdp_param.sharded_param.grad_dtype fsdp_param.sharded_param.grad_dtype = None self._modelopt_original_post_backward() # Cast gradients to parameter dtype so the optimizer sees matching dtypes for fsdp_param in self.fsdp_params: sp = fsdp_param.sharded_param if sp.grad is not None and sp.grad.dtype != sp.dtype: sp.grad = sp.grad.to(sp.dtype) + if fsdp_param in saved_grad_dtypes: + sp.grad_dtype = saved_grad_dtypes[fsdp_param]
- Exception safety. If
_modelopt_original_post_backward()raises, the gradient casting loop (lines 142–145) is skipped. Atry/finallywould ensure cleanup, though the risk is low for a temporary workaround.
What does this PR do?
Testing
CI/CD in this PR should pass
Summary by CodeRabbit
New Features
Documentation
Chores