[Core] Gate NVTE_WITH_CUBLASMP build flag behind NCCL version 2.29.3+#2786
[Core] Gate NVTE_WITH_CUBLASMP build flag behind NCCL version 2.29.3+#2786jberchtold-nvidia wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
|
/te-ci |
Greptile SummaryThis PR adds a CMake-level guard that requires NCCL ≥ 2.29.3 before allowing the However, there is a meaningful correctness issue with how
The fix requires either (a) using Confidence Score: 3/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[CMake configure with NVTE_WITH_CUBLASMP=ON] --> B{NCCL_VERSION\nCMake variable defined?}
B -- No --> C[FATAL_ERROR:\nNCCL_VERSION not set]
B -- Yes --> D{NCCL_VERSION matches\nX.Y.Z regex?}
D -- No --> E[FATAL_ERROR:\nInvalid format]
D -- Yes --> F[Parse MAJOR / MINOR / PATCH]
F --> G{version >= 2.29.3?}
G -- No --> H[FATAL_ERROR:\nNCCL too old]
G -- Yes --> I[STATUS: version check passed]
I --> J[target_compile_definitions\nNVTE_WITH_CUBLASMP]
J --> K[find_library: cublasmp]
J --> L[find_library: nccl]
K --> M[target_link_libraries\nnccl + cublasmp]
L --> M
M --> N[STATUS: Using cuBLASMp]
subgraph gap [Gap - standard build path]
P[setup.py: NVTE_WITH_CUBLASMP=1\npip install .] -- never passes\n-DNCCL_VERSION --> A
end
Last reviewed commit: "Update CMakeLists.tx..." |
| if(NOT DEFINED NCCL_VERSION) | ||
| message(FATAL_ERROR "NCCL_VERSION environment variable not set. NCCL 2.29.3+ is required for cuBLASMp support.") |
There was a problem hiding this comment.
CMake variable vs environment variable mismatch
if(NOT DEFINED NCCL_VERSION) checks for a CMake cache variable (set with -DNCCL_VERSION=2.29.3), but the error message says "NCCL_VERSION environment variable not set". If a user reads this error and sets export NCCL_VERSION=2.29.3 in their shell, the error will persist because CMake does not automatically promote environment variables to CMake variables.
Furthermore, setup.py (the standard build path via NVTE_WITH_CUBLASMP=1 pip install .) never passes -DNCCL_VERSION to CMake — so this check will always fail when building via setup.py unless the user also provides NVTE_CMAKE_EXTRA_ARGS="-DNCCL_VERSION=X.Y.Z". This is not obvious and entirely undocumented.
To read an environment variable in CMake the code should use $ENV{NCCL_VERSION} and handle both sources:
# Allow NCCL_VERSION to be supplied as either a -D CMake variable or an env var
if(NOT DEFINED NCCL_VERSION AND DEFINED ENV{NCCL_VERSION})
set(NCCL_VERSION "$ENV{NCCL_VERSION}")
endif()
if(NOT DEFINED NCCL_VERSION)
message(FATAL_ERROR "NCCL_VERSION is not set. Pass -DNCCL_VERSION=X.Y.Z to CMake "
"or set the NCCL_VERSION environment variable. NCCL 2.29.3+ is required for cuBLASMp support.")
endif()Additionally, setup.py should be updated to read NCCL_VERSION from the environment and forward it as a CMake flag alongside the other cuBLASMp flags (lines 74-79).
Description
Currently using older NCCL versions with the NVTE_WITH_CUBLASMP build flag lead to an unclear linker error. This PR addes a guard to the CMake setup to check that the NCCL version is sufficient before trying to link with cublasMp. If the user's NCCL version is not sufficient, this gives a clear error from CMake instead of a linker error.
Type of change
Changes
Checklist: