Skip to content

[Core] Gate NVTE_WITH_CUBLASMP build flag behind NCCL version 2.29.3+#2786

Closed
jberchtold-nvidia wants to merge 1 commit intoNVIDIA:mainfrom
jberchtold-nvidia:jberchtold/fix-cublasmp-build-error
Closed

[Core] Gate NVTE_WITH_CUBLASMP build flag behind NCCL version 2.29.3+#2786
jberchtold-nvidia wants to merge 1 commit intoNVIDIA:mainfrom
jberchtold-nvidia:jberchtold/fix-cublasmp-build-error

Conversation

@jberchtold-nvidia
Copy link
Collaborator

Description

Currently using older NCCL versions with the NVTE_WITH_CUBLASMP build flag lead to an unclear linker error. This PR addes a guard to the CMake setup to check that the NCCL version is sufficient before trying to link with cublasMp. If the user's NCCL version is not sufficient, this gives a clear error from CMake instead of a linker error.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Check NCCL version for cuBLASMp

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
@jberchtold-nvidia
Copy link
Collaborator Author

/te-ci

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 20, 2026

Greptile Summary

This PR adds a CMake-level guard that requires NCCL ≥ 2.29.3 before allowing the NVTE_WITH_CUBLASMP build path to proceed, replacing a confusing linker error with a clear FATAL_ERROR message. The version comparison logic (parsing MAJOR.MINOR.PATCH and comparing numerically) is correct.

However, there is a meaningful correctness issue with how NCCL_VERSION is sourced:

  • if(NOT DEFINED NCCL_VERSION) tests for a CMake cache variable (supplied via -DNCCL_VERSION=X.Y.Z), but the error message calls it an "environment variable". A user who reads the error and runs export NCCL_VERSION=2.29.3 will see the same failure because CMake does not automatically promote shell environment variables to CMake variables.
  • setup.py — the standard way to build TransformerEngine — reads NVTE_WITH_CUBLASMP from the environment and appends -DNVTE_WITH_CUBLASMP=ON to CMake flags, but never reads or forwards NCCL_VERSION. This means NVTE_WITH_CUBLASMP=1 pip install . will always hit the new FATAL_ERROR unless the user additionally passes NVTE_CMAKE_EXTRA_ARGS="-DNCCL_VERSION=X.Y.Z", which is undocumented.

The fix requires either (a) using $ENV{NCCL_VERSION} in CMake to also accept the environment variable, and/or (b) updating setup.py to read NCCL_VERSION from the environment and forward it as a -D flag, consistent with how CUBLASMP_HOME and other options are handled.

Confidence Score: 3/5

  • The guard logic is directionally correct but will break the standard setup.py build path for all users of cuBLASMp due to a CMake-vs-environment-variable mismatch.
  • The version comparison logic itself is sound. However, the check relies on a CMake cache variable (NCCL_VERSION) that is never populated automatically — not by find_package, not by find_library, and not by setup.py. As a result, any user building via pip install . with NVTE_WITH_CUBLASMP=1 will hit a FATAL_ERROR with a misleading message pointing to an "environment variable", which is not what CMake reads. This makes the feature functionally broken through the standard build path until setup.py is updated to forward the variable.
  • transformer_engine/common/CMakeLists.txt (lines 291-292) and setup.py (lines 74-79) both need attention.

Important Files Changed

Filename Overview
transformer_engine/common/CMakeLists.txt Adds NCCL ≥2.29.3 version guard before the cuBLASMp build path. The version comparison logic is correct, but there is a CMake-vs-environment-variable mismatch: the guard checks a CMake cache variable while the error message refers to an "environment variable", and setup.py never forwards NCCL_VERSION to CMake, making the standard build path always fail.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[CMake configure with NVTE_WITH_CUBLASMP=ON] --> B{NCCL_VERSION\nCMake variable defined?}
    B -- No --> C[FATAL_ERROR:\nNCCL_VERSION not set]
    B -- Yes --> D{NCCL_VERSION matches\nX.Y.Z regex?}
    D -- No --> E[FATAL_ERROR:\nInvalid format]
    D -- Yes --> F[Parse MAJOR / MINOR / PATCH]
    F --> G{version >= 2.29.3?}
    G -- No --> H[FATAL_ERROR:\nNCCL too old]
    G -- Yes --> I[STATUS: version check passed]
    I --> J[target_compile_definitions\nNVTE_WITH_CUBLASMP]
    J --> K[find_library: cublasmp]
    J --> L[find_library: nccl]
    K --> M[target_link_libraries\nnccl + cublasmp]
    L --> M
    M --> N[STATUS: Using cuBLASMp]

    subgraph gap [Gap - standard build path]
      P[setup.py: NVTE_WITH_CUBLASMP=1\npip install .] -- never passes\n-DNCCL_VERSION --> A
    end
Loading

Last reviewed commit: "Update CMakeLists.tx..."

Comment on lines +291 to +292
if(NOT DEFINED NCCL_VERSION)
message(FATAL_ERROR "NCCL_VERSION environment variable not set. NCCL 2.29.3+ is required for cuBLASMp support.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 CMake variable vs environment variable mismatch

if(NOT DEFINED NCCL_VERSION) checks for a CMake cache variable (set with -DNCCL_VERSION=2.29.3), but the error message says "NCCL_VERSION environment variable not set". If a user reads this error and sets export NCCL_VERSION=2.29.3 in their shell, the error will persist because CMake does not automatically promote environment variables to CMake variables.

Furthermore, setup.py (the standard build path via NVTE_WITH_CUBLASMP=1 pip install .) never passes -DNCCL_VERSION to CMake — so this check will always fail when building via setup.py unless the user also provides NVTE_CMAKE_EXTRA_ARGS="-DNCCL_VERSION=X.Y.Z". This is not obvious and entirely undocumented.

To read an environment variable in CMake the code should use $ENV{NCCL_VERSION} and handle both sources:

  # Allow NCCL_VERSION to be supplied as either a -D CMake variable or an env var
  if(NOT DEFINED NCCL_VERSION AND DEFINED ENV{NCCL_VERSION})
    set(NCCL_VERSION "$ENV{NCCL_VERSION}")
  endif()
  if(NOT DEFINED NCCL_VERSION)
    message(FATAL_ERROR "NCCL_VERSION is not set. Pass -DNCCL_VERSION=X.Y.Z to CMake "
            "or set the NCCL_VERSION environment variable. NCCL 2.29.3+ is required for cuBLASMp support.")
  endif()

Additionally, setup.py should be updated to read NCCL_VERSION from the environment and forward it as a CMake flag alongside the other cuBLASMp flags (lines 74-79).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant