Skip to content

feat: Implement persistent CUBIN caching, multi-threaded context safety, and cross-platform NVRTC support (#175)#193

Open
Franklalalala wants to merge 3 commits intoPASSIONLab:mainfrom
Franklalalala:main
Open

feat: Implement persistent CUBIN caching, multi-threaded context safety, and cross-platform NVRTC support (#175)#193
Franklalalala wants to merge 3 commits intoPASSIONLab:mainfrom
Franklalalala:main

Conversation

@Franklalalala
Copy link
Copy Markdown

📌 Content

This PR introduces a robust disk-based caching mechanism for NVRTC JIT compilation, ensures CUDA context safety across multiple threads, and provides cross-platform compatibility (Windows/Linux). It significantly optimizes the startup time of models using JIT-compiled kernels by avoiding redundant compilation.

🎯 Motivation

  1. Addressing Issue Compilation cache fails to persist on L40s/RTX 6000 #175: Users on high-end hardware (e.g., L40s, RTX 6000) reported long JIT compilation times. Currently, kernels are recompiled every time the process starts. Persistent caching solves this.
  2. Multi-threaded Reliability: In environments like PyTorch DataLoader workers, CUDA contexts aren't always automatically initialized. This PR ensures a valid context is bound before any Driver API calls.
  3. Deployment Flexibility: Adds support for custom cache paths and provides environment variables to toggle caching.
  4. Bug Fixes: Fixed a memory leak where the compilation log was not freed during exception throws, and improved error reporting.

🛠️ Core Differences & Implementation Details

The new implementation introduces several critical enhancements over the previous version:

  1. Deterministic Persistent Caching

    • Old: Used an in-memory compiled flag; compilation was lost on process exit.
    • New: Generates a unique FNV-1a 64-bit hash based on: NVRTC version, GPU architecture (sm_xx), compilation options, source code, and kernel name expressions.
    • Storage: Saves both .cubin (binary) and .names (lowered name mappings) to ~/.cache/openequivariance or a custom path via OEQ_CACHE_PATH.
  2. Atomic & Safe File I/O

    • Race Condition Prevention: To prevent corrupted cache files when multiple processes/threads compile simultaneously, writes are performed to a temporary file (uniquely identified by PID and ThreadID) and then atomically moved to the final destination using std::rename.
  3. Automatic CUDA Context Management

    • Added ensure_cuda_context(). This function detects if the calling thread has an active CUDA context. If not, it retains and sets the Primary Context for the device. This is crucial for stability in multi-threaded C++ or Python environments.
  4. Cross-Platform Portability

    • Replaced Linux-specific logic with a unified macro system (_WIN32) for directory creation (mkdir vs _mkdir) and process identification (getpid vs _getpid).
    • Utilizes <cinttypes> and PRIx64 for consistent 64-bit hex formatting across different compilers/architectures.
  5. Robust Error Handling

    • Ensures delete[] log; is called even when a std::logic_error is thrown during NVRTC failures, preventing memory leaks during iterative debugging.

🤖 Contribution Note

This PR was developed through a collaborative effort between the contributor and multiple AI systems (Gemini 3.1 Pro Preview, GPT-5.4 High & Claude 4.6 Thinking models). We have worked together to address complex technical edge cases (such as atomic renames and primary context retention) to ensure the long-term stability and performance of the OpenEquivariance library.

I have verified these changes on my local environment, and they successfully resolve the redundant JIT overhead described in Issue #175.

@Franklalalala
Copy link
Copy Markdown
Author

me personally do not understand the underlining mechanism, though it indeed fix the issue smh...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant