Add GPU-accelerated conformer RMSD matrix computation#105
Conversation
Implement GetConformerRMSMatrix as a GPU-accelerated equivalent of
RDKit's AllChem.GetConformerRMSMatrix. One CUDA thread-block per
conformer pair computes optimal Kabsch alignment via closed-form 3x3
SVD (Cardano eigenvalues of H^T H), avoiding iterative solvers.
Output is a lower-triangle condensed distance matrix matching RDKit's
format, enabling a fully GPU-resident pipeline:
EmbedMolecules → MMFFOptimize → GetConformerRMSMatrix → butina()
New files:
src/conformer_rmsd.{h,cu} — kernel and C++ API
nvmolkit/conformerRmsd.{cpp,py} — Boost.Python binding and Python wrapper
nvmolkit/tests/test_conformer_rmsd.py — correctness tests vs RDKit reference
benchmarks/conformer_rmsd_bench.py — CPU vs GPU benchmark
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: andreivolgin <andrei@rebelation.com>
Import _arrayHelpers to register the Boost.Python PyArray converter, matching the pattern used by other modules (fingerprints, clustering, similarity). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Prealigned mode: compute RMSD on raw coordinates without centering, matching RDKit's prealigned=True behavior. - Tests/benchmark: use deep copies for RDKit reference computation since GetConformerRMSMatrix(prealigned=False) modifies conformer coordinates in-place during sequential alignment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RDKit's GetConformerRMSMatrix modifies conformer coordinates in-place during sequential alignment, so results depend on pair ordering and are not purely pairwise. Switch to an independent numpy-based Kabsch RMSD reference that computes each pair from original coordinates, matching our GPU kernel's behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace RDKit reference with independent numpy Kabsch SVD to avoid false negatives from RDKit's sequential coordinate mutation behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace alkane chains with drug-like molecules spanning a wider range: aspirin (13 HA), celecoxib (24 HA), osimertinib (33 HA), and a cyclic pentapeptide (~48 HA). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
| Filename | Overview |
|---|---|
| nvmolkit/conformerRmsd.cpp | Python/C++ binding layer — contains a signed integer overflow (nc * (nc - 1) / 2 in int) at line 67 for large conformer counts (nc ≈ 65 536) that corrupts the PyArray shape passed back to Python; line 88 of the same file correctly uses int64_t for the equivalent single-mol path. |
| src/conformer_rmsd.cu | Core CUDA kernel implementing per-pair Kabsch RMSD via Cardano eigenvalue decomposition of H^T·H; mathematics is sound, reflection handling via det(H) < 0 is correct, numerical guards (fmax, clamping) are appropriate, though fixed block size of 128 is suboptimal for small molecules. |
| src/conformer_rmsd_mol.cpp | Host-side mol-to-GPU pipeline; properly uses int64_t for overflow detection, initiates async device allocation before CPU coordinate packing, and handles edge cases (null mols, zero atoms, < 2 conformers) cleanly. |
| nvmolkit/tests/test_conformer_rmsd.py | 12 tests covering correctness vs numpy Kabsch, prealigned mode, edge cases (zero atoms, < 2 conformers, invalid types), streams, and both single/batch APIs; good coverage. |
| src/CMakeLists.txt | New conformer_rmsd and conformer_rmsd_mol targets are correctly structured, but cccl_interface (providing CUB / cuda::std::span) is not explicitly linked unlike every other CUDA kernel target in the repo that uses CUB. |
Comments Outside Diff (3)
-
nvmolkit/conformerRmsd.cpp, line 67 (link)Integer overflow in
numPairscomputationnc * (nc - 1)is evaluated as a 32-bit signedintmultiplication. Fornc = 65536, the product is4,294,901,760, which exceedsINT_MAX(2,147,483,647), causing signed integer overflow — undefined behavior in C++. On two's complement hardware this typically wraps to-65,536, sonumPairsbecomes-32,768, corrupting the shape passed tomakePyArray.Note that the upstream
conformerRmsdBatchMatrixMoldoes validate that the final pair count fits inint(usingint64_tarithmetic inconformer_rmsd_mol.cpp), but it allowsnc = 65536through becausenumPairs = 2,147,450,880 ≤ INT_MAX. The overflow therefore occurs silently in the binding code.Line 88 in the same file already applies the correct fix with
static_cast<int64_t>. Apply the same pattern here: -
src/CMakeLists.txt, line 132-134 (link)Missing
cccl_interfacedependencyconformer_rmsd.cuincludes<cub/cub.cuh>and<cuda/std/span>from CCCL. Every other target in this repo that uses CUB explicitly linkscccl_interface(e.g.similarity_kernels,morganFingerprintKernels), butconformer_rmsddoes not:add_library(conformer_rmsd conformer_rmsd.cu) target_link_libraries(conformer_rmsd PUBLIC device_vector PRIVATE cuda_error_check)
If
device_vectordoes not transitively exposecccl_interface, builds may fail or silently pick up a different CCCL version from the system CUDA install. Consider explicitly addingcccl_interfaceconsistent with the rest of the codebase: -
src/conformer_rmsd.cu, line 84-85 (link)Performance degrades severely for very small molecules
kRmsdBlockSize = 128means every conformer pair dispatches exactly 128 threads regardless of atom count. For molecules with fewer than 128 heavy atoms (most drug-like molecules: aspirin has 13, celecoxib 26), most threads are idle on every loop iteration. All 11cub::BlockReducecalls are still executed, and each requires a__syncthreads(), meaning the inter-thread synchronization cost is disproportionate to the actual work.The benchmark results confirm this pattern (Celecoxib 26 atoms, 200 conformers: 3.59 ms vs Aspirin 13 atoms, 200 conformers: 0.99 ms — 3.6× more pairs but 3.6× slower despite having only 2× more atoms, suggesting overhead dominates).
A smaller block size (e.g., 32) for small molecules would improve occupancy and reduce the per-pair reduction overhead. Consider exposing
kRmsdBlockSizeas a template parameter or choosing it adaptively at the launch site based on atom count.
Last reviewed commit: "conformerRmsd: rever..."
|
For context on the real-world impact: we're processing ~200M molecule states in our conformer generation pipeline (embed → optimize → RMSD cluster → select). The RMSD matrix is currently the only stage that falls back to CPU. At ~2s saved per state, this kernel alone would save over 12 years of cumulative CPU time across our workload — and keep the entire pipeline on GPU. |
- Fix async copy race condition: add cudaStreamSynchronize before hostCoords goes out of scope, matching the etkdg_impl.cpp pattern. - Move prealigned branch before centroid computation to avoid 6 wasted block reductions when centroids are not needed. - Fix np.sign(0) edge case in numpy Kabsch reference: treat det(H) == 0 (coplanar atoms) as +1 instead of zeroing the last singular value. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The cudaCheckError macro requires the nvMolKit namespace scope. Use direct CUDA API call with error check in the Boost.Python binding. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add cj < 0 guard to pair-index safety check in CUDA kernel to close a theoretical OOB read from floating-point rounding. - Fix docstring: "has no conformers" → "fewer than 2 conformers". - Align C++ binding with Python wrapper: raise std::invalid_argument for ≤1 conformers instead of silently returning empty array. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Graceful handling is better for batch pipelines processing millions of molecules where some may have failed embedding. Callers can check len(result) == 0 and skip clustering. Aligns with the original C++ intent and avoids crashing production workflows. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thank you for the contribution! This is a substantial addition — we'll review it thoroughly and follow up with any comments. We'll keep you updated as we work through it and determine the best approach for integration. |
scal444
left a comment
There was a problem hiding this comment.
Thanks for the contribution! Overall this is a useful acceleration that aligns well with nvMolKit's functionality even though it wasn't initially on our rader. It's a very clean implementation and my code comments are minor.
One significant comment - I wonder if we're getting good GPU saturation. With many (100+ conformers) there's probably saturation but for smaller cases we typically need to send batches of molecules at a time, and calling singleApi(mol) repeatedly is less efficient that calling batchAPI([mol1, mol2, mol3]) multiple times. Adapting this to multiple molecules would require some minor preprocessing up front, and potentially an additional layer of indexing to match blocks to molecule ID and then to conformer ID. The output would be a list of results tensors.
Would your use cases match the multiple molecule multiple conformer case?
…safety, and edge cases - Replace raw cudaMallocHost/cudaFreeHost with PinnedHostVector<double> for exception safety and consistency with the project's existing RAII helpers. Kernel is submitted before hostCoords goes out of scope, so no extra host-side latency is added to the GPU pipeline. - Add numAtoms == 0 guard that throws std::invalid_argument consistently, intentionally diverging from RDKit's inconsistent nan/ZeroDivisionError behavior. - Increase benchmark correctness coverage from 50 to min(500, n_pairs) pairs. - Add test_rmsd_zero_atoms documenting the intentional divergence from RDKit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds GetConformerRMSMatrixBatch(), which processes a list of molecules in a single kernel launch so conformer pairs from all molecules run concurrently. This improves GPU saturation for molecules with few conformers (e.g. simple inorganics) without changing the existing single-molecule API. Design follows existing nvMolKit batch patterns: - Flat pinned coordinate buffer with per-molecule coord/pair offset arrays, matching the substructure search approach for variable atom counts. - Separate AsyncDeviceVector per molecule for output, returned as list[AsyncGpuResult] consistent with MMFFOptimizeMoleculesConfs. - PinnedHostVector throughout for exception safety and DMA efficiency. - Single kernel launch over totalPairs blocks; each block finds its molecule via a linear scan on pairOffsets (acceptable for current batch sizes). Also adds direct #include <vector> to conformerRmsd.cpp and a prealigned=True batch test to cover the hand-duplicated kernel path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… overflow
- Clamp the Cardano discriminant to fmax(..., 0.0) in symmetricEigenvalues3x3,
removing the incorrect fallback that set all eigenvalues to trace/3 for
near-degenerate inputs (e.g. near-planar conformer pairs with eigenvalue
structure like {4,4,0}), which silently produced wrong RMSD values.
- Compute numPairs as int64_t to prevent signed integer overflow for
numConfs > 65535 (int arithmetic overflows at ~2.1B). The kernel grid
launch is guarded with an explicit overflow check and cast to int, since
CUDA's grid dimension is bounded by INT_MAX regardless. Same fix applied
to the single-molecule binding, the batch binding, and conformerRmsdMatrixGpu.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…and overflow guards - Remove dead `disc` variable in symmetricEigenvalues3x3; near-degenerate inputs are handled by the existing fmax guards on sqrtPP and acos argument - Promote coordOffsets to size_t throughout: PinnedHostVector, host packing index (base), AsyncDeviceVector, and kernel parameter — prevents silent int overflow for large batches (>INT_MAX coordinate elements) - Replace O(numMols) linear scan with O(log numMols) binary search in conformerRmsdBatchKernel for molecule lookup - Guard cumulative pairOffsets prefix sum against int overflow using int64_t running total before narrowing back to int Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Document that pairOffsetsArr/totalPairs are intentionally int because the kernel launches one block per pair on the grid x-dimension (hardware limit 2^31-1 = INT_MAX), and that the int64_t accumulation exists to catch prefix-sum overflow before materializing into that launch/index type. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Addressed the latest round of feedback: Batch API (scal444) — Added Dead Coordinate offset overflow (Greptile) — Promoted O(N) linear scan (Greptile) — Replaced with an O(log N) binary search in Cumulative |
…cision notes - Remove #undef BLOCK_REDUCE_SUM from inside the prealigned branch of conformerRmsdBatchKernel; the preprocessor processes it unconditionally, undefining the macro before the 11 uses in the non-prealigned path and causing a compilation failure - Always allocate at least 1 element per molecule output buffer so that devRmsdPtrs never contains a null pointer; zero-pair molecules dispatch 0 blocks and never write to their slot - Add comments at both inverse triangular-number formula sites documenting that pairIdx is bounded by INT_MAX (~2^31), which fits exactly in a 53-bit double significand Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Fixed three issues from the latest Greptile pass: removed a premature |
Move the numPairs > INT_MAX guard to the binding layer, before the AsyncDeviceVector allocation, so a pathological conformer count produces a descriptive overflow_error rather than a CUDA OOM. The defensive check inside conformerRmsdMatrixGpu is retained. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
The Kabsch computation duplication between the two kernels is noted. We're happy to extract a |
…e-aligned coords AllChem.GetConformerRMSMatrix modifies conformer coordinates in-place during alignment, so reusing the same molecule across iterations measures already- aligned conformers and understates the true CPU cost. Deep-copy before each call to ensure each iteration starts from fresh coordinates. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…numMols==0 Extract computePairRmsd() __device__ __forceinline__ helper containing the full RMSD computation path (prealigned fast-path, centroid reduction, cross-covariance accumulation, Cardano eigenvalue decomposition, reflection correction, final write). Both conformerRmsdKernel and conformerRmsdBatchKernel now resolve their coordinate pointers and output slot, then delegate to the helper — reducing each kernel body from ~80 lines to ~15. BLOCK_REDUCE_SUM is now defined once at file scope and #undef'd after the helper, eliminating the duplicate macro definition. Add an early `if (numMols <= 0) return;` guard at the top of conformerRmsdBatchKernel to prevent the binary search from reading pairOffsets[-1] if the kernel is ever called directly from C++ with numMols=0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Went ahead with the refactoring rather than deferring — the benefits are clear enough to include here: |
…odule Move the duplicated _numpy_kabsch_rmsd function from test_conformer_rmsd.py and conformer_rmsd_bench.py into a new nvmolkit/tests/kabsch_reference.py module so there is a single source of truth for the numpy SVD reference implementation used in both correctness validation and benchmarking. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks Andrei. Please see my comments from a few days ago. The only significant ones are moving business logic outside of the pybind layer where possible, and using cub::blockreduce as it's the nvidia standard for in-kernel sum reductions. The others are minor and optional, overall the change looks great. |
Switch from the hand-rolled shared-memory tree reduction macro to cub::BlockReduce, the NVIDIA-standard in-kernel primitive. CUB performs intra-warp reduction via shuffle instructions before the cross-warp shared-memory stage, significantly reducing the number of __syncthreads() calls per reduction (from ~8 to ~2) at the cost of nothing. With shared memory now managed entirely inside computePairRmsd(), the two kernels no longer declare any __shared__ variables of their own — all allocation is handled by the inlined helper. Also pull out the repeated 2.0*sqrtPP subexpression in symmetricEigenvalues3x3 and the pairwise a00*a11 / a00*a22 / a11*a22 products shared between the q and r terms, which nvcc does not always CSE automatically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mol)
Add src/conformer_rmsd_mol.{h,cpp} — a regular C++ library (following the
morganFingerprint/morganFingerprintKernels pattern) that owns all RDKit
preprocessing and device coordination:
conformerRmsdMatrixMol — single molecule
conformerRmsdBatchMatrixMol — batch of molecules
Both functions:
- Validate inputs and compute metadata
- Allocate device buffers before filling host buffers so that async GPU
memory allocation overlaps with CPU coordinate extraction
- Call the existing device-span kernel entry points
- Return AsyncDeviceVector<double> (single) or vector thereof (batch)
The pybind layer (conformerRmsd.cpp) is now minimal: acquire stream,
extract mol pointers, call the C++ mol overload, wrap result in PyArray.
CMakeLists: add conformer_rmsd_mol library in src/; link _conformerRmsd
to conformer_rmsd_mol (which transitively exposes conformer_rmsd).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…edback Add a 'Differences from RDKit' section noting the consistent ValueError for zero-atom molecules (vs RDKit's nan/ZeroDivisionError split). Update the example to show the RDKit equivalent call and drop the now-unnecessary explicit torch.cuda.synchronize() (the result is an AsyncGpuResult). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
nvmolkit/tests is not a Python package (no __init__.py), so importing from nvmolkit.tests.kabsch_reference breaks test collection with ModuleNotFoundError. Restore _numpy_kabsch_rmsd as a local function in each file rather than a shared module, per reviewer preference. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks for the contribution Andrei! |
Summary
Adds
GetConformerRMSMatrix— a GPU-accelerated equivalent of RDKit'sAllChem.GetConformerRMSMatrixthat computes pairwise Kabsch-aligned RMSD for all conformer pairs on the GPU.prealigned=True, skips Kabsch alignment and computes raw coordinate RMSD (no centering), matching RDKit's behavior.AsyncGpuResultwith__cuda_array_interface__support, consistent with the existing nvMolKit pattern.benchmarks/conformer_rmsd_bench.pywith correctness validation against numpy Kabsch SVD.Benchmark results (RTX 4090, CUDA 12.4)
Fits the existing GPU pipeline
The result can be passed directly to
butina()for GPU-accelerated conformer clustering, keeping the entire generate → optimize → RMSD → cluster pipeline on the GPU:Note on RDKit compatibility
Results may differ slightly from RDKit's
GetConformerRMSMatrix(prealigned=False)because RDKit modifies conformer coordinates in-place during sequential alignment (eachAlignMolcall updates the probe conformer, affecting subsequent pairs). This kernel computes each pair independently from the original coordinates, which is the mathematically correct pairwise Kabsch RMSD. Tests validate against an independent numpy SVD-based Kabsch reference.Test plan
🤖 Generated with Claude Code