Skip to content

Add GPU-accelerated conformer RMSD matrix computation#105

Merged
scal444 merged 24 commits intoNVIDIA-Digital-Bio:mainfrom
Novel-Therapeutics:feature/conformer-rmsd-matrix
Mar 23, 2026
Merged

Add GPU-accelerated conformer RMSD matrix computation#105
scal444 merged 24 commits intoNVIDIA-Digital-Bio:mainfrom
Novel-Therapeutics:feature/conformer-rmsd-matrix

Conversation

@volgin
Copy link
Copy Markdown
Contributor

@volgin volgin commented Mar 7, 2026

Summary

Adds GetConformerRMSMatrix — a GPU-accelerated equivalent of RDKit's AllChem.GetConformerRMSMatrix that computes pairwise Kabsch-aligned RMSD for all conformer pairs on the GPU.

  • CUDA kernel: One thread-block (128 threads) per conformer pair. Computes centroids, cross-covariance matrix H, eigenvalues of H^T·H via Cardano's analytical formula for 3×3 symmetric matrices, singular values, reflection handling via det(H), and RMSD — all in a single kernel launch.
  • Prealigned mode: When prealigned=True, skips Kabsch alignment and computes raw coordinate RMSD (no centering), matching RDKit's behavior.
  • Async API: Returns AsyncGpuResult with __cuda_array_interface__ support, consistent with the existing nvMolKit pattern.
  • Benchmark included: benchmarks/conformer_rmsd_bench.py with correctness validation against numpy Kabsch SVD.

Benchmark results (RTX 4090, CUDA 12.4)

Molecule Heavy atoms Conformers Pairs CPU (ms) GPU (ms) Speedup
Aspirin 13 100 4,950 26.6 0.28 94x
Aspirin 13 200 19,900 121.5 0.99 122x
Celecoxib 26 100 4,950 43.1 0.29 147x
Celecoxib 26 200 19,900 190.8 3.59 53x
Osimertinib 37 200 19,900 252.7 3.72 68x
Osimertinib 37 500 124,750 1,889 21.9 86x
Cyclic pentapeptide 48 200 19,900 316.1 3.95 80x
Cyclic pentapeptide 48 500 124,750 2,276 19.0 120x

Fits the existing GPU pipeline

The result can be passed directly to butina() for GPU-accelerated conformer clustering, keeping the entire generate → optimize → RMSD → cluster pipeline on the GPU:

from nvmolkit.conformerRmsd import GetConformerRMSMatrix
from nvmolkit.clustering import butina

rmsd = GetConformerRMSMatrix(mol, prealigned=False)
torch.cuda.synchronize()
n = mol.GetNumConformers()
square = torch.zeros(n, n, device='cuda', dtype=torch.float64)
idx = torch.tril_indices(n, n, offset=-1)
square[idx[0], idx[1]] = rmsd.torch()
square = square + square.T
clusters = butina(square, cutoff=0.5)

Note on RDKit compatibility

Results may differ slightly from RDKit's GetConformerRMSMatrix(prealigned=False) because RDKit modifies conformer coordinates in-place during sequential alignment (each AlignMol call updates the probe conformer, affecting subsequent pairs). This kernel computes each pair independently from the original coordinates, which is the mathematically correct pairwise Kabsch RMSD. Tests validate against an independent numpy SVD-based Kabsch reference.

Test plan

  • 12 tests covering correctness (vs numpy Kabsch SVD), prealigned mode, large conformer sets, edge cases, streams, and invalid inputs
  • Tested on RTX 4090 (CUDA 12.4)
  • Benchmark validates correctness across 4 drug-like molecules (13–48 heavy atoms) and 10–500 conformers
  • CI validation on NVIDIA Blossom

🤖 Generated with Claude Code

volgin and others added 6 commits March 6, 2026 18:29
Implement GetConformerRMSMatrix as a GPU-accelerated equivalent of
RDKit's AllChem.GetConformerRMSMatrix. One CUDA thread-block per
conformer pair computes optimal Kabsch alignment via closed-form 3x3
SVD (Cardano eigenvalues of H^T H), avoiding iterative solvers.

Output is a lower-triangle condensed distance matrix matching RDKit's
format, enabling a fully GPU-resident pipeline:
  EmbedMolecules → MMFFOptimize → GetConformerRMSMatrix → butina()

New files:
  src/conformer_rmsd.{h,cu}           — kernel and C++ API
  nvmolkit/conformerRmsd.{cpp,py}     — Boost.Python binding and Python wrapper
  nvmolkit/tests/test_conformer_rmsd.py — correctness tests vs RDKit reference
  benchmarks/conformer_rmsd_bench.py   — CPU vs GPU benchmark

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: andreivolgin <andrei@rebelation.com>
Import _arrayHelpers to register the Boost.Python PyArray converter,
matching the pattern used by other modules (fingerprints, clustering, similarity).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Prealigned mode: compute RMSD on raw coordinates without centering,
  matching RDKit's prealigned=True behavior.
- Tests/benchmark: use deep copies for RDKit reference computation since
  GetConformerRMSMatrix(prealigned=False) modifies conformer coordinates
  in-place during sequential alignment.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RDKit's GetConformerRMSMatrix modifies conformer coordinates in-place
during sequential alignment, so results depend on pair ordering and
are not purely pairwise. Switch to an independent numpy-based Kabsch
RMSD reference that computes each pair from original coordinates,
matching our GPU kernel's behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace RDKit reference with independent numpy Kabsch SVD to avoid
false negatives from RDKit's sequential coordinate mutation behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace alkane chains with drug-like molecules spanning a wider range:
aspirin (13 HA), celecoxib (24 HA), osimertinib (33 HA), and a cyclic
pentapeptide (~48 HA).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 7, 2026

Greptile Summary

This PR adds GetConformerRMSMatrix — a GPU-accelerated replacement for RDKit's AllChem.GetConformerRMSMatrix — along with a batch variant and benchmark. The CUDA kernel is mathematically sound: it uses Cardano's analytical formula to compute eigenvalues of H^T·H (avoiding a full SVD on-device), handles reflections via det(H) < 0, and properly guards against numerical edge cases. The host plumbing (async allocation, pinned-memory H2D transfer, stream threading) follows established nvMolKit patterns.

Key findings:

  • Signed integer overflow in batch binding (nvmolkit/conformerRmsd.cpp:67): nc * (nc - 1) / 2 uses int arithmetic; for nc = 65536 the intermediate product nc * (nc - 1) = 4,294,901,760 exceeds INT_MAX, causing undefined behavior and producing a corrupt PyArray shape. The single-mol path on line 88 of the same file already applies the correct int64_t cast — the batch path should match.
  • cccl_interface not linked in src/CMakeLists.txt for the new conformer_rmsd target, unlike every other CUDA kernel target in the repo that uses CUB / cuda::std::span. This may work via transitive dependencies through device_vector today but is inconsistent and fragile.
  • Fixed block size of 128 is suboptimal for small molecules (< 128 heavy atoms, which is the majority of drug-like inputs). Most benchmarked molecules have 13–48 atoms, so many threads are idle and all 11 BlockReduce calls still incur full synchronization overhead regardless.

Confidence Score: 3/5

  • Safe to merge after fixing the integer overflow in the batch binding; the overflow is unreachable for typical drug-discovery workloads but is undefined behavior and would produce wrong results at ~65k conformers.
  • The CUDA math is correct and well-tested against an independent numpy Kabsch reference. The single-mol Python API is clean. However, the signed integer overflow in the batch binding path is a real correctness bug (UB in C++, wrong PyArray shape at the boundary case) that should be fixed before merging. The missing cccl_interface link is a build-consistency issue.
  • nvmolkit/conformerRmsd.cpp (integer overflow at line 67) and src/CMakeLists.txt (missing cccl_interface for conformer_rmsd target).

Important Files Changed

Filename Overview
nvmolkit/conformerRmsd.cpp Python/C++ binding layer — contains a signed integer overflow (nc * (nc - 1) / 2 in int) at line 67 for large conformer counts (nc ≈ 65 536) that corrupts the PyArray shape passed back to Python; line 88 of the same file correctly uses int64_t for the equivalent single-mol path.
src/conformer_rmsd.cu Core CUDA kernel implementing per-pair Kabsch RMSD via Cardano eigenvalue decomposition of H^T·H; mathematics is sound, reflection handling via det(H) < 0 is correct, numerical guards (fmax, clamping) are appropriate, though fixed block size of 128 is suboptimal for small molecules.
src/conformer_rmsd_mol.cpp Host-side mol-to-GPU pipeline; properly uses int64_t for overflow detection, initiates async device allocation before CPU coordinate packing, and handles edge cases (null mols, zero atoms, < 2 conformers) cleanly.
nvmolkit/tests/test_conformer_rmsd.py 12 tests covering correctness vs numpy Kabsch, prealigned mode, edge cases (zero atoms, < 2 conformers, invalid types), streams, and both single/batch APIs; good coverage.
src/CMakeLists.txt New conformer_rmsd and conformer_rmsd_mol targets are correctly structured, but cccl_interface (providing CUB / cuda::std::span) is not explicitly linked unlike every other CUDA kernel target in the repo that uses CUB.

Comments Outside Diff (3)

  1. nvmolkit/conformerRmsd.cpp, line 67 (link)

    Integer overflow in numPairs computation

    nc * (nc - 1) is evaluated as a 32-bit signed int multiplication. For nc = 65536, the product is 4,294,901,760, which exceeds INT_MAX (2,147,483,647), causing signed integer overflow — undefined behavior in C++. On two's complement hardware this typically wraps to -65,536, so numPairs becomes -32,768, corrupting the shape passed to makePyArray.

    Note that the upstream conformerRmsdBatchMatrixMol does validate that the final pair count fits in int (using int64_t arithmetic in conformer_rmsd_mol.cpp), but it allows nc = 65536 through because numPairs = 2,147,450,880 ≤ INT_MAX. The overflow therefore occurs silently in the binding code.

    Line 88 in the same file already applies the correct fix with static_cast<int64_t>. Apply the same pattern here:

  2. src/CMakeLists.txt, line 132-134 (link)

    Missing cccl_interface dependency

    conformer_rmsd.cu includes <cub/cub.cuh> and <cuda/std/span> from CCCL. Every other target in this repo that uses CUB explicitly links cccl_interface (e.g. similarity_kernels, morganFingerprintKernels), but conformer_rmsd does not:

    add_library(conformer_rmsd conformer_rmsd.cu)
    target_link_libraries(conformer_rmsd PUBLIC device_vector PRIVATE cuda_error_check)

    If device_vector does not transitively expose cccl_interface, builds may fail or silently pick up a different CCCL version from the system CUDA install. Consider explicitly adding cccl_interface consistent with the rest of the codebase:

  3. src/conformer_rmsd.cu, line 84-85 (link)

    Performance degrades severely for very small molecules

    kRmsdBlockSize = 128 means every conformer pair dispatches exactly 128 threads regardless of atom count. For molecules with fewer than 128 heavy atoms (most drug-like molecules: aspirin has 13, celecoxib 26), most threads are idle on every loop iteration. All 11 cub::BlockReduce calls are still executed, and each requires a __syncthreads(), meaning the inter-thread synchronization cost is disproportionate to the actual work.

    The benchmark results confirm this pattern (Celecoxib 26 atoms, 200 conformers: 3.59 ms vs Aspirin 13 atoms, 200 conformers: 0.99 ms — 3.6× more pairs but 3.6× slower despite having only 2× more atoms, suggesting overhead dominates).

    A smaller block size (e.g., 32) for small molecules would improve occupancy and reduce the per-pair reduction overhead. Consider exposing kRmsdBlockSize as a template parameter or choosing it adaptively at the launch site based on atom count.

Last reviewed commit: "conformerRmsd: rever..."

@volgin
Copy link
Copy Markdown
Contributor Author

volgin commented Mar 7, 2026

For context on the real-world impact: we're processing ~200M molecule states in our conformer generation pipeline (embed → optimize → RMSD cluster → select). The RMSD matrix is currently the only stage that falls back to CPU. At ~2s saved per state, this kernel alone would save over 12 years of cumulative CPU time across our workload — and keep the entire pipeline on GPU.

volgin and others added 4 commits March 7, 2026 00:28
- Fix async copy race condition: add cudaStreamSynchronize before
  hostCoords goes out of scope, matching the etkdg_impl.cpp pattern.
- Move prealigned branch before centroid computation to avoid 6
  wasted block reductions when centroids are not needed.
- Fix np.sign(0) edge case in numpy Kabsch reference: treat
  det(H) == 0 (coplanar atoms) as +1 instead of zeroing the
  last singular value.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The cudaCheckError macro requires the nvMolKit namespace scope. Use
direct CUDA API call with error check in the Boost.Python binding.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add cj < 0 guard to pair-index safety check in CUDA kernel to
  close a theoretical OOB read from floating-point rounding.
- Fix docstring: "has no conformers" → "fewer than 2 conformers".
- Align C++ binding with Python wrapper: raise std::invalid_argument
  for ≤1 conformers instead of silently returning empty array.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Graceful handling is better for batch pipelines processing millions
of molecules where some may have failed embedding. Callers can check
len(result) == 0 and skip clustering. Aligns with the original C++
intent and avoids crashing production workflows.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@evasnow1992
Copy link
Copy Markdown
Collaborator

Thank you for the contribution! This is a substantial addition — we'll review it thoroughly and follow up with any comments. We'll keep you updated as we work through it and determine the best approach for integration.

Copy link
Copy Markdown
Collaborator

@scal444 scal444 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! Overall this is a useful acceleration that aligns well with nvMolKit's functionality even though it wasn't initially on our rader. It's a very clean implementation and my code comments are minor.

One significant comment - I wonder if we're getting good GPU saturation. With many (100+ conformers) there's probably saturation but for smaller cases we typically need to send batches of molecules at a time, and calling singleApi(mol) repeatedly is less efficient that calling batchAPI([mol1, mol2, mol3]) multiple times. Adapting this to multiple molecules would require some minor preprocessing up front, and potentially an additional layer of indexing to match blocks to molecule ID and then to conformer ID. The output would be a list of results tensors.

Would your use cases match the multiple molecule multiple conformer case?

Comment thread nvmolkit/conformerRmsd.cpp Outdated
Comment thread nvmolkit/conformerRmsd.cpp Outdated
Comment thread nvmolkit/conformerRmsd.cpp Outdated
Comment thread nvmolkit/conformerRmsd.cpp Outdated
Comment thread nvmolkit/conformerRmsd.cpp
Comment thread benchmarks/conformer_rmsd_bench.py
Comment thread src/conformer_rmsd.cu Outdated
Comment thread src/conformer_rmsd.cu Outdated
Comment thread src/conformer_rmsd.cu Outdated
Comment thread src/conformer_rmsd.cu Outdated
volgin and others added 5 commits March 17, 2026 00:11
…safety, and edge cases

- Replace raw cudaMallocHost/cudaFreeHost with PinnedHostVector<double> for
  exception safety and consistency with the project's existing RAII helpers.
  Kernel is submitted before hostCoords goes out of scope, so no extra host-side
  latency is added to the GPU pipeline.
- Add numAtoms == 0 guard that throws std::invalid_argument consistently,
  intentionally diverging from RDKit's inconsistent nan/ZeroDivisionError behavior.
- Increase benchmark correctness coverage from 50 to min(500, n_pairs) pairs.
- Add test_rmsd_zero_atoms documenting the intentional divergence from RDKit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds GetConformerRMSMatrixBatch(), which processes a list of molecules in
a single kernel launch so conformer pairs from all molecules run concurrently.
This improves GPU saturation for molecules with few conformers (e.g. simple
inorganics) without changing the existing single-molecule API.

Design follows existing nvMolKit batch patterns:
- Flat pinned coordinate buffer with per-molecule coord/pair offset arrays,
  matching the substructure search approach for variable atom counts.
- Separate AsyncDeviceVector per molecule for output, returned as
  list[AsyncGpuResult] consistent with MMFFOptimizeMoleculesConfs.
- PinnedHostVector throughout for exception safety and DMA efficiency.
- Single kernel launch over totalPairs blocks; each block finds its molecule
  via a linear scan on pairOffsets (acceptable for current batch sizes).

Also adds direct #include <vector> to conformerRmsd.cpp and a prealigned=True
batch test to cover the hand-duplicated kernel path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… overflow

- Clamp the Cardano discriminant to fmax(..., 0.0) in symmetricEigenvalues3x3,
  removing the incorrect fallback that set all eigenvalues to trace/3 for
  near-degenerate inputs (e.g. near-planar conformer pairs with eigenvalue
  structure like {4,4,0}), which silently produced wrong RMSD values.

- Compute numPairs as int64_t to prevent signed integer overflow for
  numConfs > 65535 (int arithmetic overflows at ~2.1B). The kernel grid
  launch is guarded with an explicit overflow check and cast to int, since
  CUDA's grid dimension is bounded by INT_MAX regardless. Same fix applied
  to the single-molecule binding, the batch binding, and conformerRmsdMatrixGpu.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…and overflow guards

- Remove dead `disc` variable in symmetricEigenvalues3x3; near-degenerate
  inputs are handled by the existing fmax guards on sqrtPP and acos argument
- Promote coordOffsets to size_t throughout: PinnedHostVector, host packing
  index (base), AsyncDeviceVector, and kernel parameter — prevents silent
  int overflow for large batches (>INT_MAX coordinate elements)
- Replace O(numMols) linear scan with O(log numMols) binary search in
  conformerRmsdBatchKernel for molecule lookup
- Guard cumulative pairOffsets prefix sum against int overflow using int64_t
  running total before narrowing back to int

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Document that pairOffsetsArr/totalPairs are intentionally int because the
kernel launches one block per pair on the grid x-dimension (hardware limit
2^31-1 = INT_MAX), and that the int64_t accumulation exists to catch
prefix-sum overflow before materializing into that launch/index type.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@volgin
Copy link
Copy Markdown
Contributor Author

volgin commented Mar 17, 2026

Addressed the latest round of feedback:

Batch API (scal444) — Added GetConformerRMSMatrixBatch for processing multiple molecules in a single kernel launch. Follows existing nvMolKit batch patterns: flat coordinate buffer with per-molecule offset arrays, one block per conformer pair across all molecules, array of per-molecule device output pointers. Python wrapper validates inputs, accepts an optional CUDA stream, and returns list[AsyncGpuResult]. Tests cover: batch vs. single-molecule parity, mixed conformer counts, prealigned=True, empty input, None molecule, and explicit stream.

Dead disc variable (Greptile) — Removed. The variable was computed but never used after the else fallback branch was deleted in the previous round. Near-degenerate eigenvalue inputs are handled entirely by the existing fmax guards on sqrtPP and the acos argument.

Coordinate offset overflow (Greptile) — Promoted coordOffsets from int to size_t end-to-end: PinnedHostVector, the host packing index (base), AsyncDeviceVector, the kernel parameter, and the header declaration. Also fixed a narrowing bug in the host packing loop where the index expression coordOffsetsArr[m] + confIdx * na * 3 + a * 3 was being computed in int despite coordOffsetsArr[m] being size_t — the intermediate products are now widened via static_cast<size_t> before addition.

O(N) linear scan (Greptile) — Replaced with an O(log N) binary search in conformerRmsdBatchKernel.

Cumulative pairOffsets overflow (scal444) — pairOffsets and totalPairs are intentionally 32-bit: this kernel launches one block per pair on the grid x-dimension, whose hardware limit is 2^31 - 1 (INT_MAX). The int64_t running sum exists solely to detect prefix-sum overflow before materializing into that launch/index type — an unchecked overflow would silently route blocks to the wrong molecule or write out of bounds. Added a clarifying comment in the source.

…cision notes

- Remove #undef BLOCK_REDUCE_SUM from inside the prealigned branch of
  conformerRmsdBatchKernel; the preprocessor processes it unconditionally,
  undefining the macro before the 11 uses in the non-prealigned path and
  causing a compilation failure
- Always allocate at least 1 element per molecule output buffer so that
  devRmsdPtrs never contains a null pointer; zero-pair molecules dispatch
  0 blocks and never write to their slot
- Add comments at both inverse triangular-number formula sites documenting
  that pairIdx is bounded by INT_MAX (~2^31), which fits exactly in a
  53-bit double significand

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@volgin
Copy link
Copy Markdown
Contributor Author

volgin commented Mar 17, 2026

Fixed three issues from the latest Greptile pass: removed a premature #undef BLOCK_REDUCE_SUM inside the prealigned branch of conformerRmsdBatchKernel (the preprocessor processes it unconditionally, breaking compilation of the non-prealigned path); replaced nullptr output pointers for zero-pair molecules with a guaranteed valid device allocation; added precision comments at both inverse triangular-number formula sites.

Move the numPairs > INT_MAX guard to the binding layer, before the
AsyncDeviceVector allocation, so a pathological conformer count produces
a descriptive overflow_error rather than a CUDA OOM. The defensive check
inside conformerRmsdMatrixGpu is retained.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@volgin
Copy link
Copy Markdown
Contributor Author

volgin commented Mar 17, 2026

The Kabsch computation duplication between the two kernels is noted. We're happy to extract a __device__ helper in this PR if you'd prefer to keep it together, but can also defer it to a follow-up PR to keep this review focused. Let us know which you'd prefer.

…e-aligned coords

AllChem.GetConformerRMSMatrix modifies conformer coordinates in-place during
alignment, so reusing the same molecule across iterations measures already-
aligned conformers and understates the true CPU cost. Deep-copy before each
call to ensure each iteration starts from fresh coordinates.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…numMols==0

Extract computePairRmsd() __device__ __forceinline__ helper containing the
full RMSD computation path (prealigned fast-path, centroid reduction,
cross-covariance accumulation, Cardano eigenvalue decomposition, reflection
correction, final write). Both conformerRmsdKernel and
conformerRmsdBatchKernel now resolve their coordinate pointers and output
slot, then delegate to the helper — reducing each kernel body from ~80 lines
to ~15.

BLOCK_REDUCE_SUM is now defined once at file scope and #undef'd after the
helper, eliminating the duplicate macro definition.

Add an early `if (numMols <= 0) return;` guard at the top of
conformerRmsdBatchKernel to prevent the binary search from reading
pairOffsets[-1] if the kernel is ever called directly from C++ with numMols=0.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@volgin
Copy link
Copy Markdown
Contributor Author

volgin commented Mar 18, 2026

Went ahead with the refactoring rather than deferring — the benefits are clear enough to include here: computePairRmsd() is now a single __device__ __forceinline__ helper that contains the full RMSD computation path (prealigned fast-path, centroid reduction, cross-covariance accumulation, Cardano eigenvalue decomposition, reflection correction). Both kernels resolve their coordinate pointers and output slot, then delegate to it — reducing each kernel body from ~80 lines to ~15. BLOCK_REDUCE_SUM is defined once at file scope and #undef'd after the helper, removing the duplicate macro definition. Also added the if (numMols <= 0) return; guard at the top of conformerRmsdBatchKernel.

…odule

Move the duplicated _numpy_kabsch_rmsd function from test_conformer_rmsd.py
and conformer_rmsd_bench.py into a new nvmolkit/tests/kabsch_reference.py
module so there is a single source of truth for the numpy SVD reference
implementation used in both correctness validation and benchmarking.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@scal444
Copy link
Copy Markdown
Collaborator

scal444 commented Mar 19, 2026

Thanks Andrei. Please see my comments from a few days ago. The only significant ones are moving business logic outside of the pybind layer where possible, and using cub::blockreduce as it's the nvidia standard for in-kernel sum reductions. The others are minor and optional, overall the change looks great.

volgin and others added 3 commits March 19, 2026 09:22
Switch from the hand-rolled shared-memory tree reduction macro to
cub::BlockReduce, the NVIDIA-standard in-kernel primitive. CUB performs
intra-warp reduction via shuffle instructions before the cross-warp
shared-memory stage, significantly reducing the number of __syncthreads()
calls per reduction (from ~8 to ~2) at the cost of nothing.

With shared memory now managed entirely inside computePairRmsd(), the two
kernels no longer declare any __shared__ variables of their own — all
allocation is handled by the inlined helper.

Also pull out the repeated 2.0*sqrtPP subexpression in
symmetricEigenvalues3x3 and the pairwise a00*a11 / a00*a22 / a11*a22
products shared between the q and r terms, which nvcc does not always CSE
automatically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mol)

Add src/conformer_rmsd_mol.{h,cpp} — a regular C++ library (following the
morganFingerprint/morganFingerprintKernels pattern) that owns all RDKit
preprocessing and device coordination:

  conformerRmsdMatrixMol    — single molecule
  conformerRmsdBatchMatrixMol — batch of molecules

Both functions:
  - Validate inputs and compute metadata
  - Allocate device buffers before filling host buffers so that async GPU
    memory allocation overlaps with CPU coordinate extraction
  - Call the existing device-span kernel entry points
  - Return AsyncDeviceVector<double> (single) or vector thereof (batch)

The pybind layer (conformerRmsd.cpp) is now minimal: acquire stream,
extract mol pointers, call the C++ mol overload, wrap result in PyArray.

CMakeLists: add conformer_rmsd_mol library in src/; link _conformerRmsd
to conformer_rmsd_mol (which transitively exposes conformer_rmsd).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…edback

Add a 'Differences from RDKit' section noting the consistent ValueError for
zero-atom molecules (vs RDKit's nan/ZeroDivisionError split). Update the
example to show the RDKit equivalent call and drop the now-unnecessary
explicit torch.cuda.synchronize() (the result is an AsyncGpuResult).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread nvmolkit/tests/kabsch_reference.py Outdated
nvmolkit/tests is not a Python package (no __init__.py), so importing from
nvmolkit.tests.kabsch_reference breaks test collection with ModuleNotFoundError.
Restore _numpy_kabsch_rmsd as a local function in each file rather than a
shared module, per reviewer preference.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@scal444 scal444 merged commit f5b7a7b into NVIDIA-Digital-Bio:main Mar 23, 2026
5 checks passed
@scal444
Copy link
Copy Markdown
Collaborator

scal444 commented Mar 23, 2026

Thanks for the contribution Andrei!

evasnow1992 added a commit that referenced this pull request Apr 2, 2026
…#126)

Add conformer RMSD to feature list (docs/index.rst) and API reference
(docs/api/nvmolkit.rst) — missing since PR #105.

Fix mmff_multimol_bench.cpp build error: MMFFOptimizeMoleculesConfsBfgs
now takes MMFFProperties instead of raw nonBondedThreshold (changed in
PR #116).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants