Add GPU-accelerated conformer RMSD matrix computation by volgin · Pull Request #105 · NVIDIA-Digital-Bio/nvMolKit

volgin · 2026-03-07T05:01:52Z

Summary

Adds GetConformerRMSMatrix — a GPU-accelerated equivalent of RDKit's AllChem.GetConformerRMSMatrix that computes pairwise Kabsch-aligned RMSD for all conformer pairs on the GPU.

CUDA kernel: One thread-block (128 threads) per conformer pair. Computes centroids, cross-covariance matrix H, eigenvalues of H^T·H via Cardano's analytical formula for 3×3 symmetric matrices, singular values, reflection handling via det(H), and RMSD — all in a single kernel launch.
Prealigned mode: When prealigned=True, skips Kabsch alignment and computes raw coordinate RMSD (no centering), matching RDKit's behavior.
Async API: Returns AsyncGpuResult with __cuda_array_interface__ support, consistent with the existing nvMolKit pattern.
Benchmark included: benchmarks/conformer_rmsd_bench.py with correctness validation against numpy Kabsch SVD.

Benchmark results (RTX 4090, CUDA 12.4)

Molecule	Heavy atoms	Conformers	Pairs	CPU (ms)	GPU (ms)	Speedup
Aspirin	13	100	4,950	26.6	0.28	94x
Aspirin	13	200	19,900	121.5	0.99	122x
Celecoxib	26	100	4,950	43.1	0.29	147x
Celecoxib	26	200	19,900	190.8	3.59	53x
Osimertinib	37	200	19,900	252.7	3.72	68x
Osimertinib	37	500	124,750	1,889	21.9	86x
Cyclic pentapeptide	48	200	19,900	316.1	3.95	80x
Cyclic pentapeptide	48	500	124,750	2,276	19.0	120x

Fits the existing GPU pipeline

The result can be passed directly to butina() for GPU-accelerated conformer clustering, keeping the entire generate → optimize → RMSD → cluster pipeline on the GPU:

from nvmolkit.conformerRmsd import GetConformerRMSMatrix
from nvmolkit.clustering import butina

rmsd = GetConformerRMSMatrix(mol, prealigned=False)
torch.cuda.synchronize()
n = mol.GetNumConformers()
square = torch.zeros(n, n, device='cuda', dtype=torch.float64)
idx = torch.tril_indices(n, n, offset=-1)
square[idx[0], idx[1]] = rmsd.torch()
square = square + square.T
clusters = butina(square, cutoff=0.5)

Note on RDKit compatibility

Results may differ slightly from RDKit's GetConformerRMSMatrix(prealigned=False) because RDKit modifies conformer coordinates in-place during sequential alignment (each AlignMol call updates the probe conformer, affecting subsequent pairs). This kernel computes each pair independently from the original coordinates, which is the mathematically correct pairwise Kabsch RMSD. Tests validate against an independent numpy SVD-based Kabsch reference.

Test plan

12 tests covering correctness (vs numpy Kabsch SVD), prealigned mode, large conformer sets, edge cases, streams, and invalid inputs
Tested on RTX 4090 (CUDA 12.4)
Benchmark validates correctness across 4 drug-like molecules (13–48 heavy atoms) and 10–500 conformers
CI validation on NVIDIA Blossom

🤖 Generated with Claude Code

Implement GetConformerRMSMatrix as a GPU-accelerated equivalent of RDKit's AllChem.GetConformerRMSMatrix. One CUDA thread-block per conformer pair computes optimal Kabsch alignment via closed-form 3x3 SVD (Cardano eigenvalues of H^T H), avoiding iterative solvers. Output is a lower-triangle condensed distance matrix matching RDKit's format, enabling a fully GPU-resident pipeline: EmbedMolecules → MMFFOptimize → GetConformerRMSMatrix → butina() New files: src/conformer_rmsd.{h,cu} — kernel and C++ API nvmolkit/conformerRmsd.{cpp,py} — Boost.Python binding and Python wrapper nvmolkit/tests/test_conformer_rmsd.py — correctness tests vs RDKit reference benchmarks/conformer_rmsd_bench.py — CPU vs GPU benchmark Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: andreivolgin <andrei@rebelation.com>

Import _arrayHelpers to register the Boost.Python PyArray converter, matching the pattern used by other modules (fingerprints, clustering, similarity). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Prealigned mode: compute RMSD on raw coordinates without centering, matching RDKit's prealigned=True behavior. - Tests/benchmark: use deep copies for RDKit reference computation since GetConformerRMSMatrix(prealigned=False) modifies conformer coordinates in-place during sequential alignment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

RDKit's GetConformerRMSMatrix modifies conformer coordinates in-place during sequential alignment, so results depend on pair ordering and are not purely pairwise. Switch to an independent numpy-based Kabsch RMSD reference that computes each pair from original coordinates, matching our GPU kernel's behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace RDKit reference with independent numpy Kabsch SVD to avoid false negatives from RDKit's sequential coordinate mutation behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace alkane chains with drug-like molecules spanning a wider range: aspirin (13 HA), celecoxib (24 HA), osimertinib (33 HA), and a cyclic pentapeptide (~48 HA). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

greptile-apps · 2026-03-07T05:06:24Z

Greptile Summary

This PR adds GetConformerRMSMatrix — a GPU-accelerated replacement for RDKit's AllChem.GetConformerRMSMatrix — along with a batch variant and benchmark. The CUDA kernel is mathematically sound: it uses Cardano's analytical formula to compute eigenvalues of H^T·H (avoiding a full SVD on-device), handles reflections via det(H) < 0, and properly guards against numerical edge cases. The host plumbing (async allocation, pinned-memory H2D transfer, stream threading) follows established nvMolKit patterns.

Key findings:

Signed integer overflow in batch binding (nvmolkit/conformerRmsd.cpp:67): nc * (nc - 1) / 2 uses int arithmetic; for nc = 65536 the intermediate product nc * (nc - 1) = 4,294,901,760 exceeds INT_MAX, causing undefined behavior and producing a corrupt PyArray shape. The single-mol path on line 88 of the same file already applies the correct int64_t cast — the batch path should match.
cccl_interface not linked in src/CMakeLists.txt for the new conformer_rmsd target, unlike every other CUDA kernel target in the repo that uses CUB / cuda::std::span. This may work via transitive dependencies through device_vector today but is inconsistent and fragile.
Fixed block size of 128 is suboptimal for small molecules (< 128 heavy atoms, which is the majority of drug-like inputs). Most benchmarked molecules have 13–48 atoms, so many threads are idle and all 11 BlockReduce calls still incur full synchronization overhead regardless.

Confidence Score: 3/5

Safe to merge after fixing the integer overflow in the batch binding; the overflow is unreachable for typical drug-discovery workloads but is undefined behavior and would produce wrong results at ~65k conformers.
The CUDA math is correct and well-tested against an independent numpy Kabsch reference. The single-mol Python API is clean. However, the signed integer overflow in the batch binding path is a real correctness bug (UB in C++, wrong PyArray shape at the boundary case) that should be fixed before merging. The missing cccl_interface link is a build-consistency issue.
nvmolkit/conformerRmsd.cpp (integer overflow at line 67) and src/CMakeLists.txt (missing cccl_interface for conformer_rmsd target).

Important Files Changed

Filename	Overview
nvmolkit/conformerRmsd.cpp	Python/C++ binding layer — contains a signed integer overflow (`nc * (nc - 1) / 2` in `int`) at line 67 for large conformer counts (nc ≈ 65 536) that corrupts the PyArray shape passed back to Python; line 88 of the same file correctly uses `int64_t` for the equivalent single-mol path.
src/conformer_rmsd.cu	Core CUDA kernel implementing per-pair Kabsch RMSD via Cardano eigenvalue decomposition of H^T·H; mathematics is sound, reflection handling via det(H) < 0 is correct, numerical guards (fmax, clamping) are appropriate, though fixed block size of 128 is suboptimal for small molecules.
src/conformer_rmsd_mol.cpp	Host-side mol-to-GPU pipeline; properly uses int64_t for overflow detection, initiates async device allocation before CPU coordinate packing, and handles edge cases (null mols, zero atoms, < 2 conformers) cleanly.
nvmolkit/tests/test_conformer_rmsd.py	12 tests covering correctness vs numpy Kabsch, prealigned mode, edge cases (zero atoms, < 2 conformers, invalid types), streams, and both single/batch APIs; good coverage.
src/CMakeLists.txt	New `conformer_rmsd` and `conformer_rmsd_mol` targets are correctly structured, but `cccl_interface` (providing CUB / `cuda::std::span`) is not explicitly linked unlike every other CUDA kernel target in the repo that uses CUB.

Comments Outside Diff (3)

nvmolkit/conformerRmsd.cpp, line 67 (link)

Integer overflow in numPairs computation

nc * (nc - 1) is evaluated as a 32-bit signed int multiplication. For nc = 65536, the product is 4,294,901,760, which exceeds INT_MAX (2,147,483,647), causing signed integer overflow — undefined behavior in C++. On two's complement hardware this typically wraps to -65,536, so numPairs becomes -32,768, corrupting the shape passed to makePyArray.

Note that the upstream conformerRmsdBatchMatrixMol does validate that the final pair count fits in int (using int64_t arithmetic in conformer_rmsd_mol.cpp), but it allows nc = 65536 through because numPairs = 2,147,450,880 ≤ INT_MAX. The overflow therefore occurs silently in the binding code.

Line 88 in the same file already applies the correct fix with static_cast<int64_t>. Apply the same pattern here:
src/CMakeLists.txt, line 132-134 (link)

Missing cccl_interface dependency

conformer_rmsd.cu includes <cub/cub.cuh> and <cuda/std/span> from CCCL. Every other target in this repo that uses CUB explicitly links cccl_interface (e.g. similarity_kernels, morganFingerprintKernels), but conformer_rmsd does not:
```
add_library(conformer_rmsd conformer_rmsd.cu)
target_link_libraries(conformer_rmsd PUBLIC device_vector PRIVATE cuda_error_check)
```
If device_vector does not transitively expose cccl_interface, builds may fail or silently pick up a different CCCL version from the system CUDA install. Consider explicitly adding cccl_interface consistent with the rest of the codebase:
src/conformer_rmsd.cu, line 84-85 (link)

Performance degrades severely for very small molecules

kRmsdBlockSize = 128 means every conformer pair dispatches exactly 128 threads regardless of atom count. For molecules with fewer than 128 heavy atoms (most drug-like molecules: aspirin has 13, celecoxib 26), most threads are idle on every loop iteration. All 11 cub::BlockReduce calls are still executed, and each requires a __syncthreads(), meaning the inter-thread synchronization cost is disproportionate to the actual work.

The benchmark results confirm this pattern (Celecoxib 26 atoms, 200 conformers: 3.59 ms vs Aspirin 13 atoms, 200 conformers: 0.99 ms — 3.6× more pairs but 3.6× slower despite having only 2× more atoms, suggesting overhead dominates).

A smaller block size (e.g., 32) for small molecules would improve occupancy and reduce the per-pair reduction overhead. Consider exposing kRmsdBlockSize as a template parameter or choosing it adaptively at the launch site based on atom count.

_{Last reviewed commit: "conformerRmsd: rever..."}

volgin · 2026-03-07T05:17:40Z

For context on the real-world impact: we're processing ~200M molecule states in our conformer generation pipeline (embed → optimize → RMSD cluster → select). The RMSD matrix is currently the only stage that falls back to CPU. At ~2s saved per state, this kernel alone would save over 12 years of cumulative CPU time across our workload — and keep the entire pipeline on GPU.

- Fix async copy race condition: add cudaStreamSynchronize before hostCoords goes out of scope, matching the etkdg_impl.cpp pattern. - Move prealigned branch before centroid computation to avoid 6 wasted block reductions when centroids are not needed. - Fix np.sign(0) edge case in numpy Kabsch reference: treat det(H) == 0 (coplanar atoms) as +1 instead of zeroing the last singular value. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The cudaCheckError macro requires the nvMolKit namespace scope. Use direct CUDA API call with error check in the Boost.Python binding. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add cj < 0 guard to pair-index safety check in CUDA kernel to close a theoretical OOB read from floating-point rounding. - Fix docstring: "has no conformers" → "fewer than 2 conformers". - Align C++ binding with Python wrapper: raise std::invalid_argument for ≤1 conformers instead of silently returning empty array. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Graceful handling is better for batch pipelines processing millions of molecules where some may have failed embedding. Callers can check len(result) == 0 and skip clustering. Aligns with the original C++ intent and avoids crashing production workflows. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

evasnow1992 · 2026-03-09T18:09:27Z

Thank you for the contribution! This is a substantial addition — we'll review it thoroughly and follow up with any comments. We'll keep you updated as we work through it and determine the best approach for integration.

scal444

Thanks for the contribution! Overall this is a useful acceleration that aligns well with nvMolKit's functionality even though it wasn't initially on our rader. It's a very clean implementation and my code comments are minor.

One significant comment - I wonder if we're getting good GPU saturation. With many (100+ conformers) there's probably saturation but for smaller cases we typically need to send batches of molecules at a time, and calling singleApi(mol) repeatedly is less efficient that calling batchAPI([mol1, mol2, mol3]) multiple times. Adapting this to multiple molecules would require some minor preprocessing up front, and potentially an additional layer of indexing to match blocks to molecule ID and then to conformer ID. The output would be a list of results tensors.

Would your use cases match the multiple molecule multiple conformer case?

…safety, and edge cases - Replace raw cudaMallocHost/cudaFreeHost with PinnedHostVector<double> for exception safety and consistency with the project's existing RAII helpers. Kernel is submitted before hostCoords goes out of scope, so no extra host-side latency is added to the GPU pipeline. - Add numAtoms == 0 guard that throws std::invalid_argument consistently, intentionally diverging from RDKit's inconsistent nan/ZeroDivisionError behavior. - Increase benchmark correctness coverage from 50 to min(500, n_pairs) pairs. - Add test_rmsd_zero_atoms documenting the intentional divergence from RDKit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds GetConformerRMSMatrixBatch(), which processes a list of molecules in a single kernel launch so conformer pairs from all molecules run concurrently. This improves GPU saturation for molecules with few conformers (e.g. simple inorganics) without changing the existing single-molecule API. Design follows existing nvMolKit batch patterns: - Flat pinned coordinate buffer with per-molecule coord/pair offset arrays, matching the substructure search approach for variable atom counts. - Separate AsyncDeviceVector per molecule for output, returned as list[AsyncGpuResult] consistent with MMFFOptimizeMoleculesConfs. - PinnedHostVector throughout for exception safety and DMA efficiency. - Single kernel launch over totalPairs blocks; each block finds its molecule via a linear scan on pairOffsets (acceptable for current batch sizes). Also adds direct #include <vector> to conformerRmsd.cpp and a prealigned=True batch test to cover the hand-duplicated kernel path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… overflow - Clamp the Cardano discriminant to fmax(..., 0.0) in symmetricEigenvalues3x3, removing the incorrect fallback that set all eigenvalues to trace/3 for near-degenerate inputs (e.g. near-planar conformer pairs with eigenvalue structure like {4,4,0}), which silently produced wrong RMSD values. - Compute numPairs as int64_t to prevent signed integer overflow for numConfs > 65535 (int arithmetic overflows at ~2.1B). The kernel grid launch is guarded with an explicit overflow check and cast to int, since CUDA's grid dimension is bounded by INT_MAX regardless. Same fix applied to the single-molecule binding, the batch binding, and conformerRmsdMatrixGpu. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…and overflow guards - Remove dead `disc` variable in symmetricEigenvalues3x3; near-degenerate inputs are handled by the existing fmax guards on sqrtPP and acos argument - Promote coordOffsets to size_t throughout: PinnedHostVector, host packing index (base), AsyncDeviceVector, and kernel parameter — prevents silent int overflow for large batches (>INT_MAX coordinate elements) - Replace O(numMols) linear scan with O(log numMols) binary search in conformerRmsdBatchKernel for molecule lookup - Guard cumulative pairOffsets prefix sum against int overflow using int64_t running total before narrowing back to int Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Document that pairOffsetsArr/totalPairs are intentionally int because the kernel launches one block per pair on the grid x-dimension (hardware limit 2^31-1 = INT_MAX), and that the int64_t accumulation exists to catch prefix-sum overflow before materializing into that launch/index type. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

volgin · 2026-03-17T16:36:57Z

Addressed the latest round of feedback:

Batch API (scal444) — Added GetConformerRMSMatrixBatch for processing multiple molecules in a single kernel launch. Follows existing nvMolKit batch patterns: flat coordinate buffer with per-molecule offset arrays, one block per conformer pair across all molecules, array of per-molecule device output pointers. Python wrapper validates inputs, accepts an optional CUDA stream, and returns list[AsyncGpuResult]. Tests cover: batch vs. single-molecule parity, mixed conformer counts, prealigned=True, empty input, None molecule, and explicit stream.

Dead disc variable (Greptile) — Removed. The variable was computed but never used after the else fallback branch was deleted in the previous round. Near-degenerate eigenvalue inputs are handled entirely by the existing fmax guards on sqrtPP and the acos argument.

Coordinate offset overflow (Greptile) — Promoted coordOffsets from int to size_t end-to-end: PinnedHostVector, the host packing index (base), AsyncDeviceVector, the kernel parameter, and the header declaration. Also fixed a narrowing bug in the host packing loop where the index expression coordOffsetsArr[m] + confIdx * na * 3 + a * 3 was being computed in int despite coordOffsetsArr[m] being size_t — the intermediate products are now widened via static_cast<size_t> before addition.

O(N) linear scan (Greptile) — Replaced with an O(log N) binary search in conformerRmsdBatchKernel.

Cumulative pairOffsets overflow (scal444) — pairOffsets and totalPairs are intentionally 32-bit: this kernel launches one block per pair on the grid x-dimension, whose hardware limit is 2^31 - 1 (INT_MAX). The int64_t running sum exists solely to detect prefix-sum overflow before materializing into that launch/index type — an unchecked overflow would silently route blocks to the wrong molecule or write out of bounds. Added a clarifying comment in the source.

…cision notes - Remove #undef BLOCK_REDUCE_SUM from inside the prealigned branch of conformerRmsdBatchKernel; the preprocessor processes it unconditionally, undefining the macro before the 11 uses in the non-prealigned path and causing a compilation failure - Always allocate at least 1 element per molecule output buffer so that devRmsdPtrs never contains a null pointer; zero-pair molecules dispatch 0 blocks and never write to their slot - Add comments at both inverse triangular-number formula sites documenting that pairIdx is bounded by INT_MAX (~2^31), which fits exactly in a 53-bit double significand Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

volgin · 2026-03-17T19:36:18Z

Fixed three issues from the latest Greptile pass: removed a premature #undef BLOCK_REDUCE_SUM inside the prealigned branch of conformerRmsdBatchKernel (the preprocessor processes it unconditionally, breaking compilation of the non-prealigned path); replaced nullptr output pointers for zero-pair molecules with a guaranteed valid device allocation; added precision comments at both inverse triangular-number formula sites.

Move the numPairs > INT_MAX guard to the binding layer, before the AsyncDeviceVector allocation, so a pathological conformer count produces a descriptive overflow_error rather than a CUDA OOM. The defensive check inside conformerRmsdMatrixGpu is retained. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

volgin · 2026-03-17T22:06:41Z

The Kabsch computation duplication between the two kernels is noted. We're happy to extract a __device__ helper in this PR if you'd prefer to keep it together, but can also defer it to a follow-up PR to keep this review focused. Let us know which you'd prefer.

…e-aligned coords AllChem.GetConformerRMSMatrix modifies conformer coordinates in-place during alignment, so reusing the same molecule across iterations measures already- aligned conformers and understates the true CPU cost. Deep-copy before each call to ensure each iteration starts from fresh coordinates. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…numMols==0 Extract computePairRmsd() __device__ __forceinline__ helper containing the full RMSD computation path (prealigned fast-path, centroid reduction, cross-covariance accumulation, Cardano eigenvalue decomposition, reflection correction, final write). Both conformerRmsdKernel and conformerRmsdBatchKernel now resolve their coordinate pointers and output slot, then delegate to the helper — reducing each kernel body from ~80 lines to ~15. BLOCK_REDUCE_SUM is now defined once at file scope and #undef'd after the helper, eliminating the duplicate macro definition. Add an early `if (numMols <= 0) return;` guard at the top of conformerRmsdBatchKernel to prevent the binary search from reading pairOffsets[-1] if the kernel is ever called directly from C++ with numMols=0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

volgin · 2026-03-18T19:39:09Z

Went ahead with the refactoring rather than deferring — the benefits are clear enough to include here: computePairRmsd() is now a single __device__ __forceinline__ helper that contains the full RMSD computation path (prealigned fast-path, centroid reduction, cross-covariance accumulation, Cardano eigenvalue decomposition, reflection correction). Both kernels resolve their coordinate pointers and output slot, then delegate to it — reducing each kernel body from ~80 lines to ~15. BLOCK_REDUCE_SUM is defined once at file scope and #undef'd after the helper, removing the duplicate macro definition. Also added the if (numMols <= 0) return; guard at the top of conformerRmsdBatchKernel.

…odule Move the duplicated _numpy_kabsch_rmsd function from test_conformer_rmsd.py and conformer_rmsd_bench.py into a new nvmolkit/tests/kabsch_reference.py module so there is a single source of truth for the numpy SVD reference implementation used in both correctness validation and benchmarking. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

scal444 · 2026-03-19T14:16:31Z

Thanks Andrei. Please see my comments from a few days ago. The only significant ones are moving business logic outside of the pybind layer where possible, and using cub::blockreduce as it's the nvidia standard for in-kernel sum reductions. The others are minor and optional, overall the change looks great.

Switch from the hand-rolled shared-memory tree reduction macro to cub::BlockReduce, the NVIDIA-standard in-kernel primitive. CUB performs intra-warp reduction via shuffle instructions before the cross-warp shared-memory stage, significantly reducing the number of __syncthreads() calls per reduction (from ~8 to ~2) at the cost of nothing. With shared memory now managed entirely inside computePairRmsd(), the two kernels no longer declare any __shared__ variables of their own — all allocation is handled by the inlined helper. Also pull out the repeated 2.0*sqrtPP subexpression in symmetricEigenvalues3x3 and the pairwise a00*a11 / a00*a22 / a11*a22 products shared between the q and r terms, which nvcc does not always CSE automatically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…mol) Add src/conformer_rmsd_mol.{h,cpp} — a regular C++ library (following the morganFingerprint/morganFingerprintKernels pattern) that owns all RDKit preprocessing and device coordination: conformerRmsdMatrixMol — single molecule conformerRmsdBatchMatrixMol — batch of molecules Both functions: - Validate inputs and compute metadata - Allocate device buffers before filling host buffers so that async GPU memory allocation overlaps with CPU coordinate extraction - Call the existing device-span kernel entry points - Return AsyncDeviceVector<double> (single) or vector thereof (batch) The pybind layer (conformerRmsd.cpp) is now minimal: acquire stream, extract mol pointers, call the C++ mol overload, wrap result in PyArray. CMakeLists: add conformer_rmsd_mol library in src/; link _conformerRmsd to conformer_rmsd_mol (which transitively exposes conformer_rmsd). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…edback Add a 'Differences from RDKit' section noting the consistent ValueError for zero-atom molecules (vs RDKit's nan/ZeroDivisionError split). Update the example to show the RDKit equivalent call and drop the now-unnecessary explicit torch.cuda.synchronize() (the result is an AsyncGpuResult). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

nvmolkit/tests is not a Python package (no __init__.py), so importing from nvmolkit.tests.kabsch_reference breaks test collection with ModuleNotFoundError. Restore _numpy_kabsch_rmsd as a local function in each file rather than a shared module, per reviewer preference. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

scal444 · 2026-03-23T15:02:26Z

Thanks for the contribution Andrei!

…#126) Add conformer RMSD to feature list (docs/index.rst) and API reference (docs/api/nvmolkit.rst) — missing since PR #105. Fix mmff_multimol_bench.cpp build error: MMFFOptimizeMoleculesConfsBfgs now takes MMFFProperties instead of raw nonBondedThreshold (changed in PR #116).

volgin and others added 6 commits March 6, 2026 18:29

Fix missing PyArray converter registration in conformerRmsd

5f080de

Import _arrayHelpers to register the Boost.Python PyArray converter, matching the pattern used by other modules (fingerprints, clustering, similarity). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use numpy Kabsch SVD for benchmark correctness check

24c03da

Replace RDKit reference with independent numpy Kabsch SVD to avoid false negatives from RDKit's sequential coordinate mutation behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use realistic drug molecules in benchmark

ddcec77

Replace alkane chains with drug-like molecules spanning a wider range: aspirin (13 HA), celecoxib (24 HA), osimertinib (33 HA), and a cyclic pentapeptide (~48 HA). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

volgin and others added 4 commits March 7, 2026 00:28

Fix build: use cudaStreamSynchronize directly instead of cudaCheckError

cb3464e

The cudaCheckError macro requires the nvMolKit namespace scope. Use direct CUDA API call with error check in the Boost.Python binding. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

scal444 requested changes Mar 16, 2026

View reviewed changes

volgin and others added 5 commits March 17, 2026 00:11

scal444 assigned volgin Mar 18, 2026

volgin and others added 3 commits March 19, 2026 09:22

scal444 reviewed Mar 20, 2026

View reviewed changes

Comment thread nvmolkit/tests/kabsch_reference.py Outdated

scal444 approved these changes Mar 23, 2026

View reviewed changes

scal444 merged commit f5b7a7b into NVIDIA-Digital-Bio:main Mar 23, 2026
5 checks passed

evasnow1992 mentioned this pull request Apr 1, 2026

Add conformer RMSD docs and fix MMFF benchmark build after API change #126

Merged

Conversation

volgin commented Mar 7, 2026

Summary

Benchmark results (RTX 4090, CUDA 12.4)

Fits the existing GPU pipeline

Note on RDKit compatibility

Test plan

Uh oh!

greptile-apps Bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Comments Outside Diff (3)

Uh oh!

volgin commented Mar 7, 2026

Uh oh!

evasnow1992 commented Mar 9, 2026

Uh oh!

scal444 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

volgin commented Mar 17, 2026

Uh oh!

volgin commented Mar 17, 2026

Uh oh!

volgin commented Mar 17, 2026

Uh oh!

volgin commented Mar 18, 2026

Uh oh!

scal444 commented Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

scal444 commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented Mar 7, 2026 •

edited

Loading