Skip to content

Neon MVP#777

Draft
hildebrandmw wants to merge 9 commits intomainfrom
mhildebr/neon
Draft

Neon MVP#777
hildebrandmw wants to merge 9 commits intomainfrom
mhildebr/neon

Conversation

@hildebrandmw
Copy link
Contributor

@hildebrandmw hildebrandmw commented Feb 16, 2026

Adds a (mostly) complete AArch64 Neon backend to diskann-wide and wires it through diskann-vector, diskann-quantization, and diskann-benchmark-simd.

This PR has existed in a largely completed state for quite a while now - but as usual the last 10% takes a considerable amount of work. So here it is.

diskann-wide — Neon backend

Neon implementations for all SIMD types matching the existing x86_64 (V3/V4) backends:

  • 16 register types across 64-bit and 128-bit widths: u8x8, i8x8, f32x2, u8x16, i8x16, u16x8,
    i16x8, u32x4, i32x4, f32x4, u64x2, i64x2, f16x4, f16x8.
  • Doubled types (f32x8, f32x16, u8x32, i8x32, i32x8, etc.) via the existing Doubled machinery.
  • Masks: move_mask, from_mask, and optimized keep_first for all 8 mask widths.
  • Arithmetic: Add, Sub, Mul, FMA, Abs, MinMax.
  • Comparisons: Full SIMDPartialEq and SIMDPartialOrd.
  • Bit operations: Not, And, Or, Xor, Shr, Shl (with Miri fallbacks for variable shifts).
  • Dot products: i16×i16→i32, u8×i8→i32, i8×u8→i32 using vdotq_s32 (requires +dotprod).
  • Reductions: sum_tree via pairwise addition (vpaddq).
  • Conversions: f16↔f32 (lossless and cast), u8→i16, i8→i16, i32→f32, split/join for all appropriate types.

Optimized load_simd_first (algorithms/load_first.rs):

Rather than falling back to scalar Emulated element-by-element loads, partial loads use Neon-native primitives:

  • ≤8 bytes: GPR-only overlapping reads — no SIMD instructions needed.
  • 8–16 bytes: Two overlapping vld1_u8 loads combined with vqtbl1q_u8 (TBL shuffle). Includes a Miri shim since Miri does not support vqtbl1q_u8.
  • 32-bit / 64-bit element types: Simple if-else chains using vld1_lane / vcombine.

The aarch64_define_loadstore! macro accepts a $load_first function, and f16x4/f16x8 delegate to the u16x4/u16x8 primitives respectively.

Doubled types implement load_simd_first / store_simd_first branchlessly by passing the full count to the first half and first.saturating_sub(HALF) to the second.

Test infrastructure:

  • test_neon() helper with WIDE_TEST_MIN_ARCH env-var support, matching the x86_64 test_arch_number() pattern. Supports "all" / "neon" (panics if unavailable) and "scalar" (skips).
  • All tests use if let Some(arch) = test_neon() { ... } — graceful skip when Neon is unavailable, hard failure when explicitly requested.

diskann-vector — Neon distance kernels

14 SIMDSchema implementations covering:

  • L2, InnerProduct, Cosine for f32, f16, u8, i8.
  • L1Norm for f32 and f16.
  • All use scalar epilogues (SIMD epilogues deferred pending Arm64 benchmarking).

diskann-quantization

  • Neon Hadamard transform impl (delegates to scalar via retarget()).
  • Bit distances almost universally target the scalar architecture as well via retarget().
  • Neon test paths for bit-slice distances (1–8 bit), bit-transpose distances, and full distances.

diskann-benchmark-simd

  • Neon kernel registrations for f32, f16, u8, and i8.
  • Refactored per-architecture DispatchRule impls into a match_arch! macro.
  • Improved dispatch scoring for better mismatch diagnostics.
  • Added test-aarch64.json and architecture-aware integration test selection.

Other changes

  • .cargo/config.toml: Enables +neon,+dotprod for aarch64 targets.
  • .github/workflows/ci.yml: Added aarch64-unknown-linux-gnu to cross-compilation targets.
  • diskann-providers: Relaxed a PQ distance test tolerance (6e-76.3e-7) for the different floating opint association used by the Neon implementations.

Design decisions

  • Compile-time architecture gating. The Neon backend uses a compile-time token rather than runtime feature detection. Neon is mandatory on AArch64.
    Runtime dispatch can be added later if needed.
  • +dotprod required. Needed for vdotq in dot-product kernels. This excludes pre-2018 cores but shoud covers mainstream server and desktop targets (Graviton 2+, Apple M1+, Ampere Altra). ARMv8.4+ mandates it.
  • Scalar epilogues in diskann-vector. The SIMD epilogues could use load_simd_first for a potential win on i8/u8 cosine where the masked load cost is amortized across multiple operations, but real Arm64 benchmarking is needed first.

Suggested reviewing order

  1. diskann-wide/src/arch/aarch64/mod.rs — Architecture definition, Neon token, dispatch, test_neon().
  2. diskann-wide/src/arch/aarch64/macros.rs — The macro infrastructure that all type files build on.
  3. diskann-wide/src/arch/aarch64/masks.rs — Mask representations and operations (move_mask, from_mask, keep_first).
  4. diskann-wide/src/arch/aarch64/algorithms/load_first.rs — Optimized partial load primitives. Read bottom-up: impl functions first, then wrappers.
  5. One representative type file (e.g., f32x4_.rs for 128-bit float, or i32x4_.rs for dot products) — the rest are structurally identical.
  6. diskann-wide/src/arch/aarch64/double.rs and diskann-wide/src/doubled.rs — Doubled types and branchless partial load/store.
  7. diskann-vector/src/distance/simd.rs — Neon distance kernels.
  8. diskann-benchmark-simd/src/lib.rsmatch_arch! refactor and Neon registration.
  9. diskann-quantization/ — Neon test paths (mechanical).

@codecov-commenter
Copy link

codecov-commenter commented Feb 16, 2026

Codecov Report

❌ Patch coverage is 84.11215% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.99%. Comparing base (7cd231a) to head (c18da52).

Files with missing lines Patch % Lines
diskann-benchmark-simd/src/lib.rs 47.05% 18 Missing ⚠️
diskann-vector/src/distance/simd.rs 83.33% 4 Missing ⚠️
diskann-wide/src/test_utils/dot_product.rs 94.93% 4 Missing ⚠️
diskann-benchmark-simd/src/bin.rs 50.00% 3 Missing ⚠️
diskann-vector/src/distance/implementations.rs 0.00% 3 Missing ⚠️
diskann-vector/src/conversion.rs 66.66% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #777      +/-   ##
==========================================
- Coverage   89.00%   88.99%   -0.02%     
==========================================
  Files         428      428              
  Lines       78417    78565     +148     
==========================================
+ Hits        69795    69917     +122     
- Misses       8622     8648      +26     
Flag Coverage Δ
miri 88.99% <84.11%> (-0.02%) ⬇️
unittests 88.99% <84.11%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
diskann-providers/src/model/pq/distance/dynamic.rs 86.40% <ø> (ø)
diskann-quantization/src/algorithms/hadamard.rs 97.94% <ø> (ø)
diskann-quantization/src/bits/distances.rs 91.49% <100.00%> (+0.05%) ⬆️
diskann-quantization/src/spherical/iface.rs 92.90% <ø> (+0.32%) ⬆️
diskann-vector/src/distance/distance_provider.rs 100.00% <ø> (ø)
diskann-wide/src/arch/mod.rs 83.79% <ø> (ø)
diskann-wide/src/doubled.rs 86.72% <100.00%> (+0.02%) ⬆️
diskann-wide/src/emulated.rs 95.20% <100.00%> (+0.59%) ⬆️
diskann-wide/src/helpers.rs 100.00% <ø> (ø)
diskann-wide/src/lib.rs 86.66% <ø> (ø)
... and 7 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hildebrandmw hildebrandmw changed the title Neon MVP. Neon MVP Feb 16, 2026
@hildebrandmw
Copy link
Contributor Author

Codecov Report

❌ Patch coverage is 84.11215% with 34 lines in your changes missing coverage. Please review. ✅ Project coverage is 88.99%. Comparing base (7cd231a) to head (c18da52).

Files with missing lines Patch % Lines
diskann-benchmark-simd/src/lib.rs 47.05% 18 Missing ⚠️
diskann-vector/src/distance/simd.rs 83.33% 4 Missing ⚠️
diskann-wide/src/test_utils/dot_product.rs 94.93% 4 Missing ⚠️
diskann-benchmark-simd/src/bin.rs 50.00% 3 Missing ⚠️
diskann-vector/src/distance/implementations.rs 0.00% 3 Missing ⚠️
diskann-vector/src/conversion.rs 66.66% 2 Missing ⚠️
Additional details and impacted files
Impacted file tree graph

@@            Coverage Diff             @@
##             main     #777      +/-   ##
==========================================
- Coverage   89.00%   88.99%   -0.02%     
==========================================
  Files         428      428              
  Lines       78417    78565     +148     
==========================================
+ Hits        69795    69917     +122     
- Misses       8622     8648      +26     

Flag Coverage Δ
miri 88.99% <84.11%> (-0.02%) ⬇️
unittests 88.99% <84.11%> (-0.02%) ⬇️
Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
diskann-providers/src/model/pq/distance/dynamic.rs 86.40% <ø> (ø)
diskann-quantization/src/algorithms/hadamard.rs 97.94% <ø> (ø)
diskann-quantization/src/bits/distances.rs 91.49% <100.00%> (+0.05%) ⬆️
diskann-quantization/src/spherical/iface.rs 92.90% <ø> (+0.32%) ⬆️
diskann-vector/src/distance/distance_provider.rs 100.00% <ø> (ø)
diskann-wide/src/arch/mod.rs 83.79% <ø> (ø)
diskann-wide/src/doubled.rs 86.72% <100.00%> (+0.02%) ⬆️
diskann-wide/src/emulated.rs 95.20% <100.00%> (+0.59%) ⬆️
diskann-wide/src/helpers.rs 100.00% <ø> (ø)
diskann-wide/src/lib.rs 86.66% <ø> (ø)
... and 7 more
... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

I can see that coverage on non-x86-64 architectures is going to be fun to deal with ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants