You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Rather than falling back to scalar Emulated element-by-element loads, partial loads use Neon-native primitives:
≤8 bytes: GPR-only overlapping reads — no SIMD instructions needed.
8–16 bytes: Two overlapping vld1_u8 loads combined with vqtbl1q_u8 (TBL shuffle). Includes a Miri shim since Miri does not support vqtbl1q_u8.
32-bit / 64-bit element types: Simple if-else chains using vld1_lane / vcombine.
The aarch64_define_loadstore! macro accepts a $load_first function, and f16x4/f16x8 delegate to the u16x4/u16x8 primitives respectively.
Doubled types implement load_simd_first / store_simd_first branchlessly by passing the full count to the first half and first.saturating_sub(HALF) to the second.
Test infrastructure:
test_neon() helper with WIDE_TEST_MIN_ARCH env-var support, matching the x86_64 test_arch_number() pattern. Supports "all" / "neon" (panics if unavailable) and "scalar" (skips).
All tests use if let Some(arch) = test_neon() { ... } — graceful skip when Neon is unavailable, hard failure when explicitly requested.
diskann-vector — Neon distance kernels
14 SIMDSchema implementations covering:
L2, InnerProduct, Cosine for f32, f16, u8, i8.
L1Norm for f32 and f16.
All use scalar epilogues (SIMD epilogues deferred pending Arm64 benchmarking).
diskann-quantization
Neon Hadamard transform impl (delegates to scalar via retarget()).
Bit distances almost universally target the scalar architecture as well via retarget().
Neon test paths for bit-slice distances (1–8 bit), bit-transpose distances, and full distances.
diskann-benchmark-simd
Neon kernel registrations for f32, f16, u8, and i8.
Refactored per-architecture DispatchRule impls into a match_arch! macro.
Improved dispatch scoring for better mismatch diagnostics.
Added test-aarch64.json and architecture-aware integration test selection.
Other changes
.cargo/config.toml: Enables +neon,+dotprod for aarch64 targets.
.github/workflows/ci.yml: Added aarch64-unknown-linux-gnu to cross-compilation targets.
diskann-providers: Relaxed a PQ distance test tolerance (6e-7 → 6.3e-7) for the different floating opint association used by the Neon implementations.
Design decisions
Compile-time architecture gating. The Neon backend uses a compile-time token rather than runtime feature detection. Neon is mandatory on AArch64.
Runtime dispatch can be added later if needed.
+dotprod required. Needed for vdotq in dot-product kernels. This excludes pre-2018 cores but shoud covers mainstream server and desktop targets (Graviton 2+, Apple M1+, Ampere Altra). ARMv8.4+ mandates it.
Scalar epilogues in diskann-vector. The SIMD epilogues could use load_simd_first for a potential win on i8/u8 cosine where the masked load cost is amortized across multiple operations, but real Arm64 benchmarking is needed first.
❌ Patch coverage is 84.11215% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.99%. Comparing base (7cd231a) to head (c18da52).
❌ Patch coverage is 84.11215% with 34 lines in your changes missing coverage. Please review. ✅ Project coverage is 88.99%. Comparing base (7cd231a) to head (c18da52).
Flag Coverage Δ miri88.99% <84.11%> (-0.02%) ⬇️ unittests88.99% <84.11%> (-0.02%) ⬇️
Flags with carried forward coverage won't be shown. Click here to find out more.
I can see that coverage on non-x86-64 architectures is going to be fun to deal with ...
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a (mostly) complete AArch64 Neon backend to
diskann-wideand wires it throughdiskann-vector,diskann-quantization, anddiskann-benchmark-simd.This PR has existed in a largely completed state for quite a while now - but as usual the last 10% takes a considerable amount of work. So here it is.
diskann-wide— Neon backendNeon implementations for all SIMD types matching the existing x86_64 (V3/V4) backends:
u8x8,i8x8,f32x2,u8x16,i8x16,u16x8,i16x8,u32x4,i32x4,f32x4,u64x2,i64x2,f16x4,f16x8.f32x8,f32x16,u8x32,i8x32,i32x8, etc.) via the existingDoubledmachinery.move_mask,from_mask, and optimizedkeep_firstfor all 8 mask widths.Add,Sub,Mul, FMA,Abs,MinMax.SIMDPartialEqandSIMDPartialOrd.Not,And,Or,Xor,Shr,Shl(with Miri fallbacks for variable shifts).i16×i16→i32,u8×i8→i32,i8×u8→i32usingvdotq_s32(requires+dotprod).sum_treevia pairwise addition (vpaddq).f16↔f32(lossless and cast),u8→i16,i8→i16,i32→f32, split/join for all appropriate types.Optimized
load_simd_first(algorithms/load_first.rs):Rather than falling back to scalar
Emulatedelement-by-element loads, partial loads use Neon-native primitives:vld1_u8loads combined withvqtbl1q_u8(TBL shuffle). Includes a Miri shim since Miri does not supportvqtbl1q_u8.vld1_lane/vcombine.The
aarch64_define_loadstore!macro accepts a$load_firstfunction, andf16x4/f16x8delegate to theu16x4/u16x8primitives respectively.Doubledtypes implementload_simd_first/store_simd_firstbranchlessly by passing the full count to the first half andfirst.saturating_sub(HALF)to the second.Test infrastructure:
test_neon()helper withWIDE_TEST_MIN_ARCHenv-var support, matching the x86_64test_arch_number()pattern. Supports"all"/"neon"(panics if unavailable) and"scalar"(skips).if let Some(arch) = test_neon() { ... }— graceful skip when Neon is unavailable, hard failure when explicitly requested.diskann-vector— Neon distance kernels14
SIMDSchemaimplementations covering:f32,f16,u8,i8.f32andf16.diskann-quantizationretarget()).retarget().diskann-benchmark-simdDispatchRuleimpls into amatch_arch!macro.test-aarch64.jsonand architecture-aware integration test selection.Other changes
.cargo/config.toml: Enables+neon,+dotprodforaarch64targets..github/workflows/ci.yml: Addedaarch64-unknown-linux-gnuto cross-compilation targets.diskann-providers: Relaxed a PQ distance test tolerance (6e-7→6.3e-7) for the different floating opint association used by theNeonimplementations.Design decisions
Neonbackend uses a compile-time token rather than runtime feature detection. Neon is mandatory on AArch64.Runtime dispatch can be added later if needed.
+dotprodrequired. Needed forvdotqin dot-product kernels. This excludes pre-2018 cores but shoud covers mainstream server and desktop targets (Graviton 2+, Apple M1+, Ampere Altra). ARMv8.4+ mandates it.diskann-vector. The SIMD epilogues could useload_simd_firstfor a potential win on i8/u8 cosine where the masked load cost is amortized across multiple operations, but real Arm64 benchmarking is needed first.Suggested reviewing order
diskann-wide/src/arch/aarch64/mod.rs— Architecture definition,Neontoken, dispatch,test_neon().diskann-wide/src/arch/aarch64/macros.rs— The macro infrastructure that all type files build on.diskann-wide/src/arch/aarch64/masks.rs— Mask representations and operations (move_mask,from_mask,keep_first).diskann-wide/src/arch/aarch64/algorithms/load_first.rs— Optimized partial load primitives. Read bottom-up: impl functions first, then wrappers.f32x4_.rsfor 128-bit float, ori32x4_.rsfor dot products) — the rest are structurally identical.diskann-wide/src/arch/aarch64/double.rsanddiskann-wide/src/doubled.rs— Doubled types and branchless partial load/store.diskann-vector/src/distance/simd.rs— Neon distance kernels.diskann-benchmark-simd/src/lib.rs—match_arch!refactor and Neon registration.diskann-quantization/— Neon test paths (mechanical).