[disk-index] Doc outlining high-level flow of disk-index by arkrishn94 · Pull Request #898 · microsoft/DiskANN

arkrishn94 · 2026-04-03T16:56:03Z

This pull request adds a architecture document for the DiskANN disk index, detailing the trait and data flow architecture for both index building and searching. The document explains how the major traits, structs, and strategies interact throughout the lifecycle of the index, providing diagrams, data flow explanations, and a comparison of build vs search paths. All generated using CoPilot of course.

I found this useful recently during a design ideation so thought it might be useful for other folks using the disk-index.

Copilot

Pull request overview

Adds a new architecture document describing the DiskANN disk-index build and search flows, aiming to explain the trait/strategy/data-provider architecture and how components interact across the lifecycle.

Changes:

Introduces diskann-disk/docs/architecture.md with high-level build/search flow explanations and diagrams.
Documents key traits/structs (e.g., InmemIndexBuilder, DiskProvider, DiskIndexSearcher) and intended dataflow steps.
Adds a build vs search comparison table and a “unifying abstraction” section.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T17:00:58Z

diskann-disk/docs/architecture.md

+│  save_index(DynWriteProvider, metadata)                  │
+│  save_graph(DynWriteProvider, start_point_and_path)      │


The InmemIndexBuilder<T> method list here is slightly out of sync with the actual trait: insert_vector, final_prune, save_index, and save_graph all return async futures (Pin<Box<dyn SendFuture<...>>>) rather than being synchronous methods. Updating the diagram to reflect that these operations are async will make it easier for readers to map this section to the code in diskann-disk/src/build/builder/inmem_builder.rs.

Suggested change

│ save_index(DynWriteProvider, metadata) │

│ save_graph(DynWriteProvider, start_point_and_path) │

│ save_index(DynWriteProvider, metadata) │

│ → Future<ANNResult<()>> │

│ save_graph(DynWriteProvider, start_point_and_path) │

│ → Future<ANNResult<()>> │

Copilot · 2026-04-03T17:00:59Z

diskann-disk/docs/architecture.md

+DataProvider (diskann::provider)
+  ├── type InternalId = u32
+  ├── type ExternalId = u32
+  ├── type Context = DefaultContext
+  │
+  ├── trait Accessor
+  │     ├── get_element(id) → vector data
+  │     ├── type Id, type GetError
+  │     │
+  │     ├── trait BuildQueryComputer<Query>
+  │     │     ├── build_query_computer(query) → QueryComputer
+  │     │     └── distances_unordered(ids, computer, callback)
+  │     │
+  │     └── trait BuildDistanceComputer
+  │           └── build_distance_computer() → DistanceComputer
+  │                 (for random-access pairwise distance)
+  │
+  ├── trait NeighborAccessor
+  │     └── neighbors(id) → &[Id]  (read adjacency list)
+  │
+  ├── trait NeighborAccessorMut
+  │     └── set_neighbors(id, &[Id])  (write adjacency list)
+  │
+  └── trait SetElement<&[T]>
+        └── set_element(id, vector)  (store vector data)


This "DataProvider trait hierarchy" diagram doesn’t match the real trait layout in diskann/src/provider.rs. DataProvider does not contain nested Accessor/NeighborAccessor traits; those are separate traits, and NeighborAccessor/NeighborAccessorMut are async and use an out-parameter (get_neighbors(self, id, &mut AdjacencyList) -> Future) rather than returning &[Id]. Consider reworking this section to reflect the actual trait boundaries and async signatures so readers don’t look for associated types/methods that don’t exist on DataProvider.

Suggested change

DataProvider (diskann::provider)

├── type InternalId = u32

├── type ExternalId = u32

├── type Context = DefaultContext

│

├── trait Accessor

│ ├── get_element(id) → vector data

│ ├── type Id, type GetError

│ │

│ ├── trait BuildQueryComputer<Query>

│ │ ├── build_query_computer(query) → QueryComputer

│ │ └── distances_unordered(ids, computer, callback)

│ │

│ └── trait BuildDistanceComputer

│ └── build_distance_computer() → DistanceComputer

│ (for random-access pairwise distance)

│

├── trait NeighborAccessor

│ └── neighbors(id) → &[Id] (read adjacency list)

│

├── trait NeighborAccessorMut

│ └── set_neighbors(id, &[Id]) (write adjacency list)

│

└── trait SetElement<&[T]>

└── set_element(id, vector) (store vector data)

diskann::provider

DataProvider

├── type InternalId = u32

├── type ExternalId = u32

└── type Context = DefaultContext

Accessor (separate trait; implemented by accessor types, not nested in DataProvider)

├── type Id

├── type GetError

└── get_element(id) → vector data

BuildQueryComputer<Query> (separate trait)

├── build_query_computer(query) → QueryComputer

└── distances_unordered(ids, computer, callback)

BuildDistanceComputer (separate trait)

└── build_distance_computer() → DistanceComputer

(for random-access pairwise distance)

NeighborAccessor (separate async trait)

└── get_neighbors(id, &mut AdjacencyList) -> Future

(read adjacency list into caller-provided buffer)

NeighborAccessorMut (separate async trait)

└── get_neighbors(id, &mut AdjacencyList) -> Future

(mutable-capable neighbor access using caller-provided buffer)

SetElement<&[T]> (separate trait)

└── set_element(id, vector) (store vector data)

Copilot · 2026-04-03T17:00:59Z

diskann-disk/docs/architecture.md

+save_index() ──► SaveWith<(u32, AsyncIndexMetadata)>
+  │                writes: graph structure, vectors, metadata
+  │
+save_graph() ──► SaveWith<(u32, DiskGraphOnly)>
+                   writes: adjacency lists in disk sector layout
+                   (vectors + neighbor lists interleaved for locality)


The save_index() line implies SaveWith<(u32, AsyncIndexMetadata)>, but the index-level API is SaveWith<AsyncIndexMetadata> (the (u32, ...) tuple is handled internally by the provider save). save_graph() does use (u32, DiskGraphOnly) though. Updating the types here would better match diskann-providers/src/storage/index_storage.rs and diskann-disk/src/build/builder/inmem_builder.rs.

Copilot · 2026-04-03T17:00:59Z

diskann-disk/docs/architecture.md

+```text
+DiskProvider<Data>  (implements DataProvider)
+  ├── type InternalId = u32
+  ├── type ExternalId = u32
+  ├── type Context = DefaultContext
+  │
+  ├── pq_data: PQData              // PQ codebook + compressed vectors (in memory)
+  ├── config: Config                // graph parameters
+  ├── starting_points: Vec<u32>    // entry points for search
+  └── search_io_limit: usize       // max parallel IO ops
+```


This DiskProvider<Data> field summary is inaccurate: the implementation does not have config or starting_points fields. Instead, it stores things like graph_header, distance_comparer, pq_data, num_points, metric, and search_io_limit, and the start vertex is derived from graph_header.metadata().medoid. Please adjust this struct overview to reflect the real fields so readers can reconcile it with diskann-disk/src/search/provider/disk_provider.rs.

Copilot · 2026-04-03T17:01:00Z

diskann-disk/docs/architecture.md

+  │     │     │  1. vertex_provider reads disk sector → gets PQ-compressed vector
+  │     │     │     + neighbor list
+  │     │     │  2. DiskQueryComputer computes approximate distance using PQ
+  │     │     │     lookup table
+  │     │     │  3. callback(distance, id) feeds into best-first queue
+  │     │     │
+  │     │     └── IO is batched (beam_width sectors read in parallel)
+  │     │
+  │     ├── accessor.neighbors(id) → neighbor IDs
+  │     │     extracted from same disk sector that was read for the vector
+  │     │     (this is why disk layout interleaves vectors + adjacency lists)
+  │     │
+  │     └── ExpandBeam trait (implemented by DiskAccessor)
+  │           manages beam search expansion with IO batching


The search data-flow description currently suggests distances_unordered triggers disk reads via vertex_provider for each candidate. In the current implementation, DiskAccessor::get_element and distances_unordered operate on in-memory PQ-compressed vectors (pq_data.get_compressed_vector / compute_pq_distance), while disk IO happens when loading sectors to access adjacency lists / full vectors (via ensure_vertex_loaded in expand_beam and in post-processing rerank). Clarifying where IO actually occurs will prevent readers from misunderstanding the performance model.

Suggested change

│ │ │ 1. vertex_provider reads disk sector → gets PQ-compressed vector

│ │ │ + neighbor list

│ │ │ 2. DiskQueryComputer computes approximate distance using PQ

│ │ │ lookup table

│ │ │ 3. callback(distance, id) feeds into best-first queue

│ │ │

│ │ └── IO is batched (beam_width sectors read in parallel)

│ │

│ ├── accessor.neighbors(id) → neighbor IDs

│ │ extracted from same disk sector that was read for the vector

│ │ (this is why disk layout interleaves vectors + adjacency lists)

│ │

│ └── ExpandBeam trait (implemented by DiskAccessor)

│ manages beam search expansion with IO batching

│ │ │ 1. provider/PQ data supplies the in-memory PQ-compressed vector

│ │ │ for the candidate ID

│ │ │ 2. DiskQueryComputer computes approximate distance using the PQ

│ │ │ lookup table

│ │ │ 3. callback(distance, id) feeds into best-first queue

│ │ │

│ │ └── No per-candidate disk read occurs here; this step is PQ scoring

│ │ over in-memory compressed data

│ │

│ ├── accessor.neighbors(id) → neighbor IDs

│ │ available after the corresponding sector has been loaded during

│ │ beam expansion; disk IO is triggered by ensuring vertices/sectors

│ │ are loaded, not by `distances_unordered`

│ │

│ └── ExpandBeam trait (implemented by DiskAccessor)

│ manages beam search expansion with IO batching

│ (beam_width sectors may be read in parallel), which is where

│ adjacency lists / full on-disk records are fetched

Copilot · 2026-04-03T17:01:00Z

diskann-disk/docs/architecture.md

+  ├── impl BuildQueryComputer<&[VectorDataType]>
+  │     ├── build_query_computer() → DiskQueryComputer
+  │     │     Preprocesses query against PQ centroids (quantizer_preprocess)
+  │     │     producing a lookup table for fast approximate distance
+  │     │
+  │     └── distances_unordered(ids, computer, callback)
+  │           Batch-reads sectors from disk, computes PQ distances
+  │
+  ├── impl NeighborAccessor


This section says build_query_computer performs quantizer_preprocess, but in DiskAccessor the PQ preprocessing happens in DiskAccessor::new (before search), and build_query_computer just clones the already-populated aligned_pqtable_dist_scratch into a DiskQueryComputer. Updating the description would better match the actual initialization pipeline.

Copilot · 2026-04-03T17:01:00Z

diskann-disk/docs/architecture.md

+| Aspect | Build | Search |
+|--------|-------|--------|
+| **DataProvider** | `FullPrecisionProvider<T>` or `DefaultProvider<NoStore, Q>` | `DiskProvider<Data>` |
+| **Vector Storage** | In-memory (`Vec<T>` or quantized store) | Disk sectors (read via `VertexProvider`) |
+| **Adjacency Lists** | In-memory graph (`Vec<Vec<u32>>`) | Interleaved in disk sectors |
+| **Distance** | Exact (full precision or SQ) | Approximate (PQ lookup table) |
+| **Strategy** | `InsertStrategy` + `PruneStrategy` | `SearchStrategy` only |
+| **Accessor** | `inmem::Accessor` variants | `DiskAccessor<Data, VP>` |
+| **IO** | Memory reads | Batched aligned disk reads (beam_width) |
+| **Key Trait** | `InmemIndexBuilder<T>` | `VertexProviderFactory` + `VertexProvider` |
+


The markdown table under "Build vs Search Comparison" uses || at the start of each row, which creates an empty leading column and typically renders misaligned on GitHub. Switching to standard table syntax (single leading/trailing |) will make the comparison table render as intended.

codecov-commenter · 2026-04-03T17:15:26Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.32%. Comparing base (0ced23d) to head (5cdf51c).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #898   +/-   ##
=======================================
  Coverage   89.31%   89.32%           
=======================================
  Files         445      445           
  Lines       84095    84095           
=======================================
+ Hits        75113    75116    +3     
+ Misses       8982     8979    -3

Flag	Coverage Δ
miri	`89.32% <ø> (+<0.01%)`	⬆️
unittests	`89.16% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

arrayka · 2026-04-03T17:19:41Z

diskann-disk/docs/architecture.md

+  ├── build_quantizer: BuildQuantizer        // decides FP vs quantized in-mem build
+  └── core: DiskIndexBuilderCore
+        ├── index_writer: DiskIndexWriter    // writes final disk layout
+        ├── storage_provider: &StorageProvider


It's important to distinguish between Storageprovider for build and search. They are different. The build one just uses regular std::fs functions(). The search one is more performance-centric.

arrayka · 2026-04-03T17:26:34Z

diskann-disk/docs/architecture.md

+                   (vectors + neighbor lists interleaved for locality)
+```
+
+The `DiskIndexWriter` (in `diskann-disk/src/storage`) handles the final disk layout where each sector contains a vertex's vector data adjacent to its neighbor list for cache-friendly disk reads.


It’s worth emphasizing that DiskIndexWriter is purely a data‑transformation component: it reads data from files in one format and writes it out in another.

arrayka · 2026-04-03T17:36:47Z

diskann-disk/docs/architecture.md

+  │  Creates per-search VertexProvider instances
+  │
+  └── DiskVertexProviderFactory<Data, ReaderFactory>
+        ├── ReaderFactory: AlignedReaderFactory (creates aligned IO readers)


CLI tools use the DiskVertexProviderFactory, which returns an AlignedFileReader (e.g., WindowsAlignedFileReader, Linux*, vfs).
However, some clients provide their own VertexProviderFactory implementations that rely on different storage‑layer abstractions.

high-level flow

5cdf51c

arkrishn94 requested review from a team and Copilot April 3, 2026 16:56

arkrishn94 changed the title ~~[disk-index] Docs outlining high-level flow of disk-index~~ [disk-index] Doc outlining high-level flow of disk-index Apr 3, 2026

Copilot started reviewing on behalf of arkrishn94 April 3, 2026 16:56 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

arrayka requested changes Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[disk-index] Doc outlining high-level flow of disk-index#898

[disk-index] Doc outlining high-level flow of disk-index#898
arkrishn94 wants to merge 1 commit intomainfrom
u/adkrishnan/disk-index-docs

arkrishn94 commented Apr 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

codecov-commenter commented Apr 3, 2026 •

edited

Loading

Uh oh!

arrayka Apr 3, 2026

Uh oh!

arrayka Apr 3, 2026

Uh oh!

arrayka Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		│ save_index(DynWriteProvider, metadata) │
		│ save_graph(DynWriteProvider, start_point_and_path) │

Conversation

arkrishn94 commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

arrayka Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

arrayka Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

arrayka Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

arkrishn94 commented Apr 3, 2026 •

edited

Loading

codecov-commenter commented Apr 3, 2026 •

edited

Loading