[disk-index] Doc outlining high-level flow of disk-index#898
[disk-index] Doc outlining high-level flow of disk-index#898arkrishn94 wants to merge 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new architecture document describing the DiskANN disk-index build and search flows, aiming to explain the trait/strategy/data-provider architecture and how components interact across the lifecycle.
Changes:
- Introduces
diskann-disk/docs/architecture.mdwith high-level build/search flow explanations and diagrams. - Documents key traits/structs (e.g.,
InmemIndexBuilder,DiskProvider,DiskIndexSearcher) and intended dataflow steps. - Adds a build vs search comparison table and a “unifying abstraction” section.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| │ save_index(DynWriteProvider, metadata) │ | ||
| │ save_graph(DynWriteProvider, start_point_and_path) │ |
There was a problem hiding this comment.
The InmemIndexBuilder<T> method list here is slightly out of sync with the actual trait: insert_vector, final_prune, save_index, and save_graph all return async futures (Pin<Box<dyn SendFuture<...>>>) rather than being synchronous methods. Updating the diagram to reflect that these operations are async will make it easier for readers to map this section to the code in diskann-disk/src/build/builder/inmem_builder.rs.
| │ save_index(DynWriteProvider, metadata) │ | |
| │ save_graph(DynWriteProvider, start_point_and_path) │ | |
| │ save_index(DynWriteProvider, metadata) │ | |
| │ → Future<ANNResult<()>> │ | |
| │ save_graph(DynWriteProvider, start_point_and_path) │ | |
| │ → Future<ANNResult<()>> │ |
| DataProvider (diskann::provider) | ||
| ├── type InternalId = u32 | ||
| ├── type ExternalId = u32 | ||
| ├── type Context = DefaultContext | ||
| │ | ||
| ├── trait Accessor | ||
| │ ├── get_element(id) → vector data | ||
| │ ├── type Id, type GetError | ||
| │ │ | ||
| │ ├── trait BuildQueryComputer<Query> | ||
| │ │ ├── build_query_computer(query) → QueryComputer | ||
| │ │ └── distances_unordered(ids, computer, callback) | ||
| │ │ | ||
| │ └── trait BuildDistanceComputer | ||
| │ └── build_distance_computer() → DistanceComputer | ||
| │ (for random-access pairwise distance) | ||
| │ | ||
| ├── trait NeighborAccessor | ||
| │ └── neighbors(id) → &[Id] (read adjacency list) | ||
| │ | ||
| ├── trait NeighborAccessorMut | ||
| │ └── set_neighbors(id, &[Id]) (write adjacency list) | ||
| │ | ||
| └── trait SetElement<&[T]> | ||
| └── set_element(id, vector) (store vector data) |
There was a problem hiding this comment.
This "DataProvider trait hierarchy" diagram doesn’t match the real trait layout in diskann/src/provider.rs. DataProvider does not contain nested Accessor/NeighborAccessor traits; those are separate traits, and NeighborAccessor/NeighborAccessorMut are async and use an out-parameter (get_neighbors(self, id, &mut AdjacencyList) -> Future) rather than returning &[Id]. Consider reworking this section to reflect the actual trait boundaries and async signatures so readers don’t look for associated types/methods that don’t exist on DataProvider.
| DataProvider (diskann::provider) | |
| ├── type InternalId = u32 | |
| ├── type ExternalId = u32 | |
| ├── type Context = DefaultContext | |
| │ | |
| ├── trait Accessor | |
| │ ├── get_element(id) → vector data | |
| │ ├── type Id, type GetError | |
| │ │ | |
| │ ├── trait BuildQueryComputer<Query> | |
| │ │ ├── build_query_computer(query) → QueryComputer | |
| │ │ └── distances_unordered(ids, computer, callback) | |
| │ │ | |
| │ └── trait BuildDistanceComputer | |
| │ └── build_distance_computer() → DistanceComputer | |
| │ (for random-access pairwise distance) | |
| │ | |
| ├── trait NeighborAccessor | |
| │ └── neighbors(id) → &[Id] (read adjacency list) | |
| │ | |
| ├── trait NeighborAccessorMut | |
| │ └── set_neighbors(id, &[Id]) (write adjacency list) | |
| │ | |
| └── trait SetElement<&[T]> | |
| └── set_element(id, vector) (store vector data) | |
| diskann::provider | |
| DataProvider | |
| ├── type InternalId = u32 | |
| ├── type ExternalId = u32 | |
| └── type Context = DefaultContext | |
| Accessor (separate trait; implemented by accessor types, not nested in DataProvider) | |
| ├── type Id | |
| ├── type GetError | |
| └── get_element(id) → vector data | |
| BuildQueryComputer<Query> (separate trait) | |
| ├── build_query_computer(query) → QueryComputer | |
| └── distances_unordered(ids, computer, callback) | |
| BuildDistanceComputer (separate trait) | |
| └── build_distance_computer() → DistanceComputer | |
| (for random-access pairwise distance) | |
| NeighborAccessor (separate async trait) | |
| └── get_neighbors(id, &mut AdjacencyList) -> Future | |
| (read adjacency list into caller-provided buffer) | |
| NeighborAccessorMut (separate async trait) | |
| └── get_neighbors(id, &mut AdjacencyList) -> Future | |
| (mutable-capable neighbor access using caller-provided buffer) | |
| SetElement<&[T]> (separate trait) | |
| └── set_element(id, vector) (store vector data) |
| save_index() ──► SaveWith<(u32, AsyncIndexMetadata)> | ||
| │ writes: graph structure, vectors, metadata | ||
| │ | ||
| save_graph() ──► SaveWith<(u32, DiskGraphOnly)> | ||
| writes: adjacency lists in disk sector layout | ||
| (vectors + neighbor lists interleaved for locality) |
There was a problem hiding this comment.
The save_index() line implies SaveWith<(u32, AsyncIndexMetadata)>, but the index-level API is SaveWith<AsyncIndexMetadata> (the (u32, ...) tuple is handled internally by the provider save). save_graph() does use (u32, DiskGraphOnly) though. Updating the types here would better match diskann-providers/src/storage/index_storage.rs and diskann-disk/src/build/builder/inmem_builder.rs.
| ```text | ||
| DiskProvider<Data> (implements DataProvider) | ||
| ├── type InternalId = u32 | ||
| ├── type ExternalId = u32 | ||
| ├── type Context = DefaultContext | ||
| │ | ||
| ├── pq_data: PQData // PQ codebook + compressed vectors (in memory) | ||
| ├── config: Config // graph parameters | ||
| ├── starting_points: Vec<u32> // entry points for search | ||
| └── search_io_limit: usize // max parallel IO ops | ||
| ``` |
There was a problem hiding this comment.
This DiskProvider<Data> field summary is inaccurate: the implementation does not have config or starting_points fields. Instead, it stores things like graph_header, distance_comparer, pq_data, num_points, metric, and search_io_limit, and the start vertex is derived from graph_header.metadata().medoid. Please adjust this struct overview to reflect the real fields so readers can reconcile it with diskann-disk/src/search/provider/disk_provider.rs.
| │ │ │ 1. vertex_provider reads disk sector → gets PQ-compressed vector | ||
| │ │ │ + neighbor list | ||
| │ │ │ 2. DiskQueryComputer computes approximate distance using PQ | ||
| │ │ │ lookup table | ||
| │ │ │ 3. callback(distance, id) feeds into best-first queue | ||
| │ │ │ | ||
| │ │ └── IO is batched (beam_width sectors read in parallel) | ||
| │ │ | ||
| │ ├── accessor.neighbors(id) → neighbor IDs | ||
| │ │ extracted from same disk sector that was read for the vector | ||
| │ │ (this is why disk layout interleaves vectors + adjacency lists) | ||
| │ │ | ||
| │ └── ExpandBeam trait (implemented by DiskAccessor) | ||
| │ manages beam search expansion with IO batching |
There was a problem hiding this comment.
The search data-flow description currently suggests distances_unordered triggers disk reads via vertex_provider for each candidate. In the current implementation, DiskAccessor::get_element and distances_unordered operate on in-memory PQ-compressed vectors (pq_data.get_compressed_vector / compute_pq_distance), while disk IO happens when loading sectors to access adjacency lists / full vectors (via ensure_vertex_loaded in expand_beam and in post-processing rerank). Clarifying where IO actually occurs will prevent readers from misunderstanding the performance model.
| │ │ │ 1. vertex_provider reads disk sector → gets PQ-compressed vector | |
| │ │ │ + neighbor list | |
| │ │ │ 2. DiskQueryComputer computes approximate distance using PQ | |
| │ │ │ lookup table | |
| │ │ │ 3. callback(distance, id) feeds into best-first queue | |
| │ │ │ | |
| │ │ └── IO is batched (beam_width sectors read in parallel) | |
| │ │ | |
| │ ├── accessor.neighbors(id) → neighbor IDs | |
| │ │ extracted from same disk sector that was read for the vector | |
| │ │ (this is why disk layout interleaves vectors + adjacency lists) | |
| │ │ | |
| │ └── ExpandBeam trait (implemented by DiskAccessor) | |
| │ manages beam search expansion with IO batching | |
| │ │ │ 1. provider/PQ data supplies the in-memory PQ-compressed vector | |
| │ │ │ for the candidate ID | |
| │ │ │ 2. DiskQueryComputer computes approximate distance using the PQ | |
| │ │ │ lookup table | |
| │ │ │ 3. callback(distance, id) feeds into best-first queue | |
| │ │ │ | |
| │ │ └── No per-candidate disk read occurs here; this step is PQ scoring | |
| │ │ over in-memory compressed data | |
| │ │ | |
| │ ├── accessor.neighbors(id) → neighbor IDs | |
| │ │ available after the corresponding sector has been loaded during | |
| │ │ beam expansion; disk IO is triggered by ensuring vertices/sectors | |
| │ │ are loaded, not by `distances_unordered` | |
| │ │ | |
| │ └── ExpandBeam trait (implemented by DiskAccessor) | |
| │ manages beam search expansion with IO batching | |
| │ (beam_width sectors may be read in parallel), which is where | |
| │ adjacency lists / full on-disk records are fetched |
| ├── impl BuildQueryComputer<&[VectorDataType]> | ||
| │ ├── build_query_computer() → DiskQueryComputer | ||
| │ │ Preprocesses query against PQ centroids (quantizer_preprocess) | ||
| │ │ producing a lookup table for fast approximate distance | ||
| │ │ | ||
| │ └── distances_unordered(ids, computer, callback) | ||
| │ Batch-reads sectors from disk, computes PQ distances | ||
| │ | ||
| ├── impl NeighborAccessor |
There was a problem hiding this comment.
This section says build_query_computer performs quantizer_preprocess, but in DiskAccessor the PQ preprocessing happens in DiskAccessor::new (before search), and build_query_computer just clones the already-populated aligned_pqtable_dist_scratch into a DiskQueryComputer. Updating the description would better match the actual initialization pipeline.
| | Aspect | Build | Search | | ||
| |--------|-------|--------| | ||
| | **DataProvider** | `FullPrecisionProvider<T>` or `DefaultProvider<NoStore, Q>` | `DiskProvider<Data>` | | ||
| | **Vector Storage** | In-memory (`Vec<T>` or quantized store) | Disk sectors (read via `VertexProvider`) | | ||
| | **Adjacency Lists** | In-memory graph (`Vec<Vec<u32>>`) | Interleaved in disk sectors | | ||
| | **Distance** | Exact (full precision or SQ) | Approximate (PQ lookup table) | | ||
| | **Strategy** | `InsertStrategy` + `PruneStrategy` | `SearchStrategy` only | | ||
| | **Accessor** | `inmem::Accessor` variants | `DiskAccessor<Data, VP>` | | ||
| | **IO** | Memory reads | Batched aligned disk reads (beam_width) | | ||
| | **Key Trait** | `InmemIndexBuilder<T>` | `VertexProviderFactory` + `VertexProvider` | | ||
|
|
There was a problem hiding this comment.
The markdown table under "Build vs Search Comparison" uses || at the start of each row, which creates an empty leading column and typically renders misaligned on GitHub. Switching to standard table syntax (single leading/trailing |) will make the comparison table render as intended.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #898 +/- ##
=======================================
Coverage 89.31% 89.32%
=======================================
Files 445 445
Lines 84095 84095
=======================================
+ Hits 75113 75116 +3
+ Misses 8982 8979 -3
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
| ├── build_quantizer: BuildQuantizer // decides FP vs quantized in-mem build | ||
| └── core: DiskIndexBuilderCore | ||
| ├── index_writer: DiskIndexWriter // writes final disk layout | ||
| ├── storage_provider: &StorageProvider |
There was a problem hiding this comment.
It's important to distinguish between Storageprovider for build and search. They are different. The build one just uses regular std::fs functions(). The search one is more performance-centric.
| (vectors + neighbor lists interleaved for locality) | ||
| ``` | ||
|
|
||
| The `DiskIndexWriter` (in `diskann-disk/src/storage`) handles the final disk layout where each sector contains a vertex's vector data adjacent to its neighbor list for cache-friendly disk reads. |
There was a problem hiding this comment.
It’s worth emphasizing that DiskIndexWriter is purely a data‑transformation component: it reads data from files in one format and writes it out in another.
| │ Creates per-search VertexProvider instances | ||
| │ | ||
| └── DiskVertexProviderFactory<Data, ReaderFactory> | ||
| ├── ReaderFactory: AlignedReaderFactory (creates aligned IO readers) |
There was a problem hiding this comment.
CLI tools use the DiskVertexProviderFactory, which returns an AlignedFileReader (e.g., WindowsAlignedFileReader, Linux*, vfs).
However, some clients provide their own VertexProviderFactory implementations that rely on different storage‑layer abstractions.
This pull request adds a architecture document for the DiskANN disk index, detailing the trait and data flow architecture for both index building and searching. The document explains how the major traits, structs, and strategies interact throughout the lifecycle of the index, providing diagrams, data flow explanations, and a comparison of build vs search paths. All generated using CoPilot of course.
I found this useful recently during a design ideation so thought it might be useful for other folks using the disk-index.