Skip to content

[disk-index] Doc outlining high-level flow of disk-index#898

Open
arkrishn94 wants to merge 1 commit intomainfrom
u/adkrishnan/disk-index-docs
Open

[disk-index] Doc outlining high-level flow of disk-index#898
arkrishn94 wants to merge 1 commit intomainfrom
u/adkrishnan/disk-index-docs

Conversation

@arkrishn94
Copy link
Copy Markdown
Contributor

@arkrishn94 arkrishn94 commented Apr 3, 2026

This pull request adds a architecture document for the DiskANN disk index, detailing the trait and data flow architecture for both index building and searching. The document explains how the major traits, structs, and strategies interact throughout the lifecycle of the index, providing diagrams, data flow explanations, and a comparison of build vs search paths. All generated using CoPilot of course.

I found this useful recently during a design ideation so thought it might be useful for other folks using the disk-index.

@arkrishn94 arkrishn94 requested review from a team and Copilot April 3, 2026 16:56
@arkrishn94 arkrishn94 changed the title [disk-index] Docs outlining high-level flow of disk-index [disk-index] Doc outlining high-level flow of disk-index Apr 3, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new architecture document describing the DiskANN disk-index build and search flows, aiming to explain the trait/strategy/data-provider architecture and how components interact across the lifecycle.

Changes:

  • Introduces diskann-disk/docs/architecture.md with high-level build/search flow explanations and diagrams.
  • Documents key traits/structs (e.g., InmemIndexBuilder, DiskProvider, DiskIndexSearcher) and intended dataflow steps.
  • Adds a build vs search comparison table and a “unifying abstraction” section.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +70 to +71
│ save_index(DynWriteProvider, metadata) │
│ save_graph(DynWriteProvider, start_point_and_path) │
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The InmemIndexBuilder<T> method list here is slightly out of sync with the actual trait: insert_vector, final_prune, save_index, and save_graph all return async futures (Pin<Box<dyn SendFuture<...>>>) rather than being synchronous methods. Updating the diagram to reflect that these operations are async will make it easier for readers to map this section to the code in diskann-disk/src/build/builder/inmem_builder.rs.

Suggested change
│ save_index(DynWriteProvider, metadata) │
│ save_graph(DynWriteProvider, start_point_and_path) │
│ save_index(DynWriteProvider, metadata) │
│ → Future<ANNResult<()>> │
│ save_graph(DynWriteProvider, start_point_and_path) │
│ → Future<ANNResult<()>> │

Copilot uses AI. Check for mistakes.
Comment on lines +122 to +146
DataProvider (diskann::provider)
├── type InternalId = u32
├── type ExternalId = u32
├── type Context = DefaultContext
├── trait Accessor
│ ├── get_element(id) → vector data
│ ├── type Id, type GetError
│ │
│ ├── trait BuildQueryComputer<Query>
│ │ ├── build_query_computer(query) → QueryComputer
│ │ └── distances_unordered(ids, computer, callback)
│ │
│ └── trait BuildDistanceComputer
│ └── build_distance_computer() → DistanceComputer
│ (for random-access pairwise distance)
├── trait NeighborAccessor
│ └── neighbors(id) → &[Id] (read adjacency list)
├── trait NeighborAccessorMut
│ └── set_neighbors(id, &[Id]) (write adjacency list)
└── trait SetElement<&[T]>
└── set_element(id, vector) (store vector data)
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "DataProvider trait hierarchy" diagram doesn’t match the real trait layout in diskann/src/provider.rs. DataProvider does not contain nested Accessor/NeighborAccessor traits; those are separate traits, and NeighborAccessor/NeighborAccessorMut are async and use an out-parameter (get_neighbors(self, id, &mut AdjacencyList) -> Future) rather than returning &[Id]. Consider reworking this section to reflect the actual trait boundaries and async signatures so readers don’t look for associated types/methods that don’t exist on DataProvider.

Suggested change
DataProvider (diskann::provider)
├── type InternalId = u32
├── type ExternalId = u32
├── type Context = DefaultContext
├── trait Accessor
│ ├── get_element(id) → vector data
│ ├── type Id, type GetError
│ │
│ ├── trait BuildQueryComputer<Query>
│ │ ├── build_query_computer(query) → QueryComputer
│ │ └── distances_unordered(ids, computer, callback)
│ │
│ └── trait BuildDistanceComputer
│ └── build_distance_computer() → DistanceComputer
│ (for random-access pairwise distance)
├── trait NeighborAccessor
│ └── neighbors(id) → &[Id] (read adjacency list)
├── trait NeighborAccessorMut
│ └── set_neighbors(id, &[Id]) (write adjacency list)
└── trait SetElement<&[T]>
└── set_element(id, vector) (store vector data)
diskann::provider
DataProvider
├── type InternalId = u32
├── type ExternalId = u32
└── type Context = DefaultContext
Accessor (separate trait; implemented by accessor types, not nested in DataProvider)
├── type Id
├── type GetError
└── get_element(id) → vector data
BuildQueryComputer<Query> (separate trait)
├── build_query_computer(query) → QueryComputer
└── distances_unordered(ids, computer, callback)
BuildDistanceComputer (separate trait)
└── build_distance_computer() → DistanceComputer
(for random-access pairwise distance)
NeighborAccessor (separate async trait)
└── get_neighbors(id, &mut AdjacencyList) -> Future
(read adjacency list into caller-provided buffer)
NeighborAccessorMut (separate async trait)
└── get_neighbors(id, &mut AdjacencyList) -> Future
(mutable-capable neighbor access using caller-provided buffer)
SetElement<&[T]> (separate trait)
└── set_element(id, vector) (store vector data)

Copilot uses AI. Check for mistakes.
Comment on lines +169 to +174
save_index() ──► SaveWith<(u32, AsyncIndexMetadata)>
│ writes: graph structure, vectors, metadata
save_graph() ──► SaveWith<(u32, DiskGraphOnly)>
writes: adjacency lists in disk sector layout
(vectors + neighbor lists interleaved for locality)
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The save_index() line implies SaveWith<(u32, AsyncIndexMetadata)>, but the index-level API is SaveWith<AsyncIndexMetadata> (the (u32, ...) tuple is handled internally by the provider save). save_graph() does use (u32, DiskGraphOnly) though. Updating the types here would better match diskann-providers/src/storage/index_storage.rs and diskann-disk/src/build/builder/inmem_builder.rs.

Copilot uses AI. Check for mistakes.
Comment on lines +203 to +213
```text
DiskProvider<Data> (implements DataProvider)
├── type InternalId = u32
├── type ExternalId = u32
├── type Context = DefaultContext
├── pq_data: PQData // PQ codebook + compressed vectors (in memory)
├── config: Config // graph parameters
├── starting_points: Vec<u32> // entry points for search
└── search_io_limit: usize // max parallel IO ops
```
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This DiskProvider<Data> field summary is inaccurate: the implementation does not have config or starting_points fields. Instead, it stores things like graph_header, distance_comparer, pq_data, num_points, metric, and search_io_limit, and the start vertex is derived from graph_header.metadata().medoid. Please adjust this struct overview to reflect the real fields so readers can reconcile it with diskann-disk/src/search/provider/disk_provider.rs.

Copilot uses AI. Check for mistakes.
Comment on lines +252 to +265
│ │ │ 1. vertex_provider reads disk sector → gets PQ-compressed vector
│ │ │ + neighbor list
│ │ │ 2. DiskQueryComputer computes approximate distance using PQ
│ │ │ lookup table
│ │ │ 3. callback(distance, id) feeds into best-first queue
│ │ │
│ │ └── IO is batched (beam_width sectors read in parallel)
│ │
│ ├── accessor.neighbors(id) → neighbor IDs
│ │ extracted from same disk sector that was read for the vector
│ │ (this is why disk layout interleaves vectors + adjacency lists)
│ │
│ └── ExpandBeam trait (implemented by DiskAccessor)
│ manages beam search expansion with IO batching
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The search data-flow description currently suggests distances_unordered triggers disk reads via vertex_provider for each candidate. In the current implementation, DiskAccessor::get_element and distances_unordered operate on in-memory PQ-compressed vectors (pq_data.get_compressed_vector / compute_pq_distance), while disk IO happens when loading sectors to access adjacency lists / full vectors (via ensure_vertex_loaded in expand_beam and in post-processing rerank). Clarifying where IO actually occurs will prevent readers from misunderstanding the performance model.

Suggested change
│ │ │ 1. vertex_provider reads disk sector → gets PQ-compressed vector
│ │ │ + neighbor list
│ │ │ 2. DiskQueryComputer computes approximate distance using PQ
│ │ │ lookup table
│ │ │ 3. callback(distance, id) feeds into best-first queue
│ │ │
│ │ └── IO is batched (beam_width sectors read in parallel)
│ │
│ ├── accessor.neighbors(id) → neighbor IDs
│ │ extracted from same disk sector that was read for the vector
│ │ (this is why disk layout interleaves vectors + adjacency lists)
│ │
│ └── ExpandBeam trait (implemented by DiskAccessor)
│ manages beam search expansion with IO batching
│ │ │ 1. provider/PQ data supplies the in-memory PQ-compressed vector
│ │ │ for the candidate ID
│ │ │ 2. DiskQueryComputer computes approximate distance using the PQ
│ │ │ lookup table
│ │ │ 3. callback(distance, id) feeds into best-first queue
│ │ │
│ │ └── No per-candidate disk read occurs here; this step is PQ scoring
│ │ over in-memory compressed data
│ │
│ ├── accessor.neighbors(id) → neighbor IDs
│ │ available after the corresponding sector has been loaded during
│ │ beam expansion; disk IO is triggered by ensuring vertices/sectors
│ │ are loaded, not by `distances_unordered`
│ │
│ └── ExpandBeam trait (implemented by DiskAccessor)
│ manages beam search expansion with IO batching
│ (beam_width sectors may be read in parallel), which is where
│ adjacency lists / full on-disk records are fetched

Copilot uses AI. Check for mistakes.
Comment on lines +291 to +299
├── impl BuildQueryComputer<&[VectorDataType]>
│ ├── build_query_computer() → DiskQueryComputer
│ │ Preprocesses query against PQ centroids (quantizer_preprocess)
│ │ producing a lookup table for fast approximate distance
│ │
│ └── distances_unordered(ids, computer, callback)
│ Batch-reads sectors from disk, computes PQ distances
├── impl NeighborAccessor
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section says build_query_computer performs quantizer_preprocess, but in DiskAccessor the PQ preprocessing happens in DiskAccessor::new (before search), and build_query_computer just clones the already-populated aligned_pqtable_dist_scratch into a DiskQueryComputer. Updating the description would better match the actual initialization pipeline.

Copilot uses AI. Check for mistakes.
Comment on lines +367 to +377
| Aspect | Build | Search |
|--------|-------|--------|
| **DataProvider** | `FullPrecisionProvider<T>` or `DefaultProvider<NoStore, Q>` | `DiskProvider<Data>` |
| **Vector Storage** | In-memory (`Vec<T>` or quantized store) | Disk sectors (read via `VertexProvider`) |
| **Adjacency Lists** | In-memory graph (`Vec<Vec<u32>>`) | Interleaved in disk sectors |
| **Distance** | Exact (full precision or SQ) | Approximate (PQ lookup table) |
| **Strategy** | `InsertStrategy` + `PruneStrategy` | `SearchStrategy` only |
| **Accessor** | `inmem::Accessor` variants | `DiskAccessor<Data, VP>` |
| **IO** | Memory reads | Batched aligned disk reads (beam_width) |
| **Key Trait** | `InmemIndexBuilder<T>` | `VertexProviderFactory` + `VertexProvider` |

Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The markdown table under "Build vs Search Comparison" uses || at the start of each row, which creates an empty leading column and typically renders misaligned on GitHub. Switching to standard table syntax (single leading/trailing |) will make the comparison table render as intended.

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.32%. Comparing base (0ced23d) to head (5cdf51c).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #898   +/-   ##
=======================================
  Coverage   89.31%   89.32%           
=======================================
  Files         445      445           
  Lines       84095    84095           
=======================================
+ Hits        75113    75116    +3     
+ Misses       8982     8979    -3     
Flag Coverage Δ
miri 89.32% <ø> (+<0.01%) ⬆️
unittests 89.16% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

├── build_quantizer: BuildQuantizer // decides FP vs quantized in-mem build
└── core: DiskIndexBuilderCore
├── index_writer: DiskIndexWriter // writes final disk layout
├── storage_provider: &StorageProvider
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important to distinguish between Storageprovider for build and search. They are different. The build one just uses regular std::fs functions(). The search one is more performance-centric.

(vectors + neighbor lists interleaved for locality)
```

The `DiskIndexWriter` (in `diskann-disk/src/storage`) handles the final disk layout where each sector contains a vertex's vector data adjacent to its neighbor list for cache-friendly disk reads.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s worth emphasizing that DiskIndexWriter is purely a data‑transformation component: it reads data from files in one format and writes it out in another.

│ Creates per-search VertexProvider instances
└── DiskVertexProviderFactory<Data, ReaderFactory>
├── ReaderFactory: AlignedReaderFactory (creates aligned IO readers)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLI tools use the DiskVertexProviderFactory, which returns an AlignedFileReader (e.g., WindowsAlignedFileReader, Linux*, vfs).
However, some clients provide their own VertexProviderFactory implementations that rely on different storage‑layer abstractions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants