Skip to content

Extract diskann-storage crate from diskann-providers#873

Closed
varat73 wants to merge 1 commit intomainfrom
extract-diskann-storage
Closed

Extract diskann-storage crate from diskann-providers#873
varat73 wants to merge 1 commit intomainfrom
extract-diskann-storage

Conversation

@varat73
Copy link
Copy Markdown
Contributor

@varat73 varat73 commented Apr 2, 2026

Summary

Extract a new diskann-storage Tier 2 crate from diskann-providers as the first phase of decomposing the providers god crate (37K lines → ~15K target).

What moved to diskann-storage

Item Description
StorageReadProvider / StorageWriteProvider Core storage I/O traits
FileStorageProvider Filesystem-backed implementation
VirtualStorageProvider In-memory/overlay VFS (feature-gated behind virtual_storage)
DynWriteProvider / WriteProviderWrapper / WriteSeek Dynamic dispatch adapters
path_utility Canonical file-path naming for index artifacts
DatasetDto Aligned vector dataset transfer type
proto_storage Generic protobuf load/save via prost

Key properties

  • Zero diskann-* dependencies — only prost + thiserror. Clean Tier 2 crate.
  • 38 tests in diskann-storage covering all moved functionality.
  • Backward compatiblediskann-providers re-exports all moved types; existing crate::storage::StorageReadProvider paths continue to work.
  • Consumer crates migrateddiskann-disk, diskann-tools, diskann-benchmark updated to import directly from diskann-storage.

Verification

  • cargo fmt --all --check
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo test -p diskann-storage — 38 passed, 0 failed
  • cargo test -p diskann-providers --profile ci — 520 passed, 0 failed
  • cargo check --workspace

@varat73 varat73 requested review from a team and Copilot April 2, 2026 03:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extracts storage abstractions and implementations from diskann-providers into a new Tier-2 crate, diskann-storage, and migrates downstream crates to use the new crate directly while keeping diskann-providers::storage::* as a compatibility re-export.

Changes:

  • Added new diskann-storage crate containing storage traits, filesystem and virtual storage providers, path naming helpers, dataset DTO, and protobuf load/save utilities.
  • Updated diskann-providers to re-export moved storage APIs from diskann-storage and to re-export DatasetDto from diskann-storage.
  • Migrated consumer crates (diskann-tools, diskann-disk, diskann-benchmark) to import storage traits/providers and path helpers from diskann-storage.

Reviewed changes

Copilot reviewed 54 out of 61 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
diskann-tools/src/utils/relative_contrast.rs Switch storage provider imports to diskann-storage.
diskann-tools/src/utils/random_data_generator.rs Test-only migration to diskann-storage virtual provider; main trait import still via providers re-export.
diskann-tools/src/utils/ground_truth.rs Switch StorageReadProvider/StorageWriteProvider imports to diskann-storage.
diskann-tools/src/utils/generate_synthetic_labels_utils.rs Switch DynWriteProvider + test storage providers to diskann-storage.
diskann-tools/src/utils/gen_associated_data_from_range.rs Switch StorageWriteProvider + test providers to diskann-storage.
diskann-tools/src/utils/build_pq.rs Move path utility + storage provider imports to diskann-storage.
diskann-tools/src/utils/build_disk_index.rs Switch storage provider imports (incl. tests) to diskann-storage.
diskann-tools/src/bin/subsample_bin.rs Switch FileStorageProvider/StorageWriteProvider imports to diskann-storage.
diskann-tools/src/bin/relative_contrast.rs Switch FileStorageProvider import to diskann-storage.
diskann-tools/src/bin/random_data_generator.rs Switch FileStorageProvider import to diskann-storage.
diskann-tools/src/bin/generate_pq.rs Switch FileStorageProvider import to diskann-storage.
diskann-tools/src/bin/generate_minmax.rs Switch StorageReadProvider/FileStorageProvider usage to diskann-storage.
diskann-tools/src/bin/gen_associated_data_from_range.rs Switch FileStorageProvider import to diskann-storage.
diskann-tools/Cargo.toml Add diskann-storage dependency (currently enabling virtual_storage in normal deps).
diskann-storage/src/virtual_storage_provider.rs Add/expand VirtualStorageProvider implementation + tests.
diskann-storage/src/storage_provider.rs New core storage traits + dyn adapters + wrapper + tests.
diskann-storage/src/proto_storage.rs New generic prost-backed load_proto/save_proto helpers + tests.
diskann-storage/src/path_utility.rs New canonical artifact path helpers + tests.
diskann-storage/src/lib.rs New crate root wiring and re-exports.
diskann-storage/src/file_storage_provider.rs Filesystem-backed provider documentation + expanded tests.
diskann-storage/src/dataset_dto.rs New DatasetDto type + tests (moved from providers utils).
diskann-storage/Cargo.toml New crate manifest (edition 2024) with virtual_storage feature.
diskann-providers/src/utils/utils.rs Remove in-crate DatasetDto definition (moved).
diskann-providers/src/utils/mod.rs Re-export DatasetDto from diskann-storage.
diskann-providers/src/storage/storage_provider.rs Remove in-crate storage traits/dyn adapters (moved).
diskann-providers/src/storage/path_utility.rs Remove in-crate path utilities (moved).
diskann-providers/src/storage/mod.rs Re-export diskann-storage storage APIs for backward compatibility.
diskann-providers/Cargo.toml Add dependency on diskann-storage (currently enabling virtual_storage unconditionally).
diskann-providers/benches/benchmarks/copy_aligned_data_bench.rs Cleanup after storage extraction (still uses providers storage module).
diskann-providers/benches/benchmarks_iai/copy_aligned_data_bench_iai.rs Cleanup after storage extraction (still uses providers storage module).
diskann-disk/src/utils/partition.rs Switch storage trait imports (and tests) to diskann-storage.
diskann-disk/src/utils/aligned_file_reader/virtual_aligned_reader_factory.rs Switch VirtualStorageProvider import to diskann-storage.
diskann-disk/src/utils/aligned_file_reader/storage_provider_aligned_file_reader.rs Switch StorageReadProvider and test provider imports to diskann-storage.
diskann-disk/src/utils/aligned_file_reader/aligned_file_reader_factory.rs Switch FileStorageProvider import (macOS/miri path) to diskann-storage.
diskann-disk/src/storage/quant/pq/pq_generation.rs Switch storage traits in main + tests to diskann-storage.
diskann-disk/src/storage/quant/generator.rs Switch storage traits and tests to diskann-storage.
diskann-disk/src/storage/disk_index_writer.rs Switch storage traits + path utilities to diskann-storage.
diskann-disk/src/storage/disk_index_reader.rs Switch StorageReadProvider + tests to diskann-storage.
diskann-disk/src/storage/cached_writer.rs Switch StorageWriteProvider + tests to diskann-storage.
diskann-disk/src/storage/cached_reader.rs Switch StorageReadProvider + tests to diskann-storage.
diskann-disk/src/search/provider/disk_vertex_provider.rs Switch storage traits + get_disk_index_file to diskann-storage.
diskann-disk/src/search/provider/disk_provider.rs Switch storage trait + path helper imports to diskann-storage.
diskann-disk/src/build/chunking/checkpoint/checkpoint_record_manager_with_file.rs Switch FileStorageProvider import to diskann-storage.
diskann-disk/src/build/builder/quantizer.rs Switch storage traits to diskann-storage.
diskann-disk/src/build/builder/inmem_builder.rs Switch dyn write wrapper types to diskann-storage.
diskann-disk/src/build/builder/core.rs Switch storage traits and path helpers to diskann-storage.
diskann-disk/src/build/builder/build.rs Switch storage traits and tests to diskann-storage.
diskann-disk/Cargo.toml Add diskann-storage dependency.
diskann-benchmark/src/utils/datafiles.rs Import StorageReadProvider from diskann-storage (still uses providers FileStorageProvider).
diskann-benchmark/src/inputs/save_and_load.rs Switch StorageReadProvider import to diskann-storage.
diskann-benchmark/src/backend/disk_index/build.rs Switch storage traits to diskann-storage.
diskann-benchmark/src/backend/disk_index/benchmarks.rs Switch FileStorageProvider import to diskann-storage.
diskann-benchmark/Cargo.toml Add diskann-storage dependency.
Cargo.toml Add diskann-storage to workspace members and workspace dependencies.
Cargo.lock Record new crate and dependency graph updates.
Comments suppressed due to low confidence (1)

diskann-storage/src/virtual_storage_provider.rs:292

  • The new trait_based_read_provider test uses &dyn StorageReadProvider<Reader = impl Read + Seek>, which is not a valid/idiomatic way to express “any reader” and may fail to compile due to impl Trait usage inside an associated-type equality. Prefer a generic helper like fn read_via_trait<P: StorageReadProvider>(p: &P) where P::Reader: Read + Seek { ... } (or specify the concrete reader type explicitly) to keep the test compiling on stable Rust.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@varat73
Copy link
Copy Markdown
Contributor Author

varat73 commented Apr 2, 2026

Regarding the review comments:

build_pq.rs \p_val\ (comment 1): This is a preexisting bug — our PR only changed imports in this file, not logic. Filed as a separate issue to avoid scope creep.

random_data_generator.rs (comment 2): Fixed in d6128ea — split the import so \StorageWriteProvider\ comes from \diskann_storage.

diskann-tools/Cargo.toml \�irtual_storage\ (comment 3): Fixed in d6128ea — moved \�irtual_storage\ feature to dev-dependencies only.

diskann-providers/Cargo.toml \�irtual_storage\ (comment 4): Fixed in d6128ea — \diskann-storage\ is now without features in normal deps; \�irtual_storage\ feature forwards to \diskann-storage/virtual_storage.

datafiles.rs \FileStorageProvider\ (comment 5): Fixed in d6128ea — migrated to \diskann_storage::FileStorageProvider.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 2, 2026

Codecov Report

❌ Patch coverage is 96.09610% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.35%. Comparing base (0ced23d) to head (e3553e7).

Files with missing lines Patch % Lines
diskann-storage/src/storage_provider.rs 80.39% 10 Missing ⚠️
diskann-storage/src/proto_storage.rs 97.64% 2 Missing ⚠️
diskann-tools/src/bin/generate_minmax.rs 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #873      +/-   ##
==========================================
+ Coverage   89.31%   89.35%   +0.03%     
==========================================
  Files         445      447       +2     
  Lines       84095    84343     +248     
==========================================
+ Hits        75113    75363     +250     
+ Misses       8982     8980       -2     
Flag Coverage Δ
miri 89.35% <96.09%> (+0.03%) ⬆️
unittests 89.19% <96.09%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
diskann-benchmark/src/inputs/disk.rs 4.47% <ø> (ø)
diskann-benchmark/src/inputs/save_and_load.rs 86.66% <ø> (ø)
diskann-benchmark/src/main.rs 91.12% <ø> (ø)
diskann-benchmark/src/utils/datafiles.rs 83.33% <100.00%> (-0.25%) ⬇️
diskann-disk/src/build/builder/build.rs 94.16% <ø> (ø)
diskann-disk/src/build/builder/core.rs 95.25% <ø> (ø)
diskann-disk/src/build/builder/inmem_builder.rs 86.93% <ø> (ø)
diskann-disk/src/build/builder/quantizer.rs 90.52% <ø> (ø)
.../checkpoint/checkpoint_record_manager_with_file.rs 94.39% <ø> (ø)
diskann-disk/src/search/provider/disk_provider.rs 91.02% <ø> (ø)
... and 32 more

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Create new Tier 2 crate diskann-storage with zero diskann-* dependencies,
containing:
- StorageReadProvider / StorageWriteProvider traits
- FileStorageProvider (filesystem-backed)
- VirtualStorageProvider (in-memory/overlay VFS, feature-gated)
- DynWriteProvider / WriteProviderWrapper / WriteSeek adapters
- Path utility functions for index artifact naming
- DatasetDto transfer type
- Generic protobuf load/save via prost

diskann-providers now depends on diskann-storage and re-exports all moved
types to preserve backward compatibility. The virtual_storage feature is
properly wired through diskann-providers' own virtual_storage feature.

Consumer crates (diskann-disk, diskann-tools, diskann-benchmark) updated
to import storage types directly from diskann-storage. The virtual_storage
feature is only in dev-dependencies for these crates.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@varat73 varat73 force-pushed the extract-diskann-storage branch from d6128ea to e3553e7 Compare April 2, 2026 15:37
Copy link
Copy Markdown
Contributor

@hildebrandmw hildebrandmw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @varat73, mechanically this seems fine, but I have some mechanics feedback and some large architectural concerns.

  • diskann-providers currently doesn't compile its tests because the virtual_storage feature is not enabled during tests. Our CI didn't catch this due to feature unification. I'm working on a fix.
  • I'm not convinced that the reexports are a good idea. Until we're happier with an overall design, reexports are going to make import locations more confusing rather than less.

Now, for my architectural concerns: I don't think it's a good idea to split out this functionality as-is just for the sake of making diskann-providers smaller because moving these traits into a lower-level crate impl…icitly blesses them when I believe we should be trying to get rid of these. Here is why I think this:

Storage provider traits

These traits are problematic for several reasons.
First, the use of GATs of the trait for the readers/writers in essence forces monomorphization of a lot of code, which I'm not convinced is necessary.
There's an argument that specialization on the returned Reader could be important, but I believe we can work around this by doing something similar to the read/write provider wrappers of having a Box<dyn ReadSeek> …as an inner type.
To prevent dynamic dispatch for small reads, the dynamic inner writer can be wrapped in a BufReader directly, sending most read/write calls to the buffer and only reading from the backing store in large granularities….

Next, the use of std::io::Result discards a lot of valuable information, and by using it, the storage provider traits do not provide better diagnostics (e.g., which file did we fail to open), and by using it, the Vi…rtualStorageProvider is forced to use std::io::Error::other for everything.

Finally, the abstraction of the filesystem is incomplete.
The FileStorageProvider uses std::fs for its file system implementation, but the file identifiers are &str instead of the incredibly useful std::path::Path types.

Path Utilities

These path utilities are intrinsically tied to the implementation in diskann-providers and we're actively working on revising our data storage story.
Moving these into a separate crate is leaning into an architectural anti-pattern of specific name mangling that we should be deprecating.

DatasetDto

This is a struct that should be deprecated. There are much better alternatives, such as diskann_utils::views::MutMatrixView that serve the same purpose with much better ergonomics and usability.

proto_storage

Architecturally, I think this shows the issues that the StorageReadProvider abstraction can cause.
It encourages readers to be passed around as a StorageReadProvider/pathname instead of a single Reader.
Code using these pairs of arguments is then forced to both deal with any issues related to opening the reader as well issues that can occur during the deserialization process.
This also inhibits the ability to read multiple structs from a single Reader by forcing code to be file-centric rather than reader centric.

Wrapping Up

I know this probably seems like a strong reaction, and I sympathize with the desire to break up diskann-providers, I really do.
But in the process of doing so, I strongly believe that we should do so in a way that pushes us towards a cleaner overall architecture.
I am more than happy to discuss ideas on how we might tackle a split out in a more structured way, but as is I do not think this approach is strictly better than constraining these abstractions to the diskann-provider…s crate.

@varat73
Copy link
Copy Markdown
Contributor Author

varat73 commented Apr 2, 2026

Withdrawing this PR based on the architectural feedback. The storage traits have known design issues (GAT-forced monomorphization, std::io::Result losing context, &str vs Path, incomplete abstraction) and extracting them as-is into a Tier 2 crate would bless the current design rather than improve it.

Will focus on:

  1. Phase 2 (downward migrations of AlignedBoxWithSlice, ThreadPool, Timer, Random) — uncontroversial moves of general-purpose primitives
  2. Phase 3 (diskann-pq extraction) — algorithm-focused split with no storage trait dependency

Storage trait redesign will be tackled collaboratively before any extraction.

@varat73 varat73 closed this Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants