Extract diskann-storage crate from diskann-providers#873
Extract diskann-storage crate from diskann-providers#873
Conversation
There was a problem hiding this comment.
Pull request overview
This PR extracts storage abstractions and implementations from diskann-providers into a new Tier-2 crate, diskann-storage, and migrates downstream crates to use the new crate directly while keeping diskann-providers::storage::* as a compatibility re-export.
Changes:
- Added new
diskann-storagecrate containing storage traits, filesystem and virtual storage providers, path naming helpers, dataset DTO, and protobuf load/save utilities. - Updated
diskann-providersto re-export moved storage APIs fromdiskann-storageand to re-exportDatasetDtofromdiskann-storage. - Migrated consumer crates (
diskann-tools,diskann-disk,diskann-benchmark) to import storage traits/providers and path helpers fromdiskann-storage.
Reviewed changes
Copilot reviewed 54 out of 61 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| diskann-tools/src/utils/relative_contrast.rs | Switch storage provider imports to diskann-storage. |
| diskann-tools/src/utils/random_data_generator.rs | Test-only migration to diskann-storage virtual provider; main trait import still via providers re-export. |
| diskann-tools/src/utils/ground_truth.rs | Switch StorageReadProvider/StorageWriteProvider imports to diskann-storage. |
| diskann-tools/src/utils/generate_synthetic_labels_utils.rs | Switch DynWriteProvider + test storage providers to diskann-storage. |
| diskann-tools/src/utils/gen_associated_data_from_range.rs | Switch StorageWriteProvider + test providers to diskann-storage. |
| diskann-tools/src/utils/build_pq.rs | Move path utility + storage provider imports to diskann-storage. |
| diskann-tools/src/utils/build_disk_index.rs | Switch storage provider imports (incl. tests) to diskann-storage. |
| diskann-tools/src/bin/subsample_bin.rs | Switch FileStorageProvider/StorageWriteProvider imports to diskann-storage. |
| diskann-tools/src/bin/relative_contrast.rs | Switch FileStorageProvider import to diskann-storage. |
| diskann-tools/src/bin/random_data_generator.rs | Switch FileStorageProvider import to diskann-storage. |
| diskann-tools/src/bin/generate_pq.rs | Switch FileStorageProvider import to diskann-storage. |
| diskann-tools/src/bin/generate_minmax.rs | Switch StorageReadProvider/FileStorageProvider usage to diskann-storage. |
| diskann-tools/src/bin/gen_associated_data_from_range.rs | Switch FileStorageProvider import to diskann-storage. |
| diskann-tools/Cargo.toml | Add diskann-storage dependency (currently enabling virtual_storage in normal deps). |
| diskann-storage/src/virtual_storage_provider.rs | Add/expand VirtualStorageProvider implementation + tests. |
| diskann-storage/src/storage_provider.rs | New core storage traits + dyn adapters + wrapper + tests. |
| diskann-storage/src/proto_storage.rs | New generic prost-backed load_proto/save_proto helpers + tests. |
| diskann-storage/src/path_utility.rs | New canonical artifact path helpers + tests. |
| diskann-storage/src/lib.rs | New crate root wiring and re-exports. |
| diskann-storage/src/file_storage_provider.rs | Filesystem-backed provider documentation + expanded tests. |
| diskann-storage/src/dataset_dto.rs | New DatasetDto type + tests (moved from providers utils). |
| diskann-storage/Cargo.toml | New crate manifest (edition 2024) with virtual_storage feature. |
| diskann-providers/src/utils/utils.rs | Remove in-crate DatasetDto definition (moved). |
| diskann-providers/src/utils/mod.rs | Re-export DatasetDto from diskann-storage. |
| diskann-providers/src/storage/storage_provider.rs | Remove in-crate storage traits/dyn adapters (moved). |
| diskann-providers/src/storage/path_utility.rs | Remove in-crate path utilities (moved). |
| diskann-providers/src/storage/mod.rs | Re-export diskann-storage storage APIs for backward compatibility. |
| diskann-providers/Cargo.toml | Add dependency on diskann-storage (currently enabling virtual_storage unconditionally). |
| diskann-providers/benches/benchmarks/copy_aligned_data_bench.rs | Cleanup after storage extraction (still uses providers storage module). |
| diskann-providers/benches/benchmarks_iai/copy_aligned_data_bench_iai.rs | Cleanup after storage extraction (still uses providers storage module). |
| diskann-disk/src/utils/partition.rs | Switch storage trait imports (and tests) to diskann-storage. |
| diskann-disk/src/utils/aligned_file_reader/virtual_aligned_reader_factory.rs | Switch VirtualStorageProvider import to diskann-storage. |
| diskann-disk/src/utils/aligned_file_reader/storage_provider_aligned_file_reader.rs | Switch StorageReadProvider and test provider imports to diskann-storage. |
| diskann-disk/src/utils/aligned_file_reader/aligned_file_reader_factory.rs | Switch FileStorageProvider import (macOS/miri path) to diskann-storage. |
| diskann-disk/src/storage/quant/pq/pq_generation.rs | Switch storage traits in main + tests to diskann-storage. |
| diskann-disk/src/storage/quant/generator.rs | Switch storage traits and tests to diskann-storage. |
| diskann-disk/src/storage/disk_index_writer.rs | Switch storage traits + path utilities to diskann-storage. |
| diskann-disk/src/storage/disk_index_reader.rs | Switch StorageReadProvider + tests to diskann-storage. |
| diskann-disk/src/storage/cached_writer.rs | Switch StorageWriteProvider + tests to diskann-storage. |
| diskann-disk/src/storage/cached_reader.rs | Switch StorageReadProvider + tests to diskann-storage. |
| diskann-disk/src/search/provider/disk_vertex_provider.rs | Switch storage traits + get_disk_index_file to diskann-storage. |
| diskann-disk/src/search/provider/disk_provider.rs | Switch storage trait + path helper imports to diskann-storage. |
| diskann-disk/src/build/chunking/checkpoint/checkpoint_record_manager_with_file.rs | Switch FileStorageProvider import to diskann-storage. |
| diskann-disk/src/build/builder/quantizer.rs | Switch storage traits to diskann-storage. |
| diskann-disk/src/build/builder/inmem_builder.rs | Switch dyn write wrapper types to diskann-storage. |
| diskann-disk/src/build/builder/core.rs | Switch storage traits and path helpers to diskann-storage. |
| diskann-disk/src/build/builder/build.rs | Switch storage traits and tests to diskann-storage. |
| diskann-disk/Cargo.toml | Add diskann-storage dependency. |
| diskann-benchmark/src/utils/datafiles.rs | Import StorageReadProvider from diskann-storage (still uses providers FileStorageProvider). |
| diskann-benchmark/src/inputs/save_and_load.rs | Switch StorageReadProvider import to diskann-storage. |
| diskann-benchmark/src/backend/disk_index/build.rs | Switch storage traits to diskann-storage. |
| diskann-benchmark/src/backend/disk_index/benchmarks.rs | Switch FileStorageProvider import to diskann-storage. |
| diskann-benchmark/Cargo.toml | Add diskann-storage dependency. |
| Cargo.toml | Add diskann-storage to workspace members and workspace dependencies. |
| Cargo.lock | Record new crate and dependency graph updates. |
Comments suppressed due to low confidence (1)
diskann-storage/src/virtual_storage_provider.rs:292
- The new
trait_based_read_providertest uses&dyn StorageReadProvider<Reader = impl Read + Seek>, which is not a valid/idiomatic way to express “any reader” and may fail to compile due toimpl Traitusage inside an associated-type equality. Prefer a generic helper likefn read_via_trait<P: StorageReadProvider>(p: &P) where P::Reader: Read + Seek { ... }(or specify the concrete reader type explicitly) to keep the test compiling on stable Rust.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Regarding the review comments: build_pq.rs \p_val\ (comment 1): This is a preexisting bug — our PR only changed imports in this file, not logic. Filed as a separate issue to avoid scope creep. random_data_generator.rs (comment 2): Fixed in d6128ea — split the import so \StorageWriteProvider\ comes from \diskann_storage. diskann-tools/Cargo.toml \�irtual_storage\ (comment 3): Fixed in d6128ea — moved \�irtual_storage\ feature to dev-dependencies only. diskann-providers/Cargo.toml \�irtual_storage\ (comment 4): Fixed in d6128ea — \diskann-storage\ is now without features in normal deps; \�irtual_storage\ feature forwards to \diskann-storage/virtual_storage. datafiles.rs \FileStorageProvider\ (comment 5): Fixed in d6128ea — migrated to \diskann_storage::FileStorageProvider. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #873 +/- ##
==========================================
+ Coverage 89.31% 89.35% +0.03%
==========================================
Files 445 447 +2
Lines 84095 84343 +248
==========================================
+ Hits 75113 75363 +250
+ Misses 8982 8980 -2
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Create new Tier 2 crate diskann-storage with zero diskann-* dependencies, containing: - StorageReadProvider / StorageWriteProvider traits - FileStorageProvider (filesystem-backed) - VirtualStorageProvider (in-memory/overlay VFS, feature-gated) - DynWriteProvider / WriteProviderWrapper / WriteSeek adapters - Path utility functions for index artifact naming - DatasetDto transfer type - Generic protobuf load/save via prost diskann-providers now depends on diskann-storage and re-exports all moved types to preserve backward compatibility. The virtual_storage feature is properly wired through diskann-providers' own virtual_storage feature. Consumer crates (diskann-disk, diskann-tools, diskann-benchmark) updated to import storage types directly from diskann-storage. The virtual_storage feature is only in dev-dependencies for these crates. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
d6128ea to
e3553e7
Compare
hildebrandmw
left a comment
There was a problem hiding this comment.
Thanks @varat73, mechanically this seems fine, but I have some mechanics feedback and some large architectural concerns.
diskann-providerscurrently doesn't compile its tests because thevirtual_storagefeature is not enabled during tests. Our CI didn't catch this due to feature unification. I'm working on a fix.- I'm not convinced that the reexports are a good idea. Until we're happier with an overall design, reexports are going to make import locations more confusing rather than less.
Now, for my architectural concerns: I don't think it's a good idea to split out this functionality as-is just for the sake of making diskann-providers smaller because moving these traits into a lower-level crate impl…icitly blesses them when I believe we should be trying to get rid of these. Here is why I think this:
Storage provider traits
These traits are problematic for several reasons.
First, the use of GATs of the trait for the readers/writers in essence forces monomorphization of a lot of code, which I'm not convinced is necessary.
There's an argument that specialization on the returned Reader could be important, but I believe we can work around this by doing something similar to the read/write provider wrappers of having a Box<dyn ReadSeek> …as an inner type.
To prevent dynamic dispatch for small reads, the dynamic inner writer can be wrapped in a BufReader directly, sending most read/write calls to the buffer and only reading from the backing store in large granularities….
Next, the use of std::io::Result discards a lot of valuable information, and by using it, the storage provider traits do not provide better diagnostics (e.g., which file did we fail to open), and by using it, the Vi…rtualStorageProvider is forced to use std::io::Error::other for everything.
Finally, the abstraction of the filesystem is incomplete.
The FileStorageProvider uses std::fs for its file system implementation, but the file identifiers are &str instead of the incredibly useful std::path::Path types.
Path Utilities
These path utilities are intrinsically tied to the implementation in diskann-providers and we're actively working on revising our data storage story.
Moving these into a separate crate is leaning into an architectural anti-pattern of specific name mangling that we should be deprecating.
DatasetDto
This is a struct that should be deprecated. There are much better alternatives, such as diskann_utils::views::MutMatrixView that serve the same purpose with much better ergonomics and usability.
proto_storage
Architecturally, I think this shows the issues that the StorageReadProvider abstraction can cause.
It encourages readers to be passed around as a StorageReadProvider/pathname instead of a single Reader.
Code using these pairs of arguments is then forced to both deal with any issues related to opening the reader as well issues that can occur during the deserialization process.
This also inhibits the ability to read multiple structs from a single Reader by forcing code to be file-centric rather than reader centric.
Wrapping Up
I know this probably seems like a strong reaction, and I sympathize with the desire to break up diskann-providers, I really do.
But in the process of doing so, I strongly believe that we should do so in a way that pushes us towards a cleaner overall architecture.
I am more than happy to discuss ideas on how we might tackle a split out in a more structured way, but as is I do not think this approach is strictly better than constraining these abstractions to the diskann-provider…s crate.
|
Withdrawing this PR based on the architectural feedback. The storage traits have known design issues (GAT-forced monomorphization, std::io::Result losing context, &str vs Path, incomplete abstraction) and extracting them as-is into a Tier 2 crate would bless the current design rather than improve it. Will focus on:
Storage trait redesign will be tackled collaboratively before any extraction. |
Summary
Extract a new
diskann-storageTier 2 crate fromdiskann-providersas the first phase of decomposing the providers god crate (37K lines → ~15K target).What moved to
diskann-storageStorageReadProvider/StorageWriteProviderFileStorageProviderVirtualStorageProvidervirtual_storage)DynWriteProvider/WriteProviderWrapper/WriteSeekpath_utilityDatasetDtoproto_storageKey properties
diskann-*dependencies — onlyprost+thiserror. Clean Tier 2 crate.diskann-storagecovering all moved functionality.diskann-providersre-exports all moved types; existingcrate::storage::StorageReadProviderpaths continue to work.diskann-disk,diskann-tools,diskann-benchmarkupdated to import directly fromdiskann-storage.Verification
cargo fmt --all --check✅cargo clippy --workspace --all-targets -- -D warnings✅cargo test -p diskann-storage— 38 passed, 0 failedcargo test -p diskann-providers --profile ci— 520 passed, 0 failedcargo check --workspace✅