diff --git a/rfcs/document_provider_serialization.md b/rfcs/document_provider_serialization.md new file mode 100644 index 000000000..4a3811c92 --- /dev/null +++ b/rfcs/document_provider_serialization.md @@ -0,0 +1,260 @@ +# RFC: SaveWith and LoadWith for DocumentProvider + +| | | +| ---------------- | ---------------- | +| **Authors** | Sampath Rajendra | +| **Contributors** | Sampath Rajendra | +| **Created** | 2026-03-26 | +| **Updated** | 2026-03-26 | + +## Summary + +This RFC addresses the use case of saving and loading of label information along with the index information that is already saved and loaded by the `DataProvider`. The RFC proposes to do this by implementing the `SaveWith` and `LoadWith` serialization traits for `DocumentProvider`. `DocumentProvider` wraps an inner `DataProvider` (e.g. `DefaultProvider` or `BfTreeProvider`) and an `AttributeStore` that holds the per-vector label data. Serialization must handle both: delegating to the inner provider's own save/load and persisting the attribute store to a separate binary file. + +## Motivation + +### Background + +`DocumentProvider` (in `diskann-label-filter`) is a wrapper that combines a `DataProvider` with an `AttributeStore`. It is used when the ANN index must support filtered search: every indexed vector carries a set of typed key–value attributes (e.g. `"category" = "electronics"`, `"price" = 299.99`) and queries can include attribute-based predicates. + +```rust +pub struct DocumentProvider { + inner_provider: DP, + attribute_store: AS, +} +``` + +The concrete attribute store in use today is `RoaringAttributeStore`, which maintains: + +- An `AttributeEncoder` mapping each distinct `Attribute` (field name + typed value) to a compact `u64` ID. +- A forward index (`RoaringTreemapSetProvider`) mapping each node's internal ID to the set of its attribute IDs. +- An inverted index (`RoaringTreemapSetProvider`) mapping each attribute ID to the set of node IDs that carry it. The inverted index can be rebuilt from the forward index and so **does not need to be persisted**. + +The `SaveWith` and `LoadWith` traits are defined in `diskann-providers::storage`: + +```rust +pub trait SaveWith { + type Ok: Send; + type Error: std::error::Error + Send; + + fn save_with

(&self, provider: &P, auxiliary: &T) + -> impl Future> + Send + where + P: StorageWriteProvider; +} + +pub trait LoadWith: Sized { + type Error: std::error::Error + Send; + + fn load_with

(provider: &P, auxiliary: &T) + -> impl Future> + Send + where + P: StorageReadProvider; +} +``` + +The generic `T` parameter carries the context needed to derive file paths — examples are a file-path prefix `String`, a structured `AsyncIndexMetadata` (wrapping a prefix string). + +Existing serialization patterns for the inner provider types: + +**`BfTreeProvider`** (`SaveWith` / `LoadWith`): writes `_params.json`, `_vectors.bftree`, `_neighbors.bftree`, `_deleted.bin`, and PQ files when applicable. + +**`DefaultProvider`** (`SaveWith<(u32, AsyncIndexMetadata)>` / `LoadWith`): writes `{prefix}.data`, compressed vector files, and the raw graph file. The delete store is not persisted and is reconstructed empty on load. + +**`DiskProvider`** (`LoadWith` only): has no `SaveWith` implementation. Disk-index creation is handled externally by `DiskIndexWriter::create_disk_layout()`, which is not integrated with the `SaveWith`/`LoadWith` trait family. + +### Problem Statement + +`DocumentProvider` implements `DataProvider`, `SetElement`, and `Delete` but has no serialization support. An index built with `DocumentProvider` cannot be persisted and reloaded; every restart requires rebuilding the index from scratch, including re-encoding all attribute data. + +### Goals + +1. Implement `SaveWith` and `LoadWith` for `DocumentProvider`, delegating to the inner provider and the attribute store respectively. +2. Define a binary file format (`{prefix}.labels.bin`) for persisting `RoaringAttributeStore` label data. +3. Implement `SaveWith` and `LoadWith` directly on `RoaringAttributeStore` without widening the `AttributeStore` trait. + +## Proposal + +The `diskann-label-filter` crate takes a new dependency on `diskann-providers` (for the `SaveWith`/`LoadWith` trait definitions and the `StorageReadProvider`/`StorageWriteProvider` abstractions). + +### Attribute Store Serialization Interface + +Rather than extending the `AttributeStore` trait (which would impose serialization concerns on all implementations), `SaveWith` and `LoadWith` are added directly on `RoaringAttributeStore` for the relevant auxiliary types. The attribute store is responsible for extracting whatever path information it needs from `T`. + +The `DocumentProvider` impls require the attribute store bound: + +```rust +impl SaveWith for DocumentProvider +where + DP: DataProvider + SaveWith, + AS: AttributeStore + SaveWith + AsyncFriendly, + ANNError: From, +{ ... } +``` + +```rust +impl LoadWith for DocumentProvider +where + DP: DataProvider + LoadWith, + AS: AttributeStore + LoadWith + AsyncFriendly, + ANNError: From, +{ ... } +``` + +### Label File Format + +The label data is persisted to a single binary file at `{prefix}.labels.bin`. `RoaringAttributeStore` is responsible for deriving the path from the auxiliary type `T` passed to its `SaveWith` / `LoadWith` implementation. The format uses little-endian byte order throughout and is designed for straightforward sequential writes and reads. + +```text +┌────────────────────────────────────────────────────────────────────┐ +│ Header (17 bytes) │ +│ [u64: num_attribute_entries] │ +│ [u64: forward_index_offset] (byte offset from file start to │ +│ Section 2) │ +│ [u8: vector_id_type_tag] │ +│ 0 = u32 │ +│ 1 = u64 │ +├────────────────────────────────────────────────────────────────────┤ +│ Section 1: Attribute Dictionary │ +│ Repeated `num_attribute_entries` times: │ +│ │ +│ [u64: attribute_id] │ +│ [u32: field_name_byte_len] │ +│ [u8 * field_name_byte_len: UTF-8 field name] │ +│ [u8: type_tag] │ +│ 0 = Bool │ +│ 1 = Integer │ +│ 2 = Real │ +│ 3 = String │ +│ 4 = Empty │ +│ [value bytes, depends on type_tag]: │ +│ Bool: 1 byte (0 = false, 1 = true) │ +│ Integer: 8 bytes (i64, little-endian) │ +│ Real: 8 bytes (f64, little-endian) │ +│ String: [u32: byte_len] + [u8 * byte_len: UTF-8 value] │ +│ Empty: 0 bytes │ +├────────────────────────────────────────────────────────────────────┤ +│ Section 2: Forward Index │ +│ [u64: num_nodes_with_labels] │ +│ Repeated `num_nodes_with_labels` times: │ +│ │ +│ [N bytes: node_internal_id (width per vector_id_type_tag)] │ +│ [u32: num_attribute_ids_for_this_node] │ +│ [u64 * num_attribute_ids: attribute IDs (sorted ascending)] │ +└────────────────────────────────────────────────────────────────────┘ +``` + +**Notes:** + +- The inverted index is **not persisted**; it is rebuilt from the forward index during `load_with`. +- The attribute dictionary is written first so that a reader can build the reverse mapping (`u64` ID → `Attribute`) before scanning the forward index. + +### DocumentProvider SaveWith Implementation + +```rust +impl SaveWith for DocumentProvider +where + DP: DataProvider + SaveWith, + AS: AttributeStore + SaveWith + AsyncFriendly, + ANNError: From, +{ + type Ok = (); + type Error = ANNError; + + async fn save_with

( + &self, + provider: &P, + auxiliary: &T, + ) -> Result<(), ANNError> + where + P: StorageWriteProvider, + { + // 1. Delegate to inner provider. + self.inner_provider + .save_with(provider, auxiliary) + .await?; + + // 2. Persist the attribute store. + self.attribute_store + .save_with(provider, auxiliary) + .await?; + + Ok(()) + } +} +``` + +The `RoaringAttributeStore::save_with` implementation must: + +1. Acquire read locks on `attribute_map` and `index`. +2. Write the header. +3. Iterate over all `(InternalAttribute, u64)` pairs in `AttributeEncoder` via `AttributeEncoder::for_each` (already present, currently marked `dead_code`) and write the dictionary section. +4. Iterate over all `(node_id, attribute_id_set)` pairs in the forward index and write the forward index section. + +### DocumentProvider LoadWith Implementation + +```rust +impl LoadWith for DocumentProvider +where + DP: DataProvider + LoadWith, + AS: AttributeStore + LoadWith + AsyncFriendly, + ANNError: From, +{ + type Error = ANNError; + + async fn load_with

( + provider: &P, + auxiliary: &T, + ) -> Result + where + P: StorageReadProvider, + { + // 1. Load the inner provider. + let inner_provider = DP::load_with(provider, auxiliary).await?; + + // 2. Load the attribute store. + let attribute_store = AS::load_with(provider, auxiliary).await?; + + Ok(Self { + inner_provider, + attribute_store, + }) + } +} +``` + +The `RoaringAttributeStore::load_with` implementation must: + +1. Read the header to obtain `num_attribute_entries` and `forward_index_offset`. +2. Decode the attribute dictionary, inserting each `(u64 id, Attribute)` pair into the `AttributeEncoder`. +3. Seek to `forward_index_offset` and decode the forward index, inserting into the `RoaringTreemapSetProvider`. +4. Rebuild the inverted index from the forward index (iterate node → attribute-IDs, invert to attribute-ID → nodes). + +## Trade-offs + +### Keeping serialization off the `AttributeStore` trait + +Adding `SaveWith`/`LoadWith` directly to `RoaringAttributeStore` rather than to the `AttributeStore` trait avoids imposing a serialization requirement on every future `AttributeStore` implementation. The cost is that `DocumentProvider` can only be saved or loaded when `AS` satisfies the respective bound — callers using a hypothetical `AS` that does not implement `SaveWith` would need to handle serialization manually or outside of this trait pair. + +### DiskProvider compatibility + +`DiskProvider` has no `SaveWith` implementation; disk-index creation is handled by `DiskIndexWriter::create_disk_layout()`, which is not integrated with the `SaveWith`/`LoadWith` trait family. Consequently, `DocumentProvider, AS>` supports `LoadWith` only. + +For `LoadWith` to work, the label file (`{prefix}.labels.bin`) must have been written separately — e.g. by calling `attribute_store.save_with(...)` directly during the disk-index build pipeline before the disk layout is finalized. The alternative of integrating label-file writing into `DiskIndexWriter` is deferred to future work (see below). + +### Save format + +The save format could have been JSON. This would help with readability and debuggability but at the cost of more storage space and perhaps longer load times. + +## Benchmark Results + +TODO + +## Future Work + +- [ ] Extend `DiskIndexWriter` (or introduce a `create_disk_layout_with_labels` variant) to co-write `{prefix}.labels.bin` during disk-index construction, enabling a full `SaveWith`/`LoadWith` round-trip for `DocumentProvider, AS>`. + +## References + +1. [`diskann-providers` — `SaveWith` / `LoadWith` trait definitions](../diskann-providers/src/) +2. [`diskann-label-filter` — `DocumentProvider`, `RoaringAttributeStore`](../diskann-label-filter/src/)