Skip to content

Integrating in-mem, inline, beta search into GH DiskANN#782

Open
gopalrs wants to merge 5 commits intomainfrom
sync-from-cdb-diskann
Open

Integrating in-mem, inline, beta search into GH DiskANN#782
gopalrs wants to merge 5 commits intomainfrom
sync-from-cdb-diskann

Conversation

@gopalrs
Copy link
Contributor

@gopalrs gopalrs commented Feb 16, 2026

Integrate filter search into benchmarks. This PR allows users to run filter benchmarks using the inline beta search algorithm. The PR also adds a "DocumentInsertStrategy" which allows users to insert "documents" (== {vector, attributes}) into the DiskANN index and query using vectors + predicates on attributes.

- Refactored recall utilities in diskann-benchmark
- Updated tokio utilities
- Added attribute and format parser improvements in label-filter
- Updated ground_truth utilities in diskann-tools
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR integrates label-filtered (“document”) insertion and inline beta filtered search into the DiskANN benchmark/tooling flow, enabling benchmarks that operate on { vector, attributes } documents and evaluate filtered queries.

Changes:

  • Added DocumentInsertStrategy and supporting public types to insert/query Document objects (vector + attributes) through DocumentProvider.
  • Extended inline beta filter search to handle predicate encoding failures and added a constructor for InlineBetaStrategy.
  • Added a new benchmark input/backend (document-index-build) plus example config for running document + filter benchmarks.

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
test_data/disk_index_search/data.256.label.jsonl Updates LFS pointer for label test data used in filter benchmarks.
diskann-tools/src/utils/ground_truth.rs Adds array-aware label matching/expansion and extensive tracing diagnostics for filter ground-truth generation.
diskann-tools/Cargo.toml Adds serde_json dependency (and adjusts manifest metadata).
diskann-providers/src/model/graph/provider/async_/inmem/full_precision.rs Adds Vec<T> query support for full-precision in-mem provider (for inline beta usage).
diskann-label-filter/src/lib.rs Exposes the new document_insert_strategy module under encoded_attribute_provider.
diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs Adds InlineBetaStrategy::new and introduces is_valid_filter fast-path logic.
diskann-label-filter/src/inline_beta_search/encoded_document_accessor.rs Adjusts filter encoding to be optional and threads is_valid_filter into the query computer.
diskann-label-filter/src/encoded_attribute_provider/roaring_attribute_store.rs Makes RoaringAttributeStore public for cross-crate use.
diskann-label-filter/src/encoded_attribute_provider/encoded_filter_expr.rs Changes encoded filter representation to Option, allowing “invalid filter” fallback behavior.
diskann-label-filter/src/encoded_attribute_provider/document_provider.rs Allows vector types used in documents to be ?Sized.
diskann-label-filter/src/encoded_attribute_provider/document_insert_strategy.rs New strategy wrapper enabling insertion/search over Document values.
diskann-label-filter/src/encoded_attribute_provider/ast_label_id_mapper.rs Simplifies lookup error messaging and signature for attribute→id mapping.
diskann-label-filter/src/document.rs Makes Document generic over ?Sized vectors.
diskann-benchmark/src/utils/tokio.rs Adds a reusable multi-thread Tokio runtime builder.
diskann-benchmark/src/utils/recall.rs Re-exports knn recall helper for benchmark use.
diskann-benchmark/src/inputs/mod.rs Registers a new document_index input module.
diskann-benchmark/src/inputs/document_index.rs New benchmark input schema for document-index build + filtered search runs.
diskann-benchmark/src/backend/mod.rs Registers new document_index backend benchmarks.
diskann-benchmark/src/backend/index/result.rs Extends search result reporting with query count and wall-clock summary columns.
diskann-benchmark/src/backend/document_index/mod.rs New backend module entrypoint for document index benchmarks.
diskann-benchmark/src/backend/document_index/benchmark.rs New end-to-end benchmark: build via DocumentInsertStrategy + filtered search via InlineBetaStrategy.
diskann-benchmark/example/document-filter.json Adds example job configuration for document filter benchmark runs.
Cargo.lock Adds serde_json to the lockfile dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +730 to +731
// We need to clone data for each task
let queries_arc = Arc::new(queries.clone());
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_search_parallel clones the entire queries matrix and predicate list (predicates.to_vec()) for each call, which can be very expensive for realistic benchmark sizes (extra memory + copy time) and is repeated for every rep/run config. Since queries is already owned in the caller, consider passing an Arc<Matrix<T>> / Arc<Vec<(usize, ASTExpr)>> into run_filtered_search and cloning the Arc instead of deep-cloning the data.

Suggested change
// We need to clone data for each task
let queries_arc = Arc::new(queries.clone());
// Wrap shared data in Arc so it can be cloned cheaply per task
let queries_arc = Arc::new(queries);

Copilot uses AI. Check for mistakes.
@@ -5,14 +5,13 @@ version.workspace = true
authors.workspace = true
description.workspace = true
documentation.workspace = true
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diskann-tools no longer declares license.workspace = true while other workspace crates do. This inconsistency can break packaging/publishing or workspace metadata expectations. If this wasn’t intentional, re-add the workspace license field to match the rest of the workspace manifests.

Suggested change
documentation.workspace = true
documentation.workspace = true
license.workspace = true

Copilot uses AI. Check for mistakes.
Comment on lines +32 to +66
/// Expands a JSON object with array-valued fields into multiple objects with scalar values.
/// For example: {"country": ["AU", "NZ"], "year": 2007}
/// becomes: [{"country": "AU", "year": 2007}, {"country": "NZ", "year": 2007}]
///
/// If multiple fields have arrays, all combinations are generated.
fn expand_array_fields(value: &Value) -> Vec<Value> {
match value {
Value::Object(map) => {
// Start with a single empty object
let mut results: Vec<Map<String, Value>> = vec![Map::new()];

for (key, val) in map.iter() {
if let Value::Array(arr) = val {
// Expand: for each existing result, create copies for each array element
let mut new_results: Vec<Map<String, Value>> = Vec::new();
for existing in results.iter() {
for item in arr.iter() {
let mut new_map: Map<String, Value> = existing.clone();
new_map.insert(key.clone(), item.clone());
new_results.push(new_map);
}
}
// If array is empty, keep existing results without this key
if !arr.is_empty() {
results = new_results;
}
} else {
// Non-array field: add to all existing results
for existing in results.iter_mut() {
existing.insert(key.clone(), val.clone());
}
}
}

results.into_iter().map(Value::Object).collect()
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expand_array_fields clones JSON objects/values and can generate a cartesian product when multiple fields are arrays. Inside read_labels_and_compute_bitmap this happens in a nested loop over (queries × base_labels), so worst-case time/memory can explode on realistic datasets. Consider avoiding materializing expanded variants (evaluate arrays lazily), or imposing a cap/short-circuit to prevent combinatorial blow-ups.

Copilot uses AI. Check for mistakes.
Comment on lines +141 to +144
//If predicate evaluation fails, we will return the score returned by the
//inner computer, as though no predicate was specified.
tracing::warn!(
"Predicate evaluation failed in OnlineBetaComputer::evaluate_similarity()"
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning message in the !is_valid_filter branch says “Predicate evaluation failed in OnlineBetaComputer::evaluate_similarity()”, but this branch is taken when encoding the filter expression failed (not when evaluation failed), and the type is InlineBetaComputer. The message should reflect the actual failure mode to avoid misleading diagnostics.

Suggested change
//If predicate evaluation fails, we will return the score returned by the
//inner computer, as though no predicate was specified.
tracing::warn!(
"Predicate evaluation failed in OnlineBetaComputer::evaluate_similarity()"
//If the filter expression is invalid or failed to encode, we will return the score
//returned by the inner computer, as though no predicate was specified.
tracing::warn!(
"Filter expression was invalid or failed to encode in InlineBetaComputer::evaluate_similarity(); proceeding without filtering"

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +11
/// Create a generic multi-threaded runtime with `num_threads`.
pub(crate) fn runtime(num_threads: usize) -> anyhow::Result<tokio::runtime::Runtime> {
Ok(tokio::runtime::Builder::new_multi_thread()
.worker_threads(num_threads)
.build()?)
}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new runtime() builder doesn’t call enable_all() (or at least enable_time/enable_io). If any benchmark path (now or later) uses Tokio timers, networking, or async FS, the runtime will fail at runtime with missing driver errors. Consider enabling the required drivers explicitly to make this utility safe to reuse.

Copilot uses AI. Check for mistakes.
Comment on lines +689 to +699
search_l,
recall: recall_metrics,
qps,
wall_clock_time: rep_latencies,
mean_latency: mean,
p90_latency: p90,
p99_latency: p99,
mean_cmps,
mean_hops,
per_query_details: Some(per_query_details),
})
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per_query_details is always set to Some(...) even when it’s empty, which forces this field into the serialized output for every run. For larger query sets this can bloat benchmark output significantly when many queries have imperfect recall. Consider only setting this to Some when you actually have entries (or gating it behind a debug/verbose option).

Copilot uses AI. Check for mistakes.
.accept(&pe)?
{
filtered_candidates.push(Neighbor::new(candidate.id, candidate.distance));
}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is_valid_filter is false, post_process never pushes any candidates, so filtered_candidates stays empty and the search returns 0 results. If the intent is to fall back to unfiltered search when the predicate can’t be encoded, this should forward the original candidates (or treat it as match-all) instead of filtering everything out.

Suggested change
}
}
} else {
// If the filter is not valid/encodable, fall back to unfiltered candidates.
filtered_candidates.push(Neighbor::new(candidate.id, candidate.distance));

Copilot uses AI. Check for mistakes.
Comment on lines +141 to +145
//If predicate evaluation fails, we will return the score returned by the
//inner computer, as though no predicate was specified.
tracing::warn!(
"Predicate evaluation failed in OnlineBetaComputer::evaluate_similarity()"
);
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The !is_valid_filter branch logs a warning on every evaluate_similarity call. This function is called extremely frequently during search, so this will spam logs and can materially degrade performance. Consider logging once when building the query computer (already done) and making evaluate_similarity silently fall back to unfiltered scoring when the filter is invalid.

Suggested change
//If predicate evaluation fails, we will return the score returned by the
//inner computer, as though no predicate was specified.
tracing::warn!(
"Predicate evaluation failed in OnlineBetaComputer::evaluate_similarity()"
);
// If the filter is invalid, treat it as though no predicate was specified
// and return the score from the inner computer without additional logging.

Copilot uses AI. Check for mistakes.
Comment on lines +21 to +28
pub fn new(ast_expr: &ASTExpr, attribute_map: Arc<RwLock<AttributeEncoder>>) -> Self {
let mut mapper = ASTLabelIdMapper::new(attribute_map);
let ast_id_expr = ast_expr.accept(&mut mapper)?;
Ok(Self { ast_id_expr })
match ast_expr.accept(&mut mapper) {
Ok(ast_id_expr) => Self {
ast_id_expr: Some(ast_id_expr),
},
Err(_e) => Self { ast_id_expr: None },
}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EncodedFilterExpr::new now swallows all mapping errors and returns None without preserving/logging the underlying error. This makes it hard to distinguish “empty predicate” vs “unknown attribute/value” vs other failures, and can cause silent fallback to unfiltered behavior. Consider returning a Result again, or storing the error for callers to surface in a controlled way.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants