Skip to content

refactor: internalize vector staging merge flow#6220

Open
Xuanwo wants to merge 15 commits intomainfrom
xuanwo/vector-staging-merge-internal
Open

refactor: internalize vector staging merge flow#6220
Xuanwo wants to merge 15 commits intomainfrom
xuanwo/vector-staging-merge-internal

Conversation

@Xuanwo
Copy link
Collaborator

@Xuanwo Xuanwo commented Mar 18, 2026

This refactors the distributed vector index finalization into an internal staging-merge pipeline while keeping the existing public workflow unchanged. The merge_index_metadata(...) function now acts as a thin wrapper over the new flow. The tests have been split into a dedicated commit to make the logic changes easier to review.

This change lays the internal groundwork for segmented vector index commit semantics without introducing new public orchestration APIs in this PR.

This PR is based on #6209. The changes are scoped into two commits:

Please review accordingly.

@github-actions
Copy link
Contributor

PR Review: refactor: internalize vector staging merge flow

Overall: Well-structured refactor that cleanly separates planning from execution for distributed vector index finalize. Good test coverage across IVF subtypes and error cases. A few items to consider:

P1: load_fragment_bitmap_from_storage scans all partitions and all row IDs

load_fragment_bitmap_from_storage (ivf.rs ~line 2598) iterates every partition and every row ID to reconstruct the fragment bitmap. For large indices with millions of vectors, this could be very expensive during plan_staging_segments (which calls load_partial_vector_segment per shard) and also during load_vector_segment in commit_existing_index. This happens at planning time, not just at merge time.

Consider whether the fragment bitmap could be stored as a sidecar in the partial shard directory during build, or at minimum document the cost tradeoff and ensure callers are aware. For the legacy compat path (commit_existing_indexload_vector_segment) this could be a regression if the index is large.

P1: reset_final_segment_dir deletes before writing — no atomicity

merge_staging_segment_to_dir calls reset_final_segment_dir (which does remove_dir_all) on the final_dir before writing new content. If the process crashes between delete and write completion, the segment is lost. The legacy same-dir path is especially fragile: it deletes the temp dir, copies files to final_dir, then deletes the partial shards — multiple non-atomic steps.

For object stores (S3/GCS) this is somewhat inherent, but for local filesystem consider whether an atomic rename could be used instead, or at least note the non-atomic window in comments.

Minor observations

  • copy_partial_segment_contents does serial object_store.copy() per file. For cloud stores with many files per shard, parallel copies (e.g., futures::stream::iter(...).buffered(N)) would be faster.
  • plan_staging_segments processes shards sequentially in a for loop. Since each load_partial_vector_segment does I/O, these could be loaded concurrently with try_join_all or similar.
  • The chrono dependency added to lance-index is only used for IndexSegment's created_at field in the commit path (in lance/src/index.rs), not in lance-index itself. It doesn't appear to be used in types.rs. Is this dependency needed in lance-index?

Tests

Good coverage: multiple IVF subtypes (flat/pq/sq), HNSW variants, error cases for mismatched subtypes/centroids/overlapping fragments, duplicate segments, empty segments, and the legacy commit_existing_index wrapper. The parametrized rstest approach is clean.

@Xuanwo Xuanwo changed the base branch from main to xuanwo/index-segment-commit-api March 18, 2026 06:43
@Xuanwo Xuanwo force-pushed the xuanwo/vector-staging-merge-internal branch from 5c4279d to 44062e8 Compare March 18, 2026 07:23
@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Mar 18, 2026

CI is not running because our base is not main. I will change that once #6209 merged.

Base automatically changed from xuanwo/index-segment-commit-api to main March 18, 2026 17:22
…-merge-internal

# Conflicts:
#	python/src/indices.rs
#	rust/lance-index/src/traits.rs
#	rust/lance-index/src/types.rs
#	rust/lance/src/index.rs
#	rust/lance/src/index/vector/ivf/v2.rs
@codecov
Copy link

codecov bot commented Mar 18, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant