Skip to content

feat: add public vector segment builder API#6224

Open
Xuanwo wants to merge 1 commit intoxuanwo/vector-staging-merge-internalfrom
xuanwo/public-api-on-staging-merge
Open

feat: add public vector segment builder API#6224
Xuanwo wants to merge 1 commit intoxuanwo/vector-staging-merge-internalfrom
xuanwo/public-api-on-staging-merge

Conversation

@Xuanwo
Copy link
Collaborator

@Xuanwo Xuanwo commented Mar 18, 2026

This adds the public distributed vector finalize API on top of #6220's internal staging-merge flow. It exposes a canonical two-step workflow: materialize final segments from a staging root, then publish them with commit_existing_index_segments(...).

This keeps merge_index_metadata(...) as the single-root compatibility path while making the multi-segment path explicit in the Rust API. The branch includes an end-to-end vector test that stages partial shards, builds final segments through the new builder, and commits them as one logical index.

Context: #6220

@github-actions github-actions bot added the enhancement New feature or request label Mar 18, 2026
@github-actions
Copy link
Contributor

PR Review

This PR promotes the internal staging-merge types to public API and adds IndexSegmentBuilder as the canonical entry point for distributed vector index finalization. The design is clean — plan/build/commit separation gives callers good control.

Issues to address

P0 — build_all fans out all segment builds concurrently with no bound

pub async fn build_all(&self) -> Result<Vec<IndexSegment>> {
    let plans = self.plan().await?;
    try_join_all(plans.iter().map(|plan| self.build(plan))).await
}

try_join_all launches every segment merge concurrently. With many shards and no target_segment_bytes set (default: 1 segment per shard), this can spawn an unbounded number of concurrent I/O-heavy merge tasks. Consider using futures::stream::iter(...).buffered(concurrency) or at minimum documenting this behaviour so callers know to use plan() + manual orchestration for large staging roots.

P1 — plan() calls plan_staging_segments with None for requested_index_type

The plan method always passes None for requested_index_type:

crate::index::vector::ivf::plan_staging_segments(
    &self.dataset.indices_dir().child(self.staging_index_uuid.as_str()),
    &self.partial_shards,
    None,               // <-- always None
    self.target_segment_bytes,
)

There's no builder method to set a requested index type, yet VectorIndexSegmentPlan carries it. Either expose a with_index_type(...) builder method or document that callers should use plan() + modify plans if they need a non-default type.

P1 — Test doesn't verify recall quality

Per the project's testing standards: "Vector index tests must assert recall metrics (>=0.5 threshold), not just verify creation succeeds." The new test asserts result.num_rows() > 0 but doesn't check recall. Consider adding a brute-force KNN comparison similar to existing vector index tests.

Minor nits

  • StagingIndexShard::new accepts IntoIterator<Item = u32> for fragment_bitmap — this is a public API, so consider whether accepting RoaringBitmap directly would be more ergonomic for callers who already have one.
  • The type alias pub(crate) type PartialShard = StagingIndexShard; is a clean migration path but could be removed once refactor: internalize vector staging merge flow #6220 is updated to use the new public name directly, to avoid two names for the same thing living in the codebase long-term.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant