Skip to content

feat: add count_table_rows, insert_into_table, query_table for DirectoryNamespace#6132

Open
XuQianJin-Stars wants to merge 5 commits intolance-format:mainfrom
XuQianJin-Stars:feature/dir-namespace-table-ops
Open

feat: add count_table_rows, insert_into_table, query_table for DirectoryNamespace#6132
XuQianJin-Stars wants to merge 5 commits intolance-format:mainfrom
XuQianJin-Stars:feature/dir-namespace-table-ops

Conversation

@XuQianJin-Stars
Copy link
Contributor

@XuQianJin-Stars XuQianJin-Stars commented Mar 9, 2026

close #6111

  • Implement count_table_rows with optional version checkout and predicate filter
  • Implement insert_into_table with append/overwrite modes via Arrow IPC
  • Implement query_table supporting vector similarity search (with distance type, nprobes, refine factor, prefilter) and plain scan with filter/limit/offset/version
  • Add lance-linalg dependency for DistanceType in vector search
  • Add 8 unit tests covering all new methods and edge cases

Test plan

  • cargo test -p lance-namespace-impls passes all new tests
  • count_table_rows returns correct count with and without predicate filter
  • insert_into_table correctly appends and overwrites data
  • query_table returns correct results for vector search and plain scan

@github-actions github-actions bot added the enhancement New feature or request label Mar 9, 2026
@XuQianJin-Stars XuQianJin-Stars force-pushed the feature/dir-namespace-table-ops branch from eb5404d to cb74abf Compare March 9, 2026 08:07
@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2026

PR Review

P0: multi_vector enters vector path but is never handled

In query_table, has_vector is true when multi_vector.is_some(), but the vector search path only handles single_vector:

let has_vector =
    request.vector.single_vector.is_some() || request.vector.multi_vector.is_some();
// ...
if has_vector {
    if let Some(ref single) = request.vector.single_vector {
        // only single_vector is handled
    }
    // multi_vector is silently ignored
}

If a caller passes multi_vector without single_vector, the code enters the vector path (skipping the plain-scan limit/offset logic) but never configures a nearest-neighbor query. This will silently produce wrong results. Either handle multi_vector or return an explicit "not supported" error when it's set.

P1: insert_into_table doesn't use session

count_table_rows and query_table both open datasets via DatasetBuilder with self.session, but insert_into_table uses Dataset::write with only ObjectStoreParams. If the namespace has a session configured, inserts won't use it. This is inconsistent and could cause failures in environments relying on session-based configuration.

P1: Empty-result schema may not reflect column projection

In query_table, when no batches are returned, the fallback creates an IPC stream using dataset.schema():

let schema: arrow_schema::Schema = dataset.schema().into();

This is the full dataset schema, not the projected schema. If the caller requested columns: ["id"], the empty-result IPC will contain the full schema instead. Consider capturing the projected schema from the scanner before streaming.

Minor: insert_into_table double-buffers all batches

The IPC stream is fully deserialized into Vec<RecordBatch> before being re-wrapped and passed to Dataset::write. Per project guidelines ("avoid collecting streams of RecordBatch into memory"), consider streaming directly or at minimum noting this as a known limitation for large payloads.


🤖 Generated with Claude Code

@codecov
Copy link

codecov bot commented Mar 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Collaborator

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XuQianJin-Stars Thanks for the contribution. Left two minor comments.


let mode = match request.mode.as_deref() {
Some(m) if m.eq_ignore_ascii_case("overwrite") => lance::dataset::WriteMode::Overwrite,
_ => lance::dataset::WriteMode::Append,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more seriously, we may need to enumale the mode more carefully. For example, if it is not append, we need to throw an exception?

.nearest(
request.vector_column.as_deref().unwrap_or("vector"),
&query_array,
request.k as usize,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the k arg? Should it be larger than 0?

@XuQianJin-Stars XuQianJin-Stars force-pushed the feature/dir-namespace-table-ops branch 2 times, most recently from 60c499c to 1a2e984 Compare March 10, 2026 08:13
…oryNamespace

Implement three new table operation methods on DirectoryNamespace:

- count_table_rows: count rows with optional version checkout and predicate filter
- insert_into_table: write Arrow IPC batches in append or overwrite mode
- query_table: vector similarity search or plain scan with filter/limit/offset/version

Add lance-linalg as a dependency for DistanceType in vector search.
Include 8 unit tests covering all new methods and edge cases.
@XuQianJin-Stars XuQianJin-Stars force-pushed the feature/dir-namespace-table-ops branch from 1a2e984 to 9c8f82b Compare March 19, 2026 06:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add count_table_rows, insert_into_table, query_table for DirectoryNamespace

2 participants