feat: clustering by hamersaw · Pull Request #6136 · lance-format/lance

hamersaw · 2026-03-09T19:17:59Z

No description provided.

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

github-actions · 2026-03-09T19:19:18Z

PR Review: feat: clustering

Overall this is a clean, well-structured addition. The trait-based RewriteStrategy abstraction is extensible, the DataFusion SortExec integration is sound, and the row-ID capture deferral for clustering is handled correctly. A few items worth addressing:

P0: Bug — `ClusteringCompactionPlanner` does not filter deleted rows from fragment row counts

In ClusteringCompactionPlanner::plan(), you use frag.physical_rows to accumulate current_rows for chunking. Physical rows include soft-deleted rows, so each task chunk may contain significantly more logical rows than target_rows_per_fragment after deletion filtering. This could lead to unexpectedly large sort buffers and OOM in deletion-heavy datasets. Consider using frag.physical_rows - frag.deletion_count() (or equivalent) for the row estimate, consistent with how DefaultCompactionPlanner handles this.

P1: `fragment_ids` filter is O(n*m)

all_fragments.into_iter()
    .filter(|f| ids.contains(&(f.id() as u64)))

ids is a Vec<u64>, so .contains() is O(m) per fragment. For large datasets with many fragment IDs, convert fragment_ids to a HashSet<u64> (or accept one) for O(1) lookups.

P1: Missing test — integration with `rewrite_files`

The unit tests cover ClusteringRewriteStrategy in isolation, but there's no test exercising the full rewrite_files path with clustering enabled — specifically the deferred row-ID capture logic (include_row_id_column → sort → make_rowid_capture_stream). That code path has subtle ordering requirements and would benefit from at least one integration test that verifies row-ID remapping is correct after a clustering compaction.

Minor nits (non-blocking)

The debug_assert!(!(capture_row_ids && include_row_id_column)) in prepare_reader is only checked in debug builds. Since this is enforcing an API contract between internal callers, a hard assert! (or returning an error) would be safer.
ClusteringCompactionPlanner doc says "the caller should set a very large target" for global ordering — consider documenting this more prominently or providing a convenience constructor.

🤖 Generated with Claude Code

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

initial commit

6eb8e21

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

github-actions bot added the enhancement New feature or request label Mar 9, 2026

hamersaw added 2 commits March 9, 2026 14:20

adding clustering components to manifest

8aedf1f

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

Merge branch 'feature/liquid-clustering' into feature/clustering

46d7a15

github-actions bot added python java labels Mar 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: clustering#6136

feat: clustering#6136
hamersaw wants to merge 3 commits intolance-format:mainfrom
hamersaw:feature/clustering

hamersaw commented Mar 9, 2026

Uh oh!

github-actions bot commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hamersaw commented Mar 9, 2026

Uh oh!

github-actions bot commented Mar 9, 2026

PR Review: feat: clustering

P0: Bug — ClusteringCompactionPlanner does not filter deleted rows from fragment row counts

P1: fragment_ids filter is O(n*m)

P1: Missing test — integration with rewrite_files

Minor nits (non-blocking)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

P0: Bug — `ClusteringCompactionPlanner` does not filter deleted rows from fragment row counts

P1: `fragment_ids` filter is O(n*m)

P1: Missing test — integration with `rewrite_files`