Skip to content

feat: clustering#6136

Draft
hamersaw wants to merge 3 commits intolance-format:mainfrom
hamersaw:feature/clustering
Draft

feat: clustering#6136
hamersaw wants to merge 3 commits intolance-format:mainfrom
hamersaw:feature/clustering

Conversation

@hamersaw
Copy link
Contributor

@hamersaw hamersaw commented Mar 9, 2026

No description provided.

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
@github-actions github-actions bot added the enhancement New feature or request label Mar 9, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2026

PR Review: feat: clustering

Overall this is a clean, well-structured addition. The trait-based RewriteStrategy abstraction is extensible, the DataFusion SortExec integration is sound, and the row-ID capture deferral for clustering is handled correctly. A few items worth addressing:

P0: Bug — ClusteringCompactionPlanner does not filter deleted rows from fragment row counts

In ClusteringCompactionPlanner::plan(), you use frag.physical_rows to accumulate current_rows for chunking. Physical rows include soft-deleted rows, so each task chunk may contain significantly more logical rows than target_rows_per_fragment after deletion filtering. This could lead to unexpectedly large sort buffers and OOM in deletion-heavy datasets. Consider using frag.physical_rows - frag.deletion_count() (or equivalent) for the row estimate, consistent with how DefaultCompactionPlanner handles this.

P1: fragment_ids filter is O(n*m)

all_fragments.into_iter()
    .filter(|f| ids.contains(&(f.id() as u64)))

ids is a Vec<u64>, so .contains() is O(m) per fragment. For large datasets with many fragment IDs, convert fragment_ids to a HashSet<u64> (or accept one) for O(1) lookups.

P1: Missing test — integration with rewrite_files

The unit tests cover ClusteringRewriteStrategy in isolation, but there's no test exercising the full rewrite_files path with clustering enabled — specifically the deferred row-ID capture logic (include_row_id_column → sort → make_rowid_capture_stream). That code path has subtle ordering requirements and would benefit from at least one integration test that verifies row-ID remapping is correct after a clustering compaction.

Minor nits (non-blocking)

  • The debug_assert!(!(capture_row_ids && include_row_id_column)) in prepare_reader is only checked in debug builds. Since this is enforcing an API contract between internal callers, a hard assert! (or returning an error) would be safer.
  • ClusteringCompactionPlanner doc says "the caller should set a very large target" for global ordering — consider documenting this more prominently or providing a convenience constructor.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant