Skip to content

fix: avoid fts corruption when a fragment is deleted#6171

Open
westonpace wants to merge 1 commit intolance-format:mainfrom
westonpace:fix/avoid-fts-corruption-deleted-fragment
Open

fix: avoid fts corruption when a fragment is deleted#6171
westonpace wants to merge 1 commit intolance-format:mainfrom
westonpace:fix/avoid-fts-corruption-deleted-fragment

Conversation

@westonpace
Copy link
Member

@westonpace westonpace commented Mar 11, 2026

This is basically the same problem encountered in #5929

If an entire fragment is deleted then, after an FTS update, the rows in that deleted fragment will still be indexed but shouldn't be. This fix uses the old_data_filter to properly prune the old data during the update.


This PR was created with the help of Claude Opus 4.6. I take responsibility for all code.

@github-actions github-actions bot added the bug Something isn't working label Mar 11, 2026
@github-actions
Copy link
Contributor

PR Review

Clean and well-structured fix. The approach of threading old_data_filter through the inverted index merge pipeline is the right one, consistent with how other scalar indices handle this.

One minor observation

build_empty_batch hardcodes schema assumptions (index.rs:~1963-1987): The new build_empty_batchmanually constructs columns to match the posting list schema, duplicating the column layout knowledge that lives ininverted_list_schema()/build_batch(). If the schema ever changes (e.g., a new column is added), this function would silently produce a mismatched batch. Consider whether you could instead reuse build_batchwith an empty compressed array, or at least add a debug assertion that the constructed batch schema matchesinverted_list_schema(self.has_positions())`.

Looks good

  • The is_old tagging on PartitionSource is clean and correctly distinguishes old vs new partitions.
  • The old_doc_id_map approach in the merger correctly filters docs and remaps posting list entries in a single pass.
  • The single-partition shortcut bypass (len() <= 1 && old_data_filter.is_none()) is correct — can't just copy files when filtering is needed.
  • Filtering out empty postings before creating PostingIterator is a good defensive measure.
  • The regression test is thorough: covers merge_insert → compact → optimize → search, and validates both stale data removal and new data correctness.

@codecov
Copy link

codecov bot commented Mar 11, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant