Skip to content

perf: new layout for positions and new algo for phrase query#6203

Open
BubbleCal wants to merge 15 commits intomainfrom
yang/inverted-index-positions
Open

perf: new layout for positions and new algo for phrase query#6203
BubbleCal wants to merge 15 commits intomainfrom
yang/inverted-index-positions

Conversation

@BubbleCal
Copy link
Contributor

@BubbleCal BubbleCal commented Mar 16, 2026

This stores all positions of single posting list in a flat array, and apply compression to the positions:

  • for positions of single document, storing delta positions
  • divide positions of single posting list into blocks (128 by default) and apply bitpack to each block
  • store the remaining positions in varint

This also stores the remaining doc ids / frequencies in varint as well.

These are breaking changes.

on 1M random dataset:

  • build is 9.35x faster (34.269s -> 3.664s)
  • index is 8.87x smaller (955,002,412 bytes -> 107,603,319 bytes)
  • phrase query avg latency is 3.41x faster (21.631ms -> 6.334ms) and p95 latency is 3.17x faster (63.735ms -> 20.100ms)

the improvement can be more significant on larger dataset

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@github-actions
Copy link
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@BubbleCal BubbleCal changed the title varint for doc/freq perf: new layout for positions and new algo for phrase query Mar 16, 2026
@github-actions
Copy link
Contributor

PR Review: varint for doc/freq

This PR introduces varint-delta encoding for posting list tails and a new shared position stream layout (Lucene-style packed deltas), replacing the fixed-size encoding. Good improvements for space efficiency with solid backward-compat handling via codec metadata detection.

Issues

P1: Compat test no longer covers the old FTS index format

The FtsIndexTest in test_scalar_indices.py now creates with with_position=True and uses MatchQuery/PhraseQuery instead of the old LIKE filter. This means the compat test framework no longer exercises the upgrade path for FTS indices created by older Lance versions (without positions / with the old Fixed32 tail codec). Consider keeping the existing test case as-is and adding a new compat test class (e.g., FtsIndexWithPositionsTest) for the new format. This is important since the old format will exist in the wild.

P1: Duplicated metadata construction in builder.rs

write_metadata and the partial-metadata writer (write_metadata for partition) have nearly identical blocks for constructing the codec/positions metadata HashMap. If either is updated without the other, they'll silently diverge. Consider extracting to a shared helper, e.g.:

fn build_index_metadata(&self, partitions: &[u64]) -> Result<HashMap<String, String>> { ... }

P1: Duplicate varint encoder functions in encoding.rs

write_varint_u32(builder: &mut impl Write, ...) and encode_varint_u32(dst: &mut Vec<u8>, ...) implement identical logic with different interfaces. Since Vec<u8> implements Write, one can be expressed in terms of the other (or just use one).

Minor

  • In decompress_posting_remainder (encoding.rs), the VarintDelta branch uses expect() / assert_eq!() for what could be corrupt-data errors. The Fixed32 path has always been panic-on-corrupt, so this is consistent, but eventually propagating Result from this function would be safer.

Overall the encoding/decoding logic looks correct, the backward-compat codec detection (parse_posting_tail_codec, parse_shared_position_codec with sensible defaults) is well-designed, and the test coverage for new codecs and legacy formats is thorough.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@codecov
Copy link

codecov bot commented Mar 16, 2026

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
let compressed = list.value(self.idx);
let positions = decompress_positions(compressed.as_binary());
Box::new(positions.into_iter()) as Box<dyn Iterator<Item = u32>>
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand this correctly. It looks like we have cached the entire POSITION_COL in LegacyPerDoc:

PositionsLayout::LegacyPerDoc => {
    let batch = self
        .reader
        .read_range(self.posting_list_range(token_id), Some(&[POSITION_COL]))
        .await
        .map_err(|e| match e {
            Error::Schema { .. } => Error::invalid_input("position is not found but required for phrase queries, try recreating the index with position".to_owned()),
            e => e,
        })?;
    CompressedPositionStorage::LegacyPerDoc(batch[POSITION_COL].as_list::<i32>().clone())

Do we still need to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it caches only the rows hit, not the entire column.

these line read the cached rows and decompress them.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal requested a review from Xuanwo March 16, 2026 11:24
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Copy link
Collaborator

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with this PR once all CI passed

@BubbleCal BubbleCal added the donotmerge Do not merge label Mar 16, 2026
@BubbleCal
Copy link
Contributor Author

hold it because of compatibility issue

@LuQQiu
Copy link
Contributor

LuQQiu commented Mar 16, 2026

Looks like #[serde(rename = "memory_limit", skip_serializing, ...)]
pub(crate) memory_limit_mb: Option,

memory limit still not being respected and fallback to 2GiB

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants