perf: new layout for positions and new algo for phrase query#6203
perf: new layout for positions and new algo for phrase query#6203
Conversation
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
PR Review: varint for doc/freqThis PR introduces varint-delta encoding for posting list tails and a new shared position stream layout (Lucene-style packed deltas), replacing the fixed-size encoding. Good improvements for space efficiency with solid backward-compat handling via codec metadata detection. IssuesP1: Compat test no longer covers the old FTS index format The P1: Duplicated metadata construction in
fn build_index_metadata(&self, partitions: &[u64]) -> Result<HashMap<String, String>> { ... }P1: Duplicate varint encoder functions in
Minor
Overall the encoding/decoding logic looks correct, the backward-compat codec detection ( |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
| let compressed = list.value(self.idx); | ||
| let positions = decompress_positions(compressed.as_binary()); | ||
| Box::new(positions.into_iter()) as Box<dyn Iterator<Item = u32>> | ||
| } |
There was a problem hiding this comment.
I'm not sure if I understand this correctly. It looks like we have cached the entire POSITION_COL in LegacyPerDoc:
PositionsLayout::LegacyPerDoc => {
let batch = self
.reader
.read_range(self.posting_list_range(token_id), Some(&[POSITION_COL]))
.await
.map_err(|e| match e {
Error::Schema { .. } => Error::invalid_input("position is not found but required for phrase queries, try recreating the index with position".to_owned()),
e => e,
})?;
CompressedPositionStorage::LegacyPerDoc(batch[POSITION_COL].as_list::<i32>().clone())Do we still need to do this?
There was a problem hiding this comment.
it caches only the rows hit, not the entire column.
these line read the cached rows and decompress them.
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Xuanwo
left a comment
There was a problem hiding this comment.
I'm good with this PR once all CI passed
|
hold it because of compatibility issue |
|
Looks like #[serde(rename = "memory_limit", skip_serializing, ...)] memory limit still not being respected and fallback to 2GiB |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
This stores all positions of single posting list in a flat array, and apply compression to the positions:
This also stores the remaining doc ids / frequencies in varint as well.
These are breaking changes.
on 1M random dataset:
the improvement can be more significant on larger dataset