Skip to content

perf: optimize BM25 scoring strategy#2043

Merged
dgageot merged 1 commit intodocker:mainfrom
dgageot:board/you-re-the-performance-expert-you-give-a-411a43fe
Mar 10, 2026
Merged

perf: optimize BM25 scoring strategy#2043
dgageot merged 1 commit intodocker:mainfrom
dgageot:board/you-re-the-performance-expert-you-give-a-411a43fe

Conversation

@dgageot
Copy link
Member

@dgageot dgageot commented Mar 10, 2026

Summary

Optimizes the BM25 keyword-based retrieval strategy in pkg/rag/strategy/bm25.go.

Changes

  • Move replacer and stopwords into struct fields — built once in the constructor instead of allocating a new strings.NewReplacer (16 pairs) and a new map[string]bool (20 entries) on every tokenize() call.
  • Pre-tokenize documents and pre-compute document frequencies — the old code tokenized each document twice per scoring call and scanned the entire corpus with strings.Contains per query term per document (O(T×N²×L)). Now Query() tokenizes all documents once and pre-computes DF in a single pass (O(N×L + T×N)).
  • Replace O(n²) selection sort with slices.SortFunc — idiomatic O(n log n).
  • Guard against division by zerodocLength / avgDocLength produced +Inf/NaN when avgDocLength was 0. Now defaults to a length ratio of 1.0.
  • Fix DF consistency — old code used substring matching (strings.Contains) for document frequency but exact token matching for term frequency, inflating DF and suppressing IDF scores. Now both use exact token matching, which is the correct BM25 formulation.

- Move replacer and stopwords into BM25Strategy struct fields,
  built once in the constructor instead of on every tokenize() call
- Pre-tokenize all documents once per query and pre-compute document
  frequencies, reducing complexity from O(T×N²×L) to O(N×L + T×N)
- Replace O(n²) selection sort with slices.SortFunc (O(n log n))
- Guard against division by zero when avgDocLength is 0
- Fix DF computation to use exact token matching (consistent with TF)

Assisted-By: docker-agent
@dgageot dgageot requested a review from a team as a code owner March 10, 2026 17:33
Copy link

@docker-agent docker-agent bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Assessment: 🟢 APPROVE

This is a well-executed performance optimization with multiple improvements to the BM25 scoring implementation:

Memory optimization: Moving replacer and stopwords to struct fields eliminates repeated allocations on every tokenization call

Algorithmic improvement: Pre-tokenizing documents and pre-computing document frequencies reduces complexity from O(T×N²×L) to O(N×L + T×N)

Sorting efficiency: Replacing O(n²) selection sort with slices.SortFunc provides O(n log n) performance

Safety enhancement: Added guard against division by zero when avgDocLength is 0 (defaults to length ratio of 1.0)

Correctness fix: Using exact token matching for both TF and DF calculations instead of mixing exact matching and substring matching

The code includes appropriate guards for edge cases:

  • Query validation ensures queryTerms is non-empty before scoring
  • Division by avgDocLength is protected with a conditional check
  • Term frequency checks prevent unnecessary IDF calculations

No bugs found in the changed code. The optimizations are sound and improve both performance and correctness.

@dgageot dgageot merged commit f4d9d0f into docker:main Mar 10, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants