perf: optimize BM25 scoring strategy#2043
Conversation
- Move replacer and stopwords into BM25Strategy struct fields, built once in the constructor instead of on every tokenize() call - Pre-tokenize all documents once per query and pre-compute document frequencies, reducing complexity from O(T×N²×L) to O(N×L + T×N) - Replace O(n²) selection sort with slices.SortFunc (O(n log n)) - Guard against division by zero when avgDocLength is 0 - Fix DF computation to use exact token matching (consistent with TF) Assisted-By: docker-agent
There was a problem hiding this comment.
Review Summary
Assessment: 🟢 APPROVE
This is a well-executed performance optimization with multiple improvements to the BM25 scoring implementation:
✅ Memory optimization: Moving replacer and stopwords to struct fields eliminates repeated allocations on every tokenization call
✅ Algorithmic improvement: Pre-tokenizing documents and pre-computing document frequencies reduces complexity from O(T×N²×L) to O(N×L + T×N)
✅ Sorting efficiency: Replacing O(n²) selection sort with slices.SortFunc provides O(n log n) performance
✅ Safety enhancement: Added guard against division by zero when avgDocLength is 0 (defaults to length ratio of 1.0)
✅ Correctness fix: Using exact token matching for both TF and DF calculations instead of mixing exact matching and substring matching
The code includes appropriate guards for edge cases:
- Query validation ensures
queryTermsis non-empty before scoring - Division by
avgDocLengthis protected with a conditional check - Term frequency checks prevent unnecessary IDF calculations
No bugs found in the changed code. The optimizations are sound and improve both performance and correctness.
Summary
Optimizes the BM25 keyword-based retrieval strategy in
pkg/rag/strategy/bm25.go.Changes
strings.NewReplacer(16 pairs) and a newmap[string]bool(20 entries) on everytokenize()call.strings.Containsper query term per document (O(T×N²×L)). NowQuery()tokenizes all documents once and pre-computes DF in a single pass (O(N×L + T×N)).slices.SortFunc— idiomatic O(n log n).docLength / avgDocLengthproduced +Inf/NaN whenavgDocLengthwas 0. Now defaults to a length ratio of 1.0.strings.Contains) for document frequency but exact token matching for term frequency, inflating DF and suppressing IDF scores. Now both use exact token matching, which is the correct BM25 formulation.