Build:
cmake --preset release && cmake --build build/release --target <target>
Full-text and semantic search capabilities for ThemisDB.
Implements full-text and hybrid search for ThemisDB, providing inverted index management, BM25 relevance scoring, and hybrid vector+keyword search capabilities.
In scope: Inverted index construction and management, BM25/TF-IDF scoring, hybrid search (vector similarity + keyword), search result ranking and pagination.
Out of scope: Vector index construction (handled by index module), language-specific stemming/stopwords (handled by utils module).
hybrid_search.cpp— RRF-based fusion of BM25 and vector search (main entry point)distributed_hybrid_search.cpp— distributed hybrid search across shards with mTLSquery_expander.cpp— synonym expansion, spelling correction, query relaxationfuzzy_matcher.cpp— Levenshtein, Soundex, Metaphone, N-gram similarityllm_query_rewriter.cpp— LLM-based query rewriting for improved recallllm_reranker.cpp— LLM-based re-ranking of top-N resultslearning_to_rank.cpp— online pairwise gradient-descent ranking modelfaceted_search.cpp— value-count facets, range buckets, drill-down filteringsearch_analytics.cpp— thread-safe query log; avg/p95/p99 latency metricsautocomplete.cpp— prefix-index and popular-query suggestionsmulti_modal_search.cpp— TEXT/IMAGE/AUDIO/CUSTOM modalities with weighted RRFmulti_field_search.cpp— per-field BM25 boosting (title > body > tags)neural_sparse_retrieval.cpp— SPLADE / BERT-based neural sparse retrievalcross_lingual_search.cpp— cross-lingual search via multilingual embeddingspersonalized_ranker.cpp— per-user interaction history with time-decayed scoringsearch_highlighter.cpp— matched-term highlighting and snippet generationnegative_keyword_filter.cpp— NOT-operator negative keyword filteringconversational_search.cpp— multi-turn context-aware query reformulationfederated_search.cpp— tenant-isolated parallel search with per-tenant weightingsearch_result_stream.cpp— cursor-based streaming pagination for large result sets
Maturity: 🟢 Production-Ready — All 16 components production-ready as of v2.2.0.
- Hybrid Search: Combines BM25 (lexical) and vector (semantic) search with Reciprocal Rank Fusion (RRF) (
HybridSearch) - Distributed Hybrid Search: Cross-shard hybrid search with mTLS-secured inter-node communication (
DistributedHybridSearch) - Full-Text Search: BM25 ranking algorithm for keyword matching
- Fuzzy Matching: Phonetic search and edit distance matching (
FuzzyMatcher) - Search Indexing: Tokenization, stemming, and inverted index management
- Query Parsing: Natural language query parsing and expansion (
QueryExpander) - Result Ranking: Configurable scoring and ranking algorithms
- LLM Query Rewriting: LLM-based alternative query generation for improved recall (
LlmQueryRewriter) - LLM Re-ranking: Configurable re-ranking with LLM feedback loop (
LlmReranker) - Faceted Search: Per-field value-count facets, numeric range buckets, drill-down filtering (
FacetedSearch) - Search Analytics: Thread-safe query logging; avg/p95/p99 latency, zero-result detection (
SearchAnalytics) - Autocomplete: Prefix-index and popular-query suggestions (
AutocompleteEngine) - Learning to Rank: Linear re-ranker with online pairwise gradient-descent training (
LearningToRank) - Multi-Modal Search: TEXT/IMAGE/AUDIO/CUSTOM modalities with weighted RRF fusion (
MultiModalSearch) - Multi-Field Boosting: Per-field BM25 scoring with title > body > tags priority (
MultiFieldBoostedSearch) - Neural Sparse Retrieval: SPLADE / BERT-based sparse retrieval with vocabulary expansion (
NeuralSparseRetrieval) - Cross-Lingual Search: Multilingual embedding models with language detection (
CrossLingualSearch) - Personalized Ranking: Per-user interaction history with time-decayed scoring (
PersonalizedRanker) - Search Highlighter: Matched-term
<mark>highlighting and snippet generation (SearchHighlighter) - Negative Keyword Filter: NOT-operator support with posting-list candidate removal (
NegativeKeywordFilter) - Conversational Search: Multi-turn context-aware query reformulation (
ConversationalSearch) - Federated Search: Tenant-isolated parallel search with per-tenant weighting (
FederatedSearch) - Result Stream: Cursor-based streaming pagination for large result sets (
SearchResultStream)
- BM25 Ranking: Industry-standard probabilistic ranking function
- Term frequency analysis: Boost relevance based on term occurrence
- Document frequency: Penalize common terms
- Field length normalization: Account for document length
- Configurable parameters: Tune k1 and b for optimal results
- Vector embeddings: Dense vector representations for semantic similarity
- HNSW indexing: Fast approximate nearest neighbor search
- Multiple distance metrics: Cosine, dot product, L2 distance
- Hybrid search: Combine lexical and semantic matching
- Fuzzy matching: Handle typos and misspellings
- Phonetic search: Soundex, Metaphone algorithms
- Stemming: Reduce words to root forms
- Stop word filtering: Remove common words
- Synonym expansion: Query expansion with synonyms
- Phrase search: Exact phrase matching
- Reciprocal Rank Fusion (RRF): Optimal result merging
- Configurable weights: Balance between BM25 and vector search
- Score normalization: Unified scoring across search types
- Multiple search modes: Text-only, vector-only, or hybrid
SearchModule
├─→ HybridSearch (RRF-based result fusion)
│ ├─→ SecondaryIndexManager (BM25 full-text search)
│ └─→ VectorIndexManager (Semantic vector search)
├─→ QueryParser (Natural language processing)
├─→ Tokenizer (Text analysis and stemming)
├─→ FuzzyMatcher (Typo tolerance)
└─→ ResultRanker (Score aggregation and ranking)
- Retrieve relevant context for LLM prompts
- Combine keyword and semantic search for best results
- High recall@10 (85%+) with hybrid search
- Support for reranking with LLM feedback
- Full-text search across document collections
- Semantic similarity for conceptual matching
- Faceted search with filters
- Highlight matching terms
- Find answers to natural language questions
- Semantic matching for paraphrased questions
- Context retrieval for answer generation
- Confidence scoring
- Product search with text and attributes
- Fuzzy matching for misspellings
- Faceted navigation (brand, price, category)
- Personalized ranking
- Query latency: 1-10ms for typical queries
- Index size: ~20-30% of document size
- Throughput: 1K-10K queries/second
- Scalability: Handles millions of documents
- Query latency: 1-10ms for k=10 on 1M vectors
- Index size: ~1.5x vector data size
- Throughput: 1K-5K queries/second
- Scalability: Handles 10M+ vectors
- Query latency: 5-20ms (combines both searches)
- Recall@10: 85%+ with optimal tuning
- Throughput: 500-2K queries/second
- Memory overhead: Minimal (result merging only)
#include "search/hybrid_search.h"
using namespace themis;
// Configure hybrid search
HybridSearch::Config config;
config.bm25_weight = 0.5; // Balance BM25 vs vector
config.vector_weight = 0.5;
config.k = 10; // Final result count
config.k_bm25 = 50; // BM25 candidates
config.k_vector = 50; // Vector candidates
config.use_rrf = true; // Use Reciprocal Rank Fusion
config.rrf_k = 60.0; // RRF constant
config.normalize_scores = true;
config.max_k = 10000; // Hard upper bound for k
config.max_candidates = 10000; // Hard upper bound for k_bm25 / k_vector
config.default_table = "documents";
config.default_column = "content";
config.vector_metric = VectorIndexManager::Metric::COSINE; // or DOT / L2
// Constructor throws std::invalid_argument for invalid config
HybridSearch hybrid_search(
fulltext_index_manager,
vector_index_manager,
config
);
// Perform hybrid search (never throws)
std::string text_query = "machine learning algorithms";
std::vector<float> query_vector = embeddings_model.encode(text_query);
HybridSearch::SearchStats stats;
auto results = hybrid_search.search(
text_query,
query_vector.data(),
query_vector.size(),
&stats // optional: receive per-source diagnostics
);
if (stats.partial_result) {
// One backend failed; results are degraded but not empty
}
// Process results
for (const auto& result : results) {
std::cout << "Doc: " << result.document_id
<< " Score: " << result.hybrid_score
<< " (BM25: " << result.bm25_score
<< ", Vector: " << result.vector_score << ")" << std::endl;
}// Create full-text index
SecondaryIndexManager index_mgr(db);
index_mgr.createIndex(
"documents",
"content",
SecondaryIndexType::FULLTEXT
);
// Search documents
auto results = index_mgr.search(
"documents",
"content",
"machine learning",
10 // limit
);// Create vector index
VectorIndexManager vector_mgr(db);
vector_mgr.createIndex(
"documents",
"embedding",
768, // dimension
VectorIndexManager::Metric::COSINE
);
// Search by vector
std::vector<float> query_vector = {...};
auto results = vector_mgr.search(
"documents",
"embedding",
query_vector,
10 // k
);- Index Module: Secondary index for full-text, vector index for semantic search
- Storage Layer: Document retrieval and metadata
- Utils Module: Stemmer, tokenizer, stopwords
- LLM Module: Embedding generation for semantic search
- Query Module: Integration with AQL for search queries
A single HybridSearch instance is not thread-safe. search() and setConfig()
must not be called concurrently on the same instance. Callers that share an instance
across threads must provide their own synchronization (e.g. a mutex). The recommended
pattern is one HybridSearch instance per thread; the class is lightweight (Config +
two non-owning index pointers) and cheap to construct.
The underlying index managers (SecondaryIndexManager, VectorIndexManager) may have
their own thread-safety guarantees — consult their documentation.
- Secondary Index Manager: BM25 full-text search
- Vector Index Manager: HNSW vector search
- HNSWLIB: Approximate nearest neighbor library
- Utils: Stemmer, stopwords, text processing
For detailed implementation documentation, see:
- Architecture Guide - Component diagram, data flows, threading, security
- Roadmap - Implementation status, known limitations
- Future Enhancements - Planned features and IEEE references
- Secondary Docs (DE) - German overview and reality-check
- v1.2.0: Hybrid search with RRF for RAG optimization
- v1.3.0: Real BM25 and vector index integration
- v1.4.0: Configurable vector metric (COSINE/DOT/L2), strict config validation,
resource limits (
max_k,max_candidates), score normalization edge-case fixes, linear-combination pre-normalization,SearchStatsfor partial-result detection, thread-safety and exception-safety documentation, expanded test coverage - v1.5.0:
QueryExpander(synonym expansion, spelling correction, query relaxation),FuzzyMatcher(Levenshtein, Soundex, Metaphone, N-gram similarity),FacetedSearch(value-count facets, range buckets, drill-down filter intersection),SearchAnalytics(thread-safe query logging, p95/p99 latency, zero-result detection),AutocompleteEngine(prefix-index suggestions, popular-query suggestions),LearningToRank(linear re-ranker, click-through training, A/B variant selector),MultiModalSearch(text + embedding RRF fusion, searchTextAndImage convenience) - v1.6.0:
LlmQueryRewriter— LLM-based query rewriting for improved recall (injected backend, multi-strategy prompt, numbered-line parsing, deduplication, fallback) - v1.7.0: Ranked spelling correction suggestions —
SpellingCorrectionstruct,QueryExpander::suggestSpellingCorrections()(ranked multi-candidate word corrections with confidence scores),QueryExpander::suggestQueryCorrections()(ranked full-query correction suggestions with per-token substitution and all-corrected variants) - v1.8.0:
LlmReranker— Configurable re-ranking with LLM feedback loop (batched prompting, per-document 0–10 score parsing, configurable score blending,toClickEvents()bridge toLearningToRankfor closed-loop LTR training from LLM relevance judgments)
// Initialize
HybridSearch hybrid(fulltext_idx, vector_idx, config);
// Search
auto results = hybrid.search(
"neural networks",
embedding_vector,
embedding_dim
);HybridSearch::Config config;
config.bm25_weight = 1.0;
config.vector_weight = 0.0;
HybridSearch bm25_only(fulltext_idx, vector_idx, config);HybridSearch::Config config;
config.bm25_weight = 0.0;
config.vector_weight = 1.0;
HybridSearch vector_only(fulltext_idx, vector_idx, config);// Post-filter results in application code after search()
auto results = hybrid.search("restaurants", query_vector, vector_dim);
results.erase(std::remove_if(results.begin(), results.end(),
[](const HybridSearch::Result& r) {
return r.hybrid_score < 0.5; // keep only high-confidence results
}), results.end());- Start with balanced weights: 0.5/0.5 for BM25/vector
- Use RRF: Better than linear combination in most cases
- Adjust k parameter: Higher k_bm25 and k_vector improve recall
- Normalize scores: Enable for fair comparison
- Cache embeddings: Avoid recomputing query vectors
- Batch searches: Process multiple queries together
- Warm up indexes: Load into memory before serving
- Use appropriate k: Don't retrieve more results than needed
- Fine-tune embeddings: Domain-specific embeddings work better
- Optimize tokenization: Remove noise, handle special characters
- Use synonyms: Expand queries for better recall
- Rerank results: Use LLM or custom logic for final ranking
- Header Documentation - Public API definitions
- Index Module - Underlying index implementations
- LLM Module - Embedding generation
- Utils Module - Text processing utilities
- Architecture Guide - Search architecture
-
Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389. https://doi.org/10.1561/1500000019
-
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS), 30, 5998–6008. https://arxiv.org/abs/1706.03762
-
Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., … Yih, W.-t. (2020). Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550
-
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. https://doi.org/10.1017/CBO9780511809071
-
Luan, Y., Eisenstein, J., Toutanova, K., & Collins, M. (2021). Sparse, Dense, and Attentional Representations for Text Retrieval. Transactions of the Association for Computational Linguistics, 9, 329–345. https://doi.org/10.1162/tacl_a_00369
This module is built as part of ThemisDB. See the root CMakeLists.txt for build configuration.
The implementation files in this module are compiled into the ThemisDB library.
See ../../include/search/README.md for the public API.