Build: cmake --preset release && cmake --build build/release --target <target>

Search Module

Full-text and semantic search capabilities for ThemisDB.

Module Purpose

Implements full-text and hybrid search for ThemisDB, providing inverted index management, BM25 relevance scoring, and hybrid vector+keyword search capabilities.

Subsystem Scope

In scope: Inverted index construction and management, BM25/TF-IDF scoring, hybrid search (vector similarity + keyword), search result ranking and pagination.

Out of scope: Vector index construction (handled by index module), language-specific stemming/stopwords (handled by utils module).

Relevant Interfaces

hybrid_search.cpp — RRF-based fusion of BM25 and vector search (main entry point)
distributed_hybrid_search.cpp — distributed hybrid search across shards with mTLS
query_expander.cpp — synonym expansion, spelling correction, query relaxation
fuzzy_matcher.cpp — Levenshtein, Soundex, Metaphone, N-gram similarity
llm_query_rewriter.cpp — LLM-based query rewriting for improved recall
llm_reranker.cpp — LLM-based re-ranking of top-N results
learning_to_rank.cpp — online pairwise gradient-descent ranking model
faceted_search.cpp — value-count facets, range buckets, drill-down filtering
search_analytics.cpp — thread-safe query log; avg/p95/p99 latency metrics
autocomplete.cpp — prefix-index and popular-query suggestions
multi_modal_search.cpp — TEXT/IMAGE/AUDIO/CUSTOM modalities with weighted RRF
multi_field_search.cpp — per-field BM25 boosting (title > body > tags)
neural_sparse_retrieval.cpp — SPLADE / BERT-based neural sparse retrieval
cross_lingual_search.cpp — cross-lingual search via multilingual embeddings
personalized_ranker.cpp — per-user interaction history with time-decayed scoring
search_highlighter.cpp — matched-term highlighting and snippet generation
negative_keyword_filter.cpp — NOT-operator negative keyword filtering
conversational_search.cpp — multi-turn context-aware query reformulation
federated_search.cpp — tenant-isolated parallel search with per-tenant weighting
search_result_stream.cpp — cursor-based streaming pagination for large result sets

Current Delivery Status

Maturity: 🟢 Production-Ready — All 16 components production-ready as of v2.2.0.

Components

Hybrid Search: Combines BM25 (lexical) and vector (semantic) search with Reciprocal Rank Fusion (RRF) (HybridSearch)
Distributed Hybrid Search: Cross-shard hybrid search with mTLS-secured inter-node communication (DistributedHybridSearch)
Full-Text Search: BM25 ranking algorithm for keyword matching
Fuzzy Matching: Phonetic search and edit distance matching (FuzzyMatcher)
Search Indexing: Tokenization, stemming, and inverted index management
Query Parsing: Natural language query parsing and expansion (QueryExpander)
Result Ranking: Configurable scoring and ranking algorithms
LLM Query Rewriting: LLM-based alternative query generation for improved recall (LlmQueryRewriter)
LLM Re-ranking: Configurable re-ranking with LLM feedback loop (LlmReranker)
Faceted Search: Per-field value-count facets, numeric range buckets, drill-down filtering (FacetedSearch)
Search Analytics: Thread-safe query logging; avg/p95/p99 latency, zero-result detection (SearchAnalytics)
Autocomplete: Prefix-index and popular-query suggestions (AutocompleteEngine)
Learning to Rank: Linear re-ranker with online pairwise gradient-descent training (LearningToRank)
Multi-Modal Search: TEXT/IMAGE/AUDIO/CUSTOM modalities with weighted RRF fusion (MultiModalSearch)
Multi-Field Boosting: Per-field BM25 scoring with title > body > tags priority (MultiFieldBoostedSearch)
Neural Sparse Retrieval: SPLADE / BERT-based sparse retrieval with vocabulary expansion (NeuralSparseRetrieval)
Cross-Lingual Search: Multilingual embedding models with language detection (CrossLingualSearch)
Personalized Ranking: Per-user interaction history with time-decayed scoring (PersonalizedRanker)
Search Highlighter: Matched-term <mark> highlighting and snippet generation (SearchHighlighter)
Negative Keyword Filter: NOT-operator support with posting-list candidate removal (NegativeKeywordFilter)
Conversational Search: Multi-turn context-aware query reformulation (ConversationalSearch)
Federated Search: Tenant-isolated parallel search with per-tenant weighting (FederatedSearch)
Result Stream: Cursor-based streaming pagination for large result sets (SearchResultStream)

Features

Full-Text Search

BM25 Ranking: Industry-standard probabilistic ranking function
Term frequency analysis: Boost relevance based on term occurrence
Document frequency: Penalize common terms
Field length normalization: Account for document length
Configurable parameters: Tune k1 and b for optimal results

Semantic Search

Vector embeddings: Dense vector representations for semantic similarity
HNSW indexing: Fast approximate nearest neighbor search
Multiple distance metrics: Cosine, dot product, L2 distance
Hybrid search: Combine lexical and semantic matching

Search Enhancement

Fuzzy matching: Handle typos and misspellings
Phonetic search: Soundex, Metaphone algorithms
Stemming: Reduce words to root forms
Stop word filtering: Remove common words
Synonym expansion: Query expansion with synonyms
Phrase search: Exact phrase matching

Hybrid Search (v1.2.0+)

Reciprocal Rank Fusion (RRF): Optimal result merging
Configurable weights: Balance between BM25 and vector search
Score normalization: Unified scoring across search types
Multiple search modes: Text-only, vector-only, or hybrid

Architecture

SearchModule
├─→ HybridSearch (RRF-based result fusion)
│   ├─→ SecondaryIndexManager (BM25 full-text search)
│   └─→ VectorIndexManager (Semantic vector search)
├─→ QueryParser (Natural language processing)
├─→ Tokenizer (Text analysis and stemming)
├─→ FuzzyMatcher (Typo tolerance)
└─→ ResultRanker (Score aggregation and ranking)

Use Cases

RAG (Retrieval-Augmented Generation)

Retrieve relevant context for LLM prompts
Combine keyword and semantic search for best results
High recall@10 (85%+) with hybrid search
Support for reranking with LLM feedback

Document Search

Full-text search across document collections
Semantic similarity for conceptual matching
Faceted search with filters
Highlight matching terms

Question Answering

Find answers to natural language questions
Semantic matching for paraphrased questions
Context retrieval for answer generation
Confidence scoring

E-commerce Search

Product search with text and attributes
Fuzzy matching for misspellings
Faceted navigation (brand, price, category)
Personalized ranking

Performance Characteristics

Full-Text Search (BM25)

Query latency: 1-10ms for typical queries
Index size: ~20-30% of document size
Throughput: 1K-10K queries/second
Scalability: Handles millions of documents

Vector Search (HNSW)

Query latency: 1-10ms for k=10 on 1M vectors
Index size: ~1.5x vector data size
Throughput: 1K-5K queries/second
Scalability: Handles 10M+ vectors

Hybrid Search (RRF)

Query latency: 5-20ms (combines both searches)
Recall@10: 85%+ with optimal tuning
Throughput: 500-2K queries/second
Memory overhead: Minimal (result merging only)

Configuration

Hybrid Search Setup

#include "search/hybrid_search.h"

using namespace themis;

// Configure hybrid search
HybridSearch::Config config;
config.bm25_weight = 0.5;          // Balance BM25 vs vector
config.vector_weight = 0.5;
config.k = 10;                     // Final result count
config.k_bm25 = 50;                // BM25 candidates
config.k_vector = 50;              // Vector candidates
config.use_rrf = true;             // Use Reciprocal Rank Fusion
config.rrf_k = 60.0;               // RRF constant
config.normalize_scores = true;
config.max_k = 10000;              // Hard upper bound for k
config.max_candidates = 10000;     // Hard upper bound for k_bm25 / k_vector
config.default_table  = "documents";
config.default_column = "content";
config.vector_metric = VectorIndexManager::Metric::COSINE; // or DOT / L2

// Constructor throws std::invalid_argument for invalid config
HybridSearch hybrid_search(
    fulltext_index_manager,
    vector_index_manager,
    config
);

// Perform hybrid search (never throws)
std::string text_query = "machine learning algorithms";
std::vector<float> query_vector = embeddings_model.encode(text_query);

HybridSearch::SearchStats stats;
auto results = hybrid_search.search(
    text_query,
    query_vector.data(),
    query_vector.size(),
    &stats  // optional: receive per-source diagnostics
);

if (stats.partial_result) {
    // One backend failed; results are degraded but not empty
}

// Process results
for (const auto& result : results) {
    std::cout << "Doc: " << result.document_id
              << " Score: " << result.hybrid_score
              << " (BM25: " << result.bm25_score
              << ", Vector: " << result.vector_score << ")" << std::endl;
}

Full-Text Search

// Create full-text index
SecondaryIndexManager index_mgr(db);
index_mgr.createIndex(
    "documents",
    "content",
    SecondaryIndexType::FULLTEXT
);

// Search documents
auto results = index_mgr.search(
    "documents",
    "content",
    "machine learning",
    10  // limit
);

Vector Search

// Create vector index
VectorIndexManager vector_mgr(db);
vector_mgr.createIndex(
    "documents",
    "embedding",
    768,  // dimension
    VectorIndexManager::Metric::COSINE
);

// Search by vector
std::vector<float> query_vector = {...};
auto results = vector_mgr.search(
    "documents",
    "embedding",
    query_vector,
    10  // k
);

Integration Points

Index Module: Secondary index for full-text, vector index for semantic search
Storage Layer: Document retrieval and metadata
Utils Module: Stemmer, tokenizer, stopwords
LLM Module: Embedding generation for semantic search
Query Module: Integration with AQL for search queries

Thread Safety

A single HybridSearch instance is not thread-safe. search() and setConfig() must not be called concurrently on the same instance. Callers that share an instance across threads must provide their own synchronization (e.g. a mutex). The recommended pattern is one HybridSearch instance per thread; the class is lightweight (Config + two non-owning index pointers) and cheap to construct.

The underlying index managers (SecondaryIndexManager, VectorIndexManager) may have their own thread-safety guarantees — consult their documentation.

Dependencies

Secondary Index Manager: BM25 full-text search
Vector Index Manager: HNSW vector search
HNSWLIB: Approximate nearest neighbor library
Utils: Stemmer, stopwords, text processing

Documentation

For detailed implementation documentation, see:

Architecture Guide - Component diagram, data flows, threading, security
Roadmap - Implementation status, known limitations
Future Enhancements - Planned features and IEEE references
Secondary Docs (DE) - German overview and reality-check

Version History

v1.2.0: Hybrid search with RRF for RAG optimization
v1.3.0: Real BM25 and vector index integration
v1.4.0: Configurable vector metric (COSINE/DOT/L2), strict config validation, resource limits (max_k, max_candidates), score normalization edge-case fixes, linear-combination pre-normalization, SearchStats for partial-result detection, thread-safety and exception-safety documentation, expanded test coverage
v1.5.0: QueryExpander (synonym expansion, spelling correction, query relaxation), FuzzyMatcher (Levenshtein, Soundex, Metaphone, N-gram similarity), FacetedSearch (value-count facets, range buckets, drill-down filter intersection), SearchAnalytics (thread-safe query logging, p95/p99 latency, zero-result detection), AutocompleteEngine (prefix-index suggestions, popular-query suggestions), LearningToRank (linear re-ranker, click-through training, A/B variant selector), MultiModalSearch (text + embedding RRF fusion, searchTextAndImage convenience)
v1.6.0: LlmQueryRewriter — LLM-based query rewriting for improved recall (injected backend, multi-strategy prompt, numbered-line parsing, deduplication, fallback)
v1.7.0: Ranked spelling correction suggestions — SpellingCorrection struct, QueryExpander::suggestSpellingCorrections() (ranked multi-candidate word corrections with confidence scores), QueryExpander::suggestQueryCorrections() (ranked full-query correction suggestions with per-token substitution and all-corrected variants)
v1.8.0: LlmReranker — Configurable re-ranking with LLM feedback loop (batched prompting, per-document 0–10 score parsing, configurable score blending, toClickEvents() bridge to LearningToRank for closed-loop LTR training from LLM relevance judgments)

Examples

Basic Hybrid Search

// Initialize
HybridSearch hybrid(fulltext_idx, vector_idx, config);

// Search
auto results = hybrid.search(
    "neural networks",
    embedding_vector,
    embedding_dim
);

BM25-Only Search

HybridSearch::Config config;
config.bm25_weight = 1.0;
config.vector_weight = 0.0;

HybridSearch bm25_only(fulltext_idx, vector_idx, config);

Vector-Only Search

HybridSearch::Config config;
config.bm25_weight = 0.0;
config.vector_weight = 1.0;

HybridSearch vector_only(fulltext_idx, vector_idx, config);

With Filters

// Post-filter results in application code after search()
auto results = hybrid.search("restaurants", query_vector, vector_dim);
results.erase(std::remove_if(results.begin(), results.end(),
    [](const HybridSearch::Result& r) {
        return r.hybrid_score < 0.5; // keep only high-confidence results
    }), results.end());

Best Practices

Tuning Hybrid Search

Start with balanced weights: 0.5/0.5 for BM25/vector
Use RRF: Better than linear combination in most cases
Adjust k parameter: Higher k_bm25 and k_vector improve recall
Normalize scores: Enable for fair comparison

Performance Optimization

Cache embeddings: Avoid recomputing query vectors
Batch searches: Process multiple queries together
Warm up indexes: Load into memory before serving
Use appropriate k: Don't retrieve more results than needed

Quality Improvement

Fine-tune embeddings: Domain-specific embeddings work better
Optimize tokenization: Remove noise, handle special characters
Use synonyms: Expand queries for better recall
Rerank results: Use LLM or custom logic for final ranking

Scientific References

Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389. https://doi.org/10.1561/1500000019
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS), 30, 5998–6008. https://arxiv.org/abs/1706.03762
Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., … Yih, W.-t. (2020). Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. https://doi.org/10.1017/CBO9780511809071
Luan, Y., Eisenstein, J., Toutanova, K., & Collins, M. (2021). Sparse, Dense, and Attentional Representations for Text Retrieval. Transactions of the Association for Computational Linguistics, 9, 329–345. https://doi.org/10.1162/tacl_a_00369

Installation

This module is built as part of ThemisDB. See the root CMakeLists.txt for build configuration.

Usage

The implementation files in this module are compiled into the ThemisDB library. See ../../include/search/README.md for the public API.

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Search Module

Module Purpose

Subsystem Scope

Relevant Interfaces

Current Delivery Status

Components

Features

Full-Text Search

Semantic Search

Search Enhancement

Hybrid Search (v1.2.0+)

Architecture

Use Cases

RAG (Retrieval-Augmented Generation)

Document Search

Question Answering

E-commerce Search

Performance Characteristics

Full-Text Search (BM25)

Vector Search (HNSW)

Hybrid Search (RRF)

Configuration

Hybrid Search Setup

Full-Text Search

Vector Search

Integration Points

Thread Safety

Dependencies

Documentation

Version History

Examples

Basic Hybrid Search

BM25-Only Search

Vector-Only Search

With Filters

Best Practices

Tuning Hybrid Search

Performance Optimization

Quality Improvement

See Also

Scientific References

Installation

Usage