Skip to content

Latest commit

 

History

History
349 lines (278 loc) · 10.6 KB

File metadata and controls

349 lines (278 loc) · 10.6 KB

PostgreSQL Full-Text Search Implementation for pgsqlite - COMPLETED ✅

Implementation Status: COMPLETED (2025-07-23)

This document originally outlined the implementation plan for adding PostgreSQL Full-Text Search (FTS) support to pgsqlite. The implementation has been successfully completed with full tsvector/tsquery support using SQLite's FTS5 extension.

Completed Implementation Overview

The FTS implementation provides complete PostgreSQL Full-Text Search compatibility using SQLite FTS5 as the backend engine, with comprehensive query translation and type support.

What Was Implemented ✅

Core Features:

  • Migration v9: FTS schema tables (__pgsqlite_fts_tables, __pgsqlite_fts_columns)
  • Type System: Full tsvector and tsquery types with PostgreSQL wire protocol support
  • CREATE TABLE: Automatic FTS5 virtual table creation for tsvector columns
  • Search Functions: to_tsvector(), to_tsquery(), plainto_tsquery()
  • Query Operations: @@ operator translation to SQLite FTS5 MATCH
  • Data Operations: INSERT, UPDATE, DELETE with automatic tsvector population

Advanced Features:

  • Complex Query Translation: tsquery to FTS5 syntax (AND, OR, NOT, phrases)
  • Table Alias Resolution: Proper handling of aliases (d.search_vector @@ query)
  • SQL Parser Compatibility: Custom pgsqlite_fts_match() function to avoid conflicts
  • DELETE/UPDATE with FTS: Full WHERE clause FTS search support
  • Comprehensive Testing: Integration tests covering all operations and edge cases

Files Implemented:

  • src/migration/registry.rs - Migration v9 registration
  • src/translator/fts_translator.rs - Core FTS translation logic
  • src/functions/fts_functions.rs - PostgreSQL FTS functions
  • src/types/type_mapper.rs - tsvector/tsquery type mappings
  • tests/test_fts_translator.rs - Comprehensive unit tests
  • tests/sql/features/test_fts_functions.sql - Integration tests

Architecture Design

1. Schema Extensions

Add new columns to __pgsqlite_schema table:

ALTER TABLE __pgsqlite_schema ADD COLUMN fts_table_name TEXT;
ALTER TABLE __pgsqlite_schema ADD COLUMN fts_config TEXT DEFAULT 'english';
ALTER TABLE __pgsqlite_schema ADD COLUMN fts_weights TEXT; -- JSON array of weight mappings

Create FTS metadata table:

CREATE TABLE __pgsqlite_fts_metadata (
    table_name TEXT,
    column_name TEXT,
    fts_table_name TEXT,
    config_name TEXT,
    tokenizer TEXT,
    stop_words TEXT, -- JSON array
    PRIMARY KEY (table_name, column_name)
);

2. Type System

PostgreSQL Types to SQLite Mapping

  • tsvector → TEXT column in main table (stores metadata) + FTS5 shadow table
  • tsquery → Translated to FTS5 MATCH syntax at query time
  • regconfig → Mapped to FTS5 tokenizer configurations

Wire Protocol

  • tsvector OID: 3614
  • tsquery OID: 3615
  • regconfig OID: 3734

3. Shadow FTS5 Table Structure

For each tsvector column, create a shadow FTS5 table:

CREATE VIRTUAL TABLE __pgsqlite_fts_{table}_{column} USING fts5(
    content,          -- The indexed text
    weights,          -- A,B,C,D weights as space-separated positions
    lexemes UNINDEXED,-- Original lexemes for exact reconstruction
    tokenize = 'porter unicode61'  -- Configurable based on regconfig
);

4. Query Translation Patterns

INSERT Translation

-- PostgreSQL:
INSERT INTO articles (id, title, content_tsv) 
VALUES (1, 'Title', to_tsvector('english', 'The quick brown fox'));

-- Translated to:
BEGIN;
INSERT INTO articles (id, title, content_tsv) 
VALUES (1, 'Title', '{"fts_ref": "__pgsqlite_fts_articles_content_tsv", "config": "english"}');

INSERT INTO __pgsqlite_fts_articles_content_tsv (rowid, content, weights, lexemes)
VALUES (1, 'The quick brown fox', '', '{"brown":{"pos":[3],"weight":"D"},"fox":{"pos":[4],"weight":"D"},"quick":{"pos":[2],"weight":"D"}}');
COMMIT;

SELECT Translation

-- PostgreSQL:
SELECT * FROM articles WHERE content_tsv @@ to_tsquery('english', 'quick & fox');

-- Translated to:
SELECT DISTINCT a.* FROM articles a
JOIN __pgsqlite_fts_articles_content_tsv f ON a.rowid = f.rowid
WHERE f.content MATCH 'quick AND fox';

UPDATE Translation

-- PostgreSQL:
UPDATE articles SET content_tsv = to_tsvector('english', 'New content')
WHERE id = 1;

-- Translated to:
BEGIN;
UPDATE articles SET content_tsv = '{"fts_ref": "__pgsqlite_fts_articles_content_tsv", "config": "english"}'
WHERE id = 1;

UPDATE __pgsqlite_fts_articles_content_tsv 
SET content = 'New content', 
    weights = '',
    lexemes = '{"content":{"pos":[2],"weight":"D"},"new":{"pos":[1],"weight":"D"}}'
WHERE rowid = (SELECT rowid FROM articles WHERE id = 1);
COMMIT;

5. Function Implementations

Core Functions

  1. to_tsvector(regconfig, text) → Creates tsvector
  2. to_tsquery(regconfig, text) → Creates tsquery
  3. plainto_tsquery(regconfig, text) → Creates tsquery from plain text
  4. phraseto_tsquery(regconfig, text) → Creates phrase query
  5. websearch_to_tsquery(regconfig, text) → Web search syntax
  6. ts_rank(tsvector, tsquery) → Calculate relevance
  7. ts_rank_cd(tsvector, tsquery) → Cover density ranking
  8. ts_headline(regconfig, text, tsquery) → Generate snippets

Operator Implementations

  • @@ (match) → FTS5 MATCH
  • @> (contains) → Custom implementation
  • <@ (contained by) → Custom implementation
  • || (concatenate) → Merge tsvectors

6. Implementation Phases

Phase 1: Core Infrastructure (Week 1)

  • Migration v9: Add FTS schema tables
  • Create FtsTranslator module
  • Implement type recognition for tsvector/tsquery
  • Basic CREATE TABLE translation with tsvector columns

Phase 2: Basic Operations (Week 2)

  • Implement to_tsvector() function
  • Implement to_tsquery() function
  • INSERT statement translation
  • SELECT with @@ operator translation

Phase 3: Advanced Features (Week 3)

  • UPDATE/DELETE translation
  • ts_rank() and ts_rank_cd() functions
  • ts_headline() function
  • Additional query functions (plainto_tsquery, etc.)

Phase 4: Configuration Support (Week 4)

  • Multiple language configurations
  • Custom stop words
  • Weight support (setweight function)
  • Position information preservation

Phase 5: Optimization (Week 5)

  • Query plan optimization for FTS queries
  • Caching of parsed tsvectors
  • Batch operation optimizations
  • Performance benchmarking

Benchmarking Strategy

1. Baseline Measurements

Create benchmark suite comparing:

  • Pure SQLite FTS5 performance
  • pgsqlite FTS implementation
  • Native PostgreSQL FTS (for reference)

2. Benchmark Scenarios

// File: tests/benchmark_fts.rs

#[bench]
fn bench_fts_insert_baseline_sqlite() {
    // Direct SQLite FTS5 inserts
    // Measure: inserts/second
}

#[bench]
fn bench_fts_insert_pgsqlite() {
    // PostgreSQL protocol FTS inserts through pgsqlite
    // Measure: inserts/second and overhead percentage
}

#[bench]
fn bench_fts_search_simple() {
    // Single term searches
    // Compare: SQLite MATCH vs pgsqlite @@ operator
}

#[bench]
fn bench_fts_search_complex() {
    // Complex boolean queries
    // Compare: query translation overhead
}

#[bench]
fn bench_fts_ranking() {
    // ts_rank() performance
    // Measure: overhead of rank calculation
}

#[bench]
fn bench_fts_mixed_workload() {
    // 70% searches, 20% inserts, 10% updates
    // Measure: real-world performance impact
}

3. Metrics to Track

  1. Overhead Percentage: (pgsqlite_time - sqlite_time) / sqlite_time * 100
  2. Operations per Second: For each operation type
  3. Memory Usage: Shadow table overhead
  4. Translation Time: Time spent in query translation
  5. Cache Hit Rate: For translated queries and tsvectors

4. Performance Goals

  • INSERT overhead: < 200% (similar to current INSERT performance)
  • Simple SELECT overhead: < 150%
  • Complex query overhead: < 300%
  • Ranking function overhead: < 100%
  • Memory overhead: < 2x the text size

5. Benchmark Script

#!/bin/bash
# File: tests/runner/run_fts_benchmarks.sh

# Setup test data
echo "Creating test dataset..."
sqlite3 bench_fts.db < tests/sql/fts/create_benchmark_data.sql

# Run SQLite baseline
echo "Running SQLite FTS5 baseline..."
time sqlite3 bench_fts.db < tests/sql/fts/benchmark_queries.sql > sqlite_baseline.txt

# Run pgsqlite 
echo "Running pgsqlite FTS..."
./target/release/pgsqlite --database bench_fts.db &
PGSQLITE_PID=$!
sleep 2

time psql -h localhost -p 5432 -U postgres -d postgres \
    -f tests/sql/fts/benchmark_queries_pg.sql > pgsqlite_results.txt

kill $PGSQLITE_PID

# Compare results
echo "Analyzing results..."
python3 tests/scripts/analyze_fts_benchmarks.py

Testing Strategy

1. Unit Tests

  • Type conversion tests
  • Query translation tests
  • Function implementation tests

2. Integration Tests

  • Full query execution tests
  • PostgreSQL client compatibility tests
  • Edge cases and error handling

3. Test Data

-- Create diverse test dataset
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    title TEXT,
    content TEXT,
    tags TEXT[],
    search_vector tsvector
);

-- Insert test data with various characteristics:
-- - Different languages
-- - Various document lengths
-- - Special characters and edge cases
-- - Performance testing corpus (100k+ documents)

Migration Plan

-- Migration v9: Full-Text Search Support
CREATE TABLE IF NOT EXISTS __pgsqlite_fts_metadata (
    table_name TEXT,
    column_name TEXT,
    fts_table_name TEXT,
    config_name TEXT,
    tokenizer TEXT,
    stop_words TEXT,
    PRIMARY KEY (table_name, column_name)
);

ALTER TABLE __pgsqlite_schema ADD COLUMN fts_table_name TEXT;
ALTER TABLE __pgsqlite_schema ADD COLUMN fts_config TEXT DEFAULT 'english';
ALTER TABLE __pgsqlite_schema ADD COLUMN fts_weights TEXT;

-- Register new types
INSERT INTO __pgsqlite_type_map (pg_type, sqlite_type, oid)
VALUES 
    ('tsvector', 'TEXT', 3614),
    ('tsquery', 'TEXT', 3615),
    ('regconfig', 'TEXT', 3734);

Success Criteria

  1. All PostgreSQL FTS operators and core functions work correctly
  2. Performance overhead is within acceptable limits (see goals above)
  3. psql and common PostgreSQL clients work without modification
  4. All tests pass in CI/CD pipeline
  5. Comprehensive documentation is available

Future Enhancements

  1. GIN Index Emulation: Create custom index structures for better performance
  2. Phrase Search: Enhanced phrase matching beyond FTS5 capabilities
  3. Custom Dictionaries: User-defined dictionaries and thesaurus support
  4. Multilingual Support: Better handling of non-English languages
  5. Streaming Updates: Efficient bulk loading of FTS data