A fast, lightweight natural language to SQL translator for PostgreSQL, MySQL, and SQLite. Query your database using plain English with schema-aware retrieval and minimal setup.
BoltSQL translates natural language questions into SQL queries using:
- Schema-aware retrieval: Automatically finds relevant tables and columns based on your question
- Fast embeddings: CPU-based semantic search with on-disk caching
- LLM-powered generation: Uses OpenAI's GPT models to generate SQL
- Multi-database support: Works with PostgreSQL, MySQL, and SQLite
- Read-only safety: Only executes SELECT queries to protect data
- Rust-based: Fast, memory-safe library and CLI
- Zero vector database: Simple brute-force cosine similarity for retrieval
- Embedding cache: Subsequent queries reuse cached schema embeddings
- Comprehensive logging: Debug queries with detailed timing and context logs
- JSON catalog: Define your schema once in a simple JSON format
- Rust toolchain (rustup)
- OpenAI API key (get one here)
- A database connection (PostgreSQL, MySQL, or SQLite)
git clone <repository-url>
cd bolt-sql
cargo build --releaseThe binary will be available at target/release/bolt-sql (or target/release/bolt-sql.exe on Windows).
Create a .env file in the project root:
OPENAI_API_KEY=sk-your-key-here
DATABASE_URL=postgres://user:password@localhost:5432/mydbFor PostgreSQL with SSL (e.g., Supabase, AWS RDS):
DATABASE_URL=postgres://user:password@host:5432/db?sslmode=requireFor SQLite:
DATABASE_URL=sqlite:///path/to/database.dbDefine your database schema in a JSON file (catalog.json):
{
"databases": [
{
"name": "public",
"tables": [
{
"name": "users",
"description": "User accounts and profile information",
"columns": [
{ "name": "id", "type": "integer", "description": "Primary key, unique user identifier" },
{ "name": "email", "type": "text", "description": "User email address" },
{ "name": "created_at", "type": "timestamp", "description": "Account creation timestamp" }
]
},
{
"name": "orders",
"description": "Customer orders and transactions",
"columns": [
{ "name": "id", "type": "integer", "description": "Primary key" },
{ "name": "user_id", "type": "integer", "description": "Foreign key to users.id" },
{ "name": "total", "type": "decimal", "description": "Order total amount" },
{ "name": "status", "type": "text", "description": "Order status: pending, completed, cancelled" }
]
}
]
}
]
}Tips for better results:
- Provide clear, descriptive table and column descriptions
- Include relationship hints (e.g., "Foreign key to users.id")
- Mention data types and constraints when relevant
Ask questions in natural language:
# Basic query
target/release/bolt-sql ask "How many users signed up this month?" \
--catalog catalog.json
# With custom database URL
target/release/bolt-sql ask "Show me the top 10 orders by total" \
--catalog catalog.json \
--db postgres://user:pass@localhost:5432/mydb
# Get JSON output
target/release/bolt-sql ask "List all pending orders" \
--catalog catalog.json \
--json
# Verbose mode (with timing and logs)
target/release/bolt-sql ask "What's the average order value?" \
--catalog catalog.json \
--verboseQuery your database with natural language.
Required:
question: Your question in natural language--catalog: Path to your catalog JSON file
Optional:
--db: Database connection URL (overridesDATABASE_URLenv var)--model: OpenAI model to use (default:gpt-4.1-nano, fallback:gpt-4o-mini)--top-tables: Number of top tables to retrieve (default: 3)--top-columns: Number of top columns to retrieve (default: 10)--max-rows: Maximum rows to return (default: 100)--json: Output results as JSON instead of a table--verbose/-v: Enable detailed logging and timing information--no-cache: Disable embedding cache (useful for testing)--pooler: Use connection pooler (for Supabase, replace port 5432 with 6543)
Preload embeddings and test database connection:
target/release/bolt-sql warm --verboseWhen using --verbose, BoltSQL creates detailed logs in the logger/ directory:
main.log: General execution logs and timing informationretrieval.log: Retrieved tables and columns for each queryprompt.log: Full prompts sent to the LLMcontext.txt: Latest retrieved context (overwritten each run)prompt.txt: Latest prompt (overwritten each run)llm-cache.jsonl: Cached LLM responses (20-minute TTL)
- First run: Generates embeddings for your catalog (typically 1-5 seconds)
- Subsequent runs: Uses cached embeddings (typically <1 second for simple queries)
- Embedding cache: Stored as
boltsql-cache-<hash>-d384.binin the project root - Connection pooling: Automatic connection pooling for PostgreSQL and MySQL
- Read-only: Only
SELECTstatements are executed; writes are rejected - Transaction safety: Uses read-only transactions where supported
- No data modification: Your database is never modified
- API key security: Store your OpenAI API key in
.env(never commit it)
Connection issues:
- Verify your
DATABASE_URLis correct - For managed databases (Supabase, AWS RDS), ensure SSL is enabled:
?sslmode=require - Test connectivity with your database client first
Slow queries:
- Check network latency to your database
- Use
--verboseto identify bottlenecks - Try
--model gpt-4o-miniifgpt-4.1-nanois unavailable or slow
Poor SQL generation:
- Improve your catalog descriptions (more detail = better results)
- Increase
--top-tablesand--top-columnsfor complex schemas - Check
logger/prompt.txtto see what context was sent to the LLM
Empty results:
- Verify your catalog JSON is valid
- Ensure table and column names match your actual schema
- Check
logger/retrieval.logto see what was retrieved
BoltSQL has been evaluated on the Spider dataset. Initial results (50 questions):
- Exact match: 4% (syntactic SQL match)
- Execution match: 16% (semantic correctness, produces same results)
- Token similarity: 0.57 average Jaccard score
The system performs well on simple queries (SELECT, WHERE, ORDER BY, aggregates) and is actively being improved for complex operations (INTERSECT, EXCEPT, multi-join queries).
See README-spider.md for evaluation setup and scripts.
BoltSQL is organized into focused modules:
catalog: Schema loading, flattening, and hashingembedding: CPU-based embeddings with on-disk cachingretrieval: Cosine similarity and top-k selectionprompt: Few-shot prompt construction with schema pruninggenerator: OpenAI API integration and SQL validationexecutor: Database connection pooling and query executionlogging: Performance timing and file-based logging
Contributions are welcome! Areas for improvement:
- Cloud based vector retrieval support i.e Qdrant/Pinecone etc
- Complex query generation (INTERSECT, EXCEPT, correlated subqueries)
- Better schema retrieval for large databases
- Support for additional database dialects
- Query decomposition for multi-step questions
- Performance optimizations
GNU General Public License v3.0