Skip to content

Ashad001/bolt-sql

Repository files navigation

BoltSQL

A fast, lightweight natural language to SQL translator for PostgreSQL, MySQL, and SQLite. Query your database using plain English with schema-aware retrieval and minimal setup.

Overview

BoltSQL translates natural language questions into SQL queries using:

  • Schema-aware retrieval: Automatically finds relevant tables and columns based on your question
  • Fast embeddings: CPU-based semantic search with on-disk caching
  • LLM-powered generation: Uses OpenAI's GPT models to generate SQL
  • Multi-database support: Works with PostgreSQL, MySQL, and SQLite
  • Read-only safety: Only executes SELECT queries to protect data

Features

  • Rust-based: Fast, memory-safe library and CLI
  • Zero vector database: Simple brute-force cosine similarity for retrieval
  • Embedding cache: Subsequent queries reuse cached schema embeddings
  • Comprehensive logging: Debug queries with detailed timing and context logs
  • JSON catalog: Define your schema once in a simple JSON format

Installation

Prerequisites

  • Rust toolchain (rustup)
  • OpenAI API key (get one here)
  • A database connection (PostgreSQL, MySQL, or SQLite)

Build

git clone <repository-url>
cd bolt-sql
cargo build --release

The binary will be available at target/release/bolt-sql (or target/release/bolt-sql.exe on Windows).

Quick Start

1. Configure Environment

Create a .env file in the project root:

OPENAI_API_KEY=sk-your-key-here
DATABASE_URL=postgres://user:password@localhost:5432/mydb

For PostgreSQL with SSL (e.g., Supabase, AWS RDS):

DATABASE_URL=postgres://user:password@host:5432/db?sslmode=require

For SQLite:

DATABASE_URL=sqlite:///path/to/database.db

2. Create a Catalog

Define your database schema in a JSON file (catalog.json):

{
  "databases": [
    {
      "name": "public",
      "tables": [
        {
          "name": "users",
          "description": "User accounts and profile information",
          "columns": [
            { "name": "id", "type": "integer", "description": "Primary key, unique user identifier" },
            { "name": "email", "type": "text", "description": "User email address" },
            { "name": "created_at", "type": "timestamp", "description": "Account creation timestamp" }
          ]
        },
        {
          "name": "orders",
          "description": "Customer orders and transactions",
          "columns": [
            { "name": "id", "type": "integer", "description": "Primary key" },
            { "name": "user_id", "type": "integer", "description": "Foreign key to users.id" },
            { "name": "total", "type": "decimal", "description": "Order total amount" },
            { "name": "status", "type": "text", "description": "Order status: pending, completed, cancelled" }
          ]
        }
      ]
    }
  ]
}

Tips for better results:

  • Provide clear, descriptive table and column descriptions
  • Include relationship hints (e.g., "Foreign key to users.id")
  • Mention data types and constraints when relevant

3. Query Your Database

Ask questions in natural language:

# Basic query
target/release/bolt-sql ask "How many users signed up this month?" \
  --catalog catalog.json

# With custom database URL
target/release/bolt-sql ask "Show me the top 10 orders by total" \
  --catalog catalog.json \
  --db postgres://user:pass@localhost:5432/mydb

# Get JSON output
target/release/bolt-sql ask "List all pending orders" \
  --catalog catalog.json \
  --json

# Verbose mode (with timing and logs)
target/release/bolt-sql ask "What's the average order value?" \
  --catalog catalog.json \
  --verbose

Command Reference

ask Command

Query your database with natural language.

Required:

  • question: Your question in natural language
  • --catalog: Path to your catalog JSON file

Optional:

  • --db: Database connection URL (overrides DATABASE_URL env var)
  • --model: OpenAI model to use (default: gpt-4.1-nano, fallback: gpt-4o-mini)
  • --top-tables: Number of top tables to retrieve (default: 3)
  • --top-columns: Number of top columns to retrieve (default: 10)
  • --max-rows: Maximum rows to return (default: 100)
  • --json: Output results as JSON instead of a table
  • --verbose / -v: Enable detailed logging and timing information
  • --no-cache: Disable embedding cache (useful for testing)
  • --pooler: Use connection pooler (for Supabase, replace port 5432 with 6543)

warm Command

Preload embeddings and test database connection:

target/release/bolt-sql warm --verbose

Logging

When using --verbose, BoltSQL creates detailed logs in the logger/ directory:

  • main.log: General execution logs and timing information
  • retrieval.log: Retrieved tables and columns for each query
  • prompt.log: Full prompts sent to the LLM
  • context.txt: Latest retrieved context (overwritten each run)
  • prompt.txt: Latest prompt (overwritten each run)
  • llm-cache.jsonl: Cached LLM responses (20-minute TTL)

Performance

  • First run: Generates embeddings for your catalog (typically 1-5 seconds)
  • Subsequent runs: Uses cached embeddings (typically <1 second for simple queries)
  • Embedding cache: Stored as boltsql-cache-<hash>-d384.bin in the project root
  • Connection pooling: Automatic connection pooling for PostgreSQL and MySQL

Security

  • Read-only: Only SELECT statements are executed; writes are rejected
  • Transaction safety: Uses read-only transactions where supported
  • No data modification: Your database is never modified
  • API key security: Store your OpenAI API key in .env (never commit it)

Troubleshooting

Connection issues:

  • Verify your DATABASE_URL is correct
  • For managed databases (Supabase, AWS RDS), ensure SSL is enabled: ?sslmode=require
  • Test connectivity with your database client first

Slow queries:

  • Check network latency to your database
  • Use --verbose to identify bottlenecks
  • Try --model gpt-4o-mini if gpt-4.1-nano is unavailable or slow

Poor SQL generation:

  • Improve your catalog descriptions (more detail = better results)
  • Increase --top-tables and --top-columns for complex schemas
  • Check logger/prompt.txt to see what context was sent to the LLM

Empty results:

  • Verify your catalog JSON is valid
  • Ensure table and column names match your actual schema
  • Check logger/retrieval.log to see what was retrieved

Evaluation

BoltSQL has been evaluated on the Spider dataset. Initial results (50 questions):

  • Exact match: 4% (syntactic SQL match)
  • Execution match: 16% (semantic correctness, produces same results)
  • Token similarity: 0.57 average Jaccard score

The system performs well on simple queries (SELECT, WHERE, ORDER BY, aggregates) and is actively being improved for complex operations (INTERSECT, EXCEPT, multi-join queries).

See README-spider.md for evaluation setup and scripts.

Architecture

BoltSQL is organized into focused modules:

  • catalog: Schema loading, flattening, and hashing
  • embedding: CPU-based embeddings with on-disk caching
  • retrieval: Cosine similarity and top-k selection
  • prompt: Few-shot prompt construction with schema pruning
  • generator: OpenAI API integration and SQL validation
  • executor: Database connection pooling and query execution
  • logging: Performance timing and file-based logging

Contributing

Contributions are welcome! Areas for improvement:

  • Cloud based vector retrieval support i.e Qdrant/Pinecone etc
  • Complex query generation (INTERSECT, EXCEPT, correlated subqueries)
  • Better schema retrieval for large databases
  • Support for additional database dialects
  • Query decomposition for multi-step questions
  • Performance optimizations

License

GNU General Public License v3.0

Releases

No releases published

Packages

 
 
 

Contributors