RAG pipeline over:
chicago_Annual_Appropriation_Ordinance_2026.pdfchicago_Grant_Details_Ordinance_2026.pdf
The system does:
- PDF text extraction (
pdftotext -layout) - Improved chunking (smaller chunks + section-aware boundaries)
- TOC suppression (TOC-like chunks are penalized and optionally filtered)
- Hybrid retrieval (BM25 + optional embeddings)
- Optional cross-encoder reranking path
- Answers with page-level citations
- Source links that open exact PDF pages (new tab or embedded viewer panel)
- Clickable sample queries in the UI for quick testing
- One-click export of query results to Markdown, JSON, or CSV
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtOptional cross-encoder reranker dependencies:
pip install -r requirements-reranker.txtOptional provider setup examples:
OpenAI:
export LLM_PROVIDER=openai
export EMBEDDING_PROVIDER=openai
export OPENAI_API_KEY=your_key
export OPENAI_CHAT_MODEL=gpt-4.1-mini
export OPENAI_EMBED_MODEL=text-embedding-3-smallOllama:
export LLM_PROVIDER=ollama
export EMBEDDING_PROVIDER=ollama
export OLLAMA_BASE_URL=http://localhost:11434
export OLLAMA_CHAT_MODEL=llama3.2:latest
export OLLAMA_EMBED_MODEL=qwen3-embedding:4bAWS Bedrock:
export LLM_PROVIDER=bedrock
export EMBEDDING_PROVIDER=bedrock
export AWS_REGION=us-east-1
export BEDROCK_CHAT_MODEL=anthropic.claude-3-5-sonnet-20241022-v2:0
export BEDROCK_EMBED_MODEL=amazon.titan-embed-text-v2:0Default chunking now uses max_tokens=450 and overlap_tokens=70:
python3 build_index.py --pdf-dir . --index-dir data/indexOverride if needed:
python3 build_index.py --max-tokens 400 --overlap-tokens 60python3 query_rag.py "What is budgeted for the Office of the Mayor?"Override retrieval blend from CLI:
python3 query_rag.py "What grants mention ARPA?" --bm25-weight 0.9 --vector-weight 0.1JSON output:
python3 query_rag.py "What grants mention ARPA?" --jsonA starter evaluation set is included at:
eval/questions.sample.json
Run a single evaluation:
python3 eval_rag.py --questions-file eval/questions.sample.json --top-k 8 --show-queriesRun a tuning grid report (best BM25/vector blend):
python3 eval_rag.py --questions-file eval/questions.sample.json --top-k 8 --tune --bm25-grid "0.7,0.8,0.85,0.9,0.95"JSON output is supported with --json.
uvicorn app:app --reload --port 8000Then open http://localhost:8000.
After running a query in the UI, use the export buttons to download:
Markdown(.md)JSON(.json)CSV(.csv)
You can also call export directly:
curl -L "http://localhost:8000/export?query=What%20grants%20mention%20ARPA%3F&fmt=markdown" -o export.mdfmt supports: markdown, json, csv.
docker compose up --buildThen open http://localhost:8000.
Notes:
- On first start, the container auto-builds
data/index/index.json. - Index is stored in a persistent Docker volume (
rag_index). - Force rebuild when embedding provider/model changes:
FORCE_REINDEX=1 docker compose up --buildProvider examples in Docker:
LLM_PROVIDER=openai EMBEDDING_PROVIDER=openai OPENAI_API_KEY=your_key docker compose up --buildLLM_PROVIDER=ollama EMBEDDING_PROVIDER=ollama OLLAMA_CHAT_MODEL=llama3.2:latest OLLAMA_EMBED_MODEL=qwen3-embedding:4b docker compose up --buildInstall Ollama:
curl -fsSL https://ollama.com/install.sh | shPull the two models used in this project:
ollama pull llama3.2:latest
ollama pull qwen3-embedding:4bRun Docker with those two models:
LLM_PROVIDER=ollama EMBEDDING_PROVIDER=ollama OLLAMA_BASE_URL=http://host.docker.internal:11434 OLLAMA_CHAT_MODEL=llama3.2:latest OLLAMA_EMBED_MODEL=qwen3-embedding:4b docker compose up --buildLLM_PROVIDER=bedrock EMBEDDING_PROVIDER=bedrock AWS_REGION=us-east-1 docker compose up --buildSet these as env vars (local or Docker):
RAG_BM25_WEIGHT(default0.85)RAG_VECTOR_WEIGHT(default0.15)RAG_TOC_PENALTY(default0.35)RAG_SUPPRESS_TOC(defaulttrue)RAG_CANDIDATE_MULTIPLIER(default8)
Reranking controls:
RAG_RERANKER=auto|cross-encoder|none(defaultauto)RAG_RERANK_CANDIDATES(default30)RAG_RERANKER_MODEL(defaultcross-encoder/ms-marco-MiniLM-L-6-v2)
If cross-encoder dependencies are not installed, reranking falls back to heuristic ranking.
The web app now includes an in-memory per-IP rate limiter for POST / by default.
Defaults:
RATE_LIMIT_ENABLED=trueRATE_LIMIT_MAX_REQUESTS=20RATE_LIMIT_WINDOW_SECONDS=60RATE_LIMIT_METHOD=POSTRATE_LIMIT_PATH=/RATE_LIMIT_TRUST_PROXY=true
Example stricter public setting:
RATE_LIMIT_MAX_REQUESTS=10 RATE_LIMIT_WINDOW_SECONDS=60 docker compose up --buildDisable temporarily:
RATE_LIMIT_ENABLED=false docker compose up --buildNotes:
- This limiter is process-local memory. If you scale to multiple app instances, use a shared limiter (Redis-based) so limits are enforced consistently.
You can disable the site with an env flag and show a temporary disabled page that links to your repo.
SITE_ENABLED=true|false(defaulttrue)SITE_DISABLED_REPO_URL(defaulthttps://github.com)
Example:
SITE_ENABLED=false SITE_DISABLED_REPO_URL=https://github.com/your-org/your-repo docker compose up --buildWhen disabled, the app returns a temporary disabled page (health endpoint remains available).
If sudo apt install docker-compose-plugin fails with Unable to locate package, install Docker from Docker's official apt repo:
sudo apt update
sudo apt install -y ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo $VERSION_CODENAME) stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin gitVerify:
docker --version
docker compose version