This is a Retrieval-Augmented Generation (RAG) system that combines document ingestion, vector-based retrieval, and LLM-powered response generation. The system is built using LangChain, LangGraph, and ChromaDB with support for both local and server-based deployments.
┌─────────────────┐
│ Data Sources │
│ (XML, TXT) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ OCR Extraction │ (ocr_extract.py)
│ XML → TXT │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Ingestion │ (ingest.py)
│ - Load docs │
│ - Split chunks │
│ - Generate │
│ embeddings │
└────────┬────────┘
│
▼
┌─────────────────────────┐
│ Vector Store │
│ (ChromaDB) │
│ - Local (.chroma/) │
│ - Server (Docker) │
└────────┬────────────────┘
│
▼
┌──────────────────────────┐
│ Query Pipeline │ (graph.py)
│ ┌──────────┐ │
│ │ Retrieve │ │
│ └────┬─────┘ │
│ │ │
│ ┌────▼─────────┐ │
│ │ Synthesize │ │
│ │ (LLM) │ │
│ └──────────────┘ │
└──────────┬───────────────┘
│
▼
┌─────────────┐
│ Response │
└─────────────┘
Purpose: Extract OCR text from XML files and convert to plain text for ingestion.
Features:
- Heuristic-based OCR text detection
- Custom XPath support for structured XML
- Handles multiple XML formats (ALTO, hOCR, generic)
- Namespace-agnostic parsing
Key Functions:
extract_file(): Processes a single XML fileextract_dir(): Batch processes XML files matching a glob pattern_collect_candidate_text(): Heuristic text extraction logic
Data Flow:
XML Files → Parse → Extract OCR Text → Write .txt → data/ directory
Purpose: Load, chunk, embed, and index documents into ChromaDB.
Process:
- Load: Uses
DirectoryLoaderwithTextLoaderto load.txtfiles - Split: Uses
RecursiveCharacterTextSplitterto chunk documents- Default: 800 chars with 120 char overlap
- Preserves semantic boundaries (paragraphs → lines → words → chars)
- Embed: Generates embeddings using HuggingFace sentence-transformers
- Default model:
sentence-transformers/all-MiniLM-L6-v2
- Default model:
- Index: Stores vectors in ChromaDB (local or server mode)
Key Function:
ingest_corpus(): Main ingestion pipeline
Configuration:
source_dir: Source directory for text fileschunk_size: Maximum chunk size (default: 800)chunk_overlap: Overlap between chunks (default: 120)embedding_model: HuggingFace model namechroma_url: Optional Chroma Server URLcollection_name: ChromaDB collection name (default: "corpus")
Data Flow:
.txt Files → Load → Split into Chunks → Generate Embeddings → Store in ChromaDB
Purpose: Persistent vector database for similarity search.
Deployment Modes:
- Local Mode: Persists to
.chroma/directory - Server Mode: Connects to ChromaDB server via HTTP (Docker)
Docker Configuration (docker-compose.yml):
chroma:
image: chromadb/chroma:latest
ports: 8000:8000
volumes: chroma-data:/chroma/.chromaFeatures:
- DuckDB + Parquet backend for efficient storage
- Collection-based organization
- Similarity search with configurable top-k
Purpose: Orchestrate retrieval and answer synthesis using LangGraph.
Architecture: 2-node state machine
┌─────────────┐ ┌──────────────┐
│ Retrieve │ ───▶ │ Synthesize │ ───▶ END
└─────────────┘ └──────────────┘
State Definition (GraphState):
{
"question": str,
"context_docs": List[str],
"answer": Optional[str]
}Nodes:
-
Retrieve Node (
retrieve_node):- Takes user question
- Queries ChromaDB retriever (similarity search)
- Returns top-k relevant document chunks
- Updates state with
context_docs
-
Synthesize Node (
synthesize_node):- Takes question + context documents
- Constructs prompt with context
- Calls LLM (or fallback)
- Returns generated answer
LLM Providers (_llm_or_fallback):
Priority order:
-
Ollama (local LLM):
- Default:
llama3.1:8b - Configurable base URL
- No API key required
- Fully offline
- Default:
-
GitHub Models (free cloud LLM):
- Default:
meta-llama/Llama-3.1-8B-Instruct - Requires GitHub personal access token
- Free tier with rate limits
- OpenAI-compatible API
- Default:
-
OpenAI (premium cloud LLM):
- Fallback if other providers fail
- Requires
OPENAI_API_KEY - Default:
gpt-4o-mini - Usage costs apply
-
Extractive Fallback:
- Simple concatenation of top docs
- Used when no LLM is available
Configuration (RAGConfig):
persist_dir: str = ".chroma"
collection_name: str = "corpus"
embed_model: str = "sentence-transformers/all-MiniLM-L6-v2"
chroma_url: Optional[str] = None
provider: str = "ollama" # or "openai", "github_models"
model: str = "gpt-4o-mini"
ollama_model: str = "llama3.1:8b"
ollama_base_url: Optional[str] = None
github_model: str = "meta-llama/Llama-3.1-8B-Instruct"
github_token: Optional[str] = None
temperature: float = 0.0
k: int = 5Purpose: Unified CLI for all system operations.
Commands:
-
extract_ocr: Extract OCR from XML filespython -m src.rag_system.cli extract_ocr --input data_sample --output data
-
ingest: Ingest documents into ChromaDBpython -m src.rag_system.cli ingest --source data --chunk_size 800
-
query: Query the RAG systempython -m src.rag_system.cli query "What is this about?" --provider ollama -
ollama_pull: Download Ollama modelspython -m src.rag_system.cli ollama_pull --model llama3.1:8b
-
ui: Launch Gradio web interfacepython -m src.rag_system.cli ui --port 7860
Purpose: Interactive Gradio-based UI for querying the RAG system.
Features:
- Configuration panel for all RAG settings
- Real-time question answering
- Support for both local and server-mode ChromaDB
- Provider switching (Ollama/OpenAI/GitHub Models)
Ollamais a platform designed to run large language models (LLMs) locally on your machine. It provides a user-friendly way to download, manage, and interact with various open-source models, often in the GGUF format, such as Llama 2, Code Llama, and others.GitHub Modelsprovides free cloud-based LLM access through GitHub accounts with no credit card required
- Model configuration (embeddings, LLM)
- Top-k and temperature controls
Launch:
python -m src.rag_system.cli ui --host 127.0.0.1 --port 7860UI Components:
- Settings accordion (collapsed by default)
- Question input textbox
- Answer markdown output
- Apply settings button
- Status messages
| Component | Technology | Purpose |
|---|---|---|
| Document Loading | LangChain | Load and process text files |
| Text Splitting | RecursiveCharacterTextSplitter | Chunk documents with overlap |
| Embeddings | HuggingFace Transformers | Generate sentence embeddings |
| Vector Store | ChromaDB | Similarity search and storage |
| Orchestration | LangGraph | State machine for RAG pipeline |
| LLM (Local) | Ollama | Local language model inference |
| LLM (Cloud) | OpenAI | Cloud-based language models |
| LLM (Free Cloud) | GitHub Models | Free cloud-based LLMs via GitHub |
| UI | Gradio | Web-based user interface |
- Docker Compose: Container orchestration for ChromaDB and Ollama servers
- Python 3.10+: Runtime environment
- Virtual Environment: Dependency isolation
1. User Question
│
▼
2. Initialize RAG Config
│
▼
3. Load Retriever
├─ Connect to ChromaDB (local or server)
├─ Load embedding model
└─ Create retriever with top-k
│
▼
4. Build Graph
├─ Initialize LLM (Ollama/OpenAI/Fallback)
├─ Create prompt template
└─ Compile state graph
│
▼
5. Retrieve Node
├─ Embed query
├─ Similarity search in ChromaDB
└─ Return top-k chunks
│
▼
6. Synthesize Node
├─ Format context + question
├─ Call LLM with prompt
└─ Generate answer
│
▼
7. Return Answer to User
1. Source Files (data/*.txt)
│
▼
2. DirectoryLoader
└─ Load all matching files
│
▼
3. RecursiveCharacterTextSplitter
├─ Split on: \n\n → \n → space → char
├─ Target: 800 chars
└─ Overlap: 120 chars
│
▼
4. HuggingFaceEmbeddings
└─ Generate 384-dim vectors
│
▼
5. ChromaDB
├─ Store vectors
├─ Store metadata
└─ Index for similarity search
# Embeddings
EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2
# Chroma
CHROMA_URL=http://localhost:8000
# LLM Provider
LLM_PROVIDER=ollama # or openai, github_models
# Ollama
OLLAMA_MODEL=llama3.1:8b
OLLAMA_BASE_URL=http://localhost:11434
# OpenAI
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini
# GitHub Models
GITHUB_TOKEN=github_pat_...
GITHUB_MODEL=meta-llama/Llama-3.1-8B-Instruct
# Logging
LOG_LEVEL=INFO| Parameter | Default Value | Description |
|---|---|---|
chunk_size |
800 | Maximum characters per chunk |
chunk_overlap |
120 | Overlap between adjacent chunks |
k |
5 | Number of documents to retrieve |
temperature |
0.0 | LLM temperature (deterministic) |
collection_name |
"corpus" | ChromaDB collection name |
persist_dir |
".chroma" | Local ChromaDB directory |
Components:
- Local ChromaDB (
.chroma/) - Ollama (installed on host)
- HuggingFace embeddings (local)
Pros:
- No API keys required
- Fully offline
- Fast for single users
Cons:
- Limited by local compute
- No shared storage
Components:
- ChromaDB Server (Docker)
- Ollama Server (Docker)
- HuggingFace embeddings (local)
Pros:
- Shared vector store
- Scalable
- Better resource management
Cons:
- Requires Docker
- Network dependency
Setup:
# Start services
docker compose up -d chroma ollama
# Pull model
docker exec -it ollama ollama pull llama3.1:8b
# Configure clients
export CHROMA_URL=http://localhost:8000
export OLLAMA_BASE_URL=http://localhost:11434Components:
- ChromaDB Server (Docker)
- GitHub Models API (free cloud LLM)
- HuggingFace embeddings (local)
Pros:
- Shared vector store
- No local GPU needed
- Free tier available
- No credit card required
Cons:
- Requires GitHub account
- Rate limits apply
Components:
- ChromaDB Server (Docker)
- OpenAI API (premium cloud LLM)
- HuggingFace embeddings (local)
Pros:
- Shared vector store
- High-quality LLM responses
- No local GPU needed
Cons:
- Requires API key
- Usage costs
- Ollama: Try local LLM first (if provider=ollama)
- GitHub Models: Try free cloud LLM (if provider=github_models and token available)
- OpenAI: Fallback if other providers unavailable (if API key set)
- Extractive: Simple doc concatenation if no LLM available
The system includes explicit NumPy checks with helpful error messages for macOS Apple Silicon users:
try:
import numpy as _np
except Exception as e:
raise RuntimeError("NumPy is required but not available...")Custom error messages for LangChain/LangGraph version mismatches (cli.py:148-162, ui.py:85-91, ui.py:110-116).
- RecursiveCharacterTextSplitter preserves semantic boundaries
- Overlap improves retrieval recall across chunk boundaries
add_start_index=Trueenables tracing back to source
- Default:
all-MiniLM-L6-v2(384 dimensions) - Fast inference on CPU
- Good balance of speed/quality
- Alternative:
all-mpnet-base-v2(768d, slower but higher quality)
- Top-k=5 balances context vs. noise
- ChromaDB uses approximate nearest neighbor search
- DuckDB backend optimized for analytical queries
Edit ingest.py or graph.py:
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-mpnet-base-v2"
)Edit ingest.py:60:
splitter = RecursiveCharacterTextSplitter(
chunk_size=1200,
chunk_overlap=150,
separators=["\n\n## ", "\n\n", "\n", " ", ""], # Markdown headings
)Edit graph.py:_llm_or_fallback():
from langchain_anthropic import ChatAnthropic
if cfg.provider == "anthropic":
return ChatAnthropic(api_key=..., model=cfg.model)- Never commit
.envfiles - Use environment variables
- Add
.envto.gitignore
- Persistent volumes for ChromaDB (
chroma-data) - Persistent volumes for Ollama models (
ollama-data)
- Text loader handles encoding errors gracefully
- XML parser catches malformed files
-
"NumPy is not available":
- Install NumPy first:
pip install "numpy>=1.26,<2.1" - Use Python 3.10+ on macOS Apple Silicon
- Install NumPy first:
-
Empty retrieval results:
- Verify ingestion completed: check
.chroma/or ChromaDB collection - Check embedding model consistency between ingest and query
- Verify ingestion completed: check
-
Ollama connection errors:
- Verify Ollama is running:
ollama serve - Check base URL:
http://localhost:11434 - Pull model first:
ollama pull llama3.1:8b
- Verify Ollama is running:
-
ChromaDB server connection:
- Start container:
docker compose up -d chroma - Check logs:
docker logs chroma-server - Verify port:
curl http://localhost:8000/api/v1/heartbeat
- Start container:
rag_prototype/
├── src/
│ └── rag_system/
│ ├── __init__.py
│ ├── ocr_extract.py # XML → TXT conversion
│ ├── ingest.py # Document ingestion pipeline
│ ├── graph.py # RAG query pipeline (LangGraph)
│ ├── cli.py # Command-line interface
│ └── ui.py # Gradio web interface
├── data/ # Source text files (ingestion input)
├── .chroma/ # Local ChromaDB storage
├── docker-compose.yml # Docker services (Chroma, Ollama)
├── requirements.txt # Python dependencies
└── README.md # User documentation
-
Advanced Retrieval:
- Hybrid search (keyword + semantic)
- Re-ranking with cross-encoders
- Multi-query expansion
-
Evaluation:
- Add evaluation metrics (RAGAS, etc.)
- Automated testing suite
- Benchmark different embedding models
-
Scalability:
- Batch ingestion with progress tracking
- Async query processing
- Distributed ChromaDB deployment
-
Features:
- Document metadata filtering
- Multi-turn conversations with memory
- Citation/source tracking in responses
- PDF/DOCX direct support
-
UI Enhancements:
- Chat history
- Document upload via UI
- Real-time ingestion status
- Visualization of retrieved chunks
- LangChain: https://python.langchain.com/
- LangGraph: https://langchain-ai.github.io/langgraph/
- ChromaDB: https://docs.trychroma.com/
- Ollama: https://ollama.com/
- GitHub Models: https://github.com/marketplace/models
- Gradio: https://gradio.app/