System Architecture

Overview

This is a Retrieval-Augmented Generation (RAG) system that combines document ingestion, vector-based retrieval, and LLM-powered response generation. The system is built using LangChain, LangGraph, and ChromaDB with support for both local and server-based deployments.

High-Level Architecture

┌─────────────────┐
│  Data Sources   │
│  (XML, TXT)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  OCR Extraction │ (ocr_extract.py)
│  XML → TXT      │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    Ingestion    │ (ingest.py)
│  - Load docs    │
│  - Split chunks │
│  - Generate     │
│    embeddings   │
└────────┬────────┘
         │
         ▼
┌─────────────────────────┐
│   Vector Store          │
│   (ChromaDB)            │
│  - Local (.chroma/)     │
│  - Server (Docker)      │
└────────┬────────────────┘
         │
         ▼
┌──────────────────────────┐
│   Query Pipeline         │ (graph.py)
│   ┌──────────┐           │
│   │ Retrieve │           │
│   └────┬─────┘           │
│        │                 │
│   ┌────▼─────────┐       │
│   │ Synthesize   │       │
│   │ (LLM)        │       │
│   └──────────────┘       │
└──────────┬───────────────┘
           │
           ▼
    ┌─────────────┐
    │  Response   │
    └─────────────┘

Core Components

1. Data Preprocessing (`ocr_extract.py`)

Purpose: Extract OCR text from XML files and convert to plain text for ingestion.

Features:

Heuristic-based OCR text detection
Custom XPath support for structured XML
Handles multiple XML formats (ALTO, hOCR, generic)
Namespace-agnostic parsing

Key Functions:

extract_file(): Processes a single XML file
extract_dir(): Batch processes XML files matching a glob pattern
_collect_candidate_text(): Heuristic text extraction logic

Data Flow:

XML Files → Parse → Extract OCR Text → Write .txt → data/ directory

2. Document Ingestion (`ingest.py`)

Purpose: Load, chunk, embed, and index documents into ChromaDB.

Process:

Load: Uses DirectoryLoader with TextLoader to load .txt files
Split: Uses RecursiveCharacterTextSplitter to chunk documents
- Default: 800 chars with 120 char overlap
- Preserves semantic boundaries (paragraphs → lines → words → chars)
Embed: Generates embeddings using HuggingFace sentence-transformers
- Default model: sentence-transformers/all-MiniLM-L6-v2
Index: Stores vectors in ChromaDB (local or server mode)

Key Function:

ingest_corpus(): Main ingestion pipeline

Configuration:

source_dir: Source directory for text files
chunk_size: Maximum chunk size (default: 800)
chunk_overlap: Overlap between chunks (default: 120)
embedding_model: HuggingFace model name
chroma_url: Optional Chroma Server URL
collection_name: ChromaDB collection name (default: "corpus")

Data Flow:

.txt Files → Load → Split into Chunks → Generate Embeddings → Store in ChromaDB

3. Vector Store (ChromaDB)

Purpose: Persistent vector database for similarity search.

Deployment Modes:

Local Mode: Persists to .chroma/ directory
Server Mode: Connects to ChromaDB server via HTTP (Docker)

Docker Configuration (docker-compose.yml):

chroma:
  image: chromadb/chroma:latest
  ports: 8000:8000
  volumes: chroma-data:/chroma/.chroma

Features:

DuckDB + Parquet backend for efficient storage
Collection-based organization
Similarity search with configurable top-k

4. Query Pipeline (`graph.py`)

Purpose: Orchestrate retrieval and answer synthesis using LangGraph.

Architecture: 2-node state machine

┌─────────────┐       ┌──────────────┐
│  Retrieve   │ ───▶  │  Synthesize  │ ───▶ END
└─────────────┘       └──────────────┘

State Definition (GraphState):

{
    "question": str,
    "context_docs": List[str],
    "answer": Optional[str]
}

Nodes:

Retrieve Node (retrieve_node):
- Takes user question
- Queries ChromaDB retriever (similarity search)
- Returns top-k relevant document chunks
- Updates state with context_docs
Synthesize Node (synthesize_node):
- Takes question + context documents
- Constructs prompt with context
- Calls LLM (or fallback)
- Returns generated answer

LLM Providers (_llm_or_fallback):

Priority order:

Ollama (local LLM):
- Default: llama3.1:8b
- Configurable base URL
- No API key required
- Fully offline
GitHub Models (free cloud LLM):
- Default: meta-llama/Llama-3.1-8B-Instruct
- Requires GitHub personal access token
- Free tier with rate limits
- OpenAI-compatible API
OpenAI (premium cloud LLM):
- Fallback if other providers fail
- Requires OPENAI_API_KEY
- Default: gpt-4o-mini
- Usage costs apply
Extractive Fallback:
- Simple concatenation of top docs
- Used when no LLM is available

Configuration (RAGConfig):

persist_dir: str = ".chroma"
collection_name: str = "corpus"
embed_model: str = "sentence-transformers/all-MiniLM-L6-v2"
chroma_url: Optional[str] = None
provider: str = "ollama"  # or "openai", "github_models"
model: str = "gpt-4o-mini"
ollama_model: str = "llama3.1:8b"
ollama_base_url: Optional[str] = None
github_model: str = "meta-llama/Llama-3.1-8B-Instruct"
github_token: Optional[str] = None
temperature: float = 0.0
k: int = 5

5. Command-Line Interface (`cli.py`)

Purpose: Unified CLI for all system operations.

Commands:

extract_ocr: Extract OCR from XML files

python -m src.rag_system.cli extract_ocr --input data_sample --output data

ingest: Ingest documents into ChromaDB

python -m src.rag_system.cli ingest --source data --chunk_size 800

query: Query the RAG system

python -m src.rag_system.cli query "What is this about?" --provider ollama

ollama_pull: Download Ollama models

python -m src.rag_system.cli ollama_pull --model llama3.1:8b

ui: Launch Gradio web interface

python -m src.rag_system.cli ui --port 7860

6. Web User Interface (`ui.py`)

Purpose: Interactive Gradio-based UI for querying the RAG system.

Features:

Configuration panel for all RAG settings
Real-time question answering
Support for both local and server-mode ChromaDB
Provider switching (Ollama/OpenAI/GitHub Models)
- Ollama is a platform designed to run large language models (LLMs) locally on your machine. It provides a user-friendly way to download, manage, and interact with various open-source models, often in the GGUF format, such as Llama 2, Code Llama, and others.
- GitHub Models provides free cloud-based LLM access through GitHub accounts with no credit card required
Model configuration (embeddings, LLM)
Top-k and temperature controls

Launch:

python -m src.rag_system.cli ui --host 127.0.0.1 --port 7860

UI Components:

Settings accordion (collapsed by default)
Question input textbox
Answer markdown output
Apply settings button
Status messages

Technology Stack

Core Libraries

Component	Technology	Purpose
Document Loading	LangChain	Load and process text files
Text Splitting	RecursiveCharacterTextSplitter	Chunk documents with overlap
Embeddings	HuggingFace Transformers	Generate sentence embeddings
Vector Store	ChromaDB	Similarity search and storage
Orchestration	LangGraph	State machine for RAG pipeline
LLM (Local)	Ollama	Local language model inference
LLM (Cloud)	OpenAI	Cloud-based language models
LLM (Free Cloud)	GitHub Models	Free cloud-based LLMs via GitHub
UI	Gradio	Web-based user interface

Infrastructure

Docker Compose: Container orchestration for ChromaDB and Ollama servers
Python 3.10+: Runtime environment
Virtual Environment: Dependency isolation

Data Flow

End-to-End Query Flow

1. User Question
       │
       ▼
2. Initialize RAG Config
       │
       ▼
3. Load Retriever
   ├─ Connect to ChromaDB (local or server)
   ├─ Load embedding model
   └─ Create retriever with top-k
       │
       ▼
4. Build Graph
   ├─ Initialize LLM (Ollama/OpenAI/Fallback)
   ├─ Create prompt template
   └─ Compile state graph
       │
       ▼
5. Retrieve Node
   ├─ Embed query
   ├─ Similarity search in ChromaDB
   └─ Return top-k chunks
       │
       ▼
6. Synthesize Node
   ├─ Format context + question
   ├─ Call LLM with prompt
   └─ Generate answer
       │
       ▼
7. Return Answer to User

Ingestion Flow

1. Source Files (data/*.txt)
       │
       ▼
2. DirectoryLoader
   └─ Load all matching files
       │
       ▼
3. RecursiveCharacterTextSplitter
   ├─ Split on: \n\n → \n → space → char
   ├─ Target: 800 chars
   └─ Overlap: 120 chars
       │
       ▼
4. HuggingFaceEmbeddings
   └─ Generate 384-dim vectors
       │
       ▼
5. ChromaDB
   ├─ Store vectors
   ├─ Store metadata
   └─ Index for similarity search

Configuration & Environment

Environment Variables

# Embeddings
EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2

# Chroma
CHROMA_URL=http://localhost:8000

# LLM Provider
LLM_PROVIDER=ollama  # or openai, github_models

# Ollama
OLLAMA_MODEL=llama3.1:8b
OLLAMA_BASE_URL=http://localhost:11434

# OpenAI
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini

# GitHub Models
GITHUB_TOKEN=github_pat_...
GITHUB_MODEL=meta-llama/Llama-3.1-8B-Instruct

# Logging
LOG_LEVEL=INFO

Default Configuration

Parameter	Default Value	Description
`chunk_size`	800	Maximum characters per chunk
`chunk_overlap`	120	Overlap between adjacent chunks
`k`	5	Number of documents to retrieve
`temperature`	0.0	LLM temperature (deterministic)
`collection_name`	"corpus"	ChromaDB collection name
`persist_dir`	".chroma"	Local ChromaDB directory

Deployment Modes

1. Fully Local Mode

Components:

Local ChromaDB (.chroma/)
Ollama (installed on host)
HuggingFace embeddings (local)

Pros:

No API keys required
Fully offline
Fast for single users

Cons:

Limited by local compute
No shared storage

2. Server Mode (Recommended)

Components:

ChromaDB Server (Docker)
Ollama Server (Docker)
HuggingFace embeddings (local)

Pros:

Shared vector store
Scalable
Better resource management

Cons:

Requires Docker
Network dependency

Setup:

# Start services
docker compose up -d chroma ollama

# Pull model
docker exec -it ollama ollama pull llama3.1:8b

# Configure clients
export CHROMA_URL=http://localhost:8000
export OLLAMA_BASE_URL=http://localhost:11434

3. Hybrid Mode (GitHub Models)

Components:

ChromaDB Server (Docker)
GitHub Models API (free cloud LLM)
HuggingFace embeddings (local)

Pros:

Shared vector store
No local GPU needed
Free tier available
No credit card required

Cons:

Requires GitHub account
Rate limits apply

4. Hybrid Mode (OpenAI)

Components:

ChromaDB Server (Docker)
OpenAI API (premium cloud LLM)
HuggingFace embeddings (local)

Pros:

Shared vector store
High-quality LLM responses
No local GPU needed

Cons:

Requires API key
Usage costs

Error Handling & Fallbacks

LLM Fallback Chain

Ollama: Try local LLM first (if provider=ollama)
GitHub Models: Try free cloud LLM (if provider=github_models and token available)
OpenAI: Fallback if other providers unavailable (if API key set)
Extractive: Simple doc concatenation if no LLM available

NumPy Compatibility

The system includes explicit NumPy checks with helpful error messages for macOS Apple Silicon users:

try:
    import numpy as _np
except Exception as e:
    raise RuntimeError("NumPy is required but not available...")

Configuration Deserialization

Custom error messages for LangChain/LangGraph version mismatches (cli.py:148-162, ui.py:85-91, ui.py:110-116).

Performance Considerations

Chunking Strategy

RecursiveCharacterTextSplitter preserves semantic boundaries
Overlap improves retrieval recall across chunk boundaries
add_start_index=True enables tracing back to source

Embedding Model

Default: all-MiniLM-L6-v2 (384 dimensions)
Fast inference on CPU
Good balance of speed/quality
Alternative: all-mpnet-base-v2 (768d, slower but higher quality)

Retrieval

Top-k=5 balances context vs. noise
ChromaDB uses approximate nearest neighbor search
DuckDB backend optimized for analytical queries

Extensibility

Adding Custom Embeddings

Edit ingest.py or graph.py:

from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

Custom Text Splitters

Edit ingest.py:60:

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200,
    chunk_overlap=150,
    separators=["\n\n## ", "\n\n", "\n", " ", ""],  # Markdown headings
)

Adding New LLM Providers

Edit graph.py:_llm_or_fallback():

from langchain_anthropic import ChatAnthropic
if cfg.provider == "anthropic":
    return ChatAnthropic(api_key=..., model=cfg.model)

Security & Best Practices

API Keys

Never commit .env files
Use environment variables
Add .env to .gitignore

Docker Volumes

Persistent volumes for ChromaDB (chroma-data)
Persistent volumes for Ollama models (ollama-data)

File Permissions

Text loader handles encoding errors gracefully
XML parser catches malformed files

Troubleshooting

Common Issues

"NumPy is not available":
- Install NumPy first: pip install "numpy>=1.26,<2.1"
- Use Python 3.10+ on macOS Apple Silicon
Empty retrieval results:
- Verify ingestion completed: check .chroma/ or ChromaDB collection
- Check embedding model consistency between ingest and query
Ollama connection errors:
- Verify Ollama is running: ollama serve
- Check base URL: http://localhost:11434
- Pull model first: ollama pull llama3.1:8b
ChromaDB server connection:
- Start container: docker compose up -d chroma
- Check logs: docker logs chroma-server
- Verify port: curl http://localhost:8000/api/v1/heartbeat

File Structure

rag_prototype/
├── src/
│   └── rag_system/
│       ├── __init__.py
│       ├── ocr_extract.py    # XML → TXT conversion
│       ├── ingest.py          # Document ingestion pipeline
│       ├── graph.py           # RAG query pipeline (LangGraph)
│       ├── cli.py             # Command-line interface
│       └── ui.py              # Gradio web interface
├── data/                      # Source text files (ingestion input)
├── .chroma/                   # Local ChromaDB storage
├── docker-compose.yml         # Docker services (Chroma, Ollama)
├── requirements.txt           # Python dependencies
└── README.md                  # User documentation

Future Enhancements

Potential Improvements

Advanced Retrieval:
- Hybrid search (keyword + semantic)
- Re-ranking with cross-encoders
- Multi-query expansion
Evaluation:
- Add evaluation metrics (RAGAS, etc.)
- Automated testing suite
- Benchmark different embedding models
Scalability:
- Batch ingestion with progress tracking
- Async query processing
- Distributed ChromaDB deployment
Features:
- Document metadata filtering
- Multi-turn conversations with memory
- Citation/source tracking in responses
- PDF/DOCX direct support
UI Enhancements:
- Chat history
- Document upload via UI
- Real-time ingestion status
- Visualization of retrieved chunks

References

LangChain: https://python.langchain.com/
LangGraph: https://langchain-ai.github.io/langgraph/
ChromaDB: https://docs.trychroma.com/
Ollama: https://ollama.com/
GitHub Models: https://github.com/marketplace/models
Gradio: https://gradio.app/

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

System Architecture

Overview

High-Level Architecture

Core Components

1. Data Preprocessing (ocr_extract.py)

2. Document Ingestion (ingest.py)

3. Vector Store (ChromaDB)

4. Query Pipeline (graph.py)

5. Command-Line Interface (cli.py)

6. Web User Interface (ui.py)

Technology Stack

Core Libraries

Infrastructure

Data Flow

End-to-End Query Flow

Ingestion Flow

Configuration & Environment

Environment Variables

Default Configuration

Deployment Modes

1. Fully Local Mode

2. Server Mode (Recommended)

3. Hybrid Mode (GitHub Models)

4. Hybrid Mode (OpenAI)

Error Handling & Fallbacks

LLM Fallback Chain

NumPy Compatibility

Configuration Deserialization

Performance Considerations

Chunking Strategy

Embedding Model

Retrieval

Extensibility

Adding Custom Embeddings

Custom Text Splitters

Adding New LLM Providers

Security & Best Practices

API Keys

Docker Volumes

File Permissions

Troubleshooting

Common Issues

File Structure

Future Enhancements

Potential Improvements

References

1. Data Preprocessing (`ocr_extract.py`)

2. Document Ingestion (`ingest.py`)

4. Query Pipeline (`graph.py`)

5. Command-Line Interface (`cli.py`)

6. Web User Interface (`ui.py`)