Skip to content

Latest commit

 

History

History
642 lines (508 loc) · 17.1 KB

File metadata and controls

642 lines (508 loc) · 17.1 KB

System Architecture

Overview

This is a Retrieval-Augmented Generation (RAG) system that combines document ingestion, vector-based retrieval, and LLM-powered response generation. The system is built using LangChain, LangGraph, and ChromaDB with support for both local and server-based deployments.

High-Level Architecture

┌─────────────────┐
│  Data Sources   │
│  (XML, TXT)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  OCR Extraction │ (ocr_extract.py)
│  XML → TXT      │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    Ingestion    │ (ingest.py)
│  - Load docs    │
│  - Split chunks │
│  - Generate     │
│    embeddings   │
└────────┬────────┘
         │
         ▼
┌─────────────────────────┐
│   Vector Store          │
│   (ChromaDB)            │
│  - Local (.chroma/)     │
│  - Server (Docker)      │
└────────┬────────────────┘
         │
         ▼
┌──────────────────────────┐
│   Query Pipeline         │ (graph.py)
│   ┌──────────┐           │
│   │ Retrieve │           │
│   └────┬─────┘           │
│        │                 │
│   ┌────▼─────────┐       │
│   │ Synthesize   │       │
│   │ (LLM)        │       │
│   └──────────────┘       │
└──────────┬───────────────┘
           │
           ▼
    ┌─────────────┐
    │  Response   │
    └─────────────┘

Core Components

1. Data Preprocessing (ocr_extract.py)

Purpose: Extract OCR text from XML files and convert to plain text for ingestion.

Features:

  • Heuristic-based OCR text detection
  • Custom XPath support for structured XML
  • Handles multiple XML formats (ALTO, hOCR, generic)
  • Namespace-agnostic parsing

Key Functions:

  • extract_file(): Processes a single XML file
  • extract_dir(): Batch processes XML files matching a glob pattern
  • _collect_candidate_text(): Heuristic text extraction logic

Data Flow:

XML Files → Parse → Extract OCR Text → Write .txt → data/ directory

2. Document Ingestion (ingest.py)

Purpose: Load, chunk, embed, and index documents into ChromaDB.

Process:

  1. Load: Uses DirectoryLoader with TextLoader to load .txt files
  2. Split: Uses RecursiveCharacterTextSplitter to chunk documents
    • Default: 800 chars with 120 char overlap
    • Preserves semantic boundaries (paragraphs → lines → words → chars)
  3. Embed: Generates embeddings using HuggingFace sentence-transformers
    • Default model: sentence-transformers/all-MiniLM-L6-v2
  4. Index: Stores vectors in ChromaDB (local or server mode)

Key Function:

  • ingest_corpus(): Main ingestion pipeline

Configuration:

  • source_dir: Source directory for text files
  • chunk_size: Maximum chunk size (default: 800)
  • chunk_overlap: Overlap between chunks (default: 120)
  • embedding_model: HuggingFace model name
  • chroma_url: Optional Chroma Server URL
  • collection_name: ChromaDB collection name (default: "corpus")

Data Flow:

.txt Files → Load → Split into Chunks → Generate Embeddings → Store in ChromaDB

3. Vector Store (ChromaDB)

Purpose: Persistent vector database for similarity search.

Deployment Modes:

  • Local Mode: Persists to .chroma/ directory
  • Server Mode: Connects to ChromaDB server via HTTP (Docker)

Docker Configuration (docker-compose.yml):

chroma:
  image: chromadb/chroma:latest
  ports: 8000:8000
  volumes: chroma-data:/chroma/.chroma

Features:

  • DuckDB + Parquet backend for efficient storage
  • Collection-based organization
  • Similarity search with configurable top-k

4. Query Pipeline (graph.py)

Purpose: Orchestrate retrieval and answer synthesis using LangGraph.

Architecture: 2-node state machine

┌─────────────┐       ┌──────────────┐
│  Retrieve   │ ───▶  │  Synthesize  │ ───▶ END
└─────────────┘       └──────────────┘

State Definition (GraphState):

{
    "question": str,
    "context_docs": List[str],
    "answer": Optional[str]
}

Nodes:

  1. Retrieve Node (retrieve_node):

    • Takes user question
    • Queries ChromaDB retriever (similarity search)
    • Returns top-k relevant document chunks
    • Updates state with context_docs
  2. Synthesize Node (synthesize_node):

    • Takes question + context documents
    • Constructs prompt with context
    • Calls LLM (or fallback)
    • Returns generated answer

LLM Providers (_llm_or_fallback):

Priority order:

  1. Ollama (local LLM):

    • Default: llama3.1:8b
    • Configurable base URL
    • No API key required
    • Fully offline
  2. GitHub Models (free cloud LLM):

    • Default: meta-llama/Llama-3.1-8B-Instruct
    • Requires GitHub personal access token
    • Free tier with rate limits
    • OpenAI-compatible API
  3. OpenAI (premium cloud LLM):

    • Fallback if other providers fail
    • Requires OPENAI_API_KEY
    • Default: gpt-4o-mini
    • Usage costs apply
  4. Extractive Fallback:

    • Simple concatenation of top docs
    • Used when no LLM is available

Configuration (RAGConfig):

persist_dir: str = ".chroma"
collection_name: str = "corpus"
embed_model: str = "sentence-transformers/all-MiniLM-L6-v2"
chroma_url: Optional[str] = None
provider: str = "ollama"  # or "openai", "github_models"
model: str = "gpt-4o-mini"
ollama_model: str = "llama3.1:8b"
ollama_base_url: Optional[str] = None
github_model: str = "meta-llama/Llama-3.1-8B-Instruct"
github_token: Optional[str] = None
temperature: float = 0.0
k: int = 5

5. Command-Line Interface (cli.py)

Purpose: Unified CLI for all system operations.

Commands:

  1. extract_ocr: Extract OCR from XML files

    python -m src.rag_system.cli extract_ocr --input data_sample --output data
  2. ingest: Ingest documents into ChromaDB

    python -m src.rag_system.cli ingest --source data --chunk_size 800
  3. query: Query the RAG system

    python -m src.rag_system.cli query "What is this about?" --provider ollama
  4. ollama_pull: Download Ollama models

    python -m src.rag_system.cli ollama_pull --model llama3.1:8b
  5. ui: Launch Gradio web interface

    python -m src.rag_system.cli ui --port 7860

6. Web User Interface (ui.py)

Purpose: Interactive Gradio-based UI for querying the RAG system.

Features:

  • Configuration panel for all RAG settings
  • Real-time question answering
  • Support for both local and server-mode ChromaDB
  • Provider switching (Ollama/OpenAI/GitHub Models)
    • Ollama is a platform designed to run large language models (LLMs) locally on your machine. It provides a user-friendly way to download, manage, and interact with various open-source models, often in the GGUF format, such as Llama 2, Code Llama, and others.
    • GitHub Models provides free cloud-based LLM access through GitHub accounts with no credit card required
  • Model configuration (embeddings, LLM)
  • Top-k and temperature controls

Launch:

python -m src.rag_system.cli ui --host 127.0.0.1 --port 7860

UI Components:

  • Settings accordion (collapsed by default)
  • Question input textbox
  • Answer markdown output
  • Apply settings button
  • Status messages

Technology Stack

Core Libraries

Component Technology Purpose
Document Loading LangChain Load and process text files
Text Splitting RecursiveCharacterTextSplitter Chunk documents with overlap
Embeddings HuggingFace Transformers Generate sentence embeddings
Vector Store ChromaDB Similarity search and storage
Orchestration LangGraph State machine for RAG pipeline
LLM (Local) Ollama Local language model inference
LLM (Cloud) OpenAI Cloud-based language models
LLM (Free Cloud) GitHub Models Free cloud-based LLMs via GitHub
UI Gradio Web-based user interface

Infrastructure

  • Docker Compose: Container orchestration for ChromaDB and Ollama servers
  • Python 3.10+: Runtime environment
  • Virtual Environment: Dependency isolation

Data Flow

End-to-End Query Flow

1. User Question
       │
       ▼
2. Initialize RAG Config
       │
       ▼
3. Load Retriever
   ├─ Connect to ChromaDB (local or server)
   ├─ Load embedding model
   └─ Create retriever with top-k
       │
       ▼
4. Build Graph
   ├─ Initialize LLM (Ollama/OpenAI/Fallback)
   ├─ Create prompt template
   └─ Compile state graph
       │
       ▼
5. Retrieve Node
   ├─ Embed query
   ├─ Similarity search in ChromaDB
   └─ Return top-k chunks
       │
       ▼
6. Synthesize Node
   ├─ Format context + question
   ├─ Call LLM with prompt
   └─ Generate answer
       │
       ▼
7. Return Answer to User

Ingestion Flow

1. Source Files (data/*.txt)
       │
       ▼
2. DirectoryLoader
   └─ Load all matching files
       │
       ▼
3. RecursiveCharacterTextSplitter
   ├─ Split on: \n\n → \n → space → char
   ├─ Target: 800 chars
   └─ Overlap: 120 chars
       │
       ▼
4. HuggingFaceEmbeddings
   └─ Generate 384-dim vectors
       │
       ▼
5. ChromaDB
   ├─ Store vectors
   ├─ Store metadata
   └─ Index for similarity search

Configuration & Environment

Environment Variables

# Embeddings
EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2

# Chroma
CHROMA_URL=http://localhost:8000

# LLM Provider
LLM_PROVIDER=ollama  # or openai, github_models

# Ollama
OLLAMA_MODEL=llama3.1:8b
OLLAMA_BASE_URL=http://localhost:11434

# OpenAI
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini

# GitHub Models
GITHUB_TOKEN=github_pat_...
GITHUB_MODEL=meta-llama/Llama-3.1-8B-Instruct

# Logging
LOG_LEVEL=INFO

Default Configuration

Parameter Default Value Description
chunk_size 800 Maximum characters per chunk
chunk_overlap 120 Overlap between adjacent chunks
k 5 Number of documents to retrieve
temperature 0.0 LLM temperature (deterministic)
collection_name "corpus" ChromaDB collection name
persist_dir ".chroma" Local ChromaDB directory

Deployment Modes

1. Fully Local Mode

Components:

  • Local ChromaDB (.chroma/)
  • Ollama (installed on host)
  • HuggingFace embeddings (local)

Pros:

  • No API keys required
  • Fully offline
  • Fast for single users

Cons:

  • Limited by local compute
  • No shared storage

2. Server Mode (Recommended)

Components:

  • ChromaDB Server (Docker)
  • Ollama Server (Docker)
  • HuggingFace embeddings (local)

Pros:

  • Shared vector store
  • Scalable
  • Better resource management

Cons:

  • Requires Docker
  • Network dependency

Setup:

# Start services
docker compose up -d chroma ollama

# Pull model
docker exec -it ollama ollama pull llama3.1:8b

# Configure clients
export CHROMA_URL=http://localhost:8000
export OLLAMA_BASE_URL=http://localhost:11434

3. Hybrid Mode (GitHub Models)

Components:

  • ChromaDB Server (Docker)
  • GitHub Models API (free cloud LLM)
  • HuggingFace embeddings (local)

Pros:

  • Shared vector store
  • No local GPU needed
  • Free tier available
  • No credit card required

Cons:

  • Requires GitHub account
  • Rate limits apply

4. Hybrid Mode (OpenAI)

Components:

  • ChromaDB Server (Docker)
  • OpenAI API (premium cloud LLM)
  • HuggingFace embeddings (local)

Pros:

  • Shared vector store
  • High-quality LLM responses
  • No local GPU needed

Cons:

  • Requires API key
  • Usage costs

Error Handling & Fallbacks

LLM Fallback Chain

  1. Ollama: Try local LLM first (if provider=ollama)
  2. GitHub Models: Try free cloud LLM (if provider=github_models and token available)
  3. OpenAI: Fallback if other providers unavailable (if API key set)
  4. Extractive: Simple doc concatenation if no LLM available

NumPy Compatibility

The system includes explicit NumPy checks with helpful error messages for macOS Apple Silicon users:

try:
    import numpy as _np
except Exception as e:
    raise RuntimeError("NumPy is required but not available...")

Configuration Deserialization

Custom error messages for LangChain/LangGraph version mismatches (cli.py:148-162, ui.py:85-91, ui.py:110-116).

Performance Considerations

Chunking Strategy

  • RecursiveCharacterTextSplitter preserves semantic boundaries
  • Overlap improves retrieval recall across chunk boundaries
  • add_start_index=True enables tracing back to source

Embedding Model

  • Default: all-MiniLM-L6-v2 (384 dimensions)
  • Fast inference on CPU
  • Good balance of speed/quality
  • Alternative: all-mpnet-base-v2 (768d, slower but higher quality)

Retrieval

  • Top-k=5 balances context vs. noise
  • ChromaDB uses approximate nearest neighbor search
  • DuckDB backend optimized for analytical queries

Extensibility

Adding Custom Embeddings

Edit ingest.py or graph.py:

from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

Custom Text Splitters

Edit ingest.py:60:

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200,
    chunk_overlap=150,
    separators=["\n\n## ", "\n\n", "\n", " ", ""],  # Markdown headings
)

Adding New LLM Providers

Edit graph.py:_llm_or_fallback():

from langchain_anthropic import ChatAnthropic
if cfg.provider == "anthropic":
    return ChatAnthropic(api_key=..., model=cfg.model)

Security & Best Practices

API Keys

  • Never commit .env files
  • Use environment variables
  • Add .env to .gitignore

Docker Volumes

  • Persistent volumes for ChromaDB (chroma-data)
  • Persistent volumes for Ollama models (ollama-data)

File Permissions

  • Text loader handles encoding errors gracefully
  • XML parser catches malformed files

Troubleshooting

Common Issues

  1. "NumPy is not available":

    • Install NumPy first: pip install "numpy>=1.26,<2.1"
    • Use Python 3.10+ on macOS Apple Silicon
  2. Empty retrieval results:

    • Verify ingestion completed: check .chroma/ or ChromaDB collection
    • Check embedding model consistency between ingest and query
  3. Ollama connection errors:

    • Verify Ollama is running: ollama serve
    • Check base URL: http://localhost:11434
    • Pull model first: ollama pull llama3.1:8b
  4. ChromaDB server connection:

    • Start container: docker compose up -d chroma
    • Check logs: docker logs chroma-server
    • Verify port: curl http://localhost:8000/api/v1/heartbeat

File Structure

rag_prototype/
├── src/
│   └── rag_system/
│       ├── __init__.py
│       ├── ocr_extract.py    # XML → TXT conversion
│       ├── ingest.py          # Document ingestion pipeline
│       ├── graph.py           # RAG query pipeline (LangGraph)
│       ├── cli.py             # Command-line interface
│       └── ui.py              # Gradio web interface
├── data/                      # Source text files (ingestion input)
├── .chroma/                   # Local ChromaDB storage
├── docker-compose.yml         # Docker services (Chroma, Ollama)
├── requirements.txt           # Python dependencies
└── README.md                  # User documentation

Future Enhancements

Potential Improvements

  1. Advanced Retrieval:

    • Hybrid search (keyword + semantic)
    • Re-ranking with cross-encoders
    • Multi-query expansion
  2. Evaluation:

    • Add evaluation metrics (RAGAS, etc.)
    • Automated testing suite
    • Benchmark different embedding models
  3. Scalability:

    • Batch ingestion with progress tracking
    • Async query processing
    • Distributed ChromaDB deployment
  4. Features:

    • Document metadata filtering
    • Multi-turn conversations with memory
    • Citation/source tracking in responses
    • PDF/DOCX direct support
  5. UI Enhancements:

    • Chat history
    • Document upload via UI
    • Real-time ingestion status
    • Visualization of retrieved chunks

References