Tutorial: Custom Document Ingestion for Legal LoRA Training

Introduction

This tutorial guides you through ingesting your own legal documents into ThemisDB for training a customized LoRA adapter. We'll cover:

Preparing your documents
Configuring data sources
Running ingestion
Validating results
Troubleshooting common issues

Prerequisites

ThemisDB installed and running
Documents to ingest (PDF, DOCX, TXT, or other supported formats)
Basic understanding of YAML configuration
For scanned PDFs: Tesseract OCR installed

Step 1: Prepare Your Documents

Organize Your Files

Create a directory structure:

/data/legal_documents/
├── federal/
│   ├── BGB_sections.pdf
│   ├── StGB_commentary.docx
│   └── ...
├── state/
│   ├── LandVG_Berlin.pdf
│   └── ...
└── internal/
    ├── verwaltungsvorschrift_2024.pdf
    ├── guidance_modal_verbs.docx
    └── ...

File Requirements

Supported Formats

PDF: Text-based or scanned (requires OCR)
DOCX: Microsoft Word documents
TXT: Plain text files
HTML/XML: Structured documents
JSON: Structured legal data

File Size Limits

Default: 100 MB per file
Adjustable in configuration
For very large files, consider splitting

Document Quality

For best results:

Scanned PDFs:
- Minimum 300 DPI
- Clear, high-contrast text
- Proper orientation (not rotated)
Text Documents:
- UTF-8 encoding preferred
- Consistent formatting
- Minimal boilerplate (headers/footers)
Metadata:
- Include in filename if possible: BGB_§123_2024.pdf
- Helps with categorization and tracking

Step 2: Configure Data Sources

Basic Configuration

Create or edit config/ingestion/my_sources.yaml:

filesystem_sources:
  # Internal administrative regulations
  - source_id: "my_verwaltungsvorschriften"
    enabled: true
    type: "filesystem"
    location: "/data/legal_documents/internal"
    priority: 10  # Higher than public data
    description: "Internal administrative regulations"
    options:
      format: "auto"  # Auto-detect format
      recursive: true  # Include subdirectories
      
      # OCR configuration
      ocr_enabled: true
      ocr_language: "deu"  # German
      ocr_dpi: 300
      skip_text_pdfs: true  # Skip OCR if PDF has text
      
      # File filtering
      extensions:
        - ".pdf"
        - ".docx"
      exclude_patterns:
        - "**/backup/**"
        - "**/draft/**"
      min_size_bytes: 100
      max_size_bytes: 104857600  # 100 MB
      
      # Metadata
      extract_metadata: true

Advanced Configuration

Multiple Sources with Priorities

filesystem_sources:
  # Priority 10: Most important (internal guidance)
  - source_id: "internal_guidance"
    location: "/data/legal_documents/internal"
    priority: 10
    
  # Priority 8: State-level laws
  - source_id: "state_laws"
    location: "/data/legal_documents/state"
    priority: 8
    
  # Priority 5: Federal laws (public)
  - source_id: "federal_laws"
    location: "/data/legal_documents/federal"
    priority: 5

Higher priority = processed first = more weight in training.

OCR for Legacy Documents

options:
  ocr_enabled: true
  ocr_language: "deu+eng"  # German + English
  ocr_dpi: 600  # Higher quality for poor scans
  skip_text_pdfs: false  # Always OCR (legacy docs)

Selective Ingestion

Only specific files:

options:
  extensions:
    - ".pdf"  # Only PDFs
  include_patterns:
    - "**/*_final_*.pdf"  # Only final versions
    - "**/BGB_*.pdf"      # Only BGB sections
  exclude_patterns:
    - "**/backup/**"
    - "**/archive/**"
    - "**/*_draft_*.pdf"

Step 3: Run Ingestion

Command Line

# Using example executable
cd examples/legal_lora_training
./train_legal_lora --config my_sources.yaml --phase ingestion

# Or using Python client
python ingest_documents.py --config my_sources.yaml

Programmatic (C++)

#include "ingestion/ingestion_manager.h"
#include "ingestion/filesystem_ingester.h"

int main() {
    std::string db = "http://localhost:8529/_db/legal_training";
    
    // Create ingestion manager
    ingestion::IngestionManager mgr(db);
    
    // Configure source
    ingestion::SourceConfig config;
    config.source_id = "my_docs";
    config.type = ingestion::SourceType::FILESYSTEM;
    config.location = "/data/legal_documents/internal";
    config.priority = 10;
    config.options["ocr_enabled"] = "true";
    config.options["ocr_language"] = "deu";
    
    // Register and ingest
    mgr.registerSource(config);
    mgr.setTargetCollection("legal_documents");
    
    auto report = mgr.ingestAll([](const std::string& source, 
                                    size_t processed, 
                                    size_t total,
                                    const std::string& status) {
        std::cout << "[" << source << "] " 
                  << processed << "/" << total 
                  << " - " << status << std::endl;
    });
    
    // Check results
    std::cout << "Documents ingested: " << report.total_documents << std::endl;
    std::cout << "Failures: " << report.total_failures << std::endl;
    
    return 0;
}

Monitoring Progress

The ingestion provides real-time feedback:

[my_docs] 0/127 - Starting ingestion...
[my_docs] 10/127 - Processing: verwaltungsvorschrift_2024.pdf
[my_docs] 20/127 - OCR in progress: legacy_regulation_1995.pdf
[my_docs] 50/127 - Halfway complete
[my_docs] 100/127 - Almost done...
[my_docs] 127/127 - Ingestion complete!

Ingestion complete:
  Total documents: 127
  Total failures: 3
  Total time: 45.2s

Step 4: Validate Results

Check Ingested Documents

Using AQL query:

FOR doc IN legal_documents
    FILTER doc.source_id == "my_docs"
    COLLECT source = doc.document_type WITH COUNT INTO count
    RETURN {type: source, count: count}

Expected output:

[
  {"type": "regulation", "count": 85},
  {"type": "guidance", "count": 32},
  {"type": "case_law", "count": 10}
]

Inspect Document Content

FOR doc IN legal_documents
    FILTER doc.source_id == "my_docs"
    LIMIT 5
    RETURN {
        title: doc.title,
        type: doc.document_type,
        content_length: LENGTH(doc.content),
        has_embedding: doc.embedding != null
    }

Check for Errors

FOR log IN ingestion_logs
    FILTER log.source_id == "my_docs"
    FILTER log.status == "failed"
    RETURN {
        document: log.document_path,
        error: log.error_message
    }

Step 5: Advanced Features

OCR Quality Assessment

After OCR, check quality:

FOR doc IN legal_documents
    FILTER doc.source_id == "my_docs"
    FILTER doc.metadata.ocr_performed == true
    SORT doc.metadata.ocr_confidence ASC
    LIMIT 10
    RETURN {
        title: doc.title,
        confidence: doc.metadata.ocr_confidence,
        word_count: doc.metadata.word_count
    }

Low confidence (<0.7)? Try:

Increase DPI: ocr_dpi: 600
Pre-process images (contrast, deskew)
Manual review and correction

Metadata Enrichment

Extract additional metadata:

filesystem_ingester.setMetadataExtraction(true);

// Extracted metadata includes:
// - Author
// - Creation date
// - Document type (from filename or content)
// - Keywords
// - References to other documents

Chunking Large Documents

For documents >10 pages:

text_processing:
  enable_chunking: true
  chunk_size: 2000  # characters
  chunk_overlap: 200  # for context continuity

This creates:

Document → Chunk 1 (chars 0-2000)
        → Chunk 2 (chars 1800-3800)  # 200 overlap
        → Chunk 3 (chars 3600-5600)
        → ...

Deduplication

Avoid duplicate documents:

deduplication:
  enabled: true
  method: "content_hash"  # SHA-256 of content
  
  # Or use title matching:
  method: "title"
  fuzzy_threshold: 0.9

Troubleshooting

Issue 1: OCR Fails

Symptom: Error "Tesseract not found" or empty content

Solution:

# Install Tesseract
sudo apt-get install tesseract-ocr tesseract-ocr-deu

# Verify
tesseract --version
tesseract --list-langs

Issue 2: Out of Memory

Symptom: Process crashes during ingestion

Solution:

# Reduce parallel processing
parallel_processing:
  enabled: true
  max_threads: 2  # Instead of 4

# Process in smaller batches
batch_size: 50  # Instead of 100

Issue 3: Slow Ingestion

Symptom: Takes >1 minute per document

Causes & Solutions:

Large PDFs: Enable chunking
Many scanned pages: Disable OCR for text PDFs
Network storage: Copy to local disk first
CPU-bound: Increase max_threads

Issue 4: Encoding Errors

Symptom: Garbled text, special characters broken

Solution:

text_processing:
  encoding: "utf-8"
  fallback_encodings:
    - "iso-8859-1"
    - "windows-1252"
  normalize_unicode: true

Issue 5: Missing Documents

Symptom: Expected 200 files, only 150 ingested

Debug:

// Check file filter
auto count = ingester.getDocumentCount();
std::cout << "Matching files: " << count << std::endl;

// List filtered files
auto ingester_impl = ingester.getImpl();
for (const auto& file : ingester_impl->listMatchingFiles()) {
    std::cout << file << std::endl;
}

Best Practices

1. Start Small

Ingest a small subset first (10-20 documents) to validate:

Format compatibility
OCR quality
Metadata extraction
Processing time

2. Use Priorities

Assign priorities based on importance:

10: Critical internal documents
7-9: Important references
5-6: General legal corpus
1-4: Background/context

3. Monitor Quality

After ingestion, check:

Content length distribution
OCR confidence scores
Missing metadata fields
Duplicate detection

4. Incremental Updates

Don't re-ingest everything:

incremental:
  enabled: true
  check_modified_time: true
  skip_existing: true

5. Backup First

Before large ingestion:

# Backup database
themisdb-backup --database legal_training --output backup/

# Or use snapshot
curl -X POST http://localhost:8529/_db/legal_training/_api/snapshot

Next Steps

After ingestion:

Auto-Labeling: Generate training samples
```
./train_legal_lora --phase labeling
```
Graph Enrichment: Add context
```
./train_legal_lora --phase enrichment
```
Training: Create LoRA adapter
```
./train_legal_lora --phase training
```

See main documentation: LEGAL_LORA_TRAINING_PIPELINE.md

Example: Complete Workflow

# 1. Configure sources
filesystem_sources:
  - source_id: "my_internal_docs"
    location: "/data/verwaltung"
    priority: 10
    options:
      ocr_enabled: true
      ocr_language: "deu"

# 2. Run ingestion
./train_legal_lora --config my_config.yaml --phase ingestion

# 3. Validate
FOR doc IN legal_documents 
    FILTER doc.source_id == "my_internal_docs" 
    RETURN COUNT(doc)
# Expected: ~127 documents

# 4. Continue to training
./train_legal_lora --config my_config.yaml --phase all

Resources

Main Documentation: LEGAL_LORA_TRAINING_PIPELINE.md
Configuration Reference: config/ingestion/sources.yaml
Code Examples: examples/legal_lora_training/
API Reference: See header files in include/ingestion/

Support

Questions? Open an issue:

GitHub: https://github.com/makr-code/ThemisDB/issues
Tag: legal-training or ingestion

FilesExpand file tree

CUSTOM_DOCUMENT_INGESTION.md

Latest commit

History

CUSTOM_DOCUMENT_INGESTION.md

File metadata and controls

Tutorial: Custom Document Ingestion for Legal LoRA Training

Introduction

Prerequisites

Step 1: Prepare Your Documents

Organize Your Files

File Requirements

Supported Formats

File Size Limits

Document Quality

Step 2: Configure Data Sources

Basic Configuration

Advanced Configuration

Multiple Sources with Priorities

OCR for Legacy Documents

Selective Ingestion

Step 3: Run Ingestion

Command Line

Programmatic (C++)

Monitoring Progress

Step 4: Validate Results

Check Ingested Documents

Inspect Document Content

Check for Errors

Step 5: Advanced Features

OCR Quality Assessment

Metadata Enrichment

Chunking Large Documents

Deduplication

Troubleshooting

Issue 1: OCR Fails

Issue 2: Out of Memory

Issue 3: Slow Ingestion

Issue 4: Encoding Errors

Issue 5: Missing Documents

Best Practices

1. Start Small

2. Use Priorities

3. Monitor Quality

4. Incremental Updates

5. Backup First

Next Steps

Example: Complete Workflow

Resources

Support