Skip to content

ChunkyTortoise/docextract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

100 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocExtract AI

DocExtract AI

Extract structured data from unstructured documents in seconds -- not hours.

Tests Eval Gate Coverage License: MIT Python 3.10+ FastAPI Open in Streamlit

Proof in 30 seconds -- 95.5% extraction F1 | $0.03/doc avg cost | p95 latency 4.1s | 1,185 tests | 72 eval cases | live demo

Metric Value
Extraction accuracy (F1) 95.5%
Avg cost per document $0.03
p95 end-to-end latency 4.1s
Straight-through rate 88%
Test suite 1,185 tests
Eval framework LLM-as-judge + promptfoo CI gate

Key features: instructor typed extraction with auto-retry, LLM-as-judge online quality scoring (10% sampling), hybrid RRF retrieval, vision extraction mode, business metrics API, 15-page Streamlit dashboard

Best fit -- AI Engineer, Applied AI Engineer, AI Backend Engineer

For Hiring Managers

If you're evaluating for... Where to look Training behind it
AI / ML Engineer Agentic RAG ReAct loop (app/services/agentic_rag.py), RAGAS evaluation pipeline (app/services/ragas_evaluator.py), QLoRA fine-tuning pipeline (scripts/train_qlora.py) — training infrastructure ready, W&B experiment tracking, golden eval CI gate IBM GenAI Engineering (144h), IBM RAG & Agentic AI (24h), DeepLearning.AI Deep Learning (120h)
Backend / Platform Engineer Circuit breaker model fallback (app/services/circuit_breaker.py), async ARQ job queue (worker/), prompt versioning, eval CI, and sliding-window rate limiter Microsoft AI & ML Engineering (75h), Google Cloud GenAI Leader (25h)
Full-Stack AI Engineer 15-page Streamlit dashboard (frontend/), SSE streaming progress, MCP tool server (mcp_server.py), interactive demo sandbox IBM BI Analyst (141h), Google Data Analytics (181h), Microsoft Data Viz (87h)
MLOps / LLMOps Engineer Prompt versioning + regression testing (app/services/prompt_registry.py), model A/B testing with z-test significance (app/services/model_ab_test.py), DeepEval CI gates, cost tracking per request Duke LLMOps (48h), Google Advanced Data Analytics (200h)
EdTech / LMS Engineer Document extraction maps directly to assignment processing and syllabus parsing, batch pipeline (worker/) handles grading document ingestion at scale, PII sanitizer (app/services/pii_sanitizer.py) enforces FERPA compliance for student records IBM GenAI Engineering (144h), Google Data Analytics (181h)

→ Supporting background map: docs/certifications.md

Quickstart

git clone https://github.com/ChunkyTortoise/docextract.git
cd docextract
cp .env.example .env  # Add ANTHROPIC_API_KEY + GEMINI_API_KEY
docker compose up -d
open http://localhost:8501  # Streamlit UI

Services: API at :8000 (/docs for Swagger) | Frontend at :8501 | PostgreSQL :5432 | Redis :6379

Demo

Open in Streamlit

First visit may take 30 seconds to wake up. Pre-cached results for invoice, contract, and receipt extraction.

Local demo (no API key needed):

DEMO_MODE=true streamlit run frontend/app.py

Architecture

graph LR
  A[Browser / API Client] -->|POST /documents| B[FastAPI]
  B -->|enqueue| C[ARQ Worker]
  C -->|classify| D{Model Router}
  D -->|primary| E[Claude Sonnet]
  D -->|fallback| F[Claude Haiku]
  E -->|Pass 2: extract + correct| G[pgvector HNSW]
  B -->|SSE stream stages| A
  G -->|semantic search| B
  B -->|/metrics| H[Prometheus]
  D --- I[Circuit Breaker]
Loading

Supported Models

Model Provider Env Var Notes
claude-sonnet-4-6 Anthropic ANTHROPIC_API_KEY Default extraction model
claude-haiku-4-5-20251001 Anthropic ANTHROPIC_API_KEY Default classification + circuit breaker fallback
glm-4-plus Zhipu AI ZHIPUAI_API_KEY Chinese AI model, OpenAI-compatible API
glm-4-flash Zhipu AI ZHIPUAI_API_KEY Fast/cheap GLM variant
Gemini (embedding) Google GEMINI_API_KEY Used for pgvector embeddings only

GLM-4 models use an OpenAI-compatible endpoint (https://open.bigmodel.cn/api/paas/v4/). Configure via EXTRACTION_MODELS env var.

Screenshots

Upload & Extraction Extracted Records & ROI
Upload Dashboard

SSE Streaming Demo

SSE streaming extraction flow

Real-time progress: PREPROCESSING > EXTRACTING > CLASSIFYING > VALIDATING > EMBEDDING > COMPLETED

Key Capabilities

  • Extraction: Two-pass Claude pipeline (draft + verify via tool_use), 6 document types, 95.5% extraction F1 on 72-case eval corpus (51 golden + 21 adversarial)
  • Search & RAG: pgvector semantic search (768-dim HNSW), hybrid BM25+RRF retrieval, agentic ReAct loop with 5 tools, map-reduce multi-document synthesis, semantic deduplication cache
  • Reliability: Circuit breaker (Sonnet to Haiku fallback), dead-letter queue, idempotent retries, HMAC-signed webhooks with 4-attempt retry, SHA-256 upload dedup
  • Observability: OpenTelemetry traces (Jaeger/Tempo), Prometheus metrics, Grafana dashboards, per-request cost tracking, structured logging
  • Developer Experience: SSE streaming progress, MCP server integration, prompt versioning (semver), model A/B testing (z-test), 12 ADRs, 90%+ test coverage

Performance

Metric Value
Document extraction (p50) ~8s (two-pass Claude)
SSE first token (p50) <500ms
Semantic search (p95) <100ms
Extraction accuracy (eval gate) 95.5% F1 across 72 cases, 6 document types
Test suite ~5s (1,185 tests)
Coverage 90%+ (CI-enforced)

Evaluation Results

72-case corpus: 51 golden + 21 adversarial (prompt injection, PII leak, hallucination bait). Scores are field-level F1. CI-enforced on every PR that touches prompts or extraction services via eval-gate.yml.

Document Type F1 Score
Invoice 97.3%
Purchase Order 97.6%
Bank Statement 95.8%
Medical Record 99.2%
Receipt 91.1%
Identity Document 81.4%
Overall 95.5%

Baseline: autoresearch/baseline.json (28-case golden set, legacy runner). Multi-metric baseline (Promptfoo + Ragas + LLM-judge, 72 cases) pending API credit top-up.

# Full eval suite (Promptfoo + Ragas + LLM-judge, ~$0.44, ~4 min):
make eval

# Fast eval (Promptfoo only, ~$0.02, ~20s):
make eval-fast

For methodology details see docs/eval-methodology.md.

Project Structure

app/
  api/          -- FastAPI route modules (10 routers)
  auth/         -- API key auth + rate limiting middleware
  models/       -- SQLAlchemy models (8 tables)
  schemas/      -- Pydantic request/response schemas
  services/     -- Extraction, classification, embedding, validation
  storage/      -- Pluggable storage backends (local, R2)
  utils/        -- Hashing, MIME detection, token counting
worker/         -- ARQ async job processor
frontend/       -- Streamlit 15-page dashboard
alembic/        -- Database migrations (001-012)
scripts/        -- CLI tools: eval harness, training, seeding, Langfuse sync
tests/          -- Unit, integration, frontend, e2e, and load tests
evals/          -- Golden + adversarial eval corpus (72 cases)
prompts/        -- Versioned prompt templates with CHANGELOG

Architecture Decisions

12 Architecture Decision Records (ADRs) document the key design choices: docs/adr/

ADR Decision
ADR-0001 ARQ over Celery for async job queue
ADR-0002 pgvector over Pinecone/Weaviate
ADR-0003 Two-pass Claude extraction with confidence gating
ADR-0006 Circuit breaker model fallback chain
ADR-0011 API key auth over OAuth/JWT
ADR-0012 Pluggable storage backend (Local/R2)

Production Readiness

Runs locally via Docker Compose. Reference Kubernetes and AWS Terraform configs are included for future deployment work, but the clearest production-facing proof here is the live demo, observability stack, and CI-enforced eval gate.

Cloud infrastructure (deploy/aws/main.tf, deploy/k8s/): Reference AWS Terraform and Kubernetes configs are included for infrastructure direction, along with Docker Compose for local end-to-end runs.

Document Purpose
SLO Targets Latency, availability, quality, cost targets
Common Failure Runbook Circuit breaker, Redis, DB, queue, vector index recovery
Security Guide API keys, webhooks, CORS, data handling
Compliance & Privacy Privacy controls, PII handling notes, and compliance considerations
Architecture Full system architecture overview
Case Study Engineering journey from prototype to production
MCP Integration Claude Desktop / agent framework setup
Cost Model Token costs, per-document pricing, volume estimates
Certifications Applied Supporting background mapped to implementation areas

Deployment

Render (one-click): Deploy to Render

Kubernetes: kubectl apply -k deploy/k8s/ (HPA auto-scaling, nginx ingress, SSE buffering disabled)

AWS Terraform: cd deploy/aws && terraform apply (EC2 + RDS PostgreSQL 16 + ElastiCache Redis 7, free-tier eligible)

See deploy/ for full manifests and configuration.

Running Tests

pytest tests/ -v                      # Full suite (1,185 tests, ~5s)
pytest tests/ -v --run-eval           # Include golden eval (requires API key)
python scripts/run_eval_ci.py --ci    # Deterministic eval (no API key)

Known Limitations

  • Tesseract degradation on handwriting: OCR accuracy drops significantly on handwritten documents. Set OCR_ENGINE=vision to route through Claude's vision API instead.
  • English-only extraction prompts: Non-English documents may extract with lower accuracy.

Contributing

See CONTRIBUTING.md for development setup, testing, and PR guidelines.

License

MIT

About

Production document AI with hybrid retrieval, eval CI, and 94.6% extraction accuracy. FastAPI + pgvector + Claude API.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors