Extract structured data from unstructured documents in seconds -- not hours.
Proof in 30 seconds -- 95.5% extraction F1 | $0.03/doc avg cost | p95 latency 4.1s | 1,185 tests | 72 eval cases | live demo
| Metric | Value |
|---|---|
| Extraction accuracy (F1) | 95.5% |
| Avg cost per document | $0.03 |
| p95 end-to-end latency | 4.1s |
| Straight-through rate | 88% |
| Test suite | 1,185 tests |
| Eval framework | LLM-as-judge + promptfoo CI gate |
Key features: instructor typed extraction with auto-retry, LLM-as-judge online quality scoring (10% sampling), hybrid RRF retrieval, vision extraction mode, business metrics API, 15-page Streamlit dashboard
Best fit -- AI Engineer, Applied AI Engineer, AI Backend Engineer
| If you're evaluating for... | Where to look | Training behind it |
|---|---|---|
| AI / ML Engineer | Agentic RAG ReAct loop (app/services/agentic_rag.py), RAGAS evaluation pipeline (app/services/ragas_evaluator.py), QLoRA fine-tuning pipeline (scripts/train_qlora.py) — training infrastructure ready, W&B experiment tracking, golden eval CI gate |
IBM GenAI Engineering (144h), IBM RAG & Agentic AI (24h), DeepLearning.AI Deep Learning (120h) |
| Backend / Platform Engineer | Circuit breaker model fallback (app/services/circuit_breaker.py), async ARQ job queue (worker/), prompt versioning, eval CI, and sliding-window rate limiter |
Microsoft AI & ML Engineering (75h), Google Cloud GenAI Leader (25h) |
| Full-Stack AI Engineer | 15-page Streamlit dashboard (frontend/), SSE streaming progress, MCP tool server (mcp_server.py), interactive demo sandbox |
IBM BI Analyst (141h), Google Data Analytics (181h), Microsoft Data Viz (87h) |
| MLOps / LLMOps Engineer | Prompt versioning + regression testing (app/services/prompt_registry.py), model A/B testing with z-test significance (app/services/model_ab_test.py), DeepEval CI gates, cost tracking per request |
Duke LLMOps (48h), Google Advanced Data Analytics (200h) |
| EdTech / LMS Engineer | Document extraction maps directly to assignment processing and syllabus parsing, batch pipeline (worker/) handles grading document ingestion at scale, PII sanitizer (app/services/pii_sanitizer.py) enforces FERPA compliance for student records |
IBM GenAI Engineering (144h), Google Data Analytics (181h) |
→ Supporting background map: docs/certifications.md
git clone https://github.com/ChunkyTortoise/docextract.git
cd docextract
cp .env.example .env # Add ANTHROPIC_API_KEY + GEMINI_API_KEY
docker compose up -d
open http://localhost:8501 # Streamlit UIServices: API at :8000 (/docs for Swagger) | Frontend at :8501 | PostgreSQL :5432 | Redis :6379
First visit may take 30 seconds to wake up. Pre-cached results for invoice, contract, and receipt extraction.
Local demo (no API key needed):
DEMO_MODE=true streamlit run frontend/app.pygraph LR
A[Browser / API Client] -->|POST /documents| B[FastAPI]
B -->|enqueue| C[ARQ Worker]
C -->|classify| D{Model Router}
D -->|primary| E[Claude Sonnet]
D -->|fallback| F[Claude Haiku]
E -->|Pass 2: extract + correct| G[pgvector HNSW]
B -->|SSE stream stages| A
G -->|semantic search| B
B -->|/metrics| H[Prometheus]
D --- I[Circuit Breaker]
| Model | Provider | Env Var | Notes |
|---|---|---|---|
claude-sonnet-4-6 |
Anthropic | ANTHROPIC_API_KEY |
Default extraction model |
claude-haiku-4-5-20251001 |
Anthropic | ANTHROPIC_API_KEY |
Default classification + circuit breaker fallback |
glm-4-plus |
Zhipu AI | ZHIPUAI_API_KEY |
Chinese AI model, OpenAI-compatible API |
glm-4-flash |
Zhipu AI | ZHIPUAI_API_KEY |
Fast/cheap GLM variant |
| Gemini (embedding) | GEMINI_API_KEY |
Used for pgvector embeddings only |
GLM-4 models use an OpenAI-compatible endpoint (https://open.bigmodel.cn/api/paas/v4/). Configure via EXTRACTION_MODELS env var.
| Upload & Extraction | Extracted Records & ROI |
|---|---|
![]() |
![]() |
Real-time progress: PREPROCESSING > EXTRACTING > CLASSIFYING > VALIDATING > EMBEDDING > COMPLETED
- Extraction: Two-pass Claude pipeline (draft + verify via
tool_use), 6 document types, 95.5% extraction F1 on 72-case eval corpus (51 golden + 21 adversarial) - Search & RAG: pgvector semantic search (768-dim HNSW), hybrid BM25+RRF retrieval, agentic ReAct loop with 5 tools, map-reduce multi-document synthesis, semantic deduplication cache
- Reliability: Circuit breaker (Sonnet to Haiku fallback), dead-letter queue, idempotent retries, HMAC-signed webhooks with 4-attempt retry, SHA-256 upload dedup
- Observability: OpenTelemetry traces (Jaeger/Tempo), Prometheus metrics, Grafana dashboards, per-request cost tracking, structured logging
- Developer Experience: SSE streaming progress, MCP server integration, prompt versioning (semver), model A/B testing (z-test), 12 ADRs, 90%+ test coverage
| Metric | Value |
|---|---|
| Document extraction (p50) | ~8s (two-pass Claude) |
| SSE first token (p50) | <500ms |
| Semantic search (p95) | <100ms |
| Extraction accuracy (eval gate) | 95.5% F1 across 72 cases, 6 document types |
| Test suite | ~5s (1,185 tests) |
| Coverage | 90%+ (CI-enforced) |
72-case corpus: 51 golden + 21 adversarial (prompt injection, PII leak, hallucination bait). Scores are field-level F1. CI-enforced on every PR that touches prompts or extraction services via eval-gate.yml.
| Document Type | F1 Score |
|---|---|
| Invoice | 97.3% |
| Purchase Order | 97.6% |
| Bank Statement | 95.8% |
| Medical Record | 99.2% |
| Receipt | 91.1% |
| Identity Document | 81.4% |
| Overall | 95.5% |
Baseline: autoresearch/baseline.json (28-case golden set, legacy runner). Multi-metric baseline (Promptfoo + Ragas + LLM-judge, 72 cases) pending API credit top-up.
# Full eval suite (Promptfoo + Ragas + LLM-judge, ~$0.44, ~4 min):
make eval
# Fast eval (Promptfoo only, ~$0.02, ~20s):
make eval-fastFor methodology details see docs/eval-methodology.md.
app/
api/ -- FastAPI route modules (10 routers)
auth/ -- API key auth + rate limiting middleware
models/ -- SQLAlchemy models (8 tables)
schemas/ -- Pydantic request/response schemas
services/ -- Extraction, classification, embedding, validation
storage/ -- Pluggable storage backends (local, R2)
utils/ -- Hashing, MIME detection, token counting
worker/ -- ARQ async job processor
frontend/ -- Streamlit 15-page dashboard
alembic/ -- Database migrations (001-012)
scripts/ -- CLI tools: eval harness, training, seeding, Langfuse sync
tests/ -- Unit, integration, frontend, e2e, and load tests
evals/ -- Golden + adversarial eval corpus (72 cases)
prompts/ -- Versioned prompt templates with CHANGELOG
12 Architecture Decision Records (ADRs) document the key design choices: docs/adr/
| ADR | Decision |
|---|---|
| ADR-0001 | ARQ over Celery for async job queue |
| ADR-0002 | pgvector over Pinecone/Weaviate |
| ADR-0003 | Two-pass Claude extraction with confidence gating |
| ADR-0006 | Circuit breaker model fallback chain |
| ADR-0011 | API key auth over OAuth/JWT |
| ADR-0012 | Pluggable storage backend (Local/R2) |
Runs locally via Docker Compose. Reference Kubernetes and AWS Terraform configs are included for future deployment work, but the clearest production-facing proof here is the live demo, observability stack, and CI-enforced eval gate.
Cloud infrastructure (deploy/aws/main.tf, deploy/k8s/): Reference AWS Terraform and Kubernetes configs are included for infrastructure direction, along with Docker Compose for local end-to-end runs.
| Document | Purpose |
|---|---|
| SLO Targets | Latency, availability, quality, cost targets |
| Common Failure Runbook | Circuit breaker, Redis, DB, queue, vector index recovery |
| Security Guide | API keys, webhooks, CORS, data handling |
| Compliance & Privacy | Privacy controls, PII handling notes, and compliance considerations |
| Architecture | Full system architecture overview |
| Case Study | Engineering journey from prototype to production |
| MCP Integration | Claude Desktop / agent framework setup |
| Cost Model | Token costs, per-document pricing, volume estimates |
| Certifications Applied | Supporting background mapped to implementation areas |
Kubernetes: kubectl apply -k deploy/k8s/ (HPA auto-scaling, nginx ingress, SSE buffering disabled)
AWS Terraform: cd deploy/aws && terraform apply (EC2 + RDS PostgreSQL 16 + ElastiCache Redis 7, free-tier eligible)
See deploy/ for full manifests and configuration.
pytest tests/ -v # Full suite (1,185 tests, ~5s)
pytest tests/ -v --run-eval # Include golden eval (requires API key)
python scripts/run_eval_ci.py --ci # Deterministic eval (no API key)- Tesseract degradation on handwriting: OCR accuracy drops significantly on handwritten documents. Set
OCR_ENGINE=visionto route through Claude's vision API instead. - English-only extraction prompts: Non-English documents may extract with lower accuracy.
See CONTRIBUTING.md for development setup, testing, and PR guidelines.
MIT



