Skip to content

New-Math-Data/nmd-data-engineer-test-amctest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BlueTail Aviation Compliance Reports

Extracts Airworthiness Directives (ADs) and Service Bulletins (SBs) from aviation compliance report PDFs using AWS Textract and Bedrock LLMs.

Overview

  • PDF document processing via AWS Textract (OCR + layout, form and table analysis - AsyncAnalyzeDocument)
  • LLM-based field extraction via AWS Bedrock (Claude and Nova models)
  • Hybrid validation: LLM extraction with Textract fact-checking
  • PDF coordinate enrichment: every extracted field linked to page location and bounding box (Not current hooked up)
  • Step Functions orchestration for scalable, parallel batch processing
  • Configurable extraction profiles (accuracy vs speed tradeoffs) - defines the textract and llm configuration
  • Validation harness for ground truth comparison and experiment tracking - creation of dashboard and tracking of runs/metrics

Project Structure

bluetail-poc/
├── src/                      # Core application code
│   ├── bedrock/              # LLM integration via AWS Bedrock
│   ├── classifier/           # Report family classification (ADSI vs Status)
│   ├── extractor/            # Page processing orchestration, prompts & few-shot examples
│   ├── schemas/              # Canonical data models (records, enums, extraction envelope)
│   ├── textract/             # OCR & document analysis
│   ├── validator/            # Field matching & validation
│   ├── lambdas/              # Lambda handlers for workflow steps
│   ├── tasks/                # Task logic (reused by Lambda handlers)
│   ├── aws/                  # AWS client management
│   └── common/               # Shared config, logging & utilities
├── config/                   # Extraction profiles (YAML)
├── validation_harness/       # Ground truth testing & experiment tracking
├── web_viewer/               # Flask-based PDF visualization tool
├── terraform/                # AWS infrastructure as code
│   ├── live/dev/             # Environment configuration
│   └── modules/              # Reusable modules (S3, Lambda, Step Functions, etc.)
├── scripts/                  # Utility scripts (local testing, deployment)
├── tests/                    # pytest test suite
├── Dockerfile                # Local testing via docker-compose
└── docker-compose.yml        # Local development

Source Modules (src/)

Module Purpose
bedrock/ LLM integration via AWS Bedrock (model-agnostic invocation, response parsing, prompt caching)
classifier/ Report family classification (ADSI vs Status) using Bedrock
extractor/ Page processing orchestration with prompts, few-shot examples, and record auditing
schemas/ Canonical data models — record types, enums, extraction envelope (single source of truth)
textract/ AWS Textract wrapper for OCR, text detection, and bounding box extraction
validator/ Field matching and validation using fuzzy matching strategies
lambdas/ Lambda handlers (s3_path_resolver, classify_report, run_textract, build_manifest, split_batches, aggregate_results, process_document)
tasks/ Task logic reused by Lambda handlers (process_document)
aws/ Singleton AWS client management (S3, Textract, Bedrock)
common/ Shared configuration, logging, and utilities

Architecture

AWS Services

Service Purpose
S3 Document storage (input PDFs, output results)
Textract OCR and document analysis (async)
Bedrock LLM inference (Claude Sonnet, Nova)
Step Functions Workflow orchestration
EventBridge S3 upload triggers
SQS Job progress tracking (FIFO queue)

Workflow

S3 Upload ({client_id}/{job_id}/input/report.pdf)
    │
    ▼
EventBridge Rule ──► Step Functions
    │
    ▼
Step 1: Preprocess (Lambda)
    └── Resolve S3 paths from input key
    │
    ▼
Step 2: ClassifyReport (Lambda)
    └── Determine report family (ADSI / Status)
    │
    ▼
Step 3: RunTextract (Lambda)
    └── Full document OCR + two-tier caching
    │
    ▼
Step 4: BuildManifest (Lambda)
    └── Page inventory + page count verification
    │
    ▼
Step 5: SplitBatches (Lambda)
    └── Batch pages for parallel processing
    │
    ▼
Step 6: ProcessBatchesMap (Lambda, parallel per batch)
    └── Bedrock extraction
    │
    ▼
Step 7: AggregateResults (Lambda)
    └── Merge batches + record auditing → extraction_result.json
    │
    ▼
Results ──► S3 {client_id}/{job_id}/output/extraction_result.json

Lambda Functions

Lambda Purpose
s3_path_resolver Resolves output paths from input S3 keys (Preprocess)
classify_report Classifies report family (ADSI vs Status) using Bedrock
run_textract Runs full-document OCR with two-tier caching (S3 + DynamoDB)
build_manifest Builds a page inventory and validates page counts
split_batches Splits pages into batches for parallel processing
aggregate_results Merges batch outputs, validates coverage, and runs record auditing
process_document Processes PDF pages with Bedrock extraction (Distributed Map)

Architecture Diagram

BlueTailArch

S3 Bucket Structure

nmd-bluetail-poc-bucket/
└── {client_id}/{job_id}/
    ├── input/
    │   └── report.pdf                     ← Triggers workflow
    ├── cached-results/
    │   ├── classification.json            ← Report family classification result
    │   ├── textract_results/
    │   │   ├── raw.json                   ← Full Textract response
    │   │   └── extracted.txt              ← Human-readable text
    │   ├── page_manifest.json             ← Page inventory + metadata
    │   └── batches/                       ← Per-batch outputs (intermediate)
    └── output/
        └── extraction_result.json         ← Extracted ADs/SBs with coordinates

Progress Tracking (SQS)

Each pipeline step emits status events to an SQS FIFO queue so downstream consumers can track job progress in real time.

Queue: bluetail-job-progress.fifo (FIFO with MessageGroupId = job_id)

Event lifecycle for a successful job:

Each step emits exactly two events: STARTEDCOMPLETED (or FAILED on error).

STARTED  (Preprocess)          seq=11
COMPLETED (Preprocess)         seq=12
STARTED  (ClassifyReport)      seq=21
COMPLETED (ClassifyReport)     seq=22
STARTED  (Textract)            seq=31
COMPLETED (Textract)           seq=32
STARTED  (BuildManifest)       seq=41
COMPLETED (BuildManifest)      seq=42
STARTED  (SplitBatches)        seq=51
COMPLETED (SplitBatches)       seq=52
STARTED  (ProcessBatches)      seq=61  ← emitted by Step Functions Pass state
COMPLETED (ProcessBatches)     seq=62  ← emitted by Step Functions Pass state
STARTED  (AggregateResults)    seq=71
COMPLETED (AggregateResults)   seq=72

Sequence numbering: step_index * 10 + offset (STARTED=+1, COMPLETED=+2, FAILED=+9). E.g. Preprocess: STARTED=11, COMPLETED=12; ClassifyReport: STARTED=21, COMPLETED=22.

Message schema:

Field Type Description
job_id string Unique job identifier (ULID)
client_id string Client/tenant identifier
status string STARTED, COMPLETED, FAILED
step_name string Pipeline step name
sequence int Monotonic sequence for ordering/dedup
timestamp string ISO 8601 UTC
step_index int or null 1-based pipeline position
total_steps int or null Total pipeline steps (currently 7)
message string or null Human-readable summary
details object or null Step-specific context

Design notes:

  • Dedup IDs ({job_id}:{step_name}:{sequence}) handle Step Functions retries
  • ProgressPublisher is fire-and-forget — SQS failures never break the pipeline
  • When PROGRESS_QUEUE_URL is unset (local dev), all emit calls are silent no-ops
  • All emission uses ProgressPublisher from src/common/progress.py (all Lambda handlers share the same code path)

Project Prerequisites

  • Python 3.12+
  • uv package manager
  • AWS CLI configured with BlueTail profile
  • Docker and Docker Compose (potentially for local testing only)
  • Terraform >= 1.5

Setup

There is a Makefile included in the root of this project. It defines a number of commands that can help with project initialization, running code quality checks via pre-commit hooks, and deploying terraform.

Makefile Commands

Command Description
make init Install uv/prek utililes if missing, validate Terraform version, uv sync --all-groups, install pre-commit hooks
make check Run all checks (format, lint, typecheck, test) via pre-commit hooks
make build-lambda-artifacts Build Lambda handler zips and shared layer for Terraform
make deploy-dev Build Lambda artifacts then terraform apply in terraform/live/dev
make clean Remove build artifacts, caches, and __pycache__ directories

Installing project dependencies and activating the python virtual environment

# Initialize project (install tools, deps, pre-commit hooks)
make init

# Activate python virtual environment - necessary to run python code locally
source .venv/bin/activate

Testing

Unit Tests

Unit tests cover the source code and are run with pytest

# Run all tests
pytest tests/

# Run with coverage
pytest tests/ --cov=src --cov-report=term-missing

AWS Deployment

Deploying to a New Account

More comprehensive documentation for terraform can be found in this README.md

This documentation will explain the modifications you will want to make before deploying.

Deploy Infrastructure

Assuming you have updated the terraform s3 backend configuration and any terraform variables you wanted

# Initialize the terraform providers, needs to be run the first time and any time providers change
terraform -chdir=terraform/live/dev init

# Run a terraform plan to see what is going to be created
terraform -chdir=terraform/live/dev init -var-file dev.tfvars

# Build the lambda layer zipfiles - packages code lambdas can use - needs to be run before deploying
# Although `make deploy-dev` will run this automatically
make build-lambda-artifacts

# Build lambda artifacts and deploy to AWS, this runs `terraform apply under the hood`
# Note, if you are using a *.tfvars not named dev.tfvars you will need to update this in the Makefile for this command
make deploy-dev

# Destroy deployed resources
terraform -chdir=terraform/live/dev destroy

Trigger Extraction

Upload a PDF to the S3 input bucket to trigger the Step Functions workflow:

aws s3 cp report.pdf s3://nmd-bluetail-poc-bucket/{client_id}/{job_id}/input/report.pdf --profile <YOUR_AWS_PROFILE_NAME>

Monitor progress in the AWS Step Functions console.

Configuration

Extraction Profiles

Select a profile via the EXTRACTION_PROFILE environment variable or the terraform extraction_profile variable.

Profiles are defined in config/extraction_profiles.yaml

These allow for "extraction approaches" to be configurable. When deploying terraform you can set the EXTRACTION_PROFILE envvar in terraform to change you approach or test new ones -- although validation_harness is best for A/B testing.

The profile determines a configuration. Then the EXTRACTION_PROFILE envvar (change in .env for local and terraform for cloud) determines which approach the process will use

Example of a profile:

profiles:
  default:
    name: "Sonnet Hybrid Batch 5"

    # Model settings
    model_id: "us.anthropic.claude-sonnet-4-5-20250929-v1:0"
    max_tokens: 16000
    max_tokens_multi_page: 32000
    temperature: 0.0

    # Extraction behavior
    type: hybrid
    batch_size: 5
    few_shot: true
    prompt_caching: true

    # Textract settings
    textract_features:
      - TABLES
      - FORMS
      - LAYOUT
Profile Type Model Batch Size Features
default hybrid Claude Sonnet 4.5 5 Few-shot + caching
accurate hybrid Claude Opus 4.5 3 Few-shot + caching
fast llm Claude Haiku 10 No few-shot

Extraction Types:

Type Description Notes
llm Pure LLM visual extraction from PDF
hybrid LLM + Textract (Textract as spell-checker for accuracy)
textract Textract-only parsing (no LLM) <-- Not really used in this project

Environment Variables

Variable Description Default Notes
AWS_REGION AWS region us-west-2
AWS_PROFILE AWS credentials profile (local only) BlueTail
EXTRACTION_PROFILE Extraction profile name default
APP_LOG_LEVEL Log level INFO
APP_LOG_FORMAT Log format (json or text) json
APP_LOGGER_PREFIX Logger name prefix src
S3_ENDPOINT_URL S3 endpoint URL (LocalStack) -
TEXTRACT_ENDPOINT_URL Textract endpoint URL (LocalStack) -
BEDROCK_ENDPOINT_URL Bedrock endpoint URL (LocalStack) - Not supported in LocalStack
TEXTRACT_INPUT_BUCKET S3 bucket for input PDFs - Only used for local testing (eventbridge determines this in cloud deploy)
TEXTRACT_OUTPUT_BUCKET S3 bucket for output results - Only used for local testing (eventbridge determines this in cloud deploy)
TEXTRACT_OPERATION_TYPE Textract operation type analyze_document detect_text, analyze_document, or analyze_async
TEXTRACT_FEATURES Textract features TABLES,FORMS,LAYOUT Comma-separated
PROGRESS_QUEUE_URL SQS FIFO queue URL for progress events - Unset disables progress tracking (safe for local dev)

Textract Caching

Textract results are cached in two layers to avoid repeated OCR work:

  • Same-job cache: reuses Textract results from earlier steps in the same job
  • Cross-job cache: reuses results across jobs via DynamoDB dedupe

Cache controls:

  • TEXTRACT_CACHE_VERSION - cache invalidation version (bump to force refresh)
  • TEXTRACT_CACHE_TABLE_NAME - DynamoDB table name (unset disables cross-job cache)

Validation Harness

Compare extraction approaches against ground truth for AD/SB extraction accuracy.

Note: The Validation harness is a custom application and should not be assumed to be bug-free.

From analysis of the data and careful inspection of results it is working correctly, but for future continued use more deliberate development and/or another tool should be used.

Quick Start:

# Run extraction with all approaches in manifest
uv run python validation_harness/tracking_app/execute.py --folder 2023_camp_adsi --pages all

# Generate comparison dashboard
uv run python -m validation_harness.tracking_app.cli dashboard

# View results
open validation_harness/.tracking/reports/index.html

Options:

  • --folder - Report folder name (e.g., 2023_camp_adsi)
  • --pages - all, 1, 1-10, or 1,3,5
  • --approach - Run only a specific approach by name

Each validation folder contains:

  • manifest.yaml - Extraction approaches to test
  • report.pdf - Source PDF
  • ground_truth_results.json - Expected results for comparison

See validation_harness/README.md for full documentation.

Web Viewer

A web-based PDF viewer for visualizing extraction results with field-level highlighting.

This requires hooking up and running the src/validation module to map extracted results to locations in the pdf. Currently not hooked up

You will need two json files, one with the locations in the pdf and one with the output results

Features:

  • Upload PDF reports and extraction JSON results
  • Click entries to highlight fields in the PDF
  • Color-coded highlights by field type
  • Confidence indicators (High/Medium/Low)

Quick Start:

cd web_viewer
docker-compose up --build
# Open http://localhost:5000

See web_viewer/README.md for full documentation.

Development

Type Checking

mypy src/

Troubleshooting

Issue Solution
AWS credentials error Check ~/.aws/credentials has [BlueTail] profile or matches you have in terraform for aws_profile
Change model/settings Edit EXTRACTION_PROFILE in .env or extraction_profile in terraform variables
Import errors Run uv sync and activate venv `source .venv/bin/activate
Debug logs Set APP_LOG_LEVEL=DEBUG in .env
Validation failed: terraform/modules/lambda Ensure you have run make build-lambda-artifacts to build lambda layers

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors