- Choose implementation stack (Node.js + Fastify OR Python + FastAPI)
- Scaffold project (deps, scripts, formatter, linter, tests)
- Implement config loader and validation for
.envand./config/values.json - Ensure directories exist:
./reports,./preprocessing,./processed,./config - Add structured logging and basic metrics scaffolding
- Create initial unit tests for utilities (env, fs, json helpers)
- Implement Step 1: cache check vs
.envLAST_MODIFIED - Implement Step 2: list PDFs in
./reportsand parallelize tasks (bounded) - Implement Step 3.1–3.6 per-file sequential flow
- 3.1 Processed file check (
./processed/[filename].json) - 3.2 Preprocessing file check (
./preprocessing/[filename]) - 3.3 Appendix detection via Gemini (start/end pages)
- 3.4 PDF appendix extraction to
./preprocessing/[filename] - 3.5 Field extraction via Gemini using
./config/values.json - 3.6 Save JSON to
./processed/[filename].json(validate company name)
- 3.1 Processed file check (
- Implement Step 4: completion synchronization and validation of expected outputs
- Centralize error handling and map failures to 500 per PRD
- Implement Step 5: consolidate
./processed/*.jsonand write./processed/result.json - Implement Step 6: API endpoint to serve consolidated JSON (omit
LAST_MODIFIED) - Add end-to-end tests with fixture PDFs and stubbed Gemini
- Validate idempotency with unchanged
LAST_MODIFIED
- Add retries with exponential backoff for Gemini calls
- Tune concurrency, timeouts, memory use; stream PDFs where possible
- Add observability: request/file correlation ids, step durations, counts
- Extend negative tests (missing files, invalid PDFs, Gemini errors)
- Document setup/run/deploy in README and create runbook
- (Optional) Add CI workflow to run tests and lint on PRs