Skip to content

feat(pipeline): deliver v3 deterministic ingest/build with provenance and performance hardening#2

Open
jamiethompson wants to merge 17 commits intomainfrom
develop/v3-pipeline-determinism-perf
Open

feat(pipeline): deliver v3 deterministic ingest/build with provenance and performance hardening#2
jamiethompson wants to merge 17 commits intomainfrom
develop/v3-pipeline-determinism-perf

Conversation

@jamiethompson
Copy link
Owner

Summary

This PR delivers the V3 pipeline rework as an integrated stack: deterministic multi-pass ingest/build, strict provenance contracts, append-only evidence semantics, exact probability normalization, and API projection materialization for downstream consumers.

It also includes a focused performance hardening pass based on real-volume runs, especially around stage normalization and UPRN reinforcement.

What Changed

  • Introduced V3 schema/migration foundation with raw, stage, core, derived, internal, and api layers.
  • Implemented deterministic pass orchestration (0a, 0b, 1-8) and checkpoint/resume behavior.
  • Added explicit provenance primitives:
    • meta.ingest_run_file
    • relational final provenance links (derived.postcode_streets_final_candidate, derived.postcode_streets_final_source)
    • candidate lineage table (derived.postcode_street_candidate_lineage)
  • Enforced append-only candidate evidence contract in DB via mutation-rejecting triggers.
  • Locked exact probability behavior (formula + deterministic residual correction to exact 1.0000 per postcode).
  • Standardized naming from oli_* / os_open_linked_identifiers to open_lids.
  • Added versioned API projection tables and stable API views.
  • Added/updated docs as a navigable architecture map (datasets, stages, relationships, value-added, agent guidance).

Performance Work Included

  • Stage workspace lifecycle hardening (truncate rebuildable stage tables per run).
  • Stage/raw unlogged profile for faster dev rebuild loops.
  • Added stage join indexes for Open Names hot joins.
  • Refactored heavy pass SQL paths to reduce avoidable temp spill and planner drift:
    • post-pass ANALYZE refreshes for stage/core relations
    • pass-local work_mem boosts for heavy passes
    • removed unnecessary ordered upsert work on large stage inserts
    • pass 4 provenance lookup simplified to bundle-level LIDS run id
    • pass 4 postcode resolution moved to normalized stage.onspd_postcode path
  • Pass 8 finalization materialized fully set-based (no row-by-row loops).

Why These Choices

  • Traceability first: relational provenance links and immutable candidate evidence make audits straightforward.
  • Determinism first: explicit ordering/tie-break rules plus exact probability closure prevent “drift by implementation detail.”
  • Operational realism: raw/stage architecture and pass checkpoints support replay/debug at source-run granularity.
  • Performance pragmatism: high-volume joins are kept in SQL with controlled planner inputs and indexes.

Test Evidence

Unit/contract coverage

  • .venv/bin/python -m unittest discover tests
  • Result: Ran 14 tests ... OK

Build/verify evidence on real-volume bundle

  • Build resume completed successfully:
    • build_run_id=7a2aa7aa-d76e-4be4-8d1c-947ca41de1d7
    • dataset_version=v3_087b2254bdae
  • Verify completed successfully with canonical hashes:
    • derived_postcode_streets_final
    • api_postcode_street_lookup
    • api_postcode_lookup

Edge Cases Covered

  • Pass 3 TOID promotion is append-only with lineage links (no mutation of base evidence rows).
  • Candidate table immutability contract is enforced at DB level (UPDATE/DELETE rejected).
  • Weight config contract validates positive/known candidate-type weights.
  • Probability contract enforces exact postcode normalization to 1.0000 after rounding + deterministic residual.
  • Stage cleanup contract validated to avoid stale-state accumulation across rebuilds.

Reviewer Notes

  • This is intentionally a stacked delivery because the V3 data model, pass logic, and docs are tightly coupled.
  • The currently open PR feat(ids): deterministic human-readable bundle ids with uuid run ids #1 (develop/human-readable-run-ids) is orthogonal and can be merged independently.
  • If helpful, I can split follow-up PRs for additional perf tightening once we run a full cold ingest+build benchmark.

… real-scale builds

- rename source contract from os_open_linked_identifiers to os_open_lids across manifests, build profiles, ingest/build mappings, and tests\n- complete raw table naming migration to raw.os_open_lids_row with compatibility migrations for existing databases\n- add id_1/id_2 + relation_type staging for LIDS links via stage.oli_identifier_pair\n- refactor pass 0a to use ingest metadata row counts instead of expensive raw COUNT scans\n- replace fetchall-based stage loaders with streamed/batched ingestion to avoid multi-GB memory spikes on real datasets\n- add resilient field resolution aliases so smoke fixtures and real source schemas both validate deterministically\n- extend canonical street derivation to infer USRN street names from Open Names + LIDS TOID mappings when direct USRN names are absent\n\nSmoke build validation: bundle 183340b9-b9d8-4667-bcbf-d5f7f953381f rebuilt successfully after these changes.
Codifies the requested workflow:\n- commit at logical checkpoints\n- prefer atomic commits by concern\n- require Conventional Commits format for every commit
- compact AGENTS.md into a concise roadmap-first guide\n- add docs index and agent-focused onboarding/runbook/codebase-map docs under docs/\n- remove absolute local paths from these docs and enforce relative-path rule in AGENTS
…ped agent guides

- add docs/architecture index with deep links for relationships, stages, and value-added narrative\n- add separate dataset pages documenting ingest->stage->transform lineage and value per source\n- add separate pass pages (0a..8) documenting inputs, outputs, contracts, and value added\n- add scoped AGENTS guides at docs/, pipeline/, pipeline/src/pipeline/, tests/, and data/\n- strengthen root AGENTS roadmap with architecture links and strict docs-in-lockstep rule\n- normalize historical docs to remove absolute local filesystem paths\n- verify local markdown cross-links resolve and enforce relative-path documentation convention
- add a detailed Mermaid flowchart covering source->meta->raw->stage->core->derived->api relationships\n- align node names with current v3 table/source naming (including os_open_lids)\n- mark optional NI/PPD branches in the diagram\n- surface the diagram from architecture index quick links
This keeps the LIDS stage transform fully deterministic while removing avoidable sort work and write amplification.

Changes:
- remove ORDER BY clauses from open_lids stage inserts where ordering does not affect set semantics
- set synchronous_commit=off during build runs to reduce fsync waits on rebuildable stage writes
- mark stage tables UNLOGGED via migration 0012
- update stage 0b docs to reflect deterministic replay/indexing plus UNLOGGED stage behavior
For local development we prioritize ingest+build iteration speed and rely on archived source files for replayability.

Changes:
- add migration 0013 to set active V3 raw tables to UNLOGGED
- document the replay-first durability model in pass 0a, pass 0b, and V3 data model docs
- keep provenance guarantees anchored on ingest manifests and meta.ingest_run_file file hashes
This change targets the stage 0b hotspot that was growing with each rebuild.

What changed:
- stage cleanup now truncates all stage tables at pass 0b start instead of per-build-row deletes
- added a safety guard that blocks truncate when another build is actively in status=started
- open_lids stage transform now writes directly to typed tables (toid/uprn) without persisting a large pair intermediate
- pass checkpoint now records open_lids relation volume as a metric key ()
- updated architecture/spec docs and relationship diagram to match the new flow
- added a contract test to keep truncate + started-run guard behavior locked in

Why:
- prevents historical stage index/data accumulation from degrading rebuild performance
- removes one high-volume write path while preserving deterministic outputs and provenance contracts
This commit tackles the long-running build bottlenecks we observed on real-volume data, with a focus on pass 0b and pass 4 performance while preserving deterministic outputs and provenance rules.

What changed:

- add stage join indexes migration for Open Names joins (build_run_id+toid and build_run_id+postcode_norm)

- add reusable ANALYZE helper and refresh stats after stage/core population to improve planner decisions

- raise pass-local work_mem for heavy pass 0b and pass 4 operations to reduce temp-file spill

- remove unnecessary ORDER BY clauses in large stage upserts (open_uprn/nsul)

- refactor pass 4 provenance lookup to use bundle-level os_open_lids ingest_run_id directly

- update pass 4 postcode resolution path via stage.onspd_postcode normalization contract

- document execution-shape/performance behavior updates across stage docs

Validation performed:

- full test suite: .venv/bin/python -m unittest discover tests (14/14 passing)

- resumed real build completed and verify succeeded for build_run_id=7a2aa7aa-d76e-4be4-8d1c-947ca41de1d7
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e7659e12be

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

lookup_ident = sql.Identifier(lookup_table_name)

with conn.cursor() as cur:
cur.execute(sql.SQL("DROP TABLE IF EXISTS api.{} CASCADE").format(street_ident))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop dropping published API views during pass 8 rebuilds

Pass 8 drops versioned API tables with CASCADE, which also drops dependent stable views when those views currently point at the same dataset version. Re-running build run for an already-published bundle can therefore remove api.postcode_lookup/api.postcode_street_lookup before publish_build recreates them, causing live lookup failures (and leaving them missing if the build fails before publish).

Useful? React with 👍 / 👎.

Comment on lines +3202 to +3203
if status not in {"built", "published"}:
raise BuildError(f"Build run {build_run_id} must be built before publish (status={status})")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce verify-before-publish gate in publish_build

publish_build only checks that the run status is built/published, but it never verifies that verify_build has been executed (for example by requiring canonical hash rows in meta.canonical_hash). This allows operators to publish directly after build run, bypassing the explicit verification step and potentially exposing unverified outputs through the stable API views.

Useful? React with 👍 / 👎.

Comment on lines +43 to +46
cleaned = re.sub(r"[^A-Za-z0-9]", "", value).upper()
if not cleaned:
return None
return cleaned

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate UK postcode structure in postcode_norm

postcode_norm currently accepts any non-empty alphanumeric token after stripping punctuation, so malformed values (for example short non-postcode tokens) are treated as valid postcodes and flow into joins/final outputs. The V3 canonicalisation spec requires a minimum UK postcode structure, so this implementation can admit invalid keys and degrade result accuracy.

Useful? React with 👍 / 👎.

Ensure ONSPD passthrough admin fields are not silently dropped during stage normalisation.

Changes:

- add onspd field_map entries for post_town (potown) and locality

- resolve post_town/locality via _field_value(...) so mapped names + aliases are honored

- add contract tests guarding schema mapping + stage-loader extraction paths

- update ONSPD/stage/data-model docs to document passthrough semantics and NULL behavior when absent upstream

Validation:

- python3 -m unittest discover -s tests -p 'test_*.py' (16 tests, all passing)
@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@jamiethompson
Copy link
Owner Author

I found blocking issues in this branch as-is.

  1. Required-field validation only checks the first raw row, which is brittle for heterogeneous row shapes and can produce false failures/misses.
  • pipeline/src/pipeline/build/workflows.py:494
  • pipeline/src/pipeline/build/workflows.py:545
  • pipeline/src/pipeline/build/workflows.py:631
  1. Some stage loaders bypass mapped-field fallback by using direct row.get(mapped_key) reads, so alias/case fallback is inconsistently applied and fields can be silently dropped depending on source shape.
  • pipeline/src/pipeline/build/workflows.py:971
  • pipeline/src/pipeline/build/workflows.py:1044
  • pipeline/src/pipeline/build/workflows.py:1128

I verified locally on this commit that tests pass (16/16), but these are still correctness risks in production data. I recommend resolving these before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant