feat(pipeline): deliver v3 deterministic ingest/build with provenance and performance hardening by jamiethompson · Pull Request #2 · jamiethompson/postcod.es

jamiethompson · 2026-02-22T23:01:25Z

Summary

This PR delivers the V3 pipeline rework as an integrated stack: deterministic multi-pass ingest/build, strict provenance contracts, append-only evidence semantics, exact probability normalization, and API projection materialization for downstream consumers.

It also includes a focused performance hardening pass based on real-volume runs, especially around stage normalization and UPRN reinforcement.

What Changed

Introduced V3 schema/migration foundation with raw, stage, core, derived, internal, and api layers.
Implemented deterministic pass orchestration (0a, 0b, 1-8) and checkpoint/resume behavior.
Added explicit provenance primitives:
- meta.ingest_run_file
- relational final provenance links (derived.postcode_streets_final_candidate, derived.postcode_streets_final_source)
- candidate lineage table (derived.postcode_street_candidate_lineage)
Enforced append-only candidate evidence contract in DB via mutation-rejecting triggers.
Locked exact probability behavior (formula + deterministic residual correction to exact 1.0000 per postcode).
Standardized naming from oli_* / os_open_linked_identifiers to open_lids.
Added versioned API projection tables and stable API views.
Added/updated docs as a navigable architecture map (datasets, stages, relationships, value-added, agent guidance).

Performance Work Included

Stage workspace lifecycle hardening (truncate rebuildable stage tables per run).
Stage/raw unlogged profile for faster dev rebuild loops.
Added stage join indexes for Open Names hot joins.
Refactored heavy pass SQL paths to reduce avoidable temp spill and planner drift:
- post-pass ANALYZE refreshes for stage/core relations
- pass-local work_mem boosts for heavy passes
- removed unnecessary ordered upsert work on large stage inserts
- pass 4 provenance lookup simplified to bundle-level LIDS run id
- pass 4 postcode resolution moved to normalized stage.onspd_postcode path
Pass 8 finalization materialized fully set-based (no row-by-row loops).

Why These Choices

Traceability first: relational provenance links and immutable candidate evidence make audits straightforward.
Determinism first: explicit ordering/tie-break rules plus exact probability closure prevent “drift by implementation detail.”
Operational realism: raw/stage architecture and pass checkpoints support replay/debug at source-run granularity.
Performance pragmatism: high-volume joins are kept in SQL with controlled planner inputs and indexes.

Test Evidence

Unit/contract coverage

.venv/bin/python -m unittest discover tests
Result: Ran 14 tests ... OK

Build/verify evidence on real-volume bundle

Build resume completed successfully:
- build_run_id=7a2aa7aa-d76e-4be4-8d1c-947ca41de1d7
- dataset_version=v3_087b2254bdae
Verify completed successfully with canonical hashes:
- derived_postcode_streets_final
- api_postcode_street_lookup
- api_postcode_lookup

Edge Cases Covered

Pass 3 TOID promotion is append-only with lineage links (no mutation of base evidence rows).
Candidate table immutability contract is enforced at DB level (UPDATE/DELETE rejected).
Weight config contract validates positive/known candidate-type weights.
Probability contract enforces exact postcode normalization to 1.0000 after rounding + deterministic residual.
Stage cleanup contract validated to avoid stale-state accumulation across rebuilds.

Reviewer Notes

This is intentionally a stacked delivery because the V3 data model, pass logic, and docs are tightly coupled.
The currently open PR feat(ids): deterministic human-readable bundle ids with uuid run ids #1 (develop/human-readable-run-ids) is orthogonal and can be merged independently.
If helpful, I can split follow-up PRs for additional perf tightening once we run a full cold ingest+build benchmark.

…ations

… real-scale builds - rename source contract from os_open_linked_identifiers to os_open_lids across manifests, build profiles, ingest/build mappings, and tests\n- complete raw table naming migration to raw.os_open_lids_row with compatibility migrations for existing databases\n- add id_1/id_2 + relation_type staging for LIDS links via stage.oli_identifier_pair\n- refactor pass 0a to use ingest metadata row counts instead of expensive raw COUNT scans\n- replace fetchall-based stage loaders with streamed/batched ingestion to avoid multi-GB memory spikes on real datasets\n- add resilient field resolution aliases so smoke fixtures and real source schemas both validate deterministically\n- extend canonical street derivation to infer USRN street names from Open Names + LIDS TOID mappings when direct USRN names are absent\n\nSmoke build validation: bundle 183340b9-b9d8-4667-bcbf-d5f7f953381f rebuilt successfully after these changes.

Codifies the requested workflow:\n- commit at logical checkpoints\n- prefer atomic commits by concern\n- require Conventional Commits format for every commit

- compact AGENTS.md into a concise roadmap-first guide\n- add docs index and agent-focused onboarding/runbook/codebase-map docs under docs/\n- remove absolute local paths from these docs and enforce relative-path rule in AGENTS

…ped agent guides - add docs/architecture index with deep links for relationships, stages, and value-added narrative\n- add separate dataset pages documenting ingest->stage->transform lineage and value per source\n- add separate pass pages (0a..8) documenting inputs, outputs, contracts, and value added\n- add scoped AGENTS guides at docs/, pipeline/, pipeline/src/pipeline/, tests/, and data/\n- strengthen root AGENTS roadmap with architecture links and strict docs-in-lockstep rule\n- normalize historical docs to remove absolute local filesystem paths\n- verify local markdown cross-links resolve and enforce relative-path documentation convention

- add a detailed Mermaid flowchart covering source->meta->raw->stage->core->derived->api relationships\n- align node names with current v3 table/source naming (including os_open_lids)\n- mark optional NI/PPD branches in the diagram\n- surface the diagram from architecture index quick links

…b heavy-source normalization

…malization notes

…rsing

This keeps the LIDS stage transform fully deterministic while removing avoidable sort work and write amplification. Changes: - remove ORDER BY clauses from open_lids stage inserts where ordering does not affect set semantics - set synchronous_commit=off during build runs to reduce fsync waits on rebuildable stage writes - mark stage tables UNLOGGED via migration 0012 - update stage 0b docs to reflect deterministic replay/indexing plus UNLOGGED stage behavior

For local development we prioritize ingest+build iteration speed and rely on archived source files for replayability. Changes: - add migration 0013 to set active V3 raw tables to UNLOGGED - document the replay-first durability model in pass 0a, pass 0b, and V3 data model docs - keep provenance guarantees anchored on ingest manifests and meta.ingest_run_file file hashes

This change targets the stage 0b hotspot that was growing with each rebuild. What changed: - stage cleanup now truncates all stage tables at pass 0b start instead of per-build-row deletes - added a safety guard that blocks truncate when another build is actively in status=started - open_lids stage transform now writes directly to typed tables (toid/uprn) without persisting a large pair intermediate - pass checkpoint now records open_lids relation volume as a metric key () - updated architecture/spec docs and relationship diagram to match the new flow - added a contract test to keep truncate + started-run guard behavior locked in Why: - prevents historical stage index/data accumulation from degrading rebuild performance - removes one high-volume write path while preserving deterministic outputs and provenance contracts

This commit tackles the long-running build bottlenecks we observed on real-volume data, with a focus on pass 0b and pass 4 performance while preserving deterministic outputs and provenance rules. What changed: - add stage join indexes migration for Open Names joins (build_run_id+toid and build_run_id+postcode_norm) - add reusable ANALYZE helper and refresh stats after stage/core population to improve planner decisions - raise pass-local work_mem for heavy pass 0b and pass 4 operations to reduce temp-file spill - remove unnecessary ORDER BY clauses in large stage upserts (open_uprn/nsul) - refactor pass 4 provenance lookup to use bundle-level os_open_lids ingest_run_id directly - update pass 4 postcode resolution path via stage.onspd_postcode normalization contract - document execution-shape/performance behavior updates across stage docs Validation performed: - full test suite: .venv/bin/python -m unittest discover tests (14/14 passing) - resumed real build completed and verify succeeded for build_run_id=7a2aa7aa-d76e-4be4-8d1c-947ca41de1d7

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e7659e12be

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-22T23:08:14Z

pipeline/src/pipeline/build/workflows.py

+    lookup_ident = sql.Identifier(lookup_table_name)
+
+    with conn.cursor() as cur:
+        cur.execute(sql.SQL("DROP TABLE IF EXISTS api.{} CASCADE").format(street_ident))


Stop dropping published API views during pass 8 rebuilds

Pass 8 drops versioned API tables with CASCADE, which also drops dependent stable views when those views currently point at the same dataset version. Re-running build run for an already-published bundle can therefore remove api.postcode_lookup/api.postcode_street_lookup before publish_build recreates them, causing live lookup failures (and leaving them missing if the build fails before publish).

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-22T23:08:14Z

pipeline/src/pipeline/build/workflows.py

+        if status not in {"built", "published"}:
+            raise BuildError(f"Build run {build_run_id} must be built before publish (status={status})")


Enforce verify-before-publish gate in publish_build

publish_build only checks that the run status is built/published, but it never verifies that verify_build has been executed (for example by requiring canonical hash rows in meta.canonical_hash). This allows operators to publish directly after build run, bypassing the explicit verification step and potentially exposing unverified outputs through the stable API views.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-22T23:08:14Z

pipeline/src/pipeline/util/normalise.py

+    cleaned = re.sub(r"[^A-Za-z0-9]", "", value).upper()
+    if not cleaned:
+        return None
+    return cleaned


Validate UK postcode structure in postcode_norm

postcode_norm currently accepts any non-empty alphanumeric token after stripping punctuation, so malformed values (for example short non-postcode tokens) are treated as valid postcodes and flow into joins/final outputs. The V3 canonicalisation spec requires a minimum UK postcode structure, so this implementation can admit invalid keys and degrade result accuracy.

Useful? React with 👍 / 👎.

Ensure ONSPD passthrough admin fields are not silently dropped during stage normalisation. Changes: - add onspd field_map entries for post_town (potown) and locality - resolve post_town/locality via _field_value(...) so mapped names + aliases are honored - add contract tests guarding schema mapping + stage-loader extraction paths - update ONSPD/stage/data-model docs to document passthrough semantics and NULL behavior when absent upstream Validation: - python3 -m unittest discover -s tests -p 'test_*.py' (16 tests, all passing)

chatgpt-codex-connector · 2026-02-22T23:41:19Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

jamiethompson · 2026-02-23T00:19:10Z

I found blocking issues in this branch as-is.

Required-field validation only checks the first raw row, which is brittle for heterogeneous row shapes and can produce false failures/misses.

pipeline/src/pipeline/build/workflows.py:494
pipeline/src/pipeline/build/workflows.py:545
pipeline/src/pipeline/build/workflows.py:631

Some stage loaders bypass mapped-field fallback by using direct row.get(mapped_key) reads, so alias/case fallback is inconsistently applied and fields can be silently dropped depending on source shape.

pipeline/src/pipeline/build/workflows.py:971
pipeline/src/pipeline/build/workflows.py:1044
pipeline/src/pipeline/build/workflows.py:1128

I verified locally on this commit that tests pass (16/16), but these are still correctness risks in production data. I recommend resolving these before merge.

jamiethompson added 16 commits February 20, 2026 23:27

docs(spec): add source acquisition and phase documentation updates

e405af7

feat(pipeline): implement deterministic ingest/build runtime and migr…

3cb8239

…ations

test(pipeline): add v3 contract coverage and determinism checks

b17b626

chore(repo): ignore local datasets and python build artifacts

ddb6153

docs(agents): add explicit commit workflow and conventional commit rules

9f2c9f5

Codifies the requested workflow:\n- commit at logical checkpoints\n- prefer atomic commits by concern\n- require Conventional Commits format for every commit

refactor(pipeline): standardize open_lids naming and speed up stage 0…

f61657b

…b heavy-source normalization

docs(architecture): align LIDS stage/candidate naming and pass 0b nor…

e8e86b7

…malization notes

fix(build): escape open_lids LIKE patterns for psycopg placeholder pa…

64337fc

…rsing

chatgpt-codex-connector bot reviewed Feb 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pipeline): deliver v3 deterministic ingest/build with provenance and performance hardening#2

feat(pipeline): deliver v3 deterministic ingest/build with provenance and performance hardening#2
jamiethompson wants to merge 17 commits intomainfrom
develop/v3-pipeline-determinism-perf

jamiethompson commented Feb 22, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 22, 2026

Uh oh!

chatgpt-codex-connector bot Feb 22, 2026

Uh oh!

chatgpt-codex-connector bot Feb 22, 2026

Uh oh!

chatgpt-codex-connector bot commented Feb 22, 2026

Uh oh!

jamiethompson commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if status not in {"built", "published"}:
		raise BuildError(f"Build run {build_run_id} must be built before publish (status={status})")

Conversation

jamiethompson commented Feb 22, 2026

Summary

What Changed

Performance Work Included

Why These Choices

Test Evidence

Unit/contract coverage

Build/verify evidence on real-volume bundle

Edge Cases Covered

Reviewer Notes

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot commented Feb 22, 2026

Uh oh!

jamiethompson commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant