feat(pipeline): deliver v3 deterministic ingest/build with provenance and performance hardening#2
feat(pipeline): deliver v3 deterministic ingest/build with provenance and performance hardening#2jamiethompson wants to merge 17 commits intomainfrom
Conversation
… real-scale builds - rename source contract from os_open_linked_identifiers to os_open_lids across manifests, build profiles, ingest/build mappings, and tests\n- complete raw table naming migration to raw.os_open_lids_row with compatibility migrations for existing databases\n- add id_1/id_2 + relation_type staging for LIDS links via stage.oli_identifier_pair\n- refactor pass 0a to use ingest metadata row counts instead of expensive raw COUNT scans\n- replace fetchall-based stage loaders with streamed/batched ingestion to avoid multi-GB memory spikes on real datasets\n- add resilient field resolution aliases so smoke fixtures and real source schemas both validate deterministically\n- extend canonical street derivation to infer USRN street names from Open Names + LIDS TOID mappings when direct USRN names are absent\n\nSmoke build validation: bundle 183340b9-b9d8-4667-bcbf-d5f7f953381f rebuilt successfully after these changes.
Codifies the requested workflow:\n- commit at logical checkpoints\n- prefer atomic commits by concern\n- require Conventional Commits format for every commit
- compact AGENTS.md into a concise roadmap-first guide\n- add docs index and agent-focused onboarding/runbook/codebase-map docs under docs/\n- remove absolute local paths from these docs and enforce relative-path rule in AGENTS
…ped agent guides - add docs/architecture index with deep links for relationships, stages, and value-added narrative\n- add separate dataset pages documenting ingest->stage->transform lineage and value per source\n- add separate pass pages (0a..8) documenting inputs, outputs, contracts, and value added\n- add scoped AGENTS guides at docs/, pipeline/, pipeline/src/pipeline/, tests/, and data/\n- strengthen root AGENTS roadmap with architecture links and strict docs-in-lockstep rule\n- normalize historical docs to remove absolute local filesystem paths\n- verify local markdown cross-links resolve and enforce relative-path documentation convention
- add a detailed Mermaid flowchart covering source->meta->raw->stage->core->derived->api relationships\n- align node names with current v3 table/source naming (including os_open_lids)\n- mark optional NI/PPD branches in the diagram\n- surface the diagram from architecture index quick links
…b heavy-source normalization
This keeps the LIDS stage transform fully deterministic while removing avoidable sort work and write amplification. Changes: - remove ORDER BY clauses from open_lids stage inserts where ordering does not affect set semantics - set synchronous_commit=off during build runs to reduce fsync waits on rebuildable stage writes - mark stage tables UNLOGGED via migration 0012 - update stage 0b docs to reflect deterministic replay/indexing plus UNLOGGED stage behavior
For local development we prioritize ingest+build iteration speed and rely on archived source files for replayability. Changes: - add migration 0013 to set active V3 raw tables to UNLOGGED - document the replay-first durability model in pass 0a, pass 0b, and V3 data model docs - keep provenance guarantees anchored on ingest manifests and meta.ingest_run_file file hashes
This change targets the stage 0b hotspot that was growing with each rebuild. What changed: - stage cleanup now truncates all stage tables at pass 0b start instead of per-build-row deletes - added a safety guard that blocks truncate when another build is actively in status=started - open_lids stage transform now writes directly to typed tables (toid/uprn) without persisting a large pair intermediate - pass checkpoint now records open_lids relation volume as a metric key () - updated architecture/spec docs and relationship diagram to match the new flow - added a contract test to keep truncate + started-run guard behavior locked in Why: - prevents historical stage index/data accumulation from degrading rebuild performance - removes one high-volume write path while preserving deterministic outputs and provenance contracts
This commit tackles the long-running build bottlenecks we observed on real-volume data, with a focus on pass 0b and pass 4 performance while preserving deterministic outputs and provenance rules. What changed: - add stage join indexes migration for Open Names joins (build_run_id+toid and build_run_id+postcode_norm) - add reusable ANALYZE helper and refresh stats after stage/core population to improve planner decisions - raise pass-local work_mem for heavy pass 0b and pass 4 operations to reduce temp-file spill - remove unnecessary ORDER BY clauses in large stage upserts (open_uprn/nsul) - refactor pass 4 provenance lookup to use bundle-level os_open_lids ingest_run_id directly - update pass 4 postcode resolution path via stage.onspd_postcode normalization contract - document execution-shape/performance behavior updates across stage docs Validation performed: - full test suite: .venv/bin/python -m unittest discover tests (14/14 passing) - resumed real build completed and verify succeeded for build_run_id=7a2aa7aa-d76e-4be4-8d1c-947ca41de1d7
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e7659e12be
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| lookup_ident = sql.Identifier(lookup_table_name) | ||
|
|
||
| with conn.cursor() as cur: | ||
| cur.execute(sql.SQL("DROP TABLE IF EXISTS api.{} CASCADE").format(street_ident)) |
There was a problem hiding this comment.
Stop dropping published API views during pass 8 rebuilds
Pass 8 drops versioned API tables with CASCADE, which also drops dependent stable views when those views currently point at the same dataset version. Re-running build run for an already-published bundle can therefore remove api.postcode_lookup/api.postcode_street_lookup before publish_build recreates them, causing live lookup failures (and leaving them missing if the build fails before publish).
Useful? React with 👍 / 👎.
| if status not in {"built", "published"}: | ||
| raise BuildError(f"Build run {build_run_id} must be built before publish (status={status})") |
There was a problem hiding this comment.
Enforce verify-before-publish gate in publish_build
publish_build only checks that the run status is built/published, but it never verifies that verify_build has been executed (for example by requiring canonical hash rows in meta.canonical_hash). This allows operators to publish directly after build run, bypassing the explicit verification step and potentially exposing unverified outputs through the stable API views.
Useful? React with 👍 / 👎.
| cleaned = re.sub(r"[^A-Za-z0-9]", "", value).upper() | ||
| if not cleaned: | ||
| return None | ||
| return cleaned |
There was a problem hiding this comment.
Validate UK postcode structure in postcode_norm
postcode_norm currently accepts any non-empty alphanumeric token after stripping punctuation, so malformed values (for example short non-postcode tokens) are treated as valid postcodes and flow into joins/final outputs. The V3 canonicalisation spec requires a minimum UK postcode structure, so this implementation can admit invalid keys and degrade result accuracy.
Useful? React with 👍 / 👎.
Ensure ONSPD passthrough admin fields are not silently dropped during stage normalisation. Changes: - add onspd field_map entries for post_town (potown) and locality - resolve post_town/locality via _field_value(...) so mapped names + aliases are honored - add contract tests guarding schema mapping + stage-loader extraction paths - update ONSPD/stage/data-model docs to document passthrough semantics and NULL behavior when absent upstream Validation: - python3 -m unittest discover -s tests -p 'test_*.py' (16 tests, all passing)
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
I found blocking issues in this branch as-is.
I verified locally on this commit that tests pass ( |
Summary
This PR delivers the V3 pipeline rework as an integrated stack: deterministic multi-pass ingest/build, strict provenance contracts, append-only evidence semantics, exact probability normalization, and API projection materialization for downstream consumers.
It also includes a focused performance hardening pass based on real-volume runs, especially around stage normalization and UPRN reinforcement.
What Changed
raw,stage,core,derived,internal, andapilayers.0a,0b,1-8) and checkpoint/resume behavior.meta.ingest_run_filederived.postcode_streets_final_candidate,derived.postcode_streets_final_source)derived.postcode_street_candidate_lineage)1.0000per postcode).oli_*/os_open_linked_identifierstoopen_lids.Performance Work Included
ANALYZErefreshes for stage/core relationswork_memboosts for heavy passesstage.onspd_postcodepathWhy These Choices
Test Evidence
Unit/contract coverage
.venv/bin/python -m unittest discover testsRan 14 tests ... OKBuild/verify evidence on real-volume bundle
build_run_id=7a2aa7aa-d76e-4be4-8d1c-947ca41de1d7dataset_version=v3_087b2254bdaederived_postcode_streets_finalapi_postcode_street_lookupapi_postcode_lookupEdge Cases Covered
UPDATE/DELETErejected).1.0000after rounding + deterministic residual.Reviewer Notes
develop/human-readable-run-ids) is orthogonal and can be merged independently.