jamiethompson · jamiethompson · Feb 20, 2026 · Feb 20, 2026 · Feb 20, 2026 · Feb 20, 2026
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,15 @@
+/.idea/
+/.DS
+.DS_Store
+**/.DS_Store
+
+# Python caches and local build artifacts
+__pycache__/
+*.py[cod]
+*.egg-info/
+.pytest_cache/
+
+# Local datasets and generated source extracts
+/data/source_files/real/
+/data/source_files/e2e/
+/data/source_files/v3_smoke/
diff --git a/AGENTS.md b/AGENTS.md
@@ -1,268 +1,63 @@
 # AGENTS.md
 
-This repository contains a **data import and transformation pipeline** for UK open datasets.
-Its purpose is to produce a reproducible, versioned derived dataset:
-
-UPRN → postcode → inferred street name → confidence score
-
-This file defines behavioural rules, quality standards, and documentation requirements
-for any agent contributing to this project.
-
-The priority is **accuracy, provenance, and reproducibility**.
-
----
-
-## 1. Core Principles
-
-### 1.1 No Guessing
-If a dataset field, schema, release identifier, or licence detail is unknown:
-- Mark it as **Unknown**
-- Add validation logic
-- Document the assumption explicitly
-
-Never silently assume structure based on “typical” formats.
-
----
-
-### 1.2 Reproducibility First
-The pipeline must be:
-- Deterministic
-- Rebuildable from raw inputs
-- Fully traceable to dataset release identifiers
-
-If the same inputs are used, outputs must be identical.
-
-No hidden state.
-No environment-dependent logic.
-No implicit defaults.
-
----
-
-### 1.3 Raw Data is Sacred
-- Raw imports are immutable.
-- Transformations must not mutate raw tables.
-- Derived outputs must be rebuildable from raw + release metadata.
-
-If you need to correct something, rebuild it — do not patch it.
-
----
-
-### 1.4 Provenance is Mandatory
-Every derived dataset must clearly record:
-- Source dataset release identifiers
-- Method used
-- Computation timestamp
-
-If provenance is not recorded, the output is invalid.
-
----
-
-### 1.5 Explicit Limitations
-Street inference is:
-- Heuristic
-- Distance-based
-- Non-authoritative
-
-Documentation must clearly state this.
-Do not imply authoritative delivery-level correctness.
-
----
-
-## 2. Documentation Requirements
-
-Every meaningful change must include documentation updates.
-
-At minimum:
-
-### 2.1 Dataset Documentation
-Maintain a living document describing:
-- Each dataset
-- Where it is obtained
-- Licence type
-- Required fields
-- Known limitations
-- Known schema quirks
-
-If a dataset changes, update the documentation immediately.
-
----
-
-### 2.2 Data Model Documentation
-Maintain clear documentation for:
-- Raw tables
-- Core tables
-- Derived tables
-- Metrics tables
-
-Include:
-- Field definitions
-- Data types
-- Constraints
-- Semantic meaning
-
-No column should exist without documented purpose.
-
----
-
-### 2.3 Transform Documentation
-For each transformation layer, document:
-- Inputs
-- Outputs
-- Assumptions
-- Failure modes
-- Determinism guarantees
-
-If logic changes (e.g., confidence thresholds), update documentation and record the change rationale.
-
----
-
-### 2.4 Metrics Documentation
-Define:
-- What each metric measures
-- How it is calculated
-- Why it exists
-- Expected ranges
-
-Metrics are part of product quality, not optional extras.
-
----
-
-## 3. Quality Standards
-
-### 3.1 Deterministic Behaviour
-- Stable ordering in queries
-- Explicit tie-breaking rules
-- No reliance on implicit database ordering
-
-### 3.2 Observability
-Each pipeline run must:
-- Log row counts per stage
-- Log join coverage percentages
-- Log resolution percentages
-- Log distance percentiles
-
-Silent processing is not acceptable.
-
----
-
-### 3.3 Fail Fast
-If:
-- Required columns are missing
-- Geometry is invalid
-- Coordinate reference systems are inconsistent
-
-The pipeline must fail clearly.
-
-Partial silent success is worse than failure.
-
----
-
-### 3.4 Schema Validation
-Before processing:
-- Validate required fields exist
-- Validate types where possible
-- Record dataset release metadata
-
-Do not infer schema dynamically without documentation.
-
----
-
-### 3.5 No Scope Drift
-This repository is a **pipeline**, not:
-- An API
-- A serving layer
-- An analytics platform
-- A proprietary dataset reconstruction engine
-
-Keep scope disciplined.
-
----
-
-## 4. Testing Expectations
-
-Agents must ensure:
-
-- Normalisation logic is tested.
-- Derived outputs are deterministic.
-- Schema validation works.
-- Metrics calculations are stable.
-- Small fixture datasets validate spatial inference logic.
-
-Tests must:
-- Use synthetic or reduced fixture data.
-- Not depend on downloading live datasets.
-
----
-
-## 5. Change Management
-
-Any change to:
-- Confidence scoring
-- Search radius
-- Join logic
-- Normalisation rules
-- Spatial reference systems
-
-Must include:
-
-1. Rationale
-2. Before/after metrics comparison
-3. Determinism confirmation
-4. Documentation update
-
----
-
-## 6. What Must Never Be Implemented Here
-
-- Address enumeration features
-- Proprietary dataset integration
-- Undocumented inference layers
-- Hidden optimisation logic
-- Behaviour designed for ambiguous or non-transparent use cases
-
-This pipeline exists to:
-- Normalise open data
-- Join open data
-- Derive transparent street-level inference
-- Record quality metrics
-
-Nothing more.
-
----
-
-## 7. Communication Standards
-
-Pull requests must:
-
-- State the problem being solved
-- Describe the solution
-- Document assumptions
-- Include metric impact
-- Confirm reproducibility
-
-Avoid vague language such as:
-- “Seems to work”
-- “Probably correct”
-- “Should be fine”
-
-Be precise.
-
----
-
-## 8. Decision Rule
-
-If a proposed change:
-- Reduces transparency,
-- Obscures provenance,
-- Makes outputs less reproducible,
-- Or introduces implicit assumptions,
-
-It should not be merged.
-
-Clarity over cleverness.
-Traceability over speed.
-Correctness over convenience.
-
----
-
-End of AGENTS.md
+Purpose: this file is the agent entrypoint for this repository.
+Use it as a roadmap to the docs, then execute work with strict reproducibility and provenance.
+
+## 1. Start Here (Required Reading Order)
+1. `docs/README.md`
+2. `docs/agent/start-here.md`
+3. `docs/architecture/README.md`
+4. `docs/spec/pipeline_v3/spec.md`
+5. `docs/spec/pipeline_v3/data_model.md`
+6. `docs/spec/pipeline_v3/canonicalisation.md`
+
+If behavior in code differs from spec, treat it as a defect and document the delta.
+
+## 2. Documentation Roadmap
+- V3 product/behavior spec: `docs/spec/pipeline_v3/spec.md`
+- V3 schema and table contracts: `docs/spec/pipeline_v3/data_model.md`
+- Determinism and canonical rules: `docs/spec/pipeline_v3/canonicalisation.md`
+- Source acquisition + licensing context: `docs/spec/data_sources.md`
+- Agent onboarding: `docs/agent/start-here.md`
+- Codebase map: `docs/agent/codebase-map.md`
+- Operational runbook (ingest/build/publish): `docs/agent/runbook.md`
+- Dataset lineage pages: `docs/architecture/datasets/README.md`
+- Stage/pass pages: `docs/architecture/stages/README.md`
+- Legacy phase docs (historical only): `docs/spec/phase_1/`, `docs/spec/phase_2-open-names/`
+
+## 3. Non-Negotiable Engineering Rules
+- No guessing: unknown fields/semantics must be marked unknown and validated explicitly.
+- Reproducibility first: same inputs must produce same outputs.
+- Raw data is immutable: never mutate raw source snapshots.
+- Provenance is mandatory: derived records must trace to source run(s) and method.
+- Deterministic execution: stable ordering + explicit tie-breaks only.
+- Fail fast on schema/geometry/CRS issues.
+- This repo is a pipeline only; do not add API-serving scope here.
+
+## 4. Change Requirements
+For meaningful behavior changes (join logic, scoring, normalization, radius/thresholds, CRS, pass semantics):
+1. Update spec/docs in `docs/` in the same change.
+2. Never place absolute local filesystem paths in docs; use repository-relative paths.
+3. State rationale.
+4. Provide before/after metrics or counts where applicable.
+5. Confirm determinism impact.
+6. Add/adjust tests (fixture-based; no live-download dependency).
+
+This rule is strict: agents must always keep documentation in step with code changes.
+
+## 5. Commit Standards
+- Commit at logical checkpoints whenever it makes sense.
+- Prefer atomic commits grouped by concern (schema, ingest, transforms, tests, docs).
+- Use Conventional Commits for every commit message (`type(scope): summary`).
+
+## 6. Decision Rule
+If a change reduces transparency, obscures provenance, weakens reproducibility, or introduces hidden assumptions, do not merge it.
+
+Clarity over cleverness. Traceability over speed. Correctness over convenience.
+
+## 7. Scoped Agent Guides
+- Docs scope: `docs/AGENTS.md`
+- Pipeline scope: `pipeline/AGENTS.md`
+- Runtime code scope: `pipeline/src/pipeline/AGENTS.md`
+- Test scope: `tests/AGENTS.md`
+- Data/manifest scope: `data/AGENTS.md`
diff --git a/data/AGENTS.md b/data/AGENTS.md
@@ -0,0 +1,17 @@
+# data/AGENTS.md
+
+## Scope
+Manifests and local source-file conventions under `data/`.
+
+## Critical Rule
+Manifest/source contract changes must be reflected in docs (`docs/spec/...` and `docs/architecture/...`) and code (`pipeline/src/pipeline/manifest.py`, `pipeline/config/source_schema.yaml`) together.
+
+## Conventions
+- source manifests live under `data/manifests/`
+- keep source naming aligned with `pipeline/src/pipeline/manifest.py`
+- avoid absolute local paths in documentation; manifests may contain absolute file paths for runtime only
+- update bundle manifests when source keys change
+
+## Useful References
+- source acquisition: `docs/spec/data_sources.md`
+- architecture dataset pages: `docs/architecture/datasets/`
diff --git a/data/manifests/e2e/onsud_manifest.json b/data/manifests/e2e/onsud_manifest.json
@@ -0,0 +1,15 @@
+{
+  "dataset_key": "onsud",
+  "release_id": "2026-Q1-E2E-P2",
+  "source_url": "https://example.local/onsud-sample",
+  "licence": "OGL v3.0",
+  "file_path": "/Users/jamie/code/postcod.es/data/source_files/e2e/onsud_sample.csv",
+  "expected_sha256": "dfe6e4bc4d4405edc6463fcb1b55929f867d8e7b9907afb92e893a9f8911033f",
+  "format": "csv",
+  "column_map": {
+    "uprn": "ONS_UPRN",
+    "postcode": "ONS_POSTCODE",
+    "postcode_unit_easting": "PC_UNIT_E",
+    "postcode_unit_northing": "PC_UNIT_N"
+  }
+}