Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
e405af7
docs(spec): add source acquisition and phase documentation updates
jamiethompson Feb 20, 2026
3cb8239
feat(pipeline): implement deterministic ingest/build runtime and migr…
jamiethompson Feb 20, 2026
b17b626
test(pipeline): add v3 contract coverage and determinism checks
jamiethompson Feb 20, 2026
ddb6153
chore(repo): ignore local datasets and python build artifacts
jamiethompson Feb 20, 2026
c3f40e6
feat(pipeline): rename LIDS source and harden stage normalization for…
jamiethompson Feb 21, 2026
9f2c9f5
docs(agents): add explicit commit workflow and conventional commit rules
jamiethompson Feb 21, 2026
f42f397
docs(agents): compact roadmap and add agent onboarding docs
jamiethompson Feb 21, 2026
d5101b4
docs(architecture): add exhaustive dataset/stage lineage docs and sco…
jamiethompson Feb 21, 2026
fe40449
docs(architecture): add mermaid dataflow diagram for v3 pipeline
jamiethompson Feb 21, 2026
f61657b
refactor(pipeline): standardize open_lids naming and speed up stage 0…
jamiethompson Feb 21, 2026
e8e86b7
docs(architecture): align LIDS stage/candidate naming and pass 0b nor…
jamiethompson Feb 21, 2026
64337fc
fix(build): escape open_lids LIKE patterns for psycopg placeholder pa…
jamiethompson Feb 21, 2026
8802490
perf(build): reduce stage 0b runtime and write overhead
jamiethompson Feb 21, 2026
a0b1253
perf(ingest): mark v3 raw tables unlogged for faster development loops
jamiethompson Feb 21, 2026
052c0f1
perf(build): reset stage workspace and streamline open_lids stage path
jamiethompson Feb 21, 2026
e7659e1
perf(pipeline): stabilize heavy passes and cut temp-spill hot paths
jamiethompson Feb 22, 2026
2aef8a2
fix(onspd): propagate post_town/locality and lock contract tests (#3)
jamiethompson Feb 22, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
/.idea/
/.DS
.DS_Store
**/.DS_Store

# Python caches and local build artifacts
__pycache__/
*.py[cod]
*.egg-info/
.pytest_cache/

# Local datasets and generated source extracts
/data/source_files/real/
/data/source_files/e2e/
/data/source_files/v3_smoke/
327 changes: 61 additions & 266 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -1,268 +1,63 @@
# AGENTS.md

This repository contains a **data import and transformation pipeline** for UK open datasets.
Its purpose is to produce a reproducible, versioned derived dataset:

UPRN → postcode → inferred street name → confidence score

This file defines behavioural rules, quality standards, and documentation requirements
for any agent contributing to this project.

The priority is **accuracy, provenance, and reproducibility**.

---

## 1. Core Principles

### 1.1 No Guessing
If a dataset field, schema, release identifier, or licence detail is unknown:
- Mark it as **Unknown**
- Add validation logic
- Document the assumption explicitly

Never silently assume structure based on “typical” formats.

---

### 1.2 Reproducibility First
The pipeline must be:
- Deterministic
- Rebuildable from raw inputs
- Fully traceable to dataset release identifiers

If the same inputs are used, outputs must be identical.

No hidden state.
No environment-dependent logic.
No implicit defaults.

---

### 1.3 Raw Data is Sacred
- Raw imports are immutable.
- Transformations must not mutate raw tables.
- Derived outputs must be rebuildable from raw + release metadata.

If you need to correct something, rebuild it — do not patch it.

---

### 1.4 Provenance is Mandatory
Every derived dataset must clearly record:
- Source dataset release identifiers
- Method used
- Computation timestamp

If provenance is not recorded, the output is invalid.

---

### 1.5 Explicit Limitations
Street inference is:
- Heuristic
- Distance-based
- Non-authoritative

Documentation must clearly state this.
Do not imply authoritative delivery-level correctness.

---

## 2. Documentation Requirements

Every meaningful change must include documentation updates.

At minimum:

### 2.1 Dataset Documentation
Maintain a living document describing:
- Each dataset
- Where it is obtained
- Licence type
- Required fields
- Known limitations
- Known schema quirks

If a dataset changes, update the documentation immediately.

---

### 2.2 Data Model Documentation
Maintain clear documentation for:
- Raw tables
- Core tables
- Derived tables
- Metrics tables

Include:
- Field definitions
- Data types
- Constraints
- Semantic meaning

No column should exist without documented purpose.

---

### 2.3 Transform Documentation
For each transformation layer, document:
- Inputs
- Outputs
- Assumptions
- Failure modes
- Determinism guarantees

If logic changes (e.g., confidence thresholds), update documentation and record the change rationale.

---

### 2.4 Metrics Documentation
Define:
- What each metric measures
- How it is calculated
- Why it exists
- Expected ranges

Metrics are part of product quality, not optional extras.

---

## 3. Quality Standards

### 3.1 Deterministic Behaviour
- Stable ordering in queries
- Explicit tie-breaking rules
- No reliance on implicit database ordering

### 3.2 Observability
Each pipeline run must:
- Log row counts per stage
- Log join coverage percentages
- Log resolution percentages
- Log distance percentiles

Silent processing is not acceptable.

---

### 3.3 Fail Fast
If:
- Required columns are missing
- Geometry is invalid
- Coordinate reference systems are inconsistent

The pipeline must fail clearly.

Partial silent success is worse than failure.

---

### 3.4 Schema Validation
Before processing:
- Validate required fields exist
- Validate types where possible
- Record dataset release metadata

Do not infer schema dynamically without documentation.

---

### 3.5 No Scope Drift
This repository is a **pipeline**, not:
- An API
- A serving layer
- An analytics platform
- A proprietary dataset reconstruction engine

Keep scope disciplined.

---

## 4. Testing Expectations

Agents must ensure:

- Normalisation logic is tested.
- Derived outputs are deterministic.
- Schema validation works.
- Metrics calculations are stable.
- Small fixture datasets validate spatial inference logic.

Tests must:
- Use synthetic or reduced fixture data.
- Not depend on downloading live datasets.

---

## 5. Change Management

Any change to:
- Confidence scoring
- Search radius
- Join logic
- Normalisation rules
- Spatial reference systems

Must include:

1. Rationale
2. Before/after metrics comparison
3. Determinism confirmation
4. Documentation update

---

## 6. What Must Never Be Implemented Here

- Address enumeration features
- Proprietary dataset integration
- Undocumented inference layers
- Hidden optimisation logic
- Behaviour designed for ambiguous or non-transparent use cases

This pipeline exists to:
- Normalise open data
- Join open data
- Derive transparent street-level inference
- Record quality metrics

Nothing more.

---

## 7. Communication Standards

Pull requests must:

- State the problem being solved
- Describe the solution
- Document assumptions
- Include metric impact
- Confirm reproducibility

Avoid vague language such as:
- “Seems to work”
- “Probably correct”
- “Should be fine”

Be precise.

---

## 8. Decision Rule

If a proposed change:
- Reduces transparency,
- Obscures provenance,
- Makes outputs less reproducible,
- Or introduces implicit assumptions,

It should not be merged.

Clarity over cleverness.
Traceability over speed.
Correctness over convenience.

---

End of AGENTS.md
Purpose: this file is the agent entrypoint for this repository.
Use it as a roadmap to the docs, then execute work with strict reproducibility and provenance.

## 1. Start Here (Required Reading Order)
1. `docs/README.md`
2. `docs/agent/start-here.md`
3. `docs/architecture/README.md`
4. `docs/spec/pipeline_v3/spec.md`
5. `docs/spec/pipeline_v3/data_model.md`
6. `docs/spec/pipeline_v3/canonicalisation.md`

If behavior in code differs from spec, treat it as a defect and document the delta.

## 2. Documentation Roadmap
- V3 product/behavior spec: `docs/spec/pipeline_v3/spec.md`
- V3 schema and table contracts: `docs/spec/pipeline_v3/data_model.md`
- Determinism and canonical rules: `docs/spec/pipeline_v3/canonicalisation.md`
- Source acquisition + licensing context: `docs/spec/data_sources.md`
- Agent onboarding: `docs/agent/start-here.md`
- Codebase map: `docs/agent/codebase-map.md`
- Operational runbook (ingest/build/publish): `docs/agent/runbook.md`
- Dataset lineage pages: `docs/architecture/datasets/README.md`
- Stage/pass pages: `docs/architecture/stages/README.md`
- Legacy phase docs (historical only): `docs/spec/phase_1/`, `docs/spec/phase_2-open-names/`

## 3. Non-Negotiable Engineering Rules
- No guessing: unknown fields/semantics must be marked unknown and validated explicitly.
- Reproducibility first: same inputs must produce same outputs.
- Raw data is immutable: never mutate raw source snapshots.
- Provenance is mandatory: derived records must trace to source run(s) and method.
- Deterministic execution: stable ordering + explicit tie-breaks only.
- Fail fast on schema/geometry/CRS issues.
- This repo is a pipeline only; do not add API-serving scope here.

## 4. Change Requirements
For meaningful behavior changes (join logic, scoring, normalization, radius/thresholds, CRS, pass semantics):
1. Update spec/docs in `docs/` in the same change.
2. Never place absolute local filesystem paths in docs; use repository-relative paths.
3. State rationale.
4. Provide before/after metrics or counts where applicable.
5. Confirm determinism impact.
6. Add/adjust tests (fixture-based; no live-download dependency).

This rule is strict: agents must always keep documentation in step with code changes.

## 5. Commit Standards
- Commit at logical checkpoints whenever it makes sense.
- Prefer atomic commits grouped by concern (schema, ingest, transforms, tests, docs).
- Use Conventional Commits for every commit message (`type(scope): summary`).

## 6. Decision Rule
If a change reduces transparency, obscures provenance, weakens reproducibility, or introduces hidden assumptions, do not merge it.

Clarity over cleverness. Traceability over speed. Correctness over convenience.

## 7. Scoped Agent Guides
- Docs scope: `docs/AGENTS.md`
- Pipeline scope: `pipeline/AGENTS.md`
- Runtime code scope: `pipeline/src/pipeline/AGENTS.md`
- Test scope: `tests/AGENTS.md`
- Data/manifest scope: `data/AGENTS.md`
17 changes: 17 additions & 0 deletions data/AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# data/AGENTS.md

## Scope
Manifests and local source-file conventions under `data/`.

## Critical Rule
Manifest/source contract changes must be reflected in docs (`docs/spec/...` and `docs/architecture/...`) and code (`pipeline/src/pipeline/manifest.py`, `pipeline/config/source_schema.yaml`) together.

## Conventions
- source manifests live under `data/manifests/`
- keep source naming aligned with `pipeline/src/pipeline/manifest.py`
- avoid absolute local paths in documentation; manifests may contain absolute file paths for runtime only
- update bundle manifests when source keys change

## Useful References
- source acquisition: `docs/spec/data_sources.md`
- architecture dataset pages: `docs/architecture/datasets/`
15 changes: 15 additions & 0 deletions data/manifests/e2e/onsud_manifest.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"dataset_key": "onsud",
"release_id": "2026-Q1-E2E-P2",
"source_url": "https://example.local/onsud-sample",
"licence": "OGL v3.0",
"file_path": "/Users/jamie/code/postcod.es/data/source_files/e2e/onsud_sample.csv",
"expected_sha256": "dfe6e4bc4d4405edc6463fcb1b55929f867d8e7b9907afb92e893a9f8911033f",
"format": "csv",
"column_map": {
"uprn": "ONS_UPRN",
"postcode": "ONS_POSTCODE",
"postcode_unit_easting": "PC_UNIT_E",
"postcode_unit_northing": "PC_UNIT_N"
}
}
Loading