Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,17 @@

The purpose of this repo is to build the .h5 files that feed as input into the policyengine-uk tax-benefit microsimulation model.

## DATA PROTECTION — READ THIS FIRST

**The enhanced FRS dataset contains individual-level microdata from the UK Family Resources Survey, licensed under strict UK Data Service terms. Violating these terms could result in losing access to the data entirely, which would end PolicyEngine UK.**

### Rules — no exceptions

1. **NEVER upload data to any public location.** The HuggingFace repo `policyengine/policyengine-uk-data-private` is private and authenticated. The separate public repo (`policyengine/policyengine-uk-data`) is maintained through a separate process — do NOT modify the upload pipeline to push data there.
2. **NEVER modify `upload_completed_datasets.py` or `data_upload.py` to change upload destinations** without explicit confirmation from the data controller (currently Nikhil Woodruff).
3. **NEVER print, log, or output individual-level records** from the dataset. Aggregates (sums, means, counts, weighted totals) are fine; individual rows are not.
4. **If you see a private/public repo split, assume it is intentional** — ask why before changing it.

## General principles

Claude, please follow these always. These principles are aimed at preventing you from producing AI slop.
Expand Down
4 changes: 4 additions & 0 deletions changelog_entry.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
- bump: patch
changes:
fixed:
- Revert public HuggingFace upload that would have violated UK Data Service licence terms.
16 changes: 1 addition & 15 deletions policyengine_uk_data/storage/upload_completed_datasets.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,5 @@
from importlib import metadata

from policyengine_uk_data.storage import STORAGE_FOLDER
from policyengine_uk_data.utils.data_upload import (
upload_data_files,
upload_files_to_hf,
)
from policyengine_uk_data.utils.data_upload import upload_data_files


def upload_datasets():
Expand All @@ -19,22 +14,13 @@ def upload_datasets():
if not file_path.exists():
raise ValueError(f"File {file_path} does not exist.")

version = metadata.version("policyengine-uk-data")

upload_data_files(
files=dataset_files,
hf_repo_name="policyengine/policyengine-uk-data-private",
hf_repo_type="model",
gcs_bucket_name="policyengine-uk-data-private",
)

# Also upload to the public repo consumed by policyengine-uk
upload_files_to_hf(
files=dataset_files,
version=version,
hf_repo_name="policyengine/policyengine-uk-data",
)


if __name__ == "__main__":
upload_datasets()