-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Background
PR #516 originally designed the calibration pipeline so that PUF + QRF imputation would run after cloning and geography assignment, giving each geographic clone geographically-informed tax imputations (with state_fips as a QRF predictor). This was a key part of the vision: households cloned into different states would receive state-appropriate imputed tax values rather than sharing identical federal-return-derived imputations.
What Changed
Commit 49a1f66 removed the --puf-dataset flag from the calibration pipeline, moving PUF cloning upstream to extended_cps.py. As a result, all ~436 clones of the same household now share identical PUF-imputed tax values regardless of their assigned geography.
Why This Was Deferred
Restoring post-cloning PUF re-imputation was deliberately deferred for several reasons:
- Matrix builder precomputation: The current matrix builder precomputes variable values per state and reuses them across congressional districts. Post-cloning re-imputation would make tax values vary per clone (not just per state), breaking this optimization pattern.
X*w ↔ sim.calculate().sum()consistency: Ensuring the calibration matrix matches simulation output is already a hard problem (see recent fixes for cross-state cache pollution, clone-to-CD collisions, and takeup draw alignment). Re-imputation per clone would add another dimension of potential mismatch.- Runtime cost: QRF training is expensive. Running it after cloning (on ~436× the original household count) would substantially increase pipeline runtime.
- Current targets don't require it: The calibration weight solver already handles geographic distribution of tax-related aggregates. Identical per-clone values with different weights achieve the same calibration targets.
What Restoration Would Require
- Re-introduce PUF dataset loading and QRF imputation into the post-cloning pipeline stage
- Update the matrix builder to handle clone-varying tax variable values (cannot precompute per-state)
- Ensure
X*wconsistency when imputed values differ across clones of the same source household - Profile and optimize QRF training for the larger post-cloning dataset
- Add validation that geographic tax variation improves microsimulation accuracy (not just adds complexity)
Acceptance Criteria
- PUF QRF re-imputation runs after cloning and geography assignment
-
state_fips(or equivalent) is used as a QRF predictor so clones get state-appropriate tax values - Matrix builder correctly handles clone-varying values
-
X @ wmatchessim.calculate(var) * wfor all tax-related target variables - Pipeline runtime remains tractable (document benchmarks)
- Calibration results are at least as good as the current identical-clone approach