Skip to content

Restore post-cloning PUF QRF re-imputation for geographic tax variation #560

@baogorek

Description

@baogorek

Background

PR #516 originally designed the calibration pipeline so that PUF + QRF imputation would run after cloning and geography assignment, giving each geographic clone geographically-informed tax imputations (with state_fips as a QRF predictor). This was a key part of the vision: households cloned into different states would receive state-appropriate imputed tax values rather than sharing identical federal-return-derived imputations.

What Changed

Commit 49a1f66 removed the --puf-dataset flag from the calibration pipeline, moving PUF cloning upstream to extended_cps.py. As a result, all ~436 clones of the same household now share identical PUF-imputed tax values regardless of their assigned geography.

Why This Was Deferred

Restoring post-cloning PUF re-imputation was deliberately deferred for several reasons:

  1. Matrix builder precomputation: The current matrix builder precomputes variable values per state and reuses them across congressional districts. Post-cloning re-imputation would make tax values vary per clone (not just per state), breaking this optimization pattern.
  2. X*w ↔ sim.calculate().sum() consistency: Ensuring the calibration matrix matches simulation output is already a hard problem (see recent fixes for cross-state cache pollution, clone-to-CD collisions, and takeup draw alignment). Re-imputation per clone would add another dimension of potential mismatch.
  3. Runtime cost: QRF training is expensive. Running it after cloning (on ~436× the original household count) would substantially increase pipeline runtime.
  4. Current targets don't require it: The calibration weight solver already handles geographic distribution of tax-related aggregates. Identical per-clone values with different weights achieve the same calibration targets.

What Restoration Would Require

  • Re-introduce PUF dataset loading and QRF imputation into the post-cloning pipeline stage
  • Update the matrix builder to handle clone-varying tax variable values (cannot precompute per-state)
  • Ensure X*w consistency when imputed values differ across clones of the same source household
  • Profile and optimize QRF training for the larger post-cloning dataset
  • Add validation that geographic tax variation improves microsimulation accuracy (not just adds complexity)

Acceptance Criteria

  • PUF QRF re-imputation runs after cloning and geography assignment
  • state_fips (or equivalent) is used as a QRF predictor so clones get state-appropriate tax values
  • Matrix builder correctly handles clone-varying values
  • X @ w matches sim.calculate(var) * w for all tax-related target variables
  • Pipeline runtime remains tractable (document benchmarks)
  • Calibration results are at least as good as the current identical-clone approach

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions