Build reproducible IPEDS institution-year panels from NCES IPEDS cross-sections with strict release checks, provenance-preserving harmonization, DuckDB-backed wide builds, and auditable QC artifacts.
This repository supports two main release products:
Panels/2004_2024_IPEDS_clean_Panel_DS.parquetfor the full cleaned release window (2004:2024)Panels/panel_wide_analysis_2004_2023.parquetfor the analysis release window (2004:2023by default)
2024 is treated as provisional/schema-transition for analysis-wide builds.
Scripts/04_build_wide_panel.py now uses DuckDB as the build/query engine for wide-panel construction, QC, and partition export. Python remains the orchestration layer.
Core stages:
Scripts/01_download_ipeds.py: multithreaded raw acquisition, manifests, retry logic, and optional dictionary extraction.Scripts/02_dictionary_ingest.py: dictionary lake and dictionary codes, including synthetic imputation-variable rows and UNITID metadata rows.Scripts/03_harmonize.py: provenance-preserving long panel build with chunked melts, release allowlist checks, and deterministic deduplication.Scripts/04_build_wide_panel.py: DuckDB-backed wide build with year-scoped execution, scalar/dimension lane split, anti-garbage gates, typed casting, discrete collapse, target lineage, and monitored phase logs.Scripts/05_clean_panel.py: PRCH parent/child cleaning that preserves all institution-year rows and nulls only the affected component families.Scripts/06_build_custom_panel.py: variable-select export from stitched wide or clean panels to parquet or CSV.
python3 -m pip install -r requirements.txt
export IPEDS_ROOT="/path/to/IPEDS_Paneling"Required input under "$IPEDS_ROOT":
Raw_Cross_Section_Data/Dictionary/dictionary_lake.parquet
bash manual_commands.shThis runs Scripts/00_run_all.py and produces:
Panels/2004-2024/panel_long_varnum_2004_2024.parquetPanels/2004_2024_IPEDS_Raw_Panel_DS.parquetPanels/2004_2024_IPEDS_PRCHclean_Panel_DS.parquetPanels/2004_2024_IPEDS_clean_Panel_DS.parquet
That full-release path is the cleaned 2004:2024 product. The separate analysis-wide build below is an unbalanced research panel for 2004:2023 and is not PRCH-cleaned unless you explicitly run Scripts/05_clean_panel.py on it.
Run this after harmonization output exists:
python3 Scripts/04_build_wide_panel.py \
--input "$IPEDS_ROOT/Panels/2004-2024/panel_long_varnum_2004_2024.parquet" \
--out_dir "$IPEDS_ROOT/Panels/wide_analysis_parts" \
--years "2004:2023" \
--dictionary "$IPEDS_ROOT/Dictionary/dictionary_lake.parquet" \
--lane-split \
--dim-sources "C_A,C_B,C_C,CDEP,EAP,IC_CAMPUSES,IC_PCCAMPUSES,F_FA_F,F_FA_G" \
--dim-prefixes "C_,EF,GR,GR200,SAL,S_,OM,DRV" \
--exclude-vars "SPORT1,SPORT2,SPORT3,SPORT4" \
--scalar-long-out "$IPEDS_ROOT/Panels/panel_long_scalar_unique.parquet" \
--dim-long-out "$IPEDS_ROOT/Panels/panel_long_dim.parquet" \
--wide-analysis-out "$IPEDS_ROOT/Panels/panel_wide_analysis_2004_2023.parquet" \
--typed-output \
--drop-empty-cols \
--collapse-disc \
--drop-disc-components \
--qc-dir "$IPEDS_ROOT/Checks/wide_qc" \
--disc-qc-dir "$IPEDS_ROOT/Checks/disc_qc" \
--duckdb-path "$IPEDS_ROOT/build/ipeds_build.duckdb" \
--duckdb-temp-dir "$IPEDS_ROOT/build/duckdb_tmp" \
--persist-duckdbAnalysis-wide schema notes:
- By default, lane-split analysis builds use
Artifacts/legacy_analysis_schema_seed.csvto seed legacy compatibility columns. - Those seeded columns are visible in year partitions and recorded in
Checks/wide_qc/qc_seeded_legacy_columns.csv. - If
--drop-globally-null-postremains enabled, globally null compatibility columns are still removed from the stitched single-file output and listed inChecks/wide_qc/qc_globally_null_columns_dropped.csv. - Use
--no-legacy-analysis-schemaif you want the narrower semantic-window schema surface instead of the legacy-compatible schema contract.
If you want a cleaned version of the stitched analysis panel, run:
python3 Scripts/05_clean_panel.py \
--input "$IPEDS_ROOT/Panels/panel_wide_analysis_2004_2023.parquet" \
--output "$IPEDS_ROOT/Panels/panel_clean_analysis_2004_2023.parquet" \
--dictionary "$IPEDS_ROOT/Dictionary/dictionary_lake.parquet" \
--qc-dir "$IPEDS_ROOT/Checks/prch_qc"Target-lineage audit without running the year loop:
python3 Scripts/04_build_wide_panel.py \
--input "$IPEDS_ROOT/Panels/2004-2024/panel_long_varnum_2004_2024.parquet" \
--out_dir "$IPEDS_ROOT/Panels/wide_analysis_parts" \
--years "2004:2023" \
--dictionary "$IPEDS_ROOT/Dictionary/dictionary_lake.parquet" \
--lane-split \
--qc-dir "$IPEDS_ROOT/Checks/wide_qc" \
--lineage-onlyThis writes Checks/wide_qc/qc_target_lineage.csv and stops before per-year builds.
Monitored real-data build with durable logs and telemetry:
python3 Scripts/QA_QC/03_monitored_analysis_build.py \
--input "$IPEDS_ROOT/Panels/2004-2024/panel_long_varnum_2004_2024.parquet" \
--dictionary "$IPEDS_ROOT/Dictionary/dictionary_lake.parquet" \
--years "2004:2023"The monitored runner writes build.log, monitor.log, build_telemetry.json, and run_meta.json under Checks/real_parity_runs/<run_id>/.
It also prints a live terminal heartbeat with elapsed time, partition count, disk usage, and latest phase.
Certification runner for a completed monitored analysis build:
python3 Scripts/QA_QC/04_certify_analysis_build.py \
--run-dir "Checks/real_parity_runs/<run_id>" \
--years "2004:2023"This writes certification_summary.csv and certification_summary.md into the monitored run directory.
Parity harness:
python3 Scripts/QA_QC/02_wide_parity.py \
--input "$IPEDS_ROOT/Panels/2004-2024/panel_long_varnum_2004_2024.parquet" \
--dictionary "$IPEDS_ROOT/Dictionary/dictionary_lake.parquet" \
--years "2004:2023" \
--lane-split \
--typed-output \
--drop-empty-cols \
--collapse-disc \
--drop-disc-components \
--parity-contract legacy_schemaUse --parity-contract semantic_window to compare the narrower semantic-window schema instead of the legacy-compatible analysis schema.
Use the input that matches your goal:
- Analysis-ready subset (recommended):
Panels/panel_wide_analysis_2004_2023.parquet - Broad full panel (includes 2024):
Panels/2004_2024_IPEDS_clean_Panel_DS.parquet
UNITID and year are always included automatically.
Export a custom panel as parquet:
python3 Scripts/06_build_custom_panel.py \
--input "$IPEDS_ROOT/Panels/panel_wide_analysis_2004_2023.parquet" \
--vars-file "Customize_Panel/selectedvars.txt" \
--years "2004:2023" \
--format parquet \
--output "$IPEDS_ROOT/Panels/custom_panel_2004_2023.parquet"Export the same custom panel as CSV:
python3 Scripts/06_build_custom_panel.py \
--input "$IPEDS_ROOT/Panels/panel_wide_analysis_2004_2023.parquet" \
--vars-file "Customize_Panel/selectedvars.txt" \
--years "2004:2023" \
--format csv \
--output "$IPEDS_ROOT/Panels/custom_panel_2004_2023.csv"Build a panel-specific variable dictionary from the actual stitched wide schema:
python3 Scripts/07_build_panel_dictionary.py \
--input "$IPEDS_ROOT/Panels/panel_clean_analysis_2004_2023.parquet" \
--dictionary "$IPEDS_ROOT/Dictionary/dictionary_lake.parquet" \
--output "Artifacts/panel_clean_analysis_2004_2023_dictionary.csv"The output includes:
varnamevarTitlelongDescriptionpanelDataType(actual parquet type in the panel)dictionaryDataType(metadata type fromdictionary_lake.parquet)
Checks/release_qc/release-manifest validation and selected-file evidenceChecks/harmonize_qc/missing-UNITID drop logs and harmonize summariesChecks/disc_qc/discrete-collapse conflicts and collapse mapChecks/wide_qc/qc_scalar_conflicts.csvscalar-lane key conflictsChecks/wide_qc/qc_anti_garbage_failures.csvblocked dimension identifiers in wide targetsChecks/wide_qc/qc_cast_report.csvtyped-cast parse reportChecks/wide_qc/qc_target_lineage.csvtarget-universe lineage through scalar filter, discrete collapse, anti-garbage, and legacy seedingChecks/wide_qc/qc_seeded_legacy_columns.csvlegacy compatibility columns injected into analysis-wide targetsChecks/wide_qc/qc_globally_null_columns_dropped.csvglobally null columns removed post-buildChecks/prch_qc/PRCH cleaning evidenceChecks/real_parity_runs/<run_id>/build.log,monitor.log,build_telemetry.json, andrun_meta.jsonfor monitored analysis buildsChecks/real_parity_runs/<run_id>/certification_summary.csvand.mdfor post-run certification
zsh: parse error near ')':- Run commands from a
.shfile or runbash manual_commands.shdirectly.
- Run commands from a
ModuleNotFoundError: duckdb:- Install dependencies in the active Python environment.
scalar conflict gate failed:- Inspect
Checks/wide_qc/qc_scalar_conflicts.csv; add true dimensioned sources/prefixes or exclude known problem vars.
- Inspect
anti-garbage gate failed:- Inspect
Checks/wide_qc/qc_anti_garbage_failures.csv; treat those variables as dimensioned or exclude them.
- Inspect
OutOfMemoryExceptionduring large analysis-wide builds:- Run
Scripts/QA_QC/03_monitored_analysis_build.pyto capture phase logs and resource telemetry underChecks/real_parity_runs/. - Inspect the latest
build.log,monitor.log, andbuild_telemetry.jsonto determine whether memory pressure is happening in target discovery, scalar conflict scans, lane exports, or pivoting.
- Run
- Unexpected analysis-wide columns:
- Inspect
Checks/wide_qc/qc_target_lineage.csvandChecks/wide_qc/qc_seeded_legacy_columns.csv. - Use
--no-legacy-analysis-schemato disable the legacy-compatible placeholder columns if you want only the semantic-window target set.
- Inspect
- Keep generated large data out of git:
Raw_Cross_Section_Data/,Cross_sections/,Panels/,Checks/,build/,*.duckdb,*.duckdb.wal. UNITIDis documented in the dictionary as controlled metadata and also used as the panel key in harmonization.- The legacy analysis schema seed manifest is
Artifacts/legacy_analysis_schema_seed.csv.