Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 125 additions & 0 deletions statvar_imports/pennsylvania/pennsylvania_education/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
## Pennsylvania_education Import

Dataset related to Pennsylvania's Education at county level.
-----

**Provenance Description:**
Data assets within this catalog are authored and maintained by individual Commonwealth agencies, which serve as the authoritative sources for their respective domains. The portal, managed by the Office of Administration, provides a transparent audit trail by documenting original publication dates, metadata update frequencies, and the specific departmental "stewards" responsible for the data's accuracy and integrity.

## Datasets
The datasets cover the following geographic levels, date ranges and institution types

*Educational attainment*
- Years: 2010 – 2016
- Type of Place: County (resolved via FIPS codes)
- Update Frequency: Annual

*Post secondary completions*
- Years: 2016-2021
- Type of Place: Educational Institution (resolved via IPEDS ID)
- Update Frequency: Annual

*Public school enrollment*
- Years: 2017
- Type of Place: County (resolved via FIPS codes)
- Update Frequency: Annual

*Undergraduate stem enrollment*
- Years: 2014, 2016, 2018, 2020
- Type of Place: Educational Institution (resolved via IPEDS ID)
- Update Frequency: Annual (Data collected in Spring, reported in Fall)

**Place Resolution**

Place resolution is handled using specific mapping files (`*_places_resolved.csv`) or direct code mapping in the PVMap:
- **Counties:** Mapped to Data Commons `geoId` using standard FIPS codes (e.g., `geoId/42001`).
- **Institutions:** Mapped to Data Commons `ipedsId` using the Integrated Postsecondary Education Data System (IPEDS) identifiers.


### How to Use

The workflow for this data import involves two main steps: downloading the necessary files and then processing them.

#### Step 1: Download the Data

- **Source:** [Pennsylvania_Education](https://data.pa.gov/browse?sortBy=relevance&page=1&pageSize=20)
- **Description:** The provided URL links to the Education data category within the Commonwealth of Pennsylvania’s open data repository. This portal serves as a centralized clearinghouse for public records, statistics, and geospatial data managed by the Pennsylvania Department of Education (PDE) and related agencies.

To fetch the necessary data files, you'll need to run download script `download_script.py`.

The download_script will download below mentioned files in the `input_files` folder. Within this folder, there are four sub-folders, each containing categorized data for both adults and children:

- educational_attainment_by_age_range_and_gender

- post_secondary_completions_total_awards_degrees

- public_school_enrollment_by_county_grade_and_race

- undergraduate_stem_enrollment


### Auto refresh Type

This import will be refreshed in a fully automated manner.

-----

#### Step 2: Process the Files

After downloading the files, you can process them to generate the final output. To do this:

**Option A: Use the `run_processing.sh` script**

The `run_processing.sh` script automates the processing of all the downloaded files.

**Run the following command:**

```bash
sh run_processing.sh
```

**Option B: Manually Execute the Processing Script**

You can also run the `stat_var_processor.py` script individually for each file. This script is located in the `data/tools/statvar_importer/` directory.

Here are the specific commands for each file:

```bash
python3 stat_var_processor.py
--input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/educational_attainment_by_age_range_and_gender/*.csv"
--pv_map=../../statvar_imports/pennsylvania/pennsylvania_education/educational_attainment_by_age_range_and_gender_pvmap.csv"
--config_file=../../statvar_imports/pennsylvania/pennsylvania_education/common_metadata.csv"
--output_path=../../statvar_imports/pennsylvania/pennsylvania_education/output_files/educational_attainment_by_age_range_and_gender_output"
--existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf
--places_resolved_csv=../../statvar_imports/pennsylvania/pennsylvania_education/educational_attainment_by_age_range_and_gender_places_resolver.csv"
```

```bash
python3 stat_var_processor.py
--input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/post_secondary_completions_total_awards_degrees/*.csv"
--pv_map=../../statvar_imports/pennsylvania/pennsylvania_education/post_secondary_completions_total_awards_degrees_pvmap.csv"
--config_file=../../statvar_imports/pennsylvania/pennsylvania_education/common_metadata.csv"
--output_path=../../statvar_imports/pennsylvania/pennsylvania_education/output_files/post_secondary_completions_total_awards_degrees_output"
--existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf
--places_resolved_csv=../../statvar_imports/pennsylvania/pennsylvania_education/post_secondary_completions_total_awards_degrees_places_resolver.csv"
```

```bash
python3 stat_var_processor.py
--input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/public_school_enrollment_by_county_grade_and_race/*.csv"
--pv_map=../../statvar_imports/pennsylvania/pennsylvania_education/public_school_enrollment_by_county_grade_and_race_pvmap.csv"
--config_file=../../statvar_imports/pennsylvania/pennsylvania_education/common_metadata.csv"
--output_path=../../statvar_imports/pennsylvania/pennsylvania_education/output_files/public_school_enrollment_by_county_grade_and_race_output"
--existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf
--places_resolved_csv=../../statvar_imports/pennsylvania/pennsylvania_education/public_school_enrollment_by_county_grade_and_race_places_resolver.csv"
```

```bash
python3 stat_var_processor.py
--input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/undergraduate_stem_enrollment/*.csv"
--pv_map=../../statvar_imports/pennsylvania/pennsylvania_education/undergraduate_stem_enrollment_pvmap.csv"
--config_file=../../statvar_imports/pennsylvania/pennsylvania_education/common_metadata.csv"
--output_path=../../statvar_imports/pennsylvania/pennsylvania_education/output_files/undergraduate_stem_enrollment_output"
--existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf
--places_resolved_csv=../../statvar_imports/pennsylvania/pennsylvania_education/undergraduate_stem_enrollment_places_resolver.csv"
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
parameter,value
#places_within,
output_columns,"observationAbout,observationDate,value,variableMeasured"
header_rows,1
url,https://data.pa.gov/browse?sortBy=relevance&page=1&pageSize=20
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
import os
import requests

def download_file(url, output_path):
print(f'Downloading {url} to {output_path}...')
response = requests.get(url, stream=True)
response.raise_for_status()

os.makedirs(os.path.dirname(output_path), exist_ok=True)
with open(output_path, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
print('Download complete.')

def main():
base_path = os.path.dirname(os.path.abspath(__file__))
input_files_dir = os.path.join(base_path, 'input_files')

datasets = {
'educational_attainment_by_age_range_and_gender': 'xwn6-8rmw',
'post_secondary_completions_total_awards_degrees': 'jqcu-bcsg',
'public_school_enrollment_by_county_grade_and_race': 'wb8u-h3s8',
'undergraduate_stem_enrollment': 'r75w-4bue'
}

for folder, data_id in datasets.items():
url = f'https://data.pa.gov/api/views/{data_id}/rows.csv?accessType=DOWNLOAD'
output_path = os.path.join(input_files_dir, f'{folder}.csv')
download_file(url, output_path)

if __name__ == '__main__':
main()
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
key,,,p1,v1,p2,v2
County FIPS Code,observationAbout,geoId/{Data},populationType,Person,statType,measuredValue
Total Population,measuredProperty,count,value,{Number},,
No High School Diploma,educationalAttainment,NoDiploma,value,{Number},,
High School Diploma Or Equivalent,educationalAttainment,HighSchoolDiplomaIncludesEquivalency,value,{Number},,
Some College No Degree,educationalAttainment,SomeCollegeNoDegree,value,{Number},,
Associate's Degree,educationalAttainment,AssociatesDegree,value,{Number},,
Bachelor's Degree,educationalAttainment,BachelorsDegree,value,{Number},,
Graduate or Professional Degree,educationalAttainment,GraduateOrProfessionalDegree,value,{Number},,
Total Post-Secondary Degrees,educationalAttainment,PostSecondaryDegree,value,{Number},,
Male,gender,Male,,,,
Female,gender,Female,,,,
35 to 44 Years,age,[35 44 Years],,,,
25 to 34 Years,age,[25 34 Years],,,,
45 to 64 Years,age,[45 64 Years],,,,
,,,,,,
2010,observationDate,2010,,,,
2011,observationDate,2011,,,,
2012,observationDate,2012,,,,
2013,observationDate,2013,,,,
2014,observationDate,2014,,,,
2015,observationDate,2015,,,,
2016,observationDate,2016,,,,
2017,observationDate,2017,,,,
2018,observationDate,2018,,,,
2019,observationDate,2019,,,,
2020,observationDate,2020,,,,
2021,observationDate,2021,,,,
2022,observationDate,2022,,,,
2023,observationDate,2023,,,,
2024,observationDate,2024,,,,
2025,observationDate,2025,,,,
2026,observationDate,2026,,,,
2027,observationDate,2027,,,,
2028,observationDate,2028,,,,
2029,observationDate,2029,,,,
2030,observationDate,2030,,,,
2031,observationDate,2031,,,,
34 changes: 34 additions & 0 deletions statvar_imports/pennsylvania/pennsylvania_education/manifest.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
{
"import_specifications": [
{
"import_name": "Pennsylvania_Education",
"curator_emails": ["support@datacommons.org"],
"provenance_url": "https://data.pa.gov/",
"provenance_description": "Dataset related to the pennsylvania's Education at country level.",
"scripts": ["download_script.py", "run_processing.sh"],
"source_files": [
"input_files/*.csv"
],
"import_inputs": [
{
"template_mcf": "output_files/educational_attainment_by_age_range_and_gender/educational_attainment_by_age_range_and_gender_output.tmcf",
"cleaned_csv": "output_files/educational_attainment_by_age_range_and_gender/educational_attainment_by_age_range_and_gender_output.csv"
},
{
"template_mcf": "output_files/post_secondary_completions_total_awards_degrees/post_secondary_completions_total_awards_degrees_output.tmcf",
"cleaned_csv": "output_files/post_secondary_completions_total_awards_degrees/post_secondary_completions_total_awards_degrees_output.csv"
},
{
"template_mcf": "output_files/public_school_enrollment_by_county_grade_and_race/public_school_enrollment_by_county_grade_and_race_output.tmcf",
"cleaned_csv": "output_files/public_school_enrollment_by_county_grade_and_race/public_school_enrollment_by_county_grade_and_race_output.csv"
},
{
"template_mcf": "output_files/undergraduate_stem_enrollment/undergraduate_stem_enrollment_output.tmcf",
"cleaned_csv": "output_files/undergraduate_stem_enrollment/undergraduate_stem_enrollment_output.csv"
}
],
"cron_schedule": "0 02 * * 2"
}
]
}

Loading
Loading