This project now uses BigQuery for every dataset shown in the Streamlit app.
- Source:
https://data.ny.gov/resource/vxuj-8kew - BigQuery table:
sipa-adv-c-bouncing-penguin.mta_data.daily_ridership - Loading type: batch full refresh
- Why: the dataset is small, updated on a daily cadence, and easy to keep consistent by reloading the full table instead of managing row-by-row updates.
- Source:
https://data.cityofnewyork.us/resource/rc75-m7u3 - BigQuery table:
sipa-adv-c-bouncing-penguin.mta_data.nyc_covid_cases - Loading type: batch full refresh
- Why: this table is also small enough for a daily refresh, and full replacement keeps the historical series in sync without extra incremental-loading logic.
The repository includes load_data_to_bq.py, which:
- Authenticates with Google BigQuery
- Creates the
mta_datadataset if it does not already exist - Pulls source data from both Open Data APIs
- Cleans date and numeric fields before upload
- Replaces the target BigQuery tables
- Verifies each upload with row counts and date ranges
Run it with:
python load_data_to_bq.py --dataset allYou can also load a single table:
python load_data_to_bq.py --dataset mta
python load_data_to_bq.py --dataset covidThe Streamlit app no longer reads API responses directly inside page files.
utils.pynow provides shared BigQuery helpers for both datasetsstreamlit_app.pyreads MTA data from BigQuerypages/1_MTA_Ridership.pyreads MTA data from BigQuerypages/2_Second_Dataset.pyreads COVID data from BigQuery
This keeps all pages aligned with the lab requirement that every dataset come from BigQuery.
To improve load time and make performance visible:
- Each page uses a custom
display_load_time()context manager and shows total load time in the UI - BigQuery results are cached with Streamlit caching
- Queries select only the columns used by the app instead of
SELECT * - Repeated client setup is cached with a shared BigQuery client helper
- Basic data cleaning is centralized in
utils.pyso pages do less work on every rerun - The homepage dashboard is split into lighter sections so each view renders only the charts needed for that section
- Default chart selections were reduced to fewer transit modes so the initial render sends fewer Plotly traces
These changes improve both initial and subsequent page loads, while keeping the code easier to maintain.
- Run
python load_data_to_bq.py --dataset all - Run
streamlit run streamlit_app.py - Open each page and confirm the caption shows the page load time
- Record the screen while loading the main page and both sub-pages
I interpreted "repeat the middle steps from Part 5" as: load the datasets into BigQuery, point the app at BigQuery tables, and document the table-level setup in the repository.