diff --git a/LAB_11_DIAGRAM.md b/LAB_11_DIAGRAM.md new file mode 100644 index 0000000..7567c4e --- /dev/null +++ b/LAB_11_DIAGRAM.md @@ -0,0 +1,65 @@ +# Lab 11 Data Flow Diagram + +Open this file in Markdown Preview to render the Mermaid diagram. + +```mermaid +%%{init: {'theme': 'neutral', 'flowchart': {'curve': 'linear', 'nodeSpacing': 35, 'rankSpacing': 55}}}%% +flowchart TD + A["MTA operations generate daily ridership and traffic estimates"]:::source + B["NYC health reporting systems generate daily COVID case counts"]:::source + + C["Open NY API
Dataset: vxuj-8kew"]:::api + D["NYC Open Data API
Dataset: rc75-m7u3"]:::api + + E["Python batch loader
load_data_to_bq.py"]:::loader + F["Cleaning and type conversion
pandas parsing + column renaming"]:::loader + G["BigQuery tables
mta_data.daily_ridership
mta_data.nyc_covid_cases"]:::storage + H["Shared query utilities
utils.py"]:::app + I["Streamlit pages
dashboard + analysis views"]:::app + J["User-facing output
charts, KPIs, and recovery analysis"]:::output + + R1["Risk: upstream definitions or collection methods change"]:::risk + R2["Risk: API outage, timeout, or schema drift"]:::risk + R3["Risk: authentication failure or bad full refresh"]:::risk + R4["Risk: permissions, cache, or query errors"]:::risk + + A --> C + B --> D + C --> E + D --> E + E --> F --> G --> H --> I --> J + + A -.-> R1 + B -.-> R1 + C -.-> R2 + D -.-> R2 + E -.-> R3 + G -.-> R4 + H -.-> R4 + I -.-> R4 + + classDef source fill:#d2e3fc,stroke:#5b9cf6,color:#202124,stroke-width:2px; + classDef api fill:#e4d7ff,stroke:#8b5cf6,color:#202124,stroke-width:2px; + classDef loader fill:#fde7c3,stroke:#f59e0b,color:#202124,stroke-width:2px; + classDef storage fill:#fbd3d0,stroke:#e56b5d,color:#202124,stroke-width:2px; + classDef app fill:#d7f1ea,stroke:#2ea76d,color:#202124,stroke-width:2px; + classDef output fill:#dbeafe,stroke:#3b82f6,color:#202124,stroke-width:2px; + classDef risk fill:#fff4cc,stroke:#d97706,color:#7c2d12,stroke-width:2px; +``` + +## What Happens + +- Two upstream organizations generate the source data. +- The project pulls both datasets from public JSON APIs. +- `load_data_to_bq.py` cleans the data and fully refreshes two BigQuery tables. +- `utils.py` queries BigQuery and prepares data for the app. +- Streamlit renders charts and analysis for the user. + +## What Can Go Wrong + +- Source definitions or field names may change upstream. +- Public APIs may fail, timeout, or return unexpected schemas. +- A failed batch load can overwrite a previously good table. +- BigQuery permissions, caching, or app queries may fail. + +If Mermaid Preview is unavailable, use [LAB_11_DIAGRAM.svg](./LAB_11_DIAGRAM.svg) as the screenshot version. diff --git a/LAB_11_DIAGRAM.svg b/LAB_11_DIAGRAM.svg new file mode 100644 index 0000000..061563a --- /dev/null +++ b/LAB_11_DIAGRAM.svg @@ -0,0 +1,136 @@ + + Lab 11 Data Flow Diagram + Google Drawings-style flowchart showing how MTA ridership data and NYC COVID data move from upstream generation through public APIs, a Python batch loader, BigQuery storage, shared query utilities, Streamlit pages, and final user analysis. + + + + + + + + + + + Lab 11 Data Flow Diagram + Google Drawings-style draft for the MTA ridership project + + Upstream generation + Public APIs + Batch loading + Storage + App and output + + + MTA Operations + Daily ridership and traffic + estimates are generated. + + + NYC Health Reporting + Daily COVID case counts + are compiled and published. + + + Open NY API + Dataset: `vxuj-8kew` + Structured JSON endpoint + + + NYC Open Data API + Dataset: `rc75-m7u3` + Structured JSON endpoint + + + `load_data_to_bq.py` + 1. Fetches both APIs with `requests.get()` + 2. Loads results into pandas DataFrames + 3. Parses dates and numeric fields + 4. Renames MTA legacy percent columns + 5. Uploads tables with `to_gbq()` + 6. Verifies row count and date range + Load pattern: batch full refresh + + + BigQuery Table A + `mta_data.daily_ridership` + Project storage layer + + + BigQuery Table B + `mta_data.nyc_covid_cases` + Project storage layer + + + Shared Query Layer + `utils.py` builds SQL, + casts types, cleans data, + and caches query results. + + + Streamlit Pages + `streamlit_app.py` + `pages/1_MTA_Ridership.py` + `pages/2_Second_Dataset.py` + Charts, KPIs, and trend views + + + User Outcome + Recovery analysis, context, + and project insights + + + + + + + + + + + + + + Possible failure + Upstream collection methods + or reporting definitions change. + + + + Possible failure + API outage, timeout, or + schema drift in source fields. + + + + Possible failure + Authentication fails, bad values pass through, + or a bad full refresh replaces a good table. + + + + Possible failure + Permissions, dataset creation, + or table access problems. + + + + Possible failure + Missing columns, stale cache, + query errors, or chart rendering issues. + Interpretation can also hide short spikes. + + + Main question answered by this diagram: how the data is generated, loaded, stored, queried, and turned into user-facing analysis, plus what can go wrong at each stage. +