silly-penguin/test_case_for_data_processing.txt at main · advanced-computing/silly-penguin · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
1. Review of Existing Code
The current Streamlit script combines multiple responsibilities in a single file:
1) Configuration / secrets: loading EIA_API_KEY from environment variables or st.secrets.
2) Network I/O: calling external APIs (EIA, and weather if applicable) via HTTP requests.
3) Data processing: filtering to actual demand, type conversion, resampling/aggregation to daily averages, pivoting, slicing recent time windows, merging datasets, and computing KPIs.
4) UI rendering: Streamlit sidebar controls, metrics, charts, and raw table display.
This structure works for a prototype, but it makes unit testing difficult because UI logic, network calls, and data transformations are tightly coupled.

2. What Can Be Refactored Into Functions
To improve testability and readability, we extract the core data logic into functions (functions that operate on DataFrames and return DataFrames/values without Streamlit side effects). In particular:
1) Demand + Weather daily merge (pure data function)
merge_daily_demand_weather(eia_df, weather_df) -> pd.DataFrame
- Responsibilities:
  Filter EIA rows to only "Demand" (exclude forecasts).
  Convert timestamps to date and compute daily average demand.
  Merge with daily weather data using an inner join on date.
2) Grid KPIs computation (pure data function)
calculate_grid_kpis(df_pivot) -> tuple[last_actual, last_forecast, delta]
- Responsibilities:
  Read the latest (first-row) "Demand" and "Day-ahead demand forecast" values.
  Compute forecast error delta = actual - forecast.
  Handle empty DataFrames or missing columns gracefully.
3) Supporting utility functions (optional, also pure)
  clean_eia_df(df): enforce numeric/datetime types and standardize column names.
  pivot_hourly_demand_forecast(df): pivot to wide format for plotting and KPI extraction.
  select_last_n_days(df, days_to_show): slice the most recent time window in hours.

After refactoring, the Streamlit file would primarily orchestrate: load secrets → fetch raw data → call pure functions → render outputs.

3. Where We Can Make the Code DRY?
Several parts of the script can be made more DRY (Don’t Repeat Yourself) by centralizing repeated logic and “magic strings”:

1) Centralize key column names
Repeated strings like "Demand" and "Day-ahead demand forecast" should be defined once (constants or default parameters). This prevents bugs from typos and simplifies future changes.
2) Separate fetch from transform
Network calls (requests.get) should be separated from transformation logic (filtering, pivoting, aggregating, merging). This reduces duplication and makes data logic independently testable.
3) Standardize time handling
Converting period to datetime and extracting date/daily grouping should be done in one shared function instead of scattered across the app.
4) Graceful empty-data handling
Instead of repeating checks like if df.empty in multiple places, make functions return empty outputs (or (None, None, None) for KPIs) consistently, and let the UI layer decide how to display warnings.
5) Reuse computation for metrics
KPI calculations should be performed in one function (calculate_grid_kpis) rather than inline computations, so both plotting and dashboard metrics rely on the same validated logic.

4. Test Plan: merge_daily_demand_weather
Function Purpose: Filters EIA data to only include "Demand", calculates the daily average, and merges it with daily weather data using an inner join.

🟢 Test Case 1: Happy Path (Standard Valid Data)
Scenario: The function receives normal hourly EIA data (mix of Actual and Forecast) and normal daily weather data spanning two days.

Input 1 (eia_df): * Jan 1, 10:00 | Demand | 100 MWh

Jan 1, 11:00 | Demand | 200 MWh

Jan 1, 10:00 | Forecast | 999 MWh (Should be filtered out)

Jan 2, 10:00 | Demand | 300 MWh

Jan 2, 11:00 | Demand | 300 MWh

Input 2 (weather_df):

Jan 1 | 15.0 °C

Jan 2 | 20.0 °C

Expected Output: A merged DataFrame with exactly 2 rows:

Row 1: Date: Jan 1 | avg_demand_mwh: 150 | avg_temp: 15.0

Row 2: Date: Jan 2 | avg_demand_mwh: 300 | avg_temp: 20.0

🟡 Test Case 2: Empty Data (API Failure Simulation)
Scenario: One or both of the APIs failed and returned an empty DataFrame.

Input 1 (eia_df): Empty DataFrame

Input 2 (weather_df): Normal weather data

Expected Output: Empty DataFrame (the function should gracefully exit without crashing Streamlit).

🟡 Test Case 3: Missing "Demand" Data
Scenario: The EIA API successfully returns data, but for some reason, it only contains forecast data, no actual "Demand".

Input 1 (eia_df):

Jan 1, 10:00 | Forecast | 500 MWh

Jan 1, 11:00 | Forecast | 550 MWh

Input 2 (weather_df): Normal weather data

Expected Output: Empty DataFrame. (The filter step type-name == "Demand" will leave an empty table, so the final merge should also be empty).

🟡 Test Case 4: Unmatched Dates (Inner Join Test)
Scenario: The EIA data and Weather data cover different time ranges. Because the function uses how="inner", it should only keep overlapping dates.

Input 1 (eia_df): Data for Jan 1 and Jan 2.

Input 2 (weather_df): Data for Jan 2 and Jan 3.

Expected Output: A merged DataFrame with exactly 1 row (for Jan 2), as that is the only overlapping date. Jan 1 and Jan 3 are dropped.


📄 Test Plan: calculate_grid_kpis
Function Purpose: Extracts the latest actual demand and forecast values from a pivoted DataFrame and calculates the forecast error (delta).

🟢 Test Case 1: Happy Path (Standard Valid Data)
Scenario: The function receives a properly formatted DataFrame with both "Demand" and "Day-ahead demand forecast" columns, sorted with the latest data in the first row.

Input (df_pivot): * Row 1 (Latest): Demand = 5000 | Day-ahead demand forecast = 4800

Row 2: Demand = 4900 | Day-ahead demand forecast = 4950

Expected Output: (5000, 4800, 200) (Calculation: 5000 - 4800 = 200)

🟡 Test Case 2: Edge Case - Missing Column
Scenario: The API returns incomplete data, resulting in a DataFrame that is missing the "Day-ahead demand forecast" column.

Input (df_pivot):

Row 1: Demand = 5000 | (No forecast column exists)

Expected Output: (None, None, None) (The function should catch the KeyError and fail gracefully without crashing the app).

🟡 Test Case 3: Edge Case - Empty DataFrame
Scenario: The function receives a completely empty DataFrame (e.g., due to an API failure upstream or aggressive filtering).

Input (df_pivot): pd.DataFrame() (0 rows, 0 columns)

Expected Output: (None, None, None) (The function should catch the IndexError or check for emptiness and return None values).