End-to-end BI solution transforming 541K+ messy transactional records into executive insights — automated Python ETL feeding a multi-page Power BI report with advanced DAX measures and time intelligence.
| Dataset | UCI Online Retail Dataset (UK-based e-commerce, 2010–2011) |
| Raw Records | 541,909 |
| After Cleaning | 397,884 rows (removed 135,080 records with missing Customer IDs) |
| Total Revenue Analyzed | $8,911,407.90 |
| Geographic Scope | International — UK, EIRE, Hong Kong, and more |
| Tools | Python, Pandas, Power BI, DAX, Power Query |
Raw data is messy — over 24% of records had missing Customer IDs. Before any analysis was meaningful, the data needed to be understood and cleaned properly.
| Step | Action | Result |
|---|---|---|
| Load | Import 541,909 raw transaction records | Full dataset |
| Investigate | Identify missing Customer IDs — understand why, not just drop | Informed cleaning strategy |
| Clean | Remove 135,080 rows with null Customer IDs | 397,884 clean rows |
| Export | Output clean CSV for Power BI consumption | $8.91M revenue ready for analysis |
Row-level precision using SUMX to calculate revenue from quantity × unit price:
Sales Amount =
SUMX(
Online_Retail,
Online_Retail[Quantity] * Online_Retail[UnitPrice]
)
Dynamic Calendar table enabling seasonal and temporal analysis. This single model decision unlocked the discovery that Q4 drives 35.48% of total sales volume:
Calendar =
VAR _minDate = CALCULATE( MIN( Online_Retail[InvoiceDate] ), REMOVEFILTERS( Online_Retail ) )
VAR _maxDate = CALCULATE( MAX( Online_Retail[InvoiceDate] ), REMOVEFILTERS( Online_Retail ) )
RETURN
ADDCOLUMNS(
CALENDAR( _minDate, _maxDate ),
"Year", YEAR( [Date] ),
"Month Name", FORMAT( [Date], "MMM" ),
"Season", SWITCH( TRUE(),
MONTH( [Date] ) IN { 12, 1, 2 }, "Winter",
MONTH( [Date] ) IN { 3, 4, 5 }, "Spring",
MONTH( [Date] ) IN { 6, 7, 8 }, "Summer",
"Autumn" )
)
Global KPIs: Total Revenue, Total Quantity Sold, Average Unit Price.
Seasonal sales patterns — Q4 (Autumn) identified as peak revenue period at 35.48% of annual volume.
Revenue distribution across international markets — UK dominates, with EIRE and Hong Kong as secondary markets.
Average Unit Price by country and high-value product category analysis.
Global_Retail_Analysis/
├── Data/
│ └── Online_Retail_Sample.csv # 10,000 row sample for demonstration
├── Scripts/
│ └── Online_Retail_powerbi.ipynb # Python ETL & data cleaning
├── Media/
│ ├── capstone video.mp4 # Full project walkthrough
│ ├── Dashboard.png
│ ├── Sales_Trends_by_Time.png
│ ├── Sales_by_Country.png
│ ├── Unit_Price_vs_Country.png
│ └── Country_Category.png
└── README.md
- Data cleaning is 80% of the work. Over 24% of records had missing Customer IDs — understanding why data is missing, not just dropping it, changed the entire cleaning approach.
- DAX requires a different mental model. Writing
SUMXand time intelligence measures means thinking in filter context, not row-by-row iteration like Python. - Seasonality is invisible without proper time modeling. Building an explicit Calendar table was the key to discovering Q4's 35% revenue dominance — the trend simply couldn't be seen without it.
- Visualization is a communication skill. Deciding what not to show on a dashboard is just as important as what to include.
- Migrate ETL from notebook to modular
.pypipeline scripts - Rebuild data model in dbt + Snowflake for production-grade architecture
- Add RFM customer segmentation (Recency, Frequency, Monetary)
- Build Power BI incremental refresh for ongoing data updates
Jing You — Data Analytics & Engineering
Built with Python + Power BI · Dataset: UCI Online Retail 2010–2011