Skip to content

JingYou-data/Global_Retail_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 

Repository files navigation

📊 Global Online Retail Strategic Intelligence (2010–2011)

End-to-end BI solution transforming 541K+ messy transactional records into executive insights — automated Python ETL feeding a multi-page Power BI report with advanced DAX measures and time intelligence.

🔗 View Live Dashboard


📌 Project Overview

Dataset UCI Online Retail Dataset (UK-based e-commerce, 2010–2011)
Raw Records 541,909
After Cleaning 397,884 rows (removed 135,080 records with missing Customer IDs)
Total Revenue Analyzed $8,911,407.90
Geographic Scope International — UK, EIRE, Hong Kong, and more
Tools Python, Pandas, Power BI, DAX, Power Query

🛠️ ETL Pipeline (Python)

Raw data is messy — over 24% of records had missing Customer IDs. Before any analysis was meaningful, the data needed to be understood and cleaned properly.

Step Action Result
Load Import 541,909 raw transaction records Full dataset
Investigate Identify missing Customer IDs — understand why, not just drop Informed cleaning strategy
Clean Remove 135,080 rows with null Customer IDs 397,884 clean rows
Export Output clean CSV for Power BI consumption $8.91M revenue ready for analysis

🧮 DAX Implementation

Core Revenue Measure

Row-level precision using SUMX to calculate revenue from quantity × unit price:

Sales Amount =
SUMX(
    Online_Retail,
    Online_Retail[Quantity] * Online_Retail[UnitPrice]
)

Time Intelligence — Calendar Table

Dynamic Calendar table enabling seasonal and temporal analysis. This single model decision unlocked the discovery that Q4 drives 35.48% of total sales volume:

Calendar =
VAR _minDate = CALCULATE( MIN( Online_Retail[InvoiceDate] ), REMOVEFILTERS( Online_Retail ) )
VAR _maxDate = CALCULATE( MAX( Online_Retail[InvoiceDate] ), REMOVEFILTERS( Online_Retail ) )
RETURN
ADDCOLUMNS(
    CALENDAR( _minDate, _maxDate ),
    "Year",       YEAR( [Date] ),
    "Month Name", FORMAT( [Date], "MMM" ),
    "Season",     SWITCH( TRUE(),
                    MONTH( [Date] ) IN { 12, 1, 2 }, "Winter",
                    MONTH( [Date] ) IN { 3, 4, 5 },  "Spring",
                    MONTH( [Date] ) IN { 6, 7, 8 },  "Summer",
                    "Autumn" )
)

📈 Dashboard Pages

1. Executive Summary

Dashboard Overview Global KPIs: Total Revenue, Total Quantity Sold, Average Unit Price.

2. Temporal Analysis

Sales Trends Seasonal sales patterns — Q4 (Autumn) identified as peak revenue period at 35.48% of annual volume.

3. Geospatial Intelligence

Global Sales Revenue distribution across international markets — UK dominates, with EIRE and Hong Kong as secondary markets.

4. Market Benchmarking

Pricing Analysis Category Analysis Average Unit Price by country and high-value product category analysis.


📁 Repository Structure

Global_Retail_Analysis/
├── Data/
│   └── Online_Retail_Sample.csv       # 10,000 row sample for demonstration
├── Scripts/
│   └── Online_Retail_powerbi.ipynb    # Python ETL & data cleaning
├── Media/
│   ├── capstone video.mp4             # Full project walkthrough
│   ├── Dashboard.png
│   ├── Sales_Trends_by_Time.png
│   ├── Sales_by_Country.png
│   ├── Unit_Price_vs_Country.png
│   └── Country_Category.png
└── README.md

💡 Key Takeaways

  • Data cleaning is 80% of the work. Over 24% of records had missing Customer IDs — understanding why data is missing, not just dropping it, changed the entire cleaning approach.
  • DAX requires a different mental model. Writing SUMX and time intelligence measures means thinking in filter context, not row-by-row iteration like Python.
  • Seasonality is invisible without proper time modeling. Building an explicit Calendar table was the key to discovering Q4's 35% revenue dominance — the trend simply couldn't be seen without it.
  • Visualization is a communication skill. Deciding what not to show on a dashboard is just as important as what to include.

🚀 Future Work

  • Migrate ETL from notebook to modular .py pipeline scripts
  • Rebuild data model in dbt + Snowflake for production-grade architecture
  • Add RFM customer segmentation (Recency, Frequency, Monetary)
  • Build Power BI incremental refresh for ongoing data updates

👤 Author

Jing You — Data Analytics & Engineering LinkedIn GitHub Portfolio


Built with Python + Power BI · Dataset: UCI Online Retail 2010–2011

About

Included Power BI Star Schema model, Python EDA scripts, and interactive dashboard documentation. Focused on analyzing 1.45M+ sales records.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors