E-Commerce Data Pipeline

ELT Pipeline for Brazilian E-Commerce Analytics (2016-2018)

This project implements an ELT pipeline that extracts, loads, and transforms e-commerce data to support revenue and delivery performance analysis for a Latin American e-commerce platform (2016–2018).

Problem Statement

A major Latin American e-commerce platform needs to understand their business performance across two critical dimensions:

Revenue Analysis

Monthly revenue trends across 2016-2018
Top 10 revenue-generating product categories and their order volumes
Bottom 10 revenue-generating categories for optimization opportunities
Revenue distribution by state to identify regional performance

Delivery Performance Analysis

Delivery time accuracy: Difference between estimated and actual delivery dates
Monthly delivery performance trends (real vs. estimated delivery times)
State-level delivery efficiency metrics
Impact of public holidays on order volumes and delivery patterns
Freight value vs. product weight correlation analysis

The solution requires building a reusable ELT (Extract, Load, Transform) pipeline that can be executed periodically to generate these insights.

Dataset Description

Data Sources

1. Olist E-Commerce Dataset

A Brazilian e-commerce public dataset containing 100,000 anonymized orders from 2016-2018. The dataset includes:

99,441 orders with status, timestamps, and delivery information
112,650 order items with product details and shipping information
103,886 payment records with payment methods and values
99,224 customer reviews with ratings and comments
32,951 products with categories, dimensions, and weights
99,441 customers with geographic information
3,095 sellers with location data
1,000,163 geolocation records for mapping
71 product categories with Portuguese-to-English translations

Key Tables:

olist_orders: Order lifecycle timestamps and status
olist_order_items: Product details per order
olist_order_payments: Payment information
olist_products: Product catalog with categories
olist_customers: Customer geographic data
olist_sellers: Seller information
product_category_name_translation: Category translations

2. Public Holidays API

Source: date.nager.at API
Data: Brazil's public holidays for 2017
Purpose: Correlate holiday periods with order volumes and delivery performance

Data Schema

See images/data_schema.png for the complete database schema visualization.

Modeling Approach

This project implements an ELT (Extract, Load, Transform) pipeline architecture:

1. Extract Phase (`src/extract.py`)

CSV Data Extraction: Reads 9 CSV files from the dataset/ folder
API Data Extraction: Fetches Brazil public holidays via REST API
Data Validation: Ensures data integrity and proper data types
Output: Dictionary of pandas DataFrames ready for loading

2. Load Phase (`src/load.py`)

Database Creation: Initializes SQLite database (olist.db)
Table Creation: Maps CSV files to database tables
Data Loading: Uses SQLAlchemy to load DataFrames into SQLite
Schema Management: Handles table replacement and indexing

3. Transform Phase (`src/transform.py`)

SQL-Based Transformations

Seven SQL queries perform complex aggregations and joins:

Revenue by Month/Year: Aggregates revenue by month across 2016-2018
Revenue per State: Top 10 states by total revenue
Top 10 Revenue Categories: Highest revenue categories with order counts
Top 10 Least Revenue Categories: Lowest revenue categories for optimization
Global Order Status: Distribution of orders by status
Delivery Date Difference: Average delivery delay by state
Real vs. Estimated Delivery Time: Monthly comparison of actual vs. estimated delivery times

Pandas-Based Transformations

Two transformations use pandas for complex data manipulation:

Freight Value vs. Weight Relationship: Analyzes correlation between product weight and shipping costs
Orders per Day with Holidays: Daily order counts with holiday indicators for 2017

4. Visualization Phase (`src/plots.py`)

Revenue Visualizations: Line charts, bar charts, treemaps
Delivery Performance: Time series, bar charts, scatter plots
Category Analysis: Pie charts and treemaps
Holiday Impact: Time series with holiday markers

Evaluation Metrics

Data Quality Metrics

Data Completeness: All 9 CSV files loaded successfully
Record Counts: Validated against expected dataset sizes
- Orders: 99,441
- Order Items: 112,650
- Payments: 103,886
- Products: 32,951
- Customers: 99,441

Query Accuracy Metrics

Unit Tests: All SQL queries validated against expected JSON results
Tolerance: Floating-point comparisons within 0.1 tolerance
Schema Validation: Column names and data types verified

Business Metrics Calculated

Revenue Metrics
- Total revenue by month/year
- Revenue per state (top 10)
- Revenue per product category (top/bottom 10)
Delivery Metrics
- Average delivery delay (days) by state
- Real vs. estimated delivery time (monthly)
- Delivery performance trends
Operational Metrics
- Order status distribution
- Orders per day with holiday correlation
- Freight cost vs. weight correlation

Results

Revenue Insights

2017: Revenue increases from Q1 to Q4, with peak in December (~R$1.08M).
2018: Highest revenue in the first eight months, with August peak (~R$1.32M).
State distribution: Top 10 states by revenue are identified; results show concentration patterns.
Category performance: High- and low-performing product categories are identified.

Delivery Insights

Delivery accuracy: State-level metrics show variation in delivery performance.
Time trends: Monthly comparison of actual vs. estimated delivery times.
Holiday impact: Order volumes and delivery times are correlated with public holidays.
Shipping costs: Relationship between product weight and freight value is analyzed.

Visualizations Generated

Revenue trends by month/year (line + bar charts)
Revenue distribution by state (treemap)
Top/bottom revenue categories (pie charts and treemaps)
Delivery performance metrics (bar charts, time series)
Order patterns with holiday markers (time series)
Freight value vs. weight scatter plot

Visualizations are produced in the Jupyter notebook and written to the images/ directory.

How to Run

Prerequisites

Python 3.9, 3.10, 3.11, or 3.12
Windows, Linux, or macOS
Git (for cloning the repository)

Step 1: Clone the Repository

git clone <your-repo-url>
cd Sprint_01-Marissa_Singh

Step 2: Set Up Virtual Environment

Windows (PowerShell):

# Create virtual environment
python -m venv venv

# Activate virtual environment
.\venv\Scripts\Activate.ps1

# If you get an execution policy error, run:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

Linux/macOS:

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
source venv/bin/activate

Step 3: Install Dependencies

# Upgrade pip
python -m pip install --upgrade pip

# Install requirements
pip install -r requirements.txt

Step 4: Download Dataset

Download the dataset from this link
Extract the dataset folder from the .zip file
Place the dataset folder in the project root directory

Verify dataset structure:

dataset/
├── olist_customers_dataset.csv
├── olist_geolocation_dataset.csv
├── olist_order_items_dataset.csv
├── olist_order_payments_dataset.csv
├── olist_order_reviews_dataset.csv
├── olist_orders_dataset.csv
├── olist_products_dataset.csv
├── olist_sellers_dataset.csv
└── product_category_name_translation.csv

Step 5: Run Tests

# Run all tests
pytest tests/

# Run specific test files
pytest tests/test_extract.py
pytest tests/test_transform.py

Step 6: Execute the Pipeline

Option A: Jupyter Notebook

# Start Jupyter Notebook
jupyter notebook

# Open and run: "elt_pipeline.ipynb"
# Execute all cells in order

Option B: Python Script

from src import config
from src.extract import extract
from src.load import load
from src.transform import run_queries
from sqlalchemy import create_engine
from pathlib import Path

# Create database
Path(config.SQLITE_BD_ABSOLUTE_PATH).touch()
ENGINE = create_engine(f"sqlite:///{config.SQLITE_BD_ABSOLUTE_PATH}", echo=False)

# Extract
csv_dataframes = extract(
    config.DATASET_ROOT_PATH,
    config.get_csv_to_table_mapping(),
    config.PUBLIC_HOLIDAYS_URL
)

# Load
load(csv_dataframes, ENGINE)

# Transform
query_results = run_queries(ENGINE)

Step 7: View Results

Database: SQLite database at olist.db
Visualizations: Rendered in the Jupyter notebook
Images: Output written to images/

Optional: Code formatting

# Format code with Black
black --line-length=88 .

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                         DATA SOURCES                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────────────┐         ┌──────────────────────────┐     │
│  │  CSV Files      │         │  Public Holidays API     │     │
│  │  (9 datasets)   │         │  (date.nager.at)         │     │
│  └──────────────────┘         └──────────────────────────┘     │
│           │                              │                      │
└───────────┼──────────────────────────────┼──────────────────────┘
            │                              │
            ▼                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    EXTRACT PHASE                                │
│                    (src/extract.py)                             │
├─────────────────────────────────────────────────────────────────┤
│  • Read CSV files using pandas                                  │
│  • Fetch API data using requests                                │
│  • Validate and clean data                                      │
│  • Return dictionary of DataFrames                              │
└─────────────────────────────────────────────────────────────────┘
            │
            ▼
┌─────────────────────────────────────────────────────────────────┐
│                     LOAD PHASE                                  │
│                     (src/load.py)                               │
├─────────────────────────────────────────────────────────────────┤
│  • Create SQLite database (olist.db)                           │
│  • Map CSV files to database tables                             │
│  • Load DataFrames using SQLAlchemy                             │
│  • Create indexes for performance                                │
└─────────────────────────────────────────────────────────────────┘
            │
            ▼
┌─────────────────────────────────────────────────────────────────┐
│                   TRANSFORM PHASE                               │
│                   (src/transform.py)                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌────────────────────────┐  ┌──────────────────────────┐    │
│  │  SQL Transformations    │  │  Pandas Transformations   │    │
│  ├────────────────────────┤  ├──────────────────────────┤    │
│  │ • Revenue queries      │  │ • Freight vs Weight      │    │
│  │ • Delivery queries     │  │ • Orders & Holidays      │    │
│  │ • Category analysis    │  │                          │    │
│  │ • State analysis       │  │                          │    │
│  └────────────────────────┘  └──────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
            │
            ▼
┌─────────────────────────────────────────────────────────────────┐
│                 VISUALIZATION PHASE                              │
│                 (src/plots.py)                                  │
├─────────────────────────────────────────────────────────────────┤
│  • Revenue visualizations (line, bar, treemap)                  │
│  • Delivery performance charts                                  │
│  • Category analysis (pie charts)                              │
│  • Holiday impact analysis                                     │
└─────────────────────────────────────────────────────────────────┘
            │
            ▼
┌─────────────────────────────────────────────────────────────────┐
│                      OUTPUT                                     │
├─────────────────────────────────────────────────────────────────┤
│  • SQLite database (olist.db)                                   │
│  • Visualizations (images/)                                     │
│  • Query results (DataFrames)                                    │
│  • Jupyter notebook report                                      │
└─────────────────────────────────────────────────────────────────┘

Pipeline Flow

Extract → Data from CSV files and API
Load → Data into SQLite database
Transform → SQL queries and pandas operations
Visualize → Generate charts and reports

Project Structure

Sprint_01-Marissa_Singh/
├── dataset/                          # CSV data files (download separately)
│   ├── olist_customers_dataset.csv
│   ├── olist_geolocation_dataset.csv
│   ├── olist_order_items_dataset.csv
│   ├── olist_order_payments_dataset.csv
│   ├── olist_order_reviews_dataset.csv
│   ├── olist_orders_dataset.csv
│   ├── olist_products_dataset.csv
│   ├── olist_sellers_dataset.csv
│   └── product_category_name_translation.csv
├── images/                           # Generated visualizations
│   ├── data_schema.png
│   ├── freight_value_weight_relationship.png
│   └── orders_per_day_and_holidays.png
├── queries/                          # SQL query files
│   ├── delivery_date_difference.sql
│   ├── global_amount_order_status.sql
│   ├── real_vs_estimated_delivered_time.sql
│   ├── revenue_by_month_year.sql
│   ├── revenue_per_state.sql
│   ├── top_10_least_revenue_categories.sql
│   └── top_10_revenue_categories.sql
├── src/                              # Source code
│   ├── __init__.py
│   ├── config.py                    # Configuration and paths
│   ├── extract.py                   # Extract phase
│   ├── load.py                      # Load phase
│   ├── transform.py                 # Transform phase
│   └── plots.py                     # Visualization functions
├── tests/                            # Unit tests
│   ├── __init__.py
│   ├── query_results/                # Expected query results (JSON)
│   ├── test_extract.py
│   └── test_transform.py
├── ASSIGNMENT.md                     # Project instructions
├── elt_pipeline.ipynb               # Main Jupyter notebook
├── README.md                         # This file
├── README_og.md                      # Original README
├── requirements.txt                  # Python dependencies
└── olist.db                          # Generated SQLite database (after running)

Technologies Used

Python 3.9+: Core programming language
Pandas: Data manipulation and analysis
SQLAlchemy: Database ORM and connection management
SQLite: Lightweight database engine
Requests: HTTP library for API calls
Matplotlib: Static plotting library
Seaborn: Statistical data visualization
Plotly: Interactive visualizations
Jupyter: Interactive notebook environment
Pytest: Testing framework
Black: Code formatter

License

This project is part of an educational course assignment.

Author

Marissa Singh

Acknowledgments

Olist for providing the public e-commerce dataset
date.nager.at for the public holidays API
AnyoneAI for the project framework and guidance

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
images		images
queries		queries
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
elt_pipeline.ipynb		elt_pipeline.ipynb
requirements.txt		requirements.txt
run_tests.py		run_tests.py

Folders and files

Latest commit

History

Repository files navigation

E-Commerce Data Pipeline

Table of Contents

Problem Statement

Revenue Analysis

Delivery Performance Analysis

Dataset Description

Data Sources

1. Olist E-Commerce Dataset

2. Public Holidays API

Data Schema

Modeling Approach

1. Extract Phase (src/extract.py)

2. Load Phase (src/load.py)

3. Transform Phase (src/transform.py)

SQL-Based Transformations

Pandas-Based Transformations

4. Visualization Phase (src/plots.py)

Evaluation Metrics

Data Quality Metrics

Query Accuracy Metrics

Business Metrics Calculated

Results

Revenue Insights

Delivery Insights

Visualizations Generated

How to Run

Prerequisites

Step 1: Clone the Repository

Step 2: Set Up Virtual Environment

Step 3: Install Dependencies

Step 4: Download Dataset

Step 5: Run Tests

Step 6: Execute the Pipeline

Step 7: View Results

Optional: Code formatting

Architecture Diagram

Pipeline Flow

Project Structure

Technologies Used

License

Author

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Extract Phase (`src/extract.py`)

2. Load Phase (`src/load.py`)

3. Transform Phase (`src/transform.py`)

4. Visualization Phase (`src/plots.py`)

Packages