Data Engineering Capstone Project
Specialization: Data Engineering
Focus: Data Modeling • Version Control • Python Integration
- Project Overview
- Dataset Mapping (Olist → Pay-Flow))
- Technologies & Tools
- Step-by-Step What This Repository Does
- Repository Structure
- Quick Start (Local Setup)
- Learning Outcomes You Will Gain
- Dataset Download
- Connect & Contribute
Pay-Flow is a simulated leading fintech platform specializing in cross-border payments, merchant services, and transaction analytics. This repository demonstrates how to build a production-grade, version-controlled, and Python-automated PostgreSQL database using the real-world Brazilian E-Commerce Public Dataset by Olist.
The project solves three critical enterprise challenges faced by fintech teams:
- Code Management – No more scattered SQL files; everything lives in Git with proper versioning and collaboration.
- Analytics Gap – Raw transaction data is transformed into rich analytical schemas and views (merchant revenue, geographic fraud patterns, payment trends, etc.).
- Python Integration – Fully reproducible ETL pipelines using Python (psycopg2 + pandas) that can be scheduled, tested, and versioned.
Result: A clean, optimized, scalable database ready for fraud detection, BI dashboards, and regulatory reporting — exactly what fintech companies demand.
| Raw CSV File | Target Table | Key Transformations |
|---|---|---|
olist_customers_dataset.csv |
customers |
Direct mapping |
olist_sellers_dataset.csv |
merchants |
seller_id → merchant_id |
olist_orders_dataset.csv |
transactions |
Core order data |
olist_order_items_dataset.csv |
transactions |
Adds seller_id, price, freight_value |
olist_order_payments_dataset.csv |
transactions |
Adds payment_type, payment_value |
| Other 4 files | Not used | — |
- Database: PostgreSQL 16 (with proper indexing & constraints)
- Language: Python 3.11 (pandas + psycopg2 for ETL automation)
- Version Control: Git + GitHub (branching strategy for SQL & Python)
- Schema Design: 3NF + Star-Schema analytical layer
- Automation: Python scripts that generate, migrate, and populate tables
- Clone the repo
- All SQL scripts are stored in
/sql/with semantic versioning (v1.0.0__create_customers.sql) - Git branching strategy (
main,feature/,release/) enforced
- DDL Scripts (
/sql/ddl/)- Create
customers,merchants, andtransactionstables - Proper primary keys, foreign keys, indexes, and constraints
- Create
- Data Dictionary included as Markdown + SQL comments
etl_load_data.py→ Reads all CSVs and loads into PostgreSQLetl_merge_transactions.py→ Combines orders + items + payments into one enrichedtransactionstableetl_create_analytical_views.py→ Builds materialized views and analytical layer- Fully reproducible — run
python run_etl.pyto recreate everything from scratch
- Views created:
vw_merchant_revenue→ Highest revenue merchantsvw_geographic_fraud_patterns→ Fraud signals by state/cityvw_payment_trends→ Payment method adoption over timevw_customer_lifetime_value→ CLV calculations
- Optimized for complex fintech queries (sub-second response on millions of rows)
- Indexes on high-frequency columns (
order_id,merchant_id,customer_id,order_purchase_timestamp) - Partitioning-ready design
- Column-level comments for compliance audits
├── sql/
│ ├── ddl/ # Table creation scripts
│ ├── views/ # Analytical views
│ └── migrations/ # Versioned migration files
├── python/
│ ├── etl/ # Data loading & merging scripts
│ ├── analytics/ # View creation scripts
│ └── utils/ # Database connection helpers
├── data/ # Raw Olist CSVs (gitignore'd — download link below)
├── docs/
│ ├── data_dictionary.md
│ └── schema_diagram.png
├── .github/workflows/ # CI/CD for SQL validation (optional)
├── run_etl.py # One-click full pipeline
├── requirements.txt
├── README.md # ← You are here
└── LICENSE
-
Clone the repo
git clone https://github.com/yourusername/pay-flow-database-optimization.git cd pay-flow-database-optimization -
Create PostgreSQL database
CREATE DATABASE payflow_db;
-
Install Python dependencies
pip install -r requirements.txt
-
Configure connection (edit
python/utils/db_config.pyor use environment variables) -
**Run the full pipelineBash
python run_etl.py -
Explore
- Connect with pgAdmin / DBeaver
- Query the analytical views
- Professional PostgreSQL data modeling for fintech
- Git-based collaborative SQL development (exactly how senior data engineers work)
- Python automation of ETL + schema management
- Building analytical layers on top of transactional data
- Real-world documentation & migration practices
Download the full Olist dataset here: https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce
Place the CSVs in the data/ folder.
This repo is part of my Data Engineering Portfolio. Feel free to fork, open issues, or use it as a template for your own fintech/database projects!
Made with ❤️ for real-world data engineering practice.
Status: Production-ready template text