Skip to content

spachito/Data_Cleaning_Football_Data_SQL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data_Cleaning_Football_Data_SQL

📌 Project Overview

End-to-End Data Cleaning and Preparation on a raw FIFA players dataset using PostgreSQL and pgAdmin 4.

🛠️ Tools & Technologies Used

Database:PostgreSQL

Query Language: SQL

Database Management: pgAdmin 4

Data Manipulation & String Matching: PostgreSQL built-in string functions, Regular Expressions (Regex), fuzzystrmatch extension

🧹 Methodology: Data Cleaning & Processing

To ensure data integrity before analysis, the following steps were performed:

Database Setup:

Imported the raw CSV file into a base table (fifa_ratings) and immediately created a working copy (fifa_rt) to safely perform all transformations.

Duplicate Removal & Integrity:

Verified the uniqueness of the player id and checked for duplicate player entries (based on name, age, and nationality) using Window Functions like ROW_NUMBER() OVER()

Text Normalization & Fuzzy Matching:

Trimmed whitespaces from player names and nationalities. Standardized text by converting nationalities to uppercase and used REGEXP_REPLACE to strip hidden characters (\n, \r, \t) from club names. Employed the fuzzystrmatch extension, using levenshtein distance and similarity scores, to detect and verify spelling inconsistencies in nationalities and club names.

Data Parsing & Feature Engineering:

Split the complex text column contract into four distinct, functional columns: contract_start, contract_end, contract_type (e.g., 'Regular', 'On Loan', 'Free'), and loan_end_date. Extracted the specific best_position from a comma-separated list of positions using split_part.

Data Type Conversion:

Converted date strings (like joined and loan_end_date) into proper DATE format using to_date for accurate temporal analysis.

Unit Standardization:

Converted mixed height measurements (feet/inches) into a uniform centimeter (cm) scale. Converted imperial weight measurements (lbs) into metric kilograms (kg) and cast them to integers.

Financial Data Cleaning:

Standardized monetary columns (value, wage, release_clause) by stripping the Euro symbol ('€') and mathematically converting text abbreviations ('M' for millions, 'K' for thousands) into uniform numeric types

Range Validations:

Performed logical checks on numerical columns (age, ova, pot, bov) to ensure all values fell within realistic limits (e.g., confirming max age was 53 and minimum overall rating was 47) and contained no unexpected nulls.

About

End-to-End Data Cleaning and Preparation on a raw FIFA players dataset using PostgreSQL and pgAdmin 4.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors