End-to-End Data Cleaning and Preparation on a raw FIFA players dataset using PostgreSQL and pgAdmin 4.
Database:PostgreSQL
Query Language: SQL
Database Management: pgAdmin 4
Data Manipulation & String Matching: PostgreSQL built-in string functions, Regular Expressions (Regex), fuzzystrmatch extension
To ensure data integrity before analysis, the following steps were performed:
Imported the raw CSV file into a base table (fifa_ratings) and immediately created a working copy (fifa_rt) to safely perform all transformations.
Verified the uniqueness of the player id and checked for duplicate player entries (based on name, age, and nationality) using Window Functions like ROW_NUMBER() OVER()
Trimmed whitespaces from player names and nationalities. Standardized text by converting nationalities to uppercase and used REGEXP_REPLACE to strip hidden characters (\n, \r, \t) from club names. Employed the fuzzystrmatch extension, using levenshtein distance and similarity scores, to detect and verify spelling inconsistencies in nationalities and club names.
Split the complex text column contract into four distinct, functional columns: contract_start, contract_end, contract_type (e.g., 'Regular', 'On Loan', 'Free'), and loan_end_date. Extracted the specific best_position from a comma-separated list of positions using split_part.
Converted date strings (like joined and loan_end_date) into proper DATE format using to_date for accurate temporal analysis.
Converted mixed height measurements (feet/inches) into a uniform centimeter (cm) scale. Converted imperial weight measurements (lbs) into metric kilograms (kg) and cast them to integers.
Standardized monetary columns (value, wage, release_clause) by stripping the Euro symbol ('€') and mathematically converting text abbreviations ('M' for millions, 'K' for thousands) into uniform numeric types
Performed logical checks on numerical columns (age, ova, pot, bov) to ensure all values fell within realistic limits (e.g., confirming max age was 53 and minimum overall rating was 47) and contained no unexpected nulls.