A collection of 5 data cleaning projects demonstrating data preprocessing, feature engineering, and CSV output.
This repository is designed for portfolio purposes and is suitable for showcasing Python data cleaning skills.
Location: Project_HouseNew/
Raw Data: Project_HouseNew/Datasets/HouseNew.csv
Cleaned Data: Project_HouseNew/CleanedDatasets/clean_HouseNew.csv
Script: Project_HouseNew/Scripts/Clean_HouseNew.py
Description:
- Fills missing
Addressvalues using mode - Adds mean price feature (
MeanPrice) - Calculates difference from mean price (
DifPrice) - Saves cleaned dataset as CSV
Sample CSV Output:
| Elevator | Floor | Area | Parking | Room | Warehouse | YearOfConstruction | Address | Price | MeanPrice | DifPrice |
|---|---|---|---|---|---|---|---|---|---|---|
| True | 1 | 311 | True | 4 | True | 1396 | دروس | 31100000000 | 6159466196 | 24940533804 |
| True | 13 | 99 | True | 2 | True | 1401 | دریاچه شهدای خلیج فارس | 4700000000 | 6159466196 | -1459466196 |
| True | 15 | 251 | True | 3 | True | 1375 | شهرک غرب | 30120000000 | 6159466196 | 23960533804 |
Location: Project_NBA/
Raw Data: Project_NBA/Datasets/NBA.csv
Cleaned Data: Project_NBA/CleanedDatasets/clean_NBA.csv
Script: Project_NBA/Scripts/Clean_NBA.py
Description:
- Remove empty rows
- Convert Height to inches and centimeters
- Convert Weight to pounds and kilograms
- Fill missing College values using mode
- Fill missing Salary values using mean
- Save cleaned dataset as CSV
Sample CSV Output:
| Name | Team | Number | Position | Age | Height | Height_in | Height_cm | Weight | Weight_lb | Weight_kg | College | Salary |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Avery Bradley | Boston Celtics | 0 | PG | 25 | 6-2 | 74 | 188 | 180 | 180 | 81.6 | Texas | 7730337 |
| Jae Crowder | Boston Celtics | 99 | SF | 25 | 6-6 | 78 | 198 | 235 | 235 | 106.6 | Marquette | 6796117 |
| John Holland | Boston Celtics | 30 | SG | 27 | 6-5 | 77 | 195 | 205 | 205 | 93.0 | Boston University | 4842684 |
Location: Project_Sales/
Raw Data: Project_Sales/Datasets/Sales.csv
Cleaned Data: Project_Sales/CleanedDatasets/clean_Sales.csv
Script: Project_Sales/Scripts/Clean_Sales.py
Description:
- Strips extra spaces from
CategoryOfBook - Removes commas
,and$fromSalesAmountand converts to float - Saves cleaned dataset as CSV
Sample CSV Output:
| ID | BookID | BookName | CategoryOfBook | SalesAmount |
|---|---|---|---|---|
| 1 | BU1032 | The Busy Executive's Database Guide | business | 299.85 |
| 2 | BU1111 | Cooking with Computers: Surreptitious Balance Sheets | business | 298.75 |
| 3 | BU2075 | You Can Combat Computer Stress! | business | 104.65 |
Location: Project_SMSSpam/
Raw Data: Project_SMSSpam/Datasets/SMSSpamCollection
Cleaned Data: Project_SMSSpam/CleanedDatasets/clean_SMSSPAMCollection.csv
Script: Project_SMSSpam/Scripts/SMSSpamCollection.py
Description:
- Removes punctuation from
messagecolumn - Calculates length of messages (
LenOfMessage) - Saves cleaned dataset as CSV
Sample CSV Output:
| label | clean_message | LenOfMessage |
|---|---|---|
| ham | Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat | 111 |
| ham | Ok lar Joking wif u oni | 29 |
| spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s | 155 |
Location: Project_Titanic/
Raw Data: Project_Titanic/Datasets/Titanic.csv
Cleaned Data: Project_Titanic/CleanedDatasets/clean_Titanic.csv
Script: Project_Titanic/Scripts/Titanic.py
Description:
- Drops
Cabincolumn - Fills missing
Agevalues with mean - Fills missing
Embarkedvalues with mode - Saves cleaned dataset as CSV
Sample CSV Output:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.25 | S |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.925 | S |
git clone git@github.com:SamiraSiavash/Data_Cleaning_Project.git
cd Data_Cleaning_Project
python -m venv .venv
.venv\Scripts\activate
source .venv/bin/activate
pip install -r requirements.txt
Run the scripts from each project folder using Python:
python Project_HouseNew/Scripts/Clean_HouseNew.py
python Project_NBA/Scripts/Clean_NBA.py
python Project_Sales/Scripts/Clean_Sales.py
python Project_SMSSpam/Scripts/Clean_SMSSpam.py
python Project_Titanic/Scripts/Clean_Titanic.py
All cleaned CSVs will be saved in the corresponding CleanedDatasets/ folder.
Data_Cleaning_Project/
├── Project_HouseNew/
│ ├── Datasets/
│ ├── CleanedDatasets/
│ └── Scripts/
├── Project_NBA/
│ ├── Datasets/
│ ├── CleanedDatasets/
│ └── Scripts/
├── Project_Sales/
│ ├── Datasets/
│ ├── CleanedDatasets/
│ └── Scripts/
├── Project_SMSSpam/
│ ├── Datasets/
│ ├── CleanedDatasets/
│ └── Scripts/
├── Project_Titanic/
│ ├── Datasets/
│ ├── CleanedDatasets/
│ └── Scripts/
├── .gitignore
├── README.md
└── requirements.txt
- All scripts are independent and can be run separately.
- The project demonstrates basic data cleaning and feature engineering suitable for portfolio purposes.
- Use requirements.txt to ensure consistent dependencies.
MIT License (optional)
Samira Siavash
🔗 GitHub: https://github.com/SamiraSiavash
🔗 LinkedIn: https://linkedin.com/in/samira-siavash