Skip to content

SamiraSiavash/Data_Cleaning_Project

Repository files navigation

Data Cleaning Project

A collection of 5 data cleaning projects demonstrating data preprocessing, feature engineering, and CSV output.
This repository is designed for portfolio purposes and is suitable for showcasing Python data cleaning skills.


Projects Overview

1. House Dataset

Location: Project_HouseNew/
Raw Data: Project_HouseNew/Datasets/HouseNew.csv
Cleaned Data: Project_HouseNew/CleanedDatasets/clean_HouseNew.csv
Script: Project_HouseNew/Scripts/Clean_HouseNew.py


Description:

  • Fills missing Address values using mode
  • Adds mean price feature (MeanPrice)
  • Calculates difference from mean price (DifPrice)
  • Saves cleaned dataset as CSV

Sample CSV Output:

Elevator Floor Area Parking Room Warehouse YearOfConstruction Address Price MeanPrice DifPrice
True 1 311 True 4 True 1396 دروس 31100000000 6159466196 24940533804
True 13 99 True 2 True 1401 دریاچه شهدای خلیج فارس 4700000000 6159466196 -1459466196
True 15 251 True 3 True 1375 شهرک غرب 30120000000 6159466196 23960533804


2. NBA Dataset

Location: Project_NBA/
Raw Data: Project_NBA/Datasets/NBA.csv
Cleaned Data: Project_NBA/CleanedDatasets/clean_NBA.csv
Script: Project_NBA/Scripts/Clean_NBA.py

Description:

  • Remove empty rows
  • Convert Height to inches and centimeters
  • Convert Weight to pounds and kilograms
  • Fill missing College values using mode
  • Fill missing Salary values using mean
  • Save cleaned dataset as CSV

Sample CSV Output:

Name Team Number Position Age Height Height_in Height_cm Weight Weight_lb Weight_kg College Salary
Avery Bradley Boston Celtics 0 PG 25 6-2 74 188 180 180 81.6 Texas 7730337
Jae Crowder Boston Celtics 99 SF 25 6-6 78 198 235 235 106.6 Marquette 6796117
John Holland Boston Celtics 30 SG 27 6-5 77 195 205 205 93.0 Boston University 4842684


3. Sales Dataset

Location: Project_Sales/
Raw Data: Project_Sales/Datasets/Sales.csv
Cleaned Data: Project_Sales/CleanedDatasets/clean_Sales.csv
Script: Project_Sales/Scripts/Clean_Sales.py


Description:

  • Strips extra spaces from CategoryOfBook
  • Removes commas , and $ from SalesAmount and converts to float
  • Saves cleaned dataset as CSV

Sample CSV Output:

ID BookID BookName CategoryOfBook SalesAmount
1 BU1032 The Busy Executive's Database Guide business 299.85
2 BU1111 Cooking with Computers: Surreptitious Balance Sheets business 298.75
3 BU2075 You Can Combat Computer Stress! business 104.65


4. SMS Spam Dataset

Location: Project_SMSSpam/
Raw Data: Project_SMSSpam/Datasets/SMSSpamCollection
Cleaned Data: Project_SMSSpam/CleanedDatasets/clean_SMSSPAMCollection.csv
Script: Project_SMSSpam/Scripts/SMSSpamCollection.py


Description:

  • Removes punctuation from message column
  • Calculates length of messages (LenOfMessage)
  • Saves cleaned dataset as CSV

Sample CSV Output:

label clean_message LenOfMessage
ham Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat 111
ham Ok lar Joking wif u oni 29
spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s 155


5. Titanic Dataset

Location: Project_Titanic/
Raw Data: Project_Titanic/Datasets/Titanic.csv
Cleaned Data: Project_Titanic/CleanedDatasets/clean_Titanic.csv
Script: Project_Titanic/Scripts/Titanic.py


Description:

  • Drops Cabin column
  • Fills missing Age values with mean
  • Fills missing Embarked values with mode
  • Saves cleaned dataset as CSV

Sample CSV Output:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.25 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 0 PC 17599 71.2833 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.925 S


Installation

1. Clone the repository:

git clone git@github.com:SamiraSiavash/Data_Cleaning_Project.git
cd Data_Cleaning_Project

2. Create a virtual environment:

python -m venv .venv

3. Activate the virtual environment:

* Windows:

.venv\Scripts\activate

* macOS / Linux:

source .venv/bin/activate

4. Install dependencies:

pip install -r requirements.txt

Usage

Run the scripts from each project folder using Python:

python Project_HouseNew/Scripts/Clean_HouseNew.py
python Project_NBA/Scripts/Clean_NBA.py
python Project_Sales/Scripts/Clean_Sales.py
python Project_SMSSpam/Scripts/Clean_SMSSpam.py
python Project_Titanic/Scripts/Clean_Titanic.py

All cleaned CSVs will be saved in the corresponding CleanedDatasets/ folder.


Folder Structure

Data_Cleaning_Project/
├── Project_HouseNew/
│   ├── Datasets/
│   ├── CleanedDatasets/
│   └── Scripts/
├── Project_NBA/
│   ├── Datasets/
│   ├── CleanedDatasets/
│   └── Scripts/
├── Project_Sales/
│   ├── Datasets/
│   ├── CleanedDatasets/
│   └── Scripts/
├── Project_SMSSpam/
│   ├── Datasets/
│   ├── CleanedDatasets/
│   └── Scripts/
├── Project_Titanic/
│   ├── Datasets/
│   ├── CleanedDatasets/
│   └── Scripts/
├── .gitignore
├── README.md
└── requirements.txt

Notes

  • All scripts are independent and can be run separately.
  • The project demonstrates basic data cleaning and feature engineering suitable for portfolio purposes.
  • Use requirements.txt to ensure consistent dependencies.

License

MIT License (optional)


Author

Samira Siavash

🔗 GitHub: https://github.com/SamiraSiavash

🔗 LinkedIn: https://linkedin.com/in/samira-siavash

About

A portfolio-ready collection of 4 Python data cleaning projects demonstrating data preprocessing, feature engineering, and CSV output.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages