Data Cleaning Project

A collection of 5 data cleaning projects demonstrating data preprocessing, feature engineering, and CSV output.
This repository is designed for portfolio purposes and is suitable for showcasing Python data cleaning skills.

Projects Overview

1. House Dataset

Location: Project_HouseNew/
Raw Data: Project_HouseNew/Datasets/HouseNew.csv
Cleaned Data: Project_HouseNew/CleanedDatasets/clean_HouseNew.csv
Script: Project_HouseNew/Scripts/Clean_HouseNew.py

Description:

Fills missing Address values using mode
Adds mean price feature (MeanPrice)
Calculates difference from mean price (DifPrice)
Saves cleaned dataset as CSV

Sample CSV Output:

Elevator	Floor	Area	Parking	Room	Warehouse	YearOfConstruction	Address	Price	MeanPrice	DifPrice
True	1	311	True	4	True	1396	دروس	31100000000	6159466196	24940533804
True	13	99	True	2	True	1401	دریاچه شهدای خلیج فارس	4700000000	6159466196	-1459466196
True	15	251	True	3	True	1375	شهرک غرب	30120000000	6159466196	23960533804

2. NBA Dataset

Location: Project_NBA/
Raw Data: Project_NBA/Datasets/NBA.csv
Cleaned Data: Project_NBA/CleanedDatasets/clean_NBA.csv
Script: Project_NBA/Scripts/Clean_NBA.py

Description:

Remove empty rows
Convert Height to inches and centimeters
Convert Weight to pounds and kilograms
Fill missing College values using mode
Fill missing Salary values using mean
Save cleaned dataset as CSV

Sample CSV Output:

Name	Team	Number	Position	Age	Height	Height_in	Height_cm	Weight	Weight_lb	Weight_kg	College	Salary
Avery Bradley	Boston Celtics	0	PG	25	6-2	74	188	180	180	81.6	Texas	7730337
Jae Crowder	Boston Celtics	99	SF	25	6-6	78	198	235	235	106.6	Marquette	6796117
John Holland	Boston Celtics	30	SG	27	6-5	77	195	205	205	93.0	Boston University	4842684

3. Sales Dataset

Location: Project_Sales/
Raw Data: Project_Sales/Datasets/Sales.csv
Cleaned Data: Project_Sales/CleanedDatasets/clean_Sales.csv
Script: Project_Sales/Scripts/Clean_Sales.py

Description:

Strips extra spaces from CategoryOfBook
Removes commas , and $ from SalesAmount and converts to float
Saves cleaned dataset as CSV

Sample CSV Output:

ID	BookID	BookName	CategoryOfBook	SalesAmount
1	BU1032	The Busy Executive's Database Guide	business	299.85
2	BU1111	Cooking with Computers: Surreptitious Balance Sheets	business	298.75
3	BU2075	You Can Combat Computer Stress!	business	104.65

4. SMS Spam Dataset

Location: Project_SMSSpam/
Raw Data: Project_SMSSpam/Datasets/SMSSpamCollection
Cleaned Data: Project_SMSSpam/CleanedDatasets/clean_SMSSPAMCollection.csv
Script: Project_SMSSpam/Scripts/SMSSpamCollection.py

Description:

Removes punctuation from message column
Calculates length of messages (LenOfMessage)
Saves cleaned dataset as CSV

Sample CSV Output:

label	clean_message	LenOfMessage
ham	Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat	111
ham	Ok lar Joking wif u oni	29
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s	155

5. Titanic Dataset

Location: Project_Titanic/
Raw Data: Project_Titanic/Datasets/Titanic.csv
Cleaned Data: Project_Titanic/CleanedDatasets/clean_Titanic.csv
Script: Project_Titanic/Scripts/Titanic.py

Description:

Drops Cabin column
Fills missing Age values with mean
Fills missing Embarked values with mode
Saves cleaned dataset as CSV

Sample CSV Output:

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Embarked
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.25	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38.0	1	PC 17599	71.2833	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.925	S

Installation

1. Clone the repository:

git clone git@github.com:SamiraSiavash/Data_Cleaning_Project.git
cd Data_Cleaning_Project

2. Create a virtual environment:

python -m venv .venv

3. Activate the virtual environment:

* Windows:

.venv\Scripts\activate

* macOS / Linux:

source .venv/bin/activate

4. Install dependencies:

pip install -r requirements.txt

Usage

Run the scripts from each project folder using Python:

python Project_HouseNew/Scripts/Clean_HouseNew.py
python Project_NBA/Scripts/Clean_NBA.py
python Project_Sales/Scripts/Clean_Sales.py
python Project_SMSSpam/Scripts/Clean_SMSSpam.py
python Project_Titanic/Scripts/Clean_Titanic.py

All cleaned CSVs will be saved in the corresponding CleanedDatasets/ folder.

Folder Structure

Data_Cleaning_Project/
├── Project_HouseNew/
│   ├── Datasets/
│   ├── CleanedDatasets/
│   └── Scripts/
├── Project_NBA/
│   ├── Datasets/
│   ├── CleanedDatasets/
│   └── Scripts/
├── Project_Sales/
│   ├── Datasets/
│   ├── CleanedDatasets/
│   └── Scripts/
├── Project_SMSSpam/
│   ├── Datasets/
│   ├── CleanedDatasets/
│   └── Scripts/
├── Project_Titanic/
│   ├── Datasets/
│   ├── CleanedDatasets/
│   └── Scripts/
├── .gitignore
├── README.md
└── requirements.txt

Notes

All scripts are independent and can be run separately.
The project demonstrates basic data cleaning and feature engineering suitable for portfolio purposes.
Use requirements.txt to ensure consistent dependencies.

License

MIT License (optional)

Author

Samira Siavash

🔗 GitHub: https://github.com/SamiraSiavash

🔗 LinkedIn: https://linkedin.com/in/samira-siavash

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Cleaning Project

Projects Overview

1. House Dataset

2. NBA Dataset

3. Sales Dataset

4. SMS Spam Dataset

5. Titanic Dataset

Installation

1. Clone the repository:

2. Create a virtual environment:

3. Activate the virtual environment:

* Windows:

* macOS / Linux:

4. Install dependencies:

Usage

Folder Structure

Notes

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Project_HouseNew		Project_HouseNew
Project_NBA		Project_NBA
Project_SMSSpam		Project_SMSSpam
Project_Sales		Project_Sales
Project_Titanic		Project_Titanic
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Data Cleaning Project

Projects Overview

1. House Dataset

2. NBA Dataset

3. Sales Dataset

4. SMS Spam Dataset

5. Titanic Dataset

Installation

1. Clone the repository:

2. Create a virtual environment:

3. Activate the virtual environment:

* Windows:

* macOS / Linux:

4. Install dependencies:

Usage

Folder Structure

Notes

License

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages