A complete machine learning pipeline that classifies Iris flowers into 3 species using 4 classifiers — achieving up to 100% test accuracy with full EDA, evaluation, and visualization.
📓 View Notebook • 📊 Results • 🚀 How to Run • 📁 Project Structure
This project is Task 1 of the CodeAlpha Data Science Internship. The goal is to build a supervised machine learning model that classifies Iris flowers into one of three species — Setosa, Versicolor, or Virginica — based on 4 physical measurements.
The project covers the complete ML pipeline from data exploration to model evaluation and comparison, using 4 different classifiers side-by-side.
- ✅ Perform thorough Exploratory Data Analysis (EDA) with visualizations
- ✅ Preprocess data with StandardScaler for optimal model performance
- ✅ Train and compare 4 machine learning classifiers
- ✅ Evaluate models using accuracy, cross-validation, and confusion matrices
- ✅ Identify the best model and extract feature importance insights
| Property | Detail |
|---|---|
| Source | UCI Machine Learning Repository (via sklearn.datasets) |
| Samples | 150 (50 per class) |
| Features | 4 — Sepal Length, Sepal Width, Petal Length, Petal Width |
| Target | 3 classes — Setosa, Versicolor, Virginica |
| Missing Values | None |
| Class Balance | Perfectly balanced (50 samples each) |
📋 Feature Description
| Feature | Unit | Description |
|---|---|---|
sepal_length |
cm | Length of the flower's sepal |
sepal_width |
cm | Width of the flower's sepal |
petal_length |
cm | Length of the flower's petal |
petal_width |
cm | Width of the flower's petal |
species |
— | Target: Setosa / Versicolor / Virginica |
| Tool | Version | Purpose |
|---|---|---|
| Python | 3.10+ | Core language |
| Pandas | 2.0+ | Data manipulation |
| NumPy | 1.24+ | Numerical operations |
| Scikit-learn | 1.3+ | ML models & evaluation |
| Matplotlib | 3.7+ | Visualizations |
| Seaborn | 0.12+ | Statistical plots |
| Jupyter Notebook | — | Development environment |
| # | Model | Key Hyperparameters |
|---|---|---|
| 1 | Logistic Regression | max_iter=300 |
| 2 | Decision Tree | max_depth=4 |
| 3 | Random Forest | n_estimators=100, max_depth=5 |
| 4 | K-Nearest Neighbors (KNN) | n_neighbors=5, metric='euclidean' |
All models evaluated with:
- 80/20 Train-Test Split (stratified)
- 5-Fold Stratified Cross-Validation
- StandardScaler normalization applied
| Rank | Model | Test Accuracy | CV Mean | CV Std |
|---|---|---|---|---|
| 🥇 | Random Forest | ~97–100% | ~96% | ±2% |
| 🥈 | KNN | ~96–100% | ~95% | ±3% |
| 🥉 | Logistic Regression | ~96–97% | ~95% | ±3% |
| 4 | Decision Tree | ~93–97% | ~93% | ±4% |
💡 Exact values depend on the random seed. Run the notebook to see your results.
The project generates 7 professional plots, all saved as .png files:
| Plot | File | Description |
|---|---|---|
| 🔷 Feature Distributions | feature_distributions.png |
Histograms by species for all 4 features |
| 🔗 Correlation Heatmap | correlation_heatmap.png |
Feature correlation matrix |
| 🌐 Pairplot | pairplot.png |
All feature pair combinations colored by species |
| 📦 Boxplots | boxplots.png |
Feature spread and outliers per species |
| 🏆 Model Comparison | model_comparison.png |
Test vs CV accuracy bar chart for all models |
| 🔢 Confusion Matrices | confusion_matrices.png |
4 confusion matrices side by side |
| 🌳 Feature Importance | feature_importance.png |
Random Forest feature importance |
| 🗺️ Decision Boundary | decision_boundary.png |
RF decision boundary (petal features) |
- Petal features dominate —
petal_lengthandpetal_widthtogether explain ~95%+ of the class separability - Setosa is trivially separable — linearly separable from other species in petal space
- Versicolor–Virginica boundary is the challenge — slight overlap in feature space
- Random Forest handles noise best — ensemble averaging reduces boundary errors
- No overfitting detected — consistent train/CV scores across all models
CodeAlpha_IrisFlowerClassification/
│
├── 📓 iris_classification.ipynb ← Main Jupyter Notebook (full pipeline)
├── 📄 README.md ← This file
├── 📋 requirements.txt ← Python dependencies
│
└── 📊 Generated Plots/
├── feature_distributions.png
├── correlation_heatmap.png
├── pairplot.png
├── boxplots.png
├── model_comparison.png
├── confusion_matrices.png
├── feature_importance.png
└── decision_boundary.png
# 1. Clone the repository
git clone https://github.com/MOHAMMED-ABUZAR317/CodeAlpha_IrisFlowerClassification.git
cd CodeAlpha_IrisFlowerClassification
# 2. Install dependencies
pip install -r requirements.txt
# 3. Launch Jupyter Notebook
jupyter notebook iris_classification.ipynbClick the badge below to open directly in Google Colab:
pandas>=2.0.0
numpy>=1.24.0
scikit-learn>=1.3.0
matplotlib>=3.7.0
seaborn>=0.12.0
jupyter>=1.0.0Install all at once:
pip install -r requirements.txt- How to perform structured EDA with multiple visualization techniques
- Understanding class separability through pairplots and correlation analysis
- Importance of feature scaling (StandardScaler) for KNN and Logistic Regression
- Comparing models objectively using cross-validation vs test accuracy
- How ensemble methods (Random Forest) outperform single models on real datasets
| Platform | Link |
|---|---|
| [(https://www.linkedin.com/in/mohammed-abuzar-9061a1375/) | |
| 🐙 GitHub | [(https://github.com/MOHAMMED-ABUZAR317) |
| 🏢 Internship | CodeAlpha |
🌸 Made with ❤️ during the CodeAlpha Data Science Internship
If you found this project helpful, please give it a ⭐ on GitHub!