diff --git a/ABOUT.md b/ABOUT.md
index a739b9d..68ee428 100644
--- a/ABOUT.md
+++ b/ABOUT.md
@@ -155,7 +155,7 @@ Both exist simultaneously, creating a living curriculum.
### Near-term (2026)
- β
Launch master repository (this!)
- β
Complete Foundation Track (5 chapters β all available!)
-- π Release Practitioner Track (2 of 10 chapters available)
+- π Release Practitioner Track (3 of 10 chapters available)
- π Establish community request process
- π Build 100+ community-contributed chapters
@@ -178,11 +178,11 @@ Both exist simultaneously, creating a living curriculum.
## π By The Numbers
**Current State:**
-- 7 chapters available (Foundation complete + Practitioner started)
+- 8 chapters available (Foundation complete + Practitioner started)
- 21 Jupyter notebooks with interactive content
- 21 professional SVG diagrams
- 37 exercises with solutions
-- 56 hours of learning content available
+- 64 hours of learning content available
- 5 practice datasets
- 25+ total chapters planned
- $0 barrier to entry
diff --git a/GITHUB_PROFILE_README.md b/GITHUB_PROFILE_README.md
index b10f849..66f93ad 100644
--- a/GITHUB_PROFILE_README.md
+++ b/GITHUB_PROFILE_README.md
@@ -10,7 +10,7 @@
**[Berta AI](https://berta.one)** β AI-powered tools for tomorrow's world
-- **[Berta Chapters](https://github.com/luigipascal/berta-chapters)** β Free, open-source AI curriculum. 7 chapters live, 25 planned. Learn Python to production ML through interactive notebooks, exercises, and an online playground. No paywall, no signup.
+- **[Berta Chapters](https://github.com/luigipascal/berta-chapters)** β Free, open-source AI curriculum. 8 chapters live, 25 planned. Learn Python to production ML through interactive notebooks, exercises, and an online playground. No paywall, no signup.
- **[LLM Cost Optimizer](https://llm.berta.one)** β Cut LLM API costs 80-95% while keeping data private. Local processing, text anonymization, automatic model routing.
- **OrbaOS** β A framework for post-project work. AI handles coordination so teams focus on strategy and creative output.
diff --git a/README.md b/README.md
index a6b150f..59d88a5 100644
--- a/README.md
+++ b/README.md
@@ -53,7 +53,7 @@ Apply what you've learned to real-world machine learning and AI problems.
|---------|-------|------|--------|
| 6 | [Introduction to Machine Learning](./chapters/chapter-06-intro-machine-learning/) | 8h | β
Available |
| 7 | [Supervised Learning: Regression & Classification](./chapters/chapter-07-supervised-learning/) | 10h | β
Available |
-| 8 | Unsupervised Learning: Clustering & Dimensionality Reduction | 8h | π Coming Soon |
+| 8 | [Unsupervised Learning: Clustering & Dimensionality Reduction](./chapters/chapter-08-unsupervised-learning/) | 8h | β
Available |
| 9 | Deep Learning Fundamentals | 12h | π Coming Soon |
| 10 | Natural Language Processing Basics | 10h | π Coming Soon |
| 11 | Large Language Models & Transformers | 10h | π Coming Soon |
@@ -268,7 +268,7 @@ pie title Curriculum Breakdown
"Community Requested" : 999
```
-- **Chapters Available Now**: 7 (56 hours of content)
+- **Chapters Available Now**: 8 (64 hours of content)
- **Total Planned Chapters**: 25+
- **Jupyter Notebooks**: 21 interactive notebooks
- **SVG Diagrams**: 21 professional diagrams
diff --git a/ROADMAP.md b/ROADMAP.md
index 8f05088..a52b19c 100644
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -8,11 +8,11 @@ Our vision for the future of AI education. This is a living documentβprioritie
**Master Repository**: β
Live
**Foundation Track**: β
Complete (5 chapters available)
-**Practitioner Track**: π In progress (2 of 10 chapters available)
+**Practitioner Track**: π In progress (3 of 10 chapters available)
**Advanced Track**: π Planned (10 chapters)
**Community Requests**: π Starting (unlimited)
**Total Planned**: 25+ chapters, 500+ hours of content
-**Currently Available**: 7 chapters, 56 hours of content, 21 SVG diagrams
+**Currently Available**: 8 chapters, 64 hours of content, 24 SVG diagrams
---
@@ -21,7 +21,7 @@ Our vision for the future of AI education. This is a living documentβprioritie
### Objectives
- β
Establish master repository (DONE)
- β
Complete Foundation Track (DONE)
-- β
Begin Practitioner Track (Ch 6-7 available)
+- β
Begin Practitioner Track (Ch 6-8 available)
- π Establish community request process
- π Build first 100 community chapters
- β
Create core infrastructure and documentation (DONE)
@@ -37,11 +37,11 @@ Our vision for the future of AI education. This is a living documentβprioritie
- One new chapter released per week
- New chapters unlock after reaching **10 newsletter subscribers**
- β
Foundation Track complete (Chapters 1-5)
-- β
Practitioner Track started (Chapters 6-7)
+- β
Practitioner Track started (Chapters 6-8)
### Metrics to Track
- Newsletter subscribers (target: 10 to unlock weekly releases)
-- Chapters completed: 7 / 25
+- Chapters completed: 8 / 25
- Community requests received
- Stars on master repo
@@ -59,7 +59,7 @@ Our vision for the future of AI education. This is a living documentβprioritie
### Practitioner Track Chapters
- [x] Chapter 6: Introduction to Machine Learning
- [x] Chapter 7: Supervised Learning (Regression & Classification)
-- [ ] Chapter 8: Unsupervised Learning
+- [x] Chapter 8: Unsupervised Learning
- [ ] Chapter 9: Deep Learning Fundamentals
- [ ] Chapter 10: Natural Language Processing Basics
- [ ] Chapter 11: Large Language Models & Transformers
diff --git a/SYLLABUS.md b/SYLLABUS.md
index 04493d3..0601119 100644
--- a/SYLLABUS.md
+++ b/SYLLABUS.md
@@ -16,7 +16,7 @@ graph TD
CH6["Ch 6: Intro to ML
8h | Available"]
CH7["Ch 7: Supervised Learning
10h | Available"]
- CH8["Ch 8: Unsupervised Learning
8h | Coming Soon"]
+ CH8["Ch 8: Unsupervised Learning
8h | Available"]
CH9["Ch 9: Deep Learning
12h | Coming Soon"]
CH10["Ch 10: NLP Basics
10h | Coming Soon"]
CH11["Ch 11: LLMs & Transformers
10h | Coming Soon"]
@@ -56,7 +56,7 @@ graph TD
style CH5 fill:#4caf50,color:#fff
style CH6 fill:#4caf50,color:#fff
style CH7 fill:#4caf50,color:#fff
- style CH8 fill:#f3e5f5
+ style CH8 fill:#4caf50,color:#fff
style CH9 fill:#f3e5f5
style CH10 fill:#f3e5f5
style CH11 fill:#f3e5f5
@@ -66,7 +66,7 @@ graph TD
style CH15 fill:#f3e5f5
```
-**Legend**: Green = Available | Purple = Practitioner (Coming Soon) | Chapters 1-7 fully available with SVG diagrams
+**Legend**: Green = Available | Purple = Practitioner (Coming Soon) | Chapters 1-8 fully available with SVG diagrams
---
@@ -81,7 +81,7 @@ graph TD
| 5 | [Software Design & Best Practices](./chapters/chapter-05-software-design/) | Foundation | 6h | Available | 3 notebooks, scripts, 5 exercises, 3 SVGs |
| 6 | [Introduction to Machine Learning](./chapters/chapter-06-intro-machine-learning/) | Practitioner | 8h | Available | 3 notebooks, scripts, 5 exercises, 3 SVGs |
| 7 | [Supervised Learning](./chapters/chapter-07-supervised-learning/) | Practitioner | 10h | Available | 3 notebooks, scripts, 5 exercises, 3 SVGs |
-| 8 | Unsupervised Learning | Practitioner | 8h | Planned | - |
+| 8 | [Unsupervised Learning](./chapters/chapter-08-unsupervised-learning/) | Practitioner | 8h | Available | 3 notebooks, scripts, 5 exercises, 3 SVGs |
| 9 | Deep Learning Fundamentals | Practitioner | 12h | Planned | - |
| 10 | Natural Language Processing | Practitioner | 10h | Planned | - |
| 11 | LLMs & Transformers | Practitioner | 10h | Planned | - |
diff --git a/chapters/chapter-08-unsupervised-learning/README.md b/chapters/chapter-08-unsupervised-learning/README.md
new file mode 100644
index 0000000..0326ae6
--- /dev/null
+++ b/chapters/chapter-08-unsupervised-learning/README.md
@@ -0,0 +1,61 @@
+# Chapter 8: Unsupervised Learning
+
+**Track**: Practitioner | **Time**: 8 hours | **Prerequisites**: Chapters 1-6
+
+---
+
+## Learning Objectives
+
+By the end of this chapter, you will be able to:
+
+- Understand the difference between supervised and unsupervised learning
+- Implement K-Means clustering from scratch using NumPy
+- Apply hierarchical (agglomerative) clustering and interpret dendrograms
+- Use DBSCAN for density-based clustering with automatic cluster count detection
+- Evaluate clusters with the silhouette score, inertia, and the elbow method
+- Apply Principal Component Analysis (PCA) for dimensionality reduction
+- Implement t-SNE for 2D visualization of high-dimensional data
+- Perform anomaly detection with Isolation Forest and statistical methods
+- Build a complete customer segmentation pipeline end-to-end
+
+---
+
+## Chapter Structure
+
+```
+chapter-08-unsupervised-learning/
+βββ README.md
+βββ requirements.txt
+βββ notebooks/
+β βββ 01_introduction.ipynb # K-Means, evaluation metrics, elbow method
+β βββ 02_intermediate.ipynb # Hierarchical, DBSCAN, Gaussian Mixture Models
+β βββ 03_advanced.ipynb # PCA, t-SNE, anomaly detection, customer segmentation capstone
+βββ scripts/
+β βββ unsupervised_toolkit.py # KMeansScratch, PCA, plotting utilities
+β βββ utilities.py # Helper functions
+βββ exercises/
+β βββ exercises.py # 5 exercises
+β βββ solutions/
+β βββ solutions.py # Complete solutions
+βββ assets/diagrams/
+β βββ clustering_algorithms.svg # K-Means, Hierarchical, DBSCAN comparison
+β βββ dimensionality_reduction.svg # PCA and t-SNE visual
+β βββ anomaly_detection.svg # Normal vs anomalous points
+βββ datasets/
+β βββ customers.csv # Synthetic customer data (300+ rows)
+β βββ sensors.csv # Synthetic sensor data with anomalies (200+ rows)
+```
+
+## Time Estimate
+
+| Section | Time |
+|---------|------|
+| Notebook 01: Introduction (Clustering Basics) | 2.5 hours |
+| Notebook 02: Intermediate (Advanced Clustering) | 2.5 hours |
+| Notebook 03: Advanced (Dimensionality Reduction & Capstone) | 3 hours |
+| Exercises | Included in notebooks |
+| **Total** | **8 hours** |
+
+---
+
+**Generated by Berta AI | Created by Luigi Pascal Rondanini**
diff --git a/chapters/chapter-08-unsupervised-learning/assets/diagrams/anomaly_detection.svg b/chapters/chapter-08-unsupervised-learning/assets/diagrams/anomaly_detection.svg
new file mode 100644
index 0000000..92452f7
--- /dev/null
+++ b/chapters/chapter-08-unsupervised-learning/assets/diagrams/anomaly_detection.svg
@@ -0,0 +1,90 @@
+
diff --git a/chapters/chapter-08-unsupervised-learning/assets/diagrams/clustering_algorithms.svg b/chapters/chapter-08-unsupervised-learning/assets/diagrams/clustering_algorithms.svg
new file mode 100644
index 0000000..f17f560
--- /dev/null
+++ b/chapters/chapter-08-unsupervised-learning/assets/diagrams/clustering_algorithms.svg
@@ -0,0 +1,92 @@
+
diff --git a/chapters/chapter-08-unsupervised-learning/assets/diagrams/dimensionality_reduction.svg b/chapters/chapter-08-unsupervised-learning/assets/diagrams/dimensionality_reduction.svg
new file mode 100644
index 0000000..7f4b92b
--- /dev/null
+++ b/chapters/chapter-08-unsupervised-learning/assets/diagrams/dimensionality_reduction.svg
@@ -0,0 +1,81 @@
+
diff --git a/chapters/chapter-08-unsupervised-learning/datasets/customers.csv b/chapters/chapter-08-unsupervised-learning/datasets/customers.csv
new file mode 100644
index 0000000..889bba4
--- /dev/null
+++ b/chapters/chapter-08-unsupervised-learning/datasets/customers.csv
@@ -0,0 +1,301 @@
+age,income,spending_score,visits,online_ratio
+34,54766,39,9,0.54
+49,111074,67,9,0.25
+49,95903,27,1,0.21
+51,121394,41,4,0.38
+18,28010,72,16,1.0
+25,42858,84,13,0.73
+21,33927,54,8,0.96
+52,103491,78,10,0.93
+49,69579,40,0,0.39
+55,73860,68,3,0.23
+27,31448,57,15,0.78
+46,106849,69,9,0.37
+43,66166,93,12,0.67
+23,24789,58,15,0.91
+22,25771,64,16,0.59
+21,34690,48,10,0.8
+37,57274,47,4,0.2
+30,67012,30,6,0.51
+35,22561,64,5,0.59
+29,43617,92,13,0.84
+44,70128,46,5,0.28
+30,50813,85,7,0.77
+18,34524,60,10,0.68
+32,71483,70,5,0.63
+31,53350,61,7,0.26
+53,93115,32,0,0.29
+27,40318,72,9,0.79
+32,70778,36,6,0.46
+51,92763,37,5,0.29
+24,20205,56,8,0.9
+62,85856,30,4,0.3
+19,23195,55,11,0.67
+23,25355,76,10,0.87
+69,158343,80,8,0.62
+58,106604,82,4,0.73
+50,77316,90,7,0.78
+21,56478,28,8,0.43
+25,33556,79,15,0.83
+48,94480,37,3,0.6
+46,75340,49,5,0.33
+30,28927,73,7,0.86
+33,57335,73,3,0.49
+20,23720,54,9,0.82
+42,156212,82,10,0.5
+25,29032,75,15,0.8
+51,98058,41,4,0.46
+24,49848,56,11,0.59
+43,46690,52,4,0.42
+37,38124,53,6,0.41
+28,27017,65,12,0.81
+54,128594,99,2,0.9
+58,45557,48,4,0.31
+35,120819,80,16,0.84
+54,101037,67,7,0.51
+54,72377,36,3,0.45
+42,96894,70,6,0.61
+26,17794,81,12,0.81
+21,40100,79,15,0.73
+23,25585,54,14,0.62
+31,56380,57,6,0.52
+38,53078,50,9,0.52
+29,31476,67,4,0.97
+29,117914,92,7,0.73
+31,45976,50,4,0.62
+29,46492,86,17,0.89
+28,45447,35,1,0.61
+29,38192,79,13,0.72
+58,121733,77,10,0.42
+20,36226,75,9,0.9
+40,124420,72,7,0.51
+50,66955,42,0,0.31
+28,29973,85,8,0.81
+24,28638,73,18,0.84
+45,153511,72,3,0.73
+39,66490,45,4,0.42
+55,89575,43,3,0.34
+55,75070,57,4,0.21
+23,41820,93,9,0.81
+39,49203,48,5,0.42
+30,28450,92,12,1.0
+37,91936,78,9,0.63
+62,64967,29,1,0.35
+52,51777,42,6,0.39
+25,29026,38,11,0.85
+35,29178,31,11,0.44
+40,46370,16,4,0.2
+46,41564,51,3,0.37
+51,77134,63,1,0.07
+33,146182,66,7,0.64
+38,55922,44,8,0.32
+20,28188,54,10,0.77
+44,76056,42,7,0.23
+59,80336,51,5,0.25
+37,88206,44,5,0.15
+45,62287,39,0,0.04
+34,71931,55,7,0.5
+35,53816,43,6,0.38
+42,55226,50,6,0.23
+24,31127,88,14,0.84
+20,15852,95,18,0.88
+25,32585,78,15,0.84
+32,88877,45,1,0.28
+27,47650,50,1,0.6
+23,59338,55,2,0.41
+61,116713,76,6,0.73
+23,25738,75,6,0.87
+25,54601,38,10,0.55
+45,57987,57,4,0.31
+20,29779,76,13,0.74
+26,44178,76,16,0.85
+26,43290,70,9,0.9
+22,26343,78,2,0.76
+38,107425,71,11,0.62
+32,126587,68,9,0.54
+37,55986,37,2,0.25
+37,114139,82,6,0.83
+47,100405,87,6,0.4
+23,39815,66,6,0.83
+48,97095,84,0,0.81
+42,76303,41,6,0.52
+25,118320,84,9,0.55
+48,102106,71,8,0.73
+53,67006,70,6,0.65
+19,32938,63,9,0.8
+24,37308,72,11,0.81
+34,51999,52,10,0.46
+37,132492,81,8,1.0
+44,131800,100,7,0.69
+27,29829,79,9,0.83
+31,64729,46,5,0.6
+34,64521,48,8,0.66
+27,37072,60,15,0.8
+28,35894,73,9,0.88
+35,35050,53,7,0.5
+43,100917,87,8,0.47
+45,68520,50,0,0.54
+18,41908,52,2,0.45
+38,86995,41,3,0.11
+43,41731,34,5,0.73
+51,179096,88,12,0.37
+34,100350,68,4,0.63
+29,19147,46,12,0.7
+25,70723,41,0,0.47
+29,71748,71,5,0.75
+29,17078,66,14,0.67
+25,39317,66,17,0.61
+37,110354,45,1,0.3
+43,45329,47,7,0.66
+20,31604,83,17,0.82
+22,39189,60,18,0.76
+52,169812,81,5,0.78
+22,30493,54,7,0.84
+21,33430,62,12,0.9
+39,58361,53,6,0.61
+26,110321,100,5,0.55
+34,27063,50,9,0.82
+48,73561,48,2,0.09
+50,88828,41,0,0.31
+20,21422,78,5,0.74
+33,86932,31,3,0.0
+27,49763,48,1,0.44
+44,59552,39,0,0.32
+41,82845,59,2,0.27
+32,50559,65,1,0.31
+18,34953,78,3,0.9
+42,113005,83,4,0.61
+41,61731,48,3,0.17
+31,24175,74,13,0.89
+18,28536,72,8,0.77
+57,77136,45,4,0.27
+50,90796,41,6,0.25
+25,23606,54,13,0.88
+24,53712,41,2,0.43
+28,22373,43,14,0.88
+57,102261,46,2,0.02
+24,42997,57,9,0.75
+58,75304,37,4,0.62
+46,56715,27,4,0.48
+44,132046,65,6,0.76
+19,17494,60,6,0.69
+31,62656,46,7,0.6
+31,29377,65,12,0.84
+45,116863,90,4,0.57
+31,128196,78,8,0.45
+32,32390,77,10,0.83
+46,79008,42,0,0.32
+38,114255,78,12,0.68
+53,76294,34,2,0.27
+37,31248,44,5,0.39
+33,18493,82,6,0.8
+23,37353,69,4,0.78
+49,112161,62,12,0.63
+45,143045,73,9,0.83
+53,99819,50,5,0.1
+35,24636,27,4,0.67
+37,51567,65,8,0.5
+54,52001,44,5,0.32
+50,91727,56,7,0.38
+60,131840,88,11,0.65
+36,91859,53,1,0.29
+53,50448,45,3,0.17
+28,38239,65,0,0.77
+32,27321,66,8,0.83
+41,112622,78,9,0.43
+40,80214,45,4,0.07
+18,33388,94,11,0.92
+18,46500,100,11,0.82
+58,90294,42,0,0.23
+26,30193,74,12,0.79
+24,41297,47,15,0.72
+30,58229,48,11,0.68
+49,56536,55,5,0.05
+48,104258,46,4,0.39
+28,33426,53,10,0.8
+55,66518,38,0,0.65
+20,37885,57,23,0.64
+30,68512,48,3,0.43
+61,93824,35,2,0.23
+44,78086,35,2,0.33
+26,73001,83,11,0.47
+53,58232,38,3,0.37
+52,79818,34,4,0.18
+22,32707,88,7,0.72
+53,71493,30,8,0.28
+24,31347,59,17,0.76
+19,40540,66,13,0.73
+30,128195,82,7,0.61
+39,57678,53,8,0.62
+44,94652,83,7,0.62
+27,86064,58,2,0.45
+59,98536,41,1,0.26
+34,62742,41,9,0.61
+47,90203,41,4,0.23
+30,56074,28,12,0.6
+28,22005,57,12,0.91
+59,84794,41,6,0.22
+22,36724,78,7,0.67
+22,34373,88,16,0.87
+49,85215,39,4,0.22
+18,27065,76,9,1.0
+31,112508,72,4,0.54
+28,118744,72,1,0.4
+59,185519,62,3,0.65
+24,50337,40,7,0.21
+27,39737,55,10,0.65
+55,144921,67,8,0.42
+34,45725,61,6,0.3
+63,163965,79,1,0.73
+51,129302,78,4,0.74
+40,129728,76,10,0.55
+18,22770,60,18,0.9
+39,65348,46,6,0.56
+52,147411,84,9,0.59
+42,61157,65,4,0.39
+21,26283,62,15,0.68
+51,76226,46,1,0.2
+45,99999,81,8,0.6
+48,101335,24,2,0.26
+49,106083,36,2,0.27
+32,26200,38,17,0.85
+55,81279,44,4,0.13
+46,121450,77,9,0.74
+32,31011,77,16,0.93
+42,67841,27,1,0.24
+31,104832,78,10,0.39
+34,101023,39,5,0.43
+57,89756,38,3,0.3
+52,57453,36,2,0.43
+41,46897,52,7,0.47
+34,33913,63,8,0.77
+56,104115,67,2,0.94
+52,90258,81,7,0.88
+54,104391,71,10,0.43
+37,115386,85,4,0.84
+36,55014,40,5,0.31
+31,45194,59,4,0.44
+45,88017,45,4,0.5
+65,93314,34,2,0.19
+48,75189,52,1,0.36
+51,106928,31,4,0.29
+49,114962,71,8,0.53
+36,43721,68,0,0.51
+42,64953,58,6,0.42
+28,29275,97,10,0.77
+69,92584,32,7,0.4
+20,31523,80,4,0.78
+35,62625,41,4,0.56
+42,40128,58,0,0.48
+18,34104,59,9,0.82
+59,91684,15,4,0.32
+21,35910,67,15,0.91
+27,34922,69,12,0.82
+34,130244,73,14,0.68
+45,57206,38,2,0.49
+18,25712,68,8,0.75
+31,69867,64,3,0.64
+49,90433,21,6,0.44
+20,44705,84,18,0.63
+52,75590,54,0,0.18
+18,27205,63,12,0.77
diff --git a/chapters/chapter-08-unsupervised-learning/datasets/sensors.csv b/chapters/chapter-08-unsupervised-learning/datasets/sensors.csv
new file mode 100644
index 0000000..e641ca3
--- /dev/null
+++ b/chapters/chapter-08-unsupervised-learning/datasets/sensors.csv
@@ -0,0 +1,201 @@
+temp,pressure,vibration,is_anomaly
+67.4,29.4,0.343,0
+62.3,32.1,0.492,0
+63.6,27.0,0.549,0
+68.4,28.0,0.52,0
+68.9,31.1,0.316,0
+67.8,27.4,0.665,0
+76.3,32.1,0.408,0
+60.1,28.5,0.674,0
+97.6,52.9,0.756,1
+67.4,24.7,0.533,0
+77.2,31.6,0.611,0
+76.7,33.9,0.591,0
+73.0,32.0,0.512,0
+65.0,31.5,0.356,0
+68.8,32.5,0.546,0
+76.6,34.7,0.547,0
+69.5,26.7,0.461,0
+100.5,40.7,1.425,1
+62.0,36.4,0.474,0
+73.5,28.6,0.406,0
+59.0,29.0,0.5,0
+76.1,27.3,0.433,0
+67.8,29.0,0.459,0
+65.4,33.1,0.373,0
+72.2,29.9,0.4,0
+62.9,29.2,0.514,0
+66.2,31.5,0.323,0
+70.9,30.8,0.437,0
+62.8,23.9,0.437,0
+73.1,31.5,0.407,0
+77.0,30.2,0.45,0
+73.4,29.5,0.251,0
+77.3,49.5,0.949,1
+64.8,27.5,0.555,0
+69.4,30.4,0.552,0
+75.0,28.6,0.52,0
+70.4,34.2,0.397,0
+73.5,35.1,0.6,0
+78.9,28.9,0.62,0
+93.7,51.2,1.037,1
+72.5,31.0,0.544,0
+64.3,27.3,0.654,0
+70.4,30.8,0.384,0
+67.1,30.9,0.53,0
+67.4,33.8,0.448,0
+71.5,32.7,0.496,0
+70.6,28.9,0.39,0
+59.6,27.9,0.304,0
+64.8,31.4,0.643,0
+72.2,22.9,0.534,0
+101.6,52.0,1.254,1
+80.7,27.1,0.605,0
+65.9,29.3,0.375,0
+67.2,28.1,0.568,0
+77.1,24.3,0.262,0
+74.6,26.5,0.538,0
+65.4,29.0,0.619,0
+73.6,31.1,0.578,0
+72.6,34.8,0.288,0
+80.3,29.1,0.329,0
+70.6,34.3,0.43,0
+64.6,28.1,0.424,0
+67.5,33.2,0.502,0
+70.4,27.9,0.392,0
+72.5,31.5,0.33,0
+63.2,32.6,0.43,0
+66.8,26.1,0.45,0
+73.2,34.5,0.519,0
+67.0,26.4,0.402,0
+70.3,34.4,0.439,0
+74.0,33.5,0.262,0
+69.6,34.3,0.714,0
+92.9,53.4,1.367,1
+68.8,30.1,0.526,0
+95.1,40.5,1.75,1
+61.5,34.1,0.421,0
+68.9,26.6,0.371,0
+66.4,27.6,0.471,0
+102.8,40.4,0.751,1
+74.8,23.2,0.441,0
+69.1,24.6,0.647,0
+100.5,38.5,1.069,1
+102.0,52.7,0.611,1
+63.3,29.2,0.737,0
+95.5,33.2,0.94,1
+75.6,31.2,0.515,0
+83.2,25.6,0.377,0
+60.3,29.6,0.589,0
+104.6,40.3,1.798,1
+69.9,34.4,0.405,0
+70.0,26.0,0.423,0
+76.9,26.0,0.609,0
+64.0,24.5,0.445,0
+73.7,33.8,0.528,0
+76.4,27.8,0.432,0
+68.8,25.3,0.419,0
+66.9,32.4,0.68,0
+58.5,27.5,0.478,0
+74.5,30.8,0.557,0
+75.5,25.8,0.551,0
+70.3,33.3,0.533,0
+70.2,33.4,0.475,0
+73.8,33.9,0.482,0
+62.8,34.0,0.461,0
+76.4,29.9,0.503,0
+75.5,26.9,0.594,0
+64.0,27.9,0.46,0
+69.4,29.8,0.485,0
+68.9,32.1,0.583,0
+71.7,29.1,0.582,0
+69.4,34.4,0.351,0
+75.1,31.8,0.433,0
+73.4,31.8,0.682,0
+62.8,24.6,0.516,0
+72.5,30.4,0.425,0
+62.5,26.9,0.586,0
+72.7,27.6,0.419,0
+69.1,33.0,0.369,0
+68.1,30.4,0.577,0
+70.4,27.7,0.537,0
+74.4,30.8,0.706,0
+69.3,28.9,0.278,0
+81.6,33.0,0.503,0
+65.9,31.2,0.396,0
+71.8,29.3,0.674,0
+91.6,44.4,1.021,1
+67.9,31.7,0.678,0
+57.6,31.0,0.438,0
+101.6,43.6,1.199,1
+70.1,34.0,0.57,0
+74.8,27.3,0.586,0
+60.3,32.1,0.428,0
+75.6,29.0,0.229,0
+86.2,43.1,1.148,1
+72.5,25.4,0.667,0
+66.7,28.6,0.537,0
+85.9,48.5,1.196,1
+63.2,34.6,0.643,0
+74.6,29.5,0.415,0
+74.6,30.4,0.542,0
+73.9,29.6,0.519,0
+74.2,30.7,0.449,0
+69.2,25.8,0.537,0
+76.7,30.3,0.541,0
+69.5,28.9,0.445,0
+63.8,32.9,0.537,0
+74.8,28.6,0.646,0
+76.2,33.7,0.468,0
+69.2,32.4,0.36,0
+72.2,29.4,0.497,0
+67.0,32.4,0.435,0
+94.3,47.2,1.333,1
+67.8,30.4,0.613,0
+64.2,30.1,0.646,0
+70.3,30.4,0.496,0
+77.3,29.1,0.466,0
+65.5,30.5,0.306,0
+86.2,44.0,0.773,1
+70.7,21.9,0.387,0
+70.6,28.2,0.535,0
+69.1,30.3,0.524,0
+68.1,31.3,0.572,0
+69.8,31.0,0.418,0
+68.6,29.1,0.547,0
+68.2,29.2,0.53,0
+65.3,27.5,0.506,0
+76.5,25.9,0.495,0
+74.3,25.8,0.479,0
+68.3,29.3,0.489,0
+70.0,26.4,0.527,0
+67.4,33.5,0.41,0
+74.9,37.5,0.408,0
+91.4,37.8,0.975,1
+75.6,22.0,0.609,0
+69.1,33.5,0.412,0
+69.5,26.5,0.503,0
+73.7,32.9,0.546,0
+65.9,30.7,0.48,0
+70.4,27.0,0.461,0
+77.0,26.3,0.324,0
+70.7,27.3,0.295,0
+68.2,29.2,0.343,0
+67.8,27.7,0.476,0
+83.9,31.7,0.607,0
+71.5,26.0,0.72,0
+74.8,28.0,0.633,0
+68.3,31.6,0.477,0
+70.9,26.2,0.68,0
+85.7,42.0,0.568,1
+71.8,29.7,0.581,0
+69.1,29.2,0.561,0
+71.5,29.2,0.528,0
+68.7,30.0,0.668,0
+68.0,25.5,0.424,0
+69.7,31.2,0.539,0
+67.0,28.3,0.494,0
+68.2,22.1,0.596,0
+74.8,29.8,0.435,0
+74.5,32.1,0.794,0
+62.9,29.4,0.552,0
diff --git a/chapters/chapter-08-unsupervised-learning/exercises/exercises.py b/chapters/chapter-08-unsupervised-learning/exercises/exercises.py
new file mode 100644
index 0000000..07d8f6c
--- /dev/null
+++ b/chapters/chapter-08-unsupervised-learning/exercises/exercises.py
@@ -0,0 +1,154 @@
+"""
+Chapter 8 Exercises: Unsupervised Learning
+
+Generated by Berta AI | Created by Luigi Pascal Rondanini
+"""
+
+import numpy as np
+
+
+# =============================================================================
+# Exercise 1: Implement K-Means Clustering From Scratch
+# =============================================================================
+# Build a KMeans class that:
+# - Initializes K centroids randomly from the data points
+# - Assigns each point to the nearest centroid (Euclidean distance)
+# - Recomputes centroids as the mean of assigned points
+# - Repeats for max_iters or until convergence (centroids stop moving)
+#
+# Methods:
+# - fit(X): Run the K-Means algorithm
+# - predict(X): Assign each row to its nearest centroid
+# - fit_predict(X): fit then predict
+#
+# Attributes after fit:
+# - centroids: (K, n_features) array
+# - inertia: within-cluster sum of squared distances
+#
+# Hint: np.linalg.norm(X[:, None] - centroids, axis=2) gives all pairwise distances
+
+class KMeansClustering:
+ def __init__(self, n_clusters=3, max_iters=100, random_state=42):
+ # YOUR CODE HERE
+ pass
+
+ def fit(self, X):
+ # YOUR CODE HERE
+ pass
+
+ def predict(self, X):
+ # YOUR CODE HERE
+ pass
+
+ def fit_predict(self, X):
+ # YOUR CODE HERE
+ pass
+
+
+# =============================================================================
+# Exercise 2: Implement PCA From Scratch
+# =============================================================================
+# Build a PCA class that:
+# - Centers the data (subtract mean)
+# - Computes the covariance matrix
+# - Finds eigenvectors/eigenvalues via np.linalg.eigh
+# - Sorts components by descending eigenvalue
+# - Projects data onto the top n_components eigenvectors
+#
+# Methods:
+# - fit(X): Compute components
+# - transform(X): Project X onto components
+# - fit_transform(X): fit then transform
+#
+# Attributes after fit:
+# - components_: (n_components, n_features) array
+# - explained_variance_ratio_: fraction of variance per component
+#
+# Hint: covariance = X_centered.T @ X_centered / (n - 1)
+
+class PCAFromScratch:
+ def __init__(self, n_components=2):
+ # YOUR CODE HERE
+ pass
+
+ def fit(self, X):
+ # YOUR CODE HERE
+ pass
+
+ def transform(self, X):
+ # YOUR CODE HERE
+ pass
+
+ def fit_transform(self, X):
+ # YOUR CODE HERE
+ pass
+
+
+# =============================================================================
+# Exercise 3: Implement Silhouette Score From Scratch
+# =============================================================================
+# Compute the silhouette score for a clustering result:
+# For each point i:
+# a(i) = mean distance to all other points in the same cluster
+# b(i) = min over other clusters of mean distance to that cluster's points
+# s(i) = (b(i) - a(i)) / max(a(i), b(i))
+# Return the mean of s(i) over all points.
+#
+# Parameters:
+# X: (n_samples, n_features) array
+# labels: (n_samples,) array of cluster assignments
+#
+# Return: float in [-1, 1], higher is better
+#
+# Hint: Use pairwise Euclidean distances. Handle single-point clusters (s=0).
+
+def silhouette_score_scratch(X, labels):
+ # YOUR CODE HERE
+ pass
+
+
+# =============================================================================
+# Exercise 4: Anomaly Detection with Z-Score
+# =============================================================================
+# Implement a simple anomaly detector that:
+# 1. Computes the Z-score for each feature: z = (x - mean) / std
+# 2. Flags a point as anomalous if any feature has |z| > threshold
+#
+# Parameters:
+# X: (n_samples, n_features) array
+# threshold: float (default 3.0)
+#
+# Return: (n_samples,) boolean array, True = anomaly
+#
+# Hint: np.any(np.abs(z_scores) > threshold, axis=1)
+
+def detect_anomalies_zscore(X, threshold=3.0):
+ # YOUR CODE HERE
+ pass
+
+
+# =============================================================================
+# Exercise 5: End-to-End Customer Segmentation Pipeline
+# =============================================================================
+# Build a pipeline that:
+# 1. Loads customer data from datasets/customers.csv
+# 2. Scales features with StandardScaler
+# 3. Applies PCA (keep 95% variance)
+# 4. Uses elbow method to find optimal K (test K=2..8)
+# 5. Runs K-Means with optimal K
+# 6. Returns segment profiles (mean of original features per cluster)
+#
+# Return dict: {
+# "n_clusters": int,
+# "labels": array,
+# "profiles": DataFrame (one row per cluster, columns = original features),
+# "inertias": list (for each K tested),
+# "silhouette": float
+# }
+#
+# Hint: The "elbow" can be found by looking for the K where the second
+# derivative of inertia changes most (or just pick K=4 if uncertain).
+
+def customer_segmentation_pipeline(csv_path="datasets/customers.csv"):
+ # YOUR CODE HERE
+ pass
diff --git a/chapters/chapter-08-unsupervised-learning/exercises/solutions/solutions.py b/chapters/chapter-08-unsupervised-learning/exercises/solutions/solutions.py
new file mode 100644
index 0000000..1ffad8c
--- /dev/null
+++ b/chapters/chapter-08-unsupervised-learning/exercises/solutions/solutions.py
@@ -0,0 +1,265 @@
+"""
+Chapter 8 Solutions: Unsupervised Learning
+
+Generated by Berta AI | Created by Luigi Pascal Rondanini
+"""
+
+import numpy as np
+from pathlib import Path
+
+
+# =============================================================================
+# Exercise 1: K-Means Clustering From Scratch
+# =============================================================================
+
+class KMeansClustering:
+ def __init__(self, n_clusters=3, max_iters=100, random_state=42):
+ self.n_clusters = n_clusters
+ self.max_iters = max_iters
+ self.random_state = random_state
+ self.centroids = None
+ self.inertia = None
+
+ def fit(self, X):
+ X = np.asarray(X, dtype=float)
+ rng = np.random.RandomState(self.random_state)
+ idx = rng.choice(len(X), size=self.n_clusters, replace=False)
+ self.centroids = X[idx].copy()
+
+ for _ in range(self.max_iters):
+ distances = np.linalg.norm(X[:, None] - self.centroids, axis=2)
+ labels = np.argmin(distances, axis=1)
+
+ new_centroids = np.array([
+ X[labels == k].mean(axis=0) if np.any(labels == k) else self.centroids[k]
+ for k in range(self.n_clusters)
+ ])
+
+ if np.allclose(new_centroids, self.centroids):
+ break
+ self.centroids = new_centroids
+
+ distances = np.linalg.norm(X[:, None] - self.centroids, axis=2)
+ labels = np.argmin(distances, axis=1)
+ self.inertia = sum(
+ np.sum((X[labels == k] - self.centroids[k]) ** 2)
+ for k in range(self.n_clusters)
+ )
+ self._labels = labels
+ return self
+
+ def predict(self, X):
+ X = np.asarray(X, dtype=float)
+ distances = np.linalg.norm(X[:, None] - self.centroids, axis=2)
+ return np.argmin(distances, axis=1)
+
+ def fit_predict(self, X):
+ self.fit(X)
+ return self._labels
+
+
+# =============================================================================
+# Exercise 2: PCA From Scratch
+# =============================================================================
+
+class PCAFromScratch:
+ def __init__(self, n_components=2):
+ self.n_components = n_components
+ self.components_ = None
+ self.explained_variance_ratio_ = None
+ self._mean = None
+
+ def fit(self, X):
+ X = np.asarray(X, dtype=float)
+ self._mean = X.mean(axis=0)
+ X_centered = X - self._mean
+ n = X.shape[0]
+ cov = X_centered.T @ X_centered / (n - 1)
+
+ eigenvalues, eigenvectors = np.linalg.eigh(cov)
+ idx = np.argsort(eigenvalues)[::-1]
+ eigenvalues = eigenvalues[idx]
+ eigenvectors = eigenvectors[:, idx]
+
+ self.components_ = eigenvectors[:, :self.n_components].T
+ total_var = eigenvalues.sum()
+ self.explained_variance_ratio_ = eigenvalues[:self.n_components] / total_var
+ return self
+
+ def transform(self, X):
+ X = np.asarray(X, dtype=float)
+ X_centered = X - self._mean
+ return X_centered @ self.components_.T
+
+ def fit_transform(self, X):
+ self.fit(X)
+ return self.transform(X)
+
+
+# =============================================================================
+# Exercise 3: Silhouette Score From Scratch
+# =============================================================================
+
+def silhouette_score_scratch(X, labels):
+ X = np.asarray(X, dtype=float)
+ labels = np.asarray(labels)
+ n = len(X)
+ unique_labels = np.unique(labels)
+
+ if len(unique_labels) < 2:
+ return 0.0
+
+ scores = np.zeros(n)
+ for i in range(n):
+ same_mask = labels == labels[i]
+ same_mask[i] = False
+ same_cluster = X[same_mask]
+
+ if len(same_cluster) == 0:
+ scores[i] = 0.0
+ continue
+
+ a_i = np.mean(np.linalg.norm(same_cluster - X[i], axis=1))
+
+ b_i = np.inf
+ for k in unique_labels:
+ if k == labels[i]:
+ continue
+ other_cluster = X[labels == k]
+ mean_dist = np.mean(np.linalg.norm(other_cluster - X[i], axis=1))
+ b_i = min(b_i, mean_dist)
+
+ denom = max(a_i, b_i)
+ scores[i] = (b_i - a_i) / denom if denom > 0 else 0.0
+
+ return float(np.mean(scores))
+
+
+# =============================================================================
+# Exercise 4: Anomaly Detection with Z-Score
+# =============================================================================
+
+def detect_anomalies_zscore(X, threshold=3.0):
+ X = np.asarray(X, dtype=float)
+ mean = X.mean(axis=0)
+ std = X.std(axis=0)
+ std[std == 0] = 1.0
+ z_scores = (X - mean) / std
+ return np.any(np.abs(z_scores) > threshold, axis=1)
+
+
+# =============================================================================
+# Exercise 5: Customer Segmentation Pipeline
+# =============================================================================
+
+def customer_segmentation_pipeline(csv_path="datasets/customers.csv"):
+ try:
+ import pandas as pd
+ from sklearn.preprocessing import StandardScaler
+ from sklearn.decomposition import PCA
+ from sklearn.cluster import KMeans
+ from sklearn.metrics import silhouette_score
+ except ImportError:
+ return {"n_clusters": 0, "labels": None, "profiles": None,
+ "inertias": [], "silhouette": 0.0}
+
+ base = Path(__file__).parent.parent.parent
+ path = base / csv_path
+ if not path.exists():
+ return {"n_clusters": 0, "labels": None, "profiles": None,
+ "inertias": [], "silhouette": 0.0}
+
+ df = pd.read_csv(path)
+ feature_cols = [c for c in df.columns if c not in ("customer_id", "segment")]
+ X_raw = df[feature_cols].values
+
+ scaler = StandardScaler()
+ X_scaled = scaler.fit_transform(X_raw)
+
+ pca = PCA(n_components=0.95)
+ X_pca = pca.fit_transform(X_scaled)
+
+ K_range = range(2, 9)
+ inertias = []
+ for k in K_range:
+ km = KMeans(n_clusters=k, n_init=10, random_state=42)
+ km.fit(X_pca)
+ inertias.append(km.inertia_)
+
+ diffs = np.diff(inertias)
+ diffs2 = np.diff(diffs)
+ best_k = int(np.argmax(np.abs(diffs2)) + 2)
+ best_k = max(2, min(best_k, 8))
+
+ km_final = KMeans(n_clusters=best_k, n_init=10, random_state=42)
+ labels = km_final.fit_predict(X_pca)
+ sil = silhouette_score(X_pca, labels)
+
+ df["cluster"] = labels
+ profiles = df.groupby("cluster")[feature_cols].mean()
+
+ return {
+ "n_clusters": best_k,
+ "labels": labels,
+ "profiles": profiles,
+ "inertias": list(inertias),
+ "silhouette": float(sil),
+ }
+
+
+if __name__ == "__main__":
+ print("Chapter 8 Solutions - Verification\n")
+
+ np.random.seed(42)
+
+ # Ex 1
+ print("Exercise 1: K-Means Clustering")
+ from sklearn.datasets import make_blobs
+ X, y_true = make_blobs(n_samples=200, centers=3, random_state=42)
+ km = KMeansClustering(n_clusters=3, random_state=42)
+ labels = km.fit_predict(X)
+ assert km.centroids.shape == (3, 2)
+ assert len(labels) == 200
+ assert km.inertia > 0
+ print(f" Inertia = {km.inertia:.2f}")
+ print(f" Centroids shape: {km.centroids.shape}")
+
+ # Ex 2
+ print("\nExercise 2: PCA From Scratch")
+ X_4d = np.random.randn(100, 4)
+ pca = PCAFromScratch(n_components=2)
+ X_2d = pca.fit_transform(X_4d)
+ assert X_2d.shape == (100, 2)
+ assert len(pca.explained_variance_ratio_) == 2
+ assert abs(sum(pca.explained_variance_ratio_) - sum(pca.explained_variance_ratio_)) < 1e-10
+ print(f" Variance explained: {pca.explained_variance_ratio_}")
+ print(f" Projected shape: {X_2d.shape}")
+
+ # Ex 3
+ print("\nExercise 3: Silhouette Score")
+ sil = silhouette_score_scratch(X, y_true)
+ assert -1 <= sil <= 1
+ print(f" Silhouette score = {sil:.4f}")
+
+ # Ex 4
+ print("\nExercise 4: Anomaly Detection (Z-Score)")
+ X_normal = np.random.randn(100, 3)
+ X_anomalies = np.array([[10, 10, 10], [-8, -8, -8]])
+ X_combined = np.vstack([X_normal, X_anomalies])
+ flags = detect_anomalies_zscore(X_combined, threshold=3.0)
+ assert flags[-1] == True
+ assert flags[-2] == True
+ n_detected = flags.sum()
+ print(f" Detected {n_detected} anomalies out of {len(X_combined)} points")
+
+ # Ex 5
+ print("\nExercise 5: Customer Segmentation Pipeline")
+ result = customer_segmentation_pipeline()
+ if result["labels"] is not None:
+ print(f" Optimal K: {result['n_clusters']}")
+ print(f" Silhouette: {result['silhouette']:.4f}")
+ print(f" Segment profiles:\n{result['profiles']}")
+ else:
+ print(" (Dataset may not be found - run from chapter root)")
+
+ print("\nAll verifications passed.")
diff --git a/chapters/chapter-08-unsupervised-learning/notebooks/01_introduction.ipynb b/chapters/chapter-08-unsupervised-learning/notebooks/01_introduction.ipynb
new file mode 100644
index 0000000..5bb2233
--- /dev/null
+++ b/chapters/chapter-08-unsupervised-learning/notebooks/01_introduction.ipynb
@@ -0,0 +1,580 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Chapter 8: Unsupervised Learning\n",
+ "## Notebook 01 - Introduction: Clustering Basics\n",
+ "\n",
+ "Unsupervised learning finds hidden patterns in data without labels. We start with the most fundamental algorithm: K-Means clustering.\n",
+ "\n",
+ "**What you'll learn:**\n",
+ "- The difference between supervised and unsupervised learning\n",
+ "- K-Means clustering from scratch using NumPy\n",
+ "- Evaluating clusters with inertia and silhouette score\n",
+ "- The elbow method for choosing K\n",
+ "- Scikit-learn's KMeans interface\n",
+ "\n",
+ "**Time estimate:** 2.5 hours\n",
+ "\n",
+ "---\n",
+ "*Generated by Berta AI | Created by Luigi Pascal Rondanini*"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "## 1. Supervised vs Unsupervised Learning\n",
+ "\n",
+ "In **supervised learning**, every training example comes with a label β the \"right answer\" β and the model learns a mapping from inputs to outputs. Classification and regression are the classic examples.\n",
+ "\n",
+ "In **unsupervised learning**, there are **no labels at all**. The algorithm must discover structure in the data on its own. Common tasks include:\n",
+ "\n",
+ "| Task | Goal | Example algorithms |\n",
+ "|------|------|--------------------|\n",
+ "| **Clustering** | Group similar points together | K-Means, DBSCAN, Hierarchical |\n",
+ "| **Dimensionality reduction** | Compress features while preserving structure | PCA, t-SNE, UMAP |\n",
+ "| **Anomaly detection** | Find unusual observations | Isolation Forest, LOF |\n",
+ "\n",
+ "This notebook focuses on **clustering** β specifically the **K-Means** algorithm, the most widely-used clustering method.\n",
+ "\n",
+ "Let's start by generating some data and seeing what it looks like *without* labels."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "from sklearn.datasets import make_blobs\n",
+ "\n",
+ "np.random.seed(42)\n",
+ "\n",
+ "X, y_true = make_blobs(\n",
+ " n_samples=200, centers=3, cluster_std=0.9, random_state=42\n",
+ ")\n",
+ "\n",
+ "fig, axes = plt.subplots(1, 2, figsize=(13, 5))\n",
+ "\n",
+ "axes[0].scatter(X[:, 0], X[:, 1], c=\"steelblue\", edgecolors=\"k\", s=50, alpha=0.7)\n",
+ "axes[0].set_title(\"What we observe (no labels)\", fontsize=14)\n",
+ "axes[0].set_xlabel(\"Feature 1\")\n",
+ "axes[0].set_ylabel(\"Feature 2\")\n",
+ "\n",
+ "colors = [\"#e74c3c\", \"#2ecc71\", \"#3498db\"]\n",
+ "for k in range(3):\n",
+ " mask = y_true == k\n",
+ " axes[1].scatter(X[mask, 0], X[mask, 1], c=colors[k],\n",
+ " edgecolors=\"k\", s=50, alpha=0.7, label=f\"Cluster {k}\")\n",
+ "axes[1].set_title(\"True clusters (hidden from algorithm)\", fontsize=14)\n",
+ "axes[1].set_xlabel(\"Feature 1\")\n",
+ "axes[1].set_ylabel(\"Feature 2\")\n",
+ "axes[1].legend()\n",
+ "\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The left panel is what an unsupervised algorithm receives β raw coordinates with no color-coding. The right panel reveals the ground truth we want the algorithm to *recover* on its own.\n",
+ "\n",
+ "---\n",
+ "## 2. K-Means Algorithm β Theory\n",
+ "\n",
+ "K-Means is an iterative algorithm that partitions *n* data points into *K* clusters. It works in three repeating steps:\n",
+ "\n",
+ "### Step 1 β Initialize\n",
+ "Pick *K* points as initial **centroids** (cluster centers). The simplest approach is to choose *K* data points at random.\n",
+ "\n",
+ "### Step 2 β Assign\n",
+ "For every data point, compute the Euclidean distance to each centroid and assign the point to the **nearest** centroid:\n",
+ "\n",
+ "$$c_i = \\arg\\min_{k} \\| x_i - \\mu_k \\|^2$$\n",
+ "\n",
+ "### Step 3 β Update\n",
+ "Recompute each centroid as the **mean** of all points currently assigned to that cluster:\n",
+ "\n",
+ "$$\\mu_k = \\frac{1}{|C_k|} \\sum_{x_i \\in C_k} x_i$$\n",
+ "\n",
+ "### Repeat\n",
+ "Alternate between Steps 2 and 3 until the assignments no longer change (or a maximum number of iterations is reached).\n",
+ "\n",
+ "### Important caveats\n",
+ "- **Random initialization sensitivity:** Different starting centroids can lead to different final clusters. Running the algorithm multiple times with different seeds and keeping the best result is standard practice.\n",
+ "- **K must be chosen in advance.** We'll learn the *elbow method* later in this notebook.\n",
+ "- The algorithm minimises **inertia** (within-cluster sum of squares) β it always converges, but to a *local* minimum, not necessarily the global one."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "## 3. K-Means From Scratch\n",
+ "\n",
+ "Let's implement K-Means using only NumPy so we truly understand every step."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "class KMeansScratch:\n",
+ " \"\"\"Minimal K-Means implementation using NumPy.\"\"\"\n",
+ "\n",
+ " def __init__(self, k=3, max_iters=100, random_state=42):\n",
+ " self.k = k\n",
+ " self.max_iters = max_iters\n",
+ " self.random_state = random_state\n",
+ " self.centroids = None\n",
+ " self.labels_ = None\n",
+ " self.inertia_ = None\n",
+ " self.inertia_history = []\n",
+ " self.centroid_history = []\n",
+ " self.label_history = []\n",
+ "\n",
+ " def _euclidean_distances(self, X, centroids):\n",
+ " \"\"\"Compute distance from every point to every centroid.\"\"\"\n",
+ " # X: (n, d), centroids: (k, d) -> result: (n, k)\n",
+ " return np.sqrt(((X[:, np.newaxis] - centroids[np.newaxis]) ** 2).sum(axis=2))\n",
+ "\n",
+ " def _compute_inertia(self, X, labels, centroids):\n",
+ " return sum(\n",
+ " np.sum((X[labels == k] - centroids[k]) ** 2)\n",
+ " for k in range(self.k)\n",
+ " )\n",
+ "\n",
+ " def fit(self, X):\n",
+ " rng = np.random.RandomState(self.random_state)\n",
+ " n_samples = X.shape[0]\n",
+ "\n",
+ " # Step 1: random initialization\n",
+ " idx = rng.choice(n_samples, self.k, replace=False)\n",
+ " self.centroids = X[idx].copy()\n",
+ "\n",
+ " self.inertia_history = []\n",
+ " self.centroid_history = [self.centroids.copy()]\n",
+ " self.label_history = []\n",
+ "\n",
+ " for _ in range(self.max_iters):\n",
+ " # Step 2: assign\n",
+ " distances = self._euclidean_distances(X, self.centroids)\n",
+ " labels = np.argmin(distances, axis=1)\n",
+ " self.label_history.append(labels.copy())\n",
+ "\n",
+ " # Step 3: update centroids\n",
+ " new_centroids = np.array([\n",
+ " X[labels == k].mean(axis=0) if np.any(labels == k)\n",
+ " else self.centroids[k]\n",
+ " for k in range(self.k)\n",
+ " ])\n",
+ "\n",
+ " inertia = self._compute_inertia(X, labels, new_centroids)\n",
+ " self.inertia_history.append(inertia)\n",
+ " self.centroid_history.append(new_centroids.copy())\n",
+ "\n",
+ " if np.allclose(new_centroids, self.centroids):\n",
+ " break\n",
+ " self.centroids = new_centroids\n",
+ "\n",
+ " self.labels_ = labels\n",
+ " self.inertia_ = self.inertia_history[-1]\n",
+ " return self\n",
+ "\n",
+ " def predict(self, X):\n",
+ " distances = self._euclidean_distances(X, self.centroids)\n",
+ " return np.argmin(distances, axis=1)\n",
+ "\n",
+ "\n",
+ "km_scratch = KMeansScratch(k=3, random_state=42)\n",
+ "km_scratch.fit(X)\n",
+ "\n",
+ "print(f\"Converged in {len(km_scratch.inertia_history)} iterations\")\n",
+ "print(f\"Final inertia: {km_scratch.inertia_:.2f}\")\n",
+ "print(f\"Centroids:\\n{km_scratch.centroids}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "fig, axes = plt.subplots(1, 2, figsize=(13, 5))\n",
+ "\n",
+ "colors_map = np.array([\"#e74c3c\", \"#2ecc71\", \"#3498db\"])\n",
+ "\n",
+ "for k in range(3):\n",
+ " mask = y_true == k\n",
+ " axes[0].scatter(X[mask, 0], X[mask, 1], c=colors[k],\n",
+ " edgecolors=\"k\", s=50, alpha=0.7, label=f\"True {k}\")\n",
+ "axes[0].set_title(\"Ground Truth\", fontsize=14)\n",
+ "axes[0].legend()\n",
+ "axes[0].set_xlabel(\"Feature 1\")\n",
+ "axes[0].set_ylabel(\"Feature 2\")\n",
+ "\n",
+ "axes[1].scatter(X[:, 0], X[:, 1], c=colors_map[km_scratch.labels_],\n",
+ " edgecolors=\"k\", s=50, alpha=0.7)\n",
+ "axes[1].scatter(km_scratch.centroids[:, 0], km_scratch.centroids[:, 1],\n",
+ " c=colors, marker=\"X\", s=250, edgecolors=\"k\", linewidths=1.5,\n",
+ " zorder=5, label=\"Centroids\")\n",
+ "axes[1].set_title(\"K-Means (scratch) result\", fontsize=14)\n",
+ "axes[1].legend()\n",
+ "axes[1].set_xlabel(\"Feature 1\")\n",
+ "axes[1].set_ylabel(\"Feature 2\")\n",
+ "\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "## 4. Step-by-Step K-Means Visualization\n",
+ "\n",
+ "To build intuition for how the algorithm converges, let's watch the first four iterations unfold. Each subplot shows the cluster assignments and centroid positions at a particular iteration."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "fig, axes = plt.subplots(2, 2, figsize=(12, 10))\n",
+ "axes = axes.ravel()\n",
+ "\n",
+ "colors_map = np.array([\"#e74c3c\", \"#2ecc71\", \"#3498db\"])\n",
+ "\n",
+ "n_show = min(4, len(km_scratch.label_history))\n",
+ "\n",
+ "for i in range(n_show):\n",
+ " ax = axes[i]\n",
+ " labels_i = km_scratch.label_history[i]\n",
+ " centroids_i = km_scratch.centroid_history[i] # centroids *before* this assignment\n",
+ " centroids_next = km_scratch.centroid_history[i + 1] # centroids *after* update\n",
+ "\n",
+ " ax.scatter(X[:, 0], X[:, 1], c=colors_map[labels_i],\n",
+ " edgecolors=\"k\", s=40, alpha=0.6)\n",
+ "\n",
+ " # Old centroids (hollow)\n",
+ " ax.scatter(centroids_i[:, 0], centroids_i[:, 1],\n",
+ " facecolors=\"none\", edgecolors=\"k\", marker=\"o\",\n",
+ " s=200, linewidths=2, label=\"Old centroid\")\n",
+ "\n",
+ " # New centroids (filled star)\n",
+ " ax.scatter(centroids_next[:, 0], centroids_next[:, 1],\n",
+ " c=colors, marker=\"X\", s=250, edgecolors=\"k\",\n",
+ " linewidths=1.5, zorder=5, label=\"New centroid\")\n",
+ "\n",
+ " # Arrows showing centroid movement\n",
+ " for k in range(3):\n",
+ " ax.annotate(\"\",\n",
+ " xy=centroids_next[k], xytext=centroids_i[k],\n",
+ " arrowprops=dict(arrowstyle=\"->\", lw=1.5, color=\"black\"))\n",
+ "\n",
+ " ax.set_title(f\"Iteration {i + 1} | inertia = {km_scratch.inertia_history[i]:.1f}\",\n",
+ " fontsize=12)\n",
+ " if i == 0:\n",
+ " ax.legend(fontsize=9, loc=\"upper left\")\n",
+ "\n",
+ "for j in range(n_show, 4):\n",
+ " axes[j].axis(\"off\")\n",
+ "\n",
+ "plt.suptitle(\"K-Means β Iteration-by-Iteration\", fontsize=15, y=1.01)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Notice how the centroids (stars) migrate toward the cluster centers with each iteration while the assignments stabilize.\n",
+ "\n",
+ "---\n",
+ "## 5. Evaluating Clusters\n",
+ "\n",
+ "How do we know if K-Means did a good job? Two common metrics:\n",
+ "\n",
+ "### Inertia (Within-Cluster Sum of Squares β WCSS)\n",
+ "$$\\text{Inertia} = \\sum_{k=1}^{K} \\sum_{x_i \\in C_k} \\| x_i - \\mu_k \\|^2$$\n",
+ "\n",
+ "Lower is better, but inertia **always decreases** as K increases (at K = n every point is its own cluster with inertia = 0). So inertia alone doesn't tell us the *right* K.\n",
+ "\n",
+ "### Silhouette Score\n",
+ "For each point *i*:\n",
+ "- **a(i)** = mean distance to other points in the *same* cluster\n",
+ "- **b(i)** = mean distance to points in the *nearest different* cluster\n",
+ "\n",
+ "$$s(i) = \\frac{b(i) - a(i)}{\\max(a(i),\\, b(i))}$$\n",
+ "\n",
+ "Values range from β1 to +1. Higher is better; values near 0 indicate overlapping clusters."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import silhouette_score, silhouette_samples\n",
+ "\n",
+ "sil_avg = silhouette_score(X, km_scratch.labels_)\n",
+ "sil_vals = silhouette_samples(X, km_scratch.labels_)\n",
+ "\n",
+ "print(f\"Inertia: {km_scratch.inertia_:.2f}\")\n",
+ "print(f\"Silhouette (mean): {sil_avg:.4f}\")\n",
+ "print(f\"Silhouette (min): {sil_vals.min():.4f}\")\n",
+ "print(f\"Silhouette (max): {sil_vals.max():.4f}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "fig, ax = plt.subplots(figsize=(8, 5))\n",
+ "\n",
+ "y_lower = 10\n",
+ "colors_sil = [\"#e74c3c\", \"#2ecc71\", \"#3498db\"]\n",
+ "\n",
+ "for k in range(3):\n",
+ " cluster_sil = np.sort(sil_vals[km_scratch.labels_ == k])\n",
+ " cluster_size = cluster_sil.shape[0]\n",
+ " y_upper = y_lower + cluster_size\n",
+ "\n",
+ " ax.fill_betweenx(np.arange(y_lower, y_upper), 0, cluster_sil,\n",
+ " facecolor=colors_sil[k], edgecolor=colors_sil[k], alpha=0.7)\n",
+ " ax.text(-0.05, y_lower + 0.5 * cluster_size, f\"Cluster {k}\", fontsize=11,\n",
+ " fontweight=\"bold\", va=\"center\")\n",
+ " y_lower = y_upper + 10\n",
+ "\n",
+ "ax.axvline(x=sil_avg, color=\"k\", linestyle=\"--\", linewidth=1.5,\n",
+ " label=f\"Mean silhouette = {sil_avg:.3f}\")\n",
+ "ax.set_xlabel(\"Silhouette coefficient\", fontsize=12)\n",
+ "ax.set_ylabel(\"Points (sorted within cluster)\", fontsize=12)\n",
+ "ax.set_title(\"Silhouette Plot β K-Means (K=3)\", fontsize=14)\n",
+ "ax.legend(fontsize=11)\n",
+ "ax.set_yticks([])\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "A healthy silhouette plot shows clusters of roughly similar width that extend well past the mean line. Thin slivers or clusters that barely cross zero suggest poor separation.\n",
+ "\n",
+ "---\n",
+ "## 6. The Elbow Method for Choosing K\n",
+ "\n",
+ "Since we must specify *K* before running K-Means, how do we pick a good value?\n",
+ "\n",
+ "**The Elbow Method:**\n",
+ "1. Run K-Means for K = 1, 2, β¦, K_max.\n",
+ "2. Plot inertia vs K.\n",
+ "3. Look for the **\"elbow\"** β the point where inertia stops decreasing sharply and begins to level off.\n",
+ "\n",
+ "The elbow suggests a natural number of clusters in the data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "K_range = range(1, 11)\n",
+ "inertias = []\n",
+ "silhouettes = []\n",
+ "\n",
+ "for k in K_range:\n",
+ " km = KMeansScratch(k=k, random_state=42)\n",
+ " km.fit(X)\n",
+ " inertias.append(km.inertia_)\n",
+ " if k >= 2:\n",
+ " silhouettes.append(silhouette_score(X, km.labels_))\n",
+ " else:\n",
+ " silhouettes.append(np.nan)\n",
+ "\n",
+ "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
+ "\n",
+ "axes[0].plot(K_range, inertias, \"o-\", color=\"#2c3e50\", linewidth=2, markersize=8)\n",
+ "axes[0].set_xlabel(\"Number of clusters (K)\", fontsize=12)\n",
+ "axes[0].set_ylabel(\"Inertia\", fontsize=12)\n",
+ "axes[0].set_title(\"Elbow Method\", fontsize=14)\n",
+ "axes[0].axvline(x=3, color=\"#e74c3c\", linestyle=\"--\", alpha=0.7, label=\"K = 3 (elbow)\")\n",
+ "axes[0].legend(fontsize=11)\n",
+ "axes[0].grid(True, alpha=0.3)\n",
+ "\n",
+ "sil_values = [s for s in silhouettes if not np.isnan(s)]\n",
+ "sil_ks = list(range(2, 11))\n",
+ "axes[1].plot(sil_ks, sil_values, \"s-\", color=\"#27ae60\", linewidth=2, markersize=8)\n",
+ "axes[1].set_xlabel(\"Number of clusters (K)\", fontsize=12)\n",
+ "axes[1].set_ylabel(\"Mean Silhouette Score\", fontsize=12)\n",
+ "axes[1].set_title(\"Silhouette Score vs K\", fontsize=14)\n",
+ "axes[1].axvline(x=3, color=\"#e74c3c\", linestyle=\"--\", alpha=0.7, label=\"K = 3\")\n",
+ "axes[1].legend(fontsize=11)\n",
+ "axes[1].grid(True, alpha=0.3)\n",
+ "\n",
+ "plt.tight_layout()\n",
+ "plt.show()\n",
+ "\n",
+ "print(\"Silhouette scores by K:\")\n",
+ "for k, s in zip(sil_ks, sil_values):\n",
+ " print(f\" K={k:2d} -> {s:.4f}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Both plots agree: **K = 3** is the best choice for this dataset β inertia has a clear elbow and the silhouette score peaks at K = 3.\n",
+ "\n",
+ "---\n",
+ "## 7. Scikit-learn's KMeans\n",
+ "\n",
+ "In practice you'll use scikit-learn's battle-tested implementation. Let's verify our scratch version gives the same answer."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.cluster import KMeans\n",
+ "\n",
+ "km_sklearn = KMeans(n_clusters=3, random_state=42, n_init=10)\n",
+ "km_sklearn.fit(X)\n",
+ "\n",
+ "print(\"=== Scikit-learn KMeans ===\")\n",
+ "print(f\"Inertia: {km_sklearn.inertia_:.2f}\")\n",
+ "print(f\"Silhouette score: {silhouette_score(X, km_sklearn.labels_):.4f}\")\n",
+ "print(f\"Centroids:\\n{km_sklearn.cluster_centers_}\")\n",
+ "print()\n",
+ "\n",
+ "print(\"=== Our scratch KMeans ===\")\n",
+ "print(f\"Inertia: {km_scratch.inertia_:.2f}\")\n",
+ "print(f\"Silhouette score: {silhouette_score(X, km_scratch.labels_):.4f}\")\n",
+ "print(f\"Centroids:\\n{km_scratch.centroids}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "fig, axes = plt.subplots(1, 2, figsize=(13, 5))\n",
+ "\n",
+ "colors_map = np.array([\"#e74c3c\", \"#2ecc71\", \"#3498db\"])\n",
+ "\n",
+ "axes[0].scatter(X[:, 0], X[:, 1], c=colors_map[km_scratch.labels_],\n",
+ " edgecolors=\"k\", s=50, alpha=0.7)\n",
+ "axes[0].scatter(km_scratch.centroids[:, 0], km_scratch.centroids[:, 1],\n",
+ " c=\"gold\", marker=\"X\", s=250, edgecolors=\"k\", linewidths=1.5, zorder=5)\n",
+ "axes[0].set_title(\"Our Scratch Implementation\", fontsize=14)\n",
+ "axes[0].set_xlabel(\"Feature 1\")\n",
+ "axes[0].set_ylabel(\"Feature 2\")\n",
+ "\n",
+ "axes[1].scatter(X[:, 0], X[:, 1], c=colors_map[km_sklearn.labels_],\n",
+ " edgecolors=\"k\", s=50, alpha=0.7)\n",
+ "axes[1].scatter(km_sklearn.cluster_centers_[:, 0], km_sklearn.cluster_centers_[:, 1],\n",
+ " c=\"gold\", marker=\"X\", s=250, edgecolors=\"k\", linewidths=1.5, zorder=5)\n",
+ "axes[1].set_title(\"Scikit-learn KMeans\", fontsize=14)\n",
+ "axes[1].set_xlabel(\"Feature 1\")\n",
+ "axes[1].set_ylabel(\"Feature 2\")\n",
+ "\n",
+ "plt.suptitle(\"Scratch vs Scikit-learn β Side by Side\", fontsize=15, y=1.01)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The cluster labels may differ in numbering (label 0 in one could be label 2 in the other), but the **groupings themselves** should be nearly identical. Scikit-learn's version often achieves slightly lower inertia because it uses the smarter **k-means++** initialization by default and runs multiple initializations (`n_init=10`).\n",
+ "\n",
+ "---\n",
+ "## 8. Practical Tips\n",
+ "\n",
+ "### Assumptions of K-Means\n",
+ "K-Means works best when clusters are:\n",
+ "- **Spherical (isotropic):** roughly the same spread in every direction.\n",
+ "- **Similar in size:** very uneven cluster sizes can pull centroids away from smaller groups.\n",
+ "- **Well-separated:** heavily overlapping clusters confuse the algorithm.\n",
+ "\n",
+ "### Feature Scaling\n",
+ "K-Means relies on Euclidean distance. If one feature has a range of 0β1 and another 0β10,000, the second feature will dominate. **Always standardize your features** (e.g., `StandardScaler`) before clustering.\n",
+ "\n",
+ "### Multiple Initializations\n",
+ "Scikit-learn's `n_init` parameter (default 10) runs K-Means 10 times with different random seeds and keeps the result with the lowest inertia. This greatly reduces the risk of a poor local minimum.\n",
+ "\n",
+ "### When K-Means Fails\n",
+ "K-Means struggles with:\n",
+ "- **Non-convex shapes** (e.g., crescent moons, concentric rings) β consider DBSCAN or spectral clustering instead.\n",
+ "- **Clusters with very different densities** β HDBSCAN handles this better.\n",
+ "- **High-dimensional data** β distances become less meaningful (curse of dimensionality); apply dimensionality reduction first.\n",
+ "\n",
+ "We'll explore some of these alternatives in later notebooks."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "## 9. Summary\n",
+ "\n",
+ "### Key Takeaways\n",
+ "\n",
+ "1. **Unsupervised learning** discovers structure without labels. Clustering is its flagship task.\n",
+ "2. **K-Means** iterates between *assigning* points to the nearest centroid and *updating* centroids as cluster means until convergence.\n",
+ "3. **Inertia** measures within-cluster compactness; **silhouette score** balances compactness and separation.\n",
+ "4. The **elbow method** plots inertia vs K to find a natural number of clusters.\n",
+ "5. **Scikit-learn's KMeans** adds smart initialization (k-means++) and multiple restarts for robust results.\n",
+ "6. Always **scale features** before clustering, and remember that K-Means assumes spherical, similarly-sized clusters.\n",
+ "\n",
+ "### What's Next\n",
+ "In the following notebooks we will:\n",
+ "- Explore **hierarchical clustering** and dendrograms\n",
+ "- Learn **DBSCAN** for density-based clustering\n",
+ "- Apply **dimensionality reduction** (PCA, t-SNE) for visualization\n",
+ "\n",
+ "---\n",
+ "*End of Notebook 01 β Clustering Basics*"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "name": "python",
+ "version": "3.9.0"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file
diff --git a/chapters/chapter-08-unsupervised-learning/notebooks/02_intermediate.ipynb b/chapters/chapter-08-unsupervised-learning/notebooks/02_intermediate.ipynb
new file mode 100644
index 0000000..584626b
--- /dev/null
+++ b/chapters/chapter-08-unsupervised-learning/notebooks/02_intermediate.ipynb
@@ -0,0 +1,721 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Chapter 8: Unsupervised Learning\n",
+ "## Notebook 02 - Intermediate: Advanced Clustering\n",
+ "\n",
+ "Beyond K-Means: hierarchical clustering, density-based methods, and Gaussian mixtures for real-world data shapes.\n",
+ "\n",
+ "**What you'll learn:**\n",
+ "- Hierarchical (agglomerative) clustering and dendrograms\n",
+ "- DBSCAN for density-based clustering\n",
+ "- Gaussian Mixture Models (GMMs)\n",
+ "- Comparing clustering algorithms on different data shapes\n",
+ "\n",
+ "**Time estimate:** 2.5 hours\n",
+ "\n",
+ "---\n",
+ "*Generated by Berta AI | Created by Luigi Pascal Rondanini*"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "import matplotlib.cm as cm\n",
+ "from sklearn.datasets import make_blobs, make_moons, make_circles\n",
+ "from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN\n",
+ "from sklearn.mixture import GaussianMixture\n",
+ "from sklearn.preprocessing import StandardScaler\n",
+ "from sklearn.neighbors import NearestNeighbors\n",
+ "from scipy.cluster.hierarchy import dendrogram, linkage, fcluster\n",
+ "from scipy.stats import multivariate_normal\n",
+ "\n",
+ "np.random.seed(42)\n",
+ "\n",
+ "plt.rcParams['figure.figsize'] = (10, 6)\n",
+ "plt.rcParams['figure.dpi'] = 100\n",
+ "plt.rcParams['font.size'] = 11\n",
+ "\n",
+ "print(\"All imports loaded successfully.\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "## 1. Hierarchical (Agglomerative) Clustering\n",
+ "\n",
+ "Hierarchical clustering builds a tree of clusters instead of requiring a fixed number of clusters up front.\n",
+ "\n",
+ "### How agglomerative clustering works\n",
+ "\n",
+ "The **agglomerative (bottom-up)** approach proceeds as follows:\n",
+ "\n",
+ "1. **Start** β treat every data point as its own single-point cluster.\n",
+ "2. **Merge** β find the two closest clusters and merge them into one.\n",
+ "3. **Repeat** β keep merging until only a single cluster remains (or until a stopping criterion is met).\n",
+ "\n",
+ "The result is a hierarchy that can be visualised as a **dendrogram** β a tree diagram showing the order and distance of each merge.\n",
+ "\n",
+ "### Linkage criteria\n",
+ "\n",
+ "\"Distance between two clusters\" can be measured in several ways:\n",
+ "\n",
+ "| Linkage | Definition | Tendency |\n",
+ "|---------|-----------|----------|\n",
+ "| **Single** | Minimum distance between any pair of points across two clusters | Produces elongated, chain-like clusters |\n",
+ "| **Complete** | Maximum distance between any pair of points across two clusters | Produces compact, roughly equal-sized clusters |\n",
+ "| **Average** | Mean distance between all pairs of points across two clusters | Compromise between single and complete |\n",
+ "| **Ward** | Minimises the total within-cluster variance at each merge | Tends to produce equally sized, spherical clusters |\n",
+ "\n",
+ "Ward linkage is the most commonly used default and works well when clusters are roughly spherical."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Generate synthetic data with 4 well-separated clusters\n",
+ "X_hier, y_hier = make_blobs(\n",
+ " n_samples=200, centers=4, cluster_std=0.8, random_state=42\n",
+ ")\n",
+ "\n",
+ "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
+ "\n",
+ "# Left panel β raw data\n",
+ "axes[0].scatter(X_hier[:, 0], X_hier[:, 1], s=30, alpha=0.7, edgecolors='k', linewidths=0.3)\n",
+ "axes[0].set_title('Raw Data (200 points, 4 clusters)')\n",
+ "axes[0].set_xlabel('Feature 1')\n",
+ "axes[0].set_ylabel('Feature 2')\n",
+ "\n",
+ "# Right panel β dendrogram using Ward linkage\n",
+ "Z_ward = linkage(X_hier, method='ward')\n",
+ "dendrogram(\n",
+ " Z_ward,\n",
+ " truncate_mode='lastp',\n",
+ " p=30,\n",
+ " leaf_rotation=90,\n",
+ " leaf_font_size=8,\n",
+ " ax=axes[1],\n",
+ " color_threshold=12\n",
+ ")\n",
+ "axes[1].set_title('Dendrogram (Ward Linkage, truncated to 30 leaves)')\n",
+ "axes[1].set_xlabel('Cluster (size)')\n",
+ "axes[1].set_ylabel('Merge Distance')\n",
+ "axes[1].axhline(y=12, color='r', linestyle='--', label='Cut at distance = 12')\n",
+ "axes[1].legend()\n",
+ "\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The dendrogram shows the full merge history. By drawing a horizontal cut line we decide\n",
+ "how many clusters to keep β each vertical line that crosses the cut corresponds to one cluster.\n",
+ "\n",
+ "### Comparing linkage methods\n",
+ "\n",
+ "Let's visualise how the four linkage types partition the same dataset."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "linkage_methods = ['single', 'complete', 'average', 'ward']\n",
+ "fig, axes = plt.subplots(1, 4, figsize=(20, 4.5))\n",
+ "\n",
+ "for ax, method in zip(axes, linkage_methods):\n",
+ " Z = linkage(X_hier, method=method)\n",
+ " labels = fcluster(Z, t=4, criterion='maxclust')\n",
+ " scatter = ax.scatter(\n",
+ " X_hier[:, 0], X_hier[:, 1],\n",
+ " c=labels, cmap='viridis', s=30, alpha=0.7, edgecolors='k', linewidths=0.3\n",
+ " )\n",
+ " ax.set_title(f'{method.capitalize()} linkage')\n",
+ " ax.set_xlabel('Feature 1')\n",
+ " ax.set_ylabel('Feature 2')\n",
+ "\n",
+ "plt.suptitle('Agglomerative Clustering β 4 Linkage Methods (k=4)', fontsize=14, y=1.02)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Scikit-learn's AgglomerativeClustering with Ward linkage\n",
+ "agg = AgglomerativeClustering(n_clusters=4, linkage='ward')\n",
+ "agg_labels = agg.fit_predict(X_hier)\n",
+ "\n",
+ "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
+ "\n",
+ "axes[0].scatter(\n",
+ " X_hier[:, 0], X_hier[:, 1],\n",
+ " c=y_hier, cmap='tab10', s=40, alpha=0.7, edgecolors='k', linewidths=0.3\n",
+ ")\n",
+ "axes[0].set_title('Ground-Truth Labels')\n",
+ "axes[0].set_xlabel('Feature 1')\n",
+ "axes[0].set_ylabel('Feature 2')\n",
+ "\n",
+ "axes[1].scatter(\n",
+ " X_hier[:, 0], X_hier[:, 1],\n",
+ " c=agg_labels, cmap='tab10', s=40, alpha=0.7, edgecolors='k', linewidths=0.3\n",
+ ")\n",
+ "axes[1].set_title('AgglomerativeClustering (Ward, k=4)')\n",
+ "axes[1].set_xlabel('Feature 1')\n",
+ "axes[1].set_ylabel('Feature 2')\n",
+ "\n",
+ "plt.tight_layout()\n",
+ "plt.show()\n",
+ "\n",
+ "print(f\"Cluster sizes: {np.bincount(agg_labels)}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "## 2. DBSCAN β Density-Based Spatial Clustering\n",
+ "\n",
+ "**DBSCAN** (Density-Based Spatial Clustering of Applications with Noise) takes a fundamentally different\n",
+ "approach to clustering:\n",
+ "\n",
+ "- It does **not** require the number of clusters in advance.\n",
+ "- It defines clusters as **dense regions** separated by sparse regions.\n",
+ "- Points that don't belong to any dense region are labelled as **noise** (label = -1).\n",
+ "\n",
+ "### Key parameters\n",
+ "\n",
+ "| Parameter | Meaning |\n",
+ "|-----------|--------|\n",
+ "| `eps` (Ξ΅) | Maximum distance between two points for them to be considered neighbours |\n",
+ "| `min_samples` | Minimum number of points within Ξ΅-distance to form a dense region |\n",
+ "\n",
+ "### Point types\n",
+ "\n",
+ "- **Core point** β has at least `min_samples` neighbours within Ξ΅.\n",
+ "- **Border point** β within Ξ΅ of a core point but doesn't have enough neighbours itself.\n",
+ "- **Noise point** β neither core nor border; isolated outliers.\n",
+ "\n",
+ "### Key advantage\n",
+ "\n",
+ "DBSCAN can discover clusters of **arbitrary shape** and naturally identifies outliers β something\n",
+ "centroid-based methods like K-Means cannot do."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Generate two non-convex datasets\n",
+ "X_moons, y_moons = make_moons(n_samples=500, noise=0.08, random_state=42)\n",
+ "X_circles, y_circles = make_circles(n_samples=500, noise=0.05, factor=0.5, random_state=42)\n",
+ "\n",
+ "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
+ "\n",
+ "axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap='coolwarm', s=20, alpha=0.7)\n",
+ "axes[0].set_title('Two Moons Dataset')\n",
+ "axes[0].set_xlabel('Feature 1')\n",
+ "axes[0].set_ylabel('Feature 2')\n",
+ "\n",
+ "axes[1].scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, cmap='coolwarm', s=20, alpha=0.7)\n",
+ "axes[1].set_title('Two Circles Dataset')\n",
+ "axes[1].set_xlabel('Feature 1')\n",
+ "axes[1].set_ylabel('Feature 2')\n",
+ "\n",
+ "plt.suptitle('Non-Convex Datasets β Ground Truth', fontsize=14, y=1.02)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Apply DBSCAN to both datasets\n",
+ "db_moons = DBSCAN(eps=0.2, min_samples=5).fit(X_moons)\n",
+ "db_circles = DBSCAN(eps=0.15, min_samples=5).fit(X_circles)\n",
+ "\n",
+ "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
+ "\n",
+ "colors_moons = db_moons.labels_\n",
+ "colors_circles = db_circles.labels_\n",
+ "\n",
+ "axes[0].scatter(\n",
+ " X_moons[:, 0], X_moons[:, 1],\n",
+ " c=colors_moons, cmap='viridis', s=20, alpha=0.7\n",
+ ")\n",
+ "n_noise_moons = (db_moons.labels_ == -1).sum()\n",
+ "axes[0].set_title(f'DBSCAN on Moons β {len(set(colors_moons)) - (1 if -1 in colors_moons else 0)} clusters, {n_noise_moons} noise')\n",
+ "axes[0].set_xlabel('Feature 1')\n",
+ "axes[0].set_ylabel('Feature 2')\n",
+ "\n",
+ "axes[1].scatter(\n",
+ " X_circles[:, 0], X_circles[:, 1],\n",
+ " c=colors_circles, cmap='viridis', s=20, alpha=0.7\n",
+ ")\n",
+ "n_noise_circles = (db_circles.labels_ == -1).sum()\n",
+ "axes[1].set_title(f'DBSCAN on Circles β {len(set(colors_circles)) - (1 if -1 in colors_circles else 0)} clusters, {n_noise_circles} noise')\n",
+ "axes[1].set_xlabel('Feature 1')\n",
+ "axes[1].set_ylabel('Feature 2')\n",
+ "\n",
+ "plt.suptitle('DBSCAN Handles Non-Convex Shapes', fontsize=14, y=1.02)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# K-Means vs DBSCAN on the moons dataset\n",
+ "km_moons = KMeans(n_clusters=2, random_state=42, n_init=10).fit(X_moons)\n",
+ "\n",
+ "fig, axes = plt.subplots(1, 3, figsize=(18, 5))\n",
+ "\n",
+ "axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap='coolwarm', s=20, alpha=0.7)\n",
+ "axes[0].set_title('Ground Truth')\n",
+ "axes[0].set_xlabel('Feature 1')\n",
+ "axes[0].set_ylabel('Feature 2')\n",
+ "\n",
+ "axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=km_moons.labels_, cmap='coolwarm', s=20, alpha=0.7)\n",
+ "axes[1].scatter(km_moons.cluster_centers_[:, 0], km_moons.cluster_centers_[:, 1],\n",
+ " marker='X', s=200, c='black', edgecolors='white', linewidths=1.5)\n",
+ "axes[1].set_title('K-Means (k=2) β Fails on non-convex shapes')\n",
+ "axes[1].set_xlabel('Feature 1')\n",
+ "axes[1].set_ylabel('Feature 2')\n",
+ "\n",
+ "axes[2].scatter(X_moons[:, 0], X_moons[:, 1], c=db_moons.labels_, cmap='coolwarm', s=20, alpha=0.7)\n",
+ "axes[2].set_title('DBSCAN (eps=0.2) β Correctly separates crescents')\n",
+ "axes[2].set_xlabel('Feature 1')\n",
+ "axes[2].set_ylabel('Feature 2')\n",
+ "\n",
+ "plt.suptitle('K-Means vs DBSCAN on the Moons Dataset', fontsize=14, y=1.02)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "## 3. Choosing DBSCAN Parameters\n",
+ "\n",
+ "Picking `eps` and `min_samples` can be tricky. A practical heuristic:\n",
+ "\n",
+ "1. Set `min_samples` β 2 Γ number of features (a reasonable default).\n",
+ "2. For each point compute the distance to its **k-th nearest neighbour** (k = `min_samples`).\n",
+ "3. Sort these distances and plot them β the **k-distance graph**.\n",
+ "4. Look for the \"elbow\" β the point where the curve bends sharply upward. The distance at that elbow is a good candidate for `eps`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# k-distance graph for the moons dataset\n",
+ "k = 5 # same as min_samples\n",
+ "nn = NearestNeighbors(n_neighbors=k)\n",
+ "nn.fit(X_moons)\n",
+ "distances, _ = nn.kneighbors(X_moons)\n",
+ "\n",
+ "k_distances = np.sort(distances[:, k - 1])[::-1]\n",
+ "\n",
+ "plt.figure(figsize=(10, 5))\n",
+ "plt.plot(k_distances, linewidth=1.5)\n",
+ "plt.axhline(y=0.2, color='r', linestyle='--', label='eps = 0.2 (our choice)')\n",
+ "plt.title(f'k-Distance Graph (k={k}) β Elbow Indicates Good eps')\n",
+ "plt.xlabel('Points (sorted by descending k-distance)')\n",
+ "plt.ylabel(f'Distance to {k}-th Nearest Neighbour')\n",
+ "plt.legend()\n",
+ "plt.grid(True, alpha=0.3)\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Effect of different eps values on DBSCAN results\n",
+ "eps_values = [0.05, 0.1, 0.2, 0.3, 0.5]\n",
+ "fig, axes = plt.subplots(1, len(eps_values), figsize=(22, 4))\n",
+ "\n",
+ "for ax, eps in zip(axes, eps_values):\n",
+ " db = DBSCAN(eps=eps, min_samples=5).fit(X_moons)\n",
+ " labels = db.labels_\n",
+ " n_clusters = len(set(labels)) - (1 if -1 in labels else 0)\n",
+ " n_noise = (labels == -1).sum()\n",
+ "\n",
+ " unique_labels = set(labels)\n",
+ " colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]\n",
+ "\n",
+ " for k_label, col in zip(sorted(unique_labels), colors):\n",
+ " if k_label == -1:\n",
+ " col = [0, 0, 0, 1] # black for noise\n",
+ " mask = labels == k_label\n",
+ " ax.scatter(X_moons[mask, 0], X_moons[mask, 1], c=[col], s=15, alpha=0.7)\n",
+ "\n",
+ " ax.set_title(f'eps={eps}\\n{n_clusters} clusters, {n_noise} noise')\n",
+ " ax.set_xlabel('Feature 1')\n",
+ "\n",
+ "axes[0].set_ylabel('Feature 2')\n",
+ "plt.suptitle('Effect of eps on DBSCAN (min_samples=5)', fontsize=14, y=1.05)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Observations:**\n",
+ "- **eps too small** (0.05) β most points classified as noise; many tiny clusters.\n",
+ "- **eps just right** (0.2) β two clean crescent clusters with very little noise.\n",
+ "- **eps too large** (0.5) β everything merges into a single cluster.\n",
+ "\n",
+ "The k-distance graph helps you find that sweet spot without trial and error."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "## 4. Gaussian Mixture Models (GMMs)\n",
+ "\n",
+ "A **Gaussian Mixture Model** assumes that the data is generated from a mixture of a finite number\n",
+ "of Gaussian (normal) distributions with unknown parameters.\n",
+ "\n",
+ "### GMM vs K-Means\n",
+ "\n",
+ "| Aspect | K-Means | GMM |\n",
+ "|--------|---------|-----|\n",
+ "| Cluster assignment | **Hard** β each point belongs to exactly one cluster | **Soft** β each point has a probability for every cluster |\n",
+ "| Cluster shape | Spherical (Voronoi cells) | Elliptical (full covariance matrices) |\n",
+ "| Outlier handling | None β every point is assigned | Naturally down-weights low-probability points |\n",
+ "| Output | Cluster label | Probability vector over all clusters |\n",
+ "\n",
+ "GMMs are fit using the **Expectation-Maximisation (EM)** algorithm:\n",
+ "1. **E-step** β compute the probability that each point belongs to each Gaussian component.\n",
+ "2. **M-step** β update each component's mean, covariance, and weight to maximise log-likelihood.\n",
+ "3. Repeat until convergence."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Create elongated / elliptical clusters that challenge K-Means\n",
+ "np.random.seed(42)\n",
+ "\n",
+ "n_per_cluster = 200\n",
+ "cov1 = [[2.0, 1.5], [1.5, 1.5]]\n",
+ "cov2 = [[1.5, -1.2], [-1.2, 1.5]]\n",
+ "cov3 = [[0.5, 0.0], [0.0, 2.5]]\n",
+ "\n",
+ "cluster1 = np.random.multivariate_normal([0, 0], cov1, n_per_cluster)\n",
+ "cluster2 = np.random.multivariate_normal([5, 5], cov2, n_per_cluster)\n",
+ "cluster3 = np.random.multivariate_normal([8, 0], cov3, n_per_cluster)\n",
+ "\n",
+ "X_gmm = np.vstack([cluster1, cluster2, cluster3])\n",
+ "y_gmm_true = np.array([0]*n_per_cluster + [1]*n_per_cluster + [2]*n_per_cluster)\n",
+ "\n",
+ "fig, axes = plt.subplots(1, 3, figsize=(18, 5))\n",
+ "\n",
+ "# Ground truth\n",
+ "axes[0].scatter(X_gmm[:, 0], X_gmm[:, 1], c=y_gmm_true, cmap='tab10', s=15, alpha=0.6)\n",
+ "axes[0].set_title('Ground Truth (Elliptical Clusters)')\n",
+ "axes[0].set_xlabel('Feature 1')\n",
+ "axes[0].set_ylabel('Feature 2')\n",
+ "\n",
+ "# K-Means\n",
+ "km_gmm = KMeans(n_clusters=3, random_state=42, n_init=10).fit(X_gmm)\n",
+ "axes[1].scatter(X_gmm[:, 0], X_gmm[:, 1], c=km_gmm.labels_, cmap='tab10', s=15, alpha=0.6)\n",
+ "axes[1].scatter(km_gmm.cluster_centers_[:, 0], km_gmm.cluster_centers_[:, 1],\n",
+ " marker='X', s=200, c='black', edgecolors='white', linewidths=1.5)\n",
+ "axes[1].set_title('K-Means (k=3) β Spherical assumption')\n",
+ "axes[1].set_xlabel('Feature 1')\n",
+ "axes[1].set_ylabel('Feature 2')\n",
+ "\n",
+ "# GMM\n",
+ "gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=42)\n",
+ "gmm.fit(X_gmm)\n",
+ "gmm_labels = gmm.predict(X_gmm)\n",
+ "axes[2].scatter(X_gmm[:, 0], X_gmm[:, 1], c=gmm_labels, cmap='tab10', s=15, alpha=0.6)\n",
+ "axes[2].set_title('GMM (3 components) β Elliptical fit')\n",
+ "axes[2].set_xlabel('Feature 1')\n",
+ "axes[2].set_ylabel('Feature 2')\n",
+ "\n",
+ "plt.suptitle('K-Means vs GMM on Elliptical Clusters', fontsize=14, y=1.02)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Visualise GMM probability contours\n",
+ "x_min, x_max = X_gmm[:, 0].min() - 2, X_gmm[:, 0].max() + 2\n",
+ "y_min, y_max = X_gmm[:, 1].min() - 2, X_gmm[:, 1].max() + 2\n",
+ "xx, yy = np.meshgrid(np.linspace(x_min, x_max, 300), np.linspace(y_min, y_max, 300))\n",
+ "grid_points = np.column_stack([xx.ravel(), yy.ravel()])\n",
+ "\n",
+ "log_prob = gmm.score_samples(grid_points)\n",
+ "log_prob = log_prob.reshape(xx.shape)\n",
+ "\n",
+ "fig, ax = plt.subplots(figsize=(10, 7))\n",
+ "ax.contourf(xx, yy, np.exp(log_prob), levels=30, cmap='YlOrRd', alpha=0.6)\n",
+ "ax.contour(xx, yy, np.exp(log_prob), levels=10, colors='darkred', linewidths=0.5, alpha=0.5)\n",
+ "ax.scatter(X_gmm[:, 0], X_gmm[:, 1], c=gmm_labels, cmap='tab10', s=10, alpha=0.7,\n",
+ " edgecolors='k', linewidths=0.2)\n",
+ "\n",
+ "for i in range(gmm.n_components):\n",
+ " ax.scatter(gmm.means_[i, 0], gmm.means_[i, 1],\n",
+ " marker='+', s=300, c='black', linewidths=3)\n",
+ "\n",
+ "ax.set_title('GMM Probability Density Contours')\n",
+ "ax.set_xlabel('Feature 1')\n",
+ "ax.set_ylabel('Feature 2')\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Soft cluster probabilities β the key advantage of GMM\n",
+ "probs = gmm.predict_proba(X_gmm)\n",
+ "\n",
+ "print(\"Cluster membership probabilities for the first 10 points:\")\n",
+ "print(f\"{'Point':>5} {'P(C0)':>8} {'P(C1)':>8} {'P(C2)':>8} {'Assigned':>8}\")\n",
+ "print(\"-\" * 48)\n",
+ "for i in range(10):\n",
+ " print(f\"{i:5d} {probs[i, 0]:8.4f} {probs[i, 1]:8.4f} {probs[i, 2]:8.4f} {gmm_labels[i]:8d}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Model selection with BIC and AIC\n",
+ "\n",
+ "How many Gaussian components should we use? We can use information criteria:\n",
+ "\n",
+ "- **BIC** (Bayesian Information Criterion) β penalises model complexity more heavily.\n",
+ "- **AIC** (Akaike Information Criterion) β lighter penalty.\n",
+ "\n",
+ "**Lower is better** for both. We fit GMMs with different numbers of components and pick the one with the lowest BIC (or AIC)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "n_components_range = range(1, 10)\n",
+ "bic_scores = []\n",
+ "aic_scores = []\n",
+ "\n",
+ "for n in n_components_range:\n",
+ " gmm_test = GaussianMixture(n_components=n, covariance_type='full', random_state=42)\n",
+ " gmm_test.fit(X_gmm)\n",
+ " bic_scores.append(gmm_test.bic(X_gmm))\n",
+ " aic_scores.append(gmm_test.aic(X_gmm))\n",
+ "\n",
+ "fig, ax = plt.subplots(figsize=(10, 5))\n",
+ "ax.plot(list(n_components_range), bic_scores, 'bo-', label='BIC', linewidth=2)\n",
+ "ax.plot(list(n_components_range), aic_scores, 'rs--', label='AIC', linewidth=2)\n",
+ "ax.axvline(x=3, color='green', linestyle=':', alpha=0.7, label='True number of components (3)')\n",
+ "ax.set_xlabel('Number of Components')\n",
+ "ax.set_ylabel('Score (lower is better)')\n",
+ "ax.set_title('GMM Model Selection: BIC and AIC')\n",
+ "ax.legend()\n",
+ "ax.grid(True, alpha=0.3)\n",
+ "plt.tight_layout()\n",
+ "plt.show()\n",
+ "\n",
+ "print(f\"Best BIC at n_components = {np.argmin(bic_scores) + 1}\")\n",
+ "print(f\"Best AIC at n_components = {np.argmin(aic_scores) + 1}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "## 5. Algorithm Comparison on Multiple Datasets\n",
+ "\n",
+ "Let's put all four algorithms head-to-head on three different data geometries:\n",
+ "\n",
+ "1. **Blobs** β well-separated spherical clusters\n",
+ "2. **Moons** β two interleaving crescents\n",
+ "3. **Varied-variance blobs** β spherical clusters with very different densities"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "np.random.seed(42)\n",
+ "\n",
+ "n_samples = 500\n",
+ "\n",
+ "# Dataset 1: standard blobs\n",
+ "X_blobs, y_blobs = make_blobs(n_samples=n_samples, centers=3, cluster_std=1.0, random_state=42)\n",
+ "\n",
+ "# Dataset 2: moons\n",
+ "X_moons2, y_moons2 = make_moons(n_samples=n_samples, noise=0.07, random_state=42)\n",
+ "\n",
+ "# Dataset 3: varied-variance blobs\n",
+ "X_varied, y_varied = make_blobs(\n",
+ " n_samples=n_samples, centers=3, cluster_std=[0.5, 2.5, 1.0], random_state=42\n",
+ ")\n",
+ "\n",
+ "datasets = [\n",
+ " ('Blobs', X_blobs, {'n_clusters': 3, 'eps': 1.0}),\n",
+ " ('Moons', X_moons2, {'n_clusters': 2, 'eps': 0.2}),\n",
+ " ('Varied', X_varied, {'n_clusters': 3, 'eps': 1.5}),\n",
+ "]\n",
+ "\n",
+ "fig, axes = plt.subplots(3, 4, figsize=(22, 15))\n",
+ "\n",
+ "for row, (name, X, params) in enumerate(datasets):\n",
+ " X_scaled = StandardScaler().fit_transform(X)\n",
+ " n_c = params['n_clusters']\n",
+ " eps = params['eps']\n",
+ "\n",
+ " # K-Means\n",
+ " km = KMeans(n_clusters=n_c, random_state=42, n_init=10).fit(X_scaled)\n",
+ " # Agglomerative\n",
+ " agg = AgglomerativeClustering(n_clusters=n_c, linkage='ward').fit(X_scaled)\n",
+ " # DBSCAN\n",
+ " db = DBSCAN(eps=eps, min_samples=5).fit(X_scaled)\n",
+ " # GMM\n",
+ " gm = GaussianMixture(n_components=n_c, random_state=42).fit(X_scaled)\n",
+ "\n",
+ " results = [\n",
+ " ('K-Means', km.labels_),\n",
+ " ('Agglomerative', agg.labels_),\n",
+ " ('DBSCAN', db.labels_),\n",
+ " ('GMM', gm.predict(X_scaled)),\n",
+ " ]\n",
+ "\n",
+ " for col, (algo_name, labels) in enumerate(results):\n",
+ " ax = axes[row, col]\n",
+ " unique_labels = set(labels)\n",
+ " n_clust = len(unique_labels) - (1 if -1 in unique_labels else 0)\n",
+ "\n",
+ " noise_mask = labels == -1\n",
+ " ax.scatter(X_scaled[~noise_mask, 0], X_scaled[~noise_mask, 1],\n",
+ " c=labels[~noise_mask], cmap='viridis', s=12, alpha=0.7)\n",
+ " if noise_mask.any():\n",
+ " ax.scatter(X_scaled[noise_mask, 0], X_scaled[noise_mask, 1],\n",
+ " c='red', marker='x', s=15, alpha=0.5, label='noise')\n",
+ " ax.legend(fontsize=8)\n",
+ "\n",
+ " if row == 0:\n",
+ " ax.set_title(algo_name, fontsize=13, fontweight='bold')\n",
+ " ax.set_ylabel(f'{name}' if col == 0 else '', fontsize=12)\n",
+ " ax.text(0.02, 0.98, f'{n_clust} cluster(s)',\n",
+ " transform=ax.transAxes, fontsize=9, va='top',\n",
+ " bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.8))\n",
+ "\n",
+ "plt.suptitle('Algorithm Comparison Across Data Geometries', fontsize=16, y=1.01)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "## 6. Summary β When to Use Each Algorithm\n",
+ "\n",
+ "### Quick reference\n",
+ "\n",
+ "| Algorithm | Best for | Weaknesses | Must specify k? |\n",
+ "|-----------|---------|------------|------------------|\n",
+ "| **K-Means** | Large datasets with spherical clusters | Cannot handle non-convex shapes; sensitive to outliers | Yes |\n",
+ "| **Agglomerative Clustering** | Small-to-medium datasets; exploring hierarchy | O(nΒ³) time complexity; hard to scale | Yes (or cut dendrogram) |\n",
+ "| **DBSCAN** | Arbitrary shapes; datasets with noise/outliers | Sensitive to `eps`; struggles with varying densities | No |\n",
+ "| **Gaussian Mixture Model** | Elliptical clusters; need soft assignments | Assumes Gaussian components; sensitive to initialisation | Yes |\n",
+ "\n",
+ "### Rules of thumb\n",
+ "\n",
+ "1. **Start simple:** try K-Means first. If results look poor, consider the data geometry.\n",
+ "2. **Non-convex shapes?** β Use DBSCAN.\n",
+ "3. **Elliptical or overlapping clusters?** β Use GMM.\n",
+ "4. **Need a hierarchy or dendrogram?** β Use Agglomerative Clustering.\n",
+ "5. **Noisy data with outliers?** β DBSCAN naturally handles noise.\n",
+ "6. **Need probability estimates?** β GMM provides soft assignments.\n",
+ "\n",
+ "### What's next\n",
+ "\n",
+ "In the **advanced notebook** (Notebook 03) we will explore:\n",
+ "- Dimensionality reduction (PCA, t-SNE, UMAP)\n",
+ "- Clustering evaluation metrics (Silhouette, Adjusted Rand Index)\n",
+ "- Pipelines combining reduction + clustering on real-world datasets\n",
+ "\n",
+ "---\n",
+ "*Generated by Berta AI | Created by Luigi Pascal Rondanini*"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "name": "python",
+ "version": "3.10.0"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/chapters/chapter-08-unsupervised-learning/notebooks/03_advanced.ipynb b/chapters/chapter-08-unsupervised-learning/notebooks/03_advanced.ipynb
new file mode 100644
index 0000000..d73ba76
--- /dev/null
+++ b/chapters/chapter-08-unsupervised-learning/notebooks/03_advanced.ipynb
@@ -0,0 +1,938 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Chapter 8: Unsupervised Learning\n",
+ "## Notebook 03 - Advanced: Dimensionality Reduction & Capstone\n",
+ "\n",
+ "Reduce high-dimensional data for visualization and modeling, detect anomalies, and build a complete customer segmentation system.\n",
+ "\n",
+ "**What you'll learn:**\n",
+ "- Principal Component Analysis (PCA) from scratch\n",
+ "- t-SNE for 2D visualization\n",
+ "- Anomaly detection with Isolation Forest\n",
+ "- Customer segmentation capstone project\n",
+ "\n",
+ "**Time estimate:** 3 hours\n",
+ "\n",
+ "---\n",
+ "*Generated by Berta AI | Created by Luigi Pascal Rondanini*"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "## 1. Principal Component Analysis (PCA) β Theory\n",
+ "\n",
+ "### The Core Idea\n",
+ "\n",
+ "PCA is a **linear** dimensionality-reduction technique that finds the directions\n",
+ "(called **principal components**) along which the data varies the most.\n",
+ "\n",
+ "Imagine a cloud of 3-D points that is shaped like a flat pancake. Two axes\n",
+ "capture almost all of the spread; the third adds very little information. PCA\n",
+ "discovers those two dominant axes automatically.\n",
+ "\n",
+ "### Algorithm Steps\n",
+ "\n",
+ "1. **Center the data** β subtract the mean of each feature so that the cloud is\n",
+ " centered at the origin.\n",
+ "2. **Compute the covariance matrix** β a $d \\times d$ matrix (where $d$ is the\n",
+ " number of features) that captures pairwise linear relationships.\n",
+ "3. **Eigendecomposition** β find the eigenvectors and eigenvalues of the\n",
+ " covariance matrix. Each eigenvector is a principal component direction;\n",
+ " its eigenvalue tells us how much variance that direction explains.\n",
+ "4. **Sort & select** β rank components by eigenvalue (descending) and keep the\n",
+ " top $k$ to reduce dimensionality from $d$ to $k$.\n",
+ "5. **Project** β multiply the centered data by the selected eigenvectors to\n",
+ " obtain the lower-dimensional representation.\n",
+ "\n",
+ "### Variance Explained Ratio\n",
+ "\n",
+ "$$\\text{variance explained ratio}_i = \\frac{\\lambda_i}{\\sum_{j=1}^{d} \\lambda_j}$$\n",
+ "\n",
+ "where $\\lambda_i$ is the $i$-th eigenvalue. The **cumulative** variance explained\n",
+ "tells us how much total information is retained when we keep the first $k$\n",
+ "components."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "## 2. PCA From Scratch\n",
+ "\n",
+ "We will implement PCA using only NumPy and apply it to the classic **Iris**\n",
+ "dataset (4 features β 2 components)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "from sklearn.datasets import load_iris\n",
+ "\n",
+ "np.random.seed(42)\n",
+ "\n",
+ "# Load the Iris dataset (4 features, 150 samples, 3 classes)\n",
+ "iris = load_iris()\n",
+ "X = iris.data # shape (150, 4)\n",
+ "y = iris.target # 0, 1, 2\n",
+ "feature_names = iris.feature_names\n",
+ "target_names = iris.target_names\n",
+ "\n",
+ "print(f\"Dataset shape: {X.shape}\")\n",
+ "print(f\"Features: {feature_names}\")\n",
+ "print(f\"Classes: {list(target_names)}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def pca_from_scratch(X, n_components=2):\n",
+ " \"\"\"Implement PCA using NumPy.\"\"\"\n",
+ " # Step 1: Center the data\n",
+ " mean = np.mean(X, axis=0)\n",
+ " X_centered = X - mean\n",
+ "\n",
+ " # Step 2: Covariance matrix (features Γ features)\n",
+ " cov_matrix = np.cov(X_centered, rowvar=False)\n",
+ "\n",
+ " # Step 3: Eigendecomposition\n",
+ " eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)\n",
+ "\n",
+ " # Step 4: Sort by eigenvalue descending\n",
+ " sorted_idx = np.argsort(eigenvalues)[::-1]\n",
+ " eigenvalues = eigenvalues[sorted_idx]\n",
+ " eigenvectors = eigenvectors[:, sorted_idx]\n",
+ "\n",
+ " # Variance explained ratio\n",
+ " variance_ratio = eigenvalues / eigenvalues.sum()\n",
+ "\n",
+ " # Step 5: Project onto top-k components\n",
+ " W = eigenvectors[:, :n_components]\n",
+ " X_projected = X_centered @ W\n",
+ "\n",
+ " return X_projected, eigenvalues, variance_ratio, W\n",
+ "\n",
+ "\n",
+ "X_pca_scratch, eigenvalues, var_ratio, components = pca_from_scratch(X, n_components=2)\n",
+ "\n",
+ "print(\"Eigenvalues:\", np.round(eigenvalues, 4))\n",
+ "print(\"Variance explained ratio:\", np.round(var_ratio, 4))\n",
+ "print(f\"Total variance retained (2 components): {var_ratio[:2].sum():.2%}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# --- Variance Explained Bar + Cumulative Line ---\n",
+ "fig, axes = plt.subplots(1, 2, figsize=(13, 5))\n",
+ "\n",
+ "# Left: bar chart of individual variance ratios\n",
+ "axes[0].bar(range(1, len(var_ratio) + 1), var_ratio, color=\"steelblue\", edgecolor=\"black\")\n",
+ "axes[0].set_xlabel(\"Principal Component\")\n",
+ "axes[0].set_ylabel(\"Variance Explained Ratio\")\n",
+ "axes[0].set_title(\"Variance Explained by Each Component\")\n",
+ "axes[0].set_xticks(range(1, len(var_ratio) + 1))\n",
+ "\n",
+ "# Right: cumulative variance explained\n",
+ "cumulative = np.cumsum(var_ratio)\n",
+ "axes[1].plot(range(1, len(cumulative) + 1), cumulative, \"o-\", color=\"darkorange\", linewidth=2)\n",
+ "axes[1].axhline(y=0.95, color=\"red\", linestyle=\"--\", label=\"95% threshold\")\n",
+ "axes[1].set_xlabel(\"Number of Components\")\n",
+ "axes[1].set_ylabel(\"Cumulative Variance Explained\")\n",
+ "axes[1].set_title(\"Cumulative Variance Explained\")\n",
+ "axes[1].set_xticks(range(1, len(cumulative) + 1))\n",
+ "axes[1].legend()\n",
+ "\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# --- 2-D scatter plot of the scratch PCA projection ---\n",
+ "colors = [\"#1f77b4\", \"#ff7f0e\", \"#2ca02c\"]\n",
+ "\n",
+ "plt.figure(figsize=(8, 6))\n",
+ "for i, name in enumerate(target_names):\n",
+ " mask = y == i\n",
+ " plt.scatter(X_pca_scratch[mask, 0], X_pca_scratch[mask, 1],\n",
+ " label=name, alpha=0.7, edgecolors=\"k\", linewidth=0.5,\n",
+ " color=colors[i], s=60)\n",
+ "plt.xlabel(f\"PC 1 ({var_ratio[0]:.1%} variance)\")\n",
+ "plt.ylabel(f\"PC 2 ({var_ratio[1]:.1%} variance)\")\n",
+ "plt.title(\"PCA From Scratch β Iris Dataset (2-D Projection)\")\n",
+ "plt.legend()\n",
+ "plt.grid(alpha=0.3)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "## 3. PCA with Scikit-learn\n",
+ "\n",
+ "Now let's verify our scratch implementation against the well-optimized\n",
+ "`sklearn.decomposition.PCA`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.decomposition import PCA\n",
+ "\n",
+ "pca_sk = PCA(n_components=4) # keep all 4 to inspect variance\n",
+ "X_pca_sk_full = pca_sk.fit_transform(X)\n",
+ "\n",
+ "print(\"Sklearn variance explained ratio:\", np.round(pca_sk.explained_variance_ratio_, 4))\n",
+ "print(\"Scratch variance explained ratio: \", np.round(var_ratio, 4))\n",
+ "print()\n",
+ "print(\"Cumulative (sklearn):\", np.round(np.cumsum(pca_sk.explained_variance_ratio_), 4))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "X_pca_sk = X_pca_sk_full[:, :2] # first 2 components\n",
+ "\n",
+ "# Sign of eigenvectors can flip β align for visual comparison\n",
+ "for col in range(2):\n",
+ " if np.corrcoef(X_pca_scratch[:, col], X_pca_sk[:, col])[0, 1] < 0:\n",
+ " X_pca_scratch[:, col] *= -1\n",
+ "\n",
+ "fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharex=True, sharey=True)\n",
+ "\n",
+ "for ax, data, title in zip(axes,\n",
+ " [X_pca_scratch, X_pca_sk],\n",
+ " [\"PCA (from scratch)\", \"PCA (scikit-learn)\"]):\n",
+ " for i, name in enumerate(target_names):\n",
+ " mask = y == i\n",
+ " ax.scatter(data[mask, 0], data[mask, 1], label=name,\n",
+ " alpha=0.7, edgecolors=\"k\", linewidth=0.5,\n",
+ " color=colors[i], s=60)\n",
+ " ax.set_xlabel(\"PC 1\")\n",
+ " ax.set_ylabel(\"PC 2\")\n",
+ " ax.set_title(title)\n",
+ " ax.legend()\n",
+ " ax.grid(alpha=0.3)\n",
+ "\n",
+ "plt.suptitle(\"Scratch vs Scikit-learn PCA β Identical Results\", fontsize=14, y=1.02)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The two plots are virtually identical (eigenvector signs may differ, which is\n",
+ "cosmetic). This confirms our from-scratch implementation is correct."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "## 4. t-SNE β Non-linear Visualization\n",
+ "\n",
+ "### What is t-SNE?\n",
+ "\n",
+ "**t-distributed Stochastic Neighbor Embedding (t-SNE)** is a non-linear\n",
+ "dimensionality-reduction technique designed specifically for **visualization**.\n",
+ "\n",
+ "Key properties:\n",
+ "- Preserves **local structure**: points that are close in high-dimensional space\n",
+ " stay close in the 2-D embedding.\n",
+ "- Does **not** preserve global distances β clusters may move relative to each\n",
+ " other between runs.\n",
+ "- Computationally expensive β not suitable as a preprocessing step in machine-\n",
+ " learning pipelines.\n",
+ "- The **perplexity** parameter (roughly: how many neighbors each point\n",
+ " considers) strongly influences the result. Typical range: 5β50.\n",
+ "\n",
+ "> **Rule of thumb:** Use PCA when you need a general-purpose reduction (for\n",
+ "> modeling, compression, noise removal). Use t-SNE when your sole goal is to\n",
+ "> *see* cluster structure in 2-D."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.manifold import TSNE\n",
+ "\n",
+ "tsne = TSNE(n_components=2, perplexity=30, random_state=42, n_iter=1000)\n",
+ "X_tsne = tsne.fit_transform(X)\n",
+ "\n",
+ "print(f\"t-SNE output shape: {X_tsne.shape}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# --- Side-by-side: PCA vs t-SNE ---\n",
+ "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
+ "\n",
+ "for ax, data, title in zip(axes,\n",
+ " [X_pca_sk, X_tsne],\n",
+ " [\"PCA (linear)\", \"t-SNE (non-linear)\"]):\n",
+ " for i, name in enumerate(target_names):\n",
+ " mask = y == i\n",
+ " ax.scatter(data[mask, 0], data[mask, 1], label=name,\n",
+ " alpha=0.7, edgecolors=\"k\", linewidth=0.5,\n",
+ " color=colors[i], s=60)\n",
+ " ax.set_xlabel(\"Dim 1\")\n",
+ " ax.set_ylabel(\"Dim 2\")\n",
+ " ax.set_title(title)\n",
+ " ax.legend()\n",
+ " ax.grid(alpha=0.3)\n",
+ "\n",
+ "plt.suptitle(\"PCA vs t-SNE β Iris Dataset\", fontsize=14, y=1.02)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# --- Effect of perplexity on t-SNE ---\n",
+ "perplexities = [5, 15, 30, 50]\n",
+ "fig, axes = plt.subplots(1, 4, figsize=(20, 4))\n",
+ "\n",
+ "for ax, perp in zip(axes, perplexities):\n",
+ " embedding = TSNE(n_components=2, perplexity=perp,\n",
+ " random_state=42, n_iter=1000).fit_transform(X)\n",
+ " for i, name in enumerate(target_names):\n",
+ " mask = y == i\n",
+ " ax.scatter(embedding[mask, 0], embedding[mask, 1],\n",
+ " alpha=0.7, color=colors[i], s=40, edgecolors=\"k\",\n",
+ " linewidth=0.3, label=name)\n",
+ " ax.set_title(f\"Perplexity = {perp}\")\n",
+ " ax.set_xticks([])\n",
+ " ax.set_yticks([])\n",
+ "\n",
+ "axes[0].legend(fontsize=8)\n",
+ "plt.suptitle(\"t-SNE: Impact of Perplexity\", fontsize=14, y=1.04)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Observations on perplexity:**\n",
+ "- Low perplexity (5): focuses on very local neighbors β clusters may fragment.\n",
+ "- High perplexity (50): considers more neighbors β clusters become rounder and\n",
+ " more global structure is visible, but fine local detail may blur.\n",
+ "- There is no single \"correct\" perplexity; try several and look for consistent\n",
+ " patterns."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "## 5. Anomaly Detection\n",
+ "\n",
+ "### Why Unsupervised Anomaly Detection?\n",
+ "\n",
+ "In many real-world scenarios, labeled anomalies are scarce or non-existent:\n",
+ "\n",
+ "| Domain | Normal | Anomaly |\n",
+ "|--------|--------|--------|\n",
+ "| Banking | Legitimate transactions | Fraud |\n",
+ "| Manufacturing | Good products | Defects |\n",
+ "| Cybersecurity | Regular traffic | Intrusions |\n",
+ "\n",
+ "Unsupervised methods learn the distribution of *normal* data and flag anything\n",
+ "that doesn't fit.\n",
+ "\n",
+ "### Approach 1 β Z-Score\n",
+ "\n",
+ "Flag a point as anomalous if any feature has a Z-score $|z| > \\tau$ (e.g.,\n",
+ "$\\tau = 3$). Simple, but assumes Gaussian features and works only for\n",
+ "univariate or low-dimensional data.\n",
+ "\n",
+ "### Approach 2 β Isolation Forest\n",
+ "\n",
+ "The **Isolation Forest** algorithm isolates observations by randomly selecting\n",
+ "a feature and a split value. Anomalies are easier to isolate (fewer splits\n",
+ "needed), so they have shorter average path lengths in the trees.\n",
+ "\n",
+ "Advantages:\n",
+ "- Works well in high dimensions\n",
+ "- No distribution assumptions\n",
+ "- Linear time complexity"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.ensemble import IsolationForest\n",
+ "\n",
+ "np.random.seed(42)\n",
+ "\n",
+ "# Generate normal data: 2 clusters\n",
+ "normal_a = np.random.randn(150, 2) * 0.8 + np.array([2, 2])\n",
+ "normal_b = np.random.randn(150, 2) * 0.8 + np.array([-2, -2])\n",
+ "normal_data = np.vstack([normal_a, normal_b])\n",
+ "\n",
+ "# Inject 20 anomalies scattered far from the clusters\n",
+ "anomalies = np.random.uniform(low=-6, high=6, size=(20, 2))\n",
+ "\n",
+ "X_anom = np.vstack([normal_data, anomalies])\n",
+ "labels_true = np.array([0] * len(normal_data) + [1] * len(anomalies)) # 0=normal, 1=anomaly\n",
+ "\n",
+ "print(f\"Total points: {len(X_anom)} (normal: {len(normal_data)}, anomalies: {len(anomalies)})\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# --- Z-Score method ---\n",
+ "from scipy import stats\n",
+ "\n",
+ "z_scores = np.abs(stats.zscore(X_anom))\n",
+ "z_threshold = 3.0\n",
+ "z_anomaly_mask = (z_scores > z_threshold).any(axis=1)\n",
+ "\n",
+ "print(f\"Z-Score method detected {z_anomaly_mask.sum()} anomalies (threshold={z_threshold})\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# --- Isolation Forest ---\n",
+ "iso_forest = IsolationForest(n_estimators=200, contamination=0.06,\n",
+ " random_state=42)\n",
+ "iso_preds = iso_forest.fit_predict(X_anom) # 1 = normal, -1 = anomaly\n",
+ "iso_anomaly_mask = iso_preds == -1\n",
+ "\n",
+ "print(f\"Isolation Forest detected {iso_anomaly_mask.sum()} anomalies\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "fig, axes = plt.subplots(1, 3, figsize=(18, 5))\n",
+ "\n",
+ "# Ground truth\n",
+ "axes[0].scatter(X_anom[labels_true == 0, 0], X_anom[labels_true == 0, 1],\n",
+ " c=\"steelblue\", s=30, alpha=0.6, label=\"Normal\")\n",
+ "axes[0].scatter(X_anom[labels_true == 1, 0], X_anom[labels_true == 1, 1],\n",
+ " c=\"red\", s=80, marker=\"X\", label=\"True Anomaly\")\n",
+ "axes[0].set_title(\"Ground Truth\")\n",
+ "axes[0].legend()\n",
+ "axes[0].grid(alpha=0.3)\n",
+ "\n",
+ "# Z-Score\n",
+ "axes[1].scatter(X_anom[~z_anomaly_mask, 0], X_anom[~z_anomaly_mask, 1],\n",
+ " c=\"steelblue\", s=30, alpha=0.6, label=\"Normal\")\n",
+ "axes[1].scatter(X_anom[z_anomaly_mask, 0], X_anom[z_anomaly_mask, 1],\n",
+ " c=\"red\", s=80, marker=\"X\", label=\"Detected Anomaly\")\n",
+ "axes[1].set_title(f\"Z-Score (threshold={z_threshold})\")\n",
+ "axes[1].legend()\n",
+ "axes[1].grid(alpha=0.3)\n",
+ "\n",
+ "# Isolation Forest\n",
+ "axes[2].scatter(X_anom[~iso_anomaly_mask, 0], X_anom[~iso_anomaly_mask, 1],\n",
+ " c=\"steelblue\", s=30, alpha=0.6, label=\"Normal\")\n",
+ "axes[2].scatter(X_anom[iso_anomaly_mask, 0], X_anom[iso_anomaly_mask, 1],\n",
+ " c=\"red\", s=80, marker=\"X\", label=\"Detected Anomaly\")\n",
+ "axes[2].set_title(\"Isolation Forest\")\n",
+ "axes[2].legend()\n",
+ "axes[2].grid(alpha=0.3)\n",
+ "\n",
+ "plt.suptitle(\"Anomaly Detection Comparison\", fontsize=14, y=1.02)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Key takeaway:** The Isolation Forest typically outperforms the Z-Score\n",
+ "method, especially when the data is multi-modal or the anomalies are not simply\n",
+ "extreme values along a single axis."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "## 6. Capstone Project β Customer Segmentation\n",
+ "\n",
+ "We will build a complete customer-segmentation pipeline:\n",
+ "\n",
+ "1. Generate & save a synthetic customer dataset\n",
+ "2. Feature scaling\n",
+ "3. Dimensionality reduction with PCA\n",
+ "4. Elbow method to choose optimal $K$\n",
+ "5. K-Means clustering\n",
+ "6. Segment profiling & visualization\n",
+ "7. Business recommendations"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 6.1 Generate Synthetic Customer Data\n",
+ "\n",
+ "We create five features that mimic a retail scenario:\n",
+ "\n",
+ "| Feature | Description |\n",
+ "|---------|-------------|\n",
+ "| `age` | Customer age (18β70) |\n",
+ "| `income` | Annual income in $k (15β150) |\n",
+ "| `spending_score` | In-store spending score (1β100) |\n",
+ "| `visits` | Monthly store visits (0β30) |\n",
+ "| `online_ratio` | Fraction of purchases made online (0β1) |"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "import os\n",
+ "\n",
+ "np.random.seed(42)\n",
+ "n_customers = 500\n",
+ "\n",
+ "# Segment 1: Young, moderate income, high online, high spending\n",
+ "seg1 = {\n",
+ " \"age\": np.random.normal(25, 4, 130).clip(18, 40),\n",
+ " \"income\": np.random.normal(45, 12, 130).clip(15, 80),\n",
+ " \"spending_score\": np.random.normal(75, 10, 130).clip(1, 100),\n",
+ " \"visits\": np.random.normal(8, 3, 130).clip(0, 30),\n",
+ " \"online_ratio\": np.random.normal(0.75, 0.1, 130).clip(0, 1),\n",
+ "}\n",
+ "\n",
+ "# Segment 2: Middle-aged, high income, balanced channel, moderate spending\n",
+ "seg2 = {\n",
+ " \"age\": np.random.normal(42, 6, 150).clip(28, 60),\n",
+ " \"income\": np.random.normal(95, 18, 150).clip(50, 150),\n",
+ " \"spending_score\": np.random.normal(55, 12, 150).clip(1, 100),\n",
+ " \"visits\": np.random.normal(15, 5, 150).clip(0, 30),\n",
+ " \"online_ratio\": np.random.normal(0.45, 0.15, 150).clip(0, 1),\n",
+ "}\n",
+ "\n",
+ "# Segment 3: Older, lower income, low online, low spending\n",
+ "seg3 = {\n",
+ " \"age\": np.random.normal(58, 7, 120).clip(40, 70),\n",
+ " \"income\": np.random.normal(35, 10, 120).clip(15, 70),\n",
+ " \"spending_score\": np.random.normal(25, 10, 120).clip(1, 100),\n",
+ " \"visits\": np.random.normal(20, 5, 120).clip(0, 30),\n",
+ " \"online_ratio\": np.random.normal(0.15, 0.08, 120).clip(0, 1),\n",
+ "}\n",
+ "\n",
+ "# Segment 4: Mixed ages, very high income, high spending, moderate visits\n",
+ "seg4 = {\n",
+ " \"age\": np.random.normal(38, 10, 100).clip(18, 70),\n",
+ " \"income\": np.random.normal(120, 15, 100).clip(80, 150),\n",
+ " \"spending_score\": np.random.normal(85, 8, 100).clip(1, 100),\n",
+ " \"visits\": np.random.normal(12, 4, 100).clip(0, 30),\n",
+ " \"online_ratio\": np.random.normal(0.55, 0.15, 100).clip(0, 1),\n",
+ "}\n",
+ "\n",
+ "frames = []\n",
+ "for seg in [seg1, seg2, seg3, seg4]:\n",
+ " frames.append(pd.DataFrame(seg))\n",
+ "\n",
+ "df_customers = pd.concat(frames, ignore_index=True)\n",
+ "df_customers = df_customers.sample(frac=1, random_state=42).reset_index(drop=True)\n",
+ "\n",
+ "df_customers[\"age\"] = df_customers[\"age\"].round(0).astype(int)\n",
+ "df_customers[\"income\"] = df_customers[\"income\"].round(1)\n",
+ "df_customers[\"spending_score\"] = df_customers[\"spending_score\"].round(0).astype(int)\n",
+ "df_customers[\"visits\"] = df_customers[\"visits\"].round(0).astype(int)\n",
+ "df_customers[\"online_ratio\"] = df_customers[\"online_ratio\"].round(2)\n",
+ "\n",
+ "# Save to CSV\n",
+ "dataset_dir = os.path.join(os.path.dirname(os.getcwd()), \"datasets\")\n",
+ "os.makedirs(dataset_dir, exist_ok=True)\n",
+ "csv_path = os.path.join(dataset_dir, \"customers.csv\")\n",
+ "df_customers.to_csv(csv_path, index=False)\n",
+ "print(f\"Saved {len(df_customers)} rows to {csv_path}\")\n",
+ "df_customers.head(10)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df_customers.describe().round(2)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 6.2 Feature Scaling"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.preprocessing import StandardScaler\n",
+ "\n",
+ "feature_cols = [\"age\", \"income\", \"spending_score\", \"visits\", \"online_ratio\"]\n",
+ "X_cust = df_customers[feature_cols].values\n",
+ "\n",
+ "scaler = StandardScaler()\n",
+ "X_scaled = scaler.fit_transform(X_cust)\n",
+ "\n",
+ "print(\"Scaled means (β0):\", np.round(X_scaled.mean(axis=0), 4))\n",
+ "print(\"Scaled stds (β1):\", np.round(X_scaled.std(axis=0), 4))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 6.3 PCA for Dimensionality Reduction"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "pca_cust = PCA(n_components=5)\n",
+ "X_pca_cust = pca_cust.fit_transform(X_scaled)\n",
+ "\n",
+ "cum_var = np.cumsum(pca_cust.explained_variance_ratio_)\n",
+ "\n",
+ "plt.figure(figsize=(7, 4))\n",
+ "plt.bar(range(1, 6), pca_cust.explained_variance_ratio_,\n",
+ " color=\"steelblue\", edgecolor=\"black\", alpha=0.7, label=\"Individual\")\n",
+ "plt.step(range(1, 6), cum_var, where=\"mid\", color=\"darkorange\",\n",
+ " linewidth=2, label=\"Cumulative\")\n",
+ "plt.axhline(0.90, color=\"red\", linestyle=\"--\", alpha=0.7, label=\"90% threshold\")\n",
+ "plt.xlabel(\"Principal Component\")\n",
+ "plt.ylabel(\"Variance Explained\")\n",
+ "plt.title(\"Customer Data β PCA Variance Explained\")\n",
+ "plt.xticks(range(1, 6))\n",
+ "plt.legend()\n",
+ "plt.tight_layout()\n",
+ "plt.show()\n",
+ "\n",
+ "n_keep = np.argmax(cum_var >= 0.90) + 1\n",
+ "print(f\"\\nComponents needed for β₯90% variance: {n_keep}\")\n",
+ "print(f\"Using first 2 components for visualization ({cum_var[1]:.1%} variance).\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 6.4 K-Means β Elbow Method"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.cluster import KMeans\n",
+ "\n",
+ "K_range = range(2, 11)\n",
+ "inertias = []\n",
+ "\n",
+ "for k in K_range:\n",
+ " km = KMeans(n_clusters=k, n_init=10, random_state=42)\n",
+ " km.fit(X_scaled)\n",
+ " inertias.append(km.inertia_)\n",
+ "\n",
+ "plt.figure(figsize=(8, 4))\n",
+ "plt.plot(list(K_range), inertias, \"o-\", linewidth=2, color=\"steelblue\")\n",
+ "plt.xlabel(\"Number of Clusters (K)\")\n",
+ "plt.ylabel(\"Inertia (within-cluster sum of squares)\")\n",
+ "plt.title(\"Elbow Method for Optimal K\")\n",
+ "plt.xticks(list(K_range))\n",
+ "plt.grid(alpha=0.3)\n",
+ "plt.tight_layout()\n",
+ "plt.show()\n",
+ "\n",
+ "print(\"Look for the 'elbow' β the point where adding more clusters yields\")\n",
+ "print(\"diminishing returns. Here K=4 appears to be a good choice.\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 6.5 Fit K-Means with Optimal K"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "optimal_k = 4\n",
+ "km_final = KMeans(n_clusters=optimal_k, n_init=20, random_state=42)\n",
+ "cluster_labels = km_final.fit_predict(X_scaled)\n",
+ "\n",
+ "df_customers[\"cluster\"] = cluster_labels\n",
+ "print(f\"Cluster distribution:\\n{df_customers['cluster'].value_counts().sort_index()}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 6.6 Segment Profiling"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "segment_profile = df_customers.groupby(\"cluster\")[feature_cols].mean().round(2)\n",
+ "segment_profile[\"count\"] = df_customers.groupby(\"cluster\").size()\n",
+ "print(\"=== Segment Profiles ===\")\n",
+ "segment_profile"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Radar / parallel-coordinates style comparison\n",
+ "fig, axes = plt.subplots(1, len(feature_cols), figsize=(18, 4), sharey=True)\n",
+ "cluster_colors = [\"#1f77b4\", \"#ff7f0e\", \"#2ca02c\", \"#d62728\"]\n",
+ "\n",
+ "for idx, feat in enumerate(feature_cols):\n",
+ " means = df_customers.groupby(\"cluster\")[feat].mean()\n",
+ " bars = axes[idx].bar(means.index, means.values,\n",
+ " color=cluster_colors[:optimal_k], edgecolor=\"black\")\n",
+ " axes[idx].set_title(feat, fontsize=11)\n",
+ " axes[idx].set_xlabel(\"Cluster\")\n",
+ " axes[idx].set_xticks(range(optimal_k))\n",
+ "\n",
+ "axes[0].set_ylabel(\"Mean Value\")\n",
+ "plt.suptitle(\"Feature Means by Cluster\", fontsize=14, y=1.02)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 6.7 Visualize Segments in 2-D (PCA Projection)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "X_vis = X_pca_cust[:, :2]\n",
+ "centroids_scaled = km_final.cluster_centers_\n",
+ "centroids_2d = pca_cust.transform(centroids_scaled)[:, :2] # project centroids\n",
+ "\n",
+ "plt.figure(figsize=(9, 7))\n",
+ "for c in range(optimal_k):\n",
+ " mask = cluster_labels == c\n",
+ " plt.scatter(X_vis[mask, 0], X_vis[mask, 1], s=40, alpha=0.6,\n",
+ " color=cluster_colors[c], edgecolors=\"k\", linewidth=0.3,\n",
+ " label=f\"Segment {c}\")\n",
+ "\n",
+ "plt.scatter(centroids_2d[:, 0], centroids_2d[:, 1], s=250, c=\"black\",\n",
+ " marker=\"*\", zorder=5, label=\"Centroids\")\n",
+ "\n",
+ "plt.xlabel(f\"PC 1 ({pca_cust.explained_variance_ratio_[0]:.1%} var)\")\n",
+ "plt.ylabel(f\"PC 2 ({pca_cust.explained_variance_ratio_[1]:.1%} var)\")\n",
+ "plt.title(\"Customer Segments β PCA 2-D Projection\")\n",
+ "plt.legend()\n",
+ "plt.grid(alpha=0.3)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 6.8 Business Recommendations"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "recommendations = {\n",
+ " 0: {\n",
+ " \"label\": \"Budget Traditionalists\",\n",
+ " \"description\": \"Older customers with low income and spending, who shop mostly in-store.\",\n",
+ " \"actions\": [\n",
+ " \"Offer loyalty discounts and in-store promotions\",\n",
+ " \"Simplify the in-store experience\",\n",
+ " \"Provide personalized coupons at checkout\",\n",
+ " ],\n",
+ " },\n",
+ " 1: {\n",
+ " \"label\": \"Young Digital Shoppers\",\n",
+ " \"description\": \"Young customers with moderate income but high online engagement and spending.\",\n",
+ " \"actions\": [\n",
+ " \"Invest in mobile app features and social media marketing\",\n",
+ " \"Offer free shipping and digital-only deals\",\n",
+ " \"Launch a referral program to leverage their network\",\n",
+ " ],\n",
+ " },\n",
+ " 2: {\n",
+ " \"label\": \"Premium High-Spenders\",\n",
+ " \"description\": \"High income, high spending score β the most valuable segment.\",\n",
+ " \"actions\": [\n",
+ " \"Create a VIP/premium loyalty tier\",\n",
+ " \"Offer early access to new products\",\n",
+ " \"Assign dedicated account managers for retention\",\n",
+ " ],\n",
+ " },\n",
+ " 3: {\n",
+ " \"label\": \"Established Moderates\",\n",
+ " \"description\": \"Middle-aged, higher income, moderate spending, balanced channel use.\",\n",
+ " \"actions\": [\n",
+ " \"Cross-sell higher-margin products\",\n",
+ " \"Provide omni-channel convenience (buy online, pick up in store)\",\n",
+ " \"Target with email campaigns for seasonal offers\",\n",
+ " ],\n",
+ " },\n",
+ "}\n",
+ "\n",
+ "for seg_id, info in recommendations.items():\n",
+ " count = (cluster_labels == seg_id).sum()\n",
+ " print(f\"\\n{'='*60}\")\n",
+ " print(f\"Segment {seg_id}: {info['label']} (n={count})\")\n",
+ " print(f\"{'='*60}\")\n",
+ " print(f\" {info['description']}\")\n",
+ " print(\" Recommended actions:\")\n",
+ " for action in info[\"actions\"]:\n",
+ " print(f\" β’ {action}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "## 7. Summary & Key Takeaways\n",
+ "\n",
+ "### What We Covered in This Notebook\n",
+ "\n",
+ "| Topic | Key Idea |\n",
+ "|-------|----------|\n",
+ "| **PCA** | Linear projection onto directions of maximum variance |\n",
+ "| **t-SNE** | Non-linear embedding that preserves local neighborhoods β for visualization only |\n",
+ "| **Z-Score Anomaly Detection** | Simple threshold on standardized values |\n",
+ "| **Isolation Forest** | Tree-based anomaly detector β fast, distribution-free |\n",
+ "| **Customer Segmentation** | End-to-end pipeline: scale β PCA β K-Means β profile β recommend |\n",
+ "\n",
+ "### Chapter 8 Recap\n",
+ "\n",
+ "Across the three notebooks you have:\n",
+ "\n",
+ "1. **Notebook 01 (Introduction):** Learned K-Means, hierarchical clustering, and evaluation metrics.\n",
+ "2. **Notebook 02 (Intermediate):** Explored DBSCAN, Gaussian Mixture Models, and silhouette analysis.\n",
+ "3. **Notebook 03 (Advanced β this one):** Mastered PCA, t-SNE, anomaly detection, and built a full capstone project.\n",
+ "\n",
+ "### What's Next\n",
+ "\n",
+ "In **Chapter 9: Deep Learning** we'll move from classical ML to neural\n",
+ "networks β starting with perceptrons, backpropagation, and building your first\n",
+ "deep network with PyTorch/Keras.\n",
+ "\n",
+ "---\n",
+ "*Generated by Berta AI | Created by Luigi Pascal Rondanini*"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "name": "python",
+ "version": "3.10.0"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file
diff --git a/chapters/chapter-08-unsupervised-learning/requirements.txt b/chapters/chapter-08-unsupervised-learning/requirements.txt
new file mode 100644
index 0000000..8781803
--- /dev/null
+++ b/chapters/chapter-08-unsupervised-learning/requirements.txt
@@ -0,0 +1,7 @@
+jupyter
+notebook
+numpy
+pandas
+matplotlib
+scikit-learn
+scipy
diff --git a/chapters/chapter-08-unsupervised-learning/scripts/unsupervised_toolkit.py b/chapters/chapter-08-unsupervised-learning/scripts/unsupervised_toolkit.py
new file mode 100644
index 0000000..c1b1659
--- /dev/null
+++ b/chapters/chapter-08-unsupervised-learning/scripts/unsupervised_toolkit.py
@@ -0,0 +1,423 @@
+"""
+Unsupervised Learning Toolkit - Core implementations and plotting utilities.
+Generated by Berta AI | Created by Luigi Pascal Rondanini
+"""
+
+import numpy as np
+import matplotlib.pyplot as plt
+from sklearn.metrics import silhouette_samples, silhouette_score
+from scipy.cluster.hierarchy import dendrogram, linkage
+from sklearn.datasets import make_blobs
+
+
+class KMeansScratch:
+ """
+ K-Means clustering implementation from scratch.
+ """
+
+ def __init__(self, n_clusters=3, max_iters=100, random_state=42):
+ """
+ Initialize K-Means.
+
+ Parameters
+ ----------
+ n_clusters : int
+ Number of clusters.
+ max_iters : int
+ Maximum iterations for the algorithm.
+ random_state : int
+ Random seed for reproducibility.
+ """
+ self.n_clusters = n_clusters
+ self.max_iters = max_iters
+ self.random_state = random_state
+ self.centroids = None
+ self.labels_ = None
+ self.inertia_history = []
+
+ def fit(self, X):
+ """
+ Fit K-Means to the data.
+
+ Parameters
+ ----------
+ X : array-like of shape (n_samples, n_features)
+ Training data.
+
+ Returns
+ -------
+ self
+ """
+ X = np.asarray(X)
+ np.random.seed(self.random_state)
+ n_samples = X.shape[0]
+
+ # Random centroid initialization
+ idx = np.random.choice(n_samples, self.n_clusters, replace=False)
+ self.centroids = X[idx].copy()
+
+ for _ in range(self.max_iters):
+ # Assign points to nearest centroid
+ labels = self._assign_clusters(X)
+ # Recompute centroids
+ new_centroids = np.zeros_like(self.centroids)
+ for k in range(self.n_clusters):
+ mask = labels == k
+ if np.any(mask):
+ new_centroids[k] = X[mask].mean(axis=0)
+ else:
+ new_centroids[k] = self.centroids[k]
+
+ inertia = self._compute_inertia(X, labels, new_centroids)
+ self.inertia_history.append(inertia)
+
+ if np.allclose(self.centroids, new_centroids):
+ break
+ self.centroids = new_centroids
+
+ self.labels_ = self._assign_clusters(X)
+ return self
+
+ def _assign_clusters(self, X):
+ """Assign each point to the nearest centroid."""
+ distances = np.linalg.norm(X[:, np.newaxis] - self.centroids, axis=2)
+ return np.argmin(distances, axis=1)
+
+ def predict(self, X):
+ """
+ Predict cluster labels for new data.
+
+ Parameters
+ ----------
+ X : array-like of shape (n_samples, n_features)
+ Data to predict.
+
+ Returns
+ -------
+ labels : ndarray of shape (n_samples,)
+ Cluster indices.
+ """
+ X = np.asarray(X)
+ return self._assign_clusters(X)
+
+ def fit_predict(self, X):
+ """
+ Fit and return cluster labels.
+
+ Parameters
+ ----------
+ X : array-like of shape (n_samples, n_features)
+ Training data.
+
+ Returns
+ -------
+ labels : ndarray of shape (n_samples,)
+ Cluster indices.
+ """
+ return self.fit(X).labels_
+
+ def _compute_inertia(self, X, labels, centroids):
+ """
+ Compute within-cluster sum of squares (inertia).
+
+ Parameters
+ ----------
+ X : ndarray
+ Data points.
+ labels : ndarray
+ Cluster labels.
+ centroids : ndarray
+ Cluster centroids.
+
+ Returns
+ -------
+ inertia : float
+ """
+ inertia = 0.0
+ for k in range(self.n_clusters):
+ mask = labels == k
+ if np.any(mask):
+ inertia += np.sum((X[mask] - centroids[k]) ** 2)
+ return inertia
+
+
+class PCAScratch:
+ """
+ Principal Component Analysis implementation from scratch.
+ """
+
+ def __init__(self, n_components=2):
+ """
+ Initialize PCA.
+
+ Parameters
+ ----------
+ n_components : int
+ Number of components to keep.
+ """
+ self.n_components = n_components
+ self.mean_ = None
+ self.components_ = None
+ self.explained_variance_ = None
+
+ def fit(self, X):
+ """
+ Fit PCA to the data.
+
+ Parameters
+ ----------
+ X : array-like of shape (n_samples, n_features)
+ Training data.
+
+ Returns
+ -------
+ self
+ """
+ X = np.asarray(X)
+ self.mean_ = X.mean(axis=0)
+ X_centered = X - self.mean_
+
+ # Covariance matrix
+ cov = np.cov(X_centered.T)
+
+ # Eigendecomposition
+ eigenvalues, eigenvectors = np.linalg.eigh(cov)
+ idx = np.argsort(eigenvalues)[::-1]
+ eigenvalues = eigenvalues[idx]
+ eigenvectors = eigenvectors[:, idx]
+
+ n = min(self.n_components, len(eigenvalues))
+ self.components_ = eigenvectors[:, :n].T
+ self.explained_variance_ = eigenvalues[:n]
+ return self
+
+ def transform(self, X):
+ """
+ Project data onto principal components.
+
+ Parameters
+ ----------
+ X : array-like of shape (n_samples, n_features)
+ Data to transform.
+
+ Returns
+ -------
+ X_transformed : ndarray of shape (n_samples, n_components)
+ """
+ X = np.asarray(X)
+ X_centered = X - self.mean_
+ return X_centered @ self.components_.T
+
+ def fit_transform(self, X):
+ """
+ Fit and transform in one step.
+
+ Parameters
+ ----------
+ X : array-like of shape (n_samples, n_features)
+ Training data.
+
+ Returns
+ -------
+ X_transformed : ndarray of shape (n_samples, n_components)
+ """
+ return self.fit(X).transform(X)
+
+ @property
+ def explained_variance_ratio_(self):
+ """Fraction of variance explained by each component."""
+ total = np.sum(self.explained_variance_)
+ return self.explained_variance_ / total if total > 0 else self.explained_variance_
+
+
+def plot_clusters(X, labels, centroids=None, title="Clusters"):
+ """
+ Scatter plot of clustered data with optional centroid markers.
+
+ Parameters
+ ----------
+ X : array-like
+ Data points (2D).
+ labels : array-like
+ Cluster labels.
+ centroids : array-like, optional
+ Centroids to plot as markers.
+ title : str
+ Plot title.
+ """
+ X = np.asarray(X)
+ labels = np.asarray(labels)
+ plt.figure(figsize=(8, 6))
+ scatter = plt.scatter(X[:, 0], X[:, 1], c=labels, cmap="viridis", alpha=0.7, edgecolors="k")
+ if centroids is not None:
+ centroids = np.asarray(centroids)
+ plt.scatter(centroids[:, 0], centroids[:, 1], c="red", marker="X", s=200, edgecolors="black")
+ plt.colorbar(scatter, label="Cluster")
+ plt.title(title)
+ plt.xlabel("Feature 1")
+ plt.ylabel("Feature 2")
+ plt.tight_layout()
+ plt.show()
+
+
+def plot_elbow(K_range, inertias, title="Elbow Method"):
+ """
+ Line plot of inertia vs K for elbow method.
+
+ Parameters
+ ----------
+ K_range : array-like
+ Range of K values.
+ inertias : array-like
+ Inertia for each K.
+ title : str
+ Plot title.
+ """
+ plt.figure(figsize=(8, 5))
+ plt.plot(K_range, inertias, "bo-")
+ plt.xlabel("Number of clusters (K)")
+ plt.ylabel("Inertia")
+ plt.title(title)
+ plt.grid(True, alpha=0.3)
+ plt.tight_layout()
+ plt.show()
+
+
+def plot_silhouette(X, labels, title="Silhouette Analysis"):
+ """
+ Silhouette plot using sklearn.metrics.
+
+ Parameters
+ ----------
+ X : array-like
+ Data points.
+ labels : array-like
+ Cluster labels.
+ title : str
+ Plot title.
+ """
+ X = np.asarray(X)
+ labels = np.asarray(labels)
+ n_clusters = len(np.unique(labels))
+ silhouette_vals = silhouette_samples(X, labels)
+ score = silhouette_score(X, labels)
+
+ plt.figure(figsize=(10, 6))
+ y_lower = 10
+ for i in range(n_clusters):
+ cluster_silhouette = silhouette_vals[labels == i]
+ cluster_silhouette.sort()
+ size = cluster_silhouette.shape[0]
+ y_upper = y_lower + size
+ plt.fill_betweenx(np.arange(y_lower, y_upper), 0, cluster_silhouette, alpha=0.7)
+ plt.text(-0.05, y_lower + 0.5 * size, str(i))
+ y_lower = y_upper + 10
+
+ plt.axvline(x=score, color="red", linestyle="--", label=f"Avg: {score:.3f}")
+ plt.xlabel("Silhouette coefficient")
+ plt.ylabel("Cluster label")
+ plt.title(title)
+ plt.legend()
+ plt.tight_layout()
+ plt.show()
+
+
+def plot_dendrogram(X, method="ward", title="Dendrogram"):
+ """
+ Hierarchical clustering dendrogram using scipy.
+
+ Parameters
+ ----------
+ X : array-like
+ Data points.
+ method : str
+ Linkage method ('ward', 'complete', 'average', 'single').
+ title : str
+ Plot title.
+ """
+ X = np.asarray(X)
+ linkage_matrix = linkage(X, method=method)
+ plt.figure(figsize=(10, 6))
+ dendrogram(linkage_matrix)
+ plt.title(title)
+ plt.xlabel("Sample index or (cluster size)")
+ plt.ylabel("Distance")
+ plt.tight_layout()
+ plt.show()
+
+
+def plot_pca_variance(pca, title="PCA Variance Explained"):
+ """
+ Bar chart and cumulative line for PCA variance explained.
+
+ Parameters
+ ----------
+ pca : PCAScratch
+ Fitted PCA object.
+ title : str
+ Plot title.
+ """
+ ratios = pca.explained_variance_ratio_
+ cumsum = np.cumsum(ratios)
+ n = len(ratios)
+
+ fig, ax1 = plt.subplots(figsize=(8, 5))
+ x = np.arange(1, n + 1)
+ ax1.bar(x - 0.2, ratios, 0.4, label="Individual", color="steelblue")
+ ax1.set_xlabel("Principal Component")
+ ax1.set_ylabel("Variance explained ratio")
+ ax1.set_xticks(x)
+
+ ax2 = ax1.twinx()
+ ax2.plot(x, cumsum, "ro-", label="Cumulative")
+ ax2.set_ylabel("Cumulative variance")
+ ax2.set_ylim(0, 1.05)
+
+ plt.title(title)
+ fig.legend(loc="upper right", bbox_to_anchor=(1, 1), bbox_transform=ax1.transAxes)
+ plt.tight_layout()
+ plt.show()
+
+
+def plot_anomalies(X, labels, title="Anomaly Detection"):
+ """
+ Scatter plot for normal vs anomaly points.
+
+ Parameters
+ ----------
+ X : array-like
+ Data points (2D).
+ labels : array-like
+ Binary labels (0=normal, 1=anomaly or similar).
+ title : str
+ Plot title.
+ """
+ X = np.asarray(X)
+ labels = np.asarray(labels)
+ plt.figure(figsize=(8, 6))
+ normal = labels == 0
+ anomaly = labels == 1
+ plt.scatter(X[normal, 0], X[normal, 1], c="steelblue", alpha=0.7, label="Normal")
+ plt.scatter(X[anomaly, 0], X[anomaly, 1], c="red", alpha=0.7, label="Anomaly")
+ plt.xlabel("Feature 1")
+ plt.ylabel("Feature 2")
+ plt.title(title)
+ plt.legend()
+ plt.tight_layout()
+ plt.show()
+
+
+if __name__ == "__main__":
+ # Demo: Generate blobs, run KMeansScratch
+ X_blobs, _ = make_blobs(n_samples=300, n_features=2, centers=4, random_state=42)
+ kmeans = KMeansScratch(n_clusters=4, max_iters=100, random_state=42)
+ kmeans.fit(X_blobs)
+ print("KMeansScratch inertia:", kmeans.inertia_history[-1] if kmeans.inertia_history else "N/A")
+
+ # Demo: Run PCAScratch on 4D dataset
+ X_4d, _ = make_blobs(n_samples=200, n_features=4, centers=3, random_state=42)
+ pca = PCAScratch(n_components=4)
+ pca.fit(X_4d)
+ print("PCA variance explained:", pca.explained_variance_ratio_)
+
+ print("Demo complete.")
diff --git a/chapters/chapter-08-unsupervised-learning/scripts/utilities.py b/chapters/chapter-08-unsupervised-learning/scripts/utilities.py
new file mode 100644
index 0000000..bdf4c31
--- /dev/null
+++ b/chapters/chapter-08-unsupervised-learning/scripts/utilities.py
@@ -0,0 +1,104 @@
+"""
+Helper utilities for unsupervised learning.
+Generated by Berta AI | Created by Luigi Pascal Rondanini
+"""
+
+import numpy as np
+import pandas as pd
+from sklearn.preprocessing import StandardScaler
+
+
+def scale_features(X):
+ """
+ Scale features using StandardScaler (zero mean, unit variance).
+
+ Parameters
+ ----------
+ X : array-like of shape (n_samples, n_features)
+ Data to scale.
+
+ Returns
+ -------
+ X_scaled : ndarray of shape (n_samples, n_features)
+ Scaled data.
+ """
+ scaler = StandardScaler()
+ return scaler.fit_transform(X)
+
+
+def generate_synthetic_customers(n=300, seed=42):
+ """
+ Generate synthetic customer data for clustering/segmentation.
+
+ Parameters
+ ----------
+ n : int
+ Number of customers to generate.
+ seed : int
+ Random seed for reproducibility.
+
+ Returns
+ -------
+ df : pandas.DataFrame
+ DataFrame with columns: age, income, spending_score, visits, online_ratio.
+ """
+ np.random.seed(seed)
+ age = np.random.randint(18, 70, size=n)
+ income = np.random.exponential(scale=30000, size=n).astype(int) + 20000
+ spending_score = np.random.exponential(scale=50, size=n).astype(int) + 10
+ visits = np.random.poisson(lam=5, size=n) + 1
+ online_ratio = np.random.beta(2, 2, size=n)
+ return pd.DataFrame({
+ "age": age,
+ "income": income,
+ "spending_score": spending_score,
+ "visits": visits,
+ "online_ratio": online_ratio,
+ })
+
+
+def generate_synthetic_sensors(n=200, anomaly_fraction=0.1, seed=42):
+ """
+ Generate synthetic sensor data with anomalies.
+
+ Parameters
+ ----------
+ n : int
+ Number of sensor readings.
+ anomaly_fraction : float
+ Fraction of readings that are anomalies (0 to 1).
+ seed : int
+ Random seed for reproducibility.
+
+ Returns
+ -------
+ df : pandas.DataFrame
+ DataFrame with columns: temp, pressure, vibration, is_anomaly.
+ """
+ np.random.seed(seed)
+ n_anomaly = int(n * anomaly_fraction)
+ n_normal = n - n_anomaly
+
+ # Normal readings
+ temp_normal = np.random.normal(25, 2, n_normal)
+ pressure_normal = np.random.normal(100, 5, n_normal)
+ vibration_normal = np.random.exponential(0.5, n_normal)
+
+ # Anomalous readings (outliers)
+ temp_anomaly = np.random.uniform(50, 90, n_anomaly)
+ pressure_anomaly = np.random.uniform(150, 200, n_anomaly)
+ vibration_anomaly = np.random.exponential(5, n_anomaly)
+
+ temp = np.concatenate([temp_normal, temp_anomaly])
+ pressure = np.concatenate([pressure_normal, pressure_anomaly])
+ vibration = np.concatenate([vibration_normal, vibration_anomaly])
+ is_anomaly = np.concatenate([np.zeros(n_normal, dtype=int), np.ones(n_anomaly, dtype=int)])
+
+ # Shuffle
+ idx = np.random.permutation(n)
+ return pd.DataFrame({
+ "temp": temp[idx],
+ "pressure": pressure[idx],
+ "vibration": vibration[idx],
+ "is_anomaly": is_anomaly[idx],
+ })
diff --git a/docs/chapters/assets/diagrams/anomaly_detection.svg b/docs/chapters/assets/diagrams/anomaly_detection.svg
new file mode 100644
index 0000000..92452f7
--- /dev/null
+++ b/docs/chapters/assets/diagrams/anomaly_detection.svg
@@ -0,0 +1,90 @@
+
diff --git a/docs/chapters/assets/diagrams/clustering_algorithms.svg b/docs/chapters/assets/diagrams/clustering_algorithms.svg
new file mode 100644
index 0000000..f17f560
--- /dev/null
+++ b/docs/chapters/assets/diagrams/clustering_algorithms.svg
@@ -0,0 +1,92 @@
+
diff --git a/docs/chapters/assets/diagrams/dimensionality_reduction.svg b/docs/chapters/assets/diagrams/dimensionality_reduction.svg
new file mode 100644
index 0000000..7f4b92b
--- /dev/null
+++ b/docs/chapters/assets/diagrams/dimensionality_reduction.svg
@@ -0,0 +1,81 @@
+
diff --git a/docs/chapters/chapter-08.md b/docs/chapters/chapter-08.md
new file mode 100644
index 0000000..1a9eb96
--- /dev/null
+++ b/docs/chapters/chapter-08.md
@@ -0,0 +1,100 @@
+# Chapter 8: Unsupervised Learning
+
+Discover hidden patterns in unlabeled dataβclustering, dimensionality reduction, and anomaly detection.
+
+---
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Track** | Practitioner |
+| **Time** | 8 hours |
+| **Prerequisites** | Chapters 1β6 |
+
+---
+
+## Learning Objectives
+
+- Implement K-Means clustering from scratch using NumPy
+- Apply hierarchical clustering and interpret dendrograms
+- Use DBSCAN for density-based clustering with noise detection
+- Evaluate clusters with silhouette scores and the elbow method
+- Reduce dimensionality with PCA and t-SNE
+- Detect anomalies with Isolation Forest and statistical methods
+- Build a complete customer segmentation pipeline
+
+---
+
+## What's Included
+
+### Notebooks
+
+| Notebook | Description |
+|----------|-------------|
+| `01_introduction.ipynb` | K-Means from scratch, evaluation, elbow method |
+| `02_intermediate.ipynb` | Hierarchical, DBSCAN, Gaussian Mixture Models |
+| `03_advanced.ipynb` | PCA, t-SNE, anomaly detection, customer segmentation capstone |
+
+### Scripts
+
+- `unsupervised_toolkit.py` β Core implementations (KMeansScratch, PCAScratch) and plotting utilities
+
+### Exercises
+
+- **5 exercises** with solutions (in `solutions/` branch)
+
+### SVG Diagrams
+
+- 3 visual diagrams for clustering algorithms, dimensionality reduction, and anomaly detection
+
+---
+
+
+
+---
+
+## Read Online
+
+You can read the full chapter content right here on the website:
+
+- **[08.1 Introduction](content/ch08-01_introduction.md)** -- K-Means from scratch, silhouette scores, elbow method
+- **[08.2 Intermediate](content/ch08-02_intermediate.md)** -- Hierarchical clustering, DBSCAN, Gaussian Mixture Models
+- **[08.3 Advanced](content/ch08-03_advanced.md)** -- PCA, t-SNE, anomaly detection, customer segmentation capstone
+
+Or [try the code in the Playground](../playground.md).
+
+## How to Use This Chapter
+
+!!! tip "Quick Start"
+ Follow these steps to get coding in minutes.
+
+**1. Clone and install dependencies**
+
+```bash
+git clone https://github.com/luigipascal/berta-chapters.git
+cd berta-chapters
+pip install -r requirements.txt
+```
+
+**2. Navigate to the chapter**
+
+```bash
+cd chapters/chapter-08-unsupervised-learning
+```
+
+**3. Launch Jupyter**
+
+```bash
+jupyter notebook notebooks/01_introduction.ipynb
+```
+
+!!! info "GitHub Folder"
+ All chapter materials live in: [`chapters/chapter-08-unsupervised-learning/`](https://github.com/luigipascal/berta-chapters/tree/main/chapters/chapter-08-unsupervised-learning/)
+
+!!! tip "SciPy"
+ This chapter uses SciPy for hierarchical clustering dendrograms. Ensure it's installed: `pip install scipy`
+
+---
+
+**Created by Luigi Pascal Rondanini | Generated by Berta AI**
diff --git a/docs/chapters/content/ch08-01_introduction.md b/docs/chapters/content/ch08-01_introduction.md
new file mode 100644
index 0000000..1f2321e
--- /dev/null
+++ b/docs/chapters/content/ch08-01_introduction.md
@@ -0,0 +1,448 @@
+# Ch 8: Unsupervised Learning - Introduction
+
+**Track**: Practitioner | [Try code in Playground](../../playground.md) | [Back to chapter overview](../chapter-08.md)
+
+
+!!! tip "Read online or run locally"
+ You can read this content here on the web. To run the code interactively,
+ either use the [Playground](../../playground.md) or clone the repo and open
+ `chapters/chapter-08-unsupervised-learning/notebooks/01_introduction.ipynb` in Jupyter.
+
+---
+
+# Chapter 8: Unsupervised Learning
+## Notebook 01 - Introduction: Clustering Basics
+
+Unsupervised learning finds hidden patterns in data without labels. We start with the most fundamental algorithm: **K-Means clustering**.
+
+**What you'll learn:**
+- The difference between supervised and unsupervised learning
+- K-Means clustering from scratch using NumPy
+- Evaluating clusters with inertia and silhouette score
+- The elbow method for choosing K
+- Scikit-learn's KMeans interface
+
+**Time estimate:** 2.5 hours
+
+---
+
+## 1. Supervised vs Unsupervised Learning
+
+In **supervised learning**, every training example comes with a label β the "right answer" β and the model learns a mapping from inputs to outputs. Classification and regression are the classic examples.
+
+In **unsupervised learning**, there are **no labels at all**. The algorithm must discover structure in the data on its own. Common tasks include clustering (group similar points), dimensionality reduction (compress features), and anomaly detection (find unusual observations).
+
+This notebook focuses on **clustering** β specifically the **K-Means** algorithm. Let's start by generating some data and seeing what it looks like *without* labels. The left plot shows raw data (all same color); the right reveals the true clusters we want the algorithm to recover on its own.
+
+```python
+import numpy as np
+import matplotlib.pyplot as plt
+from sklearn.datasets import make_blobs
+
+np.random.seed(42)
+
+X, y_true = make_blobs(
+ n_samples=200, centers=3, cluster_std=0.9, random_state=42
+)
+
+fig, axes = plt.subplots(1, 2, figsize=(13, 5))
+
+axes[0].scatter(X[:, 0], X[:, 1], c="steelblue", edgecolors="k", s=50, alpha=0.7)
+axes[0].set_title("What we observe (no labels)", fontsize=14)
+axes[0].set_xlabel("Feature 1")
+axes[0].set_ylabel("Feature 2")
+
+colors = ["#e74c3c", "#2ecc71", "#3498db"]
+for k in range(3):
+ mask = y_true == k
+ axes[1].scatter(X[mask, 0], X[mask, 1], c=colors[k],
+ edgecolors="k", s=50, alpha=0.7, label=f"Cluster {k}")
+axes[1].set_title("True clusters (hidden from algorithm)", fontsize=14)
+axes[1].set_xlabel("Feature 1")
+axes[1].set_ylabel("Feature 2")
+axes[1].legend()
+
+plt.tight_layout()
+plt.show()
+```
+
+---
+
+## 2. K-Means Algorithm
+
+K-Means is an iterative algorithm that partitions *n* data points into *K* clusters. It works in three repeating steps:
+
+**Step 1 β Initialize:** Pick *K* points as initial **centroids** (cluster centers). The simplest approach is to choose *K* data points at random.
+
+**Step 2 β Assign:** For every data point, compute the Euclidean distance to each centroid and assign the point to the **nearest** centroid.
+
+**Step 3 β Update:** Recompute each centroid as the **mean** of all points currently assigned to that cluster.
+
+**Repeat** Steps 2 and 3 until the assignments no longer change (or a maximum number of iterations is reached).
+
+Let's implement K-Means from scratch using only NumPy:
+
+```python
+class KMeansScratch:
+ """Minimal K-Means implementation using NumPy."""
+
+ def __init__(self, k=3, max_iters=100, random_state=42):
+ self.k = k
+ self.max_iters = max_iters
+ self.random_state = random_state
+ self.centroids = None
+ self.labels_ = None
+ self.inertia_ = None
+ self.inertia_history = []
+ self.centroid_history = []
+ self.label_history = []
+
+ def _euclidean_distances(self, X, centroids):
+ """Compute distance from every point to every centroid."""
+ return np.sqrt(((X[:, np.newaxis] - centroids[np.newaxis]) ** 2).sum(axis=2))
+
+ def _compute_inertia(self, X, labels, centroids):
+ return sum(
+ np.sum((X[labels == k] - centroids[k]) ** 2)
+ for k in range(self.k)
+ )
+
+ def fit(self, X):
+ rng = np.random.RandomState(self.random_state)
+ n_samples = X.shape[0]
+
+ # Step 1: random initialization
+ idx = rng.choice(n_samples, self.k, replace=False)
+ self.centroids = X[idx].copy()
+
+ self.inertia_history = []
+ self.centroid_history = [self.centroids.copy()]
+ self.label_history = []
+
+ for _ in range(self.max_iters):
+ # Step 2: assign
+ distances = self._euclidean_distances(X, self.centroids)
+ labels = np.argmin(distances, axis=1)
+ self.label_history.append(labels.copy())
+
+ # Step 3: update centroids
+ new_centroids = np.array([
+ X[labels == k].mean(axis=0) if np.any(labels == k)
+ else self.centroids[k]
+ for k in range(self.k)
+ ])
+
+ inertia = self._compute_inertia(X, labels, new_centroids)
+ self.inertia_history.append(inertia)
+ self.centroid_history.append(new_centroids.copy())
+
+ if np.allclose(new_centroids, self.centroids):
+ break
+ self.centroids = new_centroids
+
+ self.labels_ = labels
+ self.inertia_ = self.inertia_history[-1]
+ return self
+
+ def predict(self, X):
+ distances = self._euclidean_distances(X, self.centroids)
+ return np.argmin(distances, axis=1)
+
+
+km_scratch = KMeansScratch(k=3, random_state=42)
+km_scratch.fit(X)
+
+print(f"Converged in {len(km_scratch.inertia_history)} iterations")
+print(f"Final inertia: {km_scratch.inertia_:.2f}")
+print(f"Centroids:\n{km_scratch.centroids}")
+```
+
+Now let's plot the ground truth alongside our K-Means result:
+
+```python
+fig, axes = plt.subplots(1, 2, figsize=(13, 5))
+
+colors_map = np.array(["#e74c3c", "#2ecc71", "#3498db"])
+
+for k in range(3):
+ mask = y_true == k
+ axes[0].scatter(X[mask, 0], X[mask, 1], c=colors[k],
+ edgecolors="k", s=50, alpha=0.7, label=f"True {k}")
+axes[0].set_title("Ground Truth", fontsize=14)
+axes[0].legend()
+axes[0].set_xlabel("Feature 1")
+axes[0].set_ylabel("Feature 2")
+
+axes[1].scatter(X[:, 0], X[:, 1], c=colors_map[km_scratch.labels_],
+ edgecolors="k", s=50, alpha=0.7)
+axes[1].scatter(km_scratch.centroids[:, 0], km_scratch.centroids[:, 1],
+ c=colors, marker="X", s=250, edgecolors="k", linewidths=1.5,
+ zorder=5, label="Centroids")
+axes[1].set_title("K-Means (scratch) result", fontsize=14)
+axes[1].legend()
+axes[1].set_xlabel("Feature 1")
+axes[1].set_ylabel("Feature 2")
+
+plt.tight_layout()
+plt.show()
+```
+
+---
+
+## 3. Step-by-Step Visualization
+
+To build intuition for how the algorithm converges, let's watch the first four iterations unfold. Each subplot shows the cluster assignments and centroid positions at a particular iteration. Notice how the centroids migrate toward the cluster centers with each iteration.
+
+```python
+fig, axes = plt.subplots(2, 2, figsize=(12, 10))
+axes = axes.ravel()
+
+colors_map = np.array(["#e74c3c", "#2ecc71", "#3498db"])
+
+n_show = min(4, len(km_scratch.label_history))
+
+for i in range(n_show):
+ ax = axes[i]
+ labels_i = km_scratch.label_history[i]
+ centroids_i = km_scratch.centroid_history[i]
+ centroids_next = km_scratch.centroid_history[i + 1]
+
+ ax.scatter(X[:, 0], X[:, 1], c=colors_map[labels_i],
+ edgecolors="k", s=40, alpha=0.6)
+
+ ax.scatter(centroids_i[:, 0], centroids_i[:, 1],
+ facecolors="none", edgecolors="k", marker="o",
+ s=200, linewidths=2, label="Old centroid")
+
+ ax.scatter(centroids_next[:, 0], centroids_next[:, 1],
+ c=colors, marker="X", s=250, edgecolors="k",
+ linewidths=1.5, zorder=5, label="New centroid")
+
+ for k in range(3):
+ ax.annotate("",
+ xy=centroids_next[k], xytext=centroids_i[k],
+ arrowprops=dict(arrowstyle="->", lw=1.5, color="black"))
+
+ ax.set_title(f"Iteration {i + 1} | inertia = {km_scratch.inertia_history[i]:.1f}",
+ fontsize=12)
+ if i == 0:
+ ax.legend(fontsize=9, loc="upper left")
+
+for j in range(n_show, 4):
+ axes[j].axis("off")
+
+plt.suptitle("K-Means β Iteration-by-Iteration", fontsize=15, y=1.01)
+plt.tight_layout()
+plt.show()
+```
+
+---
+
+## 4. Evaluating Clusters
+
+How do we know if K-Means did a good job? Two common metrics:
+
+**Inertia (Within-Cluster Sum of Squares):** The sum of squared distances from each point to its centroid. Lower is better, but inertia *always* decreases as K increases β so it alone doesn't tell us the right K.
+
+**Silhouette Score:** For each point, we compare the mean distance to others in the same cluster (*a*) vs. the mean distance to the nearest other cluster (*b*). The score is *(b - a) / max(a, b)*, ranging from β1 to +1. Higher is better; values near 0 indicate overlapping clusters.
+
+```python
+from sklearn.metrics import silhouette_score, silhouette_samples
+
+sil_avg = silhouette_score(X, km_scratch.labels_)
+sil_vals = silhouette_samples(X, km_scratch.labels_)
+
+print(f"Inertia: {km_scratch.inertia_:.2f}")
+print(f"Silhouette (mean): {sil_avg:.4f}")
+print(f"Silhouette (min): {sil_vals.min():.4f}")
+print(f"Silhouette (max): {sil_vals.max():.4f}")
+```
+
+A silhouette plot shows each cluster's distribution of silhouette coefficients. Healthy clusters extend well past the mean line; thin slivers or clusters barely crossing zero suggest poor separation.
+
+```python
+fig, ax = plt.subplots(figsize=(8, 5))
+
+y_lower = 10
+colors_sil = ["#e74c3c", "#2ecc71", "#3498db"]
+
+for k in range(3):
+ cluster_sil = np.sort(sil_vals[km_scratch.labels_ == k])
+ cluster_size = cluster_sil.shape[0]
+ y_upper = y_lower + cluster_size
+
+ ax.fill_betweenx(np.arange(y_lower, y_upper), 0, cluster_sil,
+ facecolor=colors_sil[k], edgecolor=colors_sil[k], alpha=0.7)
+ ax.text(-0.05, y_lower + 0.5 * cluster_size, f"Cluster {k}", fontsize=11,
+ fontweight="bold", va="center")
+ y_lower = y_upper + 10
+
+ax.axvline(x=sil_avg, color="k", linestyle="--", linewidth=1.5,
+ label=f"Mean silhouette = {sil_avg:.3f}")
+ax.set_xlabel("Silhouette coefficient", fontsize=12)
+ax.set_ylabel("Points (sorted within cluster)", fontsize=12)
+ax.set_title("Silhouette Plot β K-Means (K=3)", fontsize=14)
+ax.legend(fontsize=11)
+ax.set_yticks([])
+plt.tight_layout()
+plt.show()
+```
+
+---
+
+## 5. The Elbow Method
+
+Since we must specify *K* before running K-Means, how do we pick a good value?
+
+**The Elbow Method:**
+1. Run K-Means for K = 1, 2, β¦, K_max.
+2. Plot inertia vs K.
+3. Look for the **"elbow"** β the point where inertia stops decreasing sharply and begins to level off.
+
+We can also plot silhouette score vs K; the best K often maximizes silhouette. Both plots together give a clearer picture.
+
+```python
+K_range = range(1, 11)
+inertias = []
+silhouettes = []
+
+for k in K_range:
+ km = KMeansScratch(k=k, random_state=42)
+ km.fit(X)
+ inertias.append(km.inertia_)
+ if k >= 2:
+ silhouettes.append(silhouette_score(X, km.labels_))
+ else:
+ silhouettes.append(np.nan)
+
+fig, axes = plt.subplots(1, 2, figsize=(14, 5))
+
+axes[0].plot(K_range, inertias, "o-", color="#2c3e50", linewidth=2, markersize=8)
+axes[0].set_xlabel("Number of clusters (K)", fontsize=12)
+axes[0].set_ylabel("Inertia", fontsize=12)
+axes[0].set_title("Elbow Method", fontsize=14)
+axes[0].axvline(x=3, color="#e74c3c", linestyle="--", alpha=0.7, label="K = 3 (elbow)")
+axes[0].legend(fontsize=11)
+axes[0].grid(True, alpha=0.3)
+
+sil_values = [s for s in silhouettes if not np.isnan(s)]
+sil_ks = list(range(2, 11))
+axes[1].plot(sil_ks, sil_values, "s-", color="#27ae60", linewidth=2, markersize=8)
+axes[1].set_xlabel("Number of clusters (K)", fontsize=12)
+axes[1].set_ylabel("Mean Silhouette Score", fontsize=12)
+axes[1].set_title("Silhouette Score vs K", fontsize=14)
+axes[1].axvline(x=3, color="#e74c3c", linestyle="--", alpha=0.7, label="K = 3")
+axes[1].legend(fontsize=11)
+axes[1].grid(True, alpha=0.3)
+
+plt.tight_layout()
+plt.show()
+
+print("Silhouette scores by K:")
+for k, s in zip(sil_ks, sil_values):
+ print(f" K={k:2d} -> {s:.4f}")
+```
+
+Both plots agree: **K = 3** is the best choice for this dataset β inertia has a clear elbow and the silhouette score peaks at K = 3.
+
+---
+
+## 6. Scikit-learn KMeans
+
+In practice you'll use scikit-learn's battle-tested implementation. It uses smarter **k-means++** initialization and runs multiple restarts (`n_init`) to avoid poor local minima. Let's compare with our scratch version:
+
+```python
+from sklearn.cluster import KMeans
+
+km_sklearn = KMeans(n_clusters=3, random_state=42, n_init=10)
+km_sklearn.fit(X)
+
+print("=== Scikit-learn KMeans ===")
+print(f"Inertia: {km_sklearn.inertia_:.2f}")
+print(f"Silhouette score: {silhouette_score(X, km_sklearn.labels_):.4f}")
+print(f"Centroids:\n{km_sklearn.cluster_centers_}")
+print()
+
+print("=== Our scratch KMeans ===")
+print(f"Inertia: {km_scratch.inertia_:.2f}")
+print(f"Silhouette score: {silhouette_score(X, km_scratch.labels_):.4f}")
+print(f"Centroids:\n{km_scratch.centroids}")
+```
+
+The cluster labels may differ in numbering (label 0 in one could be label 2 in the other), but the **groupings themselves** should be nearly identical.
+
+```python
+fig, axes = plt.subplots(1, 2, figsize=(13, 5))
+
+colors_map = np.array(["#e74c3c", "#2ecc71", "#3498db"])
+
+axes[0].scatter(X[:, 0], X[:, 1], c=colors_map[km_scratch.labels_],
+ edgecolors="k", s=50, alpha=0.7)
+axes[0].scatter(km_scratch.centroids[:, 0], km_scratch.centroids[:, 1],
+ c="gold", marker="X", s=250, edgecolors="k", linewidths=1.5, zorder=5)
+axes[0].set_title("Our Scratch Implementation", fontsize=14)
+axes[0].set_xlabel("Feature 1")
+axes[0].set_ylabel("Feature 2")
+
+axes[1].scatter(X[:, 0], X[:, 1], c=colors_map[km_sklearn.labels_],
+ edgecolors="k", s=50, alpha=0.7)
+axes[1].scatter(km_sklearn.cluster_centers_[:, 0], km_sklearn.cluster_centers_[:, 1],
+ c="gold", marker="X", s=250, edgecolors="k", linewidths=1.5, zorder=5)
+axes[1].set_title("Scikit-learn KMeans", fontsize=14)
+axes[1].set_xlabel("Feature 1")
+axes[1].set_ylabel("Feature 2")
+
+plt.suptitle("Scratch vs Scikit-learn β Side by Side", fontsize=15, y=1.01)
+plt.tight_layout()
+plt.show()
+```
+
+---
+
+## 7. Practical Tips
+
+### When K-Means Works Well
+
+K-Means works best when clusters are:
+- **Spherical (isotropic):** roughly the same spread in every direction
+- **Similar in size:** very uneven cluster sizes can pull centroids away from smaller groups
+- **Well-separated:** heavily overlapping clusters confuse the algorithm
+
+### Feature Scaling
+
+K-Means relies on Euclidean distance. If one feature has a range of 0β1 and another 0β10,000, the second feature will dominate. **Always standardize your features** (e.g., `StandardScaler`) before clustering.
+
+### Multiple Initializations
+
+Scikit-learn's `n_init` parameter (default 10) runs K-Means multiple times with different random seeds and keeps the result with the lowest inertia. This greatly reduces the risk of a poor local minimum.
+
+### When K-Means Fails
+
+K-Means struggles with:
+- **Non-convex shapes** (e.g., crescent moons, concentric rings) β consider DBSCAN or spectral clustering instead
+- **Clusters with very different densities** β HDBSCAN handles this better
+- **High-dimensional data** β distances become less meaningful (curse of dimensionality); apply dimensionality reduction first
+
+---
+
+## Summary
+
+### Key Takeaways
+
+1. **Unsupervised learning** discovers structure without labels. Clustering is its flagship task.
+2. **K-Means** iterates between *assigning* points to the nearest centroid and *updating* centroids as cluster means until convergence.
+3. **Inertia** measures within-cluster compactness; **silhouette score** balances compactness and separation.
+4. The **elbow method** plots inertia vs K to find a natural number of clusters.
+5. **Scikit-learn's KMeans** adds smart initialization (k-means++) and multiple restarts for robust results.
+6. Always **scale features** before clustering, and remember that K-Means assumes spherical, similarly-sized clusters.
+
+### What's Next
+
+In the following notebooks we will:
+- Explore **hierarchical clustering** and dendrograms
+- Learn **DBSCAN** for density-based clustering
+- Apply **dimensionality reduction** (PCA, t-SNE) for visualization
+
+---
+
+*Generated by Berta AI | Created by Luigi Pascal Rondanini*
diff --git a/docs/chapters/content/ch08-02_intermediate.md b/docs/chapters/content/ch08-02_intermediate.md
new file mode 100644
index 0000000..66973b2
--- /dev/null
+++ b/docs/chapters/content/ch08-02_intermediate.md
@@ -0,0 +1,520 @@
+# Ch 8: Unsupervised Learning - Intermediate
+
+**Track**: Practitioner | [Try code in Playground](../../playground.md) | [Back to chapter overview](../chapter-08.md)
+
+
+!!! tip "Read online or run locally"
+ You can read this content here on the web. To run the code interactively,
+ either use the [Playground](../../playground.md) or clone the repo and open
+ `chapters/chapter-08-unsupervised-learning/notebooks/02_intermediate.ipynb` in Jupyter.
+
+---
+
+# Chapter 8: Unsupervised Learning
+## Notebook 02 - Intermediate: Advanced Clustering
+
+Beyond K-Means: hierarchical clustering, density-based methods, and Gaussian mixtures for real-world data shapes.
+
+**What you'll learn:**
+- Hierarchical (agglomerative) clustering and dendrograms
+- DBSCAN for density-based clustering
+- Gaussian Mixture Models (GMMs)
+- Comparing clustering algorithms on different data shapes
+
+**Time estimate:** 2.5 hours
+
+**Try it yourself:** Experiment with different linkage methods (single, complete, average, ward) on the hierarchical clustering example. Change `eps` and `min_samples` in DBSCAN to see how they affect cluster formation.
+
+**Common mistakes:** Using K-Means on non-convex shapes (e.g., moons), ignoring the k-distance graph when tuning DBSCAN, or assuming spherical clusters when data is elliptical.
+
+---
+
+## 1. Hierarchical Clustering
+
+Hierarchical clustering builds a tree of clusters instead of requiring a fixed number of clusters up front. The **agglomerative (bottom-up)** approach proceeds as follows:
+
+1. **Start** β treat every data point as its own single-point cluster.
+2. **Merge** β find the two closest clusters and merge them into one.
+3. **Repeat** β keep merging until only a single cluster remains (or until a stopping criterion is met).
+
+The result is a hierarchy that can be visualised as a **dendrogram** β a tree diagram showing the order and distance of each merge.
+
+### Linkage criteria
+
+"Distance between two clusters" can be measured in several ways:
+
+| Linkage | Definition | Tendency |
+|---------|-----------|----------|
+| **Single** | Minimum distance between any pair of points across two clusters | Produces elongated, chain-like clusters |
+| **Complete** | Maximum distance between any pair of points across two clusters | Produces compact, roughly equal-sized clusters |
+| **Average** | Mean distance between all pairs of points across two clusters | Compromise between single and complete |
+| **Ward** | Minimises the total within-cluster variance at each merge | Tends to produce equally sized, spherical clusters |
+
+Ward linkage is the most commonly used default and works well when clusters are roughly spherical.
+
+```python
+import numpy as np
+import matplotlib.pyplot as plt
+from sklearn.datasets import make_blobs
+from sklearn.cluster import AgglomerativeClustering
+from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
+
+np.random.seed(42)
+
+# Generate synthetic data with 4 well-separated clusters
+X_hier, y_hier = make_blobs(
+ n_samples=200, centers=4, cluster_std=0.8, random_state=42
+)
+
+fig, axes = plt.subplots(1, 2, figsize=(14, 5))
+
+# Left panel β raw data
+axes[0].scatter(X_hier[:, 0], X_hier[:, 1], s=30, alpha=0.7, edgecolors='k', linewidths=0.3)
+axes[0].set_title('Raw Data (200 points, 4 clusters)')
+axes[0].set_xlabel('Feature 1')
+axes[0].set_ylabel('Feature 2')
+
+# Right panel β dendrogram using Ward linkage
+Z_ward = linkage(X_hier, method='ward')
+dendrogram(
+ Z_ward,
+ truncate_mode='lastp',
+ p=30,
+ leaf_rotation=90,
+ leaf_font_size=8,
+ ax=axes[1],
+ color_threshold=12
+)
+axes[1].set_title('Dendrogram (Ward Linkage, truncated to 30 leaves)')
+axes[1].set_xlabel('Cluster (size)')
+axes[1].set_ylabel('Merge Distance')
+axes[1].axhline(y=12, color='r', linestyle='--', label='Cut at distance = 12')
+axes[1].legend()
+
+plt.tight_layout()
+plt.show()
+```
+
+The dendrogram shows the full merge history. By drawing a horizontal cut line we decide how many clusters to keep β each vertical line that crosses the cut corresponds to one cluster.
+
+### Comparing linkage methods
+
+Let's visualise how the four linkage types partition the same dataset.
+
+```python
+linkage_methods = ['single', 'complete', 'average', 'ward']
+fig, axes = plt.subplots(1, 4, figsize=(20, 4.5))
+
+for ax, method in zip(axes, linkage_methods):
+ Z = linkage(X_hier, method=method)
+ labels = fcluster(Z, t=4, criterion='maxclust')
+ scatter = ax.scatter(
+ X_hier[:, 0], X_hier[:, 1],
+ c=labels, cmap='viridis', s=30, alpha=0.7, edgecolors='k', linewidths=0.3
+ )
+ ax.set_title(f'{method.capitalize()} linkage')
+ ax.set_xlabel('Feature 1')
+ ax.set_ylabel('Feature 2')
+
+plt.suptitle('Agglomerative Clustering β 4 Linkage Methods (k=4)', fontsize=14, y=1.02)
+plt.tight_layout()
+plt.show()
+```
+
+```python
+# Scikit-learn's AgglomerativeClustering with Ward linkage
+agg = AgglomerativeClustering(n_clusters=4, linkage='ward')
+agg_labels = agg.fit_predict(X_hier)
+
+fig, axes = plt.subplots(1, 2, figsize=(14, 5))
+
+axes[0].scatter(
+ X_hier[:, 0], X_hier[:, 1],
+ c=y_hier, cmap='tab10', s=40, alpha=0.7, edgecolors='k', linewidths=0.3
+)
+axes[0].set_title('Ground-Truth Labels')
+axes[0].set_xlabel('Feature 1')
+axes[0].set_ylabel('Feature 2')
+
+axes[1].scatter(
+ X_hier[:, 0], X_hier[:, 1],
+ c=agg_labels, cmap='tab10', s=40, alpha=0.7, edgecolors='k', linewidths=0.3
+)
+axes[1].set_title('AgglomerativeClustering (Ward, k=4)')
+axes[1].set_xlabel('Feature 1')
+axes[1].set_ylabel('Feature 2')
+
+plt.tight_layout()
+plt.show()
+
+print(f"Cluster sizes: {np.bincount(agg_labels)}")
+```
+
+---
+
+## 2. DBSCAN
+
+**DBSCAN** (Density-Based Spatial Clustering of Applications with Noise) takes a fundamentally different approach to clustering:
+
+- It does **not** require the number of clusters in advance.
+- It defines clusters as **dense regions** separated by sparse regions.
+- Points that don't belong to any dense region are labelled as **noise** (label = -1).
+
+### Key parameters
+
+| Parameter | Meaning |
+|-----------|--------|
+| `eps` (Ξ΅) | Maximum distance between two points for them to be considered neighbours |
+| `min_samples` | Minimum number of points within Ξ΅-distance to form a dense region |
+
+### Point types
+
+- **Core point** β has at least `min_samples` neighbours within Ξ΅.
+- **Border point** β within Ξ΅ of a core point but doesn't have enough neighbours itself.
+- **Noise point** β neither core nor border; isolated outliers.
+
+DBSCAN can discover clusters of **arbitrary shape** and naturally identifies outliers β something centroid-based methods like K-Means cannot do.
+
+```python
+from sklearn.datasets import make_moons
+from sklearn.cluster import KMeans, DBSCAN
+
+np.random.seed(42)
+
+# Generate two moons (non-convex dataset)
+X_moons, y_moons = make_moons(n_samples=500, noise=0.08, random_state=42)
+
+# Apply DBSCAN and K-Means
+db_moons = DBSCAN(eps=0.2, min_samples=5).fit(X_moons)
+km_moons = KMeans(n_clusters=2, random_state=42, n_init=10).fit(X_moons)
+
+fig, axes = plt.subplots(1, 3, figsize=(18, 5))
+
+axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap='coolwarm', s=20, alpha=0.7)
+axes[0].set_title('Ground Truth')
+axes[0].set_xlabel('Feature 1')
+axes[0].set_ylabel('Feature 2')
+
+axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=km_moons.labels_, cmap='coolwarm', s=20, alpha=0.7)
+axes[1].scatter(km_moons.cluster_centers_[:, 0], km_moons.cluster_centers_[:, 1],
+ marker='X', s=200, c='black', edgecolors='white', linewidths=1.5)
+axes[1].set_title('K-Means (k=2) β Fails on non-convex shapes')
+axes[1].set_xlabel('Feature 1')
+axes[1].set_ylabel('Feature 2')
+
+axes[2].scatter(X_moons[:, 0], X_moons[:, 1], c=db_moons.labels_, cmap='coolwarm', s=20, alpha=0.7)
+axes[2].set_title('DBSCAN (eps=0.2) β Correctly separates crescents')
+axes[2].set_xlabel('Feature 1')
+axes[2].set_ylabel('Feature 2')
+
+plt.suptitle('K-Means vs DBSCAN on the Moons Dataset', fontsize=14, y=1.02)
+plt.tight_layout()
+plt.show()
+```
+
+---
+
+## 3. Choosing DBSCAN Parameters
+
+Picking `eps` and `min_samples` can be tricky. A practical heuristic:
+
+1. Set `min_samples` β 2 Γ number of features (a reasonable default).
+2. For each point compute the distance to its **k-th nearest neighbour** (k = `min_samples`).
+3. Sort these distances and plot them β the **k-distance graph**.
+4. Look for the "elbow" β the point where the curve bends sharply upward. The distance at that elbow is a good candidate for `eps`.
+
+```python
+from sklearn.neighbors import NearestNeighbors
+
+k = 5 # same as min_samples
+nn = NearestNeighbors(n_neighbors=k)
+nn.fit(X_moons)
+distances, _ = nn.kneighbors(X_moons)
+
+k_distances = np.sort(distances[:, k - 1])[::-1]
+
+plt.figure(figsize=(10, 5))
+plt.plot(k_distances, linewidth=1.5)
+plt.axhline(y=0.2, color='r', linestyle='--', label='eps = 0.2 (our choice)')
+plt.title(f'k-Distance Graph (k={k}) β Elbow Indicates Good eps')
+plt.xlabel('Points (sorted by descending k-distance)')
+plt.ylabel(f'Distance to {k}-th Nearest Neighbour')
+plt.legend()
+plt.grid(True, alpha=0.3)
+plt.show()
+```
+
+```python
+# Effect of different eps values on DBSCAN results
+eps_values = [0.05, 0.1, 0.2, 0.3, 0.5]
+fig, axes = plt.subplots(1, len(eps_values), figsize=(22, 4))
+
+for ax, eps in zip(axes, eps_values):
+ db = DBSCAN(eps=eps, min_samples=5).fit(X_moons)
+ labels = db.labels_
+ n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
+ n_noise = (labels == -1).sum()
+
+ unique_labels = set(labels)
+ colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
+
+ for k_label, col in zip(sorted(unique_labels), colors):
+ if k_label == -1:
+ col = [0, 0, 0, 1] # black for noise
+ mask = labels == k_label
+ ax.scatter(X_moons[mask, 0], X_moons[mask, 1], c=[col], s=15, alpha=0.7)
+
+ ax.set_title(f'eps={eps}\n{n_clusters} clusters, {n_noise} noise')
+ ax.set_xlabel('Feature 1')
+
+axes[0].set_ylabel('Feature 2')
+plt.suptitle('Effect of eps on DBSCAN (min_samples=5)', fontsize=14, y=1.05)
+plt.tight_layout()
+plt.show()
+```
+
+**Observations:**
+- **eps too small** (0.05) β most points classified as noise; many tiny clusters.
+- **eps just right** (0.2) β two clean crescent clusters with very little noise.
+- **eps too large** (0.5) β everything merges into a single cluster.
+
+The k-distance graph helps you find that sweet spot without trial and error.
+
+---
+
+## 4. Gaussian Mixture Models
+
+A **Gaussian Mixture Model** assumes that the data is generated from a mixture of a finite number of Gaussian (normal) distributions with unknown parameters.
+
+### GMM vs K-Means
+
+| Aspect | K-Means | GMM |
+|--------|---------|-----|
+| Cluster assignment | **Hard** β each point belongs to exactly one cluster | **Soft** β each point has a probability for every cluster |
+| Cluster shape | Spherical (Voronoi cells) | Elliptical (full covariance matrices) |
+| Outlier handling | None β every point is assigned | Naturally down-weights low-probability points |
+| Output | Cluster label | Probability vector over all clusters |
+
+GMMs are fit using the **Expectation-Maximisation (EM)** algorithm:
+1. **E-step** β compute the probability that each point belongs to each Gaussian component.
+2. **M-step** β update each component's mean, covariance, and weight to maximise log-likelihood.
+3. Repeat until convergence.
+
+```python
+from sklearn.mixture import GaussianMixture
+
+np.random.seed(42)
+
+# Create elongated / elliptical clusters that challenge K-Means
+n_per_cluster = 200
+cov1 = [[2.0, 1.5], [1.5, 1.5]]
+cov2 = [[1.5, -1.2], [-1.2, 1.5]]
+cov3 = [[0.5, 0.0], [0.0, 2.5]]
+
+cluster1 = np.random.multivariate_normal([0, 0], cov1, n_per_cluster)
+cluster2 = np.random.multivariate_normal([5, 5], cov2, n_per_cluster)
+cluster3 = np.random.multivariate_normal([8, 0], cov3, n_per_cluster)
+
+X_gmm = np.vstack([cluster1, cluster2, cluster3])
+y_gmm_true = np.array([0]*n_per_cluster + [1]*n_per_cluster + [2]*n_per_cluster)
+
+fig, axes = plt.subplots(1, 3, figsize=(18, 5))
+
+# Ground truth
+axes[0].scatter(X_gmm[:, 0], X_gmm[:, 1], c=y_gmm_true, cmap='tab10', s=15, alpha=0.6)
+axes[0].set_title('Ground Truth (Elliptical Clusters)')
+axes[0].set_xlabel('Feature 1')
+axes[0].set_ylabel('Feature 2')
+
+# K-Means
+km_gmm = KMeans(n_clusters=3, random_state=42, n_init=10).fit(X_gmm)
+axes[1].scatter(X_gmm[:, 0], X_gmm[:, 1], c=km_gmm.labels_, cmap='tab10', s=15, alpha=0.6)
+axes[1].scatter(km_gmm.cluster_centers_[:, 0], km_gmm.cluster_centers_[:, 1],
+ marker='X', s=200, c='black', edgecolors='white', linewidths=1.5)
+axes[1].set_title('K-Means (k=3) β Spherical assumption')
+axes[1].set_xlabel('Feature 1')
+axes[1].set_ylabel('Feature 2')
+
+# GMM
+gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=42)
+gmm.fit(X_gmm)
+gmm_labels = gmm.predict(X_gmm)
+axes[2].scatter(X_gmm[:, 0], X_gmm[:, 1], c=gmm_labels, cmap='tab10', s=15, alpha=0.6)
+axes[2].set_title('GMM (3 components) β Elliptical fit')
+axes[2].set_xlabel('Feature 1')
+axes[2].set_ylabel('Feature 2')
+
+plt.suptitle('K-Means vs GMM on Elliptical Clusters', fontsize=14, y=1.02)
+plt.tight_layout()
+plt.show()
+```
+
+```python
+# Visualise GMM probability contours
+x_min, x_max = X_gmm[:, 0].min() - 2, X_gmm[:, 0].max() + 2
+y_min, y_max = X_gmm[:, 1].min() - 2, X_gmm[:, 1].max() + 2
+xx, yy = np.meshgrid(np.linspace(x_min, x_max, 300), np.linspace(y_min, y_max, 300))
+grid_points = np.column_stack([xx.ravel(), yy.ravel()])
+
+log_prob = gmm.score_samples(grid_points)
+log_prob = log_prob.reshape(xx.shape)
+
+fig, ax = plt.subplots(figsize=(10, 7))
+ax.contourf(xx, yy, np.exp(log_prob), levels=30, cmap='YlOrRd', alpha=0.6)
+ax.contour(xx, yy, np.exp(log_prob), levels=10, colors='darkred', linewidths=0.5, alpha=0.5)
+ax.scatter(X_gmm[:, 0], X_gmm[:, 1], c=gmm_labels, cmap='tab10', s=10, alpha=0.7,
+ edgecolors='k', linewidths=0.2)
+
+for i in range(gmm.n_components):
+ ax.scatter(gmm.means_[i, 0], gmm.means_[i, 1],
+ marker='+', s=300, c='black', linewidths=3)
+
+ax.set_title('GMM Probability Density Contours')
+ax.set_xlabel('Feature 1')
+ax.set_ylabel('Feature 2')
+plt.tight_layout()
+plt.show()
+```
+
+### Model selection with BIC and AIC
+
+How many Gaussian components should we use? We can use information criteria:
+
+- **BIC** (Bayesian Information Criterion) β penalises model complexity more heavily.
+- **AIC** (Akaike Information Criterion) β lighter penalty.
+
+**Lower is better** for both. We fit GMMs with different numbers of components and pick the one with the lowest BIC (or AIC).
+
+```python
+n_components_range = range(1, 10)
+bic_scores = []
+aic_scores = []
+
+for n in n_components_range:
+ gmm_test = GaussianMixture(n_components=n, covariance_type='full', random_state=42)
+ gmm_test.fit(X_gmm)
+ bic_scores.append(gmm_test.bic(X_gmm))
+ aic_scores.append(gmm_test.aic(X_gmm))
+
+fig, ax = plt.subplots(figsize=(10, 5))
+ax.plot(list(n_components_range), bic_scores, 'bo-', label='BIC', linewidth=2)
+ax.plot(list(n_components_range), aic_scores, 'rs--', label='AIC', linewidth=2)
+ax.axvline(x=3, color='green', linestyle=':', alpha=0.7, label='True number of components (3)')
+ax.set_xlabel('Number of Components')
+ax.set_ylabel('Score (lower is better)')
+ax.set_title('GMM Model Selection: BIC and AIC')
+ax.legend()
+ax.grid(True, alpha=0.3)
+plt.tight_layout()
+plt.show()
+
+print(f"Best BIC at n_components = {np.argmin(bic_scores) + 1}")
+print(f"Best AIC at n_components = {np.argmin(aic_scores) + 1}")
+```
+
+---
+
+## 5. Algorithm Comparison
+
+Let's put all four algorithms head-to-head on three different data geometries:
+
+1. **Blobs** β well-separated spherical clusters
+2. **Moons** β two interleaving crescents
+3. **Varied-variance blobs** β spherical clusters with very different densities
+
+```python
+from sklearn.preprocessing import StandardScaler
+
+np.random.seed(42)
+
+n_samples = 500
+
+# Dataset 1: standard blobs
+X_blobs, y_blobs = make_blobs(n_samples=n_samples, centers=3, cluster_std=1.0, random_state=42)
+
+# Dataset 2: moons
+X_moons2, y_moons2 = make_moons(n_samples=n_samples, noise=0.07, random_state=42)
+
+# Dataset 3: varied-variance blobs
+X_varied, y_varied = make_blobs(
+ n_samples=n_samples, centers=3, cluster_std=[0.5, 2.5, 1.0], random_state=42
+)
+
+datasets = [
+ ('Blobs', X_blobs, {'n_clusters': 3, 'eps': 1.0}),
+ ('Moons', X_moons2, {'n_clusters': 2, 'eps': 0.2}),
+ ('Varied', X_varied, {'n_clusters': 3, 'eps': 1.5}),
+]
+
+fig, axes = plt.subplots(3, 4, figsize=(22, 15))
+
+for row, (name, X, params) in enumerate(datasets):
+ X_scaled = StandardScaler().fit_transform(X)
+ n_c = params['n_clusters']
+ eps = params['eps']
+
+ # K-Means
+ km = KMeans(n_clusters=n_c, random_state=42, n_init=10).fit(X_scaled)
+ # Agglomerative
+ agg = AgglomerativeClustering(n_clusters=n_c, linkage='ward').fit(X_scaled)
+ # DBSCAN
+ db = DBSCAN(eps=eps, min_samples=5).fit(X_scaled)
+ # GMM
+ gm = GaussianMixture(n_components=n_c, random_state=42).fit(X_scaled)
+
+ results = [
+ ('K-Means', km.labels_),
+ ('Agglomerative', agg.labels_),
+ ('DBSCAN', db.labels_),
+ ('GMM', gm.predict(X_scaled)),
+ ]
+
+ for col, (algo_name, labels) in enumerate(results):
+ ax = axes[row, col]
+ unique_labels = set(labels)
+ n_clust = len(unique_labels) - (1 if -1 in unique_labels else 0)
+
+ noise_mask = labels == -1
+ ax.scatter(X_scaled[~noise_mask, 0], X_scaled[~noise_mask, 1],
+ c=labels[~noise_mask], cmap='viridis', s=12, alpha=0.7)
+ if noise_mask.any():
+ ax.scatter(X_scaled[noise_mask, 0], X_scaled[noise_mask, 1],
+ c='red', marker='x', s=15, alpha=0.5, label='noise')
+ ax.legend(fontsize=8)
+
+ if row == 0:
+ ax.set_title(algo_name, fontsize=13, fontweight='bold')
+ ax.set_ylabel(f'{name}' if col == 0 else '', fontsize=12)
+ ax.text(0.02, 0.98, f'{n_clust} cluster(s)',
+ transform=ax.transAxes, fontsize=9, va='top',
+ bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.8))
+
+plt.suptitle('Algorithm Comparison Across Data Geometries', fontsize=16, y=1.01)
+plt.tight_layout()
+plt.show()
+```
+
+---
+
+## Summary
+
+### When to use each algorithm
+
+| Algorithm | Best for | Weaknesses | Must specify k? |
+|-----------|----------|------------|-----------------|
+| **K-Means** | Large datasets with spherical clusters | Cannot handle non-convex shapes; sensitive to outliers | Yes |
+| **Agglomerative Clustering** | Small-to-medium datasets; exploring hierarchy | O(nΒ³) time complexity; hard to scale | Yes (or cut dendrogram) |
+| **DBSCAN** | Arbitrary shapes; datasets with noise/outliers | Sensitive to `eps`; struggles with varying densities | No |
+| **Gaussian Mixture Model** | Elliptical clusters; need soft assignments | Assumes Gaussian components; sensitive to initialisation | Yes |
+
+### Rules of thumb
+
+1. **Start simple:** try K-Means first. If results look poor, consider the data geometry.
+2. **Non-convex shapes?** β Use DBSCAN.
+3. **Elliptical or overlapping clusters?** β Use GMM.
+4. **Need a hierarchy or dendrogram?** β Use Agglomerative Clustering.
+5. **Noisy data with outliers?** β DBSCAN naturally handles noise.
+6. **Need probability estimates?** β GMM provides soft assignments.
+
+---
+*Generated by Berta AI | Created by Luigi Pascal Rondanini*
diff --git a/docs/chapters/content/ch08-03_advanced.md b/docs/chapters/content/ch08-03_advanced.md
new file mode 100644
index 0000000..1e08ea4
--- /dev/null
+++ b/docs/chapters/content/ch08-03_advanced.md
@@ -0,0 +1,687 @@
+# Ch 8: Unsupervised Learning - Advanced
+
+**Track**: Practitioner | [Try code in Playground](../../playground.md) | [Back to chapter overview](../chapter-08.md)
+
+
+!!! tip "Read online or run locally"
+ You can read this content here on the web. To run the code interactively,
+ either use the [Playground](../../playground.md) or clone the repo and open
+ `chapters/chapter-08-unsupervised-learning/notebooks/03_advanced.ipynb` in Jupyter.
+
+---
+
+# Chapter 8: Unsupervised Learning
+## Notebook 03 - Advanced: Dimensionality Reduction & Capstone
+
+Reduce high-dimensional data for visualization and modeling, detect anomalies, and build a complete customer segmentation system.
+
+**What you'll learn:**
+- Principal Component Analysis (PCA) from scratch
+- t-SNE for 2D visualization
+- Anomaly detection with Isolation Forest
+- Customer segmentation capstone project
+
+**Time estimate:** 3 hours
+
+---
+
+## 1. PCA Theory
+
+### The Core Idea
+
+PCA is a **linear** dimensionality-reduction technique that finds the directions (called **principal components**) along which the data varies the most.
+
+Imagine a cloud of 3-D points shaped like a flat pancake. Two axes capture almost all of the spread; the third adds very little information. PCA discovers those two dominant axes automatically.
+
+### Algorithm Steps
+
+1. **Center the data** β subtract the mean of each feature so that the cloud is centered at the origin.
+2. **Compute the covariance matrix** β a \(d \times d\) matrix (where \(d\) is the number of features) that captures pairwise linear relationships.
+3. **Eigendecomposition** β find the eigenvectors and eigenvalues of the covariance matrix. Each eigenvector is a principal component direction; its eigenvalue tells us how much variance that direction explains.
+4. **Sort & select** β rank components by eigenvalue (descending) and keep the top \(k\) to reduce dimensionality from \(d\) to \(k\).
+5. **Project** β multiply the centered data by the selected eigenvectors to obtain the lower-dimensional representation.
+
+### Variance Explained Ratio
+
+The variance explained ratio for component \(i\) is \(\lambda_i / \sum_{j=1}^{d} \lambda_j\), where \(\lambda_i\) is the \(i\)-th eigenvalue. The **cumulative** variance explained tells us how much total information is retained when we keep the first \(k\) components.
+
+---
+
+## 2. PCA From Scratch
+
+We implement PCA using only NumPy and apply it to the classic **Iris** dataset (4 features β 2 components).
+
+```python
+import numpy as np
+import matplotlib.pyplot as plt
+from sklearn.datasets import load_iris
+
+np.random.seed(42)
+
+# Load the Iris dataset (4 features, 150 samples, 3 classes)
+iris = load_iris()
+X = iris.data # shape (150, 4)
+y = iris.target # 0, 1, 2
+feature_names = iris.feature_names
+target_names = iris.target_names
+
+print(f"Dataset shape: {X.shape}")
+print(f"Features: {feature_names}")
+print(f"Classes: {list(target_names)}")
+```
+
+```python
+def pca_from_scratch(X, n_components=2):
+ """Implement PCA using NumPy."""
+ # Step 1: Center the data
+ mean = np.mean(X, axis=0)
+ X_centered = X - mean
+
+ # Step 2: Covariance matrix (features Γ features)
+ cov_matrix = np.cov(X_centered, rowvar=False)
+
+ # Step 3: Eigendecomposition
+ eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
+
+ # Step 4: Sort by eigenvalue descending
+ sorted_idx = np.argsort(eigenvalues)[::-1]
+ eigenvalues = eigenvalues[sorted_idx]
+ eigenvectors = eigenvectors[:, sorted_idx]
+
+ # Variance explained ratio
+ variance_ratio = eigenvalues / eigenvalues.sum()
+
+ # Step 5: Project onto top-k components
+ W = eigenvectors[:, :n_components]
+ X_projected = X_centered @ W
+
+ return X_projected, eigenvalues, variance_ratio, W
+
+
+X_pca_scratch, eigenvalues, var_ratio, components = pca_from_scratch(X, n_components=2)
+
+print("Eigenvalues:", np.round(eigenvalues, 4))
+print("Variance explained ratio:", np.round(var_ratio, 4))
+print(f"Total variance retained (2 components): {var_ratio[:2].sum():.2%}")
+```
+
+```python
+# Variance Explained Bar + Cumulative Line
+fig, axes = plt.subplots(1, 2, figsize=(13, 5))
+
+# Left: bar chart of individual variance ratios
+axes[0].bar(range(1, len(var_ratio) + 1), var_ratio, color="steelblue", edgecolor="black")
+axes[0].set_xlabel("Principal Component")
+axes[0].set_ylabel("Variance Explained Ratio")
+axes[0].set_title("Variance Explained by Each Component")
+axes[0].set_xticks(range(1, len(var_ratio) + 1))
+
+# Right: cumulative variance explained
+cumulative = np.cumsum(var_ratio)
+axes[1].plot(range(1, len(cumulative) + 1), cumulative, "o-", color="darkorange", linewidth=2)
+axes[1].axhline(y=0.95, color="red", linestyle="--", label="95% threshold")
+axes[1].set_xlabel("Number of Components")
+axes[1].set_ylabel("Cumulative Variance Explained")
+axes[1].set_title("Cumulative Variance Explained")
+axes[1].set_xticks(range(1, len(cumulative) + 1))
+axes[1].legend()
+
+plt.tight_layout()
+plt.show()
+```
+
+```python
+# 2-D scatter plot of the scratch PCA projection
+colors = ["#1f77b4", "#ff7f0e", "#2ca02c"]
+
+plt.figure(figsize=(8, 6))
+for i, name in enumerate(target_names):
+ mask = y == i
+ plt.scatter(X_pca_scratch[mask, 0], X_pca_scratch[mask, 1],
+ label=name, alpha=0.7, edgecolors="k", linewidth=0.5,
+ color=colors[i], s=60)
+plt.xlabel(f"PC 1 ({var_ratio[0]:.1%} variance)")
+plt.ylabel(f"PC 2 ({var_ratio[1]:.1%} variance)")
+plt.title("PCA From Scratch β Iris Dataset (2-D Projection)")
+plt.legend()
+plt.grid(alpha=0.3)
+plt.tight_layout()
+plt.show()
+```
+
+---
+
+## 3. PCA with Scikit-learn
+
+We verify our scratch implementation against the well-optimized `sklearn.decomposition.PCA`.
+
+```python
+from sklearn.decomposition import PCA
+
+pca_sk = PCA(n_components=4) # keep all 4 to inspect variance
+X_pca_sk_full = pca_sk.fit_transform(X)
+
+print("Sklearn variance explained ratio:", np.round(pca_sk.explained_variance_ratio_, 4))
+print("Scratch variance explained ratio: ", np.round(var_ratio, 4))
+print()
+print("Cumulative (sklearn):", np.round(np.cumsum(pca_sk.explained_variance_ratio_), 4))
+```
+
+```python
+X_pca_sk = X_pca_sk_full[:, :2] # first 2 components
+
+# Sign of eigenvectors can flip β align for visual comparison
+for col in range(2):
+ if np.corrcoef(X_pca_scratch[:, col], X_pca_sk[:, col])[0, 1] < 0:
+ X_pca_scratch[:, col] *= -1
+
+fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharex=True, sharey=True)
+
+for ax, data, title in zip(axes,
+ [X_pca_scratch, X_pca_sk],
+ ["PCA (from scratch)", "PCA (scikit-learn)"]):
+ for i, name in enumerate(target_names):
+ mask = y == i
+ ax.scatter(data[mask, 0], data[mask, 1], label=name,
+ alpha=0.7, edgecolors="k", linewidth=0.5,
+ color=colors[i], s=60)
+ ax.set_xlabel("PC 1")
+ ax.set_ylabel("PC 2")
+ ax.set_title(title)
+ ax.legend()
+ ax.grid(alpha=0.3)
+
+plt.suptitle("Scratch vs Scikit-learn PCA β Identical Results", fontsize=14, y=1.02)
+plt.tight_layout()
+plt.show()
+```
+
+The two plots are virtually identical (eigenvector signs may differ, which is cosmetic). This confirms our from-scratch implementation is correct.
+
+---
+
+## 4. t-SNE
+
+### What is t-SNE?
+
+**t-distributed Stochastic Neighbor Embedding (t-SNE)** is a non-linear dimensionality-reduction technique designed specifically for **visualization**.
+
+Key properties:
+- Preserves **local structure**: points that are close in high-dimensional space stay close in the 2-D embedding.
+- Does **not** preserve global distances β clusters may move relative to each other between runs.
+- Computationally expensive β not suitable as a preprocessing step in machine-learning pipelines.
+- The **perplexity** parameter (roughly: how many neighbors each point considers) strongly influences the result. Typical range: 5β50.
+
+**Rule of thumb:** Use PCA when you need a general-purpose reduction (for modeling, compression, noise removal). Use t-SNE when your sole goal is to *see* cluster structure in 2-D.
+
+```python
+from sklearn.manifold import TSNE
+
+tsne = TSNE(n_components=2, perplexity=30, random_state=42, n_iter=1000)
+X_tsne = tsne.fit_transform(X)
+
+print(f"t-SNE output shape: {X_tsne.shape}")
+```
+
+```python
+# Side-by-side: PCA vs t-SNE
+fig, axes = plt.subplots(1, 2, figsize=(14, 5))
+
+for ax, data, title in zip(axes,
+ [X_pca_sk, X_tsne],
+ ["PCA (linear)", "t-SNE (non-linear)"]):
+ for i, name in enumerate(target_names):
+ mask = y == i
+ ax.scatter(data[mask, 0], data[mask, 1], label=name,
+ alpha=0.7, edgecolors="k", linewidth=0.5,
+ color=colors[i], s=60)
+ ax.set_xlabel("Dim 1")
+ ax.set_ylabel("Dim 2")
+ ax.set_title(title)
+ ax.legend()
+ ax.grid(alpha=0.3)
+
+plt.suptitle("PCA vs t-SNE β Iris Dataset", fontsize=14, y=1.02)
+plt.tight_layout()
+plt.show()
+```
+
+```python
+# Effect of perplexity on t-SNE
+perplexities = [5, 15, 30, 50]
+fig, axes = plt.subplots(1, 4, figsize=(20, 4))
+
+for ax, perp in zip(axes, perplexities):
+ embedding = TSNE(n_components=2, perplexity=perp,
+ random_state=42, n_iter=1000).fit_transform(X)
+ for i, name in enumerate(target_names):
+ mask = y == i
+ ax.scatter(embedding[mask, 0], embedding[mask, 1],
+ alpha=0.7, color=colors[i], s=40, edgecolors="k",
+ linewidth=0.3, label=name)
+ ax.set_title(f"Perplexity = {perp}")
+ ax.set_xticks([])
+ ax.set_yticks([])
+
+axes[0].legend(fontsize=8)
+plt.suptitle("t-SNE: Impact of Perplexity", fontsize=14, y=1.04)
+plt.tight_layout()
+plt.show()
+```
+
+**Observations on perplexity:**
+- Low perplexity (5): focuses on very local neighbors β clusters may fragment.
+- High perplexity (50): considers more neighbors β clusters become rounder and more global structure is visible, but fine local detail may blur.
+- There is no single "correct" perplexity; try several and look for consistent patterns.
+
+---
+
+## 5. Anomaly Detection
+
+### Why Unsupervised Anomaly Detection?
+
+In many real-world scenarios, labeled anomalies are scarce or non-existent:
+
+| Domain | Normal | Anomaly |
+|--------|--------|---------|
+| Banking | Legitimate transactions | Fraud |
+| Manufacturing | Good products | Defects |
+| Cybersecurity | Regular traffic | Intrusions |
+
+Unsupervised methods learn the distribution of *normal* data and flag anything that doesn't fit.
+
+### Approach 1 β Z-Score
+
+Flag a point as anomalous if any feature has a Z-score \(|z| > \tau\) (e.g., \(\tau = 3\)). Simple, but assumes Gaussian features and works only for univariate or low-dimensional data.
+
+### Approach 2 β Isolation Forest
+
+The **Isolation Forest** algorithm isolates observations by randomly selecting a feature and a split value. Anomalies are easier to isolate (fewer splits needed), so they have shorter average path lengths in the trees.
+
+Advantages:
+- Works well in high dimensions
+- No distribution assumptions
+- Linear time complexity
+
+```python
+from sklearn.ensemble import IsolationForest
+from scipy import stats
+
+np.random.seed(42)
+
+# Generate normal data: 2 clusters
+normal_a = np.random.randn(150, 2) * 0.8 + np.array([2, 2])
+normal_b = np.random.randn(150, 2) * 0.8 + np.array([-2, -2])
+normal_data = np.vstack([normal_a, normal_b])
+
+# Inject 20 anomalies scattered far from the clusters
+anomalies = np.random.uniform(low=-6, high=6, size=(20, 2))
+
+X_anom = np.vstack([normal_data, anomalies])
+labels_true = np.array([0] * len(normal_data) + [1] * len(anomalies)) # 0=normal, 1=anomaly
+
+print(f"Total points: {len(X_anom)} (normal: {len(normal_data)}, anomalies: {len(anomalies)})")
+```
+
+```python
+# Z-Score method
+z_scores = np.abs(stats.zscore(X_anom))
+z_threshold = 3.0
+z_anomaly_mask = (z_scores > z_threshold).any(axis=1)
+
+print(f"Z-Score method detected {z_anomaly_mask.sum()} anomalies (threshold={z_threshold})")
+```
+
+```python
+# Isolation Forest
+iso_forest = IsolationForest(n_estimators=200, contamination=0.06,
+ random_state=42)
+iso_preds = iso_forest.fit_predict(X_anom) # 1 = normal, -1 = anomaly
+iso_anomaly_mask = iso_preds == -1
+
+print(f"Isolation Forest detected {iso_anomaly_mask.sum()} anomalies")
+```
+
+```python
+fig, axes = plt.subplots(1, 3, figsize=(18, 5))
+
+# Ground truth
+axes[0].scatter(X_anom[labels_true == 0, 0], X_anom[labels_true == 0, 1],
+ c="steelblue", s=30, alpha=0.6, label="Normal")
+axes[0].scatter(X_anom[labels_true == 1, 0], X_anom[labels_true == 1, 1],
+ c="red", s=80, marker="X", label="True Anomaly")
+axes[0].set_title("Ground Truth")
+axes[0].legend()
+axes[0].grid(alpha=0.3)
+
+# Z-Score
+axes[1].scatter(X_anom[~z_anomaly_mask, 0], X_anom[~z_anomaly_mask, 1],
+ c="steelblue", s=30, alpha=0.6, label="Normal")
+axes[1].scatter(X_anom[z_anomaly_mask, 0], X_anom[z_anomaly_mask, 1],
+ c="red", s=80, marker="X", label="Detected Anomaly")
+axes[1].set_title(f"Z-Score (threshold={z_threshold})")
+axes[1].legend()
+axes[1].grid(alpha=0.3)
+
+# Isolation Forest
+axes[2].scatter(X_anom[~iso_anomaly_mask, 0], X_anom[~iso_anomaly_mask, 1],
+ c="steelblue", s=30, alpha=0.6, label="Normal")
+axes[2].scatter(X_anom[iso_anomaly_mask, 0], X_anom[iso_anomaly_mask, 1],
+ c="red", s=80, marker="X", label="Detected Anomaly")
+axes[2].set_title("Isolation Forest")
+axes[2].legend()
+axes[2].grid(alpha=0.3)
+
+plt.suptitle("Anomaly Detection Comparison", fontsize=14, y=1.02)
+plt.tight_layout()
+plt.show()
+```
+
+**Key takeaway:** The Isolation Forest typically outperforms the Z-Score method, especially when the data is multi-modal or the anomalies are not simply extreme values along a single axis.
+
+---
+
+## 6. Capstone β Customer Segmentation
+
+We build a complete customer-segmentation pipeline:
+
+1. Generate & save a synthetic customer dataset
+2. Feature scaling
+3. Dimensionality reduction with PCA
+4. Elbow method to choose optimal \(K\)
+5. K-Means clustering
+6. Segment profiling & visualization
+7. Business recommendations
+
+### 6.1 Generate Synthetic Customer Data
+
+We create five features that mimic a retail scenario:
+
+| Feature | Description |
+|---------|-------------|
+| `age` | Customer age (18β70) |
+| `income` | Annual income in $k (15β150) |
+| `spending_score` | In-store spending score (1β100) |
+| `visits` | Monthly store visits (0β30) |
+| `online_ratio` | Fraction of purchases made online (0β1) |
+
+```python
+import pandas as pd
+import os
+
+np.random.seed(42)
+n_customers = 500
+
+# Segment 1: Young, moderate income, high online, high spending
+seg1 = {
+ "age": np.random.normal(25, 4, 130).clip(18, 40),
+ "income": np.random.normal(45, 12, 130).clip(15, 80),
+ "spending_score": np.random.normal(75, 10, 130).clip(1, 100),
+ "visits": np.random.normal(8, 3, 130).clip(0, 30),
+ "online_ratio": np.random.normal(0.75, 0.1, 130).clip(0, 1),
+}
+
+# Segment 2: Middle-aged, high income, balanced channel, moderate spending
+seg2 = {
+ "age": np.random.normal(42, 6, 150).clip(28, 60),
+ "income": np.random.normal(95, 18, 150).clip(50, 150),
+ "spending_score": np.random.normal(55, 12, 150).clip(1, 100),
+ "visits": np.random.normal(15, 5, 150).clip(0, 30),
+ "online_ratio": np.random.normal(0.45, 0.15, 150).clip(0, 1),
+}
+
+# Segment 3: Older, lower income, low online, low spending
+seg3 = {
+ "age": np.random.normal(58, 7, 120).clip(40, 70),
+ "income": np.random.normal(35, 10, 120).clip(15, 70),
+ "spending_score": np.random.normal(25, 10, 120).clip(1, 100),
+ "visits": np.random.normal(20, 5, 120).clip(0, 30),
+ "online_ratio": np.random.normal(0.15, 0.08, 120).clip(0, 1),
+}
+
+# Segment 4: Mixed ages, very high income, high spending, moderate visits
+seg4 = {
+ "age": np.random.normal(38, 10, 100).clip(18, 70),
+ "income": np.random.normal(120, 15, 100).clip(80, 150),
+ "spending_score": np.random.normal(85, 8, 100).clip(1, 100),
+ "visits": np.random.normal(12, 4, 100).clip(0, 30),
+ "online_ratio": np.random.normal(0.55, 0.15, 100).clip(0, 1),
+}
+
+frames = []
+for seg in [seg1, seg2, seg3, seg4]:
+ frames.append(pd.DataFrame(seg))
+
+df_customers = pd.concat(frames, ignore_index=True)
+df_customers = df_customers.sample(frac=1, random_state=42).reset_index(drop=True)
+
+df_customers["age"] = df_customers["age"].round(0).astype(int)
+df_customers["income"] = df_customers["income"].round(1)
+df_customers["spending_score"] = df_customers["spending_score"].round(0).astype(int)
+df_customers["visits"] = df_customers["visits"].round(0).astype(int)
+df_customers["online_ratio"] = df_customers["online_ratio"].round(2)
+
+# Save to CSV (run from chapter folder: chapters/chapter-08-unsupervised-learning/)
+dataset_dir = "datasets"
+os.makedirs(dataset_dir, exist_ok=True)
+csv_path = os.path.join(dataset_dir, "customers.csv")
+df_customers.to_csv(csv_path, index=False)
+print(f"Saved {len(df_customers)} rows to {csv_path}")
+df_customers.head(10)
+```
+
+### 6.2 Feature Scaling
+
+```python
+from sklearn.preprocessing import StandardScaler
+
+feature_cols = ["age", "income", "spending_score", "visits", "online_ratio"]
+X_cust = df_customers[feature_cols].values
+
+scaler = StandardScaler()
+X_scaled = scaler.fit_transform(X_cust)
+
+print("Scaled means (β0):", np.round(X_scaled.mean(axis=0), 4))
+print("Scaled stds (β1):", np.round(X_scaled.std(axis=0), 4))
+```
+
+### 6.3 PCA for Dimensionality Reduction
+
+```python
+pca_cust = PCA(n_components=5)
+X_pca_cust = pca_cust.fit_transform(X_scaled)
+
+cum_var = np.cumsum(pca_cust.explained_variance_ratio_)
+
+plt.figure(figsize=(7, 4))
+plt.bar(range(1, 6), pca_cust.explained_variance_ratio_,
+ color="steelblue", edgecolor="black", alpha=0.7, label="Individual")
+plt.step(range(1, 6), cum_var, where="mid", color="darkorange",
+ linewidth=2, label="Cumulative")
+plt.axhline(0.90, color="red", linestyle="--", alpha=0.7, label="90% threshold")
+plt.xlabel("Principal Component")
+plt.ylabel("Variance Explained")
+plt.title("Customer Data β PCA Variance Explained")
+plt.xticks(range(1, 6))
+plt.legend()
+plt.tight_layout()
+plt.show()
+
+n_keep = np.argmax(cum_var >= 0.90) + 1
+print(f"\nComponents needed for β₯90% variance: {n_keep}")
+print(f"Using first 2 components for visualization ({cum_var[1]:.1%} variance).")
+```
+
+### 6.4 K-Means β Elbow Method
+
+```python
+from sklearn.cluster import KMeans
+
+K_range = range(2, 11)
+inertias = []
+
+for k in K_range:
+ km = KMeans(n_clusters=k, n_init=10, random_state=42)
+ km.fit(X_scaled)
+ inertias.append(km.inertia_)
+
+plt.figure(figsize=(8, 4))
+plt.plot(list(K_range), inertias, "o-", linewidth=2, color="steelblue")
+plt.xlabel("Number of Clusters (K)")
+plt.ylabel("Inertia (within-cluster sum of squares)")
+plt.title("Elbow Method for Optimal K")
+plt.xticks(list(K_range))
+plt.grid(alpha=0.3)
+plt.tight_layout()
+plt.show()
+
+print("Look for the 'elbow' β the point where adding more clusters yields")
+print("diminishing returns. Here K=4 appears to be a good choice.")
+```
+
+### 6.5 Fit K-Means with Optimal K
+
+```python
+optimal_k = 4
+km_final = KMeans(n_clusters=optimal_k, n_init=20, random_state=42)
+cluster_labels = km_final.fit_predict(X_scaled)
+
+df_customers["cluster"] = cluster_labels
+print(f"Cluster distribution:\n{df_customers['cluster'].value_counts().sort_index()}")
+```
+
+### 6.6 Segment Profiling
+
+```python
+segment_profile = df_customers.groupby("cluster")[feature_cols].mean().round(2)
+segment_profile["count"] = df_customers.groupby("cluster").size()
+print("=== Segment Profiles ===")
+segment_profile
+```
+
+```python
+# Radar / parallel-coordinates style comparison
+fig, axes = plt.subplots(1, len(feature_cols), figsize=(18, 4), sharey=True)
+cluster_colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728"]
+
+for idx, feat in enumerate(feature_cols):
+ means = df_customers.groupby("cluster")[feat].mean()
+ axes[idx].bar(means.index, means.values,
+ color=cluster_colors[:optimal_k], edgecolor="black")
+ axes[idx].set_title(feat, fontsize=11)
+ axes[idx].set_xlabel("Cluster")
+ axes[idx].set_xticks(range(optimal_k))
+
+axes[0].set_ylabel("Mean Value")
+plt.suptitle("Feature Means by Cluster", fontsize=14, y=1.02)
+plt.tight_layout()
+plt.show()
+```
+
+### 6.7 Visualize Segments in 2-D (PCA Projection)
+
+```python
+X_vis = X_pca_cust[:, :2]
+centroids_scaled = km_final.cluster_centers_
+centroids_2d = pca_cust.transform(centroids_scaled)[:, :2] # project centroids
+
+plt.figure(figsize=(9, 7))
+for c in range(optimal_k):
+ mask = cluster_labels == c
+ plt.scatter(X_vis[mask, 0], X_vis[mask, 1], s=40, alpha=0.6,
+ color=cluster_colors[c], edgecolors="k", linewidth=0.3,
+ label=f"Segment {c}")
+
+plt.scatter(centroids_2d[:, 0], centroids_2d[:, 1], s=250, c="black",
+ marker="*", zorder=5, label="Centroids")
+
+plt.xlabel(f"PC 1 ({pca_cust.explained_variance_ratio_[0]:.1%} var)")
+plt.ylabel(f"PC 2 ({pca_cust.explained_variance_ratio_[1]:.1%} var)")
+plt.title("Customer Segments β PCA 2-D Projection")
+plt.legend()
+plt.grid(alpha=0.3)
+plt.tight_layout()
+plt.show()
+```
+
+### 6.8 Business Recommendations
+
+```python
+recommendations = {
+ 0: {
+ "label": "Budget Traditionalists",
+ "description": "Older customers with low income and spending, who shop mostly in-store.",
+ "actions": [
+ "Offer loyalty discounts and in-store promotions",
+ "Simplify the in-store experience",
+ "Provide personalized coupons at checkout",
+ ],
+ },
+ 1: {
+ "label": "Young Digital Shoppers",
+ "description": "Young customers with moderate income but high online engagement and spending.",
+ "actions": [
+ "Invest in mobile app features and social media marketing",
+ "Offer free shipping and digital-only deals",
+ "Launch a referral program to leverage their network",
+ ],
+ },
+ 2: {
+ "label": "Premium High-Spenders",
+ "description": "High income, high spending score β the most valuable segment.",
+ "actions": [
+ "Create a VIP/premium loyalty tier",
+ "Offer early access to new products",
+ "Assign dedicated account managers for retention",
+ ],
+ },
+ 3: {
+ "label": "Established Moderates",
+ "description": "Middle-aged, higher income, moderate spending, balanced channel use.",
+ "actions": [
+ "Cross-sell higher-margin products",
+ "Provide omni-channel convenience (buy online, pick up in store)",
+ "Target with email campaigns for seasonal offers",
+ ],
+ },
+}
+
+for seg_id, info in recommendations.items():
+ count = (cluster_labels == seg_id).sum()
+ print(f"\n{'='*60}")
+ print(f"Segment {seg_id}: {info['label']} (n={count})")
+ print(f"{'='*60}")
+ print(f" {info['description']}")
+ print(" Recommended actions:")
+ for action in info["actions"]:
+ print(f" β’ {action}")
+```
+
+---
+
+## 7. Summary
+
+### What We Covered in This Notebook
+
+| Topic | Key Idea |
+|-------|----------|
+| **PCA** | Linear projection onto directions of maximum variance |
+| **t-SNE** | Non-linear embedding that preserves local neighborhoods β for visualization only |
+| **Z-Score Anomaly Detection** | Simple threshold on standardized values |
+| **Isolation Forest** | Tree-based anomaly detector β fast, distribution-free |
+| **Customer Segmentation** | End-to-end pipeline: scale β PCA β K-Means β profile β recommend |
+
+### Chapter 8 Recap
+
+Across the three notebooks you have:
+
+1. **Notebook 01 (Introduction):** Learned K-Means, hierarchical clustering, and evaluation metrics.
+2. **Notebook 02 (Intermediate):** Explored DBSCAN, Gaussian Mixture Models, and silhouette analysis.
+3. **Notebook 03 (Advanced β this one):** Mastered PCA, t-SNE, anomaly detection, and built a full capstone project.
+
+### What's Next
+
+In **Chapter 9: Deep Learning** we'll move from classical ML to neural networks β starting with perceptrons, backpropagation, and building your first deep network with PyTorch/Keras.
+
+---
+*Generated by Berta AI | Created by Luigi Pascal Rondanini*
diff --git a/docs/chapters/index.md b/docs/chapters/index.md
index cd83b10..ba4b1ea 100644
--- a/docs/chapters/index.md
+++ b/docs/chapters/index.md
@@ -48,6 +48,10 @@ Apply your knowledge to real-world ML and AI problems.
*10h Β· 3 notebooks, 5 exercises, 3 SVGs*
Regression, regularization; classification, SVM, ROC; ensembles, tuning, credit-risk
+- **Ch 8: [Unsupervised Learning](chapter-08.md)**
+ *8h Β· 3 notebooks, 5 exercises, 3 SVGs*
+ K-Means, hierarchical, DBSCAN; PCA, t-SNE; anomaly detection, customer segmentation
+
---
@@ -63,15 +67,16 @@ Apply your knowledge to real-world ML and AI problems.
| [5: Software Design](chapter-05.md) | Foundation | 6h | 3 | 5 | 3 |
| [6: Intro to ML](chapter-06.md) | Practitioner | 8h | 3 | 5 | 3 |
| [7: Supervised Learning](chapter-07.md) | Practitioner | 10h | 3 | 5 | 3 |
+| [8: Unsupervised Learning](chapter-08.md) | Practitioner | 8h | 3 | 5 | 3 |
---
## Coming Soon
-!!! info "Chapters 8β25"
+!!! info "Chapters 9β25"
Additional chapters are planned for the Practitioner and Advanced tracks:
- - **Practitioner** (8β15): Unsupervised Learning, Deep Learning, NLP, LLMs, Prompt Engineering, RAG, Fine-tuning, MLOps
+ - **Practitioner** (9β15): Deep Learning, NLP, LLMs, Prompt Engineering, RAG, Fine-tuning, MLOps
- **Advanced** (16β25): Multi-Agent Systems, Advanced RAG, Reinforcement Learning, Model Optimization, Production AI, Finance, Safety, AI Products, Research, Governance & Ethics
[Request a custom chapter](../guides/chapter-requests.md) on any AI topic while you wait!
diff --git a/docs/guides/curriculum.md b/docs/guides/curriculum.md
index 2e997d7..6401d23 100644
--- a/docs/guides/curriculum.md
+++ b/docs/guides/curriculum.md
@@ -30,7 +30,7 @@ Apply knowledge to real-world ML and AI problems.
|---|---------|-------|--------|------|
| 6 | Introduction to Machine Learning | 8h | Available | [chapter-06.md](../chapters/chapter-06.md) |
| 7 | Supervised Learning | 10h | Available | [chapter-07.md](../chapters/chapter-07.md) |
-| 8 | Unsupervised Learning | 8h | Coming soon | β |
+| 8 | Unsupervised Learning | 8h | Available | [chapter-08.md](../chapters/chapter-08.md) |
| 9 | Deep Learning Fundamentals | 12h | Coming soon | β |
| 10 | Natural Language Processing Basics | 10h | Coming soon | β |
| 11 | Large Language Models & Transformers | 10h | Coming soon | β |
@@ -39,7 +39,7 @@ Apply knowledge to real-world ML and AI problems.
| 14 | Fine-tuning & Adaptation | 8h | Coming soon | β |
| 15 | MLOps & Deployment | 8h | Coming soon | β |
-**Total: 88 hours (18h available)**
+**Total: 88 hours (26h available)**
---
@@ -69,9 +69,9 @@ Master complex topics and specialized domains.
| Track | Chapters | Total Hours | Available |
|-------|----------|-------------|-----------|
| Foundation | 1β5 | 38h | 5/5 |
-| Practitioner | 6β15 | 88h | 2/10 |
+| Practitioner | 6β15 | 88h | 3/10 |
| Advanced | 16β25 | 84h | 0/10 |
-| **Total** | **25** | **210h+** | **7** |
+| **Total** | **25** | **210h+** | **8** |
---
diff --git a/docs/guides/roadmap.md b/docs/guides/roadmap.md
index d9161ef..528ba8f 100644
--- a/docs/guides/roadmap.md
+++ b/docs/guides/roadmap.md
@@ -10,10 +10,10 @@ Our vision for the future of AI education. Priorities evolve based on community
|-----------|--------|
| Master Repository | Live |
| Foundation Track | Complete (5 chapters) |
-| Practitioner Track | In progress (2 of 10 chapters) |
+| Practitioner Track | In progress (3 of 10 chapters) |
| Advanced Track | Planned (10 chapters) |
| Community Requests | Starting |
-| **Available Now** | 7 chapters, 56 hours, 21 SVGs |
+| **Available Now** | 8 chapters, 64 hours, 24 SVGs |
---
@@ -28,13 +28,13 @@ Our vision for the future of AI education. Priorities evolve based on community
## Phase 1: Foundation & Launch β Complete
!!! success "Complete"
- Foundation Track complete. Chapters 6-7 available. Core infrastructure done.
+ Foundation Track complete. Chapters 6-8 available. Core infrastructure done.
### Objectives
- [x] Establish master repository
- [x] Complete Foundation Track (Chapters 1-5)
-- [x] Begin Practitioner Track (Ch 6-7)
+- [x] Begin Practitioner Track (Ch 6-8)
- [ ] Establish community request process
- [ ] Build first 100 community chapters
@@ -63,7 +63,7 @@ Our vision for the future of AI education. Priorities evolve based on community
|---|---------|--------|
| 6 | Introduction to ML | Done |
| 7 | Supervised Learning | Done |
-| 8 | Unsupervised Learning | Next |
+| 8 | Unsupervised Learning | Done |
| 9 | Deep Learning Fundamentals | Planned |
| 10 | NLP Basics | Planned |
| 11 | LLMs & Transformers | Planned |
diff --git a/docs/guides/syllabus.md b/docs/guides/syllabus.md
index ed94091..c8f174e 100644
--- a/docs/guides/syllabus.md
+++ b/docs/guides/syllabus.md
@@ -16,7 +16,7 @@ graph TD
CH6["Ch 6: Intro to ML
8h | Available"]
CH7["Ch 7: Supervised Learning
10h | Available"]
- CH8["Ch 8: Unsupervised Learning
8h | Coming Soon"]
+ CH8["Ch 8: Unsupervised Learning
8h | Available"]
CH9["Ch 9: Deep Learning
12h | Coming Soon"]
CH10["Ch 10: NLP Basics
10h | Coming Soon"]
CH11["Ch 11: LLMs & Transformers
10h | Coming Soon"]
@@ -56,7 +56,7 @@ graph TD
style CH5 fill:#4caf50,color:#fff
style CH6 fill:#4caf50,color:#fff
style CH7 fill:#4caf50,color:#fff
- style CH8 fill:#f3e5f5
+ style CH8 fill:#4caf50,color:#fff
style CH9 fill:#f3e5f5
style CH10 fill:#f3e5f5
style CH11 fill:#f3e5f5
@@ -89,7 +89,7 @@ graph TD
| 5 | Software Design | Foundation | 6h | Available |
| 6 | Introduction to ML | Practitioner | 8h | Available |
| 7 | Supervised Learning | Practitioner | 10h | Available |
-| 8 | Unsupervised Learning | Practitioner | 8h | Coming soon |
+| 8 | Unsupervised Learning | Practitioner | 8h | Available |
| 9 | Deep Learning Fundamentals | Practitioner | 12h | Coming soon |
| 10 | NLP Basics | Practitioner | 10h | Coming soon |
| 11 | LLMs & Transformers | Practitioner | 10h | Coming soon |
diff --git a/docs/index.md b/docs/index.md
index e7c82c5..a3fb797 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -28,27 +28,27 @@ Free. Open-source. Community-driven. Generated by [Berta AI](https://berta.one).
" + chapterDescription + "
", + "", + " ", + " Read Chapter " + chapterNumber + " Now", + "
", + "What's included:
", + "Browse all chapters | ", + " Try the Playground
", + "",
+ " You're receiving this because you subscribed at " + siteUrl + "
",
+ " To unsubscribe, reply to this email with 'unsubscribe'.",
+ "
", + " Created by Luigi Pascal Rondanini | ", + " Powered by Berta AI", + "
", + "Thank you for subscribing to updates.
", "You'll receive an email when new chapters are published. At most one email per week.
", "Latest chapter:
", + "Chapter 8: Unsupervised Learning β ", + " K-Means, hierarchical clustering, DBSCAN, PCA, t-SNE, anomaly detection, and a customer segmentation capstone.
", "Start learning now:
", "