diff --git a/ABOUT.md b/ABOUT.md index a739b9d..68ee428 100644 --- a/ABOUT.md +++ b/ABOUT.md @@ -155,7 +155,7 @@ Both exist simultaneously, creating a living curriculum. ### Near-term (2026) - βœ… Launch master repository (this!) - βœ… Complete Foundation Track (5 chapters β€” all available!) -- πŸ”„ Release Practitioner Track (2 of 10 chapters available) +- πŸ”„ Release Practitioner Track (3 of 10 chapters available) - πŸ”„ Establish community request process - πŸ”„ Build 100+ community-contributed chapters @@ -178,11 +178,11 @@ Both exist simultaneously, creating a living curriculum. ## πŸ“Š By The Numbers **Current State:** -- 7 chapters available (Foundation complete + Practitioner started) +- 8 chapters available (Foundation complete + Practitioner started) - 21 Jupyter notebooks with interactive content - 21 professional SVG diagrams - 37 exercises with solutions -- 56 hours of learning content available +- 64 hours of learning content available - 5 practice datasets - 25+ total chapters planned - $0 barrier to entry diff --git a/GITHUB_PROFILE_README.md b/GITHUB_PROFILE_README.md index b10f849..66f93ad 100644 --- a/GITHUB_PROFILE_README.md +++ b/GITHUB_PROFILE_README.md @@ -10,7 +10,7 @@ **[Berta AI](https://berta.one)** β€” AI-powered tools for tomorrow's world -- **[Berta Chapters](https://github.com/luigipascal/berta-chapters)** β€” Free, open-source AI curriculum. 7 chapters live, 25 planned. Learn Python to production ML through interactive notebooks, exercises, and an online playground. No paywall, no signup. +- **[Berta Chapters](https://github.com/luigipascal/berta-chapters)** β€” Free, open-source AI curriculum. 8 chapters live, 25 planned. Learn Python to production ML through interactive notebooks, exercises, and an online playground. No paywall, no signup. - **[LLM Cost Optimizer](https://llm.berta.one)** β€” Cut LLM API costs 80-95% while keeping data private. Local processing, text anonymization, automatic model routing. - **OrbaOS** β€” A framework for post-project work. AI handles coordination so teams focus on strategy and creative output. diff --git a/README.md b/README.md index a6b150f..59d88a5 100644 --- a/README.md +++ b/README.md @@ -53,7 +53,7 @@ Apply what you've learned to real-world machine learning and AI problems. |---------|-------|------|--------| | 6 | [Introduction to Machine Learning](./chapters/chapter-06-intro-machine-learning/) | 8h | βœ… Available | | 7 | [Supervised Learning: Regression & Classification](./chapters/chapter-07-supervised-learning/) | 10h | βœ… Available | -| 8 | Unsupervised Learning: Clustering & Dimensionality Reduction | 8h | πŸ”„ Coming Soon | +| 8 | [Unsupervised Learning: Clustering & Dimensionality Reduction](./chapters/chapter-08-unsupervised-learning/) | 8h | βœ… Available | | 9 | Deep Learning Fundamentals | 12h | πŸ”„ Coming Soon | | 10 | Natural Language Processing Basics | 10h | πŸ”„ Coming Soon | | 11 | Large Language Models & Transformers | 10h | πŸ”„ Coming Soon | @@ -268,7 +268,7 @@ pie title Curriculum Breakdown "Community Requested" : 999 ``` -- **Chapters Available Now**: 7 (56 hours of content) +- **Chapters Available Now**: 8 (64 hours of content) - **Total Planned Chapters**: 25+ - **Jupyter Notebooks**: 21 interactive notebooks - **SVG Diagrams**: 21 professional diagrams diff --git a/ROADMAP.md b/ROADMAP.md index 8f05088..a52b19c 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -8,11 +8,11 @@ Our vision for the future of AI education. This is a living documentβ€”prioritie **Master Repository**: βœ… Live **Foundation Track**: βœ… Complete (5 chapters available) -**Practitioner Track**: πŸ”„ In progress (2 of 10 chapters available) +**Practitioner Track**: πŸ”„ In progress (3 of 10 chapters available) **Advanced Track**: πŸ“‹ Planned (10 chapters) **Community Requests**: πŸš€ Starting (unlimited) **Total Planned**: 25+ chapters, 500+ hours of content -**Currently Available**: 7 chapters, 56 hours of content, 21 SVG diagrams +**Currently Available**: 8 chapters, 64 hours of content, 24 SVG diagrams --- @@ -21,7 +21,7 @@ Our vision for the future of AI education. This is a living documentβ€”prioritie ### Objectives - βœ… Establish master repository (DONE) - βœ… Complete Foundation Track (DONE) -- βœ… Begin Practitioner Track (Ch 6-7 available) +- βœ… Begin Practitioner Track (Ch 6-8 available) - πŸ”„ Establish community request process - πŸ”„ Build first 100 community chapters - βœ… Create core infrastructure and documentation (DONE) @@ -37,11 +37,11 @@ Our vision for the future of AI education. This is a living documentβ€”prioritie - One new chapter released per week - New chapters unlock after reaching **10 newsletter subscribers** - βœ… Foundation Track complete (Chapters 1-5) -- βœ… Practitioner Track started (Chapters 6-7) +- βœ… Practitioner Track started (Chapters 6-8) ### Metrics to Track - Newsletter subscribers (target: 10 to unlock weekly releases) -- Chapters completed: 7 / 25 +- Chapters completed: 8 / 25 - Community requests received - Stars on master repo @@ -59,7 +59,7 @@ Our vision for the future of AI education. This is a living documentβ€”prioritie ### Practitioner Track Chapters - [x] Chapter 6: Introduction to Machine Learning - [x] Chapter 7: Supervised Learning (Regression & Classification) -- [ ] Chapter 8: Unsupervised Learning +- [x] Chapter 8: Unsupervised Learning - [ ] Chapter 9: Deep Learning Fundamentals - [ ] Chapter 10: Natural Language Processing Basics - [ ] Chapter 11: Large Language Models & Transformers diff --git a/SYLLABUS.md b/SYLLABUS.md index 04493d3..0601119 100644 --- a/SYLLABUS.md +++ b/SYLLABUS.md @@ -16,7 +16,7 @@ graph TD CH6["Ch 6: Intro to ML
8h | Available"] CH7["Ch 7: Supervised Learning
10h | Available"] - CH8["Ch 8: Unsupervised Learning
8h | Coming Soon"] + CH8["Ch 8: Unsupervised Learning
8h | Available"] CH9["Ch 9: Deep Learning
12h | Coming Soon"] CH10["Ch 10: NLP Basics
10h | Coming Soon"] CH11["Ch 11: LLMs & Transformers
10h | Coming Soon"] @@ -56,7 +56,7 @@ graph TD style CH5 fill:#4caf50,color:#fff style CH6 fill:#4caf50,color:#fff style CH7 fill:#4caf50,color:#fff - style CH8 fill:#f3e5f5 + style CH8 fill:#4caf50,color:#fff style CH9 fill:#f3e5f5 style CH10 fill:#f3e5f5 style CH11 fill:#f3e5f5 @@ -66,7 +66,7 @@ graph TD style CH15 fill:#f3e5f5 ``` -**Legend**: Green = Available | Purple = Practitioner (Coming Soon) | Chapters 1-7 fully available with SVG diagrams +**Legend**: Green = Available | Purple = Practitioner (Coming Soon) | Chapters 1-8 fully available with SVG diagrams --- @@ -81,7 +81,7 @@ graph TD | 5 | [Software Design & Best Practices](./chapters/chapter-05-software-design/) | Foundation | 6h | Available | 3 notebooks, scripts, 5 exercises, 3 SVGs | | 6 | [Introduction to Machine Learning](./chapters/chapter-06-intro-machine-learning/) | Practitioner | 8h | Available | 3 notebooks, scripts, 5 exercises, 3 SVGs | | 7 | [Supervised Learning](./chapters/chapter-07-supervised-learning/) | Practitioner | 10h | Available | 3 notebooks, scripts, 5 exercises, 3 SVGs | -| 8 | Unsupervised Learning | Practitioner | 8h | Planned | - | +| 8 | [Unsupervised Learning](./chapters/chapter-08-unsupervised-learning/) | Practitioner | 8h | Available | 3 notebooks, scripts, 5 exercises, 3 SVGs | | 9 | Deep Learning Fundamentals | Practitioner | 12h | Planned | - | | 10 | Natural Language Processing | Practitioner | 10h | Planned | - | | 11 | LLMs & Transformers | Practitioner | 10h | Planned | - | diff --git a/chapters/chapter-08-unsupervised-learning/README.md b/chapters/chapter-08-unsupervised-learning/README.md new file mode 100644 index 0000000..0326ae6 --- /dev/null +++ b/chapters/chapter-08-unsupervised-learning/README.md @@ -0,0 +1,61 @@ +# Chapter 8: Unsupervised Learning + +**Track**: Practitioner | **Time**: 8 hours | **Prerequisites**: Chapters 1-6 + +--- + +## Learning Objectives + +By the end of this chapter, you will be able to: + +- Understand the difference between supervised and unsupervised learning +- Implement K-Means clustering from scratch using NumPy +- Apply hierarchical (agglomerative) clustering and interpret dendrograms +- Use DBSCAN for density-based clustering with automatic cluster count detection +- Evaluate clusters with the silhouette score, inertia, and the elbow method +- Apply Principal Component Analysis (PCA) for dimensionality reduction +- Implement t-SNE for 2D visualization of high-dimensional data +- Perform anomaly detection with Isolation Forest and statistical methods +- Build a complete customer segmentation pipeline end-to-end + +--- + +## Chapter Structure + +``` +chapter-08-unsupervised-learning/ +β”œβ”€β”€ README.md +β”œβ”€β”€ requirements.txt +β”œβ”€β”€ notebooks/ +β”‚ β”œβ”€β”€ 01_introduction.ipynb # K-Means, evaluation metrics, elbow method +β”‚ β”œβ”€β”€ 02_intermediate.ipynb # Hierarchical, DBSCAN, Gaussian Mixture Models +β”‚ └── 03_advanced.ipynb # PCA, t-SNE, anomaly detection, customer segmentation capstone +β”œβ”€β”€ scripts/ +β”‚ β”œβ”€β”€ unsupervised_toolkit.py # KMeansScratch, PCA, plotting utilities +β”‚ └── utilities.py # Helper functions +β”œβ”€β”€ exercises/ +β”‚ β”œβ”€β”€ exercises.py # 5 exercises +β”‚ └── solutions/ +β”‚ └── solutions.py # Complete solutions +β”œβ”€β”€ assets/diagrams/ +β”‚ β”œβ”€β”€ clustering_algorithms.svg # K-Means, Hierarchical, DBSCAN comparison +β”‚ β”œβ”€β”€ dimensionality_reduction.svg # PCA and t-SNE visual +β”‚ └── anomaly_detection.svg # Normal vs anomalous points +β”œβ”€β”€ datasets/ +β”‚ β”œβ”€β”€ customers.csv # Synthetic customer data (300+ rows) +β”‚ └── sensors.csv # Synthetic sensor data with anomalies (200+ rows) +``` + +## Time Estimate + +| Section | Time | +|---------|------| +| Notebook 01: Introduction (Clustering Basics) | 2.5 hours | +| Notebook 02: Intermediate (Advanced Clustering) | 2.5 hours | +| Notebook 03: Advanced (Dimensionality Reduction & Capstone) | 3 hours | +| Exercises | Included in notebooks | +| **Total** | **8 hours** | + +--- + +**Generated by Berta AI | Created by Luigi Pascal Rondanini** diff --git a/chapters/chapter-08-unsupervised-learning/assets/diagrams/anomaly_detection.svg b/chapters/chapter-08-unsupervised-learning/assets/diagrams/anomaly_detection.svg new file mode 100644 index 0000000..92452f7 --- /dev/null +++ b/chapters/chapter-08-unsupervised-learning/assets/diagrams/anomaly_detection.svg @@ -0,0 +1,90 @@ + + + + + + Statistical (Z-Score) + + + + + + + + -3 sigma + +3 sigma + + + + + + + + ! + ! + Points beyond threshold + Simple, fast, assumes normal + + + + Isolation Forest + + + + + + + + + + anomaly + + + + Split 1 + Split 2 + + + short path + + + long path + Anomalies isolated quickly + Works with any distribution + + + + Applications + + + $ + Fraud Detection + Unusual transactions + + + ! + Manufacturing QA + Defective products + + + + + + Network Intrusion + Unusual traffic patterns + + + + + Health Monitoring + Abnormal sensor readings + + + IoT + Predictive Maintenance + Equipment failure warnings + diff --git a/chapters/chapter-08-unsupervised-learning/assets/diagrams/clustering_algorithms.svg b/chapters/chapter-08-unsupervised-learning/assets/diagrams/clustering_algorithms.svg new file mode 100644 index 0000000..f17f560 --- /dev/null +++ b/chapters/chapter-08-unsupervised-learning/assets/diagrams/clustering_algorithms.svg @@ -0,0 +1,92 @@ + + + + + + K-Means + + + + + + + + + + + + + + + + + + + + + + + + + + + Spherical clusters, fixed K + Assigns to nearest centroid + + + + Hierarchical + + + + + + + + + + + + + + + + cut + + A + B + C + D + E + Dendrogram, cut to get K + Bottom-up merging + + + + DBSCAN + + + + + + + + + + + + + + noise + noise + + + eps + Arbitrary shapes, auto K + Density-based, detects noise + diff --git a/chapters/chapter-08-unsupervised-learning/assets/diagrams/dimensionality_reduction.svg b/chapters/chapter-08-unsupervised-learning/assets/diagrams/dimensionality_reduction.svg new file mode 100644 index 0000000..7f4b92b --- /dev/null +++ b/chapters/chapter-08-unsupervised-learning/assets/diagrams/dimensionality_reduction.svg @@ -0,0 +1,81 @@ + + + + + + High-Dimensional Data + + + f1 f2 f3 f4 ... fN + + 2.1 0.3 1.7 4.2 ... 0.9 + 1.5 2.8 0.4 3.1 ... 1.2 + 3.2 1.1 2.9 0.8 ... 2.4 + ... n rows x d features ... + d = 50, 100, 1000+ + + Curse of dimensionality + Hard to visualize + Noisy, redundant features + Slow computation + + + + Reduce + + + + PCA (Linear) + + + PC1 + PC2 + + + + + + + + + + + + + + + + + PC1: 72% + PC2: 18% + Max variance directions + Global structure preserved + + + + t-SNE (Nonlinear) + + + + + + + + + + + + + + + + + Preserves local neighborhoods + Best for visualization + diff --git a/chapters/chapter-08-unsupervised-learning/datasets/customers.csv b/chapters/chapter-08-unsupervised-learning/datasets/customers.csv new file mode 100644 index 0000000..889bba4 --- /dev/null +++ b/chapters/chapter-08-unsupervised-learning/datasets/customers.csv @@ -0,0 +1,301 @@ +age,income,spending_score,visits,online_ratio +34,54766,39,9,0.54 +49,111074,67,9,0.25 +49,95903,27,1,0.21 +51,121394,41,4,0.38 +18,28010,72,16,1.0 +25,42858,84,13,0.73 +21,33927,54,8,0.96 +52,103491,78,10,0.93 +49,69579,40,0,0.39 +55,73860,68,3,0.23 +27,31448,57,15,0.78 +46,106849,69,9,0.37 +43,66166,93,12,0.67 +23,24789,58,15,0.91 +22,25771,64,16,0.59 +21,34690,48,10,0.8 +37,57274,47,4,0.2 +30,67012,30,6,0.51 +35,22561,64,5,0.59 +29,43617,92,13,0.84 +44,70128,46,5,0.28 +30,50813,85,7,0.77 +18,34524,60,10,0.68 +32,71483,70,5,0.63 +31,53350,61,7,0.26 +53,93115,32,0,0.29 +27,40318,72,9,0.79 +32,70778,36,6,0.46 +51,92763,37,5,0.29 +24,20205,56,8,0.9 +62,85856,30,4,0.3 +19,23195,55,11,0.67 +23,25355,76,10,0.87 +69,158343,80,8,0.62 +58,106604,82,4,0.73 +50,77316,90,7,0.78 +21,56478,28,8,0.43 +25,33556,79,15,0.83 +48,94480,37,3,0.6 +46,75340,49,5,0.33 +30,28927,73,7,0.86 +33,57335,73,3,0.49 +20,23720,54,9,0.82 +42,156212,82,10,0.5 +25,29032,75,15,0.8 +51,98058,41,4,0.46 +24,49848,56,11,0.59 +43,46690,52,4,0.42 +37,38124,53,6,0.41 +28,27017,65,12,0.81 +54,128594,99,2,0.9 +58,45557,48,4,0.31 +35,120819,80,16,0.84 +54,101037,67,7,0.51 +54,72377,36,3,0.45 +42,96894,70,6,0.61 +26,17794,81,12,0.81 +21,40100,79,15,0.73 +23,25585,54,14,0.62 +31,56380,57,6,0.52 +38,53078,50,9,0.52 +29,31476,67,4,0.97 +29,117914,92,7,0.73 +31,45976,50,4,0.62 +29,46492,86,17,0.89 +28,45447,35,1,0.61 +29,38192,79,13,0.72 +58,121733,77,10,0.42 +20,36226,75,9,0.9 +40,124420,72,7,0.51 +50,66955,42,0,0.31 +28,29973,85,8,0.81 +24,28638,73,18,0.84 +45,153511,72,3,0.73 +39,66490,45,4,0.42 +55,89575,43,3,0.34 +55,75070,57,4,0.21 +23,41820,93,9,0.81 +39,49203,48,5,0.42 +30,28450,92,12,1.0 +37,91936,78,9,0.63 +62,64967,29,1,0.35 +52,51777,42,6,0.39 +25,29026,38,11,0.85 +35,29178,31,11,0.44 +40,46370,16,4,0.2 +46,41564,51,3,0.37 +51,77134,63,1,0.07 +33,146182,66,7,0.64 +38,55922,44,8,0.32 +20,28188,54,10,0.77 +44,76056,42,7,0.23 +59,80336,51,5,0.25 +37,88206,44,5,0.15 +45,62287,39,0,0.04 +34,71931,55,7,0.5 +35,53816,43,6,0.38 +42,55226,50,6,0.23 +24,31127,88,14,0.84 +20,15852,95,18,0.88 +25,32585,78,15,0.84 +32,88877,45,1,0.28 +27,47650,50,1,0.6 +23,59338,55,2,0.41 +61,116713,76,6,0.73 +23,25738,75,6,0.87 +25,54601,38,10,0.55 +45,57987,57,4,0.31 +20,29779,76,13,0.74 +26,44178,76,16,0.85 +26,43290,70,9,0.9 +22,26343,78,2,0.76 +38,107425,71,11,0.62 +32,126587,68,9,0.54 +37,55986,37,2,0.25 +37,114139,82,6,0.83 +47,100405,87,6,0.4 +23,39815,66,6,0.83 +48,97095,84,0,0.81 +42,76303,41,6,0.52 +25,118320,84,9,0.55 +48,102106,71,8,0.73 +53,67006,70,6,0.65 +19,32938,63,9,0.8 +24,37308,72,11,0.81 +34,51999,52,10,0.46 +37,132492,81,8,1.0 +44,131800,100,7,0.69 +27,29829,79,9,0.83 +31,64729,46,5,0.6 +34,64521,48,8,0.66 +27,37072,60,15,0.8 +28,35894,73,9,0.88 +35,35050,53,7,0.5 +43,100917,87,8,0.47 +45,68520,50,0,0.54 +18,41908,52,2,0.45 +38,86995,41,3,0.11 +43,41731,34,5,0.73 +51,179096,88,12,0.37 +34,100350,68,4,0.63 +29,19147,46,12,0.7 +25,70723,41,0,0.47 +29,71748,71,5,0.75 +29,17078,66,14,0.67 +25,39317,66,17,0.61 +37,110354,45,1,0.3 +43,45329,47,7,0.66 +20,31604,83,17,0.82 +22,39189,60,18,0.76 +52,169812,81,5,0.78 +22,30493,54,7,0.84 +21,33430,62,12,0.9 +39,58361,53,6,0.61 +26,110321,100,5,0.55 +34,27063,50,9,0.82 +48,73561,48,2,0.09 +50,88828,41,0,0.31 +20,21422,78,5,0.74 +33,86932,31,3,0.0 +27,49763,48,1,0.44 +44,59552,39,0,0.32 +41,82845,59,2,0.27 +32,50559,65,1,0.31 +18,34953,78,3,0.9 +42,113005,83,4,0.61 +41,61731,48,3,0.17 +31,24175,74,13,0.89 +18,28536,72,8,0.77 +57,77136,45,4,0.27 +50,90796,41,6,0.25 +25,23606,54,13,0.88 +24,53712,41,2,0.43 +28,22373,43,14,0.88 +57,102261,46,2,0.02 +24,42997,57,9,0.75 +58,75304,37,4,0.62 +46,56715,27,4,0.48 +44,132046,65,6,0.76 +19,17494,60,6,0.69 +31,62656,46,7,0.6 +31,29377,65,12,0.84 +45,116863,90,4,0.57 +31,128196,78,8,0.45 +32,32390,77,10,0.83 +46,79008,42,0,0.32 +38,114255,78,12,0.68 +53,76294,34,2,0.27 +37,31248,44,5,0.39 +33,18493,82,6,0.8 +23,37353,69,4,0.78 +49,112161,62,12,0.63 +45,143045,73,9,0.83 +53,99819,50,5,0.1 +35,24636,27,4,0.67 +37,51567,65,8,0.5 +54,52001,44,5,0.32 +50,91727,56,7,0.38 +60,131840,88,11,0.65 +36,91859,53,1,0.29 +53,50448,45,3,0.17 +28,38239,65,0,0.77 +32,27321,66,8,0.83 +41,112622,78,9,0.43 +40,80214,45,4,0.07 +18,33388,94,11,0.92 +18,46500,100,11,0.82 +58,90294,42,0,0.23 +26,30193,74,12,0.79 +24,41297,47,15,0.72 +30,58229,48,11,0.68 +49,56536,55,5,0.05 +48,104258,46,4,0.39 +28,33426,53,10,0.8 +55,66518,38,0,0.65 +20,37885,57,23,0.64 +30,68512,48,3,0.43 +61,93824,35,2,0.23 +44,78086,35,2,0.33 +26,73001,83,11,0.47 +53,58232,38,3,0.37 +52,79818,34,4,0.18 +22,32707,88,7,0.72 +53,71493,30,8,0.28 +24,31347,59,17,0.76 +19,40540,66,13,0.73 +30,128195,82,7,0.61 +39,57678,53,8,0.62 +44,94652,83,7,0.62 +27,86064,58,2,0.45 +59,98536,41,1,0.26 +34,62742,41,9,0.61 +47,90203,41,4,0.23 +30,56074,28,12,0.6 +28,22005,57,12,0.91 +59,84794,41,6,0.22 +22,36724,78,7,0.67 +22,34373,88,16,0.87 +49,85215,39,4,0.22 +18,27065,76,9,1.0 +31,112508,72,4,0.54 +28,118744,72,1,0.4 +59,185519,62,3,0.65 +24,50337,40,7,0.21 +27,39737,55,10,0.65 +55,144921,67,8,0.42 +34,45725,61,6,0.3 +63,163965,79,1,0.73 +51,129302,78,4,0.74 +40,129728,76,10,0.55 +18,22770,60,18,0.9 +39,65348,46,6,0.56 +52,147411,84,9,0.59 +42,61157,65,4,0.39 +21,26283,62,15,0.68 +51,76226,46,1,0.2 +45,99999,81,8,0.6 +48,101335,24,2,0.26 +49,106083,36,2,0.27 +32,26200,38,17,0.85 +55,81279,44,4,0.13 +46,121450,77,9,0.74 +32,31011,77,16,0.93 +42,67841,27,1,0.24 +31,104832,78,10,0.39 +34,101023,39,5,0.43 +57,89756,38,3,0.3 +52,57453,36,2,0.43 +41,46897,52,7,0.47 +34,33913,63,8,0.77 +56,104115,67,2,0.94 +52,90258,81,7,0.88 +54,104391,71,10,0.43 +37,115386,85,4,0.84 +36,55014,40,5,0.31 +31,45194,59,4,0.44 +45,88017,45,4,0.5 +65,93314,34,2,0.19 +48,75189,52,1,0.36 +51,106928,31,4,0.29 +49,114962,71,8,0.53 +36,43721,68,0,0.51 +42,64953,58,6,0.42 +28,29275,97,10,0.77 +69,92584,32,7,0.4 +20,31523,80,4,0.78 +35,62625,41,4,0.56 +42,40128,58,0,0.48 +18,34104,59,9,0.82 +59,91684,15,4,0.32 +21,35910,67,15,0.91 +27,34922,69,12,0.82 +34,130244,73,14,0.68 +45,57206,38,2,0.49 +18,25712,68,8,0.75 +31,69867,64,3,0.64 +49,90433,21,6,0.44 +20,44705,84,18,0.63 +52,75590,54,0,0.18 +18,27205,63,12,0.77 diff --git a/chapters/chapter-08-unsupervised-learning/datasets/sensors.csv b/chapters/chapter-08-unsupervised-learning/datasets/sensors.csv new file mode 100644 index 0000000..e641ca3 --- /dev/null +++ b/chapters/chapter-08-unsupervised-learning/datasets/sensors.csv @@ -0,0 +1,201 @@ +temp,pressure,vibration,is_anomaly +67.4,29.4,0.343,0 +62.3,32.1,0.492,0 +63.6,27.0,0.549,0 +68.4,28.0,0.52,0 +68.9,31.1,0.316,0 +67.8,27.4,0.665,0 +76.3,32.1,0.408,0 +60.1,28.5,0.674,0 +97.6,52.9,0.756,1 +67.4,24.7,0.533,0 +77.2,31.6,0.611,0 +76.7,33.9,0.591,0 +73.0,32.0,0.512,0 +65.0,31.5,0.356,0 +68.8,32.5,0.546,0 +76.6,34.7,0.547,0 +69.5,26.7,0.461,0 +100.5,40.7,1.425,1 +62.0,36.4,0.474,0 +73.5,28.6,0.406,0 +59.0,29.0,0.5,0 +76.1,27.3,0.433,0 +67.8,29.0,0.459,0 +65.4,33.1,0.373,0 +72.2,29.9,0.4,0 +62.9,29.2,0.514,0 +66.2,31.5,0.323,0 +70.9,30.8,0.437,0 +62.8,23.9,0.437,0 +73.1,31.5,0.407,0 +77.0,30.2,0.45,0 +73.4,29.5,0.251,0 +77.3,49.5,0.949,1 +64.8,27.5,0.555,0 +69.4,30.4,0.552,0 +75.0,28.6,0.52,0 +70.4,34.2,0.397,0 +73.5,35.1,0.6,0 +78.9,28.9,0.62,0 +93.7,51.2,1.037,1 +72.5,31.0,0.544,0 +64.3,27.3,0.654,0 +70.4,30.8,0.384,0 +67.1,30.9,0.53,0 +67.4,33.8,0.448,0 +71.5,32.7,0.496,0 +70.6,28.9,0.39,0 +59.6,27.9,0.304,0 +64.8,31.4,0.643,0 +72.2,22.9,0.534,0 +101.6,52.0,1.254,1 +80.7,27.1,0.605,0 +65.9,29.3,0.375,0 +67.2,28.1,0.568,0 +77.1,24.3,0.262,0 +74.6,26.5,0.538,0 +65.4,29.0,0.619,0 +73.6,31.1,0.578,0 +72.6,34.8,0.288,0 +80.3,29.1,0.329,0 +70.6,34.3,0.43,0 +64.6,28.1,0.424,0 +67.5,33.2,0.502,0 +70.4,27.9,0.392,0 +72.5,31.5,0.33,0 +63.2,32.6,0.43,0 +66.8,26.1,0.45,0 +73.2,34.5,0.519,0 +67.0,26.4,0.402,0 +70.3,34.4,0.439,0 +74.0,33.5,0.262,0 +69.6,34.3,0.714,0 +92.9,53.4,1.367,1 +68.8,30.1,0.526,0 +95.1,40.5,1.75,1 +61.5,34.1,0.421,0 +68.9,26.6,0.371,0 +66.4,27.6,0.471,0 +102.8,40.4,0.751,1 +74.8,23.2,0.441,0 +69.1,24.6,0.647,0 +100.5,38.5,1.069,1 +102.0,52.7,0.611,1 +63.3,29.2,0.737,0 +95.5,33.2,0.94,1 +75.6,31.2,0.515,0 +83.2,25.6,0.377,0 +60.3,29.6,0.589,0 +104.6,40.3,1.798,1 +69.9,34.4,0.405,0 +70.0,26.0,0.423,0 +76.9,26.0,0.609,0 +64.0,24.5,0.445,0 +73.7,33.8,0.528,0 +76.4,27.8,0.432,0 +68.8,25.3,0.419,0 +66.9,32.4,0.68,0 +58.5,27.5,0.478,0 +74.5,30.8,0.557,0 +75.5,25.8,0.551,0 +70.3,33.3,0.533,0 +70.2,33.4,0.475,0 +73.8,33.9,0.482,0 +62.8,34.0,0.461,0 +76.4,29.9,0.503,0 +75.5,26.9,0.594,0 +64.0,27.9,0.46,0 +69.4,29.8,0.485,0 +68.9,32.1,0.583,0 +71.7,29.1,0.582,0 +69.4,34.4,0.351,0 +75.1,31.8,0.433,0 +73.4,31.8,0.682,0 +62.8,24.6,0.516,0 +72.5,30.4,0.425,0 +62.5,26.9,0.586,0 +72.7,27.6,0.419,0 +69.1,33.0,0.369,0 +68.1,30.4,0.577,0 +70.4,27.7,0.537,0 +74.4,30.8,0.706,0 +69.3,28.9,0.278,0 +81.6,33.0,0.503,0 +65.9,31.2,0.396,0 +71.8,29.3,0.674,0 +91.6,44.4,1.021,1 +67.9,31.7,0.678,0 +57.6,31.0,0.438,0 +101.6,43.6,1.199,1 +70.1,34.0,0.57,0 +74.8,27.3,0.586,0 +60.3,32.1,0.428,0 +75.6,29.0,0.229,0 +86.2,43.1,1.148,1 +72.5,25.4,0.667,0 +66.7,28.6,0.537,0 +85.9,48.5,1.196,1 +63.2,34.6,0.643,0 +74.6,29.5,0.415,0 +74.6,30.4,0.542,0 +73.9,29.6,0.519,0 +74.2,30.7,0.449,0 +69.2,25.8,0.537,0 +76.7,30.3,0.541,0 +69.5,28.9,0.445,0 +63.8,32.9,0.537,0 +74.8,28.6,0.646,0 +76.2,33.7,0.468,0 +69.2,32.4,0.36,0 +72.2,29.4,0.497,0 +67.0,32.4,0.435,0 +94.3,47.2,1.333,1 +67.8,30.4,0.613,0 +64.2,30.1,0.646,0 +70.3,30.4,0.496,0 +77.3,29.1,0.466,0 +65.5,30.5,0.306,0 +86.2,44.0,0.773,1 +70.7,21.9,0.387,0 +70.6,28.2,0.535,0 +69.1,30.3,0.524,0 +68.1,31.3,0.572,0 +69.8,31.0,0.418,0 +68.6,29.1,0.547,0 +68.2,29.2,0.53,0 +65.3,27.5,0.506,0 +76.5,25.9,0.495,0 +74.3,25.8,0.479,0 +68.3,29.3,0.489,0 +70.0,26.4,0.527,0 +67.4,33.5,0.41,0 +74.9,37.5,0.408,0 +91.4,37.8,0.975,1 +75.6,22.0,0.609,0 +69.1,33.5,0.412,0 +69.5,26.5,0.503,0 +73.7,32.9,0.546,0 +65.9,30.7,0.48,0 +70.4,27.0,0.461,0 +77.0,26.3,0.324,0 +70.7,27.3,0.295,0 +68.2,29.2,0.343,0 +67.8,27.7,0.476,0 +83.9,31.7,0.607,0 +71.5,26.0,0.72,0 +74.8,28.0,0.633,0 +68.3,31.6,0.477,0 +70.9,26.2,0.68,0 +85.7,42.0,0.568,1 +71.8,29.7,0.581,0 +69.1,29.2,0.561,0 +71.5,29.2,0.528,0 +68.7,30.0,0.668,0 +68.0,25.5,0.424,0 +69.7,31.2,0.539,0 +67.0,28.3,0.494,0 +68.2,22.1,0.596,0 +74.8,29.8,0.435,0 +74.5,32.1,0.794,0 +62.9,29.4,0.552,0 diff --git a/chapters/chapter-08-unsupervised-learning/exercises/exercises.py b/chapters/chapter-08-unsupervised-learning/exercises/exercises.py new file mode 100644 index 0000000..07d8f6c --- /dev/null +++ b/chapters/chapter-08-unsupervised-learning/exercises/exercises.py @@ -0,0 +1,154 @@ +""" +Chapter 8 Exercises: Unsupervised Learning + +Generated by Berta AI | Created by Luigi Pascal Rondanini +""" + +import numpy as np + + +# ============================================================================= +# Exercise 1: Implement K-Means Clustering From Scratch +# ============================================================================= +# Build a KMeans class that: +# - Initializes K centroids randomly from the data points +# - Assigns each point to the nearest centroid (Euclidean distance) +# - Recomputes centroids as the mean of assigned points +# - Repeats for max_iters or until convergence (centroids stop moving) +# +# Methods: +# - fit(X): Run the K-Means algorithm +# - predict(X): Assign each row to its nearest centroid +# - fit_predict(X): fit then predict +# +# Attributes after fit: +# - centroids: (K, n_features) array +# - inertia: within-cluster sum of squared distances +# +# Hint: np.linalg.norm(X[:, None] - centroids, axis=2) gives all pairwise distances + +class KMeansClustering: + def __init__(self, n_clusters=3, max_iters=100, random_state=42): + # YOUR CODE HERE + pass + + def fit(self, X): + # YOUR CODE HERE + pass + + def predict(self, X): + # YOUR CODE HERE + pass + + def fit_predict(self, X): + # YOUR CODE HERE + pass + + +# ============================================================================= +# Exercise 2: Implement PCA From Scratch +# ============================================================================= +# Build a PCA class that: +# - Centers the data (subtract mean) +# - Computes the covariance matrix +# - Finds eigenvectors/eigenvalues via np.linalg.eigh +# - Sorts components by descending eigenvalue +# - Projects data onto the top n_components eigenvectors +# +# Methods: +# - fit(X): Compute components +# - transform(X): Project X onto components +# - fit_transform(X): fit then transform +# +# Attributes after fit: +# - components_: (n_components, n_features) array +# - explained_variance_ratio_: fraction of variance per component +# +# Hint: covariance = X_centered.T @ X_centered / (n - 1) + +class PCAFromScratch: + def __init__(self, n_components=2): + # YOUR CODE HERE + pass + + def fit(self, X): + # YOUR CODE HERE + pass + + def transform(self, X): + # YOUR CODE HERE + pass + + def fit_transform(self, X): + # YOUR CODE HERE + pass + + +# ============================================================================= +# Exercise 3: Implement Silhouette Score From Scratch +# ============================================================================= +# Compute the silhouette score for a clustering result: +# For each point i: +# a(i) = mean distance to all other points in the same cluster +# b(i) = min over other clusters of mean distance to that cluster's points +# s(i) = (b(i) - a(i)) / max(a(i), b(i)) +# Return the mean of s(i) over all points. +# +# Parameters: +# X: (n_samples, n_features) array +# labels: (n_samples,) array of cluster assignments +# +# Return: float in [-1, 1], higher is better +# +# Hint: Use pairwise Euclidean distances. Handle single-point clusters (s=0). + +def silhouette_score_scratch(X, labels): + # YOUR CODE HERE + pass + + +# ============================================================================= +# Exercise 4: Anomaly Detection with Z-Score +# ============================================================================= +# Implement a simple anomaly detector that: +# 1. Computes the Z-score for each feature: z = (x - mean) / std +# 2. Flags a point as anomalous if any feature has |z| > threshold +# +# Parameters: +# X: (n_samples, n_features) array +# threshold: float (default 3.0) +# +# Return: (n_samples,) boolean array, True = anomaly +# +# Hint: np.any(np.abs(z_scores) > threshold, axis=1) + +def detect_anomalies_zscore(X, threshold=3.0): + # YOUR CODE HERE + pass + + +# ============================================================================= +# Exercise 5: End-to-End Customer Segmentation Pipeline +# ============================================================================= +# Build a pipeline that: +# 1. Loads customer data from datasets/customers.csv +# 2. Scales features with StandardScaler +# 3. Applies PCA (keep 95% variance) +# 4. Uses elbow method to find optimal K (test K=2..8) +# 5. Runs K-Means with optimal K +# 6. Returns segment profiles (mean of original features per cluster) +# +# Return dict: { +# "n_clusters": int, +# "labels": array, +# "profiles": DataFrame (one row per cluster, columns = original features), +# "inertias": list (for each K tested), +# "silhouette": float +# } +# +# Hint: The "elbow" can be found by looking for the K where the second +# derivative of inertia changes most (or just pick K=4 if uncertain). + +def customer_segmentation_pipeline(csv_path="datasets/customers.csv"): + # YOUR CODE HERE + pass diff --git a/chapters/chapter-08-unsupervised-learning/exercises/solutions/solutions.py b/chapters/chapter-08-unsupervised-learning/exercises/solutions/solutions.py new file mode 100644 index 0000000..1ffad8c --- /dev/null +++ b/chapters/chapter-08-unsupervised-learning/exercises/solutions/solutions.py @@ -0,0 +1,265 @@ +""" +Chapter 8 Solutions: Unsupervised Learning + +Generated by Berta AI | Created by Luigi Pascal Rondanini +""" + +import numpy as np +from pathlib import Path + + +# ============================================================================= +# Exercise 1: K-Means Clustering From Scratch +# ============================================================================= + +class KMeansClustering: + def __init__(self, n_clusters=3, max_iters=100, random_state=42): + self.n_clusters = n_clusters + self.max_iters = max_iters + self.random_state = random_state + self.centroids = None + self.inertia = None + + def fit(self, X): + X = np.asarray(X, dtype=float) + rng = np.random.RandomState(self.random_state) + idx = rng.choice(len(X), size=self.n_clusters, replace=False) + self.centroids = X[idx].copy() + + for _ in range(self.max_iters): + distances = np.linalg.norm(X[:, None] - self.centroids, axis=2) + labels = np.argmin(distances, axis=1) + + new_centroids = np.array([ + X[labels == k].mean(axis=0) if np.any(labels == k) else self.centroids[k] + for k in range(self.n_clusters) + ]) + + if np.allclose(new_centroids, self.centroids): + break + self.centroids = new_centroids + + distances = np.linalg.norm(X[:, None] - self.centroids, axis=2) + labels = np.argmin(distances, axis=1) + self.inertia = sum( + np.sum((X[labels == k] - self.centroids[k]) ** 2) + for k in range(self.n_clusters) + ) + self._labels = labels + return self + + def predict(self, X): + X = np.asarray(X, dtype=float) + distances = np.linalg.norm(X[:, None] - self.centroids, axis=2) + return np.argmin(distances, axis=1) + + def fit_predict(self, X): + self.fit(X) + return self._labels + + +# ============================================================================= +# Exercise 2: PCA From Scratch +# ============================================================================= + +class PCAFromScratch: + def __init__(self, n_components=2): + self.n_components = n_components + self.components_ = None + self.explained_variance_ratio_ = None + self._mean = None + + def fit(self, X): + X = np.asarray(X, dtype=float) + self._mean = X.mean(axis=0) + X_centered = X - self._mean + n = X.shape[0] + cov = X_centered.T @ X_centered / (n - 1) + + eigenvalues, eigenvectors = np.linalg.eigh(cov) + idx = np.argsort(eigenvalues)[::-1] + eigenvalues = eigenvalues[idx] + eigenvectors = eigenvectors[:, idx] + + self.components_ = eigenvectors[:, :self.n_components].T + total_var = eigenvalues.sum() + self.explained_variance_ratio_ = eigenvalues[:self.n_components] / total_var + return self + + def transform(self, X): + X = np.asarray(X, dtype=float) + X_centered = X - self._mean + return X_centered @ self.components_.T + + def fit_transform(self, X): + self.fit(X) + return self.transform(X) + + +# ============================================================================= +# Exercise 3: Silhouette Score From Scratch +# ============================================================================= + +def silhouette_score_scratch(X, labels): + X = np.asarray(X, dtype=float) + labels = np.asarray(labels) + n = len(X) + unique_labels = np.unique(labels) + + if len(unique_labels) < 2: + return 0.0 + + scores = np.zeros(n) + for i in range(n): + same_mask = labels == labels[i] + same_mask[i] = False + same_cluster = X[same_mask] + + if len(same_cluster) == 0: + scores[i] = 0.0 + continue + + a_i = np.mean(np.linalg.norm(same_cluster - X[i], axis=1)) + + b_i = np.inf + for k in unique_labels: + if k == labels[i]: + continue + other_cluster = X[labels == k] + mean_dist = np.mean(np.linalg.norm(other_cluster - X[i], axis=1)) + b_i = min(b_i, mean_dist) + + denom = max(a_i, b_i) + scores[i] = (b_i - a_i) / denom if denom > 0 else 0.0 + + return float(np.mean(scores)) + + +# ============================================================================= +# Exercise 4: Anomaly Detection with Z-Score +# ============================================================================= + +def detect_anomalies_zscore(X, threshold=3.0): + X = np.asarray(X, dtype=float) + mean = X.mean(axis=0) + std = X.std(axis=0) + std[std == 0] = 1.0 + z_scores = (X - mean) / std + return np.any(np.abs(z_scores) > threshold, axis=1) + + +# ============================================================================= +# Exercise 5: Customer Segmentation Pipeline +# ============================================================================= + +def customer_segmentation_pipeline(csv_path="datasets/customers.csv"): + try: + import pandas as pd + from sklearn.preprocessing import StandardScaler + from sklearn.decomposition import PCA + from sklearn.cluster import KMeans + from sklearn.metrics import silhouette_score + except ImportError: + return {"n_clusters": 0, "labels": None, "profiles": None, + "inertias": [], "silhouette": 0.0} + + base = Path(__file__).parent.parent.parent + path = base / csv_path + if not path.exists(): + return {"n_clusters": 0, "labels": None, "profiles": None, + "inertias": [], "silhouette": 0.0} + + df = pd.read_csv(path) + feature_cols = [c for c in df.columns if c not in ("customer_id", "segment")] + X_raw = df[feature_cols].values + + scaler = StandardScaler() + X_scaled = scaler.fit_transform(X_raw) + + pca = PCA(n_components=0.95) + X_pca = pca.fit_transform(X_scaled) + + K_range = range(2, 9) + inertias = [] + for k in K_range: + km = KMeans(n_clusters=k, n_init=10, random_state=42) + km.fit(X_pca) + inertias.append(km.inertia_) + + diffs = np.diff(inertias) + diffs2 = np.diff(diffs) + best_k = int(np.argmax(np.abs(diffs2)) + 2) + best_k = max(2, min(best_k, 8)) + + km_final = KMeans(n_clusters=best_k, n_init=10, random_state=42) + labels = km_final.fit_predict(X_pca) + sil = silhouette_score(X_pca, labels) + + df["cluster"] = labels + profiles = df.groupby("cluster")[feature_cols].mean() + + return { + "n_clusters": best_k, + "labels": labels, + "profiles": profiles, + "inertias": list(inertias), + "silhouette": float(sil), + } + + +if __name__ == "__main__": + print("Chapter 8 Solutions - Verification\n") + + np.random.seed(42) + + # Ex 1 + print("Exercise 1: K-Means Clustering") + from sklearn.datasets import make_blobs + X, y_true = make_blobs(n_samples=200, centers=3, random_state=42) + km = KMeansClustering(n_clusters=3, random_state=42) + labels = km.fit_predict(X) + assert km.centroids.shape == (3, 2) + assert len(labels) == 200 + assert km.inertia > 0 + print(f" Inertia = {km.inertia:.2f}") + print(f" Centroids shape: {km.centroids.shape}") + + # Ex 2 + print("\nExercise 2: PCA From Scratch") + X_4d = np.random.randn(100, 4) + pca = PCAFromScratch(n_components=2) + X_2d = pca.fit_transform(X_4d) + assert X_2d.shape == (100, 2) + assert len(pca.explained_variance_ratio_) == 2 + assert abs(sum(pca.explained_variance_ratio_) - sum(pca.explained_variance_ratio_)) < 1e-10 + print(f" Variance explained: {pca.explained_variance_ratio_}") + print(f" Projected shape: {X_2d.shape}") + + # Ex 3 + print("\nExercise 3: Silhouette Score") + sil = silhouette_score_scratch(X, y_true) + assert -1 <= sil <= 1 + print(f" Silhouette score = {sil:.4f}") + + # Ex 4 + print("\nExercise 4: Anomaly Detection (Z-Score)") + X_normal = np.random.randn(100, 3) + X_anomalies = np.array([[10, 10, 10], [-8, -8, -8]]) + X_combined = np.vstack([X_normal, X_anomalies]) + flags = detect_anomalies_zscore(X_combined, threshold=3.0) + assert flags[-1] == True + assert flags[-2] == True + n_detected = flags.sum() + print(f" Detected {n_detected} anomalies out of {len(X_combined)} points") + + # Ex 5 + print("\nExercise 5: Customer Segmentation Pipeline") + result = customer_segmentation_pipeline() + if result["labels"] is not None: + print(f" Optimal K: {result['n_clusters']}") + print(f" Silhouette: {result['silhouette']:.4f}") + print(f" Segment profiles:\n{result['profiles']}") + else: + print(" (Dataset may not be found - run from chapter root)") + + print("\nAll verifications passed.") diff --git a/chapters/chapter-08-unsupervised-learning/notebooks/01_introduction.ipynb b/chapters/chapter-08-unsupervised-learning/notebooks/01_introduction.ipynb new file mode 100644 index 0000000..5bb2233 --- /dev/null +++ b/chapters/chapter-08-unsupervised-learning/notebooks/01_introduction.ipynb @@ -0,0 +1,580 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Chapter 8: Unsupervised Learning\n", + "## Notebook 01 - Introduction: Clustering Basics\n", + "\n", + "Unsupervised learning finds hidden patterns in data without labels. We start with the most fundamental algorithm: K-Means clustering.\n", + "\n", + "**What you'll learn:**\n", + "- The difference between supervised and unsupervised learning\n", + "- K-Means clustering from scratch using NumPy\n", + "- Evaluating clusters with inertia and silhouette score\n", + "- The elbow method for choosing K\n", + "- Scikit-learn's KMeans interface\n", + "\n", + "**Time estimate:** 2.5 hours\n", + "\n", + "---\n", + "*Generated by Berta AI | Created by Luigi Pascal Rondanini*" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## 1. Supervised vs Unsupervised Learning\n", + "\n", + "In **supervised learning**, every training example comes with a label β€” the \"right answer\" β€” and the model learns a mapping from inputs to outputs. Classification and regression are the classic examples.\n", + "\n", + "In **unsupervised learning**, there are **no labels at all**. The algorithm must discover structure in the data on its own. Common tasks include:\n", + "\n", + "| Task | Goal | Example algorithms |\n", + "|------|------|--------------------|\n", + "| **Clustering** | Group similar points together | K-Means, DBSCAN, Hierarchical |\n", + "| **Dimensionality reduction** | Compress features while preserving structure | PCA, t-SNE, UMAP |\n", + "| **Anomaly detection** | Find unusual observations | Isolation Forest, LOF |\n", + "\n", + "This notebook focuses on **clustering** β€” specifically the **K-Means** algorithm, the most widely-used clustering method.\n", + "\n", + "Let's start by generating some data and seeing what it looks like *without* labels." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from sklearn.datasets import make_blobs\n", + "\n", + "np.random.seed(42)\n", + "\n", + "X, y_true = make_blobs(\n", + " n_samples=200, centers=3, cluster_std=0.9, random_state=42\n", + ")\n", + "\n", + "fig, axes = plt.subplots(1, 2, figsize=(13, 5))\n", + "\n", + "axes[0].scatter(X[:, 0], X[:, 1], c=\"steelblue\", edgecolors=\"k\", s=50, alpha=0.7)\n", + "axes[0].set_title(\"What we observe (no labels)\", fontsize=14)\n", + "axes[0].set_xlabel(\"Feature 1\")\n", + "axes[0].set_ylabel(\"Feature 2\")\n", + "\n", + "colors = [\"#e74c3c\", \"#2ecc71\", \"#3498db\"]\n", + "for k in range(3):\n", + " mask = y_true == k\n", + " axes[1].scatter(X[mask, 0], X[mask, 1], c=colors[k],\n", + " edgecolors=\"k\", s=50, alpha=0.7, label=f\"Cluster {k}\")\n", + "axes[1].set_title(\"True clusters (hidden from algorithm)\", fontsize=14)\n", + "axes[1].set_xlabel(\"Feature 1\")\n", + "axes[1].set_ylabel(\"Feature 2\")\n", + "axes[1].legend()\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The left panel is what an unsupervised algorithm receives β€” raw coordinates with no color-coding. The right panel reveals the ground truth we want the algorithm to *recover* on its own.\n", + "\n", + "---\n", + "## 2. K-Means Algorithm β€” Theory\n", + "\n", + "K-Means is an iterative algorithm that partitions *n* data points into *K* clusters. It works in three repeating steps:\n", + "\n", + "### Step 1 β€” Initialize\n", + "Pick *K* points as initial **centroids** (cluster centers). The simplest approach is to choose *K* data points at random.\n", + "\n", + "### Step 2 β€” Assign\n", + "For every data point, compute the Euclidean distance to each centroid and assign the point to the **nearest** centroid:\n", + "\n", + "$$c_i = \\arg\\min_{k} \\| x_i - \\mu_k \\|^2$$\n", + "\n", + "### Step 3 β€” Update\n", + "Recompute each centroid as the **mean** of all points currently assigned to that cluster:\n", + "\n", + "$$\\mu_k = \\frac{1}{|C_k|} \\sum_{x_i \\in C_k} x_i$$\n", + "\n", + "### Repeat\n", + "Alternate between Steps 2 and 3 until the assignments no longer change (or a maximum number of iterations is reached).\n", + "\n", + "### Important caveats\n", + "- **Random initialization sensitivity:** Different starting centroids can lead to different final clusters. Running the algorithm multiple times with different seeds and keeping the best result is standard practice.\n", + "- **K must be chosen in advance.** We'll learn the *elbow method* later in this notebook.\n", + "- The algorithm minimises **inertia** (within-cluster sum of squares) β€” it always converges, but to a *local* minimum, not necessarily the global one." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## 3. K-Means From Scratch\n", + "\n", + "Let's implement K-Means using only NumPy so we truly understand every step." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "class KMeansScratch:\n", + " \"\"\"Minimal K-Means implementation using NumPy.\"\"\"\n", + "\n", + " def __init__(self, k=3, max_iters=100, random_state=42):\n", + " self.k = k\n", + " self.max_iters = max_iters\n", + " self.random_state = random_state\n", + " self.centroids = None\n", + " self.labels_ = None\n", + " self.inertia_ = None\n", + " self.inertia_history = []\n", + " self.centroid_history = []\n", + " self.label_history = []\n", + "\n", + " def _euclidean_distances(self, X, centroids):\n", + " \"\"\"Compute distance from every point to every centroid.\"\"\"\n", + " # X: (n, d), centroids: (k, d) -> result: (n, k)\n", + " return np.sqrt(((X[:, np.newaxis] - centroids[np.newaxis]) ** 2).sum(axis=2))\n", + "\n", + " def _compute_inertia(self, X, labels, centroids):\n", + " return sum(\n", + " np.sum((X[labels == k] - centroids[k]) ** 2)\n", + " for k in range(self.k)\n", + " )\n", + "\n", + " def fit(self, X):\n", + " rng = np.random.RandomState(self.random_state)\n", + " n_samples = X.shape[0]\n", + "\n", + " # Step 1: random initialization\n", + " idx = rng.choice(n_samples, self.k, replace=False)\n", + " self.centroids = X[idx].copy()\n", + "\n", + " self.inertia_history = []\n", + " self.centroid_history = [self.centroids.copy()]\n", + " self.label_history = []\n", + "\n", + " for _ in range(self.max_iters):\n", + " # Step 2: assign\n", + " distances = self._euclidean_distances(X, self.centroids)\n", + " labels = np.argmin(distances, axis=1)\n", + " self.label_history.append(labels.copy())\n", + "\n", + " # Step 3: update centroids\n", + " new_centroids = np.array([\n", + " X[labels == k].mean(axis=0) if np.any(labels == k)\n", + " else self.centroids[k]\n", + " for k in range(self.k)\n", + " ])\n", + "\n", + " inertia = self._compute_inertia(X, labels, new_centroids)\n", + " self.inertia_history.append(inertia)\n", + " self.centroid_history.append(new_centroids.copy())\n", + "\n", + " if np.allclose(new_centroids, self.centroids):\n", + " break\n", + " self.centroids = new_centroids\n", + "\n", + " self.labels_ = labels\n", + " self.inertia_ = self.inertia_history[-1]\n", + " return self\n", + "\n", + " def predict(self, X):\n", + " distances = self._euclidean_distances(X, self.centroids)\n", + " return np.argmin(distances, axis=1)\n", + "\n", + "\n", + "km_scratch = KMeansScratch(k=3, random_state=42)\n", + "km_scratch.fit(X)\n", + "\n", + "print(f\"Converged in {len(km_scratch.inertia_history)} iterations\")\n", + "print(f\"Final inertia: {km_scratch.inertia_:.2f}\")\n", + "print(f\"Centroids:\\n{km_scratch.centroids}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fig, axes = plt.subplots(1, 2, figsize=(13, 5))\n", + "\n", + "colors_map = np.array([\"#e74c3c\", \"#2ecc71\", \"#3498db\"])\n", + "\n", + "for k in range(3):\n", + " mask = y_true == k\n", + " axes[0].scatter(X[mask, 0], X[mask, 1], c=colors[k],\n", + " edgecolors=\"k\", s=50, alpha=0.7, label=f\"True {k}\")\n", + "axes[0].set_title(\"Ground Truth\", fontsize=14)\n", + "axes[0].legend()\n", + "axes[0].set_xlabel(\"Feature 1\")\n", + "axes[0].set_ylabel(\"Feature 2\")\n", + "\n", + "axes[1].scatter(X[:, 0], X[:, 1], c=colors_map[km_scratch.labels_],\n", + " edgecolors=\"k\", s=50, alpha=0.7)\n", + "axes[1].scatter(km_scratch.centroids[:, 0], km_scratch.centroids[:, 1],\n", + " c=colors, marker=\"X\", s=250, edgecolors=\"k\", linewidths=1.5,\n", + " zorder=5, label=\"Centroids\")\n", + "axes[1].set_title(\"K-Means (scratch) result\", fontsize=14)\n", + "axes[1].legend()\n", + "axes[1].set_xlabel(\"Feature 1\")\n", + "axes[1].set_ylabel(\"Feature 2\")\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## 4. Step-by-Step K-Means Visualization\n", + "\n", + "To build intuition for how the algorithm converges, let's watch the first four iterations unfold. Each subplot shows the cluster assignments and centroid positions at a particular iteration." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fig, axes = plt.subplots(2, 2, figsize=(12, 10))\n", + "axes = axes.ravel()\n", + "\n", + "colors_map = np.array([\"#e74c3c\", \"#2ecc71\", \"#3498db\"])\n", + "\n", + "n_show = min(4, len(km_scratch.label_history))\n", + "\n", + "for i in range(n_show):\n", + " ax = axes[i]\n", + " labels_i = km_scratch.label_history[i]\n", + " centroids_i = km_scratch.centroid_history[i] # centroids *before* this assignment\n", + " centroids_next = km_scratch.centroid_history[i + 1] # centroids *after* update\n", + "\n", + " ax.scatter(X[:, 0], X[:, 1], c=colors_map[labels_i],\n", + " edgecolors=\"k\", s=40, alpha=0.6)\n", + "\n", + " # Old centroids (hollow)\n", + " ax.scatter(centroids_i[:, 0], centroids_i[:, 1],\n", + " facecolors=\"none\", edgecolors=\"k\", marker=\"o\",\n", + " s=200, linewidths=2, label=\"Old centroid\")\n", + "\n", + " # New centroids (filled star)\n", + " ax.scatter(centroids_next[:, 0], centroids_next[:, 1],\n", + " c=colors, marker=\"X\", s=250, edgecolors=\"k\",\n", + " linewidths=1.5, zorder=5, label=\"New centroid\")\n", + "\n", + " # Arrows showing centroid movement\n", + " for k in range(3):\n", + " ax.annotate(\"\",\n", + " xy=centroids_next[k], xytext=centroids_i[k],\n", + " arrowprops=dict(arrowstyle=\"->\", lw=1.5, color=\"black\"))\n", + "\n", + " ax.set_title(f\"Iteration {i + 1} | inertia = {km_scratch.inertia_history[i]:.1f}\",\n", + " fontsize=12)\n", + " if i == 0:\n", + " ax.legend(fontsize=9, loc=\"upper left\")\n", + "\n", + "for j in range(n_show, 4):\n", + " axes[j].axis(\"off\")\n", + "\n", + "plt.suptitle(\"K-Means β€” Iteration-by-Iteration\", fontsize=15, y=1.01)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice how the centroids (stars) migrate toward the cluster centers with each iteration while the assignments stabilize.\n", + "\n", + "---\n", + "## 5. Evaluating Clusters\n", + "\n", + "How do we know if K-Means did a good job? Two common metrics:\n", + "\n", + "### Inertia (Within-Cluster Sum of Squares β€” WCSS)\n", + "$$\\text{Inertia} = \\sum_{k=1}^{K} \\sum_{x_i \\in C_k} \\| x_i - \\mu_k \\|^2$$\n", + "\n", + "Lower is better, but inertia **always decreases** as K increases (at K = n every point is its own cluster with inertia = 0). So inertia alone doesn't tell us the *right* K.\n", + "\n", + "### Silhouette Score\n", + "For each point *i*:\n", + "- **a(i)** = mean distance to other points in the *same* cluster\n", + "- **b(i)** = mean distance to points in the *nearest different* cluster\n", + "\n", + "$$s(i) = \\frac{b(i) - a(i)}{\\max(a(i),\\, b(i))}$$\n", + "\n", + "Values range from βˆ’1 to +1. Higher is better; values near 0 indicate overlapping clusters." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.metrics import silhouette_score, silhouette_samples\n", + "\n", + "sil_avg = silhouette_score(X, km_scratch.labels_)\n", + "sil_vals = silhouette_samples(X, km_scratch.labels_)\n", + "\n", + "print(f\"Inertia: {km_scratch.inertia_:.2f}\")\n", + "print(f\"Silhouette (mean): {sil_avg:.4f}\")\n", + "print(f\"Silhouette (min): {sil_vals.min():.4f}\")\n", + "print(f\"Silhouette (max): {sil_vals.max():.4f}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(figsize=(8, 5))\n", + "\n", + "y_lower = 10\n", + "colors_sil = [\"#e74c3c\", \"#2ecc71\", \"#3498db\"]\n", + "\n", + "for k in range(3):\n", + " cluster_sil = np.sort(sil_vals[km_scratch.labels_ == k])\n", + " cluster_size = cluster_sil.shape[0]\n", + " y_upper = y_lower + cluster_size\n", + "\n", + " ax.fill_betweenx(np.arange(y_lower, y_upper), 0, cluster_sil,\n", + " facecolor=colors_sil[k], edgecolor=colors_sil[k], alpha=0.7)\n", + " ax.text(-0.05, y_lower + 0.5 * cluster_size, f\"Cluster {k}\", fontsize=11,\n", + " fontweight=\"bold\", va=\"center\")\n", + " y_lower = y_upper + 10\n", + "\n", + "ax.axvline(x=sil_avg, color=\"k\", linestyle=\"--\", linewidth=1.5,\n", + " label=f\"Mean silhouette = {sil_avg:.3f}\")\n", + "ax.set_xlabel(\"Silhouette coefficient\", fontsize=12)\n", + "ax.set_ylabel(\"Points (sorted within cluster)\", fontsize=12)\n", + "ax.set_title(\"Silhouette Plot β€” K-Means (K=3)\", fontsize=14)\n", + "ax.legend(fontsize=11)\n", + "ax.set_yticks([])\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A healthy silhouette plot shows clusters of roughly similar width that extend well past the mean line. Thin slivers or clusters that barely cross zero suggest poor separation.\n", + "\n", + "---\n", + "## 6. The Elbow Method for Choosing K\n", + "\n", + "Since we must specify *K* before running K-Means, how do we pick a good value?\n", + "\n", + "**The Elbow Method:**\n", + "1. Run K-Means for K = 1, 2, …, K_max.\n", + "2. Plot inertia vs K.\n", + "3. Look for the **\"elbow\"** β€” the point where inertia stops decreasing sharply and begins to level off.\n", + "\n", + "The elbow suggests a natural number of clusters in the data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "K_range = range(1, 11)\n", + "inertias = []\n", + "silhouettes = []\n", + "\n", + "for k in K_range:\n", + " km = KMeansScratch(k=k, random_state=42)\n", + " km.fit(X)\n", + " inertias.append(km.inertia_)\n", + " if k >= 2:\n", + " silhouettes.append(silhouette_score(X, km.labels_))\n", + " else:\n", + " silhouettes.append(np.nan)\n", + "\n", + "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n", + "\n", + "axes[0].plot(K_range, inertias, \"o-\", color=\"#2c3e50\", linewidth=2, markersize=8)\n", + "axes[0].set_xlabel(\"Number of clusters (K)\", fontsize=12)\n", + "axes[0].set_ylabel(\"Inertia\", fontsize=12)\n", + "axes[0].set_title(\"Elbow Method\", fontsize=14)\n", + "axes[0].axvline(x=3, color=\"#e74c3c\", linestyle=\"--\", alpha=0.7, label=\"K = 3 (elbow)\")\n", + "axes[0].legend(fontsize=11)\n", + "axes[0].grid(True, alpha=0.3)\n", + "\n", + "sil_values = [s for s in silhouettes if not np.isnan(s)]\n", + "sil_ks = list(range(2, 11))\n", + "axes[1].plot(sil_ks, sil_values, \"s-\", color=\"#27ae60\", linewidth=2, markersize=8)\n", + "axes[1].set_xlabel(\"Number of clusters (K)\", fontsize=12)\n", + "axes[1].set_ylabel(\"Mean Silhouette Score\", fontsize=12)\n", + "axes[1].set_title(\"Silhouette Score vs K\", fontsize=14)\n", + "axes[1].axvline(x=3, color=\"#e74c3c\", linestyle=\"--\", alpha=0.7, label=\"K = 3\")\n", + "axes[1].legend(fontsize=11)\n", + "axes[1].grid(True, alpha=0.3)\n", + "\n", + "plt.tight_layout()\n", + "plt.show()\n", + "\n", + "print(\"Silhouette scores by K:\")\n", + "for k, s in zip(sil_ks, sil_values):\n", + " print(f\" K={k:2d} -> {s:.4f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Both plots agree: **K = 3** is the best choice for this dataset β€” inertia has a clear elbow and the silhouette score peaks at K = 3.\n", + "\n", + "---\n", + "## 7. Scikit-learn's KMeans\n", + "\n", + "In practice you'll use scikit-learn's battle-tested implementation. Let's verify our scratch version gives the same answer." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.cluster import KMeans\n", + "\n", + "km_sklearn = KMeans(n_clusters=3, random_state=42, n_init=10)\n", + "km_sklearn.fit(X)\n", + "\n", + "print(\"=== Scikit-learn KMeans ===\")\n", + "print(f\"Inertia: {km_sklearn.inertia_:.2f}\")\n", + "print(f\"Silhouette score: {silhouette_score(X, km_sklearn.labels_):.4f}\")\n", + "print(f\"Centroids:\\n{km_sklearn.cluster_centers_}\")\n", + "print()\n", + "\n", + "print(\"=== Our scratch KMeans ===\")\n", + "print(f\"Inertia: {km_scratch.inertia_:.2f}\")\n", + "print(f\"Silhouette score: {silhouette_score(X, km_scratch.labels_):.4f}\")\n", + "print(f\"Centroids:\\n{km_scratch.centroids}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fig, axes = plt.subplots(1, 2, figsize=(13, 5))\n", + "\n", + "colors_map = np.array([\"#e74c3c\", \"#2ecc71\", \"#3498db\"])\n", + "\n", + "axes[0].scatter(X[:, 0], X[:, 1], c=colors_map[km_scratch.labels_],\n", + " edgecolors=\"k\", s=50, alpha=0.7)\n", + "axes[0].scatter(km_scratch.centroids[:, 0], km_scratch.centroids[:, 1],\n", + " c=\"gold\", marker=\"X\", s=250, edgecolors=\"k\", linewidths=1.5, zorder=5)\n", + "axes[0].set_title(\"Our Scratch Implementation\", fontsize=14)\n", + "axes[0].set_xlabel(\"Feature 1\")\n", + "axes[0].set_ylabel(\"Feature 2\")\n", + "\n", + "axes[1].scatter(X[:, 0], X[:, 1], c=colors_map[km_sklearn.labels_],\n", + " edgecolors=\"k\", s=50, alpha=0.7)\n", + "axes[1].scatter(km_sklearn.cluster_centers_[:, 0], km_sklearn.cluster_centers_[:, 1],\n", + " c=\"gold\", marker=\"X\", s=250, edgecolors=\"k\", linewidths=1.5, zorder=5)\n", + "axes[1].set_title(\"Scikit-learn KMeans\", fontsize=14)\n", + "axes[1].set_xlabel(\"Feature 1\")\n", + "axes[1].set_ylabel(\"Feature 2\")\n", + "\n", + "plt.suptitle(\"Scratch vs Scikit-learn β€” Side by Side\", fontsize=15, y=1.01)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The cluster labels may differ in numbering (label 0 in one could be label 2 in the other), but the **groupings themselves** should be nearly identical. Scikit-learn's version often achieves slightly lower inertia because it uses the smarter **k-means++** initialization by default and runs multiple initializations (`n_init=10`).\n", + "\n", + "---\n", + "## 8. Practical Tips\n", + "\n", + "### Assumptions of K-Means\n", + "K-Means works best when clusters are:\n", + "- **Spherical (isotropic):** roughly the same spread in every direction.\n", + "- **Similar in size:** very uneven cluster sizes can pull centroids away from smaller groups.\n", + "- **Well-separated:** heavily overlapping clusters confuse the algorithm.\n", + "\n", + "### Feature Scaling\n", + "K-Means relies on Euclidean distance. If one feature has a range of 0–1 and another 0–10,000, the second feature will dominate. **Always standardize your features** (e.g., `StandardScaler`) before clustering.\n", + "\n", + "### Multiple Initializations\n", + "Scikit-learn's `n_init` parameter (default 10) runs K-Means 10 times with different random seeds and keeps the result with the lowest inertia. This greatly reduces the risk of a poor local minimum.\n", + "\n", + "### When K-Means Fails\n", + "K-Means struggles with:\n", + "- **Non-convex shapes** (e.g., crescent moons, concentric rings) β€” consider DBSCAN or spectral clustering instead.\n", + "- **Clusters with very different densities** β€” HDBSCAN handles this better.\n", + "- **High-dimensional data** β€” distances become less meaningful (curse of dimensionality); apply dimensionality reduction first.\n", + "\n", + "We'll explore some of these alternatives in later notebooks." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## 9. Summary\n", + "\n", + "### Key Takeaways\n", + "\n", + "1. **Unsupervised learning** discovers structure without labels. Clustering is its flagship task.\n", + "2. **K-Means** iterates between *assigning* points to the nearest centroid and *updating* centroids as cluster means until convergence.\n", + "3. **Inertia** measures within-cluster compactness; **silhouette score** balances compactness and separation.\n", + "4. The **elbow method** plots inertia vs K to find a natural number of clusters.\n", + "5. **Scikit-learn's KMeans** adds smart initialization (k-means++) and multiple restarts for robust results.\n", + "6. Always **scale features** before clustering, and remember that K-Means assumes spherical, similarly-sized clusters.\n", + "\n", + "### What's Next\n", + "In the following notebooks we will:\n", + "- Explore **hierarchical clustering** and dendrograms\n", + "- Learn **DBSCAN** for density-based clustering\n", + "- Apply **dimensionality reduction** (PCA, t-SNE) for visualization\n", + "\n", + "---\n", + "*End of Notebook 01 β€” Clustering Basics*" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.9.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file diff --git a/chapters/chapter-08-unsupervised-learning/notebooks/02_intermediate.ipynb b/chapters/chapter-08-unsupervised-learning/notebooks/02_intermediate.ipynb new file mode 100644 index 0000000..584626b --- /dev/null +++ b/chapters/chapter-08-unsupervised-learning/notebooks/02_intermediate.ipynb @@ -0,0 +1,721 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Chapter 8: Unsupervised Learning\n", + "## Notebook 02 - Intermediate: Advanced Clustering\n", + "\n", + "Beyond K-Means: hierarchical clustering, density-based methods, and Gaussian mixtures for real-world data shapes.\n", + "\n", + "**What you'll learn:**\n", + "- Hierarchical (agglomerative) clustering and dendrograms\n", + "- DBSCAN for density-based clustering\n", + "- Gaussian Mixture Models (GMMs)\n", + "- Comparing clustering algorithms on different data shapes\n", + "\n", + "**Time estimate:** 2.5 hours\n", + "\n", + "---\n", + "*Generated by Berta AI | Created by Luigi Pascal Rondanini*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import matplotlib.cm as cm\n", + "from sklearn.datasets import make_blobs, make_moons, make_circles\n", + "from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN\n", + "from sklearn.mixture import GaussianMixture\n", + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.neighbors import NearestNeighbors\n", + "from scipy.cluster.hierarchy import dendrogram, linkage, fcluster\n", + "from scipy.stats import multivariate_normal\n", + "\n", + "np.random.seed(42)\n", + "\n", + "plt.rcParams['figure.figsize'] = (10, 6)\n", + "plt.rcParams['figure.dpi'] = 100\n", + "plt.rcParams['font.size'] = 11\n", + "\n", + "print(\"All imports loaded successfully.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## 1. Hierarchical (Agglomerative) Clustering\n", + "\n", + "Hierarchical clustering builds a tree of clusters instead of requiring a fixed number of clusters up front.\n", + "\n", + "### How agglomerative clustering works\n", + "\n", + "The **agglomerative (bottom-up)** approach proceeds as follows:\n", + "\n", + "1. **Start** β€” treat every data point as its own single-point cluster.\n", + "2. **Merge** β€” find the two closest clusters and merge them into one.\n", + "3. **Repeat** β€” keep merging until only a single cluster remains (or until a stopping criterion is met).\n", + "\n", + "The result is a hierarchy that can be visualised as a **dendrogram** β€” a tree diagram showing the order and distance of each merge.\n", + "\n", + "### Linkage criteria\n", + "\n", + "\"Distance between two clusters\" can be measured in several ways:\n", + "\n", + "| Linkage | Definition | Tendency |\n", + "|---------|-----------|----------|\n", + "| **Single** | Minimum distance between any pair of points across two clusters | Produces elongated, chain-like clusters |\n", + "| **Complete** | Maximum distance between any pair of points across two clusters | Produces compact, roughly equal-sized clusters |\n", + "| **Average** | Mean distance between all pairs of points across two clusters | Compromise between single and complete |\n", + "| **Ward** | Minimises the total within-cluster variance at each merge | Tends to produce equally sized, spherical clusters |\n", + "\n", + "Ward linkage is the most commonly used default and works well when clusters are roughly spherical." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Generate synthetic data with 4 well-separated clusters\n", + "X_hier, y_hier = make_blobs(\n", + " n_samples=200, centers=4, cluster_std=0.8, random_state=42\n", + ")\n", + "\n", + "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n", + "\n", + "# Left panel β€” raw data\n", + "axes[0].scatter(X_hier[:, 0], X_hier[:, 1], s=30, alpha=0.7, edgecolors='k', linewidths=0.3)\n", + "axes[0].set_title('Raw Data (200 points, 4 clusters)')\n", + "axes[0].set_xlabel('Feature 1')\n", + "axes[0].set_ylabel('Feature 2')\n", + "\n", + "# Right panel β€” dendrogram using Ward linkage\n", + "Z_ward = linkage(X_hier, method='ward')\n", + "dendrogram(\n", + " Z_ward,\n", + " truncate_mode='lastp',\n", + " p=30,\n", + " leaf_rotation=90,\n", + " leaf_font_size=8,\n", + " ax=axes[1],\n", + " color_threshold=12\n", + ")\n", + "axes[1].set_title('Dendrogram (Ward Linkage, truncated to 30 leaves)')\n", + "axes[1].set_xlabel('Cluster (size)')\n", + "axes[1].set_ylabel('Merge Distance')\n", + "axes[1].axhline(y=12, color='r', linestyle='--', label='Cut at distance = 12')\n", + "axes[1].legend()\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The dendrogram shows the full merge history. By drawing a horizontal cut line we decide\n", + "how many clusters to keep β€” each vertical line that crosses the cut corresponds to one cluster.\n", + "\n", + "### Comparing linkage methods\n", + "\n", + "Let's visualise how the four linkage types partition the same dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "linkage_methods = ['single', 'complete', 'average', 'ward']\n", + "fig, axes = plt.subplots(1, 4, figsize=(20, 4.5))\n", + "\n", + "for ax, method in zip(axes, linkage_methods):\n", + " Z = linkage(X_hier, method=method)\n", + " labels = fcluster(Z, t=4, criterion='maxclust')\n", + " scatter = ax.scatter(\n", + " X_hier[:, 0], X_hier[:, 1],\n", + " c=labels, cmap='viridis', s=30, alpha=0.7, edgecolors='k', linewidths=0.3\n", + " )\n", + " ax.set_title(f'{method.capitalize()} linkage')\n", + " ax.set_xlabel('Feature 1')\n", + " ax.set_ylabel('Feature 2')\n", + "\n", + "plt.suptitle('Agglomerative Clustering β€” 4 Linkage Methods (k=4)', fontsize=14, y=1.02)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Scikit-learn's AgglomerativeClustering with Ward linkage\n", + "agg = AgglomerativeClustering(n_clusters=4, linkage='ward')\n", + "agg_labels = agg.fit_predict(X_hier)\n", + "\n", + "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n", + "\n", + "axes[0].scatter(\n", + " X_hier[:, 0], X_hier[:, 1],\n", + " c=y_hier, cmap='tab10', s=40, alpha=0.7, edgecolors='k', linewidths=0.3\n", + ")\n", + "axes[0].set_title('Ground-Truth Labels')\n", + "axes[0].set_xlabel('Feature 1')\n", + "axes[0].set_ylabel('Feature 2')\n", + "\n", + "axes[1].scatter(\n", + " X_hier[:, 0], X_hier[:, 1],\n", + " c=agg_labels, cmap='tab10', s=40, alpha=0.7, edgecolors='k', linewidths=0.3\n", + ")\n", + "axes[1].set_title('AgglomerativeClustering (Ward, k=4)')\n", + "axes[1].set_xlabel('Feature 1')\n", + "axes[1].set_ylabel('Feature 2')\n", + "\n", + "plt.tight_layout()\n", + "plt.show()\n", + "\n", + "print(f\"Cluster sizes: {np.bincount(agg_labels)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## 2. DBSCAN β€” Density-Based Spatial Clustering\n", + "\n", + "**DBSCAN** (Density-Based Spatial Clustering of Applications with Noise) takes a fundamentally different\n", + "approach to clustering:\n", + "\n", + "- It does **not** require the number of clusters in advance.\n", + "- It defines clusters as **dense regions** separated by sparse regions.\n", + "- Points that don't belong to any dense region are labelled as **noise** (label = -1).\n", + "\n", + "### Key parameters\n", + "\n", + "| Parameter | Meaning |\n", + "|-----------|--------|\n", + "| `eps` (Ξ΅) | Maximum distance between two points for them to be considered neighbours |\n", + "| `min_samples` | Minimum number of points within Ξ΅-distance to form a dense region |\n", + "\n", + "### Point types\n", + "\n", + "- **Core point** β€” has at least `min_samples` neighbours within Ξ΅.\n", + "- **Border point** β€” within Ξ΅ of a core point but doesn't have enough neighbours itself.\n", + "- **Noise point** β€” neither core nor border; isolated outliers.\n", + "\n", + "### Key advantage\n", + "\n", + "DBSCAN can discover clusters of **arbitrary shape** and naturally identifies outliers β€” something\n", + "centroid-based methods like K-Means cannot do." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Generate two non-convex datasets\n", + "X_moons, y_moons = make_moons(n_samples=500, noise=0.08, random_state=42)\n", + "X_circles, y_circles = make_circles(n_samples=500, noise=0.05, factor=0.5, random_state=42)\n", + "\n", + "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n", + "\n", + "axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap='coolwarm', s=20, alpha=0.7)\n", + "axes[0].set_title('Two Moons Dataset')\n", + "axes[0].set_xlabel('Feature 1')\n", + "axes[0].set_ylabel('Feature 2')\n", + "\n", + "axes[1].scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, cmap='coolwarm', s=20, alpha=0.7)\n", + "axes[1].set_title('Two Circles Dataset')\n", + "axes[1].set_xlabel('Feature 1')\n", + "axes[1].set_ylabel('Feature 2')\n", + "\n", + "plt.suptitle('Non-Convex Datasets β€” Ground Truth', fontsize=14, y=1.02)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Apply DBSCAN to both datasets\n", + "db_moons = DBSCAN(eps=0.2, min_samples=5).fit(X_moons)\n", + "db_circles = DBSCAN(eps=0.15, min_samples=5).fit(X_circles)\n", + "\n", + "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n", + "\n", + "colors_moons = db_moons.labels_\n", + "colors_circles = db_circles.labels_\n", + "\n", + "axes[0].scatter(\n", + " X_moons[:, 0], X_moons[:, 1],\n", + " c=colors_moons, cmap='viridis', s=20, alpha=0.7\n", + ")\n", + "n_noise_moons = (db_moons.labels_ == -1).sum()\n", + "axes[0].set_title(f'DBSCAN on Moons β€” {len(set(colors_moons)) - (1 if -1 in colors_moons else 0)} clusters, {n_noise_moons} noise')\n", + "axes[0].set_xlabel('Feature 1')\n", + "axes[0].set_ylabel('Feature 2')\n", + "\n", + "axes[1].scatter(\n", + " X_circles[:, 0], X_circles[:, 1],\n", + " c=colors_circles, cmap='viridis', s=20, alpha=0.7\n", + ")\n", + "n_noise_circles = (db_circles.labels_ == -1).sum()\n", + "axes[1].set_title(f'DBSCAN on Circles β€” {len(set(colors_circles)) - (1 if -1 in colors_circles else 0)} clusters, {n_noise_circles} noise')\n", + "axes[1].set_xlabel('Feature 1')\n", + "axes[1].set_ylabel('Feature 2')\n", + "\n", + "plt.suptitle('DBSCAN Handles Non-Convex Shapes', fontsize=14, y=1.02)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# K-Means vs DBSCAN on the moons dataset\n", + "km_moons = KMeans(n_clusters=2, random_state=42, n_init=10).fit(X_moons)\n", + "\n", + "fig, axes = plt.subplots(1, 3, figsize=(18, 5))\n", + "\n", + "axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap='coolwarm', s=20, alpha=0.7)\n", + "axes[0].set_title('Ground Truth')\n", + "axes[0].set_xlabel('Feature 1')\n", + "axes[0].set_ylabel('Feature 2')\n", + "\n", + "axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=km_moons.labels_, cmap='coolwarm', s=20, alpha=0.7)\n", + "axes[1].scatter(km_moons.cluster_centers_[:, 0], km_moons.cluster_centers_[:, 1],\n", + " marker='X', s=200, c='black', edgecolors='white', linewidths=1.5)\n", + "axes[1].set_title('K-Means (k=2) β€” Fails on non-convex shapes')\n", + "axes[1].set_xlabel('Feature 1')\n", + "axes[1].set_ylabel('Feature 2')\n", + "\n", + "axes[2].scatter(X_moons[:, 0], X_moons[:, 1], c=db_moons.labels_, cmap='coolwarm', s=20, alpha=0.7)\n", + "axes[2].set_title('DBSCAN (eps=0.2) β€” Correctly separates crescents')\n", + "axes[2].set_xlabel('Feature 1')\n", + "axes[2].set_ylabel('Feature 2')\n", + "\n", + "plt.suptitle('K-Means vs DBSCAN on the Moons Dataset', fontsize=14, y=1.02)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## 3. Choosing DBSCAN Parameters\n", + "\n", + "Picking `eps` and `min_samples` can be tricky. A practical heuristic:\n", + "\n", + "1. Set `min_samples` β‰ˆ 2 Γ— number of features (a reasonable default).\n", + "2. For each point compute the distance to its **k-th nearest neighbour** (k = `min_samples`).\n", + "3. Sort these distances and plot them β€” the **k-distance graph**.\n", + "4. Look for the \"elbow\" β€” the point where the curve bends sharply upward. The distance at that elbow is a good candidate for `eps`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# k-distance graph for the moons dataset\n", + "k = 5 # same as min_samples\n", + "nn = NearestNeighbors(n_neighbors=k)\n", + "nn.fit(X_moons)\n", + "distances, _ = nn.kneighbors(X_moons)\n", + "\n", + "k_distances = np.sort(distances[:, k - 1])[::-1]\n", + "\n", + "plt.figure(figsize=(10, 5))\n", + "plt.plot(k_distances, linewidth=1.5)\n", + "plt.axhline(y=0.2, color='r', linestyle='--', label='eps = 0.2 (our choice)')\n", + "plt.title(f'k-Distance Graph (k={k}) β€” Elbow Indicates Good eps')\n", + "plt.xlabel('Points (sorted by descending k-distance)')\n", + "plt.ylabel(f'Distance to {k}-th Nearest Neighbour')\n", + "plt.legend()\n", + "plt.grid(True, alpha=0.3)\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Effect of different eps values on DBSCAN results\n", + "eps_values = [0.05, 0.1, 0.2, 0.3, 0.5]\n", + "fig, axes = plt.subplots(1, len(eps_values), figsize=(22, 4))\n", + "\n", + "for ax, eps in zip(axes, eps_values):\n", + " db = DBSCAN(eps=eps, min_samples=5).fit(X_moons)\n", + " labels = db.labels_\n", + " n_clusters = len(set(labels)) - (1 if -1 in labels else 0)\n", + " n_noise = (labels == -1).sum()\n", + "\n", + " unique_labels = set(labels)\n", + " colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]\n", + "\n", + " for k_label, col in zip(sorted(unique_labels), colors):\n", + " if k_label == -1:\n", + " col = [0, 0, 0, 1] # black for noise\n", + " mask = labels == k_label\n", + " ax.scatter(X_moons[mask, 0], X_moons[mask, 1], c=[col], s=15, alpha=0.7)\n", + "\n", + " ax.set_title(f'eps={eps}\\n{n_clusters} clusters, {n_noise} noise')\n", + " ax.set_xlabel('Feature 1')\n", + "\n", + "axes[0].set_ylabel('Feature 2')\n", + "plt.suptitle('Effect of eps on DBSCAN (min_samples=5)', fontsize=14, y=1.05)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Observations:**\n", + "- **eps too small** (0.05) β†’ most points classified as noise; many tiny clusters.\n", + "- **eps just right** (0.2) β†’ two clean crescent clusters with very little noise.\n", + "- **eps too large** (0.5) β†’ everything merges into a single cluster.\n", + "\n", + "The k-distance graph helps you find that sweet spot without trial and error." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## 4. Gaussian Mixture Models (GMMs)\n", + "\n", + "A **Gaussian Mixture Model** assumes that the data is generated from a mixture of a finite number\n", + "of Gaussian (normal) distributions with unknown parameters.\n", + "\n", + "### GMM vs K-Means\n", + "\n", + "| Aspect | K-Means | GMM |\n", + "|--------|---------|-----|\n", + "| Cluster assignment | **Hard** β€” each point belongs to exactly one cluster | **Soft** β€” each point has a probability for every cluster |\n", + "| Cluster shape | Spherical (Voronoi cells) | Elliptical (full covariance matrices) |\n", + "| Outlier handling | None β€” every point is assigned | Naturally down-weights low-probability points |\n", + "| Output | Cluster label | Probability vector over all clusters |\n", + "\n", + "GMMs are fit using the **Expectation-Maximisation (EM)** algorithm:\n", + "1. **E-step** β€” compute the probability that each point belongs to each Gaussian component.\n", + "2. **M-step** β€” update each component's mean, covariance, and weight to maximise log-likelihood.\n", + "3. Repeat until convergence." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Create elongated / elliptical clusters that challenge K-Means\n", + "np.random.seed(42)\n", + "\n", + "n_per_cluster = 200\n", + "cov1 = [[2.0, 1.5], [1.5, 1.5]]\n", + "cov2 = [[1.5, -1.2], [-1.2, 1.5]]\n", + "cov3 = [[0.5, 0.0], [0.0, 2.5]]\n", + "\n", + "cluster1 = np.random.multivariate_normal([0, 0], cov1, n_per_cluster)\n", + "cluster2 = np.random.multivariate_normal([5, 5], cov2, n_per_cluster)\n", + "cluster3 = np.random.multivariate_normal([8, 0], cov3, n_per_cluster)\n", + "\n", + "X_gmm = np.vstack([cluster1, cluster2, cluster3])\n", + "y_gmm_true = np.array([0]*n_per_cluster + [1]*n_per_cluster + [2]*n_per_cluster)\n", + "\n", + "fig, axes = plt.subplots(1, 3, figsize=(18, 5))\n", + "\n", + "# Ground truth\n", + "axes[0].scatter(X_gmm[:, 0], X_gmm[:, 1], c=y_gmm_true, cmap='tab10', s=15, alpha=0.6)\n", + "axes[0].set_title('Ground Truth (Elliptical Clusters)')\n", + "axes[0].set_xlabel('Feature 1')\n", + "axes[0].set_ylabel('Feature 2')\n", + "\n", + "# K-Means\n", + "km_gmm = KMeans(n_clusters=3, random_state=42, n_init=10).fit(X_gmm)\n", + "axes[1].scatter(X_gmm[:, 0], X_gmm[:, 1], c=km_gmm.labels_, cmap='tab10', s=15, alpha=0.6)\n", + "axes[1].scatter(km_gmm.cluster_centers_[:, 0], km_gmm.cluster_centers_[:, 1],\n", + " marker='X', s=200, c='black', edgecolors='white', linewidths=1.5)\n", + "axes[1].set_title('K-Means (k=3) β€” Spherical assumption')\n", + "axes[1].set_xlabel('Feature 1')\n", + "axes[1].set_ylabel('Feature 2')\n", + "\n", + "# GMM\n", + "gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=42)\n", + "gmm.fit(X_gmm)\n", + "gmm_labels = gmm.predict(X_gmm)\n", + "axes[2].scatter(X_gmm[:, 0], X_gmm[:, 1], c=gmm_labels, cmap='tab10', s=15, alpha=0.6)\n", + "axes[2].set_title('GMM (3 components) β€” Elliptical fit')\n", + "axes[2].set_xlabel('Feature 1')\n", + "axes[2].set_ylabel('Feature 2')\n", + "\n", + "plt.suptitle('K-Means vs GMM on Elliptical Clusters', fontsize=14, y=1.02)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Visualise GMM probability contours\n", + "x_min, x_max = X_gmm[:, 0].min() - 2, X_gmm[:, 0].max() + 2\n", + "y_min, y_max = X_gmm[:, 1].min() - 2, X_gmm[:, 1].max() + 2\n", + "xx, yy = np.meshgrid(np.linspace(x_min, x_max, 300), np.linspace(y_min, y_max, 300))\n", + "grid_points = np.column_stack([xx.ravel(), yy.ravel()])\n", + "\n", + "log_prob = gmm.score_samples(grid_points)\n", + "log_prob = log_prob.reshape(xx.shape)\n", + "\n", + "fig, ax = plt.subplots(figsize=(10, 7))\n", + "ax.contourf(xx, yy, np.exp(log_prob), levels=30, cmap='YlOrRd', alpha=0.6)\n", + "ax.contour(xx, yy, np.exp(log_prob), levels=10, colors='darkred', linewidths=0.5, alpha=0.5)\n", + "ax.scatter(X_gmm[:, 0], X_gmm[:, 1], c=gmm_labels, cmap='tab10', s=10, alpha=0.7,\n", + " edgecolors='k', linewidths=0.2)\n", + "\n", + "for i in range(gmm.n_components):\n", + " ax.scatter(gmm.means_[i, 0], gmm.means_[i, 1],\n", + " marker='+', s=300, c='black', linewidths=3)\n", + "\n", + "ax.set_title('GMM Probability Density Contours')\n", + "ax.set_xlabel('Feature 1')\n", + "ax.set_ylabel('Feature 2')\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Soft cluster probabilities β€” the key advantage of GMM\n", + "probs = gmm.predict_proba(X_gmm)\n", + "\n", + "print(\"Cluster membership probabilities for the first 10 points:\")\n", + "print(f\"{'Point':>5} {'P(C0)':>8} {'P(C1)':>8} {'P(C2)':>8} {'Assigned':>8}\")\n", + "print(\"-\" * 48)\n", + "for i in range(10):\n", + " print(f\"{i:5d} {probs[i, 0]:8.4f} {probs[i, 1]:8.4f} {probs[i, 2]:8.4f} {gmm_labels[i]:8d}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Model selection with BIC and AIC\n", + "\n", + "How many Gaussian components should we use? We can use information criteria:\n", + "\n", + "- **BIC** (Bayesian Information Criterion) β€” penalises model complexity more heavily.\n", + "- **AIC** (Akaike Information Criterion) β€” lighter penalty.\n", + "\n", + "**Lower is better** for both. We fit GMMs with different numbers of components and pick the one with the lowest BIC (or AIC)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "n_components_range = range(1, 10)\n", + "bic_scores = []\n", + "aic_scores = []\n", + "\n", + "for n in n_components_range:\n", + " gmm_test = GaussianMixture(n_components=n, covariance_type='full', random_state=42)\n", + " gmm_test.fit(X_gmm)\n", + " bic_scores.append(gmm_test.bic(X_gmm))\n", + " aic_scores.append(gmm_test.aic(X_gmm))\n", + "\n", + "fig, ax = plt.subplots(figsize=(10, 5))\n", + "ax.plot(list(n_components_range), bic_scores, 'bo-', label='BIC', linewidth=2)\n", + "ax.plot(list(n_components_range), aic_scores, 'rs--', label='AIC', linewidth=2)\n", + "ax.axvline(x=3, color='green', linestyle=':', alpha=0.7, label='True number of components (3)')\n", + "ax.set_xlabel('Number of Components')\n", + "ax.set_ylabel('Score (lower is better)')\n", + "ax.set_title('GMM Model Selection: BIC and AIC')\n", + "ax.legend()\n", + "ax.grid(True, alpha=0.3)\n", + "plt.tight_layout()\n", + "plt.show()\n", + "\n", + "print(f\"Best BIC at n_components = {np.argmin(bic_scores) + 1}\")\n", + "print(f\"Best AIC at n_components = {np.argmin(aic_scores) + 1}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## 5. Algorithm Comparison on Multiple Datasets\n", + "\n", + "Let's put all four algorithms head-to-head on three different data geometries:\n", + "\n", + "1. **Blobs** β€” well-separated spherical clusters\n", + "2. **Moons** β€” two interleaving crescents\n", + "3. **Varied-variance blobs** β€” spherical clusters with very different densities" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "np.random.seed(42)\n", + "\n", + "n_samples = 500\n", + "\n", + "# Dataset 1: standard blobs\n", + "X_blobs, y_blobs = make_blobs(n_samples=n_samples, centers=3, cluster_std=1.0, random_state=42)\n", + "\n", + "# Dataset 2: moons\n", + "X_moons2, y_moons2 = make_moons(n_samples=n_samples, noise=0.07, random_state=42)\n", + "\n", + "# Dataset 3: varied-variance blobs\n", + "X_varied, y_varied = make_blobs(\n", + " n_samples=n_samples, centers=3, cluster_std=[0.5, 2.5, 1.0], random_state=42\n", + ")\n", + "\n", + "datasets = [\n", + " ('Blobs', X_blobs, {'n_clusters': 3, 'eps': 1.0}),\n", + " ('Moons', X_moons2, {'n_clusters': 2, 'eps': 0.2}),\n", + " ('Varied', X_varied, {'n_clusters': 3, 'eps': 1.5}),\n", + "]\n", + "\n", + "fig, axes = plt.subplots(3, 4, figsize=(22, 15))\n", + "\n", + "for row, (name, X, params) in enumerate(datasets):\n", + " X_scaled = StandardScaler().fit_transform(X)\n", + " n_c = params['n_clusters']\n", + " eps = params['eps']\n", + "\n", + " # K-Means\n", + " km = KMeans(n_clusters=n_c, random_state=42, n_init=10).fit(X_scaled)\n", + " # Agglomerative\n", + " agg = AgglomerativeClustering(n_clusters=n_c, linkage='ward').fit(X_scaled)\n", + " # DBSCAN\n", + " db = DBSCAN(eps=eps, min_samples=5).fit(X_scaled)\n", + " # GMM\n", + " gm = GaussianMixture(n_components=n_c, random_state=42).fit(X_scaled)\n", + "\n", + " results = [\n", + " ('K-Means', km.labels_),\n", + " ('Agglomerative', agg.labels_),\n", + " ('DBSCAN', db.labels_),\n", + " ('GMM', gm.predict(X_scaled)),\n", + " ]\n", + "\n", + " for col, (algo_name, labels) in enumerate(results):\n", + " ax = axes[row, col]\n", + " unique_labels = set(labels)\n", + " n_clust = len(unique_labels) - (1 if -1 in unique_labels else 0)\n", + "\n", + " noise_mask = labels == -1\n", + " ax.scatter(X_scaled[~noise_mask, 0], X_scaled[~noise_mask, 1],\n", + " c=labels[~noise_mask], cmap='viridis', s=12, alpha=0.7)\n", + " if noise_mask.any():\n", + " ax.scatter(X_scaled[noise_mask, 0], X_scaled[noise_mask, 1],\n", + " c='red', marker='x', s=15, alpha=0.5, label='noise')\n", + " ax.legend(fontsize=8)\n", + "\n", + " if row == 0:\n", + " ax.set_title(algo_name, fontsize=13, fontweight='bold')\n", + " ax.set_ylabel(f'{name}' if col == 0 else '', fontsize=12)\n", + " ax.text(0.02, 0.98, f'{n_clust} cluster(s)',\n", + " transform=ax.transAxes, fontsize=9, va='top',\n", + " bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.8))\n", + "\n", + "plt.suptitle('Algorithm Comparison Across Data Geometries', fontsize=16, y=1.01)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## 6. Summary β€” When to Use Each Algorithm\n", + "\n", + "### Quick reference\n", + "\n", + "| Algorithm | Best for | Weaknesses | Must specify k? |\n", + "|-----------|---------|------------|------------------|\n", + "| **K-Means** | Large datasets with spherical clusters | Cannot handle non-convex shapes; sensitive to outliers | Yes |\n", + "| **Agglomerative Clustering** | Small-to-medium datasets; exploring hierarchy | O(nΒ³) time complexity; hard to scale | Yes (or cut dendrogram) |\n", + "| **DBSCAN** | Arbitrary shapes; datasets with noise/outliers | Sensitive to `eps`; struggles with varying densities | No |\n", + "| **Gaussian Mixture Model** | Elliptical clusters; need soft assignments | Assumes Gaussian components; sensitive to initialisation | Yes |\n", + "\n", + "### Rules of thumb\n", + "\n", + "1. **Start simple:** try K-Means first. If results look poor, consider the data geometry.\n", + "2. **Non-convex shapes?** β†’ Use DBSCAN.\n", + "3. **Elliptical or overlapping clusters?** β†’ Use GMM.\n", + "4. **Need a hierarchy or dendrogram?** β†’ Use Agglomerative Clustering.\n", + "5. **Noisy data with outliers?** β†’ DBSCAN naturally handles noise.\n", + "6. **Need probability estimates?** β†’ GMM provides soft assignments.\n", + "\n", + "### What's next\n", + "\n", + "In the **advanced notebook** (Notebook 03) we will explore:\n", + "- Dimensionality reduction (PCA, t-SNE, UMAP)\n", + "- Clustering evaluation metrics (Silhouette, Adjusted Rand Index)\n", + "- Pipelines combining reduction + clustering on real-world datasets\n", + "\n", + "---\n", + "*Generated by Berta AI | Created by Luigi Pascal Rondanini*" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.10.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/chapters/chapter-08-unsupervised-learning/notebooks/03_advanced.ipynb b/chapters/chapter-08-unsupervised-learning/notebooks/03_advanced.ipynb new file mode 100644 index 0000000..d73ba76 --- /dev/null +++ b/chapters/chapter-08-unsupervised-learning/notebooks/03_advanced.ipynb @@ -0,0 +1,938 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Chapter 8: Unsupervised Learning\n", + "## Notebook 03 - Advanced: Dimensionality Reduction & Capstone\n", + "\n", + "Reduce high-dimensional data for visualization and modeling, detect anomalies, and build a complete customer segmentation system.\n", + "\n", + "**What you'll learn:**\n", + "- Principal Component Analysis (PCA) from scratch\n", + "- t-SNE for 2D visualization\n", + "- Anomaly detection with Isolation Forest\n", + "- Customer segmentation capstone project\n", + "\n", + "**Time estimate:** 3 hours\n", + "\n", + "---\n", + "*Generated by Berta AI | Created by Luigi Pascal Rondanini*" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## 1. Principal Component Analysis (PCA) β€” Theory\n", + "\n", + "### The Core Idea\n", + "\n", + "PCA is a **linear** dimensionality-reduction technique that finds the directions\n", + "(called **principal components**) along which the data varies the most.\n", + "\n", + "Imagine a cloud of 3-D points that is shaped like a flat pancake. Two axes\n", + "capture almost all of the spread; the third adds very little information. PCA\n", + "discovers those two dominant axes automatically.\n", + "\n", + "### Algorithm Steps\n", + "\n", + "1. **Center the data** β€” subtract the mean of each feature so that the cloud is\n", + " centered at the origin.\n", + "2. **Compute the covariance matrix** β€” a $d \\times d$ matrix (where $d$ is the\n", + " number of features) that captures pairwise linear relationships.\n", + "3. **Eigendecomposition** β€” find the eigenvectors and eigenvalues of the\n", + " covariance matrix. Each eigenvector is a principal component direction;\n", + " its eigenvalue tells us how much variance that direction explains.\n", + "4. **Sort & select** β€” rank components by eigenvalue (descending) and keep the\n", + " top $k$ to reduce dimensionality from $d$ to $k$.\n", + "5. **Project** β€” multiply the centered data by the selected eigenvectors to\n", + " obtain the lower-dimensional representation.\n", + "\n", + "### Variance Explained Ratio\n", + "\n", + "$$\\text{variance explained ratio}_i = \\frac{\\lambda_i}{\\sum_{j=1}^{d} \\lambda_j}$$\n", + "\n", + "where $\\lambda_i$ is the $i$-th eigenvalue. The **cumulative** variance explained\n", + "tells us how much total information is retained when we keep the first $k$\n", + "components." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## 2. PCA From Scratch\n", + "\n", + "We will implement PCA using only NumPy and apply it to the classic **Iris**\n", + "dataset (4 features β†’ 2 components)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from sklearn.datasets import load_iris\n", + "\n", + "np.random.seed(42)\n", + "\n", + "# Load the Iris dataset (4 features, 150 samples, 3 classes)\n", + "iris = load_iris()\n", + "X = iris.data # shape (150, 4)\n", + "y = iris.target # 0, 1, 2\n", + "feature_names = iris.feature_names\n", + "target_names = iris.target_names\n", + "\n", + "print(f\"Dataset shape: {X.shape}\")\n", + "print(f\"Features: {feature_names}\")\n", + "print(f\"Classes: {list(target_names)}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def pca_from_scratch(X, n_components=2):\n", + " \"\"\"Implement PCA using NumPy.\"\"\"\n", + " # Step 1: Center the data\n", + " mean = np.mean(X, axis=0)\n", + " X_centered = X - mean\n", + "\n", + " # Step 2: Covariance matrix (features Γ— features)\n", + " cov_matrix = np.cov(X_centered, rowvar=False)\n", + "\n", + " # Step 3: Eigendecomposition\n", + " eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)\n", + "\n", + " # Step 4: Sort by eigenvalue descending\n", + " sorted_idx = np.argsort(eigenvalues)[::-1]\n", + " eigenvalues = eigenvalues[sorted_idx]\n", + " eigenvectors = eigenvectors[:, sorted_idx]\n", + "\n", + " # Variance explained ratio\n", + " variance_ratio = eigenvalues / eigenvalues.sum()\n", + "\n", + " # Step 5: Project onto top-k components\n", + " W = eigenvectors[:, :n_components]\n", + " X_projected = X_centered @ W\n", + "\n", + " return X_projected, eigenvalues, variance_ratio, W\n", + "\n", + "\n", + "X_pca_scratch, eigenvalues, var_ratio, components = pca_from_scratch(X, n_components=2)\n", + "\n", + "print(\"Eigenvalues:\", np.round(eigenvalues, 4))\n", + "print(\"Variance explained ratio:\", np.round(var_ratio, 4))\n", + "print(f\"Total variance retained (2 components): {var_ratio[:2].sum():.2%}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# --- Variance Explained Bar + Cumulative Line ---\n", + "fig, axes = plt.subplots(1, 2, figsize=(13, 5))\n", + "\n", + "# Left: bar chart of individual variance ratios\n", + "axes[0].bar(range(1, len(var_ratio) + 1), var_ratio, color=\"steelblue\", edgecolor=\"black\")\n", + "axes[0].set_xlabel(\"Principal Component\")\n", + "axes[0].set_ylabel(\"Variance Explained Ratio\")\n", + "axes[0].set_title(\"Variance Explained by Each Component\")\n", + "axes[0].set_xticks(range(1, len(var_ratio) + 1))\n", + "\n", + "# Right: cumulative variance explained\n", + "cumulative = np.cumsum(var_ratio)\n", + "axes[1].plot(range(1, len(cumulative) + 1), cumulative, \"o-\", color=\"darkorange\", linewidth=2)\n", + "axes[1].axhline(y=0.95, color=\"red\", linestyle=\"--\", label=\"95% threshold\")\n", + "axes[1].set_xlabel(\"Number of Components\")\n", + "axes[1].set_ylabel(\"Cumulative Variance Explained\")\n", + "axes[1].set_title(\"Cumulative Variance Explained\")\n", + "axes[1].set_xticks(range(1, len(cumulative) + 1))\n", + "axes[1].legend()\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# --- 2-D scatter plot of the scratch PCA projection ---\n", + "colors = [\"#1f77b4\", \"#ff7f0e\", \"#2ca02c\"]\n", + "\n", + "plt.figure(figsize=(8, 6))\n", + "for i, name in enumerate(target_names):\n", + " mask = y == i\n", + " plt.scatter(X_pca_scratch[mask, 0], X_pca_scratch[mask, 1],\n", + " label=name, alpha=0.7, edgecolors=\"k\", linewidth=0.5,\n", + " color=colors[i], s=60)\n", + "plt.xlabel(f\"PC 1 ({var_ratio[0]:.1%} variance)\")\n", + "plt.ylabel(f\"PC 2 ({var_ratio[1]:.1%} variance)\")\n", + "plt.title(\"PCA From Scratch β€” Iris Dataset (2-D Projection)\")\n", + "plt.legend()\n", + "plt.grid(alpha=0.3)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## 3. PCA with Scikit-learn\n", + "\n", + "Now let's verify our scratch implementation against the well-optimized\n", + "`sklearn.decomposition.PCA`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.decomposition import PCA\n", + "\n", + "pca_sk = PCA(n_components=4) # keep all 4 to inspect variance\n", + "X_pca_sk_full = pca_sk.fit_transform(X)\n", + "\n", + "print(\"Sklearn variance explained ratio:\", np.round(pca_sk.explained_variance_ratio_, 4))\n", + "print(\"Scratch variance explained ratio: \", np.round(var_ratio, 4))\n", + "print()\n", + "print(\"Cumulative (sklearn):\", np.round(np.cumsum(pca_sk.explained_variance_ratio_), 4))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "X_pca_sk = X_pca_sk_full[:, :2] # first 2 components\n", + "\n", + "# Sign of eigenvectors can flip β€” align for visual comparison\n", + "for col in range(2):\n", + " if np.corrcoef(X_pca_scratch[:, col], X_pca_sk[:, col])[0, 1] < 0:\n", + " X_pca_scratch[:, col] *= -1\n", + "\n", + "fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharex=True, sharey=True)\n", + "\n", + "for ax, data, title in zip(axes,\n", + " [X_pca_scratch, X_pca_sk],\n", + " [\"PCA (from scratch)\", \"PCA (scikit-learn)\"]):\n", + " for i, name in enumerate(target_names):\n", + " mask = y == i\n", + " ax.scatter(data[mask, 0], data[mask, 1], label=name,\n", + " alpha=0.7, edgecolors=\"k\", linewidth=0.5,\n", + " color=colors[i], s=60)\n", + " ax.set_xlabel(\"PC 1\")\n", + " ax.set_ylabel(\"PC 2\")\n", + " ax.set_title(title)\n", + " ax.legend()\n", + " ax.grid(alpha=0.3)\n", + "\n", + "plt.suptitle(\"Scratch vs Scikit-learn PCA β€” Identical Results\", fontsize=14, y=1.02)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The two plots are virtually identical (eigenvector signs may differ, which is\n", + "cosmetic). This confirms our from-scratch implementation is correct." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## 4. t-SNE β€” Non-linear Visualization\n", + "\n", + "### What is t-SNE?\n", + "\n", + "**t-distributed Stochastic Neighbor Embedding (t-SNE)** is a non-linear\n", + "dimensionality-reduction technique designed specifically for **visualization**.\n", + "\n", + "Key properties:\n", + "- Preserves **local structure**: points that are close in high-dimensional space\n", + " stay close in the 2-D embedding.\n", + "- Does **not** preserve global distances β€” clusters may move relative to each\n", + " other between runs.\n", + "- Computationally expensive β€” not suitable as a preprocessing step in machine-\n", + " learning pipelines.\n", + "- The **perplexity** parameter (roughly: how many neighbors each point\n", + " considers) strongly influences the result. Typical range: 5–50.\n", + "\n", + "> **Rule of thumb:** Use PCA when you need a general-purpose reduction (for\n", + "> modeling, compression, noise removal). Use t-SNE when your sole goal is to\n", + "> *see* cluster structure in 2-D." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.manifold import TSNE\n", + "\n", + "tsne = TSNE(n_components=2, perplexity=30, random_state=42, n_iter=1000)\n", + "X_tsne = tsne.fit_transform(X)\n", + "\n", + "print(f\"t-SNE output shape: {X_tsne.shape}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# --- Side-by-side: PCA vs t-SNE ---\n", + "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n", + "\n", + "for ax, data, title in zip(axes,\n", + " [X_pca_sk, X_tsne],\n", + " [\"PCA (linear)\", \"t-SNE (non-linear)\"]):\n", + " for i, name in enumerate(target_names):\n", + " mask = y == i\n", + " ax.scatter(data[mask, 0], data[mask, 1], label=name,\n", + " alpha=0.7, edgecolors=\"k\", linewidth=0.5,\n", + " color=colors[i], s=60)\n", + " ax.set_xlabel(\"Dim 1\")\n", + " ax.set_ylabel(\"Dim 2\")\n", + " ax.set_title(title)\n", + " ax.legend()\n", + " ax.grid(alpha=0.3)\n", + "\n", + "plt.suptitle(\"PCA vs t-SNE β€” Iris Dataset\", fontsize=14, y=1.02)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# --- Effect of perplexity on t-SNE ---\n", + "perplexities = [5, 15, 30, 50]\n", + "fig, axes = plt.subplots(1, 4, figsize=(20, 4))\n", + "\n", + "for ax, perp in zip(axes, perplexities):\n", + " embedding = TSNE(n_components=2, perplexity=perp,\n", + " random_state=42, n_iter=1000).fit_transform(X)\n", + " for i, name in enumerate(target_names):\n", + " mask = y == i\n", + " ax.scatter(embedding[mask, 0], embedding[mask, 1],\n", + " alpha=0.7, color=colors[i], s=40, edgecolors=\"k\",\n", + " linewidth=0.3, label=name)\n", + " ax.set_title(f\"Perplexity = {perp}\")\n", + " ax.set_xticks([])\n", + " ax.set_yticks([])\n", + "\n", + "axes[0].legend(fontsize=8)\n", + "plt.suptitle(\"t-SNE: Impact of Perplexity\", fontsize=14, y=1.04)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Observations on perplexity:**\n", + "- Low perplexity (5): focuses on very local neighbors β€” clusters may fragment.\n", + "- High perplexity (50): considers more neighbors β€” clusters become rounder and\n", + " more global structure is visible, but fine local detail may blur.\n", + "- There is no single \"correct\" perplexity; try several and look for consistent\n", + " patterns." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## 5. Anomaly Detection\n", + "\n", + "### Why Unsupervised Anomaly Detection?\n", + "\n", + "In many real-world scenarios, labeled anomalies are scarce or non-existent:\n", + "\n", + "| Domain | Normal | Anomaly |\n", + "|--------|--------|--------|\n", + "| Banking | Legitimate transactions | Fraud |\n", + "| Manufacturing | Good products | Defects |\n", + "| Cybersecurity | Regular traffic | Intrusions |\n", + "\n", + "Unsupervised methods learn the distribution of *normal* data and flag anything\n", + "that doesn't fit.\n", + "\n", + "### Approach 1 β€” Z-Score\n", + "\n", + "Flag a point as anomalous if any feature has a Z-score $|z| > \\tau$ (e.g.,\n", + "$\\tau = 3$). Simple, but assumes Gaussian features and works only for\n", + "univariate or low-dimensional data.\n", + "\n", + "### Approach 2 β€” Isolation Forest\n", + "\n", + "The **Isolation Forest** algorithm isolates observations by randomly selecting\n", + "a feature and a split value. Anomalies are easier to isolate (fewer splits\n", + "needed), so they have shorter average path lengths in the trees.\n", + "\n", + "Advantages:\n", + "- Works well in high dimensions\n", + "- No distribution assumptions\n", + "- Linear time complexity" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.ensemble import IsolationForest\n", + "\n", + "np.random.seed(42)\n", + "\n", + "# Generate normal data: 2 clusters\n", + "normal_a = np.random.randn(150, 2) * 0.8 + np.array([2, 2])\n", + "normal_b = np.random.randn(150, 2) * 0.8 + np.array([-2, -2])\n", + "normal_data = np.vstack([normal_a, normal_b])\n", + "\n", + "# Inject 20 anomalies scattered far from the clusters\n", + "anomalies = np.random.uniform(low=-6, high=6, size=(20, 2))\n", + "\n", + "X_anom = np.vstack([normal_data, anomalies])\n", + "labels_true = np.array([0] * len(normal_data) + [1] * len(anomalies)) # 0=normal, 1=anomaly\n", + "\n", + "print(f\"Total points: {len(X_anom)} (normal: {len(normal_data)}, anomalies: {len(anomalies)})\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# --- Z-Score method ---\n", + "from scipy import stats\n", + "\n", + "z_scores = np.abs(stats.zscore(X_anom))\n", + "z_threshold = 3.0\n", + "z_anomaly_mask = (z_scores > z_threshold).any(axis=1)\n", + "\n", + "print(f\"Z-Score method detected {z_anomaly_mask.sum()} anomalies (threshold={z_threshold})\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# --- Isolation Forest ---\n", + "iso_forest = IsolationForest(n_estimators=200, contamination=0.06,\n", + " random_state=42)\n", + "iso_preds = iso_forest.fit_predict(X_anom) # 1 = normal, -1 = anomaly\n", + "iso_anomaly_mask = iso_preds == -1\n", + "\n", + "print(f\"Isolation Forest detected {iso_anomaly_mask.sum()} anomalies\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fig, axes = plt.subplots(1, 3, figsize=(18, 5))\n", + "\n", + "# Ground truth\n", + "axes[0].scatter(X_anom[labels_true == 0, 0], X_anom[labels_true == 0, 1],\n", + " c=\"steelblue\", s=30, alpha=0.6, label=\"Normal\")\n", + "axes[0].scatter(X_anom[labels_true == 1, 0], X_anom[labels_true == 1, 1],\n", + " c=\"red\", s=80, marker=\"X\", label=\"True Anomaly\")\n", + "axes[0].set_title(\"Ground Truth\")\n", + "axes[0].legend()\n", + "axes[0].grid(alpha=0.3)\n", + "\n", + "# Z-Score\n", + "axes[1].scatter(X_anom[~z_anomaly_mask, 0], X_anom[~z_anomaly_mask, 1],\n", + " c=\"steelblue\", s=30, alpha=0.6, label=\"Normal\")\n", + "axes[1].scatter(X_anom[z_anomaly_mask, 0], X_anom[z_anomaly_mask, 1],\n", + " c=\"red\", s=80, marker=\"X\", label=\"Detected Anomaly\")\n", + "axes[1].set_title(f\"Z-Score (threshold={z_threshold})\")\n", + "axes[1].legend()\n", + "axes[1].grid(alpha=0.3)\n", + "\n", + "# Isolation Forest\n", + "axes[2].scatter(X_anom[~iso_anomaly_mask, 0], X_anom[~iso_anomaly_mask, 1],\n", + " c=\"steelblue\", s=30, alpha=0.6, label=\"Normal\")\n", + "axes[2].scatter(X_anom[iso_anomaly_mask, 0], X_anom[iso_anomaly_mask, 1],\n", + " c=\"red\", s=80, marker=\"X\", label=\"Detected Anomaly\")\n", + "axes[2].set_title(\"Isolation Forest\")\n", + "axes[2].legend()\n", + "axes[2].grid(alpha=0.3)\n", + "\n", + "plt.suptitle(\"Anomaly Detection Comparison\", fontsize=14, y=1.02)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Key takeaway:** The Isolation Forest typically outperforms the Z-Score\n", + "method, especially when the data is multi-modal or the anomalies are not simply\n", + "extreme values along a single axis." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## 6. Capstone Project β€” Customer Segmentation\n", + "\n", + "We will build a complete customer-segmentation pipeline:\n", + "\n", + "1. Generate & save a synthetic customer dataset\n", + "2. Feature scaling\n", + "3. Dimensionality reduction with PCA\n", + "4. Elbow method to choose optimal $K$\n", + "5. K-Means clustering\n", + "6. Segment profiling & visualization\n", + "7. Business recommendations" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 6.1 Generate Synthetic Customer Data\n", + "\n", + "We create five features that mimic a retail scenario:\n", + "\n", + "| Feature | Description |\n", + "|---------|-------------|\n", + "| `age` | Customer age (18–70) |\n", + "| `income` | Annual income in $k (15–150) |\n", + "| `spending_score` | In-store spending score (1–100) |\n", + "| `visits` | Monthly store visits (0–30) |\n", + "| `online_ratio` | Fraction of purchases made online (0–1) |" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import os\n", + "\n", + "np.random.seed(42)\n", + "n_customers = 500\n", + "\n", + "# Segment 1: Young, moderate income, high online, high spending\n", + "seg1 = {\n", + " \"age\": np.random.normal(25, 4, 130).clip(18, 40),\n", + " \"income\": np.random.normal(45, 12, 130).clip(15, 80),\n", + " \"spending_score\": np.random.normal(75, 10, 130).clip(1, 100),\n", + " \"visits\": np.random.normal(8, 3, 130).clip(0, 30),\n", + " \"online_ratio\": np.random.normal(0.75, 0.1, 130).clip(0, 1),\n", + "}\n", + "\n", + "# Segment 2: Middle-aged, high income, balanced channel, moderate spending\n", + "seg2 = {\n", + " \"age\": np.random.normal(42, 6, 150).clip(28, 60),\n", + " \"income\": np.random.normal(95, 18, 150).clip(50, 150),\n", + " \"spending_score\": np.random.normal(55, 12, 150).clip(1, 100),\n", + " \"visits\": np.random.normal(15, 5, 150).clip(0, 30),\n", + " \"online_ratio\": np.random.normal(0.45, 0.15, 150).clip(0, 1),\n", + "}\n", + "\n", + "# Segment 3: Older, lower income, low online, low spending\n", + "seg3 = {\n", + " \"age\": np.random.normal(58, 7, 120).clip(40, 70),\n", + " \"income\": np.random.normal(35, 10, 120).clip(15, 70),\n", + " \"spending_score\": np.random.normal(25, 10, 120).clip(1, 100),\n", + " \"visits\": np.random.normal(20, 5, 120).clip(0, 30),\n", + " \"online_ratio\": np.random.normal(0.15, 0.08, 120).clip(0, 1),\n", + "}\n", + "\n", + "# Segment 4: Mixed ages, very high income, high spending, moderate visits\n", + "seg4 = {\n", + " \"age\": np.random.normal(38, 10, 100).clip(18, 70),\n", + " \"income\": np.random.normal(120, 15, 100).clip(80, 150),\n", + " \"spending_score\": np.random.normal(85, 8, 100).clip(1, 100),\n", + " \"visits\": np.random.normal(12, 4, 100).clip(0, 30),\n", + " \"online_ratio\": np.random.normal(0.55, 0.15, 100).clip(0, 1),\n", + "}\n", + "\n", + "frames = []\n", + "for seg in [seg1, seg2, seg3, seg4]:\n", + " frames.append(pd.DataFrame(seg))\n", + "\n", + "df_customers = pd.concat(frames, ignore_index=True)\n", + "df_customers = df_customers.sample(frac=1, random_state=42).reset_index(drop=True)\n", + "\n", + "df_customers[\"age\"] = df_customers[\"age\"].round(0).astype(int)\n", + "df_customers[\"income\"] = df_customers[\"income\"].round(1)\n", + "df_customers[\"spending_score\"] = df_customers[\"spending_score\"].round(0).astype(int)\n", + "df_customers[\"visits\"] = df_customers[\"visits\"].round(0).astype(int)\n", + "df_customers[\"online_ratio\"] = df_customers[\"online_ratio\"].round(2)\n", + "\n", + "# Save to CSV\n", + "dataset_dir = os.path.join(os.path.dirname(os.getcwd()), \"datasets\")\n", + "os.makedirs(dataset_dir, exist_ok=True)\n", + "csv_path = os.path.join(dataset_dir, \"customers.csv\")\n", + "df_customers.to_csv(csv_path, index=False)\n", + "print(f\"Saved {len(df_customers)} rows to {csv_path}\")\n", + "df_customers.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df_customers.describe().round(2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 6.2 Feature Scaling" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.preprocessing import StandardScaler\n", + "\n", + "feature_cols = [\"age\", \"income\", \"spending_score\", \"visits\", \"online_ratio\"]\n", + "X_cust = df_customers[feature_cols].values\n", + "\n", + "scaler = StandardScaler()\n", + "X_scaled = scaler.fit_transform(X_cust)\n", + "\n", + "print(\"Scaled means (β‰ˆ0):\", np.round(X_scaled.mean(axis=0), 4))\n", + "print(\"Scaled stds (β‰ˆ1):\", np.round(X_scaled.std(axis=0), 4))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 6.3 PCA for Dimensionality Reduction" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pca_cust = PCA(n_components=5)\n", + "X_pca_cust = pca_cust.fit_transform(X_scaled)\n", + "\n", + "cum_var = np.cumsum(pca_cust.explained_variance_ratio_)\n", + "\n", + "plt.figure(figsize=(7, 4))\n", + "plt.bar(range(1, 6), pca_cust.explained_variance_ratio_,\n", + " color=\"steelblue\", edgecolor=\"black\", alpha=0.7, label=\"Individual\")\n", + "plt.step(range(1, 6), cum_var, where=\"mid\", color=\"darkorange\",\n", + " linewidth=2, label=\"Cumulative\")\n", + "plt.axhline(0.90, color=\"red\", linestyle=\"--\", alpha=0.7, label=\"90% threshold\")\n", + "plt.xlabel(\"Principal Component\")\n", + "plt.ylabel(\"Variance Explained\")\n", + "plt.title(\"Customer Data β€” PCA Variance Explained\")\n", + "plt.xticks(range(1, 6))\n", + "plt.legend()\n", + "plt.tight_layout()\n", + "plt.show()\n", + "\n", + "n_keep = np.argmax(cum_var >= 0.90) + 1\n", + "print(f\"\\nComponents needed for β‰₯90% variance: {n_keep}\")\n", + "print(f\"Using first 2 components for visualization ({cum_var[1]:.1%} variance).\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 6.4 K-Means β€” Elbow Method" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.cluster import KMeans\n", + "\n", + "K_range = range(2, 11)\n", + "inertias = []\n", + "\n", + "for k in K_range:\n", + " km = KMeans(n_clusters=k, n_init=10, random_state=42)\n", + " km.fit(X_scaled)\n", + " inertias.append(km.inertia_)\n", + "\n", + "plt.figure(figsize=(8, 4))\n", + "plt.plot(list(K_range), inertias, \"o-\", linewidth=2, color=\"steelblue\")\n", + "plt.xlabel(\"Number of Clusters (K)\")\n", + "plt.ylabel(\"Inertia (within-cluster sum of squares)\")\n", + "plt.title(\"Elbow Method for Optimal K\")\n", + "plt.xticks(list(K_range))\n", + "plt.grid(alpha=0.3)\n", + "plt.tight_layout()\n", + "plt.show()\n", + "\n", + "print(\"Look for the 'elbow' β€” the point where adding more clusters yields\")\n", + "print(\"diminishing returns. Here K=4 appears to be a good choice.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 6.5 Fit K-Means with Optimal K" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "optimal_k = 4\n", + "km_final = KMeans(n_clusters=optimal_k, n_init=20, random_state=42)\n", + "cluster_labels = km_final.fit_predict(X_scaled)\n", + "\n", + "df_customers[\"cluster\"] = cluster_labels\n", + "print(f\"Cluster distribution:\\n{df_customers['cluster'].value_counts().sort_index()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 6.6 Segment Profiling" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "segment_profile = df_customers.groupby(\"cluster\")[feature_cols].mean().round(2)\n", + "segment_profile[\"count\"] = df_customers.groupby(\"cluster\").size()\n", + "print(\"=== Segment Profiles ===\")\n", + "segment_profile" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Radar / parallel-coordinates style comparison\n", + "fig, axes = plt.subplots(1, len(feature_cols), figsize=(18, 4), sharey=True)\n", + "cluster_colors = [\"#1f77b4\", \"#ff7f0e\", \"#2ca02c\", \"#d62728\"]\n", + "\n", + "for idx, feat in enumerate(feature_cols):\n", + " means = df_customers.groupby(\"cluster\")[feat].mean()\n", + " bars = axes[idx].bar(means.index, means.values,\n", + " color=cluster_colors[:optimal_k], edgecolor=\"black\")\n", + " axes[idx].set_title(feat, fontsize=11)\n", + " axes[idx].set_xlabel(\"Cluster\")\n", + " axes[idx].set_xticks(range(optimal_k))\n", + "\n", + "axes[0].set_ylabel(\"Mean Value\")\n", + "plt.suptitle(\"Feature Means by Cluster\", fontsize=14, y=1.02)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 6.7 Visualize Segments in 2-D (PCA Projection)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "X_vis = X_pca_cust[:, :2]\n", + "centroids_scaled = km_final.cluster_centers_\n", + "centroids_2d = pca_cust.transform(centroids_scaled)[:, :2] # project centroids\n", + "\n", + "plt.figure(figsize=(9, 7))\n", + "for c in range(optimal_k):\n", + " mask = cluster_labels == c\n", + " plt.scatter(X_vis[mask, 0], X_vis[mask, 1], s=40, alpha=0.6,\n", + " color=cluster_colors[c], edgecolors=\"k\", linewidth=0.3,\n", + " label=f\"Segment {c}\")\n", + "\n", + "plt.scatter(centroids_2d[:, 0], centroids_2d[:, 1], s=250, c=\"black\",\n", + " marker=\"*\", zorder=5, label=\"Centroids\")\n", + "\n", + "plt.xlabel(f\"PC 1 ({pca_cust.explained_variance_ratio_[0]:.1%} var)\")\n", + "plt.ylabel(f\"PC 2 ({pca_cust.explained_variance_ratio_[1]:.1%} var)\")\n", + "plt.title(\"Customer Segments β€” PCA 2-D Projection\")\n", + "plt.legend()\n", + "plt.grid(alpha=0.3)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 6.8 Business Recommendations" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "recommendations = {\n", + " 0: {\n", + " \"label\": \"Budget Traditionalists\",\n", + " \"description\": \"Older customers with low income and spending, who shop mostly in-store.\",\n", + " \"actions\": [\n", + " \"Offer loyalty discounts and in-store promotions\",\n", + " \"Simplify the in-store experience\",\n", + " \"Provide personalized coupons at checkout\",\n", + " ],\n", + " },\n", + " 1: {\n", + " \"label\": \"Young Digital Shoppers\",\n", + " \"description\": \"Young customers with moderate income but high online engagement and spending.\",\n", + " \"actions\": [\n", + " \"Invest in mobile app features and social media marketing\",\n", + " \"Offer free shipping and digital-only deals\",\n", + " \"Launch a referral program to leverage their network\",\n", + " ],\n", + " },\n", + " 2: {\n", + " \"label\": \"Premium High-Spenders\",\n", + " \"description\": \"High income, high spending score β€” the most valuable segment.\",\n", + " \"actions\": [\n", + " \"Create a VIP/premium loyalty tier\",\n", + " \"Offer early access to new products\",\n", + " \"Assign dedicated account managers for retention\",\n", + " ],\n", + " },\n", + " 3: {\n", + " \"label\": \"Established Moderates\",\n", + " \"description\": \"Middle-aged, higher income, moderate spending, balanced channel use.\",\n", + " \"actions\": [\n", + " \"Cross-sell higher-margin products\",\n", + " \"Provide omni-channel convenience (buy online, pick up in store)\",\n", + " \"Target with email campaigns for seasonal offers\",\n", + " ],\n", + " },\n", + "}\n", + "\n", + "for seg_id, info in recommendations.items():\n", + " count = (cluster_labels == seg_id).sum()\n", + " print(f\"\\n{'='*60}\")\n", + " print(f\"Segment {seg_id}: {info['label']} (n={count})\")\n", + " print(f\"{'='*60}\")\n", + " print(f\" {info['description']}\")\n", + " print(\" Recommended actions:\")\n", + " for action in info[\"actions\"]:\n", + " print(f\" β€’ {action}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## 7. Summary & Key Takeaways\n", + "\n", + "### What We Covered in This Notebook\n", + "\n", + "| Topic | Key Idea |\n", + "|-------|----------|\n", + "| **PCA** | Linear projection onto directions of maximum variance |\n", + "| **t-SNE** | Non-linear embedding that preserves local neighborhoods β€” for visualization only |\n", + "| **Z-Score Anomaly Detection** | Simple threshold on standardized values |\n", + "| **Isolation Forest** | Tree-based anomaly detector β€” fast, distribution-free |\n", + "| **Customer Segmentation** | End-to-end pipeline: scale β†’ PCA β†’ K-Means β†’ profile β†’ recommend |\n", + "\n", + "### Chapter 8 Recap\n", + "\n", + "Across the three notebooks you have:\n", + "\n", + "1. **Notebook 01 (Introduction):** Learned K-Means, hierarchical clustering, and evaluation metrics.\n", + "2. **Notebook 02 (Intermediate):** Explored DBSCAN, Gaussian Mixture Models, and silhouette analysis.\n", + "3. **Notebook 03 (Advanced β€” this one):** Mastered PCA, t-SNE, anomaly detection, and built a full capstone project.\n", + "\n", + "### What's Next\n", + "\n", + "In **Chapter 9: Deep Learning** we'll move from classical ML to neural\n", + "networks β€” starting with perceptrons, backpropagation, and building your first\n", + "deep network with PyTorch/Keras.\n", + "\n", + "---\n", + "*Generated by Berta AI | Created by Luigi Pascal Rondanini*" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.10.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file diff --git a/chapters/chapter-08-unsupervised-learning/requirements.txt b/chapters/chapter-08-unsupervised-learning/requirements.txt new file mode 100644 index 0000000..8781803 --- /dev/null +++ b/chapters/chapter-08-unsupervised-learning/requirements.txt @@ -0,0 +1,7 @@ +jupyter +notebook +numpy +pandas +matplotlib +scikit-learn +scipy diff --git a/chapters/chapter-08-unsupervised-learning/scripts/unsupervised_toolkit.py b/chapters/chapter-08-unsupervised-learning/scripts/unsupervised_toolkit.py new file mode 100644 index 0000000..c1b1659 --- /dev/null +++ b/chapters/chapter-08-unsupervised-learning/scripts/unsupervised_toolkit.py @@ -0,0 +1,423 @@ +""" +Unsupervised Learning Toolkit - Core implementations and plotting utilities. +Generated by Berta AI | Created by Luigi Pascal Rondanini +""" + +import numpy as np +import matplotlib.pyplot as plt +from sklearn.metrics import silhouette_samples, silhouette_score +from scipy.cluster.hierarchy import dendrogram, linkage +from sklearn.datasets import make_blobs + + +class KMeansScratch: + """ + K-Means clustering implementation from scratch. + """ + + def __init__(self, n_clusters=3, max_iters=100, random_state=42): + """ + Initialize K-Means. + + Parameters + ---------- + n_clusters : int + Number of clusters. + max_iters : int + Maximum iterations for the algorithm. + random_state : int + Random seed for reproducibility. + """ + self.n_clusters = n_clusters + self.max_iters = max_iters + self.random_state = random_state + self.centroids = None + self.labels_ = None + self.inertia_history = [] + + def fit(self, X): + """ + Fit K-Means to the data. + + Parameters + ---------- + X : array-like of shape (n_samples, n_features) + Training data. + + Returns + ------- + self + """ + X = np.asarray(X) + np.random.seed(self.random_state) + n_samples = X.shape[0] + + # Random centroid initialization + idx = np.random.choice(n_samples, self.n_clusters, replace=False) + self.centroids = X[idx].copy() + + for _ in range(self.max_iters): + # Assign points to nearest centroid + labels = self._assign_clusters(X) + # Recompute centroids + new_centroids = np.zeros_like(self.centroids) + for k in range(self.n_clusters): + mask = labels == k + if np.any(mask): + new_centroids[k] = X[mask].mean(axis=0) + else: + new_centroids[k] = self.centroids[k] + + inertia = self._compute_inertia(X, labels, new_centroids) + self.inertia_history.append(inertia) + + if np.allclose(self.centroids, new_centroids): + break + self.centroids = new_centroids + + self.labels_ = self._assign_clusters(X) + return self + + def _assign_clusters(self, X): + """Assign each point to the nearest centroid.""" + distances = np.linalg.norm(X[:, np.newaxis] - self.centroids, axis=2) + return np.argmin(distances, axis=1) + + def predict(self, X): + """ + Predict cluster labels for new data. + + Parameters + ---------- + X : array-like of shape (n_samples, n_features) + Data to predict. + + Returns + ------- + labels : ndarray of shape (n_samples,) + Cluster indices. + """ + X = np.asarray(X) + return self._assign_clusters(X) + + def fit_predict(self, X): + """ + Fit and return cluster labels. + + Parameters + ---------- + X : array-like of shape (n_samples, n_features) + Training data. + + Returns + ------- + labels : ndarray of shape (n_samples,) + Cluster indices. + """ + return self.fit(X).labels_ + + def _compute_inertia(self, X, labels, centroids): + """ + Compute within-cluster sum of squares (inertia). + + Parameters + ---------- + X : ndarray + Data points. + labels : ndarray + Cluster labels. + centroids : ndarray + Cluster centroids. + + Returns + ------- + inertia : float + """ + inertia = 0.0 + for k in range(self.n_clusters): + mask = labels == k + if np.any(mask): + inertia += np.sum((X[mask] - centroids[k]) ** 2) + return inertia + + +class PCAScratch: + """ + Principal Component Analysis implementation from scratch. + """ + + def __init__(self, n_components=2): + """ + Initialize PCA. + + Parameters + ---------- + n_components : int + Number of components to keep. + """ + self.n_components = n_components + self.mean_ = None + self.components_ = None + self.explained_variance_ = None + + def fit(self, X): + """ + Fit PCA to the data. + + Parameters + ---------- + X : array-like of shape (n_samples, n_features) + Training data. + + Returns + ------- + self + """ + X = np.asarray(X) + self.mean_ = X.mean(axis=0) + X_centered = X - self.mean_ + + # Covariance matrix + cov = np.cov(X_centered.T) + + # Eigendecomposition + eigenvalues, eigenvectors = np.linalg.eigh(cov) + idx = np.argsort(eigenvalues)[::-1] + eigenvalues = eigenvalues[idx] + eigenvectors = eigenvectors[:, idx] + + n = min(self.n_components, len(eigenvalues)) + self.components_ = eigenvectors[:, :n].T + self.explained_variance_ = eigenvalues[:n] + return self + + def transform(self, X): + """ + Project data onto principal components. + + Parameters + ---------- + X : array-like of shape (n_samples, n_features) + Data to transform. + + Returns + ------- + X_transformed : ndarray of shape (n_samples, n_components) + """ + X = np.asarray(X) + X_centered = X - self.mean_ + return X_centered @ self.components_.T + + def fit_transform(self, X): + """ + Fit and transform in one step. + + Parameters + ---------- + X : array-like of shape (n_samples, n_features) + Training data. + + Returns + ------- + X_transformed : ndarray of shape (n_samples, n_components) + """ + return self.fit(X).transform(X) + + @property + def explained_variance_ratio_(self): + """Fraction of variance explained by each component.""" + total = np.sum(self.explained_variance_) + return self.explained_variance_ / total if total > 0 else self.explained_variance_ + + +def plot_clusters(X, labels, centroids=None, title="Clusters"): + """ + Scatter plot of clustered data with optional centroid markers. + + Parameters + ---------- + X : array-like + Data points (2D). + labels : array-like + Cluster labels. + centroids : array-like, optional + Centroids to plot as markers. + title : str + Plot title. + """ + X = np.asarray(X) + labels = np.asarray(labels) + plt.figure(figsize=(8, 6)) + scatter = plt.scatter(X[:, 0], X[:, 1], c=labels, cmap="viridis", alpha=0.7, edgecolors="k") + if centroids is not None: + centroids = np.asarray(centroids) + plt.scatter(centroids[:, 0], centroids[:, 1], c="red", marker="X", s=200, edgecolors="black") + plt.colorbar(scatter, label="Cluster") + plt.title(title) + plt.xlabel("Feature 1") + plt.ylabel("Feature 2") + plt.tight_layout() + plt.show() + + +def plot_elbow(K_range, inertias, title="Elbow Method"): + """ + Line plot of inertia vs K for elbow method. + + Parameters + ---------- + K_range : array-like + Range of K values. + inertias : array-like + Inertia for each K. + title : str + Plot title. + """ + plt.figure(figsize=(8, 5)) + plt.plot(K_range, inertias, "bo-") + plt.xlabel("Number of clusters (K)") + plt.ylabel("Inertia") + plt.title(title) + plt.grid(True, alpha=0.3) + plt.tight_layout() + plt.show() + + +def plot_silhouette(X, labels, title="Silhouette Analysis"): + """ + Silhouette plot using sklearn.metrics. + + Parameters + ---------- + X : array-like + Data points. + labels : array-like + Cluster labels. + title : str + Plot title. + """ + X = np.asarray(X) + labels = np.asarray(labels) + n_clusters = len(np.unique(labels)) + silhouette_vals = silhouette_samples(X, labels) + score = silhouette_score(X, labels) + + plt.figure(figsize=(10, 6)) + y_lower = 10 + for i in range(n_clusters): + cluster_silhouette = silhouette_vals[labels == i] + cluster_silhouette.sort() + size = cluster_silhouette.shape[0] + y_upper = y_lower + size + plt.fill_betweenx(np.arange(y_lower, y_upper), 0, cluster_silhouette, alpha=0.7) + plt.text(-0.05, y_lower + 0.5 * size, str(i)) + y_lower = y_upper + 10 + + plt.axvline(x=score, color="red", linestyle="--", label=f"Avg: {score:.3f}") + plt.xlabel("Silhouette coefficient") + plt.ylabel("Cluster label") + plt.title(title) + plt.legend() + plt.tight_layout() + plt.show() + + +def plot_dendrogram(X, method="ward", title="Dendrogram"): + """ + Hierarchical clustering dendrogram using scipy. + + Parameters + ---------- + X : array-like + Data points. + method : str + Linkage method ('ward', 'complete', 'average', 'single'). + title : str + Plot title. + """ + X = np.asarray(X) + linkage_matrix = linkage(X, method=method) + plt.figure(figsize=(10, 6)) + dendrogram(linkage_matrix) + plt.title(title) + plt.xlabel("Sample index or (cluster size)") + plt.ylabel("Distance") + plt.tight_layout() + plt.show() + + +def plot_pca_variance(pca, title="PCA Variance Explained"): + """ + Bar chart and cumulative line for PCA variance explained. + + Parameters + ---------- + pca : PCAScratch + Fitted PCA object. + title : str + Plot title. + """ + ratios = pca.explained_variance_ratio_ + cumsum = np.cumsum(ratios) + n = len(ratios) + + fig, ax1 = plt.subplots(figsize=(8, 5)) + x = np.arange(1, n + 1) + ax1.bar(x - 0.2, ratios, 0.4, label="Individual", color="steelblue") + ax1.set_xlabel("Principal Component") + ax1.set_ylabel("Variance explained ratio") + ax1.set_xticks(x) + + ax2 = ax1.twinx() + ax2.plot(x, cumsum, "ro-", label="Cumulative") + ax2.set_ylabel("Cumulative variance") + ax2.set_ylim(0, 1.05) + + plt.title(title) + fig.legend(loc="upper right", bbox_to_anchor=(1, 1), bbox_transform=ax1.transAxes) + plt.tight_layout() + plt.show() + + +def plot_anomalies(X, labels, title="Anomaly Detection"): + """ + Scatter plot for normal vs anomaly points. + + Parameters + ---------- + X : array-like + Data points (2D). + labels : array-like + Binary labels (0=normal, 1=anomaly or similar). + title : str + Plot title. + """ + X = np.asarray(X) + labels = np.asarray(labels) + plt.figure(figsize=(8, 6)) + normal = labels == 0 + anomaly = labels == 1 + plt.scatter(X[normal, 0], X[normal, 1], c="steelblue", alpha=0.7, label="Normal") + plt.scatter(X[anomaly, 0], X[anomaly, 1], c="red", alpha=0.7, label="Anomaly") + plt.xlabel("Feature 1") + plt.ylabel("Feature 2") + plt.title(title) + plt.legend() + plt.tight_layout() + plt.show() + + +if __name__ == "__main__": + # Demo: Generate blobs, run KMeansScratch + X_blobs, _ = make_blobs(n_samples=300, n_features=2, centers=4, random_state=42) + kmeans = KMeansScratch(n_clusters=4, max_iters=100, random_state=42) + kmeans.fit(X_blobs) + print("KMeansScratch inertia:", kmeans.inertia_history[-1] if kmeans.inertia_history else "N/A") + + # Demo: Run PCAScratch on 4D dataset + X_4d, _ = make_blobs(n_samples=200, n_features=4, centers=3, random_state=42) + pca = PCAScratch(n_components=4) + pca.fit(X_4d) + print("PCA variance explained:", pca.explained_variance_ratio_) + + print("Demo complete.") diff --git a/chapters/chapter-08-unsupervised-learning/scripts/utilities.py b/chapters/chapter-08-unsupervised-learning/scripts/utilities.py new file mode 100644 index 0000000..bdf4c31 --- /dev/null +++ b/chapters/chapter-08-unsupervised-learning/scripts/utilities.py @@ -0,0 +1,104 @@ +""" +Helper utilities for unsupervised learning. +Generated by Berta AI | Created by Luigi Pascal Rondanini +""" + +import numpy as np +import pandas as pd +from sklearn.preprocessing import StandardScaler + + +def scale_features(X): + """ + Scale features using StandardScaler (zero mean, unit variance). + + Parameters + ---------- + X : array-like of shape (n_samples, n_features) + Data to scale. + + Returns + ------- + X_scaled : ndarray of shape (n_samples, n_features) + Scaled data. + """ + scaler = StandardScaler() + return scaler.fit_transform(X) + + +def generate_synthetic_customers(n=300, seed=42): + """ + Generate synthetic customer data for clustering/segmentation. + + Parameters + ---------- + n : int + Number of customers to generate. + seed : int + Random seed for reproducibility. + + Returns + ------- + df : pandas.DataFrame + DataFrame with columns: age, income, spending_score, visits, online_ratio. + """ + np.random.seed(seed) + age = np.random.randint(18, 70, size=n) + income = np.random.exponential(scale=30000, size=n).astype(int) + 20000 + spending_score = np.random.exponential(scale=50, size=n).astype(int) + 10 + visits = np.random.poisson(lam=5, size=n) + 1 + online_ratio = np.random.beta(2, 2, size=n) + return pd.DataFrame({ + "age": age, + "income": income, + "spending_score": spending_score, + "visits": visits, + "online_ratio": online_ratio, + }) + + +def generate_synthetic_sensors(n=200, anomaly_fraction=0.1, seed=42): + """ + Generate synthetic sensor data with anomalies. + + Parameters + ---------- + n : int + Number of sensor readings. + anomaly_fraction : float + Fraction of readings that are anomalies (0 to 1). + seed : int + Random seed for reproducibility. + + Returns + ------- + df : pandas.DataFrame + DataFrame with columns: temp, pressure, vibration, is_anomaly. + """ + np.random.seed(seed) + n_anomaly = int(n * anomaly_fraction) + n_normal = n - n_anomaly + + # Normal readings + temp_normal = np.random.normal(25, 2, n_normal) + pressure_normal = np.random.normal(100, 5, n_normal) + vibration_normal = np.random.exponential(0.5, n_normal) + + # Anomalous readings (outliers) + temp_anomaly = np.random.uniform(50, 90, n_anomaly) + pressure_anomaly = np.random.uniform(150, 200, n_anomaly) + vibration_anomaly = np.random.exponential(5, n_anomaly) + + temp = np.concatenate([temp_normal, temp_anomaly]) + pressure = np.concatenate([pressure_normal, pressure_anomaly]) + vibration = np.concatenate([vibration_normal, vibration_anomaly]) + is_anomaly = np.concatenate([np.zeros(n_normal, dtype=int), np.ones(n_anomaly, dtype=int)]) + + # Shuffle + idx = np.random.permutation(n) + return pd.DataFrame({ + "temp": temp[idx], + "pressure": pressure[idx], + "vibration": vibration[idx], + "is_anomaly": is_anomaly[idx], + }) diff --git a/docs/chapters/assets/diagrams/anomaly_detection.svg b/docs/chapters/assets/diagrams/anomaly_detection.svg new file mode 100644 index 0000000..92452f7 --- /dev/null +++ b/docs/chapters/assets/diagrams/anomaly_detection.svg @@ -0,0 +1,90 @@ + + + + + + Statistical (Z-Score) + + + + + + + + -3 sigma + +3 sigma + + + + + + + + ! + ! + Points beyond threshold + Simple, fast, assumes normal + + + + Isolation Forest + + + + + + + + + + anomaly + + + + Split 1 + Split 2 + + + short path + + + long path + Anomalies isolated quickly + Works with any distribution + + + + Applications + + + $ + Fraud Detection + Unusual transactions + + + ! + Manufacturing QA + Defective products + + + + + + Network Intrusion + Unusual traffic patterns + + + + + Health Monitoring + Abnormal sensor readings + + + IoT + Predictive Maintenance + Equipment failure warnings + diff --git a/docs/chapters/assets/diagrams/clustering_algorithms.svg b/docs/chapters/assets/diagrams/clustering_algorithms.svg new file mode 100644 index 0000000..f17f560 --- /dev/null +++ b/docs/chapters/assets/diagrams/clustering_algorithms.svg @@ -0,0 +1,92 @@ + + + + + + K-Means + + + + + + + + + + + + + + + + + + + + + + + + + + + Spherical clusters, fixed K + Assigns to nearest centroid + + + + Hierarchical + + + + + + + + + + + + + + + + cut + + A + B + C + D + E + Dendrogram, cut to get K + Bottom-up merging + + + + DBSCAN + + + + + + + + + + + + + + noise + noise + + + eps + Arbitrary shapes, auto K + Density-based, detects noise + diff --git a/docs/chapters/assets/diagrams/dimensionality_reduction.svg b/docs/chapters/assets/diagrams/dimensionality_reduction.svg new file mode 100644 index 0000000..7f4b92b --- /dev/null +++ b/docs/chapters/assets/diagrams/dimensionality_reduction.svg @@ -0,0 +1,81 @@ + + + + + + High-Dimensional Data + + + f1 f2 f3 f4 ... fN + + 2.1 0.3 1.7 4.2 ... 0.9 + 1.5 2.8 0.4 3.1 ... 1.2 + 3.2 1.1 2.9 0.8 ... 2.4 + ... n rows x d features ... + d = 50, 100, 1000+ + + Curse of dimensionality + Hard to visualize + Noisy, redundant features + Slow computation + + + + Reduce + + + + PCA (Linear) + + + PC1 + PC2 + + + + + + + + + + + + + + + + + PC1: 72% + PC2: 18% + Max variance directions + Global structure preserved + + + + t-SNE (Nonlinear) + + + + + + + + + + + + + + + + + Preserves local neighborhoods + Best for visualization + diff --git a/docs/chapters/chapter-08.md b/docs/chapters/chapter-08.md new file mode 100644 index 0000000..1a9eb96 --- /dev/null +++ b/docs/chapters/chapter-08.md @@ -0,0 +1,100 @@ +# Chapter 8: Unsupervised Learning + +Discover hidden patterns in unlabeled dataβ€”clustering, dimensionality reduction, and anomaly detection. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Track** | Practitioner | +| **Time** | 8 hours | +| **Prerequisites** | Chapters 1–6 | + +--- + +## Learning Objectives + +- Implement K-Means clustering from scratch using NumPy +- Apply hierarchical clustering and interpret dendrograms +- Use DBSCAN for density-based clustering with noise detection +- Evaluate clusters with silhouette scores and the elbow method +- Reduce dimensionality with PCA and t-SNE +- Detect anomalies with Isolation Forest and statistical methods +- Build a complete customer segmentation pipeline + +--- + +## What's Included + +### Notebooks + +| Notebook | Description | +|----------|-------------| +| `01_introduction.ipynb` | K-Means from scratch, evaluation, elbow method | +| `02_intermediate.ipynb` | Hierarchical, DBSCAN, Gaussian Mixture Models | +| `03_advanced.ipynb` | PCA, t-SNE, anomaly detection, customer segmentation capstone | + +### Scripts + +- `unsupervised_toolkit.py` β€” Core implementations (KMeansScratch, PCAScratch) and plotting utilities + +### Exercises + +- **5 exercises** with solutions (in `solutions/` branch) + +### SVG Diagrams + +- 3 visual diagrams for clustering algorithms, dimensionality reduction, and anomaly detection + +--- + + + +--- + +## Read Online + +You can read the full chapter content right here on the website: + +- **[08.1 Introduction](content/ch08-01_introduction.md)** -- K-Means from scratch, silhouette scores, elbow method +- **[08.2 Intermediate](content/ch08-02_intermediate.md)** -- Hierarchical clustering, DBSCAN, Gaussian Mixture Models +- **[08.3 Advanced](content/ch08-03_advanced.md)** -- PCA, t-SNE, anomaly detection, customer segmentation capstone + +Or [try the code in the Playground](../playground.md). + +## How to Use This Chapter + +!!! tip "Quick Start" + Follow these steps to get coding in minutes. + +**1. Clone and install dependencies** + +```bash +git clone https://github.com/luigipascal/berta-chapters.git +cd berta-chapters +pip install -r requirements.txt +``` + +**2. Navigate to the chapter** + +```bash +cd chapters/chapter-08-unsupervised-learning +``` + +**3. Launch Jupyter** + +```bash +jupyter notebook notebooks/01_introduction.ipynb +``` + +!!! info "GitHub Folder" + All chapter materials live in: [`chapters/chapter-08-unsupervised-learning/`](https://github.com/luigipascal/berta-chapters/tree/main/chapters/chapter-08-unsupervised-learning/) + +!!! tip "SciPy" + This chapter uses SciPy for hierarchical clustering dendrograms. Ensure it's installed: `pip install scipy` + +--- + +**Created by Luigi Pascal Rondanini | Generated by Berta AI** diff --git a/docs/chapters/content/ch08-01_introduction.md b/docs/chapters/content/ch08-01_introduction.md new file mode 100644 index 0000000..1f2321e --- /dev/null +++ b/docs/chapters/content/ch08-01_introduction.md @@ -0,0 +1,448 @@ +# Ch 8: Unsupervised Learning - Introduction + +**Track**: Practitioner | [Try code in Playground](../../playground.md) | [Back to chapter overview](../chapter-08.md) + + +!!! tip "Read online or run locally" + You can read this content here on the web. To run the code interactively, + either use the [Playground](../../playground.md) or clone the repo and open + `chapters/chapter-08-unsupervised-learning/notebooks/01_introduction.ipynb` in Jupyter. + +--- + +# Chapter 8: Unsupervised Learning +## Notebook 01 - Introduction: Clustering Basics + +Unsupervised learning finds hidden patterns in data without labels. We start with the most fundamental algorithm: **K-Means clustering**. + +**What you'll learn:** +- The difference between supervised and unsupervised learning +- K-Means clustering from scratch using NumPy +- Evaluating clusters with inertia and silhouette score +- The elbow method for choosing K +- Scikit-learn's KMeans interface + +**Time estimate:** 2.5 hours + +--- + +## 1. Supervised vs Unsupervised Learning + +In **supervised learning**, every training example comes with a label β€” the "right answer" β€” and the model learns a mapping from inputs to outputs. Classification and regression are the classic examples. + +In **unsupervised learning**, there are **no labels at all**. The algorithm must discover structure in the data on its own. Common tasks include clustering (group similar points), dimensionality reduction (compress features), and anomaly detection (find unusual observations). + +This notebook focuses on **clustering** β€” specifically the **K-Means** algorithm. Let's start by generating some data and seeing what it looks like *without* labels. The left plot shows raw data (all same color); the right reveals the true clusters we want the algorithm to recover on its own. + +```python +import numpy as np +import matplotlib.pyplot as plt +from sklearn.datasets import make_blobs + +np.random.seed(42) + +X, y_true = make_blobs( + n_samples=200, centers=3, cluster_std=0.9, random_state=42 +) + +fig, axes = plt.subplots(1, 2, figsize=(13, 5)) + +axes[0].scatter(X[:, 0], X[:, 1], c="steelblue", edgecolors="k", s=50, alpha=0.7) +axes[0].set_title("What we observe (no labels)", fontsize=14) +axes[0].set_xlabel("Feature 1") +axes[0].set_ylabel("Feature 2") + +colors = ["#e74c3c", "#2ecc71", "#3498db"] +for k in range(3): + mask = y_true == k + axes[1].scatter(X[mask, 0], X[mask, 1], c=colors[k], + edgecolors="k", s=50, alpha=0.7, label=f"Cluster {k}") +axes[1].set_title("True clusters (hidden from algorithm)", fontsize=14) +axes[1].set_xlabel("Feature 1") +axes[1].set_ylabel("Feature 2") +axes[1].legend() + +plt.tight_layout() +plt.show() +``` + +--- + +## 2. K-Means Algorithm + +K-Means is an iterative algorithm that partitions *n* data points into *K* clusters. It works in three repeating steps: + +**Step 1 β€” Initialize:** Pick *K* points as initial **centroids** (cluster centers). The simplest approach is to choose *K* data points at random. + +**Step 2 β€” Assign:** For every data point, compute the Euclidean distance to each centroid and assign the point to the **nearest** centroid. + +**Step 3 β€” Update:** Recompute each centroid as the **mean** of all points currently assigned to that cluster. + +**Repeat** Steps 2 and 3 until the assignments no longer change (or a maximum number of iterations is reached). + +Let's implement K-Means from scratch using only NumPy: + +```python +class KMeansScratch: + """Minimal K-Means implementation using NumPy.""" + + def __init__(self, k=3, max_iters=100, random_state=42): + self.k = k + self.max_iters = max_iters + self.random_state = random_state + self.centroids = None + self.labels_ = None + self.inertia_ = None + self.inertia_history = [] + self.centroid_history = [] + self.label_history = [] + + def _euclidean_distances(self, X, centroids): + """Compute distance from every point to every centroid.""" + return np.sqrt(((X[:, np.newaxis] - centroids[np.newaxis]) ** 2).sum(axis=2)) + + def _compute_inertia(self, X, labels, centroids): + return sum( + np.sum((X[labels == k] - centroids[k]) ** 2) + for k in range(self.k) + ) + + def fit(self, X): + rng = np.random.RandomState(self.random_state) + n_samples = X.shape[0] + + # Step 1: random initialization + idx = rng.choice(n_samples, self.k, replace=False) + self.centroids = X[idx].copy() + + self.inertia_history = [] + self.centroid_history = [self.centroids.copy()] + self.label_history = [] + + for _ in range(self.max_iters): + # Step 2: assign + distances = self._euclidean_distances(X, self.centroids) + labels = np.argmin(distances, axis=1) + self.label_history.append(labels.copy()) + + # Step 3: update centroids + new_centroids = np.array([ + X[labels == k].mean(axis=0) if np.any(labels == k) + else self.centroids[k] + for k in range(self.k) + ]) + + inertia = self._compute_inertia(X, labels, new_centroids) + self.inertia_history.append(inertia) + self.centroid_history.append(new_centroids.copy()) + + if np.allclose(new_centroids, self.centroids): + break + self.centroids = new_centroids + + self.labels_ = labels + self.inertia_ = self.inertia_history[-1] + return self + + def predict(self, X): + distances = self._euclidean_distances(X, self.centroids) + return np.argmin(distances, axis=1) + + +km_scratch = KMeansScratch(k=3, random_state=42) +km_scratch.fit(X) + +print(f"Converged in {len(km_scratch.inertia_history)} iterations") +print(f"Final inertia: {km_scratch.inertia_:.2f}") +print(f"Centroids:\n{km_scratch.centroids}") +``` + +Now let's plot the ground truth alongside our K-Means result: + +```python +fig, axes = plt.subplots(1, 2, figsize=(13, 5)) + +colors_map = np.array(["#e74c3c", "#2ecc71", "#3498db"]) + +for k in range(3): + mask = y_true == k + axes[0].scatter(X[mask, 0], X[mask, 1], c=colors[k], + edgecolors="k", s=50, alpha=0.7, label=f"True {k}") +axes[0].set_title("Ground Truth", fontsize=14) +axes[0].legend() +axes[0].set_xlabel("Feature 1") +axes[0].set_ylabel("Feature 2") + +axes[1].scatter(X[:, 0], X[:, 1], c=colors_map[km_scratch.labels_], + edgecolors="k", s=50, alpha=0.7) +axes[1].scatter(km_scratch.centroids[:, 0], km_scratch.centroids[:, 1], + c=colors, marker="X", s=250, edgecolors="k", linewidths=1.5, + zorder=5, label="Centroids") +axes[1].set_title("K-Means (scratch) result", fontsize=14) +axes[1].legend() +axes[1].set_xlabel("Feature 1") +axes[1].set_ylabel("Feature 2") + +plt.tight_layout() +plt.show() +``` + +--- + +## 3. Step-by-Step Visualization + +To build intuition for how the algorithm converges, let's watch the first four iterations unfold. Each subplot shows the cluster assignments and centroid positions at a particular iteration. Notice how the centroids migrate toward the cluster centers with each iteration. + +```python +fig, axes = plt.subplots(2, 2, figsize=(12, 10)) +axes = axes.ravel() + +colors_map = np.array(["#e74c3c", "#2ecc71", "#3498db"]) + +n_show = min(4, len(km_scratch.label_history)) + +for i in range(n_show): + ax = axes[i] + labels_i = km_scratch.label_history[i] + centroids_i = km_scratch.centroid_history[i] + centroids_next = km_scratch.centroid_history[i + 1] + + ax.scatter(X[:, 0], X[:, 1], c=colors_map[labels_i], + edgecolors="k", s=40, alpha=0.6) + + ax.scatter(centroids_i[:, 0], centroids_i[:, 1], + facecolors="none", edgecolors="k", marker="o", + s=200, linewidths=2, label="Old centroid") + + ax.scatter(centroids_next[:, 0], centroids_next[:, 1], + c=colors, marker="X", s=250, edgecolors="k", + linewidths=1.5, zorder=5, label="New centroid") + + for k in range(3): + ax.annotate("", + xy=centroids_next[k], xytext=centroids_i[k], + arrowprops=dict(arrowstyle="->", lw=1.5, color="black")) + + ax.set_title(f"Iteration {i + 1} | inertia = {km_scratch.inertia_history[i]:.1f}", + fontsize=12) + if i == 0: + ax.legend(fontsize=9, loc="upper left") + +for j in range(n_show, 4): + axes[j].axis("off") + +plt.suptitle("K-Means β€” Iteration-by-Iteration", fontsize=15, y=1.01) +plt.tight_layout() +plt.show() +``` + +--- + +## 4. Evaluating Clusters + +How do we know if K-Means did a good job? Two common metrics: + +**Inertia (Within-Cluster Sum of Squares):** The sum of squared distances from each point to its centroid. Lower is better, but inertia *always* decreases as K increases β€” so it alone doesn't tell us the right K. + +**Silhouette Score:** For each point, we compare the mean distance to others in the same cluster (*a*) vs. the mean distance to the nearest other cluster (*b*). The score is *(b - a) / max(a, b)*, ranging from βˆ’1 to +1. Higher is better; values near 0 indicate overlapping clusters. + +```python +from sklearn.metrics import silhouette_score, silhouette_samples + +sil_avg = silhouette_score(X, km_scratch.labels_) +sil_vals = silhouette_samples(X, km_scratch.labels_) + +print(f"Inertia: {km_scratch.inertia_:.2f}") +print(f"Silhouette (mean): {sil_avg:.4f}") +print(f"Silhouette (min): {sil_vals.min():.4f}") +print(f"Silhouette (max): {sil_vals.max():.4f}") +``` + +A silhouette plot shows each cluster's distribution of silhouette coefficients. Healthy clusters extend well past the mean line; thin slivers or clusters barely crossing zero suggest poor separation. + +```python +fig, ax = plt.subplots(figsize=(8, 5)) + +y_lower = 10 +colors_sil = ["#e74c3c", "#2ecc71", "#3498db"] + +for k in range(3): + cluster_sil = np.sort(sil_vals[km_scratch.labels_ == k]) + cluster_size = cluster_sil.shape[0] + y_upper = y_lower + cluster_size + + ax.fill_betweenx(np.arange(y_lower, y_upper), 0, cluster_sil, + facecolor=colors_sil[k], edgecolor=colors_sil[k], alpha=0.7) + ax.text(-0.05, y_lower + 0.5 * cluster_size, f"Cluster {k}", fontsize=11, + fontweight="bold", va="center") + y_lower = y_upper + 10 + +ax.axvline(x=sil_avg, color="k", linestyle="--", linewidth=1.5, + label=f"Mean silhouette = {sil_avg:.3f}") +ax.set_xlabel("Silhouette coefficient", fontsize=12) +ax.set_ylabel("Points (sorted within cluster)", fontsize=12) +ax.set_title("Silhouette Plot β€” K-Means (K=3)", fontsize=14) +ax.legend(fontsize=11) +ax.set_yticks([]) +plt.tight_layout() +plt.show() +``` + +--- + +## 5. The Elbow Method + +Since we must specify *K* before running K-Means, how do we pick a good value? + +**The Elbow Method:** +1. Run K-Means for K = 1, 2, …, K_max. +2. Plot inertia vs K. +3. Look for the **"elbow"** β€” the point where inertia stops decreasing sharply and begins to level off. + +We can also plot silhouette score vs K; the best K often maximizes silhouette. Both plots together give a clearer picture. + +```python +K_range = range(1, 11) +inertias = [] +silhouettes = [] + +for k in K_range: + km = KMeansScratch(k=k, random_state=42) + km.fit(X) + inertias.append(km.inertia_) + if k >= 2: + silhouettes.append(silhouette_score(X, km.labels_)) + else: + silhouettes.append(np.nan) + +fig, axes = plt.subplots(1, 2, figsize=(14, 5)) + +axes[0].plot(K_range, inertias, "o-", color="#2c3e50", linewidth=2, markersize=8) +axes[0].set_xlabel("Number of clusters (K)", fontsize=12) +axes[0].set_ylabel("Inertia", fontsize=12) +axes[0].set_title("Elbow Method", fontsize=14) +axes[0].axvline(x=3, color="#e74c3c", linestyle="--", alpha=0.7, label="K = 3 (elbow)") +axes[0].legend(fontsize=11) +axes[0].grid(True, alpha=0.3) + +sil_values = [s for s in silhouettes if not np.isnan(s)] +sil_ks = list(range(2, 11)) +axes[1].plot(sil_ks, sil_values, "s-", color="#27ae60", linewidth=2, markersize=8) +axes[1].set_xlabel("Number of clusters (K)", fontsize=12) +axes[1].set_ylabel("Mean Silhouette Score", fontsize=12) +axes[1].set_title("Silhouette Score vs K", fontsize=14) +axes[1].axvline(x=3, color="#e74c3c", linestyle="--", alpha=0.7, label="K = 3") +axes[1].legend(fontsize=11) +axes[1].grid(True, alpha=0.3) + +plt.tight_layout() +plt.show() + +print("Silhouette scores by K:") +for k, s in zip(sil_ks, sil_values): + print(f" K={k:2d} -> {s:.4f}") +``` + +Both plots agree: **K = 3** is the best choice for this dataset β€” inertia has a clear elbow and the silhouette score peaks at K = 3. + +--- + +## 6. Scikit-learn KMeans + +In practice you'll use scikit-learn's battle-tested implementation. It uses smarter **k-means++** initialization and runs multiple restarts (`n_init`) to avoid poor local minima. Let's compare with our scratch version: + +```python +from sklearn.cluster import KMeans + +km_sklearn = KMeans(n_clusters=3, random_state=42, n_init=10) +km_sklearn.fit(X) + +print("=== Scikit-learn KMeans ===") +print(f"Inertia: {km_sklearn.inertia_:.2f}") +print(f"Silhouette score: {silhouette_score(X, km_sklearn.labels_):.4f}") +print(f"Centroids:\n{km_sklearn.cluster_centers_}") +print() + +print("=== Our scratch KMeans ===") +print(f"Inertia: {km_scratch.inertia_:.2f}") +print(f"Silhouette score: {silhouette_score(X, km_scratch.labels_):.4f}") +print(f"Centroids:\n{km_scratch.centroids}") +``` + +The cluster labels may differ in numbering (label 0 in one could be label 2 in the other), but the **groupings themselves** should be nearly identical. + +```python +fig, axes = plt.subplots(1, 2, figsize=(13, 5)) + +colors_map = np.array(["#e74c3c", "#2ecc71", "#3498db"]) + +axes[0].scatter(X[:, 0], X[:, 1], c=colors_map[km_scratch.labels_], + edgecolors="k", s=50, alpha=0.7) +axes[0].scatter(km_scratch.centroids[:, 0], km_scratch.centroids[:, 1], + c="gold", marker="X", s=250, edgecolors="k", linewidths=1.5, zorder=5) +axes[0].set_title("Our Scratch Implementation", fontsize=14) +axes[0].set_xlabel("Feature 1") +axes[0].set_ylabel("Feature 2") + +axes[1].scatter(X[:, 0], X[:, 1], c=colors_map[km_sklearn.labels_], + edgecolors="k", s=50, alpha=0.7) +axes[1].scatter(km_sklearn.cluster_centers_[:, 0], km_sklearn.cluster_centers_[:, 1], + c="gold", marker="X", s=250, edgecolors="k", linewidths=1.5, zorder=5) +axes[1].set_title("Scikit-learn KMeans", fontsize=14) +axes[1].set_xlabel("Feature 1") +axes[1].set_ylabel("Feature 2") + +plt.suptitle("Scratch vs Scikit-learn β€” Side by Side", fontsize=15, y=1.01) +plt.tight_layout() +plt.show() +``` + +--- + +## 7. Practical Tips + +### When K-Means Works Well + +K-Means works best when clusters are: +- **Spherical (isotropic):** roughly the same spread in every direction +- **Similar in size:** very uneven cluster sizes can pull centroids away from smaller groups +- **Well-separated:** heavily overlapping clusters confuse the algorithm + +### Feature Scaling + +K-Means relies on Euclidean distance. If one feature has a range of 0–1 and another 0–10,000, the second feature will dominate. **Always standardize your features** (e.g., `StandardScaler`) before clustering. + +### Multiple Initializations + +Scikit-learn's `n_init` parameter (default 10) runs K-Means multiple times with different random seeds and keeps the result with the lowest inertia. This greatly reduces the risk of a poor local minimum. + +### When K-Means Fails + +K-Means struggles with: +- **Non-convex shapes** (e.g., crescent moons, concentric rings) β€” consider DBSCAN or spectral clustering instead +- **Clusters with very different densities** β€” HDBSCAN handles this better +- **High-dimensional data** β€” distances become less meaningful (curse of dimensionality); apply dimensionality reduction first + +--- + +## Summary + +### Key Takeaways + +1. **Unsupervised learning** discovers structure without labels. Clustering is its flagship task. +2. **K-Means** iterates between *assigning* points to the nearest centroid and *updating* centroids as cluster means until convergence. +3. **Inertia** measures within-cluster compactness; **silhouette score** balances compactness and separation. +4. The **elbow method** plots inertia vs K to find a natural number of clusters. +5. **Scikit-learn's KMeans** adds smart initialization (k-means++) and multiple restarts for robust results. +6. Always **scale features** before clustering, and remember that K-Means assumes spherical, similarly-sized clusters. + +### What's Next + +In the following notebooks we will: +- Explore **hierarchical clustering** and dendrograms +- Learn **DBSCAN** for density-based clustering +- Apply **dimensionality reduction** (PCA, t-SNE) for visualization + +--- + +*Generated by Berta AI | Created by Luigi Pascal Rondanini* diff --git a/docs/chapters/content/ch08-02_intermediate.md b/docs/chapters/content/ch08-02_intermediate.md new file mode 100644 index 0000000..66973b2 --- /dev/null +++ b/docs/chapters/content/ch08-02_intermediate.md @@ -0,0 +1,520 @@ +# Ch 8: Unsupervised Learning - Intermediate + +**Track**: Practitioner | [Try code in Playground](../../playground.md) | [Back to chapter overview](../chapter-08.md) + + +!!! tip "Read online or run locally" + You can read this content here on the web. To run the code interactively, + either use the [Playground](../../playground.md) or clone the repo and open + `chapters/chapter-08-unsupervised-learning/notebooks/02_intermediate.ipynb` in Jupyter. + +--- + +# Chapter 8: Unsupervised Learning +## Notebook 02 - Intermediate: Advanced Clustering + +Beyond K-Means: hierarchical clustering, density-based methods, and Gaussian mixtures for real-world data shapes. + +**What you'll learn:** +- Hierarchical (agglomerative) clustering and dendrograms +- DBSCAN for density-based clustering +- Gaussian Mixture Models (GMMs) +- Comparing clustering algorithms on different data shapes + +**Time estimate:** 2.5 hours + +**Try it yourself:** Experiment with different linkage methods (single, complete, average, ward) on the hierarchical clustering example. Change `eps` and `min_samples` in DBSCAN to see how they affect cluster formation. + +**Common mistakes:** Using K-Means on non-convex shapes (e.g., moons), ignoring the k-distance graph when tuning DBSCAN, or assuming spherical clusters when data is elliptical. + +--- + +## 1. Hierarchical Clustering + +Hierarchical clustering builds a tree of clusters instead of requiring a fixed number of clusters up front. The **agglomerative (bottom-up)** approach proceeds as follows: + +1. **Start** β€” treat every data point as its own single-point cluster. +2. **Merge** β€” find the two closest clusters and merge them into one. +3. **Repeat** β€” keep merging until only a single cluster remains (or until a stopping criterion is met). + +The result is a hierarchy that can be visualised as a **dendrogram** β€” a tree diagram showing the order and distance of each merge. + +### Linkage criteria + +"Distance between two clusters" can be measured in several ways: + +| Linkage | Definition | Tendency | +|---------|-----------|----------| +| **Single** | Minimum distance between any pair of points across two clusters | Produces elongated, chain-like clusters | +| **Complete** | Maximum distance between any pair of points across two clusters | Produces compact, roughly equal-sized clusters | +| **Average** | Mean distance between all pairs of points across two clusters | Compromise between single and complete | +| **Ward** | Minimises the total within-cluster variance at each merge | Tends to produce equally sized, spherical clusters | + +Ward linkage is the most commonly used default and works well when clusters are roughly spherical. + +```python +import numpy as np +import matplotlib.pyplot as plt +from sklearn.datasets import make_blobs +from sklearn.cluster import AgglomerativeClustering +from scipy.cluster.hierarchy import dendrogram, linkage, fcluster + +np.random.seed(42) + +# Generate synthetic data with 4 well-separated clusters +X_hier, y_hier = make_blobs( + n_samples=200, centers=4, cluster_std=0.8, random_state=42 +) + +fig, axes = plt.subplots(1, 2, figsize=(14, 5)) + +# Left panel β€” raw data +axes[0].scatter(X_hier[:, 0], X_hier[:, 1], s=30, alpha=0.7, edgecolors='k', linewidths=0.3) +axes[0].set_title('Raw Data (200 points, 4 clusters)') +axes[0].set_xlabel('Feature 1') +axes[0].set_ylabel('Feature 2') + +# Right panel β€” dendrogram using Ward linkage +Z_ward = linkage(X_hier, method='ward') +dendrogram( + Z_ward, + truncate_mode='lastp', + p=30, + leaf_rotation=90, + leaf_font_size=8, + ax=axes[1], + color_threshold=12 +) +axes[1].set_title('Dendrogram (Ward Linkage, truncated to 30 leaves)') +axes[1].set_xlabel('Cluster (size)') +axes[1].set_ylabel('Merge Distance') +axes[1].axhline(y=12, color='r', linestyle='--', label='Cut at distance = 12') +axes[1].legend() + +plt.tight_layout() +plt.show() +``` + +The dendrogram shows the full merge history. By drawing a horizontal cut line we decide how many clusters to keep β€” each vertical line that crosses the cut corresponds to one cluster. + +### Comparing linkage methods + +Let's visualise how the four linkage types partition the same dataset. + +```python +linkage_methods = ['single', 'complete', 'average', 'ward'] +fig, axes = plt.subplots(1, 4, figsize=(20, 4.5)) + +for ax, method in zip(axes, linkage_methods): + Z = linkage(X_hier, method=method) + labels = fcluster(Z, t=4, criterion='maxclust') + scatter = ax.scatter( + X_hier[:, 0], X_hier[:, 1], + c=labels, cmap='viridis', s=30, alpha=0.7, edgecolors='k', linewidths=0.3 + ) + ax.set_title(f'{method.capitalize()} linkage') + ax.set_xlabel('Feature 1') + ax.set_ylabel('Feature 2') + +plt.suptitle('Agglomerative Clustering β€” 4 Linkage Methods (k=4)', fontsize=14, y=1.02) +plt.tight_layout() +plt.show() +``` + +```python +# Scikit-learn's AgglomerativeClustering with Ward linkage +agg = AgglomerativeClustering(n_clusters=4, linkage='ward') +agg_labels = agg.fit_predict(X_hier) + +fig, axes = plt.subplots(1, 2, figsize=(14, 5)) + +axes[0].scatter( + X_hier[:, 0], X_hier[:, 1], + c=y_hier, cmap='tab10', s=40, alpha=0.7, edgecolors='k', linewidths=0.3 +) +axes[0].set_title('Ground-Truth Labels') +axes[0].set_xlabel('Feature 1') +axes[0].set_ylabel('Feature 2') + +axes[1].scatter( + X_hier[:, 0], X_hier[:, 1], + c=agg_labels, cmap='tab10', s=40, alpha=0.7, edgecolors='k', linewidths=0.3 +) +axes[1].set_title('AgglomerativeClustering (Ward, k=4)') +axes[1].set_xlabel('Feature 1') +axes[1].set_ylabel('Feature 2') + +plt.tight_layout() +plt.show() + +print(f"Cluster sizes: {np.bincount(agg_labels)}") +``` + +--- + +## 2. DBSCAN + +**DBSCAN** (Density-Based Spatial Clustering of Applications with Noise) takes a fundamentally different approach to clustering: + +- It does **not** require the number of clusters in advance. +- It defines clusters as **dense regions** separated by sparse regions. +- Points that don't belong to any dense region are labelled as **noise** (label = -1). + +### Key parameters + +| Parameter | Meaning | +|-----------|--------| +| `eps` (Ξ΅) | Maximum distance between two points for them to be considered neighbours | +| `min_samples` | Minimum number of points within Ξ΅-distance to form a dense region | + +### Point types + +- **Core point** β€” has at least `min_samples` neighbours within Ξ΅. +- **Border point** β€” within Ξ΅ of a core point but doesn't have enough neighbours itself. +- **Noise point** β€” neither core nor border; isolated outliers. + +DBSCAN can discover clusters of **arbitrary shape** and naturally identifies outliers β€” something centroid-based methods like K-Means cannot do. + +```python +from sklearn.datasets import make_moons +from sklearn.cluster import KMeans, DBSCAN + +np.random.seed(42) + +# Generate two moons (non-convex dataset) +X_moons, y_moons = make_moons(n_samples=500, noise=0.08, random_state=42) + +# Apply DBSCAN and K-Means +db_moons = DBSCAN(eps=0.2, min_samples=5).fit(X_moons) +km_moons = KMeans(n_clusters=2, random_state=42, n_init=10).fit(X_moons) + +fig, axes = plt.subplots(1, 3, figsize=(18, 5)) + +axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap='coolwarm', s=20, alpha=0.7) +axes[0].set_title('Ground Truth') +axes[0].set_xlabel('Feature 1') +axes[0].set_ylabel('Feature 2') + +axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=km_moons.labels_, cmap='coolwarm', s=20, alpha=0.7) +axes[1].scatter(km_moons.cluster_centers_[:, 0], km_moons.cluster_centers_[:, 1], + marker='X', s=200, c='black', edgecolors='white', linewidths=1.5) +axes[1].set_title('K-Means (k=2) β€” Fails on non-convex shapes') +axes[1].set_xlabel('Feature 1') +axes[1].set_ylabel('Feature 2') + +axes[2].scatter(X_moons[:, 0], X_moons[:, 1], c=db_moons.labels_, cmap='coolwarm', s=20, alpha=0.7) +axes[2].set_title('DBSCAN (eps=0.2) β€” Correctly separates crescents') +axes[2].set_xlabel('Feature 1') +axes[2].set_ylabel('Feature 2') + +plt.suptitle('K-Means vs DBSCAN on the Moons Dataset', fontsize=14, y=1.02) +plt.tight_layout() +plt.show() +``` + +--- + +## 3. Choosing DBSCAN Parameters + +Picking `eps` and `min_samples` can be tricky. A practical heuristic: + +1. Set `min_samples` β‰ˆ 2 Γ— number of features (a reasonable default). +2. For each point compute the distance to its **k-th nearest neighbour** (k = `min_samples`). +3. Sort these distances and plot them β€” the **k-distance graph**. +4. Look for the "elbow" β€” the point where the curve bends sharply upward. The distance at that elbow is a good candidate for `eps`. + +```python +from sklearn.neighbors import NearestNeighbors + +k = 5 # same as min_samples +nn = NearestNeighbors(n_neighbors=k) +nn.fit(X_moons) +distances, _ = nn.kneighbors(X_moons) + +k_distances = np.sort(distances[:, k - 1])[::-1] + +plt.figure(figsize=(10, 5)) +plt.plot(k_distances, linewidth=1.5) +plt.axhline(y=0.2, color='r', linestyle='--', label='eps = 0.2 (our choice)') +plt.title(f'k-Distance Graph (k={k}) β€” Elbow Indicates Good eps') +plt.xlabel('Points (sorted by descending k-distance)') +plt.ylabel(f'Distance to {k}-th Nearest Neighbour') +plt.legend() +plt.grid(True, alpha=0.3) +plt.show() +``` + +```python +# Effect of different eps values on DBSCAN results +eps_values = [0.05, 0.1, 0.2, 0.3, 0.5] +fig, axes = plt.subplots(1, len(eps_values), figsize=(22, 4)) + +for ax, eps in zip(axes, eps_values): + db = DBSCAN(eps=eps, min_samples=5).fit(X_moons) + labels = db.labels_ + n_clusters = len(set(labels)) - (1 if -1 in labels else 0) + n_noise = (labels == -1).sum() + + unique_labels = set(labels) + colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))] + + for k_label, col in zip(sorted(unique_labels), colors): + if k_label == -1: + col = [0, 0, 0, 1] # black for noise + mask = labels == k_label + ax.scatter(X_moons[mask, 0], X_moons[mask, 1], c=[col], s=15, alpha=0.7) + + ax.set_title(f'eps={eps}\n{n_clusters} clusters, {n_noise} noise') + ax.set_xlabel('Feature 1') + +axes[0].set_ylabel('Feature 2') +plt.suptitle('Effect of eps on DBSCAN (min_samples=5)', fontsize=14, y=1.05) +plt.tight_layout() +plt.show() +``` + +**Observations:** +- **eps too small** (0.05) β†’ most points classified as noise; many tiny clusters. +- **eps just right** (0.2) β†’ two clean crescent clusters with very little noise. +- **eps too large** (0.5) β†’ everything merges into a single cluster. + +The k-distance graph helps you find that sweet spot without trial and error. + +--- + +## 4. Gaussian Mixture Models + +A **Gaussian Mixture Model** assumes that the data is generated from a mixture of a finite number of Gaussian (normal) distributions with unknown parameters. + +### GMM vs K-Means + +| Aspect | K-Means | GMM | +|--------|---------|-----| +| Cluster assignment | **Hard** β€” each point belongs to exactly one cluster | **Soft** β€” each point has a probability for every cluster | +| Cluster shape | Spherical (Voronoi cells) | Elliptical (full covariance matrices) | +| Outlier handling | None β€” every point is assigned | Naturally down-weights low-probability points | +| Output | Cluster label | Probability vector over all clusters | + +GMMs are fit using the **Expectation-Maximisation (EM)** algorithm: +1. **E-step** β€” compute the probability that each point belongs to each Gaussian component. +2. **M-step** β€” update each component's mean, covariance, and weight to maximise log-likelihood. +3. Repeat until convergence. + +```python +from sklearn.mixture import GaussianMixture + +np.random.seed(42) + +# Create elongated / elliptical clusters that challenge K-Means +n_per_cluster = 200 +cov1 = [[2.0, 1.5], [1.5, 1.5]] +cov2 = [[1.5, -1.2], [-1.2, 1.5]] +cov3 = [[0.5, 0.0], [0.0, 2.5]] + +cluster1 = np.random.multivariate_normal([0, 0], cov1, n_per_cluster) +cluster2 = np.random.multivariate_normal([5, 5], cov2, n_per_cluster) +cluster3 = np.random.multivariate_normal([8, 0], cov3, n_per_cluster) + +X_gmm = np.vstack([cluster1, cluster2, cluster3]) +y_gmm_true = np.array([0]*n_per_cluster + [1]*n_per_cluster + [2]*n_per_cluster) + +fig, axes = plt.subplots(1, 3, figsize=(18, 5)) + +# Ground truth +axes[0].scatter(X_gmm[:, 0], X_gmm[:, 1], c=y_gmm_true, cmap='tab10', s=15, alpha=0.6) +axes[0].set_title('Ground Truth (Elliptical Clusters)') +axes[0].set_xlabel('Feature 1') +axes[0].set_ylabel('Feature 2') + +# K-Means +km_gmm = KMeans(n_clusters=3, random_state=42, n_init=10).fit(X_gmm) +axes[1].scatter(X_gmm[:, 0], X_gmm[:, 1], c=km_gmm.labels_, cmap='tab10', s=15, alpha=0.6) +axes[1].scatter(km_gmm.cluster_centers_[:, 0], km_gmm.cluster_centers_[:, 1], + marker='X', s=200, c='black', edgecolors='white', linewidths=1.5) +axes[1].set_title('K-Means (k=3) β€” Spherical assumption') +axes[1].set_xlabel('Feature 1') +axes[1].set_ylabel('Feature 2') + +# GMM +gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=42) +gmm.fit(X_gmm) +gmm_labels = gmm.predict(X_gmm) +axes[2].scatter(X_gmm[:, 0], X_gmm[:, 1], c=gmm_labels, cmap='tab10', s=15, alpha=0.6) +axes[2].set_title('GMM (3 components) β€” Elliptical fit') +axes[2].set_xlabel('Feature 1') +axes[2].set_ylabel('Feature 2') + +plt.suptitle('K-Means vs GMM on Elliptical Clusters', fontsize=14, y=1.02) +plt.tight_layout() +plt.show() +``` + +```python +# Visualise GMM probability contours +x_min, x_max = X_gmm[:, 0].min() - 2, X_gmm[:, 0].max() + 2 +y_min, y_max = X_gmm[:, 1].min() - 2, X_gmm[:, 1].max() + 2 +xx, yy = np.meshgrid(np.linspace(x_min, x_max, 300), np.linspace(y_min, y_max, 300)) +grid_points = np.column_stack([xx.ravel(), yy.ravel()]) + +log_prob = gmm.score_samples(grid_points) +log_prob = log_prob.reshape(xx.shape) + +fig, ax = plt.subplots(figsize=(10, 7)) +ax.contourf(xx, yy, np.exp(log_prob), levels=30, cmap='YlOrRd', alpha=0.6) +ax.contour(xx, yy, np.exp(log_prob), levels=10, colors='darkred', linewidths=0.5, alpha=0.5) +ax.scatter(X_gmm[:, 0], X_gmm[:, 1], c=gmm_labels, cmap='tab10', s=10, alpha=0.7, + edgecolors='k', linewidths=0.2) + +for i in range(gmm.n_components): + ax.scatter(gmm.means_[i, 0], gmm.means_[i, 1], + marker='+', s=300, c='black', linewidths=3) + +ax.set_title('GMM Probability Density Contours') +ax.set_xlabel('Feature 1') +ax.set_ylabel('Feature 2') +plt.tight_layout() +plt.show() +``` + +### Model selection with BIC and AIC + +How many Gaussian components should we use? We can use information criteria: + +- **BIC** (Bayesian Information Criterion) β€” penalises model complexity more heavily. +- **AIC** (Akaike Information Criterion) β€” lighter penalty. + +**Lower is better** for both. We fit GMMs with different numbers of components and pick the one with the lowest BIC (or AIC). + +```python +n_components_range = range(1, 10) +bic_scores = [] +aic_scores = [] + +for n in n_components_range: + gmm_test = GaussianMixture(n_components=n, covariance_type='full', random_state=42) + gmm_test.fit(X_gmm) + bic_scores.append(gmm_test.bic(X_gmm)) + aic_scores.append(gmm_test.aic(X_gmm)) + +fig, ax = plt.subplots(figsize=(10, 5)) +ax.plot(list(n_components_range), bic_scores, 'bo-', label='BIC', linewidth=2) +ax.plot(list(n_components_range), aic_scores, 'rs--', label='AIC', linewidth=2) +ax.axvline(x=3, color='green', linestyle=':', alpha=0.7, label='True number of components (3)') +ax.set_xlabel('Number of Components') +ax.set_ylabel('Score (lower is better)') +ax.set_title('GMM Model Selection: BIC and AIC') +ax.legend() +ax.grid(True, alpha=0.3) +plt.tight_layout() +plt.show() + +print(f"Best BIC at n_components = {np.argmin(bic_scores) + 1}") +print(f"Best AIC at n_components = {np.argmin(aic_scores) + 1}") +``` + +--- + +## 5. Algorithm Comparison + +Let's put all four algorithms head-to-head on three different data geometries: + +1. **Blobs** β€” well-separated spherical clusters +2. **Moons** β€” two interleaving crescents +3. **Varied-variance blobs** β€” spherical clusters with very different densities + +```python +from sklearn.preprocessing import StandardScaler + +np.random.seed(42) + +n_samples = 500 + +# Dataset 1: standard blobs +X_blobs, y_blobs = make_blobs(n_samples=n_samples, centers=3, cluster_std=1.0, random_state=42) + +# Dataset 2: moons +X_moons2, y_moons2 = make_moons(n_samples=n_samples, noise=0.07, random_state=42) + +# Dataset 3: varied-variance blobs +X_varied, y_varied = make_blobs( + n_samples=n_samples, centers=3, cluster_std=[0.5, 2.5, 1.0], random_state=42 +) + +datasets = [ + ('Blobs', X_blobs, {'n_clusters': 3, 'eps': 1.0}), + ('Moons', X_moons2, {'n_clusters': 2, 'eps': 0.2}), + ('Varied', X_varied, {'n_clusters': 3, 'eps': 1.5}), +] + +fig, axes = plt.subplots(3, 4, figsize=(22, 15)) + +for row, (name, X, params) in enumerate(datasets): + X_scaled = StandardScaler().fit_transform(X) + n_c = params['n_clusters'] + eps = params['eps'] + + # K-Means + km = KMeans(n_clusters=n_c, random_state=42, n_init=10).fit(X_scaled) + # Agglomerative + agg = AgglomerativeClustering(n_clusters=n_c, linkage='ward').fit(X_scaled) + # DBSCAN + db = DBSCAN(eps=eps, min_samples=5).fit(X_scaled) + # GMM + gm = GaussianMixture(n_components=n_c, random_state=42).fit(X_scaled) + + results = [ + ('K-Means', km.labels_), + ('Agglomerative', agg.labels_), + ('DBSCAN', db.labels_), + ('GMM', gm.predict(X_scaled)), + ] + + for col, (algo_name, labels) in enumerate(results): + ax = axes[row, col] + unique_labels = set(labels) + n_clust = len(unique_labels) - (1 if -1 in unique_labels else 0) + + noise_mask = labels == -1 + ax.scatter(X_scaled[~noise_mask, 0], X_scaled[~noise_mask, 1], + c=labels[~noise_mask], cmap='viridis', s=12, alpha=0.7) + if noise_mask.any(): + ax.scatter(X_scaled[noise_mask, 0], X_scaled[noise_mask, 1], + c='red', marker='x', s=15, alpha=0.5, label='noise') + ax.legend(fontsize=8) + + if row == 0: + ax.set_title(algo_name, fontsize=13, fontweight='bold') + ax.set_ylabel(f'{name}' if col == 0 else '', fontsize=12) + ax.text(0.02, 0.98, f'{n_clust} cluster(s)', + transform=ax.transAxes, fontsize=9, va='top', + bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.8)) + +plt.suptitle('Algorithm Comparison Across Data Geometries', fontsize=16, y=1.01) +plt.tight_layout() +plt.show() +``` + +--- + +## Summary + +### When to use each algorithm + +| Algorithm | Best for | Weaknesses | Must specify k? | +|-----------|----------|------------|-----------------| +| **K-Means** | Large datasets with spherical clusters | Cannot handle non-convex shapes; sensitive to outliers | Yes | +| **Agglomerative Clustering** | Small-to-medium datasets; exploring hierarchy | O(nΒ³) time complexity; hard to scale | Yes (or cut dendrogram) | +| **DBSCAN** | Arbitrary shapes; datasets with noise/outliers | Sensitive to `eps`; struggles with varying densities | No | +| **Gaussian Mixture Model** | Elliptical clusters; need soft assignments | Assumes Gaussian components; sensitive to initialisation | Yes | + +### Rules of thumb + +1. **Start simple:** try K-Means first. If results look poor, consider the data geometry. +2. **Non-convex shapes?** β†’ Use DBSCAN. +3. **Elliptical or overlapping clusters?** β†’ Use GMM. +4. **Need a hierarchy or dendrogram?** β†’ Use Agglomerative Clustering. +5. **Noisy data with outliers?** β†’ DBSCAN naturally handles noise. +6. **Need probability estimates?** β†’ GMM provides soft assignments. + +--- +*Generated by Berta AI | Created by Luigi Pascal Rondanini* diff --git a/docs/chapters/content/ch08-03_advanced.md b/docs/chapters/content/ch08-03_advanced.md new file mode 100644 index 0000000..1e08ea4 --- /dev/null +++ b/docs/chapters/content/ch08-03_advanced.md @@ -0,0 +1,687 @@ +# Ch 8: Unsupervised Learning - Advanced + +**Track**: Practitioner | [Try code in Playground](../../playground.md) | [Back to chapter overview](../chapter-08.md) + + +!!! tip "Read online or run locally" + You can read this content here on the web. To run the code interactively, + either use the [Playground](../../playground.md) or clone the repo and open + `chapters/chapter-08-unsupervised-learning/notebooks/03_advanced.ipynb` in Jupyter. + +--- + +# Chapter 8: Unsupervised Learning +## Notebook 03 - Advanced: Dimensionality Reduction & Capstone + +Reduce high-dimensional data for visualization and modeling, detect anomalies, and build a complete customer segmentation system. + +**What you'll learn:** +- Principal Component Analysis (PCA) from scratch +- t-SNE for 2D visualization +- Anomaly detection with Isolation Forest +- Customer segmentation capstone project + +**Time estimate:** 3 hours + +--- + +## 1. PCA Theory + +### The Core Idea + +PCA is a **linear** dimensionality-reduction technique that finds the directions (called **principal components**) along which the data varies the most. + +Imagine a cloud of 3-D points shaped like a flat pancake. Two axes capture almost all of the spread; the third adds very little information. PCA discovers those two dominant axes automatically. + +### Algorithm Steps + +1. **Center the data** β€” subtract the mean of each feature so that the cloud is centered at the origin. +2. **Compute the covariance matrix** β€” a \(d \times d\) matrix (where \(d\) is the number of features) that captures pairwise linear relationships. +3. **Eigendecomposition** β€” find the eigenvectors and eigenvalues of the covariance matrix. Each eigenvector is a principal component direction; its eigenvalue tells us how much variance that direction explains. +4. **Sort & select** β€” rank components by eigenvalue (descending) and keep the top \(k\) to reduce dimensionality from \(d\) to \(k\). +5. **Project** β€” multiply the centered data by the selected eigenvectors to obtain the lower-dimensional representation. + +### Variance Explained Ratio + +The variance explained ratio for component \(i\) is \(\lambda_i / \sum_{j=1}^{d} \lambda_j\), where \(\lambda_i\) is the \(i\)-th eigenvalue. The **cumulative** variance explained tells us how much total information is retained when we keep the first \(k\) components. + +--- + +## 2. PCA From Scratch + +We implement PCA using only NumPy and apply it to the classic **Iris** dataset (4 features β†’ 2 components). + +```python +import numpy as np +import matplotlib.pyplot as plt +from sklearn.datasets import load_iris + +np.random.seed(42) + +# Load the Iris dataset (4 features, 150 samples, 3 classes) +iris = load_iris() +X = iris.data # shape (150, 4) +y = iris.target # 0, 1, 2 +feature_names = iris.feature_names +target_names = iris.target_names + +print(f"Dataset shape: {X.shape}") +print(f"Features: {feature_names}") +print(f"Classes: {list(target_names)}") +``` + +```python +def pca_from_scratch(X, n_components=2): + """Implement PCA using NumPy.""" + # Step 1: Center the data + mean = np.mean(X, axis=0) + X_centered = X - mean + + # Step 2: Covariance matrix (features Γ— features) + cov_matrix = np.cov(X_centered, rowvar=False) + + # Step 3: Eigendecomposition + eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix) + + # Step 4: Sort by eigenvalue descending + sorted_idx = np.argsort(eigenvalues)[::-1] + eigenvalues = eigenvalues[sorted_idx] + eigenvectors = eigenvectors[:, sorted_idx] + + # Variance explained ratio + variance_ratio = eigenvalues / eigenvalues.sum() + + # Step 5: Project onto top-k components + W = eigenvectors[:, :n_components] + X_projected = X_centered @ W + + return X_projected, eigenvalues, variance_ratio, W + + +X_pca_scratch, eigenvalues, var_ratio, components = pca_from_scratch(X, n_components=2) + +print("Eigenvalues:", np.round(eigenvalues, 4)) +print("Variance explained ratio:", np.round(var_ratio, 4)) +print(f"Total variance retained (2 components): {var_ratio[:2].sum():.2%}") +``` + +```python +# Variance Explained Bar + Cumulative Line +fig, axes = plt.subplots(1, 2, figsize=(13, 5)) + +# Left: bar chart of individual variance ratios +axes[0].bar(range(1, len(var_ratio) + 1), var_ratio, color="steelblue", edgecolor="black") +axes[0].set_xlabel("Principal Component") +axes[0].set_ylabel("Variance Explained Ratio") +axes[0].set_title("Variance Explained by Each Component") +axes[0].set_xticks(range(1, len(var_ratio) + 1)) + +# Right: cumulative variance explained +cumulative = np.cumsum(var_ratio) +axes[1].plot(range(1, len(cumulative) + 1), cumulative, "o-", color="darkorange", linewidth=2) +axes[1].axhline(y=0.95, color="red", linestyle="--", label="95% threshold") +axes[1].set_xlabel("Number of Components") +axes[1].set_ylabel("Cumulative Variance Explained") +axes[1].set_title("Cumulative Variance Explained") +axes[1].set_xticks(range(1, len(cumulative) + 1)) +axes[1].legend() + +plt.tight_layout() +plt.show() +``` + +```python +# 2-D scatter plot of the scratch PCA projection +colors = ["#1f77b4", "#ff7f0e", "#2ca02c"] + +plt.figure(figsize=(8, 6)) +for i, name in enumerate(target_names): + mask = y == i + plt.scatter(X_pca_scratch[mask, 0], X_pca_scratch[mask, 1], + label=name, alpha=0.7, edgecolors="k", linewidth=0.5, + color=colors[i], s=60) +plt.xlabel(f"PC 1 ({var_ratio[0]:.1%} variance)") +plt.ylabel(f"PC 2 ({var_ratio[1]:.1%} variance)") +plt.title("PCA From Scratch β€” Iris Dataset (2-D Projection)") +plt.legend() +plt.grid(alpha=0.3) +plt.tight_layout() +plt.show() +``` + +--- + +## 3. PCA with Scikit-learn + +We verify our scratch implementation against the well-optimized `sklearn.decomposition.PCA`. + +```python +from sklearn.decomposition import PCA + +pca_sk = PCA(n_components=4) # keep all 4 to inspect variance +X_pca_sk_full = pca_sk.fit_transform(X) + +print("Sklearn variance explained ratio:", np.round(pca_sk.explained_variance_ratio_, 4)) +print("Scratch variance explained ratio: ", np.round(var_ratio, 4)) +print() +print("Cumulative (sklearn):", np.round(np.cumsum(pca_sk.explained_variance_ratio_), 4)) +``` + +```python +X_pca_sk = X_pca_sk_full[:, :2] # first 2 components + +# Sign of eigenvectors can flip β€” align for visual comparison +for col in range(2): + if np.corrcoef(X_pca_scratch[:, col], X_pca_sk[:, col])[0, 1] < 0: + X_pca_scratch[:, col] *= -1 + +fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharex=True, sharey=True) + +for ax, data, title in zip(axes, + [X_pca_scratch, X_pca_sk], + ["PCA (from scratch)", "PCA (scikit-learn)"]): + for i, name in enumerate(target_names): + mask = y == i + ax.scatter(data[mask, 0], data[mask, 1], label=name, + alpha=0.7, edgecolors="k", linewidth=0.5, + color=colors[i], s=60) + ax.set_xlabel("PC 1") + ax.set_ylabel("PC 2") + ax.set_title(title) + ax.legend() + ax.grid(alpha=0.3) + +plt.suptitle("Scratch vs Scikit-learn PCA β€” Identical Results", fontsize=14, y=1.02) +plt.tight_layout() +plt.show() +``` + +The two plots are virtually identical (eigenvector signs may differ, which is cosmetic). This confirms our from-scratch implementation is correct. + +--- + +## 4. t-SNE + +### What is t-SNE? + +**t-distributed Stochastic Neighbor Embedding (t-SNE)** is a non-linear dimensionality-reduction technique designed specifically for **visualization**. + +Key properties: +- Preserves **local structure**: points that are close in high-dimensional space stay close in the 2-D embedding. +- Does **not** preserve global distances β€” clusters may move relative to each other between runs. +- Computationally expensive β€” not suitable as a preprocessing step in machine-learning pipelines. +- The **perplexity** parameter (roughly: how many neighbors each point considers) strongly influences the result. Typical range: 5–50. + +**Rule of thumb:** Use PCA when you need a general-purpose reduction (for modeling, compression, noise removal). Use t-SNE when your sole goal is to *see* cluster structure in 2-D. + +```python +from sklearn.manifold import TSNE + +tsne = TSNE(n_components=2, perplexity=30, random_state=42, n_iter=1000) +X_tsne = tsne.fit_transform(X) + +print(f"t-SNE output shape: {X_tsne.shape}") +``` + +```python +# Side-by-side: PCA vs t-SNE +fig, axes = plt.subplots(1, 2, figsize=(14, 5)) + +for ax, data, title in zip(axes, + [X_pca_sk, X_tsne], + ["PCA (linear)", "t-SNE (non-linear)"]): + for i, name in enumerate(target_names): + mask = y == i + ax.scatter(data[mask, 0], data[mask, 1], label=name, + alpha=0.7, edgecolors="k", linewidth=0.5, + color=colors[i], s=60) + ax.set_xlabel("Dim 1") + ax.set_ylabel("Dim 2") + ax.set_title(title) + ax.legend() + ax.grid(alpha=0.3) + +plt.suptitle("PCA vs t-SNE β€” Iris Dataset", fontsize=14, y=1.02) +plt.tight_layout() +plt.show() +``` + +```python +# Effect of perplexity on t-SNE +perplexities = [5, 15, 30, 50] +fig, axes = plt.subplots(1, 4, figsize=(20, 4)) + +for ax, perp in zip(axes, perplexities): + embedding = TSNE(n_components=2, perplexity=perp, + random_state=42, n_iter=1000).fit_transform(X) + for i, name in enumerate(target_names): + mask = y == i + ax.scatter(embedding[mask, 0], embedding[mask, 1], + alpha=0.7, color=colors[i], s=40, edgecolors="k", + linewidth=0.3, label=name) + ax.set_title(f"Perplexity = {perp}") + ax.set_xticks([]) + ax.set_yticks([]) + +axes[0].legend(fontsize=8) +plt.suptitle("t-SNE: Impact of Perplexity", fontsize=14, y=1.04) +plt.tight_layout() +plt.show() +``` + +**Observations on perplexity:** +- Low perplexity (5): focuses on very local neighbors β€” clusters may fragment. +- High perplexity (50): considers more neighbors β€” clusters become rounder and more global structure is visible, but fine local detail may blur. +- There is no single "correct" perplexity; try several and look for consistent patterns. + +--- + +## 5. Anomaly Detection + +### Why Unsupervised Anomaly Detection? + +In many real-world scenarios, labeled anomalies are scarce or non-existent: + +| Domain | Normal | Anomaly | +|--------|--------|---------| +| Banking | Legitimate transactions | Fraud | +| Manufacturing | Good products | Defects | +| Cybersecurity | Regular traffic | Intrusions | + +Unsupervised methods learn the distribution of *normal* data and flag anything that doesn't fit. + +### Approach 1 β€” Z-Score + +Flag a point as anomalous if any feature has a Z-score \(|z| > \tau\) (e.g., \(\tau = 3\)). Simple, but assumes Gaussian features and works only for univariate or low-dimensional data. + +### Approach 2 β€” Isolation Forest + +The **Isolation Forest** algorithm isolates observations by randomly selecting a feature and a split value. Anomalies are easier to isolate (fewer splits needed), so they have shorter average path lengths in the trees. + +Advantages: +- Works well in high dimensions +- No distribution assumptions +- Linear time complexity + +```python +from sklearn.ensemble import IsolationForest +from scipy import stats + +np.random.seed(42) + +# Generate normal data: 2 clusters +normal_a = np.random.randn(150, 2) * 0.8 + np.array([2, 2]) +normal_b = np.random.randn(150, 2) * 0.8 + np.array([-2, -2]) +normal_data = np.vstack([normal_a, normal_b]) + +# Inject 20 anomalies scattered far from the clusters +anomalies = np.random.uniform(low=-6, high=6, size=(20, 2)) + +X_anom = np.vstack([normal_data, anomalies]) +labels_true = np.array([0] * len(normal_data) + [1] * len(anomalies)) # 0=normal, 1=anomaly + +print(f"Total points: {len(X_anom)} (normal: {len(normal_data)}, anomalies: {len(anomalies)})") +``` + +```python +# Z-Score method +z_scores = np.abs(stats.zscore(X_anom)) +z_threshold = 3.0 +z_anomaly_mask = (z_scores > z_threshold).any(axis=1) + +print(f"Z-Score method detected {z_anomaly_mask.sum()} anomalies (threshold={z_threshold})") +``` + +```python +# Isolation Forest +iso_forest = IsolationForest(n_estimators=200, contamination=0.06, + random_state=42) +iso_preds = iso_forest.fit_predict(X_anom) # 1 = normal, -1 = anomaly +iso_anomaly_mask = iso_preds == -1 + +print(f"Isolation Forest detected {iso_anomaly_mask.sum()} anomalies") +``` + +```python +fig, axes = plt.subplots(1, 3, figsize=(18, 5)) + +# Ground truth +axes[0].scatter(X_anom[labels_true == 0, 0], X_anom[labels_true == 0, 1], + c="steelblue", s=30, alpha=0.6, label="Normal") +axes[0].scatter(X_anom[labels_true == 1, 0], X_anom[labels_true == 1, 1], + c="red", s=80, marker="X", label="True Anomaly") +axes[0].set_title("Ground Truth") +axes[0].legend() +axes[0].grid(alpha=0.3) + +# Z-Score +axes[1].scatter(X_anom[~z_anomaly_mask, 0], X_anom[~z_anomaly_mask, 1], + c="steelblue", s=30, alpha=0.6, label="Normal") +axes[1].scatter(X_anom[z_anomaly_mask, 0], X_anom[z_anomaly_mask, 1], + c="red", s=80, marker="X", label="Detected Anomaly") +axes[1].set_title(f"Z-Score (threshold={z_threshold})") +axes[1].legend() +axes[1].grid(alpha=0.3) + +# Isolation Forest +axes[2].scatter(X_anom[~iso_anomaly_mask, 0], X_anom[~iso_anomaly_mask, 1], + c="steelblue", s=30, alpha=0.6, label="Normal") +axes[2].scatter(X_anom[iso_anomaly_mask, 0], X_anom[iso_anomaly_mask, 1], + c="red", s=80, marker="X", label="Detected Anomaly") +axes[2].set_title("Isolation Forest") +axes[2].legend() +axes[2].grid(alpha=0.3) + +plt.suptitle("Anomaly Detection Comparison", fontsize=14, y=1.02) +plt.tight_layout() +plt.show() +``` + +**Key takeaway:** The Isolation Forest typically outperforms the Z-Score method, especially when the data is multi-modal or the anomalies are not simply extreme values along a single axis. + +--- + +## 6. Capstone β€” Customer Segmentation + +We build a complete customer-segmentation pipeline: + +1. Generate & save a synthetic customer dataset +2. Feature scaling +3. Dimensionality reduction with PCA +4. Elbow method to choose optimal \(K\) +5. K-Means clustering +6. Segment profiling & visualization +7. Business recommendations + +### 6.1 Generate Synthetic Customer Data + +We create five features that mimic a retail scenario: + +| Feature | Description | +|---------|-------------| +| `age` | Customer age (18–70) | +| `income` | Annual income in $k (15–150) | +| `spending_score` | In-store spending score (1–100) | +| `visits` | Monthly store visits (0–30) | +| `online_ratio` | Fraction of purchases made online (0–1) | + +```python +import pandas as pd +import os + +np.random.seed(42) +n_customers = 500 + +# Segment 1: Young, moderate income, high online, high spending +seg1 = { + "age": np.random.normal(25, 4, 130).clip(18, 40), + "income": np.random.normal(45, 12, 130).clip(15, 80), + "spending_score": np.random.normal(75, 10, 130).clip(1, 100), + "visits": np.random.normal(8, 3, 130).clip(0, 30), + "online_ratio": np.random.normal(0.75, 0.1, 130).clip(0, 1), +} + +# Segment 2: Middle-aged, high income, balanced channel, moderate spending +seg2 = { + "age": np.random.normal(42, 6, 150).clip(28, 60), + "income": np.random.normal(95, 18, 150).clip(50, 150), + "spending_score": np.random.normal(55, 12, 150).clip(1, 100), + "visits": np.random.normal(15, 5, 150).clip(0, 30), + "online_ratio": np.random.normal(0.45, 0.15, 150).clip(0, 1), +} + +# Segment 3: Older, lower income, low online, low spending +seg3 = { + "age": np.random.normal(58, 7, 120).clip(40, 70), + "income": np.random.normal(35, 10, 120).clip(15, 70), + "spending_score": np.random.normal(25, 10, 120).clip(1, 100), + "visits": np.random.normal(20, 5, 120).clip(0, 30), + "online_ratio": np.random.normal(0.15, 0.08, 120).clip(0, 1), +} + +# Segment 4: Mixed ages, very high income, high spending, moderate visits +seg4 = { + "age": np.random.normal(38, 10, 100).clip(18, 70), + "income": np.random.normal(120, 15, 100).clip(80, 150), + "spending_score": np.random.normal(85, 8, 100).clip(1, 100), + "visits": np.random.normal(12, 4, 100).clip(0, 30), + "online_ratio": np.random.normal(0.55, 0.15, 100).clip(0, 1), +} + +frames = [] +for seg in [seg1, seg2, seg3, seg4]: + frames.append(pd.DataFrame(seg)) + +df_customers = pd.concat(frames, ignore_index=True) +df_customers = df_customers.sample(frac=1, random_state=42).reset_index(drop=True) + +df_customers["age"] = df_customers["age"].round(0).astype(int) +df_customers["income"] = df_customers["income"].round(1) +df_customers["spending_score"] = df_customers["spending_score"].round(0).astype(int) +df_customers["visits"] = df_customers["visits"].round(0).astype(int) +df_customers["online_ratio"] = df_customers["online_ratio"].round(2) + +# Save to CSV (run from chapter folder: chapters/chapter-08-unsupervised-learning/) +dataset_dir = "datasets" +os.makedirs(dataset_dir, exist_ok=True) +csv_path = os.path.join(dataset_dir, "customers.csv") +df_customers.to_csv(csv_path, index=False) +print(f"Saved {len(df_customers)} rows to {csv_path}") +df_customers.head(10) +``` + +### 6.2 Feature Scaling + +```python +from sklearn.preprocessing import StandardScaler + +feature_cols = ["age", "income", "spending_score", "visits", "online_ratio"] +X_cust = df_customers[feature_cols].values + +scaler = StandardScaler() +X_scaled = scaler.fit_transform(X_cust) + +print("Scaled means (β‰ˆ0):", np.round(X_scaled.mean(axis=0), 4)) +print("Scaled stds (β‰ˆ1):", np.round(X_scaled.std(axis=0), 4)) +``` + +### 6.3 PCA for Dimensionality Reduction + +```python +pca_cust = PCA(n_components=5) +X_pca_cust = pca_cust.fit_transform(X_scaled) + +cum_var = np.cumsum(pca_cust.explained_variance_ratio_) + +plt.figure(figsize=(7, 4)) +plt.bar(range(1, 6), pca_cust.explained_variance_ratio_, + color="steelblue", edgecolor="black", alpha=0.7, label="Individual") +plt.step(range(1, 6), cum_var, where="mid", color="darkorange", + linewidth=2, label="Cumulative") +plt.axhline(0.90, color="red", linestyle="--", alpha=0.7, label="90% threshold") +plt.xlabel("Principal Component") +plt.ylabel("Variance Explained") +plt.title("Customer Data β€” PCA Variance Explained") +plt.xticks(range(1, 6)) +plt.legend() +plt.tight_layout() +plt.show() + +n_keep = np.argmax(cum_var >= 0.90) + 1 +print(f"\nComponents needed for β‰₯90% variance: {n_keep}") +print(f"Using first 2 components for visualization ({cum_var[1]:.1%} variance).") +``` + +### 6.4 K-Means β€” Elbow Method + +```python +from sklearn.cluster import KMeans + +K_range = range(2, 11) +inertias = [] + +for k in K_range: + km = KMeans(n_clusters=k, n_init=10, random_state=42) + km.fit(X_scaled) + inertias.append(km.inertia_) + +plt.figure(figsize=(8, 4)) +plt.plot(list(K_range), inertias, "o-", linewidth=2, color="steelblue") +plt.xlabel("Number of Clusters (K)") +plt.ylabel("Inertia (within-cluster sum of squares)") +plt.title("Elbow Method for Optimal K") +plt.xticks(list(K_range)) +plt.grid(alpha=0.3) +plt.tight_layout() +plt.show() + +print("Look for the 'elbow' β€” the point where adding more clusters yields") +print("diminishing returns. Here K=4 appears to be a good choice.") +``` + +### 6.5 Fit K-Means with Optimal K + +```python +optimal_k = 4 +km_final = KMeans(n_clusters=optimal_k, n_init=20, random_state=42) +cluster_labels = km_final.fit_predict(X_scaled) + +df_customers["cluster"] = cluster_labels +print(f"Cluster distribution:\n{df_customers['cluster'].value_counts().sort_index()}") +``` + +### 6.6 Segment Profiling + +```python +segment_profile = df_customers.groupby("cluster")[feature_cols].mean().round(2) +segment_profile["count"] = df_customers.groupby("cluster").size() +print("=== Segment Profiles ===") +segment_profile +``` + +```python +# Radar / parallel-coordinates style comparison +fig, axes = plt.subplots(1, len(feature_cols), figsize=(18, 4), sharey=True) +cluster_colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728"] + +for idx, feat in enumerate(feature_cols): + means = df_customers.groupby("cluster")[feat].mean() + axes[idx].bar(means.index, means.values, + color=cluster_colors[:optimal_k], edgecolor="black") + axes[idx].set_title(feat, fontsize=11) + axes[idx].set_xlabel("Cluster") + axes[idx].set_xticks(range(optimal_k)) + +axes[0].set_ylabel("Mean Value") +plt.suptitle("Feature Means by Cluster", fontsize=14, y=1.02) +plt.tight_layout() +plt.show() +``` + +### 6.7 Visualize Segments in 2-D (PCA Projection) + +```python +X_vis = X_pca_cust[:, :2] +centroids_scaled = km_final.cluster_centers_ +centroids_2d = pca_cust.transform(centroids_scaled)[:, :2] # project centroids + +plt.figure(figsize=(9, 7)) +for c in range(optimal_k): + mask = cluster_labels == c + plt.scatter(X_vis[mask, 0], X_vis[mask, 1], s=40, alpha=0.6, + color=cluster_colors[c], edgecolors="k", linewidth=0.3, + label=f"Segment {c}") + +plt.scatter(centroids_2d[:, 0], centroids_2d[:, 1], s=250, c="black", + marker="*", zorder=5, label="Centroids") + +plt.xlabel(f"PC 1 ({pca_cust.explained_variance_ratio_[0]:.1%} var)") +plt.ylabel(f"PC 2 ({pca_cust.explained_variance_ratio_[1]:.1%} var)") +plt.title("Customer Segments β€” PCA 2-D Projection") +plt.legend() +plt.grid(alpha=0.3) +plt.tight_layout() +plt.show() +``` + +### 6.8 Business Recommendations + +```python +recommendations = { + 0: { + "label": "Budget Traditionalists", + "description": "Older customers with low income and spending, who shop mostly in-store.", + "actions": [ + "Offer loyalty discounts and in-store promotions", + "Simplify the in-store experience", + "Provide personalized coupons at checkout", + ], + }, + 1: { + "label": "Young Digital Shoppers", + "description": "Young customers with moderate income but high online engagement and spending.", + "actions": [ + "Invest in mobile app features and social media marketing", + "Offer free shipping and digital-only deals", + "Launch a referral program to leverage their network", + ], + }, + 2: { + "label": "Premium High-Spenders", + "description": "High income, high spending score β€” the most valuable segment.", + "actions": [ + "Create a VIP/premium loyalty tier", + "Offer early access to new products", + "Assign dedicated account managers for retention", + ], + }, + 3: { + "label": "Established Moderates", + "description": "Middle-aged, higher income, moderate spending, balanced channel use.", + "actions": [ + "Cross-sell higher-margin products", + "Provide omni-channel convenience (buy online, pick up in store)", + "Target with email campaigns for seasonal offers", + ], + }, +} + +for seg_id, info in recommendations.items(): + count = (cluster_labels == seg_id).sum() + print(f"\n{'='*60}") + print(f"Segment {seg_id}: {info['label']} (n={count})") + print(f"{'='*60}") + print(f" {info['description']}") + print(" Recommended actions:") + for action in info["actions"]: + print(f" β€’ {action}") +``` + +--- + +## 7. Summary + +### What We Covered in This Notebook + +| Topic | Key Idea | +|-------|----------| +| **PCA** | Linear projection onto directions of maximum variance | +| **t-SNE** | Non-linear embedding that preserves local neighborhoods β€” for visualization only | +| **Z-Score Anomaly Detection** | Simple threshold on standardized values | +| **Isolation Forest** | Tree-based anomaly detector β€” fast, distribution-free | +| **Customer Segmentation** | End-to-end pipeline: scale β†’ PCA β†’ K-Means β†’ profile β†’ recommend | + +### Chapter 8 Recap + +Across the three notebooks you have: + +1. **Notebook 01 (Introduction):** Learned K-Means, hierarchical clustering, and evaluation metrics. +2. **Notebook 02 (Intermediate):** Explored DBSCAN, Gaussian Mixture Models, and silhouette analysis. +3. **Notebook 03 (Advanced β€” this one):** Mastered PCA, t-SNE, anomaly detection, and built a full capstone project. + +### What's Next + +In **Chapter 9: Deep Learning** we'll move from classical ML to neural networks β€” starting with perceptrons, backpropagation, and building your first deep network with PyTorch/Keras. + +--- +*Generated by Berta AI | Created by Luigi Pascal Rondanini* diff --git a/docs/chapters/index.md b/docs/chapters/index.md index cd83b10..ba4b1ea 100644 --- a/docs/chapters/index.md +++ b/docs/chapters/index.md @@ -48,6 +48,10 @@ Apply your knowledge to real-world ML and AI problems. *10h Β· 3 notebooks, 5 exercises, 3 SVGs* Regression, regularization; classification, SVM, ROC; ensembles, tuning, credit-risk +- **Ch 8: [Unsupervised Learning](chapter-08.md)** + *8h Β· 3 notebooks, 5 exercises, 3 SVGs* + K-Means, hierarchical, DBSCAN; PCA, t-SNE; anomaly detection, customer segmentation + --- @@ -63,15 +67,16 @@ Apply your knowledge to real-world ML and AI problems. | [5: Software Design](chapter-05.md) | Foundation | 6h | 3 | 5 | 3 | | [6: Intro to ML](chapter-06.md) | Practitioner | 8h | 3 | 5 | 3 | | [7: Supervised Learning](chapter-07.md) | Practitioner | 10h | 3 | 5 | 3 | +| [8: Unsupervised Learning](chapter-08.md) | Practitioner | 8h | 3 | 5 | 3 | --- ## Coming Soon -!!! info "Chapters 8–25" +!!! info "Chapters 9–25" Additional chapters are planned for the Practitioner and Advanced tracks: - - **Practitioner** (8–15): Unsupervised Learning, Deep Learning, NLP, LLMs, Prompt Engineering, RAG, Fine-tuning, MLOps + - **Practitioner** (9–15): Deep Learning, NLP, LLMs, Prompt Engineering, RAG, Fine-tuning, MLOps - **Advanced** (16–25): Multi-Agent Systems, Advanced RAG, Reinforcement Learning, Model Optimization, Production AI, Finance, Safety, AI Products, Research, Governance & Ethics [Request a custom chapter](../guides/chapter-requests.md) on any AI topic while you wait! diff --git a/docs/guides/curriculum.md b/docs/guides/curriculum.md index 2e997d7..6401d23 100644 --- a/docs/guides/curriculum.md +++ b/docs/guides/curriculum.md @@ -30,7 +30,7 @@ Apply knowledge to real-world ML and AI problems. |---|---------|-------|--------|------| | 6 | Introduction to Machine Learning | 8h | Available | [chapter-06.md](../chapters/chapter-06.md) | | 7 | Supervised Learning | 10h | Available | [chapter-07.md](../chapters/chapter-07.md) | -| 8 | Unsupervised Learning | 8h | Coming soon | β€” | +| 8 | Unsupervised Learning | 8h | Available | [chapter-08.md](../chapters/chapter-08.md) | | 9 | Deep Learning Fundamentals | 12h | Coming soon | β€” | | 10 | Natural Language Processing Basics | 10h | Coming soon | β€” | | 11 | Large Language Models & Transformers | 10h | Coming soon | β€” | @@ -39,7 +39,7 @@ Apply knowledge to real-world ML and AI problems. | 14 | Fine-tuning & Adaptation | 8h | Coming soon | β€” | | 15 | MLOps & Deployment | 8h | Coming soon | β€” | -**Total: 88 hours (18h available)** +**Total: 88 hours (26h available)** --- @@ -69,9 +69,9 @@ Master complex topics and specialized domains. | Track | Chapters | Total Hours | Available | |-------|----------|-------------|-----------| | Foundation | 1–5 | 38h | 5/5 | -| Practitioner | 6–15 | 88h | 2/10 | +| Practitioner | 6–15 | 88h | 3/10 | | Advanced | 16–25 | 84h | 0/10 | -| **Total** | **25** | **210h+** | **7** | +| **Total** | **25** | **210h+** | **8** | --- diff --git a/docs/guides/roadmap.md b/docs/guides/roadmap.md index d9161ef..528ba8f 100644 --- a/docs/guides/roadmap.md +++ b/docs/guides/roadmap.md @@ -10,10 +10,10 @@ Our vision for the future of AI education. Priorities evolve based on community |-----------|--------| | Master Repository | Live | | Foundation Track | Complete (5 chapters) | -| Practitioner Track | In progress (2 of 10 chapters) | +| Practitioner Track | In progress (3 of 10 chapters) | | Advanced Track | Planned (10 chapters) | | Community Requests | Starting | -| **Available Now** | 7 chapters, 56 hours, 21 SVGs | +| **Available Now** | 8 chapters, 64 hours, 24 SVGs | --- @@ -28,13 +28,13 @@ Our vision for the future of AI education. Priorities evolve based on community ## Phase 1: Foundation & Launch β€” Complete !!! success "Complete" - Foundation Track complete. Chapters 6-7 available. Core infrastructure done. + Foundation Track complete. Chapters 6-8 available. Core infrastructure done. ### Objectives - [x] Establish master repository - [x] Complete Foundation Track (Chapters 1-5) -- [x] Begin Practitioner Track (Ch 6-7) +- [x] Begin Practitioner Track (Ch 6-8) - [ ] Establish community request process - [ ] Build first 100 community chapters @@ -63,7 +63,7 @@ Our vision for the future of AI education. Priorities evolve based on community |---|---------|--------| | 6 | Introduction to ML | Done | | 7 | Supervised Learning | Done | -| 8 | Unsupervised Learning | Next | +| 8 | Unsupervised Learning | Done | | 9 | Deep Learning Fundamentals | Planned | | 10 | NLP Basics | Planned | | 11 | LLMs & Transformers | Planned | diff --git a/docs/guides/syllabus.md b/docs/guides/syllabus.md index ed94091..c8f174e 100644 --- a/docs/guides/syllabus.md +++ b/docs/guides/syllabus.md @@ -16,7 +16,7 @@ graph TD CH6["Ch 6: Intro to ML
8h | Available"] CH7["Ch 7: Supervised Learning
10h | Available"] - CH8["Ch 8: Unsupervised Learning
8h | Coming Soon"] + CH8["Ch 8: Unsupervised Learning
8h | Available"] CH9["Ch 9: Deep Learning
12h | Coming Soon"] CH10["Ch 10: NLP Basics
10h | Coming Soon"] CH11["Ch 11: LLMs & Transformers
10h | Coming Soon"] @@ -56,7 +56,7 @@ graph TD style CH5 fill:#4caf50,color:#fff style CH6 fill:#4caf50,color:#fff style CH7 fill:#4caf50,color:#fff - style CH8 fill:#f3e5f5 + style CH8 fill:#4caf50,color:#fff style CH9 fill:#f3e5f5 style CH10 fill:#f3e5f5 style CH11 fill:#f3e5f5 @@ -89,7 +89,7 @@ graph TD | 5 | Software Design | Foundation | 6h | Available | | 6 | Introduction to ML | Practitioner | 8h | Available | | 7 | Supervised Learning | Practitioner | 10h | Available | -| 8 | Unsupervised Learning | Practitioner | 8h | Coming soon | +| 8 | Unsupervised Learning | Practitioner | 8h | Available | | 9 | Deep Learning Fundamentals | Practitioner | 12h | Coming soon | | 10 | NLP Basics | Practitioner | 10h | Coming soon | | 11 | LLMs & Transformers | Practitioner | 10h | Coming soon | diff --git a/docs/index.md b/docs/index.md index e7c82c5..a3fb797 100644 --- a/docs/index.md +++ b/docs/index.md @@ -28,27 +28,27 @@ Free. Open-source. Community-driven. Generated by [Berta AI](https://berta.one).
-
7
+
8
Chapters
-
21
+
24
Notebooks
-
21
+
24
Diagrams
-
56h
+
64h
Content
-
37
+
42
Exercises
@@ -79,7 +79,8 @@ Free. Open-source. Community-driven. Generated by [Berta AI](https://berta.one). |---|---------|-------|----------| | 6 | [Introduction to Machine Learning](chapters/chapter-06.md) | 8h | 3 notebooks, 5 exercises, 3 diagrams | | 7 | [Supervised Learning](chapters/chapter-07.md) | 10h | 3 notebooks, 5 exercises, 3 diagrams | -| 8–25 | Coming soon | | [View roadmap](guides/roadmap.md) | +| 8 | [Unsupervised Learning](chapters/chapter-08.md) | 8h | 3 notebooks, 5 exercises, 3 diagrams | +| 9–25 | Coming soon | | [View roadmap](guides/roadmap.md) | --- diff --git a/mkdocs.yml b/mkdocs.yml index 97b5818..591ae44 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -133,6 +133,10 @@ nav: - "7.1 Introduction": chapters/content/ch07-01_introduction.md - "7.2 Intermediate": chapters/content/ch07-02_intermediate.md - "7.3 Advanced": chapters/content/ch07-03_advanced.md + - "Ch 8: Unsupervised Learning": chapters/chapter-08.md + - "8.1 Introduction": chapters/content/ch08-01_introduction.md + - "8.2 Intermediate": chapters/content/ch08-02_intermediate.md + - "8.3 Advanced": chapters/content/ch08-03_advanced.md - Playground: playground.md - Community: - Contributing: guides/contributing.md diff --git a/netlify.toml b/netlify.toml index 75b89e4..7319110 100644 --- a/netlify.toml +++ b/netlify.toml @@ -1,10 +1,23 @@ [build] command = "pip install mkdocs-material mkdocs-minify-plugin && mkdocs build" publish = "site" + functions = "netlify/functions" [build.environment] PYTHON_VERSION = "3.11" +# Newsletter: auto-sends on deploy when chapter_notification.json changes. +# One-time setup in Netlify Dashboard > Site settings > Environment variables: +# RESEND_API_KEY = your Resend API key +# CONFIRM_FROM_EMAIL = verified sender email +# SITE_URL = https://chapters.berta.one +# NETLIFY_API_TOKEN = personal access token (app.netlify.com/user/applications) +# NETLIFY_SITE_ID = site ID (Site settings > General) +# +# To notify subscribers about a new chapter: +# 1. Edit netlify/functions/chapter_notification.json +# 2. Commit and deploy β€” done. No dashboard changes needed. + [[redirects]] from = "/chapters" to = "/chapters/" diff --git a/netlify/functions/chapter_notification.json b/netlify/functions/chapter_notification.json new file mode 100644 index 0000000..b422030 --- /dev/null +++ b/netlify/functions/chapter_notification.json @@ -0,0 +1,5 @@ +{ + "chapter_number": 8, + "chapter_title": "Unsupervised Learning", + "chapter_description": "K-Means clustering, hierarchical clustering, DBSCAN, PCA, t-SNE, anomaly detection, and a customer segmentation capstone project." +} diff --git a/netlify/functions/deploy-succeeded.js b/netlify/functions/deploy-succeeded.js new file mode 100644 index 0000000..afb4f91 --- /dev/null +++ b/netlify/functions/deploy-succeeded.js @@ -0,0 +1,225 @@ +/** + * Netlify Function: deploy-succeeded (Background Event) + * + * Automatically triggered after every successful Netlify deploy. + * Reads chapter_notification.json to determine what to notify about, + * then checks LAST_NOTIFIED_CHAPTER to avoid re-sending. + * + * How to send a newsletter for a new chapter: + * 1. Update chapter_notification.json with the new chapter details. + * 2. Commit, push, and deploy. That's it. + * + * The function compares the chapter number in the JSON file against + * LAST_NOTIFIED_CHAPTER (stored as a Netlify env var). If they differ, + * it fetches all subscribers from Netlify Forms and emails them via + * Resend, then updates LAST_NOTIFIED_CHAPTER so subsequent deploys + * won't re-send. + * + * One-time setup (Netlify Dashboard > Site settings > Environment variables): + * RESEND_API_KEY - Resend API key + * CONFIRM_FROM_EMAIL - Verified sender email in Resend + * SITE_URL - https://chapters.berta.one + * NETLIFY_API_TOKEN - Personal access token (app.netlify.com/user/applications) + * NETLIFY_SITE_ID - Site ID (Site settings > General) + * + * No per-release changes needed in the dashboard. + */ + +var config = require("./chapter_notification.json"); + +exports.handler = async function () { + var chapterNumber = String(config.chapter_number); + var chapterTitle = config.chapter_title; + var chapterDescription = config.chapter_description; + + var lastNotified = process.env.LAST_NOTIFIED_CHAPTER || ""; + if (lastNotified === chapterNumber) { + console.log("Chapter " + chapterNumber + " already notified β€” skipping"); + return { statusCode: 200, body: "Already notified for chapter " + chapterNumber }; + } + + var apiKey = process.env.RESEND_API_KEY; + if (!apiKey) { + console.log("RESEND_API_KEY not set β€” cannot send emails"); + return { statusCode: 200, body: "No email service configured" }; + } + + var netlifyToken = process.env.NETLIFY_API_TOKEN; + var siteId = process.env.NETLIFY_SITE_ID; + if (!netlifyToken || !siteId) { + console.log("NETLIFY_API_TOKEN or NETLIFY_SITE_ID not set"); + return { statusCode: 200, body: "Netlify API not configured" }; + } + + var fromEmail = process.env.CONFIRM_FROM_EMAIL || "onboarding@resend.dev"; + var siteUrl = process.env.SITE_URL || "https://chapters.berta.one"; + + // ── Fetch subscribers from Netlify Forms ── + var subscribers = []; + try { + var formsRes = await fetch( + "https://api.netlify.com/api/v1/sites/" + siteId + "/forms", + { headers: { "Authorization": "Bearer " + netlifyToken } } + ); + if (!formsRes.ok) { + console.log("Failed to fetch forms: HTTP " + formsRes.status); + return { statusCode: 200, body: "Failed to fetch forms" }; + } + var forms = await formsRes.json(); + var newsletterForm = forms.find(function (f) { return f.name === "newsletter"; }); + + if (!newsletterForm) { + console.log("Newsletter form not found"); + return { statusCode: 200, body: "Newsletter form not found" }; + } + + var page = 1; + var perPage = 100; + while (true) { + var subsRes = await fetch( + "https://api.netlify.com/api/v1/forms/" + newsletterForm.id + + "/submissions?per_page=" + perPage + "&page=" + page, + { headers: { "Authorization": "Bearer " + netlifyToken } } + ); + if (!subsRes.ok) break; + var subs = await subsRes.json(); + if (!subs || subs.length === 0) break; + for (var i = 0; i < subs.length; i++) { + var email = subs[i].data && subs[i].data.email; + if (email && subscribers.indexOf(email) === -1) { + subscribers.push(email); + } + } + if (subs.length < perPage) break; + page++; + } + } catch (err) { + console.log("Error fetching subscribers: " + err.message); + return { statusCode: 500, body: "Failed to fetch subscribers" }; + } + + if (subscribers.length === 0) { + console.log("No subscribers found"); + return { statusCode: 200, body: "No subscribers" }; + } + + console.log("Sending Chapter " + chapterNumber + ": " + chapterTitle + + " notification to " + subscribers.length + " subscriber(s)"); + + // ── Build email ── + var chapterUrl = siteUrl + "/chapters/chapter-" + + (parseInt(chapterNumber) < 10 ? "0" : "") + chapterNumber + "/"; + + var htmlBody = [ + "
", + "

New Chapter Published!

", + "

Chapter " + chapterNumber + ": " + chapterTitle + "

", + "

" + chapterDescription + "

", + "

", + " ", + " Read Chapter " + chapterNumber + " Now", + "

", + "
", + "

What's included:

", + " ", + "

Browse all chapters | ", + " Try the Playground

", + "
", + "

", + " You're receiving this because you subscribed at " + siteUrl + "
", + " To unsubscribe, reply to this email with 'unsubscribe'.", + "

", + "

", + " Created by Luigi Pascal Rondanini | ", + " Powered by Berta AI", + "

", + "
", + ].join("\n"); + + // ── Send emails ── + var sent = 0; + var failed = 0; + + for (var j = 0; j < subscribers.length; j++) { + try { + var response = await fetch("https://api.resend.com/emails", { + method: "POST", + headers: { + "Authorization": "Bearer " + apiKey, + "Content-Type": "application/json", + }, + body: JSON.stringify({ + from: "Berta Chapters <" + fromEmail + ">", + to: [subscribers[j]], + subject: "New Chapter: " + chapterTitle + " (Chapter " + chapterNumber + ")", + html: htmlBody, + }), + }); + + if (response.ok) { + sent++; + console.log("Sent to " + subscribers[j]); + } else { + failed++; + var errorText = await response.text(); + console.log("Failed: " + subscribers[j] + " β€” " + errorText); + } + } catch (err) { + failed++; + console.log("Error: " + subscribers[j] + " β€” " + err.message); + } + } + + // ── Update LAST_NOTIFIED_CHAPTER via Netlify API ── + if (sent > 0) { + try { + // Fetch existing env vars for this account (scoped to site) + var envRes = await fetch( + "https://api.netlify.com/api/v1/accounts/me/env?site_id=" + siteId, + { headers: { "Authorization": "Bearer " + netlifyToken } } + ); + var envVars = await envRes.json(); + var existing = Array.isArray(envVars) + ? envVars.find(function (v) { return v.key === "LAST_NOTIFIED_CHAPTER"; }) + : null; + + if (existing) { + // Delete then recreate (Netlify API pattern for updating) + await fetch( + "https://api.netlify.com/api/v1/accounts/me/env/LAST_NOTIFIED_CHAPTER?site_id=" + siteId, + { method: "DELETE", headers: { "Authorization": "Bearer " + netlifyToken } } + ); + } + await fetch( + "https://api.netlify.com/api/v1/accounts/me/env?site_id=" + siteId, + { + method: "POST", + headers: { + "Authorization": "Bearer " + netlifyToken, + "Content-Type": "application/json", + }, + body: JSON.stringify([{ + key: "LAST_NOTIFIED_CHAPTER", + scopes: ["functions"], + values: [{ value: chapterNumber, context: "all" }], + }]), + } + ); + console.log("Updated LAST_NOTIFIED_CHAPTER to " + chapterNumber); + } catch (err) { + console.log("Could not update LAST_NOTIFIED_CHAPTER: " + err.message); + } + } + + var summary = "Chapter " + chapterNumber + " β€” Sent: " + sent + + ", Failed: " + failed + ", Total: " + subscribers.length; + console.log(summary); + return { statusCode: 200, body: summary }; +}; diff --git a/netlify/functions/submission-created.js b/netlify/functions/submission-created.js index 9a272f4..8aefae5 100644 --- a/netlify/functions/submission-created.js +++ b/netlify/functions/submission-created.js @@ -41,9 +41,12 @@ exports.handler = async function (event) { "

Thank you for subscribing to updates.

", "

You'll receive an email when new chapters are published. At most one email per week.

", "
", + "

Latest chapter:

", + "

Chapter 8: Unsupervised Learning β€” ", + " K-Means, hierarchical clustering, DBSCAN, PCA, t-SNE, anomaly detection, and a customer segmentation capstone.

", "

Start learning now:

", " ", diff --git a/wiki/Home.md b/wiki/Home.md index 83c9da6..3a61a8c 100644 --- a/wiki/Home.md +++ b/wiki/Home.md @@ -47,7 +47,7 @@ A full journey from basics to advanced AI: - **Practitioner (Ch 6–15)** β€” ML intro, supervised learning, deep learning, NLP, RAG, MLOps - **Advanced (Ch 16–25+)** β€” Multi-agent systems, reinforcement learning, production AI, ethics -**7 chapters available now** (56 hours of content). More unlock as the community grows. +**8 chapters available now** (64 hours of content). More unlock as the community grows. ### 2. Community-Requested Chapters β€” Learn What You Need @@ -121,7 +121,7 @@ Same format everywhere. Learn once, then move fast. | **Advanced** | 16–25+ | πŸ“‹ Planned | | **Community** | Your requests | πŸš€ Unlimited | -**21 notebooks**, **37 exercises with solutions**, **5 datasets**, **21 diagrams**. All open and ready to run. +**24 notebooks**, **42 exercises with solutions**, **7 datasets**, **24 diagrams**. All open and ready to run. ---