A machine learning project focused on unsupervised learning, pattern discovery, and population-level health analysis using the NHANES (National Health and Nutrition Examination Survey) dataset.
This project explores hidden health-related patterns in large-scale population data using unsupervised machine learning techniques.
Instead of predicting a predefined target, the goal is to:
- Discover natural population subgroups
- Identify latent health profiles
- Analyze relationships between clinical, demographic, and lifestyle variables
- Compare clustering behaviors across multiple algorithms
The project emphasizes interpretability, data-driven insight, and real-world complexity rather than model accuracy alone.
- Name: NHANES (National Health and Nutrition Examination Survey)
- Source: Data set
- Type: Real-world health survey data
- Scale: Large, multi-table, heterogeneous dataset
- Demographic features (age, gender, ethnicity, etc.)
- Clinical measurements
- Laboratory results
- Lifestyle and health indicators
- Missing values and mixed distributions (realistic noise)
This dataset reflects real population health conditions and is significantly more complex than typical educational datasets.
- Perform structured preprocessing on complex health data
- Reduce dimensionality for better visualization and analysis
- Apply and compare multiple clustering algorithms
- Interpret discovered clusters from a health perspective
- Build a reusable framework for unsupervised health analysis
- Selected relevant features from the NHANES dataset
- Handled missing values
- Removed non-informative or redundant attributes
- Prepared a clean analytical dataset
The focus was on preserving real-world data structure rather than over-sanitizing the dataset.
Key preprocessing steps include:
-
Merging multiple NHANES tables using
SEQN -
Handling missing values using:
- Domain-driven filling (categorical logic)
- Mean imputation for numerical features
-
Removing survey-specific noise codes (e.g. 777, 99)
-
Feature engineering:
- Sugar-to-carb ratio
- Fiber density
- Saturated fat ratio
- Physical activity indicator
- Smoking status
- Alcohol consumption normalization
-
Target construction:
- Diabetes label derived from multiple questionnaire variables
Used standardization to normalize feature distributions:
StandardScaler()MinMaxScaler()
This step ensures fair contribution of all features to distance-based models.
Several models were trained and evaluated:
- Logistic Regression
- Support Vector Machine (SVC)
- Decision Tree
- Random Forest
- Gradient Boosting
- K-Nearest Neighbors
Applied Principal Component Analysis (PCA) to:
- Reduce high-dimensional feature space
- Enable visualization in 2D and 3D
- Capture dominant variance patterns
PCA was used for analysis and visualization, not as a predictive model.
Multiple clustering approaches were applied and compared to explore different structural perspectives of the data.
- Explored different numbers of clusters
- Used the elbow method to evaluate inertia
- Provided a baseline partitioning of the population
- Detected dense population groups
- Identified outliers and rare health profiles
- Demonstrated sensitivity to data density and scaling
- Cluster assignments were analyzed statistically
- Summary tables were generated to compare cluster-level characteristics
- Differences between clusters were interpreted in a health-related context
Interpretation focuses on patterns, not medical diagnosis.
- Distinct population subgroups emerged without supervision
- Different clustering algorithms revealed complementary perspectives
- Density-based methods highlighted rare and extreme cases
These results demonstrate that unsupervised learning can extract meaningful structure from complex health data.
- Real-world health data is noisy, incomplete, and non-trivial
- Unsupervised learning is powerful for exploratory analysis
- No single clustering method is universally optimal
- Interpretation is as important as algorithm selection
- PCA is a critical tool for understanding high-dimensional structure
- No causal inference or medical diagnosis
- Cluster interpretation depends on selected features
- NHANES dataset complexity limits full coverage in a single project
These limitations are intentional and reflect real data science constraints.
numpypandasmatplotlibseabornscikit-learnPCAKMeansDBSCANStandardScalerDecisionTreeClassifierKNeighborsClassifierGradientBoostingClassifierRandomForestClassifierMinMaxScalerGridSearchCV...
pip install -r requirements.txt- requirements.txt → File
or directly:
pip install numpy pandas seaborn matplotlib scikit-learnRun the script to generate all visualizations.
Based on KMeans clustering with 10 groups, patients were segmented into distinct risk profiles:
| Cluster | Patient Group | Characteristics |
|---|---|---|
| 0 | Older Low-Education Males | Older age (65%), Predominantly male (63%), Low education level, Single/never married, US-born, Potential social isolation risk |
| 1 | Middle-Aged Immigrant Males | Middle age (59%), Male majority (56%), Medium education, Married, Foreign-born population, Cultural adaptation factors |
| 2 | Older Low-Education Females | Older age (62%), Predominantly female (43%), Low education, Single/unmarried, US-born, Higher vulnerability to chronic conditions |
| 3 | Elderly Males with Low SES | Elderly population (68%), Male majority (57%), Low education, Single/unmarried, US-born, High-risk demographic for diabetes complications |
| 4 | Young Educated Women | Youngest group (41%), Female majority (44%), Highest education level, Married, Foreign-born, Health-conscious demographic |
| 5 | Elderly Low-Education Population | Oldest group (72%), Gender-balanced, Very low education, Single/unmarried, US-born, Highest risk for health complications |
| 6 | Older Males with Moderate SES | Older age (64%), Male majority (65%), Medium education, Single, US-born, Moderate diabetes risk profile |
| 7 | Educated Married Immigrants | Middle age (48%), Gender-balanced, Highest education and marriage rate, Foreign-born, Likely higher health literacy |
| 8 | Middle-Aged Male Immigrants | Middle age (56%), Male majority (59%), Low-medium education, Single, Some foreign-born, Mixed risk factors |
| 9 | Elderly with Medium Education | Elderly (69%), Gender-balanced, Medium education, Single, US-born, Moderate-to-high risk for age-related conditions |
Clusters 3, 5, and 9 represent the highest-risk patient populations:
- Advanced age (68-72% normalized age score)
- Low educational attainment limiting health literacy
- Social isolation (predominantly single/unmarried)
- US-born suggesting longer exposure to Western dietary patterns
These groups require intensive diabetes screening, patient education programs, and social support interventions.
Clusters 4 and 7 show protective characteristics:
- Younger age or middle-aged
- Higher education levels (53-59% education score)
- Married status providing social support
- Foreign-born potentially maintaining healthier traditional diets
These groups demonstrate better health-seeking behaviors and preventive care engagement.
- Male-dominated clusters (0, 1, 3, 6, 8): Higher prevalence in older age groups
- Female-dominated clusters (2, 4): Split between elderly low-SES and young educated groups
- Gender-balanced clusters (5, 7, 9): Mixed demographics with varying education levels
Clear stratification emerged across education levels:
- Low education (clusters 0, 2, 3, 5, 8): Associated with older age and single status
- Medium education (clusters 1, 6, 9): Mixed risk profiles
- High education (clusters 4, 7): Younger age, married, better health outcomes
- Clusters 3, 5, 9: Prioritize aggressive diabetes screening, medication adherence programs, and geriatric care coordination
- Clusters 0, 2: Focus on social support services and simplified health education materials
- Clusters 4, 7: Leverage as peer educators and community health advocates
- Clusters 1, 8: Provide culturally-adapted health materials in multiple languages
Healthcare systems should allocate:
- 70% of preventive resources to high-risk clusters (0, 2, 3, 5, 9)
- 20% to moderate-risk clusters (1, 6, 8)
- 10% to low-risk clusters (4, 7) for maintenance and prevention
- Implement age-stratified screening protocols (65+ priority)
- Develop education-level appropriate diabetes education materials
- Create culturally sensitive programs for immigrant populations
- Establish social support networks for single/isolated patients
- cluster_summary.csv → file
The clustering analysis successfully identified 10 distinct patient subgroups with varying diabetes risk profiles. The most vulnerable populations are characterized by:
- Advanced age (>65 years)
- Low educational attainment
- Social isolation (single/unmarried status)
- US nativity with likely Western dietary exposure
This segmentation enables precision public health interventions, allowing healthcare providers to:
- Allocate resources efficiently to high-risk groups
- Design tailored prevention programs based on demographic characteristics
- Improve health equity by addressing socioeconomic disparities
- Optimize screening schedules and follow-up protocols
The identification of protected clusters (4 and 7) provides valuable insights into protective factors that can inform population-level prevention strategies. Future research should investigate > behavioral and lifestyle differences across these clusters to develop evidence-based interventions for diabetes prevention and management.
Author: Ali
Field: Data Science & Machine Learning Student
Email: ali.hz87980@gmail.com
GitHub: ali-119