Mental Health & Wellbeing Prediction

πŸ“Ή Video Presentation

[YOUR VIDEO LINK HERE - Add after recording]


πŸ“‹ Project Overview

This project predicts mental wellbeing scores based on lifestyle and environmental factors. We built both regression models (to predict exact scores) and classification models (to categorize wellbeing levels as Low vs Medium/High).

Dataset Synthetic Mental Health, Lifestyle & Wellbeing (Kaggle)
Size 400,000 individuals, 15 features
Target mental_wellbeing_score (0-100)
Train/Test 320,000 / 80,000 (80/20 split)

Main Question

Which lifestyle and environmental factors are most strongly associated with mental wellbeing, and how accurately can we predict wellbeing scores from these features?

Goals

  1. Explore relationships between lifestyle factors and mental wellbeing
  2. Build baseline regression model and improve through feature engineering
  3. Apply K-Means clustering to discover lifestyle segments
  4. Convert to binary classification and identify at-risk individuals

πŸ“Š Part 1-2: Exploratory Data Analysis

Dataset Features

Feature Type Features
Target mental_wellbeing_score (0-100)
Lifestyle sleep_hours, screen_time, physical_activity, diet_quality, sleep_quality
Stress work_stress, financial_stress
Social social_interactions
Environment air_quality_index, noise_level
Demographics age, gender, city_type

Target Distribution

Target Distribution

The mental wellbeing score ranges from 0 to 100, with scores concentrated in the 80-100 range.


Research Question 1: Screen Time vs Wellbeing

Screen Time vs Wellbeing

Finding: Higher screen time is associated with slightly lower mental wellbeing. The relationship is negative but relatively weak.


Research Question 2: Physical Activity vs Wellbeing

Physical Activity vs Wellbeing

Finding: Higher physical activity levels are associated with better mental wellbeing. This is one of the positive lifestyle factors.


Research Question 3: Work Stress vs Wellbeing

Work Stress vs Wellbeing

Finding: Work stress has a strong negative relationship with mental wellbeing - one of the most impactful factors.


Research Question 4: Sleep Quality vs Wellbeing

Sleep Quality vs Wellbeing

Finding: Better sleep quality strongly correlates with higher mental wellbeing scores. Sleep quality is one of the top positive predictors.


Research Question 5: Diet Quality vs Wellbeing

Diet Quality vs Wellbeing

Finding: Higher diet quality is associated with better mental wellbeing outcomes.


Correlation Analysis

Correlation Heatmap

Key Correlations with Mental Wellbeing:

Factor Correlation Direction
Sleep Quality Strong Positive ↑ Better sleep = Higher wellbeing
Diet Quality Moderate Positive ↑ Better diet = Higher wellbeing
Physical Activity Moderate Positive ↑ More activity = Higher wellbeing
Work Stress Strong Negative ↑ More stress = Lower wellbeing
Financial Stress Moderate Negative ↑ More stress = Lower wellbeing
Screen Time Weak Negative ↑ More screen time = Lower wellbeing

Feature Correlation with Target

Feature Correlation

This visualization shows how each feature correlates with mental wellbeing score. Green bars indicate positive relationships (beneficial factors), while red bars indicate negative relationships (risk factors).


πŸ“ˆ Part 3: Baseline Model

Baseline Configuration

Setting Value
Algorithm Linear Regression
Features 6 lifestyle scores
Preprocessing StandardScaler
Train/Test Split 80/20

Baseline Results

Metric Value
RΒ² Score 0.672
MAE 4.11
RMSE 5.16

Interpretation: The baseline model explains 67.2% of variance in wellbeing scores with an average error of about 4 points on the 0-100 scale. This is a solid baseline.

Baseline: Actual vs Predicted

Baseline Actual vs Predicted

Baseline Feature Importance

Baseline Feature Importance

Top Features (Baseline):

Rank Feature Coefficient Effect
1 Work Stress -4.04 Strongest negative
2 Sleep Quality +4.03 Strongest positive
3 Financial Stress -2.69 Negative
4 Diet Quality +2.69 Positive
5 Physical Activity +2.24 Positive
6 Screen Time -1.56 Weakest negative

πŸ”§ Part 4: Feature Engineering

Engineered Features

We created additional features to capture more complex relationships:

Feature Description Rationale
Weighted Lifestyle Risk Composite score combining all risk factors Captures overall lifestyle health
Cluster Labels K-Means lifestyle segments (k=3) Non-linear pattern capture
PCA Components Lifestyle_PCA_1, Lifestyle_PCA_2 Dimensionality reduction

4.1 Weighted Lifestyle Risk Score

Lifestyle Risk Distribution

We created a weighted lifestyle risk score based on EDA findings:

Factor Weight Direction
Work Stress 0.30 Higher = More Risk
Financial Stress 0.25 Higher = More Risk
Poor Sleep Quality 0.20 Lower quality = More Risk
Poor Diet Quality 0.15 Lower quality = More Risk
Low Physical Activity 0.05 Lower = More Risk
Screen Time 0.05 Higher = More Risk

Formula: Higher score = Riskier lifestyle profile (worse for wellbeing)


4.2 K-Means Clustering (k=3)

Clustering PCA

We applied K-Means clustering to identify distinct lifestyle segments:

Cluster Count Profile
0 129,119 (32%) Lifestyle Profile A
1 142,014 (36%) Lifestyle Profile B
2 128,867 (32%) Lifestyle Profile C

The cluster label becomes a categorical feature that helps the model capture non-linear relationships between lifestyle factors.

4.3 PCA Components

We added two PCA components (Lifestyle_PCA_1, Lifestyle_PCA_2) that compress the six lifestyle features into orthogonal dimensions capturing the main variance patterns.


🎯 Part 5: Improved Regression Models

Model Comparison RΒ²

Model Comparison RMSE

Model Comparison Results

Model MAE RMSE RΒ²
Baseline Linear Regression 4.11 5.16 0.672
Linear Regression (engineered) 4.09 5.14 0.675
Random Forest (engineered) 2.28 3.61 0.839
Gradient Boosting (engineered) 2.29 3.49 0.850

Improvement Analysis

Comparison Improvement
Baseline β†’ Gradient Boosting +26.5% RΒ² improvement
MAE Reduction 4.11 β†’ 2.29 (44% reduction)
RMSE Reduction 5.16 β†’ 3.49 (32% reduction)

Feature Importance (Best Model)

Feature Importance

Key Insights:

  • Stress factors (work, financial) remain the strongest predictors
  • Sleep quality continues to be the top positive factor
  • Engineered features and cluster labels add predictive value

πŸ† Part 6: Regression Winner

Gradient Boosting Regressor

Metric Value
RΒ² Score 0.850
MAE 2.29
RMSE 3.49

Why Gradient Boosting Won:

  • Captures non-linear relationships between lifestyle factors
  • Handles feature interactions naturally
  • Best balance of accuracy and generalization
  • Lowest RMSE among all models

Saved as: winning_regressor.pkl


πŸ”„ Part 7: Regression to Classification

We converted wellbeing scores into 2 binary classes using quantile thresholds:

Class Distribution

Class Wellbeing Level Threshold Train Count Percentage
0 Low Wellbeing < 92.49 105,600 33%
1 Medium/High Wellbeing β‰₯ 92.49 214,400 67%

Note: The classes are imbalanced (33% vs 67%), so we focus on F1-score and recall rather than accuracy alone.


Precision vs Recall Analysis

For mental health prediction, Recall is more important:

In the context of predicting mental wellbeing, recall is more important than precision for the low-wellbeing class. Missing a person who is actually struggling (false negative) is more harmful than flagging someone as "at risk" when they are actually fine (false positive).

False Positive vs False Negative

False Negatives are more critical:

Error Type Meaning Consequence
False Positive Predict Low, actually OK Extra attention to someone who is fine (less harmful)
False Negative Predict OK, actually Low Person who needs support is not identified (more harmful)

A false negative means the model predicts that someone is not in the low-wellbeing group, while in reality they are. This could result in a person who needs support not being identified.

Conclusion: We prioritize recall for Class 0 (Low Wellbeing) to minimize missed at-risk individuals.


πŸ“Š Part 8: Classification Models

Classification Results

Model Accuracy F1 (macro)
Logistic Regression 90.55% 0.893
Gradient Boosting 90.47% 0.892
Random Forest 90.39% 0.891

Confusion Matrices

Confusion Matrix

Key Observations:

  • All models achieve ~90% accuracy
  • Most confusion occurs between the two adjacent classes
  • Models rarely completely misclassify (important for identifying at-risk individuals)
  • Logistic Regression achieves the highest F1 score despite being the simplest model

Confusion Matrix Analysis

The confusion matrices show that most errors are confusions between "medium" and "high" wellbeing individuals. More importantly, the model rarely confuses class 0 (low wellbeing) with class 1 (high wellbeing), which is good from a practical perspective: it almost never predicts "high wellbeing" for people who are actually in the low group.


πŸ† Part 8.4: Classification Winner

Logistic Regression

Metric Value
Accuracy 90.55%
Macro F1 0.893

Why Logistic Regression Won:

  • Highest accuracy and F1 score
  • Simple, interpretable model
  • Fast inference time (trained in 1.48 seconds vs 117-247 seconds for others)
  • Excellent calibrated probabilities
  • Performs well on this linearly-separable problem

Saved as: winning_classifier.pkl


πŸ“ Repository Files

File Description
winning_regressor.pkl Gradient Boosting regression model (RΒ²=0.85)
winning_classifier.pkl Logistic Regression classifier (90.6% accuracy)
notebook.ipynb Complete Jupyter notebook with all code

πŸ’‘ Key Takeaways

What Affects Mental Wellbeing Most?

Negative Factors (Risk):

  1. πŸ”΄ Work Stress - Strongest negative impact (coefficient: -4.04)
  2. πŸ”΄ Financial Stress - Significant negative impact (coefficient: -2.69)
  3. 🟑 Screen Time - Weak negative impact (coefficient: -1.56)

Positive Factors (Protective):

  1. 🟒 Sleep Quality - Strongest positive impact (coefficient: +4.03)
  2. 🟒 Diet Quality - Significant positive impact (coefficient: +2.69)
  3. 🟒 Physical Activity - Moderate positive impact (coefficient: +2.24)

Model Performance Summary

Task Best Model Performance
Regression Gradient Boosting RΒ² = 0.850, RMSE = 3.49
Classification Logistic Regression 90.55% accuracy, F1 = 0.893

Feature Engineering Impact

Model MAE RMSE RΒ²
Baseline (6 features) 4.11 5.16 0.672
Gradient Boosting (engineered) 2.29 3.49 0.850
Improvement -44% -32% +26.5%

Lessons Learned

  1. Stress management is crucial - Work and financial stress are the strongest predictors of low wellbeing
  2. Sleep quality matters most among positive lifestyle factors
  3. Feature engineering helps - Weighted risk score and cluster features improved predictions
  4. Simple models can win - Logistic Regression beat complex models for classification
  5. Ensemble methods excel for regression - Gradient Boosting captured non-linear patterns
  6. Recall matters for mental health - Don't miss at-risk individuals (minimize false negatives)

πŸ‘€ Author

Odeya

Assignment #2: Classification, Regression, Clustering, Evaluation


πŸ“š References

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support