Mental Health & Wellbeing Prediction
πΉ Video Presentation
[YOUR VIDEO LINK HERE - Add after recording]
π Project Overview
This project predicts mental wellbeing scores based on lifestyle and environmental factors. We built both regression models (to predict exact scores) and classification models (to categorize wellbeing levels as Low vs Medium/High).
| Dataset | Synthetic Mental Health, Lifestyle & Wellbeing (Kaggle) |
| Size | 400,000 individuals, 15 features |
| Target | mental_wellbeing_score (0-100) |
| Train/Test | 320,000 / 80,000 (80/20 split) |
Main Question
Which lifestyle and environmental factors are most strongly associated with mental wellbeing, and how accurately can we predict wellbeing scores from these features?
Goals
- Explore relationships between lifestyle factors and mental wellbeing
- Build baseline regression model and improve through feature engineering
- Apply K-Means clustering to discover lifestyle segments
- Convert to binary classification and identify at-risk individuals
π Part 1-2: Exploratory Data Analysis
Dataset Features
| Feature Type | Features |
|---|---|
| Target | mental_wellbeing_score (0-100) |
| Lifestyle | sleep_hours, screen_time, physical_activity, diet_quality, sleep_quality |
| Stress | work_stress, financial_stress |
| Social | social_interactions |
| Environment | air_quality_index, noise_level |
| Demographics | age, gender, city_type |
Target Distribution
The mental wellbeing score ranges from 0 to 100, with scores concentrated in the 80-100 range.
Research Question 1: Screen Time vs Wellbeing
Finding: Higher screen time is associated with slightly lower mental wellbeing. The relationship is negative but relatively weak.
Research Question 2: Physical Activity vs Wellbeing
Finding: Higher physical activity levels are associated with better mental wellbeing. This is one of the positive lifestyle factors.
Research Question 3: Work Stress vs Wellbeing
Finding: Work stress has a strong negative relationship with mental wellbeing - one of the most impactful factors.
Research Question 4: Sleep Quality vs Wellbeing
Finding: Better sleep quality strongly correlates with higher mental wellbeing scores. Sleep quality is one of the top positive predictors.
Research Question 5: Diet Quality vs Wellbeing
Finding: Higher diet quality is associated with better mental wellbeing outcomes.
Correlation Analysis
Key Correlations with Mental Wellbeing:
| Factor | Correlation | Direction |
|---|---|---|
| Sleep Quality | Strong Positive | β Better sleep = Higher wellbeing |
| Diet Quality | Moderate Positive | β Better diet = Higher wellbeing |
| Physical Activity | Moderate Positive | β More activity = Higher wellbeing |
| Work Stress | Strong Negative | β More stress = Lower wellbeing |
| Financial Stress | Moderate Negative | β More stress = Lower wellbeing |
| Screen Time | Weak Negative | β More screen time = Lower wellbeing |
Feature Correlation with Target
This visualization shows how each feature correlates with mental wellbeing score. Green bars indicate positive relationships (beneficial factors), while red bars indicate negative relationships (risk factors).
π Part 3: Baseline Model
Baseline Configuration
| Setting | Value |
|---|---|
| Algorithm | Linear Regression |
| Features | 6 lifestyle scores |
| Preprocessing | StandardScaler |
| Train/Test Split | 80/20 |
Baseline Results
| Metric | Value |
|---|---|
| RΒ² Score | 0.672 |
| MAE | 4.11 |
| RMSE | 5.16 |
Interpretation: The baseline model explains 67.2% of variance in wellbeing scores with an average error of about 4 points on the 0-100 scale. This is a solid baseline.
Baseline: Actual vs Predicted
Baseline Feature Importance
Top Features (Baseline):
| Rank | Feature | Coefficient | Effect |
|---|---|---|---|
| 1 | Work Stress | -4.04 | Strongest negative |
| 2 | Sleep Quality | +4.03 | Strongest positive |
| 3 | Financial Stress | -2.69 | Negative |
| 4 | Diet Quality | +2.69 | Positive |
| 5 | Physical Activity | +2.24 | Positive |
| 6 | Screen Time | -1.56 | Weakest negative |
π§ Part 4: Feature Engineering
Engineered Features
We created additional features to capture more complex relationships:
| Feature | Description | Rationale |
|---|---|---|
| Weighted Lifestyle Risk | Composite score combining all risk factors | Captures overall lifestyle health |
| Cluster Labels | K-Means lifestyle segments (k=3) | Non-linear pattern capture |
| PCA Components | Lifestyle_PCA_1, Lifestyle_PCA_2 | Dimensionality reduction |
4.1 Weighted Lifestyle Risk Score
We created a weighted lifestyle risk score based on EDA findings:
| Factor | Weight | Direction |
|---|---|---|
| Work Stress | 0.30 | Higher = More Risk |
| Financial Stress | 0.25 | Higher = More Risk |
| Poor Sleep Quality | 0.20 | Lower quality = More Risk |
| Poor Diet Quality | 0.15 | Lower quality = More Risk |
| Low Physical Activity | 0.05 | Lower = More Risk |
| Screen Time | 0.05 | Higher = More Risk |
Formula: Higher score = Riskier lifestyle profile (worse for wellbeing)
4.2 K-Means Clustering (k=3)
We applied K-Means clustering to identify distinct lifestyle segments:
| Cluster | Count | Profile |
|---|---|---|
| 0 | 129,119 (32%) | Lifestyle Profile A |
| 1 | 142,014 (36%) | Lifestyle Profile B |
| 2 | 128,867 (32%) | Lifestyle Profile C |
The cluster label becomes a categorical feature that helps the model capture non-linear relationships between lifestyle factors.
4.3 PCA Components
We added two PCA components (Lifestyle_PCA_1, Lifestyle_PCA_2) that compress the six lifestyle features into orthogonal dimensions capturing the main variance patterns.
π― Part 5: Improved Regression Models
Model Comparison Results
| Model | MAE | RMSE | RΒ² |
|---|---|---|---|
| Baseline Linear Regression | 4.11 | 5.16 | 0.672 |
| Linear Regression (engineered) | 4.09 | 5.14 | 0.675 |
| Random Forest (engineered) | 2.28 | 3.61 | 0.839 |
| Gradient Boosting (engineered) | 2.29 | 3.49 | 0.850 |
Improvement Analysis
| Comparison | Improvement |
|---|---|
| Baseline β Gradient Boosting | +26.5% RΒ² improvement |
| MAE Reduction | 4.11 β 2.29 (44% reduction) |
| RMSE Reduction | 5.16 β 3.49 (32% reduction) |
Feature Importance (Best Model)
Key Insights:
- Stress factors (work, financial) remain the strongest predictors
- Sleep quality continues to be the top positive factor
- Engineered features and cluster labels add predictive value
π Part 6: Regression Winner
Gradient Boosting Regressor
| Metric | Value |
|---|---|
| RΒ² Score | 0.850 |
| MAE | 2.29 |
| RMSE | 3.49 |
Why Gradient Boosting Won:
- Captures non-linear relationships between lifestyle factors
- Handles feature interactions naturally
- Best balance of accuracy and generalization
- Lowest RMSE among all models
Saved as: winning_regressor.pkl
π Part 7: Regression to Classification
We converted wellbeing scores into 2 binary classes using quantile thresholds:
| Class | Wellbeing Level | Threshold | Train Count | Percentage |
|---|---|---|---|---|
| 0 | Low Wellbeing | < 92.49 | 105,600 | 33% |
| 1 | Medium/High Wellbeing | β₯ 92.49 | 214,400 | 67% |
Note: The classes are imbalanced (33% vs 67%), so we focus on F1-score and recall rather than accuracy alone.
Precision vs Recall Analysis
For mental health prediction, Recall is more important:
In the context of predicting mental wellbeing, recall is more important than precision for the low-wellbeing class. Missing a person who is actually struggling (false negative) is more harmful than flagging someone as "at risk" when they are actually fine (false positive).
False Positive vs False Negative
False Negatives are more critical:
| Error Type | Meaning | Consequence |
|---|---|---|
| False Positive | Predict Low, actually OK | Extra attention to someone who is fine (less harmful) |
| False Negative | Predict OK, actually Low | Person who needs support is not identified (more harmful) |
A false negative means the model predicts that someone is not in the low-wellbeing group, while in reality they are. This could result in a person who needs support not being identified.
Conclusion: We prioritize recall for Class 0 (Low Wellbeing) to minimize missed at-risk individuals.
π Part 8: Classification Models
Classification Results
| Model | Accuracy | F1 (macro) |
|---|---|---|
| Logistic Regression | 90.55% | 0.893 |
| Gradient Boosting | 90.47% | 0.892 |
| Random Forest | 90.39% | 0.891 |
Confusion Matrices
Key Observations:
- All models achieve ~90% accuracy
- Most confusion occurs between the two adjacent classes
- Models rarely completely misclassify (important for identifying at-risk individuals)
- Logistic Regression achieves the highest F1 score despite being the simplest model
Confusion Matrix Analysis
The confusion matrices show that most errors are confusions between "medium" and "high" wellbeing individuals. More importantly, the model rarely confuses class 0 (low wellbeing) with class 1 (high wellbeing), which is good from a practical perspective: it almost never predicts "high wellbeing" for people who are actually in the low group.
π Part 8.4: Classification Winner
Logistic Regression
| Metric | Value |
|---|---|
| Accuracy | 90.55% |
| Macro F1 | 0.893 |
Why Logistic Regression Won:
- Highest accuracy and F1 score
- Simple, interpretable model
- Fast inference time (trained in 1.48 seconds vs 117-247 seconds for others)
- Excellent calibrated probabilities
- Performs well on this linearly-separable problem
Saved as: winning_classifier.pkl
π Repository Files
| File | Description |
|---|---|
| winning_regressor.pkl | Gradient Boosting regression model (RΒ²=0.85) |
| winning_classifier.pkl | Logistic Regression classifier (90.6% accuracy) |
| notebook.ipynb | Complete Jupyter notebook with all code |
π‘ Key Takeaways
What Affects Mental Wellbeing Most?
Negative Factors (Risk):
- π΄ Work Stress - Strongest negative impact (coefficient: -4.04)
- π΄ Financial Stress - Significant negative impact (coefficient: -2.69)
- π‘ Screen Time - Weak negative impact (coefficient: -1.56)
Positive Factors (Protective):
- π’ Sleep Quality - Strongest positive impact (coefficient: +4.03)
- π’ Diet Quality - Significant positive impact (coefficient: +2.69)
- π’ Physical Activity - Moderate positive impact (coefficient: +2.24)
Model Performance Summary
| Task | Best Model | Performance |
|---|---|---|
| Regression | Gradient Boosting | RΒ² = 0.850, RMSE = 3.49 |
| Classification | Logistic Regression | 90.55% accuracy, F1 = 0.893 |
Feature Engineering Impact
| Model | MAE | RMSE | RΒ² |
|---|---|---|---|
| Baseline (6 features) | 4.11 | 5.16 | 0.672 |
| Gradient Boosting (engineered) | 2.29 | 3.49 | 0.850 |
| Improvement | -44% | -32% | +26.5% |
Lessons Learned
- Stress management is crucial - Work and financial stress are the strongest predictors of low wellbeing
- Sleep quality matters most among positive lifestyle factors
- Feature engineering helps - Weighted risk score and cluster features improved predictions
- Simple models can win - Logistic Regression beat complex models for classification
- Ensemble methods excel for regression - Gradient Boosting captured non-linear patterns
- Recall matters for mental health - Don't miss at-risk individuals (minimize false negatives)
π€ Author
Odeya
Assignment #2: Classification, Regression, Clustering, Evaluation
π References
- Dataset: Synthetic Mental Health, Lifestyle & Wellbeing Dataset - Kaggle
- Tools: scikit-learn, pandas, numpy, matplotlib, seaborn
- Algorithms: Linear Regression, Random Forest, Gradient Boosting, Logistic Regression, K-Means
- Downloads last month
- -
















