πŸ“‰ Diabetes β€” Disease Progression Prediction (Linear Regression)

A Linear Regression model trained on the Diabetes dataset from Azure Open Datasets to predict Y (a quantitative measure of disease progression one year after baseline).

Built and deployed on Microsoft Fabric during Offline Workshop Training β€” organized by Microsoft Elevate and Dicoding.

πŸ“Š Model Details

Property Value
Model Type Linear Regression
Framework scikit-learn
Task Tabular Regression
Target Variable Y (disease progression, continuous)
Training Platform Microsoft Fabric + MLflow
Dataset Diabetes (Azure Open Datasets)
Total Samples 442
Train/Test Split 70/30 (random_state=0)

πŸ“ Features (10)

Feature Type Description
AGE int Age of patient
SEX int Gender
BMI float Body Mass Index
BP float Average Blood Pressure
S1 int Total Serum Cholesterol (tc)
S2 float Low-Density Lipoproteins (ldl)
S3 float High-Density Lipoproteins (hdl)
S4 float Total Cholesterol / HDL (tch)
S5 float Log of Serum Triglycerides (ltg)
S6 int Blood Sugar Level (glu)

πŸ“ˆ Performance

Best Model: Linear Regression

Metric Score
RΒ² (Coefficient of Determination) 0.3929
MAE (Mean Absolute Error) 44.62
RMSE (Root Mean Squared Error) 55.65
CV RΒ² (5-fold) 0.4823 Β± 0.0493

All Models Compared

Model RΒ² MAE RMSE
Linear Regression 0.3929 44.62 55.65
Random Forest 0.3011 47.86 59.71
XGBoost 0.2026 48.93 63.78
Gradient Boosting 0.1823 51.44 64.59

ℹ️ Note: An RΒ² of ~0.39 is typical for clinical datasets where disease progression depends on many unmeasured factors (genetics, lifestyle, diet). Interestingly, Linear models outperform tree-based models here due to the small sample size (442 rows), avoiding overfitting.

πŸ’» Usage

import pickle
import numpy as np

# Load model (ensure model.pkl is in the directory)
with open("model.pkl", "rb") as f:
    model = pickle.load(f)

# Input: [AGE, SEX, BMI, BP, S1, S2, S3, S4, S5, S6]
# Example: Patient with average stats
sample = np.array([[50, 1, 28.5, 90.0, 200, 120.5, 45.0, 4.5, 5.2, 95]])

# Predict Disease Progression
prediction = model.predict(sample)
print(f"Predicted Disease Progression (Y): {prediction[0]:.2f}")

πŸ” Key Insights

  • S5 (Log of Serum Triglycerides) is the most important predictor by far (Coefficient: 65.8), indicating a strong correlation with disease progression.
  • SEX and BMI are the 2nd and 3rd most influential features.
  • Simplicity wins: Linear Regression outperforms complex ensemble models on this small dataset. Simpler models often generalize better when data is limited (n=442).
  • Stability: Cross-validation shows moderate stability (CV RΒ² = 0.48 Β± 0.05), suggesting the model is robust within its performance range.

βš–οΈ Feature Importance

Ranked by the absolute value of coefficients:

Rank Feature Coef (Abs) Impact
1 S5 65.807 ⭐⭐⭐⭐⭐
2 SEX 18.445 ⭐⭐⭐
3 BMI 6.246 ⭐⭐
4 S4 3.196 ⭐
5 BP 0.938
6 S1 0.694
7 S2 0.378
8 S3 0.257
9 AGE 0.191
10 S6 0.111

⚠️ Intended Use

  • Primary: Educational / demonstration of ML workflow on Microsoft Fabric.
  • Not intended for: Clinical decision-making without further validation.

πŸ™Œ Acknowledgments

  • Microsoft Elevate and Dicoding β€” for organizing Offline Workshop Training.
  • Azure Open Datasets β€” for providing the Diabetes dataset.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support