π Diabetes β Disease Progression Prediction (Linear Regression)
A Linear Regression model trained on the Diabetes dataset from Azure Open Datasets to predict Y (a quantitative measure of disease progression one year after baseline).
Built and deployed on Microsoft Fabric during Offline Workshop Training β organized by Microsoft Elevate and Dicoding.
π Model Details
| Property |
Value |
| Model Type |
Linear Regression |
| Framework |
scikit-learn |
| Task |
Tabular Regression |
| Target Variable |
Y (disease progression, continuous) |
| Training Platform |
Microsoft Fabric + MLflow |
| Dataset |
Diabetes (Azure Open Datasets) |
| Total Samples |
442 |
| Train/Test Split |
70/30 (random_state=0) |
π Features (10)
| Feature |
Type |
Description |
AGE |
int |
Age of patient |
SEX |
int |
Gender |
BMI |
float |
Body Mass Index |
BP |
float |
Average Blood Pressure |
S1 |
int |
Total Serum Cholesterol (tc) |
S2 |
float |
Low-Density Lipoproteins (ldl) |
S3 |
float |
High-Density Lipoproteins (hdl) |
S4 |
float |
Total Cholesterol / HDL (tch) |
S5 |
float |
Log of Serum Triglycerides (ltg) |
S6 |
int |
Blood Sugar Level (glu) |
π Performance
Best Model: Linear Regression
| Metric |
Score |
| RΒ² (Coefficient of Determination) |
0.3929 |
| MAE (Mean Absolute Error) |
44.62 |
| RMSE (Root Mean Squared Error) |
55.65 |
| CV RΒ² (5-fold) |
0.4823 Β± 0.0493 |
All Models Compared
| Model |
RΒ² |
MAE |
RMSE |
| Linear Regression |
0.3929 |
44.62 |
55.65 |
| Random Forest |
0.3011 |
47.86 |
59.71 |
| XGBoost |
0.2026 |
48.93 |
63.78 |
| Gradient Boosting |
0.1823 |
51.44 |
64.59 |
βΉοΈ Note: An RΒ² of ~0.39 is typical for clinical datasets where disease progression depends on many unmeasured factors (genetics, lifestyle, diet). Interestingly, Linear models outperform tree-based models here due to the small sample size (442 rows), avoiding overfitting.
π» Usage
import pickle
import numpy as np
with open("model.pkl", "rb") as f:
model = pickle.load(f)
sample = np.array([[50, 1, 28.5, 90.0, 200, 120.5, 45.0, 4.5, 5.2, 95]])
prediction = model.predict(sample)
print(f"Predicted Disease Progression (Y): {prediction[0]:.2f}")
π Key Insights
- S5 (Log of Serum Triglycerides) is the most important predictor by far (Coefficient:
65.8), indicating a strong correlation with disease progression.
- SEX and BMI are the 2nd and 3rd most influential features.
- Simplicity wins: Linear Regression outperforms complex ensemble models on this small dataset. Simpler models often generalize better when data is limited (
n=442).
- Stability: Cross-validation shows moderate stability (
CV RΒ² = 0.48 Β± 0.05), suggesting the model is robust within its performance range.
βοΈ Feature Importance
Ranked by the absolute value of coefficients:
| Rank |
Feature |
Coef (Abs) |
Impact |
| 1 |
S5 |
65.807 |
βββββ |
| 2 |
SEX |
18.445 |
βββ |
| 3 |
BMI |
6.246 |
ββ |
| 4 |
S4 |
3.196 |
β |
| 5 |
BP |
0.938 |
|
| 6 |
S1 |
0.694 |
|
| 7 |
S2 |
0.378 |
|
| 8 |
S3 |
0.257 |
|
| 9 |
AGE |
0.191 |
|
| 10 |
S6 |
0.111 |
|
β οΈ Intended Use
- Primary: Educational / demonstration of ML workflow on Microsoft Fabric.
- Not intended for: Clinical decision-making without further validation.
π Acknowledgments
- Microsoft Elevate and Dicoding β for organizing Offline Workshop Training.
- Azure Open Datasets β for providing the Diabetes dataset.