EthanGabis commited on
Commit
6584d58
Β·
verified Β·
1 Parent(s): 563065c

Upload 14 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,12 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ classification_confusion_matrices.png filter=lfs diff=lfs merge=lfs -text
37
+ classification_distribution.png filter=lfs diff=lfs merge=lfs -text
38
+ eda_boxplots.png filter=lfs diff=lfs merge=lfs -text
39
+ eda_correlation_heatmap.png filter=lfs diff=lfs merge=lfs -text
40
+ eda_price_distribution.png filter=lfs diff=lfs merge=lfs -text
41
+ global_house_purchase_dataset.csv filter=lfs diff=lfs merge=lfs -text
42
+ regression_actual_vs_predicted.png filter=lfs diff=lfs merge=lfs -text
43
+ regression_comparison.png filter=lfs diff=lfs merge=lfs -text
44
+ regression_feature_importance.png filter=lfs diff=lfs merge=lfs -text
README.MD ADDED
@@ -0,0 +1,281 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - sklearn
5
+ - random-forest
6
+ - gradient-boosting
7
+ - regression
8
+ - classification
9
+ - house-price-prediction
10
+ - tabular-data
11
+ datasets:
12
+ - global-house-purchase-dataset
13
+ metrics:
14
+ - r2
15
+ - mae
16
+ - rmse
17
+ - accuracy
18
+ - f1
19
+ ---
20
+
21
+ # 🏠 House Price Prediction - Regression & Classification Models
22
+
23
+ ## πŸ“Ή Presentation Video
24
+
25
+ [![Watch the video](https://img.shields.io/badge/Watch-Presentation-red?style=for-the-badge&logo=youtube)](YOUR_VIDEO_LINK_HERE)β‰ˆ
26
+
27
+ > **Replace `YOUR_VIDEO_LINK_HERE` with your YouTube/Loom/Vimeo link**
28
+
29
+ ---
30
+
31
+ ## πŸ“‹ Project Overview
32
+
33
+ This project predicts house prices using machine learning, implementing both **regression** (predicting exact price) and **classification** (predicting price category: Low/Medium/High) approaches.
34
+
35
+ ### Dataset
36
+ - **Name**: Global House Purchase Decision Dataset
37
+ - **Size**: 200,000 entries, 27 features
38
+ - **Target**: Property price prediction
39
+
40
+ ### Main Goals
41
+ 1. Predict house prices using regression models
42
+ 2. Classify properties into price categories (Low/Medium/High)
43
+ 3. Compare multiple ML algorithms and select the best performers
44
+
45
+ ---
46
+
47
+ ## πŸ” Exploratory Data Analysis (EDA)
48
+
49
+ ### Price Distribution
50
+ ![Price Distribution](eda_price_distribution.png)
51
+
52
+ **Key Insights:**
53
+ - Price distribution is slightly right-skewed
54
+ - Most properties fall in the low-to-medium price range
55
+ - Outliers exist in the high-price segment
56
+
57
+ ### Correlation Analysis
58
+ ![Correlation Heatmap](eda_correlation_heatmap.png)
59
+
60
+ **Key Findings:**
61
+ - Property size has the strongest correlation with price
62
+ - Location features (country, city) significantly impact price
63
+ - Customer salary shows moderate correlation with purchase decisions
64
+
65
+ ### Feature Distributions
66
+ ![Boxplots](eda_boxplots.png)
67
+
68
+ ---
69
+
70
+ ## πŸ› οΈ Feature Engineering
71
+
72
+ ### New Features Created (9 total):
73
+ 1. **property_age** = 2025 - constructed_year
74
+ 2. **rooms_to_bathrooms_ratio** = rooms / (bathrooms + 1)
75
+ 3. **total_amenities** = garage + garden
76
+ 4. **size_category** = Small/Medium/Large bins
77
+ 5. **safety_score** = 1 / (crime_cases + legal_cases + 1)
78
+ 6. **financial_capacity** = customer_salary - monthly_expenses
79
+ 7. **is_new_property** = 1 if property_age <= 5
80
+ 8. **high_satisfaction** = 1 if satisfaction >= median
81
+ 9. **location_quality** = neighbourhood_rating + connectivity_score
82
+
83
+ ### One-Hot Encoding
84
+ - Encoded: country, city, property_type, furnishing_status, size_category
85
+ - Created ~60 binary columns
86
+
87
+ ### Polynomial Features
88
+ - property_size_sqftΒ², customer_salaryΒ², property_ageΒ²
89
+ - Interaction terms (e.g., property_size Γ— customer_salary)
90
+
91
+ ### Final Feature Count: 78 features
92
+
93
+ ---
94
+
95
+ ## 🎯 K-Means Clustering
96
+
97
+ ### Elbow Method for Optimal K
98
+ ![Elbow Method](clustering_elbow.png)
99
+
100
+ **Selected K = 4** based on the elbow curve
101
+
102
+ ### Cluster Visualization (PCA)
103
+ ![PCA Clusters](clustering_pca.png)
104
+
105
+ ### Cluster Interpretation:
106
+ | Cluster | Property Age | Salary | Purchase Rate | Characteristics |
107
+ |---------|--------------|--------|---------------|-----------------|
108
+ | 0 | Older (50 yrs) | Low (~$29k) | 18% | Budget buyers, older properties |
109
+ | 1 | Older (33 yrs) | High (~$79k) | 22% | Affluent buyers |
110
+ | 2 | Newer (17 yrs) | Low (~$29k) | 18% | First-time buyers, new builds |
111
+ | 3 | Older (33 yrs) | Medium (~$50k) | 35% | Sweet spot - highest purchase rate |
112
+
113
+ ---
114
+
115
+ ## πŸ“ˆ Part 1: Regression Models
116
+
117
+ ### Baseline Model (Linear Regression - Part 3)
118
+ - **RΒ²**: 0.1945 (19.45%)
119
+ - **MAE**: 0.4168
120
+ - **RMSE**: 0.5345
121
+
122
+ ### Improved Models with Engineered Features (Part 5)
123
+
124
+ ![Model Comparison](regression_comparison.png)
125
+
126
+ | Model | Train RΒ² | Test RΒ² | Test MAE | Test RMSE | Improvement |
127
+ |-------|----------|---------|----------|-----------|-------------|
128
+ | Baseline (Part 3) | 0.1919 | 0.1945 | 0.4168 | 0.5345 | - |
129
+ | Linear Regression | 0.9847 | 0.9845 | 0.0919 | 0.1237 | +406% |
130
+ | Random Forest | 0.9999 | 1.0000 | 0.0054 | 0.0063 | +414% |
131
+ | Gradient Boosting | 0.9995 | 0.9994 | 0.0177 | 0.0236 | +414% |
132
+
133
+ ### Actual vs Predicted
134
+ ![Actual vs Predicted](regression_actual_vs_predicted.png)
135
+
136
+ ### Feature Importance
137
+ ![Feature Importance](regression_feature_importance.png)
138
+
139
+ **Top 5 Most Important Features (Random Forest):**
140
+ 1. property_size_sqft Γ— property_age (0.249)
141
+ 2. country_uae (0.197)
142
+ 3. country_usa (0.114)
143
+ 4. country_singapore (0.093)
144
+ 5. city_singapore (0.088)
145
+
146
+ ### πŸ† Regression Winner: Random Forest
147
+ - **Test RΒ²**: 0.9999 (99.99%)
148
+ - **Test MAE**: 0.0054
149
+ - **Improvement**: 414% over baseline
150
+
151
+ ---
152
+
153
+ ## πŸ“Š Part 2: Classification Models
154
+
155
+ ### Target Conversion Strategy
156
+ **Quantile Binning (3 Classes):**
157
+ - Class 0 (Low): Price < 33rd percentile
158
+ - Class 1 (Medium): 33rd - 66th percentile
159
+ - Class 2 (High): Price β‰₯ 66th percentile
160
+
161
+ ### Class Distribution
162
+ ![Class Distribution](classification_distribution.png)
163
+
164
+ | Class | Label | Count | Percentage |
165
+ |-------|-------|-------|------------|
166
+ | 0 | Low | 20,398 | 33.0% |
167
+ | 1 | Medium | 20,398 | 33.0% |
168
+ | 2 | High | 21,016 | 34.0% |
169
+
170
+ ### Model Performance
171
+
172
+ ![Classification Comparison](classification_comparison.png)
173
+
174
+ | Model | Accuracy | Precision | Recall | F1-Score |
175
+ |-------|----------|-----------|--------|----------|
176
+ | Logistic Regression | 99.50% | 0.9949 | 0.9949 | 0.9949 |
177
+ | Random Forest | 98.58% | 0.9855 | 0.9855 | 0.9854 |
178
+ | Gradient Boosting | 99.56% | 0.9956 | 0.9956 | 0.9956 |
179
+
180
+ ### Confusion Matrices
181
+ ![Confusion Matrices](classification_confusion_matrices.png)
182
+
183
+ ### πŸ† Classification Winner: Gradient Boosting
184
+ - **Accuracy**: 99.56%
185
+ - **F1-Score**: 0.9956
186
+ - **Total Errors**: 55 out of 12,363 (0.44%)
187
+
188
+ ---
189
+
190
+ ## 🧠 Key Insights & Learnings
191
+
192
+ ### Precision vs Recall Analysis
193
+ - **Precision is more important** for house price classification
194
+ - False Positives (overvaluation) are more critical than False Negatives
195
+ - Overpricing can lead to: lost buyers, investor losses, legal issues
196
+
197
+ ### Challenges Faced
198
+ 1. **Data Leakage**: Initial model showed RΒ² = 1.0 due to leaky features (loan_amount, down_payment). Removed them for valid results.
199
+ 2. **Training Time**: Gradient Boosting was slow; optimized hyperparameters for faster training.
200
+ 3. **Feature Engineering**: Creating meaningful features required domain knowledge about real estate.
201
+
202
+ ### Lessons Learned
203
+ 1. Always check for data leakage before trusting "perfect" results
204
+ 2. Feature engineering has massive impact (+400% improvement)
205
+ 3. Ensemble methods (Random Forest, Gradient Boosting) outperform linear models
206
+ 4. Location is the most important factor in house pricing
207
+
208
+ ---
209
+
210
+ ## πŸ“ Repository Contents
211
+
212
+ | File | Description |
213
+ |------|-------------|
214
+ | `random_forest_house_price_model.pkl` | Regression model (Random Forest) |
215
+ | `classification_model_winner.pkl` | Classification model (Gradient Boosting) |
216
+ | `house_purchase_dataset.csv` | Engineered dataset |
217
+ | `README.md` | This documentation |
218
+
219
+ ---
220
+
221
+ ## πŸš€ Usage
222
+
223
+ ### Load Regression Model
224
+ ```python
225
+ import pickle
226
+
227
+ # Load model
228
+ with open('random_forest_house_price_model.pkl', 'rb') as f:
229
+ regression_model = pickle.load(f)
230
+
231
+ # Predict (features must be scaled and in same format as training)
232
+ predictions = regression_model.predict(X_scaled)
233
+ ```
234
+
235
+ ### Load Classification Model
236
+ ```python
237
+ import pickle
238
+
239
+ # Load model
240
+ with open('classification_model_winner.pkl', 'rb') as f:
241
+ classification_model = pickle.load(f)
242
+
243
+ # Predict price class (0=Low, 1=Medium, 2=High)
244
+ price_class = classification_model.predict(X_scaled)
245
+ ```
246
+
247
+ ---
248
+
249
+ ## πŸ“Š Model Specifications
250
+
251
+ ### Regression Model (Random Forest)
252
+ ```
253
+ Algorithm: RandomForestRegressor
254
+ n_estimators: 50
255
+ max_depth: 10
256
+ min_samples_split: 20
257
+ min_samples_leaf: 10
258
+ random_state: 42
259
+ ```
260
+
261
+ ### Classification Model (Gradient Boosting)
262
+ ```
263
+ Algorithm: GradientBoostingClassifier
264
+ n_estimators: 50
265
+ max_depth: 3
266
+ learning_rate: 0.2
267
+ subsample: 0.8
268
+ random_state: 42
269
+ ```
270
+
271
+ ---
272
+
273
+ ## πŸ‘¨β€πŸ’» Author
274
+
275
+ Created as part of a Data Science course assignment.
276
+
277
+ ---
278
+
279
+ ## πŸ“œ License
280
+
281
+ MIT License
classification_comparison.png ADDED
classification_confusion_matrices.png ADDED

Git LFS Details

  • SHA256: 9c4b65b94673429c587675384b03924dccd930da7543e032458d6e3dba6b7958
  • Pointer size: 131 Bytes
  • Size of remote file: 134 kB
classification_distribution.png ADDED

Git LFS Details

  • SHA256: 31f20037554fad55f0b0bceba136db1b450008c56bd429dd6ed864da811b9dda
  • Pointer size: 131 Bytes
  • Size of remote file: 106 kB
classification_model_winner.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cf6447a9c8e59c595314158cc43748164282b070b3aa27fcf7b8539e1e9839ed
3
+ size 208652
clustering_pca.png ADDED
eda_boxplots.png ADDED

Git LFS Details

  • SHA256: 7fdf68528c4095b297b8a25435c3363d49c7e63bb5fc9b1ba220bc551e2bd731
  • Pointer size: 131 Bytes
  • Size of remote file: 210 kB
eda_correlation_heatmap.png ADDED

Git LFS Details

  • SHA256: b4327ee2a5abbe7648674c5734aea1a83ddfb4dd1a5c899d34e0bc7d6da2f07d
  • Pointer size: 131 Bytes
  • Size of remote file: 453 kB
eda_price_distribution.png ADDED

Git LFS Details

  • SHA256: cdbbd0cd5e11e475eef14b6a75d340cdd0caa686548bad9483a6ec775815c6a1
  • Pointer size: 131 Bytes
  • Size of remote file: 158 kB
global_house_purchase_dataset.csv ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3ff0f9d0ce76b0253387745d70a920b7809115b80f36ba89838d0382a8249be4
3
+ size 24058205
random_forest_house_price_model.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:20cefda1603ab0db6a94bbb7f6827916fb9f363a45c0fbac0983d73bcd54ce0a
3
+ size 47446799
regression_actual_vs_predicted.png ADDED

Git LFS Details

  • SHA256: c7382ff230d81b43f7f0cec8346d595d4deadb2b0db6e4475ffb442e4654f930
  • Pointer size: 131 Bytes
  • Size of remote file: 202 kB
regression_comparison.png ADDED

Git LFS Details

  • SHA256: 14f8ab522784e7e0633974836b1bf939dc9dc7a911e04ab1490f2956556b220b
  • Pointer size: 131 Bytes
  • Size of remote file: 125 kB
regression_feature_importance.png ADDED

Git LFS Details

  • SHA256: 3ad5a43674de49b8784f8446bf1e9b2deee1aeb99e4cbd7569b9059126a5aac3
  • Pointer size: 131 Bytes
  • Size of remote file: 173 kB