Example: DTR
Problem Statement
Suppose we want to predict salary based on years of experience.
Dataset
| Experience (X) | Salary (Y) |
|---|---|
| 1 | 20 |
| 2 | 25 |
| 3 | 30 |
| 4 | 80 |
Main Idea of Decision Tree Regression
Decision Tree Regression works by:
Splitting the dataset into smaller regions
The algorithm tries to find:
The split that minimizes prediction error
using:
Mean Squared Error (MSE)
Step 1: Find Possible Split Points
The tree first sorts input values:
1, 2, 3, 4
Then it calculates:
Midpoints between consecutive values
Possible Split Calculations
Between 1 and 2:
(1 + 2) / 2
3 / 2
1.5
Between 2 and 3:
(2 + 3) / 2
5 / 2
2.5
Between 3 and 4:
(3 + 4) / 2
7 / 2
3.5
Final Possible Splits
1.5, 2.5, 3.5
The algorithm will:
-
Try every split
-
Calculate MSE
-
Select the split with minimum error
Step 2: Calculate Initial Mean
Before splitting:
Formula:
Mean = Sum of salaries / Number of samples
Calculation:
(20 + 25 + 30 + 80) / 4
155 / 4
38.75
Initially:
Every prediction = 38.75
Step 3: Calculate Initial MSE
Formula:
MSE = mean((Actual - Predicted)²)
Error Table
| Actual Salary | Predicted Salary | Error | Error² |
|---|---|---|---|
| 20 | 38.75 | -18.75 | 351.56 |
| 25 | 38.75 | -13.75 | 189.06 |
| 30 | 38.75 | -8.75 | 76.56 |
| 80 | 38.75 | 41.25 | 1701.56 |
Sum of Squared Errors
351.56 + 189.06 + 76.56 + 1701.56
= 2318.74
Initial MSE
2318.74 / 4
579.68
This error is very high.
So the tree tries splitting.
Step 4: Try Split = 1.5
Split rule:
Experience < 1.5
Left Region
| Experience | Salary |
|---|---|
| 1 | 20 |
Prediction:
20
MSE:
0
Right Region
| Experience | Salary |
|---|---|
| 2 | 25 |
| 3 | 30 |
| 4 | 80 |
Mean:
(25 + 30 + 80) / 3
135 / 3
45
Right Region Error Table
| Actual | Predicted | Error² |
|---|---|---|
| 25 | 45 | 400 |
| 30 | 45 | 225 |
| 80 | 45 | 1225 |
Total Error
400 + 225 + 1225 = 1850
Total MSE
1850 / 4
462.5
Step 5: Try Split = 2.5
Split rule:
Experience < 2.5
Left Region
| Salary |
|---|
| 20 |
| 25 |
Mean:
(20 + 25) / 2
22.5
Left Error
| Actual | Predicted | Error² |
|---|---|---|
| 20 | 22.5 | 6.25 |
| 25 | 22.5 | 6.25 |
Total:
12.5
Right Region
| Salary |
|---|
| 30 |
| 80 |
Mean:
(30 + 80) / 2
55
Right Error
| Actual | Predicted | Error² |
|---|---|---|
| 30 | 55 | 625 |
| 80 | 55 | 625 |
Total:
1250
Total Error
12.5 + 1250
1262.5
Total MSE
1262.5 / 4
315.63
Step 6: Try Split = 3.5
Split rule:
Experience < 3.5
Left Region
| Salary |
|---|
| 20 |
| 25 |
| 30 |
Mean:
(20 + 25 + 30) / 3
75 / 3
25
Left Error
| Actual | Predicted | Error² |
|---|---|---|
| 20 | 25 | 25 |
| 25 | 25 | 0 |
| 30 | 25 | 25 |
Total:
50
Right Region
| Salary |
|---|
| 80 |
Prediction:
80
Error:
0
Total Error
50 + 0
50
Total MSE
50 / 4
12.5
Step 7: Compare All Splits
| Split | MSE |
|---|---|
| 1.5 | 462.5 |
| 2.5 | 315.63 |
| 3.5 | 12.5 |
Best Split
The minimum MSE is:
12.5
So the best split becomes:
Experience < 3.5
Final Decision Tree
Experience < 3.5
/ \
Yes No
| |
Predict 25 Predict 80
Prediction Example
Suppose:
Experience = 2
Prediction:
25
Suppose:
Experience = 4
Prediction:
80
Summary
Decision Tree Regression automatically generates possible split points using midpoints between neighboring values. It evaluates every split using Mean Squared Error (MSE) and selects the split that minimizes prediction error. The final tree predicts the average value within each region.