10 Decision Trees
Consider the following dataset consisting of 4 observations with a single predictor variable \(x\) and a response variable \(y\):
| Observation | \(x\) | \(y\) |
|---|---|---|
| 1 | 1 | 2 |
| 2 | 2 | 3 |
| 3 | 3 | 2.5 |
| 4 | 4 | 5 |
Construct a regression decision tree by performing a binary split on \(x\) that minimizes the sum of squared errors (SSE). Show your calculations to determine the optimal split point.
Provide a plot of the resulting tree model overlaid on a scatter plot of the data.
To find the optimal split point, you need to evaluate all possible split points between consecutive \(x\) values. For each potential split:
- Divide the data into two regions
- Calculate the mean \(y\) value in each region
- Compute the SSE for each region: \(\text{SSE} = \sum_{i \in \text{region}} (y_i - \bar{y}_{\text{region}})^2\)
- Sum the SSEs from both regions to get the total SSE for that split
The optimal split minimizes the total SSE.
Create a scatter plot of the data points and overlay horizontal line segments representing the predicted values in each region created by your optimal split. Include a vertical dashed line to show the split point.
Part 1: Constructing the Regression Tree
We need to find the split point on \(x\) that minimizes the total SSE. Possible split points are between each pair of consecutive \(x\) values:
- Split at \(x = 1.5\)
- Split at \(x = 2.5\)
- Split at \(x = 3.5\)
1. Split at \(x = 1.5\)
Regions: \[R_1: x \leq 1.5 \quad (\text{Observation }1)\] \[R_2: x > 1.5 \quad (\text{Observations }2-4)\]
Means: \[\bar{y}_{R_1} = y_1 = 2\] \[\bar{y}_{R_2} = \frac{3 + 2.5 + 5}{3} = \frac{10.5}{3} = 3.5\]
SSE: \[\text{SSE}_{R_1} = (2 - 2)^2 = 0\] \[\text{SSE}_{R_2} = (3 - 3.5)^2 + (2.5 - 3.5)^2 + (5 - 3.5)^2 = 0.25 + 1 + 2.25 = 3.5\] \[\text{Total SSE} = 0 + 3.5 = 3.5\]
2. Split at \(x = 2.5\)
Regions: \[R_1: x \leq 2.5 \quad (\text{Observations }1-2)\] \[R_2: x > 2.5 \quad (\text{Observations }3-4)\]
Means: \[\bar{y}_{R_1} = \frac{2 + 3}{2} = 2.5\] \[\bar{y}_{R_2} = \frac{2.5 + 5}{2} = 3.75\]
SSE: \[\text{SSE}_{R_1} = (2 - 2.5)^2 + (3 - 2.5)^2 = 0.25 + 0.25 = 0.5\] \[\text{SSE}_{R_2} = (2.5 - 3.75)^2 + (5 - 3.75)^2 = 1.5625 + 1.5625 = 3.125\] \[\text{Total SSE} = 0.5 + 3.125 = 3.625\]
3. Split at \(x = 3.5\)
Regions: \[R_1: x \leq 3.5 \quad (\text{Observations }1-3)\] \[R_2: x > 3.5 \quad (\text{Observation }4)\]
Means: \[\bar{y}_{R_1} = \frac{2 + 3 + 2.5}{3} = 2.5\] \[\bar{y}_{R_2} = y_4 = 5\]
SSE: \[\text{SSE}_{R_1} = (2 - 2.5)^2 + (3 - 2.5)^2 + (2.5 - 2.5)^2 = 0.25 + 0.25 + 0 = 0.5\] \[\text{SSE}_{R_2} = (5 - 5)^2 = 0\] \[\text{Total SSE} = 0.5 + 0 = 0.5\]
Conclusion:
The optimal split is at \(x = 3.5\), resulting in the lowest total SSE of \(0.5\).
Part 2: Plot of the Tree Model
The vertical dashed line represents the split at \(x = 3.5\). The red lines show the predicted values in each region:
\[\hat{y} = \begin{cases} 2.5 & \text{if } x \leq 3.5 \\ 5 & \text{if } x > 3.5 \end{cases}\]