10 Decision Trees

Exercise 10.1: Regression Decision Tree

Consider the following dataset consisting of 4 observations with a single predictor variable \(x\) and a response variable \(y\):

Observation \(x\) \(y\)
1 1 2
2 2 3
3 3 2.5
4 4 5
  1. Construct a regression decision tree by performing a binary split on \(x\) that minimizes the sum of squared errors (SSE). Show your calculations to determine the optimal split point.

  2. Provide a plot of the resulting tree model overlaid on a scatter plot of the data.

To find the optimal split point, you need to evaluate all possible split points between consecutive \(x\) values. For each potential split:

  1. Divide the data into two regions
  2. Calculate the mean \(y\) value in each region
  3. Compute the SSE for each region: \(\text{SSE} = \sum_{i \in \text{region}} (y_i - \bar{y}_{\text{region}})^2\)
  4. Sum the SSEs from both regions to get the total SSE for that split

The optimal split minimizes the total SSE.

Create a scatter plot of the data points and overlay horizontal line segments representing the predicted values in each region created by your optimal split. Include a vertical dashed line to show the split point.

Part 1: Constructing the Regression Tree

We need to find the split point on \(x\) that minimizes the total SSE. Possible split points are between each pair of consecutive \(x\) values:

  • Split at \(x = 1.5\)
  • Split at \(x = 2.5\)
  • Split at \(x = 3.5\)

1. Split at \(x = 1.5\)

Regions: \[R_1: x \leq 1.5 \quad (\text{Observation }1)\] \[R_2: x > 1.5 \quad (\text{Observations }2-4)\]

Means: \[\bar{y}_{R_1} = y_1 = 2\] \[\bar{y}_{R_2} = \frac{3 + 2.5 + 5}{3} = \frac{10.5}{3} = 3.5\]

SSE: \[\text{SSE}_{R_1} = (2 - 2)^2 = 0\] \[\text{SSE}_{R_2} = (3 - 3.5)^2 + (2.5 - 3.5)^2 + (5 - 3.5)^2 = 0.25 + 1 + 2.25 = 3.5\] \[\text{Total SSE} = 0 + 3.5 = 3.5\]

2. Split at \(x = 2.5\)

Regions: \[R_1: x \leq 2.5 \quad (\text{Observations }1-2)\] \[R_2: x > 2.5 \quad (\text{Observations }3-4)\]

Means: \[\bar{y}_{R_1} = \frac{2 + 3}{2} = 2.5\] \[\bar{y}_{R_2} = \frac{2.5 + 5}{2} = 3.75\]

SSE: \[\text{SSE}_{R_1} = (2 - 2.5)^2 + (3 - 2.5)^2 = 0.25 + 0.25 = 0.5\] \[\text{SSE}_{R_2} = (2.5 - 3.75)^2 + (5 - 3.75)^2 = 1.5625 + 1.5625 = 3.125\] \[\text{Total SSE} = 0.5 + 3.125 = 3.625\]

3. Split at \(x = 3.5\)

Regions: \[R_1: x \leq 3.5 \quad (\text{Observations }1-3)\] \[R_2: x > 3.5 \quad (\text{Observation }4)\]

Means: \[\bar{y}_{R_1} = \frac{2 + 3 + 2.5}{3} = 2.5\] \[\bar{y}_{R_2} = y_4 = 5\]

SSE: \[\text{SSE}_{R_1} = (2 - 2.5)^2 + (3 - 2.5)^2 + (2.5 - 2.5)^2 = 0.25 + 0.25 + 0 = 0.5\] \[\text{SSE}_{R_2} = (5 - 5)^2 = 0\] \[\text{Total SSE} = 0.5 + 0 = 0.5\]

Conclusion:

The optimal split is at \(x = 3.5\), resulting in the lowest total SSE of \(0.5\).

Part 2: Plot of the Tree Model

Figure 1: Regression Tree Model with Split at x = 3.5

The vertical dashed line represents the split at \(x = 3.5\). The red lines show the predicted values in each region:

\[\hat{y} = \begin{cases} 2.5 & \text{if } x \leq 3.5 \\ 5 & \text{if } x > 3.5 \end{cases}\]