11 Random Forests

11.1 Exercises

Exercise 11.1: Random Forests

Suppose you have built a random forest model for your quantitative business task at hand.

Explain briefly how random forests improve upon single regression trees in terms of model variance.
Given the following table of Out-Of-Bag (OOB) error estimates for different numbers of trees in the forest, interpret the trend in OOB error as the number of trees increases.

Number of Trees OOB Error

3 0.9

5 0.7

10 0.6

20 0.55

50 0.5
Impact of Hyperparameters on Random Forest Performance

You are tasked with building a Random Forest model to forecast inflation rates using a dataset with 50 predictor variables. Consider the following hyperparameters: the number of trees (n_estimators), the number of features considered at each split (max_features), and the minimum number of samples required to be at a leaf node (min_samples_leaf).
1. Predict how increasing n_estimators while keeping other hyperparameters constant would affect the model’s variance and computational cost.
2. Discuss the potential consequences of setting max_features too low or too high. How does this choice affect the diversity among trees and the overall ensemble performance?
3. Explain the impact of increasing min_samples_leaf on the complexity of individual trees and the Random Forest’s ability to capture underlying data patterns.

Number of Trees	OOB Error
3	0.9
5	0.7
10	0.6
20	0.55
50	0.5

Exam level

Hint for Part 1: Bias-Variance Trade-off

Think about how ensemble methods work: - How does averaging multiple models affect variance? - What role does bootstrap sampling play? - How does random feature selection at each split contribute to model diversity?

Hint for Part 2: OOB Error Analysis

Consider: - What does the decreasing trend tell us about ensemble performance? - Why might the improvement rate slow down after a certain number of trees? - What is the practical implication for choosing the number of trees?

Hint for Part 3: Hyperparameter Effects

For each hyperparameter, think about: - n_estimators: Law of large numbers and computational complexity - max_features: Tree correlation vs. individual tree strength - min_samples_leaf: Tree complexity and generalization

Solution

Part 1: Explanation of Random Forests Lowering Variance

Random forests improve upon single regression trees by reducing modeling variation through ensemble averaging. By building multiple decision trees on bootstrap samples and averaging their predictions, random forests decrease the likelihood of overfitting to any particular sample. Additionally, by selecting a random subset of features at each split, they reduce the correlation between individual trees, further lowering the overall variance without adding any bias.

Part 2: Interpretation of OOB Error Trend

The table shows that as the number of trees increases, the Out-Of-Bag (OOB) error decreases:

From 3 to 10 trees, the OOB error drops from 0.9 to 0.6.
Beyond 10 trees, the reduction in OOB error becomes more gradual.

This trend indicates that adding more trees improves the model’s performance by reducing variance. However, after a certain point (e.g., 20 trees), the benefit of adding more trees diminishes, suggesting that the ensemble has stabilized.

Figure 1: Out-of-Bag Error vs Number of Trees

Part 3: Impact of Hyperparameters on Random Forest Performance

n_estimators: Increasing the number of trees reduces variance due to averaging effects (following the law of large numbers). However, this comes at the cost of increased computational resources and training time. The variance reduction follows a diminishing returns pattern - the first few trees provide the most benefit.
max_features:
- Too low: Increases diversity (reduces correlation between trees) but may limit each tree’s ability to capture complex patterns. Trees become weaker individually.
- Too high: May decrease diversity, leading to highly correlated trees that don’t benefit as much from ensemble averaging. However, individual trees can capture more intricate relationships.
min_samples_leaf:
- Higher values: Lead to simpler trees (fewer splits), potentially reducing overfitting but possibly increasing bias. The model may miss detailed data structures.
- Lower values: Allow for more complex trees that can capture detailed data structures but may increase variance and risk overfitting.