13 Advanced Tree-Based Methods

13.1 Exercises

Exercise 13.1: Quantile Random Forest Prediction

You trained a Quantile Random Forest with \(B=3\) trees. For a new test point \(x_{\text{new}}\), you collect all training outcomes \(Y_i\) that fall into the terminal leaf reached by \(x_{\text{new}}\) in each tree:

  • Tree 1 leaf: \(\{10,\,12,\,15\}\)
  • Tree 2 leaf: \(\{8,\,11,\,12\}\)
  • Tree 3 leaf: \(\{12,\,14,\,20\}\)

Questions:

  1. What is the full set (with duplicates) of neighboring outcomes for \(x_{\text{new}}\)?
  2. What is the predicted conditional median (50th percentile)?
  3. What is the predicted conditional 25th percentile?

Exam level

Pool (concatenate) all leaf outcomes from all trees; keep duplicates; then sort.

For \(N\) odd, the median is the \(\frac{N+1}{2}\)-th smallest value.

For quantile level \(p\), compute position \(k = p (N+1)\).
If \(k\) not integer: interpolate between \(\lfloor k \rfloor\) and \(\lceil k \rceil\).

  1. Pool: \(\{10,12,15,8,11,12,12,14,20\}\) → sorted: \(\{8,10,11,12,12,12,14,15,20\}\) (\(N=9\)).
  2. Median index: \((9+1)/2 = 5\) → 5th value \(=12\) so \(\hat q_{0.5}(x_{\text{new}})=12\).
  3. 25th percentile position: \(0.25(9+1)=2.5\) → interpolate between 2nd (10) and 3rd (11): \(10 + 0.5(11-10)=10.5\). Thus \(\hat q_{0.25}(X_{\text{new}})=10.5\).

(Interpolation corresponds to the simple \(p(N+1)\) rule; software may use alternative conventions.)

Exercise 13.2: NGBoost with Exponential Likelihood

Assume we model a positive target variable \(Y\) with a conditional Exponential distribution: \[ Y \mid X=x \sim \text{Exp}(\lambda(x)), \qquad \text{pdf is } f(y;\lambda)=\lambda e^{-\lambda y} \text{ for } y\ge 0. \] NGBoost can be used to learn the rate parameter function \(\lambda(x)\).

  1. Conceptual: In your own words, contrast NGBoost with (i) standard gradient boosting (using MSE) and (ii) Quantile Regression Forests.
  2. Loss & Gradient (Rate parameter \(\lambda\)):
    1. Write the per-observation negative log-likelihood (NLL), \(\ell(\lambda;y)\).
    2. Compute the ordinary gradient \(\nabla_\lambda \ell = \partial \ell/\partial \lambda\).
  3. Natural Gradient (Rate parameter \(\lambda\)):
    1. The Fisher Information for \(\lambda\) is \(F_{\lambda\lambda} = 1/\lambda^2\). Calculate the natural gradient, \(g^{\text{nat}}_\lambda = F_{\lambda\lambda}^{-1} \nabla_\lambda \ell\).
    2. For a single data point with current prediction \(\lambda=0.4\) and observed \(y=3\), compute both the ordinary and natural gradients for \(\lambda\).
  4. Reparameterization (Log-rate parameter \(\theta\)): To ensure \(\lambda>0\), we often model \(\theta(x) = \log \lambda(x)\), so \(\lambda = e^\theta\)
    1. Use the chain rule to find the ordinary gradient for \(\theta\), \(\nabla_\theta \ell = \partial \ell/\partial \theta\).
    2. Fisher Information Transformation Rule: If \(\theta = g(\lambda)\), then \(F_{\theta\theta} = F_{\lambda\lambda} \cdot (\frac{d\lambda}{d\theta})^2\). Use this rule to find the Fisher Information \(F_{\theta\theta}\) for our case where \(\theta = \log \lambda\).
    3. Calculate the natural gradient for \(\theta\), \(g^{\text{nat}}_\theta = F_{\theta\theta}^{-1} \nabla_\theta \ell\). What do you notice?
  5. Conclusion: Why is the \(\theta = \log \lambda\) parameterization often preferred in practice for NGBoost with an Exponential likelihood?

Note that using the natural gradient is more important for multi-parameter densities where the natural gradient standardizes the gradients across parameters.

Exam level

The NLL is \(-\log f(y;\lambda)\). Remember that \(\log(ab) = \log a + \log b\).

For Q4a, you’ll need \(\partial \ell/\partial \theta = (\partial \ell/\partial \lambda) \cdot (\partial \lambda/\partial \theta)\).

For Q4b, you’ll need the derivative of \(\lambda = e^\theta\) with respect to \(\theta\).

  1. Conceptual question:
    • Standard gradient boosting (with MSE) learns a single point function (typically the conditional mean) by fitting weak learners to (negative) gradients of the loss; uncertainty is not modeled.
    • Quantile Regression Forests approximate the entire conditional distribution nonparametrically by pooling leaf observations from many trees; quantiles are read off the empirical distribution (no parametric assumptions, no explicit gradient optimization, no natural gradients).
    • NGBoost is a semi-parametric method: it specifies a parametric family (here Exponential with rate \(\lambda(x)\)) and trains weak learners to predict its parameters by minimizing a proper scoring rule (negative log-likelihood), using (natural) gradients. It gives full predictive distributions (mean, quantiles, tail probs) if the family is suitable. Trade‑off: risk of misspecification and added complexity (Fisher / natural gradient machinery).
  2. Loss & Gradient for \(\lambda\):
    1. \(\ell(\lambda;y) = -(\log \lambda - \lambda y) = -\log \lambda + \lambda y\).
    2. \(\nabla_\lambda \ell = -1/\lambda + y\).
  3. Natural Gradient for \(\lambda\):
    1. \(g^{\text{nat}}_\lambda = (\lambda^2) \cdot (-1/\lambda + y) = -\lambda + \lambda^2 y = \lambda(\lambda y - 1)\).
    2. With \(\lambda=0.4, y=3\):
      • Ordinary: \(\nabla_\lambda \ell = -1/0.4 + 3 = -2.5 + 3 = 0.5\).
      • Natural: \(g^{\text{nat}}_\lambda = -0.4 + (0.4^2 \cdot 3) = -0.4 + 0.48 = 0.08\).
  4. Reparameterization to \(\theta = \log \lambda\):
    1. \(\nabla_\theta \ell = (\partial \ell/\partial \lambda) \cdot (\partial \lambda/\partial \theta) = (-1/\lambda + y) \cdot e^\theta = (-1/e^\theta + y) \cdot e^\theta = -1 + y e^\theta\).
    2. \(F_{\theta\theta} = F_{\lambda\lambda} \cdot (\frac{d\lambda}{d\theta})^2 = (1/\lambda^2) \cdot (e^\theta)^2 = (1/(e^\theta)^2) \cdot (e^\theta)^2 = 1\).
    3. \(g^{\text{nat}}_\theta = F_{\theta\theta}^{-1} \nabla_\theta \ell = (1)^{-1} \cdot (-1 + y e^\theta) = -1 + y e^\theta\). Observation: For the log-parameter \(\theta\), the natural gradient is identical to the ordinary gradient.
  5. Conclusion: The \(\theta = \log \lambda\) parameterization is preferred for two main reasons:
    • Practical: It transforms the constrained optimization problem (\(\lambda>0\)) into an unconstrained one for \(\theta\), which can be learned by a standard tree booster.
    • Theoretical: It is the “canonical” parameter of the Exponential family. In this space, the Fisher Information is constant (1), making the geometry of the parameter space flat. This means the ordinary gradient is already “natural,” simplifying the algorithm as no scaling by the Fisher matrix is needed. The gradient steps are more stable.