Part of Advances in Neural Information Processing Systems 31 (NeurIPS 2018)
Cheng Tang, Damien Garreau, Ulrike von Luxburg
Random forests are learning algorithms that build large collections of random trees and make predictions by averaging the individual tree predictions.
In this paper, we consider various tree constructions and examine how the choice of parameters affects the generalization error of the resulting random forests as the sample size goes to infinity.
We show that subsampling of data points during the tree construction phase is important: Forests can become inconsistent with either no subsampling or too severe subsampling.
As a consequence, even highly randomized trees can lead to inconsistent forests if no subsampling is used, which implies that some of the commonly used setups for random forests can be inconsistent.
As a second consequence we can show that trees that have good performance in nearest-neighbor search can be a poor choice for random forests.