BRFOREST procedure

Constructs a random regression forest (R.W. Payne).

Options

`PRINT` = string tokens	Controls printed output (`outofbagerror`, `youtofbagestimates`, `importance`, `orderedimportance`, `monitoring`); default `outo`, `impo`
`Y` = variate	Response variate for the regression
`NTREES` = scalar	Number of trees in the forest; no default – must be specified
`NXTRY` = scalar	Number of `X` variables to select at random at each node from which to choose the `X` variable to use there; default is the square root of number of `X` variables
`NUNITSTRY` = scalar	Number of units of the X variables to select at random to use in the construction of each tree; default is two thirds of the number of units
`MSLIMIT` = scalar	Limit on the mean square of the observations at a node at which to stop making splits; default 0
`NSTOP` = scalar	Specifies the number of observations at a node at which to stop making splits; default 1
`SEED` = scalar	Seed for random numbers to select the `NXTRY` `X`-variables and `NUMITSTRY` units; default 0
`OWNBSELECT` = string token	Indicates whether or not your own version of the `BSELECT` procedure is to be used, as explained in the Method section (`yes`, `no`); default `no`
`OUTOFBAGERROR` = string token	Saves the “out-of-bag” error rate
`YOUTOFBAGESTIMATES` = variate	Saves the “out-of-bag” estimates of `Y`
`SAVE` = pointer	Saves details of the forest that has been constructed

Parameters

`X` = factors or variates	X-variables available for constructing the tree
`ORDERED` = string tokens	Whether factor levels are ordered (`yes`, `no`); default `no`
`IMPORTANCE` = scalars	Saves the importance of each x-variable

Description

A regression tree is a mechanism for predicting a response variable from a set of independent variables (see Chapter 8 of Breiman et al.). A random regression forest is a set of regression trees that are used collectively to form the prediction, by averaging the predictions from the individual trees (see e.g. Breiman 2001). The number of trees in the forest is specified by the NTREES option. Constructing a large forest can be time consuming, so it may be best to investigate first with a relatively small number of trees (e.g. 10).

The trees are constructed using data on a set of observations. Their values for the response variable are specified (in a variate) using the Y option, and their values for the independent variables are specified (in a list of variates or factors) using the X parameter. Factors may have either ordered or unordered levels, according to whether the corresponding value ORDERED parameter is set to yes or no. For example, a factor called Dose with levels 1, 1.5, 2 and 2.5 would usually be treated as having ordered levels, whereas levels labelled 'Morphine', 'Amidone', 'Phenadoxone' and 'Pethidine' of a factor called Drug would be regarded as unordered.

Each regression tree is formed using a random sample of the X variables in the data set, and a bootstrap random sample of their units (i.e. sampled with replacement). The NXTRY option defines how many X variables to select, and the NUNITSTRY option defines how many units to take. The default for NXTRY is the square root of the number of variables, and the default for NUNITSTRY is two thirds of the number of units. The SEED option specifies a seed for the random numbers that are used to select the variables and to select the units. The default of zero continues an existing sequence of random numbers, if any of the random functions (GRSELECT etc) has already been used in the current Genstat run. Otherwise, a seed is chosen at random.

The construction process splits the observations into subsets. With an x-variate or a factor with ordered levels, the subsets are formed by taking the observations with values less than or greater than some split point p. For a factor with unordered levels, all possible ways of dividing its levels into two subsets are tried. The aim is to form subsets that have similar values for the response variate. The predicted value of the response variable for each node of the tree is the mean of its value for the subset of observations at that node. The accuracy of the node is the squared distance of the values of the response variate from their mean for the observations at the node, divided by the total number of observations. The potential splits at the node are assessed by their effect on the accuracy, that is the difference between the accuracy of the node and the sum of the accuracies of the two potential successor nodes. The node will become a terminal node if none of the splits provides any improvement in accuracy, or if the mean square of the observations at the node is less than or equal to a limit specified by the MSLIMIT option (default 0), or if the number of observations at the node is less than or equal to the number specified by the NSTOP option (default 1).

The resulting forest (and its associated information) can be saved using the SAVE option. This can then be used in the BRFDISPLAY procedure to produce further output, or in the BRFPREDICT procedure to predict the response for new values of the x-variables.

The OUTOFBAGERROR parameter can save the “out-of-bag” error rate. This is calculated using the individuals that were not involved in the construction of each tree. So, it gives an independent measure of the reliability of the forest. The idea is to put the x-values in each observation through all of the trees where it was not used, and predict its y-value by taking the average of the predictions from the individual trees. The out-of-bag error is the square root of the mean of the squared differences of the predictions from the values in the response variate. The YOUTOFBAGESTIMATES can save a variate containing the out-of-bag predictions, and the %VARIANCE option can save the percentage of the variance in the y-values that is accounted for by the forest. Note: the out-of-bag prediction will be missing for any observation that has been selected in all the random samples (i.e. that has been used to construct every tree).

The IMPORTANCE parameter can save a variate giving the “importance” of each X variate or factor in the forest, calculated as the total amount by which the variable increases the accuracy in the forest.

Printed output is controlled by the PRINT option, with settings:

`outofbagerror`	out-of-bag error rate,
`youtofbagestimates`	out-of-bag predictions of the y-values,
`importance`	importance ratings of the `X` variates and factors,
`orderedimportance`	importance ratings of the `X` variates and factors in decreasing order, and
`monitoring`	monitoring information during the construction process.

The default is PRINT=outofbagerror,importance.

Options: PRINT, Y, NTREES, NXTRY, NUNITSTRY, MSLIMIT, NSTOP, SEED, OWNBSELECT, OUTOFBAGERROR, YOUTOFBAGESTIMATES, SAVE.

Parameters: X, ORDERED, IMPORTANCE.

Method

BRFOREST calls procedure BCONSTRUCT to form the tree. This uses a special-purpose procedure BSELECT, which is customized specifically to select splits for use in regression trees. You can use your own method of selection by providing your own BSELECT and setting option OWNBSELECT=yes. In the standard version of BSELECT, the BASSESS directive is used to assess the potential splits.

Action with `RESTRICT`

Restrictions on the X or Y vectors are ignored.

References

Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984). Classification and Regression Trees. Wadsworth, Monterey.

Breiman, L. (2001) Random forests. Machine Learning, 45, 5-32.

Example

CAPTION    'BRFOREST example'; STYLE=meta
SPLOAD     [PRINT=*] '%gendir%/data/water.gsh'
BRFOREST   [PRINT=outofbagerror,youtofbagestimates,importance;\
           Y=Water; NTREES=8; NXTRY=3; NUNITSTRY=10; SEED=185090]\
           Employ,Opdays,Product,Temp
BRFPREDICT [PRINT=*; PREDICTION=Prediction] Employ,Opdays,Product,Temp
PRINT      Water,Prediction

Updated on March 8, 2019

Tagged: Command Procedures

Was this article helpful?

Yes No