Constructs a random regression forest (R.W. Payne).

### Options

`PRINT` = string tokens |
Controls printed output (`outofbagerror` , `youtofbagestimates` , `importance` , `orderedimportance` , `monitoring` ); default `outo` , `impo` |
---|---|

`Y` = variate |
Response variate for the regression |

`NTREES` = scalar |
Number of trees in the forest; no default – must be specified |

`NXTRY` = scalar |
Number of `X` variables to select at random at each node from which to choose the `X` variable to use there; default is the square root of number of `X` variables |

`NUNITSTRY` = scalar |
Number of units of the X variables to select at random to use in the construction of each tree; default is two thirds of the number of units |

`MSLIMIT` = scalar |
Limit on the mean square of the observations at a node at which to stop making splits; default 0 |

`NSTOP` = scalar |
Specifies the number of observations at a node at which to stop making splits; default 1 |

`SEED` = scalar |
Seed for random numbers to select the `NXTRY` `X` -variables and `NUMITSTRY` units; default 0 |

`OWNBSELECT` = string token |
Indicates whether or not your own version of the `BSELECT` procedure is to be used, as explained in the Method section (`yes` , `no` ); default `no` |

`OUTOFBAGERROR` = string token |
Saves the “out-of-bag” error rate |

`YOUTOFBAGESTIMATES` = variate |
Saves the “out-of-bag” estimates of `Y` |

`SAVE` = pointer |
Saves details of the forest that has been constructed |

### Parameters

`X` = factors or variates |
X-variables available for constructing the tree |
---|---|

`ORDERED` = string tokens |
Whether factor levels are ordered (`yes` , `no` ); default `no` |

`IMPORTANCE` = scalars |
Saves the importance of each x-variable |

### Description

A regression tree is a mechanism for predicting a response variable from a set of independent variables (see Chapter 8 of Breiman *et al*.). A random regression forest is a set of regression trees that are used collectively to form the prediction, by averaging the predictions from the individual trees (see e.g. Breiman 2001). The number of trees in the forest is specified by the `NTREES`

option. Constructing a large forest can be time consuming, so it may be best to investigate first with a relatively small number of trees (e.g. 10).

The trees are constructed using data on a set of observations. Their values for the response variable are specified (in a variate) using the `Y`

option, and their values for the independent variables are specified (in a list of variates or factors) using the `X`

parameter. Factors may have either ordered or unordered levels, according to whether the corresponding value `ORDERED`

parameter is set to `yes`

or `no`

. For example, a factor called `Dose`

with levels 1, 1.5, 2 and 2.5 would usually be treated as having ordered levels, whereas levels labelled `'Morphine'`

, `'Amidone'`

, `'Phenadoxone'`

and `'Pethidine'`

of a factor called `Drug`

would be regarded as unordered.

Each regression tree is formed using a random sample of the `X`

variables in the data set, and a bootstrap random sample of their units (i.e. sampled with replacement). The `NXTRY`

option defines how many `X`

variables to select, and the `NUNITSTRY`

option defines how many units to take. The default for `NXTRY`

is the square root of the number of variables, and the default for `NUNITSTRY`

is two thirds of the number of units. The `SEED`

option specifies a seed for the random numbers that are used to select the variables and to select the units. The default of zero continues an existing sequence of random numbers, if any of the random functions (`GRSELECT`

etc) has already been used in the current Genstat run. Otherwise, a seed is chosen at random.

The construction process splits the observations into subsets. With an x-variate or a factor with ordered levels, the subsets are formed by taking the observations with values less than or greater than some split point *p*. For a factor with unordered levels, all possible ways of dividing its levels into two subsets are tried. The aim is to form subsets that have similar values for the response variate. The predicted value of the response variable for each node of the tree is the mean of its value for the subset of observations at that node. The *accuracy* of the node is the squared distance of the values of the response variate from their mean for the observations at the node, divided by the total number of observations. The potential splits at the node are assessed by their effect on the accuracy, that is the difference between the accuracy of the node and the sum of the accuracies of the two potential successor nodes. The node will become a terminal node if none of the splits provides any improvement in accuracy, or if the mean square of the observations at the node is less than or equal to a limit specified by the `MSLIMIT`

option (default 0), or if the number of observations at the node is less than or equal to the number specified by the `NSTOP`

option (default 1).

The resulting forest (and its associated information) can be saved using the `SAVE`

option. This can then be used in the `BRFDISPLAY`

procedure to produce further output, or in the `BRFPREDICT`

procedure to predict the response for new values of the x-variables.

The `OUTOFBAGERROR`

parameter can save the “out-of-bag” error rate. This is calculated using the individuals that were not involved in the construction of each tree. So, it gives an independent measure of the reliability of the forest. The idea is to put the x-values in each observation through all of the trees where it was not used, and predict its y-value by taking the average of the predictions from the individual trees. The out-of-bag error is the square root of the mean of the squared differences of the predictions from the values in the response variate. The `YOUTOFBAGESTIMATES`

can save a variate containing the out-of-bag predictions, and the `%VARIANCE`

option can save the percentage of the variance in the y-values that is accounted for by the forest. Note: the out-of-bag prediction will be missing for any observation that has been selected in all the random samples (i.e. that has been used to construct every tree).

The `IMPORTANCE`

parameter can save a variate giving the “importance” of each `X`

variate or factor in the forest, calculated as the total amount by which the variable increases the accuracy in the forest.

Printed output is controlled by the `PRINT`

option, with settings:

`outofbagerror` |
out-of-bag error rate, |
---|---|

`youtofbagestimates` |
out-of-bag predictions of the y-values, |

`importance` |
importance ratings of the `X` variates and factors, |

`orderedimportance` |
importance ratings of the `X` variates and factors in decreasing order, and |

`monitoring` |
monitoring information during the construction process. |

The default is `PRINT=outofbagerror,importance`

.

Options: `PRINT`

, `Y`

, `NTREES`

, `NXTRY`

, `NUNITSTRY`

, `MSLIMIT`

, `NSTOP`

, `SEED`

, `OWNBSELECT`

, `OUTOFBAGERROR`

, `YOUTOFBAGESTIMATES`

, `SAVE`

.

Parameters: `X`

, `ORDERED`

, `IMPORTANCE`

.

### Method

`BRFOREST`

calls procedure `BCONSTRUCT`

to form the tree. This uses a special-purpose procedure `BSELECT`

, which is customized specifically to select splits for use in regression trees. You can use your own method of selection by providing your own `BSELECT`

and setting option `OWNBSELECT=yes`

. In the standard version of `BSELECT`

, the `BASSESS`

directive is used to assess the potential splits.

### Action with `RESTRICT`

Restrictions on the `X`

or `Y`

vectors are ignored.

### References

Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984). *Classification and Regression Trees*. Wadsworth, Monterey.

Breiman, L. (2001) Random forests. *Machine Learning*, 45, 5-32.

### See also

Procedures: `BRFDISPLAY`

, `BRFPREDICT`

, `BREGRESSION`

.

Commands for: Regression analysis, Multivariate and cluster analysis.

### Example

CAPTION 'BRFOREST example'; STYLE=meta SPLOAD [PRINT=*] '%gendir%/data/water.gsh' BRFOREST [PRINT=outofbagerror,youtofbagestimates,importance;\ Y=Water; NTREES=8; NXTRY=3; NUNITSTRY=10; SEED=185090]\ Employ,Opdays,Product,Temp BRFPREDICT [PRINT=*; PREDICTION=Prediction] Employ,Opdays,Product,Temp PRINT Water,Prediction