Constructs a regression tree (R.W. Payne).

### Options

`PRINT` = string tokens |
Controls printed output (`summary` , `details` , `indented` , `bracketed` , `labelleddiagram` , `numbereddiagram` , `graph` , `monitoring` ); default `*` i.e. none |
---|---|

`Y` = variate |
Response variate for the regression |

`TREE` = tree |
Saves the tree that has been constructed |

`MSLIMIT` = scalar |
Limit on the mean square of the observations at a node at which to stop making splits; default 0 |

`NSTOP` = scalar |
Specifies the number of observations at a node at which to stop making splits; default 1 |

`OWNBSELECT` = string token |
Indicates whether or not your own version of the `BSELECT` procedure is to be used, as explained in the Method section (`yes` , `no` ); default `no` |

### Parameters

`X` = variates or factors |
Independent variables available for constructing the tree |
---|---|

`ORDERED` = string tokens |
Whether factor levels are ordered (`yes` , `no` ); default `no` |

### Description

A regression tree is a mechanism for predicting a response variable from a set of independent variables (see Chapter 8 of Breiman *et al*.). The tree is constructed using data on a set of observations. Their values for the response variable are specified (in a variate) using the `Y`

option, and their values for the independent variables are specified (in a list of variates or factors) using the `X`

parameter. Factors may have either ordered or unordered levels, according to whether the corresponding value `ORDERED`

parameter is set to `yes`

or `no`

. For example, a factor called `Dose`

with levels 1, 1.5, 2 and 2.5 would usually be treated as having ordered levels, whereas levels labelled `'Morphine'`

, `'Amidone'`

, `'Phenadoxone'`

and `'Pethidine'`

of a factor called `Drug`

would be regarded as unordered.

The construction process splits the observations into subsets. With an x-variate or a factor with ordered levels, the subsets are formed by taking the observations with values less then or greater than some split point *p*. For a factor with unordered levels, all possible ways of dividing its levels into two subsets are tried. The aim is to form subsets that have similar values for the response variate. The predicted value of the response variable for each node of the tree is the mean of its value for the subset of observations at that node. The *accuracy* of the node is the squared distance of the values of the response variate from their mean for the observations at the node, divided by the total number of observations. The potential splits at the node are assessed by their effect on the accuracy, that is the difference between the accuracy of the node and the sum of the accuracies of the two potential successor nodes. The node will become a terminal node if none of the splits provides any improvement in accuracy, or if the mean square of the observations at the node is less than or equal to a limit specified by the `MSLIMIT`

option (default 0), or if the number of observations at the node is less than or equal to the number specified by the `NSTOP`

option (default 1).

The resulting tree can be saved using the `TREE`

option. Details of the tree can be printed as selected by the `PRINT`

option, with settings:

`summary` |
prints a summary of the properties of the tree; |
---|---|

`details` |
gives detailed information about the nodes of the tree; |

`bracketed` |
display as used to represent an identification key in “bracketed” form (printed node by node). |

`indented` |
display as used to represent an identification key in “indented” form (printed branch by branch); |

`labelleddiagram` |
diagrammatic display including the node labels; |

`numbereddiagram` |
diagrammatic display with the nodes labelled by their numbers; |

`graph` |
plots the tree using high-resolution graphics. |

`monitoring` |
prints information monitoring the construction process. |

`BREGRESSION`

stores the information required for printing as part of the tree. For variates and ordered factors, the labels are generally formed as “*identifier*<*p*” and “*identifier*>*p*“, where *p* is the value chosen to partition the data for the variate concerned. Alternatively, if you have defined an “extra” text for the variate (using the `EXTRA`

parameter of the `VARIATE`

command), this will be used instead. The labels are then “*extra-text* < *p*” and “*extra-text* > *p*“. The style is similar for unordered factors, but here the labels involve the operators `.IN.`

and `.NI.`

instead of .

Generally the construction will result in *over-fitting*, that is it will form a tree that keeps making splits beyond the point that can be justified statistically. The solution is to prune the tree to remove the uninformative sub-branches, and this can be performed using the `BPRUNE`

procedure. It is best, if possible, to base the pruning on an independent set of data. The pruning uses the *accuracy* figures, which are stored with the tree. The `BRVALUES`

procedure can be used to calculate new accuracy (and prediction) values, from another data set.

Finally, once the tree has been pruned, the value predicted for a new set of independent values can be obtained by supplying their values to the `BRPREDICT`

procedure. This runs the values through the tree to see which terminal node they reach. The prediction is then provided by the value predicted at that node.

Options: `PRINT`

, `Y`

, `TREE`

, `MSLIMIT`

, `NSTOP`

, `OWNBSELECT`

.

Parameters: `X`

, `ORDERED`

.

### Method

`BREGRESSION`

calls procedure `BCONSTRUCT`

to form the tree. This uses a special-purpose procedure `BSELECT`

, which is customized specifically to select splits for use in regression trees and stored with `BREGRESSION`

. You can use your own method of selection by providing your own `BSELECT`

and setting option `OWNBSELECT=yes`

. In the standard version of `BSELECT`

, the `BASSESS`

directive is used to assess the potential splits.

### Action with `RESTRICT`

Any restrictions on the `Y`

or `X`

variates are removed.

### Reference

Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984). *Classification and Regression Trees*. Wadsworth, Monterey.

### See also

Procedures: `BRDISPLAY`

, `BRKEEP`

, `BRPREDICT`

, `BRVALUES`

, `BRFOREST`

, `BGRAPH`

, `BPRUNE`

, `BCLASSIFICATION`

, `BCFOREST`

.

Commands for: Regression analysis, Multivariate and cluster analysis.

### Example

CAPTION 'BREGRESSION example',!t('Water usage data (Draper & Smith 1981,',\ 'Applied Regression Analysis, Wiley, New York).'); STYLE=meta,plain READ temp,product,opdays,employ,water 58.8 7.107 21 129 3.067 65.2 6.373 22 141 2.828 70.9 6.796 22 153 2.891 77.4 9.208 20 166 2.994 79.3 14.792 25 193 3.082 81.0 14.564 23 189 3.898 71.9 11.964 20 175 3.502 63.9 13.526 23 186 3.060 54.5 12.656 20 190 3.211 39.5 14.119 20 187 3.286 44.5 16.691 22 195 3.542 43.6 14.571 19 206 3.125 56.0 13.619 22 198 3.022 64.7 14.575 22 192 2.922 73.0 14.556 21 191 3.950 78.9 18.573 21 200 4.488 79.4 15.618 22 200 3.295 : "form the regression tree" BREGRESSION [PRINT=summary,indented; Y=water; TREE=tree]\ employ,opdays,product,temp "prune the tree" BPRUNE [PRINT=table,graph] tree; NEWTREES=pruned "use tree 6 - renumber nodes" BCUT [RENUMBER=yes] pruned[6]; NEWTREE=tree "display the tree" BRDISPLAY [PRINT=summary,indented,graph] tree