Assesses potential splits for regression and classification trees.

### Options

`Y` = variate or factor |
Response variate for a regression tree, or factor specifying the groupings for a classification tree |
---|---|

`SELECTED` = dummy |
Returns the identifier of `X` variate or factor used in the best split |

`TESTSPLIT` = expression structure |
Logical expression representing the best split |

`MAXSPLITPOINT` = scalar or variate |
When `SELECTED` is a variate or a factor with ordered levels this returns a scalar containing the boundary between the two splits, when the `SELECTED` is a factor with unordered levels it returns a variate containing the levels allocated to the first split |

`MAXCRITERION` = scalar |
Maximum value obtained for the selection criterion |

`NOSELECTION` = scalar |
Returns the value 1 if no split has been selected, otherwise 0 |

`FMETHOD` = string token |
Selection method to use when `Y` is a factor (`Gini` , `MPI` ); default `Gini` |

`ANTIENDCUTFACTOR` = string token |
Anti-end-cut factor to use when `Y` is a factor (`classnumber` , `reciprocalentropy` ); default `*` i.e. none |

`WEIGHTS` = variate |
Weights; default `*` i.e. all weights 1 |

`TOLERANCE` = scalar |
Tolerance multiplier used e.g. to check for equality of x-values; default `*` i.e. set automatically for the implementation concerned |

### Parameters

`X` = variates or factors |
Variables available to make the split |
---|---|

`ORDERED` = string tokens |
Whether factor levels are ordered (`yes` , `no` ); default `no` |

`SPLITPOINT` = scalars or variates |
Saves details of the best split found for each `X` variable; when `X` is a variate or a factor with ordered levels this returns a scalar containing the boundary between the two splits, when the `X` is a factor with unordered levels it returns a variate containing the levels allocated to the first split |

`CRITERIONVALUE` = scalars |
Saves the value of the selection criterion for the best split found for each `X` variable |

### Description

`BASSESS`

selects splits for use when constructing classification or regression trees. The `Y`

option specifies the factor defining the groupings for a classification tree, or the response variate for a regression tree. The x-variables that are available to make the split are supplied by the `X`

parameter. They can be variates, or factors with either ordered or unordered levels as indicated by the `ORDERED`

parameter. For example, a factor called `Dose`

with levels for example 1, 1.5, 2 and 2.5 would usually be treated as having ordered levels, whereas levels labelled `'Morphine'`

, `'Amidone'`

, `'Phenadoxone'`

and `'Pethidine'`

of a factor called `Drug`

would be regarded as unordered.

In a regression tree, the accuracy of each node is the squared distance of the values of the response variate from their mean for the observations at the node, divided by the total number of observations. The potential splits are assessed by their effect on the accuracy, that is the difference between the initial accuracy and the sum of the accuracies of the two successor nodes resulting from the split.

For a classification tree, the `FMETHOD`

option allows one of two selection criteria to be requested, either Gini information or the MPI (*mean posterior improvement*) criterion of Taylor & Silverman (1993). The default is to use Gini information. The `ANTIENDCUTFACTOR`

option allows you to request use of adaptive anti-end-cut factors as devised by Taylor & Silverman (1993, Section 5). Further details are given in the *Methods* section. By default no adaptive factors are used.

The `SPLITPOINT`

parameter can be used to save details of the best split found for each `X`

variable. When `X`

is a variate or a factor with ordered levels, this returns a scalar containing the boundary between the two splits. Alternatively, when `X`

is a factor with unordered levels, it returns a variate containing the levels allocated to the first split. The `CRITERIONVALUE`

parameter saves the value of the selection criterion for the best split found for each `X`

variable.

The `SELECTED`

option can be set to a dummy to store the identifier of the `X`

variate or factor used in the best split, and the `MAXSPLITPOINT`

option can save details of the best split, similarly to the `SPLITPOINT`

parameter. The `MAXCRITERION`

option saves the maximum value obtained for the selection criterion, and the `NOSELECTION`

saves a scalar containing the value 0 if a split could be selected or 1 if no further splitting was possible. You can save a logical expression representing the best split using the `TESTSPLIT`

option. So, for example, you can put

`BASSESS [Y=Yvar; TESTSPLIT=Test; ...]`

`RESTRICT Yvar; #Test == 1`

`PRINT Yvar`

to print the y-values of the individuals in the first successor set. `BASSESS`

takes account of restrictions on `Y`

or on any of the `X`

variates or factors. So you also could now use `BASSESS`

to find the best split on that set.

The `WEIGHTS`

option can supply a variate of weights for the observations. This could be used to supply prior probabilities, or to emphasize units that are perceived as being especially important.

Finally, the `TOLERANCE`

option can be used to modify the tolerance multiplier used internally for example to check for equality of x-values. By default this is set automatically to a value appropriate for the Genstat implementation concerned.

Options: `Y`

, `SELECTED`

, `TESTSPLIT`

, `MAXSPLITPOINT`

, `MAXCRITERION`

, `NOSELECTION`

, `FMETHOD`

, `ANTIENDCUTFACTOR`

, `WEIGHTS`

, `TOLERANCE`

.

Parameters: `X`

, `ORDERED`

, `SPLITPOINT`

, `CRITERIONVALUE`

.

### Method

Further general information about classification and regression trees can be found in Breiman *et al*. (1984). The methods used by `BASSESS`

for classification trees are based on Taylor & Silverman (1993). The `Gini`

setting of the `FMETHOD`

option uses the change in Gini information:

*G* = (1 – ∑_{k} *α _{k}*

^{2}) – (∑

_{k}

*β*

_{1k}) × (1 – ∑

_{k}

*β*

_{1k}

^{2}) – (∑

_{k}

*β*

_{2k}) × (1 – ∑

_{k}

*β*

_{2k}

^{2})

where *α _{k}* is the proportion of individuals in the original set that are in group

*k*, and β

_{ik}is the proportion of individuals in successor set

*i*(

*i*= 1 or 2) that are in group

*k*. The aim here is to split the individuals into sets to maximize differences between the within-set group probabilities. An equivalent formula (Taylor & Silverman 1993, Section 4) is

*G* = (*p*_{1} × *p*_{2}) × { ∑_{k} *β*_{1k}^{2} + ∑_{k} *β*_{2k}^{2} – ∑_{k} ( *β*_{1k} × *β*_{2k} ) }

where *p _{i}* = ∑

_{k}

*β*. The alternative

_{ik}`MPI`

(*mean posterior improvement*) criterion concentrates more on making the group probabilities differ between the successor sets:

*MPI* = (*p*_{1} × *p*_{2}) × { 1 – ∑_{k} (( *β*_{1k} × *β*_{2k}) / ( *β*_{1k} + *β*_{2k})) }

Taylor & Silverman (1993) note that the term (*p*_{1} × *p*_{2}) aims to generate successor sets of similar size, and refer to it as the *anti-end-cut factor* because it aims to avoid sets being produced with only a small number of individuals. They suggest that this should vary according to the complexity of the problem, and instead become

min { *p*_{1} × *p*_{2}, *p _{low} *× (1 –

*p*) }

_{low}where *p _{low} *is the reciprocal of the number of groups in the initial set for the

`classnumber`

setting of the` ANTIENDCUTFACTOR`

option, andmin { 0.5, 1 / ( ∑_{k} *α _{k}*

^{2}) }

for the `reciprocalentropy`

setting. The idea is to encourage splits that lead to terminal modes – and to take accounts of the fact that these are more likely to be generated as the number of groups becomes small.

### Action with `RESTRICT`

You can request that `BASSESS`

operate on only a subset of the units by applying a restriction to the `Y`

variate or factor, or to any of the `X`

variates or factors, or to the `WEIGHTS`

variate.

### References

Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984). *Classification and Regression Trees*. Wadsworth, Monterey.

Taylor, P.C. & Silverman, B.W. (1993). Block diagrams and splitting criteria for classification trees. *Statistics and Computing*, 3, 147-161.

### See also

Directives: `BCUT`

, `BGROW`

, `BIDENTIFY`

, `BJOIN`

, `TREE`

.

Procedures: `BCONSTRUCT`

, `BCLASSIFICATION`

, `BGRAPH`

, `BKEY`

, `BPRINT`

, `BPRUNE`

.

Commands for: Calculations and manipulation.