BASSESS directive

Assesses potential splits for regression and classification trees.

Options

`Y` = variate or factor	Response variate for a regression tree, or factor specifying the groupings for a classification tree
`SELECTED` = dummy	Returns the identifier of `X` variate or factor used in the best split
`TESTSPLIT` = expression structure	Logical expression representing the best split
`MAXSPLITPOINT` = scalar or variate	When `SELECTED` is a variate or a factor with ordered levels this returns a scalar containing the boundary between the two splits, when the `SELECTED` is a factor with unordered levels it returns a variate containing the levels allocated to the first split
`MAXCRITERION` = scalar	Maximum value obtained for the selection criterion
`NOSELECTION` = scalar	Returns the value 1 if no split has been selected, otherwise 0
`FMETHOD` = string token	Selection method to use when `Y` is a factor (`Gini`, `MPI`); default `Gini`
`ANTIENDCUTFACTOR` = string token	Anti-end-cut factor to use when `Y` is a factor (`classnumber`, `reciprocalentropy`); default `*` i.e. none
`WEIGHTS` = variate	Weights; default `*` i.e. all weights 1
`TOLERANCE` = scalar	Tolerance multiplier used e.g. to check for equality of x-values; default `*` i.e. set automatically for the implementation concerned

Parameters

`X` = variates or factors	Variables available to make the split
`ORDERED` = string tokens	Whether factor levels are ordered (`yes`, `no`); default `no`
`SPLITPOINT` = scalars or variates	Saves details of the best split found for each `X` variable; when `X` is a variate or a factor with ordered levels this returns a scalar containing the boundary between the two splits, when the `X` is a factor with unordered levels it returns a variate containing the levels allocated to the first split
`CRITERIONVALUE` = scalars	Saves the value of the selection criterion for the best split found for each `X` variable

Description

BASSESS selects splits for use when constructing classification or regression trees. The Y option specifies the factor defining the groupings for a classification tree, or the response variate for a regression tree. The x-variables that are available to make the split are supplied by the X parameter. They can be variates, or factors with either ordered or unordered levels as indicated by the ORDERED parameter. For example, a factor called Dose with levels for example 1, 1.5, 2 and 2.5 would usually be treated as having ordered levels, whereas levels labelled 'Morphine', 'Amidone', 'Phenadoxone' and 'Pethidine' of a factor called Drug would be regarded as unordered.

In a regression tree, the accuracy of each node is the squared distance of the values of the response variate from their mean for the observations at the node, divided by the total number of observations. The potential splits are assessed by their effect on the accuracy, that is the difference between the initial accuracy and the sum of the accuracies of the two successor nodes resulting from the split.

For a classification tree, the FMETHOD option allows one of two selection criteria to be requested, either Gini information or the MPI (mean posterior improvement) criterion of Taylor & Silverman (1993). The default is to use Gini information. The ANTIENDCUTFACTOR option allows you to request use of adaptive anti-end-cut factors as devised by Taylor & Silverman (1993, Section 5). Further details are given in the Methods section. By default no adaptive factors are used.

The SPLITPOINT parameter can be used to save details of the best split found for each X variable. When X is a variate or a factor with ordered levels, this returns a scalar containing the boundary between the two splits. Alternatively, when X is a factor with unordered levels, it returns a variate containing the levels allocated to the first split. The CRITERIONVALUE parameter saves the value of the selection criterion for the best split found for each X variable.

The SELECTED option can be set to a dummy to store the identifier of the X variate or factor used in the best split, and the MAXSPLITPOINT option can save details of the best split, similarly to the SPLITPOINT parameter. The MAXCRITERION option saves the maximum value obtained for the selection criterion, and the NOSELECTION saves a scalar containing the value 0 if a split could be selected or 1 if no further splitting was possible. You can save a logical expression representing the best split using the TESTSPLIT option. So, for example, you can put

BASSESS [Y=Yvar; TESTSPLIT=Test; ...]

RESTRICT Yvar; #Test == 1

PRINT Yvar

to print the y-values of the individuals in the first successor set. BASSESS takes account of restrictions on Y or on any of the X variates or factors. So you also could now use BASSESS to find the best split on that set.

The WEIGHTS option can supply a variate of weights for the observations. This could be used to supply prior probabilities, or to emphasize units that are perceived as being especially important.

Finally, the TOLERANCE option can be used to modify the tolerance multiplier used internally for example to check for equality of x-values. By default this is set automatically to a value appropriate for the Genstat implementation concerned.

Options: Y, SELECTED, TESTSPLIT, MAXSPLITPOINT, MAXCRITERION, NOSELECTION, FMETHOD, ANTIENDCUTFACTOR, WEIGHTS, TOLERANCE.

Parameters: X, ORDERED, SPLITPOINT, CRITERIONVALUE.

Method

Further general information about classification and regression trees can be found in Breiman et al. (1984). The methods used by BASSESS for classification trees are based on Taylor & Silverman (1993). The Gini setting of the FMETHOD option uses the change in Gini information:

G = (1 – ∑_k α_k²) – (∑_k β_1k) × (1 – ∑_k β_1k²) – (∑_k β_2k) × (1 – ∑_k β_2k²)

where α_k is the proportion of individuals in the original set that are in group k, and β_ik is the proportion of individuals in successor set i (i = 1 or 2) that are in group k. The aim here is to split the individuals into sets to maximize differences between the within-set group probabilities. An equivalent formula (Taylor & Silverman 1993, Section 4) is

G = (p₁ × p₂) × { ∑_k β_1k² + ∑_k β_2k² – ∑_k ( β_1k × β_2k ) }

where p_i = ∑_k β_ik. The alternative MPI (mean posterior improvement) criterion concentrates more on making the group probabilities differ between the successor sets:

MPI = (p₁ × p₂) × { 1 – ∑_k (( β_1k × β_2k) / ( β_1k + β_2k)) }

Taylor & Silverman (1993) note that the term (p₁ × p₂) aims to generate successor sets of similar size, and refer to it as the anti-end-cut factor because it aims to avoid sets being produced with only a small number of individuals. They suggest that this should vary according to the complexity of the problem, and instead become

min { p₁ × p₂, p_low × (1 – p_low) }

where p_low is the reciprocal of the number of groups in the initial set for the classnumber setting of the ANTIENDCUTFACTOR option, and

min { 0.5, 1 / ( ∑_k α_k²) }

for the reciprocalentropy setting. The idea is to encourage splits that lead to terminal modes – and to take accounts of the fact that these are more likely to be generated as the number of groups becomes small.

Action with `RESTRICT`

You can request that BASSESS operate on only a subset of the units by applying a restriction to the Y variate or factor, or to any of the X variates or factors, or to the WEIGHTS variate.

References

Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984). Classification and Regression Trees. Wadsworth, Monterey.

Taylor, P.C. & Silverman, B.W. (1993). Block diagrams and splitting criteria for classification trees. Statistics and Computing, 3, 147-161.

Was this article helpful?

Yes No