Assesses potential splits for regression and classification trees.
Options
Y = variate or factor |
Response variate for a regression tree, or factor specifying the groupings for a classification tree |
---|---|
SELECTED = dummy |
Returns the identifier of X variate or factor used in the best split |
TESTSPLIT = expression structure |
Logical expression representing the best split |
MAXSPLITPOINT = scalar or variate |
When SELECTED is a variate or a factor with ordered levels this returns a scalar containing the boundary between the two splits, when the SELECTED is a factor with unordered levels it returns a variate containing the levels allocated to the first split |
MAXCRITERION = scalar |
Maximum value obtained for the selection criterion |
NOSELECTION = scalar |
Returns the value 1 if no split has been selected, otherwise 0 |
FMETHOD = string token |
Selection method to use when Y is a factor (Gini , MPI ); default Gini |
ANTIENDCUTFACTOR = string token |
Anti-end-cut factor to use when Y is a factor (classnumber , reciprocalentropy ); default * i.e. none |
WEIGHTS = variate |
Weights; default * i.e. all weights 1 |
TOLERANCE = scalar |
Tolerance multiplier used e.g. to check for equality of x-values; default * i.e. set automatically for the implementation concerned |
Parameters
X = variates or factors |
Variables available to make the split |
---|---|
ORDERED = string tokens |
Whether factor levels are ordered (yes , no ); default no |
SPLITPOINT = scalars or variates |
Saves details of the best split found for each X variable; when X is a variate or a factor with ordered levels this returns a scalar containing the boundary between the two splits, when the X is a factor with unordered levels it returns a variate containing the levels allocated to the first split |
CRITERIONVALUE = scalars |
Saves the value of the selection criterion for the best split found for each X variable |
Description
BASSESS
selects splits for use when constructing classification or regression trees. The Y
option specifies the factor defining the groupings for a classification tree, or the response variate for a regression tree. The x-variables that are available to make the split are supplied by the X
parameter. They can be variates, or factors with either ordered or unordered levels as indicated by the ORDERED
parameter. For example, a factor called Dose
with levels for example 1, 1.5, 2 and 2.5 would usually be treated as having ordered levels, whereas levels labelled 'Morphine'
, 'Amidone'
, 'Phenadoxone'
and 'Pethidine'
of a factor called Drug
would be regarded as unordered.
In a regression tree, the accuracy of each node is the squared distance of the values of the response variate from their mean for the observations at the node, divided by the total number of observations. The potential splits are assessed by their effect on the accuracy, that is the difference between the initial accuracy and the sum of the accuracies of the two successor nodes resulting from the split.
For a classification tree, the FMETHOD
option allows one of two selection criteria to be requested, either Gini information or the MPI (mean posterior improvement) criterion of Taylor & Silverman (1993). The default is to use Gini information. The ANTIENDCUTFACTOR
option allows you to request use of adaptive anti-end-cut factors as devised by Taylor & Silverman (1993, Section 5). Further details are given in the Methods section. By default no adaptive factors are used.
The SPLITPOINT
parameter can be used to save details of the best split found for each X
variable. When X
is a variate or a factor with ordered levels, this returns a scalar containing the boundary between the two splits. Alternatively, when X
is a factor with unordered levels, it returns a variate containing the levels allocated to the first split. The CRITERIONVALUE
parameter saves the value of the selection criterion for the best split found for each X
variable.
The SELECTED
option can be set to a dummy to store the identifier of the X
variate or factor used in the best split, and the MAXSPLITPOINT
option can save details of the best split, similarly to the SPLITPOINT
parameter. The MAXCRITERION
option saves the maximum value obtained for the selection criterion, and the NOSELECTION
saves a scalar containing the value 0 if a split could be selected or 1 if no further splitting was possible. You can save a logical expression representing the best split using the TESTSPLIT
option. So, for example, you can put
BASSESS [Y=Yvar; TESTSPLIT=Test; ...]
RESTRICT Yvar; #Test == 1
PRINT Yvar
to print the y-values of the individuals in the first successor set. BASSESS
takes account of restrictions on Y
or on any of the X
variates or factors. So you also could now use BASSESS
to find the best split on that set.
The WEIGHTS
option can supply a variate of weights for the observations. This could be used to supply prior probabilities, or to emphasize units that are perceived as being especially important.
Finally, the TOLERANCE
option can be used to modify the tolerance multiplier used internally for example to check for equality of x-values. By default this is set automatically to a value appropriate for the Genstat implementation concerned.
Options: Y
, SELECTED
, TESTSPLIT
, MAXSPLITPOINT
, MAXCRITERION
, NOSELECTION
, FMETHOD
, ANTIENDCUTFACTOR
, WEIGHTS
, TOLERANCE
.
Parameters: X
, ORDERED
, SPLITPOINT
, CRITERIONVALUE
.
Method
Further general information about classification and regression trees can be found in Breiman et al. (1984). The methods used by BASSESS
for classification trees are based on Taylor & Silverman (1993). The Gini
setting of the FMETHOD
option uses the change in Gini information:
G = (1 – ∑k αk2) – (∑k β1k) × (1 – ∑k β1k2) – (∑k β2k) × (1 – ∑k β2k2)
where αk is the proportion of individuals in the original set that are in group k, and βik is the proportion of individuals in successor set i (i = 1 or 2) that are in group k. The aim here is to split the individuals into sets to maximize differences between the within-set group probabilities. An equivalent formula (Taylor & Silverman 1993, Section 4) is
G = (p1 × p2) × { ∑k β1k2 + ∑k β2k2 – ∑k ( β1k × β2k ) }
where pi = ∑k βik. The alternative MPI
(mean posterior improvement) criterion concentrates more on making the group probabilities differ between the successor sets:
MPI = (p1 × p2) × { 1 – ∑k (( β1k × β2k) / ( β1k + β2k)) }
Taylor & Silverman (1993) note that the term (p1 × p2) aims to generate successor sets of similar size, and refer to it as the anti-end-cut factor because it aims to avoid sets being produced with only a small number of individuals. They suggest that this should vary according to the complexity of the problem, and instead become
min { p1 × p2, plow × (1 – plow) }
where plow is the reciprocal of the number of groups in the initial set for the classnumber
setting of the ANTIENDCUTFACTOR
option, and
min { 0.5, 1 / ( ∑k αk2) }
for the reciprocalentropy
setting. The idea is to encourage splits that lead to terminal modes – and to take accounts of the fact that these are more likely to be generated as the number of groups becomes small.
Action with RESTRICT
You can request that BASSESS
operate on only a subset of the units by applying a restriction to the Y
variate or factor, or to any of the X
variates or factors, or to the WEIGHTS
variate.
References
Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984). Classification and Regression Trees. Wadsworth, Monterey.
Taylor, P.C. & Silverman, B.W. (1993). Block diagrams and splitting criteria for classification trees. Statistics and Computing, 3, 147-161.
See also
Directives: BCUT
, BGROW
, BIDENTIFY
, BJOIN
, TREE
.
Procedures: BCONSTRUCT
, BCLASSIFICATION
, BGRAPH
, BKEY
, BPRINT
, BPRUNE
.
Commands for: Calculations and manipulation.