Identifies specimens using a classification tree (R.W. Payne).
Options
PRINT = string tokens |
Controls printed output (identification, transcript); if PRINT is unset in an interactive run BCIDENTIFY will ask what you want to print, in a batch run the default is iden |
|---|---|
TREE = tree |
Specifies the tree |
IDENTIFICATION = text |
Saves the identification of each specimen |
TERMINALNODES = pointer |
Saves the numbers of the terminal nodes reached by each specimen |
PROBABILITIES = matrix |
Specimen × group matrix giving the probability that the specimens belong to each group |
MVINCLUDE = string token |
Whether to provide identifications for specimens with missing or unavailable values of the x-variables (explanatory); default expl |
Parameters
X = variates or factors |
Explanatory variables |
|---|---|
VALUES = scalars, variates or texts |
Values to use for the explanatory variables; if these are unset for any variable, its existing values are used |
Description
BCIDENTIFY identifies specimens using a classification tree, as constructed by the BCLASSIFICATION procedure. The tree can be saved from BCLASSIFICATION (using the TREE option of BCLASSIFICATION), and specified for BCIDENTIFY using its own TREE option. Alternatively, BCIDENTIFY will ask you for the identifier of the tree if you do not specify TREE when running interactively.
The characteristics of the specimens can be specified in the variates or factors listed by the X parameter. These must have identical names (and levels) to those used originally to construct the tree. You can use the VALUES parameter to supply new values, if those stored in any of the variates or factors are unsuitable.
If you do not set X when running interactively, BCIDENTIFY will ask you to supply the relevant characteristics in turn, as required by the tree. Otherwise, if an x-variable in the tree is not specified in the X parameter list, its values are assumed to be unavailable (i.e. missing).
By default, when the x-variable required at a node in the tree is unavailable or contains a missing value, BCIDENTIFY will follow all the branches from that node, and form a combined conclusion. You can set option MVINCLUDE=*, if you would prefer the identification to be missing.
The PRINT option controls printed output, with settings:
identification |
prints the identifications obtained using the tree; |
|---|---|
transcript |
prints the observed characteristics when supplied in response to questions in an interactive run. |
If you do not set PRINT in an interactive run, BCIDENTIFY will ask what you would like to print. In batch, the default is to print the identifications.
The IDENTIFICATION option allows you to save the identifications (in a text). The TERMINALNODES option allows you to save a pointer, with an element for each specimen, containing the numbers of the terminal nodes reached in the tree to provide its identification. This will be a scalar if the identification was derived from a single node, or a variate if it involved more than one (because several branches have been taken, as the result of a missing x-value). Finally, the PROBABILITIES option can save a specimen-by-group matrix giving the probability that the specimens belong to each group.
Options: PRINT, TREE, IDENTIFICATION, TERMINALNODES, PROBABILITIES, MVINCLUDE.
Parameters: X, VALUES.
Method
BCIDENTIFY uses BIDENTIFY to find the terminal nodes of the tree that correspond to the values of the explanatory variables.
Action with RESTRICT
Restrictions are ignored.
See also
Procedures: BCLASSIFICATION, BCDISPLAY, BCKEEP.
Commands for: Multivariate and cluster analysis.
Example
CAPTION 'BCIDENTIFY example',!t(\
'Calculator digit recognition problem as in Breiman et al.',\
'(1984, p.44). The assumption is that the digits of a calculator',\
'are made up of 7 lines (as shown below), which may be missing for',\
'any particular digit with probability 0.1:'); STYLE=meta,plain
SCALAR Chan
ENQUIRE Chan; FILETYPE=output; OUTSTYLE=Style
OUTPUT [STYLE=plain]
PRINT !t(' -1- ','| |','2 3','| |',' -4- ',\
'| |','5 6','| |',' -7- '); FIELD=20
OUTPUT [STYLE=#Style]
VARIATE xdefn[1...7]
READ [PRINT=error] xdefn[1...7]
0 0 1 0 0 1 0
1 0 1 1 1 0 1
1 0 1 1 0 1 1
0 1 1 1 0 1 0
1 1 0 1 0 1 1
1 1 0 1 1 1 1
1 0 1 0 0 1 0
1 1 1 1 1 1 1
1 1 1 1 0 1 1
1 1 1 0 1 1 1 :
"generate a set of random observations"
SCALAR nsamples,seed; VALUE=50,876083
VARIATE [NVALUES=nsamples] light[1...7],truelight[1...7],error[1...7],rdigit
CALC rdigit = MOD( INTEGER( URAND(seed; nsamples) * 10); 10) + 1
& truelight[] = ELEMENTS(xdefn[]; rdigit)
GRANDOM [DISTRIBUTION=binomial; PROBABILITY=0.1; NVALUES=nsamples]error[1...7]
CALC light[] = MOD(truelight[] + error[]; 2)
FACTOR [LEVELS=!(0...9)] digit; VALUES=MOD(rdigit; 10); DECIMALS=0
FACTOR [LEVELS=!(0,1)] x1,x2,x3,x4,x5,x6,x7; VALUES=light[]; DECIMALS=0
"form the classification tree"
BCLASSIFICATION [PRINT=*; GROUPS=digit; TREE=tree]\
x1,x2,x3,x4,x5,x6,x7
"prune the tree"
BPRUNE [PRINT=table] tree; NEWTREE=pruned
"use the 5th tree - renumber nodes"
BCUT [RENUMBER=yes] pruned[5]; NEWTREE=tree
"display the tree"
BCDISPLAY [PRINT=labelled] tree
PRINT 'Check identification of the true representations of the digits.'
FACTOR [LEVELS=!(0,1); NVALUES=10] x1,x2,x3,x4,x5,x6,x7; VALUES=xdefn[]
BCIDENTIFY [PRINT=*; TREE=tree; IDENTIFICATION=identification]\
x1,x2,x3,x4,x5,x6,x7
TEXT [VALUES='Digit 1:','Digit 2:','Digit 3:','Digit 4:','Digit 5:',\
'Digit 6:','Digit 7:','Digit 8:','Digit 9:','Digit 0:'] name
PRINT name,identification; FIELD=15