Constructs an identification key (R.W. Payne).
Options
PRINT = string tokens |
Controls printed output (indented , bracketed , diagram , graph ); default * i.e. none |
---|---|
TAXONNAMES = text |
Names of the taxa in the key; default * uses textual versions of the numbers 1, 2 onwards |
GROUPS = factor |
Groupings of the taxa, if the key is to identify the group of a specimen rather than its taxon |
CRITERION = string token |
Criterion to use to select the character to use at each node of the key (CME , CMV , GME ); default GME when GROUPS is set, otherwise CME |
PARTIAL = string token |
Controls whether or not to use partial separation; (yes , no ) default no |
KEY = tree |
Saves the key |
Parameters
CHARACTER = factors |
Characters available to construct the key |
---|---|
COST = scalars |
Cost of each character; default 1 |
Description
Identification keys provide efficient ways of identifying objects, or taxa, whose properties can be described by a set of discrete-valued tests. Many applications are biological. For example, in botanical work, the taxa may be species of plant and the tests may require the observation of characters like the colours of petals or numbers of leaves. Similarly, in microbiology, the tests may involve the ability of an organism to grow in various media. Using a key involves doing a sequence of tests which continues until the unknown specimen can be identified.
The characters that are available for constructing the key are specified, as a list of factors, using the CHARACTER
parameter. Each factor has a level for each possible value of the character concerned, and you can insert a missing value for a particular taxon to indicate that its value for the character is either variable or unknown. If an “extra” text has been defined for the factor (using the EXTRA
parameter of the FACTOR
directive), BKEY
will use this when printing the textual forms of the key instead of the identifier of the factor. (So the characters can be described in the key using any printable symbol, not just those that may be used in identifiers.) The COST
parameter allows you to specify a cost for each character. This may be how much it costs to observe or may simply record your own personal preferences between the parameters. By default all the costs are 1. The names of the taxa can be specified in a text using the TAXONNAMES
option. If this is omitted, they are simply numbered 1, 2 and so on. If the taxa are classified into groups, BKEY
can construct a key to identify the group of a specimen rather than the taxon itself. These groupings can be supplied using the GROUPS
factor.
The efficiency of a key is usually measured by its expected cost of identification. To find the optimal key using a particular set of data essentially requires the construction and comparison of all possible keys for the taxa that could be formed with the available tests. This is impracticable even for moderate numbers of tests and taxa. Thus, heuristic algorithms are used which construct the key sequentially, selecting first the test that “best” divides the taxa into sets (where set k for test i contains all the taxa that can give result k to test i), then selecting the best test to use with each set, continuing until the sets each contain only one taxon – or until no further separation is possible. The “best” test can be defined using a selection criterion function (Gower & Payne 1975). BKEY
provides three criteria, which can be selected using the CRITERION
option, with settings:
CME |
is an estimate of the expected cost of completing the identification from the current point of the key, assuming that test i is used and that, below this point, the key is completed optimally (this is the function CMe devised by Payne 1981); |
---|---|
CMV |
is a less optimistic estimates, which assumes that the key is completed by simple binary tests (i.e. tests for each of which one particular taxon always gives a positive response and other taxa give negative responses) which corresponds to the function CMv′ of Payne (1981); |
GME |
is an equivalent version of CMv for the identification of groups of taxa (see Payne, Yarrow & Barnett 1982). |
CMe and CMv′ (and two other criteria) were studied by Payne & Thompson (1989), who found that each of them produced the best key for some sets of data. They thus concluded that programs for key construction should allow their users to try several so that they can choose the one that behaves best with any particular set of data.
Usually construction of the key stops when the possible taxa at that point share identical values or have missing values for all the characters. However, if the missing values represent variable rather than unknown values, it may still be worth using these tests in case a specimen of the taxon concerned is obtained that happens to give a level different from the shared level. This partial separation can be requested by setting option PARTIAL=yes
.
The key can be printed in various formats, as requested by the PRINT
option, or it can be saved using the KEY
option. The settings of PRINT
are:
indented |
indented form – prints the key branch by branch; |
---|---|
bracketed |
bracketed form – prints the key test by test; |
diagram |
diagrammatic representation; |
graph |
plots the key using high resolution graphics. |
BKEY
stores the information required for printing as part of the tree. The labels for the diagram are formed as “identifier==n1“, where n1 is the first level of the factor. The lines of the indented and bracketed keys are formed similarly if the factor has no extra test and no labels. Otherwise, the form is “fname lname“, where fname is the extra text if this has been defined (by the EXTRA
parameter of the FACTOR
command) or else the identifier of the factor, and lname is the label if available or the level if not.
Options: PRINT
, TAXONNAMES
, GROUPS
, CRITERION
, PARTIAL
, KEY
.
Parameters: CHARACTER
, COST
.
Method
BKEY
calls procedure BCONSTRUCT
to form the key. This uses a special-purpose procedure BSELECT
, which is customized specifically for keys, and stored with BKEY
. The methodology involved in the construction of keys is reviewed by Payne & Preece (1980). Statistical applications of keys are described by Payne (1992).
Action with RESTRICT
Any restrictions on the CHARACTER
factors or on TAXONNAMES
or GROUPS
are removed.
References
Gower, J.C. & Payne, R.W. (1975). A comparison of different criteria for selecting binary tests in diagnostic keys. Biometrika, 62, 665-671.
Payne, R.W. & Preece, D.A. (1980). Identification keys and diagnostic tables: a review (with discussion). Journal of the Royal Statistical Society, Series A, 143, 253-292.
Payne, R.W. (1981). Selection criteria for the construction of efficient diagnostic keys. Journal of Statistical Planning and Inference, 5, 27-36.
Payne, R.W., Yarrow, D. & Barnett, J.A. (1982). The construction by computer of a diagnostic key to the genera of yeasts and other such groups of taxa. Journal of General Microbiology, 128, 1265-1277.
Payne, R.W. & Thompson, C.J. (1989). A study of selection criteria for constructing identification keys containing tests with different costs. Computational Statistics Quarterly, 5, 43-52.
Payne, R.W. (1992). The use of identification keys and diagnostic tables in statistical work. In: COMPSTAT 92 Proceedings in Computational Statistics (Ed. Y. Dodge & J. Whittaker), Volume 2, 239-244. Heidelberg: Physica-Verlag.
See also
Directive: IRREDUNDANT
.
Procedures: BKDISPLAY
, BKIDENTIFY
, BKKEEP
, BCLASSIFICATION
, BCFOREST
, BREGRESSION
, IDENTIFY
.
Commands for: Multivariate and cluster analysis.
Example
CAPTION 'BKEY example',\ 'Common clinical yeasts from Payne (1992) COMPSTAT paper';\ STYLE=meta,plain TEXT [VALUES='Candida albicans','Candida glabrata',\ 'Candida parapsilosis','Candida tropicalis',\ 'Cryptococcus albidus','Cryptococcus laurentii',\ 'Filobasidiella neoformans',\ 'Issatchenkia orientalis',\ 'Kluyveromyces marxianus',\ 'Pichia guilliermondii','Rhodotorula glutinis',\ 'Rhodotorula mucilaginosa','Trichosporon beigelii'] Yeasts FACTOR [NVALUES=Yeasts; LABELS=!t('-','+')]\ C11; EXTRA='Maltose growth' & C18; EXTRA='Lactose growth' & C19; EXTRA='Raffinose growth' & C36; EXTRA='D-Glucuronate growth' & N1; EXTRA='Nitrate growth' & V5; EXTRA='Growth w/o Thiamin' & O2; EXTRA='0.1% Cycloheximide growth' & E5; EXTRA='Splitting cells' READ [PRINT=errors]\ C11,C18,C19,C36,N1,V5,O2,E5; FREPRESENTATION=labels '+' '-' '-' '-' '-' '+' '+' '-' '-' '-' '-' '-' '-' '-' '-' '-' '+' '-' '-' '-' '-' '+' '-' '-' '+' '-' '-' '-' '-' '+' '+' '-' '+' * * '+' '+' '-' '-' '-' '+' '+' '+' '+' '-' * * '-' '+' '-' * '+' '-' '-' '-' '-' '-' '-' '-' '-' '-' '+' '-' '-' '-' * '+' '-' '-' '+' '+' '-' '+' '-' '+' '-' '-' '+' '+' '-' '+' '-' * '-' '+' * * '-' * '-' '+' '-' * '-' * '-' * '+' * '+' '-' '-' * '+' : PRINT [MISSING='V'] C11,C18,C19,C36,N1,V5,O2,E5; FIELDWIDTH=4; DECIMALS=0 FACTOR [MODIFY=yes; LABELS=!t(negative,positive)] C11,C18,C19,C36,N1,V5,O2,E5 BKEY [PRINT=bracketed,indented; TAXONNAMES=Yeasts;\ CRITERION=cme] C11,C18,C19,C36,N1,V5,O2,E5