Forms irredundant test sets for the efficient identification of a set of objects.
Options
PRINT = string tokens |
Controls printed output (numbers , diagram , notdistinguished , messages ); default numb , diag , notd , mess |
---|---|
BESTSET = pointer |
Saves the best set |
SETS = matrix |
Saves details of the available sets |
NOTDISTINGUISHED = matrix |
Saves details of the objects that cannot be distinguished |
METHOD = string token |
Algorithm to use (exact , sequential ); default exac |
TAXONNAMES = text, variate or factor |
Defines labels for the objects (or taxa) to be identified; default uses the unit labels vector of the CHARACTER factors |
GROUPS = factor |
Defines groupings of the objects so that the sets are constructed to distinguish only between the objects that belong to different groups; default constructs sets to distinguish between individual objects |
OBJECT = scalar or text |
If this is specified, sets are constructed just to distinguish the specified object (or taxon) from the other objects |
NDISTINCTIONS = scalar |
Number of factors required in each set to distinguish between each pair of objects; default 1 |
MAXPREFERENCE = scalar |
Maximum preference of the factors to be included in the sets |
MAXSIZE = scalar |
Limit on number of factors in a set (sets containing more than this are discarded); default * i.e. none |
NPRINT = scalar |
Number of sets to print (a positive number specifies the number to print, a negative number sets a tolerance on the difference between the sizes of the sets printed and the size of the best set); default * prints them all |
NSAVE = scalar |
Number of sets to save in the SETS matrix; default * saves them all |
LIMSETS = variate |
Variate containing two numbers n1 and n2, if this is specified then every time that there are more than n1 sets under construction using the exact method, the sets are arranged in order of increasing size and all sets containing more factors than set n2 are deleted |
DISTINCTIONS = string token |
Whether or not to store the distinctions or recalculate them at every stage in the exact algorithm (store , calculate ); default stor |
CRITERION = string token |
Function to be use to select factors by the sequential method (ndistinctions , weightedndistinctions ); default ndis |
MAXCYCLE = scalar |
Maximum number of improvement cycles to perform during the sequential method; default 20 |
EQUIVALENCE = scalar |
Value for determining equivalence of the selection criteria of tests selected during the sequential method |
Parameters
CHARACTER = identifiers |
Factors, and/or tables classified by a single factor, defining the properties of the objects to to be identified |
---|---|
COST = scalars |
Cost associated with each factor; default 1 |
PREFERENCE = scalar |
Preference rating for each factor (1 representing most preferred etc.); default 1 |
VARIABLE = scalar or text |
Factor level used to represent variable information; default is to use a missing value |
INAPPLICABLE = scalar or text |
Factor level used to indicate that the information provided by that factor is inapplicable for a particular object |
Description
The IRREDUNDANT
directive is useful when you have a set of objects (or taxa) whose properties can be described by a set of discrete-valued tests. Many applications are biological. For example, in botanical work, the taxa may be species of plant and the tests may require the observation of characters like the colours of petals or numbers of leaves. Similarly, in microbiology, the tests may involve the ability of an organism to grow in various media. IRREDUNDANT
helps you to select an efficient set of tests that can be applied, in a batch, to identify any unknown specimen of any of the objects. (The batch of tests is then often printed as a diagnostic table; see Payne & Preece 1980.) As all the tests in the set are to be used for every identification, it is best for the set to contain as few tests as possible. So there should thus be no redundant tests: these are tests that can be deleted from the set without causing any object (or taxon) to be no longer identifiable. Sets of tests that contain no redundant tests are known as irredundant.
Consider taxa A, B, C and D, whose responses to tests 1-5 are shown in the table below. The symbol “+”, for example in the entry for taxon A and test 1, indicates that all specimens of taxon A will always give a positive result to test 1, the symbol “-” for taxon D with test 1 indicates a negative result, and the symbol “v” for taxon B with test 3 indicates that some specimens of D will give a positive result to test 3 but others will give a negative result.
Test | |||||
Taxon | 1 | 2 | 3 | 4 | 5 |
A | + | + | + | – | + |
B | + | – | v | – | – |
C | + | – | – | + | + |
D | – | + | + | + | – |
The table contains several irredundant sets, one of which contains the tests 1, 3 and 5. (If, for example, test 3 is deleted from this set, taxa A and C can no longer be distinguished). Another set contains tests 2 and 4. So, the irredundant sets can be of different sizes. The optimum set will often be defined to be one containing a minimum number of tests. Alternatively, if the test cost different amounts to apply, the optimum set may be one with minimum total cost. However, whichever of these situations applies, the optimum set will be irredundant, as otherwise a better set could be obtained by deleting a redundant test.
The characteristics of the taxa and tests are specified using the CHARACTER
parameter of IRREDUNDANT
. In the simplest situation, this provides a list of factors, one for each test (or character), as with the BKEY
procedure. The factors contain a unit for each taxon, and the level stored in that unit indicates how the taxon can respond to the test. For example, irredundant sets for the data in the table above could be constructed as follows:
TEXT [VALUES=A,B,C,D] Taxa
FACTOR [NVALUES=4; LEVELS=2] T1,T2,T3,T4,T5
READ T1,T2,T3,T4,T5
2 2 2 1 2
2 1 * 1 1
2 1 1 2 2
1 2 2 2 1 :
IRREDUNDANT [TAXONNAMES=Taxa] T1,T2,T3,T4,T5
Level 1 of the factors T1
– T5
represents a negative response, and level 2 represents a positive response. The variable response of taxon B with test 3 is represented by a missing value, but you can use the VARIABLE
parameter to use a particular level of the factor instead. There may be tests that are not applicable to some of the taxa. For example, when identifying insects, tests concerning colours of wings are not applicable to those that do not fly! The level to be used to indicate these responses is specified by the INAPPLICABLE
parameter. Costs for the test can be specified by the COST
parameter; by default, these are all taken to be one. Names for the taxa can be supplied, in either a text or a variate or a factor, using the TAXONNAMES
parameter. If this is not set, IRREDUNDANT
uses the unit labels of the CHARACTER
factors if any have been defined (see the FACTOR
directive), or otherwise the integers 1, 2 upwards.
The use of the VARIABLE
option works well with responses that are completely variable i.e. where the specimens of the taxon may give any of the available results to the test. However, when the tests have more than two possible results, there may be taxa that can give some but not all of the available results to a test. The responses to a test like this should be specified by a two-way table classified by one factor with a level for each possible result, and another with a level for each taxon. The table should then contain a positive (e.g. one) whererever the taxon concerned can deliver the result, and zero elsewhere. For example suppose that, with test T6
, taxon A, C and D always give result 1, 2 and 3 respectively, but taxon B can give either or results 2 or 3. The relevant table could then be constructed and used as follows:
FACTOR [LABELS=Taxa] Taxfact
FACTOR [LEVELS=3] T6fact
TABLE [CLASSIFICATION=T6fact,Taxfact; VALUES=\
" level 1:" 1, 0, 0, 0,\
" level 2:" 0, 1, 1, 0,\
" level 3:" 0, 1, 0, 1 ] T6tab
IRREDUNDANT [TAXONNAMES=Taxfact] T1,T2,T3,T4,T5,T6tab
The standard irredundant sets contain at least one test to distinguish each pair of taxa. However, to guard against mistakes in either the original data on during the subsequent use of the set, you can set the NDISTINCTIONS
option to ask for the set to include a larger number of tests able to distinguish each pair. Another refinement is that you can set the GROUPS
option to a factor defining groupings of the taxa. The sets are then formed to distinguish only pairs of taxa that belong to different groups. Alternatively, you may want a set of tests to either confirm whether or not the specimen belongs to one particular taxon. The taxon of interest should then be indicated by setting the OBJECT
option to the number of the taxon or, if textual taxon names have been defined, to the text identifying the taxon. Finally, if you set both GROUPS
and OBJECT
, the sets will be constructed to confirm whether or not a specimen belongs to a particular group.
IRREDUNDANT
provides two methods for constructing the irredundant sets. The default is to use an exact method (Payne 1991) which constructs all possible sets for the dataset concerned. However, with some datasets, there may be too many sets to construct them all. If you run out of workspace (or time), you can use the LIMSETS
to specify a variate containing two integers n1 and n2. Then whenever there are more than n1 sets under construction, the sets are arranged in order of increasing size and all sets containing more factors than set n2 are deleted. The method then no longer guarantees to find all the irredundant sets containing the fewest number of tests or with the minimum total costs, but in the situations where this modification is needed, it is very unlikely that it will fail to find any of them.
Alternatively, you can set option METHOD=sequential
to use a sequential algorithm (Payne & Preece 1980, Section 6.6). This does not guarantee to find a set with minimum size or cost, but it takes much less computing time and should always should produce a satisfactory set. The sequential method starts with an initial set containing all the essential tests, and then adds additional tests, one at a time, until each pair of taxa can be distinguished. (A test is essential if it is the only test which can distinguish between a particular pair of taxa.) The criterion for selecting the test to add to the set at each stage is usually the number of pairs of taxa that the test distinguishes, of those pairs not distinguished by tests already in the set. If costs have been defined, this number of pairs is divided by the cost of the test concerned.
Setting option CRITERION=weighted
uses a refinement, suggested by Barnett et al. (1983), which weights each pair of taxa by the reciprocal of the number of tests that can distinguish between them. The criterion is then the maximum weighted number of pairs of taxa (divided by the cost of the test, if defined). This causes tests that distinguish “difficult” pairs of taxa (those with nearly identical characteristics) to be selected earlier during the construction of the set, and thus tends to generate smaller sets. You can set a preference rating for each test using the PREFERENCE
parameter; the most-preferred tests should have ratings of one, and less-preferred tests should have ratings of two and upwards. Then, if at any stage there is then more than one test with the best criterion value, the most-preferred test is selected. If these preferances are especially important, you may also also want to set the EQUIVALENCE
option to a scalar, e say. Then all tests whose criterion values are within e of the current maximum are regarded as equivalent, and the best test is selected from within these tests according to the preferences.
The main disadvantage of most sequential methods is that they produce only a single set of tests. In order to allow a choice of sets and as a way of improving the original set, IRREDUNDANT
can run through a sequence of cycles. In each of these, the tests in the best set are deleted in turn, further tests are selected to separate the pairs of taxa distinguished only by the deleted test, and any redundant tests are deleted. If no improvement is achieved, all the non-essential tests are deleted, and the set is reformed without using those tests. The process can be then repeated until no improvements are being achieved of until the number of cycles exceeds the setting of the MAXCYCLE
option (default 20).
Printed output is controlled by the PRINT
option, with settings:
numbers |
numbers of the tests in the sets, |
---|---|
diagram |
table showing the contents of the sets, |
notdistinguished |
lists of pairs of taxa that cannot be distinguished, |
messages |
messages for example when the number of sets has been reduced as requested by the LIMSETS option, or concerning pairs of taxa than cannot be distinguished. |
The default is PRINT=numb,diag,notd,mess
.
The best set can be saved using the BESTSET
option, as a pointer containing the relevant factors. The SETS
option can save a matrix, with a row for each set and a column for each test, representing all the sets that have been formed. In each row the matrix generally stores the number one in the columns corresponding to the tests in that set, and zero elsewhere. However, if the sets have been constructed to confirm the identification of a single taxon, the matrix contains more informative numbers than one. So, down each column wherever one would be stored, it instead stores the level given by the taxon for the factor corresponding to the test concerned. The NOTDISTINGUISHED
option can save information about the pairs of taxa that cannot be distinguished, or that are distinguished by less than NDISTINCTIONS
tests. The matrix has a row for each such pair of taxa, and three columns. Columns 1 and 2 contain the numbers of the taxa in the pair, and column 3 contains the number of tests that can distinguish them.
Options: PRINT
, BESTSET
, SETS
, NOTDISTINGUISHED
, METHOD
, UNITS
, GROUPS
, OBJECT
, NDISTINCTIONS
, MAXPREFERENCE
, MAXSIZE
, NPRINT
, NSAVE
, LIMSETS
, DISTINCTIONS
, CRITERION
, MAXCYCLE
, EQUIVALENCE
.
Parameters: CHARACTER
, COST
, PREFERENCE
, VARIABLE
, INAPPLICABLE
.
Method
The exact method uses an extension of the algorithm of Payne (1991). The sequential method is an extension by Barnett et al. (1983) of the algorithm described in Payne & Preece (1980) Section 6.6.
Action with RESTRICT
IRREDUNDANT
takes account of restrictions on any of the CHARACTER
factors or on TAXONNAMES
or GROUPS
.
References
Barnett, J.A., Payne, R.W. & Yarrow, D. (1983). Yeasts: Characteristics and Identification. Cambridge: Cambridge University Press.
Payne, R.W. (1991). Algorithm AS263 Construction of irredundant test sets. Applied Statistics, 40, 213-229.
Payne, R.W. & Preece, D.A. (1980). Identification keys and diagnostic tables: a review (with discussion). Journal of the Royal Statistical Society, Series A, 143, 253-292.
See also
Commands for: Calculations and manipulation,