Forms irredundant test sets for the efficient identification of a set of objects.

### Options

`PRINT` = string tokens |
Controls printed output (`numbers` , `diagram` , `notdistinguished` , `messages` ); default `numb` , `diag` , `notd` , `mess` |
---|---|

`BESTSET` = pointer |
Saves the best set |

`SETS` = matrix |
Saves details of the available sets |

`NOTDISTINGUISHED` = matrix |
Saves details of the objects that cannot be distinguished |

`METHOD` = string token |
Algorithm to use (`exact` , `sequential` ); default `exac` |

`TAXONNAMES` = text, variate or factor |
Defines labels for the objects (or taxa) to be identified; default uses the unit labels vector of the `CHARACTER` factors |

`GROUPS` = factor |
Defines groupings of the objects so that the sets are constructed to distinguish only between the objects that belong to different groups; default constructs sets to distinguish between individual objects |

`OBJECT` = scalar or text |
If this is specified, sets are constructed just to distinguish the specified object (or taxon) from the other objects |

`NDISTINCTIONS` = scalar |
Number of factors required in each set to distinguish between each pair of objects; default 1 |

`MAXPREFERENCE` = scalar |
Maximum preference of the factors to be included in the sets |

`MAXSIZE` = scalar |
Limit on number of factors in a set (sets containing more than this are discarded); default `*` i.e. none |

`NPRINT` = scalar |
Number of sets to print (a positive number specifies the number to print, a negative number sets a tolerance on the difference between the sizes of the sets printed and the size of the best set); default `*` prints them all |

`NSAVE` = scalar |
Number of sets to save in the `SETS` matrix; default `*` saves them all |

`LIMSETS` = variate |
Variate containing two numbers n_{1} and n_{2}, if this is specified then every time that there are more than n_{1} sets under construction using the exact method, the sets are arranged in order of increasing size and all sets containing more factors than set n_{2} are deleted |

`DISTINCTIONS` = string token |
Whether or not to store the distinctions or recalculate them at every stage in the exact algorithm (`store` , `calculate` ); default `stor` |

`CRITERION` = string token |
Function to be use to select factors by the sequential method (`ndistinctions` , `weightedndistinctions` ); default `ndis` |

`MAXCYCLE` = scalar |
Maximum number of improvement cycles to perform during the sequential method; default 20 |

`EQUIVALENCE` = scalar |
Value for determining equivalence of the selection criteria of tests selected during the sequential method |

### Parameters

`CHARACTER` = identifiers |
Factors, and/or tables classified by a single factor, defining the properties of the objects to to be identified |
---|---|

`COST` = scalars |
Cost associated with each factor; default 1 |

`PREFERENCE` = scalar |
Preference rating for each factor (1 representing most preferred etc.); default 1 |

`VARIABLE` = scalar or text |
Factor level used to represent variable information; default is to use a missing value |

`INAPPLICABLE` = scalar or text |
Factor level used to indicate that the information provided by that factor is inapplicable for a particular object |

### Description

The `IRREDUNDANT`

directive is useful when you have a set of objects (or *taxa*) whose properties can be described by a set of discrete-valued tests. Many applications are biological. For example, in botanical work, the taxa may be species of plant and the tests may require the observation of characters like the colours of petals or numbers of leaves. Similarly, in microbiology, the tests may involve the ability of an organism to grow in various media. `IRREDUNDANT`

helps you to select an efficient set of tests that can be applied, in a batch, to identify any unknown specimen of any of the objects. (The batch of tests is then often printed as a diagnostic table; see Payne & Preece 1980.) As all the tests in the set are to be used for every identification, it is best for the set to contain as few tests as possible. So there should thus be no *redundant* tests: these are tests that can be deleted from the set without causing any object (or *taxon*) to be no longer identifiable. Sets of tests that contain no redundant tests are known as *irredundant*.

Consider taxa A, B, C and D, whose responses to tests 1-5 are shown in the table below. The symbol “+”, for example in the entry for taxon A and test 1, indicates that all specimens of taxon A will always give a positive result to test 1, the symbol “-” for taxon D with test 1 indicates a negative result, and the symbol “v” for taxon B with test 3 indicates that some specimens of D will give a positive result to test 3 but others will give a negative result.

Test |
|||||

Taxon |
1 | 2 | 3 | 4 | 5 |

A |
+ | + | + | – | + |

B |
+ | – | v | – | – |

C |
+ | – | – | + | + |

D |
– | + | + | + | – |

The table contains several irredundant sets, one of which contains the tests 1, 3 and 5. (If, for example, test 3 is deleted from this set, taxa A and C can no longer be distinguished). Another set contains tests 2 and 4. So, the irredundant sets can be of different sizes. The optimum set will often be defined to be one containing a minimum number of tests. Alternatively, if the test cost different amounts to apply, the optimum set may be one with minimum total cost. However, whichever of these situations applies, the optimum set will be irredundant, as otherwise a better set could be obtained by deleting a redundant test.

The characteristics of the taxa and tests are specified using the `CHARACTER`

parameter of `IRREDUNDANT`

. In the simplest situation, this provides a list of factors, one for each test (or character), as with the `BKEY`

procedure. The factors contain a unit for each taxon, and the level stored in that unit indicates how the taxon can respond to the test. For example, irredundant sets for the data in the table above could be constructed as follows:

`TEXT [VALUES=A,B,C,D] Taxa `

`FACTOR [NVALUES=4; LEVELS=2] T1,T2,T3,T4,T5`

`READ T1,T2,T3,T4,T5`

`2 2 2 1 2`

`2 1 * 1 1`

`2 1 1 2 2`

`1 2 2 2 1 :`

`IRREDUNDANT [TAXONNAMES=Taxa] T1,T2,T3,T4,T5`

Level 1 of the factors `T1`

– `T5`

represents a negative response, and level 2 represents a positive response. The variable response of taxon B with test 3 is represented by a missing value, but you can use the `VARIABLE`

parameter to use a particular level of the factor instead. There may be tests that are not applicable to some of the taxa. For example, when identifying insects, tests concerning colours of wings are not applicable to those that do not fly! The level to be used to indicate these responses is specified by the `INAPPLICABLE`

parameter. Costs for the test can be specified by the `COST`

parameter; by default, these are all taken to be one. Names for the taxa can be supplied, in either a text or a variate or a factor, using the `TAXONNAMES`

parameter. If this is not set, `IRREDUNDANT`

uses the unit labels of the `CHARACTER`

factors if any have been defined (see the `FACTOR`

directive), or otherwise the integers 1, 2 upwards.

The use of the `VARIABLE`

option works well with responses that are completely variable i.e. where the specimens of the taxon may give any of the available results to the test. However, when the tests have more than two possible results, there may be taxa that can give some but not all of the available results to a test. The responses to a test like this should be specified by a two-way table classified by one factor with a level for each possible result, and another with a level for each taxon. The table should then contain a positive (e.g. one) whererever the taxon concerned can deliver the result, and zero elsewhere. For example suppose that, with test `T6`

, taxon A, C and D always give result 1, 2 and 3 respectively, but taxon B can give either or results 2 or 3. The relevant table could then be constructed and used as follows:

`FACTOR [LABELS=Taxa] Taxfact`

`FACTOR [LEVELS=3] T6fact`

`TABLE [CLASSIFICATION=T6fact,Taxfact; VALUES=\`

` " level 1:" 1, 0, 0, 0,\`

` " level 2:" 0, 1, 1, 0,\`

` " level 3:" 0, 1, 0, 1 ] T6tab`

`IRREDUNDANT [TAXONNAMES=Taxfact] T1,T2,T3,T4,T5,T6tab`

The standard irredundant sets contain at least one test to distinguish each pair of taxa. However, to guard against mistakes in either the original data on during the subsequent use of the set, you can set the `NDISTINCTIONS`

option to ask for the set to include a larger number of tests able to distinguish each pair. Another refinement is that you can set the `GROUPS`

option to a factor defining groupings of the taxa. The sets are then formed to distinguish only pairs of taxa that belong to different groups. Alternatively, you may want a set of tests to either confirm whether or not the specimen belongs to one particular taxon. The taxon of interest should then be indicated by setting the `OBJECT`

option to the number of the taxon or, if textual taxon names have been defined, to the text identifying the taxon. Finally, if you set both `GROUPS`

and `OBJECT`

, the sets will be constructed to confirm whether or not a specimen belongs to a particular group.

`IRREDUNDANT`

provides two methods for constructing the irredundant sets. The default is to use an exact method (Payne 1991) which constructs all possible sets for the dataset concerned. However, with some datasets, there may be too many sets to construct them all. If you run out of workspace (or time), you can use the `LIMSETS`

to specify a variate containing two integers *n*_{1} and *n*_{2}. Then whenever there are more than *n*_{1} sets under construction, the sets are arranged in order of increasing size and all sets containing more factors than set *n*_{2} are deleted. The method then no longer guarantees to find all the irredundant sets containing the fewest number of tests or with the minimum total costs, but in the situations where this modification is needed, it is very unlikely that it will fail to find any of them.

Alternatively, you can set option `METHOD=sequential`

to use a sequential algorithm (Payne & Preece 1980, Section 6.6). This does not guarantee to find a set with minimum size or cost, but it takes much less computing time and should always should produce a satisfactory set. The sequential method starts with an initial set containing all the essential tests, and then adds additional tests, one at a time, until each pair of taxa can be distinguished. (A test is *essential* if it is the only test which can distinguish between a particular pair of taxa.) The criterion for selecting the test to add to the set at each stage is usually the number of pairs of taxa that the test distinguishes, of those pairs not distinguished by tests already in the set. If costs have been defined, this number of pairs is divided by the cost of the test concerned.

Setting option `CRITERION=weighted`

uses a refinement, suggested by Barnett *et al*. (1983), which weights each pair of taxa by the reciprocal of the number of tests that can distinguish between them. The criterion is then the maximum *weighted* number of pairs of taxa (divided by the cost of the test, if defined). This causes tests that distinguish “difficult” pairs of taxa (those with nearly identical characteristics) to be selected earlier during the construction of the set, and thus tends to generate smaller sets. You can set a preference rating for each test using the `PREFERENCE`

parameter; the most-preferred tests should have ratings of one, and less-preferred tests should have ratings of two and upwards. Then, if at any stage there is then more than one test with the best criterion value, the most-preferred test is selected. If these preferances are especially important, you may also also want to set the `EQUIVALENCE`

option to a scalar, *e* say. Then all tests whose criterion values are within *e* of the current maximum are regarded as equivalent, and the best test is selected from within these tests according to the preferences.

The main disadvantage of most sequential methods is that they produce only a single set of tests. In order to allow a choice of sets and as a way of improving the original set, `IRREDUNDANT`

can run through a sequence of cycles. In each of these, the tests in the best set are deleted in turn, further tests are selected to separate the pairs of taxa distinguished only by the deleted test, and any redundant tests are deleted. If no improvement is achieved, all the non-essential tests are deleted, and the set is reformed without using those tests. The process can be then repeated until no improvements are being achieved of until the number of cycles exceeds the setting of the `MAXCYCLE`

option (default 20).

Printed output is controlled by the `PRINT`

option, with settings:

`numbers` |
numbers of the tests in the sets, |
---|---|

`diagram` |
table showing the contents of the sets, |

`notdistinguished` |
lists of pairs of taxa that cannot be distinguished, |

`messages` |
messages for example when the number of sets has been reduced as requested by the `LIMSETS` option, or concerning pairs of taxa than cannot be distinguished. |

The default is `PRINT=numb,diag,notd,mess`

.

The best set can be saved using the `BESTSET`

option, as a pointer containing the relevant factors. The `SETS`

option can save a matrix, with a row for each set and a column for each test, representing all the sets that have been formed. In each row the matrix generally stores the number one in the columns corresponding to the tests in that set, and zero elsewhere. However, if the sets have been constructed to confirm the identification of a single taxon, the matrix contains more informative numbers than one. So, down each column wherever one would be stored, it instead stores the level given by the taxon for the factor corresponding to the test concerned. The `NOTDISTINGUISHED`

option can save information about the pairs of taxa that cannot be distinguished, or that are distinguished by less than `NDISTINCTIONS`

tests. The matrix has a row for each such pair of taxa, and three columns. Columns 1 and 2 contain the numbers of the taxa in the pair, and column 3 contains the number of tests that can distinguish them.

Options: `PRINT`

, `BESTSET`

, `SETS`

, `NOTDISTINGUISHED`

, `METHOD`

, `UNITS`

, `GROUPS`

, `OBJECT`

, `NDISTINCTIONS`

, `MAXPREFERENCE`

, `MAXSIZE`

, `NPRINT`

, `NSAVE`

, `LIMSETS`

, `DISTINCTIONS`

, `CRITERION`

, `MAXCYCLE`

, `EQUIVALENCE`

.

Parameters: `CHARACTER`

, `COST`

, `PREFERENCE`

, `VARIABLE`

, `INAPPLICABLE`

.

### Method

The exact method uses an extension of the algorithm of Payne (1991). The sequential method is an extension by Barnett *et al*. (1983) of the algorithm described in Payne & Preece (1980) Section 6.6.

### Action with `RESTRICT`

`IRREDUNDANT`

takes account of restrictions on any of the `CHARACTER`

factors or on `TAXONNAMES`

or `GROUPS`

.

### References

Barnett, J.A., Payne, R.W. & Yarrow, D. (1983). *Yeasts: Characteristics and Identification*. Cambridge: Cambridge University Press.

Payne, R.W. (1991). Algorithm AS263 Construction of irredundant test sets*. Applied Statistics*, 40, 213-229.

Payne, R.W. & Preece, D.A. (1980). Identification keys and diagnostic tables: a review (with discussion). *Journal of the Royal Statistical Society, Series A*, 143, 253-292.

### See also

Commands for: Calculations and manipulation,