IRREDUNDANT directive

Forms irredundant test sets for the efficient identification of a set of objects.

Options

`PRINT` = string tokens	Controls printed output (`numbers`, `diagram`, `notdistinguished`, `messages`); default `numb`, `diag`, `notd`, `mess`
`BESTSET` = pointer	Saves the best set
`SETS` = matrix	Saves details of the available sets
`NOTDISTINGUISHED` = matrix	Saves details of the objects that cannot be distinguished
`METHOD` = string token	Algorithm to use (`exact`, `sequential`); default `exac`
`TAXONNAMES` = text, variate or factor	Defines labels for the objects (or taxa) to be identified; default uses the unit labels vector of the `CHARACTER` factors
`GROUPS` = factor	Defines groupings of the objects so that the sets are constructed to distinguish only between the objects that belong to different groups; default constructs sets to distinguish between individual objects
`OBJECT` = scalar or text	If this is specified, sets are constructed just to distinguish the specified object (or taxon) from the other objects
`NDISTINCTIONS` = scalar	Number of factors required in each set to distinguish between each pair of objects; default 1
`MAXPREFERENCE` = scalar	Maximum preference of the factors to be included in the sets
`MAXSIZE` = scalar	Limit on number of factors in a set (sets containing more than this are discarded); default `*` i.e. none
`NPRINT` = scalar	Number of sets to print (a positive number specifies the number to print, a negative number sets a tolerance on the difference between the sizes of the sets printed and the size of the best set); default `*` prints them all
`NSAVE` = scalar	Number of sets to save in the `SETS` matrix; default `*` saves them all
`LIMSETS` = variate	Variate containing two numbers n₁ and n₂, if this is specified then every time that there are more than n₁ sets under construction using the exact method, the sets are arranged in order of increasing size and all sets containing more factors than set n₂ are deleted
`DISTINCTIONS` = string token	Whether or not to store the distinctions or recalculate them at every stage in the exact algorithm (`store`, `calculate`); default `stor`
`CRITERION` = string token	Function to be use to select factors by the sequential method (`ndistinctions`, `weightedndistinctions`); default `ndis`
`MAXCYCLE` = scalar	Maximum number of improvement cycles to perform during the sequential method; default 20
`EQUIVALENCE` = scalar	Value for determining equivalence of the selection criteria of tests selected during the sequential method

Parameters

`CHARACTER` = identifiers	Factors, and/or tables classified by a single factor, defining the properties of the objects to to be identified
`COST` = scalars	Cost associated with each factor; default 1
`PREFERENCE` = scalar	Preference rating for each factor (1 representing most preferred etc.); default 1
`VARIABLE` = scalar or text	Factor level used to represent variable information; default is to use a missing value
`INAPPLICABLE` = scalar or text	Factor level used to indicate that the information provided by that factor is inapplicable for a particular object

Description

The IRREDUNDANT directive is useful when you have a set of objects (or taxa) whose properties can be described by a set of discrete-valued tests. Many applications are biological. For example, in botanical work, the taxa may be species of plant and the tests may require the observation of characters like the colours of petals or numbers of leaves. Similarly, in microbiology, the tests may involve the ability of an organism to grow in various media. IRREDUNDANT helps you to select an efficient set of tests that can be applied, in a batch, to identify any unknown specimen of any of the objects. (The batch of tests is then often printed as a diagnostic table; see Payne & Preece 1980.) As all the tests in the set are to be used for every identification, it is best for the set to contain as few tests as possible. So there should thus be no redundant tests: these are tests that can be deleted from the set without causing any object (or taxon) to be no longer identifiable. Sets of tests that contain no redundant tests are known as irredundant.

Consider taxa A, B, C and D, whose responses to tests 1-5 are shown in the table below. The symbol “+”, for example in the entry for taxon A and test 1, indicates that all specimens of taxon A will always give a positive result to test 1, the symbol “-” for taxon D with test 1 indicates a negative result, and the symbol “v” for taxon B with test 3 indicates that some specimens of D will give a positive result to test 3 but others will give a negative result.

	Test
Taxon	1	2	3	4	5
A	+	+	+	–	+
B	+	–	v	–	–
C	+	–	–	+	+
D	–	+	+	+	–

The table contains several irredundant sets, one of which contains the tests 1, 3 and 5. (If, for example, test 3 is deleted from this set, taxa A and C can no longer be distinguished). Another set contains tests 2 and 4. So, the irredundant sets can be of different sizes. The optimum set will often be defined to be one containing a minimum number of tests. Alternatively, if the test cost different amounts to apply, the optimum set may be one with minimum total cost. However, whichever of these situations applies, the optimum set will be irredundant, as otherwise a better set could be obtained by deleting a redundant test.

The characteristics of the taxa and tests are specified using the CHARACTER parameter of IRREDUNDANT. In the simplest situation, this provides a list of factors, one for each test (or character), as with the BKEY procedure. The factors contain a unit for each taxon, and the level stored in that unit indicates how the taxon can respond to the test. For example, irredundant sets for the data in the table above could be constructed as follows:

TEXT [VALUES=A,B,C,D] Taxa

FACTOR [NVALUES=4; LEVELS=2] T1,T2,T3,T4,T5

READ T1,T2,T3,T4,T5

2 2 2 1 2

2 1 * 1 1

2 1 1 2 2

1 2 2 2 1 :

IRREDUNDANT [TAXONNAMES=Taxa] T1,T2,T3,T4,T5

Level 1 of the factors T1 – T5 represents a negative response, and level 2 represents a positive response. The variable response of taxon B with test 3 is represented by a missing value, but you can use the VARIABLE parameter to use a particular level of the factor instead. There may be tests that are not applicable to some of the taxa. For example, when identifying insects, tests concerning colours of wings are not applicable to those that do not fly! The level to be used to indicate these responses is specified by the INAPPLICABLE parameter. Costs for the test can be specified by the COST parameter; by default, these are all taken to be one. Names for the taxa can be supplied, in either a text or a variate or a factor, using the TAXONNAMES parameter. If this is not set, IRREDUNDANT uses the unit labels of the CHARACTER factors if any have been defined (see the FACTOR directive), or otherwise the integers 1, 2 upwards.

The use of the VARIABLE option works well with responses that are completely variable i.e. where the specimens of the taxon may give any of the available results to the test. However, when the tests have more than two possible results, there may be taxa that can give some but not all of the available results to a test. The responses to a test like this should be specified by a two-way table classified by one factor with a level for each possible result, and another with a level for each taxon. The table should then contain a positive (e.g. one) whererever the taxon concerned can deliver the result, and zero elsewhere. For example suppose that, with test T6, taxon A, C and D always give result 1, 2 and 3 respectively, but taxon B can give either or results 2 or 3. The relevant table could then be constructed and used as follows:

FACTOR [LABELS=Taxa] Taxfact

FACTOR [LEVELS=3] T6fact

TABLE [CLASSIFICATION=T6fact,Taxfact; VALUES=\

" level 1:" 1, 0, 0, 0,\

" level 2:" 0, 1, 1, 0,\

" level 3:" 0, 1, 0, 1 ] T6tab

IRREDUNDANT [TAXONNAMES=Taxfact] T1,T2,T3,T4,T5,T6tab

The standard irredundant sets contain at least one test to distinguish each pair of taxa. However, to guard against mistakes in either the original data on during the subsequent use of the set, you can set the NDISTINCTIONS option to ask for the set to include a larger number of tests able to distinguish each pair. Another refinement is that you can set the GROUPS option to a factor defining groupings of the taxa. The sets are then formed to distinguish only pairs of taxa that belong to different groups. Alternatively, you may want a set of tests to either confirm whether or not the specimen belongs to one particular taxon. The taxon of interest should then be indicated by setting the OBJECT option to the number of the taxon or, if textual taxon names have been defined, to the text identifying the taxon. Finally, if you set both GROUPS and OBJECT, the sets will be constructed to confirm whether or not a specimen belongs to a particular group.

IRREDUNDANT provides two methods for constructing the irredundant sets. The default is to use an exact method (Payne 1991) which constructs all possible sets for the dataset concerned. However, with some datasets, there may be too many sets to construct them all. If you run out of workspace (or time), you can use the LIMSETS to specify a variate containing two integers n₁ and n₂. Then whenever there are more than n₁ sets under construction, the sets are arranged in order of increasing size and all sets containing more factors than set n₂ are deleted. The method then no longer guarantees to find all the irredundant sets containing the fewest number of tests or with the minimum total costs, but in the situations where this modification is needed, it is very unlikely that it will fail to find any of them.

Alternatively, you can set option METHOD=sequential to use a sequential algorithm (Payne & Preece 1980, Section 6.6). This does not guarantee to find a set with minimum size or cost, but it takes much less computing time and should always should produce a satisfactory set. The sequential method starts with an initial set containing all the essential tests, and then adds additional tests, one at a time, until each pair of taxa can be distinguished. (A test is essential if it is the only test which can distinguish between a particular pair of taxa.) The criterion for selecting the test to add to the set at each stage is usually the number of pairs of taxa that the test distinguishes, of those pairs not distinguished by tests already in the set. If costs have been defined, this number of pairs is divided by the cost of the test concerned.

Setting option CRITERION=weighted uses a refinement, suggested by Barnett et al. (1983), which weights each pair of taxa by the reciprocal of the number of tests that can distinguish between them. The criterion is then the maximum weighted number of pairs of taxa (divided by the cost of the test, if defined). This causes tests that distinguish “difficult” pairs of taxa (those with nearly identical characteristics) to be selected earlier during the construction of the set, and thus tends to generate smaller sets. You can set a preference rating for each test using the PREFERENCE parameter; the most-preferred tests should have ratings of one, and less-preferred tests should have ratings of two and upwards. Then, if at any stage there is then more than one test with the best criterion value, the most-preferred test is selected. If these preferances are especially important, you may also also want to set the EQUIVALENCE option to a scalar, e say. Then all tests whose criterion values are within e of the current maximum are regarded as equivalent, and the best test is selected from within these tests according to the preferences.

The main disadvantage of most sequential methods is that they produce only a single set of tests. In order to allow a choice of sets and as a way of improving the original set, IRREDUNDANT can run through a sequence of cycles. In each of these, the tests in the best set are deleted in turn, further tests are selected to separate the pairs of taxa distinguished only by the deleted test, and any redundant tests are deleted. If no improvement is achieved, all the non-essential tests are deleted, and the set is reformed without using those tests. The process can be then repeated until no improvements are being achieved of until the number of cycles exceeds the setting of the MAXCYCLE option (default 20).

Printed output is controlled by the PRINT option, with settings:

`numbers`	numbers of the tests in the sets,
`diagram`	table showing the contents of the sets,
`notdistinguished`	lists of pairs of taxa that cannot be distinguished,
`messages`	messages for example when the number of sets has been reduced as requested by the `LIMSETS` option, or concerning pairs of taxa than cannot be distinguished.

The default is PRINT=numb,diag,notd,mess.

The best set can be saved using the BESTSET option, as a pointer containing the relevant factors. The SETS option can save a matrix, with a row for each set and a column for each test, representing all the sets that have been formed. In each row the matrix generally stores the number one in the columns corresponding to the tests in that set, and zero elsewhere. However, if the sets have been constructed to confirm the identification of a single taxon, the matrix contains more informative numbers than one. So, down each column wherever one would be stored, it instead stores the level given by the taxon for the factor corresponding to the test concerned. The NOTDISTINGUISHED option can save information about the pairs of taxa that cannot be distinguished, or that are distinguished by less than NDISTINCTIONS tests. The matrix has a row for each such pair of taxa, and three columns. Columns 1 and 2 contain the numbers of the taxa in the pair, and column 3 contains the number of tests that can distinguish them.

Options: PRINT, BESTSET, SETS, NOTDISTINGUISHED, METHOD, UNITS, GROUPS, OBJECT, NDISTINCTIONS, MAXPREFERENCE, MAXSIZE, NPRINT, NSAVE, LIMSETS, DISTINCTIONS, CRITERION, MAXCYCLE, EQUIVALENCE.

Parameters: CHARACTER, COST, PREFERENCE, VARIABLE, INAPPLICABLE.

Method

The exact method uses an extension of the algorithm of Payne (1991). The sequential method is an extension by Barnett et al. (1983) of the algorithm described in Payne & Preece (1980) Section 6.6.

Action with `RESTRICT`

IRREDUNDANT takes account of restrictions on any of the CHARACTER factors or on TAXONNAMES or GROUPS.

References

Barnett, J.A., Payne, R.W. & Yarrow, D. (1983). Yeasts: Characteristics and Identification. Cambridge: Cambridge University Press.

Payne, R.W. (1991). Algorithm AS263 Construction of irredundant test sets. Applied Statistics, 40, 213-229.

Payne, R.W. & Preece, D.A. (1980). Identification keys and diagnostic tables: a review (with discussion). Journal of the Royal Statistical Society, Series A, 143, 253-292.

Was this article helpful?

Yes No