Selects the best set of variates to discriminate between groups (D.B. Baird, L.H. Schmitt & J.W. McNicol).

### Options

`PRINT` = string tokens |
Printed output from the analysis (`summary` , `steps` , `validation` , `specificity` , `discrimination` , `monitoring` ); default `summ` , `vali` , `spec` , `disc` |
---|---|

`PLOT` = string tokens |
What plots to produce (`errorrate` , `steps` , `specificity` , `discriminant` ); default `erro` , `steps` , `spec` , `disc` |

`DDISCRIMINANT` = string tokens |
What to display on the discriminant plot (`means` , `mlabels` , `scores` , `polygons` , `confidencecircle` ); default `means` , `mlabels` , `scores` , `conf` |

`METHOD` = string token |
The variable selection method to use (`forward` , `backward` ); default `forw` |

`NSELECT` = scalar |
Number of variates to select; default 4 |

`CRITERION` = string token |
Criterion to use to select variables (`wilkslambda` , `crossvalidation` , `bootstrap` , `jackknife` ); default `wilk` |

`MODELCHOICE` = string token |
Which model to save (`optimal` , `nselect` ); default `opti` |

`VALIDATIONMETHOD` = string token |
Validation method to use to calculate error rates (`bootstrap` , `crossvalidation` , `jackknife` , `prediction` ); default `cros` |

`NSIMULATIONS` = variate |
Number of bootstraps or cross-validation sets to use for selection and for validation; default `!(10,50)` |

`NCROSSVALIDATIONGROUPS` = scalar |
Number of groups for cross-validation, default 10 |

`SEED` = scalar |
Seed for random number generation; default 0 |

`YROOT` = scalars |
Specifies roots for plotting on y-axes |

`XROOT` = scalars |
Specifies roots for plotting on x-axes |

### Parameters

`DATA` = pointers |
Each pointer contains a set of variates that are available to be selected |
---|---|

`GROUPS` = factors |
Define groupings for the units in each training set |

`FORCED` = pointers |
Variates that must be included in the model |

`SELECTED` = pointers |
Saves the variates in the final model |

`STEPS` = pointers |
Saves the criterion values for each step in the model selection |

`ERRORRATE` = scalars |
Saves the validation error rate for the final model |

`SPECIFICITY` = matrices |
Saves the specificity table for the final model |

`ALLOCATION` = factors |
Saves the groups allocated by the final model |

`LRV` = LRVs |
Saves the LRVs from the final discriminant analysis |

`SCORES=` matrices or pointers |
Saves discriminant scores for units from the final model |

### Description

`SDISCRIMINATE`

uses forward selection or backwards elimination to search for the best set of variates to discriminate between groups. The variates that are available for the discrimination must be specified, in a pointer, by the `DATA`

parameter. The membership of the groups must be specified, in a factor, by the `GROUPS`

parameter. If there are some variates that must always be included in the model, these can be specified, in a pointer, by the `FORCED`

parameter.

Printed output is controlled by the option `PRINT`

, with settings:

`summary` |
summary of the model fitting, |
---|---|

`steps` |
criterion values evaluated at each step of the model fitting, |

`validation` |
error rates at each model step, |

`specificity` |
specificity of allocation (i.e. the proportion of each group that is assigned correctly), |

`discrimination` |
the standard discriminant analysis output for the final model, and |

`monitoring` |
criterion values for each model tried. |

The default is `PRINT=summ,vali,spec,disc`

.

The `PLOT`

option controls what plots are displayed, with settings:

`errorrate` |
error rate at each selection step, |
---|---|

`steps` |
criterion values at each step of the model fitting, |

`specificity` |
specificity at each selection step, and |

`discriminant` |
the standard discriminant plot from the final model. |

By default these are all plotted. The `DDISCRIMINANT`

option allows group means, labels for group means, unit scores, group polygons enclosing units, and 95% confidence circles around group means to be included on the discriminant plot. The `YROOT`

and `XROOT`

options specify the roots for the axes.

The selection method is defined by the `METHOD`

option. The `forward`

setting starts with the `FORCED`

model and then, at each step, looks to see which of `DATA`

variates not already in the model gives the best improvement; this is the default. The `backward`

setting starts with the model, and looks to see which variate in model (other than those in `FORCED`

) gives the least reduction in the criterion when eliminated at that step.

The criterion for evaluating the model is defined by the `CRITERION`

option, with settings:

`wilkslambda` |
uses the ratio of the determinant of the within-group sums of squares and products to the determinants of the total sums of squares and products (default), |
---|---|

`crossvalidation` |
uses the cross-validation error rate, |

`bootstrap` |
uses the bootstrap error rate, and |

`jackknife` |
uses jackknifing. |

Cross validation, bootstrapping and jackknifing take much longer than the use of Wilks’ lambda.

The number of variates in the final model (excluding those in the `FORCED`

model) is set by `NSELECT`

option. The `MODELCHOICE`

option indicates how to choose the final model. The default setting `optimal`

takes the model from the step with the minimum validation error. Alternatively, the `nselect`

setting takes the model with the number of variates specified by the `NSELECT`

option.

The `VALIDATIONMETHOD`

option specifies the validation method, with settings for prediction, cross-validation, jackknife and bootstrap. Cross-validation works by randomly splitting the units into a number of groups specified by the `NCROSSVALIDATIONGROUPS`

option (default 10). It then omits each of the groups, in turn, and predicts how the the omitted units are allocated to the discrimination groups. Jackknifing leaves the units out one at a time, and uses the rest of the data to predict the group of the omitted unit. The bootstrap method works by drawing a bootstrap sample of units (a random sample of units with replacement of the same size as the original sample), and predicting the units that are not present in the random sample. The resulting bootstrap error rate is then calculated as a weighted average of the error rate of the omitted observations and the predictive error rate of the bootstrap sample. The weights used are 0.632 and 0.368 respectively, and so this is known as the *632 rule*.

The `NSIMULATIONS`

option sets the number of simulations for cross-validation or bootstrapping. It should be set to a variate with two values: the first value defines the number of simulations to use during selection (default 10), and the second sets the number to use in the estimation of the error rates (default 50).

The `SEED`

option provides the seed for the random numbers used for the randomizations during in the simulations. The default value of 0 continues an existing sequence of random numbers, if none have been used in the current Genstat job, it initializes the seed automatically using the computer clock.

The `SELECTED`

parameter can save the contents of the chosen model, in a pointer. The `STEPS`

parameter can save a pointer with a variate for each step of the selection, containing the criterion evaluated for each `DATA`

variate at then step. The variates contain a missing value if the `DATA`

variate had already been included or excluded from the model. The `ERRORRATE`

parameter can save a variate with the minimum value of the validation error rate after each step. The `SPECIFICITY`

parameter can save a matrix containing the specificity table for the final model. The `LRV`

parameter can save the latent roots, vectors and trace from the final discriminant analysis, and the `ALLOCATION`

and `SCORES`

parameters can save the assigned groups and discriminant scores.

Options: `PRINT`

, `PLOT`

, `DDISCRIMINANT`

, `METHOD`

, `NSELECT`

, `CRITERION`

, `MODELCHOICE`

, `VALIDATIONMETHOD`

, `NSIMULATIONS`

, `NCROSSVALIDATIONGROUPS`

, `SEED`

, `YROOT`

, `XROOT`

.

Parameters: `DATA`

, `GROUPS`

, `FORCED`

, `SELECTED`

, `STEPS`

, `ERRORRATE`

, `SPECIFICITY`

, `ALLOCATION`

, `LRV`

, `SCORES`

.

### Method

The procedure steps through the models using `FSSPM`

to calculate Wilks’ Lambda, and subsidiary procedures `_SDISCROSSVALIDATE`

and `_SDISBOOTSTRAP`

to calculate the other selection criteria. `DISCRIMINATE`

is called to provide the output for the final model.

### Action with `RESTRICT`

The input variates and factor may be restricted (but any restrictions must be identical). The restricted units are omitted from the analysis.

### See also

Directive: `CVA`

.

Procedures: `CVAPLOT`

, `DBIPLOT`

, `DISCRIMINATE`

, `QDISCRIMINATE`

.

Commands for: Multivariate and cluster analysis.

### Example

CAPTION 'SDISCRIMINATE example'; STYLE=meta SPLOAD FILE='%gendir%/examples/Automobile.gsh' POINTER [VALUES=normalized_losses,wheel_base,length,width,height,\ curb_weight,engine_size,bore,stroke,compression_ratio,\ horsepower,peak_rpm,city_mpg,highway_mpg,price] Xvars SDISCRIMINATE [NSELECT=6; SEED=925081] DATA=Xvars; GROUPS=symboling