Selects the best set of variates to discriminate between groups (D.B. Baird, L.H. Schmitt & J.W. McNicol).
Options
PRINT = string tokens |
Printed output from the analysis (summary , steps , validation , specificity , discrimination , monitoring ); default summ , vali , spec , disc |
---|---|
PLOT = string tokens |
What plots to produce (errorrate , steps , specificity , discriminant ); default erro , steps , spec , disc |
DDISCRIMINANT = string tokens |
What to display on the discriminant plot (means , mlabels , scores , polygons , confidencecircle ); default means , mlabels , scores , conf |
METHOD = string token |
The variable selection method to use (forward , backward ); default forw |
NSELECT = scalar |
Number of variates to select; default 4 |
CRITERION = string token |
Criterion to use to select variables (wilkslambda , crossvalidation , bootstrap , jackknife ); default wilk |
MODELCHOICE = string token |
Which model to save (optimal , nselect ); default opti |
VALIDATIONMETHOD = string token |
Validation method to use to calculate error rates (bootstrap , crossvalidation , jackknife , prediction ); default cros |
NSIMULATIONS = variate |
Number of bootstraps or cross-validation sets to use for selection and for validation; default !(10,50) |
NCROSSVALIDATIONGROUPS = scalar |
Number of groups for cross-validation, default 10 |
SEED = scalar |
Seed for random number generation; default 0 |
YROOT = scalars |
Specifies roots for plotting on y-axes |
XROOT = scalars |
Specifies roots for plotting on x-axes |
Parameters
DATA = pointers |
Each pointer contains a set of variates that are available to be selected |
---|---|
GROUPS = factors |
Define groupings for the units in each training set |
FORCED = pointers |
Variates that must be included in the model |
SELECTED = pointers |
Saves the variates in the final model |
STEPS = pointers |
Saves the criterion values for each step in the model selection |
ERRORRATE = scalars |
Saves the validation error rate for the final model |
SPECIFICITY = matrices |
Saves the specificity table for the final model |
ALLOCATION = factors |
Saves the groups allocated by the final model |
LRV = LRVs |
Saves the LRVs from the final discriminant analysis |
SCORES= matrices or pointers |
Saves discriminant scores for units from the final model |
Description
SDISCRIMINATE
uses forward selection or backwards elimination to search for the best set of variates to discriminate between groups. The variates that are available for the discrimination must be specified, in a pointer, by the DATA
parameter. The membership of the groups must be specified, in a factor, by the GROUPS
parameter. If there are some variates that must always be included in the model, these can be specified, in a pointer, by the FORCED
parameter.
Printed output is controlled by the option PRINT
, with settings:
summary |
summary of the model fitting, |
---|---|
steps |
criterion values evaluated at each step of the model fitting, |
validation |
error rates at each model step, |
specificity |
specificity of allocation (i.e. the proportion of each group that is assigned correctly), |
discrimination |
the standard discriminant analysis output for the final model, and |
monitoring |
criterion values for each model tried. |
The default is PRINT=summ,vali,spec,disc
.
The PLOT
option controls what plots are displayed, with settings:
errorrate |
error rate at each selection step, |
---|---|
steps |
criterion values at each step of the model fitting, |
specificity |
specificity at each selection step, and |
discriminant |
the standard discriminant plot from the final model. |
By default these are all plotted. The DDISCRIMINANT
option allows group means, labels for group means, unit scores, group polygons enclosing units, and 95% confidence circles around group means to be included on the discriminant plot. The YROOT
and XROOT
options specify the roots for the axes.
The selection method is defined by the METHOD
option. The forward
setting starts with the FORCED
model and then, at each step, looks to see which of DATA
variates not already in the model gives the best improvement; this is the default. The backward
setting starts with the model, and looks to see which variate in model (other than those in FORCED
) gives the least reduction in the criterion when eliminated at that step.
The criterion for evaluating the model is defined by the CRITERION
option, with settings:
wilkslambda |
uses the ratio of the determinant of the within-group sums of squares and products to the determinants of the total sums of squares and products (default), |
---|---|
crossvalidation |
uses the cross-validation error rate, |
bootstrap |
uses the bootstrap error rate, and |
jackknife |
uses jackknifing. |
Cross validation, bootstrapping and jackknifing take much longer than the use of Wilks’ lambda.
The number of variates in the final model (excluding those in the FORCED
model) is set by NSELECT
option. The MODELCHOICE
option indicates how to choose the final model. The default setting optimal
takes the model from the step with the minimum validation error. Alternatively, the nselect
setting takes the model with the number of variates specified by the NSELECT
option.
The VALIDATIONMETHOD
option specifies the validation method, with settings for prediction, cross-validation, jackknife and bootstrap. Cross-validation works by randomly splitting the units into a number of groups specified by the NCROSSVALIDATIONGROUPS
option (default 10). It then omits each of the groups, in turn, and predicts how the the omitted units are allocated to the discrimination groups. Jackknifing leaves the units out one at a time, and uses the rest of the data to predict the group of the omitted unit. The bootstrap method works by drawing a bootstrap sample of units (a random sample of units with replacement of the same size as the original sample), and predicting the units that are not present in the random sample. The resulting bootstrap error rate is then calculated as a weighted average of the error rate of the omitted observations and the predictive error rate of the bootstrap sample. The weights used are 0.632 and 0.368 respectively, and so this is known as the 632 rule.
The NSIMULATIONS
option sets the number of simulations for cross-validation or bootstrapping. It should be set to a variate with two values: the first value defines the number of simulations to use during selection (default 10), and the second sets the number to use in the estimation of the error rates (default 50).
The SEED
option provides the seed for the random numbers used for the randomizations during in the simulations. The default value of 0 continues an existing sequence of random numbers, if none have been used in the current Genstat job, it initializes the seed automatically using the computer clock.
The SELECTED
parameter can save the contents of the chosen model, in a pointer. The STEPS
parameter can save a pointer with a variate for each step of the selection, containing the criterion evaluated for each DATA
variate at then step. The variates contain a missing value if the DATA
variate had already been included or excluded from the model. The ERRORRATE
parameter can save a variate with the minimum value of the validation error rate after each step. The SPECIFICITY
parameter can save a matrix containing the specificity table for the final model. The LRV
parameter can save the latent roots, vectors and trace from the final discriminant analysis, and the ALLOCATION
and SCORES
parameters can save the assigned groups and discriminant scores.
Options: PRINT
, PLOT
, DDISCRIMINANT
, METHOD
, NSELECT
, CRITERION
, MODELCHOICE
, VALIDATIONMETHOD
, NSIMULATIONS
, NCROSSVALIDATIONGROUPS
, SEED
, YROOT
, XROOT
.
Parameters: DATA
, GROUPS
, FORCED
, SELECTED
, STEPS
, ERRORRATE
, SPECIFICITY
, ALLOCATION
, LRV
, SCORES
.
Method
The procedure steps through the models using FSSPM
to calculate Wilks’ Lambda, and subsidiary procedures _SDISCROSSVALIDATE
and _SDISBOOTSTRAP
to calculate the other selection criteria. DISCRIMINATE
is called to provide the output for the final model.
Action with RESTRICT
The input variates and factor may be restricted (but any restrictions must be identical). The restricted units are omitted from the analysis.
See also
Directive: CVA
.
Procedures: CVAPLOT
, DBIPLOT
, DISCRIMINATE
, QDISCRIMINATE
.
Commands for: Multivariate and cluster analysis.
Example
CAPTION 'SDISCRIMINATE example'; STYLE=meta SPLOAD FILE='%gendir%/examples/Automobile.gsh' POINTER [VALUES=normalized_losses,wheel_base,length,width,height,\ curb_weight,engine_size,bore,stroke,compression_ratio,\ horsepower,peak_rpm,city_mpg,highway_mpg,price] Xvars SDISCRIMINATE [NSELECT=6; SEED=925081] DATA=Xvars; GROUPS=symboling