Performs quadratic discrimination between groups i.e. allowing for different variance-covariance matrices (D.B. Baird).
Options
PRINT = string tokens |
Printed output from the analysis (allocation , counts , distance , probabilities , specificity , summary , table , validation , vcovariance ); default spec , summ , vali |
---|---|
VALIDATIONMETHOD = string token |
Validation method to use to calculate error rates (bootstrap , crossvalidation , jackknife , prediction ); default cros |
NSIMULATIONS = scalar |
Number of bootstraps or cross-validation sets; default 50 |
NCROSSVALIDATIONGROUPS = scalar |
Number of groups for cross-validation, default 10 |
Parameters
DATA = pointers |
Each pointer contains a training set of variates to be used to form a quadratic discrimination |
---|---|
GROUPS = factors |
Define groupings for the units in each training set |
PRIORPROBABILITIES = variates |
Prior probabilities of group membership; default * i.e. equal |
SEED = scalars |
Seed for the random numbers used in bootstrapping or cross-validation; default 0 continues from the previous generation or (if none) initializes the seed automatically |
ERRORRATE = scalars |
Saves the validation error rate |
SPECIFICITY = matrices |
Saves the specificity table |
ALLOCATION = factors |
Saves the groups allocated by the discriminant rule |
PROBABILITIES = matrices or pointers |
Save posterior probabilities of membership of the groups (in the columns of a matrix or the variates in a pointer) for the units in the training set (in the rows) |
Description
QDISCRIMINATE
performs a quadratic discrimination analysis to identify members of a set of groups using their observations on a set of variates. The quadratic discrimination rule assumes that the values of the variates within each group are distributed with a multi-variate Normal distribution, and that the variance-covariance matrix of the distributions are different for each group. This differs from the more familiar linear discriminant analysis, performed by procedure DISCRIMINATE
, where the groups are assumed to have the same variance-covariance matrix.
The variates to be used to discriminate between the groups are specified in a pointer by the DATA
parameter, and the membership of the groups is specified in a factor by the GROUPS
parameter. The non-missing units of the GROUPS
factor provide a training set to estimate the discriminant rule. Units that you would like to allocate to groups using the discriminant rule should be included in the data set with missing values in the GROUPS
factor.
You can specify prior probabilities for the groups using the PRIORPROBABILITIES
option; by default the groups are all assumed to be equally likely. You can use this to allow for unequal costs of mis-allocation by weighting the prior probabilities like this:
PRIORPROBABILITIES = Cost * Prior / SUM(Cost * Prior)
where Cost
is a variate defining the cost of mis-allocation for each group.
Printed output is controlled by the option PRINT
, with settings:
allocation |
the allocated group for each unit, |
---|---|
counts |
number of units in each group with a complete set of observations, |
distance |
generalized pairwise distance between group means, |
probabilities |
the posterior probability of being allocated to each group, |
specificity |
specificity of allocation (i.e. the proportion of each group that is assigned correctly), |
summary |
summary of the model fitting, |
table |
table of counts of training units allocated to each group, |
validation |
the error rate, and |
vcovariance |
variance-covariance matrices for the groups |
The default is PRINT=spec,summ,vali
.
The VALIDATIONMETHOD
option specifies the validation method, with settings for prediction, cross-validation, jackknife and bootstrap. Prediction calculates
the error rate as the proportion of the training set that were misallocated. Cross-validation works by randomly splitting the units into a number of groups specified by the NCROSSVALIDATIONGROUPS
option (default 10). It then omits each of the groups, in turn, and predicts how the the omitted units are allocated to the discrimination groups. Jackknifing leaves the units out one at a time, and uses the rest of the data to predict the group of the omitted unit. The bootstrap method works by drawing a bootstrap sample of units (a random sample of units with replacement of the same size as the original sample), and predicting the units that are not present in the random sample. The resulting bootstrap error rate is then calculated as a weighted average of the error rate of the omitted observations and the predictive error rate of the bootstrap sample. The weights used are 0.632 and 0.368 respectively, and so this is known as the 632 rule.
The NSIMULATIONS
option sets the number of simulations for cross-validation or bootstrapping; default 50.
The SEED
parameter provides the seed for the random numbers used for the randomizations during in the simulations. The default value of 0 continues an existing sequence of random numbers, if none have been used in the current Genstat job, it initializes the seed automatically using the computer clock.
The ERRORRATE
parameter can save the validation error rates. The SPECIFICITY
parameter can save the proportion of each group that is assigned correctly. The ALLOCATION
parameter can save the assigned groups, and the PROBABILITIES
parameter can save the posterior probabilities of the groups.
Options: PRINT
, VALIDATIONMETHOD
, NSIMULATIONS
, NCROSSVALIDATIONGROUPS
.
Parameters: DATA
, GROUPS
, PRIORPROBABILITIES
, SEED
, ERRORRATE
, SPECIFICITY
, ALLOCATION
, PROBABILITIES
.
Method
The FSSPM
directive is used to calculate the variance-covariance matrices of the groups. The posterior probability of belonging to each group are then calculated for each unit, and its membership is assigned to the most likely group. For more details, see e.g. Hastie et al. (2001) or McLachlan (1992).
Action with RESTRICT
The input variates and factor may be restricted (but any restrictions must be identical). The restricted units are omitted from the analysis.
References
Hastie, T., Tibshirani, R. & Friedman J. (2001). The Elements of Statistical Learning. Springer, New York.
McLachlan, G.J. (1992). Discriminant Analysis and Statistical Pattern Recognition. Wiley, Hoboken, New Jersey.
See also
Directive: CVA
.
Procedures: CVAPLOT
, DBIPLOT
, DISCRIMINATE
, SDISCRIMINATE
.
Commands for: Multivariate and cluster analysis.
Example
CAPTION 'QDISCRIMINATE example','Fisher''s Iris data'; STYLE=meta,plain SPLOAD [PRINT=*] '%GENDIR%/Data/Iris.gsh' POINTER [VALUES=Sepal_Length,Sepal_Width,Petal_Length,Petal_Width]\ Measures QDISCRIMINATE [PRINT=allocation,probabilities,specificity,summary,validation;\ VALIDATIONMETHOD=bootstrap; NSIMULATIONS=100]\ Measures; GROUPS=Species; SEED=764527