Performs principal coordinates analysis, also principal components and canonical variates analysis (but with different weighting from that used in CVA
) as special cases.
Options
PRINT = string tokens |
Printed output required (roots, scores, loadings, residuals, centroid, distances ); default * i.e. no printing |
---|---|
NROOTS = scalar |
Number of latent roots for printed output; default * requests them all to be printed |
SMALLEST = string token |
Whether to print the smallest roots instead of the largest (yes, no ); default no |
Parameters
DATA = identifiers |
These can be specified either as a symmetric matrix of similarities or transformed distances or, for the canonical variates analysis, as an SSPM containing within-group sums of squares and products etc or, for principal components analysis, either as a pointer containing the variates of the data matrix or as a matrix storing the variates by columns |
---|---|
LRV = LRVs |
Latent vectors (i.e. coordinates or scores), roots and trace from each analysis |
CENTROID = diagonal matrices |
Squared distances of the units from their centroid |
RESIDUALS = matrices or variates |
Distances of the units from the fitted space |
LOADINGS = matrices |
Principal component loadings, or canonical variate loadings |
DISTANCES = symmetric matrices |
Computed inter-unit distances calculated from the variates of a data matrix, or inter-group Mahalanobis distances calculated from a within-group SSPM |
SAVE = pointers |
Saves details of the analysis; if unset, an unnamed save structure is saved automatically (and this can be accessed using the GET directive) |
Description
The PCO
directive is used for principal coordinates analysis. This method encompasses principal components analysis and a form of canonical variates analysis as special cases as explained above.
There are six sections of output from PCO
, requested using the PRINT
option:
The NROOTS
and SMALLEST
options control the printed output of roots, scores, loadings and residuals. By default, results are printed for all the roots, but you can set the NROOTS
option to specify a lesser number. If option SMALLEST
has the default setting no
these are taken to be the largest roots, but if you set SMALLEST=yes
the results are for the smallest non-zero roots. The inter-unit distances are unaffected by the setting of the NROOTS
option.
The DATA
parameter supplies the data. In its simplest form, PCO
works on a symmetric matrix, with values giving the associations amongst a set of objects. This could, for example, be a similarity matrix produced by FSIMILARITY
.
Alternatively, the input to PCO
can be a pointer whose values are the identifiers of a set of variates, or a matrix storing the variates by columns. Now the PCO
directive will construct the matrix of inter-unit squared distances, and will base the analysis on associations derived from this. This is equivalent to a principal components analysis; however, the results are derived by analysing the distance matrix rather than an SSPM. When there are more units than variates, using PCO
for principal components analysis is less efficient than using the PCP
directive; however, if there are more variates than units the PCO
directive is more efficient. When PCO
is used for principal components analysis, all the variates must be of the same length and none of their values may be missing; any restrictions on the variates are ignored.
The third type of input to PCO
is an SSPM structure. This must be a within-group SSPM: that is, you must have set the GROUP
option of the SSPM
directive when the SSPM was declared. Now the PCO
directive will calculate the Mahalanobis distances amongst the group means, and base the analysis on them. This will give results similar to a canonical variates analysis. The representation of distances will be better than that of CVA
, but CVA
will be better if you are interested in loadings for discriminatory purposes.
The second and subsequent parameters of PCO
allow you to save the results. The number of units that determine the sizes of the output structures differs according to the input to PCO
. For a matrix or a symmetric matrix the number of units is the number of rows of the matrix, for a pointer it is the number of values in the variates that the pointer contains, while for an SSPM the number of units is the number of groups.
The latent roots, scores and trace can be saved in an LRV structure using the LRV
parameter. If you have declared the LRV already, its number of rows must equal the number of units.
If the input to PCO
is a pointer, a matrix, or an SSPM, the principal component or canonical variate loadings can be saved in a matrix using the LOADINGS
parameter. The number of rows of the matrix is equal to the number of variates (either those specified by an input pointer or those specified in the SSPM
directive for an input SSPM structure), or the number of columns in an input matrix.
The number of columns of the LRV and of the LOADINGS
matrix corresponds to the number of dimensions to be saved from the analysis, and this must be the same for both of them. If the structures have been declared already, Genstat will take the larger of the numbers of columns declared for either, and declare (or redeclare) the other one to match. If neither has been declared and option SMALLEST
retains the default setting no
, Genstat takes the number of columns from the setting of the NROOTS
option. Otherwise, Genstat saves results for the full set of dimensions. The trace saved as the third component of the LRV structure, however, will contain the sums of all the latent roots, whether or not they have all been saved.
The distances of the units from their centroid can be saved in a diagonal matrix using the CENTROID
parameter. The diagonal matrix has the same number of rows as the number of units, defined above. The RESIDUALS
parameter allows you to save residuals, formed from the dimensions that have not been saved, in a matrix with one column and number of rows equal to the number of units. Finally, the inter-unit distances can be saved in a symmetric matrix using the DISTANCES
parameter. The number of rows of the symmetric matrix is again the same as the number of units.
The SAVE
parameter can supply a pointer to save a multivariate save structure contining all the details of the analysis. If this is unset, an unnamed save structure is saved automatically (and this can be accessed using the GET
directive). Alternatively, you can set SAVE=*
to prevent any save structure being formed if, for example, you have a very large data set and want to avoid committing the storage space.
Having obtained an ordination, you may sometimes want to add points to the ordination for additional units. If you know the squared distances of the new units from the old, the technique of Gower (1968) can be used to add points to the ordination for the new units. You can do this in Genstat by using the ADDPOINTS
directive.
Options: PRINT
, NROOTS
, SMALLEST
.
Parameters: DATA
, LRV
, CENTROID
, RESIDUALS
, LOADINGS
, DISTANCES
, SAVE
.
Action with RESTRICT
PCO
ignores any restrictions on the DATA
variates.
Reference
Gower, J.C. (1968). Adding a point to vector diagrams in multivariate analysis. Biometrika, 55, 582-585.
See also
Directives: CVA
, FCA
, MDS
, PCP
, PCORELATE
, SSPM
.
Procedures: LRVSCREE
, DBIPLOT
, DMST
, MULTMISSING
, MVAOD
, DISCRIMINATE
, SDISCRIMINATE
, PLS
, RIDGE
.
Commands for: Multivariate and cluster analysis.
Example
" Genstat example PCO-1: Principal coordinates analysis. The data for this example (Nathanson J A 1971. An aplication of multivariate analysis in astronomy. Applied Statistics 20, 239-249) gives squared distances amongst ten types of galaxy: those of an elliptical shape, eight different kinds of spiral galaxy , and irregularly-shaped galaxies. The spiral types vary from those which are mailnly made up of a central core (coded as types SO and SBO) to those that are extremely tenuous (Sc and SBc). This example forms an ordination of the ten galaxy types. " " Declare the symmetric data matrix " SYMMETRIC [ROWS=!T(E,SO,SBO,Sa,SBa,Sb,SBb,Sc,SBc,I)] Galaxy READ Galaxy 0 1.87 0 2.24 0.91 0 4.03 2.05 1.51 0 4.09 1.74 1.59 0.68 0 5.38 3.41 3.15 1.86 1.27 0 7.03 3.85 3.24 2.25 1.89 2.02 0 6.02 4.85 4.11 3.00 2.13 1.71 1.45 0 6.88 5.70 5.12 3.72 3.01 2.97 1.75 1.13 0 4.12 3.77 3.86 3.93 3.27 3.77 3.52 2.79 3.29 0 : PRINT Galaxy CALCULATE Galaxy = -Galaxy/2 " Carry out the principal coordinates analysis, printing out the latent roots and trace, the principal coordinate scores, the distances of each unit from their overall centroid, and the matrix of inter-unit distances. " PCO [PRINT=roots,scores,centroid,distances] Galaxy " Carry out the analysis once again, printing information for the 8 smallest roots only. " PCO [PRINT=residuals,centroid; NROOTS=8; SMALLEST=yes] Galaxy " Create two different data matrices: Gname8 - which holds the data corresponding to the eight spiral galaxies. This is created from taking row 2 to column 2, to row 9, column 9 of the symmetric matrix Galaxy. Corresponding row labels are supplied. Gname2 - which holds the data corresponding to the elliptical and irregularly-shaped galaxies. This is created from taking the values in the Galaxy matrix from row 1, columns 2 to 9, and row 10, columns 2 to 9. Again, appropriate labels are supplied. " TEXT Gname8; !T(SO,SBO,Sa,SBa,Sb,SBb,Sc,SBc) & Gname2; !T(E,I) SYMMETRIC [ROWS=Gname8] G8 CALCULATE G8 = Galaxy$[!(2...9)] MATRIX [ROWS=Gname2; COLUMNS=Gname8] G2 CALCULATE G2 = Galaxy$[!(1,10); !(2...9)] " Transform the matrix back to the original scale. " CALCULATE G2 = -2*G2 PRINT G2; FIELDWIDTH=7 " Perform the analysis for the eight spiral galaxies, saving the latent vectors in the LRV structure L8, and the centroid distances in the diagonal matric C8. Their is no need to declare these structures in advance since the PCO will do this automatically. " PCO [PRINT=roots,scores] G8; LRV=L8; CENTROID=C8 " Now add the points for the elliptical and irregularly shaped galaxies to the principal coordinate analysis. " ADDPOINTS [PRINT=coordinates,residuals] G2; LRV=L8; CENTROID=C8