Compares groupings generated, for example, from cluster analyses (R.W. Payne).
Options
PRINT = string tokens |
Controls printed output (indexes , tests ); default inde |
---|---|
PLOT = string |
What to plot (histogram ); default * |
METHOD = string tokens |
Which indexes to calculate (arand , jaccard , rand ); default arand |
NTIMES = scalar |
Number of permutations to make for the tests; default 999 |
Parameters
FIRSTGROUPING = factors |
First set of groupings |
---|---|
SECONDGROUPING = factors |
Second set of groupings |
ESTIMATES = pointers |
Saves the values of the indexes calculated from the original data set |
SEED = scalars |
Seed for the random number generator used to make the permutations; default 0 continues from the previous generation or (if none) initializes the seed automatically |
PERMUTATIONESTIMATES = pointers |
Saves the values of the indexes calculated from the permuted data sets |
Description
HCOMPAREGROUPINGS
calculates indexes to assess the similarity between two sets of groupings, which are specified in factors using the FIRSTGROUPING
and SECONDGROUPING
parameters. These may, for example, have been obtained from two different cluster analyses.
The METHOD
option selects the indexes, with settings:
arand |
adjusted Rand index, |
jaccard |
Jaccard index, and |
rand |
Rand index. |
Details are given in the Method section. The default is to calculate only the adjusted Rand index.
The ESTIMATES
parameter can save a pointer, containing a scalar for each index, to save the calculated values. The elements of the pointer are labelled by the index names, but defined so that you can refer to them in either lower- or upper-case or a mixture.
The PRINT
option controls the printed output, with settings:
indexes |
prints the indexes, and |
tests |
prints probabilities obtained from random permutation tests. |
The random permutation tests allow you to assess whether the similarity may have arisen only by chance. The NTIMES
option specifies the number of permutations to take (default 999
). HCOMPAREGROUPINGS
checks whether NTIMES
is greater than the number of possible permutations available for the data set. If so, it does an exact test instead, which uses each possible permutation once. The SEED
option specifies the seed that is used to obtain the random numbers used to form the permutations.
The PERMUTATIONESTIMATES
parameter can save a pointer, containing a variate for each index, to save the values calculated in the random permutations. The elements of the pointer are labelled by the index names, but defined so that you can refer to them in either lower- or upper-case or a mixture.
You can set option PLOT=histogram
to plot histograms showing where the calculated value of each index lies within those obtained from the permutation tests.
Options: PRINT
, PLOT
, METHOD
, NTIMES
.
Parameters: FIRSTGROUPING
, SECONDGROUPING
, ESTIMATES
, SEED
, PERMUTATIONESTIMATES
.
Method
The Rand index (Rand 1971) is defined as
( np1 + np2 ) / NC2
where
np1 is the number of pairs of units that are in the same group in both factors,
np2 is the number of pairs of units that are in different groups in both factors,
N is the total number of units, and
NC2 is the total number of ways of selecting of 2 units from a sample of N units,
which can be calculated as N×(N-1)/2.
This ranges from zero (for no similarity) to one (for complete similarity).
The adjusted Rand index of Hubert & Arabie (1985) is defined as
{ ∑ i ∑ j (mijC2 ) } – { ∑ i ( aiC2 ) × ∑ j ( bjC2 ) / ( NC2) } /
– { ∑ i ( aiC2 ) + ∑ j ( bjC2 ) } – { ∑ i ( aiC2 ) × ∑ j ( bjC2 ) / ( NC2) }
where
mij is the number of units that are in group i for the first factor, and group j for the second factor,
ai is the number of units in group i of the first factor, and
bj is the number of units in group j of the second factor.
The first term in the numerator measures the agreement between the groupings. The second term is the expected value of the first term, assuming a generalized hypergeometric distribution, and the first term of the denominator is its maximum value. The index has a value of zero if the groupings are independent, and one if they are in complete agreement.
The Jaccard index is defined as
np1 / ( NC2 –np2 )
This is similar to the Rand index, except that it excludes the pairs of units that are in different groups in both factors.
Action with RESTRICT
There must be no restrictions.
See also
Directives: CLUSTER
, FACTOR
, HCLUSTER
.
Commands for: Multivariate and cluster analysis.
Example
CAPTION 'HCOMPAREGROUPINGS example',\ !t('Compare groupings from average and single-link cluster',\ 'analyses of cars in Guide to Genstat, Part 2, Section 6.1.2.');\ STYLE=meta,plain TEXT Cars; !T(Estate,'Arna1.5','Alfa2.5',Mondialqc,Testarossa,Croma,\ Panda,Regatta,Regattad,Uno,X19,Contach,Delta,Thema,Y10,Spider) POINTER Vars; !P(CC,NCyl,Tank,Wt,Length,Width,Ht,WBase,TSpeed,StSt,\ Carb,Drive) VARIATE [NVALUES=Cars] Vars[] READ [PRINT=*] Vars[] 1490 4 50 966 414 161 133 245 177 10.9 1 2 1409 4 50 845 399 162 139 242 174 10.2 1 2 2492 6 49 1160 433 163 140 251 210 8.2 1 1 3185 8 87 1430 458 179 126 265 249 7.4 2 1 4942 12 120 1506 449 198 113 255 291 5.8 2 1 1995 4 70 1180 450 176 143 266 209 7.8 2 2 965 4 35 761 338 149 146 216 134 16.8 1 2 1585 4 55 970 426 165 141 244 180 10.0 1 2 1714 4 55 980 426 165 141 245 150 18.9 3 2 999 4 42 720 364 155 143 236 145 16.2 1 2 1498 4 48 912 397 157 118 220 171 11.0 1 1 5167 12 120 1446 414 200 107 245 286 4.9 1 1 1585 4 45 1000 389 162 138 247 195 8.2 1 2 1995 4 70 1150 459 175 143 266 224 7.6 2 2 1049 4 47 790 339 151 143 216 179 11.8 1 2 1995 4 45 1050 414 162 125 228 190 9.0 2 1 : SYMMETRIC [ROWS=Cars] CarSim FSIMILARITY [SIMILARITY=CarSim]\ Vars[]; TEST=4(cityblock,euclidean),2(cityblock,simplematching) HCLUSTER [PRINT=dendrogram; METHOD=average] CarSim;\ GROUPS=AverageLink; GTHRESHOLD=90 HCLUSTER [PRINT=dendrogram; METHOD=single] CarSim;\ GROUPS=SingleLink; GTHRESHOLD=90 SORT [INDEX=AverageLink,Cars] AverageLink,Cars; NEWV=Group,Car PRINT Group,Car SORT [INDEX=SingleLink,Cars] SingleLink,Cars; NEWV=Group,Car PRINT Group,Car HCOMPAREGROUPINGS [PRINT=indexes,tests] FIRSTGROUPING=AverageLink;\ SECONDGROUPING=SingleLink; SEED=353445