HCOMPAREGROUPINGS procedure

Compares groupings generated, for example, from cluster analyses (R.W. Payne).

Options

`PRINT` = string tokens	Controls printed output (`indexes`, `tests`); default `inde`
`PLOT` = string	What to plot (`histogram`); default `*`
`METHOD` = string tokens	Which indexes to calculate (`arand`, `jaccard`, `rand`); default `arand`
`NTIMES` = scalar	Number of permutations to make for the tests; default `999`

Parameters

`FIRSTGROUPING` = factors	First set of groupings
`SECONDGROUPING` = factors	Second set of groupings
`ESTIMATES` = pointers	Saves the values of the indexes calculated from the original data set
`SEED` = scalars	Seed for the random number generator used to make the permutations; default `0` continues from the previous generation or (if none) initializes the seed automatically
`PERMUTATIONESTIMATES` = pointers	Saves the values of the indexes calculated from the permuted data sets

Description

HCOMPAREGROUPINGS calculates indexes to assess the similarity between two sets of groupings, which are specified in factors using the FIRSTGROUPING and SECONDGROUPING parameters. These may, for example, have been obtained from two different cluster analyses.

The METHOD option selects the indexes, with settings:

`arand`	adjusted Rand index,
`jaccard`	Jaccard index, and
`rand`	Rand index.

Details are given in the Method section. The default is to calculate only the adjusted Rand index.

The ESTIMATES parameter can save a pointer, containing a scalar for each index, to save the calculated values. The elements of the pointer are labelled by the index names, but defined so that you can refer to them in either lower- or upper-case or a mixture.

The PRINT option controls the printed output, with settings:

`indexes`	prints the indexes, and
`tests`	prints probabilities obtained from random permutation tests.

The random permutation tests allow you to assess whether the similarity may have arisen only by chance. The NTIMES option specifies the number of permutations to take (default 999). HCOMPAREGROUPINGS checks whether NTIMES is greater than the number of possible permutations available for the data set. If so, it does an exact test instead, which uses each possible permutation once. The SEED option specifies the seed that is used to obtain the random numbers used to form the permutations.

The PERMUTATIONESTIMATES parameter can save a pointer, containing a variate for each index, to save the values calculated in the random permutations. The elements of the pointer are labelled by the index names, but defined so that you can refer to them in either lower- or upper-case or a mixture.

You can set option PLOT=histogram to plot histograms showing where the calculated value of each index lies within those obtained from the permutation tests.

Options: PRINT, PLOT, METHOD, NTIMES.
Parameters: FIRSTGROUPING, SECONDGROUPING, ESTIMATES, SEED, PERMUTATIONESTIMATES.

Method

The Rand index (Rand 1971) is defined as

( np₁ + np₂ ) / ^NC₂

where

np₁ is the number of pairs of units that are in the same group in both factors,
np₂is the number of pairs of units that are in different groups in both factors,
N is the total number of units, and
^NC₂ is the total number of ways of selecting of 2 units from a sample of N units,
which can be calculated as N×(N-1)/2.

This ranges from zero (for no similarity) to one (for complete similarity).
The adjusted Rand index of Hubert & Arabie (1985) is defined as

{ ∑_i ∑_j (^mijC₂ ) } – { ∑_i ( ^aiC₂ ) × ∑_j ( ^bjC₂ ) / ( ^NC₂) } /
– { ∑_i ( ^aiC₂ ) + ∑_j ( ^bjC₂ ) } – { ∑_i ( ^aiC₂ ) × ∑_j ( ^bjC₂ ) / ( ^NC₂) }

where

m_ijis the number of units that are in group i for the first factor, and group j for the second factor,
a_iis the number of units in group i of the first factor, and
b_jis the number of units in group j of the second factor.

The first term in the numerator measures the agreement between the groupings. The second term is the expected value of the first term, assuming a generalized hypergeometric distribution, and the first term of the denominator is its maximum value. The index has a value of zero if the groupings are independent, and one if they are in complete agreement.

The Jaccard index is defined as

np₁ / ( ^NC₂ –np₂ )

This is similar to the Rand index, except that it excludes the pairs of units that are in different groups in both factors.

Action with RESTRICT

There must be no restrictions.

Example

CAPTION     'HCOMPAREGROUPINGS example',\
            !t('Compare groupings from average and single-link cluster',\
            'analyses of cars in Guide to Genstat, Part 2, Section 6.1.2.');\
            STYLE=meta,plain
TEXT        Cars; !T(Estate,'Arna1.5','Alfa2.5',Mondialqc,Testarossa,Croma,\
            Panda,Regatta,Regattad,Uno,X19,Contach,Delta,Thema,Y10,Spider)
POINTER     Vars; !P(CC,NCyl,Tank,Wt,Length,Width,Ht,WBase,TSpeed,StSt,\
            Carb,Drive)
VARIATE     [NVALUES=Cars] Vars[]
READ        [PRINT=*] Vars[]
 1490  4  50  966 414 161 133 245 177 10.9  1  2
 1409  4  50  845 399 162 139 242 174 10.2  1  2
 2492  6  49 1160 433 163 140 251 210  8.2  1  1
 3185  8  87 1430 458 179 126 265 249  7.4  2  1
 4942 12 120 1506 449 198 113 255 291  5.8  2  1
 1995  4  70 1180 450 176 143 266 209  7.8  2  2
  965  4  35  761 338 149 146 216 134 16.8  1  2
 1585  4  55  970 426 165 141 244 180 10.0  1  2
 1714  4  55  980 426 165 141 245 150 18.9  3  2
  999  4  42  720 364 155 143 236 145 16.2  1  2
 1498  4  48  912 397 157 118 220 171 11.0  1  1
 5167 12 120 1446 414 200 107 245 286  4.9  1  1
 1585  4  45 1000 389 162 138 247 195  8.2  1  2
 1995  4  70 1150 459 175 143 266 224  7.6  2  2
 1049  4  47  790 339 151 143 216 179 11.8  1  2
 1995  4  45 1050 414 162 125 228 190  9.0  2  1 :
SYMMETRIC   [ROWS=Cars] CarSim
FSIMILARITY [SIMILARITY=CarSim]\
            Vars[]; TEST=4(cityblock,euclidean),2(cityblock,simplematching)
HCLUSTER    [PRINT=dendrogram; METHOD=average] CarSim;\
            GROUPS=AverageLink; GTHRESHOLD=90
HCLUSTER    [PRINT=dendrogram; METHOD=single] CarSim;\
            GROUPS=SingleLink; GTHRESHOLD=90
SORT        [INDEX=AverageLink,Cars] AverageLink,Cars; NEWV=Group,Car
PRINT       Group,Car
SORT        [INDEX=SingleLink,Cars] SingleLink,Cars; NEWV=Group,Car
PRINT       Group,Car
HCOMPAREGROUPINGS [PRINT=indexes,tests] FIRSTGROUPING=AverageLink;\
            SECONDGROUPING=SingleLink; SEED=353445

Updated on September 12, 2019

Was this article helpful?

Yes No