Compares groupings generated, for example, from cluster analyses (R.W. Payne).

### Options

`PRINT` = string tokens |
Controls printed output (`indexes` , `tests` ); default `inde` |
---|---|

`PLOT` = string |
What to plot (`histogram` ); default `*` |

`METHOD` = string tokens |
Which indexes to calculate (`arand` , `jaccard` , `rand` ); default `arand` |

`NTIMES` = scalar |
Number of permutations to make for the tests; default `999` |

### Parameters

`FIRSTGROUPING` = factors |
First set of groupings |
---|---|

`SECONDGROUPING` = factors |
Second set of groupings |

`ESTIMATES` = pointers |
Saves the values of the indexes calculated from the original data set |

`SEED` = scalars |
Seed for the random number generator used to make the permutations; default `0` continues from the previous generation or (if none) initializes the seed automatically |

`PERMUTATIONESTIMATES` = pointers |
Saves the values of the indexes calculated from the permuted data sets |

### Description

`HCOMPAREGROUPINGS`

calculates indexes to assess the similarity between two sets of groupings, which are specified in factors using the `FIRSTGROUPING`

and `SECONDGROUPING`

parameters. These may, for example, have been obtained from two different cluster analyses.

The `METHOD`

option selects the indexes, with settings:

`arand` |
adjusted Rand index, |

`jaccard` |
Jaccard index, and |

`rand` |
Rand index. |

Details are given in the *Method* section. The default is to calculate only the adjusted Rand index.

The `ESTIMATES`

parameter can save a pointer, containing a scalar for each index, to save the calculated values. The elements of the pointer are labelled by the index names, but defined so that you can refer to them in either lower- or upper-case or a mixture.

The `PRINT`

option controls the printed output, with settings:

`indexes` |
prints the indexes, and |

`tests` |
prints probabilities obtained from random permutation tests. |

The random permutation tests allow you to assess whether the similarity may have arisen only by chance. The `NTIMES`

option specifies the number of permutations to take (default `999`

). `HCOMPAREGROUPINGS`

checks whether `NTIMES`

is greater than the number of possible permutations available for the data set. If so, it does an exact test instead, which uses each possible permutation once. The `SEED`

option specifies the seed that is used to obtain the random numbers used to form the permutations.

The `PERMUTATIONESTIMATES`

parameter can save a pointer, containing a variate for each index, to save the values calculated in the random permutations. The elements of the pointer are labelled by the index names, but defined so that you can refer to them in either lower- or upper-case or a mixture.

You can set option `PLOT=histogram`

to plot histograms showing where the calculated value of each index lies within those obtained from the permutation tests.

Options: `PRINT`

, `PLOT`

, `METHOD`

, `NTIMES`

.

Parameters: `FIRSTGROUPING`

, `SECONDGROUPING`

, `ESTIMATES`

, `SEED`

, `PERMUTATIONESTIMATES`

.

### Method

The Rand index (Rand 1971) is defined as

( *np*_{1} + *np*_{2} ) / ^{N}C_{2}

where

*np*_{1} is the number of pairs of units that are in the same group in both factors,

*np*_{2 }is the number of pairs of units that are in different groups in both factors,

*N* is the total number of units, and

* ^{N}C_{2}* is the total number of ways of selecting of

*2*units from a sample of

*N*units,

which can be calculated as

*N×(N-1)/2*.

This ranges from zero (for no similarity) to one (for complete similarity).

The adjusted Rand index of Hubert & Arabie (1985) is defined as

{ ∑_{ i} ∑_{ j} (^{mij}C_{2} ) } – { ∑_{ i} ( * ^{ai}*C

_{2}) × ∑

_{ j}(

^{bj}C

_{2}) / (

^{N}C

_{2}) } /

– { ∑

_{ i}(

*C*

^{ai}_{2}) + ∑

_{ j}(

^{bj}C

_{2}) } – { ∑

_{ i}(

*C*

^{ai}_{2}) × ∑

_{ j}(

^{bj}C

_{2}) / (

^{N}C

_{2}) }

where

*m _{ij }*is the number of units that are in group i for the first factor, and group j for the second factor,

*a*is the number of units in group i of the first factor, and

_{i }*b*is the number of units in group j of the second factor.

_{j }The first term in the numerator measures the agreement between the groupings. The second term is the expected value of the first term, assuming a generalized hypergeometric distribution, and the first term of the denominator is its maximum value. The index has a value of zero if the groupings are independent, and one if they are in complete agreement.

The Jaccard index is defined as

*np*_{1} / ( ^{N}C_{2} –*np*_{2} )

This is similar to the Rand index, except that it excludes the pairs of units that are in different groups in both factors.

### Action with RESTRICT

There must be no restrictions.

### See also

Directives: `CLUSTER`

, `FACTOR`

, `HCLUSTER`

.

Commands for: Multivariate and cluster analysis.

### Example

CAPTION 'HCOMPAREGROUPINGS example',\ !t('Compare groupings from average and single-link cluster',\ 'analyses of cars in Guide to Genstat, Part 2, Section 6.1.2.');\ STYLE=meta,plain TEXT Cars; !T(Estate,'Arna1.5','Alfa2.5',Mondialqc,Testarossa,Croma,\ Panda,Regatta,Regattad,Uno,X19,Contach,Delta,Thema,Y10,Spider) POINTER Vars; !P(CC,NCyl,Tank,Wt,Length,Width,Ht,WBase,TSpeed,StSt,\ Carb,Drive) VARIATE [NVALUES=Cars] Vars[] READ [PRINT=*] Vars[] 1490 4 50 966 414 161 133 245 177 10.9 1 2 1409 4 50 845 399 162 139 242 174 10.2 1 2 2492 6 49 1160 433 163 140 251 210 8.2 1 1 3185 8 87 1430 458 179 126 265 249 7.4 2 1 4942 12 120 1506 449 198 113 255 291 5.8 2 1 1995 4 70 1180 450 176 143 266 209 7.8 2 2 965 4 35 761 338 149 146 216 134 16.8 1 2 1585 4 55 970 426 165 141 244 180 10.0 1 2 1714 4 55 980 426 165 141 245 150 18.9 3 2 999 4 42 720 364 155 143 236 145 16.2 1 2 1498 4 48 912 397 157 118 220 171 11.0 1 1 5167 12 120 1446 414 200 107 245 286 4.9 1 1 1585 4 45 1000 389 162 138 247 195 8.2 1 2 1995 4 70 1150 459 175 143 266 224 7.6 2 2 1049 4 47 790 339 151 143 216 179 11.8 1 2 1995 4 45 1050 414 162 125 228 190 9.0 2 1 : SYMMETRIC [ROWS=Cars] CarSim FSIMILARITY [SIMILARITY=CarSim]\ Vars[]; TEST=4(cityblock,euclidean),2(cityblock,simplematching) HCLUSTER [PRINT=dendrogram; METHOD=average] CarSim;\ GROUPS=AverageLink; GTHRESHOLD=90 HCLUSTER [PRINT=dendrogram; METHOD=single] CarSim;\ GROUPS=SingleLink; GTHRESHOLD=90 SORT [INDEX=AverageLink,Cars] AverageLink,Cars; NEWV=Group,Car PRINT Group,Car SORT [INDEX=SingleLink,Cars] SingleLink,Cars; NEWV=Group,Car PRINT Group,Car HCOMPAREGROUPINGS [PRINT=indexes,tests] FIRSTGROUPING=AverageLink;\ SECONDGROUPING=SingleLink; SEED=353445