Performs hierarchical cluster analysis.

### Options

`PRINT` = string tokens |
Printed output required (`dendrogram` , `amalgamations` ); default `*` i.e. no printing |
---|---|

`METHOD` = string token |
Criterion for forming clusters (`singlelink` , `nearestneighbour` , `completelink` , `furthestneighbour` , `averagelink` , `mediansort` , `groupaverage` ); default `sing` |

`CTHRESHOLD` = scalar |
Clustering threshold at which to print formation of clusters; default `*` i.e. determined automatically |

### Parameters

`SIMILARITY` = symmetric matrices |
Input similarity matrix for each cluster analysis |
---|---|

`GTHRESHOLD` = scalars |
Grouping threshold where groups are formed from the dendrogram |

`GROUPS` = factors |
Stores the groups formed |

`PERMUTATION` = variates |
Permutation order of the units on the dendrogram |

`AMALGAMATIONS` = matrices |
To store linked list of amalgamations |

### Description

The aim of cluster analysis is to arrange the *n* sampling units into more or less homogeneous groups. `HCLUSTER`

offers several possibilities. The general strategy is best appreciated in geometrical terms, with the *n* sampling units represented by points in a multidimensional space. In *agglomerative* methods, these points initially represent *n* separate clusters, each containing one member. At each of *n*-1 stages, two clusters are fused into one bigger cluster, until at the final stage all units are fused into a single cluster: this process can be represented by a hierarchical tree whose nodes indicate what fusions have occurred. The methods fuse the two closest clusters and vary in how *closest* is defined. In *single-linkage* cluster analysis, *closest* is defined as the smallest distance between any two samples from different clusters; in *centroid* clustering it is the smallest distance between cluster centroids; and so on. Genstat can display the tree fitted to a given similarity matrix, and provides a scale to show the level of similarity at which the fusions have occurred; such a scaled tree is termed a *dendrogram*.

The input for `HCLUSTER`

is provided by the `SIMILARITY`

parameter, as a list of symmetric matrices, one for each analysis. These matrices can be formed by `FSIMILARITY`

, by `HREDUCE`

or by `CALCULATE`

. Missing values are allowed in the similarity matrix only with the single-linkage method.

A hierarchical tree does not by itself provide a classification. This can be derived by cutting the dendrogram at some arbitrary level of similarity, specified as a percentage similarity using the `GTHRESHOLD`

parameter. Each cluster then consists of those samples occurring on the same detached branch of the dendrogram, and the resulting cluster membership can be saved in a factor whose identifier is specified by the `GROUPS`

parameter. The factor will be declared implicitly, if necessary, and it will have its number of levels set to the number of clusters formed and its number of values taken from the number of rows of the corresponding symmetric matrix. `GTHRESHOLD`

and `GROUPS`

must be either both present or both absent.

The endpoints of the dendrogram correspond to the units in some permuted order. The `PERMUTATION`

parameter allows you to specify a variate to save this order, for example to use in the `FSIMILARITY`

directive. Genstat will define it to be a variate automatically, if necessary, with number of values is taken from the number of rows of the corresponding similarity matrix. Conventionally, the first unit on the dendrogram is unit 1 and so the first value of the variate of permutations will be 1.

The `AMALGAMATIONS`

parameter can specify a matrix to store information about the order in which the units form groups, and at what level of similarity. At any stage in the process of agglomeration, each group is represented by the unit with the smallest unit number: for example, a group containing units 2, 5, 17 and 22 is represented by unit 2. This means that the final merge is always between a group indexed by unit 1 and a group indexed by another unit. Since there are *n*-1 stages of agglomeration, the matrix will have a number of rows one less than the number of rows of the input similarity matrix. Each row represents a joining of two groups and consists of three values. The first two values are the numbers indexing the two groups that are joining, and the third value is the level of similarity. So the matrix has three columns. The matrix will be declared implicitly, if necessary.

`HCLUSTER`

can print two pieces of information. The first gives details of each amalgamation, followed by a list of clusters that are formed at decreasing levels of similarity. The second is the dendrogram. The `PRINT`

option allows you to control which of these are printed. If `METHOD=singlelink`

and the `PRINT`

setting includes `amalgamations`

, the minimum spanning tree will be printed instead of the stages at which the clusters merge. This is because information from forming the minimum spanning tree is used to form the single linkage clustering.

Alternatively, if you save the `AMALGAMATIONS`

matrix, you can use procedure `DDENDROGRAM`

to display the dendrogram using high-resolution graphics. Also the `HFCLUSTERS`

procedure can be used to obtain the full set of clusters constructed during the cluster analysis, and the similarity values at which they were formed.

The `METHOD`

option has seven possible settings; these determine how the similarities amongst clusters are redefined after each merge. The default `singlelink`

, which has synonym `nearestneighbour`

, gives single linkage. The setting `completelink`

(synonym `furthestneighbour`

) defines the distance between two clusters as the maximum distance between any two units in those clusters. The setting `averagelink`

defines the similarity between a cluster and two merged clusters as the average of the similarities of the cluster with each of the two. For `groupaverage`

, an average is taken over all the units in the two merged clusters. Median sorting is best thought of in terms of clusters being represented by points in a multidimensional space; when two clusters join, the new cluster is represented by the midpoint of the original cluster points.

The `CTHRESHOLD`

option is a scalar which allows you to define the levels of decreasing similarity at which the lists of clusters are printed with their membership. The decreasing levels of similarity are formed by repeatedly subtracting the `CTHRESHOLD`

value from the maximum similarity of 100%. For example, setting `CTHRESHOLD=10`

will list the clusters formed at 90% similarity, 80%, and so on. At each level, those units that have not joined any group are also listed. If you do not set this option, the default value will be calculated from the range of similarities at which merges occur, to give between 10 and 20 separate levels.

Options: `PRINT`

, `METHOD`

, `CTHRESHOLD`

.

Parameters: `SIMILARITY`

, `GTHRESHOLD`

, `GROUPS`

, `PERMUTATION`

, `AMALGAMATIONS`

.

### See also

Directives: `FSIMILARITY`

, `HDISPLAY`

, `HLIST`

, `HSUMMARIZE`

, `CLUSTER`

, `HREDUCE`

.

Procedures: `DDENDROGRAM`

, `DCLUSTERLABELS`

, `DMST`

, `BCLASSIFICATION`

, `BKEY`

, `CINTERACTION`

, `HBOOTSTRAP`

, `HCOMPAREGROUPINGS`

, `HFAMALGAMATIONS`

, `HFCLUSTERS`

, `HPCLUSTERS`

, `MASCLUSTER`

.

Commands for: Multivariate and cluster analysis.

### Example

" Genstat example HCLU-1: Cluster analysis Data from 'Observers Book of Automobiles', 1986 16 Italian cars and 10 measurements: 1. engine capacity c.c. CC 2. number of cylinders NCyl 3. fuel tank litres Tank 4. unladen weight kg Wt 5. length cm Length 6. width cm Width 7. height cm Ht 8. wheelbase cm Wbase 9. top speed kph TSpeed 10. time to 100kph secs StSt 11. carburettor/inj/diesel 1/2/3 Carb 12. front/rear wheel drive 1/2 Drive " TEXT [VALUES=Estate,'Arna1.5','Alfa2.5',Mondialqc,Testarossa,Croma,\ Panda,Regatta,Regattad,Uno,X19,Contach,Delta,Thema,Y10,Spider] Cars POINTER [VALUES=CC,NCyl,Tank,Wt,Length,Width,Ht,WBase,TSpeed,StSt,\ Carb,Drive] Vars " Read the data - measurements and carnames - from the file 'HCLU-1.DAT', and then display it." OPEN '%gendir%/examples/HCLU-1.DAT'; CHANNEL=cardat READ [CHANNEL=cardat] Vars[] CLOSE cardat " Treat the number of cylinders, data[2], differently to the continuous measurements." HLIST [UNITS=Cars]\ Vars[]; TEST=4(cityblock,euclidean),2(cityblock,simplematching) " Form a hierarchical clustering of the cars, using the single linkage method." SYMMETRIC [ROWS=Cars] CarSim FSIMILARITY [SIMILARITY=CarSim]\ Vars[]; TEST=4(cityblock,euclidean),2(cityblock,simplematching) HCLUSTER [PRINT=amalgamations; METHOD=single] CarSim " Use the average-linkage method." HCLUSTER [PRINT=dendrogram; METHOD=average] CarSim;\ AMALGAMATIONS=Am; PERMUTATION=Perm " Display a high-resolution dendrogram." DDENDROGRAM [ORDERING=given] DATA=Am; PERMUTATION=Perm; LABELS=Cars;\ TITLE='Italian cars clustered by average linkage'