HCLUSTER directive

Performs hierarchical cluster analysis.

Options

`PRINT` = string tokens	Printed output required (`dendrogram`, `amalgamations`); default `*` i.e. no printing
`METHOD` = string token	Criterion for forming clusters (`singlelink`, `nearestneighbour`, `completelink`, `furthestneighbour`, `averagelink`, `mediansort`, `groupaverage`); default `sing`
`CTHRESHOLD` = scalar	Clustering threshold at which to print formation of clusters; default `*` i.e. determined automatically

Parameters

`SIMILARITY` = symmetric matrices	Input similarity matrix for each cluster analysis
`GTHRESHOLD` = scalars	Grouping threshold where groups are formed from the dendrogram
`GROUPS` = factors	Stores the groups formed
`PERMUTATION` = variates	Permutation order of the units on the dendrogram
`AMALGAMATIONS` = matrices	To store linked list of amalgamations

Description

The aim of cluster analysis is to arrange the n sampling units into more or less homogeneous groups. HCLUSTER offers several possibilities. The general strategy is best appreciated in geometrical terms, with the n sampling units represented by points in a multidimensional space. In agglomerative methods, these points initially represent n separate clusters, each containing one member. At each of n-1 stages, two clusters are fused into one bigger cluster, until at the final stage all units are fused into a single cluster: this process can be represented by a hierarchical tree whose nodes indicate what fusions have occurred. The methods fuse the two closest clusters and vary in how closest is defined. In single-linkage cluster analysis, closest is defined as the smallest distance between any two samples from different clusters; in centroid clustering it is the smallest distance between cluster centroids; and so on. Genstat can display the tree fitted to a given similarity matrix, and provides a scale to show the level of similarity at which the fusions have occurred; such a scaled tree is termed a dendrogram.

The input for HCLUSTER is provided by the SIMILARITY parameter, as a list of symmetric matrices, one for each analysis. These matrices can be formed by FSIMILARITY, by HREDUCE or by CALCULATE. Missing values are allowed in the similarity matrix only with the single-linkage method.

A hierarchical tree does not by itself provide a classification. This can be derived by cutting the dendrogram at some arbitrary level of similarity, specified as a percentage similarity using the GTHRESHOLD parameter. Each cluster then consists of those samples occurring on the same detached branch of the dendrogram, and the resulting cluster membership can be saved in a factor whose identifier is specified by the GROUPS parameter. The factor will be declared implicitly, if necessary, and it will have its number of levels set to the number of clusters formed and its number of values taken from the number of rows of the corresponding symmetric matrix. GTHRESHOLD and GROUPS must be either both present or both absent.

The endpoints of the dendrogram correspond to the units in some permuted order. The PERMUTATION parameter allows you to specify a variate to save this order, for example to use in the FSIMILARITY directive. Genstat will define it to be a variate automatically, if necessary, with number of values is taken from the number of rows of the corresponding similarity matrix. Conventionally, the first unit on the dendrogram is unit 1 and so the first value of the variate of permutations will be 1.

The AMALGAMATIONS parameter can specify a matrix to store information about the order in which the units form groups, and at what level of similarity. At any stage in the process of agglomeration, each group is represented by the unit with the smallest unit number: for example, a group containing units 2, 5, 17 and 22 is represented by unit 2. This means that the final merge is always between a group indexed by unit 1 and a group indexed by another unit. Since there are n-1 stages of agglomeration, the matrix will have a number of rows one less than the number of rows of the input similarity matrix. Each row represents a joining of two groups and consists of three values. The first two values are the numbers indexing the two groups that are joining, and the third value is the level of similarity. So the matrix has three columns. The matrix will be declared implicitly, if necessary.

HCLUSTER can print two pieces of information. The first gives details of each amalgamation, followed by a list of clusters that are formed at decreasing levels of similarity. The second is the dendrogram. The PRINT option allows you to control which of these are printed. If METHOD=singlelink and the PRINT setting includes amalgamations, the minimum spanning tree will be printed instead of the stages at which the clusters merge. This is because information from forming the minimum spanning tree is used to form the single linkage clustering.

Alternatively, if you save the AMALGAMATIONS matrix, you can use procedure DDENDROGRAM to display the dendrogram using high-resolution graphics. Also the HFCLUSTERS procedure can be used to obtain the full set of clusters constructed during the cluster analysis, and the similarity values at which they were formed.

The METHOD option has seven possible settings; these determine how the similarities amongst clusters are redefined after each merge. The default singlelink, which has synonym nearestneighbour, gives single linkage. The setting completelink (synonym furthestneighbour) defines the distance between two clusters as the maximum distance between any two units in those clusters. The setting averagelink defines the similarity between a cluster and two merged clusters as the average of the similarities of the cluster with each of the two. For groupaverage, an average is taken over all the units in the two merged clusters. Median sorting is best thought of in terms of clusters being represented by points in a multidimensional space; when two clusters join, the new cluster is represented by the midpoint of the original cluster points.

The CTHRESHOLD option is a scalar which allows you to define the levels of decreasing similarity at which the lists of clusters are printed with their membership. The decreasing levels of similarity are formed by repeatedly subtracting the CTHRESHOLD value from the maximum similarity of 100%. For example, setting CTHRESHOLD=10 will list the clusters formed at 90% similarity, 80%, and so on. At each level, those units that have not joined any group are also listed. If you do not set this option, the default value will be calculated from the range of similarities at which merges occur, to give between 10 and 20 separate levels.

Options: PRINT, METHOD, CTHRESHOLD.
Parameters: SIMILARITY, GTHRESHOLD, GROUPS, PERMUTATION, AMALGAMATIONS.

Example

" Genstat example HCLU-1: Cluster analysis

   Data from 'Observers Book of Automobiles', 1986
   16 Italian cars and 10 measurements:
   1.  engine capacity        c.c.        CC
   2.  number of cylinders                NCyl
   3.  fuel tank              litres      Tank
   4.  unladen weight         kg          Wt
   5.  length                 cm          Length
   6.  width                  cm          Width
   7.  height                 cm          Ht
   8.  wheelbase              cm          Wbase
   9.  top speed              kph         TSpeed
  10.  time to 100kph         secs        StSt
  11.  carburettor/inj/diesel 1/2/3       Carb
  12.  front/rear wheel drive 1/2         Drive
"

TEXT [VALUES=Estate,'Arna1.5','Alfa2.5',Mondialqc,Testarossa,Croma,\ 
  Panda,Regatta,Regattad,Uno,X19,Contach,Delta,Thema,Y10,Spider] Cars
POINTER [VALUES=CC,NCyl,Tank,Wt,Length,Width,Ht,WBase,TSpeed,StSt,\ 
  Carb,Drive] Vars
" Read the data - measurements and carnames - from the file
 'HCLU-1.DAT', and then display it."
OPEN '%gendir%/examples/HCLU-1.DAT'; CHANNEL=cardat
READ [CHANNEL=cardat] Vars[]
CLOSE cardat

" Treat the number of cylinders, data[2], differently to the 
  continuous measurements."
HLIST [UNITS=Cars]\ 
  Vars[]; TEST=4(cityblock,euclidean),2(cityblock,simplematching)

" Form a hierarchical clustering of the cars,
  using the single linkage method."
SYMMETRIC [ROWS=Cars] CarSim
FSIMILARITY [SIMILARITY=CarSim]\ 
  Vars[]; TEST=4(cityblock,euclidean),2(cityblock,simplematching)
HCLUSTER [PRINT=amalgamations; METHOD=single] CarSim

" Use the average-linkage method."
HCLUSTER [PRINT=dendrogram; METHOD=average] CarSim;\ 
  AMALGAMATIONS=Am; PERMUTATION=Perm

" Display a high-resolution dendrogram."
DDENDROGRAM [ORDERING=given] DATA=Am; PERMUTATION=Perm; LABELS=Cars;\ 
  TITLE='Italian cars clustered by average linkage'

Updated on September 2, 2019

Tagged: Command Procedures

Was this article helpful?

Yes No

Options

Parameters

Description

See also

Example

Was this article helpful?