Performs bootstrap analyses to assess the reliability of clusters from hierarchical cluster analysis (R.W. Payne).

### Options

`PRINT` = string token |
Controls printed output (`clusters` , `dendrograms` ); default `*` i.e. none |
---|---|

`METHOD` = string token |
Criterion for forming clusters (`singlelink` , `nearestneighbour` , `completelink` , `furthestneighbour` , `averagelink` , `mediansort` , `groupaverage` ); default `sing` |

`CLIMIT` = scalar |
Similarity value below which clusters are not recorded; default `0` |

`UNITS` = text or variate |
Names to label the units of the clusters when they are printed; default `*` |

`MINKOWSKI` = scalar |
Index t for use with `TEST=minkowski` |

`CLUSTERS` = pointer |
Specifies or saves the clusters |

`REPLICATION` = variate |
Saves the replication of the clusters in the bootstrap samples |

`NDATASAMPLE` = scalar |
Number of `DATA` vectors to take in each sample; default takes the same number as supplied by the `DATA` parameter |

`NTIMES` = scalar |
Number of times to resample; default `100` |

`SEED` = scalar |
Seed for random number generator; default continue from previous generation or use system clock |

### Parameters

`DATA` = variates or factors |
The characteristics of the units to be clustered |
---|---|

`TEST` = string tokens |
Test type, defining how each `DATA` variate or factor is treated in the calculation of the similarity between each unit (`simplematching` , `jaccard` , `russellrao` , `dice` , `antidice` , `sneathsokal` , `rogerstanimoto` , `cityblock` , `manhattan` , `ecological` , `euclidean` , `pythagorean` , `minkowski` , `divergence` , `canberra` , `braycurtis` , `soergel` ); default `*` ignores that variate or factor |

`RANGE` = scalar |
Range of possible values of each `DATA` variate or factor; if omitted, the observed range is taken |

### Description

`HBOOTSTRAP`

uses bootstrapping to assess the reliability of clusters formed in hierarchical cluster analyses. The characteristics of the units to be clustered are described in a list of variates and factors, specified by the `DATA`

parameter. The `TEST`

parameter defines how each one is to be used when calculating similarities, and the `RANGE`

parameter can specify ranges of their values. These operate as in the `FSIMILARITY `

directive, which is used to form the similarity matrix for each cluster analysis. The `MINKOWSKI`

option specifies the index *t* for the Minkowski type of test.

For each bootstrap sample, a set of vectors is formed by sampling with replacement from the `DATA`

vectors. The `NDATASAMPLE`

option specifies the number of vectors to take; by default this is the same as the number of vectors supplied by `DATA`

. The `NTIMES`

option specifies the number of bootstrap samples; default 100. The `SEED`

option specifies the seed to use for the random numbers used to select the sample; the default of zero continues an existing sequence of random numbers or, if none, it initializes the sequence using the system clock. `HBOOTSTRAP`

does a cluster analysis with those vectors using the `HCLUSTER `

directive, and obtains the clusters that it forms using the `HFCLUSTERS `

procedure. The `CLIMIT`

option can be used to specify a limit, below which any clusters will be excluded.

The `CLUSTERS`

option can supply a pointer containing a list of clusters whose reliability is to be assessed. This would usually have been obtained previously, from a cluster analysis performed with all the `DATA`

vectors. Alternatively, if `CLUSTERS`

is set to a pointer whose number of values has not been defined, or to an undeclared data structure, this will be defined as a pointer containing one of every cluster that has occurred during the bootstrapping. Each cluster is represented as a variate, containing the number of each unit in that cluster. (This number corresponds to the location of that unit in the `DATA`

vectors.)

The `REPLICATION`

option can save a variate containing the number of times each cluster has occurred during the bootstrapping. These replications can be used by the `DCLUSTERLABELS `

procedure to label the clusters on a dendrogram.

The clusters and their replications can be printed by setting option `PRINT=clusters`

. The `UNITS`

option can be set to a text or a variate, to provide textual labels or other numbers to use for the units of the clusters, instead of the numbers in the `CLUSTERS`

variates. The other `PRINT`

setting, dendrogram, prints the dendrogram of the cluster analysis from each bootstrap sample.

Options: `PRINT`

, `METHOD`

, `CLIMIT`

, `UNITS`

, `MINKOWSKI`

, `CLUSTERS`

, `REPLICATION`

, `NDATASAMPLE`

, `NTIMES`

, `SEED`

.

Parameters: `DATA`

, `TEST`

, `RANGE`

.

### Action with `RESTRICT`

The `DATA`

variates and factors must not be restricted.

### See also

Directive: `HCLUSTER`

.

Procedures: `BOOTSTRAP`

, `DCLUSTERLABELS`

, `HFCLUSTERS`

, `HPCLUSTERS`

.

Commands for: Multivariate and cluster analysis.

### Example

CAPTION 'HBOOTSTRAP example',\ !t('Random classification forest for automobile data',\ 'from UCI Machine Learning Repository',\ 'http://archive.ics.uci.edu/ml/datasets/Automobile');\ STYLE=meta,plain SPLOAD [PRINT=*] '%gendir%/examples/Automobile.gsh' " select cars with wagon body style " SUBSET [body_style.IN.'wagon'] make,\ fuel_type,aspiration,number_doors,drive_wheels,\ engine_location,engine_type,number_cylinders,fuel_system,\ wheel_base,length,width,height,curb_weight,\ engine_size,bore,stroke,compression_ratio,horsepower,\ peak_rpm,city_mpg,highway_mpg,price " form labels for the cars from make and price " TXCONSTRUCT [TEXT=car] make,' ',price " cluster analysis using all the data variables " FSIMILARITY [PRINT=*; SIMILARITY=similarity; UNITS=car]\ fuel_type,aspiration,number_doors,drive_wheels,\ engine_location,engine_type,number_cylinders,fuel_system,\ wheel_base,length,width,height,curb_weight,\ engine_size,bore,stroke,compression_ratio,horsepower,\ peak_rpm,city_mpg,highway_mpg,price;\ TEST=8(simplematching),14(euclidean) HCLUSTER [METHOD=averagelink] similarity; AMALGAMATIONS=amalg;\ PERMUTATION=perm " plot dendrogram " FRAME 3; XMLOWER=0.2; XMUPPER=0 DDENDROGRAM [STYLE=average; ORDERING=given; DSIMILARITY=yes] amalg;\ PERMUTATION=perm; LABELS=car; WINDOW=3 " form the clusters in the dendrogram " HFCLUSTERS amalg; CLUSTERS=clusters " see often these clusters occur in 100 bootstrap samples of data variables " HBOOTSTRAP [PRINT=clusters; METHOD=averagelink; NTIMES=100;\ SEED=161647; CLUSTERS=clusters; REPLICATION=reps]\ fuel_type,aspiration,number_doors,drive_wheels,\ engine_location,engine_type,number_cylinders,fuel_system,\ wheel_base,length,width,height,curb_weight,\ engine_size,bore,stroke,compression_ratio,horsepower,\ peak_rpm,city_mpg,highway_mpg,price;\ TEST=8(simplematching),14(euclidean) " plot the numbers of occurrence on the dendrogram " DCLUSTERLABELS [WINDOW=3] #clusters; LABEL=#reps