Performs bootstrap analyses to assess the reliability of clusters from hierarchical cluster analysis (R.W. Payne).
Options
PRINT = string token |
Controls printed output (clusters , dendrograms ); default * i.e. none |
---|---|
METHOD = string token |
Criterion for forming clusters (singlelink , nearestneighbour , completelink , furthestneighbour , averagelink , mediansort , groupaverage ); default sing |
CLIMIT = scalar |
Similarity value below which clusters are not recorded; default 0 |
UNITS = text or variate |
Names to label the units of the clusters when they are printed; default * |
MINKOWSKI = scalar |
Index t for use with TEST=minkowski |
CLUSTERS = pointer |
Specifies or saves the clusters |
REPLICATION = variate |
Saves the replication of the clusters in the bootstrap samples |
NDATASAMPLE = scalar |
Number of DATA vectors to take in each sample; default takes the same number as supplied by the DATA parameter |
NTIMES = scalar |
Number of times to resample; default 100 |
SEED = scalar |
Seed for random number generator; default continue from previous generation or use system clock |
Parameters
DATA = variates or factors |
The characteristics of the units to be clustered |
---|---|
TEST = string tokens |
Test type, defining how each DATA variate or factor is treated in the calculation of the similarity between each unit (simplematching , jaccard , russellrao , dice , antidice , sneathsokal , rogerstanimoto , cityblock , manhattan , ecological , euclidean , pythagorean , minkowski , divergence , canberra , braycurtis , soergel ); default * ignores that variate or factor |
RANGE = scalar |
Range of possible values of each DATA variate or factor; if omitted, the observed range is taken |
Description
HBOOTSTRAP
uses bootstrapping to assess the reliability of clusters formed in hierarchical cluster analyses. The characteristics of the units to be clustered are described in a list of variates and factors, specified by the DATA
parameter. The TEST
parameter defines how each one is to be used when calculating similarities, and the RANGE
parameter can specify ranges of their values. These operate as in the FSIMILARITY
directive, which is used to form the similarity matrix for each cluster analysis. The MINKOWSKI
option specifies the index t for the Minkowski type of test.
For each bootstrap sample, a set of vectors is formed by sampling with replacement from the DATA
vectors. The NDATASAMPLE
option specifies the number of vectors to take; by default this is the same as the number of vectors supplied by DATA
. The NTIMES
option specifies the number of bootstrap samples; default 100. The SEED
option specifies the seed to use for the random numbers used to select the sample; the default of zero continues an existing sequence of random numbers or, if none, it initializes the sequence using the system clock. HBOOTSTRAP
does a cluster analysis with those vectors using the HCLUSTER
directive, and obtains the clusters that it forms using the HFCLUSTERS
procedure. The CLIMIT
option can be used to specify a limit, below which any clusters will be excluded.
The CLUSTERS
option can supply a pointer containing a list of clusters whose reliability is to be assessed. This would usually have been obtained previously, from a cluster analysis performed with all the DATA
vectors. Alternatively, if CLUSTERS
is set to a pointer whose number of values has not been defined, or to an undeclared data structure, this will be defined as a pointer containing one of every cluster that has occurred during the bootstrapping. Each cluster is represented as a variate, containing the number of each unit in that cluster. (This number corresponds to the location of that unit in the DATA
vectors.)
The REPLICATION
option can save a variate containing the number of times each cluster has occurred during the bootstrapping. These replications can be used by the DCLUSTERLABELS
procedure to label the clusters on a dendrogram.
The clusters and their replications can be printed by setting option PRINT=clusters
. The UNITS
option can be set to a text or a variate, to provide textual labels or other numbers to use for the units of the clusters, instead of the numbers in the CLUSTERS
variates. The other PRINT
setting, dendrogram, prints the dendrogram of the cluster analysis from each bootstrap sample.
Options: PRINT
, METHOD
, CLIMIT
, UNITS
, MINKOWSKI
, CLUSTERS
, REPLICATION
, NDATASAMPLE
, NTIMES
, SEED
.
Parameters: DATA
, TEST
, RANGE
.
Action with RESTRICT
The DATA
variates and factors must not be restricted.
See also
Directive: HCLUSTER
.
Procedures: BOOTSTRAP
, DCLUSTERLABELS
, HFCLUSTERS
, HPCLUSTERS
.
Commands for: Multivariate and cluster analysis.
Example
CAPTION 'HBOOTSTRAP example',\ !t('Random classification forest for automobile data',\ 'from UCI Machine Learning Repository',\ 'http://archive.ics.uci.edu/ml/datasets/Automobile');\ STYLE=meta,plain SPLOAD [PRINT=*] '%gendir%/examples/Automobile.gsh' " select cars with wagon body style " SUBSET [body_style.IN.'wagon'] make,\ fuel_type,aspiration,number_doors,drive_wheels,\ engine_location,engine_type,number_cylinders,fuel_system,\ wheel_base,length,width,height,curb_weight,\ engine_size,bore,stroke,compression_ratio,horsepower,\ peak_rpm,city_mpg,highway_mpg,price " form labels for the cars from make and price " TXCONSTRUCT [TEXT=car] make,' ',price " cluster analysis using all the data variables " FSIMILARITY [PRINT=*; SIMILARITY=similarity; UNITS=car]\ fuel_type,aspiration,number_doors,drive_wheels,\ engine_location,engine_type,number_cylinders,fuel_system,\ wheel_base,length,width,height,curb_weight,\ engine_size,bore,stroke,compression_ratio,horsepower,\ peak_rpm,city_mpg,highway_mpg,price;\ TEST=8(simplematching),14(euclidean) HCLUSTER [METHOD=averagelink] similarity; AMALGAMATIONS=amalg;\ PERMUTATION=perm " plot dendrogram " FRAME 3; XMLOWER=0.2; XMUPPER=0 DDENDROGRAM [STYLE=average; ORDERING=given; DSIMILARITY=yes] amalg;\ PERMUTATION=perm; LABELS=car; WINDOW=3 " form the clusters in the dendrogram " HFCLUSTERS amalg; CLUSTERS=clusters " see often these clusters occur in 100 bootstrap samples of data variables " HBOOTSTRAP [PRINT=clusters; METHOD=averagelink; NTIMES=100;\ SEED=161647; CLUSTERS=clusters; REPLICATION=reps]\ fuel_type,aspiration,number_doors,drive_wheels,\ engine_location,engine_type,number_cylinders,fuel_system,\ wheel_base,length,width,height,curb_weight,\ engine_size,bore,stroke,compression_ratio,horsepower,\ peak_rpm,city_mpg,highway_mpg,price;\ TEST=8(simplematching),14(euclidean) " plot the numbers of occurrence on the dendrogram " DCLUSTERLABELS [WINDOW=3] #clusters; LABEL=#reps