Forms a similarity matrix or a betweengroupelements similarity matrix or prints a similarity matrix.
Options
PRINT = string token 
Printed output required (similarities , summary ); default * i.e. no printing 

STYLE = string token 
Print percentage similarities in full or just the 10% digit (full, abbreviated ); default full 
METHOD = string token 
Form similarity matrix or rectangular betweengroupelement similarity matrix (similarities, betweengroupsimilarities ); default simi 
SIMILARITY = matrix or symmetric matrix 
Input or output matrix of similarities; default * 
GROUPS = factor 
Grouping of units into two groups for betweengroupelement similarity matrix; default * 
PERMUTATION = variate 
Permutation of units (possibly from HCLUSTER ) for order in which units of the similarity matrix are printed; default * 
UNITS = text or variate 
Unit names to label the rows of the similarity matrix; default * 
MINKOWSKI = scalar 
Index t for use with TEST=minkowski 
Parameters
DATA = variates or factors 
The data values 

TEST = string tokens 
Test type, defining how each DATA variate or factor is treated in the calculation of the similarity between each unit (simplematching , jaccard , russellrao , dice , antidice , sneathsokal , rogerstanimoto , cityblock , manhattan , ecological , euclidean , pythagorean , minkowski , divergence , canberra , braycurtis , soergel ); default * ignores that variate or factor 
RANGE = scalars 
Range of possible values of each DATA variate or factor; if omitted, the observed range is taken 
Description
The FSIMILARITY
directive forms similarity matrices, essentially using the method described by Gower (1971). The similarity coefficient that is calculated allows variables to be qualitative, quantitative or dichotomous, or mixtures of these types; values of some of the variables may be missing for some samples. The values of a similarity coefficient vary between zero and unity: two samples have a similarity of unity only when both have identical values for all variables; a value of zero occurs when the values for the two samples differ maximally for all variables.
You can form a symmetric matrix of similarities, or a rectangular matrix of similarities between the units in two groups. You can save either form of similarity matrix, using the SIMILARITY
option. FSIMILARITY
can also be used to print the symmetric matrix of similarities after it has formed it; alternatively, you can input an existing similarity matrix for printing, using the SIMILARITY
option.
The DATA
parameter specifies a list of variates or factors, all of which must be of the same length. If you want to print an existing similarity matrix, the DATA
parameter (and the TEST
and RANGE
parameters) should be omitted, and the SIMILARITY
option used to input the matrix concerned.
The TEST
parameter specifies a list of strings, one for each variate or factor in the DATA
parameter list, that define their “types”. If you want to exclude a variate or factor from contributing, you should specify an empty string (*
or ''
). Otherwise the similarity between units i and j is calculated as
∑_{k} { w_{k}(x_{ik}, x_{jk}) s_{k}(x_{ik}, x_{jk}) } / ∑_{k} w_{k}(x_{ik}, x_{jk})
where x_{ik} is the value of the DATA
variate k in unit i, and the contribution functions s_{k} and weight functions w_{k} for a variate or factor k of the available types are defined in the tables below (for further details see Gower 1971, 1985).
The first table contains the types appropriate for variates that are recording the presence or absence of a characteristic; these cannot be used with factors.
Type  Contribution s_{k}  Weight w_{k} 
Jaccard 
if x_{i} ≠ 0 and x_{j} ≠ 0, then 1  1 
if x_{i} = x_{j} = 0, then 0  0  
if only one of x_{i} or x_{j} = 0, then 0  1  
RussellRao 
if x_{i} ≠ 0 and x_{j} ≠ 0, then 1  1 
if x_{i} = 0 or x_{j} = 0, then 0  1  
Dice 
if x_{i} ≠ 0 and x_{j} ≠ 0, then 1  1 
if x_{i} = x_{j} = 0, then 0  0  
if only one of x_{i} or x_{j} = 0, then 0  0.5  
antidice 
if x_{i} ≠ 0 and x_{j} ≠ 0, then 1  1 
if x_{i} = x_{j} = 0, then 0  0  
if only one of x_{i} or x_{j} = 0, then 0  2  
SneathSokal 
if x_{i} ≠ 0 and x_{j} ≠ 0, then 1  1 
if x_{i} = x_{j} = 0, then 1  1  
if only one of x_{i} or x_{j} = 0, then 0  0.5  
RogersTanimoto 
if x_{i} ≠ 0 and x_{j} ≠ 0, then 1  1 
if x_{i} = x_{j} = 0, then 1  1  
if only one of x_{i} or x_{j} = 0, then 0  2 
The simplematching
type is appropriate for qualitative variables, which may be either variates or factors.
Type  Contribution s_{k}  Weight w_{k} 
simplematching 
if x_{i} = x_{j}, then 1  1 
if x_{i} ≠ x_{j}, then 0  1 
The next table shows the types that can be used for quantitative variates (but not factors). In the definitions, r is the range of the variate, t is the Minkowski index (defined by the MINKOWSKI
option). Note, however, that BrayCurtis
and Soergel
should not be mixed with other types.
Type  Contribution s_{k}  Weight w_{k} 
cityblock 
1 – x_{i} – x_{j} / r  1 
Manhattan 
synonymous with cityblock 

ecological 
1 – x_{i} – x_{j} / r  1 
unless x_{i} = x_{j} = 0  0  
Euclidean 
1 – {(x_{i} – x_{j}) / r}^{2}  1 
Pythagorean 
synonymous with Euclidean 

Minkowski 
1 – x_{i} – x_{j}^{t} / r^{t}  1 
Divergence 
1 – {(x_{i} – x_{j}) / (x_{i} + x_{j})}^{2}  1 
Canberra 
1 – x_{i} – x_{j} / (x_{i} + x_{j})  1 
BrayCurtis 
1 – x_{i} – x_{j} / (x_{i} + x_{j})  x_{i} + x_{j} 
Soergel 
1 – x_{i} – x_{j} / max(x_{i}, x_{j})  max(x_{i}, x_{j}) 
The RANGE
parameter contains a list of scalars, one for each variate or factor in the DATA
list. This allows you to check that the values of each variate lie within the given range. If any variate or factor fails the range check, FSIMILARITY
gives an error diagnostic and terminates without forming the similarity matrix. The range is also used to standardize quantitative variates; this lets you impose a standard range, for example when variates are measured on commensurate scales. You can omit the RANGE
parameter for all or any of the variates or factors by giving a missing identifier or a scalar with a missing value; Genstat then uses the observed range. If PRINT=summary
, Genstat prints the name, the minimum value, and the range for each variate and factor.
The METHOD
option controls what type of matrix is produced. METHOD=similarity
, the default, gives a symmetric matrix of similarities amongst a single set of units. METHOD=betweengroupsimilarity
gives a rectangular matrix of similarities between two sets of units. To form a rectangular matrix of similarities, you must also define the grouping of units by setting the GROUPS
option (see below).
The PRINT
, STYLE
and PERMUTATION
options govern the printing of a symmetric matrix of similarities. You can either form the similarity matrix within FSIMILARITY
, or input it by the SIMILARITY
option. To print the similarity matrix you should set option PRINT=similarity
. The STYLE
option has two settings, full
(the default) or abbreviated
. The similarity matrix printed in full style has its values displayed as percentages with one decimal place. If you put STYLE=abbreviated
, the values of the similarity matrix are printed as single digits with no spaces, the digit being the 10’s value of the similarity as a percentage. In both cases, though, the actual similarities in the range 01 are stored in the similarity matrix itself. The PERMUTATION
option lets you specify a variate with values corresponding to the order in which you want the rows of the similarity matrix to be printed. The reordering of the rows is most effective when the permutation arises from a hierarchical clustering and corresponds to the dendrogram order.
You use the GROUPS
option to specify a partition of the units into two groups, by giving a factor with two levels. The units with level 1 of the factor correspond to the rows of the matrix, while the units with level 2 correspond to the columns.
The UNITS
option lets you label the rows of the output similarity matrix if the variates of the DATA
parameter do not have any unit labels, or if you want to use different labels from those labelling the units of the variates. This labelling also applies to the rows and columns of a matrix of similarities between group elements.
Options: PRINT
, STYLE
, METHOD
, SIMILARITY
, GROUPS
, PERMUTATION
, UNITS
, MINKOWSKI
.
Parameters: DATA
, TEST
, RANGE
.
Action with RESTRICT
If any of the DATA
variates or factors is restricted, or if the factor in the GROUPS
option is restricted, then that restriction is applied to all the variates or factors. If more than one is restricted, then the restrictions must all be to the same set of units. The dimension of the resulting symmetric matrix of similarities is taken from the number of units that contribute to the similarity matrix.
References
Gower, J.C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27, 857871.
Gower, J.C. (1985). Measures of similarity, dissimilarity and distance. In: Encyclopedia of Statistical Sciences, Volume 5, 397405.
See also
Directives: CLUSTER
, HCLUSTER
, PCO
, HREDUCE
.
Procedures: ECANOSIM
, HBOOTSTRAP
, MANTEL
, MASCLUSTER
.
Commands for: Calculations and manipulation, Multivariate and cluster analysis.
Example
" Genstat example HCLU1: Cluster analysis Data from 'Observers Book of Automobiles', 1986 16 Italian cars and 10 measurements: 1. engine capacity c.c. CC 2. number of cylinders NCyl 3. fuel tank litres Tank 4. unladen weight kg Wt 5. length cm Length 6. width cm Width 7. height cm Ht 8. wheelbase cm Wbase 9. top speed kph TSpeed 10. time to 100kph secs StSt 11. carburettor/inj/diesel 1/2/3 Carb 12. front/rear wheel drive 1/2 Drive " TEXT [VALUES=Estate,'Arna1.5','Alfa2.5',Mondialqc,Testarossa,Croma,\ Panda,Regatta,Regattad,Uno,X19,Contach,Delta,Thema,Y10,Spider] Cars POINTER [VALUES=CC,NCyl,Tank,Wt,Length,Width,Ht,WBase,TSpeed,StSt,\ Carb,Drive] Vars " Read the data  measurements and carnames  from the file 'HCLU1.DAT', and then display it." OPEN '%gendir%/examples/HCLU1.DAT'; CHANNEL=cardat READ [CHANNEL=cardat] Vars[] CLOSE cardat " Treat the number of cylinders, data[2], differently to the continuous measurements." HLIST [UNITS=Cars] \ Vars[]; TEST=4(cityblock,euclidean),2(cityblock,simplematching) " Form a hierarchical clustering of the cars, using the single linkage method." SYMMETRIC [ROWS=Cars] CarSim FSIMILARITY [SIMILARITY=CarSim]\ Vars[]; TEST=4(cityblock,euclidean),2(cityblock,simplematching) HCLUSTER [PRINT=amalgamations; METHOD=single] CarSim " Use the averagelinkage method." HCLUSTER [PRINT=dendrogram; METHOD=average] CarSim;\ AMALGAMATIONS=Am; PERMUTATION=Perm " Display a highresolution dendrogram." DDENDROGRAM [ORDERING=given] DATA=Am; PERMUTATION=Perm; LABELS=Cars;\ TITLE='Italian cars clustered by average linkage'