Estimates the weights for self-organizing maps (R.W. Payne).

### Options

`PRINT` = string tokens |
Controls output (`weights` , `errors` , `monitoring` , `report` ); default `weig` , `repo` |
---|---|

`PLOT` = string token |
Controls what to plot (`fit` , `totalerror` ); default `fit` |

`DMETHOD` = string token |
Method for calculating the distances of data points from the modes (`euclidean` , `cityblock` ); default `eucl` |

`WMETHOD` = string token |
Method for calculating the contribution of a data point to each node when revising the weights (`gaussian` , `neighbour` ); default `gaus` |

`ALPHA` = scalar or variate |
Initial alpha value for each set of iterations; default `!(1,` `0.1)` |

`SIGMA` = scalar or variate |
Initial sigma value for each set of iterations when `WMETHOD=gaussian` ; default `!(1,` `0.01)` multiplied by the maximum distance between nodes |

`THRESHOLD` = scalar or variate |
Initial distance threshold for each set of iterations when `WMETHOD=neighbour` ; default `!(0.5,` `0.1)` multiplied by the maximum distance between nodes |

`NCYCLE` = scalar or variate |
Number of cycles in each set of iterations; default 500 |

`NSTOP` = scalar |
Number of consecutive cycles with no changes required for convergence; default 10 |

### Parameters

`SOM` = pointers |
Save the information about each map |
---|---|

`DATA` = matrices or pointers |
Data values for training each map |

`ERRORS` = matrices |
Reconstruction errors at the nodes of each map |

`FITROWS` = factors |
Save the positions of the rows allocated to the data points |

`FITCOLUMNS` = factors |
Save the positions of the columns allocated to the data points |

`Y` = variates |
Save y-values used to plot the data points |

`X` = variates |
Save x-values used to plot the data points |

`PEN` = scalars, variates or factors |
Pens used to plot the maps |

`SEED` = scalars |
Seed for the random numbers used to initialize the weights in each map |

### Description

A self-organizing map is a two dimensional grid of nodes, used to classify vectors of observations on *p* variables. Each node is characterized by a vector of *p* weights (one for each variable).

Before estimating the weights, you first need to declare a SOM structure to store the map. The `SOM`

procedure, which does this, defines the row and column positions of the nodes on the grid. It also stores the names of the weight variables and information about how distances are to be measured on the grid and how the weights should be adjusted during their estimation. The SOM structure is then input to `SOMESTIMATE`

by the `SOM`

parameter.

The training dataset to estimate the weights is specified by the `DATA`

parameter, either as a matrix with *n* rows and *p* columns (where *n* is the number of observations in the training set) or as a pointer containing *p* variates each with *n* units. `SOMESTIMATE`

gives a warning if the row names of a `DATA`

matrix or the names of the variates in a `DATA`

pointer differ from the names stored for the weight variables in the SOM structure.

The weights are estimated by a sequence of iterations, which are performed by the `SOMADJUST`

procedure. In an iteration, the training observations are taken in turn. Each observation *i* is assessed to find its closest node. The method to use to measure distance on the map will have been specified, by the `DMETHOD`

option of `SOM`

, and stored with the SOM structure when it was declared. However, `SOMESTIMATE`

also has a `DMETHOD`

option in case you want to override the stored setting. The default setting for the `DMETHOD`

option of `SOM`

is `euclidean`

. If `X_i`

is a variate containing the values of the variables for observation *i* and `W_j`

is the variate of weights at node *j*, the distance is then given by

`d_ij = SQRT(SUM((X_i - W_j)**2))`

The alternative setting, `cityblock`

, calculates the distance as

`d_ij = SUM(ABS(X_i - W_j)))`

Once the closest node, *k*, has been found, the weights at that node and other nodes are adjusted. The method to use will have been specified when the SOM structure was declared, by the `WMETHOD`

option of `SOM`

. However, `SOMESTIMATE`

again has its own `WMETHOD`

option, that you can use to override the stored setting. The default setting for the `DMETHOD`

option of `SOM`

is `gaussian`

. This adjusts the weights `W_j`

at every node *j* to become

`W_j + alpha * EXP( -0.5 * (d_jk / sigma)**2) * (X_i - W_j)`

where `d_jk`

is the distance between nodes *j* and *k*. With the alternative setting, `neighbour`

, the weights at node *j* are adjusted to become

`W_j + alpha * (X_i - W_j)`

but only if `d_jk`

is less than a threshold `r`

.

The values of `alpha`

, `sigma`

and `r`

change at each iteration. By default, `SOMESTIMATE`

runs two sequences of iterations. At the start of the first set, the parameters have initial values

`alpha = 1`

`sigma = dmax`

`r = dmax / 2`

where `dmax`

is the maximum distance between any two nodes in the network. At the end of the first set, they have final values

`alpha = 0.1`

`sigma = dmax / 10`

`r = dmax / 10`

There are 500 iterations in the first set, and the parameters decrease in equal steps from their initial to their final values. There are also 500 cycles in the second set of iterations, and the parameters now decrease in equal steps to to final values

`alpha = 1`

`sigma = 0`

`r = dmin`

where `dmin`

is the minimum distance between any two nodes in the network. If `dmax/10`

is less than `dmin`

, then the value of `r`

at the end of the first set will be `dmin`

too.

You can define your own sequence of iterations using the `ALPHA`

, `SIGMA`

, `THRESHOLD`

and `NCYCLE`

options (where `SIGMA`

is relevant only when `WMETHOD=gaussian`

, and `THRESHOLD`

only when `WMETHOD=neighbour`

). Setting all the relevant options to scalars, defines a single set of iterations where the parameters decrease from initial values set by the options to the final values specified above. Alternatively, you can set `ALPHA`

and either `SIGMA`

or `THRESHOLD`

to variates to specify initial values for several sets of iterations. `NCYCLE`

can be set to a scalar if all the sets are to contain the same number of iterations, or to a variate of the same length as `ALPHA`

if you want each set to contain a different number.

The weights are initialized to have random positions within the plane of the first two principal components for the `DATA`

matrix. The `SEED`

parameter supplies a seed for the random numbers used to define the positions. The default value of zero initializes the random number generator automatically if this is the first time that it has been used in the current job, or continues the existing sequence of random numbers.

By default `SOMESTIMATE`

will stop the estimation process if there are more than ten successive iterations in which no observation changes its closest node. Different numbers of successive iterations with no changes can be specified using the `NSTOP`

option.

Printed output is controlled by the `PRINT`

option, with settings:

`weights` |
to print the weights at each node of the map; |
---|---|

`errors` |
to print the reconstruction errors at each node of the map; |

`monitoring` |
to provide monitoring about each iteration; and |

`report` |
to print a report at the end of the estimation process. |

By default `PRINT=weights,report`

.

The `PLOT`

option controls which plots are produced, with settings:

`fit` |
for a plot showing how the data observations are allocated to the nodes of the map; and |
---|---|

`totalerror` |
for a plot showing how the total reconstruction error changes at each iteration. |

By default, the map is plotted. The `PEN`

parameter can be used to define the pen or pens to be used to plot the points on the map. If `PEN`

is set to a scalar, the same pen will be used for every point, so you would simply be able to assess the density of points around the map. Alternatively, you can supply a variate or factor to distinguish different groups of observations.

The `ERRORS`

parameter can save a matrix with the reconstruction error at each node of the map. The `Y`

and `X`

parameters can save the coordinates used to plot the points on the map. These are formed by adding a small amount of random variation to the row and column of the nodes, to ensure that points allocated to the same node are not all plotted in the same position.

Options: `PRINT`

, `PLOT`

, `DMETHOD`

, `WMETHOD`

, `ALPHA`

, `SIGMA`

, `THRESHOLD`

, `NCYCLE`

, `NSTOP`

.

Parameters: `SOM`

, `DATA`

, `ERRORS`

, `FITROWS`

, `FITCOLUMNS`

, `Y`

, `X`

, `PEN`

, `SEED`

.

### Method

The individual iterations involved in the estimation are carried out by the `SOMADJUST`

procedure.

### Action with `RESTRICT`

`SOMESTIMATE`

takes account of any restrictions defined on the DATA variates.

### See also

Procedures: `SOM`

, `SOMADJUST`

, `SOMDESCRIBE`

, `SOMIDENTIFY`

, `SOMPREDICT`

.

Commands for: Data mining.

### Example

CAPTION 'SOMESTIMATE example',!t('Fisher''s Iris Data'); STYLE=meta,plain SOM Som; VARIABLENAMES=!t(Sepal_L,Sepal_W,Petal_L,Petal_W) MATRIX [ROWS=150; COLUMNS=!t(Sepal_L,Sepal_W,Petal_L,Petal_W)] Measures READ Measures 5.1 3.5 1.4 0.2 4.9 3.0 1.4 0.2 4.7 3.2 1.3 0.2 4.6 3.1 1.5 0.2 5.0 3.6 1.4 0.2 5.4 3.9 1.7 0.4 4.6 3.4 1.4 0.3 5.0 3.4 1.5 0.2 4.4 2.9 1.4 0.2 4.9 3.1 1.5 0.1 5.4 3.7 1.5 0.2 4.8 3.4 1.6 0.2 4.8 3.0 1.4 0.1 4.3 3.0 1.1 0.1 5.8 4.0 1.2 0.2 5.7 4.4 1.5 0.4 5.4 3.9 1.3 0.4 5.1 3.5 1.4 0.3 5.7 3.8 1.7 0.3 5.1 3.8 1.5 0.3 5.4 3.4 1.7 0.2 5.1 3.7 1.5 0.4 4.6 3.6 1.0 0.2 5.1 3.3 1.7 0.5 4.8 3.4 1.9 0.2 5.0 3.0 1.6 0.2 5.0 3.4 1.6 0.4 5.2 3.5 1.5 0.2 5.2 3.4 1.4 0.2 4.7 3.2 1.6 0.2 4.8 3.1 1.6 0.2 5.4 3.4 1.5 0.4 5.2 4.1 1.5 0.1 5.5 4.2 1.4 0.2 4.9 3.1 1.5 0.2 5.0 3.2 1.2 0.2 5.5 3.5 1.3 0.2 4.9 3.6 1.4 0.1 4.4 3.0 1.3 0.2 5.1 3.4 1.5 0.2 5.0 3.5 1.3 0.3 4.5 2.3 1.3 0.3 4.4 3.2 1.3 0.2 5.0 3.5 1.6 0.6 5.1 3.8 1.9 0.4 4.8 3.0 1.4 0.3 5.1 3.8 1.6 0.2 4.6 3.2 1.4 0.2 5.3 3.7 1.5 0.2 5.0 3.3 1.4 0.2 7.0 3.2 4.7 1.4 6.4 3.2 4.5 1.5 6.9 3.1 4.9 1.5 5.5 2.3 4.0 1.3 6.5 2.8 4.6 1.5 5.7 2.8 4.5 1.3 6.3 3.3 4.7 1.6 4.9 2.4 3.3 1.0 6.6 2.9 4.6 1.3 5.2 2.7 3.9 1.4 5.0 2.0 3.5 1.0 5.9 3.0 4.2 1.5 6.0 2.2 4.0 1.0 6.1 2.9 4.7 1.4 5.6 2.9 3.6 1.3 6.7 3.1 4.4 1.4 5.6 3.0 4.5 1.5 5.8 2.7 4.1 1.0 6.2 2.2 4.5 1.5 5.6 2.5 3.9 1.1 5.9 3.2 4.8 1.8 6.1 2.8 4.0 1.3 6.3 2.5 4.9 1.5 6.1 2.8 4.7 1.2 6.4 2.9 4.3 1.3 6.6 3.0 4.4 1.4 6.8 2.8 4.8 1.4 6.7 3.0 5.0 1.7 6.0 2.9 4.5 1.5 5.7 2.6 3.5 1.0 5.5 2.4 3.8 1.1 5.5 2.4 3.7 1.0 5.8 2.7 3.9 1.2 6.0 2.7 5.1 1.6 5.4 3.0 4.5 1.5 6.0 3.4 4.5 1.6 6.7 3.1 4.7 1.5 6.3 2.3 4.4 1.3 5.6 3.0 4.1 1.3 5.5 2.5 4.0 1.3 5.5 2.6 4.4 1.2 6.1 3.0 4.6 1.4 5.8 2.6 4.0 1.2 5.0 2.3 3.3 1.0 5.6 2.7 4.2 1.3 5.7 3.0 4.2 1.2 5.7 2.9 4.2 1.3 6.2 2.9 4.3 1.3 5.1 2.5 3.0 1.1 5.7 2.8 4.1 1.3 6.3 3.3 6.0 2.5 5.8 2.7 5.1 1.9 7.1 3.0 5.9 2.1 6.3 2.9 5.6 1.8 6.5 3.0 5.8 2.2 7.6 3.0 6.6 2.1 4.9 2.5 4.5 1.7 7.3 2.9 6.3 1.8 6.7 2.5 5.8 1.8 7.2 3.6 6.1 2.5 6.5 3.2 5.1 2.0 6.4 2.7 5.3 1.9 6.8 3.0 5.5 2.1 5.7 2.5 5.0 2.0 5.8 2.8 5.1 2.4 6.4 3.2 5.3 2.3 6.5 3.0 5.5 1.8 7.7 3.8 6.7 2.2 7.7 2.6 6.9 2.3 6.0 2.2 5.0 1.5 6.9 3.2 5.7 2.3 5.6 2.8 4.9 2.0 7.7 2.8 6.7 2.0 6.3 2.7 4.9 1.8 6.7 3.3 5.7 2.1 7.2 3.2 6.0 1.8 6.2 2.8 4.8 1.8 6.1 3.0 4.9 1.8 6.4 2.8 5.6 2.1 7.2 3.0 5.8 1.6 7.4 2.8 6.1 1.9 7.9 3.8 6.4 2.0 6.4 2.8 5.6 2.2 6.3 2.8 5.1 1.5 6.1 2.6 5.6 1.4 7.7 3.0 6.1 2.3 6.3 3.4 5.6 2.4 6.4 3.1 5.5 1.8 6.0 3.0 4.8 1.8 6.9 3.1 5.4 2.1 6.7 3.1 5.6 2.4 6.9 3.1 5.1 2.3 5.8 2.7 5.1 1.9 6.8 3.2 5.9 2.3 6.7 3.3 5.7 2.5 6.7 3.0 5.2 2.3 6.3 2.5 5.0 1.9 6.5 3.0 5.2 2.0 6.2 3.4 5.4 2.3 5.9 3.0 5.1 1.8 : FACTOR [NVALUES=150; LABELS=!t(Setosa,Versicolor,Virginica);\ VALUES=50(1,2,3)] Species SOMESTIMATE [PRINT=weights,errors,report; PLOT=fit,totalerror;\ NCYCLE=!(100,200); SIGMA=!(5,1)] Som; DATA=Measures;\ PEN=Species; SEED=419749