Fits a support vector machine (D. B. Baird).
Options
PRINT = string tokens |
Printed output from the analysis (summary , predictions , allocations , debug ); default summ , alloc |
---|---|
SVMTYPE = string token |
Type of support vector machine to fit (svc , svr , nusvc , nusvr , lsvc , lsvr , lcs , svm1 ); default svc |
KERNEL = string token |
Type of kernel to use (linear , polynomial , radialbasis , sigmoid ); default radi |
PENALTY = scalar or variate |
Penalty or cost for points on the wrong side of the boundary; default 1 |
GAMMA = scalar or variate |
Gamma parameter for types with non-linear kernels; default 1 |
NU = scalar or variate |
Nu parameter for types nusvc , nusvr , and svm1 ; default 0.5 |
EPSILON = scalar or variate |
Epsilon parameter for types svr and lsvr ; default 0.1 |
BIAS = scalar |
Bias for allocations to groups for types lsvc and lsvr ; default -1 i.e. no bias |
DEGREE = scalar |
Degree for polynomial kernel; default 3 |
CONSTANTVALUE = scalar |
Constant for polynomial or sigmoid kernel; default 0 |
LOWER = scalar or variate |
Lower limit for scaling data variates; default -1 |
UPPER = scalar or variate |
Upper limit for scaling data variates; default 1 |
SCALING = string token |
Type of scaling to use (none , uniform , given ); default unif |
NOSHRINK =string token |
Whether to suppress the shrinkage of attributes to exclude unused ones (no , yes ); default no |
OPTMETHOD =string token |
Whether to optimize probabilities or allocations (allocations , probabilities ); default allo |
REGULARIZATIONMETHOD = string token |
Regularization method for SMVTYPE = lsvc or lsvr (l1 , l2 ); default l2 |
LOSSMETHOD = string token |
Loss method for SMVTYPE = lsvc or lsvr (logistic , l1 , l2 ); default logi |
DUALMETHOD = string token |
Whether to use the dual algorithm for SMVTYPE = lsvc or lsvr (yes , no ); default no |
NCROSSVALIDATIONGROUPS = scalar |
Number of groups for cross-validation; default 10 |
SEED = scalar |
Seed for random number generation; default 0 |
TOLERANCE = scalar |
Tolerance for termination criterion; default 0.001 |
WORKSPACE = scalar |
Size of workspace needed for data; default is to calculate this from the number of observations and variates |
Parameters
Y = factors or variates |
Define groupings for the units in each training set y-variate to be predicted via regression, with missing values in the units to be allocated or predicted |
---|---|
X = pointers |
Each pointer contains a set of explanatory variates or factors |
WEIGHTS = variates |
Weights to multiply penalties for each group when SMVTYPE = svc , nusvc , lsvc or lcs |
PREDICTIONS = factors or variates |
Saves allocations to groups or predictions from regression |
ERRORRATE = scalars, variates or matrices |
Saves the error rate for the combinations of parameters specified for the support vector machine |
OPTPENALTY = scalars |
Saves the optimal value of penalty parameter |
OPTGAMMA = scalars |
Saves the optimal value of gamma parameter |
OPTNU = scalars |
Saves the optimal value of nu parameter |
OPTEPSILON = scalars |
Saves the optimal value of epsilon parameter |
OPTERRORRATE = scalars |
Saves the minimum error rate |
SCALE = texts or pointers |
Saves the scaling used for the X variates, in a file if a text is given, or otherwise in a pointer to a pair of variates |
SAVEFILE = texts |
File in which to save the model, for use by SVMPREDICT |
Description
SVMFIT
fits a support vector machine (Cortes & Vapnik 1995), which defines multivariate boundaries to separate groups, or predict values. It provides a Genstat interface to the libraries LIBSVM (Chang & Lin 2001) and LIBLINEAR (Fan et al. 2008), which are made available subject to the conditions listed in the Method section.
Unlike linear discriminant analysis, a support vector machine assumes no statistical model for the distribution of individuals within a group. The method is thus less affected by outliers. The method chooses boundaries to maximize the separation between groups. The reason why this is known as a support vector machine, is that there is a small set of data points that define the boundaries, and these are known as the support vectors. If individuals lie on the wrong side of the boundary, the distance from the boundary, multiplied by a penalty, is added to the separation criterion.
The type of support vector machine to fit is specified by the SVMTYPE
option, with settings:
svc |
a multi-class support vector classifier with a range of kernels for discriminating between groups; |
---|---|
svr |
support vector regression with a range of kernels for predicting the values of a y-variate as in a regression; |
nusvc |
Nu classification – a multi-class support vector classifier with a range of kernels for discriminating between groups with a parameter NU that controls the fraction of support vectors used; |
nusvr |
Nu regression – support vector regression with a range of kernels for predicting the values of a y-variate as in a regression with a parameter NU that controls the fraction of support vectors used; |
lsvc |
Fast linear classification – a fast regularized linear support vector for discriminating between groups; |
lsvr |
Fast linear regression – a fast regularized linear support vector regression for predicting the values of a y-variate as in a regression; |
lcs |
a fast linear support vector machine for discriminating between groups using the approach of Cramer & Singer (2000), where a direct method for training multi-class predictors is used, rather than dividing the multi-class classification into a set of binary classifications; and |
svm1 |
Consistent group SVM – a support vector machine which attempts to identify a consistent group of observations. |
The shape of the boundary is controlled by the KERNEL
option which specifies the metric used to measure distance between multi-dimensional points u and v. The settings are:
linear |
the linear function u′v; |
---|---|
polynomial |
the polynomial function γ (u′v + c)d; |
radialbasis |
the radial basis function exp(-γ |u – v|2); and |
sigmoid |
the sigmoid function tanh(γ u′v + c). |
With a linear kernel, the boundaries are multi-dimensional planes. For the other types they are curved surfaces. The kernel is ignored for SMVTYPE=lsvc
, lsvr
and lcs
as these always use a linear kernel.
The data set is supplied in a pointer of explanatory variates or factors, specified by the X
parameter, and a response variate or factor specified by the Y
parameter. The Y
parameter need not be set if SMVTYPE=svm1
, as this searches for a consistent group of individuals in the data set, ignoring the Y
parameter. Explanatory factors are converted to variates, using the levels of the factor concerned. Any unit with a missing value in an explanatory variate takes a zero value for that attribute. With the default, uniform, scaling this puts them in the centre of the range of the variate concerned. Units can also be excluded from the analysis by restricting the factor or variates; any such restrictions must be consistent.
The response factor specifies the pre-defined groupings of the units from which the allocation is derived (the “training set”); the units to be allocated by the analysis have missing values for Y
. A response variate supplies training values for a regression-type support vector machine. (These are requested by SMVTYPE
settings svr
, nusvr
and lsvr
.) Units to be predicted by the regression have missing values in the y-variate.
The support vector machine solutions depend on the scale of the attributes. It is usually recommended that all attributes are put on the same scale, so that they all have the same influence. This is controlled by the SCALING
option, with settings:
none |
the attributes are used as supplied, with no scaling; |
---|---|
uniform |
all the attributes are centred, and scaled to have the same minimum and maximum (default); and |
given |
the variates are scaled using the LOWER and UPPER options. |
The LOWER
and UPPER
options can be set to a scalar, to apply a uniform scaling, where all the variates are given the same minimum (LOWER
) and maximum (UPPER
) value; alternatively, they can be variates specifying the minimum and maximum value for each variate, respectively.
The PENALTY
option defines the penalty that is applied to the sum of distances for the points on the wrong side of the boundary when calculating the optimal boundaries; default 1. Larger values apply more weight to points that are on the wrong side of the discrimination boundaries, and can be investigated to optimize performance. However, linear support vector machines are generally insensitive to the choice of the penalty. The WEIGHTS
parameter can be used to change the penalty for mis-assigning a case to a particular group, and should be a variate with the same length as the number of levels in Y
. The penalty for each group is then corresponding value of PENALTY*WEIGHTS
.
The GAMMA
option (γ in the equations for the kernels) controls the smoothness of the boundary for non-linear kernels, with larger values giving a rougher surface.
With SVMTYPE
=nusvc
and nusvr
, the parameter NU
controls the number of support vectors used; default 0.5. With larger values of NU
, smaller numbers of support vectors are used, giving a sparser solution that may be more robust and thus perform better in future prediction.
With the regression cases SVMTYPE
=svr
and lsvr
, the parameter EPSILON
controls the sensitivity of the loss function being optimized; default 0.1. A range of parameter values for PENALTY
, GAMMA
, NU
or EPSILON
are usually tried, to optimize the discrimination between groups or predictions of the y-variate. These parameters also accept a variate, in which case all the values in the variate are tried and the one that minimizes the error rate is selected. Up to two of these parameters can be variates at once. A grid of error rates is then calculated using every combination of the two sets of parameters, and the optimal combination is selected. If three or more of these parameters are set to variates, a warning is given, and only the first values of the third and fourth variates are selected.
When KERNEL=polynomial
, the DEGREE
option defines the degree of the polynomial (d in the equation for the polynomial kernel). The CONSTANTVALUE
option gives the constant (c in the equations for the kernels), for KERNEL=polynomial
and sigmoid
.
The TOLERANCE
option supplies a small positive value that controls the precision used for the termination criterion. Decreasing this may provide a better solution, but will increase the time taken until convergence.
The NOSHRINK
option controls whether unnecessary attributes are dropped from the fitting process; by default, these are dropped, thus increasing the speed to find a solution when there are many iterations (e.g. when TOLERANCE
has been made smaller). If few iterations are required to find a solution, it may be faster to set NOSHRINK
=yes
.
The OPTMETHOD
option controls the criterion that is optimized when the SVMTYPE
is set to svc
, svr
, nusvc
or nusvr
, with settings:
allocations |
for the accuracy of allocating individuals to groups; or |
---|---|
probabilities |
for sum of the probabilities of allocating an individual to the correct group. |
The SYMTYPE
s lsvc
, lsvr
and lcs
fit regularized linear support vector machines using the algorithms in the LIBLINEAR library of Fan et al. (2008). This is much faster than the default algorithm, allowing much bigger data sets to be analysed. The REGULARIZATIONMETHOD
, LOSSMETHOD
and DUALMETHOD
options specify which LIBLINEAR algorithm is used for SYMTYPE
s lsvc
and lsvr
.
The REGULARIZATIONMETHOD
option allows you to create sparser sets of support vectors, with the L1
setting giving a smaller set of support vectors than L2
. The LOSSMETHOD
option controls the loss function being minimized: the L2
setting minimizes the sum of the squared distances of points on the wrong side of the boundary, the L1
setting minimizes the sum of the distances, and the logistic
setting uses a logistic regression loss function. Setting option DUALMETHOD
=yes
may be faster when there are a large number of attributes. Not all combinations of REGULARIZATIONMETHOD
, LOSSMETHOD
and DUALMETHOD
options are available.
When SVMTYPE
=lsvc
, you can use the BIAS
option to attempt to achieve a more optimal discrimination between groups. When BIAS
is set to a non-negative value, an extra constant attribute is added to the end of each individual. This extra attribute is given a weight that controls the origin of the separating hyper-plane (the origin is where all attributes have value of 0). A BIAS
of 0 forces the separating hyper-plane to go through the origin, and a non-zero value moves the plane away from the origin. The BIAS
thus acts as a tuning parameter, that changes the hyper-plane’s origin. A range of values can be investigated, to try to improve the discrimination.
Printed output is controlled by the option PRINT
with settings:
summary |
tables giving the number of units in each group with a complete set of observations; |
---|---|
allocations |
tables of counts of allocations; and |
debug |
details of the parameters set when calling the libraries. |
The error rate is worked out by cross-validation, which works by randomly splitting the units into a number of groups specified by the NCROSSVALIDATIONGROUPS
option. It then omits each of the groups, in turn, and predicts how the omitted units are allocated to the discrimination groups.
The SEED
option provides the seed for the random numbers used for allocating individuals to the cross-validation groups. The default value of 0 continues an existing sequence of random numbers. If none have been used in the current Genstat job, it initializes the seed automatically using the computer clock.
The WORKSPACE
option can be set if the problem requires more memory than the default settings.
Results from the analysis can be saved using the parameters PREDICTIONS
, ERRORRATE
, OPTPENALTY
, OPTGAMMA
, OPTNU
, OPTEPSILON
and OPTERRORRATE
. The structures specified for these parameters need not be declared in advance. If one of the options PENALTY
, GAMMA
, NU
or EPSILON
has been set to a variate, ERRORRATE
will be a variate indexed by that variate. Alternatively, if two of these options have been set to variates, ERRORRATE
will be a matrix with rows and columns indexed by those variates. The OPT
parameters contain the values of the parameters, that give the minimum error rate (returned in OPTERRORRATE
).
The support vector machine model can be saved in an external file, using the SAVEFILE
parameter, so that it can be used later with SVMPREDICT
. As the scaling on the attributes must be the same in future data sets, the scaling can be saved with the SCALE
parameter. This can supply either a filename (ending in .gsh) to keep these permanently, or a pointer so that these can be applied to the attributes used in SVMPREDICT
later in the same program. The file or pointer contains two variates, which give the slope and intercept (in that order) for the linear transform applied to each attribute.
Options: PRINT
, SVMTYPE
, KERNEL
, PENALTY
, GAMMA
, NU
, EPSILON
, BIAS
, DEGREE
, CONSTANTVALUE
, LOWER
, UPPER
, SCALING
, NOSHRINK
, OPTMETHOD
, REGULARIZATIONMETHOD
, LOSSMETHOD
, DUALMETHOD
, NCROSSVALIDATIONGROUPS
, SEED
, TOLERANCE
, WORKSPACE
.
Parameters: Y
, X
, WEIGHTS
, PREDICTIONS
, ERRORRATE
, OPTPENALTY
, OPTGAMMA
, OPTNU
, OPTEPSILON
, OPTERRORRATE
, SCALE
, SAVEFILE
.
Method
SVMFIT
provides a Genstat interface to the C++ libraries LIBSVM (Chang & Lin 2001) and LIBLINEAR (Fan et al. 2008), that have been compiled into the GenSVM dynamic link library. A user guide by Hsu et al. (2003) gives details on their use.
LIBSVM is provided subject to the following copyright notice.
Copyright © 2000-2014 Chih-Chung Chang and Chih-Jen Lin. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
3. Neither name of copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
This software is provided by the copyright holders and contributors “as is” and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In no event shall the regents or contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.
LIBLINEAR is provided subject to the following copyright notice.
Copyright © 2007-2013 The LIBLINEAR Project. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
3. Neither name of copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
This software is provided by the copyright holders and contributors “as is” and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In no event shall the regents or contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.
Action with RESTRICT
The input variates and factor may be restricted. The restrictions must be identical.
References
Cortes, C. & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273-297.
URL: https://link.springer.com/article/10.1007%2FBF00994018
Chang, C.C. & Lin, C.J. (2001). LIBSVM: A library for support vector machines.
URL: http://www.csie.ntu.edu.tw/~cjlin/libsvm
Cramer, K. & Singer, Y. (2000). On learnability and design of output codes for multi-class problems. In Computational Learning Theory, 35-46.
Fan, R.E., Chang, K.W, Hsieh, X.R., Wang, X.R. & Lin C.J. (2008). LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9, 1871-1874.
URL: http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf
Hsu, C.W., Chang, C.C. & Lin, C.J. (2003). A practical guide to support vector classification. (Technical report). Department of Computer Science and Information Engineering, National Taiwan University.
URL: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
See also
Directive: CVA
.
Procedures: SVMPREDICT
, DISCRIMINATE
, QDISCRIMINATE
, SDISCRIMINATE
.
Example
CAPTION 'SVMFIT for classification: Fisher Iris data'; STYLE=meta SPLOAD [PRINT=*] '%DATA%/Iris.gsh' POINTER [VALUES=Sepal_Length,Sepal_Width,Petal_Length,Petal_Width] Var " Default - radialbasis kernel with scaling." SVMFIT [PRINT=summary,allocations; SEED=726454] Y=Species; X=Var " Unscaled with linear kernel." SVMFIT [PRINT=summary,allocations; KERNEL=linear; SCALING=none;\ SEED=143038] Y=Species; X=Var CAPTION 'SVMFIT for regression: Los Angeles Ozone data'; STYLE=meta SPLOAD [PRINT=*] '%DATA%/Ozone.gsh'; ISAVE=Data SUBSET [Ozone /= !s(*)] Data[] POINTER [VALUES=Data[1,2,(5...10)]] OZVars " Find optimal values for penalty and gamma." SVMFIT [PRINT=summary; SVMTYPE=svr; PENALTY=!(1,10,100,500,1000);\ GAMMA=!(0.05,0.1,0.2,0.4); SEED=562011] Y=Ozone; X=OZVars;\ PREDICTIONS=POzone DGRAPH [TITLE='Los Angeles Ozone levels 1976 ~{epsilon}-regression';\ KEY=0;WIND=3] Y=POzone; X=Ozone