Fits a partial least squares regression model (Ian Wakeling & Nick Bratchell).
Options
PRINT = string tokens |
Printed output required (data , xloadings , yloadings , ploadings , scores , leverages , xerrors , yerrors , scree , xpercent , ypercent , predictions , groups , estimates , fittedvalues ); default esti , xper , yper , scor , xloa , yloa , ploa |
---|---|
NROOTS = scalar |
Number of PLS dimensions to be extracted |
YSCALING = string token |
Whether to scale the Y variates to unit variance; (yes , no ); default no |
XSCALING = string token |
Whether to scale the X variates to unit variance; (yes , no ); default no |
NGROUPS = scalar |
Number of cross-validation groups into which to divide the data; default 1 (i.e. no cross-validation performed) |
SEED = scalar or factor |
A scalar indicating the seed value to use when dividing the data randomly into NGROUPS groups for the cross-validation or a factor to indicate a specific set of groupings to use for the cross-validation; default 0 |
LABELS = text |
Sample labels for X and Y that are to be used in the printed output; defaults to the integers 1…n where n is the length of the variates in X and Y |
PLABELS = text |
Sample labels for XPREDICTIONS that are to be used in the printed output; default uses the integers 1, 2 … |
Parameters
Y = pointers |
Pointer to variates containing the dependent variables |
---|---|
X = pointers |
Pointer to variates containing the independent variables |
YLOADINGS = pointers |
Pointer to variates used to store the Y component loadings for each dimension extracted |
XLOADINGS = pointers |
Pointer to variates used to store the X component loadings for each dimension extracted |
PLOADINGS = pointers |
Pointer to variates used to store the loadings for the bilinear model for the X block |
YSCORES = pointers |
Pointer to variates used to store the Y component scores for each dimension extracted |
XSCORES = pointers |
Pointer to variates used to store the X component scores for each dimension extracted |
B = matrices |
A diagonal matrix containing the regression coefficients of YSCORES on XSCORES for each dimension |
YPREDICTIONS = pointers |
A pointer to variates used to store predicted Y values for samples in the prediction set |
XPREDICTIONS = pointers |
A pointer to variates containing data for the independent variables in the prediction set |
ESTIMATES = matrices |
An nX+1 by nY matrix (where nX and nY are the numbers of variates contained in X and Y respectively) used to store the PLS regression coefficients for a PLS model with NROOTS dimensions |
FITTEDVALUES = pointers |
Pointer to variates used to store the fitted values for each Y variate |
LEVERAGES = variates |
Variate used to store the leverage that each sample has on the PLS model |
PRESS = variates |
Variate used to contain the Predictive Residual Error Sum of Squares for each dimension in the PLS model, available only if cross-validation has been selected |
RSS = variates |
Variate used to store the Residual Sum of Squares for each dimension extracted |
YRESIDUALS = pointers |
Pointer to variates used to store the residuals from the Y block after NROOTS dimensions have been extracted, uncorrected for any scaling applied using YSCALING |
XRESIDUALS = pointers |
Pointer to variates used to store the residuals from the X block after NROOTS dimensions have been extracted, uncorrected for any scaling applied using XSCALING |
XPRESIDUALS = pointers |
Pointer to variates used to store the residuals from the XPREDICTIONS block after NROOTS dimensions have been extracted |
Description
The regression method of Partial Least Squares (PLS) was initially developed as a calibration method for use with chemical data. It was designed principally for use with overdetermined data sets and to be more efficient computationally than competing methods such as principal components regression. If Y and X denote matrices of dependent and independent variables respectively, then the aim of PLS is to fit a bilinear model having the form T=XW, X=TP′+E and Y=TQ′+F, where W is a matrix of coefficients whose columns define the PLS factors as linear combinations of the independent variables. Successive PLS factors contained in the columns of T are selected both to minimise the residuals in E and simultaneously to have high squared covariance with a single Y variate (PLS1) or a linear combination of multiple Y variates (PLS2). The columns of T are constrained to be mutually orthogonal. See Helland (1988) or Hoskuldsson (1988) for a more comprehensive description of the PLS method.
The procedure allows the calculation of PLS1 and PLS2 models with cross-validation to assist in the determination of the correct number of dimensions to include in the model. By setting the NGROUPS
option the data are randomly divided into a number of groups; samples in each group are then modelled from the remaining samples only. The sum of squares of differences between these “leave out predictions” and the observed values of Y are called PRESS. Many tests of significance for determining the correct number of dimensions are based on comparing values of PRESS for PLS models of varying rank. Values of PRESS are used in the procedure to perform Osten’s (1988) test of significance and may also be plotted out in a scree diagram. In addition to the factor scores, factor loadings and residuals, the procedure also calculates a leverage measure (Martens & Naes 1989 page 276) and a single linear combination of the X variables (ESTIMATES
) which summarises the entire PLS model.
The procedure will fail if there are missing values present in either the X
or Y
variates.
To use a PLS model to make predictions from new observations on the X variables, two methods are available. Either the user may do this manually by using the model as specified in the estimates matrix, or the new X data may be specified beforehand as the pointer to variates XPREDICTIONS
and the corresponding predictions obtained as YPREDICTIONS
.
Output from the PLS procedure can be selected using the following settings of the PRINT
option.
data |
the unscaled data values (with labels). |
---|---|
xloadings |
X-component loadings (columns of the matrix W – see above). |
yloadings |
variable loadings for the bilinear model of the matrix of dependent variables. Note that these are standardized to unit length and are not the same as the columns of the matrix Q above. To obtain Q, form the matrix C, whose columns are the standardized loadings, and post-multiply by the diagonal matrix supplied as the output parameter B. |
ploadings |
variable loadings for the bilinear model of the matrix of independent variables (columns of the matrix P – see above). |
scores |
X and Y component scores. The X component scores are the columns of the matrix T and are mutually orthogonal. The Y component scores, usually given the symbol u, are not in fact needed in the calculation of the PLS model unless an iterative algorithm is used (see method section). They are provided here for completeness, as sometimes it is useful to plot the Y component scores against the X component scores to give a visual indication of the degree of fit for each PLS dimension. |
leverages |
measure of leverage. |
xerrors |
residual sum of squares and residual standard deviations for all the independent variables. When NGROUPS >1 additional statistics are calculated from the cross-validated residuals, derived when each object is left out. The PRESS value is equal to the sum of squares of cross-validated standard deviations for each X variable multipled by N-1, where N is the total number of observations. The cross-validated standard deviations may therefore be used to measure the predictive ability of the model for each of the variables. |
yerrors |
residual sum of squares and residual standard deviations for all the dependent variables (see xerrors above). |
scree |
scree diagram of PRESS. |
xpercent |
percentage variance explained for the X variables. |
ypercent |
percentage variance explained for the Y variables. |
predictions |
predicted values for any observations that were not included in the PLS model but were supplied using the XPREDICTIONS parameter. |
groups |
details of groupings used for cross-validation. |
estimates |
estimated PLS regression coefficients. |
fittedvalues |
fitted values from the PLS regressions. |
The default settings are estimates
, xpercent
, ypercent
, scores
, xloadings
, yloadings
, ploadings
.
The data for PLS
are supplied using the X
and Y
parameters, as pointers to variates containing the columns of the X and Y matrices. Other parameters allow output to be saved in appropriate data structures.
Options: PRINT
, NROOTS
, YSCALING
, XSCALING
, NGROUPS
, SEED
, LABELS
, PLABELS
.
Parameters: Y
, X
, YLOADINGS
, XLOADINGS
, PLOADINGS
, YSCORES
, XSCORES
, B
, YPREDICTIONS
, XPREDICTIONS
, ESTIMATES
, FITTEDVALUES
, LEVERAGES
, PRESS
, RSS
, YRESIDUALS
, XRESIDUALS
, XPRESIDUALS
.
Method
Although the PLS method is often presented in terms of an iterative algorithm (Manne 1987), the X block loadings vector for the first PLS dimension (w1) is simply the eigenvector of X′YY′X corresponding to its largest eigenvalue. To find the second and subsequent dimensions, X and Y are deflated by orthogonalising with respect to the current PLS factor (t=Xw) and the eigenanalysis repeated. The above approach was adopted by Rogers (1987) in an implementation of a Genstat 4 macro. Here we adopt a very similar approach by performing a singular value decomposition on the matrix X′Y which simultaneously obtains loading vectors for both data blocks (Hoskuldsson 1988, de Jong & ter Braak 1994).
It is usual to centre all variables prior to a PLS analysis, the procedure will automatically do so even if the XSCALING
/YSCALING
options are not set. On exit from the procedure the variates pointed to by X
and Y
are unchanged.
Action with RESTRICT
The procedure will work with restricted variates, fitting a PLS model to the subset of objects indicated by the restriction. If there are different restrictions on different data variates then these restrictions will be combined and the analysis performed on the subset of samples that is common to all the restrictions. Note that the unrestricted length of all of the data variates must be the same and the number of samples in the common subset must be at least three. Any restrictions on a text supplied for the LABELS
option or a factor for the SEED
option will be ignored. On exit from the procedure all the data variates, and if supplied the SEED
factor and LABELS
text, will all be returned restricted to the common subset of samples. Output data structures that correspond to the samples (i.e. XSCORES
, YSCORES
, FITTEDVALUES
, LEVERAGES
, YRESIDUAL
and XRESIDUAL
) will also be returned restricted to the common subset, and missing values will be used for those values that have been restricted out.
When restricted data are supplied and LABELS
are also given then the appropriate subset of labels will be appear in the output; if LABELS
are not defined then default labels reflecting the position of the restricted data in the unrestricted variate will be used instead.
No restrictions are allowed in the variates supplied in the XPREDICTIONS
parameter or the PLABELS
option.
References
Helland, I.S. (1988). On the structure of partial least squares regression. Commun, Statist.-.Simula.Comput., 17, 581-607.
Hoskuldsson, A. (1988). PLS Regression Methods. J. Chemometrics, 2, 211-228.
de Jong & ter Braak (1994). Comments on the PLS kernel algorithm. J. Chemometrics, 8, 169-174
Manne, R. (1987). Analysis of two partial least squares algorithms for multivariate calibration. Chemometrics and Intell. Lab. Systems, 2, 187-197.
Naes, T. & Martens H. (1989). Multivariate Calibrarion. John Wiley, Chichester.
Osten, D.W. (1988). Selection of optimal regression models via cross-validation. J. Chemometrics, 2, 39-48.
Rogers, C.A. (1987). A Genstat Macro for Partial Least Squares Analysis with Cross-Validation Assessment of Model Dimensionality. Genstat Newsletter, 18, 81-92.
See also
Procedures: CCA
, OPLS
, RDA
, RIDGE
.
Commands for: Multivariate and cluster analysis, Regression analysis.
Example
CAPTION 'PLS example',!t('The data are 24 calibration',\ 'samples used to determine the protein content of wheat',\ 'from spectroscopic readings at six different wavelengths',\ '(Fearn, T., 1983, Applied Statistics, 32, 73-79).'); STYLE=meta,plain VARIATE [NVALUES=24] L[1...6],%Protein[1] READ L[1...6],%Protein[1] 468 123 246 374 386 -11 9.23 458 112 236 368 383 -15 8.01 457 118 240 359 353 -16 10.95 450 115 236 352 340 -15 11.67 464 119 243 366 371 -16 10.41 499 147 273 404 433 5 9.51 463 119 242 370 377 -12 8.67 462 115 238 370 353 -13 7.75 488 134 258 393 377 -5 8.05 483 141 264 384 398 -2 11.39 463 120 243 367 378 -13 9.95 456 111 233 365 365 -15 8.25 512 161 288 415 443 12 10.57 518 167 293 421 450 19 10.23 552 197 324 448 467 32 11.87 497 146 271 407 451 11 8.09 592 229 360 484 524 51 12.55 501 150 274 406 407 11 8.38 483 137 260 385 374 -3 9.64 491 147 269 389 391 1 11.35 463 121 242 366 353 -13 9.70 507 159 285 410 445 13 10.75 474 132 255 376 383 -7 10.75 496 152 276 396 404 6 11.47 : " Fit a 3 dimensional PLS model to the standardized data using leave-one-out cross-validation. All three dimensions are significant using Osten's test" PLS [PRINT=estimates,xpercent,ypercent,xloadings,yloadings,ploadings;\ NROOTS=3; NGROUPS=24; SEED=57133; XSCALING=yes; YSCALING=yes]\ Y=%Protein; X=L