PLS procedure

Fits a partial least squares regression model (Ian Wakeling & Nick Bratchell).

Options

`PRINT` = string tokens	Printed output required (`data`, `xloadings`, `yloadings`, `ploadings`, `scores`, `leverages`, `xerrors`, `yerrors`, `scree`, `xpercent`, `ypercent`, `predictions`, `groups`, `estimates`, `fittedvalues`); default `esti`, `xper`, `yper`, `scor`, `xloa`, `yloa`, `ploa`
`NROOTS` = scalar	Number of PLS dimensions to be extracted
`YSCALING` = string token	Whether to scale the `Y` variates to unit variance; (`yes`, `no`); default `no`
`XSCALING` = string token	Whether to scale the `X` variates to unit variance; (`yes`, `no`); default `no`
`NGROUPS` = scalar	Number of cross-validation groups into which to divide the data; default 1 (i.e. no cross-validation performed)
`SEED` = scalar or factor	A scalar indicating the seed value to use when dividing the data randomly into `NGROUPS` groups for the cross-validation or a factor to indicate a specific set of groupings to use for the cross-validation; default 0
`LABELS` = text	Sample labels for `X` and `Y` that are to be used in the printed output; defaults to the integers 1…n where n is the length of the variates in `X` and `Y`
`PLABELS` = text	Sample labels for `XPREDICTIONS` that are to be used in the printed output; default uses the integers 1, 2 …

Parameters

`Y` = pointers	Pointer to variates containing the dependent variables
`X` = pointers	Pointer to variates containing the independent variables
`YLOADINGS` = pointers	Pointer to variates used to store the Y component loadings for each dimension extracted
`XLOADINGS` = pointers	Pointer to variates used to store the X component loadings for each dimension extracted
`PLOADINGS` = pointers	Pointer to variates used to store the loadings for the bilinear model for the X block
`YSCORES` = pointers	Pointer to variates used to store the Y component scores for each dimension extracted
`XSCORES` = pointers	Pointer to variates used to store the X component scores for each dimension extracted
`B` = matrices	A diagonal matrix containing the regression coefficients of `YSCORES` on `XSCORES` for each dimension
`YPREDICTIONS` = pointers	A pointer to variates used to store predicted Y values for samples in the prediction set
`XPREDICTIONS` = pointers	A pointer to variates containing data for the independent variables in the prediction set
`ESTIMATES` = matrices	An n_X+1 by n_Y matrix (where n_X and n_Y are the numbers of variates contained in `X` and `Y` respectively) used to store the PLS regression coefficients for a PLS model with `NROOTS` dimensions
`FITTEDVALUES` = pointers	Pointer to variates used to store the fitted values for each `Y` variate
`LEVERAGES` = variates	Variate used to store the leverage that each sample has on the PLS model
`PRESS` = variates	Variate used to contain the Predictive Residual Error Sum of Squares for each dimension in the PLS model, available only if cross-validation has been selected
`RSS` = variates	Variate used to store the Residual Sum of Squares for each dimension extracted
`YRESIDUALS` = pointers	Pointer to variates used to store the residuals from the Y block after `NROOTS` dimensions have been extracted, uncorrected for any scaling applied using `YSCALING`
`XRESIDUALS` = pointers	Pointer to variates used to store the residuals from the X block after `NROOTS` dimensions have been extracted, uncorrected for any scaling applied using `XSCALING`
`XPRESIDUALS` = pointers	Pointer to variates used to store the residuals from the `XPREDICTIONS` block after `NROOTS` dimensions have been extracted

Description

The regression method of Partial Least Squares (PLS) was initially developed as a calibration method for use with chemical data. It was designed principally for use with overdetermined data sets and to be more efficient computationally than competing methods such as principal components regression. If Y and X denote matrices of dependent and independent variables respectively, then the aim of PLS is to fit a bilinear model having the form T=XW, X=TP′+E and Y=TQ′+F, where W is a matrix of coefficients whose columns define the PLS factors as linear combinations of the independent variables. Successive PLS factors contained in the columns of T are selected both to minimise the residuals in E and simultaneously to have high squared covariance with a single Y variate (PLS1) or a linear combination of multiple Y variates (PLS2). The columns of T are constrained to be mutually orthogonal. See Helland (1988) or Hoskuldsson (1988) for a more comprehensive description of the PLS method.

The procedure allows the calculation of PLS1 and PLS2 models with cross-validation to assist in the determination of the correct number of dimensions to include in the model. By setting the NGROUPS option the data are randomly divided into a number of groups; samples in each group are then modelled from the remaining samples only. The sum of squares of differences between these “leave out predictions” and the observed values of Y are called PRESS. Many tests of significance for determining the correct number of dimensions are based on comparing values of PRESS for PLS models of varying rank. Values of PRESS are used in the procedure to perform Osten’s (1988) test of significance and may also be plotted out in a scree diagram. In addition to the factor scores, factor loadings and residuals, the procedure also calculates a leverage measure (Martens & Naes 1989 page 276) and a single linear combination of the X variables (ESTIMATES) which summarises the entire PLS model.

The procedure will fail if there are missing values present in either the X or Y variates.

To use a PLS model to make predictions from new observations on the X variables, two methods are available. Either the user may do this manually by using the model as specified in the estimates matrix, or the new X data may be specified beforehand as the pointer to variates XPREDICTIONS and the corresponding predictions obtained as YPREDICTIONS.

Output from the PLS procedure can be selected using the following settings of the PRINT option.

`data`	the unscaled data values (with labels).
`xloadings`	X-component loadings (columns of the matrix W – see above).
`yloadings`	variable loadings for the bilinear model of the matrix of dependent variables. Note that these are standardized to unit length and are not the same as the columns of the matrix Q above. To obtain Q, form the matrix C, whose columns are the standardized loadings, and post-multiply by the diagonal matrix supplied as the output parameter B.
`ploadings`	variable loadings for the bilinear model of the matrix of independent variables (columns of the matrix P – see above).
`scores`	X and Y component scores. The X component scores are the columns of the matrix T and are mutually orthogonal. The Y component scores, usually given the symbol u, are not in fact needed in the calculation of the PLS model unless an iterative algorithm is used (see method section). They are provided here for completeness, as sometimes it is useful to plot the Y component scores against the X component scores to give a visual indication of the degree of fit for each PLS dimension.
`leverages`	measure of leverage.
`xerrors`	residual sum of squares and residual standard deviations for all the independent variables. When `NGROUPS`>1 additional statistics are calculated from the cross-validated residuals, derived when each object is left out. The PRESS value is equal to the sum of squares of cross-validated standard deviations for each X variable multipled by N-1, where N is the total number of observations. The cross-validated standard deviations may therefore be used to measure the predictive ability of the model for each of the variables.
`yerrors`	residual sum of squares and residual standard deviations for all the dependent variables (see xerrors above).
`scree`	scree diagram of PRESS.
`xpercent`	percentage variance explained for the X variables.
`ypercent`	percentage variance explained for the Y variables.
`predictions`	predicted values for any observations that were not included in the PLS model but were supplied using the `XPREDICTIONS` parameter.
`groups`	details of groupings used for cross-validation.
`estimates`	estimated PLS regression coefficients.
`fittedvalues`	fitted values from the PLS regressions.

The default settings are estimates, xpercent, ypercent, scores, xloadings, yloadings, ploadings.

The data for PLS are supplied using the X and Y parameters, as pointers to variates containing the columns of the X and Y matrices. Other parameters allow output to be saved in appropriate data structures.

Options: PRINT, NROOTS, YSCALING, XSCALING, NGROUPS, SEED, LABELS, PLABELS.

Parameters: Y, X, YLOADINGS, XLOADINGS, PLOADINGS, YSCORES, XSCORES, B, YPREDICTIONS, XPREDICTIONS, ESTIMATES, FITTEDVALUES, LEVERAGES, PRESS, RSS, YRESIDUALS, XRESIDUALS, XPRESIDUALS.

Method

Although the PLS method is often presented in terms of an iterative algorithm (Manne 1987), the X block loadings vector for the first PLS dimension (w₁) is simply the eigenvector of X′YY′X corresponding to its largest eigenvalue. To find the second and subsequent dimensions, X and Y are deflated by orthogonalising with respect to the current PLS factor (t=Xw) and the eigenanalysis repeated. The above approach was adopted by Rogers (1987) in an implementation of a Genstat 4 macro. Here we adopt a very similar approach by performing a singular value decomposition on the matrix X′Y which simultaneously obtains loading vectors for both data blocks (Hoskuldsson 1988, de Jong & ter Braak 1994).

It is usual to centre all variables prior to a PLS analysis, the procedure will automatically do so even if the XSCALING/YSCALING options are not set. On exit from the procedure the variates pointed to by X and Y are unchanged.

Action with `RESTRICT`

The procedure will work with restricted variates, fitting a PLS model to the subset of objects indicated by the restriction. If there are different restrictions on different data variates then these restrictions will be combined and the analysis performed on the subset of samples that is common to all the restrictions. Note that the unrestricted length of all of the data variates must be the same and the number of samples in the common subset must be at least three. Any restrictions on a text supplied for the LABELS option or a factor for the SEED option will be ignored. On exit from the procedure all the data variates, and if supplied the SEED factor and LABELS text, will all be returned restricted to the common subset of samples. Output data structures that correspond to the samples (i.e. XSCORES, YSCORES, FITTEDVALUES, LEVERAGES, YRESIDUAL and XRESIDUAL) will also be returned restricted to the common subset, and missing values will be used for those values that have been restricted out.

When restricted data are supplied and LABELS are also given then the appropriate subset of labels will be appear in the output; if LABELS are not defined then default labels reflecting the position of the restricted data in the unrestricted variate will be used instead.

No restrictions are allowed in the variates supplied in the XPREDICTIONS parameter or the PLABELS option.

References

Helland, I.S. (1988). On the structure of partial least squares regression. Commun, Statist.-.Simula.Comput., 17, 581-607.

Hoskuldsson, A. (1988). PLS Regression Methods. J. Chemometrics, 2, 211-228.

de Jong & ter Braak (1994). Comments on the PLS kernel algorithm. J. Chemometrics, 8, 169-174

Manne, R. (1987). Analysis of two partial least squares algorithms for multivariate calibration. Chemometrics and Intell. Lab. Systems, 2, 187-197.

Naes, T. & Martens H. (1989). Multivariate Calibrarion. John Wiley, Chichester.

Osten, D.W. (1988). Selection of optimal regression models via cross-validation. J. Chemometrics, 2, 39-48.

Rogers, C.A. (1987). A Genstat Macro for Partial Least Squares Analysis with Cross-Validation Assessment of Model Dimensionality. Genstat Newsletter, 18, 81-92.

Example

CAPTION 'PLS example',!t('The data are 24 calibration',\
        'samples used to determine the protein content of wheat',\
        'from spectroscopic readings at six different wavelengths',\
        '(Fearn, T., 1983, Applied Statistics, 32, 73-79).'); STYLE=meta,plain
VARIATE [NVALUES=24] L[1...6],%Protein[1]
READ    L[1...6],%Protein[1]
468 123 246 374 386 -11  9.23   458 112 236 368 383 -15  8.01
457 118 240 359 353 -16 10.95   450 115 236 352 340 -15 11.67
464 119 243 366 371 -16 10.41   499 147 273 404 433   5  9.51
463 119 242 370 377 -12  8.67   462 115 238 370 353 -13  7.75
488 134 258 393 377  -5  8.05   483 141 264 384 398  -2 11.39
463 120 243 367 378 -13  9.95   456 111 233 365 365 -15  8.25
512 161 288 415 443  12 10.57   518 167 293 421 450  19 10.23
552 197 324 448 467  32 11.87   497 146 271 407 451  11  8.09
592 229 360 484 524  51 12.55   501 150 274 406 407  11  8.38
483 137 260 385 374  -3  9.64   491 147 269 389 391   1 11.35
463 121 242 366 353 -13  9.70   507 159 285 410 445  13 10.75
474 132 255 376 383  -7 10.75   496 152 276 396 404   6 11.47  :
" Fit a 3 dimensional PLS model to the standardized data using
  leave-one-out cross-validation. All three dimensions are
  significant using Osten's test"
PLS     [PRINT=estimates,xpercent,ypercent,xloadings,yloadings,ploadings;\ 
        NROOTS=3; NGROUPS=24; SEED=57133; XSCALING=yes; YSCALING=yes]\ 
        Y=%Protein; X=L

Updated on March 6, 2019

Tagged: Command Procedures

Was this article helpful?

Yes No