Forms predictions from a linear or generalized linear model.

### Options

`PRINT` = string token |
What to print (`description` , `lsd` , `predictions` , `se` , `sed` , `vcovariance` ); default `desc` , `pred` , `se` |
---|---|

`CHANNEL` = scalar |
Channel number for output; default `*` i.e. current output channel |

`COMBINATIONS` = string token |
Which combinations of factors in the current model to include (`full, present` , `estimable` ); default `esti` |

`ADJUSTMENT` = string token |
Type of adjustment (`marginal, equal` ); default `marg` |

`WEIGHTS` = table |
Weights classified by some or all of the factors in the model; default `*` |

`OFFSET` = scalar |
Value of offset on which to base predictions; default mean of offset variate |

`METHOD` = string token |
Method of forming margin (`mean, total` ); default `mean` |

`ALIASING` = string token |
How to deal with aliased parameters (`fault` , `ignore` ); default `faul` |

`BACKTRANSFORM` = string token |
What back-transformation to apply to the values on the linear scale, before calculating the predicted means (`link, none` ); default `link` |

`SCOPE` = string token |
Controls whether the variance of predictions is calculated on the basis of forecasting new observations rather than summarizing the data to which the model has been fitted (`data` , `new` ); default `data` |

`NOMESSAGE` = string tokens |
Which warning messages to suppress (`dispersion` , `nonlinear` ); default `*` |

`DISPERSION` = scalar |
Value of dispersion parameter in calculation of s.e.s; default is as set in the `MODEL` statement |

`DMETHOD` = string token |
Basis of estimate of dispersion, if not fixed by `DISPERSION` option (`deviance, Pearson` ); default is as set in the `MODEL` statement |

`NBINOMIAL` = scalar |
Supplies the total number of trials to be used for prediction with a binomial distribution (providing a value n greater than one allows predictions to be made of the number of “successes” out of n, whereas the value one predicts the proportion of successes); default 1 |

`PREDICTIONS` = tables or scalars |
Saves predictions for each y variate; default `*` |

`SE` = tables or scalars |
Saves standard errors of predictions for each y variate; default `*` |

`SED` = symmetric matrices |
Saves standard errors of differences between predictions for each y variate; default `*` |

`LSD` = symmetric matrices |
Saves least significant differences between predictions for each y variate (models with Normal errors only); default `*` |

`LSDLEVEL` = scalar |
Significance level (%) to use in the calculation of least significant differences; default 5 |

`VCOVARIANCE` = symmetric matrices |
Saves variance-covariance matrices of predictions for each y variate; default `*` |

`SAVE` = identifier |
Specifies save structure of model from which to predict; default `*` i.e. that from latest model fitted |

### Parameters

`CLASSIFY` = vectors |
Variates and/or factors to classify table of predictions |
---|---|

`LEVELS` = variates, scalars or texts |
To specify values of variates, levels of factors |

`PARALLEL` = identifiers |
For each vector in the `CLASSIFY` list, allows you to specify another vector in the `CLASSIFY` list with which the values of this vector should change in parallel (you then obtain just one dimension in the table of predictions for these vectors) |

`NEWFACTOR` = identifiers |
Identifiers for new factors that are defined when `LEVELS` are specified |

### Description

The `PREDICT`

directive can be used after the `FIT`

directive to summarize the results of the regression, by using the fitted relationship to predict the values of the response variate at particular values of the explanatory variables. `CLASSIFY`

, the first parameter of `PREDICT`

, specifies those variates or factors in the current regression model whose effects you want to summarize. Any variate or factor in the current model that you do not include will be standardized in some way, as described below.

The `LEVELS`

parameter specifies values at which the summaries are to be calculated, for each of the structures in the `CLASSIFY`

list. For factors, you can select some or all of the levels, while for variates you can specify any set of values. A single level or value is represented by a scalar; several levels or values must be combined into a variate (which may of course be unnamed). Alternatively, if the factor has labels, you can use these to select the levels for the summaries by setting `LEVELS`

to a text. A missing value in the `LEVELS`

parameter is taken by Genstat to stand for all the levels of a factor, or for the mean value of a variate.

The `PARALLEL`

parameter allows you to indicate that a factor or variate should change in parallel to another factor or variate. Both of these should have same number of values specified for it by the `LEVELS`

parameter of `PREDICT`

. The predictions are then formed for each corresponding set of values rather than for every combination of these values. For example, suppose we had fitted a quadratic model with explanatory variates `X`

and `Xsquared`

. We could then put

`PREDICT Xsquared,X; PARALLEL=X,*;\`

` LEVELS=!(0,4,16,36,64,100),!(0,2,4,6,8,10)`

The `PARALLEL`

parameter specifies that `Xsquared`

should change in parallel to `X`

, so that we obtain predictions only for matching values.

When you specify `LEVELS`

, `PREDICT`

needs to define a new factor to classify that dimension of the table. By default this will be an unnamed factor, but you can use the `NEWFACTOR`

parameter to give it an identifier. The `EXTRA`

attribute of the factor is set to the name of the corresponding factor or variate in the `CLASSIFY`

list; this will then be used to label that dimension of the table of predictions.

You can best understand how Genstat forms predictions by regarding its calculations as consisting of two steps. The first step, referred to below as Step A, is to calculate the full table of predictions, classified by every factor in the current model. For any variate in the model, the predictions are formed at its mean, unless you have specified some other values using the `LEVELS`

parameter; if so, these are then taken as a further classification of the table of predictions. The second step, referred to as Step B, is to average the full table of predictions over the classifications that do not appear in the `CLASSIFY`

parameter: you can control the type of averaging using the `COMBINATIONS`

, `ADJUSTMENT`

and `WEIGHTS`

options. By default, the predictions are made at the mean of any offset variate, but option `OFFSET`

can be used to specify another value at which the predictions should be made instead.

Printed output is controlled by settings of the `PRINT`

option:

`description` |
describes the standardization policies used when forming the predictions, |
---|---|

`predictions` |
prints the predictions, |

`se` |
produces predictions and standard errors, |

`sed` |
prints standard errors for differences between the predictions, |

`lsd` |
prints least significant differences between the predictions (ordinary linear regression models or generalized linear models with the Normal distibution only), and |

`vcovariance` |
prints the variance and covariances of the predictions. |

By default descriptions, predictions and standard errors are printed. The standard errors (and sed’s) are relevant for the predictions when considered as means of those data that have been analysed, with the means formed according to the averaging policy defined by the options of `PREDICT`

. The word *prediction* is used because these are predictions of what the means would have been if the factor levels been replicated differently in the data; see Lane & Nelder (1982) for more details. The `LSDLEVEL`

option specifies the significance level (%) to use in the calculation of least significant differences (default 5%).

By default, the standard errors (and sed’s) are not augmented by any component corresponding to the estimated variability of a new observation. However, you can set option `SCOPE=new `

to request that the variance of predictions should be calculated on the basis of forecasting new observations rather than of summarizing the data to which the model has been fitted. This setting cannot be used if the predictions are to be standardized for the effects of any factors in the model; in other words, all factors in the current model must be listed in the `CLASSIFY`

parameter of the `PREDICT`

statement. In addition, it cannot be used when making predictions from generalized linear models with option `BACKTRANSFORMATION=none`

, nor with weighted regression. The effect of `SCOPE=new`

is to form variances for each predicted value by combining the variance of the estimated mean value of the prediction (as produced for `SCOPE=data`

) together with the estimated variance of a new observation with the same values of explanatory variates and factors:

“new” variance = “data” variance + (dispersion × variance function)

The `DISPERSION`

and `DMETHOD`

options allow you to change the method by which the variance of the distribution of the response values is obtained for calculating the standard errors. These options operate like the corresponding options of `MODEL`

(except that they apply only to the current statement). The default is to use the method as originally defined by the `MODEL`

statement.

The `NBINOMIAL`

parameter can be used to supply the total number of trials to be used for prediction with a binomial distribution when option `BACKTRANSFORMATION`

is set to `link`

. If you provide a value *n* greater than one, Genstat will predict the number of “successes” out of *n*. The default, `NBINOMIAL=1`

, causes Genstat to predict the proportion of successes.

You can send the output to another channel, or to a text structure, by setting the `CHANNEL`

option.

The `COMBINATIONS`

option specifies which cells of the full table in Step A are to be filled for averaging in Step B. The default, `COMBINATIONS=estimable`

, uses all the cells other than those that involve parameters that cannot be estimated, for example because of aliasing. Alternatively, you can set `COMBINATIONS=present`

to exclude cells for factor combinations that do not occur in the data, or `COMBINATIONS=full`

to use all the cells. When `COMBINATIONS=estimable`

or `COMBINATIONS=present`

the `LEVELS`

parameter is overruled. Any subsets of factor levels in the `LEVELS`

parameter are ignored, and predictions are formed for all the factor levels that occur in the data or are estimable. Likewise, the full table cannot then be classified by any sets of values of variates; the `LEVELS`

parameter must then supply only single values for variates.

The `ADJUSTMENT`

and `WEIGHTS`

options define how the averaging is done in Step B. Values in the full table produced in Step A are averaged with respect to all those factors that you have not included in the settings of the `CLASSIFY`

parameter. By default, the levels of any such factor are combined with what we call *marginal weights*: that is, by the number of occurrences of each of its levels in the whole dataset. The `ADJUSTMENT`

and `WEIGHTS`

options allow you to change the weights. The setting `ADJUSTMENT=equal`

specifies that the levels are to be weighted equally. (This corresponds to the default weighting used by `VPREDICT`

.) The `WEIGHTS`

option is more powerful than the `ADJUSTMENT`

option, allowing you to specify an explicit table of weights. This table can be classified by any, or all, of the factors over whose levels the predictions are to be averaged; the levels of remaining factors will be weighted according to the `ADJUSTMENT`

option. Moreover, you can classify the weights by the factors in the `CLASSIFY`

parameter as well, to provide different weightings for different combinations of levels of these factors. If you supply explicit weights in the `WEIGHTS`

option, any setting of the `COMBINATIONS`

option is ignored. You will find explicit weights useful in particular when you have population estimates of the proportions of each level of a factor – proportions which may not be matched well in the available data.

If a model contains any aliased parameters, predicted values cannot be formed for some cells of the full table without assuming a value for the aliased parameters. With the default setting, `COMBINATIONS=estimable`

, no predictions are formed for these cells. When `COMBINATIONS=full`

, if the aliased parameters simply represent effects of variates that are correlated with other explanatory variables in the model, it may be sufficient just to ignore them. This can be done by setting the `ALIASING`

option to `ignore`

. The aliased parameters are then taken to be zero, and fitted values are calculated for all cells of the table from the remaining parameters in the model. Aliasing can also occur if there are some combinations of factors that do not occur in the data, and here it may be more sensible to set option `COMBINATIONS=present`

so that these cells are all excluded from the calculation of predictions. The final way to overcome aliasing is to supply explicit weights using the `WEIGHTS`

option.

Averaging is usually the appropriate way of combining predicted values over levels of a factor. But sometimes summation is needed, for example in the analysis of counts by log-linear models. You can achieve this by setting the `METHOD`

option to `total`

. The rules about weights and so on still apply. In a generalized linear model, averaging is done by default on the scale of the original response variable, not on the scale transformed by the link function. In other words, linear predictors are formed for all the combinations of factor levels and variate values specified by `PREDICT`

, and then transformed by the link function back to the natural scale. This back-transformation may be useful when you are reporting results, since the tables from `PREDICT`

can then be interpreted as natural averages of means predicted by the fitted model. You can set option `BACKTRANSFORM=none`

if you want the averaging to be done on the scale of the linear predictor; `PREDICT`

will then form averages and report predictions on the transformed scale.

`PREDICT`

calculates the standard errors of predictions from iterative models by using first-order approximations that allow for the effect of the link function. Thus you should interpret them only as a rough guide to the variability of individual predictions.

The `PREDICTIONS`

, `SE`

, `SED`

, `LSD`

and `VCOVARIANCE`

options let you save the results of `PREDICT`

as well as, or instead of, printing them.

The `SAVE`

option allows you to specify the regression save structure of the analysis on which the predictions are based. If `SAVE`

is not set, the most recent regression model is used.

The `NOMESSAGE`

option controls printing of messages. The `nonlinear`

setting suppresses messages about the approximate nature of standard errors of predictions in generalized linear models, and the `dispersion`

setting prevents reminders appearing about the basis of the standard errors.

Options: `PRINT`

, `CHANNEL`

, `COMBINATIONS`

, `ADJUSTMENT`

, `WEIGHTS`

, `OFFSET`

, `METHOD`

, `ALIASING`

, `BACKTRANSFORM`

, `SCOPE`

, `NOMESSAGE`

, `DISPERSION`

, `DMETHOD`

, `NBINOMIAL`

, `PREDICTIONS`

, `SE`

, `SED`

, `LSD`

, `LSDLEVEL`

, `VCOVARIANCE`

, `SAVE`

.

Parameters: `CLASSIFY`

, `LEVELS`

, `PARALLEL`

, `NEWFACTOR`

.

### Reference

Lane, P.W. & Nelder, J.A. (1982). Analysis of covariance and standardization as instances of prediction. *Biometrics*, 38, 613-621.

### See also

Directives: `FIT`

, `RDISPLAY`

, `VPREDICT`

.

Procedure: `HGPREDICT`

.

Commands for: Regression analysis.

### Example

" Example PRED-1: Prediction from simple linear regression Attempt to find a linear relationship between the boiling point of water and barometric pressure, to allow prediction of pressure and thus of altitude." " Read and display the data." READ Boiltemp,Pressure 194.5 20.79 194.3 20.79 197.9 22.40 198.4 22.67 199.4 23.15 199.9 23.35 200.9 23.89 201.1 23.99 201.4 24.02 201.3 24.01 203.6 25.14 204.6 26.57 209.5 28.49 208.6 27.76 210.7 29.04 211.9 29.88 212.2 30.06 : DGRAPH Pressure; Boiltemp " Regress pressure on boiling point." MODEL Pressure FIT Boiltemp " Predict pressure when boiling point is 190." PREDICT Boiltemp; LEVEL=190 " Print a chart of predictions for a range of temperatures including standard errors of the predicted means and standard errors for future observations." VARIATE [VALUES=190,192...216] temp PREDICT [PRINT=*; PREDICT=predict; SE=sepred] Boiltemp; LEVEL=temp RKEEP DEVIANCE=rss; DF=rdf CALCULATE sefuture = SQRT(sepred**2 + rss/rdf) PRINT predict,sepred,sefuture; DECIMALS=2