Defines the response variate(s) and the type of model to be fitted for linear, generalized linear, generalized additive and nonlinear models.

### Options

`DISTRIBUTION` = string token |
Distribution of the response variable (`normal` , `poisson` , `binomial` , `gamma` , `inversenormal` , `multinomial` , `calculated` , `negativebinomial` , `geometric` , `exponential` , `bernoulli` ); default `norm` |
---|---|

`LINK` = string token |
Link function (`canonical` , `identity` , `logarithm` , `logit` , `reciprocal` , `power` , `squareroot` , `probit` , `complementaryloglog` , `calculated` , `logratio` ); default `cano` (i.e. `iden` for `DIST=norm` or `calc` ; `loga` for `DIST=pois` ; `logi` for `DIST=bino` , `bern ` or `mult` ; `reci` for `DIST=gamm` or `expo` ; `powe` for `DIST=inve` ; `logr` for `DIST=nega` or `geom` ) |

`EXPONENT` = scalar |
Exponent for power link; default -2 |

`AGGREGATION` = scalar |
Fixed parameter for negative binomial distribution (parameter k as in variance function Var = mean + mean^{2}/k); default 1 |

`KLOGRATIO` = scalar |
Parameter for logratio link, in form log(mean/(mean+k)); default as set in `AGGREGATION` option |

`DISPERSION` = scalar |
Value of dispersion parameter in calculation of s.e.s etc; default `*` for `DIST=norm` , `gamm` , `inve` or `calc` , and 1 for `DIST=pois` , `bino` , `mult` , `nega` , `geom` , `expo` or `bern` |

`WEIGHTS` = variate or symmetric matrix |
Variate of weights for weighted regression, or symmetric matrix of weights (one row and column for each unit of data) for generalized least squares; default `*` |

`OFFSET` = variate |
Offset variate to be included in model; default `*` |

`GROUPS` = factor |
Absorbing factor defining the groups for within-groups linear or generalized linear regression; default `*` |

`RMETHOD` = string token |
Type of residuals to form, if any, after each model is fitted (`deviance` , `Pearson` , `simple` ); default `devi` |

`DMETHOD` = string token |
Basis of estimate of dispersion, if not fixed by `DISPERSION` option (`deviance, Pearson` ); default `devi` |

`FUNCTIONVALUE` = scalar |
Scalar whose value is to be minimized by calculation; default `*` |

`YRELATION` = string token |
Whether to analyse the y-variates separately, as in ordinary regression, or to analyse them cumulatively as counts in successive categories of a multinomial distribution (`separate, cumulative` ); default `sepa` |

`DCALCULATION` = expression structures |
Calculations to define the deviance contributions and variance function for a non-standard distribution; must be specified when `DIST=calc` |

`LCALCULATION` = expression structures |
Calculations to define the fitted values and link derivative for a non-standard link; must be specified when `LINK=calc` |

`DFDISPERSION` = scalar |
allows you to specify the number of degrees of freedom for a dispersion parameter specified by the `DISPERSION` option; if this is not set, the supplied dispersion is assumed to be known exactly |

`SAVE` = identifier |
To name regression save structure; default `*` |

### Parameters

`Y` = variates |
Response variates; only the first is used in nonlinear models and in generalized linear models except when `DIST=mult` , when they specify the numbers in each category of an ordinal response model |
---|---|

`NBINOMIAL` = variate or scalar |
Total numbers for `DIST=bino` |

`RESIDUALS` = variates |
To save residuals for each y variate after fitting a model |

`FITTEDVALUES` = variates |
To save fitted values, and provide fitted values if no terms are given in `FITNONLINEAR` |

`LINEARPREDICTOR` = variate |
Specifies the identifier of the variate to hold the linear predictor |

`DERIVATIVE` = variate |
Specifies the identifier of the variate to hold the derivative of the link function at each unit |

`DEVIANCE` = variate |
Specifies the identifier of the variate to hold the contribution to the deviance from each unit |

`VFUNCTION` = variate |
Specifies the identifier of the variate to hold the value of the variance function at each unit |

### Description

The `MODEL`

directive does not actually fit anything: it simply sets up some structures inside Genstat that are used when you give a `FIT`

, `FITCURVE`

or `FITNONLINEAR`

statement later on. So when you are doing regression, `MODEL`

will always be accompanied by at least one other regression statement to fit a model, like `FIT`

.

The `Y`

parameter allows a list of variates; if you put more than one for linear regression, then you will get an analysis for each. This is a more efficient way of doing many linear regressions with the same explanatory variables, than separate pairs of `MODEL`

and `FIT`

statements. With additive models, generalized linear models and nonlinear models, only the first variate will be analysed (with the exception of multinomial response models); the others will be ignored.

The `RESIDUALS`

and `FITTEDVALUES`

parameters allow you to specify variates to contain the residuals and fitted values for each response variable. The residuals are the “unexplained” component of the response variable, standardized in some way according to the `RMETHOD`

option. The fitted values are the “explained” component: that is, the combination of parameters and explanatory variables fitted in the model. You can get access to these sets of values in a different way through the `RKEEP`

directive.

The `DISTRIBUTION`

and `LINK`

options are used to specify a *generalized linear model* (McCullagh & Nelder 1989). By default the data are assumed to follow a Normal distribution, as required for ordinary linear regression, but other distributions can be selected using the `DISTRIBUTION`

option. The `LINK`

option specifies the *link function* that relates the linear model to the expected values of the distribution; in the default ordinary linear regression, this is the identity function (indicating no transformation). So, for example, for a log-linear model we would specify `DISTRIBUTION=Poisson`

and `LINK=log`

, while for logistic regression we would have `DISTRIBUTION=binomial`

and `LINK=logit`

. The `NBINOMIAL`

parameter must also be set when `DISTRIBUTION=binomial`

, to give the number of binomial trials for each unit.

The `EXPONENT`

option specifies the exponent when `LINK=power`

. Similarly, the `AGGREGATION`

option specifies the aggregation parameter *k* when `DISTRIBUTION=negativebinomial`

. This is a measure of the tendency for observations to cluster together which appears in the formula for the variance as a function of the mean

variance = mean + mean^{2}/*k*

The default value of *k* is set at 1, which corresponds to the geometric distribution. The parameter *k* must be positive, and as it increases to infinity the distribution approaches the Poisson distribution. The `KLOGRATIO`

option sets the parameter *k* for the logratio link.

You can also define your own distribution or link function for a generalized linear model. To specify your own distribution, you need to set `DISTRIBUTION=calculated`

and then specify expression structures with the `DCALCULATION`

option to calculate the deviance and the variance function for each unit of the response variate, using the current values of the fitted-values variate. You must also set the `FITTEDVALUES`

, `DEVIANCE`

and `VFUNCTION`

parameters to indicate which identifiers are used to represent these in the expressions. To specify your own link, you need to set `LINK=calculated`

and provide expressions with the `LCALCULATION`

option for two other calculations to form the fitted values and the derivative of the link function for each unit of the response variate, using the current values of the linear predictor. You must also set the `FITTEDVALUES`

, `LINEARPREDICTOR`

and `DERIVATIVE`

parameters to specify the identifiers used to represent these in the calculations. In addition, you must provide initial values for the linear predictor, so that the iterative process can get started: often this can be done just by applying the link function to the response variate itself, but it may be necessary to modify extreme values such as 0 that may be mapped to infinity by the link function.

You can fit ordinal response models by setting option `YRELATION=cumulative`

and option `DISTRIBUTION=multinomial`

.

The `DISPERSION`

option controls how the variance of the distribution of the response values is calculated. By default, the variance is estimated from the residual mean square, and standard errors and standardized residuals are calculated from the estimate. If you use `DISPERSION`

to supply a value for the variance of the Normal distribution, or for the dispersion parameter of other distributions, then standard errors and residuals are based on this given value instead. In a generalized linear model, the dispersion of the chosen distribution can be fixed at a value provided by the `DISPERSION`

option, or estimated from either the residual deviance or the Pearson chi-square statistic, as specified by the `DMETHOD`

option.

The `DFDISPERSION`

option allows you to specify the number of degrees of freedom for a value specified by the `DISPERSION`

option. You might want to use this, for example, if you had estimated the dispersion from some other data set. If `DFDISPERSION`

is not set, the supplied dispersion is assumed to be known exactly.

The `WEIGHTS`

option allows you to specify a variate holding weights for each unit. In simple linear regression, the estimate of dispersion is then the weighted residual mean square. Thus, if the variance of the response variable is not constant, and you know the relative size of the variance for each observation, you can set the weight to be proportional to the inverse of the variance of an observation. Alternatively, if the variance is related in a simple way to the mean, you may just need to specify a different distribution for the response. The `WEIGHTS`

option can also be set to a symmetric matrix, supplying weights corresponding to some pattern of correlation or covariance between units as well as variance of each unit. The subsequent analysis is known as generalized least-squares if the response distribution is Normal.

The `OFFSET`

option allows you to include in the regression a variable with no corresponding parameter. Linear regression analysis of *Y* with offset *O* is just the same as analysis of *Y*–*O*, but the offset has non-trivial applications in generalized linear models.

The `GROUPS`

option specifies a factor whose effects you want to eliminate before any regression is fitted. The factor must already have been defined. This method of elimination is sometimes called *absorption*; you might want to use it when data from many different groups are to be modelled. Use of `GROUPS`

gives less information than you would get if you included the factor explicitly in the model (leverages, predictions and some parameter correlations cannot be formed), but it saves space and time in fitting the model when the factor has many levels. You can use `GROUPS`

only with linear and generalized linear regression.

The `RMETHOD`

option controls how residuals are formed. By default, residuals are *deviance residuals* standardized by their estimated variance. The alternative *Pearson residuals* are defined in exactly the same way if the distribution is Normal, but for regression models with distributions other than Normal the two kinds of residual are different. If you do not want residuals, you can set the option to a missing value (`*`

) to save space within Genstat. However, you will then not be able to get residuals, fitted values or leverages, and the automatic checks on the fit of a model will not be done.

The `FUNCTIONVALUE`

option is relevant only when you want to use `FITNONLINEAR`

to optimize a general function. It then identifies the scalar that stores the results in the expression that calculates the function to be minimized (see the `CALCULATION`

option of `FITNONLINEAR`

). This should calculate a deviance if you are using this general facility to fit a statistical model. `FUNCTIONVALUE`

is ignored if the `Y`

parameter of `MODEL`

is set.

The `SAVE`

option allows you to specify an identifier for the regression save structure. This structure stores the current state of the regression model, and can be used explicitly in the directives `RDISPLAY`

, `RKEEP`

, `PREDICT`

and `RFUNCTION`

. If the identifier in `SAVE`

is of a regression save structure that already has values, those values are deleted. You can reset the current regression save structure at any point in a program by using the `SET`

directive. Then, later regression statements would use the model stored in this save structure.

Options: `DISTRIBUTION`

, `LINK`

, `EXPONENT`

, `AGGREGATION`

, `KLOGRATIO`

, `DISPERSION`

, `WEIGHTS`

, `OFFSET`

, `GROUPS`

, `RMETHOD`

, `DMETHOD`

, `FUNCTIONVALUE`

, `YRELATION`

, `DCALCULATION`

, `LCALCULATION`

, `DFDISPERSION`

, `SAVE`

.

Parameters: `Y`

, `NBINOMIAL`

, `RESIDUALS`

, `FITTEDVALUES`

, `LINEARPREDICTOR`

, `DERIVATIVE`

, `DEVIANCE`

, `VFUNCTION`

.

### Action with `RESTRICT`

You can restrict the units that Genstat will use for the regression by putting a restriction on any of the vectors involved in the `MODEL`

statement (response variates, weight variate, offset variate, grouping factor or variate of binomial totals), or on any explanatory variate or factor in a subsequent `TERMS`

statement. However, you are not allowed to have different restrictions on the different vectors. You should not alter the restriction applied to the vectors between the `TERMS`

statement and subsequent fitting statements.

### Reference

McCullagh, P. & Nelder, J.A. (1989). *Generalized Linear Models* (second edition). Chapman and Hall, London.

### See also

Directives: `FIT`

, `FITCURVE`

, `FITNONLINEAR`

, `TERMS`

.

Commands for: `Regression analysis`

.

### Example

" Example FIT-1: Simple linear regression Modelling the relationship between counts of apples from 12 trees (recorded as 100s of fruit) and percentage damage by codling moth. (Snedecor & Cochran, Statistical analysis, 1980, p162.)" VARIATE [VALUES= 8, 6,11,22,14,17,18,24,19,23,26,40] Cropsize & [VALUES=59,58,56,53,50,45,43,42,39,38,30,27] Wormy DGRAPH Wormy; Cropsize " It is expected that the larger the crop is the less the damage will be, since the density of the flying moths is unrelated to the crop size. Try fitting a linear model relating the percentage of damage directly to the size of the crop." MODEL Wormy FIT Cropsize " Tree number 4 seems different from the rest: perhaps it was not adequately protected by the standard spraying programme, or was on the side from which the codling moths flew in to the orchard. Tree number 12 has a much larger crop than the rest: the results of the regression are strongly influenced by this one observation. Display all the fitted values, residuals and leverages (influence)." RDISPLAY [PRINT=fittedvalues] " Check the effect of omitting tree number 4." RESTRICT Wormy; .NOT.EXPAND(4; 12) FIT [PRINT=summary] Cropsize " Return to the complete dataset, and display the fitted line." RESTRICT Wormy FIT [PRINT=*] Cropsize RGRAPH [GRAPHICS=high] " Plot the fitted values against the residuals, to check that the variance is roughly constant; use the procedure RCHECK from the Genstat Procedure Library." RCHECK [GRAPHICS=high] residual; fittedvalues