Fits generalized linear models to survey data (S.D. Langton).

### Options

`PRINT` = string token |
What output to display (`model` , `summary` , `estimates` , `wald` , `predictions` , `monitor` ); default `mode` , `esti` , `wald` , `pred` |
---|---|

`DISTRIBUTION` = string token |
Error distribution (`binomial` , `poisson` , `normal` , `gamma` ); default `norm` |

`LINK` = string token |
Link function (`identity` , `logarithm` , `logit` , `reciprocal` , `probit` , `complementaryloglog` , `canonical` ); default `cano` |

`DISPERSION` = scalar |
Value at which to fix the residual variance, if missing the variance is estimated; default 1 for binomial or Poisson, otherwise `*` |

`TERMS` = formula |
Explanatory model |

`CONSTANT` = string token |
Whether to estimate or omit constant term in fixed model (`omit` , `estimate` ); default `esti` |

`FACTORIAL` = scalar |
Limit on number of factors/covariates in a model term; default 3 |

`PFACTORS` = factors or variates |
Variables for which predictions are to be formed; default `*` , or as specified in `PTERMS` |

`PLEVELS` = variates or scalars |
Levels or values at which predictions are to be made corresponding to `PFACTORS` ; default (weighted) mean for variates, all levels for factors |

`PTERMS` = formula |
Formula specifying fixed terms for which predicted means are to be printed; default `*` , unless `PFACTORS` is set, in which case it is all main effects of and interactions between `PFACTORS` |

`STRATUMFACTOR` = factor |
Stratification factor; default `*` , i.e. unstratified |

`NUNITS` = variate or table |
Number of primary sampling units in each stratum |

`SAMPLINGUNITS` = factor |
Factor indicating the primary sampling units; default `*` , i.e. single stage design |

`WEIGHTS` = variates |
Survey weights |

`METHOD` = string token |
Bootstrapping method (`simple` , `csimple` , `sarndal` ); default `simp` |

`NBOOT` = scalar |
Number of bootstrap samples to use; default 0 uses a Taylor series approximation for `DISTRIBUTION=normal` , or a simple approximation otherwise |

`SEED` = scalar |
Seed for random number generator for bootstrap; default 0 |

`CIPROBABILITY` = scalars |
The probability level for the confidence intervals; default 0.95 |

`CIMETHOD` = string token |
Method for forming confidence intervals (`automatic` , `tdistribution` , `percentile` ); default `auto` |

### Parameters

`Y` = variates |
Dependent variates |
---|---|

`NBINOMIAL` = scalars or variates |
Number of binomial trials for each unit (must be set if `DISTRIBUTION=binomial` ) |

`RESIDUALS` = variates |
Variates to save residuals |

`FITTEDVALUES` = variates |
Variates to save fitted values |

`ESTIMATES` = variates |
Estimates of parameters for each `Y` variate |

`SE` = variates |
Standard errors of the estimates |

`VCOVARIANCE` = symmetric matrices |
Variance-covariance matrix for the estimates |

`LOWER` = variates |
Lower confidence limits for estimates |

`UPPER` = variates |
Upper confidence limits for estimates |

`WALD` = pointers |
Pointers to save Wald statistics for each term (pointer contains name of term, Wald statistic, F statistic, degrees of freedom, and P-value) |

`PREDICTIONS` = pointers |
Pointers to tables of predictions |

`SEPREDICTIONS` = pointers |
Pointers to tables of standard errors of predictions |

`LOWPREDICTIONS` = variates |
Lower confidence limits for predictions |

`UPPREDICTIONS` = variates |
Upper confidence limits for predictions |

`VCPREDICTIONS` = symmetric matrices |
Variance-covariance matrix for the predictions |

### Description

`SVGLM`

fits generalized linear models to data from one- or two-stage surveys. Variance estimates reflecting the survey design are estimated by a bootstrap method or a Taylor series approximation (Korn & Graubard 1999). Survey weights, which are supplied using the `WEIGHTS`

option and which may be calculated by `SVWEIGHT`

, are used to ensure that unbiased estimates of the finite survey population parameters are produced. It should be noted that using a weighted analysis is not the only way to handle such data; in some circumstances it may be preferable to use an unweighted analysis, including factors reflecting the survey design (see, for example, Chapter 5 of Korn & Graubard 1999 for discussion of this subject). Mixed models, such as those fitted by the `REML`

directive, the `GLMM`

procedure or the `HGANALYSE`

procedure may be another way of accounting for the correlations induced in the data by the survey design.

The `DISTRIBUTION`

, `LINK`

, `DISPERSION`

, `CONSTANT`

and `FACTORIAL`

options are used to specify the model in exactly the same way as in the `MODEL`

directive. Similarly the `Y`

parameter supplies the response variable to be analysed and, for the binomial distribution, `NBINOMIAL`

supplies the number of trials for each unit. The terms to be fitted are supplied using option `TERMS`

as either a formula or, if no interactions are fitted, a list of variates and factors.

Information on the survey design is provided using the `STRATUMFACTOR`

and `SAMPLINGUNITS`

options. The option `NUNITS`

can be used to list the number of primary sampling units per stratum, using a table or variate with one value for each stratum; this is used to calculate the appropriate degrees of freedom for test statistics and in construction of bootstrap samples.

The bootstrapping method is selected using the `METHOD`

option. In a one-stage design the default of `simple`

forms each bootstrap sample by sampling with replacement from the original sample within each stratum. In a two-stage design (i.e. if `SAMPLINGUNITS`

is set), primary sampling units are first sampled with replacement, and then secondary units are sampled with replacement within the selected primary units. Variance estimates from the boostrapping process will be biased where there are very few sampling units in each stratum and so the method is not recommended in this situation. For a cluster sample the setting `csimple`

should be used; this samples primary sampling units with replacement as for the two-stage design, but does not resampling within those secondary units. The setting `METHOD=sarndal`

constructs a “pseudo-population” by replicating each sampled unit by the rounded value of its weight, so that, for example, an observation with weight 16.1 is represented sixteen times in the pseudo-population (see Sarndal *et al*. 1992, page 442). The bootstrap sample is formed by sampling with replacement from this pseudo-population. At present this method is only available for single- stage sampling.

The number of bootstrap samples used is set by means of the `NBOOT`

parameter. For exploratory analyses a relatively low value (perhaps 20) may suffice, but where test statistics or confidence limits are required a value of at least 500 is recommended. For simple linear regression (i.e. `DISTRIBUTION=normal`

), setting `NBOOT`

to zero calculates variances of regression parameters by a linearization approach similar to that used for means and totals by `SVTABULATE`

(Binder 1983). For other generalized linear models setting `NBOOT`

to zero uses a simple approximation in which the weights are scaled to sum to the number of observations in the sample; this setting is only recommended for initial model fitting as variance estimates will be seriously inaccurate, particularly in two-stage designs.

Parameter estimates and their standard errors can be saved using the `ESTIMATES`

and `SE`

parameters, whilst `VCOVARIANCE`

saves the full variance-covariance matrix. The `LOWER`

and `UPPER`

parameters save confidence limits for the estimates; by default 95% confidence limits are shown, but this may be changed by means of the `CIPROBABILITY`

option. The `CIMETHOD`

option controls how confidence limits are formed after bootstrapping: `percentile`

uses simple percentiles of the bootstrapped distribution, whilst `tdistribution`

calculates a standard error from the bootstrapped estimates and then uses the t-distribution to form intervals; the default of `automatic`

uses the percentile method unless less than 400 bootstrap samples have been made.

Wald statistics (Korn & Graubard 1999) for terms in the model can be saved using parameter `WALD`

, in the form of a pointer with elements corresponding to the term (as a text), the Wald statistic, the approximate F statistic, the two sets of degrees of freedom, and the probability value.

Predicted values can be formed from the analysis. These estimate the average value of the response variable that would have been expected in the population had all the units been in the specified group, or had had the specified covariate value. The averages are taken over the distribution of the other fitted variables within the population (as deduced from the weighted sample). Factors and variates for which predictions are required are specified using the `PFACTORS`

option and particular levels or values may be specified using `PLEVELS`

, which operates in the same way as the `LEVELS`

parameter of `PREDICT`

. Alternatively, `PTERMS`

can be used to specify particular terms so that, for example, `PTERMS=A.B`

would produce a two-way table classified by factors `A`

and `B`

. The parameters `PREDICTIONS`

, `SEPREDICTIONS`

, `LOWPREDICTIONS`

, and `UPPREDICTIONS`

save the tables of predictions, their standard errors, and the lower and upper confidence limits respectively. `VCPREDICTIONS`

saves the full variance-covariance matrix of the bootstrapped predictions.

Printing is controlled by the `PRINT`

option. The default output consists of model details, parameter estimates, Wald statistics and, if `PFACTORS`

or `PTERMS`

is set, predictions. The `monitor`

setting provides progress of the bootstrap samples.

Options: `PRINT`

, `DISTRIBUTION`

, `LINK`

, `DISPERSION`

, `TERMS`

, `CONSTANT`

, `FACTORIAL`

, `PFACTORS`

, `PLEVELS`

, `PTERMS`

, `STRATUMFACTOR`

, `NUNITS`

, `SAMPLINGUNITS`

, `WEIGHTS`

, `METHOD`

, `NBOOT`

, `SEED`

, `CIPROBABILITY`

`CIMETHOD`

.

Parameters: `Y`

, `NBINOMIAL`

, `RESIDUALS`

, `FITTEDVALUES`

, `ESTIMATES`

, `SE`

, `VCOVARIANCE`

, `LOWER`

, `UPPER`

, `WALD`

, `PREDICTIONS`

, `SEPREDICTIONS`

, `VCPREDICTIONS`

, `LOWPREDICTIONS`

, `UPPREDICTIONS`

.

### Action with `RESTRICT`

Restricting the response variate `Y`

fits a model to the subpopulation defined by the restriction.

### References

Binder, D.A. (1983). On the Variances of Asymptotically Normal Estimators from Complex Surveys. *International Statistical Review*, 51, 279-292.

Sarndal, C., Swenssion, B. & Wretman, J. (1992). *Model Assisted Survey Sampling*. Springer-Verlag, New York.

### See also

Procedures: `SVBOOT`

, `SVCALIBRATE`

, `SVHOTDECK`

, `SVREWEIGHT`

, `SVSAMPLE`

, `SVSTRATIFIED`

, `SVTABULATE`

, `SVWEIGHT`

.

Commands for: Survey analysis, Regression analysis, REML analysis of linear mixed models.

### Example

CAPTION 'SVGLM example',\ 'Data from Sampford, Table 5.1, page 61, using farms of Table 6.1.';\ STYLE=meta,plain FACTOR [LEVELS=3] stratum TABLE [CLASS=stratum; VALUES=12,12,11] N READ farm,stratum,crops,oats 6 1 60 15 7 1 62 20 8 1 65 18 12 1 74 18 13 2 78 23 15 2 91 27 17 2 96 25 23 2 190 60 26 3 240 28 31 3 324 128 33 3 356 69 34 3 410 72 : SVWEIGHT [PRINT=summary; STRATUM=stratum; NUNITS=N] OUTWEIGHTS=wts CALCULATE logcrops,logoats=log10(crops,oats) SVGLM [PRINT=model,estimates,wald;STRATUM=stratum;\ WEIGHTS=wts;TERMS=logcrops;NBOOT=0] logoats VARIATE [VALUES=1.7,1.8...2.7] xpred SVGLM [PRINT=model,estimates,wald,prediction,monitor; STRATUM=stratum;\ PFACTORS=logcrops; PLEVELS=xpred; WEIGHTS=wts; TERMS=logcrops;\ NBOOT=50; SEED=630232] logoats; PREDICTIONS=ypred;\ SEPREDICTIONS=sep; LOWPREDICTIONS=lo; UPPREDICTIONS=hi PEN 2,3,4; METHOD=line; LINE=1,2,2; COLOUR='red',2('limegreen');\ SYMBOL=0 DGRAPH logoats,ypred[],lo[],hi[]; logcrops,(xpred)3; DESCRIPTION=\ 'observed','fitted line','lower 95% limit','upper 95% limit'