1. Home
  2. GEE procedure

GEE procedure

Fits models to longitudinal data by generalized estimating equations (D.M. Smith & M.G.Kenward).


PRINT = string token What to display (estimates, correlations, scalefactor, wald, monitoring); default esti, corr, scal
DISTRIBUTION = string token Distribution of response (normal, Poisson, binomial, gamma, inversenormal, negativebinomial); default *
LINK = string token Link function (identity, logarithm, logit, reciprocal, power, squareroot, probit, complementaryloglog, logratio); default *
EXPONENT = scalar Exponent for power link; default -2
TERMS = formula Explanatory variates, factors etc
CONSTANT = string token How to treat constant (estimate, omit); default esti
FACTORIAL = scalar Limit for expansion of model terms; default 3
AGGREGATION = scalar Fixed parameter for negative binomial distribution (parameter k as in variance function var = mean + mean2/k); default 1
KLOGRATIO = scalar Parameter for logratio link, in form log(mean / (mean + k)); default as set in AGGREGATION option
QUADESTIMATION = string token Whether to use quadratic estimation (used, notused); default used
SCALEFACTOR = string token How to calculate the scale factor (fixed, constant, varytime); default varies with distribution, fixed for Poisson, binomial and negative binomial, constant for rest
SFVALUE = scalar Value for scale factor when SCALEFACTOR=fixed; default 1.0 for Poisson and binomial, missing for rest
CRTYPE = string token Form of correlation matrix (independence, unstructured, exchangeable, autoregressive, dependence, antedependence); default *
ORDER = scalar Order in dependence and ante-dependence form of correlation matrix; default 1
TIMEDEPENDENT = string token Whether correlation in dependence model changes with time (no, yes); default no


Y = variates Response variate for each analysis
NBINOMIAL = variates or scalars Denominator in binomial
FITTEDVALUES = variates To store fitted values
RESIDUALS = variates To store residuals
SUBJECT = factors Identifier of subjects
OUTCOME = factors Identifier of outcomes
COUNT = variates Variate of counts of no. outcomes
TIME = factors Times of repeated measures variate
WEIGHT = variates Weight variate
OFFSET = variates Offset variate
SAVE = pointers Structure to save output variables


GEE implements the General Estimating Equation (GEE) methodology of Liang & Zeger (1986) with quadratic estimation for the covariance structure. In the terminology of Liang et al. (1992) the methodology implemented is a form of GEE1. Full details of the implementation are given in Kenward & Smith (1995a). GEE, as implemented here, is a comparatively simple non-likelihood method for fitting marginal models to repeated measurements that can be used when the response has a distribution in the exponential family. This includes the Gaussian distribution, for which the procedure implemented here reduces to a form of the EM algorithm, and then produces exact ML or REML estimates, or a close approximation to these depending on the particular correlation structure chosen. For other distributions the resulting estimates are not maximum likelihood but can be shown to have asymptotic properties familiar from quasi-likelihood, such as consistency and asymptotic normality.

The standard range of generalized linear models (as in procedure GLM) can be fitted involving a variety of covariance or correlation structures over the times of the repeated measurements. The standard links and distributions can be chosen by setting the options DISTRIBUTION, LINK, EXPONENT, AGGREGATION and KLOGRATIO, as in the MODEL directive. Non-standard ones require the definition of auxiliary procedures to carry out the necessary calculations (see the Method section). The terms in the fitted model are specified by the TERMS option, which may be set to a formula or left unset to fit a null model. The FACTORIAL option (default 3) sets a limit on the number of factors and variates in the terms that are fitted, as in the FIT directive. The CONSTANT option can be used to omit a constant term. Setting the QUADESTIMATION option to used requests the use of quadratic estimation for the data-based covariance or correlation matrix (see Kenward & Smith 1995a). The SCALEFACTOR option specifies the form of scalefactor to be used (fixed to a value specified by the SFVALUE option, constant over times of repeated measurements, or varying over times of repeated measurements). The CRTYPE option specifies the structure of the covariance or correlation matrix over the times of the repeated measurements. The ORDER option specifies the order of the covariance or correlation structures for the dependence and ante-dependence cases, with option TIMEDEPENDENT specifying whether the correlation in a dependence structure changes with the time of the repeated measurement.

The Y parameter must be set to specify the response variate. For a binomial distribution the NBINOMIAL parameter must also be set; this may be either a variate or a scalar (if it is a scalar, GEE maps it out automatically to a variate with the same number of values as Y). The SUBJECT parameter specifies a factor to identify the subjects. Alternatively, where the data consist of outcomes and numbers with those outcomes, the parameter OUTCOME must be set to the identifier of the outcome and the parameter COUNT to the number with the outcome. The parameter TIME must be set to the times of the repeated measurements. The parameters WEIGHT and OFFSET specify weight and offset variates that may be involved.

The output from the procedure is controlled by the PRINT option; by default estimates, their standard errors, covariances or correlations and scalefactors are given. Two sets of standard errors are provided for the estimates. One is the naive estimate which assumes the specified covariance or correlation structure holds. The other is the sandwich estimate which makes no such assumption. When PRINT=wald, Wald tests are produced using both sets of standard errors and correlations.

The fitted values and residuals can be obtained by setting the parameters FITTEDVALUES and RESIDUALS. The residuals are the Pearson residuals as defined in the Guide to the Genstat Command Language, Part 2, Section 3.1.1.

The SAVE parameter can save various details of the analysis, in a pointer with the following suffixes and labels:

1 or 'scalefactors' scalefactor(s),
2 or 'correlation' or 'covariance' correlations or covariances, according to the type of model (and labelled appropriately),
3 or 'estimates' the estimates of the linear predictor parameters,
4 or 'naive covariances' naive variance-covariance matrix for the estimates,
5 or 'sandwich covariances' sandwich variance-covariance matrix for the estimates,
6 or 'naive Wald' Wald tests calculated using the naive variance-covariance matrix, and
7 or 'sandwich Wald' Wald tests calculated using the sandwich variance-covariance matrix.

The algorithms in the procedure have been set up assuming that the data contain a complete set of observations for each subject. Where there are missing values these must be included explicitly (using the missing value symbol *) to create a complete set of observations. Missing values are allowed in both the Y variate and the explanatory variates in TERMS.

In the case of the Gaussian distribution, a working covariance matrix, rather than correlation matrix, is used. This provides considerable simplification within the algorithm.

This is a complicated algorithm and some examples may take a while to run. If necessary, however, you can set option PRINT=monitoring to see what is happening.




For full details of the method implemented in this procedure see Kenward & Smith (1995a). A generalized linear model is formulated for the marginal distribution of the observations at each time point using an appropriate link function and error distribution. If the repeated measurements could be assumed to be independent, the well-known iterative weighted least squares fitting procedure could be used to obtain ML estimates of the marginal model parameters. However this ignores the dependence among the repeated measurements. Full likelihood is in general very awkward in this setting so, to avoid a formal introduction of dependence into the model, a working correlation matrix is introduced into the iterative procedure, changing the least squares from a weighted to a generalized form. The correlation matrix can be introduced in various ways. It can be held constant throughout the iterative procedure. An example of this is the use of the identity matrix, leading to the so-called independence estimating equations for which the process reduces back to that of fitting a univariate generalized linear model. Alternatively an estimated correlation matrix can be introduced into the algorithm which is updated at each cycle using quadratic estimation: essentially the correlation structure is estimated from the residuals using the equations that would be appropriate were the residuals normally distributed. On convergence consistent estimates of the marginal linear model parameters are obtained and, if the correlation structure chosen is appropriate, then this will be consistently estimated as well. It is not necessary for the correlation structure to be correct for the consistency of the marginal parameter estimates, at least when the correlation structure is fixed; indeed the common choice of independence is almost certain not to be appropriate. However the estimates of precision of the marginal parameter estimates do need to be adjusted to allow for the true correlation structure. This correction is done in the so-called “sandwich” estimator provided by the procedure.

The procedures have been written so that it is possible to fit models other than the standard ones. An important example of such a model is the application of the GEE methodology to ordinal categorical data. This application requires the data to be arranged in a particular form (as cummulative logits) and a particular correlation matrix (specified in _GEECORRELATION). The type of analyses are explained in Kenward et al. (1994) and the methodology described in that paper has been duplicated. Further details are given in Kenward & Smith (1995b).

An option (SCALEFACTOR) has been included that allows the user to decide whether or not the scale factor is fixed at its independence distributional default, or is estimated from the scaled residuals as in Liang & Zeger (1986), or is treated as a vector varying over time.

GEE calls three subsidiary procedures, _GEECODI, _GEECALLIN and _GEECALDIS to assist with the analysis. There is no need for the user to modify these procedures:

    _GEECODI prints out the results of the iterative processes;
    _GEECALLIN calculates the fitted values and derivatives for various links;
    _GEECALDIS calculates the variance function and deviance for various distributions.

There are also four other procedures, which can be re-written or replaced, to cater for further user-defined distributions, links and correlation structures:

    _GEEINIT calculates initial estimates of the linear predictor in the generalized linear model;
    _GEELINK calculates fitted values and derivatives;
    _GEEDISTRIBUTION calculates the variance function and deviance;
    _GEECORRELATION calculates the correlation matrix and the sandwich matrix involving the residuals. (For the normal distribution the variance-covariance matrices are used not the correlation matrices.)

If the LINK option is unset, the procedure will call _GEEINIT and _GEELINK instead of using those for the various standard link functions. For a logit link function _GEEINIT and _GEELINK should be defined as follows.

          “Calculation of initial estimate of linear predictor,
           link unset”
          ‘Y’, “I: variate; response variate”\
          ‘LINEARPREDICTOR’,”O: variate; linear predictor”\
          ‘OFFSET’, “I: variate; offset”\
          ‘NBINOMIAL’; “I: variate; denominator of binomial”\
          SET=3(yes),no;TYPE=4(‘variate’); \
          “Calculation of fitted values and derivatives”
          ‘LINEARPREDICTOR’, “I: variate; linear predictor”\
          ‘FITTEDVALUES’, “O: variate; estimate of fitted values”\
          ‘DERIVATIVES’, “O: variate; estimate of derivatives”\
          ‘OFFSET’, “I: variate; offset”\
          ‘NBINOMIAL’; “I: variate; denominator of binomial”\
          SET=4(yes),no;TYPE=5(‘variate’); \
If the DISTRIBUTION option is unset, the procedure will call _GEEDISTRIBUTION instead of using one of the various standard distributions. For a binomial error distribution _GEEDISTRIBUTION should be defined as follows.
          ” Calculation of variance function and deviance”
          ‘Y’, “I: variate; response variate”\
          ‘FITTEDVALUES’,”I: variate; fitted values”\
          ‘VARIANCE’, “O: variate; variance”\
          ‘DEVIANCE’, “O: scalar; total deviance”\
          ‘NBINOMIAL’; “I: variate; denominator of binomial”\
          SET=4(yes),no;TYPE=3(‘variate’),’scalar’,’variate’; \
                     !T(type,nvalues,restriction); \
If the CRTYPE option is unset, the procedure will call _GEECORRELATION instead of using one of the various standard correlation models. For the independence model _GEECORRELATION should be defined as follows. Kenward & Smith (1995b) describe how _GEECORRELATION should be set up for analysing repeated ordinal categorical data.
          Calculation of correlation matrix
          For SANDWICH = NO
                  input is the R matrix as for UNSPECIFIED
                  output is the desired R matrix.
          For SANDWICH = YES
                  input is the (Y-MU)*T(Y-MU) matrix
          output is the desired modified (Y-MU)*T(Y-MU) matrix.
          N.B. For the normal distribution both the input and
                 output R’s should be variance/covariance matrices
                 not correlation matrices.
        ‘CONSTANT’, “I: text; how to treat constant (estimate,
                        omit); default e”\
        ‘SANDWICH’; “I; text; whether the sandwich central matrix
                        product or not) (no,yes); default no”\
            MODE=2(T); NVALUES=2(1); SET=yes;\
            VALUES=!T(ESTIMATE,OMIT),!T(NO,YES); \
      ‘CORRELATIONS’,”I/O: matrix; the correlation matrix”\
       ESTIMATES’, “I: variate; estimates of parameters in
      ‘Y’, “I: variate; response variate”\
      ‘RESIDUALS’, “I: variate; residuals”\
      ‘FITTEDVALUES’,”I: variate; fitted values”\
       TIME’, “I: variate; times of repeated measures”\
      ‘MARKER’, “I: factor; identifier of subject or outcome”\
      ‘DISTRIBUTION’,”I: text; identifier of distribution”\
      ‘SCALEFACTOR’, “I: text; scalefactor option in use”\
      ‘SFVALUE’; “I: scalar; value of scalefactor if FIXED”\
      SET=10(yes);DECLARED=10(yes); \
      TYPE=’symmetric’,5(‘variate’),’factor’,2(‘text’),’scalar’; \
DIAGONALMATRIX [ROWS=ntime;MODIFY=yes] done,wkdm; \
 & sandw = ‘NO’ .IN. SANDWICH
IF sandw

  SCALEFACTOR is as in GEE i.e. FIXED means fixed to SFVALUE
  CONSTANT means the scalefactor is estimated but constant
  across time, and VARYTIME means the scalefactor is estimated
  and varies across time.
  The variate TIME in this PROCEDURE represents the 1…ntime
  distinct times, it is not a FACTOR of length nobs as in GEE.
  It is the levels of the parameter TIME of GEE.

        CALC wkdm = SFVALUE
        CALC wkdm = TRACE(CORRELATIONS)/ntime
    CALC CORRELATIONS = 0 + wkdm

If LINK, DISTRIBUTION or CRTYPE are unset, but no user routines are given for _GEEINIT, _GEELINK, _GEEDISTRIBUTION and _GEECORRELATION, then those given here (for logit link, binomial error distribution and independence) will be used.

Action with RESTRICT

Input structures must not be restricted, and any existing restrictions will be cancelled.


Kenward, M.G., Lesaffre, E. & Molenberghs, G. (1994). An application of maximum likelihood and generalized estimating equations to the analysis of ordinal data from a longitudinal study with cases missing at random. Biometrics, 50, 945-953.

Kenward, M.G. & Smith, D.M. (1995a). Computing the generalized estimating equations for repeated measurements. Genstat Newsletter, 32, 50-62.

Kenward, M.G. & Smith, D.M. (1995b). Computing the generalized estimating equations for repeated ordinal, categorical measurements. Genstat Newsletter, 32, 63-70.

Liang, K.-Y. & Zeger,S.L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13-22.

Liang, K.-Y., Zeger, S.L. & Qaqish, B. (1992). Multivariate regression analyses for categorical data (with discussion). Journal of the Royal Statistical Society, Series B, 54, 3-40.

See also

Procedures: GLMM, HGANALYSE.

Commands for: Regression analysis, Repeated measurements.


CAPTION 'GEE example',!t('Data from Archer',\
        '(1987, Fertility & Sterility, 47, 559-564)',\
        'about prolactin response to thyroptrin releasing',\
        'hormone in women grouped according fertility status;',\
        'also see Paik (1992, Biometrics, 48, 19-30).'); STYLE=meta,plain
VARIATE [VALUES=44,45,41,40,72,49,41,31,37,23,15,10,103,79,47,\
                156,117,91,100,90,60,50,86,65,57,42,101,93,68,47] Response
 &      [VALUES=16,16,16,16,10,10,10,10,7,7,7,7,8,8,8,8,5,5,5,5,\
 &      [VALUES=(1...4)30] CTime
FACTOR  [LEVELS=30; VALUES=4(1...30)] Woman
 &      [LEVELS=3; VALUES=24(1),48(2),48(3)] Group
 &      [LEVELS=4; VALUES=(1...4)30] Time
GEE     [LINK=log; DISTRIBUTION=gamma; TERMS=Group+CTime+Baseline;\ 
Updated on March 7, 2019

Was this article helpful?