Fits zero-inflated regression models to count data with excess zeros (D.A. Murray).
Options
PRINT = string token |
Controls printed output (model , summary , estimates , fittedvalues , monitoring ); default mode , summ , esti |
---|---|
DISTRIBUTION = string token |
Distribution of response variable (poisson , binomial , negativebinomial ); default pois |
METHOD = string token |
Method used for model fitting (em , conditional ); default em |
CONSTANT = string token |
How to treat constant for count state (estimate , omit ); default esti |
ZCONSTANT = string token |
How to treat constant for zero-inflation state (estimate , omit ); default esti |
XTERMS = formula |
List of explanatory variates and factors, or model formula for count state of model |
ZTERMS = formula |
List of explanatory variates and factors, or model formula for zero-inflation state of model |
WEIGHTS = variate |
Variate of weights for weighted zero-inflated regression (EM model only) |
OFFSET = variate |
Offset variate to be used in the model (EM model only) |
XGROUPS = factor |
Absorbing factor defining the groups for within-groups regression for the count state model (EM model only) |
ZGROUPS = factor |
Absorbing factor defining the groups for within-groups regression for the zero-inflation state model (EM model only) |
MAXCYCLE = scalar |
Maximum number of iterations for EM algorithm; default 100 |
TOLERANCE = scalar or variate |
Convergence criteria for EM algorithm, k and in the generalized linear models; default !(1.E-4, 1.E-4, 1.E-4) |
ZPARAMETERIZATION = string token |
Parameterization of the probability of the zero-inflation model (zero , nonzero ): if unset, zero is used for the EM model and nonzero for the conditional model |
Parameters
Y = variates |
Response variate |
---|---|
NBINOMIAL = scalars or variates |
Total numbers for DISTRIBUTION=binomial |
RESIDUALS = variates |
Saves the simple residuals |
FITTEDVALUES = variates |
Saves the fitted values |
ESTIMATES = variates |
Saves the estimates of the parameters |
SE = variates |
Saves the standard errors of the estimates |
RSAVE = identifiers |
Saves the regression structure for the final generalized model fitted for the count model |
ZSAVE = identifiers |
Saves the regression structure for the final binomial regression fitted for the zero-inflation model |
Description
R0INFLATED
can be used to fit zero-inflated regression models to count data with excess zeros. The procedure allows the data to be modelled using two different approaches. The first possibility is to fit a zero-inflated Poisson regression model (ZIP), a zero-inflated binomial regression model (ZIB) or a zero-inflated negative binomial regression model (ZINB) using an EM algorithm (Lambert 1992). In this analysis, the response variable of counts is assumed to be distributed as a mixture of a distribution (such as Poisson) and a degenerate distribution at zero. In these models, a generalized linear model with a Poisson or negative binomial distribution and log link, or with a binomial distribution and logit link, is used for the count model. A generalized linear model with a binomial distribution and logit link is used for the zero-inflation model.
The alternative is to fit the conditional model of Welsh et al. (1996), which assumes that the data are in one of two states: a state where zeros are observed, or a state where counts are recorded. A binomial model with a logit link is used for the zero state. A truncated Poisson, truncated binomial or truncated negative binomial model is used for the count state.
The response variable is supplied, in a variate, using the Y
parameter. The NBINOMIAL
parameter must also be set when DISTRIBUTION=binomial
, to give the number of binomial trials for each unit. The XTERMS
and ZTERMS
options each specifies a formula, to describe the count model and the zero-inflation model respectively. The CONSTANT
and ZCONSTANT
options control whether a constant parameter is included in the count and zero-inflation models.
The METHOD
option specifies the type of model to fit: the em
setting fits the ZIP, ZIB and ZINB mixture models, and the conditional
setting fits the conditional model. The DISTRIBUTION
option specifies the distribution for the count model. Note that a log link is always used for the count model with the Poisson and negative binomial distributions, and a logit link is used with the binomial distribution.
The XGROUPS
and ZGROUPS
options can specify factors whose effects you want to eliminate from the count or zero-inflation state respectively, before any regression is fitted. This method of elimination is sometimes called absorption. (See the GROUPS
option of the MODEL
directive.) It gives less information than you would get if you included the factor explicitly in the model. For example, no standard errors are produced. However, it saves space and time when data from many different groups are to be modelled. These options are only available for the EM model.
The ESTIMATES
and SE
parameters save the parameter estimates and their standard errors. R0INFLATED
puts them into variates, using the same order as in the display produced by the PRINT
option. The simple residuals and the fitted values can be saved using the RESIDUALS
and FITTEDVALUES
parameters.
The RSAVE
and ZSAVE
parameters allow you to specify identifiers for the regression save structures for the count and zero-inflation states of the model. These structures store the final state of the regression models fitted. Note that the standard errors for the parameter estimates in the regression save structures will not be correct and should instead be obtained using the SE
parameter or by the R0KEEP
procedure.
For the mixture models, the WEIGHTS
option can specify a variate holding weights for each unit, and the OFFSET
option allows you to include an offset (i.e. a variable in the regression model with a regression coefficient fixed at one).
The PRINT
option controls printed output, with settings:
model |
gives a description of the model, including response and explanatory variates for count and zero-inflation models; |
---|---|
summary |
displays minus twice log-likelihood, the Akaike information coefficient (AIC) and the Schwarz (Bayesian) information coefficient (BIC or SIC); |
estimates |
gives the estimates of the parameters in the model with standard errors based on the asymptotic variance-covariance matrix derived from the inverse of the observed Fisher information matrix; |
fittedvalues |
displays a table of unit labels, values of response variate, fitted values and residuals; |
monitoring |
displays monitoring information of the iterative algorithm. |
The iterative process for the EM algorithm is controlled by the MAXCYCLE
option which defines the maximum number of cycles, and the TOLERANCE
option which sets convergence criteria. The EM algorithm cycle stops when successive values of the log-likelihood are within a tolerance set by the first element of the TOLERANCE
option. The second and third elements of TOLERANCE
control the convergence criterion for the aggregation parameter (k) for the negative binomial model and for the generalized linear model, respectively.
The ZPARAMETERIZATION
option controls how the probability for the zero-inflation model is specified. Note that the parameters in the model specification for the mixture and conditional models have different interpretations. In the mixture model the default setting is zero
, which parameterizes the model such that ω is the probability of the excess zeros. Alternatively, you can set ZPARAMETERIZATION=nonzero
, to parameterize the model such that ω is the probability that an observation is generated through the distribution. In the conditional model the default setting is nonzero
, which parameterizes the model such that ω = 1 – p(x) where p(x) is the probability of detecting at least one observation, given that there is at least one observation. Alternatively, if you set ZPARAMETERIZATION=zero
, the parameterization is that ω = p(x). For further details, see the Method section.
Options: PRINT
, DISTRIBUTION
, METHOD
, CONSTANT
, ZCONSTANT
, XTERMS
, ZTERMS
, WEIGHTS
, OFFSET
, XGROUPS
, ZGROUPS
, MAXCYCLE
, TOLERANCE
, ZPARAMETERIZATION
.
Parameters: Y
, ,NBINOMIAL
, RESIDUALS
, FITTEDVALUES
, ESTIMATES
, SE
, RSAVE
, ZSAVE
.
Method
The zero-inflated Poisson (mixture) regression model has the distribution
Pr(Y=y) | = ω + (1 – ω) × exp(-λ) for y=0 |
---|---|
= (1 – ω) × exp(-λ) × λy / y! for y>0 |
where λ and ω are given by the following models
log(λ) = X β
log(ω/(1-ω)) = Z α
where X and Z are covariate matrices and β and α are vectors of unknown parameters. The zero-inflated binomial (mixture) regression model has the distribution
Pr(Y=y) | = ω + (1 – ω) × (1-p)n for y=0 |
---|---|
= (1 – ω) × py × (1 – p)n–y × n! / (y! × (n–y!)) for y>0 |
where p and ω are given by the following models
log(p/(1-p)) = X β
log(ω/(1-ω)) = Z α
The zero-inflated negative binomial (mixture) regression model has the distribution
Pr(Y=y) | = ω + (1 – ω) × (1 + λ × k)-(1/k) for y=0 |
---|---|
= (1 – ω) × Γ(y + 1/k) / (y! × Γ(1/k)) | |
× (1 + λ × k)-(y + 1/k) for y>0 |
where λ and ω are given by the same models as for the Poisson distribution, and k is the extra-variation parameter in the negative binomial distribution.
The maximum likelihood estimates for β, α and k are obtained using an EM algorithm (Lambert 1992). The standard errors for the parameter estimates are derived using the incomplete data observed information matrix as proposed by Lambert (1992). The default parameterization for the mixture models estimates ω, the probability of excess zeros. You can use the ZPARAMETERIZATION
option to change the parameterization to estimate ω′, the probability that an observation is generated through the distribution instead (ω′ = 1-ω).
In the Poisson case of the conditional model, y has a truncated Poisson distribution (λ). So the probability model is
Pr(Y=y) | = ω for y=0 |
---|---|
= (1 – ω) × exp(-λ) × λy) / { y! × (1 – exp(-λ) } for y>0 |
where λ and ω are given by the following models
log(λ) = X β
log(ω/(1-ω)) = Z α
In the truncated binomial case, y has a truncated binomial distribution. So the probability model is
Pr(Y=y) | = ω for y=0 |
---|---|
= (1 – ω) × py × (1 – p)n–y / (1 – (1 – p)n) | |
× n! / (y! × (n–y!)) for y>0 |
where p and ω are given by the following models
log(p/(1-p)) = X β
log(ω/(1-ω)) = Z α
In the negative binomial case, y has a truncated negative binomial (λ, k). So the probability model is
Pr(Y=y) | = ω for y=0 |
---|---|
= (1 – ω) × Γ(y + 1/k) / (y! × Γ(1/k)) | |
× (1 + k × λ)-(y + 1/k) | |
× (1 – (1 + k × λ)-1/k)-1, for y>0 |
where λ and ω are given by the same models as for the Poisson distribution, and k is the extra-variation parameter in the negative binomial distribution.
The truncated Poisson model is fitted using an iteratively re-weighted least squares algorithm (see Welsh et al. 1996). The truncated binomial and negative binomial models are fitted using FITNONLINEAR
.. The default parameterization for the mixture models estimates ω′ (=1-ω), the probability of detecting at least one observation given that there is at least one observation, as in Welsh et al. (1996). You can use the ZPARAMETERIZATION
option to change the parameterization to estimate ω, the probability of detecting a zero observation, instead.
Action with RESTRICT
If a parameter is restricted the statistics will be calculated using only those units included in the restriction.
References
Hall, D,B. (2000). Zero-inflated Poisson and Binomial regression with random effects: a case study. Biometrics, 56, 1030-1039.
Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics, 34, 1-14.
Ridout, M., Demetrio, C.G.B. & Hinde, J. (1998). Models for count data with many zeros. International Biometrics Conference, Cape Town.
Welsh, A.H., Cunningham, R.B., Donnelly, C.F. & Lindenmayer, D.B. (1996). Modelling the abundance of rare species: statistical models for counts with extra zeros. Ecological Modelling, 88, 297-308.
See also
Procedures: RNEGBINOMIAL
, R0KEEP
.
Commands for: Regression analysis.
Example
CAPTION 'R0INFLATED example - EM algorithm',\ 'Apple shoot data',\ !t('Ridout et al. (1998)',\ 'Models for count data with many zeros,',\ 'IBC Cape Town 1998.'); STYLE=meta,minor,plain FACTOR [LABELS=!T('0.5','1','2','4'); VALUES=30(1,2),\ 40(3,4),30(1,2,3),40(4)] Hormone FACTOR [LABELS=!T('8','16'); VALUES=140(1),130(2)] Period READ NShoots 1 1 1 2 2 3 3 3 4 4 4 4 4 4 5 5 5 6 6 7 7 8 8 8 9 10 10 11 13 17 2 2 2 4 6 6 6 7 7 7 7 7 7 7 8 8 8 9 9 9 9 9 10 10 10 11 11 11 11 13 2 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 8 8 8 9 9 9 9 9 10 10 10 10 11 12 12 14 14 0 0 3 3 4 4 5 5 5 5 5 6 6 6 6 6 7 7 7 7 8 8 8 8 8 8 8 8 9 9 9 10 10 10 10 11 11 11 11 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 3 3 4 5 5 6 8 9 9 9 10 11 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 3 4 4 5 6 6 8 10 10 10 12 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 2 2 3 4 4 5 5 6 6 6 7 9 9 11 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 2 3 3 3 3 4 4 4 5 6 6 6 6 7 7 7 9 9 : R0INFLATED [PRINT=mod,est,sum; CONSTANT=estimate; XTERMS=Hormone*Period]\ NShoots R0INFLATED [PRINT=mod,est,sum; CONSTANT=estimate; XTERMS=Hormone*Period;\ ZCONSTANT=estimate; ZTERMS=Period] NShoots R0INFLATED [PRINT=mod,est,sum; DISTRIBUTION=negativebinomial;\ XTERMS=Hormone*Period; ZTERMS=Period] NShoots R0INFLATED [PRINT=mod,est,sum; DISTRIBUTION=negativebinomial;\ XTERMS=Period; ZTERMS=Period] NShoots CAPTION 'R0INFLATED example - Conditional Model',\ 'Leadbeater''s Possum data,',\ !t('Welsh et al. (1996) Modelling the abundance of rare species:',\ 'statistical models for counts with extra zeros.',\ 'Ecological Modelling.'); STYLE=meta,minor,plain VARIATE [NVALUES=151] no_lb,stags READ no_lb 7 0 0 3 2 10 7 3 0 0 0 0 0 2 0 1 0 4 3 2 10 7 0 3 7 0 0 0 0 0 5 9 0 0 0 0 1 0 5 4 0 0 4 0 4 0 2 0 0 1 1 0 3 0 0 0 0 0 2 0 0 1 0 2 5 3 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 1 5 4 0 0 0 0 3 0 3 3 1 0 0 0 0 0 2 0 0 1 0 3 0 0 4 0 0 3 4 0 8 5 3 0 0 0 5 5 0 2 0 0 0 0 0 2 0 2 0 0 0 0 0 4 0 0 0 0 5 0 0 0 0 0 1 0 0 0 0 : READ stags 12 15 6 14 16 16 9 20 7 4 6 5 4 6 4 10 6 11 11 4 16 8 10 9 7 10 15 5 7 10 11 8 8 3 14 5 8 14 11 2 1 1 7 2 7 7 1 6 8 6 6 5 6 0 0 2 0 1 3 2 2 6 3 4 3 4 5 2 3 4 4 2 2 10 16 10 4 3 2 2 2 2 3 1 6 8 2 4 12 13 3 14 2 4 0 2 3 14 29 2 4 6 3 8 4 7 20 4 11 5 1 2 27 24 9 18 3 20 25 4 4 30 24 8 4 6 5 3 5 2 3 5 7 4 5 4 4 1 4 23 25 31 0 8 4 4 1 3 1 1 4 : CALCULATE lstags = log(stags+1) R0INFLATED [PRINT=mod,sum,est; METHOD=conditional; DIST=negative;\ ZTERMS=lstags; XTERMS=lstags] no_lb R0INFLATED [PRINT=mod,sum,est; METHOD=conditional;\ ZTERMS=lstags; XTERMS=lstags] no_lb