Estimates the parameters of continuous and discrete distributions.
Options
PRINT = string tokens 
Printed output required from each individual fit (parameters , samplestatistics , fittedvalues , proportions , monitoring ); default para, samp, fitt 

CBPRINT = string tokens 
Printed output required from a fit combining all the input data (parameters , samplestatistics , fittedvalues , proportions , monitoring ); default * 
DISTRIBUTION = string token 
Distribution to be fitted (Poisson , geometric , logseries , negativebinomial , NeymanA , PolyaAeppli , PlogNormal , PPascal , Normal , dNvequal , dNvunequal , logNormal , exponential , gamma , Weibull , b1 , b2 , Pareto ); default * i.e. fit nothing 
CONSTANT = string token 
Whether to estimate a location parameter for the gamma, logNormal, Pareto or Weibull distributions (estimate , omit ); default omit 
LIMITS = variate 
Variate to specify or save upper limits for classifying the data into groups; default * 
NGROUPS = scalar 
When LIMITS is not specified, this defines the number of groups (of approximately equal size) into which the data are to be classified; default is the integer value nearest to the square root of the number of data values 
XDEVIATES = variate 
Variate to specify points up to which the CUMPROPORTIONS are to be estimated 
JOINT = string token 
Requests joint estimates from the combined fit to be used for a refit to the separate data sets (dispersion , variancemeanratio , Poissonindex ); default * 
PARAMETERS = variate 
Estimated parameters from the combined fit 
SE = variate 
Standard errors for the estimated parameters of the combined fit 
VCOVARIANCE = symmetric matrix 
Variancecovariance matrix for the estimated parameters of the combined fit 
CUMPROPORTIONS = variate 
Estimated cumulative proportions of the combined distribution up to the values specified by the XDEVIATES option 
MAXCYCLE = scalar 
Maximum number of iterations; default 30 
TOLERANCE = scalar 
Convergence criterion; default 0.0001 
Parameters
DATA = variates or tables 
Data values either classified (table) or unclassified (variate) 

NOBSERVATIONS = tables 
Oneway table to save the data classified into groups 
RESIDUALS = tables 
Residuals from each (individual) fit 
FITTEDVALUES = tables 
Fitted values from each fit 
PARAMETERS = variates 
Estimated parameters from each fit 
SE = variates 
Standard errors of the estimates 
VCOVARIANCE = symmetric matrices 
Variancecovariance matrix for each set of estimated parameters 
CUMPROPORTIONS = variates 
Estimated cumulative proportions of each distribution up to the values specified by the XDEVIATES option 
CBRESIDUALS = tables 
Residuals from the combined fit 
CBFITTEDVALUES = tables 
Fitted values from the combined fit 
STEPLENGTH = variates 
Initial step lengths for each fit 
INITIAL = variates 
Initial values for each set fit 
Description
The DISTRIBUTION
directive is used to fit an observed sample of data to a theoretical distribution function, in order to obtain maximumlikelihood estimates of the parameters of the distribution and test the goodness of fit. The data consists of observations x_{i} of a random variable X, which has a distribution function F(x) defined by F(x)=Pr(X≤x). A selection of both discrete and continuous distributions are available; full details are given below.
For discrete distributions X may take nonnegative integer values only, except for the logseries distribution where only positive integer values are allowed. For continuous distributions the random variable X may take any values, subject to constraints for certain distributions, for example, data values must be strictly positive in order to fit a logNormal distribution. Constraints are detailed with the individual distributions described below.
The data can be supplied to DISTRIBUTION
as a variate or as a oneway table of counts. If the raw data are available, then these should be supplied (as a variate), since the raw data contains more information than grouped data.
If raw data are not available, then a oneway table of counts, or frequencies, should be given. The factor classifying the table must have its levels vector declared explicitly, since the levels are used to indicate the boundary values of the raw data used to create the grouping. For example, if the discrete variable X takes the values 0…8, with numbers of observations 2,6,7,4,2,1,0,1,0 respectively, a table of counts can be declared by
FACTOR [LEVELS=!(0...8)] F
TABLE [CLASSIFICATION=F; VALUES=2,6,7,4,2,1,0,1,0] T
The factor levels do not have to specify single data values: often it will be desirable to group certain values together, and indeed for continuous data this is the only sensible way to proceed. In general, for a classifying factor with levels l_{1}, l_{2}, … , l_{f}, the count n_{k} for the kth cell of the table will be the number of observations x_{i} such that
x_{i} ≤ l_{1},  k=1 

l_{k}–_{1} < x_{i} ≤ l_{k},  2≤k≤f1 
l_{f}–_{1} < x_{i},  k=f 
This means that for all except the last cell of the table, the factor level represents the upper limit on values in that cell. The final class of the table is termed the tail; it is formed by combining the frequencies for all values of X greater than l_{f}–_{1}, and the upper limit on values in the tail is infinity. For continuous distributions with no lower bound, the first class will be the lower tail. You will often want to form the tail(s) by amalgamating groups with low numbers of counts. In the example above, you might amalgamate the groups for values 68:
FACTOR [LEVELS=!(0...5,99)] F2
TABLE [CLASSIFICATION=F2; VALUES=2,6,7,4,2,1,1] T2
Note that the final factor level, for the tail, can be given a dummy value of 99 to indicate that it has no upper limit, since this value is never used in calculations.
When data are supplied as a table instead of as a variate, the computed loglikelihood is only an approximation to the full loglikelihood and the solution obtained will depend to some extent on the choice of class limits. More reliable results will be achieved with a larger number of classes, since this gives more information on the data distribution, so only classes with very few observations should be amalgamated. In general, care should be taken to choose class limits that give a reasonable number of counts in each class, but with none of the individual classes holding a disproportionately large number of observations.
The DISTRIBUTION
option should be set to indicate which distribution is to be fitted to the data. The following distributions are available:
Note: the parameterization for the gamma distribution differs from that used in the gamma probability functions. DISTRIBUTION
uses the shape parameter k and the rate parameter b, while the functions use the shape parameter k and the scale parameter t, which is the reciprocal of the rate (t=1/b).
The first step of the fitting process is to compute and print various sample statistics. Examining these may help in the selection of appropriate distributions for fitting – properties of the various distributions are listed at the end of this section. The setting DISTRIBUTION=*
can be used to produce this output without any model fitting. The following sample statistics are calculated:
Sample size 
n 

m = Σ x_{i}/n 


s^{2} = Σ x_{i}^{2}/n – m^{2} 
discrete distributions 


s^{2} = Σ (x_{i}–m)^{2} / (n1) 
continuous distributions 
g_{1} = Σ (x_{i}–m)^{3} / (n1)s^{3} = m_{3}/s^{3}x 


g_{2} = Σ{(x_{i}–m)^{4}/(n1)s^{4}} – 3 
continuous distributions only 

Sample quartiles 
x_{p}: F(x_{p})=p 

Poisson index 
(s^{2}–m)/m^{2} 
discrete distributions only 
Negative binomial index 
m(m_{3}3s^{2}+2m)/(s^{2}–m)^{2} 
discrete distributions only 
If the original data are not available, the sample statistics are calculated by substituting class midpoints in place of the data. For the lower tail, the class “midpoint” is taken to be l_{1}½(l_{2}–l_{1}) and for the upper tail, l_{f}–_{1}+½(l_{f}–_{1}–l_{f}–_{2}). No corrections are made for groupings. When a distribution has been fitted to data, the relevant theoretical statistics of that distribution are printed for comparison with the sample statistics, as a check on the appropriateness of the model for the data.
A summary is given of the fit: the parameter estimates are printed with their standard errors and correlations, including the working parameters, which are stable functions of the parameters defining the distribution and are used in the internal algorithm. The goodness of fit to the chosen distribution is indicated by the residual deviance which has an asymptotic chisquare distribution with the specified degrees of freedom. The deviance is also the preferred statistic for comparison of nested models, for example the double Normal distribution with equal and unequal variances. This is followed by a table of observed and fitted values (expected frequencies), together with weighted residuals. If raw data are supplied, by default this table is formed by dividing the data into √n groups of approximately equal observed frequency, which are therefore likely to be of unequal widths. The NGROUPS
option may be used to set the number of groups for this table. If data are supplied as a table, the fitted values use the classification from that table. In either case the LIMITS
option may be used to supply a different set of limits; with the constraint that if tabulated data are analysed these limits should be a subset of the original limits so that the new groups are formed by aggregation.
The NOBSERVATIONS
, RESIDUALS
and FITTEDVALUES
parameters can be used to save the number of observations in each cell, the fitted number, and the residual respectively (all in tables). The parameter estimates and their standard errors can be saved in variates specified by PARAMETERS
and SE
. The variancecovariance matrix for the estimated parameters can be saved as a symmetric matrix using the VCOVARIANCE
parameter.
Having fitted the required distribution, the estimated cumulative distribution function (CDF) can be evaluated at specified values of X. These are defined using the XDEVIATES
option. The values of the CDF can be printed (by selecting PRINT=proportions
) or saved in a variate by setting the CUMPROPORTION
parameter.
If you have several sets of data you may be interested in fitting the distribution individually to each set; this can be done by setting the DATA
parameter to a list of identifiers. A separate analysis is then performed for each set of data, but of course any option settings are common to all the data sets. The data sets should all be specified in the same way, either as raw data or as tabulated counts. For tabulated counts, the same categories must be used for defining every table. You can also carry out one final fit to the combined data set, in order to investigate whether the data can be adequately modelled as coming from a single population. This combined fit is produced if any of the options relating to the combined fit have been set (that is, options CBPRINT
, PARAMETERS
, SE
, VCOVARIANCE
or CUMPROPORTION
which print or save information from the combined analysis). For each individual data set you can also save fitted values and residuals based on the parameters estimated from the combined data set, using the CBRESIDUALS
and CBFITTEDVALUES
parameters. The JOINT
option can be used to specify that certain parameters should be held constant at their estimated values from the combined analysis during refits to the individual data sets. For continuous distributions only, a common dispersion parameter can be requested; for discrete distributions a common value can be requested for either the Poisson index or the ratio of variance to mean. An analysis of deviance is printed to compare the nested models.
If the original data are available, the full loglikelihood is used in the optimization algorithm. Otherwise, an approximate loglikelihood is optimized, using representative values for each class. For some distributions, it is necessary to use stable working parameters in the optimization algorithm (Ross 1990), and the defining parameters for the distribution are then evaluated by a simple transformation.
The deviance and corresponding degrees of freedom that are printed as part of the model summary are based on the table of fitted values, and thus may be affected by the choice of limits. The residuals computed are deviance residuals (McCullagh & Nelder 1989), and the deviance is therefore the sum of squared residuals. The degrees of freedom are n–p1, where n is the number of cells in the table of fitted values and p is the number of parameters estimated in the model. The default limits for grouping the raw data are designed to avoid small expected frequencies (for example in the tail cells) which can have an inflationary affect on the deviance; however, if the tails are important, because of the origin of the data, it may be important to specify the limits explicitly.
An iterative GaussNewton optimization method is used to estimate the parameters of the distribution. The parameterization is chosen for each model so that the optimization is stable, but if there are any problems with particular data sets it may be necessary to control this process. The MAXCYCLE
and TOLERANCE
options allow you to increase the number of iterations and alter the convergence criterion for data sets that fail to converge. You can also specify initial values and step lengths for the parameters for each set of data using the STEPLENGTH
and INITIAL
parameters. These parameters should be set to variates of length appropriate for the distribution being fitted; for example, if DISTRIBUTION=Poisson
they should have just one value. Another use of INITIAL
and STEPLENGTH
is to constrain a parameter to a particular value; for example when fitting a double Normal the proportion parameter p could be fixed at 0.5 by setting the initial value to 0.5 and the step length to 0, thus fitting a double Normal in equal proportions. Note that the degrees of freedom are not adjusted to take account of this.
Options: PRINT
, CBPRINT
, DISTRIBUTION
, CONSTANT
, LIMITS
, NGROUPS
, XDEVIATES
, JOINT
, PARAMETERS
, SE
, VCOVARIANCE
, CUMPROPORTIONS
, MAXCYCLE
, TOLERANCE
.
Parameters: DATA
, NOBSERVATIONS
, RESIDUALS
, FITTEDVALUES
, PARAMETERS
, SE
, VCOVARIANCE
, CUMPROPORTIONS
, CBRESIDUALS
, CBFITTEDVALUES
, STEPLENGTH
, INITIAL
.
Action with RESTRICT
You can restrict the units of a DATA
variate to fit a distribution to a subset of its values.
References
McCullagh, P. & Nelder, J.A. (1989). Generalized Linear Models (second edition). Chapman and Hall, London.
Ross, G.J.S. (1990). Nonlinear Estimation. SpringerVerlag, New York.
See also
Procedures: BBINOMIAL
, CUMDISTRIBUTION
, DPROBABILITY
, EDFTEST
, FDRMIXTURE
, KERNELDENSITY
, NORMTEST
, WSTATISTIC
, RSURVIVAL
.
Functions: CLBETA
, CLBINOMIAL
, CLBVARIATENORMAL
, CLCHISQUARE
, CLF
, CLGAMMA
, CLHYPERGEOMETRIC
, CLINVNORMAL
, CLLOGNORMAL
, CLNORMAL
, CLOGLOG
, CLPOISSON
, CLSMMODULUS
, CLSRANGE
, CLT
, CLUNIFORM
, CUBETA
, CUBINOMIAL
, CUBVARIATENORMAL
, CUCHISQUARE
, CUF
, CUGAMMA
, CUHYPERGEOMETRIC
, CUINVNORMAL
, CULOGNORMAL
, CUNORMAL
, CUPOISSON
, CUSMMODULUS
, CUSRANGE
, CUT
, CUUNIFORM
, EDBETA
, EDBINOMIAL
, EDCHISQUARE
, EDF
, EDGAMMA
, EDHYPERGEOMETRIC
, EDINVNORMAL
, EDLOGNORMAL
, EDNORMAL
, EDPOISSON
, EDSMMODULUS
, EDSRANGE
, EDT
, EDUNIFORM
, GRBETA
, GRBINOMIAL
, GRCHISQUARE
, GRF
, GRGAMMA
, GRHYPERGEOMETRIC
, GRLOGNORMAL
, GRNORMAL
, GRPOISSON
, GRSAMPLE
, GRSELECT
, GRT
, GRUNIFORM
, PRBETA
, PRBINOMIAL
, PRCHISQUARE
, PRF
, PRGAMMA
, PRHYPERGEOMETRIC
, PRINVNORMAL
, PRLOGNORMAL
, PRNORMAL
, PRPOISSON
, PRSMMODULUS
, PRSRANGE
, PRT
, PRUNIFORM
.
Commands for: Basic and nonparametric statistics.
Example
" Example DIST1: Negative Binomial and LogSeries distributions Taken from Chatfield et al. (1966), JRSS A, 129, p317360. The data are recorded frequencies of number of purchases of a household product by 2000 households over a 26 week period. Thus, 1612 households made 0 purchases of the product, 164 made 1 purchase, and so on. The final cell of the table is the tail, the number of households that made 21 or more purchases. " FACTOR [LEVELS=!(0...21)] Npurchase; DECIMALS=0 TABLE [CLASSIFICATION=Npurchase] Purchases; DECIMALS=0 READ Purchases 1612 164 71 47 28 17 12 12 5 7 6 3 3 5 0 0 0 2 0 0 1 5 : " Fit negative binomial" DISTRIBUTION [DISTRIBUTION=negativebinomial] Purchases " Fit logseries distribution" DISTRIBUTION [DISTRIBUTION=logseries] Purchases