Performs empirical-distribution-function goodness-of-fit tests (V.M. Cave).
Options
PRINT = string tokens |
Controls printed output (summary , tests ); default summ , test |
---|---|
PLOT = string tokens |
What graphs to plot (kerneldensity , histogram ); default * |
TEST = string tokens |
Specifies the type of goodness-of-fit test to perform (andersondarling , cramervonmises , kolmogorovsmirnov ); default ande , cram , kolm |
DISTRIBUTION = string tokens |
Continuous distribution that is hypothesized to have generated the DATA ; (beta , b2 , burr , cauchy , chisquare , ev1 (or gumbel ), ev2 (or frechet ), ev3 , exponential , fdistribution , gamma , gev , gpareto , iburr , igamma , invnormal , iweibull , laplace , loggamma , logistic , loglogistic , lognormal , normal , paralogistic , pareto , stdnormal , stduniform , tdistribution , ubetamix , ugammamix , uniform , weibull , calculated ); default norm |
CONSTANT = string tokens |
Whether to estimate a constant for the distribution, when the parameter values are estimated from the DATA (estimate , omit ); default omit |
TMETHOD = string tokens |
Specifies the method used to perform the goodness-of-fit tests (likelihoodratio , traditional ); default like |
PARAMETERS = scalar or variate |
Parameter values for the hypothesized distribution; if this is not set, parameter values are estimated from the DATA |
NAMES = text |
Names to identify the parameters in PARAMETERS ; if this is not set, the default parameter ordering is assumed |
CDFCALCULATION = expression |
Expression, formed using argument X , that defines the cumulative distribution function of the hypothesized distribution; must be specified when DISTRIBUTION = calculated |
MCPARAMETERS = string tokens |
Whether the parameters are re-estimated or fixed during the Monte-Carlo simulations, when the parameter values are estimated from the DATA (fix , estimate ); default esti |
NTIMES = scalar |
Number of Monte-Carlo simulations to perform; default 999 |
SEED = scalar |
Seed for random number generation; default 0 continues an existing sequence or, if none, selects a seed automatically |
TITLE = text |
Title for the graphs; default generates the title automatically |
YTITLE = text |
Y-axis title for the graphs; default generates the title automatically |
XTITLE = text |
X-axis title for the graphs; default generates the title automatically |
WINDOW = scalar |
Window to use for the graphs; default 3 |
SCREEN = string tokens |
Whether to clear the screen before plotting the graph or to continue plotting on the old screen, when a single graph is requested (clear , keep ); default clear |
Parameters
DATA = variate |
Identifier of the variate holding the data |
---|---|
STATISTIC = pointer |
Pointer to scalar(s) to save the test statistic(s) |
MCSTATISTICS = pointer |
Pointer to variates(s) to save the Monte-Carlo simulated test statistic(s) |
PROBABILITY = pointer |
Pointer to scalar(s) to save the probability value(s) of the test statistic(s) |
Description
EDFTEST
performs one-sample two-sided empirical-distribution-function goodness-of-fit tests to assess whether a sample of data comes from a specified continuous distribution. The data values must be supplied, in a variate, using the DATA
parameter. The type of tests to be performed are specified by the TEST
option, with settings andersondarling
(Anderson-Darling), cramervonmises
(Cramér-von Mises) and kolmogorovsmirnov
(Kolmogorov-Smirnov).
The method used to perform these tests is specified by the TMETHOD
option, with settings likelihoodratio
for the Zhang (2002) likelihood-ratio based method, and traditional
for the traditional approach. The default is to use the likelihood-ratio based tests, which are generally more powerful.
The distribution from which the data are assumed to arise is specified using the DISTRIBUTION
option; default normal
. Values for the parameters can be supplied, in either a scalar or a variate, by the PARAMETERS
option. However, when parameter values are supplied, a value must be specified for every parameter.
If parameter values are not supplied, they are estimated from the DATA
, except when DISTRIBUTION
is set to stdnormal
, stduniform
or calculated
.
The NAMES
option specifies a text to identify the individual parameter values within a variate of PARAMETERS
. The parameter names associated with each distribution are given below. When the names are not supplied, the default ordering of the parameters is assumed. (This matches the ordering in which parameter estimates are saved using the ESTIMATES
parameter of the DPROBABILITY
procedure.) The parameter names are listed below, in the default parameter ordering for each distribution:
Beta Type I (beta ) |
ashape, bshape; |
---|---|
Beta Type II (b2 ) |
ashape, bshape, rate; |
Burr (burr ) |
ashape, scale, bshape; |
Cauchy (cauchy ) |
location, scale; |
Chi-square (chisquare ) |
df; |
Extreme Value Type I (ev1 or gumbel ) |
location, scale; |
Extreme Value Type II (ev2 or frechet ) |
location, scale, shape; |
Extreme Value Type III (ev3 ) |
location, scale, shape; |
Exponential (exponential ) |
rate; |
F (fdistribution ) |
ndf, ddf; |
Gamma (gamma ) |
shape, rate, constant (optional); |
Generalized Extreme Value (gev ) |
shape, location, scale; |
Generalized Pareto (gpareto ) |
shape, scale; |
Inverse Burr (iburr ) |
ashape, scale, bshape; |
Inverse Gamma (igamma ) |
shape, scale; |
Inverse Normal (invnormal ) |
mean, shape; |
Inverse Weibull (iweibull ) |
scale, shape; |
Laplace (laplace ) |
location, scale; |
Log-Gamma (loggamma ) |
shape, rate; |
Logistic (logistic ) |
location, scale; |
Log-Logistic (loglogistic ) |
shape, scale; |
Log-Normal (lognormal ) |
mean, sd, constant (optional); |
Normal (normal ) |
mean, sd; |
Paralogistic (paralogistic ) |
shape, scale; |
Pareto (pareto ) |
shape, scale, constant (optional); |
t (tdistribution ) |
df; |
Uniform-Beta mixture (ubetamix ) |
weight, ashape, bshape; |
Uniform-Gamma mixture (ugammamix ) |
weight, shape, scale; |
Uniform (uniform ) |
min, max; |
Weibull (weibull ) |
shape, rate, constant (optional); |
The Gamma, Log-Normal, Pareto and Weibull distributions can have an extra constant parameter, so that the data values minus the constant then follow the specified distribution. When PARAMETERS
are not supplied, you can set option CONSTANT
= estimate
to estimate a constant from the DATA
. The default is not to estimate a constant.
The DISTRIBUTION
option provides the common distributions. Alternatively, for traditional tests (i.e. TMETHOD
=
traditional
) you can set DISTRIBUTION=calculated
to define your own distribution. You must then use the CDFCALCULATION
option to provide an expression, formed using argument X
, to calculate the cumulative distribution function. For example, the exponential
distribution with rate parameter of 2 could be specified by setting options
DISTRIBUTION=calculated
and
CDF=!E(X=1-EXP(-2*X))].
Monte-Carlo simulations are used to calculate the empirical probability values of the test statistics under the likelihood-ratio based method (i.e. TMETHOD
= likelihoodratio
), or, by default, under the traditional method when the parameters are estimated from the DATA
. The NTIMES
option defines how many Monte-Carlo simulations are used; default 999. The SEED
option can be set to initialize the random-number generator used during the Monte-Carlo simulations; if the procedure is called again with the same settings, you will get identical results. The default of zero continues the sequence of random numbers from a previous generation or, if this is the first use of the generator in this run of Genstat, the seed is initialized automatically.
By default, when parameters are estimated from the DATA
during the Monte-Carlo simulations, the parameters are re-estimated to ensure that the correct probability values are obtained. However, this can be overridden by setting the MCPARAMETERS
option to fix
.
Printed output is controlled by the PRINT
option, with settings:
summary |
to print summary information; and |
---|---|
tests |
to print the test statistic(s), with its probability value(s) under the assumption that the data are from the hypothesized distribution (so a low probability indicates that the data are unlikely to be from the hypothesized distribution). |
The printed output can be suppressed by setting option PRINT
= *. The default is to print the summary and the tests.
The PLOT
option controls graphical output, with settings:
histogram |
to plot a histogram of the Monte-Carlo simulated test statistics; and |
---|---|
kerneldensity |
to produce a kernel density plot of the Monte-Carlo simulated test statistics. |
By default, nothing is plotted.
The TITLE
, YTITLE
and XTITLE
options can supply an overall title, a y-axis title and a x-axis title for the graphs, respectively. If these are not supplied, suitable titles are generated automatically. When a single plot is requested, you can set option SCREEN
= keep
to plot the graph on an existing screen; by default the screen is cleared first. The WINDOW
option defines the window to use for the plots; default 3.
The STATISTIC
, PROBABILITY
and MCSTATISTICS
parameters allow the test statistics, their probabilities and the Monte-Carlo simulated test statistics to be saved, respectively, in pointers.
Options: PRINT
, PLOT
, DISTRIBUTION
, CONSTANT
, TMETHOD
, PARAMETERS
, NAMES
, CDFCALCULATION
, MCPARAMETERS
, NTIMES
, SEED
, TITLE
, YTITLE
, XTITLE
, WINDOW
, SCREEN
.
Parameters: DATA
, STATISTIC
, MCSTATISTICS
, PROBABILITY
.
Method
If TMETHOD=traditional
, EDFTEST
calculates the traditional Anderson-Darling, Cramér-von Mises and Kolmogorov-Smirnov goodness-of-fit tests. When PARAMETERS
are supplied (or if MCPARAMETERS
= fix
), the probability of the Anderson-Darling test statistic is calculated using the fast algorithm (adinf) of Marsaglia & Marsaglia (2004), the probability of the Cramér-von Mises test statistic is calculated using the one-term linking approximation (equation 1.8) of Csörgő & Faraway (1996), and the probability of the Kolmogorov-Smirnov test statistic is calculated using the method of Carvalho (2015) for data sets with fewer than 171 values or using the Wang et al. (2003) approximation for larger data sets. When PARAMETERS
are not supplied, Monte-Carlo simulation is used by default to obtain empirical probability values of the test statistics. However, empirical probability values are not available for DISTRIBUTION
=
ubetamix
or ugammamix
.
If TMETHOD
=
likelihoodratio
, EDFTEST
calculates likelihood-ratio based goodness-of-fit test statistics using the method of Zhang (2002). (Note, however, that the likelihood-ratio based method is not available for DISTRIBUTION
= ubetamix
, ugammamix
, or calculated
.) The resulting tests are generally more powerful than their traditional analogues. Monte-Carlo simulation is used to obtain empirical probability values of the test statistics.
When PARAMETERS
are not supplied, maximum-likelihood estimates are obtained using the methods in the DPROBABILITY
procedure. When MCPARAMETERS
=
estimate
, the parameter values are re-estimated for each simulated data set using the DPROBABILITY
procedure.
The kernel-density plot is generated by the KERNELDENSITY
procedure, using the method of Sheather & Jones (1991), with the default number of grid points. The simulated test statistics are plotted using red +
symbols along the x-axis, and the location of the test statistic is denoted by a blue line. As the observed test statistic contributes to the null distribution, it is included in the calculation of both the kernel density and histogram.
Action with RESTRICT
The DATA
variate can be restricted to assess a subset of the data.
References
Carvalho, L. (2015). An improved evaluation of Kolmogorov’s distribution. Journal of Statistical Software, 65(3), 1-7.
Csörgő, S. & Faraway, J.J. (1996). The exact and asymptotic distributions of Cramér-von Mises statistics. Journal of the Royal Statistical Society, Series B, 58, 221-234.
Marsaglia, G. & Marsaglia, J. (2004). Evaluating the Anderson-Darling distribution. Journal of Statistical Software, 9(2), 1-5.
Sheather, S.J. & Jones, M.C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society, Series B, 53, 683-690.
Wang, J., Tsang, W.W. & Marsaglia, G. (2003). Evaluating of Kolmogorov’s distribution. Journal of Statistical Software, 8(18), 1-4.
Zhang (2002). Powerful goodness-of-fit tests based on the likelihood ratio. Journal of the Royal Statistical Society, Series B, 64, 281-294.
See also
Directive: DISTRIBUTION
.
Procedures: DPROBABILITY
, NORMTEST
, KOLMOG2
, WSTATISTIC
.
Commands for: Basic and nonparametric statistics.
Example
CAPTION 'EDFTEST example',\ !t('Random sample of size 10 assumed to come from the Uniform distribution.'),\ !t('From W.J. Conover (1980), Practical Nonparametric Statistics 2ed, pg 348.');\ STYLE=meta,plain,plain VARIATE [VALUES=0.621,0.503,0.203,0.477,0.710,0.581,0.329,0.480,0.554,0.382] x "Assuming a Uniform[0,1] distribution." "Likelihood-ratio based tests with histograms of the Monte-Carlo test statistics." EDFTEST [PLOT=histogram; DISTRIBUTION=uniform; PARAMETERS=!(1,0); NAMES=!t(max,min);\ SEED=1234; NTIMES=999] x "Traditional tests." EDFTEST [TMETHOD=traditional; DISTRIBUTION=uniform; PARAMETERS=!(1,0);\ NAMES=!t(max,min)] x "Estimating parameter values from the data." "Likelihood-ratio based tests with kernel density plots of the Monte-Carlo test statistics." EDFTEST [PLOT=kerneldensity; DISTRIBUTION=uniform; SEED=1234; NTIMES=999] x "Traditional tests with kernel density plots of the Monte-Carlo test statistics." EDFTEST [TMETHOD=traditional; PLOT=kerneldensity; DISTRIBUTION=uniform; SEED=1234;\ NTIMES=999] x