Estimates parameters in Box-Jenkins models for time series.

### Options

`PRINT` = string tokens |
What to print (`model` , `summary` , `estimates` , `correlations` , `monitoring` ); default `mode,summ,esti` |
---|---|

`LIKELIHOOD` = string token |
Method of likelihood calculation (`exact` , `leastsquares` , `marginal` ); default `exac` |

`CONSTANT` = string token |
How to treat the constant (`estimate` , `fix` ); default `esti` |

`RECYCLE` = string token |
Whether to continue from previous estimation (`yes` , `no` ); default `no` |

`WEIGHTS` = variate |
Weights; default `*` |

`MVREPLACE` = string token |
Whether to replace missing values by their estimates (`yes` , `no` ); default `no` |

`FIX ` = variate |
Defines constraints on parameters (ordered as in each model, tf models first): zeros fix parameters, parameters with equal numbers are constrained to be equal; default `*` |

`METHOD` = string token |
Whether to carry out full iterative estimation, to carry out just one iterative step, to perform no steps but still give parameter standard deviations, or only to initialize for forecasting by regenerating residuals (`full` , `onestep` , `zerostep` , `initialize` ); default `full` |

`MAXCYCLE` = scalar |
Maximum number of iterations; default 15 |

`TOLERANCE` = scalar |
Criterion for convergence; default 0.0004 |

`SAVE` = identifier |
To name save structure, or supply save structure with transfer-functions; default `*` i.e. transfer-functions taken from the latest model |

### Parameters

`SERIES` = variate |
Time series to be modelled (output series) |
---|---|

`TSM` = TSM |
Model for output series |

`BOXCOXMETHOD` = string token |
How to treat transformation parameter in output series (`fix` , `estimate` ); default `fix` |

`RESIDUALS` = variate |
To save residual series |

### Description

The main use of `TFIT`

is to fit parameters to time-series models, although you can also use it to initialize for the `TFORECAST`

directive, even when the model parameters are already known. `TFIT`

was originally called `ESTIMATE`

, but was renamed in Release 14 to emphasize its status as a time-series command. The earlier name (`ESTIMATE`

) was retained to allow previous programs to continue to run, but this may be removed in a future release.

You need to define a TSM structure before using `TFIT`

, to provide the setting for the `TSM`

parameter. You may also wish to give a `TRANSFERFUNCTION`

statement, for example if you wish to specify explanatory variables for regression with ARIMA errors, or to define transfer-function models. In many applications of estimating a univariate ARIMA model, you will need only a simple form of the directive, such as:

`TFIT Daylength; TSM=Erp`

The `SERIES`

parameter specifies the variate holding the time series data to which the model is to be fitted.

The `TSM`

parameter specifies the ARIMA model that is to be fitted to the time-series data. This TSM must already have been declared and its `ORDERS`

must have been set. If the `LAGS`

parameter of the TSM has been set, the lags must have been given values. However, if the `PARAMETERS`

of the TSM model have been set, these need not have been declared previously nor given values. When the parameter values are not set, default values are used: these are all zero, except for the transformation parameter, which is set to 1.0 if it is not to be estimated (see `BOXCOXMETHOD`

and `FIX`

below). Any parameter values that you do specify will be used as initial values for the parameters in the model; Genstat replaces any missing values by the default values. If any group of autoregressive or moving-average parameters do not satisfy the required conditions for stationarity or invertibility, all the parameters to be estimated are reset by Genstat to the default values. After `TFIT`

, the parameters of the TSM contain the estimated parameter values.

The `BOXCOXMETHOD`

parameter allows you to estimate the transformation parameter λ.

The `RESIDUALS`

parameter saves the estimated innovations (or residuals). The residuals are calculated for *t*=*t*_{0}…*N*, where *t*_{0}=1+*p*+*d*–*q* for a simple ARIMA model. If *t*_{0}>1, missing values will be inserted for *t*=1…*t*_{0}-1.

The `PRINT`

option controls printed output. If you specify `monitoring`

, then at each cycle of the iterative process of estimation, Genstat prints the *deviance* for the current fitted model, together with the current estimates of model parameters. The format is simple with the minimum of description, to let you judge easily how quickly the process is converging. The other settings of `PRINT`

control output at the end of the iterative process. If you specify `model`

, the model is briefly described, giving the identifier of the series and the time-series model, together with the orders of the model. If you specify `summary`

, the deviance of the final model is printed, along with the residual number of degrees of freedom. If you specify `estimates`

, the estimates of the model parameter are printed in a descriptive format, together with their estimated standard errors and reference numbers. If you specify `correlations`

, the correlations between estimates of parameters are printed, with reference numbers to identify the parameters.

The `LIKELIHOOD`

option specifies the criterion that Genstat minimizes to obtain the estimates of the parameters: this is described in the next section. The default setting `exact`

is recommended for most applications.

You can use the `CONSTANT`

option to specify whether Genstat is to estimate the constant term *c* in the model. If `CONSTANT=fix`

, the constant is held at the value given in the initial parameter values; this need not be zero.

The `RECYCLE`

option allows a previous `TFIT`

statement to continue; this can save computing time. If `RECYCLE=yes`

, the most recent `TFIT`

statement is continued, unless the `SAVE`

option has been set to the save structure from some other `TFIT`

statement. The `SERIES`

and `TSM`

settings are then taken from this previous `TFIT`

statement: Genstat ignores any specified in the current statement. Most of the settings of other parameters and options are carried over from the previous statement, and new values are ignored. However, there are some exceptions. You can change the `RESIDUALS`

variate, you can reset `MAXCYCLE`

to the number of further iterations you require, and you can change the settings of `TOLERANCE`

and `PRINT`

. You can also change the values of the variate in the `WEIGHTS`

option; you can thus get reweighted estimation. You can change the values of the `SERIES`

itself, although you cannot change missing values; if the `MVREPLACE`

option was previously set to `yes`

, you must put the original missing values back into the `SERIES`

variate before the new `TFIT`

statement.

The `WEIGHTS`

option includes in the likelihood a weighted sum-of-squares term

∑_{t = t0 … N} { w_{t} *a _{t}*

^{2}}

where *w _{t}*,

*t*=1…

*N*are provided by the

`WEIGHTS`

variate. The values of *w*must be strictly positive. If

_{t}*t*

_{0}<1, where

*t*

_{0}=1+

*d*+

*p*–

*q*, then

*w*is taken as 1 for

_{t}*t*<1.

The `MVREPLACE`

option allows you to request any missing values in the time-series to be replaced by their estimates after estimation. Genstat will always estimate the missing values, irrespective of the setting of `MVREPLACE`

; so you can also obtain these estimates later from `TKEEP`

.

The `FIX`

option allows you to place simple constraints on parameter values throughout the estimation. The units of the `FIX`

variate correspond to the parameters of the TSM, excluding the innovation variance. The values of the `FIX`

variate are used to define the parameter constraints and must be integers. If an element of the `FIX`

variate is set to 0, the corresponding parameter is constrained to remain at its initial setting. If an element is not 0, and the value is unique in the `FIX`

variate, the parameter is estimated without any special constraint. If two or more values are equal, the corresponding parameters are constrained to be equal throughout the estimation. The number that you give to a parameter by `FIX`

will appear as the reference number of the parameter in the printed model and correlation matrix. This option overrides any setting of `CONSTANT`

and `BOXCOXMETHOD`

.

The `MAXCYCLE`

option specifies the maximum number of iterations to be performed.

The `TOLERANCE`

option specifies the convergence criterion. Genstat decides that convergence has occurred if the fractional reduction in the deviance in successive iterations is less than the specified value, provided also that the search is not encountering numerical difficulties that force the step length in the parameter space to be severely limited. You can use monitoring to judge whether, for all practical purposes, the iterations have converged. Genstat gives warnings if the specified number of iterations is completed without convergence, or if the search procedure fails to find a reduced value of the deviance despite a very short step length. Such an outcome may be due to complexities in the likelihood function that make the search difficult, but can be due to your specifying too small a value for `TOLERANCE`

.

The `SAVE`

option allows you to save the *time-series save structure* produced by `TFIT`

. You can use this in further `TFIT`

statements with `RECYCLE=yes`

, or in `TFORECAST`

statements. It can also be used by the `TDISPLAY`

and `TKEEP`

directives. Genstat automatically saves the structure from the most recent `TFIT`

statement, but this is over-written when the next `TFIT`

statement is executed, unless you have used `SAVE`

to give it an identifier of its own. You can access the current time-series save structure by the `SPECIAL`

option of the `GET`

directive, and reset it by the `TSAVE`

option of the `SET`

directive.

The `METHOD`

option has four possible settings. The default setting is `full`

which gives the usual estimation to convergence or until the maximum number of iterations has been reached.

With the setting `METHOD=initialize`

, `TFIT`

carries out only the residual regeneration steps (that is, calculation of *a _{t}* for

*t*=

*t*

_{0}…

*N*) which are needed before

`TFORECAST`

can be used. If the model has just been estimated using the default `full`

setting, this is unnecessary. The setting `initialize`

is useful when the time series is supplied with a known model and a minimal amount of calculation is wanted to prepare or initialize for forecasting. None of the model parameters are changed, and no standard errors of parameter estimates are available. Missing values in the series *are*estimated so this setting provides an efficient way of getting their values when the time series model is known; they can then be obtained using

`TKEEP`

. The deviance value is also available from `TKEEP`

. This setting is therefore useful for efficient calculation of deviance values when you want to plot the shape of the deviance as a function of parameter values.With the setting `METHOD=zerostep`

the effect is the same as for `initialize`

except that `TFIT`

also calculates the standard errors of the parameters as if they had just been estimated. These can be used together with other quantities available from `TKEEP`

to construct confidence intervals and carry out tests on the parameter values, which remain unchanged except that the innovation variance in the ARIMA model is replaced by its estimate conditional on all other parameters.

The setting `METHOD=onestep`

gives the same results as specifying the option `MAXCYCLE=1`

in `TFIT`

. It is convenient for carrying out quick tests of model parameters.

To explain the `LIKELIHOOD`

option, we need to describe the estimation of ARIMA models in more detail. You may want to skip this if you are doing fairly routine work.

The first step in deriving the likelihood for a simple model is to calculate

*w _{t}* = ∇

^{d}

*y*–

_{t}*c*,

*t*= 1+

*d*…

*N*

This has a multivariate Normal distribution with dispersion matrix *V*σ_{a}^{2}, where *V* depends only on the autoregressive and moving-average parameters. The likelihood is then proportional to

{ σ_{a}^{2m}│*V*│ }^{ -½} exp{ –*w*′*V*^{-1}*w*/2σ_{a}^{2} }

where *m*=*N*–*d*. In practice Genstat evaluates this by using the formula

*w*′ *V*^{-1} *w* = *W* + ∑_{t = t0 … N} { *a _{t}*

^{2}} =

*S*

where *t*_{0}=1+*d*+*p*–*q*. The term *W* is a quadratic form in the *p* values *w*_{1+d–q} … *w _{p}*

_{+d–q}: it takes account of the starting-value problem for regenerating the innovations

*a*, and avoids losing information as would happen if the process used only a conditional sum-of-squares function. If

_{t}*q*>0, Genstat introduces unobserved values of

*w*

_{1+d–q}…

*w*in order to calculate the sum

_{d}*S*. Genstat uses linear least-squares to calculate these

*q*starting values for

*w*, thus minimizing

*S*. We shall call them

*back-forecasts*, though if

*p*>0 they are actually computationally convenient linear functions of the proper back-forecasts. We shall call

*S*the sum-of-squares function: it is the sum of the quadratic form and the sum-of-squares term, and is identical to the value expressed by Box & Jenkins (1970) as

∑_{t = -∞ … N} { *a _{t}*

^{2}}

using infinite back-forecasting; that is, using:

*W* = ∑_{t = -∞ … t0-1} { *a _{t}*

^{2}}

The values *a _{t}* for

*t*=

*t*

_{0}…

*N*agree precisely with those of Box and Jenkins.

To clarify all this, consider examples with no differencing; that is, *d*=0. If *p*=0 and *q*=1 then *W*=0 and *t*_{0}=0, and one back-forecast *w*_{0} is introduced. If *p*=1 and *q*=0 then *W*=(1-φ_{1}^{2})*w*_{1}^{2} and *t*_{0}=2, and no back-forecasts are needed. If *p*=*q*=1 then *W*=(1-φ_{1}^{2})*w*_{0}^{2} and *t*_{0}=1, and so one back-forecast *w*_{0} is needed. In this case the proper back-forecast is in fact *w*_{0 }/(1-θ_{1}φ_{1}).

The value of │*V*│ is a by-product of calculating *W* and the back-forecast. For example, if *p*=0 and *q*=1, then

│*V*│ = (1 + θ_{1}^{2} + … + θ_{1}^{2N})

If *p*=1 and *q*=0,

│*V*│ = 1 / (1 – φ_{1}^{2})

and if *p*=*q*=1,

│*V*│ = 1 + (φ_{1} – θ_{1})^{2} (1 + θ_{1}^{2} + … + θ_{1}^{2N-2}) / (1 – φ_{1}^{2})

Concentrating the likelihood over σ_{a}^{2} by setting σ_{a}^{2}=*S*/*m* yields a value proportional to { │*V*│^{1/m }*S* }^{–m/2}.

The default setting of the `LIKELIHOOD`

option is `exact`

. In this case the concentrated likelihood is maximized, by minimizing the quantity

*D* = │*V*│^{1/m} *S*

which is called the deviance.

The setting `leastsquares`

specifies that Genstat is to minimize only the sum-of-squares term *S*. This criterion corresponds to the back-forecasting sum-of-squares used by Box & Jenkins (1970), and will in many cases give estimates close to those of the exact likelihood. However, some discrepancy arises if the series is short or the model is close to the invertibility boundary. This is because of limitations on the back-forecasting procedure, as described in the algorithms of Box & Jenkins (1970). The deviance value *D* that Genstat prints is, with this setting, simply *S*.

When you use exact likelihood, the factor │*V*│^{1/m} reduces bias in the estimates of the parameter; you would get bias if you used `leastsquares`

instead. However, │*V*│^{1/m} is generally close to one, unless the series is short or the model is either seasonal or close to the boundaries of invertibility or stationarity. The `leastsquares`

setting is therefore adequate for most long, non-seasonal sets of data; using it may reduce the computation time by up to 50%. When you specify that Genstat is to estimate the parameter *λ* of the Box-Cox transformation, Genstat also includes the Jacobian of the transformation in the likelihood function. The result is an extra factor *G*^{-2(λ-1)} in the definition of the deviance, *G* being the geometric mean of the data,

*G* = ( ∏_{t = 1 … N} { *y _{t}* } ) ** (1 /

*N*)

Note that this is not included unless *λ* is being estimated, even if λ≠1.

You can treat differences in *N*log(*D*) as a chi-square variable in order to test nested models: this is supported by asymptotic theory, and by experience with models that have moderately large sample sizes. Similarly, you can select between different models by using *N*log(*D*)+2*k* as an information criterion, *k* being the number of estimated parameters. But both of these test procedures are questionable if the estimated models are close to the boundaries of invertibility or stationarity. Provided all the models that are being compared have the same orders of differencing, with the differenced series being of length *m*, it is recommended that *m*log(*D*) be used rather than *N*log(*D*) in these tests since *m*log(*D*) is precisely minus two multiplied by the log-likelihood as defined above.

The setting `marginal`

is relevant mainly when `TFIT`

is used for regression with ARIMA errors. (This requires a `TRANSFERFUNCTION`

statement beforehand to specify the explanatory variables.) The likelihood for the model is defined as that of the univariate error series *e _{t}* which is defined in general by

*e _{t}* =

*y*–

_{t}*b*

_{1}

*x*

_{1,t}– … –

*b*

_{m}x_{m}_{,t}

(the *x _{i}* being

*m*explanatory variables). The constant term therefore appears in the model after any differencing of

*e*; for example

_{t}∇*e _{t}* =

*c*+ (1 – θ

_{1}

*B*)

*a*

_{t}You can get bias in the estimates of the parameters of an ARIMA model because the regression is estimated at the same time. You can guard against this by specifying `LIKELIHOOD=marginal`

. This can be particularly important if the series are short or if you use many explanatory variables (Tunnicliffe Wilson 1989). The deviance is now defined as

*D* = *S* (│*X*′*V*^{-1}*X*│ │*V*│)^{1/m}

where *m* is reduced by the number of regressors (including the constant term) and the columns of *X* are the differenced explanatory series: the other terms are as in the exact likelihood.

You can use the `marginal`

setting also for univariate ARIMA modelling, when the constant term is the only explanatory term. Furthermore, Genstat deals with missing values in the response variate by doing a regression on indicator variates; these too are included in the *X* matrix. However, you cannot use marginal likelihood and estimate a transformation parameter in either the transfer-function model or an ARIMA model. Neither can you use it if you set the `FIX`

option in `TFIT`

. In these cases Genstat automatically resets the `LIKELIHOOD`

option to `exact`

.

At every iteration with the setting `LIKELIHOOD=marginal`

, the regression coefficients are the maximum-likelihood estimates conditional upon the estimated values of the parameters of the ARIMA model: these are also the generalized least-squares estimates, conditioned in the same way. This is so even if `MAXCYCLE=0`

; that is, the coefficients of the regression are re-estimated even at iteration 0. Therefore you must not use the `marginal`

setting with the option `METHOD=initialize`

to initialize for `TFORECAST`

. You can compare deviance values that were obtained using marginal likelihood only for models with the same explanatory variables and the same differencing structure in the error model.

Options: `PRINT`

, `LIKELIHOOD`

, `CONSTANT`

, `RECYCLE`

, `WEIGHTS`

, `MVREPLACE`

, `FIX , METHOD`

, `MAXCYCLE`

, `TOLERANCE`

, `SAVE`

.

Parameters: `SERIES`

, `TSM`

, `BOXCOXMETHOD`

, `RESIDUALS`

.

### Action with `RESTRICT`

The `SERIES`

variate can be restricted, but this must be to a contiguous set of units.

### References

Box, G.E.P. & Jenkins, G.M. (1970). *Time Series Analysis, Forecasting and Control*. Holden-Day, San Francisco.

Tunnicliffe Wilson, G. (1989). On the use of marginal likelihood in time-series model estimation. *Journal of the Royal Statistical Society, Series B*, 51, 15-27.

### See also

Directives: `TSM`

, `FTSM`

, `TRANSFERFUNCTION`

, `TDISPLAY`

, `TFILTER`

, `TFORECAST`

, `TKEEP`

, `TSUMMARIZE`

, `CORRELATE`

, `FOURIER`

.

Procedures: `BJESTIMATE`

, `BJFORECAST`

, `BJIDENTIFY`

, `MOVINGAVERAGE`

, `PERIODTEST`

, `PREWHITEN`

, `REPPERIODOGRAM`

, `SMOOTHSPECTRUM`

.

Commands for: Time series.

### Example

" Example TFIT-1: Fitting a seasonal ARIMA model" VARIATE time; VALUES=!(1...120) FILEREAD [NAME='%gendir%/examples/TFIT-1.DAT'] apt " Display the correlation structure of the logged data" CALCULATE lapt = LOG(apt) BJIDENTIFY [GRAPHICS=high; WINDOWS=!(5,6,7,8)] lapt " Calculate the autocorrelations of the differences and seasonally differenced series" CALCULATE ddslapt = DIFFERENCE(DIFFERENCE(lapt; 12); 1) CORRELATE [PRINT=auto; MAXLAG=48] ddslapt; AUTO=ddsr " Define a model for the series: IMA(1) (that is, a model with a single moving-average parameter applied to the differences of the series) plus a seasonal IMA(1) component" TSM [MODELTYPE=arima] airpass; ORDERS=!((0,1,1)2,12) " Form preliminary estimates of the parameters, using a log transformation (BOXCOX=0 is equivalent to log)" FTSM [PRINT=model] airpass; ddsr; BOXCOX=0 " Get the best estimates, fixing the constant" TFIT [CONSTANT=fix] SERIES=apt; TSM=airpass " Graph the residuals against time" TKEEP RESID=resids DGRAPH [WINDOW=3; KEYWINDOW=0; TITLE='Residuals vs Time'] resids; time " Test the independence of the residuals" CORRELATE [GRAPH=auto; MAXLAG=48] resids; TEST=S PRINT 'Test statistic for independence of the residuals',S