Helps search through models for a regression or generalized linear model (P.W. Goedhart).

### Options

`PRINT` = string token |
Printed output required (`model` , `results` ); default `mode` , `resu` |
---|---|

`METHOD` = string tokens |
Model selection method to employ (`allpossible` , `forward` , `backward` , `fstepwise` , `bstepwise` , `accumulated` , `pooled` ); default `allp` |

`FORCED` = formula |
Model formula to include in every model; default `*` |

`CONSTANT` = string token |
How to treat the constant (`estimate` , `omit` ); default `esti` |

`FACTORIAL` = scalar |
Limit for expansion of all model terms; default 3 |

`DENOMINATOR` = string token |
Whether to base ratios in accumulated summaries on rms from model with smallest residual ss or smallest residual ms (`ss` , `ms` ); default `ss` |

`INRATIO` = scalar |
Criterion for inclusion of terms for forward selection, backward elimination and stepwise regression; default 1.0 |

`OUTRATIO` = scalar |
Criterion for exclusion of terms for forward selection, backward elimination and stepwise regression; default 1.0 |

`MAXCYCLE` = scalar |
Limit on number of times to repeat stepwise selection methods, unless no change is made; default 50 |

`CRITERION` = string token |
Criterion for selecting best models among all possible models (`r2` , `adjusted` , `cp` , `ep` , `aic` , `bic` , `sic` , `meandeviance` , `deviance` ); default `adju` |

`EXTRA` = string token |
Criterion which is also printed for the selected best models (`r2` , `adjusted` , `cp` , `ep` , `aic` , `bic` , `sic` , `meandeviance` , `deviance` ); default `cp` when `DISPERSION=*` , and `mean` otherwise |

`AFACTORIAL` = scalar |
Limit for expansion of `FREE` model terms for the fitting of all possible models; default 3 |

`PENALTY` = scalar |
Penalty for Mallows Cp and Akaike’s information criterion AIC; default 2 |

`NTERMS` = scalar |
Limit on the number of terms to be fitted when fitting all possible models; default 16 |

`NBESTMODELS` = scalar |
Number of best models printed for each subset size; default 8 |

`PPROBABILITY` = scalar |
When `METHOD=allpossible` , only models with all probabilities less than `PPROBABILITY` are printed; default 1 i.e. all models are printed |

`FINALMODELS` = pointer |
Pointer to save the final models for forward, backward, `fstepwise` and `bstepwise` regression methods |

`ALLMODELS` = pointer |
Pointer to save formulae for all possible regression models containing the fitted terms of all the models; every formula includes the `FORCED` formula if set |

`ESTIMATES` = pointer |
Pointer to save variates for all possible regression models containing the parameter estimates |

`SE` = pointer |
Pointer to save variates for all possible regression models containing standard errors of the parameter estimates |

`RESULTS` = pointer |
Pointer to save variates for all possible regression models containing the criteria (`r2` , `adjusted` , `cp` , `ep` , `aic` , `sic` or `bic` , `deviance` , `meandeviance` ), degrees of freedom for residual and total number of fitted parameters p |

`STATISTICS` = pointer |
Pointer to save variates for all possible regression models containing the test statistics. These are F-to-delete statistics (i.e. deviance ratios) when the `DISPERSION` option of the `MODEL` directive is set to `*` , and Chi-square-to-delete statistics (i.e. deviance differences scaled by the dispersion parameter) for a fixed dispersion parameter |

`DF` = pointer |
Pointer to save variates for all possible regression models containing the degrees of freedom for the numerator of the test statistics |

`PROBABILITIES` = pointer |
Pointer to save variates for all possible regression models containing the probabilities of the test statistics |

`MARGINALTERMS` = string token |
How to treat terms that are marginal to other terms in the `FREE` formula (`forced` , `free` ); default `forc` |

### Parameter

`FREE` = formula |
Model formula specifying the candidate model terms |
---|

### Description

There are various methods for choosing a regression model when there are many candidate model terms, see e.g. Montgomery & Peck (1992) or Miller (1990). The `STEP`

directive provides forward selection, backward elimination and stepwise regression. However these methods result in only one model and alternative models, with an equivalent or even better fit, are easily overlooked. Especially in observational studies with many non-orthogonal terms there are frequently a number of alternative models, and then selection of just one well-fitting model is unsatisfactory and possibly misleading. A preferable method is to fit all possible regression models, and to evaluate these according to some criterion. In this way a number of best regression models can be selected. However the fitting of all possible regression models is very computer intensive. It should also be used with caution, because models can be selected which appear to have a lot of explanatory power, but contain noise variables only, see e.g. Flack & Chang (1987). This may occur particularly when the number of parameters is large in comparison with the number of units, as illustrated by the example for `RSEARCH`

. Terms should therefore not be selected on the basis of a statistical analysis alone.

`RSEARCH`

can be used to perform these model selection methods. The call to `RSEARCH`

must be preceded by a `MODEL`

statement which defines the response variate and, if required, all other aspects of a (generalized) linear model. Only one response variate is allowed unless the `DISTRIBUTION`

option of `MODEL`

is set to `multinomial. The FREE`

parameter specifies the candidate model terms. These may include variates, factors, interactions and regression functions like `POL`

and `SSPLINE`

. The `METHOD`

option controls which model selection methods are employed:

`accumulated` |
prints an accumulated analysis of deviance in which all model terms are added one by one to the model in the given order; |
---|---|

`pooled` |
prints an accumulated analysis of deviance in which terms with the same number of identifiers, e.g. main effects or two-factor interactions, are pooled; |

`forward` |
prints an accumulated analysis of deviance resulting from forward selection; |

`backward` |
prints an accumulated analysis of deviance resulting from backward elimination; |

`fstepwise` |
prints an accumulated analysis of deviance resulting from stepwise regression starting with no candidate terms in the model; |

`bstepwise` |
prints an accumulated analysis of deviance resulting from stepwise regression starting with all candidate terms in the model; |

`allpossible` |
prints summary statistics for a number of best models among all possible models. |

For each model with `METHOD=allpossible`

, the selection criterion and the degrees of freedom of the included terms are printed. The probability for the hypothesis that an included term can be deleted as the last term is also printed. These probabilities are based on F-to-delete statistics (i.e. deviance ratios) when the `DISPERSION`

option of the `MODEL`

directive is set to `*`

, and Chi-square-to-delete statistics (i.e. deviance differences scaled by the dispersion parameter) for a fixed dispersion parameter.

The `PPROBABILITY`

option allows you to reduce the amount of output when `METHOD=allpossible`

. If this is set, only models where all the probabilities are less than `PPROBABILITY`

are printed. (By default `PPROBABILITY=1`

, and so they are all printed.)

It is sometimes desirable to include specific terms in every model. Such terms may be specified by means of the `FORCED`

option. The `FORCED`

model terms are always fitted first. The `CONSTANT`

option controls whether the constant parameter is included in the model. The limit for expanding the `FREE`

and `FORCED`

model formulae can be set with the `FACTORIAL`

option, which has default value 3. The `PRINT`

option can be used to control the output from `RSEARCH`

.

The criteria for inclusion and exclusion of terms for forward selection, backward elimination and stepwise regression can be specified by the `INRATIO`

and `OUTRATIO`

options respectively. The `MAXCYCLE`

option specifies the number of steps. These operate exactly as in the `STEP`

directive. The `DENOMINATOR`

option controls the way in which variance ratios are calculated in accumulated analysis of deviance summaries.

All possible regression models are fitted only when the number of candidate `FREE`

model terms does not exceed 16. If the `FREE`

formula specifies a main effects model, i.e. a model without interactions, the main effects are the candidate terms. When the `FREE`

formula contains interactions, the default is to remove any terms marginal to an interaction from the `FREE`

formula, and include them instead in the `FORCED`

formula. However, you can set option `MARGINALTERMS`

to `free`

to retain them in `FREE`

formula. Note that `RSEARCH`

considers only models that obey the principle of marginality. This states that a model that includes an interaction term must also include all its marginal terms. For example, a model that includes the interaction `A.B`

must also include the main effects `A`

and `B`

.

The `AFACTORIAL`

option can be used to limit the expansion of the `FREE`

model terms for the fitting of all possible regression models. The expansion is limited in addition to the limitation imposed by the `FACTORIAL`

option. As an example, the following calls to `RSEARCH`

result in identical candidate model terms, namely `a.b`

, `a.c`

, `b.c`

and `d`

, for all possible regression models:

`RSEARCH [METHOD=forward,backward,allpossible;\`

` FACTORIAL=3; AFACTORIAL=2] a*b*c + d`

`RSEARCH [METHOD=forward,backward,allpossible;\`

` FACTORIAL=2; AFACTORIAL=2; FORCED=a+b+c] a*b*c + d`

However, forward selection starts with no terms in the first call and with the model `a+b+c`

in the second call. Backward elimination starts with the full model including the three factor interaction `a.b.c`

in the first call, while this term is not fitted in the second call.

The `CRITERION`

option controls the selection of the best models among all possible regression models. The criteria employed in `RSEARCH`

are defined as follows:

`r2` |
100 × [1 – Dev / Dev0] |
---|---|

`adjusted` |
100 × [1 – (Dev / (n–p)) / (Dev0 / (n–p_{0}))] |

`cp` |
Dev / f + 2 × p – n |

`ep` |
Dev × (n+1) × (n-2) / [n × (n–p) × (n–p-1)] |

`aic` |
Dev / f + 2 × p |

`sic` or `bic` (synonyms) |
Dev / f + Ln(n) × p |

`deviance` |
Dev |

`meandeviance` |
Dev / (n–p) |

where

Dev | is the deviance of the current model; |
---|---|

Dev0 | is the deviance of the null model; |

p |
is the number of fitted parameters of the current model; |

p_{0} |
is the number of fitted parameters of the null model; |

n |
is the number of units; |

f |
is the dispersion parameter. |

The null model is the model with only a constant term, which may include the fitting of a grouping factor for a within groups regression and/or the fitting of cut-points for an ordinal response model.

The dispersion parameter *f* is specified by the `DISPERSION`

option of the `MODEL`

directive or, when `DISPERSION is set to *`

, is estimated by the mean deviance of the model with all the candidate terms. In ordinary linear regression R², adjusted R² and Mallows Cp are widely used. When R² is used, there is no penalty for adding a term, i.e. R² always improves with the addition of a term. When adjusted R² or Cp is employed, there is a penalty for adding a term. Adjusted R² improves when the F-ratio due to the addition of the term is larger than 1, while Cp improves when the F-ratio is larger than 2. Clearly, Cp is the more conservative criterion and will tend to select models with fewer terms as compared to R² and adjusted R². Minimizing Cp minimizes the mean squared error of prediction in ordinary linear regression in the case where predictions will be made at the same values as are present in the current data set. Models with negligible bias have Cp » *p*. For predictions at new random values, as is common in observational studies, Ep estimates the mean squared error of prediction; then Ep should be minimized. Thompson (1978) and Miller (1990) discuss Cp and Ep in detail.

Criteria suggested for generalized linear models are the Akaike information criterion (AIC) and the Schwarz (Bayesian) information criterion (SIC, or its synonym BIC). The definition of both criteria used here is different from that in the literature. The deviance is used instead of the maximum value of the log-likelihood, which implies a constant shift for distributions without dispersion parameter. Moreover, in the spirit of generalized linear models, the deviance is scaled by the dispersion parameter. This makes AIC equivalent to Cp. Clearly, SIC is the more conservative criterion, especially when the number of units is large.

Note that the best models have a small Cp, Ep, AIC, SIC, deviance and mean deviance, but a large R² and adjusted R². The default penalty of 2 in the definition of Cp and AIC can be altered by setting the `PENALTY`

option, in which case Cp and AIC improves when the F-ratio is larger than `PENALTY`

. The `EXTRA`

option specifies an extra criterion which is printed alongside the selection criterion. The default for `CRITERION`

is `adjusted`

. The default for `EXTRA`

is `cp`

when `DISPERSION`

is set to `*`

, and `meandeviance`

otherwise.

The `NTERMS`

option specifies the maximum number of candidate terms in a model. This can be used when only models with few candidate terms are relevant or to reduce the computational burden. For example with 12 candidate terms there are 4096 different models, while there are only 299 models with maximally three terms. Specifying `NTERMS=3`

then saves a considerable amount of computing time. The `NBESTMODELS`

option specifies the number of best models within each subset size for which summary statistics are printed.

The `FINALMODEL`

option can be used to save the last models for forward selection, backward elimination and fstepwise and bstepwise regression. Results of the fitting of all possible regression models can be saved by means of the parameters `ALLMODELS`

, `ESTIMATES`

, `SE`

, `RESULTS`

, `STATISTICS`

, `DF`

and `PROBABILITIES`

. This saves results from all the fitted models not only from those that are printed. This includes the constant model.

All regression warnings are suppressed. This is to prevent the printing of long lists of similar warnings like “Iterative weights have become 0, or have been held at a limit”. Note that the printed output of all possible regression models is adjusted to the width of the output file.

Options: `PRINT`

, `METHOD`

, `FORCED`

, `CONSTANT`

, `FACTORIAL`

, `DENOMINATOR`

, `INRATIO`

, `OUTRATIO`

, `MAXCYCLE`

, `CRITERION, EXTRA`

, `AFACTORIAL`

, `PENALTY`

, `NTERMS`

, `NBESTMODELS`

, `PPROBABILITY`

, `FINALMODELS`

, `ALLMODELS`

, `ESTIMATES`

, `SE`

, `RESULTS, STATISTICS`

, `DF`

, `PROBABILITIES`

, `MARGINALTERMS`

.

Parameters: `FREE`

.

### Method

First the `FREE`

and `FORCED`

formulae are checked using subsidiary procedure `_RSEARCHCHECK`

, and terms that appear in both are dropped from the `FREE`

formula. Then the full model is fitted and aliased predictors are dropped from both formulae. Forward selection, backward elimination and stepwise regression are straightforward implemented using the `STEP`

directive.

The fitting of all possible regression models uses a sequence of models in which, within each subset size, every model is fitted by dropping one term from the previous model and adding another term. Test statistics are calculated as though the tested term is the last term to enter the model. When the `DISPERSION`

option of the `MODEL`

directive is set to `*`

, terms are tested by means of F-to-delete statistics, which are deviance ratios. For a fixed dispersion parameter Chi-square-to-delete statistics, i.e. deviance differences scaled by the dispersion parameter, are used to calculate probabilities.

Smoothing splines are not allowed in the `FREE`

formula for `METHOD=allpossible`

due to a limitation of the `FCLASSIFICATION`

directive.

### Action with `RESTRICT`

Factors and variates in the `FREE`

and `FORCED`

formulae should not be restricted. Any restriction applied to vectors used in the `MODEL`

statement applies also to the results from `RSEARCH`

.

### References

Flack, V.F. & Chang, P.C. (1987). Frequency of selecting noise variables in subset regression analysis: a simulation study. *The American Statistician*, 41, 84-86.

Miller, A.J. (1990). *Subset Selection in Regression*. Chapman & Hall, London.

Montgomery, D.C. & Peck, E.A. (1992). *Introduction to Linear Regression Analysis (second edition)*. Wiley, New York.

Thompson, M.L. (1978). Selection of variables in multiple regression: Part I. A review and evaluation. *International Statistical Review*, 46, 1-19.

### See also

Directive: `STEP`

.

Procedures: `ASCREEN`

, `RSCREEN`

, `RWALD`

.

Commands for: Regression analysis.

### Example

CAPTION 'RSEARCH example',\ !t('Analysis of the so-called Hald data: heat evolved in calories',\ 'per gram of cement as a function of the amount of each of four',\ 'ingredients in the mix: tricalcium aluminate, tricalcium',\ 'silicate, tetracalcium alumino ferrite and dicalcium silicate;',\ 'see page 276 of Montgomery & Peck (1992), Introduction to',\ 'Linear Regression Analysis, 2nd Edition, Wiley.'); STYLE=meta,plain VARIATE [NVALUES=13] Ingred1,Ingred2,Ingred3,Ingred4,Heat READ Ingred1,Ingred2,Ingred3,Ingred4,Heat 7 26 6 60 78.5 1 29 15 52 74.3 11 56 8 20 104.3 11 31 8 47 87.6 7 52 6 33 95.9 11 55 9 22 109.2 3 71 17 6 102.7 1 31 22 44 72.5 2 54 18 22 93.1 21 47 4 26 115.9 1 40 23 34 83.8 11 66 9 12 113.3 10 68 8 12 109.4 : MODEL Heat RSEARCH [METHOD=forward,backward,fstepwise,bstepwise,allpossible]\ Ingred1,Ingred2,Ingred3,Ingred4 CAPTION !t('Analysis of the damage caused by waves to forward sections of',\ 'cargo-carrying ships. The data are counts of damage incidents',\ 'for each combination of three risk factors: the type of ship,',\ 'the year of construction and the period of operation; see',\ 'page 204 of McCullagh & Nelder (1989), Generalized Linear',\ 'Models, 2nd Edition, Chapman & Hall.') VARIATE [NVALUES=40] Service,Incident FACTOR [NVALUES=40; LABELS=!t(A,B,C,D,E)] Type & [LABELS=!t('60-64','65-69','70-74','75-79')] Construction & [LABELS=!t('60-74','75-79')] Operation GENERATE Type,Construction,Operation READ Service,Incident 127 0 63 0 1095 3 1095 4 1512 6 3353 18 * * 2244 11 44882 39 17176 29 28609 58 20370 53 7064 12 13099 44 * * 7117 18 1179 1 552 1 781 0 676 1 783 6 1948 2 * * 274 1 251 0 105 0 288 0 192 0 349 2 1208 11 * * 2051 4 45 0 0 0 789 7 437 7 1157 5 2161 12 * * 542 1 : CALCULATE Service = LOG(Service) MODEL [DISTRIBUTION=poisson; OFFSET=Service; DISPERSION=*] Incident RSEARCH [FACTORIAL=1; EXTRA=deviance] Type * Construction * Operation RSEARCH [METHOD=pooled,accumulated,allpossible; FACTORIAL=2;\ EXTRA=deviance] Type * Construction * Operation CAPTION !t('The following example shows that when the number of',\ 'predictors is large compared with the number of units,',\ 'models with noise variables only may appear to have a lot',\ 'of explanatory power.') VARIATE [NVALUES=12] Y,Noise[1...10] CALCULATE Y,Noise[] = URAND(912439,10(0); 12) MODEL Y RSEARCH Noise[]