Produces bootstrapped estimates, standard errors and distributions (P.W. Lane).

### Options

`PRINT` = string token |
Controls printed output (`estimates` , `graphs` , `vcovariance` ); default `esti` |
---|---|

`DATA` = variates, factors or texts |
Data vectors from which the statistics are to be calculated; no default |

`AUXILIARY` = pointers |
Further sets of data vectors, each set to be resampled independently |

`ANCILLARY` = any type |
Other relevant information needed to calculate the statistics |

`NTIMES` = scalar |
Number of times to resample; default 100 |

`SEED` = scalar |
Seed for random number generator; default continue from previous generation or use system clock |

`GRAPHICS` = string token |
Type of graphics (`lineprinter` , `highresolution` ); default `high` |

`PROBABILITY` = scalar |
Probability level for confidence interval; default 0.95 |

`METHOD` = string token |
What type of bootstrapping to use (`random` , `balance` , `permute` ); default `rand` |

`BLOCKSTRUCTURE` = formula |
Block structure to use for random permutations |

`CIMETHOD` = string token |
What type of confidence intervals to provide (`bca` , `percentile` ); default `perc` |

`VCOVARIANCE` = symmetric matrix |
Saves the variance-covariance matrix of the statistics |

### Parameters

`LABEL` = texts |
Texts, each containing a single line, to label the statistics; default `'Statistic'` |
---|---|

`ESTIMATE` = scalars |
Saves the bootstrap mean for each statistic |

`SE` = scalars |
Saves the bootstrap standard error for each statistic |

`LOWER` = scalars |
Saves the bootstrap lower confidence limit for each statistic |

`UPPER` = scalars |
Saves the bootstrap upper confidence limit for each statistic |

`STATISTIC` = variates |
Saves the series of bootstrap estimates of each statistic |

`WINDOW` = scalars |
Graphical window to use for displaying bootstrap distribution for each statistic; default 4 |

`SCREEN` = string tokens |
Whether to clear graphical frame or draw on top (`clear` , `keep` ); default `clea` |

### Description

The bootstrap is a method of providing distributional information, such as standard errors, about statistical estimates – without making precise distributional assumptions about the data. It can also provide estimates with reduced bias. This is achieved by “resampling” from the data; that is, generating new data sets by sampling with replacement from the data set being investigated. A good introduction to the bootstrap is given by Efron & Tibshirani (1986); a fuller treatment can be found in Efron & Tibshirani (1993).

The `BOOTSTRAP`

procedure can be used for any statistic or set of statistics that can be calculated by Genstat from one or more data matrices. You need to provide a procedure called `RESAMPLE`

which calculates the statistics from the data, as explained in the Method section. There are also several examples of `RESAMPLE`

in the standard examples, which can be extracted by the commands:

`LIBEXAMPLE 'BOOTSTRAP'; EXAMPLE=Ex`

`PRINT Ex; JUSTIF=left`

The options and parameters of `RESAMPLE`

must not be changed. The body of the procedure should store the required statistics in scalars called `STATISTIC[1...s]`

using variates, factors and texts called `DATA[1...d]`

, where each of `s`

and `d`

can be any positive integer. The `EXIT`

parameter of `RESAMPLE`

should be set to indicate when any of the calculations fail, as can sometimes happen if degenerate data-sets are generated (see Example 3).

The data for `BOOTSTRAP`

are provided as a list of vectors (variates, factors or texts) using the `DATA`

option. From this, the procedure will generate new data by resampling from the set of units: all the vectors must have the same length, and each new sample uses the same set of units for all vectors. The procedure `RESAMPLE`

is then called to calculate the statistics.

Extra information required in procedure `RESAMPLE`

to calculate the statistics, which is not to be resampled along with the data matrix, can be passed as a list of data structures using the `ANCILLARY`

option of `BOOTSTRAP`

(see Examples 2 and 3).

The procedure can also deal with statistics calculated from several independent data matrices. For example, the difference in means between two independent samples must be dealt with by resampling independently from each sample, which may have different numbers of observations. In this case, one data matrix is specified as a list of vectors using the `DATA`

option as usual, and the second data matrix is specified as a pointer using the `AUXILIARY`

option. This option may be set to any number of pointers, each storing a list of vectors; resampling is done independently for each set of vectors (see Example 4).

The option `NTIMES`

specifies how many times the resampling is carried out. The default value is 100, which has been found by many users of the bootstrap to be sufficient for producing standard errors and bias-reduced estimates. However, the number should be increased to get reliable distributional information: 1000 or more may be needed for reliable 95% confidence limits.

Printed output is controlled by the `PRINT`

option, with settings `estimates`

for the estimates and their standard errors and confidence limits, and `vcovariance`

for the variance-covariance matrix. The `graphs`

setting draws a histogram of the bootstrap distributions. The default setting is just `estimates`

.

A label should be provided for each statistic, using the `LABEL`

parameter; by default, bootstrapping will be done for a single statistic which will be labelled simply as `Statistic`

. The estimates and their standard errors can be saved by the `ESTIMATE`

and `SE`

parameters. Also, a variance-covariance matrix of the estimates can be saved using the `VCOVARIANCE`

option. The number of labels, `s`

say, must match the number of statistics, called `STATISTIC[1...s]`

, calculated in your version of the `RESAMPLE`

procedure.

The parameters `LOWER`

and `UPPER`

allow confidence limits for each statistic to be saved, with the probability level specified in the `PROBABILITY`

option (default 0.95 i.e. 95% confidence intervals). By default the intervals are constructed as percentiles of the empirical distribution of the bootstrap estimates. However, provided there are no auxiliary data vectors, you can request bias-corrected and accelerated limits instead by setting option `CIMETHOD=bca`

(see Efron & Tibshirani, 1993, Section 14.3). The full sets of bootstrap estimates can be saved by setting the `STATISTICS`

parameter; each variate will contain *n* values, where *n* is the setting of the `NTIMES`

option.

Three methods of bootstrapping are provided. By default, resampling is completely pseudo-random, using Genstat’s random-number generator. The generator can be initialized by setting option `SEED`

, thereby producing reproducible results; otherwise, the initialization uses the system clock. A second alternative is balanced bootstrapping, requested by setting `METHOD=balance`

. In this case, the resampling is constrained to ensure that each unit of the data matrix occurs the same number of times in the complete set of generated samples (see Examples 3 and 4). The third method, specified by `METHOD=permute`

, is simply to permute the units of the data matrix. Note that this method gives no variation in results if the statistics are independent of the order of the data, like the sample mean. However, this method provides permutation tests, a type of randomization test that can be applied to grouped data (see Example 4). When `METHOD=permute`

, you can set the `BLOCKSTRUCTURE`

option to a model formula to define how the randomization is to be done (see the `RANDOMIZE`

directive for details).

If the `graphics`

setting of the `PRINT`

option is used, the procedure will display the distribution of each set of bootstrap estimates as a histogram. By default, this will be a high-resolution plot on the current device, but the `GRAPHICS`

option can be set to `line`

to produce a line-printer histogram. In a high-resolution plot, the histogram is enhanced with a smoothed line, giving a clearer indication of the distribution of the statistic. By default, the display for the statistics will appear in graphical window 4, one at a time (this window is set by default to fill the whole graphical frame). But the `WINDOW`

and `SCREEN`

parameters can be set to arrange for concurrent displays of the statistics in differently sized windows.

Options: `PRINT`

, `DATA`

, `AUXILIARY`

, `ANCILLARY`

, `NTIMES`

, `SEED`

, `GRAPHICS`

, `PROBABILITY`

, `METHOD`

, `BLOCKSTRUCTURE`

, `CIMETHOD`

, `VCOVARIANCE`

.

Parameters: `LABEL`

, `ESTIMATE`

, `SE`

, `LOWER`

, `UPPER`

, `STATISTIC`

, `WINDOW`

, `SCREEN`

.

### Method

Samples are generated by scaling uniform random numbers produced by the `URAND`

function. For the balanced bootstrap, a list of repeated unit numbers is sorted into random order and used one block at a time. For the permutation test, the `RANDOMIZE`

directive is used to re-order the data at random.

`BOOTSTRAP`

needs a subsidiary procedure `RESAMPLE`

to calculate the statistics of interest. `RESAMPLE`

has an option, `DATA`

, which is used to supply the data vectors (variates, factors or texts) from which the statistics are to be calculated. Other relevant information can be supplied through the `AUXILIARY`

and `ANCILLARY`

options, which correspond to the `AUXILIARY`

and `ANCILLARY`

options of `BOOTSTRAP`

itself. There are two parameters: `STATISTIC`

supplies a list of scalars to store the estimates of each statistic, and `EXIT`

a list of scalars which should be set to zero or one according to whether or not each statistic could be estimated successfully with the supplied data vectors. If the value of `EXIT`

is not calculated in `RESAMPLE`

, the `BOOTSTRAP`

procedure assumes that the calculations succeeded.

This example shows a version of `RESAMPLE`

which calculates the correlation between two variates.

`PROCEDURE [PARAMETER=pointer] 'RESAMPLE'`

`OPTION 'DATA', " (I: variates, factors or texts) data`

` vectors from which to calculate`

` the statistics; no default"\`

` 'AUXILIARY', " (I: pointers) auxiliary sets of data`

` vectors, each of which is to be`

` resampled independently"\`

` 'ANCILLARY'; " (I: any type of structure) other`

` relevant information needed to`

` calculate the statistics "`

` MODE=p; TYPE=!t(variate,factor,text),'pointer',*;\`

` SET=yes,no,no; LIST=yes; DECLARED=yes; PRESENT=yes`

`PARAMETER 'STATISTIC', " (O: scalars) to save the calculated`

` statistics "\`

` 'EXIT'; " (O: scalars) to save an exit code`

` to indicate failure (EXIT[i]=1) or`

` success (EXIT[i]=0) when calculating`

` each STATISTIC[i]"\`

` MODE=p; TYPE='scalar'; SET=yes`

` CALC STATISTIC[1] = CORRELATION(DATA[1]; DATA[2])`

` & EXIT[1] = STATISTIC[1]==C('missing')`

`ENDPROCEDURE`

`VARIATE [VALUES=576,635,558,578,666,580,555,661,651,605,\`

` 653,575,545,572,594] Y`

`& [VALUES=3.39,3.30,2.81,3.03,3.44,3.07,3.00,3.43,3.36,3.13,\`

` 3.12,2.74,2.76,2.88,2.96] Z`

`BOOTSTRAP [DATA=Y,Z; SEED=77320] 'Correlation'`

The `RESAMPLE`

procedure is called within a loop, and the statistics that are returned are loaded into variates. If any statistics fail to be calculated, as recorded by the `EXIT`

parameter of `RESAMPLE`

, they are stored as missing values. `BOOTSTRAP`

will then base its estimation on the successful generations, but reports how many failures occurred.

The bootstrap estimates are formed as simple means of the stored variates, and the s.e.s are square roots of the sample variance. The `TABULATE`

directive is used to estimate quantiles from the stored variates, to define confidence limits. The variance-covariance matrix is formed from the statistics using the `FSSPM`

directive.

The graphical representation uses `DHISTOGRAM`

or `HISTOGRAM`

on the stored variates. The smoothed curves are calculated from the transformed percentages from the histogram: `LOGIT(CUM(%))`

. A smoothing spline is fitted on this scale, by the `FIT`

directive with the `SSPLINE`

function, using 4 d.f. The resulting fitted values are then backtransformed and drawn on the plot with the `monotonic`

setting of the `PEN`

directive.

### Action with `RESTRICT`

If any of the data vectors is restricted, `BOOTSTRAP`

will use only the units that are not restricted for any of the vectors. The data vectors that are passed to the `RESAMPLE`

procedure are all restricted to this identified set of units, but otherwise match the original data vectors. Each set of vectors supplied in pointers in the `AUXILIARY`

option are treated separately in this way.

### References

Efron, B. & Tibshirani, R. (1986). Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. *Statistical Science*, 1, 54-77.

Efron, B. & Tibshirani, R.J. (1993). *An Introduction to the Bootstrap*. Chapman & Hall, London.

### See also

Procedures: `JACKKNIFE`

, `APERMTEST`

, `CHIPERMTEST`

, `HBOOTSTRAP`

,`RPERMTEST`

.

### Example

CAPTION 'BOOTSTRAP example','1) Bootstrapped correlation.',\ !t('The data are scores from two tests on new admissions to Law School ',\ '(Efron, 1981, The Jackknife, the Bootstrap & Other Resampling Plans.',\ 'CBMS Monograph 38, SIAM, Philadelphia); listed in Table 1 of Hinkley',\ '(1983), Encyclopedia of Statistics, Volume 4, page 282.');\ STYLE=meta,plain,plain " Define RESAMPLE to calculate the correlation between the two scores." PROCEDURE [PARAMETER=pointer] 'RESAMPLE' OPTION 'DATA', " (I: variates, factors or texts) data vectors from which to calculate the statistics; no default"\ 'AUXILIARY', " (I: pointers) auxiliary sets of data vectors, each of which is to be resampled independently"\ 'ANCILLARY'; " (I: any type of structure) other relevant information needed to calculate the statistics "\ MODE=p; TYPE=!t(variate,factor,text),'pointer',*; SET=yes,no,no;\ LIST=yes; DECLARED=yes; PRESENT=yes PARAMETER 'STATISTIC', " (O: scalars) to save the calculated statistics "\ 'EXIT'; " (O: scalars) to save an exit code to indicate failure (EXIT[i]=1) or success (EXIT[i]=0) when calculating each STATISTIC[i]"\ MODE=p; TYPE='scalar'; SET=yes CALC STATISTIC[1] = CORRELATION(DATA[1]; DATA[2]) & EXIT[1] = STATISTIC[1]==C('missing') ENDPROCEDURE VARIATE [VALUES=576,635,558,578,666,580,555,661,651,605,653,575,545,572,594] Y & [VALUES=3.39,3.30,2.81,3.03,3.44,3.07,3.00,3.43,3.36,3.13,\ 3.12,2.74,2.76,2.88,2.96] Z BOOTSTRAP [DATA=Y,Z; SEED=77320] 'Correlation' CAPTION '2) A permutation test.',\ !t('Five wines are tested in a completely randomized design for',\ 'alcohol content. The variance-ratio for the treatment effect is',\ 'estimated by resampling with random permutation of the observations.') " Re-define RESAMPLE to calculate the ratio. The treatment factor must be passed to the procedure via the ANCILLARY option so that AKEEP can extract the treatment sum of squares." PROCEDURE [PARAMETER=pointer] 'RESAMPLE' OPTION 'DATA', " (I: variates, factors or texts) data vectors from which to calculate the statistics; no default"\ 'AUXILIARY', " (I: pointers) auxiliary sets of data vectors, each of which is to be resampled independently"\ 'ANCILLARY'; " (I: any type of structure) other relevant information needed to calculate the statistics "\ MODE=p; TYPE=!t(variate,factor,text),'pointer',*; SET=yes,no,no; \ LIST=yes; DECLARED=yes; PRESENT=yes PARAMETER 'STATISTIC', " (O: scalars) to save the calculated statistics "\ 'EXIT'; " (O: scalars) to save an exit code to indicate failure (EXIT[i]=1) or success (EXIT[i]=0) when calculating each STATISTIC[i]"\ MODE=p; TYPE='scalar'; SET=yes ANOVA [PRINT=*] DATA[1] AKEEP TERMS=ANCILLARY[1],'*Units*'; SS=sstreat,ssresid; DF=dftreat,dfresid CALC STATISTIC[1] = (sstreat/dftreat)/(ssresid/dfresid) ENDPROCEDURE FACTOR Wine READ Wine,%Alcohol; FREP=labels E 4.931 D 7.263 A 4.857 C 3.361 B 6.871 E 4.141 C 3.164 B 3.012 A 5.668 D 12.185 B 4.223 E 3.323 A 4.668 C 2.686 D 7.776 : TREATMENT Wine ANOVA %Alcohol BOOTSTRAP [DATA=%Alcohol; ANCILLARY=Wine; METHOD=permute; NTIMES=500;\ PROBABILITY=0.90; SEED=46921] 'Ratio' CAPTION !t(\ 'The observed variance ratio of 6.41 is well outside the 90% confidence',\ 'interval. A one-sided permutuation test at the 95% level therefore',\ 'rejects the hypothesis that the observed treatment differences could',\ 'have arisen by chance from this set of data.') CAPTION '3) Balanced bootstrap.',\ !t('Fit parallel exponential curves to the relationship between',\ 'Sugar yield and Soil phosphorus in four years.',\ 'Estimate the asymptotic yields for each year, with standard errors.',\ '(Note that FITCURVE does not provide these s.e.s.)') PROCEDURE [PARAMETER=pointer] 'RESAMPLE' OPTION 'DATA', " (I: variates, factors or texts) data vectors from which to calculate the statistics; no default"\ 'AUXILIARY', " (I: pointers) auxiliary sets of data vectors, each of which is to be resampled independently"\ 'ANCILLARY'; " (I: any type of structure) other relevant information needed to calculate the statistics "\ MODE=p; TYPE=!t(variate,factor,text),'pointer',*; SET=yes,no,no;\ LIST=yes; DECLARED=yes; PRESENT=yes PARAMETER 'STATISTIC', " (O: scalars) to save the calculated statistics "\ 'EXIT'; " (O: scalars) to save an exit code to indicate failure (EXIT[i]=1) or success (EXIT[i]=0) when calculating each STATISTIC[i]"\ MODE=p; TYPE='scalar'; SET=yes CALC y = ANCILLARY[1]+DATA[1] MODEL y FITCURVE [PRINT=*] ANCILLARY[2,3] RKEEP ESTIMATES=est; EXIT=ex " Extract asymptotes: first two parameters are rate and range." EQUATE [OLDFORMAT=!(-2,4)] OLD=est; NEW=STATISTIC " Pass on information about success of fitting" EQUATE OLD=ex; NEW=EXIT ENDPROCEDURE FACTOR [LEVELS=4; VALUES=16(1...4)] Year READ Beetwt,%sugar,SoilP 7.23 18.5 5.4 7.69 18.0 5.4 24.64 20.1 7.8 26.67 19.8 8.0 39.78 19.5 18.0 44.98 19.3 15.6 41.59 19.7 30.4 44.08 19.8 33.8 48.37 19.4 50.4 44.76 19.0 51.0 49.73 18.6 44.0 51.54 18.5 40.2 47.69 19.0 57.2 45.66 19.4 65.0 50.18 18.6 27.0 47.69 18.7 30.0 8.82 13.8 5.6 1.81 13.9 4.8 15.82 14.5 10.2 9.04 14.0 8.6 24.41 15.0 21.6 22.60 14.1 17.2 26.45 15.2 36.4 20.80 15.3 37.2 28.30 14.2 44.4 22.60 14.7 44.4 14.24 13.5 41.0 35.94 15.6 30.2 25.54 15.8 60.8 27.13 15.6 47.0 31.42 15.6 27.0 34.13 15.4 29.0 19.90 16.1 3.0 20.60 16.0 2.0 34.70 16.7 6.2 35.40 16.4 6.2 46.80 17.1 19.8 40.50 16.9 17.2 43.00 16.9 29.6 48.60 17.1 28.0 47.30 17.0 42.8 41.30 17.1 46.2 44.30 17.0 36.6 47.60 16.6 40.0 45.60 17.0 42.2 44.60 17.0 52.0 44.00 17.2 23.4 40.10 16.6 28.0 14.35 16.1 4.0 14.35 15.5 3.8 26.71 16.6 8.0 25.12 16.4 6.4 33.39 17.2 18.2 33.79 16.2 14.8 36.68 17.0 35.0 33.69 16.8 29.6 34.98 17.0 37.2 35.78 17.0 40.0 42.06 17.2 39.6 38.77 17.3 36.8 40.66 17.3 52.4 37.28 17.2 45.6 34.68 17.3 22.0 32.59 17.2 26.0 : CALC Sugar = Beetwt * %sugar / 100 MODEL Sugar FITCURVE [PRINT=model,estimates] SoilP,Year RKEEP FITTED=Fit CALCULATE Simplres = Sugar-Fit CAPTION !t(\ 'Resample the residuals, adding to the fitted values already calculated.',\ 'The calculations require the fitted values, and the explanatory variate',\ 'and factor, but the values of these vectors must not be resampled.',\ 'Use the balanced bootstrap, which ensures that all observations occur',\ 'an equal number of times in the complete set of bootstrap samples.') BOOTSTRAP [DATA=Simplres; ANCILLARY=Fit,SoilP,Year; METHOD=balance;\ SEED=23845] 'Year 1','Year 2','Year 3','Year 4' CAPTION '4) Use of auxiliary data.',\ !t('Estimate difference in medians between the heights',\ 'of active volcanos in America and in Asia/Oceania',\ '(see the Guide to Genstat, Part 2, Section 2.1).') PROCEDURE [PARAMETER=pointer] 'RESAMPLE' OPTION 'DATA', " (I: variates, factors or texts) data vectors from which to calculate the statistics; no default"\ 'AUXILIARY', " (I: pointers) auxiliary sets of data vectors, each of which is to be resampled independently"\ 'ANCILLARY'; " (I: any type of structure) other relevant information needed to calculate the statistics "\ MODE=p; TYPE=!t(variate,factor,text),'pointer',*; SET=yes,no,no;\ LIST=yes; DECLARED=yes; PRESENT=yes PARAMETER 'STATISTIC', " (O: scalars) to save the calculated statistics "\ 'EXIT'; " (O: scalars) to save an exit code to indicate failure (EXIT[i]=1) or success (EXIT[i]=0) when calculating each STATISTIC[i]"\ MODE=p; TYPE='scalar'; SET=yes CALC STATISTIC[1] = MEDIAN(DATA[1])-MEDIAN(AUXILIARY[1][1]) ENDPROCEDURE CAPTION !t(\ 'Since the samples are independent, they must be resampled separately.',\ 'The sets of vectors provided by the AUXILIARY option are each subjected',\ 'to separate resampling. The sets must be combined into pointers, to',\ 'allow the possibility of more than one additional set of data.') VARIATE America,AsiaOcea; VALUES=!(130,126,124,124,113,89,83,77,70,62,58,51,\ 51,42,40,34,199,197,193,185,177,172,157,156,140,102,93,86,36,140,102,100,\ 94,83,83,82,73,67,67,66,60,57,57,53,49,43,43,40,35,35), !(156,125,122,120,\ 112,109,103,100,100,96,95,95,90,83,81,81,81,77,75,75,73,71,71,67,66,66,64,\ 62,60,60,60,59,58,57,56,56,55,54,54,52,52,52,51,50,49,49,48,45,44,44,37,\ 36,36,26,26,24,19,11,10,137,41) CAPTION 'Calculate and print the difference using the procedure.' POINTER [VALUE=AsiaOcea] Aux RESAMPLE [DATA=America; AUXILIARY=Aux] Diff; EXIT=exit PRINT Diff CAPTION !t('Produce the bootstrapped estimate and 90% confidence interval;',\ 'using balanced resampling.') BOOTSTRAP [DATA=America; AUXILIARY=Aux; METHOD=balance; PROBABILITY=0.90]\ 'Difference' CAPTION '5) Bias-corrected and accelerated confidence limits.',\ !t('Spatial data test (Efron, B. & Tibshirani, R.J., 1993,',\ 'An Introduction to the Bootstrap, Chapman & Hall, London).') PROCEDURE [PARAMETER=pointer] 'RESAMPLE' OPTION 'DATA', " (I: variates, factors or texts) data vectors from which to calculate the statistics; no default"\ 'AUXILIARY', " (I: pointers) auxiliary sets of data vectors, each of which is to be resampled independently"\ 'ANCILLARY'; " (I: any type of structure) other relevant information needed to calculate the statistics "\ MODE=p; TYPE=!t(variate,factor,text),'pointer',*; SET=yes,no,no;\ LIST=yes; DECLARED=yes; PRESENT=yes PARAMETER 'STATISTIC', " (O: scalars) to save the calculated statistics " 'EXIT'; " (O: scalars) to save an exit code to indicate failure (EXIT[i]=1) or success (EXIT[i]=0) when calculating each STATISTIC[i]"\ MODE=p; TYPE='scalar'; SET=yes CALC STATISTIC[1] = SUM((DATA[1] - MEAN(DATA[1]))**2) / NOBS(DATA[1]) ENDPROCEDURE VARIATE [VALUES=48,36,20,29,42,42,20,42,22,41,45,14,6,\ 0,33,28,34,4,32,24,47,41,24,26,30,41] A & [VALUES=42,33,16,39,38,36,15,33,20,43,34,22,7,\ 15,34,29,41,13,38,25,27,41,28,14,28,40] B BOOTSTRAP [DATA=A; NTIMES=2000; SEED=245875; PROBABILITY=0.90; CIMETHOD=bca]\ 'Variance'