Prints histograms with improved definition of groups (A. Keen).
Options
CHANNEL = scalar |
Channel number of output file; default is the current output file |
---|---|
TITLE = text |
General title; default ‘Histogram of ... ‘, where ... is the identifier of the structure specified by DATA |
LOWER = scalar |
Lowest class limit |
WIDTH = scalar |
Interval width |
SCALE = scalar |
Number of units represented by each symbol; default 1 (or more if the page width is not sufficient) |
Parameters
DATA = identifiers |
Data for the histograms (variate, table, factor or matrix) |
---|---|
NOBSERVATIONS = tables |
One-way table to save numbers in the groups |
GROUPS = factors |
Factor to save groups defined, with LEVELS the midpoints of the intervals and LABELS as LEVELS , but as text-vector |
SYMBOLS = texts |
Characters to be used to represent the bars of each histogram |
DESCRIPTION = texts |
Annotation for key |
Description
The procedure AKAIKEHISTOGRAM
has been designed as an alternative for the Genstat directive HISTOGRAM
, for cases where the default settings are not optimal. Such cases may arise due to the following disadvantages of HISTOGRAM
:
– HISTOGRAM
does not take into account the round-off of the data. The round-off defines a minimal interval width, say dy, for the observations. A sensible interval width must be a multiple of dy, because otherwise the actual width is not equal for all intervals. An extreme example of this is the case where the interval width is smaller than dy; this causes artificial “holes” in the histogram.
– The default number of groups equals the square root of the number of observations, irrespective of the shape of the distribution. In some situations (for instance if the number of observations is large) the number of groups is unnecessarily large; in other situations (for instance if the shape of the distribution is complex) the number of groups can be too small. If the number of groups is too large, then differences in numbers of observations between neighbouring classes may be just random fluctuations, while if the number of groups is too small, valuable information is lost.
– The specification of own class limits (in a variate) can be rather cumbersome, especially if many histograms have to be produced.
AKAIKEHISTOGRAM
aims to avoid these disadvantages of HISTOGRAM
. By default an “optimal” number of groups is determined using Akaike’s Information Criterion.
Alternatively, own class limits can be specified using options LOWER
and WIDTH
instead of the option LIMITS
of HISTOGRAM
. In a FOR
loop different values for the lower limit and/or for the interval width can be specified for different quantitative structures. Scalars with missing values can be used to specify default values for these options. Option LOWER
is especially important if the observations have a “natural” lower limit, for example the value 0; then 0 is taken as the lower limit of the first group and the first group has the same interval width as the following groups.
The option TITLE
and the parameters of HISTOGRAM
have been transferred to AKAIKEHISTOGRAM
. However, options NGROUPS
and LABELS
from HISTOGRAM
have been omitted, because they are not in line with the style of AKAIKEHISTOGRAM
.
Options: CHANNEL
, TITLE
, LOWER
, WIDTH
, SCALE
.
Parameters: DATA
, NOBSERVATIONS
, GROUPS
, SYMBOLS
, DESCRIPTION
.
Method
The optimality criterion used is Akaike’s Information Criterion (AIC), which is twice the number of free parameters of the model (that is, the number of groups minus 1) minus the maximal log likelihood of the observations under the multinomial model. The starting histogram is a histogram with equal length intervals and more than sufficient groups. From this histogram, new histograms are derived with interval length r times the interval length of the starting histogram, r = 2 … etc. The “optimal” histogram is the one with minimal AIC. The basic idea for the method is obtained from Sakamoto, Ishiguro & Kitagawa (1986); also see Taylor (1987).
The starting histogram is obtained as follows. First the range of the observations is divided into five equal length intervals from which the apparent number of observations Na is calculated as five times the number of observations in the interval with the largest frequency. Na is then used as the number of observations instead of the true number, and the number of groups Ng is calculated as five times the number obtained from Sturgess’ formula (see, for example, Sakamoto, Ishiguro & Kitagawa (1986), page 117.):
Ng = 5 × ( 1 + log10( Na/2 ))
The final limits of the starting histogram are obtained by a relatively strong rounding-off of the class limits (comparable with that in HISTOGRAM
), where the width is always a multiple of the rounding-off interval.
Action with RESTRICT
The structures in DATA
can be restricted, and in different ways; AKAIKEHISTOGRAM
uses only those units that are not excluded by their respective restrictions.
References
Sakamoto, Y., Ishiguro, M & Kitagawa, G. (1986). Akaike Information Statistics. D. Reidel Publishing Company. Dordrecht.
Taylor, C.C., (1987). Akaike’s Information Criterion and the Histogram. Biometrika, 74, 636-639.
See also
Directives: DHISTOGRAM
, LPHISTOGRAM
.
Example
CAPTION 'AKAIKEHISTOGRAM example',\ !t('The first example illustrates what can go wrong if the',\ 'class-width is not a multiple of the round-off interval.');\ STYLE=meta,plain VARIATE [NVALUES=436] Cadmium READ Cadmium .03 .02 .06 .03 .04 .04 .03 .04 .03 .05 .04 .03 .04 .03 .03 .02 .03 .05 .03 .04 .04 .03 .06 .05 .03 .04 .02 .04 .04 .05 .02 .05 .04 .04 .03 .04 .03 .06 .05 .04 .06 .03 .05 .08 .09 .08 .08 .09 .05 .03 .04 .04 .04 .03 .08 .11 .02 .04 .02 .03 .05 .04 .03 .02 .02 .02 .03 .04 .04 .03 .03 .07 .04 .06 .06 .05 .03 .05 .06 .04 .03 .07 .07 .07 .07 .06 .03 .04 .04 .05 .03 .08 .02 .04 .03 .06 .07 .07 .04 .05 .07 .04 .09 .10 .05 .05 .05 .06 .05 .08 .04 .03 .03 .02 .03 .04 .07 .02 .05 .13 .03 .06 .03 .08 .07 .07 .07 .05 .05 .03 .05 .06 .06 .03 .05 .04 .04 .03 .03 .02 .03 .01 .02 .03 .03 .04 .09 .04 .05 .15 .09 .07 .04 .05 .04 .06 .03 .07 .04 .05 .06 .03 .06 .05 .02 .02 .03 .02 .05 .04 .05 .05 .08 .07 .08 .06 .02 .04 .05 .08 .07 .04 .02 .03 .03 .05 .03 .03 .05 .04 .06 .06 .08 .08 .06 .06 .04 .07 .06 .02 .08 .10 .08 .06 .05 .11 .05 .06 .04 .07 .08 .07 .08 .07 .07 .08 .05 .04 .04 .07 .08 .03 .07 .10 .09 .05 .07 .05 .06 .07 .07 .07 .06 .19 .13 .09 .05 .13 .12 .04 .07 .04 .03 .06 .07 .06 .03 .06 .06 .06 .06 .05 .07 .05 .06 .04 .07 .06 .06 .03 .05 .04 .06 .05 .09 .07 .04 .05 .09 .17 .23 .19 .05 .04 .05 .11 .05 .06 .06 .08 .04 .09 .16 .06 .05 .09 .05 .06 .06 .05 .04 .04 .05 .09 .05 .08 .04 .05 .04 .05 .06 .08 .04 .03 .06 .05 .11 .06 .05 .05 .09 .05 .04 .05 .04 .05 .04 .08 .04 .02 .02 .03 .02 .02 .03 .08 .04 .03 .03 .04 .04 .04 .02 .05 .09 .09 .09 .09 .08 .06 .07 .05 .08 .08 .06 .06 .08 .08 .09 .08 .07 .07 .08 .06 .09 .08 .09 .09 .09 .08 .08 .07 .06 .07 .07 .08 .07 .05 .05 .06 .09 .04 .06 .05 .07 .07 .04 .03 .03 .03 .03 .02 .02 .05 .07 .06 .04 .05 .05 .03 .03 .03 .06 .05 .06 .06 .06 .04 .06 .06 .06 .06 .06 .07 .05 .06 .06 .07 .08 .06 .06 .06 .06 .08 .03 .04 .06 .05 .02 .05 .04 .06 .05 .04 .05 .04 .04 .03 .03 .06 .04 .05 .02 .08 .06 .04 : HISTOGRAM Cadmium AKAIKEHISTOGRAM Cadmium CAPTION !t('The second example illustrates similarity and differences',\ 'between HISTOGRAM and AKAIKEHISTOGRAM with respect to the',\ 'number of groups, class-limits and representation for a random',\ 'sample of size 1000 from the standard normal distribution and',\ 'the halfnormal distribution.') VARIATE [NVAL= 1000] y CALCULATE y= NED(URAND( 34761; 1000)) HISTOGRAM y ; SYMB= '.' AKAIKEHISTOGRAM y ; SYMB= '.' CALCULATE y= ABS( y) HISTOGRAM y ; SYMB= '-' AKAIKEHISTOGRAM y ; SYMB= '-'