AKAIKEHISTOGRAM procedure

Prints histograms with improved definition of groups (A. Keen).

Options

`CHANNEL` = scalar	Channel number of output file; default is the current output file
`TITLE` = text	General title; default ‘`Histogram of ...`‘, where `...` is the identifier of the structure specified by `DATA`
`LOWER` = scalar	Lowest class limit
`WIDTH` = scalar	Interval width
`SCALE` = scalar	Number of units represented by each symbol; default 1 (or more if the page width is not sufficient)

Parameters

`DATA` = identifiers	Data for the histograms (variate, table, factor or matrix)
`NOBSERVATIONS` = tables	One-way table to save numbers in the groups
`GROUPS` = factors	Factor to save groups defined, with `LEVELS` the midpoints of the intervals and `LABELS` as `LEVELS`, but as text-vector
`SYMBOLS` = texts	Characters to be used to represent the bars of each histogram
`DESCRIPTION` = texts	Annotation for key

Description

The procedure AKAIKEHISTOGRAM has been designed as an alternative for the Genstat directive HISTOGRAM, for cases where the default settings are not optimal. Such cases may arise due to the following disadvantages of HISTOGRAM:

– HISTOGRAM does not take into account the round-off of the data. The round-off defines a minimal interval width, say dy, for the observations. A sensible interval width must be a multiple of dy, because otherwise the actual width is not equal for all intervals. An extreme example of this is the case where the interval width is smaller than dy; this causes artificial “holes” in the histogram.

– The default number of groups equals the square root of the number of observations, irrespective of the shape of the distribution. In some situations (for instance if the number of observations is large) the number of groups is unnecessarily large; in other situations (for instance if the shape of the distribution is complex) the number of groups can be too small. If the number of groups is too large, then differences in numbers of observations between neighbouring classes may be just random fluctuations, while if the number of groups is too small, valuable information is lost.

– The specification of own class limits (in a variate) can be rather cumbersome, especially if many histograms have to be produced.

AKAIKEHISTOGRAM aims to avoid these disadvantages of HISTOGRAM. By default an “optimal” number of groups is determined using Akaike’s Information Criterion.

Alternatively, own class limits can be specified using options LOWER and WIDTH instead of the option LIMITS of HISTOGRAM. In a FOR loop different values for the lower limit and/or for the interval width can be specified for different quantitative structures. Scalars with missing values can be used to specify default values for these options. Option LOWER is especially important if the observations have a “natural” lower limit, for example the value 0; then 0 is taken as the lower limit of the first group and the first group has the same interval width as the following groups.

The option TITLE and the parameters of HISTOGRAM have been transferred to AKAIKEHISTOGRAM. However, options NGROUPS and LABELS from HISTOGRAM have been omitted, because they are not in line with the style of AKAIKEHISTOGRAM.

Options: CHANNEL, TITLE, LOWER, WIDTH, SCALE.

Parameters: DATA, NOBSERVATIONS, GROUPS, SYMBOLS, DESCRIPTION.

Method

The optimality criterion used is Akaike’s Information Criterion (AIC), which is twice the number of free parameters of the model (that is, the number of groups minus 1) minus the maximal log likelihood of the observations under the multinomial model. The starting histogram is a histogram with equal length intervals and more than sufficient groups. From this histogram, new histograms are derived with interval length r times the interval length of the starting histogram, r = 2 … etc. The “optimal” histogram is the one with minimal AIC. The basic idea for the method is obtained from Sakamoto, Ishiguro & Kitagawa (1986); also see Taylor (1987).

The starting histogram is obtained as follows. First the range of the observations is divided into five equal length intervals from which the apparent number of observations Na is calculated as five times the number of observations in the interval with the largest frequency. Na is then used as the number of observations instead of the true number, and the number of groups Ng is calculated as five times the number obtained from Sturgess’ formula (see, for example, Sakamoto, Ishiguro & Kitagawa (1986), page 117.):

Ng = 5 × ( 1 + log₁₀( Na/2 ))

The final limits of the starting histogram are obtained by a relatively strong rounding-off of the class limits (comparable with that in HISTOGRAM), where the width is always a multiple of the rounding-off interval.

Action with `RESTRICT`

The structures in DATA can be restricted, and in different ways; AKAIKEHISTOGRAM uses only those units that are not excluded by their respective restrictions.

References

Sakamoto, Y., Ishiguro, M & Kitagawa, G. (1986). Akaike Information Statistics. D. Reidel Publishing Company. Dordrecht.

Taylor, C.C., (1987). Akaike’s Information Criterion and the Histogram. Biometrika, 74, 636-639.

Example

CAPTION  'AKAIKEHISTOGRAM example',\
         !t('The first example illustrates what can go wrong if the',\
         'class-width is not a multiple of the round-off interval.');\
         STYLE=meta,plain
VARIATE  [NVALUES=436] Cadmium
READ     Cadmium
.03 .02 .06 .03 .04 .04 .03 .04 .03 .05 .04 .03 .04 .03 .03 .02 .03 .05
.03 .04 .04 .03 .06 .05 .03 .04 .02 .04 .04 .05 .02 .05 .04 .04 .03 .04
.03 .06 .05 .04 .06 .03 .05 .08 .09 .08 .08 .09 .05 .03 .04 .04 .04 .03
.08 .11 .02 .04 .02 .03 .05 .04 .03 .02 .02 .02 .03 .04 .04 .03 .03 .07
.04 .06 .06 .05 .03 .05 .06 .04 .03 .07 .07 .07 .07 .06 .03 .04 .04 .05
.03 .08 .02 .04 .03 .06 .07 .07 .04 .05 .07 .04 .09 .10 .05 .05 .05 .06
.05 .08 .04 .03 .03 .02 .03 .04 .07 .02 .05 .13 .03 .06 .03 .08 .07 .07
.07 .05 .05 .03 .05 .06 .06 .03 .05 .04 .04 .03 .03 .02 .03 .01 .02 .03
.03 .04 .09 .04 .05 .15 .09 .07 .04 .05 .04 .06 .03 .07 .04 .05 .06 .03
.06 .05 .02 .02 .03 .02 .05 .04 .05 .05 .08 .07 .08 .06 .02 .04 .05 .08
.07 .04 .02 .03 .03 .05 .03 .03 .05 .04 .06 .06 .08 .08 .06 .06 .04 .07
.06 .02 .08 .10 .08 .06 .05 .11 .05 .06 .04 .07 .08 .07 .08 .07 .07 .08
.05 .04 .04 .07 .08 .03 .07 .10 .09 .05 .07 .05 .06 .07 .07 .07 .06 .19
.13 .09 .05 .13 .12 .04 .07 .04 .03 .06 .07 .06 .03 .06 .06 .06 .06 .05
.07 .05 .06 .04 .07 .06 .06 .03 .05 .04 .06 .05 .09 .07 .04 .05 .09 .17
.23 .19 .05 .04 .05 .11 .05 .06 .06 .08 .04 .09 .16 .06 .05 .09 .05 .06
.06 .05 .04 .04 .05 .09 .05 .08 .04 .05 .04 .05 .06 .08 .04 .03 .06 .05
.11 .06 .05 .05 .09 .05 .04 .05 .04 .05 .04 .08 .04 .02 .02 .03 .02 .02
.03 .08 .04 .03 .03 .04 .04 .04 .02 .05 .09 .09 .09 .09 .08 .06 .07 .05
.08 .08 .06 .06 .08 .08 .09 .08 .07 .07 .08 .06 .09 .08 .09 .09 .09 .08
.08 .07 .06 .07 .07 .08 .07 .05 .05 .06 .09 .04 .06 .05 .07 .07 .04 .03
.03 .03 .03 .02 .02 .05 .07 .06 .04 .05 .05 .03 .03 .03 .06 .05 .06 .06
.06 .04 .06 .06 .06 .06 .06 .07 .05 .06 .06 .07 .08 .06 .06 .06 .06 .08
.03 .04 .06 .05 .02 .05 .04 .06 .05 .04 .05 .04 .04 .03 .03 .06 .04 .05
.02 .08 .06 .04 :
HISTOGRAM        Cadmium
AKAIKEHISTOGRAM  Cadmium
CAPTION !t('The second example illustrates similarity and differences',\
        'between HISTOGRAM and AKAIKEHISTOGRAM with respect to the',\
        'number of groups, class-limits and representation for a random',\
        'sample of size 1000 from the standard normal distribution and',\
        'the halfnormal distribution.')
VARIATE    [NVAL= 1000] y
CALCULATE  y= NED(URAND( 34761; 1000))

HISTOGRAM  y ; SYMB= '.'
AKAIKEHISTOGRAM  y ; SYMB= '.'

CALCULATE  y= ABS( y)
HISTOGRAM  y ; SYMB= '-'
AKAIKEHISTOGRAM  y ; SYMB= '-'

Updated on March 11, 2019

Tagged: Command Procedures

Was this article helpful?

Yes No