Performs hot-deck and model-based imputation for survey data (S.D. Langton).

### Options

`PRINT` = string token |
Controls printed output (`summary` , `monitoring` , `check` , `list` , `regression` ); default `summ` |
---|---|

`METHOD` = string token |
Imputation method (`hotdeck` , `modelbased` ); default `hotd` |

`DMETHOD` = string token |
Method for calculating distances (`mean` , `minimax` , `regression` ); defaule `mini` |

`%THRESHOLD` = scalar |
Percentage threshold for matches |

`THRESHOLD` = scalar |
Absolute threshold for matches |

`DVARIABLES` = variates or factors |
Variables to use for distance calculation or factors |

`DRANGES` = scalars |
Ranges to use for distance calculations with each of the `DVARIABLES` ; default `*` uses the observed range |

`LABELS` = variate, factor or text |
Provides labels for the cases |

`SEED` = scalar |
Seed for random numbers; default 0 |

`IMPUTE` = variate or scalar |
The variate provides logical (0 or 1) values to indicate whether each unit is to be imputed, alternatively the scalar specifies a number of rows to be selected at random to be imputed to allow the effectiveness of the imputation process to be studied; default `*` imputes values for any units where an `OLDSTRUCTURE` contains a missing value |

`DONORS` = variate |
Logical variate indicating whether each unit can be used as a donor; default `*` implies that all units are used with complete data for each `OLDSTRUCTURE` |

`RSAVE` = rsave |
Regression analysis to use for `METHOD=model` or `DMETHOD=regression` |

`URECEPTORS` = variate |
Saves unit numbers of receptor (imputed) cases |

`UDONORS` = variate |
Saves unit numbers of donor cases |

`DISTANCES` = variate |
Saves the distances for the chosen receptor-donor pairs |

### Parameters

`OLDSTRUCTURE` = variates or factors |
Structure containing missing values |
---|---|

`NEWSTRUCTURE` = variates or factors |
New structures with imputed values |

`OVERWRITE` = string tokens |
Whether to overwrite any existing data for imputed cases (`yes` , `no` ); default `no` |

### Description

Survey data frequently contain missing values. When all the information is missing for a sample unit it is generally appropriate to allow for this by modifying the weights, but when only certain variables are missing (item non-response) imputation is often used to fill in the missing values. `SVHOTDECK`

performs “hot-deck” imputation (see for example Korn & Graubard 1998) whereby replacement values are taken from another unit, chosen at random, usually from a list of suitable matches determined on the basis of a suitable distance metric. The procedure can also be used for model-based imputation; in this case the imputed value is taken as the sum of the fitted value from a regression model and a residual chosen at random from another unit. In the description below “donor” is used to mean a unit supplying data to a “receptor” that has a missing value initially.

The data are usually supplied by the `OLDSTRUCTURE`

parameter, in variates and/or factors, containing missing values. The `NEWSTRUCTURE`

parameter supplies new variates or factors to contain the values of each `OLDSTRUCTURE`

variate or factor, but with the missing values replaced by the imputed values. By default, imputation is carried out for any row of data where an `OLDSTRUCTURE`

contains missing values. Alternatively, the rows to be imputed can be specified by setting option IMPUTE. This can supply a logical variate, containing the value one in the units whose values are to be imputed, and zero elsewhere, or it can supply a scalar specifying a number of rows to be selected at random to be imputed. The scalar setting is useful if you want to study the effectiveness of the imputation process.

By default, imputed values will be used only to replace the missing values in each `OLDSTRUCTURE`

, unless the corresponding setting of the `OVERWRITE`

parameter is `yes`

. Imputed values are then inserted even if the original value is not missing. This would allow you, for example, to compare real and imputed data in order to check the efficiency of the imputation process. Alternatively, you might set `OVERWRITE=yes`

for every `OLDSTRUCTURE`

in order to preserve the correlations between the variables by taking all the values from each donor.

By default, any row of `OLDSTRUCTURE`

with no missing values may be used as a donor, unless option `DONORS`

is used to specify a logical variate to indicate the rows that are to act as potential donors.

The `DVARIABLES`

option is used to supply one of more variables to use to determine the matching between donors and receptors. In the simplest case, if you set `DVARIABLES`

to a single factor, the donors are selected at random from receptors with the same factor value (e.g. to replace observations by others from the same stratum). For more complex matching, `DVARIABLES`

can be set to a list of variates or factors which are then used to determine a distance between each receptor and the potential donors. By default the distance for a `DVARIABLES`

variate is calculated as

*d* = |*x _{i}* –

*x*| /

_{j}*r*

where *r* is the observed range of the data, but an alternative value of *r* may be supplied using the `DRANGES`

option. `DRANGES`

should be set to 1 if no scaling of the distances is required. For a `DVARIABLES`

factor a simple matching criterion is used, so *d* = 0 if *x _{i}* and

*x*are the same, and

_{j}*d*= 1 if they are not.

Matches are then determined using these distances according to a “minimax” approach, where the best match is the one with the minimum value of the maximum absolute difference between any of the `DVARIABLES`

. Alternatively you can set the `DMETHOD`

option to `mean`

to use the mean of the absolute differences, or to `regression`

to request that the distances are determined on the basis of predictions from a regression.

The `RSAVE`

option specifies the regression analysis to use when `DMETHOD=regression`

. The terms in the model must include the `DVARIABLES`

. If `RSAVE`

is not specified, the most recent regression analysis is used. The calculation of the distances between units is then weighted by the appropriate regression coefficients: for example, if the slope of `x1`

is 0.24 and two units have `x1`

values of 10 and 20, the distance is

(20 – 10) × 0.24 = 2.4.

`DRANGES`

are ignored when `DMETHOD=regression`

.

Conventional hot-deck imputation is the default method. Alternatively, if you set option `METHOD=modelbased`

, `SVHOTDECK`

will do model-based imputation. Note, though, that this cannot be used if `DMETHOD=regression`

. Model-based imputation uses a regression analysis, specified by the `RSAVE`

option. If `RSAVE`

is not specified, the most recent regression analysis is used. The method creates an imputed value by adding a random residual to the fitted value of the selected donor. This method can be used only if the `OLDSTRUCTURE`

is the same as the y-variate in the regression. `DVARIABLES`

will frequently be left unset in this situation, so that the residuals are chosen totally at random. However, in some situations it may be preferable to select residuals from similar units, in which case `DVARIABLES`

can be used to determine the matching, as with the hot-deck method.

By default, `SVHOTDECK`

will determine the single best match for each unit, where possible. In many cases (e.g. when doing multiple imputation), it is required to select one at random from the closest matches. The `%THRESHOLD`

option specifies the tolerance to use in these situations: for example, setting `%THRESHOLD`

to 10 requests that the match is selected at random from amongst the donors with distance up to 10% greater than the minimum distance. The `SEED`

option specifies the seed for the random numbers that are used for this operation (default 0). Alternatively, if it is desired to specify the distance relative to the minimum in absolute terms, the `THRESHOLD`

option should be used instead. If both `THRESHOLD`

and `%THRESHOLD`

are set, both criteria must be met. The `THRESHOLD`

value is normally set relative to the minimum distance, but, if it is set to a negative value this is taken to mean that a match is selected at random from those with a distance less than the absolute value of the `THRESHOLD`

. Thus, for example, if `THRESHOLD`

is set to -0.2 and `METHOD=mean`

, any units with a mean distance of less than 0.2 (after taking into account settings of `DRANGES`

) from the unit to be imputed are considered matches, and one of these is selected at random. Alternatively, if `THESHOLD`

is set to 0.2 and the best match is for example 0.18, any units with a mean distance of less than 0.18 + 0.2 = 0.38 are considered matches, and one of these is selected at random.

The `URECEPTORS`

and `UDONORS`

options can be used to save the unit numbers of the receptor (imputed) cases and the donor cases, respectively. Note that, if the `IMPUTE`

option is set, the `OLDSTRUCTURE`

and `NEWSTRUCTURE`

parameters need not be set. The use of `URECEPTORS`

and `UDONORS`

then allows more complicated methods of replacement to be used than those provided directly by `SVHOTDECK`

.

Printed output and plots are controlled by the `PRINT`

option, with the settings:

`monitoring` |
provides information about each match, |
---|---|

`summary` |
provides a summary, |

`list` |
produces a list of recipients and donors, |

`check` |
prints correlations as well as giving a scatter plot of the predictions against the actual data, and |

`regression` |
gives details of the model used when `DMETHOD` is set to `regression` . |

To use `check`

it is necessary to impute for data values that are present. This can be achieved either by specifying these units using `IMPUTE`

, or by setting `IMPUTE`

to a scalar, in which case the appropriate number of rows will be selected at random.

Options: `PRINT`

, `METHOD`

, `DMETHOD`

, `%THRESHOLD`

, `THRESHOLD`

, `DVARIABLES`

, `DRANGES`

, `LABELS`

, `SEED`

, `IMPUTE`

, `DONORS`

, `RSAVE`

, `URECEPTORS`

, `UDONORS`

, `DISTANCE`

.

Parameters: `OLDSTRUCTURE`

, `NEWSTRUCTURE`

, `OVERWRITE`

.

### Action with `RESTRICT`

`SVHOTDECK`

takes restrictions from any `OLDSTRUCTURE`

or `DVARIABLES`

vectors. Only unrestricted units are used as either donors or receptors. However, restrictions on `IMPUTE`

and `DONORS`

are ignored.

### References

Korn, E.L. & Graubard, B.I. (1999). *Analysis of Health Surveys*. Wiley, New York.

### See also

Procedures: `SVBOOT`

, `SVCALIBRATE`

, `SVGLM`

, `SVREWEIGHT`

, `SVSAMPLE`

, `SVSTRATIFIED`

, `SVTABULATE`

, `SVWEIGHT`

, `MULTMISSING`

, `QMVREPLACE`

.

Commands for: Survey analysis.

### Example

CAPTION 'SVHOTDECK example',\ 'Orkney oats data (Sampford, Table 5.1, page 61).';\ STYLE=meta,plain VARIATE Oats READ Farm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 : READ Crops 50 50 52 58 60 60 62 65 65 68 71 74 78 90 91 92 96 110 140 140 156 156 190 198 209 240 274 300 303 311 324 330 356 410 430 : READ Oats 17 17 10 16 6 15 20 18 14 20 24 18 23 0 27 34 25 24 43 48 44 45 60 63 70 28 62 59 66 58 128 38 69 72 103 : "Insert some missing values to impute" CALCULATE Oatsmiss = MVINSERT(Oats; Farm.IN.!(17,23,30)) "First nearest match. Set DRANGE to 1 to make distances easy to interpret" SVHOTDECK [PRINT=summary,list,monitoring; DVARIABLES=Crops; DRANGES=1;\ SEED=600209] Oatsmiss; NEWSTRUCTURE=Oatsimp1 "now pick at random from those within 20 acres of nearest match on crops" SVHOTDECK [PRINT=summary,list,monitoring; DVARIABLES=Crops; DRANGES=1;\ THRESHOLD=20; SEED=12345] Oatsmiss; NEWSTRUCTURE=Oatsimp2 "and at random from those differing in crop area by 20 hectares or less" SVHOTDECK [PRINT=summary,list,monitoring; DVARIABLES=Crops; DRANGES=1;\ THRESHOLD=-20; SEED=23456] Oatsmiss; NEWSTRUCTURE=Oatsimp3 PRINT Farm,Crops,Oats,Oatsmiss,Oatsimp1,Oatsimp2,Oatsimp3;\ DECIMALS=0; FIELD=9