Estimates false discovery rates using mixture distributions (J.W. McNicol & D.B. Baird).

### Options

`PRINT` = string token |
What to print (`monitoring` , `estimates` ); default `esti` |
---|---|

`DISTRIBUTION` = string token |
Which distribution to mix with Uniform (`beta` , `gamma` ); default `beta` |

`INITIAL` = variate |
Initial values for mixing proportion (φ) and Beta or Gamma parameters (A and B); default `!(0.90,` `0.30,` `2)` |

`LOWER` = variate |
Lower limits for parameters; default `!(0.00001,` `0.001,` `0.001)` |

`UPPER` = variate |
Upper values for parameters; default `!(0.99999,` `5,` `1000)` |

`PLOT` = string token |
What to plot (`histogram` , `density` , `logdensity` , `inference` , `loginference` ); default `hist` , `dens` , `logd` , `infe` , `logi` |

`WINDOW` = scalar |
Window for graphs; default 1 |

`KEYWINDOW` = scalar |
Key window for Inference plot; default 2 |

`MAXCYCLE` = scalar |
Maximum iteration cycles; default 50 |

`TOLERANCE` = scalar or variate |
Tolerance for convergence of parameters; default 0.01 for Beta, and 0.001 for Gamma |

### Parameters

`PROBABILITIES` = variates |
Significance values, must lie between 0 and 1 |
---|---|

`ESTIMATES` = variates |
Saves the estimates of mixture parameters φ, A and B |

`FDR` = variates |
Saves the False Discovery Rates at the p-values in `PROBABILITIES` i.e. q-values |

`FRR` = variates |
Saves the False Rejection Rates at the p-values in `PROBABILITIES` |

`POWER` = variates |
Saves the power estimates as a function of the p-values in `PROBABILITIES` |

`POSTHA` = variates |
Saves the Posterior Probability of H_{a} at the p-values in `PROBABILITIES` |

`LOGLIKELIHOOD` = scalars |
Value of the loglikelihood at end of the iteration process |

`NCYCLES` = scalars |
Number of iterations taken to convergence |

### Description

`FDRMIXTURE`

estimates the false discovery rate (FDR), false rejection rate (FRR) and power of a test by modelling significance values as a 2-component mixture of Uniform and Beta or Gamma densities, Allison *et al.* (2002). The context is multiple testing, with data from any situation where the same simple hypothesis, H_{o}, is tested many times, such as in transcriptomics (microarrays), metabolomics and proteomics. These tests generate a large number of significance values which, under H_{0}, have a Uniform distribution and, under H_{a}, can be modelled as a Beta or truncated Gamma density. `FDRMIXTURE`

estimates the parameters of this mixture distribution to derive the False Discovery Rate, Prob(H_{0}/D_{a}), the False Rejection Rate, Prob(H_{a}/D_{0}) and the Power of the test, Prob(D_{a}/H_{a}), each as a function of *p*_{crit}. Here Da denotes the event “*p*<*p*crit”. The procedure also calculates the posterior probability of Ha, Prob(Ha/p), (POSTHa) from the mixture distribution. The significance values are provided by the `PROBABILITIES`

parameter and the choice of distribution (Beta or Gamma) by the `DISTRIBUTION`

option. The `FDR`

, `FRR`

, `POWER`

and `POSTHA`

parameters return estimates at the corresponding values of `PROBABILITIES`

. Thus `FDR`

contains the q-values of Storey & Tibshirani (2003). An EM algorithm is used to estimate the mixture parameters which are returned in the parameter `ESTIMATES`

.

The mixture model parameterization takes a proportion φ from the Uniform distribution, and (1 – φ) from either a Beta or Gamma distribution. The Gamma parameterization is

f(*x*) = (1/*b*)^{A} / Gamma(*A*) × exp(-*x*/*B*) × *x*^{(A-1)}

truncated at x=1, and the Beta parameterization is

f(*x*) = *x*^{(A-1)} × (1-*x*)^{(B-1)} / Beta(*A*; *B*).

Details of the estimation process are returned in the parameters `NCYCLES`

and `LOGLIKELIHOOD`

. Initial values, lower and upper mixture parameter limits are set by the `INITIAL`

, `LOWER`

and `UPPER`

options. Convergence can be controlled by a single tolerance for all three parameters or for each parameter separately using the `TOLERANCE`

option, and the number of iterations by the `MAXCYCLE`

option. A warning is printed when the parameter estimates imply a Beta or Gamma density which is unimodal rather than reverse J-shaped. The former would give rise to situations where Pr(H_{0}/D_{a}) > Pr(H_{a}/D_{a}) for very small *p*.

Printed output is controlled by the `PRINT`

option with settings:

`estimates` |
for estimates of the parameters, and |
---|---|

`monitoring` |
for monitoring information from the fitting process. |

By default `PRINT=estimates`

.

Graphical output is controlled by the `PLOT`

option with settings:

`histogram` |
for a plot of the fitted mixture against the histogram of probabilities, |
---|---|

`density` |
for a plot of the fitted mixture against the kernel density estimate of the probabilities on a logit scale (this allows a more detailed comparison at small probability value), |

`logdensity` |
gives even greater detail, by putting the density on a log scale (note that greater variation is expected around small density values on the log scale), |

`inference` |
generates a plot of `FDR` , `FRR` and `POWER` against p, and |

`loginference` |
plots these statistics on log scales, restricted to p<0.5, with a background grid, to enable estimates to be read for specific probability values. |

By default all the plots are produced.

The `WINDOW`

option controls where the plots go and the `KEYWINDOW`

option can be used to position the key in the inference plots.

Options: `PRINT`

, `DISTRIBUTION`

, `LOWER`

, `UPPER`

, `PLOT`

, `WINDOW`

, `KEYWINDOW`

, `MAXCYCLE`

, `TOLERANCE`

.

Parameters: `PROBABILITIES`

, `ESTIMATES`

, `FDR`

, `FRR`

, `POWER`

, `POSTHA`

, `LOGLIKELIHOOD`

, `NCYCLES`

.

### Method

In the context of hypothesis testing the false discovery rate, `FDR`

, can be defined as the probability of H_{0} being true when the result of the statistical test leads us to accept H_{a}:

`FDR`

= Prob(H_{0}/D_{a}).

By Bayes theorem

Prob(H_{0}/D_{a}) = Prob(D_{a}/H_{0}) × Prob(H_{0}) / Prob(D_{a})

and

Prob(D_{a}) = Prob(D_{a}/H_{0}) × Prob(H_{0}) + Prob(D_{a}/H_{a}) × Prob(H_{a}).

Further, in the context of multiple testing, where there are many *p*-values available, all the terms in these expressions can be derived by modelling the *p*-values as a 2-component mixture distribution. The *p*-values, under H_{0}, have a Uniform density and, under H_{a}, can be modelled as a Beta or truncated Gamma density. The mixing proportions are Prob(H_{0}) and Prob(H_{a}) respectively. Prob(D_{a}/H_{a}) is `CLBETA`

(*p*; *A*; *B*) or `CLGAMMA`

(*p*; *A*; *B*). The False Rejection Rate,

`FRR`

= Prob(H_{a}/Do)

is derived similarly. The posterior probability of H_{a},

PostH_{a} = Prob(H_{a}/p)

= Prob(p/H_{a}) × Prob(H_{a}) / (Prob(p/H_{a}) × Prob(H_{a}) + Prob(p/H_{0}) × Prob(H_{0}))

and each term is estimated by the mixture model parameters.

### Action with `RESTRICT`

The `PROBABILITIES`

parameter can be restricted. All output estimates will then be based only on those unrestricted units.

### References

Allison, D.B., Gadbury. G.L., Heo, M., Fernandez, J.R., Lee, C.-K., Prolla, T.A., & Weindruch R. (2002). A mixture model approach for the analysis of microarray gene expression data. *Computational Statistics & Data Analysis*, 39, 1-16.

Storey J.D. & Tibshirani R. (2003). Statistical significance for genomewide studies. *Proceedings of the National Academy of Science*, 100, 9440-9445.

### See also

Procedures: `FDRBONFERRONI`

, `QTHRESHOLD`

.

Commands for: Microarray data.

### Example

CAPTION 'FDRMIXTURE example'; STYLE=meta SCALAR ntests; VALUE = 20000 SCALAR seed; VALUE = 1231 SCALAR delta; VALUE = 0.3 SCALAR phi; VALUE = 0.90 SCALAR sampsize; VALUE = 50 " Create ntest data sets (rows), each of size sampsize (columns)." GRANDOM [NVALUES=1; SEED=seed] dum POINTER [NVALUES=sampsize] x CALCULATE x[] = GRNORMAL(ntests;0;1) " Add an offset, delta, to a fraction, (1-phi), of the samples. Derive test statistic, z, for Test Ho: mean=0 against Ha:mean0 and significance, p." CALCULATE nHa = ROUND(ntests * (1-phi)) & nHo = ntests - nHa VARIATE [VALUES=#nHo(0),#nHa(delta)] vdelta CALCULATE xbar = vmean(x) + vdelta & t = ABS(xbar * SQRT(sampsize)) & p = 2 * CUNORMAL(t; 0; 1) " Derive FDR, FRR, Power, PostHa." VARIATE [VALUES=0.85,0.4,3.74] initb FDRMIXTURE [DISTRIBUTION=beta; INITIAL=initb; TOLERANCE=0.01]\ p; FDR=qvalues; FRR=frr; POWER=power; POSTHA=postha