Hypothesis Testing

Hypothesis testing is a statistical inference technique that uses a sample to give probabilistic conclusions on the underlying population.

In scientific methodology, a working assumption is a statement used to construct scientific theories. Working assumptions cannot be logically false or logically true; it should be falsifiable and frequently examined. You start from an assumption that you believe is at least partially true, and results will confirm, reject or suggest modification to your assumption.

In contrast, a core assumption is a proposition that is either logically true or accepted as a fundamental principle for pragmatic use. Core assumptions are thus rarely examined.

Definitions

In statistics, a hypothesis is a statement about population parameters. Typically we have two mutually exclusive hypotheses: one that is doubted, called the null hypothesis \( H_0: \theta \in \Theta_0 \); the other that is believed, called the alternative hypothesis \( H_1: \theta \in \Theta_0^C \). Hypothesis test is a rule that specifies for every possible sample whether to reject or (passively) accept the null.

Theoretically, a hypothesis test is an indicator function that divides the sample space into a rejection region and an acceptance region. In practice, we develop a test statistic as a real-valued function of the sample, and a rejection interval on the test statistic based on p-value. The p-value of a test statistic is the probability of observing a test statistic at least as deviant (away from center) as the observed value, if the null hypothesis is true. The significance level \( \alpha \) of the alternative hypothesis is a conventional upper bound on the p-value with which we claim the alternative is statistically significant.

\[ \alpha = P(\text{Test Positive} | \text{Truth Negative}) \]

A typical hypothesis testing procedure:

State the null and alternative hypotheses: \( H_0: \theta \in \Theta_0; \quad H_1: \theta \in \Theta_0^C \).
State the hypothesis test: test statistic \( W(\mathbf{X})\); rejection interval \(R\) or significance level \( \alpha \).
Sample and compute the test statistic \( W(\vec{X})\).
Reject the null hypothesis if \( W \in R \) or \( p(W) < \alpha \); otherwise passively accept the null.

Power of a hypothesis test is the probability of correctly rejecting the null hypothesis: \[ 1 - \beta(\theta) = P(\text{Test Positive} | \text{Truth Positive}) \]

Errors cannot be eliminated from hypothesis tests, which comes in two types: Type I Error is false positive/discovery rate (FDR); Type II Error is false negative rate.

Table 1: Test Conclusion Probabilities Conditional on Truth

Test\Truth	Negative	Positive
Negative
Positive	significance level \(\alpha\)	power \( 1-\beta(\theta) \)

Table 2: Truth Likelihood Conditional on Test Conclusions

Test\Truth	Negative	Positive
Negative		False negative (Type II error)
Positive	False positive (Type I error)

Assume the fraction of test candidates that have real effects is \( p_0 \), known as prevalence in screening tests. Then by rules of conditional probability, the error probabilities can be written as functions of \( (p_0, \alpha, \beta) \):

\[ P(\text{Truth Negative} | \text{Test Positive}) = \frac{(1-p_0)\alpha}{p_0(1-\beta) + (1-p_0)\alpha} \] \[ P(\text{Truth Positive} | \text{Test Negative}) = \frac{p_0 \beta}{(1-p_0)(1-\alpha) + p_0 \beta} \]

Test Statistics

t-statistic

The t-statistic under null \( H_0: \hat{\beta}_i = \beta^{*} \) is:

\[ t = \frac{\hat{\beta}_i - \beta^{*}}{\widehat{\text{s.e.}}(\hat{\beta}_i)} \]

F-statistic

F-statistic:

\[ F = \frac{\text{MSS} / p}{\text{RSS} / (n - p - 1)} \]

Analysis of Variance

Analysis of variance (ANOVA) is a collection of multivariate statistical methods originally used to analyze data obtained from (agricultural) experiments. The design of experiment affects the choice of analysis of variance method.

Analysis of variance as defined in Merriam-Webster:

analysis of variation in an experimental outcome and especially of a statistical variance in order to determine the contributions of given factors or variables to the variance. (First use: 1918)

Analysis of variance is fundamentally about multilevel modeling: each row in the ANOVA table corresponds to a different batch of parameters, along with inference about the standard deviation of the parameters in this batch.

structuring of parameters/effects into batches

ANOVA F-statistic: MSS/RSS, normalized.

ANOVA decomposes the variance (total sum of squares, TSS) into component sums of squares: factor sum of squares and residual sum of squares. ANOVA is closely related to regression analysis: recall that a fitted regression model split TSS into model sum of squares (MSS) and residual sum of squares (RSS). Indeed, it can be seen as a special case of generalized linear models: independent variables in analysis of variance become dummy variables for a regression model.

Multi-factor ANOVA is used to detect significant factors in a multi-factor model.

The ANOVA table is a standard report of an analysis of variance, which provides sum of squares and also a formal F-test for the factor effect. One-way ANOVA is an omnibus test that determines for several independent groups whether any of their group means are statistically significantly different from each other. The one-way ANOVA F-test is a generalization of the two-sample t-test, which are identical (in particular, \(F = t^2\)) when there are only two groups. As an omnibus test, ANOVA does not have the problem of increased Type I error probability in multiple t-tests.

Oneway ANOVA by M. Plonsky

Three main assumptions of ANOVA test:

(The residuals in) Each group is normally distributed. (But robust against violation of normality)
All groups have the same variance.
Observations are independent.

Non-parametric alternative: Kruskal-Wallis H-test.

To determine which specific groups differ from one another, you need to use a post hoc test:

Homoskedasticity satisfied: honestly significant difference (HSD) post hoc test;
Homoskedasticity violated: Games Howell post hoc test;

Factor is a categorical/discrete variable; level is a possible value of a factor. A contrast is a linear combination of factor level means whose coefficients sum to zero. Two contrasts are orthogonal if these coefficients are orthogonal. Simple contrast is the difference between two factor means.

multiple comparison, a more formal analysis for comparing individual batch means.

Fixed effects and random effects have many incompatible definitions, but it is better to keep the terminology simple and unambiguous. An effect/coefficient in a multilevel model is constant ("fixed effect") if it is identical for all groups in a population; varying ("random effect") if it is allowed to differ from group to group. {Gelman:2005. Annals of Statistics.}

Types of Tests

Likelihood Ratio Test (LRT)

Def: likelihood ratio statistic

Def: likelihood ratio test

Def: Nuisance parameter

Uniformly Most Powerful (UMP) Test

Def: unbiased

Def: uniformly most powerful (UMP) test

Thm: (Neyman-Pearson)

Def: monotone likelihood ratio (MLR)

Thm: (Karlin-Rubin)

Other Tests

Wald, LM, LR, J

Large-scale Simultaneous Hypothesis Testing

Classical testing theory involves a single test statistics; Since the 1960s, a theory of multiple testing try to handle ((N\) between 2 and perhaps 20, where \(N\) denotes the number of test statistics.

For small-scale simultaneous hypothesis testing, such as \( N \le 20 \), family-wise error rate (FWER) is used to control the probability of making any false rejection.

large-scale testing For thousands or more simultaneous hypothesis tests, false-discovery rate (FDR) has become the standard control criterion.

{Efron2004}

🏷 Category=Statistics