**Hypothesis testing** is a statistical inference technique
that uses a sample to give probabilistic conclusions on the underlying population.

In scientific methodology, a **core assumption** is a proposition
that is either logically true or accepted as a fundamental principle for pragmatic use,
thus rarely examined.

In contrast, a **working assumption** is a statement used to construct scientific theories.
Working assumptions cannot be logically false or logically true;
it should be falsifiable and frequently examined.
You start from an assumption that you believe is at least partially true,
while results will confirm, suggest modification to, or reject your assumption.

In statistics, a **hypothesis** is a statement about population parameters $\theta$.
Typically we have two mutually exclusive hypotheses:

- The
**null hypothesis**(aka negative) to falsify, $H_0: \theta \in \Theta_0$; - The
**alternative hypothesis**(aka positive), $H_1: \theta \in \Theta_0^C$;

**Hypothesis test** is a rule that specifies for every possible sample $\mathbf{X}$
whether to reject or passively accept the null $H_0$.
Theoretically, a hypothesis test is an indicator function
that divides the sample space into a rejection region and an acceptance region.
In practice, we design a **test statistic** $W(\mathbf{X})$ of the sample
and define rejection thresholds on the space of the test statistic.

If the test statistic of the given sample falls beyond the threshold,
we claim the alternative hypothesis is **statistically significant**,
where the **significance level** (aka size) $\alpha$ of a test
is the probability of the test statistic falling beyond the rejection thresholds
if the null hypothesis is true.
Instead of reporting a binary along with the significance level, a more refined measure is the p-value.
The **p-value** of a test statistic is the probability of it
being at least as deviant (away from center) as the observed value, if the null hypothesis is true:
$$p(w) = P(W > w \mid H_0)$$

A typical hypothesis testing procedure:

- State the null and alternative hypotheses: $H_0: \theta \in \Theta_0; \quad H_1: \theta \in \Theta_0^C$;
- State the hypothesis test: test statistic $W(\mathbf{X})$ and significance level $\alpha$;
- Sample and compute the test statistic $w = W(\mathbf{x})$
- Reject the null hypothesis if $p(w) < \alpha$; otherwise passively accept the null;

Errors occur in hypothesis testing whenever the test conclusion differs from the truth, which cannot be eliminated in a probabilistic setting. The probabilities of errors can be conditioned either on truth or on test conclusions.

Table: Truth-Test Contingency

Frequencies | Test Negative, $N^0$ | Test Positive, $N^1$ |
---|---|---|

True Negative, $N_0$ | true negative, $N_0^0$ | false positive, $N_0^1$ |

True Positive, $N_1$ | false negative, $N_1^0$ | true positive, $N_1^1$ |

The significance level $\alpha$ and the **power** $1 - \beta(\theta)$ of a hypothesis test
are the probabilities of false and true positives conditioned on truth,
where the latter depends on the true population parameter $\theta$.
**Type I Error**, aka **false discovery rate** (FDR), and **Type II Error**
are the probabilities of false positives and false negatives conditioned on test conclusions.

$$\begin{aligned} \alpha = N_0^1 / N_0 && \text{Type I error} = N_0^1 / N^1 \\ \beta(\theta) = N_1^0 / N_1 && \text{Type II error} = N_1^0 / N^0 \end{aligned}$$

Denote the **prevalence** of true positives among test candidates as $p_1 = N_1 / N$,
the error probabilities are related as:
$$\begin{aligned}
\text{Type I error} &= \frac{(1-p_1)\alpha}{(1-p_1)\alpha + p_1(1-\beta)} \\
\text{Type II error} &= \frac{p_1 \beta}{(1-p_1)(1-\alpha) + p_1 \beta}
\end{aligned}$$

Common convention sets significance level $\alpha = 0.05$, power $1 - \beta(\theta) = 0.8$, and type I error $\text{FDR} = 0.1$.

The **t-statistic** under null $H_0: \hat{\beta}_i = \beta^{*}$ is:

$$t = \frac{\hat{\beta}_i - \beta^{*}}{\hat{\sigma}_{\hat{\beta}_i}}$$

**F-statistic**:

$$F = \frac{\text{MSS} / p}{\text{RSS} / (n - p - 1)}$$

**likelihood ratio** statistic

**likelihood ratio test**

Def: Nuisance parameter

Def: unbiased

**uniformly most powerful** (UMP) test

Thm: (Neyman-Pearson)

**monotone likelihood ratio** (MLR)

Thm: (Karlin-Rubin)

Wald, LM, LR, J

Analysis of variance (ANOVA) is a collection of multivariate statistical methods originally used to analyze data obtained from (agricultural) experiments. The design of experiment affects the choice of analysis of variance method.

Analysis of variance as defined in Merriam-Webster:

Analysis of variation in an experimental outcome and especially of a statistical variance in order to determine the contributions of given factors or variables to the variance. (First use: 1918)

Analysis of variance is fundamentally about multilevel modeling: each row in the ANOVA table corresponds to a different batch of parameters, along with inference about the standard deviation of the parameters in this batch.

structuring of parameters/effects into batches

ANOVA F-statistic: MSS/RSS, normalized.

**ANOVA** decomposes the variance (total sum of squares, TSS) into component sums of squares:
factor sum of squares and residual sum of squares.
ANOVA is closely related to regression analysis:
recall that a fitted regression model split TSS into
model sum of squares (MSS) and residual sum of squares (RSS).
Indeed, it can be seen as a special case of generalized linear models:
independent variables in analysis of variance become dummy variables for a regression model.

Multi-factor ANOVA is used to detect significant factors in a multi-factor model.

The **ANOVA table** is a standard report of an analysis of variance,
which provides sum of squares and also a formal F-test for the factor effect.
**One-way ANOVA** is an omnibus test that determines for several independent groups
whether any of their group means are statistically significantly different from each other.
The one-way ANOVA F-test is a generalization of the two-sample t-test,
which are identical (in particular, $F = t^2$) when there are only two groups.
As an omnibus test, ANOVA does not have the problem of increased Type I error probability
in multiple t-tests.

Three main assumptions of ANOVA test:

- (The residuals in) Each group is normally distributed. (But robust against violation of normality)
- All groups have the same variance.
- Observations are independent.

Non-parametric alternative: Kruskal-Wallis H-test.

To determine which specific groups differ from one another, you need to use a *post hoc* test:

- Homoskedasticity satisfied: honestly significant difference (HSD) post hoc test;
- Homoskedasticity violated: Games Howell post hoc test;

Factor is a categorical/discrete variable; level is a possible value of a factor.
A **contrast** is a linear combination of factor level means whose coefficients sum to zero.
Two contrasts are *orthogonal* if these coefficients are orthogonal.
Simple contrast is the difference between two factor means.

**multiple comparison**, a more formal analysis for comparing individual batch means.

Fixed effects and random effects have many incompatible definitions,
but it is better to keep the terminology simple and unambiguous.
An effect/coefficient in a multilevel model is **constant** ("fixed effect")
if it is identical for all groups in a population;
**varying** ("random effect") if it is allowed to differ from group to group. [@Gelman2005]

Consider the number of test statistics $N$.
Classical testing theory involves a single test statistics, i.e. $N=1$.
Since the 1960s, a theory of **multiple testing** try to handle ((N$ between 2 and perhaps 20.
**Large-scale testing** deals with thousands or more simultaneous hypothesis tests [@Efron2004].

Single tests depend on the theoretical null distribution purely frequentist long-run property

**Family-wise error rate** (FWER) is used to control the probability of
making any false rejection among the $N$ simultaneous tests.
Common FWER include Bonferroni bound and Holm's procedure,
both are very conservative (hard to obtain statistical significance).

False-discovery rate (FDR) has become the standard control criterion for large-scale tests. $\widehat{\text{fdr}}$, $\widehat{\text{Fdr}}$,

Decision rule $D_q$:
reject all samples such that $p_{(i)} / i \le q/N$, where $(i)$ is the ordered index.
The **Benjaminiâ€“Hochberg FDR Control Theorem** shows that
the decision rule controls FDR at level $\pi_0 q$, which is bounded above by $q$:

If the p-values corresponding to valid null hypotheses are independent of each other, then $\text{FDR}(D_q) = \pi_0 q$, where $\pi_0 = N_0 / N$ is the prevalence of negatives.

đźŹ· Category=Statistics