Sample and Statistic

Random variables $X_1, \cdots, X_n$ are called a random sample of size $n$ from population $f(x)$, iff $X_1, \cdots, X_n \mathrm{i.i.d.} f(x)$. A basic and important example of random sample is Gaussian white noise, the statistical model for a deterministic value affected by numerous additive uncorrelated zero-mean small error terms.

In frequentist approach to statistics, sample is the single source of information, and all statistical results are derived from it or equivalently the empirical distribution. (In Bayesian statistics, prior knowledge is accepted as the other source of information, which is formally expressed as prior distributions.)

Statistic

Statistic of a random sample is a transformation of the sample, which is a random variable itself: $T = f(X_1, \cdots, X_n)$. Sampling distribution of a statistic is the probability distribution of the statistic as a random variable. Standard error $\sigma_T$ of a statistic is the standard deviation of its sampling distribution.

A statistic is typically designed to estimate some property of a probabilistic model, such as its expectation or variance. Since the sampling distribution of a statistic is also a probabilistic model, its properties can in turn be estimated by other statistics.

Common statistics:

Sample mean: $\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i$
(Unbiased) Sample variance: $S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2$
Sample standard deviation: $S = \sqrt{S^2}$

If a random sample is drawn from a population with finite mean $\mu$ and variance $\sigma^2$, then:

$\mathbb{E}\bar{X} = \mu$, $\text{Var}\bar{X} = \frac{\sigma^2}{n}$;
$\mathbb{E} S^2 = \sigma^2$
$\mathbb{E} S \leq \sigma$

If a random sample is drawn from Gaussian population $N(\mu,\sigma^2)$, then:

$\bar{X} \sim N(\mu, \frac{\sigma^2}{n})$
$S^2 \sim \frac{\sigma^2}{n-1} \chi_{n-1}^2$
$\bar{X} ∐ S^2$

Gaussian Sampling Distributions

Student's t-distribution

For a random sample drawn from a Gaussian population with mean $\mu$, Student's t-distribution with $n-1$ degrees of freedom is defined as the sampling distribution of t-statistic:

$$t_{n-1} \sim T = \frac{\bar{X} - \mu}{S / \sqrt{n}}$$

An equivalent definition is:

$$t_q \sim \frac{\Gamma(\frac{q+1}{2})}{\Gamma(\frac{q}{2}) \Gamma(\frac{1}{2})} \sqrt{\frac{1}{q}} \left( 1+\frac{t^2}{q} \right)^{-\frac{q+1}{2}}$$

Student's t-distribution also arises from Bayesian inference as the (standardised) marginal posterior distribution of the mean of a normal distribution, with uninformative priors for the unknown mean and variance.

Snedecor's F distribution

For a random sample of size $n$ drawn from a Gaussian population with variance $\sigma_X^2$ and another of size $m$ drawn from an independent Gaussian population with variance $\sigma_Y^2$, Snedecor's F distribution with $n-1$ and $m-1$ degrees of freedom is defined as the sampling distribution of F-statistic:

$$F_{n-1,m-1} \sim F = \frac{ S_X^2 / \sigma_X^2 }{ S_Y^2 / \sigma_Y^2 }$$

An equivalent definition is:

$$F_{p,q} \sim \frac{ \Gamma(\frac{p+q}{2}) }{ \Gamma(\frac{p}{2}) \Gamma(\frac{q}{2}) } \left( \frac{p}{q} \right) \frac{ \left( \frac{p}{q} x \right)^{\frac{p}{2}-1} }{ \left( 1 + \frac{p}{q} x \right)^{\frac{p+q}{2}} } 1_{\{x>0\}}$$

Properties of F distribution:

If $X \sim F_{p,q}$, then $\frac{1}{X} \sim F_{q,p}$;
If $X \sim t_q$, then $X^2 \sim F_{1,q}$;
If $X \sim F_{p,q}$, then $\frac{ \frac{p}{q} x }{ 1 + \frac{p}{q} x } \sim B(\frac{p}{2}, \frac{q}{2})$;

Click here for all the proofs.

Order Statistic

The order statistics (顺序统计量) of a random sample are the sample values in ascending order: $$P(X_{(1)} \le \cdots \le X_{(n)}) = 1$$

Statistics derived from order statistics:

Sample range: $R = X_{(n)} - X_{(1)}$
Sample median: $M = \begin{cases} X_{\left(\frac{n+1}{2}\right)} &\text{n odd} \\ \frac{1}{2} \left( X_{\left(\frac{n}{2}\right)} + X_{\left(\frac{n+1}{2}\right)} \right) &\text{n even} \end{cases}$
Interquartile range: $\text{IQR} = X_{( n+1 - \{ \frac{n}{4} \} )} - X_{( \{ \frac{n}{4} \} )}$, where $\{\cdot\}$ is rounding to the nearest integer.

Sampling Distribution of Order Statistic

Lemma: Given a continuous random vector $(X_1, \cdots, X_n) \sim f(\mathbf{x})$, the random vector in ascending order $$(X_{(1)}, \cdots, X_{(n)}) \sim \sum_{\pi \in S_n} f\left( \pi^{-1}(x_{(1)}, \cdots, x_{(n)}) \right) 1_{\{x_{(1)} \le \cdots \le x_{(n)}\}}$$ Here $S_n$, the symmetric group of degree $n$, is the collection of all possible permutations of $(1,\cdots,n)$.

Theorem: If a random sample is drawn from a continuous distribution $f(x)$, let $r, i, j \in \{1, \cdots, n\}, i < j$, then for its order statistics, $$\begin{aligned} X_{(r)} &\sim \binom{n}{r-1, 1, n-r} \left(F(x)\right)^{r-1} f(x) \left(1-F(x)\right)^{n-r} \\ (X_{(i)}, X_{(j)}) &\sim \binom{n}{i-1, 1, j-i-1, 1, n-j} \left(F(u)\right)^{i-1} f(u) \left(F(v)-F(u)\right)^{j-i-1} f(v) \left(1-F(v)\right)^{n-j} 1_{\{u \le v\}} \end{aligned}$$ Additionally, $F_{X_{(r)}}(x) = \sum_{k=r}^{n} \binom{n}{k} \left(F(x)\right)^k \left(1-F(x)\right)^{n-k}$.

Uniform order statistics are the order statistics of samples from the uniform distribution. The sampling distribution of the uniform order statistics has support on hyper-tetrahedron $1_{\{0 \le x_{(1)} \le \cdots \le x_{(n)}\}}$, and the marginals are beta distributions: $$U_{(r)} \sim B(r,n+1-r)$$

Click here for all the proofs.

🏷 Category=Statistics