Statistics

Statistics has two aspects---algorithms and inference.

Statistical inference is a set of methods that derive probabilistic conclusions from finite data. Classical inference methodology: frequentist; Bayesian; Fisherian.

Principles of Statistical Inference: sufficiency principle, conditionality principle, likelihood principle.

Core Concepts

The fundamental construct in probability is random variable; the fundamental construct in statistics is random sample.

Random sample is a sampling process from a hypothetical population. Traditional statistics assumes "large $n$, small $p$" ($n$ for observations, $p$ for parameters measured.) While in modern statistics, the problem typically is "small $n$, large $p$".

Model in statistics is a probability distribution of one or more variables: univariate models; regression models;

Parametric and nonparametric methods do not have essential difference or comparative superiority: both are collections of models and take random samples as the sole input for estimation (frequentist). Parametric methods are algorithms selecting one from a subspace of probabilistic models, which is indexed by model parameters. Nonparametric methods are algorithms selecting one from another subspace of probabilistic models, only without an index. Generally, nonparametric methods are non-mechanistic methods, which are statistical in essence.

Estimation

Point Estimation: methods of finding and evaluating estimators, UMVU estimators;

Interval Estimation: confidence interval, tolerance interval;

Regression: Least-squares, lasso, ridge

Correlation

Pearson correlation coefficient, Pearson product-moment correlation coefficient, Pearson's rho (per Karl Pearson, 1895), or simply the correlation coefficient, is a measure of linear correlation, defined as the ratio between the covariance of two nondegenerate real random variables and the product of their standard deviations: $\rho = \text{cov}(X, Y) / (\sigma_X \sigma_Y)$, or as moments $\rho = \frac{\mathbb{E}(XY) - \mathbb{E}(X)\mathbb{E}(Y)} {\sqrt{\mathbb{E}(X^2) - \mathbb{E}(X)^2} \sqrt{\mathbb{E}(Y^2) - \mathbb{E}(Y)^2}}$. For a random sample, an estimator of the correlation coeffient is defined by replacing all expectations with sample means: $r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}$. It takes values in [-1, 1] and assesses linear relationships: it equals 1 if and only if the two variables are positively affinely dependent (i.e. $Y - \mu_Y = a (X - \mu_X)$ where k > 0); it equals -1 if and only if they are negatively affinely dependent (i.e. k < 0). Zero value means that the two variables are linearly uncorrelated, although they may still have meaningful nonlinear relationships.

Rank correlation is the statistical dependence between the rankings of two random samples. Rank statistics $(R_1, \cdots, R_n)$ of a random sample $(X_1, \cdots, X_n)$ are the ranks (i.e., weak order in order theory) of the random observations in ascending order, defined as follows: let $(X_{(1)}, \cdots, X_{(n)})$ be the order statistics of the sample, $R$ is a permutation of n such that $X_i = X_{(R_i)}$; (1) if all values are unique, the permutation is unique; (2) if there are duplicate values, replace the ranks with the average rank of ties, $R_i \gets \text{avg}{R_j : X_i = X_{(R_j)}}$, which is unique. The rank statistics are positive integers or half integers (i.e., real numbers that end with .5). This tie breaking mechanism is called the fractional ranking (e.g., 1 2.5 2.5 4), which does not change the sum of all ranks. Other methods may also be used. For example, replacing the ranks with the minimum rank of ties, $R_i \gets \min {R_j : X_i = X_{(R_j)}}$, gives the standard competition ranking (e.g., 1 2 2 4).

Spearman rank correlation coefficient $r_s$ or Spearman's rho $\rho$ (per Charles Spearman, 1904) is a measure of rank correlation, defined as the Pearson correlation coefficient between the rank statistics of (a joint sample of) two random variables: $r_s = r(R(X), R(Y))$. The random variables need not be real valuled; each may take values from a totally ordered set. Spearman's $\rho$ can be defined for random variables in the large-sample limit: $\rho := \lim_{n \to \infty} r_s$. It takes values in [-1, 1] and assesses monotonic relationships: it equals 1 (or -1) if and only if the two variables related by an increasing (or decreasing) function. Compared with Pearson's rho, Spearman's rho can detect nonlinear dependences, but is less sensitive to outliers in the tails of both samples because the outliers are converted to their ranks.

Kendall rank correlation coefficient or Kendall's tau $\tau$ (per Maurice Kendall, 1938) is a measure of ordinal association (which is more general than rank correlation), defined as the average pair-wise concordance of a joint sample of two random variables: $\tau = \binom{n}{2}^{-1} \sum_{i < j} s(x_i, x_j) s(y_i, y_j)$, where s(x, x') is the sign of the pair (x, x'), which equals 1 if x < x' and -1 if x > x'. It takes values in [-1, 1].

Hypothesis Testing

Likelihood Ratio Test (LRT), Uniformly Most Powerful (UMP) Test

False discovery rate (FDR)

Miscellaneous Topics

Asymptotic Analysis:

Statistical learning is the attempt to explain techniques of learning from data in a statistical framework.

prediction, explanation

Before Fisher, statisticians didn’t really understand estimation. The same can be said now about prediction. [@CASI2017]

Reference

Notes on Intuitive Biostatistics [@Motulsky1995]

Table 1: Statistical Techniques

Purpose	Continuous Data	Count or Ranked Data	Arrival Time	Binary Data
(Examples)	(Height)	(Number of headaches in a week; Self-report score)	(Life expectancy of a patient; Minutes until REM sleep begins	Recurrence of infection)
Describe one sample	Frequency distribution; Sample mean; Quantiles; Sample standard deviation	Frequency distribution; Quantiles;	Kaplan-Meier survival curve; Median survival curve; Five-year survival percentage	Proportion
Distributional Test	Normality tests; Outlier tests	N/A	N/A	N/A
Infer about one population	One-sample t test	Wilcoxon’s rank-sum test	Confidence bands around survival curve; CI of median survival	CI of proportion; Binomial test to compare observed distribution with a theoretical (expected) distribution
Compare two unpaired groups	Unpaired t test	Mann-Whitney test	Log-rank test; Gehan-Breslow test; CI of ratio of median survival times; CI of hazard ratio	Fisher’s exact test;
Compare two paired groups	Paired t test	Wilcoxon’s matched paires test	Conditional proportional hazards regression	McNemar’s test
Compare three or more unpaired groups	One-way ANOVA followed by multiple comparison tests	Kruskal-Wallis test; Dunn’s posttest	Log-rank test; Gehan-Breslow test	Chi-squared test (for trend)
Compare three or more paired groups	Repeated-measures ANOVA followed by multiple comparison tests	Friedman’s test; Dunn’s posttest	Conditional proportional hazards regression	Cochran’s Q
Quantify association between two variables	Pearson’s correlation	Spearman’s correlation	N/A	N/A
Predict one variable from one or several others	linear/nonlinear regression	N/A	Cox’s proportional hazards regression	Logistic regression

🏷 Category=Statistics