Statistics has two aspects---algorithms and inference.
Statistical inference is a set of methods that derive probabilistic conclusions from finite data. Classical inference methodology: frequentist; Bayesian; Fisherian.
Principles of Statistical Inference: Sufficiency principle, Conditionality principle, Likelihood principle.
The fundamental construct in probability is random variable; the fundamental construct in statistics is random sample.
Random sample is a sampling process from a hypothetical population. Traditional statistics assumes "large $n$, small $p$" ($n$ for observations, $p$ for parameters measured.) While in modern statistics, the problem typically is "small $n$, large $p$".
In statistics, a model is a probability distribution of one or more variables: univariate models; regression models;
Parametric and nonparametric methods do not have essential difference or comparative superiority: both are collections of models and take random samples as the sole input for estimation (frequentist). Parametric methods are algorithms selecting one from a subspace of probabilistic models, which is indexed by model parameters. Nonparametric methods are algorithms selecting one from another subspace of probabilistic models, only without an index. Generally, nonparametric methods are non-mechanistic methods, which are statistical in essence.
Point Estimation: methods of finding and evaluating estimators, UMVU estimators;
Interval Estimation: confidence interval, tolerance interval;
Regression: Least-squares, lasso, ridge
Likelihood Ratio Test (LRT), Uniformly Most Powerful (UMP) Test
False discovery rate (FDR)
Asymptotic Analysis:
Statistical learning is the attempt to explain techniques of learning from data in a statistical framework.
prediction, explanation
Before Fisher, statisticians didn’t really understand estimation. The same can be said now about prediction. [@CASI2017]
Notes on Intuitive Biostatistics [@Motulsky1995]
Table 1: Statistical Techniques
Purpose | Continuous Data | Count or Ranked Data | Arrival Time | Binary Data |
---|---|---|---|---|
(Examples) | (Height) | (Number of headaches in a week; Self-report score) | (Life expectancy of a patient; Minutes until REM sleep begins | Recurrence of infection) |
Describe one sample | Frequency distribution; Sample mean; Quantiles; Sample standard deviation | Frequency distribution; Quantiles; | Kaplan-Meier survival curve; Median survival curve; Five-year survival percentage | Proportion |
Distributional Test | Normality tests; Outlier tests | N/A | N/A | N/A |
Infer about one population | One-sample t test | Wilcoxon’s rank-sum test | Confidence bands around survival curve; CI of median survival | CI of proportion; Binomial test to compare observed distribution with a theoretical (expected) distribution |
Compare two unpaired groups | Unpaired t test | Mann-Whitney test | Log-rank test; Gehan-Breslow test; CI of ratio of median survival times; CI of hazard ratio | Fisher’s exact test; |
Compare two paired groups | Paired t test | Wilcoxon’s matched paires test | Conditional proportional hazards regression | McNemar’s test |
Compare three or more unpaired groups | One-way ANOVA followed by multiple comparison tests | Kruskal-Wallis test; Dunn’s posttest | Log-rank test; Gehan-Breslow test | Chi-squared test (for trend) |
Compare three or more paired groups | Repeated-measures ANOVA followed by multiple comparison tests | Friedman’s test; Dunn’s posttest | Conditional proportional hazards regression | Cochran’s Q |
Quantify association between two variables | Pearson’s correlation | Spearman’s correlation | N/A | N/A |
Predict one variable from one or several others | linear/nonlinear regression | N/A | Cox’s proportional hazards regression | Logistic regression |