Statistics has two aspects: algorithms and inference.
Statistical inference is a system of mathematical logic for guidance and correction (or justification). Classical inference methodology: frequentist, Bayesian, and Fisherian.
Principles of Statistical Inference: Sufficiency principle, Conditionality principle, Likelihood principle.
The fundamental construct in probability is random variable; the fundamental construct in statistics is random sample.
In statistics, a model is a probability distribution of one or more variables: univariate models; regression models;
Parametric and nonparametric methods do not have essential difference between or superiority to each other: both are collections of models and take random samples as the sole input for estimation (frequentist). Parametric methods are algorithms selecting a unique probability model from a subspace of probability models, indexed by model parameter. Nonparametric methods are algorithms selecting a unique probability model from another subspace of probability models, only without an index. Generally, nonparametric methods are non-mechanistic methods, which are statistical in essence.
Random sample is a sampling process from a hypothetical population. Traditional statistics assumes "large n, small p" (\(n\) for observations, \(p\) for parameters measured.) While in modern statistics, the problem typically is "small n, large p".
Point Estimation: methods of finding and evaluating estimators, UMVU estimators;
Interval Estimation: confidence interval, tolerance interval;
regression: Least-squares, lasso, ridge
Likelihood Ratio Test (LRT), Uniformly Most Powerful (UMP) Test
False discovery rate (FDR)
Asymptotic Analysis:
Statistical learning is the attempt to explain techniques of learning from data in a statistical framework.
prediction, explanation
Before Fisher, statisticians didn’t really understand estimation. The same can be said now about prediction. {CASI2017}
Notes on Intuitive Biostatistics {Motulsky1995}
Table 1: Statistical Techniques
Purpose | Continuous Data | Count or Ranked Data | Arrival Time | Binary Data |
---|---|---|---|---|
(Examples) | (Height) | (Number of headaches in a week; Self-report score) | (Life expectancy of a patient; Minutes until REM sleep begins | Recurrence of infection) |
Describe one sample | Frequency distribution; Sample mean; Quantiles; Sample standard deviation | Frequency distribution; Quantiles; | Kaplan-Meier survival curve; Median survival curve; Five-year survival percentage | Proportion |
Distributional Test | Normality tests; Outlier tests | N/A | N/A | N/A |
Infer about one population | One-sample t test | Wilcoxon’s rank-sum test | Confidence bands around survival curve; CI of median survival | CI of proportion; Binomial test to compare observed distribution with a theoretical (expected) distribution |
Compare two unpaired groups | Unpaired t test | Mann-Whitney test | Log-rank test; Gehan-Breslow test; CI of ratio of median survival times; CI of hazard ratio | Fisher’s exact test; |
Compare two paired groups | Paired t test | Wilcoxon’s matched paires test | Conditional proportional hazards regression | McNemar’s test |
Compare three or more unpaired groups | One-way ANOVA followed by multiple comparison tests | Kruskal-Wallis test; Dunn’s posttest | Log-rank test; Gehan-Breslow test | Chi-squared test (for trend) |
Compare three or more paired groups | Repeated-measures ANOVA followed by multiple comparison tests | Friedman’s test; Dunn’s posttest | Conditional proportional hazards regression | Cochran’s Q |
Quantify association between two variables | Pearson’s correlation | Spearman’s correlation | N/A | N/A |
Predict one variable from one or several others | linear/nonlinear regression | N/A | Cox’s proportional hazards regression | Logistic regression |