The original probability space is incapable of analysis. If we map the sample space to some mathematical structure that is suited for classical deterministic analysis, an analytic approach to probability is naturally established. This is the motivation of random variables.

When extending a deterministic variable to a stochastic one, the *first order uncertainty* is variance, not expectation.

A **measurable function** is a function such that for all measurable sets in the range their preimage are also measurable in the domain:

Given measurable spaces $(X, \Sigma_X)$ and $(Y, \Sigma_Y)$, a function $f: X \to Y$ is $( \Sigma_X, \Sigma_Y )$-

measurableif $\forall E \in \Sigma_Y, f^{-1}(E) \in \Sigma_X$.

A **random variable** is a measurable function from a probability space to a measurable space:

Given a probability space $(\Omega, \Sigma, P)$ and a measurable space $(F, \Sigma_F)$, a

random variable$X: \Omega \to F$ is a $(\Sigma, \Sigma_F)$-measurable function.

In most common cases, random variable is a real-valued function with Borel sigma-algebra: $X: (\Omega, \Sigma) \to (\mathbb{R}, \mathcal{B})$. Lebesgue measurable space $(\mathbb{R}, \mathcal{L})$ might also be used.

The inverse images of all measurable sets in the range is called the **sigma-algebra introduced in the sample space by the random variable**, $\Sigma_X$.

As a measurable function, random variable is a special type of function. Although in most cases its range is either the real line or some Banach/Hilbert space, its domain, the sample space, does not have to have extra structure other than being a measure space.

Random variable introduces a probability measure on its range measurable space, called the **distribution** of the random variable:

$$\mu(B) \equiv P(X^{-1}(B)), \forall B \in \mathcal{B}$$

The Lebesgue decomposition of distribution is: {Radon–Nikodym Theorem} $$\mu(A) = \int_{A} f(x) \mathrm{d}\lambda + \mu^s (A)$$

When $\mu^s =0$, the distribution $\mu$ is said to be **absolutely continuous**.

The distribution of a random variable can be conveniently characterized by **cumulative distribution function** (CDF):

$$F_{X} (X) = \mu \{ ( -\infty, X ] \}$$

Cumulative distribution function always exist, and can be shown to be equivalent to the distribution of the random variable if the range sigma-algebra is Borel.

**Probability density function** (PDF) is the derivative of a cumulative distribution function, if the derivative exists:

$$f_{X} (X)= \frac{\mathrm{d}}{\mathrm{d} x} F_{X} (X)$$

**Probability mass function** (PMF) of a discrete random variable gives the probability of the random variable equal to some value:

$$f_X (x) = P(X^{-1}(x)), \forall x \in \mathbb{R}$$

The **expectation** of a random variable is its Lebesgue integral on the probability measure of the sample space:

$$\mathbb{E}X = \int_{\Omega} X \mathrm{d}P$$

We choose Lebesgue integral for two reasons:

- To ensure closure of the various function spaces, in particular Banach and Hilbert spaces. This does not hold in Riemann integral.
- Lebesgue integration provides a uniform integral whether a given random variable is discrete or continuous.

The change of variables theorem shifts the Lebesgue integral on the probability space to a Riemann-Stieltjes integral on the induced measure space on the real line:

**Change of variables theorem**:

$$\int_{\Omega} X \mathrm{d}P = \int_{\mathbb{R}} x \mathrm{d} \mu$$

The **characteristic function** of a random variable $X$ with measure $\mu$ is

$$\begin{aligned} \varphi_X (t) &\equiv \mathbb{E} e^{itX} = \int_{\mathbb{R}} e^{itx} \mathrm{d} \mu && \text{(scalar form)} \\ \Phi_{\mathbf{X}}(\mathbf{w}) &\equiv \mathbb{E}e^{i \mathbf{w}^T \mathbf{X}} = \mathcal{F} f_{\mathbf{X}}(\mathbf{x}) && \text{(vector form)} \end{aligned}$$

The characteristic function can be thought of as the Fourier transform of the PDF. But unlike PDF, the characteristic function of a distribution always exist.

The characteristic function uniquely determines the distribution of a random variable: $f_{\mathbf{X}}(\mathbf(x)) = \mathcal{F}^{-1} \Phi_{\mathbf{X}}(\mathbf(w))$.

Weak convergence of random variables implies pointwise convergence of corresponding characteristic functions.

If a random variable $X$ has moments up to the $k$-th order, then the characteristic function is $k$ times continuously differentiable on the entire real line.

If a characteristic function has a $k$-th derivative at 0, then the random variable has moments up to the $k$-th order if $k$ is even, and up to the $k - 1$-th order if $k$ is odd.

If the right-hand side is well defined, the $k$-th moment can be computed as:

$$\mathbb{E} X^K = (-i)^k \varphi_X^{(k)} (0)$$

Table: Standard Form of Dominant Moments

Name | Definition | Interpretation | Dimension | Range† |
---|---|---|---|---|

mean | first raw moment | central tendency | as is | $(-\infty, \infty)$ |

standard deviation | second central moment | variation | as is | $[0,\infty)$ |

skewness | normalized third central moment | lopsidedness | dimensionless | $(-\infty, \infty)$ |

excess kurtosis | excess normalized fourth central moment, centered at normal distribution | (for symmetric distribution) probability concentration on center and tails against the standard deviations | dimensionless | $[-2, \infty)$ |

† If exists.

Classification of positive random variables by concentration: [@Taleb2018]

- compact support;
- sub-Gaussian: $\exists a > 0: F(x) = O(e^{-ax^2})$;
- Gaussian;
- sub-exponential: no exponential moment; sum dominated by the maximum for large values [@Embrechts1979];
- power law (p>3): finite mean & variance;
- power law (2<p≤3): finite mean;
- power law (1<p≤2);