# Conditioning in measure-theoretic probability (draft)

This post is adapted from my notes on CMU’s machine learning course.

A foundational concept introduced in all courses on measure-theoretic probability is the conditional expectation, generalizing the discrete probability definition of the average of all outcomes of a random variables in an event. A typical course, however, moves on to martingale theory without investing much time in generalizing the other conditional constructs from discrete probability theory, often omitting them entirely. We develop the missing measure-theoretic generalizations in this post.

## 1. Review of Measure-Theoretic Terminology

A measure space is an ordered triple $(\Omega, \mathcal{F}, \mu)$ consisting of a set $\Omega$ denoting the sample space, a $\sigma$-algebra $\mathcal{F}$ of events, and a measure $\mu$ on the measurable space $(\Omega, \mathcal{F})$. If $\mathbb{P} = \mu$ is a probability meausre, then we say that $(\Omega,\mathcal{F},\mathbb{P})$ is a probability space.

By a $\sigma$-algebra, we mean a set $\mathcal{F}$ of subsets of $\mathcal{F}$ that satisfies the following properties:

1. The full set $\mathcal{F}$ and the empty set $\varnothing$ are elements of $\mathcal{F}$;
2. if $E$ is an element of $\mathcal{F}$, then its complement $X \smallsetminus E$ is an element of $\mathcal{F}$;
3. if $\{E_{n}\}_{n=1}^{\infty}$ is a collection of sets in $\mathcal{F}$, then its union $\bigcup_{n=1}^\infty E_{n}$ and its intersection $\bigcap_{n=1}^\infty E_n$ are elements of $\mathcal{F}$.

The ordered pair $(\Omega, \mathcal{F})$ of a set and a $\sigma$-algebra on it is called a measurable space, because we can define a probability measure on it. A measure on $(\Omega, \mathcal{F})$ is a function $\mu:\Omega \to [0,1]$ such that the countable additivity criterion

holds whenever $\{E_{n}\}_{n=1}^\infty$ is a disjoint collection of events. If, in addition, $\mu(\Omega) = 1$, then we say that $\mu$ is a probability measure on $(\Omega,\mathcal{F})$.

We say that a property of a probability space $(\Omega, \mathcal{F},\mathbb{P})$ holds almost surely if it holds on $\Omega \smallsetminus E$ for an event $E$ of measure zero. The corresponding term for a general measure space is almost everywhere. A function $g$ that is equal almost everywhere to another function $f$ is said to be a version of $f$.

Given a measurable space $(\Omega,\mathcal{F})$, we define a $\sigma$-subalgebra of $\mathcal{F}$ to be a subset of $\mathcal{F}$ that is also a $\sigma$-algebra on $\Omega$. The $\sigma$-subalgebra of $\mathcal{F}$ generated by a collection $\mathscr{C} \subseteq \mathcal{F}$ is the intersection of all $\sigma$-subalgebras of $\mathcal{F}$ containing $\mathcal{C}$.

If no ambient $\sigma$-algebra is given, we define the $\sigma$-algebra on $\Omega$ generated by a collection $\mathscr{C}$ of subsets of $\Omega$ to be the intersection of all $\sigma$-algebras on $\Omega$ containing $\mathscr{C}$. The intersection is well-defined, as the power set $\mathcal{P}(\Omega)$ is always a $\sigma$-algebra on $\Omega$.

A useful tool for constructing measure spaces is the Carathéodory extension theorem, which states that a countable additive function $\mu_{0}{}_{}$ on a collection $\mathcal{A}$ of subsets of a sample space $\Omega$ that is closed under finite unions, intersections, and complementation admits an extension $\mu$ to $\sigma(\mathcal{A})$, the $\sigma$-algebra generated by $\mathcal{A}$.

Furthermore, the extension is unique if the measure is $\sigma$-finite, i.e., there exists a sequence of events of finite measure whose union is the entire sample space. In particular, all countably additive function that assigns the measure of 1 to a sample space has a unique extension to a probability measure.

A measurable function from a measurable space $(A,\mathcal{F})$ to another measurable space $(B,\mathcal{G})$ is a function $f:A \to B$ such that $f^{-1}(E) \in \mathcal{F}$ whenever $E \in \mathcal{F}$.

A random variable on a probability space $(\Omega,\mathcal{F},\mathbb{P})$ is a measurable function from $(\Omega,\mathcal{F})$ to $(\mathbb{R},\mathscr{B}_\mathbb{R})$, where $\mathscr{B}_{\mathbb{R}}$ is the Borel $\sigma$-algebra, the $\sigma$-algebra generated by open sets. A random variable $X$ is said to be discrete if the image of $X$ is a countable set.

A random $n$-vector on $(\Omega,\mathcal{F},\mathbb{P})$ is a measurable function from $(\Omega,\mathcal{F})$ to $(\mathbb{R}^n,\mathscr{B}_{\mathbb{R}^n})$, where $\mathscr{B}_{\mathbb{R}^n}$ is the Borel $\sigma$-algebra on $\mathbb{R}^n$.

The distribution of a random variable $X$ is the real-valued function

The set function defined by the formula

can be extended to a probability measure on $(\mathbb{R},\mathscr{B}_\mathbb{R})$, called the law associated with the random variable $X$.

Conversely, any increasing, right-continuous function $F:\mathbb{R} \to [0,1]$ such that $F(\alpha) \to 0$ as $\alpha \to -\infty$ and $F(\alpha) \to 1$ as $\alpha \to \infty$ admits a random variable $X$ such that $F_{X} = F$. Since the extension of the set function

to a probability measure on $(\mathbb{R},\mathscr{B}_\mathbb{R})$ agrees with $\mathscr{L}_{X}{}_{}$ we conclude that there is a one-to-one correspondence between probability distributions and probability measures of the form $dF$, called the Lebesgue–Stieltjes measures.

A simple function on a probability space $(\Omega, \mathcal{F},\mathbb{P})$ is a linear combination of indicator functions on events:

The expectation of $s$ is the sum

Given a nonnegative random variable $X$ on $(\Omega,\mathcal{F},\mathbb{P})$, there exists a sequence $(s_n)_{n=1}^\infty$ of simple functions such that $0 \leq s_{1} \leq s_{2} \leq \cdots \leq X$ and that $s_{n} \to X$ pointwise almost surely. We define

In general, we can write a random variable $X$ as the difference $X^+ - X^-$, where $X^+ = \max(X,0)$ and $X^- = \max(-X,0)$. We can then define the expectation of $X$ to be the sum

This definition yields a linear functional on the space of all random variables on $(\Omega,\mathcal{F},\mathbb{P})$. The construction of the Lebesgue integral

of a real-valued measurable function $f$ on a general measure space $(\Omega,\mathcal{F},\mu){}_{}$ is analogous.

We record two computational identities regarding the expectation, which holds for each random variable $X$ and every $(\mathscr{B}_\mathbb{R},\mathscr{B}_\mathbb{R})$-measurable function (or, borel measurable for short) $g:\mathbb{R} \to \mathbb{R}$:

The integral in the first first identity is to be understood as a Lebesgue–Stieltjes integral.

The second identity is a consequence of integration on product measure spaces.

Given two measure spaces $(A,\mathcal{F},\mu)$ and $(B,\mathcal{G},\nu)$, we define a rectangle to be the cartesian product $P \times Q$ for any $P \in \mathcal{F}$ and $Q \in \mathcal{Q}$. The product measure of a rectangle is defined to be the product

$\mu \otimes \nu$ is countably additive on the algebra $\mathcal{R}$ of all finite unions, intersections, and complementations of rectangles, whence the Carathéodory’s extension theorem furnishes a measure on the $\sigma$-algebra $\mathcal{F} \otimes \mathcal{G}$ generated by $\mathcal{R}$.

If both $\mu$ and $\nu$ are $\sigma$-finite, then the product measure $\mu \otimes \nu$ is $\sigma$-finite, and the Fubini–Tonelli theorem holds:

whenever $% $ or $f$ is nonnegative.

Yet another useful computational device for the expectation is the probability density function, which is defined for each random variable $X$ to be a function $f_{X}{}_{}:\mathbb{R} \to \mathbb{R}$ such that

for all real numbers $a \leq b$. Here, $\mu$ is a measure on $(\mathbb{R},\mathscr{B}_{\mathbb{R}}){}_{}$.

The sufficient condition for the existence of a probability density function is established through the theory of signed measures.

A signed measure on a measurable space $(\Omega,\mathcal{F})$ is a countably additive set function $\mu:\mathcal{F} \to (-\infty, \infty]$. To highlight the difference, a measure is sometimes called a positive measure. A signed measure $\mu$ on $(\Omega,\mathcal{F})$ is said to be absolutely continuous with respect to a positive measure $\nu$ on $(\Omega,\mathcal{F})$, denoted by $\mu \ll \nu$, if $\mu(E) = 0$ whenever $E \in \mathcal{F}$ and $\nu(E) = 0$.

A useful special case of the Radon–Nikodym theorem (also known as the Lebesgue–Radon–Nikodym theorem) states the following: if a $\sigma$-finite signed measure $\mu$ on a measurable space $(\Omega,\mathcal{F})$ is absolutely continuous with respect to a $\sigma$-finite positive measure $\nu$ on $(\Omega,\mathcal{F})$, then there exists a $\mathcal{F}$-measurable function $\frac{d\mu}{d\nu}:\Omega \to \mathbb{R}$, called the Radon–Nikodym derivative of $\mu$ with respect to $\nu$, such that

for all $E \in \mathcal{F}$.

We fix a probability space $(\Omega,\mathcal{F},\mathbb{P})$ and a random variable $X$. If the law $\mathscr{L}_{X}$ is absolutely continuous with respect to a $\sigma$-finite positive measure $\nu$ on $(\mathbb{R},\mathscr{B}_{\mathbb{R}})$, then the Radon–Nikodym theorem implies that

for all real numbers $a \leq b$. If $X$ is discrete, then choosing the counting measure

gives us absolute continuity. Otherwise, we typically shoot for absolute continuity with respect to $\nu = \mathscr{L}_{\mathbb{R}}$, the Lebesgue measure on the real line.

## 2. Conditioning and Independence on a Discrete Probability Space

Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a discrete probability space, i.e., the sample space $\Omega$ is countable. Given two events $A,B \in \mathcal{F}$, what should the probability of $A$ given $B$ be? To answer this question, we construct a new measurable space $(\Omega_{B}, \mathcal{F}_{B}$ where event $B$ always happens. To do this, we take $\Omega_{B} = \Omega \cap B$ and

The construction implies that each $E \in \mathcal{F}_{B}$ admits $E' \in \mathcal{F}$ such that $E = E' \cap B$. It thus makes sense to define the new probability measure $\mathcal{P}_{B}$ on $(\Omega_{B}, \mathcal{F}_{B})$ in terms of $\mathbb{P}[E' \cap B]$. We take the normalization

so that $\mathbb{P}_{B}$ is a bona fide probability measure on $(\Omega_{B},\mathcal{F}_{B})$.

With this construction, it would be reasonable to say that the probability of $A$ given $B$ $\mathbb{P}[A \mid B]$ is

provided that $\mathbb{P}[B] > 0$. If $\mathbb{P}[B] = 0$, then conditioning is meaningless, as $\mathbb{P}[A \mid B]$ would have to be 0 for any event $A$.

If the conditioning does not change the probability, then the events are said to be independent. Formally, $A$ and $B$ are independent if

This holds if and only if $\mathbb{P}[A \cap B] = \mathbb{P}[A]\mathbb{P}[B]$, which yields a natural $n$-fold generalization: events $E_{1},\ldots,E_{n}$ are independent if

Tags:

Categories:

Updated: