Conditioning in measure-theoretic probability (draft)

This post is adapted from my notes on CMU’s machine learning course.

A foundational concept introduced in all courses on measure-theoretic probability is the conditional expectation, generalizing the discrete probability definition of the average of all outcomes of a random variables in an event. A typical course, however, moves on to martingale theory without investing much time in generalizing the other conditional constructs from discrete probability theory, often omitting them entirely. We develop the missing measure-theoretic generalizations in this post.

1. Review of Measure-Theoretic Terminology

A measure space is an ordered triple consisting of a set denoting the sample space, a -algebra of events, and a measure on the measurable space . If is a probability meausre, then we say that is a probability space.

By a -algebra, we mean a set of subsets of that satisfies the following properties:

  1. The full set and the empty set are elements of ;
  2. if is an element of , then its complement is an element of ;
  3. if is a collection of sets in , then its union and its intersection are elements of .

The ordered pair of a set and a -algebra on it is called a measurable space, because we can define a probability measure on it. A measure on is a function such that the countable additivity criterion

holds whenever is a disjoint collection of events. If, in addition, , then we say that is a probability measure on .

We say that a property of a probability space holds almost surely if it holds on for an event of measure zero. The corresponding term for a general measure space is almost everywhere. A function that is equal almost everywhere to another function is said to be a version of .

Given a measurable space , we define a -subalgebra of to be a subset of that is also a -algebra on . The -subalgebra of generated by a collection is the intersection of all -subalgebras of containing .

If no ambient -algebra is given, we define the -algebra on generated by a collection of subsets of to be the intersection of all -algebras on containing . The intersection is well-defined, as the power set is always a -algebra on .

A useful tool for constructing measure spaces is the Carathéodory extension theorem, which states that a countable additive function on a collection of subsets of a sample space that is closed under finite unions, intersections, and complementation admits an extension to , the -algebra generated by .

Furthermore, the extension is unique if the measure is -finite, i.e., there exists a sequence of events of finite measure whose union is the entire sample space. In particular, all countably additive function that assigns the measure of 1 to a sample space has a unique extension to a probability measure.

A measurable function from a measurable space to another measurable space is a function such that whenever .

A random variable on a probability space is a measurable function from to , where is the Borel -algebra, the -algebra generated by open sets. A random variable is said to be discrete if the image of is a countable set.

A random -vector on is a measurable function from to , where is the Borel -algebra on .

The distribution of a random variable is the real-valued function

The set function defined by the formula

can be extended to a probability measure on , called the law associated with the random variable .

Conversely, any increasing, right-continuous function such that as and as admits a random variable such that . Since the extension of the set function

to a probability measure on agrees with we conclude that there is a one-to-one correspondence between probability distributions and probability measures of the form , called the Lebesgue–Stieltjes measures.

A simple function on a probability space is a linear combination of indicator functions on events:

The expectation of is the sum

Given a nonnegative random variable on , there exists a sequence of simple functions such that and that pointwise almost surely. We define

In general, we can write a random variable as the difference , where and . We can then define the expectation of to be the sum

This definition yields a linear functional on the space of all random variables on . The construction of the Lebesgue integral

of a real-valued measurable function on a general measure space is analogous.

We record two computational identities regarding the expectation, which holds for each random variable and every -measurable function (or, borel measurable for short) :

The integral in the first first identity is to be understood as a Lebesgue–Stieltjes integral.

The second identity is a consequence of integration on product measure spaces.

Given two measure spaces and , we define a rectangle to be the cartesian product for any and . The product measure of a rectangle is defined to be the product

is countably additive on the algebra of all finite unions, intersections, and complementations of rectangles, whence the Carathéodory’s extension theorem furnishes a measure on the -algebra generated by .

If both and are -finite, then the product measure is -finite, and the Fubini–Tonelli theorem holds:

whenever or is nonnegative.

Yet another useful computational device for the expectation is the probability density function, which is defined for each random variable to be a function such that

for all real numbers . Here, is a measure on .

The sufficient condition for the existence of a probability density function is established through the theory of signed measures.

A signed measure on a measurable space is a countably additive set function . To highlight the difference, a measure is sometimes called a positive measure. A signed measure on is said to be absolutely continuous with respect to a positive measure on , denoted by , if whenever and .

A useful special case of the Radon–Nikodym theorem (also known as the Lebesgue–Radon–Nikodym theorem) states the following: if a -finite signed measure on a measurable space is absolutely continuous with respect to a -finite positive measure on , then there exists a -measurable function , called the Radon–Nikodym derivative of with respect to , such that

for all .

We fix a probability space and a random variable . If the law is absolutely continuous with respect to a -finite positive measure on , then the Radon–Nikodym theorem implies that

for all real numbers . If is discrete, then choosing the counting measure

gives us absolute continuity. Otherwise, we typically shoot for absolute continuity with respect to , the Lebesgue measure on the real line.

2. Conditioning and Independence on a Discrete Probability Space

Let be a discrete probability space, i.e., the sample space is countable. Given two events , what should the probability of given be? To answer this question, we construct a new measurable space where event always happens. To do this, we take and

The construction implies that each admits such that . It thus makes sense to define the new probability measure on in terms of . We take the normalization

so that is a bona fide probability measure on .

With this construction, it would be reasonable to say that the probability of given is

provided that . If , then conditioning is meaningless, as would have to be 0 for any event .

If the conditioning does not change the probability, then the events are said to be independent. Formally, and are independent if

This holds if and only if , which yields a natural -fold generalization: events are independent if