Reservoir sampling (Draft)

8 m

In this post, we study reservoir sampling, a technique for randomly choosing a sample from a large list. In practical scenarios, the list is often so large that it does not fit into memory and is instead streamed. In other words, we only have one-time access to each element.

Knuth attributes reservoir sampling to Alan G. Waterman in Volume 2, Section 3.4.1 of The Art of Computer Programming, but he does not provide a reference, and there is little information available on the matter. I have sent an inquiry to Knuth, who only accepts snail mail:

1. Sampling One Node From a Long Linked List

Recall that a linked list is a collection of nodes that contain a piece of data and a pointer to the next node:

Let us consider the problem of picking a node uniformly at random, i.e., each node has equal probability of being chosen. To make the problem interesting, let us assume that we do not know the length $n$ of the linked list.

One solution, of course, is to traverse the linked list once to find out the value of $n$, choose an integer $k$ between $0$ and $n-1$ uniformly at random, and traverse the first $k$ nodes to access the $k$th node. While correct, this solution requires two passes through the linked list, which may not be practical for extremely large lists.

To perform the selection in one pass, we make use of the following algorithm:

Algorithm 1.1 (One-node reservoir sampling). We enter through the head node and set node_selected to be the head node. For each $i \geq 1$, we choose an integer $k$ between $0$ and $i$, uniformly at random. We set node_selected to be the $i$th node if $k$ is 0; we leave node_selected unchanged for all other choices of $k$.

The algorithm performs selection in one pass and terminates at the end of the list. It follows that the time complexity of the algorithm is $\Theta(n)$, assuming that selecting an integer uniformly at random can be done in constant time.

We shall show that the probability that node_selected ends up being the $i$th node for some fixed index $i$ is $1/n$. In other words, one-node reservoir sampling chooses a node uniformly at random.

To this end, we recall that $\mathbb{P}[E]$ denotes the probability of event $E$ occurring, and that $\mathbb{P}[E \mid F]$ denotes the probability of event $E$ occurring assuming that event $F$ occurs. We have the identity

where $\mathbb{P}[E \cap F]$ denotes the probability that both $E$ and $F$ occur.

For each $0 \leq j \leq n-1$, we let $S^i_{j}{}_{}$ be the event that node_selected equals the $i$th node when we pass through the $j$th node. Our goal is to show that $\mathbb{P}[S^i_{n-1}{}_{}]$, the probability that node_selected equals the $i$th node after we pass through all the nodes, is equal to $1/n$.

We first observe that

for all $j \geq i$, as node_selected cannot be the $i$th node after we pass through the $(j+1)$th node had it not been the $i$th node right before we pass through the $(j+1)$th node. It follows that

At the $i$th node, we set node_selected to be the $i$th node with probability $1/(i+1)$, as per the specification of one-node reservoir sampling. We thus have

Now, we fix $j > i$ and assume that node_selected equals the $i$th node right before we pass through the $j$th node. With this assumption, the probability that node_selected is set to be the $j$th node is $1/(j+1)$, as per the specification of one-node reservoir sampling. We thus see that the probability of node_selected remaining unchanged is $1 - 1/(j+1) = j/(j+1)$, i.e.,

It now follows that

Since our choice of index $i$ was arbitrary, we conclude that one-node reservoir sampling chooses a node uniformly at random. $\square$

An alternative algorithm popular on the internet is the following:

Algorithm 1.2. At each node, we pick a number from the interval $[0,1]$, uniformly at random. If the number is larger than all the random numbers picked so far, then we set node_selected to be the current node. Repeat iteratively until we reach the end of the list.

Albeit appealing, the algorithm is wrong for all but the $n = 1$ case.

To see this, we recall that the notion of uniform randomness on $[0,1]$ corresponds to the lengths of subintervals of $[0,1]$. Indeed, given $x \in [0,1]$, the probability that a number picked from $[0,1]$ uniformly at random is larger than $x$ is $1-x$.

Let us assume $n \geq 2$. For each $0 \leq j \leq n-1$, we let $x_{j}{}_{}$ denote the random number chosen from the interval $[0,1]$ upon reaching the $j$th node. We fix an index $i$ and let $x = \max(x_{0},\ldots,x_{i-1})$. There is probability $1-x$ that $x_{i}{}_{} > x$, and so

Now, we fix $j > i$ and assume that node_selected equals the $i$th node right before we pass through the $j$th node. With this assumption, the probability that node_selected remains unchanged is the probability that $x \geq x_{j}$, which is $1-x_{j}$. Therefore,

Since

we see that

It is not difficult to find values of $x,x_{i+1},\ldots,x_{j}$ such that the above product does not equal $1/n$. For example,

would do. $\square$

2. Selecting a Random Node From a Graph

The one-node reservoir sampling from Section 1 can be used to select a node, uniformly at random, from a connected graph.

Algorithm 2.1 (random node selection on a graph). Using any graph traversal algorithm, we construct a path that visits every node on the connected graph of interest. By keeping track of the visited nodes, we can construct a linked list of graph nodes where each node appears exactly one. We now apply one-node reservoir sampling to select a node, uniformly at random.

For example, the following algorithm selects a random node from a binary tree.

from random import choice

def random_node_selection(root_node):
return sample(root_node, root_node, 0)[1]

def sample(node, selected, prob):
# take care of the leaf nodes
if node is None:
return (node, selected, prob-1)

# one-node reservoir sampling
prob = prob + 1
random_ind = choice(range(prob))
if random_ind == 0:
selected = node
else:
selected = selected

# pre-order walk on the child nodes
(node, selected, prob) = sample(node.left, selected, prob+1)
(node, selected, prob) = sample(node.right, selected, prob+1)

return (node, selected, prob)


3. Sampling Several Nodes From a Long Linked List

We consider once again a long linked list

We shall generalize one-node reservoir sampling to obtain a method for sampling several nodes from the list. To select $m$ random elements, the algorithm below keeps a reservoir $R$ of $m$ elements that is updated throughout the scanning of the list. We assume, of course, that $n \gg m$.

Algorithm 3.1 (Reservoir sampling). We enter through the head node and set $R[i] = v_i$ for $0 \leq i \leq m-1$. For each index $% $, we pick a random integer $k$ between $0$ and $j$, uniformly at random. We set $R[k]$ to be $v_j$ if $% $; we leave $R$ unchanged for all other choices of $k$.

The algorithm performs selection in one pass and terminates at the end of the list. It follows that the time complexity of the algorithm is $\Theta(n)$, assuming that selecting an integer uniformly at random can be done in constant time.

How do we show that the above algorithm is correct? We first note that a simple modification of the correctness argument for one-node reservoir sampling proves that the probability of an arbitrary node being in the reservoir is always $1/n$.

To see this, we fix an index $0 \leq i \leq n-1$. For each $0 \leq j \leq n-1$, we let $S_j^i$ be the event that the reservoir $R$ contains $v_i$ when we pass through $v_j$. We have that

for all $j \geq i$, as $v_i$ cannot be in $R$ after we pass through $v_{j+1}{}_{}$ had it not been in the reservoir before we pass through the $j$th node. It follows that

At $v_i$, we replace an element of $R$ with $v_i$ with probability $\frac{m}{i+1}$. Therefore,

We now fix $j > i$ and assume that $v_i$ is in $R$ just prior to visiting the $j$th node.

Tags:

Categories:

Updated: