Reservoir sampling (Draft)

8 m

In this post, we study reservoir sampling, a technique for randomly choosing a sample from a large list. In practical scenarios, the list is often so large that it does not fit into memory and is instead streamed. In other words, we only have one-time access to each element.

Knuth attributes reservoir sampling to Alan G. Waterman in Volume 2, Section 3.4.1 of The Art of Computer Programming, but he does not provide a reference, and there is little information available on the matter. I have sent an inquiry to Knuth, who only accepts snail mail:

letter-to-knuth

1. Sampling One Node From a Long Linked List

Recall that a linked list is a collection of nodes that contain a piece of data and a pointer to the next node:

Let us consider the problem of picking a node uniformly at random, i.e., each node has equal probability of being chosen. To make the problem interesting, let us assume that we do not know the length of the linked list.

One solution, of course, is to traverse the linked list once to find out the value of , choose an integer between and uniformly at random, and traverse the first nodes to access the th node. While correct, this solution requires two passes through the linked list, which may not be practical for extremely large lists.

To perform the selection in one pass, we make use of the following algorithm:

Algorithm 1.1 (One-node reservoir sampling). We enter through the head node and set node_selected to be the head node. For each , we choose an integer between and , uniformly at random. We set node_selected to be the th node if is 0; we leave node_selected unchanged for all other choices of .

The algorithm performs selection in one pass and terminates at the end of the list. It follows that the time complexity of the algorithm is , assuming that selecting an integer uniformly at random can be done in constant time.

We shall show that the probability that node_selected ends up being the th node for some fixed index is . In other words, one-node reservoir sampling chooses a node uniformly at random.

To this end, we recall that denotes the probability of event occurring, and that denotes the probability of event occurring assuming that event occurs. We have the identity

where denotes the probability that both and occur.

For each , we let be the event that node_selected equals the th node when we pass through the th node. Our goal is to show that , the probability that node_selected equals the th node after we pass through all the nodes, is equal to .

We first observe that

for all , as node_selected cannot be the th node after we pass through the th node had it not been the th node right before we pass through the th node. It follows that

At the th node, we set node_selected to be the th node with probability , as per the specification of one-node reservoir sampling. We thus have

Now, we fix and assume that node_selected equals the th node right before we pass through the th node. With this assumption, the probability that node_selected is set to be the th node is , as per the specification of one-node reservoir sampling. We thus see that the probability of node_selected remaining unchanged is , i.e.,

It now follows that

Since our choice of index was arbitrary, we conclude that one-node reservoir sampling chooses a node uniformly at random.

An alternative algorithm popular on the internet is the following:

Algorithm 1.2. At each node, we pick a number from the interval , uniformly at random. If the number is larger than all the random numbers picked so far, then we set node_selected to be the current node. Repeat iteratively until we reach the end of the list.

Albeit appealing, the algorithm is wrong for all but the case.

To see this, we recall that the notion of uniform randomness on corresponds to the lengths of subintervals of . Indeed, given , the probability that a number picked from uniformly at random is larger than is .

Let us assume . For each , we let denote the random number chosen from the interval upon reaching the th node. We fix an index and let . There is probability that , and so

Now, we fix and assume that node_selected equals the th node right before we pass through the th node. With this assumption, the probability that node_selected remains unchanged is the probability that , which is . Therefore,

Since

we see that

It is not difficult to find values of such that the above product does not equal . For example,

would do.

2. Selecting a Random Node From a Graph

The one-node reservoir sampling from Section 1 can be used to select a node, uniformly at random, from a connected graph.

Algorithm 2.1 (random node selection on a graph). Using any graph traversal algorithm, we construct a path that visits every node on the connected graph of interest. By keeping track of the visited nodes, we can construct a linked list of graph nodes where each node appears exactly one. We now apply one-node reservoir sampling to select a node, uniformly at random.

For example, the following algorithm selects a random node from a binary tree.

from random import choice


def random_node_selection(root_node):
    return sample(root_node, root_node, 0)[1]

def sample(node, selected, prob):
    # take care of the leaf nodes
    if node is None:
        return (node, selected, prob-1)

    # one-node reservoir sampling
    prob = prob + 1
    random_ind = choice(range(prob))
    if random_ind == 0:
        selected = node
    else:
        selected = selected

    # pre-order walk on the child nodes
    (node, selected, prob) = sample(node.left, selected, prob+1)
    (node, selected, prob) = sample(node.right, selected, prob+1)

    return (node, selected, prob)

3. Sampling Several Nodes From a Long Linked List

We consider once again a long linked list

We shall generalize one-node reservoir sampling to obtain a method for sampling several nodes from the list. To select random elements, the algorithm below keeps a reservoir of elements that is updated throughout the scanning of the list. We assume, of course, that .

Algorithm 3.1 (Reservoir sampling). We enter through the head node and set for . For each index , we pick a random integer between and , uniformly at random. We set to be if ; we leave unchanged for all other choices of .

The algorithm performs selection in one pass and terminates at the end of the list. It follows that the time complexity of the algorithm is , assuming that selecting an integer uniformly at random can be done in constant time.

How do we show that the above algorithm is correct? We first note that a simple modification of the correctness argument for one-node reservoir sampling proves that the probability of an arbitrary node being in the reservoir is always .

To see this, we fix an index . For each , we let be the event that the reservoir contains when we pass through . We have that

for all , as cannot be in after we pass through had it not been in the reservoir before we pass through the th node. It follows that

At , we replace an element of with with probability . Therefore,

We now fix and assume that is in just prior to visiting the th node.