Jekyll2017-10-20T14:56:34-04:00https://markhkim.com/Mark Hyun-ki KimMark Hyun-ki Kim (김현기) is a writer of proofs, programs, expositions, and essays, based in Brooklyn.Mark Hyun-ki KimReservoir Sampling (Draft)2017-10-14T08:00:00-04:002017-10-14T08:00:00-04:00https://markhkim.com/foundtechnicalities/reservoir-sampling<p>In this post, we study <strong><em>reservoir sampling</em></strong>, a technique for randomly choosing a sample from a large list. In practical scenarios, the list is often so large that it does not fit into memory and is instead <a href="https://en.wikipedia.org/wiki/Data_stream_mining">streamed</a>. In other words, we only have one-time access to each element.</p>
<p>Knuth attributes reservoir sampling to Alan G. Waterman in Volume 2 of <em>The Art of Computer Programming</em>, but he does not provide a reference, and there is little information available on the matter. I have sent an inquiry to Knuth, who only accepts <a href="http://www-cs-faculty.stanford.edu/~knuth/email.html">snail mail</a>:</p>
<p><img src="https://markhkim.com/uploads/found-technicalities/reservoir-sampling/letter-to-knuth.jpg" alt="letter-to-knuth" /></p>
<p><a name="1"></a></p>
<h2 id="1-sampling-one-node-from-a-long-linked-list">1. Sampling One Node From a Long Linked List</h2>
<p>Recall that a <strong><em>linked list</em></strong> is a collection of nodes that contain a piece of data and a pointer to the next node:</p>
<script type="math/tex; mode=display">\underbrace{\boxed{v_{0}}}_{\text{head}} \to \boxed{v_{1}} \to \cdots \to \boxed{v_{n-2}} \to \underbrace{\boxed{v_{n-1}}}_{\text{tail}}</script>
<p>Let us consider the problem of picking a node uniformly at random, i.e., each node has equal probability of being chosen. To make the problem interesting, let us assume that we do not know the length <script type="math/tex">n</script> of the linked list.</p>
<p>One solution, of course, is to traverse the linked list once to find out the value of <script type="math/tex">n</script>, choose an integer <script type="math/tex">k</script> between <script type="math/tex">0</script> and <script type="math/tex">n-1</script> uniformly at random, and traverse the first <script type="math/tex">k</script> nodes to access the <script type="math/tex">k</script>th node. While correct, this solution requires two passes through the linked list, which may not be practical for extremely large lists.</p>
<p>To perform the selection in one pass, we make use of the following algorithm:</p>
<blockquote>
<p><a name="algorithm-1-1"></a><strong>Algorithm 1.1</strong> (One-node reservoir sampling). We enter through the head node and set <code class="highlighter-rouge">node_selected</code> to be the head node. For each <script type="math/tex">i \geq 1</script>, we choose an integer <script type="math/tex">k</script> between <script type="math/tex">0</script> and <script type="math/tex">i</script>, uniformly at random. We set <code class="highlighter-rouge">node_selected</code> to be the <script type="math/tex">i</script>th node if <script type="math/tex">k</script> is 0; we leave <code class="highlighter-rouge">node_selected</code> unchanged for all other choices of <script type="math/tex">k</script>.</p>
</blockquote>
<p>The algorithm performs selection in one pass and terminates at the end of the list. It follows that the time complexity of the algorithm is <script type="math/tex">\Theta(n)</script>.</p>
<p>We shall show that the probability that <code class="highlighter-rouge">node_selected</code> ends up being the <script type="math/tex">i</script>th node for some fixed index <script type="math/tex">i</script> is <script type="math/tex">1/n</script>. In other words, <a href="#algorithm-1-1">one-node reservoir sampling</a> chooses a node uniformly at random.</p>
<p>To this end, we recall that <script type="math/tex">\mathbb{P}[E]</script> denotes the probability of event <script type="math/tex">E</script> occurring, and that <script type="math/tex">\mathbb{P}[E \mid F]</script> denotes the probability of event <script type="math/tex">E</script> occurring assuming that event <script type="math/tex">F</script> occurs. We have the identity</p>
<script type="math/tex; mode=display">\mathbb{P}[E \mid F] = \frac{\mathbb{P}[E \cap F]}{\mathbb{P}[F]},</script>
<p>where <script type="math/tex">\mathbb{P}[E \cap F]</script> denotes the probability that both <script type="math/tex">E</script> and <script type="math/tex">F</script> occur.</p>
<p>For each <script type="math/tex">0 \leq j \leq n-1</script>, we let <script type="math/tex">S^i_{j}{}_{}</script> be the event that <code class="highlighter-rouge">node_selected</code> equals the <script type="math/tex">i</script>th node when we pass through the <script type="math/tex">j</script>th node. Our goal is to show that <script type="math/tex">\mathbb{P}[S^i_{n-1}{}_{}]</script>, the probability that <code class="highlighter-rouge">node_selected</code> equals the <script type="math/tex">i</script>th node after we pass through all the nodes, is equal to <script type="math/tex">1/n</script>.</p>
<p>We first observe that</p>
<script type="math/tex; mode=display">S^i_{j} \cap S^i_{j+1} = S^i_{j+1}{}_{}</script>
<p>for all <script type="math/tex">j \geq i</script>, as <code class="highlighter-rouge">node_selected</code> cannot be the <script type="math/tex">i</script>th node after we pass through the <script type="math/tex">(j+1)</script>th node had it not been the <script type="math/tex">i</script>th node right before we pass through the <script type="math/tex">(j+1)</script>th node. It follows that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\mathbb{P}[S^i_{n-1}]
&= \frac{\mathbb{P}[S^i_{n-1}]}{\mathbb{P}[S^i_{n-2}]}
\cdot \frac{\mathbb{P}[S^i_{n-2}]}{\mathbb{P}[S^i_{n-3}]}
\cdots \frac{\mathbb{P}[S^i_{i+1}]}{\mathbb{P}[S^i_{i}]}
\cdot \mathbb{P}[S^i_{i}] \\
&= \frac{\mathbb{P}[S^i_{n-1} \cap S^i_{n-2}]}{\mathbb{P}[S^i_{n-2}]}
\cdot \frac{\mathbb{P}[S^i_{n-2} \cap S^i_{n-3}]}{\mathbb{P}[S^i_{n-3}]}
\cdots \frac{\mathbb{P}[S^i_{i+1} \cap S^i_{i}]}{\mathbb{P}[S^i_{i}]}
\cdot \mathbb{P}[S^i_{i}] \\
&= \mathbb{P}[S^i_{n-1} \mid S^i_{n-2}]
\cdot \mathbb{P}[S^i_{n-2} \mid S^i_{n-3}]
\cdots \mathbb{P}[S^i_{i+1} \mid S^i_{i}]
\cdot \mathbb{P}[S^i_{i}].
\end{align*} %]]></script>
<p>At the <script type="math/tex">i</script>th node, we set <code class="highlighter-rouge">node_selected</code> to be the <script type="math/tex">i</script>th node with probability <script type="math/tex">1/i</script>, as per the specification of <a href="#algorithm-1-1">one-node reservoir sampling</a>. We thus have</p>
<script type="math/tex; mode=display">\mathbb{P}[S^i_{i}] = \frac{1}{i}{}_{}.</script>
<p>Now, we fix <script type="math/tex">j > i</script> and assume that <code class="highlighter-rouge">node_selected</code> equals the <script type="math/tex">i</script>th node right before we pass through the <script type="math/tex">j</script>th node. With this assumption, the probability that <code class="highlighter-rouge">node_selected</code> is set to be the <script type="math/tex">j</script>th node is <script type="math/tex">1/j</script>, as per the specification of <a href="#algorithm-1-1">one-node reservoir sampling</a>. We thus see that the probability of <code class="highlighter-rouge">node_selected</code> remaining unchanged is <script type="math/tex">1 - 1/j = (j-1)/j</script>, i.e.,</p>
<script type="math/tex; mode=display">\mathbb{P}[S^i_{j} \mid S^i_{j-1}] = \frac{j-1}{j}.</script>
<p>It now follows that</p>
<script type="math/tex; mode=display">\mathbb{P}[S^i_{n-1}]
= \frac{n-1}{n} \cdot \frac{n-2}{n-1} \cdots \frac{i+1}{i} \cdot i
= \frac{1}{n}{}_{}.</script>
<p>Since our choice of index <script type="math/tex">i</script> was arbitrary, we conclude that <a href="#algorithm-1-1">one-node reservoir sampling</a> chooses a node uniformly at random. <script type="math/tex">\square</script></p>
<p>An alternative algorithm popular on the internet is the following:</p>
<blockquote>
<p><strong>Algorithm 1.2.</strong> At each node, we pick a number from the interval <script type="math/tex">[0,1]</script>, uniformly at random. If the number is larger than all the random numbers picked so far, then we set <code class="highlighter-rouge">node_selected</code> to be the current node. Repeat iteratively until we reach the end of the list.</p>
</blockquote>
<p>Albeit appealing, the algorithm is wrong for all but the <script type="math/tex">n = 1</script> case.</p>
<p>To see this, we recall that the notion of uniform randomness on <script type="math/tex">[0,1]</script> corresponds to the lengths of subintervals of <script type="math/tex">[0,1]</script>. Indeed, given <script type="math/tex">x \in [0,1]</script>, the probability that a number picked from <script type="math/tex">[0,1]</script> uniformly at random is larger than <script type="math/tex">x</script> is <script type="math/tex">1-x</script>.</p>
<p>Let us assume <script type="math/tex">n \geq 2</script>. For each <script type="math/tex">0 \leq j \leq n-1</script>, we let <script type="math/tex">x_{j}{}_{}</script> denote the random number chosen from the interval <script type="math/tex">[0,1]</script> upon reaching the <script type="math/tex">j</script>th node. We fix an index <script type="math/tex">i</script> and let <script type="math/tex">x = \max(x_{0},\ldots,x_{i-1})</script>. There is probability <script type="math/tex">1-x</script> that <script type="math/tex">x_{i}{}_{} > x</script>, and so</p>
<script type="math/tex; mode=display">\mathbb{P}[S^i_{i}] = 1-x.{}_{}</script>
<p>Now, we fix <script type="math/tex">j > i</script> and assume that <code class="highlighter-rouge">node_selected</code> equals the <script type="math/tex">i</script>th node right before we pass through the <script type="math/tex">j</script>th node. With this assumption, the probability that <code class="highlighter-rouge">node_selected</code> remains unchanged is the probability that <script type="math/tex">x \geq x_{j}</script>, which is <script type="math/tex">1-x_{j}</script>. Therefore,</p>
<script type="math/tex; mode=display">\mathbb{P}[S^i_{j} \mid S^i_{j-1}] = 1-x_{j}{}_{}.</script>
<p>Since</p>
<script type="math/tex; mode=display">\mathbb{P}[S^i_{n-1}]
= \mathbb{P}[S^i_{n-1} \mid S^i_{n-2}]
\cdot \mathbb{P}[S^i_{n-2} \mid S^i_{n-3}]
\cdots \mathbb{P}[S^i_{i+1} \mid S^i_{i}]
\cdot \mathbb{P}[S^i_{i}],</script>
<p>we see that</p>
<script type="math/tex; mode=display">\mathbb{P}[S^i_{n-1}] = (1-x_{i+1})(1-x_{i+2}) \cdots (1-x_{j})(1-x).</script>
<p>It is not difficult to find values of <script type="math/tex">x,x_{i+1},\ldots,x_{j}</script> such that the above product does not equal <script type="math/tex">1/n</script>. For example,</p>
<script type="math/tex; mode=display">x = x_{i+1} = \cdots = x_{j} = 1 - \frac{1}{n}</script>
<p>would do. <script type="math/tex">\square</script></p>
<p><a name="2"></a></p>
<h2 id="2-selecting-a-random-node-from-a-graph">2. Selecting a Random Node From a Graph</h2>
<p>The <a href="#algorithm-1-1">one-node reservoir sampling</a> from <a href="#1">Section 1</a> can be used to select a node, uniformly at random, from a connected graph.</p>
<blockquote>
<p><strong>Algorithm 2.1</strong> (random node selection on a graph). Using any graph traversal algorithm, we construct a path that visits every node on the connected graph of interest. By keeping track of the visited nodes, we can construct a linked list of graph nodes where each node appears exactly one. We now apply <a href="#algorithm-1-1">one-node reservoir sampling</a> to select a node, uniformly at random.</p>
</blockquote>
<p>For example, the following algorithm selects a random node from a binary tree.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">random</span> <span class="kn">import</span> <span class="n">choice</span>
<span class="k">def</span> <span class="nf">random_node_selection</span><span class="p">(</span><span class="n">root_node</span><span class="p">):</span>
<span class="k">return</span> <span class="n">sample</span><span class="p">(</span><span class="n">root_node</span><span class="p">,</span> <span class="n">root_node</span><span class="p">,</span> <span class="mi">0</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">sample</span><span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="n">selected</span><span class="p">,</span> <span class="n">prob</span><span class="p">):</span>
<span class="c"># take care of the leaf nodes</span>
<span class="k">if</span> <span class="n">node</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
<span class="k">return</span> <span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="n">selected</span><span class="p">,</span> <span class="n">prob</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="c"># one-node reservoir sampling</span>
<span class="n">prob</span> <span class="o">=</span> <span class="n">prob</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">random_ind</span> <span class="o">=</span> <span class="n">choice</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">prob</span><span class="p">))</span>
<span class="k">if</span> <span class="n">random_ind</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">selected</span> <span class="o">=</span> <span class="n">node</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">selected</span> <span class="o">=</span> <span class="n">selected</span>
<span class="c"># pre-order walk on the child nodes</span>
<span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="n">selected</span><span class="p">,</span> <span class="n">prob</span><span class="p">)</span> <span class="o">=</span> <span class="n">sample</span><span class="p">(</span><span class="n">node</span><span class="o">.</span><span class="n">left</span><span class="p">,</span> <span class="n">selected</span><span class="p">,</span> <span class="n">prob</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
<span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="n">selected</span><span class="p">,</span> <span class="n">prob</span><span class="p">)</span> <span class="o">=</span> <span class="n">sample</span><span class="p">(</span><span class="n">node</span><span class="o">.</span><span class="n">right</span><span class="p">,</span> <span class="n">selected</span><span class="p">,</span> <span class="n">prob</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="n">selected</span><span class="p">,</span> <span class="n">prob</span><span class="p">)</span>
</code></pre>
</div>
<p><a name="3"></a></p>
<h2 id="3-sampling-several-nodes-from-a-long-linked-list">3. Sampling Several Nodes From a Long Linked List</h2>
<p>We consider once again a long linked list</p>
<script type="math/tex; mode=display">\boxed{v_{0}} \to \boxed{v_{1}} \to \cdots \to \boxed{v_{n-2}} \to \boxed{v_{n-1}}</script>
<p>We shall generalize <a href="#1-1">one-node reservoir sampling</a> to obtain a method for sampling several nodes from the list.</p>Mark Hyun-ki KimIn this post, we study reservoir sampling, a technique for randomly choosing a sample from a large list. In practical scenarios, the list is often so large that it does not fit into memory and is instead streamed. In other words, we only have one-time access to each element.How Not To Lie With Statistics: Averages2017-09-22T13:00:00-04:002017-09-22T13:00:00-04:00https://markhkim.com/foundtechnicalities/how-not-to-lie-with-statistics-averages<p>The <strong>average</strong> of numbers, defined to be the sum of the numbers divided by the count of numbers being summed, is a familiar way of extracting a single-number summary of a numerical dataset. While popular, averaging can often lead to misleading observations about the data at hand.</p>
<p>As an example, let us take a look at the <a href="http://databank.worldbank.org/data/reports.aspx?source=world-development-indicators">income statistics across the world</a>. The following table shows per capita income of three countries:</p>
<table>
<thead>
<tr>
<th>Year</th>
<th>Zimbabwe</th>
<th>Russia</th>
<th>Singapore</th>
</tr>
</thead>
<tbody>
<tr>
<td>2000</td>
<td>$502.95</td>
<td>$1304.7</td>
<td>$20318</td>
</tr>
<tr>
<td>2002</td>
<td>$471.56</td>
<td>$1822.8</td>
<td>$17942</td>
</tr>
<tr>
<td>2004</td>
<td>$409.57</td>
<td>$3203.2</td>
<td>$21349</td>
</tr>
<tr>
<td>2006</td>
<td>$363.79</td>
<td>$5392.9</td>
<td>$27791</td>
</tr>
<tr>
<td>2008</td>
<td>$265.51</td>
<td>$9279.1</td>
<td>$32379</td>
</tr>
</tbody>
</table>
<p>Let’s take a look at the income statistics of Russia and Singapore in relation to Zimbabwe’s income statistics. Typically, this is done by declaring Zimbabwe’s statistics to be 1 and scaling the other statistics accordingly—a process known as <a href="https://en.wikipedia.org/wiki/Normalization_(statistics)">normalization</a>. In this case, we divide each year’s income statistics by Zimbabwe’s income per capita:</p>
<table>
<thead>
<tr>
<th>Year</th>
<th>Zimbabwe</th>
<th>Russia</th>
<th>Singapore</th>
</tr>
</thead>
<tbody>
<tr>
<td>2000</td>
<td>1</td>
<td>2.59</td>
<td>40.4</td>
</tr>
<tr>
<td>2002</td>
<td>1</td>
<td>3.87</td>
<td>38.0</td>
</tr>
<tr>
<td>2004</td>
<td>1</td>
<td>7.82</td>
<td>52.1</td>
</tr>
<tr>
<td>2006</td>
<td>1</td>
<td>14.8</td>
<td>76.4</td>
</tr>
<tr>
<td>2008</td>
<td>1</td>
<td>34.9</td>
<td>122.0</td>
</tr>
</tbody>
</table>
<p>Now, let’s take the average of each country’s income data, so we can compare them with ease:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Zimbabwe</th>
<th>Russia</th>
<th>Singapore</th>
</tr>
</thead>
<tbody>
<tr>
<td>Average</td>
<td>1</td>
<td>12.8</td>
<td>65.8</td>
</tr>
</tbody>
</table>
<p>The above summary statistics suggest that Singapore’s levels are approximately <script type="math/tex">65.8 \div 12.8 \approx 5</script> times higher than those of Russia.</p>
<p>Is this true? Let us compare the income statistics of Russia and Singapore by computing the ratio of income data from Sigapore to those from Russia, both unnormalized</p>
<table>
<thead>
<tr>
<th>Year</th>
<th>Singapore <script type="math/tex">\div</script> Russia</th>
</tr>
</thead>
<tbody>
<tr>
<td>2000</td>
<td>15.6</td>
</tr>
<tr>
<td>2002</td>
<td>9.84</td>
</tr>
<tr>
<td>2004</td>
<td>6.66</td>
</tr>
<tr>
<td>2006</td>
<td>5.15</td>
</tr>
<tr>
<td>2008</td>
<td>3.49</td>
</tr>
</tbody>
</table>
<p>The above computations reveal that Siganpore’s income levels are, in fact, far above 5 times the income levels of Russia in most of the years in our dataset. Taking the average of the ratios yields 8.13, a more realistic summary statistic.</p>
<p>What is going on here? Normalizing with respect to Zimbabwe’s income data assigns different <em>weights</em> to the income statistics of Russia and Singapore. Since Zimbabwe’s per capita income in 2002 is higher than that in 2008, the 2008 data in the normalized data set has more <em>weight</em> than the 2002 data—dividing by a larger number results in smaller numbers. This results in averages that do not quite reflect the true income levels of Russia and Singapore.</p>
<p>To explore this phenomenon further, we consider a substantially simpler dataset:</p>
<table>
<thead>
<tr>
<th> </th>
<th>Monday</th>
<th>Tuesday</th>
<th>Wednesday</th>
<th>Thursday</th>
<th>Friday</th>
</tr>
</thead>
<tbody>
<tr>
<td>Coffee</td>
<td>3 cups</td>
<td>1 cup</td>
<td>4 cups</td>
<td>5 cups</td>
<td>8 cups</td>
</tr>
</tbody>
</table>
<p>Here, the average number of cups of coffee consumed is</p>
<script type="math/tex; mode=display">(3 + 1 + 4 + 5 + 8) \div 5 = 4.4.</script>
<p>Now, let’s say a cup of coffee is usually 2 dollars. On Tuesdays, the café near work serves coffee brewed from special beans, so a cup of coffee costs twice as much. On Fridays, the café serves extra cheap coffee, at 50 cents a cup. So, the average amount of money spent on coffee is</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
(&3 \times \$2.00 + 1 \times \$4.00 + 4 \times \$2.00 \\
& +5 \times \$2.00 + 8 \times \$0.50) = \$6.40,
\end{align*} %]]></script>
<p>which is closer to 3 cups than 4 and half cups. This, as you can see, is the result of assigning different <em>weights</em> to Tuesday and Friday. For this reason, the average obtained by assigning (potentially different) weights to each item in a dataset is called the <strong>weighted average</strong>.</p>
<p>What if we’re only given a normalized dataset? Let’s go back to the income dataset and assume that</p>
<table>
<thead>
<tr>
<th>Year</th>
<th>Zimbabwe</th>
<th>Russia</th>
<th>Singapore</th>
</tr>
</thead>
<tbody>
<tr>
<td>2000</td>
<td>1</td>
<td>2.59</td>
<td>40.4</td>
</tr>
<tr>
<td>2002</td>
<td>1</td>
<td>3.87</td>
<td>38.0</td>
</tr>
<tr>
<td>2004</td>
<td>1</td>
<td>7.82</td>
<td>52.1</td>
</tr>
<tr>
<td>2006</td>
<td>1</td>
<td>14.8</td>
<td>76.4</td>
</tr>
<tr>
<td>2008</td>
<td>1</td>
<td>34.9</td>
<td>122.0</td>
</tr>
</tbody>
</table>
<p>is all we have available. How do we compare the income levels of Russia and Singapore?</p>
<p>The answer is to compute the <strong>geometric mean</strong> instead of the average. The usual average, also known as the <strong>arithmetic mean</strong>, of <script type="math/tex">N</script> nubmers is computed by taking the sum of all <script type="math/tex">N</script> numbers and then dividing the sum by <script type="math/tex">N</script>, the total count of the numbers. Since adding a number <script type="math/tex">x</script> <script type="math/tex">N</script> times is the same as multiplying <script type="math/tex">N</script> to <script type="math/tex">x</script>, the division by <script type="math/tex">N</script> makes sense.</p>
<p>In contrast, the geometric mean is computed by multiplying all <script type="math/tex">N</script> number and then taking the <a href="https://en.wikipedia.org/wiki/Nth_root"><script type="math/tex">N</script>th root</a> of the product. This is to be understood as the mulplicative analogue of the arithmetic mean. Indeed, multiplying a number <script type="math/tex">x</script> <script type="math/tex">N</script> timse is the same as taking the <script type="math/tex">N</script>th power of <script type="math/tex">x</script>, and so the <script type="math/tex">N</script>th root operation, which reverses the exponentiation operation, is appropriate here.</p>
<table>
<thead>
<tr>
<th> </th>
<th>Russia</th>
<th>Singapore</th>
</tr>
</thead>
<tbody>
<tr>
<td>Geometric mean</td>
<td>8.35</td>
<td>59.5</td>
</tr>
</tbody>
</table>
<p>Now, the ratio of the two summary statistics is <script type="math/tex">59.5 \div 8.35 \approx 7</script>, a more reasonable value than the ratio of arithmetic means.</p>
<p>The difference lies in the fact that multiplication plays nicely with itself, whereas addition does not mix as well with multiplication.</p>
<p>As we have seen, taking the arithmetic mean of normalized values is equivalent to taking the weighted average, which is a multiply-then-add operation. Once the weights are assigned and the resulted weighted values averaged away, there is no easy way to get rid of them. Removing the weighted requires division—which is a form of multiplication, after all—and we cannot switch the order of multiplication and addition.</p>
<p>On the other hand, computing the geometric mean of normalized values is a divide-then-multiply operation. Since division is equivalent to multiplication (by the reciprocal), the entire operation consists of a sequence of multiplications, whose orders we can swap without changing the final answer. This is known as the <a href="https://en.wikipedia.org/wiki/Commutative_property">commutative property of multiplication</a>.</p>
<p>In summary, it is best to resist the temptation to take averages right away when faced with a task of comparing data about multiple items. Normalization can easily render arithmetic means meaningless, and geometric means perform far better in such cases.</p>
<p>As a matter of fact, it is possible to <em>prove</em> that the geometric mean is the only correct mean to use when averaging normalized values. If you are interested, take a look at Fleming/Wallace, “<a href="http://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type=pdf&doi=10.1.1.174.8565">How Not To Lie With Statistics: The Correct Way To Summarize Benchmark Results</a>” (<em>Communications of the ACM</em>, 1986) for details.</p>
<p><em>Thanks to <a href="http://ahnheejong.name/">Ahn Heejong</a> for corrections!</em></p>Mark Hyun-ki KimWhen averaging is not the answer, and what to do about it.Conditioning in Measure-Theoretic Probability (Draft)2017-09-18T00:00:00-04:002017-09-18T00:00:00-04:00https://markhkim.com/foundtechnicalities/conditioning-in-measure-theoretic-probability<p>A foundational concept introduced in all courses on measure-theoretic probability is the <strong>conditional expectation</strong>, generalizing the discrete probability definition of the average of all outcomes of a random variables in an event. A typical course, however, moves on to martingale theory without investing much time in generalizing the other conditional constructs from discrete probability theory, often omitting them entirely. We develop the missing measure-theoretic generalizations in this post.</p>
<p><a name="1"></a></p>
<h2 id="1-review-of-measure-theoretic-terminology">1. Review of Measure-Theoretic Terminology</h2>
<p>A <strong>measure space</strong> is an ordered triple <script type="math/tex">(\Omega, \mathcal{F}, \mu)</script> consisting of a set <script type="math/tex">\Omega</script> denoting the sample space, a <strong><script type="math/tex">\sigma</script>-algebra</strong> <script type="math/tex">\mathcal{F}</script> of events, and a <strong>measure</strong> <script type="math/tex">\mu</script> on the measurable space <script type="math/tex">(\Omega, \mathcal{F})</script>. If <script type="math/tex">\mathbb{P} = \mu</script> is a probability meausre, then we say that <script type="math/tex">(\Omega,\mathcal{F},\mathbb{P})</script> is a <strong>probability space</strong>.</p>
<p>By a <script type="math/tex">\sigma</script>-algebra, we mean a set <script type="math/tex">\mathcal{F}</script> of subsets of <script type="math/tex">\mathcal{F}</script> that satisfies the following properties:</p>
<ol>
<li>The full set <script type="math/tex">\mathcal{F}</script> and the empty set <script type="math/tex">\varnothing</script> are elements of <script type="math/tex">\mathcal{F}</script>;</li>
<li>if <script type="math/tex">E</script> is an element of <script type="math/tex">\mathcal{F}</script>, then its complement <script type="math/tex">X \smallsetminus E</script> is an element of <script type="math/tex">\mathcal{F}</script>;</li>
<li>if <script type="math/tex">\{E_{n}\}_{n=1}^{\infty}</script> is a collection of sets in <script type="math/tex">\mathcal{F}</script>, then its union <script type="math/tex">\bigcup_{n=1}^\infty E_{n}</script> and its intersection <script type="math/tex">\bigcap_{n=1}^\infty E_n</script> are elements of <script type="math/tex">\mathcal{F}</script>.</li>
</ol>
<p>The ordered pair <script type="math/tex">(\Omega, \mathcal{F})</script> of a set and a <script type="math/tex">\sigma</script>-algebra on it is called a <strong>measurable space</strong>, because we can define a probability measure on it. A <strong>measure</strong> on <script type="math/tex">(\Omega, \mathcal{F})</script> is a function <script type="math/tex">\mu:\Omega \to [0,1]</script> such that the <strong>countable additivity</strong> criterion</p>
<script type="math/tex; mode=display">\mu\left(\bigcup_{n=1}^\infty E_n \right) = \sum_{n=1}^\infty \mathbb{P}(E_n)</script>
<p>holds whenever <script type="math/tex">\{E_{n}\}_{n=1}^\infty</script> is a disjoint collection of events. If, in addition, <script type="math/tex">\mu(\Omega) = 1</script>, then we say that <script type="math/tex">\mu</script> is a <strong>probability measure</strong> on <script type="math/tex">(\Omega,\mathcal{F})</script>.</p>
<p>We say that a property of a probability space <script type="math/tex">(\Omega, \mathcal{F},\mathbb{P})</script> holds <strong>almost surely</strong> if it holds on <script type="math/tex">\Omega \smallsetminus E</script> for an event <script type="math/tex">E</script> of measure zero. The corresponding term for a general measure space is <strong>almost everywhere</strong>. A function <script type="math/tex">g</script> that is equal almost everywhere to another function <script type="math/tex">f</script> is said to be a <strong>version</strong> of <script type="math/tex">f</script>.</p>
<p>Given a measurable space <script type="math/tex">(\Omega,\mathcal{F})</script>, we define a <strong><script type="math/tex">\sigma</script>-subalgebra</strong> of <script type="math/tex">\mathcal{F}</script> to be a subset of <script type="math/tex">\mathcal{F}</script> that is also a <script type="math/tex">\sigma</script>-algebra on <script type="math/tex">\Omega</script>. The <strong><script type="math/tex">\sigma</script>-subalgebra of <script type="math/tex">\mathcal{F}</script> generated by a collection <script type="math/tex">\mathscr{C} \subseteq \mathcal{F}</script></strong> is the intersection of all <script type="math/tex">\sigma</script>-subalgebras of <script type="math/tex">\mathcal{F}</script> containing <script type="math/tex">\mathcal{C}</script>.</p>
<p>If no ambient <script type="math/tex">\sigma</script>-algebra is given, we define the <strong><script type="math/tex">\sigma</script>-algebra on <script type="math/tex">\Omega</script> generated by a collection <script type="math/tex">\mathscr{C}</script> of subsets of <script type="math/tex">\Omega</script></strong> to be the intersection of all <script type="math/tex">\sigma</script>-algebras on <script type="math/tex">\Omega</script> containing <script type="math/tex">\mathscr{C}</script>. The intersection is well-defined, as the power set <script type="math/tex">\mathcal{P}(\Omega)</script> is always a <script type="math/tex">\sigma</script>-algebra on <script type="math/tex">\Omega</script>.</p>
<p>A useful tool for constructing measure spaces is the <strong>Carathéodory extension theorem</strong>, which states that a countable additive function <script type="math/tex">\mu_{0}{}_{}</script> on a collection <script type="math/tex">\mathcal{A}</script> of subsets of a sample space <script type="math/tex">\Omega</script> that is closed under finite unions, intersections, and complementation admits an extension <script type="math/tex">\mu</script> to <script type="math/tex">\sigma(\mathcal{A})</script>, the <script type="math/tex">\sigma</script>-algebra generated by <script type="math/tex">\mathcal{A}</script>.</p>
<p>Furthermore, the extension is unique if the measure is <strong><script type="math/tex">\sigma</script>-finite</strong>, i.e., there exists a sequence of events of finite measure whose union is the entire sample space. In particular, all countably additive function that assigns the measure of 1 to a sample space has a unique extension to a probability measure.</p>
<p>A <strong>measurable function</strong> from a measurable space <script type="math/tex">(A,\mathcal{F})</script> to another measurable space <script type="math/tex">(B,\mathcal{G})</script> is a function <script type="math/tex">f:A \to B</script> such that <script type="math/tex">f^{-1}(E) \in \mathcal{F}</script> whenever <script type="math/tex">E \in \mathcal{F}</script>.</p>
<p>A <strong>random variable</strong> on a probability space <script type="math/tex">(\Omega,\mathcal{F},\mathbb{P})</script> is a measurable function from <script type="math/tex">(\Omega,\mathcal{F})</script> to <script type="math/tex">(\mathbb{R},\mathscr{B}_\mathbb{R})</script>, where <script type="math/tex">\mathscr{B}_{\mathbb{R}}</script> is the <strong>Borel <script type="math/tex">\sigma</script>-algebra</strong>, the <script type="math/tex">\sigma</script>-algebra generated by open sets. A random variable <script type="math/tex">X</script> is said to be <strong>discrete</strong> if the image of <script type="math/tex">X</script> is a countable set.</p>
<p>A <strong>random <script type="math/tex">n</script>-vector</strong> on <script type="math/tex">(\Omega,\mathcal{F},\mathbb{P})</script> is a measurable function from <script type="math/tex">(\Omega,\mathcal{F})</script> to <script type="math/tex">(\mathbb{R}^n,\mathscr{B}_{\mathbb{R}^n})</script>, where <script type="math/tex">\mathscr{B}_{\mathbb{R}^n}</script> is the Borel <script type="math/tex">\sigma</script>-algebra on <script type="math/tex">\mathbb{R}^n</script>.</p>
<p>The <strong>distribution</strong> of a random variable <script type="math/tex">X</script> is the real-valued function</p>
<script type="math/tex; mode=display">F_{X}(\alpha) = \mathbb{P}[X \leq \alpha] = \mathbb{P}[\{\omega : X(\omega) \leq \alpha\}].</script>
<p>The set function defined by the formula</p>
<script type="math/tex; mode=display">\mathscr{L}_{X}((-\infty, \alpha]) = F_{X}(\alpha)</script>
<p>can be extended to a probability measure on <script type="math/tex">(\mathbb{R},\mathscr{B}_\mathbb{R})</script>, called the <strong>law</strong> associated with the random variable <script type="math/tex">X</script>.</p>
<p>Conversely, any increasing, right-continuous function <script type="math/tex">F:\mathbb{R} \to [0,1]</script> such that <script type="math/tex">F(\alpha) \to 0</script> as <script type="math/tex">\alpha \to -\infty</script> and <script type="math/tex">F(\alpha) \to 1</script> as <script type="math/tex">\alpha \to \infty</script> admits a random variable <script type="math/tex">X</script> such that <script type="math/tex">F_{X} = F</script>. Since the extension of the set function</p>
<script type="math/tex; mode=display">dF((-\infty, \alpha]) = F(\alpha)</script>
<p>to a probability measure on <script type="math/tex">(\mathbb{R},\mathscr{B}_\mathbb{R})</script> agrees with <script type="math/tex">\mathscr{L}_{X}{}_{}</script> we conclude that there is a one-to-one correspondence between probability distributions and probability measures of the form <script type="math/tex">dF</script>, called the <strong>Lebesgue–Stieltjes measures</strong>.</p>
<p>A <strong>simple function</strong> on a probability space <script type="math/tex">(\Omega, \mathcal{F},\mathbb{P})</script> is a linear combination of indicator functions on events:</p>
<script type="math/tex; mode=display">s(\omega) = \sum_{i=1}^k a_{i} \boldsymbol{1}_{E_{i}}(\omega).</script>
<p>The <strong>expectation</strong> of <script type="math/tex">s</script> is the sum</p>
<script type="math/tex; mode=display">\mathbb{E}[s] = \sum_{i=1}^k a_{i} \mathbb{P}[E_{k}].</script>
<p>Given a nonnegative random variable <script type="math/tex">X</script> on <script type="math/tex">(\Omega,\mathcal{F},\mathbb{P})</script>, there exists a sequence <script type="math/tex">(s_n)_{n=1}^\infty</script> of simple functions such that <script type="math/tex">0 \leq s_{1} \leq s_{2} \leq \cdots \leq X</script> and that <script type="math/tex">s_{n} \to X</script> pointwise almost surely. We define</p>
<script type="math/tex; mode=display">\mathbb{E}[X] = \lim_{n \to \infty} \mathbb{E}[s_{n}].</script>
<p>In general, we can write a random variable <script type="math/tex">X</script> as the difference <script type="math/tex">X^+ - X^-</script>, where <script type="math/tex">X^+ = \max(X,0)</script> and <script type="math/tex">X^- = \max(-X,0)</script>. We can then define the expectation of <script type="math/tex">X</script> to be the sum</p>
<script type="math/tex; mode=display">\mathbb{E}[X] = \mathbb{E}[X^+] - \mathbb{E}[X^-].</script>
<p>This definition yields a linear functional on the space of all random variables on <script type="math/tex">(\Omega,\mathcal{F},\mathbb{P})</script>. The construction of the <strong>Lebesgue integral</strong></p>
<script type="math/tex; mode=display">\int_\Omega f \, d\mu</script>
<p>of a real-valued measurable function <script type="math/tex">f</script> on a general measure space <script type="math/tex">(\Omega,\mathcal{F},\mu){}_{}</script> is analogous.</p>
<p>We record two computational identities regarding the expectation, which holds for each random variable <script type="math/tex">X</script> and every <script type="math/tex">(\mathscr{B}_\mathbb{R},\mathscr{B}_\mathbb{R})</script>-measurable function (or, <strong>borel measurable</strong> for short) <script type="math/tex">g:\mathbb{R} \to \mathbb{R}</script>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\mathbb{E}[\varphi(X)] &= \int_{-\infty}^\infty g(\alpha) \, dF_X(\alpha) \\
\mathbb{E}[\varphi(X)] &= \int_0^\infty \mathbb{P}[\varphi(X) \geq t] \, dt
\end{align*} %]]></script>
<p>The integral in the first first identity is to be understood as a Lebesgue–Stieltjes integral.</p>
<p>The second identity is a consequence of integration on <strong>product measure spaces</strong>.</p>
<p>Given two measure spaces <script type="math/tex">(A,\mathcal{F},\mu)</script> and <script type="math/tex">(B,\mathcal{G},\nu)</script>, we define a <strong>rectangle</strong> to be the cartesian product <script type="math/tex">P \times Q</script> for any <script type="math/tex">P \in \mathcal{F}</script> and <script type="math/tex">Q \in \mathcal{Q}</script>. The <strong>product measure</strong> of a rectangle is defined to be the product</p>
<script type="math/tex; mode=display">(\mu \otimes \nu)(P \times Q) = \mu(P)\nu(Q).</script>
<p><script type="math/tex">\mu \otimes \nu</script> is countably additive on the algebra <script type="math/tex">\mathcal{R}</script> of all finite unions, intersections, and complementations of rectangles, whence the Carathéodory’s extension theorem furnishes a measure on the <script type="math/tex">\sigma</script>-algebra <script type="math/tex">\mathcal{F} \otimes \mathcal{G}</script> generated by <script type="math/tex">\mathcal{R}</script>.</p>
<p>If both <script type="math/tex">\mu</script> and <script type="math/tex">\nu</script> are <script type="math/tex">\sigma</script>-finite, then the product measure <script type="math/tex">\mu \otimes \nu</script> is <script type="math/tex">\sigma</script>-finite, and the <strong>Fubini–Tonelli theorem</strong> holds:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\int_{A \times B} f(x,y) d(\mu \otimes \nu)(x,y)
&= \int_A \int_B f(x,y) \, d\nu(y) \, d\mu(x) \\
&= \int_B \int_A f(x,y) \, d\mu(x) \, d\nu(y)
\end{align*} %]]></script>
<p>whenever <script type="math/tex">% <![CDATA[
\int_{A \times B} \vert f \vert d(\mu \otimes \nu) < \infty{}_{} %]]></script> or <script type="math/tex">f</script> is nonnegative.</p>
<p>Yet another useful computational device for the expectation is the <strong>probability density function</strong>, which is defined for each random variable <script type="math/tex">X</script> to be a function <script type="math/tex">f_{X}{}_{}:\mathbb{R} \to \mathbb{R}</script> such that</p>
<script type="math/tex; mode=display">\mathbb{P}[a \leq X \leq b] = \int_{a}^b f_{X} \, d\mu</script>
<p>for all real numbers <script type="math/tex">a \leq b</script>. Here, <script type="math/tex">\mu</script> is a measure on <script type="math/tex">(\mathbb{R},\mathscr{B}_{\mathbb{R}}){}_{}</script>.</p>
<p>The sufficient condition for the existence of a probability density function is established through the theory of signed measures.</p>
<p>A <strong>signed measure</strong> on a measurable space <script type="math/tex">(\Omega,\mathcal{F})</script> is a countably additive set function <script type="math/tex">\mu:\mathcal{F} \to (-\infty, \infty]</script>. To highlight the difference, a measure is sometimes called a <strong>positive measure</strong>. A signed measure <script type="math/tex">\mu</script> on <script type="math/tex">(\Omega,\mathcal{F})</script> is said to be <strong>absolutely continuous</strong> with respect to a positive measure <script type="math/tex">\nu</script> on <script type="math/tex">(\Omega,\mathcal{F})</script>, denoted by <script type="math/tex">\mu \ll \nu</script>, if <script type="math/tex">\mu(E) = 0</script> whenever <script type="math/tex">E \in \mathcal{F}</script> and <script type="math/tex">\nu(E) = 0</script>.</p>
<p>A useful special case of the <strong>Radon–Nikodym theorem</strong> (also known as the <strong>Lebesgue–Radon–Nikodym theorem</strong>) states the following: if a <script type="math/tex">\sigma</script>-finite signed measure <script type="math/tex">\mu</script> on a measurable space <script type="math/tex">(\Omega,\mathcal{F})</script> is absolutely continuous with respect to a <script type="math/tex">\sigma</script>-finite positive measure <script type="math/tex">\nu</script> on <script type="math/tex">(\Omega,\mathcal{F})</script>, then there exists a <script type="math/tex">\mathcal{F}</script>-measurable function <script type="math/tex">\frac{d\mu}{d\nu}:\Omega \to \mathbb{R}</script>, called the <strong>Radon–Nikodym derivative of <script type="math/tex">\mu</script> with respect to <script type="math/tex">\nu</script></strong>, such that</p>
<script type="math/tex; mode=display">\mu(E) = \int_{E} \frac{d\mu}{d\nu} \, d\nu{}_{}</script>
<p>for all <script type="math/tex">E \in \mathcal{F}</script>.</p>
<p>We fix a probability space <script type="math/tex">(\Omega,\mathcal{F},\mathbb{P})</script> and a random variable <script type="math/tex">X</script>. If the law <script type="math/tex">\mathscr{L}_{X}</script> is absolutely continuous with respect to a <script type="math/tex">\sigma</script>-finite positive measure <script type="math/tex">\nu</script> on <script type="math/tex">(\mathbb{R},\mathscr{B}_{\mathbb{R}})</script>, then the Radon–Nikodym theorem implies that</p>
<script type="math/tex; mode=display">\mathbb{P}[a \leq X \leq b] = \mathscr{L}_{X}([a,b]) = \int_a^b \frac{d\mathscr{L}_{X}}{d\mu} \, d\nu</script>
<p>for all real numbers <script type="math/tex">a \leq b</script>. If <script type="math/tex">X</script> is discrete, then choosing the counting measure</p>
<script type="math/tex; mode=display">\nu(E) = \vert E \cap \operatorname{im} X \vert</script>
<p>gives us absolute continuity. Otherwise, we typically shoot for absolute continuity with respect to <script type="math/tex">\nu = \mathscr{L}_{\mathbb{R}}</script>, the Lebesgue measure on the real line.</p>
<p><a name="2"></a></p>
<h2 id="2-conditioning-and-independence-on-a-discrete-probability-space">2. Conditioning and Independence on a Discrete Probability Space</h2>
<p>Let <script type="math/tex">(\Omega, \mathcal{F}, \mathbb{P})</script> be a discrete probability space, i.e., the sample space <script type="math/tex">\Omega</script> is countable. Given two events <script type="math/tex">A,B \in \mathcal{F}</script>, what should the <strong>probability of <script type="math/tex">A</script> given <script type="math/tex">B</script></strong> be? To answer this question, we construct a new measurable space <script type="math/tex">(\Omega_{B}, \mathcal{F}_{B}</script> where event <script type="math/tex">B</script> <strong>always happens</strong>. To do this, we take <script type="math/tex">\Omega_{B} = \Omega \cap B</script> and</p>
<script type="math/tex; mode=display">\mathcal{F}_{B} = \{E \cap B : E \in \mathcal{F}\}.</script>
<p>The construction implies that each <script type="math/tex">E \in \mathcal{F}_{B}</script> admits <script type="math/tex">E' \in \mathcal{F}</script> such that <script type="math/tex">E = E' \cap B</script>. It thus makes sense to define the new probability measure <script type="math/tex">\mathcal{P}_{B}</script> on <script type="math/tex">(\Omega_{B}, \mathcal{F}_{B})</script> in terms of <script type="math/tex">\mathbb{P}[E' \cap B]</script>. We take the normalization</p>
<script type="math/tex; mode=display">\mathbb{P}_{B}[E] = \frac{\mathbb{P}[E' \cap B]}{\mathbb{P}[B]},</script>
<p>so that <script type="math/tex">\mathbb{P}_{B}</script> is a <em>bona fide</em> probability measure on <script type="math/tex">(\Omega_{B},\mathcal{F}_{B})</script>.</p>
<p>With this construction, it would be reasonable to say that the <strong>probability of <script type="math/tex">A</script> given <script type="math/tex">B</script></strong> <script type="math/tex">\mathbb{P}[A \mid B]</script> is</p>
<script type="math/tex; mode=display">\mathbb{P}[A \mid B] = \mathbb{P}_{B}[A] = \frac{\mathbb{P}[A \cap B]}{\mathbb{P}[B]},</script>
<p>provided that <script type="math/tex">\mathbb{P}[B] > 0</script>. If <script type="math/tex">\mathbb{P}[B] = 0</script>, then conditioning is meaningless, as <script type="math/tex">\mathbb{P}[A \mid B]</script> would have to be 0 for any event <script type="math/tex">A</script>.</p>
<p>If the conditioning does not change the probability, then the events are said to be <strong>independent</strong>. Formally, <script type="math/tex">A</script> and <script type="math/tex">B</script> are <strong>independent</strong> if</p>
<script type="math/tex; mode=display">\mathbb{P}[A \mid B] = \mathbb{P}[A] \hspace{1em}\mbox{and}\hspace{1em} \mathbb{P}[B \mid A] = \mathbb{P}[B].</script>
<p>This holds if and only if <script type="math/tex">\mathbb{P}[A \cap B] = \mathbb{P}[A]\mathbb{P}[B]</script>, which yields a natural <script type="math/tex">n</script>-fold generalization: events <script type="math/tex">E_{1},\ldots,E_{n}</script> are <strong>independent</strong> if</p>
<script type="math/tex; mode=display">\mathbb{P}\left[ \bigcap_{j=1}^n E_{j} \right] = \prod_{j=1}^n \mathbb{P}[E_{j}]{}_{}.</script>Mark Hyun-ki KimConditional expectation, conditional probabilities, conditional distributions, conditional densitiesPoincaré Embeddings of Neural Networks2017-08-15T14:00:00-04:002017-08-15T14:00:00-04:00https://markhkim.com/foundtechnicalities/poincare-embeddings-of-neural-networks<blockquote>
<p><a href="https://arxiv.org/abs/1705.08039v2">Poincaré Embeddings for Learning Hierarchical Representations</a><br />
E. Totoni, T. A. Anderson, T. Shpeisman</p>
<p><a href="https://arxiv.org/abs/1705.10359v1">Neural Embeddings of Graphs in Hyperbolic Space</a><br />
B. P. Chamberlain, J. Clough, M. P. Deisenroth</p>
</blockquote>
<p>In studying a mathematical model, it is helpful to put the model in the context of a simpler, well-studied mathematical structure. For example, the Earth is a spherical object, and yet we can learn much about its surface by studying its <a href="https://en.wikipedia.org/wiki/Transverse_Mercator_projection">Mercator projection</a> on a two-dimensional plane.</p>
<p>The study of geometric methods of context transfer starts with <strong>embeddings</strong>, which are exact copies of mathematical models in another mathematical structure. More specifically, an <a href="https://en.wikipedia.org/wiki/Embedding">embedding</a> of a mathematical object <script type="math/tex">X</script> of type <script type="math/tex">\mathcal{A}</script> (or, formally, of <a href="https://en.wikipedia.org/wiki/Category_(mathematics)">category</a> <script type="math/tex">\mathcal{A}</script>) in a mathematical object <script type="math/tex">Y</script> of type <script type="math/tex">\mathcal{B}</script> is a <a href="https://en.wikipedia.org/wiki/Distinct_(mathematics)">one-to-one</a> mapping such that <script type="math/tex">f(X)</script> is a copy of <script type="math/tex">X</script> in <script type="math/tex">Y</script>, viz., an object of type <script type="math/tex">\mathcal{A}</script> that shares the same structural properties as <script type="math/tex">X</script>.</p>
<p>Differential-geometric tools such as <a href="https://en.wikipedia.org/wiki/Whitney_embedding_theorem">Whitney embedding theorem</a> (<a href="https://books.google.com/books?id=In1Dbj-pkIkC&pg=PA349&lpg=PA349&dq=whitney+embedding+theorem+1930&source=bl&ots=JA6284Ey43&sig=FIUzNE2-pTbAwhB-jLaXJOIF4Xg&hl=en&sa=X&ved=0ahUKEwi0trX_nprUAhWK8YMKHVxuAbYQ6AEIOjAE#v=onepage&q&f=false">1944</a>) and <a href="https://en.wikipedia.org/wiki/Nash_embedding_theorem">Nash embedding theorem</a> (<a href="http://www.ams.org/journals/bull/2017-54-02/S0273-0979-2016-01551-5/S0273-0979-2016-01551-5.pdf">1954–1966</a>) allow us to view abstract geometric objects as concrete surfaces in the <a href="https://en.wikipedia.org/wiki/Euclidean_space">Euclidean space</a>. Moreover, discrete analogues such as <a href="https://en.wikipedia.org/wiki/Kuratowski%27s_theorem">Kuratowski’s theorem</a> (<a href="http://matwbn.icm.edu.pl/ksiazki/fm/fm15/fm15126.pdf">1930</a>) and the <a href="https://en.wikipedia.org/wiki/Planarity_testing#Path_addition_method">Hopcroft–Tarjan planarity testing algorithm</a> (<a href="https://dl.acm.org/citation.cfm?id=321852">1974</a>), as well as <a href="http://www-math.mit.edu/~goemans/18409.html">results</a> <a href="http://theory.stanford.edu/~tim/w06b/w06b.html">from</a> <a href="https://www.cs.cmu.edu/~anupamg/metrics/">metric</a> <a href="http://www.cs.toronto.edu/~avner/teaching/S6-2414/">embedding</a> <a href="http://ttic.uchicago.edu/~harry/teaching/teaching.html">theory</a> like <a href="https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma">Johnson–Lindenstrauss lemma</a> (<a href="https://www.enseignement.polytechnique.fr/informatique/INF442/TD/td_3/JLL-original_proof.pdf">1984</a>), suggest embedding methods can be applied in quite general contexts.</p>
<p>Indeed, datasets can be represented as <a href="https://en.wikipedia.org/wiki/Euclidean_vector">vectors</a> or <a href="https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)#Weighted_graph">weighted graphs</a>: the former by encoding each feature as coordinate values, and the latter by encoding relationships between data points as weights of edges between nodes. The latter is already a graph; the former can be thought of as sampled points on a <a href="https://en.wikipedia.org/wiki/Manifold">manifold</a>, a principal object of study in geometry, that describes the <em>true</em> nature of the dataset. As discussed above, both representations can be thought of as geometric objects embedded within a Euclidean space, allowing us to visualize the structure of a dataset.</p>
<p>Nevertheless, embedding in the strict sense is not terribly useful in data analysis, because such visualizations often take place in extremely high-dimensional space. For example, if we wish to encode color information, to be taken from the color set <code class="highlighter-rouge"><span class="p">{</span><span class="err">red,</span><span class="w"> </span><span class="err">orange,</span><span class="w"> </span><span class="err">yellow,</span><span class="w"> </span><span class="err">green,</span><span class="w"> </span><span class="err">cyan,</span><span class="w"> </span><span class="err">blue,</span><span class="w"> </span><span class="err">violet</span><span class="p">}</span></code>, in numeric vectors, we might apply <a href="https://en.wikipedia.org/wiki/One-hot">one-hot encoding</a> to represent</p>
<table>
<thead>
<tr>
<th>id</th>
<th>color</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>red</td>
</tr>
<tr>
<td>2</td>
<td>orange</td>
</tr>
<tr>
<td>3</td>
<td>yellow</td>
</tr>
<tr>
<td>4</td>
<td>green</td>
</tr>
</tbody>
</table>
<p>as</p>
<table>
<thead>
<tr>
<th>id</th>
<th>red?</th>
<th>orange?</th>
<th>yellow?</th>
<th>green?</th>
<th>cyan?</th>
<th>blue?</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>2</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>3</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>4</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>
<p>which is a collection of 7-dimensional vectors. As we consider more features, the requisite dimension increases significantly. This results in the so-called <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality">curse of dimensionality</a>, which is a heuristic principle referring to various computational difficulties that worsen significantly as the dimension increases.</p>
<p>It is thus beneficial to consider low-dimensional representations of datasets via <em>approximate</em> embeddings, which do not constitute exact copies but still contain enough information to be useful. For example, we might try and encode certain features of a dataset as geometric properties, such as the distance between two data points.</p>
<p>The method of data analysis via low-dimensional representations has a long history, significantly predating the formal development of embedding theory in mathematics. There are hints of early theoretical results found in literature as early as 1880 (see Section 1.3 of <a href="http://gifi.stat.ucla.edu/janspubs/1982/chapters/deleeuw_heiser_C_82.pdf">de Leeuw–Heiser, 1982</a>), and computational methods have been around for decades as well (see, for example, <a href="http://theoval.cmp.uea.ac.uk/~gcc/matlab/sammon/sammon.pdf">Sammon, 1969</a>).</p>
<p>Nevertheless, it is the explosion of neural-network methods, backed by ever-so-powerful modern computers, that afforded embedding methods with the significance they have now. A landmark example of the modern embedding method is <strong>Word2Vec</strong> (<a href="https://arxiv.org/abs/1301.3781">Mikolov, et al., 2013</a>), a computationally efficient technique for modeling similarities between words built on the neural probabilistic language model (<a href="http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf">Bengio, et al., 2003</a>).</p>
<p>Word2Vec and its variants, such as Node2Vec (<a href="http://snap.stanford.edu/node2vec/">Grover–Leskovec, 2016</a>), make use of <em>vector representations</em> of data, as their names suggest. Vector-space embeddings are versitile but their dimensionality reduction efficiencies leave much to be desired. For example, linear embedding of certain kinds of graph-structured data require extremely large number of dimensions (<a href="https://papers.nips.cc/paper/5448-reducing-the-rank-in-relational-factorization-models-by-including-observable-patterns.pdf">Nickel–Jiang–Tresp, 2014, Theorem 1</a>). This is because the inherent shape of a dataset is not necessarily <em>flat</em>. The intrinsic dimension of a <em>curved</em> object in a vector space may well be significantly lower than the dimension of the vector space itself: indeed, a sphere is two-dimensional, but the usual coordinate representation requires a three-dimensional vector space.</p>
<p>The <a href="https://arxiv.org/pdf/1310.0425.pdf"><strong>manifold hypothesis</strong></a> posits that each high-dimensional dataset has a corresponding lower-dimensional structure that approximates the data points well. Such a structure is assumed to be a <strong>manifold</strong>: curves, surfaces, and their high-dimensional generalizations. In light of this, we reason that there may be a <em>curved</em> manifold that represents data better than <em>flat</em> vector spaces.</p>
<p>Now, many complex networks exhibit a <strong>hierarchial</strong>, tree-like, organization (<a href="https://arxiv.org/abs/cond-mat/0206130">Ravasz–Barabási, 2003</a>. Since a tree is a discrete analogue of <a href="https://en.wikipedia.org/wiki/Hyperbolic_manifold">hyperbolic manifolds</a> (<a href="http://www.ihes.fr/~gromov/PDF/6[57].pdf">Gromov, 1987</a>), it follows that a wide variety of complex datasets can be studied profitably by assuming an underlying hyperbolic structure (<a href="https://arxiv.org/abs/1006.5169">Krioukov–Papadopoulos–Kitsak–Vahdat–Boguñá, 2010</a>). Standard optimization techniques for data analysis such as gradient descent generalize to hyperbolic manifolds (<a href="https://arxiv.org/abs/1111.5280">Bonnabel, 2013</a>), and so it is natural to try for a refinement of embedding methods via hyperbolic embedding.</p>
<p>Indeed, datasets with hierarchical organization can be embedded into a hyperbolic space of a much lower dimension without losing the accuracy of representation, compared to the usual vector-space embeddings. Amusingly, two papers, (<a href="https://arxiv.org/abs/1705.08039">Nickel–Kiela, 2017</a>) and (<a href="https://arxiv.org/abs/1705.10359">Chamberlain–Clough–Deisenroth, 2017</a>), exploiting this methodology appeared on ArXiv within a week of each other, developing the same theory independently. They nevertheless tackle different flavors of problems, and the reader is encouraged to check out both papers for experimental results.</p>Mark Hyun-ki KimPoincaré Embeddings for Learning Hierarchical Representations E. Totoni, T. A. Anderson, T. Shpeisman Neural Embeddings of Graphs in Hyperbolic Space B. P. Chamberlain, J. Clough, M. P. DeisenrothQuickselect and Asymptotically Optimal Quicksort2017-07-10T08:00:00-04:002017-07-10T08:00:00-04:00https://markhkim.com/foundtechnicalities/quickselect-and-asymptotically-optimal-quicksort<p><a name="1"></a></p>
<h2 id="1-quicksort">1. Quicksort</h2>
<p>Quicksort is a divide-and-conquer, quadratic-time sorting algorithm that often performs better than most asymptotically optimal comparison search algorithms. Quicksort divides a list <script type="math/tex">L</script> into two sublists</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}L_1 &= \{x \in L : x \leq p\} \\ L_2 &= \{x \in L : x > p\} \end{align*} %]]></script>
<p>with respect to a fixed pivot <script type="math/tex">p \in L</script> and recurses down to the sublists until it reaches sublists of size 1. Here is a simple Python implementation of quicksort, with the pivot of a list always chosen to be the last element of the list.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="k">def</span> <span class="nf">quick_sort_fixed_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">):</span>
<span class="n">n</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">L</span><span class="p">)</span>
<span class="n">sort_fixed_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="n">n</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">L</span>
<span class="k">def</span> <span class="nf">sort_fixed_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">r</span><span class="p">):</span>
<span class="k">if</span> <span class="n">p</span> <span class="o"><</span> <span class="n">r</span><span class="p">:</span>
<span class="n">pivot_index</span> <span class="o">=</span> <span class="n">partition_fixed_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">r</span><span class="p">)</span>
<span class="n">sort_fixed_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">,</span><span class="n">p</span><span class="p">,</span><span class="n">pivot_index</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">sort_fixed_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">,</span><span class="n">pivot_index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">r</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">partition_fixed_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">r</span><span class="p">):</span>
<span class="n">pivot</span> <span class="o">=</span> <span class="n">L</span><span class="p">[</span><span class="n">r</span><span class="p">]</span>
<span class="n">i</span> <span class="o">=</span> <span class="n">p</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">p</span><span class="p">,</span><span class="n">r</span><span class="p">):</span>
<span class="k">if</span> <span class="n">L</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o"><=</span> <span class="n">pivot</span><span class="p">:</span>
<span class="n">L</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">L</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">L</span><span class="p">[</span><span class="n">j</span><span class="p">],</span> <span class="n">L</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="n">i</span> <span class="o">=</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span>
<span class="n">L</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">L</span><span class="p">[</span><span class="n">r</span><span class="p">]</span> <span class="o">=</span> <span class="n">L</span><span class="p">[</span><span class="n">r</span><span class="p">],</span> <span class="n">L</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="k">return</span> <span class="n">i</span>
</code></pre>
</div>
<p><code class="highlighter-rouge">sort_fixed_pivot()</code> on a sublist <code class="highlighter-rouge">L[p:r+1]</code> chooses <code class="highlighter-rouge">L[r]</code> to be the pivot, via <code class="highlighter-rouge">partition_fixed_pivot()</code>. <code class="highlighter-rouge">L[p:r+1]</code> is divided into three sublists: the list of elements smaller than or equal to the pivot, the singleton list containing the pivot, and the list of elements larger than the pivot. <code class="highlighter-rouge">sort_fixed_pivot</code> then recurses on to the first and the third sublists, sorting the sublists.</p>
<p>If <code class="highlighter-rouge">L</code> is already sorted, then <code class="highlighter-rouge">partition_fixed_pivot(L, p, r)</code> always returns <code class="highlighter-rouge">r</code>. Therefore, <code class="highlighter-rouge">sort_fixed_pivot</code> splits <code class="highlighter-rouge">L[p:r+1]</code> into <code class="highlighter-rouge">L[p:r]</code>, <code class="highlighter-rouge">[L[r]]</code>, and <code class="highlighter-rouge">[]</code>. Since <code class="highlighter-rouge">sort_fixed_pivot()</code> can only to reduce the size of the list to be sorted by 1, it must go through <script type="math/tex">n</script> iterations to sort an already-sorted list of size <script type="math/tex">n</script>. Specifically, at the <script type="math/tex">k</script>th iteration, <code class="highlighter-rouge">partition_fixed_pivot(L, 0, n-k)</code> is executed, and <code class="highlighter-rouge">sort_fixed_pivot(L, 0, n-k-1) + [L[n-k]]</code> is returned.</p>
<p>It therefore suffices to examine the time complexity of <code class="highlighter-rouge">partition_fixed_pivot(L, 0, n-k)</code>. Since every instance of the <code class="highlighter-rouge">if</code> staement is executed, the runtime of <code class="highlighter-rouge">partition_fixed_pivot(L, 0, n-k)</code> is bounded below by <script type="math/tex">\Omega(n-k)</script>. It follows that the runtime of <code class="highlighter-rouge">quick_sort_fixed_pivot</code> on a sorted list of length <code class="highlighter-rouge">n</code> is bounded below by</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\sum_{k=1}^{n-1} \Omega(n-k)
&\geq \sum_{k=1}^{n-1} C_k (n-k) \\
&\geq \sum_{k=1}^{n-1} \left(\min_k C_k\right) (n-k) \\
&= \left(\min_k C_k\right) \frac{n(n-1)}{2} \\
&= \Omega(n^2).
\end{align*} %]]></script>
<p><a name="2"></a></p>
<h2 id="2-randomized-quicksort">2. Randomized Quicksort</h2>
<p>Why, then, does the conventional wisdom dictate that quicksort is more efficien than many <script type="math/tex">O(n \log n)</script>-time sorting algorithms? To understand the efficiency of the average-case runtime, we introduce an element of randomness to quicksort:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">random</span>
<span class="k">def</span> <span class="nf">quick_sort_randomized_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">):</span>
<span class="n">n</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">L</span><span class="p">)</span>
<span class="n">sort_randomized_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">,</span> <span class="mi">0</span> <span class="p">,</span> <span class="n">n</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">L</span>
<span class="k">def</span> <span class="nf">sort_randomized_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">r</span><span class="p">):</span>
<span class="k">if</span> <span class="n">p</span> <span class="o"><</span> <span class="n">r</span><span class="p">:</span>
<span class="n">pivot_index</span> <span class="o">=</span> <span class="n">partition_randomized_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">r</span><span class="p">)</span>
<span class="n">sort_randomized_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">pivot_index</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">sort_randomized_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">,</span> <span class="n">pivot_index</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="n">r</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">partition_randomized_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">r</span><span class="p">):</span>
<span class="n">pivot</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">r</span><span class="p">)</span>
<span class="n">L</span><span class="p">[</span><span class="n">pivot</span><span class="p">],</span> <span class="n">L</span><span class="p">[</span><span class="n">r</span><span class="p">]</span> <span class="o">=</span> <span class="n">L</span><span class="p">[</span><span class="n">r</span><span class="p">],</span> <span class="n">L</span><span class="p">[</span><span class="n">pivot</span><span class="p">]</span>
<span class="k">return</span> <span class="n">partition_fixed_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">r</span><span class="p">)</span>
</code></pre>
</div>
<p>Surely, there is a small but definite chance that the random selection of pivot results in always choosing the last element of the list, thus reducing <code class="highlighter-rouge">quick_sort_randomized_pivot</code> to <code class="highlighter-rouge">quick_sort_fixed_pivot</code>. Therefore, the worst-case time complexity should still be <script type="math/tex">\Omega(n^2)</script>.</p>
<p>Now, observe that the <a href="https://en.wikipedia.org/wiki/Bottleneck_(software)">bottleneck</a> in <code class="highlighter-rouge">quick_sort_fixed_pivot</code> is <code class="highlighter-rouge">partition_fixed_pivot</code>. In fact, the <code class="highlighter-rouge">for</code> loop in <code class="highlighter-rouge">partition_fixed_pivot</code> is the single biggest contributor to the time complexity of both <code class="highlighter-rouge">quick_sort_fixed_pivot</code> and <code class="highlighter-rouge">quick_sort_randomized_pivot</code>. This implies that the number of times the comparison statement <code class="highlighter-rouge">if L[j] <= pivot</code> is executed is a good indicator of the time complexity of quicksort.</p>
<p>In light of the above observation, we shall show that the average time complexity of randomized quicksort is <script type="math/tex">O(n \log n)</script>. To this end, we fix a list L and assume that</p>
<script type="math/tex; mode=display">a_0 \leq a_1 \leq \cdots \leq a_{n-1}</script>
<p>are the elements of <script type="math/tex">L</script>, sorted. Given a probability distribution over <script type="math/tex">L</script>, we let <script type="math/tex">C</script> denote the random variable outputing the total number of comparisons <code class="highlighter-rouge">if L[j] <= pivot</code> performed in the course of an execution of <code class="highlighter-rouge">quick_sort_randomized_pivot(L)</code>. For each <script type="math/tex">1 \leq i,j \leq n</script>, let <script type="math/tex">C_{ij}</script> denote the <a href="https://en.wikipedia.org/wiki/Indicator_function">indicator random variable</a></p>
<script type="math/tex; mode=display">I_{\left\{a_i \mbox{ is compared to } a_j\right\}},</script>
<p>so that</p>
<script type="math/tex; mode=display">\mathbb{E}[C] = \mathbb{E} \left[ \sum_{i=0}^{n-2} \sum_{j=i+1}^{n-1} C_{ij} \right],</script>
<p>where <script type="math/tex">\mathbb{E}</script> denotes the <a href="https://en.wikipedia.org/wiki/Expected_value">expected value</a> of a random variable. Since <script type="math/tex">\mathbb{E}</script> is linear, we see that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\mathbb{E}[C]
&= \mathbb{E} \left[ \sum_{i=0}^{n-2} \sum_{j=i+1}^{n-1} C_{ij} \right] \\
&= \sum_{i=0}^{n-2} \sum_{j=i+1}^{n-1} \mathbb{E}[C_{ij}] \\
&= \sum_{i=0}^{n-2} \sum_{j=i+1}^{n-1} \mathbb{P}[a_i \mbox{ is compared to } a_j],
\end{align*} %]]></script>
<p>where <script type="math/tex">\mathbb{P}</script> denotes the probability of an event.</p>
<p>We let <script type="math/tex">L_{ij}</script> denote the set <script type="math/tex">\{a_i,a_{i+1},\ldots,a_j\}</script>. Observe that <script type="math/tex">\mathbb{P}[a_i \mbox{ is compared to } a_j]</script> equals the probability that <script type="math/tex">a_i</script> or the <script type="math/tex">a_j</script> is the first pivot chosen, out of all the elements of <script type="math/tex">L_{ij}</script>, in the course of an execution of <code class="highlighter-rouge">quick_sort_randomized_pivot(L)</code>. Since each choice of pivot is independent of one another, we have that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
&\mathbb{P}[a_i \mbox{ or } a_j \mbox{ is the first pivot chosen from } L_{ij}] \\
&= \mathbb{P}[a_i \mbox{ is the firist pivot chosen from } L_{ij}] \\
&+ \mathbb{P}[a_j \mbox{ is the firist pivot chosen from } L_{ij}] \\
&= \frac{1}{j-i+1} + \frac{1}{j-i+1} = \frac{2}{j-i+1}.
\end{align*} %]]></script>
<p>It follows that</p>
<script type="math/tex; mode=display">\mathbb{E}[C] = \sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1} \frac{2}{j-i+1} = \sum_{i=0}^{n-2}\sum_{k=1}^{n-i} \frac{2}{k+1}.</script>
<p>Since the <a href="https://en.wikipedia.org/wiki/Harmonic_number">harmonic numbers</a> are asymptotically bounded by the logarithmic function, we conclude that</p>
<script type="math/tex; mode=display">\mathbb{E}[C] \leq \sum_{i=0}^{n-2} O(\log n) \leq O(n \log n).</script>
<p><a name="3"></a></p>
<h2 id="3-quickselect">3. Quickselect</h2>
<p>While randomized performs quite well “on average” (and in practice!), it still has the worst-case time complexity of <script type="math/tex">O(n^2)</script>. With carefully chosen pivots, however, we can guarantee a <script type="math/tex">\Theta(n \log n)</script> performance in all cases.</p>
<p>We have witnessed that a pivot selection strategy that divides a list into sublists of uneven sizes tends to perform badly. Therefore, an efficient search of the median element of a list would lead to an efficient choice of pivots. The <a href="https://people.csail.mit.edu/rivest/pubs/BFPRT73.pdf">Blum–Floyd–Pratt–Rivest–Tarjan selection algorithm</a>, commonly known as <strong>quickselect</strong>, or the <strong>median-of-medians selection algorithm</strong>, can be used to find a median in <script type="math/tex">O(n)</script> time. In fact, quickselect can be used to find the <script type="math/tex">k</script>th smallest element of the list in <script type="math/tex">O(n)</script> time.</p>
<p>The outline of the algorithm is as follows; here, we assume that <script type="math/tex">L</script> consists of distinct elements.</p>
<ol>
<li>Given an input list <script type="math/tex">L</script> of <script type="math/tex">n</script> elements, we divide the list into <script type="math/tex">\lfloor \frac{n}{5} \rfloor</script> sublists of 5 elements, with at most one sublist possibly containing fewer than 5 elements.</li>
<li>We find the median of each of the <script type="math/tex">\lceil \frac{n}{5} \rceil</script> sublists by sorting them. Insertion sort is a good choice here, since it is fast for small lists.</li>
<li>We take the list of the medians found in 1-2 and apply the selection algorithm recursively to find the median of medians <script type="math/tex">m</script>. If there are an even number of medians, we take the lower one.</li>
<li>We rearrange <script type="math/tex">L</script> by putting the terms no larger than <script type="math/tex">m</script> on the front, <script type="math/tex">m</script> in the middle, and terms larger than <script type="math/tex">m</script> in the back.</li>
<li>Let <script type="math/tex">i-1</script> be the index of the pivot <script type="math/tex">m</script> in the rearranged list, so that <script type="math/tex">m</script> is the <script type="math/tex">i</script>th smallest element of <script type="math/tex">L</script>. If <script type="math/tex">i = k</script>, then we have found the median. If not, we apply the selection process recursively to find the median. Specifically, if <script type="math/tex">i > k</script>, then we apply the selection algorithm on <code class="highlighter-rouge">L[:i]</code> to find the <script type="math/tex">k</script>th smallest element. If <script type="math/tex">% <![CDATA[
i < k %]]></script>, then we apply the selection algorithm on <code class="highlighter-rouge">L[i:]</code> to find the <script type="math/tex">(k-i)</script>th smallest element.</li>
</ol>
<p>In short, the algorithm either selects the correct output or recurses down to a smaller sublist. Since the <em>k</em>th smallest element of a small list can be found in <script type="math/tex">O(1)</script> time, the algorithm always terminates with the correct output.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">math</span>
<span class="c"># find the kth smallest element of L</span>
<span class="k">def</span> <span class="nf">quickselect</span><span class="p">(</span><span class="n">L</span><span class="p">,</span><span class="n">k</span><span class="p">):</span>
<span class="n">n</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">L</span><span class="p">)</span>
<span class="c"># To reduce computational complexity, we compute the median directly</span>
<span class="c"># if the length of the list is small enough.</span>
<span class="k">if</span> <span class="n">n</span> <span class="o"><</span> <span class="mi">10</span><span class="p">:</span>
<span class="k">return</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">L</span><span class="p">)[</span><span class="n">k</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="c"># Divide L into sublists of length 5, sort them, and extract the medians.</span>
<span class="n">medians</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">math</span><span class="o">.</span><span class="n">floor</span><span class="p">(</span><span class="n">n</span><span class="o">/</span><span class="mi">5</span><span class="p">)):</span>
<span class="n">L</span><span class="p">[</span><span class="mi">5</span><span class="o">*</span><span class="n">i</span><span class="p">:</span> <span class="mi">5</span><span class="o">*</span><span class="n">i</span><span class="o">+</span><span class="mi">5</span><span class="p">]</span><span class="o">.</span><span class="n">sort</span><span class="p">()</span>
<span class="n">medians</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">L</span><span class="p">[</span><span class="mi">5</span><span class="o">*</span><span class="n">i</span><span class="o">+</span><span class="mi">2</span><span class="p">])</span>
<span class="k">if</span> <span class="n">n</span> <span class="o">%</span> <span class="mi">5</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">L</span><span class="p">[</span><span class="mi">5</span><span class="o">*</span><span class="n">math</span><span class="o">.</span><span class="n">floor</span><span class="p">(</span><span class="n">n</span><span class="o">/</span><span class="mi">5</span><span class="p">):]</span><span class="o">.</span><span class="n">sort</span><span class="p">()</span>
<span class="n">medians</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">L</span><span class="p">[</span><span class="mi">5</span><span class="o">*</span><span class="n">math</span><span class="o">.</span><span class="n">floor</span><span class="p">(</span><span class="n">n</span><span class="o">/</span><span class="mi">5</span><span class="p">)</span> <span class="o">+</span> <span class="n">math</span><span class="o">.</span><span class="n">floor</span><span class="p">((</span><span class="n">n</span> <span class="o">%</span> <span class="mi">5</span><span class="p">)</span><span class="o">/</span><span class="mi">2</span><span class="p">)])</span>
<span class="c"># Find recursively the median of medians</span>
<span class="n">mm</span> <span class="o">=</span> <span class="n">quickselect</span><span class="p">(</span><span class="n">medians</span><span class="p">,</span> <span class="n">math</span><span class="o">.</span><span class="n">floor</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">medians</span><span class="p">)</span><span class="o">/</span><span class="mi">2</span><span class="p">))</span>
<span class="c"># Partition L with mm as the pivot</span>
<span class="n">mm_index</span> <span class="o">=</span> <span class="n">L</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="n">mm</span><span class="p">)</span>
<span class="n">L</span><span class="p">[</span><span class="n">n</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">L</span><span class="p">[</span><span class="n">mm_index</span><span class="p">]</span> <span class="o">=</span> <span class="n">L</span><span class="p">[</span><span class="n">mm_index</span><span class="p">],</span> <span class="n">L</span><span class="p">[</span><span class="n">n</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">i</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="o">-</span><span class="mi">1</span><span class="p">):</span>
<span class="k">if</span> <span class="n">L</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o"><=</span> <span class="n">mm</span><span class="p">:</span>
<span class="n">i</span> <span class="o">=</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span>
<span class="n">L</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">L</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">L</span><span class="p">[</span><span class="n">j</span><span class="p">],</span> <span class="n">L</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="n">i</span> <span class="o">=</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span>
<span class="n">L</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">L</span><span class="p">[</span><span class="n">n</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">L</span><span class="p">[</span><span class="n">n</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">L</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="c"># After this, L[i] = mm.</span>
<span class="n">i</span> <span class="o">=</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span> <span class="c"># Now mm is the ith largest element of L</span>
<span class="c"># Determine whether mm is at the correct spot. If not, recurse.</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">==</span> <span class="n">k</span><span class="p">:</span>
<span class="k">return</span> <span class="n">mm</span>
<span class="k">elif</span> <span class="n">i</span> <span class="o">></span> <span class="n">k</span><span class="p">:</span>
<span class="k">return</span> <span class="n">quickselect</span><span class="p">(</span><span class="n">L</span><span class="p">[:</span><span class="n">i</span><span class="p">],</span> <span class="n">k</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span> <span class="c"># i < k</span>
<span class="k">return</span> <span class="n">quickselect</span><span class="p">(</span><span class="n">L</span><span class="p">[</span><span class="n">i</span><span class="p">:],</span> <span class="n">k</span><span class="o">-</span><span class="n">i</span><span class="p">)</span>
</code></pre>
</div>
<p><a name="4"></a></p>
<h2 id="4-proof-of-linear-time-complexity-of-quickselect">4. Proof of Linear-Time Complexity of Quickselect</h2>
<p>To analyze the time complexity of the median-of-medians selection algorithm, we note that at least half of the medians found in Step 2 are no smaller than <script type="math/tex">m</script>. This implies that at least half of the <script type="math/tex">\lceil \frac{n}{5} \rceil</script> sublists contain at least 3 terms that are greater than <script type="math/tex">m</script>, except for the one sublist that may not have 5 elements, and the sublist that contains <script type="math/tex">m</script>. It follows that the number of elements of <script type="math/tex">L</script> greater than <script type="math/tex">m</script> is at least</p>
<script type="math/tex; mode=display">3 \left(\left\lceil \frac{1}{2} \left\lceil \frac{n}{5} \right\rceil \right\rceil - 2\right) \geq \frac{3}{10}n - 6.</script>
<p>Similarly, at least <script type="math/tex">\frac{3}{10}n - 6</script> elements are less than <script type="math/tex">x</script>, whence, in the worst-case scenario, Step 5 calls the selection algorithm recursively on at most <script type="math/tex">\frac{7}{10}n+6</script> elements.</p>
<p>Let <script type="math/tex">T(n)</script> denote the worst-case running time of the selection algorithm on a list of size <script type="math/tex">n</script>. Step 1 takes O(n) time. Step 2 executes insertion sort on <script type="math/tex">O(n)</script> many small sets, and so it takes <script type="math/tex">O(n)</script> time. Step 3 takes <script type="math/tex">T(n/5)</script> time. Step 4 takes <script type="math/tex">O(n)</script> time, as it is a simple variation of the partition algorithm from quicksort. Step 5 takes <script type="math/tex">T(7n/10 + 6)</script> time, as discussed above. It follows that</p>
<script type="math/tex; mode=display">T(n) \leq T \left( \left\lceil\frac{n}{5}\right\rceil \right) + T \left( \frac{7}{10}n + 6 \right) + O(n).</script>
<p>Let us fix a constant <script type="math/tex">A</script> such that</p>
<script type="math/tex; mode=display">T(n) \leq T \left( \left\lceil\frac{n}{5}\right\rceil \right) + T \left( \frac{7}{10}n + 6 \right) + An</script>
<p>for all <script type="math/tex">n</script>.</p>
<p>Our goal is to find a constant <script type="math/tex">C</script> such that <script type="math/tex">T(n) \leq Cn</script> for all large enough <script type="math/tex">n</script>. How large should <script type="math/tex">C</script> be? If <script type="math/tex">T(n) \leq Cn</script> were true for all <script type="math/tex">n > N</script> for some fixed integer <script type="math/tex">N</script>, we would have the estimate</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
T(n)
&\leq C \left\lceil \frac{n}{5} \right\rceil + \frac{7Cn}{6} + 6C + An \\
&\leq \frac{Cn}{5} + C + \frac{7Cn}{6} + 6C + An \\
&= \frac{9Cn}{10} + 7C + An \\
&= Cn + \left( -\frac{Cn}{10} + 7C + An \right).
\end{align*} %]]></script>
<p>In order to have a tight choice of <script type="math/tex">C</script>, we expect to have the upper bound</p>
<script type="math/tex; mode=display">Cn + \left( -\frac{Cn}{10} + 7C + An \right) \leq Cn,</script>
<p>which can hold if and only if</p>
<script type="math/tex; mode=display">-\frac{Cn}{10} + 7C + An \leq 0.</script>
<p>Rearranging the above inequality, we obtain</p>
<script type="math/tex; mode=display">C \left( -\frac{n}{10} + 7 \right) \leq - An.</script>
<p>Now, if we assume that <script type="math/tex">N > 70</script>, then <script type="math/tex">n > 70</script>, and so the above inequality is equivalent to</p>
<script type="math/tex; mode=display">C \geq \frac{An}{7-\frac{n}{10}} = \frac{10A}{n - 70}.</script>
<p>It follows that <script type="math/tex">N > 70</script> and</p>
<script type="math/tex; mode=display">C \geq \frac{10A}{N - 70}</script>
<p>implies</p>
<script type="math/tex; mode=display">C \geq \frac{10A}{N - 70} > \frac{10A}{n - 70}</script>
<p>for all <script type="math/tex">n > N</script>, and so</p>
<script type="math/tex; mode=display">Cn + \left( -\frac{Cn}{10} + 7C + An \right) \leq Cn,</script>
<p>for all <script type="math/tex">n > N</script>.</p>
<p>Let us now pick <script type="math/tex">N = 140</script> and <script type="math/tex">C = 5A</script>, so that <script type="math/tex">N > 70</script> and <script type="math/tex">C \geq \frac{10A}{N - 70}</script>. By choosing a larger <script type="math/tex">A</script> if necessary, we can assume that</p>
<script type="math/tex; mode=display">T(n) \leq T \left( \left\lceil\frac{n}{5}\right\rceil \right) + T \left( \frac{7}{10}n + 6 \right) + An</script>
<p>for all <script type="math/tex">n</script>, and that</p>
<script type="math/tex; mode=display">T(n) \leq Cn</script>
<p>for all <script type="math/tex">n \leq 140</script>.</p>
<p>If</p>
<script type="math/tex; mode=display">n \leq \left\lfloor \frac{10}{7} \times (140 - 6) \right\rfloor = 191,</script>
<p>then <script type="math/tex">\left\lceil\frac{n}{5}\right\rceil \leq 140</script> and <script type="math/tex">\frac{7}{10}n + 6 \leq 140</script>, and so</p>
<script type="math/tex; mode=display">T(n) \leq C \left\lceil \frac{n}{5} \right\rceil + \frac{7Cn}{6} + 6C + An.</script>
<p>By our choice of <script type="math/tex">C</script>, we conclude that</p>
<script type="math/tex; mode=display">T(n) \leq Cn</script>
<p>for all <script type="math/tex">n \leq 191</script>.</p>
<p>In general, if <script type="math/tex">T(n) \leq Cn</script> for all <script type="math/tex">n \leq M</script>, then <script type="math/tex">T(n) \leq Cn</script> for all <script type="math/tex">n \leq \left\lfloor \frac{10}{7} \times (M - 6) \right\rfloor</script>. Since</p>
<script type="math/tex; mode=display">\left\lfloor \frac{10}{7} \times (M - 6) \right\rfloor > M</script>
<p>for all <script type="math/tex">M \geq 140</script>, we conclude that</p>
<script type="math/tex; mode=display">T(n) \leq Cn</script>
<p>for all <script type="math/tex">n \geq 140</script>. It follows that <script type="math/tex">T(n) \leq O(n)</script>.</p>
<p><a name="5"></a></p>
<h2 id="5-asymptotically-optimal-quicksort">5. Asymptotically Optimal Quicksort</h2>
<p>Having established the linear-time complexity of quickselect, we proceed to construct a variant of quicksort with the worst-case time complexity of <script type="math/tex">O(n \log n)</script>.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">math</span>
<span class="k">def</span> <span class="nf">quick_sort_median_of_medians_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">):</span>
<span class="n">n</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">L</span><span class="p">)</span>
<span class="n">sort_median_of_medians_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="n">n</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">L</span>
<span class="k">def</span> <span class="nf">sort_median_of_medians_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">r</span><span class="p">):</span>
<span class="k">if</span> <span class="n">p</span> <span class="o"><</span> <span class="n">r</span><span class="p">:</span>
<span class="n">pivot_index</span> <span class="o">=</span> <span class="n">partition_median_of_medians_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">r</span><span class="p">)</span>
<span class="n">sort_median_of_medians_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">,</span><span class="n">p</span><span class="p">,</span><span class="n">pivot_index</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">sort_median_of_medians_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">,</span><span class="n">pivot_index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">r</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">partition_median_of_medians_pivot</span><span class="p">(</span><span class="n">L</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">r</span><span class="p">):</span>
<span class="n">mm</span> <span class="o">=</span> <span class="n">median_of_medians</span><span class="p">(</span><span class="n">L</span><span class="p">[</span><span class="n">p</span><span class="p">:</span><span class="n">r</span><span class="o">+</span><span class="mi">1</span><span class="p">],</span> <span class="n">math</span><span class="o">.</span><span class="n">floor</span><span class="p">((</span><span class="n">r</span><span class="o">+</span><span class="mi">1</span><span class="o">-</span><span class="n">p</span><span class="p">)</span><span class="o">/</span><span class="mi">2</span><span class="p">))</span>
<span class="n">initial_pivot_index</span> <span class="o">=</span> <span class="n">L</span><span class="p">[</span><span class="n">p</span><span class="p">:</span><span class="n">r</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="n">mm</span><span class="p">)</span> <span class="o">+</span> <span class="n">p</span>
<span class="n">L</span><span class="p">[</span><span class="n">r</span><span class="p">],</span> <span class="n">L</span><span class="p">[</span><span class="n">initial_pivot_index</span><span class="p">]</span> <span class="o">=</span> <span class="n">L</span><span class="p">[</span><span class="n">initial_pivot_index</span><span class="p">],</span> <span class="n">L</span><span class="p">[</span><span class="n">r</span><span class="p">]</span>
<span class="n">pivot</span> <span class="o">=</span> <span class="n">L</span><span class="p">[</span><span class="n">r</span><span class="p">]</span>
<span class="n">i</span> <span class="o">=</span> <span class="n">p</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">p</span><span class="p">,</span><span class="n">r</span><span class="p">):</span>
<span class="k">if</span> <span class="n">L</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o"><=</span> <span class="n">pivot</span><span class="p">:</span>
<span class="n">L</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">L</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">L</span><span class="p">[</span><span class="n">j</span><span class="p">],</span> <span class="n">L</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="n">i</span> <span class="o">=</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span>
<span class="n">L</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">L</span><span class="p">[</span><span class="n">r</span><span class="p">]</span> <span class="o">=</span> <span class="n">L</span><span class="p">[</span><span class="n">r</span><span class="p">],</span> <span class="n">L</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="k">return</span> <span class="n">i</span>
</code></pre>
</div>
<p>In the above implementation, we have sacrificed a little bit of performance for the sake of clarity. <code class="highlighter-rouge">partition_median_of_medians_pivot</code> spends <script type="math/tex">O(n)</script> time going through <code class="highlighter-rouge">L[p:r+1]</code>, looking for the exact location of the median of medians. Moreover, <code class="highlighter-rouge">median_of_medians</code> itself spends extra <script type="math/tex">O(n)</script> time going through <code class="highlighter-rouge">L</code>, looking for the exact location of the median of medians. These overheads could have prevented by having <code class="highlighter-rouge">median_of_medians</code> return the ordered pair <code class="highlighter-rouge">(med, med_index)</code> of the median <em>and</em> the index in the original list. The above implementation is simpler, however, and the <script type="math/tex">O(n)</script> overhead does not change the asymptotic runtime of <code class="highlighter-rouge">partition_median_of_pivot</code>.</p>
<p>Let us now show that the worst-case runtime complexity of <code class="highlighter-rouge">quick_sort_median_of_medians_pivot</code> is <script type="math/tex">O(n \log n)</script>. Let <script type="math/tex">T(n)</script> denote the worst-case runutime of <code class="highlighter-rouge">quick_sort_median_of_medians_pivot</code> on a list of length <script type="math/tex">n</script>. Since the pivot is always the median, we see that</p>
<script type="math/tex; mode=display">T(n) \leq T \left( \frac{n}{2} \right) + T \left( \frac{n}{2} \right) + O(n).</script>
<p>It now follows that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
T(n)
&\leq 2T\left( \frac{n}{2} \right) + O(n) \\
&\leq 4T\left(\frac{n}{4} \right) + 2O(n) \\
&\leq \cdots \\
&\leq 2^{\log n} T\left( \frac{n}{2^{\log n}} \right) + \log n O(n) \\
&= n T(1) + O(n \log n) = O(n \log n),
\end{align*} %]]></script>
<p>as was to be shown.</p>
<p><a name="6"></a></p>
<h2 id="6-additional-remarks-and-further-results">6. Additional remarks and Further Results</h2>
<p><strong>6.1.</strong> As mentioned in [<a href="https://people.csail.mit.edu/rivest/pubs/BFPRT73.pdf">Blum–Floyd–Pratt–Rivest–Tarjan 1973</a>], the interest in selection problem dates back to 1883, when Lewis Carroll published “<a href="https://www.worldcat.org/title/lawn-tennis-tournaments-the-true-method-of-assigning-prizes-with-a-proof-of-the-fallacy-of-the-present-method/oclc/12240161">Lawn Tennis Tournaments: the true method of assigning prizes with a proof of the fallacy of the present method</a>”:</p>
<blockquote>
<p>Let it not be supposed that, in thus proposing to make these Tournaments a game of pure skill … I am altogether eliminating the element of luck … a thousand accidents might occur to prevent his playing best: the 4th best, 5th best, or even a worst Player, need not despair of winning even the 1st prize. Nor, again, let it be supposed that the present system, which allows an inferior player a chance of the 2nd prize … The proposed form of Tournament, though lasting a shorter time than the present one, has a great many more contests going on at once, and consequently furnishes the spectacle-loving public with a great deal more to look at.</p>
</blockquote>
<p><strong>6.2.</strong> Let <script type="math/tex">f(k, n)</script> be the minimum number of comparisons required to select the <script type="math/tex">k</script>th smallest element in a list of <script type="math/tex">n</script> elements. The relative difficulty of computing percentile levels is measured by</p>
<script type="math/tex; mode=display">F(\alpha) = \limsup_{n \to \infty} \frac{f(\lfloor \alpha (n-1) \rfloor + 1, n)}{n}. \hspace{2em} (0 \leq \alpha \leq 1)</script>
<p>To make sense of the above definition, we let <script type="math/tex">\alpha = \frac{k}{n-1}</script> and observe that</p>
<script type="math/tex; mode=display">\frac{f(\lfloor \alpha (n-1) \rfloor + 1, n)}{n} = \frac{f(k+1, n)}{n}.</script>
<p>In other words, <script type="math/tex">F(k/(n-1))</script> is the “tight” asymptotic upper bound on the minimum number of comparisons required to select the <script type="math/tex">(k+1)</script>th element in a list of <script type="math/tex">n</script> elements, averaged over all choices of <script type="math/tex">n</script>. It can be thought of as the value of the constant hidden by the <script type="math/tex">O(n)</script> upper bound for the time complexity of the quickselect algorithm. <a href="https://en.wikipedia.org/wiki/Limit_superior_and_limit_inferior">Limit superior</a> rules out the anomalous behaviors at small values of <script type="math/tex">n</script>.</p>
<p>[<a href="https://people.csail.mit.edu/rivest/pubs/BFPRT73.pdf">BFPRT 1973</a>] presents the following bounds on <script type="math/tex">F(\alpha)</script>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*} \max_{0 \leq \alpha \leq 1} F(\alpha) &\leq 5.43 \\ \min_{0 \leq \alpha \leq 1} F(\alpha) &\geq 1.5. \end{align*} %]]></script>
<p>The upper bound was improved in [<a href="Schönhage">Schönhage–Paterson–Pippenger 1976</a>] to <script type="math/tex">3</script> in the case of finding the median, and again to 2.95 in [<a href="http://epubs.siam.org/doi/abs/10.1137/S0097539795288611?journalCode=smjcat">Dor–Zwick 1999</a>]. The current best lower bound on <script type="math/tex">F(\alpha)</script> is 2, as per <a href="https://dl.acm.org/citation.cfm?id=22169">Bent–John 1985</a>]. In the case of finding the median, [<a href="http://epubs.siam.org/doi/abs/10.1137/S0895480199353895?journalCode=sjdmec">Dor–Zwick 2001</a>] establishes the slightly improved lower bound of <script type="math/tex">2 + 2^{-80}</script>.</p>
<p><strong>6.3.</strong> More sophisticated analysis based on the notion of entropy shows that “no comparison-based algorithm can do better” than quicksort. [<a href="https://www.cs.princeton.edu/~rs/talks/QuicksortIsOptimal.pdf">Sedgewick–Bentley 2002</a>]. Wild’s currently-unpublished preprint [<a href="https://arxiv.org/abs/1608.04906">Wild 2017</a>] establishes an improved bound on the quickselect-based quicksort.</p>Mark Hyun-ki Kim1. QuicksortStrassen’s Algorithm2017-07-09T20:00:00-04:002017-07-09T20:00:00-04:00https://markhkim.com/foundtechnicalities/strassen-algorithm<p>Recall that an <script type="math/tex">n</script>-by-<script type="math/tex">m</script> <strong><a href="https://en.wikipedia.org/wiki/Matrix_(mathematics)">matrix</a></strong> is a two-dimensional array</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} a_{11} & \cdots & a_{1m} \\ \vdots & & \vdots \\ a_{n1} & \cdots & a_{nm} \end{pmatrix} %]]></script>
<p>equipped with addition operation</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*} &\begin{pmatrix} a_{11} & \cdots & a_{1m} \\ \vdots & & \vdots \\ a_{n1} & \cdots & a_{nm} \end{pmatrix} + \begin{pmatrix} b_{11} & \cdots & b_{1m} \\ \vdots & & \vdots \\ b_{n1} & \cdots & b_{nm} \end{pmatrix} \\ =& \begin{pmatrix} a_{11} + b_{11} & \cdots & a_{1m} + b_{1m} \\ \vdots & & \vdots \\ a_{n1} + b_{n1} & \cdots & a_{nm} + b_{nm} \end{pmatrix} \end{align*} %]]></script>
<p>and scalar multiplication operation</p>
<script type="math/tex; mode=display">% <![CDATA[
\lambda \begin{pmatrix} a_{11} & \cdots & a_{1m} \\ \vdots & & \vdots \\ a_{n1} & \cdots & a_{nm} \end{pmatrix} = \begin{pmatrix} \lambda a_{11} & \cdots & \lambda a_{1m} \\ \vdots & & \vdots \\ \lambda a_{n1} & \cdots & \lambda a_{nm} \end{pmatrix}. %]]></script>
<p>In special circumstances, we can also define the <strong>product</strong> of two matrices. Specifically, the product of an <script type="math/tex">n</script>-by-<script type="math/tex">m</script> matrix <script type="math/tex">(a_{ij})</script> and an <script type="math/tex">m</script>-by-<script type="math/tex">p</script> matrix <script type="math/tex">(b_{kl})</script> is the <script type="math/tex">n</script>-by-<script type="math/tex">p</script> matrix <script type="math/tex">(c_{qr})</script>, where</p>
<script type="math/tex; mode=display">c_{qr} = \sum_{s = 1}^m a_{qs} b_{sr}</script>
<p>for each choice of <script type="math/tex">q</script> and <script type="math/tex">r</script>. Since <script type="math/tex">O(m)</script> operations are required for each entry <script type="math/tex">c_{qr}</script>, we see that it takes <script type="math/tex">O(nmp)</script> operations to compute the product of an <script type="math/tex">n</script>-by-<script type="math/tex">m</script> matrix and an <script type="math/tex">m</script>-by-<script type="math/tex">p</script> matrix. In particular, if we consider square matrices, i.e., <script type="math/tex">n=m=p</script>, then matrix multiplication runs in <strong>cubic</strong> time with respect to the <strong>size</strong> <script type="math/tex">n</script>.</p>
<p>A classic work on improving the asymptotic bound on matrix multiplication is the <strong>Strassen algorithm</strong>, first published <a href="http://gdz.sub.uni-goettingen.de/dms/load/img/?PID=GDZPPN001168215">in 1969</a>. The algorithm relies crucially on <a href="https://en.wikipedia.org/wiki/Block_matrix">block multiplication</a>, a method of computing matrix multiplication <em>en masse</em> by partitioning the matrices into submatrices and computing the matrix product as if each submatrix is a scalar.</p>
<p>By a <strong>submatrix</strong> of a matrix</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} a_{11} & \cdots & a_{1m} \\ \vdots & & \vdots \\ a_{n1} & \cdots & a_{nm} \end{pmatrix}, %]]></script>
<p>we mean a matrix</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} a_{pr} & \cdots & a_{qr} \\ \vdots & & \vdots \\ a_{ps} & \cdots & a_{qs} \end{pmatrix}, %]]></script>
<p>where <script type="math/tex">1 \leq p \leq q \leq n</script> and <script type="math/tex">1 \leq r \leq s \leq m</script>. Now, suppose that we have three matrices <script type="math/tex">A</script>, <script type="math/tex">B</script>, and <script type="math/tex">C</script> of size <script type="math/tex">N = m 2^n</script>, partitioned into equally-sized submatrices, of size <script type="math/tex">\frac{N}{2} = m 2^{n-1}</script>:</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix}, \, \, \, B = \begin{pmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{pmatrix}, \, \, \, C = \begin{pmatrix} C_{11} & C_{12} \\ C_{21} & C_{22} \end{pmatrix}. %]]></script>
<p>If <script type="math/tex">C = AB</script>, then the submatrices of <script type="math/tex">C</script> can be written in terms of the submatrices of <script type="math/tex">A</script> and <script type="math/tex">B</script> as follows:</p>
<script type="math/tex; mode=display">C_{ij} = A_{i1}B_{1j} + A_{i2}B_{2j}.</script>
<p>We observe that block multiplication by itself does not reduce the time complexity of matrix multiplication. Indeed, computing <script type="math/tex">C_{ij}</script> takes <script type="math/tex">O((N/2)^3 + (N/2)^3) = O(N^3/4)</script> operations, whence computing <script type="math/tex">C</script> via block multiplication takes <script type="math/tex">O(N^3)</script> operations, just as many as standard matrix multiplication. Therefore, performing 8 block matrix multiplications is not an improvement.</p>
<p>If, however, we define</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*} D_1 &= (A_{11}+A_{22})(B_{11}+B_{22}) \\ D_2 &= (A_{21} + A_{22})B_{11} \\ D_3 &= A_{11}(B_{12} - B_{22}) \\ D_4 &= A_{22}(B_{11} - B_{21}) \\ D_5 &= (A_{11} + A_{12}) B_{22} \\ D_6 &= (-A_{11} + A_{21}) (B_{11} + B_{12}) \\ D_7 &= (A_{12} - A_{22}) (B_{21} + B_{22})\end{align*} %]]></script>
<p>then</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*} C_{11} &= D_1 + D_4 - D_5 + D_7 \\ C_{12} &= D_3 + D_5 \\ C_{21} &= D_2 + D_4 \\ C_{22} &= D_1 + D_3 - D_2 + D_6, \end{align*} %]]></script>
<p>and <script type="math/tex">C</script> can be computed with 7 block matrix multiplications. Since the algorithm can be applied recursively, we see that</p>
<script type="math/tex; mode=display">T_m(n) = 7T_m(n-1) + O(4^n),</script>
<p>where <script type="math/tex">T_m(n)</script> is the number of operations it takes to compute the above algorithm for matrices of size <script type="math/tex">N = m2^n</script>. Here, <script type="math/tex">O(4^n) = O((2^n)^2))</script> is the number of additions performed for matrices of size <script type="math/tex">N = m2^n</script>.</p>
<p>From the above identity, we conclude that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*} T_m(n) &= 7T_m(n-1) + O(4^n) \\ &= 7 \left( T_m(n-2) + O(4^{n-1}) ) + O(4^n)\right) \\ &\vdots \\ &= 7^n \left( T_m(0) + O(4) \right)^n + O(4^n) \\ &= O\left( [7 + O(1)]^n \right) \\ &= O\left([7+O(1)]^{\log_2(N/m)}\right) \\ &\approx O(N^{\log_2 7}) \approx O(N^{2.807}), \end{align*} %]]></script>
<p>which is asymptotically smaller than <script type="math/tex">O(N^3)</script>.</p>
<p>Since Strassen’s algorithm is a <a href="https://en.wikipedia.org/wiki/Divide_and_conquer_algorithm">divide-and-conquer algorithm</a>, it is not difficult to parallelize the algorithm. Indeed:</p>
<blockquote>
<p>The divide-and-conquer paradigm improves program modularity, and often leads to simple and efficient algorithms. It has therefore proven to be a powerful tool for sequential algorithm designers. Divide-and-conquer plays an even more prominent role in parallel algorithm design. Because the subproblems created in the fisrt step are typically independent, they can be solved in parallel. Often, the subproblems are solved recursively and thus the next step yields even more subproblems to beo solved in parallel. As a consequence, even divide-and-conquer algorithms that were designed for sequential machines typically have some inherent parallelism. (<a href="https://www.cs.cmu.edu/~guyb/papers/BM04.pdf">Blelloch–Maggs, 2004</a>)</p>
</blockquote>
<p>Naïve parallelization of Strassen’s algorithm does not yield much improve in performance, however. In order to carry out block multiplication of submatrices computed by different processors, much communication among the processors are needed. This, in fact, is a serious bottleneck.</p>
<blockquote>
<p>The rate at which operands can be brought to the processor is the primary performance bottleneck for many scientific computing codes. … The local and global memory bandwith bottleneck is expected to become a more serious problem in the future due to the nonuniform scaling of technology … Many supercomputing applications stretch the capabilitise of the underlying hardware, and bottlenecks may occur in many different parts of the system. (<a href="https://www.nap.edu/read/11148/chapter/7#105">NRC 2004</a>)</p>
</blockquote>
<p>The <a href="https://markhkim.com/foundtechnicalities/what-is-an-efficient-parallel-algorith/">usual model of parallel computation</a> assumes shared memory and thus does give the full picture when it comes to algorithms with high <em><a href="https://en.wikipedia.org/wiki/Communication_complexity">communication complexity</a></em>. The <a href="http://www.sciencedirect.com/science/article/pii/030439759090188N">LPRAM model</a>, a direct generalization of the PRAM model, solves this problem by adding local memory to each processor (hence the <em>L</em>), on which a variant of Strassen’s algorithm that minimizes the communication cost has been developed. See [<a href="https://arxiv.org/pdf/1202.3173.pdf">Ballard–Demmel–Holtz–Lipshitz–Schwartz 2012</a>].</p>Mark Hyun-ki KimRecall that an -by- matrix is a two-dimensional arrayWhat Is An Efficient Parallel Algorithm?2017-07-09T19:00:00-04:002017-07-09T19:00:00-04:00https://markhkim.com/foundtechnicalities/what-is-an-efficient-parallel-algorith<p>Most algorithms we encounter in the study of computer science are <strong>sequential</strong>, i.e., designed with the assumption that only one computational operation can be carried out at each step. Indeed, we recall that the abstract model of computation typically employed in the context of the analysis of algorithms is the <a href="https://en.wikipedia.org/wiki/Random-access_machine">random-access machine</a>, which executes basic operations with no <a href="https://en.wikipedia.org/wiki/Concurrency_(computer_science)">concurrency</a>.</p>
<p>Sequential algorithms lend themselves to a straightforward concept of efficiency, which can be computed by simply adding up the number of basic operations in the RAM model. Indeed, it is reasonable to assert that the amount of time it takes to run an algorithm is proportional to the total number of basic operations.</p>
<p>Furthermore, Moore’s Law, the principle that the computational capabilities of <a href="https://en.wikipedia.org/wiki/Microprocessor">microprocessors</a> of a fixed size grows exponentially over time, allowed the subject’s preoccupation on sequential algorithms to linger. Concurrency was poorly understood, and any algorithm that wasn’t efficient was to be salvaged by Moore’s law.</p>
<p>Moore’s law has now been <a href="https://www.technologyreview.com/s/601441/moores-law-is-dead-now-what/">declared obsolete</a>, and the microprocessor industry produces <a href="https://en.wikipedia.org/wiki/Multi-core_processor">multi-core processors</a>, which can execute several computational operations in parallel. Sequential algorithms cannot make efficient use of multi-core processors, hastening the need for a systematic study of <strong>parallel</strong> algorithms.</p>
<p>A natural extension of the RAM model is the <a href="https://en.wikipedia.org/wiki/Parallel_random-access_machine">parallel random-access machine</a> model, in which multiple processors share the same memory module. Each of the processors can access an arbitrary word, a predetermined size of data, in the memory module in a single step, and different processors can access the memory module in parallel.</p>
<p>In the PRAM model, the total number of basic operations present in an algorithm is no longer a good abstraction of efficiency, as some operations can be carried out in parallel. In fact, the amount of time it takes to run an algorithm is no longer proportional to the total number of basic operations, but on the amount of time the slowest of the processors takes to finish the task it was given.</p>
<p>If we assume that all processors in our PRAM are equally efficient, then efficiency depends on how well we can divide up the basic operations in an algorithm. Indeed, if a computational task at hand requires working through a long chain of operations that cannot be parallelized, then one processor must deal with the entire chain, resulting in a less-than-ideal distribution of operations.</p>
<p>To analyze such dependencies among operations, let us construct a <a href="https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)">directed graph</a> of dependencies. We represent each basic operation as a node. The incoming edges of a node represent the input values of the operation, and the outgoing edge represents the output value of the operation. With this abstraction, an algorithm can be thought of as a directed graph of operations and values, leading to one final operation whose outgoing edge represents the output value of the algorithm.</p>
<p>The end result is a <a href="https://en.wikipedia.org/wiki/Tree_(graph_theory)">tree</a> whose leaves represent the <em>input variable</em> operations that merely feed their outgoing edges into the next operations. Since each node of the tree represents one basic operation, the total <strong>work</strong> needed for an algorithm is merely the total number of nodes.</p>
<p>Recall now that the <strong>depth</strong> of a tree is the maximal distance from the root node to a leaf node. A directed path from one node to another represents a chain of dependencies, and so at least one processor must carry out every operation in the path from the farthest leaf node to the root node. It follows that the depth of the tree represents the time it takes to execute the corresponding algorithm.</p>
<p>For this reason, the tree we have constructed above is referred to as the <strong>work-depth model</strong>. A version of <a href="https://en.wikipedia.org/wiki/Richard_P._Brent">Brent</a>’s theorem guarantees that an algorithm can be modeled on the work-depth model of work size <script type="math/tex">w</script> and depth <script type="math/tex">d</script> can be executed in <script type="math/tex">O(\frac{w}{p} + d)</script> steps in the <script type="math/tex">p</script>-processor PRAM model.</p>
<p>Given an algorithm, we let <script type="math/tex">T_p</script> denote the time it takes to run the algorithm on a <script type="math/tex">p</script>-processor PRAM. In the best-case scenario, there is no abnormally long chains of dependencies, and the work is distributed equally among the <script type="math/tex">p</script> processors, resulting in the lower bound</p>
<script type="math/tex; mode=display">\frac{T_1}{p} \leq T_p,</script>
<p>where <script type="math/tex">T_1</script> denotes the time it takes to run the algorithm on a single-processor RAM.</p>
<p>Brent’s theorem provides an upper bound on <script type="math/tex">T_p</script>, and so</p>
<script type="math/tex; mode=display">\frac{T_1}{p} \leq T_p \leq O\left(\frac{w}{p} + d\right).</script>
<p>Now, <script type="math/tex">T_1</script> is proportional to <script type="math/tex">w</script>, whence it follows that</p>
<script type="math/tex; mode=display">\Omega\left(\frac{w}{p}\right) \leq T_p \leq O\left(\frac{w}{p} + d\right).</script>
<p>Therefore, the work-depth model is a reasonable abstraction of parallel computing, even though the PRAM model is more closely aligned with our intuitive notion of a multi-core processor.</p>
<p>See (<a href="https://www.cs.cmu.edu/~guyb/papers/BM04.pdf">Blelloch–Maggs, 2004</a>) for a survey of basic techniques of parallel algorithm design, as well as their analysis through the work-depth model.</p>Mark Hyun-ki KimMost algorithms we encounter in the study of computer science are sequential, i.e., designed with the assumption that only one computational operation can be carried out at each step. Indeed, we recall that the abstract model of computation typically employed in the context of the analysis of algorithms is the random-access machine, which executes basic operations with no concurrency.Perfect Secrecy, or Not?2017-06-22T12:00:00-04:002017-06-22T12:00:00-04:00https://markhkim.com/foundtechnicalities/perfect-secrecy-or-not<p>Cryptography is the science of designing systems that can withstand malicious attempts to abuse them. Every cryptographic scenario can be illustrated by the story of security’s inseparable couple, <a href="https://dl.acm.org/citation.cfm?doid=359340.359342">Alice and Bob</a>: Alice and Bob want to send messages to each other, while deterring various unwanted interlocutors, eavesdroppers, and tamperers from participating.</p>
<p>In the simplest model, Alice sends Bob a secret message, and Eve the eavesdropper attempts to decode Alice’s message. Alice’s goal is to encrypt the message in such a way that Bob can decrypt but Eve cannot. The formal study of such communication protocols must begin with the construction of a <strong>cryptographic model</strong>, which we can then analyze to determine its functionality and security.</p>
<p>The formalization of the Alice–Eve–Bob secnario consists of the following:</p>
<ul>
<li>A set <script type="math/tex">\mathcal{M}</script> of <strong>messages</strong>, and another set <script type="math/tex">\mathcal{C}</script> of <strong>ciphertexts</strong>;</li>
<li>Alice has an <strong>encryption algorithm</strong> <script type="math/tex">\operatorname{Enc}(m)</script> that takes <script type="math/tex">m \in \mathcal{M}</script> and outputs <script type="math/tex">c \in \mathcal{C}</script>;</li>
<li>Bob has a <strong>decryption algorithm</strong> <script type="math/tex">\operatorname{Dec}(c)</script> that takes <script type="math/tex">c \in \mathcal{C}</script> and outputs <script type="math/tex">m \in \mathcal{M}</script>;</li>
<li>Eve has her own decryption algorithm <script type="math/tex">\mathcal{A}(c)</script> that takes <script type="math/tex">c \in \mathcal{C}</script> and outputs <script type="math/tex">m \in \mathcal{M}</script>.</li>
</ul>
<p>In order for the above model to be functional, we expect to have <script type="math/tex">\operatorname{Dec}(\operatorname{Enc}(m)) = m</script> for all <script type="math/tex">m \in \mathcal{M}</script>. Security would mean that <script type="math/tex">\mathcal{A}(\operatorname{Enc}(m)) \neq m</script> for all <script type="math/tex">m \in \mathcal{M}</script>, or, at least, for a large portion of <script type="math/tex">m \in \mathcal{M}</script>.</p>
<p>This naïve definition of functionality and security quickly turns out to be hopeless. History has shown that one party can often find out which encryption method the other party is using through espionage and other tactics. For Alice to guarantee the security of her communication with Bob, it is imperative for her to use an encryption method that produces messages that cannot be decrypted even when the encryption algorithm is leaked to Eve.</p>
<p>To this end, we construct an encryption method with a <strong>secret key</strong>, without which encryption messages cannot be decrypted with ease. We update the above cryptographic model accordingly:</p>
<ul>
<li>We now have a set <script type="math/tex">\mathcal{M}</script> of messages, a set <script type="math/tex">\mathcal{C}</script> of ciphertexts, and a set <script type="math/tex">\mathcal{K}</script> of <strong>keys</strong>.</li>
<li>There is a key-generation algorithm <script type="math/tex">\operatorname{Gen}</script> that outputs an element of <script type="math/tex">\mathcal{K}</script>.</li>
<li>Alice is equipped with an encryption algorithm <script type="math/tex">\operatorname{Enc}(k,m) = \operatorname{Enc}_k(m)</script> that takes <script type="math/tex">(k,m) \in \mathcal{K} \times \mathcal{M}</script> and outputs <script type="math/tex">c \in \mathcal{C}</script>.</li>
<li>Bob is equipped with a decryption algorithm <script type="math/tex">\operatorname{Dec}(k,c) = \operatorname{Dec}_k(c) = m</script> that takes <script type="math/tex">(k,c) \in \mathcal{K} \times \mathcal{C}</script> and outputs <script type="math/tex">m \in \mathcal{M}</script>.</li>
<li>Even is equipped with another decryption algorithm <script type="math/tex">\mathcal{A}(k,c)</script> that takes <script type="math/tex">(k,c) \in \mathcal{K} \times \mathcal{C}</script> and outputs <script type="math/tex">m \in \mathcal{M}</script>.</li>
</ul>
<p>In this context, it is reasonable to say that the model is functional if</p>
<script type="math/tex; mode=display">\operatorname{Dec}_k(\operatorname{Enc}_k(m)) = m</script>
<p>for all <script type="math/tex">(k,m) \in \mathcal{K} \times \mathcal{M}</script>. In other words, Bob should be able to decrypt any message that Alice encrypted, so long as both of them use the same key.</p>
<p>Is this new model secure? We should hope that Eve cannot, in general, guess the key that Alice and Bob choose to use. Therefore, it is reasonable to assume that <script type="math/tex">\operatorname{Gen}</script> is a <strong>randomized algorithm</strong>.</p>
<p>Moreover, if Eve fails to obtain the correct key, then she should not be able to recover the original message. For this, a typical ciphertext must not carry any inherent meaning on its own, thereby deterring Eve from deciphering it without a key.</p>
<p>We formalize these observations as follows:</p>
<blockquote>
<p><strong>Definition 1</strong> (Shannon, <a href="https://en.wikipedia.org/wiki/Communication_Theory_of_Secrecy_Systems">1949</a>). A cryptographic system <script type="math/tex">(\operatorname{Gen}, \operatorname{Enc}, \operatorname{Dec})</script> is <strong>perfectly secret</strong> if, for each probability distribution <script type="math/tex">\mathcal{D}</script> over <script type="math/tex">\mathcal{M}</script>, we have the identity</p>
<script type="math/tex; mode=display">\operatorname{Prob}_{
\substack{
k \leftarrow \operatorname{Gen} \\
m \leftarrow \mathcal{D}
}
}
[
m = \overline{m} \mid \operatorname{Enc}_k(m) = \bar{c}
]
= \operatorname{Prob}_{
m \leftarrow \mathcal{D}
}
[
m = \bar{m}
]</script>
<p>for all choices of <script type="math/tex">(\bar{m},\bar{c}) \in \mathcal{M} \times \mathcal{C}</script>. In other words, the probability of recovering a fixed message <script type="math/tex">\bar{m}</script> does not depend on the choice of the ciphertext <script type="math/tex">\bar{c}</script>.</p>
</blockquote>
<p>An implementation of Shannon’s perfect secrecy model is the <strong>one-time pad</strong> algorithm. To state the algorithm, we recall from <a href="#1-10">Section 1.10</a> that <script type="math/tex">\mathcal{F}_2 = \{0,1\}</script> denotes the finite field of size 2.</p>
<blockquote>
<p><strong>Theorem 2</strong> (Shannon one-time pad, <a href="https://en.wikipedia.org/wiki/Communication_Theory_of_Secrecy_Systems">1949</a>). Fix <script type="math/tex">n \in \mathbb{N}</script> and let <script type="math/tex">\mathcal{M} = \mathcal{C} = \mathcal{K} = \mathbf{F}_2^n</script>, the <script type="math/tex">n</script>-fold <a href="https://en.wikipedia.org/wiki/Cartesian_product">cartesian product</a> of <script type="math/tex">\mathbf{F}_2</script>. We define <script type="math/tex">\operatorname{Gen}</script> to be the algorithm that chooese <script type="math/tex">k</script> uniformly from <script type="math/tex">\mathcal{K}</script>. Let <script type="math/tex">\oplus</script> denote the coordinatewise addition on <script type="math/tex">\mathbf{F}_2^n</script></p>
<script type="math/tex; mode=display">(a_1,\ldots,a_n) + (b_1,\ldots,b_n) = (a_1+b_1,\ldots,a_n+b_n)</script>
<p>and define</p>
<script type="math/tex; mode=display">\operatorname{Enc}_k(m) k \oplus m \hspace{1em}\mbox{and}\hspace{1em} \operatorname{Dec}_k(m) = c \oplus k</script>
<p>for all <script type="math/tex">(m,c,k) \in \mathcal{M} \times \mathcal{C} \times \mathcal{K}</script>. The resulting system <script type="math/tex">(\operatorname{Gen}, \operatorname{Enc}, \operatorname{Dec})</script>, called the <strong>one-time pad</strong>, is a perfectly secret cryptographic system.</p>
</blockquote>
<p>The one-time pad algorithm (<strong>OTP</strong>) is one of the earliest known examples of encryption methods with a secret key. Miller was the first person to describe <strong>OTP</strong> formally (<a href="https://babel.hathitrust.org/cgi/pt?id=nyp.33433019287345;view=1up;seq=7">Miller 1882</a>). Shannon, on the other hand, was the first to prove formally the security of <strong>OTP</strong> (<a href="(https://en.wikipedia.org/wiki/Communication_Theory_of_Secrecy_Systems)">Shannon 1949</a>.</p>
<p><strong>OTP</strong> is an example of a <strong>symmetric-key encryption scheme</strong>, as the same key is used both for the encryption process and the decryption process. The above theorem shows that <strong>OTP</strong> is essentially unbreakable, but <strong>OTP</strong> is not without problems. We first note that a key <script type="math/tex">k</script> must be just as large as the message <script type="math/tex">m</script> is used. In fact, this condition cannot be altered:</p>
<blockquote>
<p><strong>Theorem 3</strong> (Shannon, <a href="https://en.wikipedia.org/wiki/Communication_Theory_of_Secrecy_Systems">1949</a>). In every perfectly secret cryptographic system, <script type="math/tex">\vert \mathcal{K} \vert \geq \vert \mathcal{M} \vert</script>.</p>
</blockquote>
<p>Worse, a key can never be recycled. Indeed, if <script type="math/tex">c_1 = m_1 \oplus k</script> and <script type="math/tex">c_2 = m_2 \oplus k</script>, then</p>
<script type="math/tex; mode=display">c_1 \oplus c_2 = m_1 \oplus m_2 \oplus (k \oplus k) = m_1 \oplus m_2.</script>
<p>From <script type="math/tex">m_1 \oplus m_2</script>, the individual messages can be obtained with reasonable certainty. A historical example is the <strong>Venona project</strong>:</p>
<blockquote>
<p>One-time pads used properly only once are unbreakable; however, the KGB’s cryptographic material manufacturing center in the Soviet Union apparently reused some of the pages from one-time pads. This provided Arlington Hall with an opening. Very few of the 1942 KGB messages could be solved because there was very little duplication of one-time pad pages in those messages. The situation was more favorable in 1943, even more so in 1944, and the success rate improved accordingly. (<a href="https://www.nsa.gov/about/_files/cryptologic_heritage/publications/coldwar/venona_story.pdf">Benson 2011</a>)</p>
</blockquote>
<p>In short, it is hopeless to aim for perfect secrecy. We shall have to settle for something less while maintaining reasonable level of security, which is the underlying theme of cryptography.</p>
<p>We conclude the section with proofs of the above theorems. To this end, we shall make use of a technical lemma.</p>
<blockquote>
<p><strong>Lemma 4.</strong> The perfect secrecy condition holds if and only if</p>
<script type="math/tex; mode=display">\operatorname{Prob}_{
k \leftarrow \operatorname{Gen}
}
[
\operatorname{Enc}_k(m_0) = \bar{c}
]
= \operatorname{Prob}_{
k \leftarrow \operatorname{Gen}
}
[
\operatorname{Enc}_k(m_1) = \bar{c}
]</script>
<p>for all choices of <script type="math/tex">m_0,m_1 \in \mathcal{M}</script> and <script type="math/tex">\bar{c} \in \mathcal{C}</script>. In other words, the probabilities of obtaining the same ciphertext <script type="math/tex">\bar{c}</script> from two different messages <script type="math/tex">m_0</script> and <script type="math/tex">m_1</script> are the same.</p>
</blockquote>
<p>We defer the proof of the lemma and establish the theorems.</p>
<p>For <strong>OTP</strong>,</p>
<script type="math/tex; mode=display">\operatorname{Dec}_k(\operatorname{Enc}_k(m))
= (m \oplus k) \oplus k
= m \oplus (k \oplus k) = m</script>
<p>for all choices of <script type="math/tex">(m,k) \in \mathcal{K}</script>, whence <strong>OTP</strong> is functional.</p>
<p>To check that <strong>OTP</strong> is ecure, we observe that, for an arbitrary choice of a probability distribution <script type="math/tex">\mathcal{D}</script> over <script type="math/tex">\mathcal{M}</script>,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\operatorname{Prob}_{
k \leftarrow \operatorname{Gen}
}
[\operatorname{Enc}_k(\bar{m}) = \bar{c}]
=& \operatorname{Prob}_{
k \leftarrow \operatorname{Gen}
}
[(\bar{m} \oplus k) = \bar{c}] \\
=& \operatorname{Prob}_{
k \leftarrow \operatorname{Gen}
}
[k = \bar{c} \oplus \bar{m}] \\
=& 2^{-n}
\end{align*} %]]></script>
<p>regardless of the choice of <script type="math/tex">(\bar{m},\bar{c}) \in \mathcal{M} \times \mathcal{C}</script>. Therefore,</p>
<script type="math/tex; mode=display">\operatorname{Prob}_{
k \leftarrow \operatorname{Gen}
}
[
\operatorname{Enc}_k(m_0) = \bar{c}
]
= 2^{-n}
= \operatorname{Prob}_{
k \leftarrow \operatorname{Gen}
}
[
\operatorname{Enc}_k(m_1) = \bar{c}
]</script>
<p>for all choices of <script type="math/tex">m_0,m_1 \in \mathcal{M}</script> and <script type="math/tex">\bar{c} \in \mathcal{C}</script>. Perfect secrecy of <strong>OTP</strong> now follows from the lemma.</p>
<p>To show that any perfectly secret cryptographic system must have keys at least as large as the message, we suppose that <script type="math/tex">% <![CDATA[
\vert \mathcal{K} \vert < \vert \mathcal{M} \vert %]]></script>. We fix <script type="math/tex">(m_0,k_0) \in \mathcal{M} \times \mathcal{K}</script>, set <script type="math/tex">c_0 = \operatorname{Enc}_{k_0}(m_0)</script> and consider the set</p>
<script type="math/tex; mode=display">\mathcal{N} =
\{
\operatorname{Dec}_{k}(c_0) : k \in \mathcal{K}
\}.</script>
<p>Since</p>
<script type="math/tex; mode=display">% <![CDATA[
\vert \mathcal{N} \vert
\leq \vert \mathcal{K} \vert
< \vert \mathcal{M} \vert, %]]></script>
<p>we can find <script type="math/tex">m_1 \in \mathcal{M} \smallsetminus \mathcal{N}</script>. Now,</p>
<script type="math/tex; mode=display">\operatorname{Prob}_{
k \leftarrow \operatorname{Gen}
}
[
\operatorname{Enc}_k(m_0) = c_0
] > 0,</script>
<p>but</p>
<script type="math/tex; mode=display">\operatorname{Prob}_{
k \leftarrow \operatorname{Gen}
}
[
\operatorname{Enc}_k(m_1) = c_0
]
= \operatorname{Prob}_{
k \leftarrow \operatorname{Gen}
}
[
\operatorname{Dec}_k(c_0) = m_1
]
= 0,</script>
<p>and so it follows from the lemma that the cryptographic system in question is not perfectly secret.</p>
<p>Let us return to the proof of the technical lemma, which we restate below for convenience.</p>
<blockquote>
<p><strong>Lemma 4.</strong> The perfect secrecy condition holds if and only if</p>
<script type="math/tex; mode=display">\operatorname{Prob}_{
k \leftarrow \operatorname{Gen}
}
[
\operatorname{Enc}_k(m_0) = \bar{c}
]
= \operatorname{Prob}_{
k \leftarrow \operatorname{Gen}
}
[
\operatorname{Enc}_k(m_1) = \bar{c}
]</script>
<p>for all choices of <script type="math/tex">m_0,m_1 \in \mathcal{M}</script> and <script type="math/tex">\bar{c} \in \mathcal{C}</script>. In other words, the probabilities of obtaining the same ciphertext <script type="math/tex">\bar{c}</script> from two different messages <script type="math/tex">m_0</script> and <script type="math/tex">m_1</script> are the same.</p>
</blockquote>
<p>For any choice of <script type="math/tex">(\bar{m},\bar{c}) \in \mathcal{M} \times \mathcal{C}</script>, we have that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
&\operatorname{Prob}_{
\substack{
k \leftarrow \operatorname{Gen} \\
m \leftarrow \mathcal{D}
}
}
[
m = \bar{m} \mid \operatorname{Enc}_k(m) = \bar{c}
] \\
&= \frac{
\operatorname{Prob}_{
\substack{
k \leftarrow \operatorname{Gen} \\
m \leftarrow \mathcal{D}
}
}
[m = \bar{m} \mbox{ and } \operatorname{Enc}_k(m) = \bar{c}]
}
{
\operatorname{Prob}_{
\substack{
k \leftarrow \operatorname{Gen} \\
m \leftarrow \mathcal{D}
}
}
[
\operatorname{Enc}_k(m) = \bar{c}
]
} \\
&= \frac{
\operatorname{Prob}_{
\substack{
k \leftarrow \operatorname{Gen} \\
m \leftarrow \mathcal{D}
}
}
[m = \bar{m} \mbox{ and } \operatorname{Enc}_k(\bar{m}) = \bar{c}]
}
{
\operatorname{Prob}_{
\substack{
k \leftarrow \operatorname{Gen} \\
m \leftarrow \mathcal{D}
}
}
[
\operatorname{Enc}_k(m) = \bar{c}
]
} \\
&= \frac{
\operatorname{Prob}_{
\substack{
k \leftarrow \operatorname{Gen} \\
m \leftarrow \mathcal{D}
}
}
[
m = \bar{m}
]
\operatorname{Prob}_{
k \leftarrow \operatorname{Gen}
}
[
\operatorname{Enc}_k(\bar{m}) = \bar{c}
]
}
{
\operatorname{Prob}_{
\substack{
k \leftarrow \operatorname{Gen} \\
m \leftarrow \mathcal{D}
}
}
[
\operatorname{Enc}_k(m) = \bar{c}
]
}
\end{align*} %]]></script>
<p>Therefore,</p>
<script type="math/tex; mode=display">\operatorname{Prob}_{
\substack{
k \leftarrow \operatorname{Gen} \\
m \leftarrow \mathcal{D}
}
}
[
m = \overline{m} \mid \operatorname{Enc}_k(m) = \bar{c}
]
= \operatorname{Prob}_{
m \leftarrow \mathcal{D}
}
[
m = \bar{m}
]</script>
<p>if and only if</p>
<script type="math/tex; mode=display">\operatorname{Prob}_{
\substack{
k \leftarrow \operatorname{Gen} \\
m \leftarrow \mathcal{M}
}
}
[
\operatorname{Enc}_k(m) = \bar{c}
]
= \operatorname{Prob}_{
k \leftarrow \operatorname{Gen}
}
[
\operatorname{Enc}_k(\bar{m}) = \bar{c}
].</script>
<p>To this end, we observe that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
&\operatorname{Prob}_{
\substack{
k \leftarrow \operatorname{Gen} \\
m \leftarrow \mathcal{M}
}
}
[
\operatorname{Enc}_k(m) = \bar{c}
] \\
&= \sum_{
m_0 \in \mathcal{M}
}
\operatorname{Prob}_{
k \leftarrow \operatorname{Gen}
}
[
\operatorname{Enc}_k(m_0) = \bar{c}
]
\operatorname{Prob}_{
m \leftarrow \mathcal{M}
}
[
m = m_0
],
\end{align*} %]]></script>
<p>and that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
&\operatorname{Prob}_{
k \leftarrow \operatorname{Gen}
}
[
\operatorname{Enc}_k(\bar{m}) = \bar{c}
] \\
&= \sum_{
m_0 \in \mathcal{M}
}
\operatorname{Prob}_{
k \leftarrow \operatorname{Gen}
}
[
\operatorname{Enc}_k(\bar{m}) = \bar{c}
]
\operatorname{Prob}_{
m \leftarrow \mathcal{M}
}
[
m = m_0
],
\end{align*} %]]></script>
<p>Since probabilities are always nonnegative,</p>
<script type="math/tex; mode=display">\operatorname{Prob}_{
\substack{
k \leftarrow \operatorname{Gen} \\
m \leftarrow \mathcal{M}
}
}
[
\operatorname{Enc}_k(m) = \bar{c}
]
= \operatorname{Prob}_{
k \leftarrow \operatorname{Gen}
}
[
\operatorname{Enc}_k(\bar{m}) = \bar{c}
]</script>
<p>if and only if</p>
<script type="math/tex; mode=display">\operatorname{Prob}_{
m \leftarrow \mathcal{M}
}
[
\operatorname{Enc}_k(m_0) = \bar{c}
]
= \operatorname{Prob}_{
m \leftarrow \mathcal{M}
}
[
\operatorname{Enc}_k(\bar{m}) = \bar{c}
]</script>
<p>regardless of the choice of <script type="math/tex">m_0 \in \mathcal{M}</script>. This holds if and only if</p>
<script type="math/tex; mode=display">\operatorname{Prob}_{
k \leftarrow \operatorname{Gen}
}
[
\operatorname{Enc}_k(m_0) = \bar{c}
]
= \operatorname{Prob}_{
k \leftarrow \operatorname{Gen}
}
[
\operatorname{Enc}_k(m_1) = \bar{c}
],</script>
<p>and the proof of the lemma is now complete. <script type="math/tex">\square</script></p>
<p>We have now seen in that perfect secrecy is exceedingly difficult to achieve. Nevertheless, it is merely sufficient to prevent Eve the eavesdropper from <em>efficiently</em> decrypting the message. To formalize this notion, we must make sense of what it means for Eve to be <strong>computationally bounded</strong>.</p>
<p>Computational boundedness is, in essence, the restriction on the ability to carry out computations that take <em>too long</em> to run. What, then is computation? We turn to Alan Turing, the father of theoretical computer science (<a href="https://en.wikipedia.org/wiki/Computing_Machinery_and_Intelligence#CITEREFTuring1950">Turing 1950</a>):</p>
<blockquote>
<p>The idea behind digital computers may be explained by saying that these machines are intended to carry out any operations which could be done by a human computer. The human computer is supposed to be following fixed rules; he has no authority to deviate from them in any detail. We may suppose that these rules are supplied in a book, which is altered whenever he is put on to a new job. He has also an unlimited supply of paper on which he does his calculations. He may also do his multiplications and additions on a “desk machine,” but this is not important.</p>
<p>If we use the above explanation as a definition we shall be in danger of circularity of argument. We avoid this by giving an outline of the means by which the desired effect is achieved. A digital computer can usually be regarded as consisting of three pats:</p>
<ul>
<li>Store</li>
<li>Executive unit</li>
<li>Control</li>
</ul>
<p>The store is a store of information, and corresponds to the human computer’s paper, where this is the paper on which he does his calculations or that on which his book of rules is printed. In so far as the human computer does calculations in his bead a part of the store will correspond to his memory.</p>
<p>The executive unit is the part which carries out the various individual operations involved in a calculation. What these individual operations are will vary from machine to machine. Usually fairly lengthy operations can be done such as “Multiplicy 3540675445 by 7076345687” but in some machines only very simple ones such as “Write down 0” are possible.</p>
<p>We have mentioned that the “book of rules” supplied to the computer is replaced in the machine by a part of the store. It is then called the “table of instructions.” It is the duty of the control to see that these instructions are obeyed correctly and in the right order. The control is so constructed that this necessarily happens.</p>
</blockquote>
<p>In short, computation can be modeled with a theoretical machine consisting of three parts:</p>
<ul>
<li>infinitely-long <strong>tapes</strong> with discrete cells, one of which containing input values and the rest providing read-write workspace areas,</li>
<li>a <strong>state register</strong> that specifies the state of the machine at each step, and</li>
<li><strong>heads</strong> that can, in accordance with the state of the machine and the recorded symbols on tapes, read and write symbols on tapes, as well as moving tapes left or right one cell at a time.</li>
</ul>
<p>On such a model, computational boundedness is merely a restriction on how many times the tapes can be moved before a computational task is considered <em>too difficult</em>. It is, then, worthwhile to formalize the model into a precise mathematical construct:</p>
<blockquote>
<p><strong>Definition 5.</strong> (Turing, <a href="https://en.wikipedia.org/wiki/Computing_Machinery_and_Intelligence#CITEREFTuring1950">1950</a>). For a fixed <script type="math/tex">k \in \mathbb{N}</script>, a <strong><script type="math/tex">k</script>-tape Turing machine</strong> is defined to be an ordered triple <script type="math/tex">M = (\Gamma, Q, \delta)</script> consisting of the following:</p>
<ul>
<li>A set <script type="math/tex">\Gamma</script> of symbols, called the <strong>alphabet</strong> of <script type="math/tex">M</script>. Typically, we assume that <script type="math/tex">\Gamma</script> contains, at least, <code class="highlighter-rouge">0</code>, <code class="highlighter-rouge">1</code>, <code class="highlighter-rouge">blank</code>, and <code class="highlighter-rouge">start</code>.</li>
<li>A set <script type="math/tex">Q</script> of <strong>states</strong> for <script type="math/tex">M</script>. Typically, we assume that <script type="math/tex">Q</script> contains, at least, <code class="highlighter-rouge">start</code> and <code class="highlighter-rouge">halt</code>.</li>
<li>A <strong>transition function</strong></li>
</ul>
<script type="math/tex; mode=display">\delta:Q \times \Gamma^k \to Q \times \Gamma^{k-1} \times \{\texttt{left},\texttt{right},\texttt{stay}\}^k</script>
<p>that, based on the current state of <script type="math/tex">M</script> and the <script type="math/tex">k</script> symbols at the current locations of the heads, produces output to be recorded on the <script type="math/tex">k-1</script> workspace tapes and moving instructions for the <script type="math/tex">k</script> heads.</p>
</blockquote>
<p>We are, of course, interested in computational processes that terminate in a finite number of steps.</p>
<blockquote>
<p><strong>Definition 6.</strong> A <script type="math/tex">k</script>-tape Turing Machine <script type="math/tex">M = (\Gamma,Q,\delta)</script> is said to be a <strong>deterministic algorithm</strong>, or simply an <strong>algorithm</strong>, if, for each input <script type="math/tex">x = (\texttt{start},\gamma_1,\ldots,\gamma_k) \in Q \times \Gamma^k</script> that starts off <script type="math/tex">M</script> at the <code class="highlighter-rouge">start</code> state, there exists a positive integer <script type="math/tex">n</script> such that</p>
<script type="math/tex; mode=display">\delta^{(n)}(x) = \underbrace{(\delta \circ \cdots \circ \delta)}(x)</script>
<p>produces the <code class="highlighter-rouge">halt</code> state. In other words, a deterministic algorithm always halts after following the instructions provided by the transition function finitely many times, regardless of the initial configuration of the input tapes.</p>
</blockquote>
<p>The image to keep in mind is as follows: we put <script type="math/tex">k</script> tapes into the Turing machine <script type="math/tex">M</script> as the input, <script type="math/tex">M</script> modifies the tapes until it hits the <code class="highlighter-rouge">halt</code> state, and the final configuration of the tapes is printed as the output.</p>
<p>We recall, however, that we are interested in computational <em>efficiency</em>, not mere computability. In order to formalize the notion of computational boundedness, we must work out what it means for an algorithm to have a certain running time.</p>
<p>It is typical to consider bit strings such as</p>
<script type="math/tex; mode=display">0 \, 1 \, 0 \, 0 \, 1 \, 0 \, 0 \, 1 \, 0 \, 0 \, \cdots</script>
<p>as representations of data. Therefore, it makes sense to be able to refer to the collection of all finite bit strings:</p>
<blockquote>
<p><strong>Definition 7.</strong> Let <script type="math/tex">\{0,1\}^n</script> denote the <script type="math/tex">n</script>-fold <a href="https://en.wikipedia.org/wiki/Cartesian_product">cartesian product</a> of the bit set <script type="math/tex">\{0,1\}</script>. We define</p>
<script type="math/tex; mode=display">\{0,1\}^* = \bigcup_{n=1}^\infty \{0,1\}^n,</script>
<p>the set of all finite bit strings. Given a bit string <script type="math/tex">x \in \{0,1\}^*</script>, we define its <strong>length</strong> <script type="math/tex">\vert x \vert</script> to be the unique <script type="math/tex">n \in \mathbb{N}</script> such that</p>
<script type="math/tex; mode=display">x \in \{0,1\}^n.</script>
</blockquote>
<p>With bit strings as our representation of data, it makes sense to think of a computational task as a <em>function</em> on <script type="math/tex">\{0,1\}^*</script>, i.e., a process that outputs a unique bit string for each input bit string. This turns out to be a sufficient abstraction for defining the notion of computational efficiency.</p>
<blockquote>
<p><strong>Definition 8.</strong> Let <script type="math/tex">f:\{0,1\}^* \to \{0,1\}^*</script> and <script type="math/tex">T:\mathbb{N} \to \mathbb{N}</script>. We say that a Turing machine <script type="math/tex">M</script> <strong>computes <script type="math/tex">f</script> in <script type="math/tex">T</script>-time</strong> if, for every <script type="math/tex">x \in \{0,1\}^*</script>, the Turing machine <script type="math/tex">M</script> initialized to the start configuration on input <script type="math/tex">x</script> halts with <script type="math/tex">f(x)</script> as its output within <script type="math/tex">T(\vert x \vert)</script> steps.</p>
</blockquote>
<p>In other words, we can use a function <script type="math/tex">T</script> to provide an upper bound on the run time of a Turing machine <script type="math/tex">M</script> computing <script type="math/tex">f</script>. Computational tasks in real life often takes longer with larger input data, and so it makes sense to have the <script type="math/tex">T</script>-time depends on the size <script type="math/tex">\vert x \vert</script> of the input bit string.</p>
<p>In fact, it would make sense to have <script type="math/tex">T</script> grow with its input size, at a rate sufficiently fast that the Turing machine <script type="math/tex">M</script> is always given the time to read the input. Moreover, we would want <script type="math/tex">T</script> itself to be efficiently computable as well, for otherwise we cannot make use of the information on computational boundedness with ease. We collect these desirable properties into the following definition:</p>
<blockquote>
<p><strong>Definition 9.</strong> A function <script type="math/tex">T:\mathbb{N} \to \mathbb{N}</script> is <strong>time constructible</strong> if <script type="math/tex">T(n) \geq n</script> for all <script type="math/tex">n \in \mathbb{N}</script> and if there exists a Turing machine that computes the function <script type="math/tex">x \mapsto \operatorname{bin}(T(x))</script>, the binary representation of <script type="math/tex">T(x)</script>, in <script type="math/tex">T</script>-time.</p>
</blockquote>
<p>Examples of time constructible functions include <script type="math/tex">n \log n</script>, <script type="math/tex">n^3</script>, and <script type="math/tex">2^n</script>.</p>
<p>Let us now define what it means for a function to be efficiently computable.</p>
<blockquote>
<p><strong>Definition 10.</strong> We define <script type="math/tex">\mathsf{poly}</script> to be the set of all time-constructible functions <script type="math/tex">T:\mathbb{N} \to \mathbb{N}</script> such that <script type="math/tex">T(n) = O(n^c)</script> for some <script type="math/tex">c > 0</script>. A function <script type="math/tex">f:\{0,1\}^* \to \{0,1\}^*</script> is said to be <strong>computable in polynomial time</strong>, or <strong>efficiently computable</strong>, if there exists a Turing machine <script type="math/tex">M</script> and a function <script type="math/tex">T \in \mathsf{poly}</script> such that <script type="math/tex">M</script> computes <script type="math/tex">f</script> in <script type="math/tex">T</script>-time.</p>
</blockquote>
<p>We often write <script type="math/tex">\mathsf{poly}(n)</script> to denote a fixed, unspecified element of <script type="math/tex">\mathsf{poly}</script>.</p>
<p>Examples of efficiently computable functions include <a href="https://markhkim.com/foundtechnicalities/basic-sorting-algorithms/">the usual sorting algorithms</a>, <a href="https://en.wikipedia.org/wiki/AKS_primality_test">primality testing</a>, <a href="https://en.wikipedia.org/wiki/Fast_Fourier_transform">Fast Fourier transform</a>, and so on.</p>
<p>It is often useful to allow our model for computation to make use of randomness. For example, <a href="https://markhkim.com/foundtechnicalities/basic-sorting-algorithms/#6-2">quicksort with a randomized pivot</a> often performs a lot better than <a href="https://markhkim.com/foundtechnicalities/basic-sorting-algorithms/#6-3">quicksort with a median-of-medians pivot</a>, even though the latter has better worst-case runtime than the former. Some widely-used computational methods, such as the <a href="https://en.wikipedia.org/wiki/Monte_Carlo_method">Monte Carlo methods</a>, are always probabilistic in nature and do not have non-probabilistic analogues.</p>
<p>In light of this, we define a probabilistic analogue of the Turing machine.</p>
<p><a name="2-2-7"></a></p>
<blockquote>
<p><strong>Definition 11</strong> (de Leeuw–Moore–Shannon–Shapiro, <a href="https://philpapers.org/rec/DELCBP">1970</a>). A <strong>probabilistic Turing machine</strong> is an ordered quadruple <script type="math/tex">M = (\Gamma,Q,\delta_1,\delta_2)</script> consisting of a set <script type="math/tex">\Gamma</script> of symbols, a set <script type="math/tex">Q</script> of states, and two transition functions <script type="math/tex">\delta_1</script> and <script type="math/tex">\delta_2</script>: see <a href="#2-2-1">Definition 2.2.1</a>. Given an input, probabilistic Turing machine <script type="math/tex">M</script> is executed by randomly applying <script type="math/tex">\delta_1</script> and <script type="math/tex">\delta_2</script> with equal probability.</p>
<p>Probabilistic Turing machine <script type="math/tex">M</script> is said to be a <strong>probabilistic algorithm</strong> if, for each input</p>
<script type="math/tex; mode=display">x = (\texttt{start},\gamma_1,\ldots,\gamma_k) \in Q \times \Gamma^k,</script>
<p>there exists a positive integer <script type="math/tex">n</script> such that</p>
<script type="math/tex; mode=display">(\delta_{i_1} \circ \cdots \circ \delta_{i_n})(x)</script>
<p>produces the <code class="highlighter-rouge">halt</code> state. Here, each <script type="math/tex">i_k</script> is randomly chosen to be either <code class="highlighter-rouge">1</code> or <code class="highlighter-rouge">2</code>.</p>
</blockquote>
<p>The concept of computational efficiency for Turing machines can be carried over to the context of probabilistic Turing machines with minor modifications.</p>
<blockquote>
<p><strong>Definition 12.</strong> A probabilistic Turing machine <script type="math/tex">M</script> <strong>computes <script type="math/tex">f:\{0,1\}^* \to \{0,1\}^*</script> in <script type="math/tex">T</script>-time</strong> for some <script type="math/tex">T:\mathbb{N} \to \mathbb{N}</script> if, for every choice of bit string <script type="math/tex">x \in \{0,1\}^*</script>, the probabilistic Turing machine <script type="math/tex">M</script> initialized to the <code class="highlighter-rouge">start</code> state on input <script type="math/tex">x</script> halts after at most <script type="math/tex">T(\vert x \vert)</script> steps with <script type="math/tex">f(x)</script> as the output, regardless of the random choices made within <script type="math/tex">M</script>.</p>
</blockquote>
<blockquote>
<p><strong>Definition 13.</strong> A function <script type="math/tex">f:\{0,1\}^* \to \{0,1\}^*</script> is said to be <strong>computable in probabilistic polynomial time</strong> if there exists a probabilistic Turing machine <script type="math/tex">M</script> and a function <script type="math/tex">T \in \mathsf{poly}</script> such that <script type="math/tex">M</script> computes <script type="math/tex">f</script> in <script type="math/tex">T</script>-time.</p>
</blockquote>
<p>Regular Turing machines are sometimes called <strong>deterministic Turing machines</strong> to emphasize their difference with probabilistic Turing machines. Similarly, computability in polynomial time is often referred to as <strong>deterministic polynomial time</strong>.</p>
<p>With the language of computational complexity theory at hand, we can now formalize the notion of a process that is easy to carry out but difficult to revert. To this end, we introduce two preliminary definitions.</p>
<blockquote>
<p><strong>Definition 14.</strong> <script type="math/tex">U_n</script> denotes a random variable distributed uniformly over <script type="math/tex">\{0,1\}^n</script>, i.e.,</p>
<script type="math/tex; mode=display">\operatorname{Prob}[U_n = \alpha] = 2^{-n}</script>
<p>whenever <script type="math/tex">\alpha \in \{0,1\}^n</script> and equals zero otherwise.</p>
<p><strong>Definition 15.</strong> <script type="math/tex">0^n</script> refers to a bit string of length <script type="math/tex">n</script>, consisting entirely of 0. Similarly, <script type="math/tex">1^n</script> refers to a bit string of length <script type="math/tex">n</script>, consisting entirely of 1.</p>
</blockquote>
<p>We are now ready to give the definition of a <strong>one-way function</strong>.</p>
<blockquote>
<p><strong>Definition 16</strong> (Diffie–Hellman, <a href="https://ee.stanford.edu/~hellman/publications/24.pdf">1976</a>). A function <script type="math/tex">f:\{0,1\}^* \to \{0,1\}^*</script> is said to be <strong>one-way</strong> if the following conditions hold:</p>
<ul>
<li><script type="math/tex">f</script> is easy to compute, i.e., <script type="math/tex">f</script> is computable in deterministic polynomial time.</li>
<li><script type="math/tex">f</script> is difficult to invert, i.e., for each probabilistic polynomial-time algorithm <script type="math/tex">\mathcal{A}</script> and every polynomial <script type="math/tex">p</script>,</li>
</ul>
<script type="math/tex; mode=display">% <![CDATA[
\operatorname{Prob}[\mathcal{A}(f(U_n),1^n) \in f^{-1}(f(U_n))] < \frac{1}{p(n)} %]]></script>
<p>for all sufficiently large <script type="math/tex">n</script>.</p>
</blockquote>
<p>Why is the auxiliary input <script type="math/tex">1^n</script> needed? Without it, a function can be considered one-way by merely shrinking its input: if the image is very small, an inverting algorithm simply would not have enough time with respect to the size of its input—i.e., the shrunk output of the original function—to have good computational complexity.</p>
<p>The existence of a one-way function has not been proven. In fact, an existence proof would settle the famous <a href="https://en.wikipedia.org/wiki/P_versus_NP_problem">P versus NP</a> problem. There are, however, plausible candidates for one-way functions, having withstood many attempts at producing efficient inverting algorithms.</p>
<p>The most famous example is the <a href="https://en.wikipedia.org/wiki/Integer_factorization">integer factorization problem</a>, which is widely believed to be difficult. State-of-the-art factoring algorithms such as <a href="https://en.wikipedia.org/wiki/General_number_field_sieve">general number field sieve</a> runs in <a href="https://en.wikipedia.org/wiki/Time_complexity#Sub-exponential_time">subexponential time</a>. In the language of one-way functions, the multiplication function</p>
<script type="math/tex; mode=display">p,q \mapsto pq</script>
<p>is conjectured to be a one-way function.</p>
<p>With this assumption, we can build the famous <a href="https://en.wikipedia.org/wiki/RSA_(cryptosystem)">RSA cryptosystem</a>, which builds on the difficulty of the integer factorization problem.</p>
<p>See <a href="http://www.wisdom.weizmann.ac.il/~oded/foc.html">Goldreich’s two-volume monograph</a> for more information on the foundations of cryptography.</p>Mark Hyun-ki KimCryptography is the science of designing systems that can withstand malicious attempts to abuse them. Every cryptographic scenario can be illustrated by the story of security’s inseparable couple, Alice and Bob: Alice and Bob want to send messages to each other, while deterring various unwanted interlocutors, eavesdroppers, and tamperers from participating.The COUNT Bug for SQL Queries2017-05-26T11:59:59-04:002017-05-26T11:59:59-04:00https://markhkim.com/foundtechnicalities/the-count-bug-for-sql-queries<p>When are two SQL queries the same?</p>
<p>We say that two queries are <strong>semantically equivalent</strong> when their execution results agree on all possible inputs. Some queries are obviously equivalent:</p>
<div class="language-sql highlighter-rouge"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">att1</span> <span class="k">FROM</span> <span class="k">table</span><span class="p">;</span>
</code></pre>
</div>
<div class="language-sql highlighter-rouge"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">att1</span> <span class="k">FROM</span> <span class="p">(</span><span class="k">SELECT</span> <span class="n">att1</span><span class="p">,</span> <span class="n">att2</span> <span class="k">FROM</span> <span class="k">table</span><span class="p">);</span>
</code></pre>
</div>
<p>With complex queries, however, the problem of establishing equivalence becomes a challenging one. The problem of <em>disproving</em> equivalence is often substantially easier: all we need is one counterexample. We cannot, however, experiment on all possible inputs, and so establishing a positive result must rely on a formal proof.</p>
<p>Without a formal proof, an equivalence claim is, at best, an educated guess. And even brilliant people make wrong educated guesses.</p>
<p>The <strong>COUNT bug</strong> was discovered originally in the context of rewriting nest queries for optimization <a href="https://dl.acm.org/citation.cfm?id=38723">(Ganski–Wong, 1987)</a>:</p>
<div class="language-sql highlighter-rouge"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">pnum</span>
<span class="k">FROM</span> <span class="n">parts</span>
<span class="k">WHERE</span> <span class="n">qoh</span> <span class="o">=</span>
<span class="p">(</span><span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="n">shipdate</span><span class="p">)</span>
<span class="k">FROM</span> <span class="n">supply</span>
<span class="k">WHERE</span> <span class="n">supply</span><span class="p">.</span><span class="n">pnum</span> <span class="o">=</span> <span class="n">parts</span><span class="p">.</span><span class="n">pnum</span>
<span class="k">AND</span> <span class="n">shipdate</span> <span class="o"><</span> <span class="mi">80</span><span class="p">);</span>
</code></pre>
</div>
<div class="language-sql highlighter-rouge"><pre class="highlight"><code><span class="k">WITH</span> <span class="k">temp</span> <span class="k">AS</span>
<span class="k">SELECT</span> <span class="n">pnum</span><span class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span class="n">shipdate</span><span class="p">)</span> <span class="k">as</span> <span class="n">ct</span>
<span class="k">FROM</span> <span class="n">supply</span>
<span class="k">WHERE</span> <span class="n">shipdate</span> <span class="o"><</span> <span class="mi">80</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">pnum</span>
<span class="k">SELECT</span> <span class="n">pnum</span>
<span class="k">FROM</span> <span class="n">parts</span><span class="p">,</span> <span class="k">temp</span>
<span class="k">WHERE</span> <span class="n">parts</span><span class="p">.</span><span class="n">qoh</span> <span class="o">=</span> <span class="k">temp</span><span class="p">.</span><span class="n">ct</span>
<span class="k">AND</span> <span class="n">parts</span><span class="p">.</span><span class="n">pnum</span> <span class="o">=</span> <span class="k">temp</span><span class="p">.</span><span class="n">pnum</span><span class="p">;</span>
</code></pre>
</div>
<p>Ostensibly, both queries retrieve the part numbers (<code class="highlighter-rouge">pnum</code>) of those parts whose quantities on hand (<code class="highlighter-rouge">qoh</code>) equal the number of shipments of those parts (<code class="highlighter-rouge">COUNT(shipdate)</code>) before 80. Nevertheless, the two queries fail to be semantically equivalent. TO see this, we consider the following dataset:</p>
<ul>
<li><code class="highlighter-rouge">parts(pnum, qoh)</code> = <script type="math/tex">\{(3,6), (10,1), (8,0)\}</script></li>
<li><code class="highlighter-rouge">supply(pnum, shipdate)</code> = <script type="math/tex">\{ (3, 79), (3, 78), (10, 78), (10, 81), (8, 83)\}</script></li>
</ul>
<p>In the first query, the subquery</p>
<div class="language-sql highlighter-rouge"><pre class="highlight"><code><span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="n">shipdate</span><span class="p">)</span>
<span class="k">FROM</span> <span class="n">supply</span>
<span class="k">WHERE</span> <span class="n">supply</span><span class="p">.</span><span class="n">pnum</span> <span class="o">=</span> <span class="n">parts</span><span class="p">.</span><span class="n">pnum</span>
<span class="k">AND</span> <span class="n">shipdate</span> <span class="o"><</span> <span class="mi">80</span><span class="p">;</span>
</code></pre>
</div>
<p>returns 2 for <code class="highlighter-rouge">pnum = 3</code>, 1 for <code class="highlighter-rouge">pnum = 10</code>, and 0 for <code class="highlighter-rouge">pnum = 8</code>. Of those, <code class="highlighter-rouge">pnum = 10</code> and <code class="highlighter-rouge">pnum = 8</code> have the matching <code class="highlighter-rouge">qoh</code> counts, so the full query returns</p>
<table>
<thead>
<tr>
<th>pnum</th>
</tr>
</thead>
<tbody>
<tr>
<td>10</td>
</tr>
<tr>
<td>8</td>
</tr>
</tbody>
</table>
<p>On the other hand, the <code class="highlighter-rouge">temp</code> subquery in the second query returns</p>
<table>
<thead>
<tr>
<th>pnum</th>
<th>ct</th>
</tr>
</thead>
<tbody>
<tr>
<td>3</td>
<td>2</td>
</tr>
<tr>
<td>10</td>
<td>1</td>
</tr>
</tbody>
</table>
<p>and so the full query returns</p>
<table>
<thead>
<tr>
<th>pnum</th>
</tr>
</thead>
<tbody>
<tr>
<td>10</td>
</tr>
</tbody>
</table>
<p>It follows that the two queries are inequivalent.</p>
<p>While the above example may seem frivolous, it is worth noting that the database research community failed to discover the bug for 5 years. A typical database management system runs many query rewrite jobs for optimization purposes. So long as we rely on human intuition for correctness, there <em>will</em> be errors.</p>Mark Hyun-ki KimWhen are two SQL queries the same?Intel High Performance Analytics Toolkit and Dataframes2017-04-13T06:00:00-04:002017-06-22T11:53:00-04:00https://markhkim.com/foundtechnicalities/intel-high-performance-analytics-toolkit-and-dataframes<blockquote>
<p>“<a href="https://arxiv.org/abs/1611.04934v2">HPAT: High Performance Analytics with Scripting Ease-of-Use</a>“<br />
E. Totoni, T. A. Anderson, T. Shpeisman</p>
<p>“<a href="https://arxiv.org/abs/1704.02341v1">HiFrames: High Performance Data Frames in a Scripting Language</a>“<br />
E. Totoni, W. U. Hassan, T. A. Anderson, T. Shpeisman</p>
</blockquote>
<p>Long ago, in the ivory tower of high-performance computing, massive parallel computing tasks were carried out by complex, specialized programs written in low-level languages such as C or Fortran. Google’s <a href="https://research.google.com/archive/mapreduce.html">MapReduce</a> paradigm, published in 2003, democratized distributed computing, providing a simple model accessible to data wranglers without a domain expertise in parallel and distributed programming.</p>
<p>The MapReduce paradigm deals with problems that can be formulated as coordinate-wise transforms, sorting, and consolidation of <a href="https://www.quora.com/What-is-a-key-value-pair">key-value pairs</a> and is ill-suited for data analytics tasks that require frequent exploratory interactions with the dataset. This led to the development of interactive, in-memory data processing tools such as <a href="https://research.googleblog.com/2009/06/large-scale-graph-computing-at-google.html">Google Pregel</a> and <a href="https://spark.apache.org/">Apache Spark</a>. Such tools are often built atop existing MapReduce clusters such as <a href="https://hadoop.apache.org/">Apache Hadoop</a>, and the many layers introduces substantial overhead, resulting in processing speeds several magnitudes slower than parallel programs written in low-level languages.</p>
<p>An outstanding challenge, then, is to develop a productive, interactive data analytics tool with high performance. Intel’s <strong>High-Performance Analytics Toolkit (HPAT)</strong> takes a stab at the challenge by compiling high-level scripting syntax—HPAT uses <a href="https://julialang.org/">Julia</a>—down to high-performance, low-level code, e.g., <a href="http://www.openmp.org/">OpenMP</a>/<a href="https://www.open-mpi.org/">MPI</a>. Automatic parallelization and optimization is achieved by assuming that “the map/reduce parallel pattern inherently underlies the target application domain. … distribution of parallel vectors and matrices is mainly one-dimensional (1D) for analytics tasks … the data-parallel computations are in the form of high-level matrix/vector computations or comprehensions.”</p>
<p>Building on HPAT, <strong>HiFrames</strong> provides an alternative to fast but non-distributed dataframes such as <a href="http://pandas.pydata.org/">Python Pandas</a> and distributed but non-array-like dataframes such as <a href="https://spark.apache.org/sql/">Spark SQL</a>. Unlike Spark’s <a href="https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia">resilient distributed datasets</a> that underlies Spark SQL, HiFrames does not provide fault tolerance, with the justification that “the portion of [the research group’s] target programs with relational operations to be significantly shorter than the mean time between failure (MTBF) of moderate-sized clusters … in practice most clusters consist of 30-60 machines which is a scale at which fault tolerance is not a big concern.” The omission of fault tolerance allows a significant boost in performance. HiFrames compiles Julia-style code down to HPAT, with additional relational optimizations.</p>
<p>The papers claim that “HPAT is 369x to 2033x faster than Spark on the Cori supercomputer and 20x to 256x times on Amazon AWS,” and that “HiFrames is 3.6x to 70x faster than Spark SQL for basic relational operations, and can be up to 20,000x faster for advanced analytic operations.”</p>
<p><strong>Update, 6/22/2017:</strong> Ehsan Totoni informs me that HPAT is now <a href="https://github.com/IntelLabs/hpat">implemented in Python</a> as well. Here’s an excerpt from his email:</p>
<blockquote>
<p>“We actually moved to Python recently, which will help attract users as you mentioned. … Unfortunately, there is no documentation yet since development started very recently. It requires an MPI installation, and <a href="https://github.com/IntelLabs/numba/tree/prange_up">this branch of Numba</a>. … we are contributing our automatic shared-memory parallelization work to Numba. It’s in the development branch and will be released soon.”</p>
</blockquote>Mark Hyun-ki KimIntel's HPAT compiles high-level scripting syntax down to high-performance, low-level code. Building on HPAT, HiFrames provides an alternative to fast but non-distributed dataframes such as Python Pandas and distributed but non-array-like dataframes such as Spark SQL.