# Basic Sorting Algorithms

51 m

What does it mean to sort a list of objects? Most objects of numeric type can be ordered by the usual ordering of numbers, which certainly qualifies as sorting. Non-numeric objects can also be ordered in various ways. For example, strings can be ordered by the lexicographical order, which compares one letter at a time using the usual ordering of letters in the alphabet.

It is, then, reasonable to assert that we can talk about sorting on any class of objects that comes with the binary operations < , <=, >, >=, ==, and !=. (Formally, this is called a totally ordered set.) While there are some sorting algorithms that do not make use of the ordering on a class of objects, most standard sorting algorithms are, indeed, comparison-based.

We begin with an analysis of a few $O(n^2)$ sorting algorithms, which are simple but inefficient.

## 1. Bubble Sort

The first algorithm we consider is bubble sort. The bubble sort algorithm sorts a list by swapping two adjacent terms as needed.

def bubble_sort(L):
n = len(L)
while True:
swapped = False
for i in range(1,n):
if L[i-1] > L[i]:
L[i-1], L[i] = L[i], L[i-1]
swapped = True
if swapped is False:
break
return L

### 1.1. Correctness Proof

To see that the above algorithm is correct, we fix a list L and assume that

are the elements of L, sorted. We shall show that, after k iterations of the while loop, we have

for all $i \geq n-k$. It then follows that L is sorted after at most n iterations of the while loop, and one more iteration results in the termination of the algorithm.

We prove the assertion via mathematical induction.

we assume for now that k = 1. If

for some $% $, then $L[x] > L[x+1]$, whence $L[x]$ and $L[x+1]$ are swapped. By the same logic, a swap is made at each subsequence index, resulting in $L[n-1] = a_{n-1}$ at the end of the first iteration of the while loop.

We now assume inductively that, for a fixed $% $, the first $k-1$ iterations of the while loop yielded

for all $j \geq n-(k-1)$. At the end of the $(k-1)$th iteration of the while loop, we let x be the smallest index such that

$L[x] = a_{n-k}$.

The inductive hypothesis implies that $x \leq n-k$, and that $L[x]$ is the maximum of the sublist

This means that, after the execution of the for loop from $i = 0$ to $i = n-k$, we end up with

At this point, L[n-k] is the minimum of the sublist

and so no further swaps are made. It follows that

for all $j \geq n-k$ at the end of the $k$ th iteration of the while loop.

The desired result now follows from mathematical induction.

### 1.2. Time Complexity Analysis

We have shown above that the algorithm terminates after at most $n+1$ iterations of the while loop. During the $k$th iteration of the while loop, swaps are made only within the sublist

Therefore, at most $(n-k+1)C + (k-1)$ steps of computations are executed during the $k$th iteration of the while loop, where $C$ is the number of steps of computation required to execute the swapping if block. Let’s say $C = 4$, as swapping typically takes three steps of computation: save the value of $L[i-1]$ in an auxiliary variable $x$; set the value of $L[i-1]$ to be the value of $L[i]$; set the value of $L[i]$ to be the value of $x$.

It follows that that at most

steps of computations are executed during the full execution of bubble_sort().

To see that $O(n^2)$ is the optimal upper bound, we consider the list

the list of integers from $1$ to $n$ in reverse order. In this case, $L[i-1] > L[i]$ is true at every iteration, whence the upper bound calculated above is achieved.

To see that the algorithm does not run in $\Theta(n^2)$ time, we consider the list

the list of integers from $1$ to $n$, ordered. In this case, no swap is made, and so the while loop terminates after one iteration. Since the for loops runs from $1$ to $n-1$, we conclude that the algorithm terminates in $O(n)$ time for this list.

### 1.3. Space Complexity Analysis

All that is needed is an auxiliary variable to facilitate the swapping, and so the space complexity is bounded above by $O(1)$.

### 1.4. In-Place Algorithm

We say that an algorithm is in place if an algorithm can be executed using minimal additional storage beyond saving the input data. The definition of minimal is context-dependent, but an algorithm with the space complexity of $O(1)$ should always be considered in-place. This means, in particular, that bubble sort is in-place.

## 2. Selection Sort

We now consider selection sort, which allows for non-adjacent swaps.

def selection_sort(L):
n = len(L)
for i in range(n):
min_index = i
for j in range(i,n):
if L[j] < L[min_index]:
min_index = j
L[i], L[min_index] = L[min_index], L[i]
return L

### 2.1. Correctness Proof

To see that the above algorithm is correct, we fix a list L and assume that

are the elements of L, sorted. We shall show that, after k iterations of the outer for loop, we have

for all $% $. It then follows that L is sorted after the termination of the outer for loop.

We first note that, given $% $, the outer for loop searches for the minimum value among

and places it at L[i], while leaving

unchanged. Therefore, the first iteration of the outer for loop yields the identity

which remains true thereafter.

We now fix $% $ and assume inductively that the first $k-1$ iterations of the outer for loop yielded the identity

for each $% $. This, in particular, implies that

We then see that the k th iteration of the outer for loop yields the identity

$L[k-1] = a_{k-1}$, while leaving $L[0], \ldots, L[k-2]$ unchanged. The desired result now follows from mathematical induction.

### 2.2. Time Complexity Analysis

During the $k$th iteration of the outer for loop, the conditional

is checked (n-1)-k times. Therefore, the algorithm goes through at least

steps of computation.

On the other hand, min_index is altered at most $(n-1)-k$ times during the $k$th iteration of the outer for loop. Moreover, the swapping step

adds three steps of computation to each iteration of the outer for loop. It follows that selection_sort() goes through at most

steps of computation.

We conclude that the algorithm runs in $\Theta(n^2)$ time.

### 2.3. Space Complexity Analysis

We need two auxiliary variables: min_index, and another to facilitatie swapping. It follows that the space complexity is bounded above by O(1).

## 3. Insertion Sort

Unlike bubble sort and selection sort, insertion sort makes use of insertion, which is an O(n) operation. Despite this drawback, insertion sort tends to be more efficient than bubble sort or selection sort in practice. Moreover, insertion sort also satisfies a number of nice properties: adaptive, stable, in-place, online. We shall discuss these properties in due course.

def insertion_sort(L):
n = len(L)
for i in range(1, n):
x = L[i]
j = i - 1
while (j >= 0) and (L[j] > x):
L[j+1] = L[j]
j = j - 1
L[j+1] = x
return L

### 3.1. Online Algorithm

We say that an algorithm is online if the algorithm does not require all information about the input before it executes. In other words, an online algorithm can process the input data serially, i.e., as it comes in.

Why is insertion sort an online algorithm? During the $k$th iteration of the for loop, we see that the loop reads and modifies only the sublist L[:k+1]. Therefore, to execute insertion_sort(), it suffice sto know one extra term of L at each iteration of the for loop.

### 3.2. Correctness Proof

To see that the above algorithm is correct, we fix a list L. We shall show that the string of inequalities

holds after $k$ iterations of the for loop. It follows that L is sorted after the for loop is terminated.

During the first iteration of the for loop, $x = L[1]$ and $j = 0$. If $L[j] > x$, then $L[0] > L[1]$, and the while loop swaps $L[0]$ and $L[1]$. If $L[j] \leq x$, then $L[0] \leq L[1]$, and the while loop is bypassed. In either case, we end up with $L[0] \leq L[1]$ at the end of the first iteration.

We now fix $% $ and assume inductively that

holds after $k-1$ iterations of the for loop. During the $k$th iteration of the for loop, the while loop goes through the ordered sublist

in reverse order and shifts each term to the right until an index $j$ such that

is found. At this point, $x$ is inserted in between $L[j]$ and $L[j+1]$, and the for loop moves onto its next iteration. It now follows from the inductive hypothesis that

at the end of the $k$th iteration of the for loop. The desired result now follows from mathematical induction.

### 3.3. Stable Sorting Algorithm

We say that a sorting algorithm is stable if every pair of terms with the same value remains in the same order after the sorting process terminates. Insertion sort is stable, as each term is moved to the left without crossing any term with the same value.

Stability is a desirable property when a data set needs to be sorted in accordance with more than one ordering. For example, if we are to sort a list of names, in the alphabetical order, by last name AND first name, we can run a stable sort on last names and then run another stable sort on first names. Running an unstable sort twice might mix up the order of last names.

### 3.4. Time Complexity Analysis

During the k th iteration of the for loop, the while loop runs through at most $k$ iterations. It follows that insertion_sort() goes through at most

steps of computation.

To see that $O(n^2)$ is the optimal upper bound, we consider the list

the list of integers from $1$ to $n$ in reverse order. During each iteration of the for loop, the while loop must run through the maximum number of iterations to perform insertion at the right spot. It follows that the $O(n^2)$ runtime is achieved in sorting L.

To see that the algorithm does not run in $\Theta(n^2)$ time, we consider the list

the list of integers from $1$ to $n$, ordered. In this case, the while conditional is never satisfied, and no insertion is performed. Therefore, each iteration of the for loop contributes three steps of computation, and so the insertion_sort() sorts $L$ in $O(n)$ time.

### 3.5. Runtime Comparison: Bubble Sort, Selection Sort, and Insertion Sort

Even though all three sorting algorithms have the optimal upper bound of $O(n^2)$, they have different running times in practice.

import random

L1 = random.sample(range(100000),1000)
L2 = random.sample(range(100000),1000)
L3 = random.sample(range(100000),1000)
%%timeit
bubble_sort(L1[:])
bubble_sort(L2[:])
bubble_sort(L3[:])
1 loop, best of 3: 226 ms per loop
%%timeit
selection_sort(L1[:])
selection_sort(L2[:])
selection_sort(L3[:])
10 loops, best of 3: 61.5 ms per loop
%%timeit
insertion_sort(L1[:])
insertion_sort(L2[:])
insertion_sort(L3[:])
10 loops, best of 3: 65.5 ms per loop

As you can see above, both selection sort and insertion sort perform far better than bubble sort. This is to be expected, as our computations show bubble sort requires approximately 5 times as many steps of computation as selection sort or insertion sort. The definition of a step of computation we use here is vague, and our computations do not reflect the precise amount of time it takes to run the algorithms. Nevertheless, we are able to gather useful approximations from them.

Between selection sort and insertion sort, which performs better in general?

Intuitively, the answer should be insertion sort: selection sort always runs in $O(n^2)$ time, whereas insertion sort can run even in $O(n)$ time, depending on the input.

The underlying reason for this disparity is that selection sort does not take into account the order already present in the input data, whereas insertion sort does. Evidently, selection sort does not care how ordered the input data is, as its running time is always $O(n^2)$.

To see that insertion sort is sensitive to the order prsent in the input data, we define an inversion in a list $L$ to be a pair $(i,j)$ indices such that $% $ but $L[i] > L[j]$. We can think of the total number of inversions as a measure of disorder in $L$. If $L$ is ordered, then there are no inversions in $L$. If the terms of $L$ are arranged in reverse order, then there are $\frac{n(n-1)}{2}$ inversions in $L$, where $n$ is the length of the list $L$.

Note that, during the $i$th iteration of the for loop, the while conditional (j >= 0) and (L[j] > x) is satisfied if and only if $(i,j)$ is an inversion. Therefore, the number of steps of computation insertion_sort() performs is

where $\operatorname{Inv}(L)$ denotes the total number of inversions in the list $L$.

We say that a sorting algorithm is adaptive if the algorithm can make use of the order already present in the input data to speed up the sorting process. The number of inversions is not the only measure of order in a list: see Estivill-Castro/Wood, “A survey of adaptive sorting algorithms” for details.

### 3.7. Space Complexity Analysis

We keep two auxiliary variables: x and j. It follows that the space complexity is bounded above by $O(1)$, and so insertion sort is an in-place algorithm.

# $n \log n$ Sorting Algorithms

Note that sorting a list $L$ of $n$ terms entails choosing one out of the $n!$ possible arrangements of the terms. By choosing two terms $x$ and $y$ of $L$ and comparing them, we can eliminate half of the $n!$ possibilities. Indeed, if $x \geq y$, then no arrangement of the $n$ terms with $% $ can be the sorted list; if $% $, then no arrangement of the $n$ terms with $x \geq y$ can be the sorted list. Repeatedly applying this line of argument shows that we need to perform at least $\log(n!)$ many comparisons to find the sorted list. Stirling’s approximation

now implies that we need to perform at least

many comparisons to find the sorted list.

Below, we shall study two sorting algorithms with the runtime of $\Theta(n \log n)$. We shall also study an algorithm that, despite its $O(n^2)$ worst-case runtime, often runs faster than the two $\Theta(n \log n)$ sorting algorithms.

## 4. Merge Sort

Unlike the sorting algorithms we have considered so far, merge sort approaches sorting from a recursive angle. If a list consists of two sorted sublists, then it is not too difficult to merge them into an ordered list. In light of this, we can recursively divide a list into sublists, order them, and then merge them back to obtain the ordered list. This algorithm is spelled out below:

import math

def merge_sort(L):
n = len(L)
if n == 1:
return L
else:
midpoint = math.floor(n/2)
left, right = L[:midpoint], L[midpoint:]
return merge(merge_sort(left), merge_sort(right))

def merge(left, right):
l, m = len(left), len(right)
i, j = 0, 0
merged = []

while i < l and j < m:
if left[i] < right[j]:
merged.append(left[i])
i += 1
else:
merged.append(right[j])
j += 1

if i < l:
merged += left[i:]
else:
merged += right[j:]

return merged

### 4.1. Correctness Proof

Note that the only time merge_sort() returns a list without recursively calling itself only if the input list is of length 1. In other words, merge_sort(L) splits the list $L$ into sublists of length 1, and then calls merge() to merge them back. We shall show that, given two sorted lists left and right, the list

is sorted as well. Since all lists of length 1 are trivially sorted, this result implies the correctness of merge_sort().

Assume that left and right are sorted lists. The while loop in merge() checks the smallest terms of both left and right and moves the smaller one into merged. This is continued until we run out of terms of either left or right. Evidently, merged is a sorted list.

From there, whatever remains of left or right is appended to merged. By construction, the remaining terms are larger than all the terms in merged. Therefore, appending the remaining terms, which are sorted by assumption, to the sorted list merged results in a sorted list. This completes the proof.

### 4.2. Time Complexity Analysis

Let $T(n)$ denote the worst-case runtime of merge_sort() on a list of length $n$. Observe that

where $C(n)$ is the number of steps of computation it takes to execute merge() on two lists of length $\frac{n}{2}$. The upper bound comes from the following facts:

1. merge_sort() contains 7 lines of code.
2. merge_sort() on a list of length $n$ calls merge_sort() on lists of length $\frac{n}{2}$ twice.
3. merge_sort() calls merge() on lists of length $\frac{n}{2}$ once.

Now, merge() merely iterates through left and right, so merge() runs in $O(n)$ time. Therefore, we have the inequality

Applying the inequality repeatedly, we obtain the following:

Since $O(n \log n)$ is the universal lower bound for comparison-based sorting algorithms, we conclude that the time complexity of merge_sort() is $\Theta(n \log n)$.

### 4.3. Is Merge Sort Always Fast?

Since Merge sort always runs in $\Theta(n \log n)$ time, it is not an adaptive sorting algorithm. The runtime of insertion sort can be as low as $O(n)$, and so we expect insertion sort to run faster than merge sort for lists that are largely sorted.

%%timeit
insertion_sort(list(range(100000)))
100 loops, best of 3: 12.6 ms per loop
%%timeit
merge_sort(list(range(100000)))
10 loops, best of 3: 187 ms per loop

Moreover, $O(n^2)$ sorting algorithms can run faster than merge sort on small arrays.

%%timeit
insertion_sort(list(range(10,0,-1)))
100000 loops, best of 3: 4.79 µs per loop
%%timeit
selection_sort(list(range(10,0,-1)))
100000 loops, best of 3: 5.41 µs per loop
%%timeit
merge_sort(list(range(10,0,-1)))
100000 loops, best of 3: 9.98 µs per loop

This is because asymtotpic analysis is useful only for large $n$.

What do we mean by this?

Given two positive constants $C_1$ and $C_2$, we observe that

and so there exists a number $N$ such that

whenever $n \geq N$. Therefore,

for all $n \geq N$. This is what we mean when we say $n^2$ asymptotically dominates $n \log n$.

On the other hand, $\frac{n}{\log n}$ attains its minimum at $n = 2$ and is an increasing function on $[2,\infty)$. (Take this for granted if you don’t know differential calculus.) Therefore, so long as

we can find a positive integer $n \geq 2$ such that

Therefore, if $C_1 > 2C_2$, there exists a positive integer $n$ such that

In other words, it is possible for an $O(n^2)$ algorithm to run faster than an $O(n \log n)$ algorithm for a small $n$.

### 4.4. Space Complexity Analysis

merge() builds an auxiliary list merged, which can be as long as the sum of the lengths of the two sublists left and right. Beyond this, there are only a few auxiliary variables, and so we conclude that merge sort requires $O(n)$ memory space to run.

It is possible to construct an in-place variant of merge sort, but the currently existing algorithms are rather complicated. See Denham Coates-Evelyn, “In-Place Merging Algorithms” for details.

## 5. Heapsort

We now address the space complexity problem of merge sort by introducing an in-place $\Theta(n \log n)$-time sorting algorithm, commonly known as heapsort.

### 5.1. Graphs and Trees

In order to describe heapsort, we need to introduce a few mathematical concepts.

A graph $G = (V,E)$ consists of a set $V$ of vertices (also known as nodes) and a set $E$ of edges between pairs of vertices. We say that a graph is connected if, for each pair of vertices $v$ and $w$, there exists a sequence $\{e_1,\ldots,e_m\}$ of edges that connects $v$ and $w$:

A cycle is a sequence of edges that starts and ends at the same vertex.

We say that a graph is a tree $T$ if it is connected and contains no edges. By picking a vertex of a tree to be the root, we can create a rooted tree.

A rooted tree $T$ can be oriented by thinking of the root as the top vertex of the graph, with branches going downwards. Specifically, we measure the distance from the root to a vertex by counting the number of edges it takes to get from the root to that vertex. This number is unique, as $T$ is connected and contains no cycles. We define the depth of the rooted tree $T$ to be the maximal distance from the root to a vertex of $T$.

Given this orientation, we consider a pair $(v,w)$ of vertices connected by an edge. The vertex closer to the root is called a parent, and the vertex further away from the root is called a child. Two children of a same parent are referred to as siblings, and vertices with no children are called leaves.

A binary tree is a rooted tree such that every vertex has up to two children. A complete binary tree is a binary tree such that every parent has exactly two children, and that all of the leaves are of the same distance from the root. An almost complete binary tree is a generalization of a complete binary tree: it is a binary tree such that every parent except the parents of the leaves has exactly two children. Another way to think of a complete binary tree is that it is a complete binary tree with some of the leaves removed.

### 5.2. Heaps

A min-heap is an almost complete binary tree with a value at each node such that $% $ whenever $v$ is a parent of $w$. Similarly, a max-heap is an almost complete binary tree with a value at each node such that $\operatorname{value}(v) > \operatorname{value}(w)$ whenever $v$ is a parent of $w$.

Note that the root node contains the minimum value in a min-heap, and the maximum value in a max-heap. Therefore, if we can build a heap efficiently from a list, then we would have an efficient sorting algorithm.

Before we dig in, let us discuss an implementation of a heap as a list. Let the root node have index 0, let its children have indices 1 and 2, let their children have indices 3, 4, 5, 6, and so on. By this construction, the children of a node at index $i$ are located at index $2i + 1$ and index $2i + 2$. Moreover, the parent of a node at index $j$ is located at index $\lfloor (j-1)/2 \rfloor$.

We shall now refer to each node on a binary tree by their indices. We call the child node with the smaller index the left child node and the child note with the larger index the right child node.

Let us first consider a small binary tree consisting of three nodes: $T = \{3, 2, 5\}$. $T$ fails to be a max heap, as $% $. In order to turn $T$ into a max-heap, we can swap the root with its right child node, resulting in $\{5, 2, 3\}$.

A similar approach works for building a heap from a larger almost-complete binary tree. We let $T$ be a binary tree with a value attached at each node. Fix a node $i_0$ and assume that the subtrees formed by taking $\operatorname{left}(i_0)$ and $\operatorname{right}(i_0)$ as the root nodes, rspectively, are heaps. For the sake of simplicity, we assume that they are maxeaps.

If the subtree formed by taking $i_0$ as the root node is not a max-heap, then we push down the node until a max-heap is formed. By “push down”, we mean swapping a parent node with the smaller of the two child nodes. This process terminates in finitely many steps and results in a heap.

Since every binary tree consisting of three nodes can be rearranged into a heap, it follows from mathematical induction on the height of almost-complete binary trees that the above process turns any almost-complete binary tree into a heap.

We can apply the same process to devise a method of removing a node from a heap while maintaining the heap property: swap the node to be removed with a leaf, remove the leaf, and rearrange the rest into a heap.

### 5.3. Heapsort

Once we have a heap, it is not hard to form a sorted list consisting of the values of the nodes. We start by building a max-heap. The root node contains the maximum, so we swap the root node with the last element of the list.

The list without the last element is an almost-complete binary tree with $n-1$ nodes. The subtrees with the left child node of the root node and the right child node of the root node, respectively, are heaps, but the whole tree may not be a heap. We therefore push down the root node until we obtain a heap. The root node now contains the maximum, so we swap the root node witht the last element of the list. Repeating this process, we obtain a sorted list.

Let us now formalize the above discussion:

import math

def heap_sort(L):
build_max_heap(L)
n = len(L)
for i in range(n-1,0,-1):
L[0], L[i] = L[i], L[0]
push_down(L[:i], 0)
return L

def build_max_heap(L):
n = len(L)
for i in range(math.floor((n+1)/2),-1,-1):
push_down(L,i)

def push_down(L,i):
left, right = 2*i + 1, 2*i + 2
n = len(L)

if left < n and L[left] > L[i]:
max_index = left
else:
max_index = i
if right < n and L[right] > L[max_index]:
max_index= right

if max_index != i:
L[i], L[max_index] = L[max_index], L[i]
L = push_down(L, max_index)

### 5.4. Time Complexity Analysis

We begin by analyzing the time complexit of push_down(). We let $n_i$ be the number of nodes on a subtree rooted at index $i$. We write $T(n_i)$ to denote the worst-case runtime of push_down(L,i). $T(n_i)$ consists of a constant time to check the if-else conditions, and a possible recursion call to $T(n_{\texttt{max_index}})$. If the subtree rooted at max_index has all the leaves and the subtree rooted at the other child node of i has no leaves, then, by the binary property, the number of nodes on the former tree is twice as much as the number of nodes on the latter subtree. It follows that

Applying the inequality repeatedly, we obtain the following:

It follows that push_down runs in $O(\log n)$ time.

Since build_max_heap calls push_down $O(n)$ times, it runs in $O(n \log n)$ time. heap_sort calls build_max_heap once and push_down $n$ times, and so it runs in $O(n \log n)$ time as well. By the universal lower bound on comparison-based sorting algorithms, we conclude that heapsort runs in $\Theta(n \log n)$ time.

### 5.5. Space Complexity Analysis

The functions defined above operate on the same list without making use of an auxiliary list. It follows that the space complexity of heap sort is $O(1)$. Therefore, heapsort addresses merge sort’s lack of the in-place property.

### 5.6. Heapsort, Compared to Merge Sort

In practice, heapsort tends to be slower than merge sort.

import random

L1 = random.sample(range(100000),1000)
L2 = random.sample(range(100000),1000)
L3 = random.sample(range(100000),1000)
%%timeit
merge_sort(L1)
merge_sort(L2)
merge_sort(L3)
100 loops, best of 3: 5.83 ms per loop
%%timeit
heap_sort(L1)
heap_sort(L2)
heap_sort(L3)
100 loops, best of 3: 12.7 ms per loop

As is the case with bubble sort, selection sort, and insertion sort, algorithms with the same big-O bound can differ in running time, based on the constants the big-O bounds are hiding away.

## 6. Quicksort

Let us now discuss quicksort, an in-place sorting algorithm that, for the most part, runs faster than merge sort or insertion sort. The algorithm choose a pivot $p$ and divides the input list $L$ into sublists $\{x \in L: x \leq p\} \text{ and } \{x \in L : x > p\}.$ The algorithm is then recursively applied to the two sublists, resulting in a sorted list.

The choice of pivot severely affects the efficiency of quicksort. Indeed, if the partition process results in a sublist just as big as the original list and an empty sublist, there is no reduction in the complexity of the problem. In what follows, we shall study three methods of choosing a pivot and discuss their effectiveness.

In all variants, the sorting operations are performed directly on the input list, thus the resulting sorting algorithms are in-place algorithms.

### 6.1. Quicksort with a Fixed Pivot

Quicksort with a fixed pivot is simple: each time, choose a pivot to be the element at a fixed location of the list. It is typical to choose either the first element or the last element of the list. Following CLRS, we shall choose the last element as the pivot.

def quick_sort_fixed_pivot(L):
n = len(L)
sort_fixed_pivot(L,0,n-1)
return L

def sort_fixed_pivot(L, p, r):
if p < r:
pivot_index = partition_fixed_pivot(L, p, r)
sort_fixed_pivot(L,p,pivot_index-1)
sort_fixed_pivot(L,pivot_index + 1, r)

def partition_fixed_pivot(L, p, r):
pivot = L[r]
i = p-1
for j in range(p,r):
if L[j] <= pivot:
i = i+1
L[i], L[j] = L[j], L[i]
i = i+1
L[i], L[r] = L[r], L[i]
return i

sort_fixed_pivot() on a sublist L[p:r+1] chooses L[r] to be the pivot, via partition_fixed_pivot(). L[p:r+1] is divided into three sublists: the list of elements smaller than or equal to the pivot, the singleton list containing the pivot, and the list of elements larger than the pivot. sort_fixed_pivot then recurses on to the first and the third sublists, sorting the sublists.

If L is already sorted, then partition_fixed_pivot(L, p, r) always returns r. Therefore, sort_fixed_pivot splits L[p:r+1] into L[p:r], [L[r]], and []. Since sort_fixed_pivot() can only to reduce the size of the list to be sorted by 1, it must go through $n$ iterations to sort an already-sorted list of size $n$. Specifically, at the $k$th iteration, partition_fixed_pivot(L, 0, n-k) is executed, and sort_fixed_pivot(L, 0, n-k-1) + [L[n-k]] is returned.

It therefore suffices to examine the time complexity of partition_fixed_pivot(L, 0, n-k). Since every instance of the if staement is executed, the runtime of partition_fixed_pivot(L, 0, n-k) is bounded below by $\Omega(n-k)$. It follows that the runtime of quick_sort_fixed_pivot on a sorted list of length n is bounded below by

### 6.2. Quicksort with a Randomized Pivot

import random

def quick_sort_randomized_pivot(L):
n = len(L)
sort_randomized_pivot(L, 0 , n-1)
return L

def sort_randomized_pivot(L, p, r):
if p < r:
pivot_index = partition_randomized_pivot(L, p, r)
sort_randomized_pivot(L, p, pivot_index-1)
sort_randomized_pivot(L, pivot_index+1, r)

def partition_randomized_pivot(L, p, r):
pivot = random.randint(p, r)
L[pivot], L[r] = L[r], L[pivot]
return partition_fixed_pivot(L, p, r)

Observe that quick_sort_randomized_pivot performs better than merge_sort, whereas quick_sort_fixed_pivot performs worse than merge_sort.

import random

L1 = random.sample(range(100000),100)
L2 = random.sample(range(100000),100)
L3 = random.sample(range(100000),100)
%%timeit
merge_sort(L1)
merge_sort(L2)
merge_sort(L3)
1000 loops, best of 3: 451 µs per loop
%%timeit
quick_sort_fixed_pivot(L1)
quick_sort_fixed_pivot(L2)
quick_sort_fixed_pivot(L3)
1000 loops, best of 3: 1.44 ms per loop
%%timeit
quick_sort_randomized_pivot(L1)
quick_sort_randomized_pivot(L2)
quick_sort_randomized_pivot(L3)
1000 loops, best of 3: 439 µs per loop

Surely, there is a small but definite chance that the random selection of pivot results in always choosing the last element of the list, thus reducing quick_sort_randomized_pivot to quick_sort_fixed_pivot. Therefore, the worst-case time complexity should still be $\Omega(n^2)$. How do we make sense of the improved performance, then?

Recall that the bottleneck in quick_sort_fixed_pivot is partition_fixed_pivot. In fact, the for loop in partition_fixed_pivot is the single biggest contributor to the time complexity of both quick_sort_fixed_pivot and quick_sort_randomized_pivot. This implies that the number of times the comparison statement if L[j] <= pivot is executed is a good indicator of the time complexity of quicksort.

We fix a list L and assume that

are the elements of $L$, sorted. Given a probability distribution over $L$, we let $C$ denote the random variable outputing the total number of comparisons if L[j] <= pivot performed in the course of an execution of quick_sort_randomized_pivot(L). For each $1 \leq i,j \leq n$, let $C_{ij}$ denote the indicator random variable

so that

where $\mathbb{E}$ denotes the expected value of a random variable. Since $\mathbb{E}$ is linear, we see that

where $\mathbb{P}$ denotes the probability of an event.

We let $L_{ij}$ denote the set $\{a_i,a_{i+1},\ldots,a_j\}$. Observe that $\mathbb{P}[a_i \mbox{ is compared to } a_j]$ equals the probability that $a_i$ or the $a_j$ is the first pivot chosen, out of all the elements of $L_{ij}$, in the course of an execution of quick_sort_randomized_pivot(L). Since each choice of pivot is independent of one another, we have that

It follows that

Since the harmonic numbers are asymptotically bounded by the logarithmic function, we conclude that

### 6.3. Quicksort with a Median-of-Medians Pivot

While quick_sort_randomized_pivot performs quite well “on average” (and in practice!), it still has the worst-case time complexity of $O(n^2)$. With carefully chosen pivots, however, we can guarantee a $\Theta(n \log n)$ performance in all cases. The method of pivot selection, however, is rather complicated, and it is often better to avoid the selection overhead and simply opt for randomized selection. Suffice it to say that the median-of-medians strategy is largely of theoretical importance. (Though some may disagree: see Kurosawa, “Quicksort with median of medians is considered practical”)

We have witnessed that a pivot selection strategy that divides a list into sublists of uneven sizes tends to perform badly. Therefore, an efficient search of the median element of a list would lead to an efficient choice of pivots. The Blum–Floyd–Pratt–Rivest–Tarjan selection algorithm, commonly known as the median-of-medians selection algorithm, can be used to find a median in $O(n)$ time. In fact, the median-of-medians selection algorithm can be used to find the $k$th smallest element of the list in $O(n)$ time.

6.3.1. Outline of Median-of-Medians Selection Algorithm. The outline of the algorithm, taken from CLRS is as follows; here, we assume that $L$ consists of distinct elements.

1. Given an input list $L$ of $n$ elements, we divide the list into $\lfloor \frac{n}{5} \rfloor$ sublists of 5 elements, with at most one sublist possibly containing fewer than 5 elements.
2. We find the median of each of the $\lceil \frac{n}{5} \rceil$ sublists by sorting them. Insertion sort is a good choice here, since it is fast for small lists.
3. We take the list of the medians found in 1-2 and apply the selection algorithm recursively to find the median of medians $m$. If there are an even number of medians, we take the lower one.
4. We rearrange $L$ by putting the terms no larger than $m$ on the front, $m$ in the middle, and terms larger than $m$ in the back.
5. Let $i-1$ be the index of the pivot $m$ in the rearranged list, so that $m$ is the $i$th smallest element of $L$. If $i = k$, then we have found the median. If not, we apply the selection process recursively to find the median. Specifically, if $i > k$, then we apply the selection algorithm on L[:i] to find the $k$th smallest element. If $% $, then we apply the selection algorithm on L[i:] to find the $(k-i)$th smallest element.

In short, the algorithm either selects the correct output or recurses down to a smaller sublist. Since the kth smallest element of a small list can be found in $O(1)$ time, the algorithm always terminates with the correct output.

6.3.2. Overview of Time Complexity Analysis. To analyze the time complexity of the median-of-medians selection algorithm, we note that at least half of the medians found in Step 2 are no smaller than $m$. This implies that at least half of the $\lceil \frac{n}{5} \rceil$ sublists contain at least 3 terms that are greater than $m$, except for the one sublist that may not have 5 elements, and the sublist that contains $m$. It follows that the number of elements of $L$ greater than $m$ is at least

Similarly, at least $\frac{3}{10}n - 6$ elements are less than $x$, whence, in the worst-case scenario, Step 5 calls the selection algorithm recursively on at most $\frac{7}{10}n+6$ elements.

6.3.3. Recurrence Relation. Let $T(n)$ denote the worst-case running time of the selection algorithm on a list of size $n$. Step 1 takes O(n) time. Step 2 executes insertion sort on $O(n)$ many small sets, and so it takes $O(n)$ time. Step 3 takes $T(n/5)$ time. Step 4 takes $O(n)$ time, as it is a simple variation of the partition algorithm from quicksort. Step 5 takes $T(7n/10 + 6)$ time, as discussed above. It follows that

Let us fix a constant $A$ such that

for all $n$.

6.3.4. Constant Computation. Our goal is to find a constant $C$ such that $T(n) \leq Cn$ for all large enough $n$. How large should $C$ be? If $T(n) \leq Cn$ were true for all $n > N$ for some fixed integer $N$, we would have the estimate

In order to have a tight choice of $C$, we expect to have the upper bound

which can hold if and only if

Rearranging the above inequality, we obtain

Now, if we assume that $N > 70$, then $n > 70$, and so the above inequality is equivalent to

It follows that $N > 70$ and

implies

for all $n > N$, and so

for all $n > N$.

6.3.5. Proof of Linear Time Complexity. Let us now pick $N = 140$ and $C = 5A$, so that $N > 70$ and $C \geq \frac{10A}{N - 70}$. By choosing a larger $A$ if necessary, we can assume that

for all $n$, and that

for all $n \leq 140$.

If

then $\left\lceil\frac{n}{5}\right\rceil \leq 140$ and $\frac{7}{10}n + 6 \leq 140$, and so

By our choice of $C$, we conclude that

for all $n \leq 191$.

In general, if $T(n) \leq Cn$ for all $n \leq M$, then $T(n) \leq Cn$ for all $n \leq \left\lfloor \frac{10}{7} \times (M - 6) \right\rfloor$. Since

for all $M \geq 140$, we conclude that

for all $n \geq 140$. It follows that $T(n) \leq O(n)$.

6.3.6. An Implementation of Quicksort with Median-of-Medians Pivot. Having established the linear-time complexity of median-of-medians selection algorithm, we proceed to construct a variant of quicksort with the worst-case time complexity of $O(n \log n)$.

import math

def quick_sort_median_of_medians_pivot(L):
n = len(L)
sort_median_of_medians_pivot(L,0,n-1)
return L

def sort_median_of_medians_pivot(L, p, r):
if p < r:
pivot_index = partition_median_of_medians_pivot(L, p, r)
sort_median_of_medians_pivot(L,p,pivot_index-1)
sort_median_of_medians_pivot(L,pivot_index + 1, r)

def partition_median_of_medians_pivot(L, p, r):
mm = median_of_medians(L[p:r+1], math.floor((r+1-p)/2))
initial_pivot_index = L[p:r+1].index(mm) + p
L[r], L[initial_pivot_index] = L[initial_pivot_index], L[r]
pivot = L[r]

i = p-1
for j in range(p,r):
if L[j] <= pivot:
i = i+1
L[i], L[j] = L[j], L[i]
i = i+1
L[i], L[r] = L[r], L[i]
return i

# find the kth smallest element of L
def median_of_medians(L,k):
n = len(L)

# To reduce computational complexity, we compute the median directly
# if the length of the list is small enough.
if n < 10:
return sorted(L)[k-1]

# Divide L into sublists of length 5, sort them, and extract the medians.
medians = []
for i in range(math.floor(n/5)):
L[5*i: 5*i+5].sort()
medians.append(L[5*i+2])
if n % 5 != 0:
L[5*math.floor(n/5):].sort()
medians.append(L[5*math.floor(n/5) + math.floor((n % 5)/2)])

# Find recursively the median of medians
mm = median_of_medians(medians, math.floor(len(medians)/2))

# Partition L with mm as the pivot
mm_index = L.index(mm)
L[n-1], L[mm_index] = L[mm_index], L[n-1]
i = -1
for j in range(n-1):
if L[j] <= mm:
i = i+1
L[i], L[j] = L[j], L[i]
i = i+1
L[i], L[n-1] = L[n-1], L[i] # After this, L[i] = mm.

i = i+1 # Now mm is the ith largest element of L

# Determine whether mm is at the correct spot. If not, recurse.
if i == k:
return mm
elif i > k:
return median_of_medians(L[:i], k)
else: # i < k
return median_of_medians(L[i:], k-i)

In the above implementation, we have sacrificed a little bit of performance for the sake of clarity. partition_median_of_medians_pivot spends $O(n)$ time going through L[p:r+1], looking for the exact location of the median of medians. Moreover, median_of_medians itself spends extra $O(n)$ time going through L, looking for the exact location of the median of medians. These overheads could have prevented by having median_of_medians return the ordered pair (med, med_index) of the median and the index in the original list. The above implementation is simpler, however, and the $O(n)$ overhead does not change the asymptotic runtime of partition_median_of_pivot.

6.3.7. Time Complexity of Quicksort with Median-of-Medians Pivot. Let us now show that the worst-case runtime complexity of quick_sort_median_of_medians_pivot is $O(n \log n)$. Let $T(n)$ denote the worst-case runutime of quick_sort_median_of_medians_pivot on a list of length $n$. Since the pivot is always the median, we see that

It now follows that

as was to be shown.

Despite the improved worst-case time complexity, quick_sort_median_of_medians_pivot often performs worse than quick_sort_randomized_pivot. This is due to the large constant hidden away by the big-O notation, as we have seen in the derivation of the time complexity of median_of_medians.

import random

L1 = random.sample(range(1000000),10000)
L2 = random.sample(range(1000000),10000)
L3 = random.sample(range(1000000),10000)
%%timeit

quick_sort_randomized_pivot(L1)
quick_sort_randomized_pivot(L2)
quick_sort_randomized_pivot(L3)
10 loops, best of 3: 65.8 ms per loop
%%timeit

quick_sort_median_of_medians_pivot(L1)
quick_sort_median_of_medians_pivot(L2)
quick_sort_median_of_medians_pivot(L3)
10 loops, best of 3: 174 ms per loop

# Non-Comparison Sorting Algorithms

In designing a sorting algorithm, it is possible to beat the information-theoretic lower bound of $\Omega(n \log n)$ if we forgo comparions. In what follows, we discuss non-comparison sorting algorithms that can run in linear time or less under certain circumstances.

## 7. Counting Sort

Instead of comparing the elements of a list, counting sort determines, for each element $x$, the number of elements less than $x$ and inserts $x$ directly into the appropriate position in the output list.

Let L be an array of $n$ nonnegative integers. Let $M$ be the maximal element of L, and create a list Aux of length $M+1$, initially populated by zeros. For each $0 \leq i \leq n-1$, we set

$\operatorname{Aux}[L[i]] = \operatorname{Aux}[L[i]] + 1$,

so that A[j], for each $0 \leq j \leq M$, contains the number of elements of $L$ equal to $j$. Now, we set

for each $1 \leq j \leq k$, so that Aux[j] contains the number of elements of L no larger than $j$.

It now suffices to go through A and build a new, sorted array S.

def counting_sort(L):
M = max(L)
n = len(L)

Aux = [0 for _ in range(M+1)]
for i in range(n):
Aux[L[i]] = Aux[L[i]] + 1
for j in range(1, M+1):
Aux[j] = Aux[j] + Aux[j-1]

Sorted = [0 for _ in range(n)]
for i in range(n-1,-1,-1):
Sorted[Aux[L[i]]-1] = L[i]
Aux[L[i]] = Aux[L[i]] - 1

return Sorted

In the worst-case scenario, the first for loop takes $O(n)$ time, the second for loop takes $O(M)$ time, and the third for loop takes $O(n)$ time. Therefore, counting_sort has the worst-case time complexity of $O(n + M)$. If $M = O(n)$, then counting_sort runs in $O(n)$ time. Similarly, the space complexity of counting_sort is $O(n + M)$, as it constructs a list of length $M+1$ and another list of length $n$.

%%timeit
counting_sort(list(range(10000,0,-1)))
100 loops, best of 3: 3.4 ms per loop
%%timeit
quick_sort_randomized_pivot(list(range(10000,0,-1)))
10 loops, best of 3: 21.9 ms per loop
%%timeit
merge_sort(list(range(10000,0,-1)))
100 loops, best of 3: 17.6 ms per loop

As we can see, counting sort significantly outperforms merge sort and quicksort when the values of the input list are contained within a small interval (with respect to the length of the list).

We conclude this section by remarking that counting_sort is stable. If $% $ and $L[i] = L[j]$, then the third for loop implies that Aux[L[i]] - 1 < Aux[L[j]] - 1, and that

Sorted[Aux[L[i]]-1] = L[i] = L[j] = Sorted[Aux[L[j]]-1],

as was to be shown.

Radix sort refers to a family of digit-based sorting algorithms, its origin predating the invention of computers. As per Sedgewick, we consider sequences of fixed-length characters, not just numbers.

The simplest example of radix sort is the LSD-first radix sort, which goes through digits from right to left. It requires all items to be sorted have the same number of characters. For the implementation below, we assume that the input list L contains strings of characters.

str_len = len(L[0])

for index in range(str_len-1,-1,-1):
sorted_indices = modified_counting_sort(L, index)
L = [L[i] for i in sorted_indices]

return L

def modified_counting_sort(L, index):
n = len(L)
# ord() returns the usual numerical representation of a character.
chars = [ord(w[index]) for w in L] # just the characters.
indices = [ind for ind in range(n)] # just the indices
M = max(chars)

Aux = [0 for _ in range(M+1)]
for i in range(n):
Aux[chars[i]] = Aux[chars[i]] + 1
for j in range(1, M+1):
Aux[j] = Aux[j] + Aux[j-1]

Sorted = [0 for _ in range(n)]
for i in range(n-1,-1,-1):
Sorted[Aux[chars[i]]-1] = indices[i]
Aux[chars[i]] = Aux[chars[i]] - 1

return Sorted

Since counting sort runs in $O(n + M)$ time, LSD-first radix sort runs in $O(w(n+M_*))$ time, where $w$ denotes the length of each string, and $M_*$ denotes the maximum value of the $M$’s computed at all the indices. If $M$ and $w$ are relatively small compared to $n$, then LSD-first radix sort runs in linear time. Observe below that LSD-first radix sort runs comparably fast, or even faster, to $O(n \log n)$ comparison sorting algorithms.

numbers1 = random.sample(range(100000,1000000),60000)
numbers2 = random.sample(range(100000,1000000),60000)
numbers3 = random.sample(range(100000,1000000),60000)
strings1 = [str(x) for x in numbers1]
strings2 = [str(x) for x in numbers2]
strings3 = [str(x) for x in numbers3]
%%timeit
quick_sort_randomized_pivot(numbers1)
quick_sort_randomized_pivot(numbers2)
quick_sort_randomized_pivot(numbers3)
1 loop, best of 3: 473 ms per loop
%%timeit
merge_sort(numbers1)
merge_sort(numbers2)
merge_sort(numbers3)
1 loop, best of 3: 360 ms per loop
%%timeit
1 loop, best of 3: 459 ms per loop

We could also build counting sort directly into LSD-first radix sort, which results in a slight performance gain. Slight, but enough to beat merge sort and quick sort in many cases!

str_len = len(L[0])
n = len(L)

for index in range(str_len-1,-1,-1):
chars = [ord(w[index]) for w in L]
M = max(chars)

Aux = [0 for _ in range(M+1)]
for i in range(n):
Aux[chars[i]] = Aux[chars[i]] + 1
for j in range(1, M+1):
Aux[j] = Aux[j] + Aux[j-1]

Sorted = [None for _ in range(n)]
for i in range(n-1,-1,-1):
Sorted[Aux[chars[i]]-1] = L[i]
Aux[chars[i]] = Aux[chars[i]] - 1

L = Sorted

return L
numbers1 = random.sample(range(100000,1000000),50000)
numbers2 = random.sample(range(100000,1000000),50000)
numbers3 = random.sample(range(100000,1000000),50000)
strings1 = [str(x) for x in numbers1]
strings2 = [str(x) for x in numbers2]
strings3 = [str(x) for x in numbers3]
%%timeit
1 loop, best of 3: 278 ms per loop
%%timeit
1 loop, best of 3: 388 ms per loop
%%timeit
merge_sort(numbers1)
merge_sort(numbers2)
merge_sort(numbers3)
1 loop, best of 3: 439 ms per loop
%%timeit
quick_sort_randomized_pivot(numbers1)
quick_sort_randomized_pivot(numbers2)
quick_sort_randomized_pivot(numbers3)
1 loop, best of 3: 403 ms per loop

Unlike the LSD-first radix sort, the MSD-first radix sort goes through digits from left to right. The algorithm divides the input list into sublists according to the first character of each item and applies itself recursively to each sublist. This has the advantage that strings can be of different lengths. For the implementation below, we assume that the input list L contains strings of characters.

n = len(L)
msd(L, 0, n, 0)
return L

def msd(L, low, high, index):
# Make sure range(low, high) is nontrivial
if high <= low+1:
None

# Collect numerical representations of characters at the given index.
# +1 is there so we can add a 0, instead of a negative number, in case
# there is no character at the given index.
chars = []
blank_count = 0
for i in range(low, high):
try:
chars.append(ord(L[i][index])+1)
except IndexError:
blank_count = blank_count + 1
chars.append(0)

# Make sure there are characters to be sorted
if sum(chars) == 0:
return None

M = max(chars)

# Begin counting sort on characters at the given index
Aux = [0 for _ in range(M+1)]
for i in range(high-low):
Aux[chars[i]] = Aux[chars[i]] + 1
for j in range(1, M+1):
Aux[j] = Aux[j] + Aux[j-1]

Sorted = [None for _ in range(high-low)]
for i in range(high-low-1,-1,-1):
Sorted[Aux[chars[i]]-1] = L[i + low]
Aux[chars[i]] = Aux[chars[i]] - 1

L[low: high] = Sorted
# End counting sort

new_low = low + blank_count
for i in range(new_low+1, high):
# Recurse onto sublists with the same character at the given index.
if L[i][index] != L[i-1][index]:
msd(L, new_low, i, index+1)
new_low = i
if high > new_low+1:
msd(L, new_low, high, index+1)

Similar to the LSD-first Radix sort, the MSD-first radix sort consists of $O(n)$ loops and $O(M)$ loops that can repeat at most $w$ times, and so the worst-case runtime is bounded above by $O(w(M+n))$. Unlike the LSD-first Radix sort, however, the MSD-first radix sort does not necessarily compare every single digit. For example, to establish the order of ['a', 'baaaaaaaaaaa'], there is no need to compare anything beyond the leftmost digit. In such cases, the MSD-first radix sort can run significantly faster than the LSD-first radix sort.

some_chars = [
'!', '@', '#', '', '%', '^', '&', '*', '(', ')',
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't',
'u', 'v', 'w', 'x', 'y', 'z']
strings = [x +  ('a' * 500) for x in some_chars]
%%timeit