# Basic Data Structures (Draft)

For the computer, the atomic unit of information is a **bit**, which can represent one of the two pieces of information: 0 or 1. At first sight, this does not appear to be all that useful. Nevertheless, Sequences of bits are sufficient to represent all numbers.

Indeed, an arbitrary nonnegative integer can be converted to its **binary representation** as follows:

```
import math
def decimal_to_binary(n):
"""Return the binary representation of a nonnegative integer n."""
if n == 0:
return '0b0'
else:
digits = ''
while n > 0:
digits = str(n % 2) + digits
n = math.floor(n/2)
return '0b' + digits
```

In other words, `0b0`

is 0, `0b1`

is 1, `0b10`

is 2, `0b11`

is 3, and so on. Nonnegative integers can be recovered from their binary representations as follows:

```
def binary_to_decimal(b):
"""Return the decimal representation of a binary number b."""
decimal, index = 0, 1
while b[-index] = 0, 1:
decimal += int(b[-index]) * (2 ** (index - 1))
index += 1
return decimal
```

Negative integers are merely positive integers with the negative sign `-`

attached to them. Therefore, the binary representation of a negative number can be achieved by adding an extra bit to the binary representation of the corresponding positive number.

Simply adding a digit on the left is ambiguous, however. How do we know whether the added bit represents sign, or another digit? To circumvent the issue, we typically specify how many bits we wish to consider in each context. For example, if we are to use 32 bits for representing integers, then we could use 31 bits for storing digits, and 1 bit for storing the sign. The specified number of bits for a computing context, such as 32, is called the **word length** for a data type.

The standard representation scheme of signed integers is **two’s-complement form**, which assigns negative weight to the most significant bit:

```
def decimal_to_2comp(n, w):
"""Represent ints from -2^{w-1} to 2^{w-1}-1 in two's complement form."""
if n > 2 ** (w-1) - 1 or n < -2 ** (w-1):
raise OverflowError
sign = 0 if n >= 0 else 1
twocomp = -sign * (2 ** (w-1)) + abs(n)
binary = decimal_to_binary(abs(twocomp))[2:] # remove '0b'
w_bit_bin = '0' * (w - len(binary) - 1) + binary
return '0b' + str(sign) + w_bit_bin
```

Since rational numbers are ratios of two integers, it follows that we can represent any rational numbers with bits. Representation of real and complex numbers is more delicate and the details are immaterial for what we aim to discuss here. It suffices to say that bits are expressive enough to represent any number system.

While bits are certainly versatile unit, one bit is too small to be a meaningful unit of information. Instead, we typically consider **bytes**, a collection of 8 bits, as the smallest unit of information. The UTF-8 character encoding standard uses up to four bytes per character to represent a significant chunk of alphabets and symbols in current usage around the world, so a byte is certainly a meaningful unit of information.

We think of **memory** as a list of byte-sized cells, with a unique **physical address** assigned to each cell. Without getting into the details of hardware implementations, we shall assume that writing on a memory cell or retrieving the information in a memory cell takes time, as long as we know the physical address in advance. Data are usually larger than one byte, and so a piece of information typically occupies more than one cell. We could, of course, store the data sequentially, byte by byte in adjacent memory cells. Often, however, it is advantageous to spread out data into disjunct memory cells in an organized fashion, with explicit records of how the cells relate to each other. The tool for the job is a **pointer**, which stores the physical address of the data of interest, rather than the data itself.

A **data structure** is a way of weaving together memory cells, possibly with pointers, to suit particular needs that arise from the data at hand. In this post, we examine a variety of classical data structures: arrays, stacks, queues, linked lists, skip lists, hash tables, binary search trees, heaps, red-black trees, tries, and graphs. For each data structure, we study implementations of basic operations on data such as get, search, insert, delete, and find, and how they affect the usage of the data structure.

## 1. Lists

In this section, we study **lists**, which store data in a linear order. The ordering makes it possible to refer to each chunk of data in a list by their **index**. Lists can be either **arrays** or **linked lists**, depending on how the data is stored in memory. Either arrays or linked lists could be used to **stacks** or **queues**, which are lists with extremely simple input-output operations. We also consider **skip lists**, a probabilistic data structure that harnesses the power of randomness to reap the benefits of both arrays and linked lists.

### 1.1. Arrays

The simplest data structure is an **array**, which is little more than a sequence of adjacent memory cells. To define an array, we specify its starting point in memory, the size, in bytes, of each of its element, and how many terms it has. For example, a 2-byte array of length 4 starting at physical address 15 would look like this:

With simple arithmetic, we can compute the physical address of each term in the array. For example, the third element of the above array ranges from to . As we have assumed that writing on a memory cell or retrieving the information in a memory cell of a known physical address takes time, we see that we can access and write on each term of the array in time.

Observe below a minimal Python implementation of an array with access/write features:

```
class Array:
def __init__(self, *args):
"""Variable-length initializer.
For example, we could input Array(1, 2, 3), or Array(1, 2, 3, 4).
"""
self.array = [*args]
self.length = len(args)
def _out_of_bound_error(self, ind, bound):
if ind < 0 or ind >= bound:
raise IndexError(
("Array index out of range. {i} given: "
+ "minimum index is 0; maximum index is {m}").format(
i=ind, m=bound-1))
def __getitem__(self, ind):
"""Return the value at the specified index.
arr[ind], or arr.__getitem__(index), retrieves the value
at index ind.
"""
self._out_of_bound_error(ind, bound=self.length)
return self.array[ind]
def __setitem__(self, ind, val):
"""Set the value at the specified index.
arr[ind] = value, or arr.__setitem__(ind, value)__, replaces
the data stored at index ind with value val.
"""
self._out_of_bound_error(ind, bound=self.length)
self.array[ind] = val
return self
```

Searching for a particular piece of data stored in an array can take longer, as we would have to scan the array sequentially until we find the desired data. In the worst case, we would have to go through the entire array, whence searching takes time, where is the number of terms in an array. Similarly, finding the maximum or the minimum of a numeric array takes time, as we would have to go through the entire array. By the same logic, finding the successor of a term, as well as the predecessor of a term, takes time.

The following Python implementation can be added to the `Array`

class to enable the search features:

```
def search(self, val):
for ind in range(self.length):
if self.array[ind] == val:
return ind
return -1
def minimum(self):
min_ind = 0
for ind in range(self.length):
if self.array[min_ind] > self.array[ind]:
min_ind = ind
return min_ind
def maximum(self):
max_ind = 0
for ind in range(self.length):
if self.array[max_ind] < self.array[ind]:
max_ind = ind
return max_ind
def predecessor(self, ind):
self._out_of_bound_error(ind, self.length)
ind_pred = ind
for running_index in range(self.length):
if self.array[running_index] <= self.array[ind]:
if ind_pred == ind:
ind_pred = running_index
else:
if self.array[ind_pred] < self.array[running_index]:
if running_index != ind:
ind_pred = running_index
if ind_pred == ind:
return -1
else:
return ind_pred
def successor(self, ind):
self._out_of_bound_error(ind, self.length)
ind_succ = ind
for running_index in range(self.length):
if self.array[running_index] >= self.array[ind]:
if ind_succ == ind:
ind_succ = running_index
else:
if self.array[ind_succ] > self.array[running_index]:
if running_index != ind:
ind_succ = running_index
if ind_succ == ind:
return -1
else:
return ind_succ
```

Inserting an item in an array of length entails creating a new array of length . Indeed, insertion at index requires shifting terms at indices by one to the right, as an array must be stored in memory sequentially.

Similarly, deleting an item in an array of length entails creating a new array of length , as it requires shifting elements by one to the left. It follows that both operations take time.

Insertion/deletion in the `Array`

class can be implemented in Python as follows:

```
def __delitem__(self, ind):
self._out_of_bound_error(ind, bound=self.length)
self.array = self.array[:ind] + self.array[ind+1:]
self.length -= 1
return self
def insert(self, ind, val):
self._out_of_bound_error(ind, bound=self.length+1)
self.array = self.array[:ind] + [val] + self.array[ind:]
self.length += 1
return self
```

All in all, the power of an array lies in its random access feature, i.e., access and write in time. All other basic operations take time, which implies that arrays are a poor choice of data structure for situations that involve frequent searches, insertions, or deletions.

### 1.2. Linked Lists

Recall from Section 1.1 that arrays specialize in random access, resulting in slow search, insertion, and deletion operations. This tradeoff comes from how arrays are stored in memory: sequentially, with elements laid out adjacent to one another in memory addresses.

How do we improve the efficiency of insertion and deletion operations? We note that these operations require only that we know what chunks of data come after which chunks. The data structure that focuses on such relations is a **linked list**, which consists of a collection of nodes that contain a chunk of data and a pointer to the next node.

It is customary to keep track of the memory address of the first node, called the **head**, and the last node, called the **tail**, of a linked list, so that those two nodes can be accessed in constant time.

Here is a Python implementation of the linked list data sturcture.

```
class LinkedList:
class Node:
def __init__(self, value=None, nextnode=None):
self.value = value
self.next = nextnode
def __init__(self, head=None):
"""Initialize a linked list of length either 0 or 1.
In both cases, head = tail.
"""
self.head = head
self.tail = head
self.length = 0 if head is None else 1
```

By keeping track of the memory address of the tail node, we can append a node at the tail end of a linked list in constant time. Indeed, it suffices to access the tail node, set its next node to be the node to be appended, and designate the appended node to be the new tail node.

Without storing the memory address of the tail node, we would have to enter the linked list through the head node and go through every node to get to the tail node before appending a node, ending up with an -time operation.

Since a linked list need not be stored sequentially in memory, all that is required to insert a new node inside a linked list is to insert `new_node`

in between `old_node_0`

and `old_node_1`

is to set the next node of `old_node_0`

to be `new_node`

and the next node of `new_node`

to be `old_node_1`

:

Similarly, deleting a node from a linked list takes constant time. To remove `middle_node`

from

we simply set the next node of `left_node`

to be `right_node`

. Note, however, that we need *a priori* knowledge of the memory location of `left_node`

to achieve constant-time deletion. Otherwise, we must scan through the linked list until we find the node to be deleted, ending up with an -time operation.