Basic Data Structures (Draft)

For the computer, the atomic unit of information is a bit, which can represent one of the two pieces of information: 0 or 1. At first sight, this does not appear to be all that useful. Nevertheless, Sequences of bits are sufficient to represent all numbers.

Indeed, an arbitrary nonnegative integer can be converted to its binary representation as follows:

import math

def decimal_to_binary(n):
    """Return the binary representation of a nonnegative integer n."""
    if n == 0:
        return '0b0'
        digits = ''
          while n > 0:
          digits = str(n % 2) + digits
          n = math.floor(n/2)
    return '0b' + digits

(source code)

In other words, 0b0 is 0, 0b1 is 1, 0b10 is 2, 0b11 is 3, and so on. Nonnegative integers can be recovered from their binary representations as follows:

def binary_to_decimal(b):
    """Return the decimal representation of a binary number b."""
    decimal, index = 0, 1
    while b[-index] = 0, 1:
        decimal += int(b[-index]) * (2 ** (index - 1))
        index += 1
    return decimal

(source code)

Negative integers are merely positive integers with the negative sign - attached to them. Therefore, the binary representation of a negative number can be achieved by adding an extra bit to the binary representation of the corresponding positive number.

Simply adding a digit on the left is ambiguous, however. How do we know whether the added bit represents sign, or another digit? To circumvent the issue, we typically specify how many bits we wish to consider in each context. For example, if we are to use 32 bits for representing integers, then we could use 31 bits for storing digits, and 1 bit for storing the sign. The specified number of bits for a computing context, such as 32, is called the word length for a data type.

The standard representation scheme of signed integers is two’s-complement form, which assigns negative weight to the most significant bit:

def decimal_to_2comp(n, w):
    """Represent ints from -2^{w-1} to 2^{w-1}-1 in two's complement form."""
    if n > 2 ** (w-1) - 1 or n < -2 ** (w-1):
        raise OverflowError
    sign = 0 if n >= 0 else 1
    twocomp = -sign * (2 ** (w-1)) + abs(n)
    binary = decimal_to_binary(abs(twocomp))[2:] # remove '0b'
    w_bit_bin = '0' * (w - len(binary) - 1) + binary
    return '0b' + str(sign) + w_bit_bin

(source code)

Since rational numbers are ratios of two integers, it follows that we can represent any rational numbers with bits. Representation of real and complex numbers is more delicate and the details are immaterial for what we aim to discuss here. It suffices to say that bits are expressive enough to represent any number system.

While bits are certainly versatile unit, one bit is too small to be a meaningful unit of information. Instead, we typically consider bytes, a collection of 8 bits, as the smallest unit of information. The UTF-8 character encoding standard uses up to four bytes per character to represent a significant chunk of alphabets and symbols in current usage around the world, so a byte is certainly a meaningful unit of information.

We think of memory as a list of byte-sized cells, with a unique physical address assigned to each cell. Without getting into the details of hardware implementations, we shall assume that writing on a memory cell or retrieving the information in a memory cell takes time, as long as we know the physical address in advance. Data are usually larger than one byte, and so a piece of information typically occupies more than one cell. We could, of course, store the data sequentially, byte by byte in adjacent memory cells. Often, however, it is advantageous to spread out data into disjunct memory cells in an organized fashion, with explicit records of how the cells relate to each other. The tool for the job is a pointer, which stores the physical address of the data of interest, rather than the data itself.

A data structure is a way of weaving together memory cells, possibly with pointers, to suit particular needs that arise from the data at hand. In this post, we examine a variety of classical data structures: arrays, stacks, queues, linked lists, skip lists, hash tables, binary search trees, heaps, red-black trees, tries, and graphs. For each data structure, we study implementations of basic operations on data such as get, search, insert, delete, and find, and how they affect the usage of the data structure.

1. Lists

In this section, we study lists, which store data in a linear order. The ordering makes it possible to refer to each chunk of data in a list by their index. Lists can be either arrays or linked lists, depending on how the data is stored in memory. Either arrays or linked lists could be used to stacks or queues, which are lists with extremely simple input-output operations. We also consider skip lists, a probabilistic data structure that harnesses the power of randomness to reap the benefits of both arrays and linked lists.

1.1. Arrays

The simplest data structure is an array, which is little more than a sequence of adjacent memory cells. To define an array, we specify its starting point in memory, the size, in bytes, of each of its element, and how many terms it has. For example, a 2-byte array of length 4 starting at physical address 15 would look like this:

With simple arithmetic, we can compute the physical address of each term in the array. For example, the third element of the above array ranges from to . As we have assumed that writing on a memory cell or retrieving the information in a memory cell of a known physical address takes time, we see that we can access and write on each term of the array in time.

Observe below a minimal Python implementation of an array with access/write features:

class Array:

    def __init__(self, *args):
        """Variable-length initializer.

        For example, we could input Array(1, 2, 3), or Array(1, 2, 3, 4).
        self.array = [*args]
        self.length = len(args)

    def _out_of_bound_error(self, ind, bound):
        if ind < 0 or ind >= bound:
            raise IndexError(
                ("Array index out of range. {i} given: "
                 + "minimum index is 0; maximum index is {m}").format(
                    i=ind, m=bound-1))

    def __getitem__(self, ind):
        """Return the value at the specified index.

        arr[ind], or arr.__getitem__(index), retrieves the value
        at index ind.
        self._out_of_bound_error(ind, bound=self.length)
        return self.array[ind]

    def __setitem__(self, ind, val):
        """Set the value at the specified index.

        arr[ind] = value, or arr.__setitem__(ind, value)__, replaces
        the data stored at index ind with value val.
        self._out_of_bound_error(ind, bound=self.length)
        self.array[ind] = val
        return self

(source code)

Searching for a particular piece of data stored in an array can take longer, as we would have to scan the array sequentially until we find the desired data. In the worst case, we would have to go through the entire array, whence searching takes time, where is the number of terms in an array. Similarly, finding the maximum or the minimum of a numeric array takes time, as we would have to go through the entire array. By the same logic, finding the successor of a term, as well as the predecessor of a term, takes time.

The following Python implementation can be added to the Array class to enable the search features:

    def search(self, val):
        for ind in range(self.length):
            if self.array[ind] == val:
                return ind
        return -1

    def minimum(self):
        min_ind = 0
        for ind in range(self.length):
            if self.array[min_ind] > self.array[ind]:
                min_ind = ind
        return min_ind

    def maximum(self):
        max_ind = 0
        for ind in range(self.length):
            if self.array[max_ind] < self.array[ind]:
                max_ind = ind
        return max_ind

    def predecessor(self, ind):
        self._out_of_bound_error(ind, self.length)
        ind_pred = ind
        for running_index in range(self.length):
            if self.array[running_index] <= self.array[ind]:
                if ind_pred == ind:
                    ind_pred = running_index
                    if self.array[ind_pred] < self.array[running_index]:
                        if running_index != ind:
                            ind_pred = running_index

        if ind_pred == ind:
            return -1
            return ind_pred

    def successor(self, ind):
        self._out_of_bound_error(ind, self.length)
        ind_succ = ind
        for running_index in range(self.length):
            if self.array[running_index] >= self.array[ind]:
                if ind_succ == ind:
                    ind_succ = running_index
                    if self.array[ind_succ] > self.array[running_index]:
                        if running_index != ind:
                            ind_succ = running_index

        if ind_succ == ind:
            return -1
            return ind_succ

(source code)

Inserting an item in an array of length entails creating a new array of length . Indeed, insertion at index requires shifting terms at indices by one to the right, as an array must be stored in memory sequentially.

Similarly, deleting an item in an array of length entails creating a new array of length , as it requires shifting elements by one to the left. It follows that both operations take time.

Insertion/deletion in the Array class can be implemented in Python as follows:

    def __delitem__(self, ind):
        self._out_of_bound_error(ind, bound=self.length)
        self.array = self.array[:ind] + self.array[ind+1:]
        self.length -= 1
        return self

    def insert(self, ind, val):
        self._out_of_bound_error(ind, bound=self.length+1)
        self.array = self.array[:ind] + [val] + self.array[ind:]
        self.length += 1
        return self

All in all, the power of an array lies in its random access feature, i.e., access and write in time. All other basic operations take time, which implies that arrays are a poor choice of data structure for situations that involve frequent searches, insertions, or deletions.

1.2. Linked Lists

Recall from Section 1.1 that arrays specialize in random access, resulting in slow search, insertion, and deletion operations. This tradeoff comes from how arrays are stored in memory: sequentially, with elements laid out adjacent to one another in memory addresses.

How do we improve the efficiency of insertion and deletion operations? We note that these operations require only that we know what chunks of data come after which chunks. The data structure that focuses on such relations is a linked list, which consists of a collection of nodes that contain a chunk of data and a pointer to the next node.

It is customary to keep track of the memory address of the first node, called the head, and the last node, called the tail, of a linked list, so that those two nodes can be accessed in constant time.

Here is a Python implementation of the linked list data sturcture.

class LinkedList:
    class Node:
        def __init__(self, value=None, nextnode=None):
            self.value = value
   = nextnode

    def __init__(self, head=None):
        """Initialize a linked list of length either 0 or 1.

        In both cases, head = tail.
        self.head = head
        self.tail = head
        self.length = 0 if head is None else 1

By keeping track of the memory address of the tail node, we can append a node at the tail end of a linked list in constant time. Indeed, it suffices to access the tail node, set its next node to be the node to be appended, and designate the appended node to be the new tail node.

Without storing the memory address of the tail node, we would have to enter the linked list through the head node and go through every node to get to the tail node before appending a node, ending up with an -time operation.

Since a linked list need not be stored sequentially in memory, all that is required to insert a new node inside a linked list is to insert new_node in between old_node_0 and old_node_1 is to set the next node of old_node_0 to be new_node and the next node of new_node to be old_node_1:

Similarly, deleting a node from a linked list takes constant time. To remove middle_node from

we simply set the next node of left_node to be right_node. Note, however, that we need a priori knowledge of the memory location of left_node to achieve constant-time deletion. Otherwise, we must scan through the linked list until we find the node to be deleted, ending up with an -time operation.

1.3. Doubly Linked Lists

1.4. Stacks and Queues

1.5. Skip Lists