# Parallel Computing (Draft)

## Modeling Parallel Computing

Most algorithms we encounter in the study of computer science are sequential, i.e., designed with the assumption that only one computational operation can be carried out at each step. Indeed, we recall that the abstract model of computation typically employed in the context of the analysis of algorithms is the random-access machine, which executes basic operations with no concurrency.

Sequential algorithms lend themselves to a straightforward concept of efficiency, which can be computed by simply adding up the number of basic operations in the RAM model. Indeed, it is reasonable to assert that the amount of time it takes to run an algorithm is proportional to the total number of basic operations.

Furthermore, Moore’s Law, the principle that the computational capabilities of microprocessors of a fixed size grows exponentially over time, allowed the subject’s preoccupation on sequential algorithms to linger. Concurrency was poorly understood, and any algorithm that wasn’t efficient was to be salvaged by Moore’s law.

Moore’s law has now been declared obsolete, and the microprocessor industry produces multi-core processors, which can execute several computational operations in parallel. Sequential algorithms cannot make efficient use of multi-core processors, hastening the need for a systematic study of parallel algorithms.

A natural extension of the RAM model is the parallel random-access machine model, in which multiple processors share the same memory module. Each of the processors can access an arbitrary word, a predetermined size of data, in the memory module in a single step, and different processors can access the memory module in parallel.

In the PRAM model, the total number of basic operations present in an algorithm is no longer a good abstraction of efficiency, as some operations can be carried out in parallel. In fact, the amount of time it takes to run an algorithm is no longer proportional to the total number of basic operations, but on the amount of time the slowest of the processors takes to finish the task it was given.

If we assume that all processors in our PRAM are equally efficient, then efficiency depends on how well we can divide up the basic operations in an algorithm. Indeed, if a computational task at hand requires working through a long chain of operations that cannot be parallelized, then one processor must deal with the entire chain, resulting in a less-than-ideal distribution of operations.

To analyze such dependencies among operations, let us construct a directed graph of dependencies. We represent each basic operation as a node. The incoming edges of a node represent the input values of the operation, and the outgoing edge represents the output value of the operation. With this abstraction, an algorithm can be thought of as a directed graph of operations and values, leading to one final operation whose outgoing edge represents the output value of the algorithm.

The end result is a tree whose leaves represent the input variable operations that merely feed their outgoing edges into the next operations. Since each node of the tree represents one basic operation, the total work needed for an algorithm is merely the total number of nodes.

Recall now that the depth of a tree is the maximal distance from the root node to a leaf node. A directed path from one node to another represents a chain of dependencies, and so at least one processor must carry out every operation in the path from the farthest leaf node to the root node. It follows that the depth of the tree represents the time it takes to execute the corresponding algorithm.

For this reason, the tree we have constructed above is referred to as the work-depth model. A version of Brent’s theorem guarantees that an algorithm can be modeled on the work-depth model of work size $w$ and depth $d$ can be executed in $O(\frac{w}{p} + d)$ steps in the $p$-processor PRAM model.

Given an algorithm, we let $T_p$ denote the time it takes to run the algorithm on a $p$-processor PRAM. In the best-case scenario, there is no abnormally long chains of dependencies, and the work is distributed equally among the $p$ processors, resulting in the lower bound

where $T_1$ denotes the time it takes to run the algorithm on a single-processor RAM.

Brent’s theorem provides an upper bound on $T_p$, and so

Now, $T_1$ is proportional to $w$, whence it follows that

Therefore, the work-depth model is a reasonable abstraction of parallel computing, even though the PRAM model is more closely aligned with our intuitive notion of a multi-core processor.

As an aside, TensorFlow, Google’s distributed computing engine, can be thought of as an implementation of the work-depth model. Indeed, we observe that computational tasks are implemented on TensorFlow as directed acyclic graphs, where each node represents a mathematical operation, each incoming node an $n$-dimensional array of input value, and each outgoing node an $n$-dimensional array of output value.

## Case Study: Parallelizing Matrix Multiplication

Recall that an $n$-by-$m$ matrix is a two-dimensional array

equipped with addition operation

and scalar multiplication operation

In special circumstances, we can also define the product of two matrices. Specifically, the product of an $n$-by-$m$ matrix $(a_{ij})$ and an $m$-by-$p$ matrix $(b_{kl})$ is the $n$-by-$p$ matrix $(c_{qr})$, where

for each choice of $q$ and $r$. Since $O(m)$ operations are required for each entry $c_{qr}$, we see that it takes $O(nmp)$ operations to compute the product of an $n$-by-$m$ matrix and an $m$-by-$p$ matrix. In particular, if we consider square matrices, i.e., $n=m=p$, then matrix multiplication runs in cubic time with respect to the size $n$.

A classic work on improving the asymptotic bound on matrix multiplication is the Strassen algorithm, first published in 1969. The algorithm relies crucially on block multiplication, a method of computing matrix multiplication en masse by partitioning the matrices into submatrices and computing the matrix product as if each submatrix is a scalar.

By a submatrix of a matrix

we mean a matrix

where $1 \leq p \leq q \leq n$ and $1 \leq r \leq s \leq m$. Now, suppose that we have three matrices $A$, $B$, and $C$ of size $N = m 2^n$, partitioned into equally-sized submatrices, of size $\frac{N}{2} = m 2^{n-1}$:

If $C = AB$, then the submatrices of $C$ can be written in terms of the submatrices of $A$ and $B$ as follows:

We observe that block multiplication by itself does not reduce the time complexity of matrix multiplication. Indeed, computing $C_{ij}$ takes $O((N/2)^3 + (N/2)^3) = O(N^3/4)$ operations, whence computing $C$ via block multiplication takes $O(N^3)$ operations, just as many as standard matrix multiplication. Therefore, performing 8 block matrix multiplications is not an improvement.

If, however, we define

then

and $C$ can be computed with 7 block matrix multiplications. Since the algorithm can be applied recursively, we see that

where $T_m(n)$ is the number of operations it takes to compute the above algorithm for matrices of size $N = m2^n$. Here, $O(4^n) = O((2^n)^2))$ is the number of additions performed for matrices of size $N = m2^n$.

From the above identity, we conclude that

which is asymptotically smaller than O(N^3)\$

Tags:

Categories:

Updated: