# What Is An Efficient Parallel Algorithm?

Most algorithms we encounter in the study of computer science are sequential, i.e., designed with the assumption that only one computational operation can be carried out at each step. Indeed, we recall that the abstract model of computation typically employed in the context of the analysis of algorithms is the random-access machine, which executes basic operations with no concurrency.

Sequential algorithms lend themselves to a straightforward concept of efficiency, which can be computed by simply adding up the number of basic operations in the RAM model. Indeed, it is reasonable to assert that the amount of time it takes to run an algorithm is proportional to the total number of basic operations.

Furthermore, Moore’s Law, the principle that the computational capabilities of microprocessors of a fixed size grows exponentially over time, allowed the subject’s preoccupation on sequential algorithms to linger. Concurrency was poorly understood, and any algorithm that wasn’t efficient was to be salvaged by Moore’s law.

Moore’s law has now been declared obsolete, and the microprocessor industry produces multi-core processors, which can execute several computational operations in parallel. Sequential algorithms cannot make efficient use of multi-core processors, hastening the need for a systematic study of parallel algorithms.

A natural extension of the RAM model is the parallel random-access machine model, in which multiple processors share the same memory module. Each of the processors can access an arbitrary word, a predetermined size of data, in the memory module in a single step, and different processors can access the memory module in parallel.

In the PRAM model, the total number of basic operations present in an algorithm is no longer a good abstraction of efficiency, as some operations can be carried out in parallel. In fact, the amount of time it takes to run an algorithm is no longer proportional to the total number of basic operations, but on the amount of time the slowest of the processors takes to finish the task it was given.

If we assume that all processors in our PRAM are equally efficient, then efficiency depends on how well we can divide up the basic operations in an algorithm. Indeed, if a computational task at hand requires working through a long chain of operations that cannot be parallelized, then one processor must deal with the entire chain, resulting in a less-than-ideal distribution of operations.

To analyze such dependencies among operations, let us construct a directed graph of dependencies. We represent each basic operation as a node. The incoming edges of a node represent the input values of the operation, and the outgoing edge represents the output value of the operation. With this abstraction, an algorithm can be thought of as a directed graph of operations and values, leading to one final operation whose outgoing edge represents the output value of the algorithm.

The end result is a tree whose leaves represent the input variable operations that merely feed their outgoing edges into the next operations. Since each node of the tree represents one basic operation, the total work needed for an algorithm is merely the total number of nodes.

Recall now that the depth of a tree is the maximal distance from the root node to a leaf node. A directed path from one node to another represents a chain of dependencies, and so at least one processor must carry out every operation in the path from the farthest leaf node to the root node. It follows that the depth of the tree represents the time it takes to execute the corresponding algorithm.

For this reason, the tree we have constructed above is referred to as the work-depth model. A version of Brent’s theorem guarantees that an algorithm can be modeled on the work-depth model of work size $w$ and depth $d$ can be executed in $O(\frac{w}{p} + d)$ steps in the $p$-processor PRAM model.

Given an algorithm, we let $T_p$ denote the time it takes to run the algorithm on a $p$-processor PRAM. In the best-case scenario, there is no abnormally long chains of dependencies, and the work is distributed equally among the $p$ processors, resulting in the lower bound

where $T_1$ denotes the time it takes to run the algorithm on a single-processor RAM.

Brent’s theorem provides an upper bound on $T_p$, and so

Now, $T_1$ is proportional to $w$, whence it follows that

Therefore, the work-depth model is a reasonable abstraction of parallel computing, even though the PRAM model is more closely aligned with our intuitive notion of a multi-core processor.

See (Blelloch–Maggs, 2004) for a survey of basic techniques of parallel algorithm design, as well as their analysis through the work-depth model.

Tags:

Categories:

Updated: