Strassen’s Algorithm

Recall that an -by- matrix is a two-dimensional array

equipped with addition operation

and scalar multiplication operation

In special circumstances, we can also define the product of two matrices. Specifically, the product of an -by- matrix and an -by- matrix is the -by- matrix , where

for each choice of and . Since operations are required for each entry , we see that it takes operations to compute the product of an -by- matrix and an -by- matrix. In particular, if we consider square matrices, i.e., , then matrix multiplication runs in cubic time with respect to the size .

A classic work on improving the asymptotic bound on matrix multiplication is the Strassen algorithm, first published in 1969. The algorithm relies crucially on block multiplication, a method of computing matrix multiplication en masse by partitioning the matrices into submatrices and computing the matrix product as if each submatrix is a scalar.

By a submatrix of a matrix

we mean a matrix

where and . Now, suppose that we have three matrices , , and of size , partitioned into equally-sized submatrices, of size :

If , then the submatrices of can be written in terms of the submatrices of and as follows:

We observe that block multiplication by itself does not reduce the time complexity of matrix multiplication. Indeed, computing takes operations, whence computing via block multiplication takes operations, just as many as standard matrix multiplication. Therefore, performing 8 block matrix multiplications is not an improvement.

If, however, we define

then

and can be computed with 7 block matrix multiplications. Since the algorithm can be applied recursively, we see that

where is the number of operations it takes to compute the above algorithm for matrices of size . Here, is the number of additions performed for matrices of size .

From the above identity, we conclude that

which is asymptotically smaller than .

Since Strassen’s algorithm is a divide-and-conquer algorithm, it is not difficult to parallelize the algorithm. Indeed:

The divide-and-conquer paradigm improves program modularity, and often leads to simple and efficient algorithms. It has therefore proven to be a powerful tool for sequential algorithm designers. Divide-and-conquer plays an even more prominent role in parallel algorithm design. Because the subproblems created in the fisrt step are typically independent, they can be solved in parallel. Often, the subproblems are solved recursively and thus the next step yields even more subproblems to beo solved in parallel. As a consequence, even divide-and-conquer algorithms that were designed for sequential machines typically have some inherent parallelism. (Blelloch–Maggs, 2004)

Naïve parallelization of Strassen’s algorithm does not yield much improve in performance, however. In order to carry out block multiplication of submatrices computed by different processors, much communication among the processors are needed. This, in fact, is a serious bottleneck.

The rate at which operands can be brought to the processor is the primary performance bottleneck for many scientific computing codes. … The local and global memory bandwith bottleneck is expected to become a more serious problem in the future due to the nonuniform scaling of technology … Many supercomputing applications stretch the capabilitise of the underlying hardware, and bottlenecks may occur in many different parts of the system. (NRC 2004)

The usual model of parallel computation assumes shared memory and thus does give the full picture when it comes to algorithms with high communication complexity. The LPRAM model, a direct generalization of the PRAM model, solves this problem by adding local memory to each processor (hence the L), on which a variant of Strassen’s algorithm that minimizes the communication cost has been developed. See [Ballard–Demmel–Holtz–Lipshitz–Schwartz 2012].