# Strassen’s Algorithm

Recall that an $n$-by-$m$ matrix is a two-dimensional array

equipped with addition operation

and scalar multiplication operation

In special circumstances, we can also define the product of two matrices. Specifically, the product of an $n$-by-$m$ matrix $(a_{ij})$ and an $m$-by-$p$ matrix $(b_{kl})$ is the $n$-by-$p$ matrix $(c_{qr})$, where

for each choice of $q$ and $r$. Since $O(m)$ operations are required for each entry $c_{qr}$, we see that it takes $O(nmp)$ operations to compute the product of an $n$-by-$m$ matrix and an $m$-by-$p$ matrix. In particular, if we consider square matrices, i.e., $n=m=p$, then matrix multiplication runs in cubic time with respect to the size $n$.

A classic work on improving the asymptotic bound on matrix multiplication is the Strassen algorithm, first published in 1969. The algorithm relies crucially on block multiplication, a method of computing matrix multiplication en masse by partitioning the matrices into submatrices and computing the matrix product as if each submatrix is a scalar.

By a submatrix of a matrix

we mean a matrix

where $1 \leq p \leq q \leq n$ and $1 \leq r \leq s \leq m$. Now, suppose that we have three matrices $A$, $B$, and $C$ of size $N = m 2^n$, partitioned into equally-sized submatrices, of size $\frac{N}{2} = m 2^{n-1}$:

If $C = AB$, then the submatrices of $C$ can be written in terms of the submatrices of $A$ and $B$ as follows:

We observe that block multiplication by itself does not reduce the time complexity of matrix multiplication. Indeed, computing $C_{ij}$ takes $O((N/2)^3 + (N/2)^3) = O(N^3/4)$ operations, whence computing $C$ via block multiplication takes $O(N^3)$ operations, just as many as standard matrix multiplication. Therefore, performing 8 block matrix multiplications is not an improvement.

If, however, we define

then

and $C$ can be computed with 7 block matrix multiplications. Since the algorithm can be applied recursively, we see that

where $T_m(n)$ is the number of operations it takes to compute the above algorithm for matrices of size $N = m2^n$. Here, $O(4^n) = O((2^n)^2))$ is the number of additions performed for matrices of size $N = m2^n$.

From the above identity, we conclude that

which is asymptotically smaller than $O(N^3)$.

Since Strassen’s algorithm is a divide-and-conquer algorithm, it is not difficult to parallelize the algorithm. Indeed:

The divide-and-conquer paradigm improves program modularity, and often leads to simple and efficient algorithms. It has therefore proven to be a powerful tool for sequential algorithm designers. Divide-and-conquer plays an even more prominent role in parallel algorithm design. Because the subproblems created in the fisrt step are typically independent, they can be solved in parallel. Often, the subproblems are solved recursively and thus the next step yields even more subproblems to beo solved in parallel. As a consequence, even divide-and-conquer algorithms that were designed for sequential machines typically have some inherent parallelism. (Blelloch–Maggs, 2004)

Naïve parallelization of Strassen’s algorithm does not yield much improve in performance, however. In order to carry out block multiplication of submatrices computed by different processors, much communication among the processors are needed. This, in fact, is a serious bottleneck.

The rate at which operands can be brought to the processor is the primary performance bottleneck for many scientific computing codes. … The local and global memory bandwith bottleneck is expected to become a more serious problem in the future due to the nonuniform scaling of technology … Many supercomputing applications stretch the capabilitise of the underlying hardware, and bottlenecks may occur in many different parts of the system. (NRC 2004)

The usual model of parallel computation assumes shared memory and thus does give the full picture when it comes to algorithms with high communication complexity. The LPRAM model, a direct generalization of the PRAM model, solves this problem by adding local memory to each processor (hence the L), on which a variant of Strassen’s algorithm that minimizes the communication cost has been developed. See [Ballard–Demmel–Holtz–Lipshitz–Schwartz 2012].

Tags:

Categories:

Updated: