### Optimizing QR Decomposition of Tridiagonal Matrices in Julia

Wed 01 January 2014

Recently I decided to give Julia a try. Julia is a relatively new scientific programming language that is free, open source, and fast. Besides the huge appeal of a performant high level scripting language, Julia intrigued me because the language actually has a reasonable type system and seems to be well-designed (unlike Matlab, which Julia is quite similar to). I recently implemented low-rank matrix approximations (in Matlab) for my numerical linear algebra class, so I figured reimplementing part of the algorithm in Julia would be a good way to get my feet on the ground in Julia land.

Rather than implementing all of the routines required to compute the low-rank approximation to a matrix, I just implemented the QR decomposition of a tridiagonal matrix using Householder reflections. To learn more about this algorithm, I highly recommend checking out section 3.4.1 of "Applied Numerical Linear Algebra" by Demmel. I was quite interested in the performance of Julia, so I iteratively optimized my algorithm and compared each version to an implementation of the same algorithm using Python and numpy.

## Super-naive QR decomposition

This extremely inefficient implementation mainly serves to illustrate how to compute the QR decomposition of a matrix using Householder projections.

```
function td_qr1(T)
# naive as possible implementation
Qt = eye(size(T)...)
R = copy(T)
I = copy(Qt)
for i in 1:(size(R, 1) - 1)
u = [zeros(i-1), householder!(R[i:end, i])]
P = I - 2 .* u * u'
R = P * R
Qt = P * Qt
end
Qt', R
end
```

In iteration $i$ of the loop, the Householder projection $P$ makes elements below the diagonal of column $i$ equal to 0. This leads to $R$ being upper triangular by the end of the loop.

Note that this loop explicitly forms the full $P$ matrix in memory. This leads
to the poor performance of this implementation: **~0.31s for 200x200 matrix**.

We can slightly optimize this implementation by noting that
`P * R == (I - 2 .* u * u') * R == R - 2 .* u * u' * R` (and the same logic applies
for `Qt`). This lets us replace

```
P = I - 2 .* u * u'
R = P * R
Qt = P * Qt
```

with

```
M = 2 .* u * u'
R -= M * R
Qt -= M * Qt
```

This very small "optimization" caused no significant change in run-time.

## Cutting down the number of FLOPs

Here we'll take advantage of the tridiagonal form of input matrix $T$. To recall, a matrix is tridiagonal if it is 0 everywhere except for its superdiagonal, main diagonal, and subdiagonal.

Because of the tridiagonal form of $T$, the column vector $u$ in the above iterations only contains 2 non-zero elements (the diagonal and the subdiagonal). This means $P = I - 2uu^T$ is equal to $I$ on all rows except for 2. Thus, on iteration $i$ of the algorithm, we only need to multiply rows $i$ and $i+1$ of $R$ and $Q^T$. Here is this optimization implemented:

```
function td_qr3(T)
Qt = eye(size(T)...)
R = copy(T)
for i in 1:(size(R, 1) - 1)
# now M is 2x2
u = householder!(R[i:i+1, i])
M = 2 .* u * u'
# P = I - M
# only operate on rows i and i+1
R[i:i+1, :] -= M * R[i:i+1, :]
Qt[i:i+1, :] -= M * Qt[i:i+1, :]
end
Qt', R
end
```

This optimization gives huge performance gains both in terms of forming the matrix
$M$ (as it is just $2 \times 2$ rather than $n \times n$) as well as the matrix
multiplications by $R$ and $Q^T$. This implementation takes only **~0.005s for
200x200 matrix**, making it ~23 times faster than the previous implementation.
As the code just became much faster, all future benchmarking will be done with
1500x1500 tridiagonal matrices. This implementation takes **~0.36s for 1500x1500
matrix**.

Let's take even more advantage of the tridiagonal structure of $T$. Our previous optimization only depended on the fact that $T$ was zero beneath its first subdiagonal, but we can still improve our performance by exploiting that $T$ is also zero above its first superdiagonal. Namely, on iteration $i$ of the main loop, only the first $i+2$ columns have at least one non-zero element in rows $i$ and $i+1$. This means on iteration $i$ we only need to deal with exactly 2 rows and $i+2$ columns of $R$ and $Q^T$. Here's what the updated code looks like:

```
function td_qr4(T)
Qt = eye(size(T)...)
R = copy(T)
for i in 1:(size(R, 1) - 1)
u = householder!(R[i:i+1, i])
M = 2 .* u * u'
# only multiply the columns we need to multiply
upper = min(i+2, size(R, 2))
R[i:i+1, 1:upper] -= M * R[i:i+1, 1:upper]
Qt[i:i+1, 1:upper] -= M * Qt[i:i+1, 1:upper]
end
Qt', R
end
```

Profiling this code gives **~0.19s for 1500x1500 matrix**.

## A brief Python interlude

For comparision purposes, I implemented the exact same algorithm in Python with numpy. Here's the code:

```
def tridiag_qr(T):
R = T.copy()
Qt = np.eye(T.shape[0])
for i in xrange(T.shape[0] - 1):
u = householder(R[i:i+2, i])
M = np.outer(u, u)
R[i:i+2, :(i+3)] -= 2 * np.dot(M, R[i:i+2, :(i+3)])
Qt[i:i+2, :(i+3)] -= 2 * np.dot(M, Qt[i:i+2, :(i+3)])
return Qt.T, R
```

One small difficulty I had with this translation was the difference in indexing
between Julia and Python. Slicing in Python does not include the end index, but
slicing in Julia does. Profiling this codes gives **~0.16s for 1500x1500 matrix**
(**edit: This previously read 0.25s, that was for numpy.linalg.qr**).
This is quite similar to the 0.19s from the equivalent Julia code, and this is
not surprising due to most of the run-time being spent in BLAS calls.

I do not know of anyway to speed up this Python code (without using either another language or an external library). The optimizations that I make to the Julia code below would not work with Python due to the lack of JIT (and relatively poor performance) of Python.

## JIT == Awesome

I've heard that most linear algebra libraries are particularly well optimized for very rectangular matrices (such as the $2 \times n$ matrices we are using) and also that unrolling loops often speeds up Julia code due to Julia's JIT. This convinced me that there might be some performance gains if I unroll the matrix multiplication. Here's the new code:

```
function td_qr5(T)
Qt = eye(size(T)...)
R = copy(T)
for i in 1:(size(R, 1) - 1)
u = householder!(R[i:i+1, i])
M = 2 .* u * u'
# inline the multiplication
for j in 1:min(i+2, size(R, 2))
tmp = M[1, 1] .* R[i, j] + M[1, 2] .* R[i+1, j]
R[i+1, j] -= M[2, 1] .* R[i, j] + M[2, 2] .* R[i+1, j]
R[i,j] -= tmp
tmp = M[1, 1] .* Qt[i, j] + M[1, 2] .* Qt[i+1, j]
Qt[i+1, j] -= M[2, 1] .* Qt[i, j] + M[2, 2] .* Qt[i+1, j]
Qt[i, j] -= tmp
end
end
Qt', R
end
```

Besides the hopefully optimized machine code the JIT is generating, inlining the
matrix multiplication also allowed me to fuse the loops required for the two
seperate matrix multiplications. The profiler confirms that Julia's JIT did a good
job, as this implementation takes **~0.05s for a 1500x1500 matrix**. This is a 4x
speed increase just from unrolling a loop!

## Final step: Cache optimality

Julia arrays are stored in column-major order. This means that `A[i, j]` is directly
next to `A[i+1, j]` in memory and that we want to iterate over columns rather than
rows whenever possible to take full advantage of CPU caching. In both the computation
of $R$ and $Q^T$, I was iterating over rows. However, I really want $Q$, not $Q^T$.
Thus I can have the double-win of getting to iterate over columns rather than rows,
and not needing to compute transpose $Q^T$ to get $Q$ after finishing the loop.

Here's the new code:

```
function td_qr6(T)
Q = eye(size(T)...)
R = copy(T)
for i in 1:(size(R, 1) - 1)
u = householder!(R[i:i+1, i])
M = 2 .* u * u'
for j in 1:min(i+2, size(R, 2))
tmp = M[1, 1] .* R[i, j] + M[1, 2] .* R[i+1, j]
R[i+1, j] -= M[2, 1] .* R[i, j] + M[2, 2] .* R[i+1, j]
R[i,j] -= tmp
# now cache optimal and no need to transpose
tmp = M[1, 1] .* Q[j, i] + M[1, 2] .* Q[j, i+1]
Q[j, i+1] -= M[2, 1] .* Q[j, i] + M[2, 2] .* Q[j, i+1]
Q[j, i] -= tmp
end
end
Q, R
end
```

Benchmarking this code shows each iteration takes **~0.03 seconds for a 1500x1500
matrix**. This is about ~2x as fast as without cache optimizations, and about 6x
as fast as td_qr4, which performs the exact same math.

## Summary and final thoughts

Implementation | Time to invert 1500x1500 matrix |
---|---|

td_qr3 | 0.36s |

td_qr4 | 0.19s |

Python version of td_qr4 | 0.16s |

td_qr5 | 0.05s |

td_qr6 | 0.03s |

Overall, this project gave me a very favorable impression of Julia. The language was no slower to write than Python and was overall pretty pleasant to program in. Critically, I was able to perform optimizations in Julia that were not possible in Python. Julia seems to hit the awesome sweet spot of being a high level scripting language with performance more comparable to C or Fortran. I'm going to try doing the rest of the assignments for my numerical analysis class in Julia (particularly with IJulia, which merges the best aspects of IPython notebooks and Julia).