Solving eigenvalue problems numerically

Cours ENPC — Pratique du calcul scientifique

Objective

Goal of this chapter

Study numerical methods to find the eigenpairs $(\mathbf{\boldsymbol{v}}_i, \mathbf{\boldsymbol{\lambda}}_i)$ satisfying \[ \mathsf A \mathbf{\boldsymbol{v}} = \lambda \mathbf{\boldsymbol{v}}, \] where $\mathsf A \in \mathbf C^{n \times n}$ and $b \in \mathbf C^n.$

$~$

Simplifying assumption throughout this chapter: the matrix $\mathsf A$ is diagonalizable.

Notation: eigenvalues $|\lambda_1| \geqslant\dotsc \geqslant|\lambda_n|$, associated normalized eigenvectors $\mathbf{\boldsymbol{v}}_1, \dotsc, \mathbf{\boldsymbol{v}}_n$. \[ \mathsf A \mathsf V = \mathsf V \mathsf D, \qquad \mathsf V = \begin{pmatrix} \mathbf{\boldsymbol{v}}_1 & \dots & \mathbf{\boldsymbol{v}}_n \end{pmatrix}, \qquad \mathsf D = {\rm diag}(\lambda_1, \dotsc, \lambda_n). \]

Finding all the eigenvalues of a matrix is expensive:
```
using LinearAlgebra
A = rand(2000, 2000)
@time eigen(A)
```
```
6.694395 seconds
```
${\color{green} \rightarrow}$ We first focus on methods for finding just one or a few eigenpairs.

Program of today

Simple vector iterations, for calculating just one eigenpair;
Subspace iterations: generalization to calculate multiple eigenpairs at the same time;
Projection methods: approximate large eigenvalue problem by a smaller one.

Remark: can we calculate the eigenvalues as roots of the characteristic polynomial?

Cost of calculating the characteristic polynomial scales as ${\color{red}n!}$

${\color{green} \rightarrow}$ not a viable approach…

Before we begin:

We present a motivating example;
We discuss sparse matrix formats

Motivating example: Random walk on a graph

Consider a random walk on the graph:

Assumption: jumps to adjacent nodes equi-probable.

The associated adjacency matrix $\mathsf A \in \{0, 1\}^{n \times n}$ is given by \[ a_{ij} = \begin{cases} 1 & \text{if there is an edge between $i$ and $j$,} \\ 0 & \text{otherwise.} \end{cases} \]

$~$

Probability of transition $i \rightarrow j$ given by \[ p_{ij} = \frac{a_{ij}}{n_e(i)}, \] where $n_e(i)$ is the number of edges connected to node $i$.

${\color{green} \rightarrow}$ note that $\mathsf A$ is symmetric but $\mathsf P := (p_{ij})$ is not.

$~$

Steady state probability distribution $\mathbf{\boldsymbol{v}}_1 \in \mathbf R_{\geqslant 0}^n$ satisfies \[ \mathsf P^T\mathbf{\boldsymbol{v}}_1 = \mathbf{\boldsymbol{v}}_1, \qquad \mathbf{\boldsymbol{1}}^T\mathbf{\boldsymbol{v}}_1 = 1. \] (it is possible to show that $1$ is the dominant eigenvalue)

${\color{green} \rightarrow}$ eigenvalue problem for matrix $\mathsf P^T$.

Constructing the adjacency matrix

function adj(m)
  x = vcat([collect(0:i) for i in 0:m-1]...)
  y = vcat([fill(m-i, i) for i in 1:m]...)
  row_lengths = 1:m
  row_indices = [0; cumsum(row_lengths)]
  A = zeros(Int, length(x), length(x))
  for i in 1:m-1
      for j in 1:row_lengths[i]
          r1 = row_indices[i]
          r2 = row_indices[i+1]
          A[r2 + j, r2 + j + 1] = 1  # Horizontal
          A[r1 + j, r2 + j]     = 1  # Vertical
          A[r1 + j, r2 + j + 1] = 1  # Diagonal
      end
  end
  x, y, A + A'
end

Here ... is the splat operator (?... for more info)

sum3(a, b, c) = a + b + c
numbers = [1, 2, 3]
sum3(numbers...)

Also works with keyword arguments (use ; and NamedTuple)

plotargs = (color=:blue, label="example")
plot(sin; plotargs...)

Exercise: write function transition(m) that builds $\mathsf P$.

Notice that $\mathsf A$ is a sparse matrix:

# Adjancency matrix
x, y, A = adj(10);
Plots.spy(A, xlabel=L"i", ylabel=L"j")
Plots.plot!(size=(600, 600))

This is the case in many applications:

Discretization of partial differential equations;
Graph applications.

Calculating the steady state

Steady state probability distribution satisfies \[ \mathsf P^T\mathbf{\boldsymbol{v}}_1 = \mathbf{\boldsymbol{v}}_1 \]

Illustration with units in ‰, rounded to integers:

using LinearAlgebra
v₁ = eigvecs(P', sortby=real)[:, end] |> real
v₁ = v₁ / sum(v₁)  # Normalization
plotgraph(v₁)

Strategy for finding $\mathbf{\boldsymbol{v}}_1$: evolve probability vector until equilibrium: \[ \mathbf{\boldsymbol{x}}^{(k+1)} = \mathsf P^T\mathbf{\boldsymbol{x}}^{(k)} \qquad \Leftrightarrow \qquad x^{(k+1)}_i = \sum_{j=1}^{n_n} p_{ji} x^{(k)}_j \]

$x^{(k)}_i$ is the probability of being at node $i$ at iteration $k$;
This is an example of the power iteration;
Below we take $\mathbf{\boldsymbol{x}}^{(0)} = \mathbf{\boldsymbol{e}}_1$.

Efficiency considerations

The power iteration in few lines of code:

function power_iteration(M, x, niter)
    for i in 1:niter
        x = M*x
        # Sometimes normalization is required
    end
    return x
end

# Dense transition matrix transposed
Pᵗ = transition(20) |> transpose |> Matrix;

# Initial iterate
nn = size(Pᵗ, 1) ; x = zeros(nn) ; x[1] = 1;

$~$

Naive approach: use dense data structure for $\mathsf P^T$

using GFlops
flops = @count_ops power_iteration(Pᵗ, x, 10)
memory = sizeof(Pᵗ)
println(flops, "\n", "Memory: $memory bytes")

Flop Counter: 882000 flop
┌─────┬─────────┐
│     │ Float64 │
├─────┼─────────┤
│ add │  441000 │
│ mul │  441000 │
└─────┴─────────┘
Memory: 352800 bytes

${\color{green} \rightarrow}$ time is wasted on 0s. Try zeros(10_000, 10_000)^2.

Better solution: use sparse data structure

import Base: *, transpose
struct coo_matrix
    rows::Array{Int}       # Row indices of ≠0
    cols::Array{Int}       # Column indices of ≠0
    vals::Array{Float64}   # Values of ≠0 entries
    m::Int                 # Number of rows
    n::Int                 # Number of columns
end

function transpose(A::coo_matrix)
    coo_matrix(A.cols, A.rows, A.vals, A.n, A.m)
end

${\color{green} \rightarrow}$ Exercise: define the * and to_coo functions

Efficiency considerations

The power iteration in few lines of code:

function power_iteration(M, x, niter)
    for i in 1:niter
        x = M*x
        # Sometimes normalization is required
    end
    return x
end

# Dense transition matrix transposed
Pᵗ = transition(20) |> transpose |> Matrix;

# Initial iterate
nn = size(Pᵗ, 1) ; x = zeros(nn) ; x[1] = 1;

$~$

Naive approach: use dense data structure for $\mathsf P^T$

using GFlops
flops = @count_ops power_iteration(Pᵗ, x, 10)
memory = sizeof(Pᵗ)
println(flops, "\n", "Memory: $memory bytes")

Flop Counter: 882000 flop
┌─────┬─────────┐
│     │ Float64 │
├─────┼─────────┤
│ add │  441000 │
│ mul │  441000 │
└─────┴─────────┘
Memory: 352800 bytes

${\color{green} \rightarrow}$ time is wasted on 0s. Try zeros(10_000, 10_000)^2.

Better solution: use sparse data structure

using GFlops
sparse_Pᵗ = to_coo(Pᵗ)
flops = @count_ops power_iteration(sparse_Pᵗ, x, 10)
memory = sizeof(sparse_Pᵗ)
println(flops, "\n", "Memory: $memory bytes")

Flop Counter: 22800 flop
┌─────┬─────────┐
│     │ Float64 │
├─────┼─────────┤
│ add │   11400 │
│ mul │   11400 │
└─────┴─────────┘
Memory: 40 bytes

Sparse matrix formats

We take the following matrix for illustration: \[ M = \begin{pmatrix} 5 & . & . \\ . & 6 & 7 \\ 8 & . & 9 \end{pmatrix} \]

Coordinate list (COO), presented on the previous slide;
- rows is [1, 2, 2, 3, 3]
- cols is [1, 2, 3, 1, 3]
- vals is [5, 6, 7, 8, 9]

Compressed sparse rows (CSR):
- Similar to COO, but often more memory-efficient: rows is compressed;
- Vector rows contains the indices in cols and vals where rows start…
- … except for rows[m+1], which is 1 + the number of nonzeros;
- rows is [1, 2, 4, 6]
- cols is [1, 2, 3, 1, 3]
- vals is [5, 6, 7, 8, 9]

Compressed sparse columns (CSC):
- Same as CSR with rows $\leftrightarrow$ columns;
- Implemented in julia by SparseArrays.

using SparseArrays
Z = spzeros(3, 3)

3×3 SparseMatrixCSC{Float64, Int64} with 0 stored entries:
  ⋅    ⋅    ⋅ 
  ⋅    ⋅    ⋅ 
  ⋅    ⋅    ⋅

$~$

M = [5 0 0; 0 6 7; 8 0 9]
M = sparse(M)

3×3 SparseMatrixCSC{Int64, Int64} with 5 stored entries:
 5  ⋅  ⋅
 ⋅  6  7
 8  ⋅  9

$~$

# Constructor using COO format
M = sparse([1, 2, 2, 3, 3],
           [1, 2, 3, 1, 3],
           [5, 6, 7, 8, 9])

3×3 SparseMatrixCSC{Int64, Int64} with 5 stored entries:
 5  ⋅  ⋅
 ⋅  6  7
 8  ⋅  9

$~$

println(M.rowval)
println(M.colptr)
println(M.nzval)

[1, 3, 2, 2, 3]
[1, 3, 4, 6]
[5, 8, 6, 7, 9]

Simple vector iterations

The power iteration

The power iteration is a method for finding $\color{blue}(\lambda_1, v_1)$: \[ \mathbf{\boldsymbol{x}}_{k+1} = \frac{\mathsf A \mathbf{\boldsymbol{x}}_{k}}{{\lVert {\mathsf A \mathbf{\boldsymbol{x}}_{k}} \rVert}}, \]

Normalization is necessary to avoid numerical Inf.

Given $\mathbf{\boldsymbol{x}}_k \approx \mathbf{\boldsymbol{v}}_1$, the eigenvalue can be calculated from the Rayleigh quotient: \[ \lambda_{1} \approx \frac{\mathbf{\boldsymbol{x}}_{k}^* \mathsf A \mathbf{\boldsymbol{x}}_k}{\mathbf{\boldsymbol{x}}_k^* \mathbf{\boldsymbol{x}}_k} \]

A = [2 1; 1 2]
x = rand(2)
for i in 1:20
    x = A*x
    x /= √(x'x)
end
println("Eigenvector: $x, eigenvalue: $(x'A*x)")

Eigenvector: [0.7071067812033929, 0.7071067811697023], eigenvalue: 3.0000000000000004

$~$

Questions:

Does it converge and how quickly?
What is a good stopping criterion?

Convergence of the power iteration

Notation: let $\angle$ denote the acute angle between $\mathbf{\boldsymbol{x}}$ and $\mathbf{\boldsymbol{y}}$: \[ \angle(\mathbf{\boldsymbol{x}}, \mathbf{\boldsymbol{y}}) = \arccos\left( \frac{{\lvert {\mathbf{\boldsymbol{x}}^* \mathbf{\boldsymbol{y}}} \rvert}}{\sqrt{\mathbf{\boldsymbol{x}}^* \mathbf{\boldsymbol{x}}} \sqrt{\mathbf{\boldsymbol{y}}^* \mathbf{\boldsymbol{y}}}}\right). \]

Since the eigenvectors of $\mathsf A$ span $\mathbf C^n$, any vector $\mathbf{\boldsymbol{x}}_0$ may be decomposed uniquely as \[ \mathbf{\boldsymbol{x}}_0 = \alpha_1 \mathbf{\boldsymbol{v}}_1 + \dotsb + \alpha_n \mathbf{\boldsymbol{v}}_n. \]

Proposition: convergence of the power iteration started from $\mathbf{\boldsymbol{x}}_0$

Suppose that ${\lvert {\lambda_1} \rvert} > {\lvert {\lambda_2} \rvert}$ and $\alpha_1 \neq 0$. Then $(\mathbf{\boldsymbol{x}}_k)_{k \geqslant 0}$ satisfies \[ \lim_{k \to \infty} \angle(\mathbf{\boldsymbol{x}}_k, \mathbf{\boldsymbol{v}}_1) = 0. \] The sequence $(\mathbf{\boldsymbol{x}}_k)$ is said to converge essentially to $\mathbf{\boldsymbol{v}}_1$. $~$

Proof

By construction \[ \mathbf{\boldsymbol{x}}_k = \frac{\lambda_1^k \alpha_1 \mathbf{\boldsymbol{v}}_1 + \dotsb + \lambda_n^k \alpha_n \mathbf{\boldsymbol{v}}_n}{{\lVert {\lambda_1^k \alpha_1 \mathbf{\boldsymbol{v}}_1 + \dotsb + \lambda_n^k \alpha_n \mathbf{\boldsymbol{v}}_n} \rVert}} = \frac{\lambda_1^k \alpha_1}{|\lambda_1^k \alpha_1|} \frac{\mathbf{\boldsymbol{v}}_1 + \frac{\lambda_2^k \alpha_2}{\lambda_1^k \alpha_1} \mathbf{\boldsymbol{v}}_2 + \dotsb + \frac{\lambda_n^k \alpha_2}{\lambda_1^k \alpha_1} \mathbf{\boldsymbol{v}}_n }{\left\lVert\mathbf{\boldsymbol{v}}_1 + \frac{\lambda_2^k \alpha_2}{\lambda_1^k \alpha_1} \mathbf{\boldsymbol{v}}_2 + \dotsb + \frac{\lambda_n^k \alpha_n}{\lambda_1^k \alpha_1} \mathbf{\boldsymbol{v}}_n\right\rVert}. \] Therefore \[ \mathbf{\boldsymbol{x}}_k \left(\frac{|\lambda_1^k \alpha_1|}{\lambda_1^k \alpha_1}\right) \xrightarrow[k \to \infty]{} \mathbf{\boldsymbol{v}}_1. \] (Notice that the bracketed factor on the left-hand side is a scalar of modulus 1.)

The dominant error term scales as $\color{red} |\lambda_2/\lambda_1|^k$.

Inverse iteration

Question: can we find $\lambda_J$, i.e. the eigenvalue that is nearest $\mu$?

Idea: apply the power iteration to $\mathsf B = (\mathsf A - \mu \mathsf I)^{-1}$: \[ \mathbf{\boldsymbol{x}}_{k+1} = \frac{\mathsf B \mathbf{\boldsymbol{x}}_{k}}{{\lVert {\mathsf B \mathbf{\boldsymbol{x}}_{k}} \rVert}}. \]

The dominant eigenvalue of $\mathsf B$ is given by $(\lambda_J - \mu)^{-1}$.
The associated eigenvector is $\mathbf{\boldsymbol{v}}_J$.

Use \ operator instead of calculating the inverse:

A = [2 1; 1 2]
x = rand(2)
μ = 0
B = A - μ*I
for i in 1:10
    x = B\x
    x /= √(x'x)
    println("Eigenvector: $x, eigenvalue: $(x'A*x)")
end

Eigenvector: [-0.073395069620888, 0.9973029448243622], eigenvalue: 1.8536057618629984
Eigenvector: [-0.4840913701752976, 0.8750174542955146], eigenvalue: 1.152823203245567
Eigenvector: [-0.6363924079275455, 0.771365479608843], eigenvalue: 1.0182177300790847
Eigenvector: [-0.6841587989671516, 0.7293330774041614], eigenvalue: 1.0020407154323043
Eigenvector: [-0.6995341971283575, 0.7145991233187908], eigenvalue: 1.0002269520011233
Eigenvector: [-0.7045913752929893, 0.7096132706360092], eigenvalue: 1.0000252194328363
Eigenvector: [-0.7062692985764518, 0.7079432730723041], eigenvalue: 1.0000028021906129
Eigenvector: [-0.706827730223213, 0.7073857220692972], eigenvalue: 1.0000003113549005
Eigenvector: [-0.7070137764235103, 0.7071997737184075], eigenvalue: 1.0000000345949938
Eigenvector: [-0.7070757809576499, 0.7071377800564256], eigenvalue: 1.0000000038438883

The dominant error term now scales as $\color{red} \left|\frac{\lambda_J - \mu} {\lambda_K - \mu}\right|^k$.

Rayleigh quotient iteration

Question: can we adapt $\mu$ during the iteration in order to accelerate convergence?

Error of the inverse iteration scales as $\color{red} \left|\frac{\lambda_J - \mu} {\lambda_K - \mu}\right|^k$.

${\color{green} \rightarrow}$ suggests letting

\[\mu_k = \frac{\mathbf{\boldsymbol{x}}_k^* \mathsf A \mathbf{\boldsymbol{x}}_k}{\mathbf{\boldsymbol{x}}_k^* \mathbf{\boldsymbol{x}}_k}.\]

This is the Rayleigh quotient iteration.

If convergence to an eigenvector occurs, the associated eigenvalue converges cubically:

\[ |\lambda^{(k+1)} - \lambda_J| \leqslant C |\lambda^{(k)} - \lambda_J|^3. \]

${\color{green} \rightarrow}$ Number of significant digits is roughly tripled at each iteration.

A = [2 1; 1 2]
setprecision(500)
x = BigFloat.([1.2; 0.9])
μ = 0
for i in 1:6
    x = (A - μ*I)\x
    x /= √(x'x)
    μ = x'A*x
    println(μ)
end

2.689655172413793214338566074331922284628319727818488081860106227371262853189736760002654726539617400119969038165484095334893321873413803868251681795864
2.9876835222760986149331492272467355588185696109799282648595173771965397075426195063225678761434065353075142377089820027363840168362403721691362285304077
2.9999995241744513496497052218657690803207923292634754813048708145726507175594474194389978335452546331521437999768268695077710762632721314300139734193988
2.9999999999999999999730670707803381925912042248826047017208189162200256216460941091027939474791840614585910098581290345505474157692691833334644060848503
2.9999999999999999999999999999999999999999999999999999999999951158299301653110614230820165226784762374074665384291286744975203198864866424226623297764997
3.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000012

Stopping criterion

A good stopping criterion is \[ \begin{equation} \tag{stopping criterion} {\lVert {\mathsf A \widehat {\mathbf{\boldsymbol{v}}} - \widehat \lambda \widehat {\mathbf{\boldsymbol{v}}}} \rVert} \leqslant\varepsilon {\lVert {\widehat {\mathbf{\boldsymbol{v}}}} \rVert} \label{stopping} \end{equation} \]

Proposition: forward error estimate

Suppose $\mathsf A$ is Hermitian and $\eqref{stopping}$ is satisfied. Then there exists $\lambda \in \sigma(\mathsf A)$ such that $|\lambda - \widehat \lambda| \leqslant\varepsilon$.

Proof

Let $\mathbf{\boldsymbol{r}} := \mathsf A \widehat {\mathbf{\boldsymbol{v}}} - \widehat \lambda \widehat {\mathbf{\boldsymbol{v}}}$. Then \[ \frac{1}{\min_{\lambda \in \sigma(\mathsf A)} |\lambda - \widehat \lambda|} = {\lVert {(\mathsf A - \widehat \lambda \mathsf I)^{-1}} \rVert} \geqslant\frac{{\lVert {(\mathsf A - \widehat \lambda \mathsf I)^{-1}\mathbf{\boldsymbol{r}}} \rVert}}{{\lVert {\mathbf{\boldsymbol{r}}} \rVert}} \geqslant\frac{1}{\varepsilon}, \] Taking the reciprocal, we deduce that \[ \min_{\lambda \in \sigma(\mathsf A)} |\lambda - \widehat \lambda| \leqslant\varepsilon, \] which concludes the proof.

Proposition: backward error estimate

Suppose that $\eqref{stopping}$ is satisfied and let $\mathcal E = \Bigl\{\mathsf E \in \mathbf C^{n \times n}: (\mathsf A + \mathsf E) \widehat {\mathbf{\boldsymbol{v}}} = \widehat \lambda \widehat {\mathbf{\boldsymbol{v}}} \Bigr\}$. Then \[ \min_{\mathsf E \in \mathcal E} {\lVert {\mathsf E} \rVert} \leqslant\varepsilon. \]

Definition of the backward error by numerical analyst Nick Higham:

Backward error is a measure of error associated with an approximate solution to a problem. Whereas the forward error is the distance between the approximate and true solutions, the backward error is how much the data must be perturbed to produce the approximate solution.

Sketch of proof

Show first that any $\mathsf E \in \mathcal E$ satisfies $\mathsf E \widehat {\mathbf{\boldsymbol{v}}} = - \mathbf{\boldsymbol{r}}$, where $\mathbf{\boldsymbol{r}} := \mathsf A \widehat {\mathbf{\boldsymbol{v}}} - \widehat \lambda \widehat {\mathbf{\boldsymbol{v}}}$.
Deduce from the first item that \[ \inf_{\mathsf E \in \mathcal E} {\lVert {\mathsf E} \rVert} \geqslant\frac{{\lVert {\mathbf{\boldsymbol{r}}} \rVert}}{{\lVert {\widehat {\mathbf{\boldsymbol{v}}}} \rVert}}. \]
Find a rank one matrix $\mathsf E_*$ such that ${\lVert {\mathsf E_*} \rVert} = \frac{{\lVert {\mathbf{\boldsymbol{r}}} \rVert}}{{\lVert {\widehat {\mathbf{\boldsymbol{v}}}} \rVert}}$, and then conclude.

(Recall that any rank 1 matrix is of the form $\mathsf E_* = \mathbf{\boldsymbol{u}} \mathbf{\boldsymbol{w}}^*$, with norm ${\lVert {\mathbf{\boldsymbol{u}}} \rVert} {\lVert {\mathbf{\boldsymbol{w}}} \rVert}$.)

Subspace iterations

Simultaneous iteration

Suppose that $\mathsf X_0 \in \mathbf C^{n \times p}$ with $p < n$ and consider the iteration \[ \mathsf X_{k+1} = {\color{green}\text{normalization}} (\mathsf A \mathsf X_{k}) \]

For normalization, apply Gram–Schmidt to the columns;

In practice, this can be achieved by QR decompostion: \[ \mathsf X_{k+1} = \mathsf Q_{k+1}, \qquad \mathsf Q_{k+1} \mathsf R_{k+1} = \mathsf A \mathsf X_k \]
- $\mathsf Q_{k+1} \in \mathbf C^{n\times p}$ is a unitary matrix: $\mathsf Q_{k+1}^* \mathsf Q_{k+1} = \mathsf I_{p \times p}$;
- $\mathsf R_{k+1} \in \mathbf C^{p\times p}$ is upper triangular.

Some remarks:

The first column undergoes a simple power iteration;
Mathematically, it is equivalent to apply the QR decomposition at the end. Indeed \[ \mathsf X_{k} \underbrace{\mathsf R_{k} \mathsf R_{k-1} \dotsc \mathsf R_1}_{\text{upper tri.}} = \mathsf A \mathsf X_{k-1} \mathsf R_{k-1} \mathsf R_{k-2} \dotsc \mathsf R_{1} = \dots = \mathsf A^k \mathsf X_0. \]

${\color{green} \rightarrow}$ in particular, it holds that $\mathop{\mathrm{col}}(\mathsf X_k) = \mathop{\mathrm{col}}(\mathsf A^k \mathsf X_k)$.

Convergence of the simultaneous iteration

Theorem: Convergence of the simultaneous iteration

Assume that

$\mathsf A \in \mathbf C^{n \times n}$ is Hermitian
$\mathsf X_0 \in \mathbf C^{n \times p}$ has linearly independent columns
The subspace spanned by the columns of $\mathsf X_0$ satisfies \[ \mathop{\mathrm{col}}(\mathsf X_0) \cap \mathop{\mathrm{Span}}\{ \mathbf{\boldsymbol{v}}_{p+1}, \dotsc, \mathbf{\boldsymbol{v}}_n \} = \varnothing. \]
The first $p+1$ eigenvalues are distinct: \[ |\lambda_1| > |\lambda_2| > \dotsb > |\lambda_p| > |\lambda_{p+1}| \geqslant|\lambda_{p+2}| \geqslant\dotsc \geqslant|\lambda_n|, \] Then the columns $\mathsf X_k$ converge essentially to $\mathbf{\boldsymbol{v}}_1, \dots, \mathbf{\boldsymbol{v}}_p$.

Ideas of the proof:

Change coordinates so that $\mathsf A$ is diagonal;
Show convergence of the orthogonal projector: \[ \mathsf X_{k} \mathsf X_k^* \to \mathsf V_1 \mathsf V_1^*, \qquad \mathsf V_1 := \mathsf V\texttt{[:,1:p]}. \]
Show by induction the essential convergence of each column.

A particular case of the simultaneous iteration: $p = n$

Consider the following simultaneous iteration: \[ {\color{magenta} \mathsf X_{k+1} \mathsf R_{k+1} = \mathsf A \mathsf X_{k}}~\text{(QR decomposition)}, \qquad \mathsf X_0 = \mathsf I_{\color{blue} n\times n}. \] Let $\mathsf\Lambda_k := \mathsf X_k^* \mathsf A \mathsf X_k$.

Under the conditions of the previous slide, it holds that $\mathsf\Lambda_k \to \mathsf D$.

Now notice that
- $\mathsf\Lambda_{k} = \mathsf X_{k}^* ({\color{magenta}\mathsf A \mathsf X_{k}}) = \mathsf X_{k}^* ({\color{magenta} \mathsf X_{k+1} \mathsf R_{k+1}}) = {\color{green}(\mathsf X_{k}^* \mathsf X_{k+1})} \color{blue} \mathsf R_{k+1}$
- $\mathsf\Lambda_{k+1} = ({\color{magenta}\mathsf X_{k+1}^* \mathsf A}) \mathsf X_{k+1} = ({\color{magenta} \mathsf R_{k+1} \mathsf X_{k}^*}) \mathsf X_{k+1} = {\color{blue}\mathsf R_{k+1}} {\color{green}(\mathsf X_{k}^* \mathsf X_{k+1})}$
Therefore, the eigenvalues may be obtained as follows
```
A = [2 1; 1 2]
Λ = A
for i in 1:20
    Q, R = qr(Λ)
    Λ = R*Q
end
display(Λ)
```
```
2×2 Matrix{Float64}:
 3.0          5.73594e-10
 5.73594e-10  1.0
```
This is known as the QR algorithm.

Historical remark: the QR algorithm predates the simultaneous iteration:

QR algorithm first proposed in 1959 here.
First proof of convergence of simultatenous iteration in 1976 here.

Projection methods

General projection method

Idea

Find approximate eigenvectors in a subspace $\mathcal U \subset \mathbf C^n$: \[ \tag{Galerkin condition} \mathsf A \mathbf{\boldsymbol{v}} \approx \lambda \mathbf{\boldsymbol{v}}, \qquad \mathbf{\boldsymbol{v}} \in \mathcal U. \]

Definition: Ritz vectors and Ritz values

A vector $\widehat{\mathbf{\boldsymbol{v}}} \in \mathcal U$ is a Ritz vector of $\mathsf A$ relative to $\mathcal U$ associated with the Ritz value $\widehat \lambda$ if \[ \mathsf A \widehat{\mathbf{\boldsymbol{v}}} - \widehat \lambda \widehat{\mathbf{\boldsymbol{v}}} \perp \mathcal U. \]

Remark: If $\mathsf A \mathcal U \subset \mathcal U$, then every Ritz pair is an eigenpair.

Finding Ritz vectors in practice

Let $\mathsf U \in \mathbf C^{n \times p}$ be a matrix whose columns form an orthonormal basis of $\mathcal U$.

A pair $(\widehat {\mathbf{\boldsymbol{u}}}, \widehat \lambda) \in \mathbf C^p \times \mathbf C$ is solution of \[ \tag{Smaller eigenvalue problem} \mathsf U^* \mathsf A \mathsf U \widehat{\mathbf{\boldsymbol{u}}} = \widehat \lambda \widehat{\mathbf{\boldsymbol{u}}}, \] if and only if $(\mathsf U \widehat{\mathbf{\boldsymbol{u}}}, \widehat \lambda)$ is a Ritz pair.

General error estimate

Let $\mathsf A$ be a full rank Hermitian matrix and $\mathcal U$ a $p$-dimensional subspace of $\mathbf C^n$. Then there are eigenvalues $\lambda_{i_1}, \dotsc, \lambda_{i_p}$ of $\mathsf A$ which satisfy \[ \forall j \in \{1, \dotsc, p\}, \qquad {\lvert {\lambda_{i_j} - \widehat \lambda_j} \rvert} \leqslant{\lVert {(\mathsf I - \mathsf P_{\mathcal U}) \mathsf A \mathsf P_{\mathcal U}} \rVert}, \] where $\mathsf P_{\mathcal U} := \mathsf U \mathsf U^*$ is the orthogonal projector onto $\mathcal U$.

Proposition: Property of Ritz values in Hermitian setting

In the Hermitian setting, it holds that $\widehat \lambda_i \leqslant\lambda_i$.

Proof

Let $\mathsf H := \mathsf U^* \mathsf A \mathsf U$. By the Courant–Fisher theorem, it holds that \[ \widehat \lambda_i = \max_{\mathcal S \subset \mathbf C^p, {\rm dim}(\mathcal S) = i} \left( \min_{\mathbf{\boldsymbol{x}} \in \mathcal S \backslash\{0\}} \frac{\mathbf{\boldsymbol{x}}^* \mathsf H \mathbf{\boldsymbol{x}}}{\mathbf{\boldsymbol{x}}^* \mathbf{\boldsymbol{x}}} \right) \] Letting $\mathbf{\boldsymbol{y}} = \mathsf U \mathbf{\boldsymbol{x}}$ and then $\mathcal R = \mathsf U \mathcal S$, we deduce that \[\begin{align*} \widehat \lambda_i &= \max_{\mathcal R \subset \mathcal U, {\rm dim}(\mathcal R) = i} \left( \min_{\mathbf{\boldsymbol{y}} \in \mathcal R \backslash\{0\}} \frac{\mathbf{\boldsymbol{y}}^* \mathsf A \mathbf{\boldsymbol{y}}}{\mathbf{\boldsymbol{y}}^* \mathbf{\boldsymbol{y}}} \right) \leqslant\max_{\mathcal R \subset \mathbf C^n, {\rm dim}(\mathcal R) = i} \left( \min_{\mathbf{\boldsymbol{y}} \in \mathcal R \backslash\{0\}} \frac{\mathbf{\boldsymbol{y}}^* \mathsf A \mathbf{\boldsymbol{y}}}{\mathbf{\boldsymbol{y}}^* \mathbf{\boldsymbol{y}}} \right) = \lambda_i, \end{align*}\] where we used the Courant–Fisher for the matrix $\mathsf A$ in the last equality.

In particular \[ \widehat \lambda_1 = \max_{\mathbf{\boldsymbol{u}} \in \mathcal U \setminus \{0\}} \frac{\mathbf{\boldsymbol{u}}^* \mathsf A \mathbf{\boldsymbol{u}}}{\mathbf{\boldsymbol{u}}^* \mathbf{\boldsymbol{u}}} \leqslant\max_{\mathbf{\boldsymbol{u}} \in \mathbf C^n \setminus \{0\}} \frac{\mathbf{\boldsymbol{u}}^* \mathsf A \mathbf{\boldsymbol{u}}}{\mathbf{\boldsymbol{u}}^* \mathbf{\boldsymbol{u}}} = \lambda_1. \]

Arnoldi iteration

The Arnoldi iteration is a projection method with $\mathcal U$ a Krylov subspace \[ \mathcal K_p(\mathsf A, \mathbf{\boldsymbol{u}}_0) = \mathop{\mathrm{Span}}\Bigl\{ \mathbf{\boldsymbol{u}}_0, \mathsf A \mathbf{\boldsymbol{u}}_0, \mathsf A^2 \mathbf{\boldsymbol{u}}_0, \dotsc, \mathsf A^{p-1} \mathbf{\boldsymbol{u}}_0 \Bigr\}. \] More precisely, it is a method for

constructing the matrix $\mathsf U$ by Gram-Schmidt

(i.e. constructing an orthonormal basis of $\mathcal K_k$)
calculating the reduced matrix $\mathsf H := \mathsf U^* \mathsf A \mathsf U$.

function arnoldi(u₀, A, p)
    U = zeros(length(u₀), p)
    U[:, 1] = u₀ / √(u₀'u₀)
    H = zeros(p, p)
    for j in 1:p
        u = A*U[:, j]
        for i in 1:j
            H[i, j] = U[:, i]'u
            u -= H[i, j] * U[:, i]
        end
        if j < p
            H[j+1, j] = √(u'u)
            U[:, j+1] = u / H[j+1, j]
        end
    end
    return H, U
end ;

Remarks:

The matrix $\mathsf H$ is of Hessenberg form.

n, p = 10, 5
A = rand(n, n)
u₀ = rand(n)
H, U = arnoldi(u₀, A, p)
display(H)

5×5 Matrix{Float64}:
 3.92498  0.770511   0.119577   0.948479   0.113864
 2.02973  0.81376    0.175274  -0.180263   0.49816
 0.0      0.550018  -0.346784  -0.156294   0.104327
 0.0      0.0        0.513835   0.488151  -0.00283717
 0.0      0.0        0.0        0.442819   0.128255

$~$

Computational cost scales as $p^2$.

${\color{green} \rightarrow}$ Expensive for large $p$.

$~$

Sometimes useful to restart iteration, e.g. when calculating dominant eigenpair:
1. Perform $p$ iterations of Arnoldi;
2. Calculate the associated Ritz vector $\widehat {\mathbf{\boldsymbol{v}}}_1$;
3. Repeat with $\mathbf{\boldsymbol{u}}_0 = \widehat {\mathbf{\boldsymbol{v}}}_1$.

Performance of Arnoldi iteration in practice

Eigenvalues of large modulus tend to converge faster;
But this observation still lacks a strong theoretical footing…

Code

n, p = 40, 5
θ = LinRange(0, 6π, n)
A = (@. θ*cos(θ) + im*θ*sin(θ)) |> diagm
u₀ = rand(n)
true_λs = eigvals(A)
xlims = extrema(real.(true_λs)) .+ [-2; 2]
ylims = extrema(imag.(true_λs)) .+ [-2; 2]
anim = @animate for p ∈ 1:n
    H, U = arnoldi(u₀, A, p)
    λs = eigvals(H)
    heatmap(LinRange(xlims..., 100), LinRange(ylims..., 100), (x, y) -> √(x^2 + y^2), c=:viridis, alpha=.5, colorbar=:none)
    scatter!(real.(true_λs), imag.(true_λs), shape=:plus, color=:black, ms=6)
    scatter!(real.(λs), imag.(λs), label="Arnoldi")
    plot!(xlims=xlims, ylims=ylims, legend=:none, aspect_ratio=:equal, size=(900, 900))
    annotate!(-15, -17, "p = $p")
end
gif(anim, "arnoldi.gif", fps=2)

Particular case of $\mathsf A$ Hermitian: Lanczos iteration

The Lanczos iteration is a particular case of Arnoldi’s method when $\mathsf A$ is Hermitian. \[ \mathcal K_p = \mathop{\mathrm{Span}}\Bigl\{ \mathbf{\boldsymbol{u}}_0, \mathsf A \mathbf{\boldsymbol{u}}_0, \mathsf A^2 \mathbf{\boldsymbol{u}}_0, \dotsc, \mathsf A^{p-1} \mathbf{\boldsymbol{u}}_0 \Bigr\}. \]

Arnoldi’s iteration constructs orthonormal $\mathbf{\boldsymbol{u}}_0, \mathbf{\boldsymbol{u}}_1, \mathbf{\boldsymbol{u}}_2, \dotsc$ recursively as follows: \[ \mathbf{\boldsymbol{u}}_{i+1} = \frac{\widehat {\mathbf{\boldsymbol{u}}}_{i+1}}{{\lVert {\widehat {\mathbf{\boldsymbol{u}}}_{i+1}} \rVert}}, \qquad \widehat {\mathbf{\boldsymbol{u}}}_{i+1} = \mathsf A \mathbf{\boldsymbol{u}}_i - \sum_{j = 0}^i {\langle {\mathsf A \mathbf{\boldsymbol{u}}_i, \mathbf{\boldsymbol{u}}_j} \rangle} \mathbf{\boldsymbol{u}}_j. \]

In the Hermitian setting, \[ {\langle {\mathsf A \mathbf{\boldsymbol{u}}_i, \mathbf{\boldsymbol{u}}_j} \rangle} = {\langle {\mathbf{\boldsymbol{u}}_i, \mathsf A \mathbf{\boldsymbol{u}}_j} \rangle}. \] Since $\mathsf A \mathbf{\boldsymbol{u}}_j \in \mathcal K_{j+1}(\mathsf A, \mathbf{\boldsymbol{u}}_0)$, this inner product is $0$ if $i \geqslant j+2$. Therefore \[ \mathbf{\boldsymbol{u}}_{i+1} = \frac{\widehat {\mathbf{\boldsymbol{u}}}_{i+1}}{{\lVert {\widehat {\mathbf{\boldsymbol{u}}}_{i+1}} \rVert}}, \qquad \widehat {\mathbf{\boldsymbol{u}}}_{i+1} = \mathsf A \mathbf{\boldsymbol{u}}_i - {\langle {\mathsf A \mathbf{\boldsymbol{u}}_i, \mathbf{\boldsymbol{u}}_{i-1}} \rangle} \mathbf{\boldsymbol{u}}_{i-1} - {\langle {\mathsf A \mathbf{\boldsymbol{u}}_i, \mathbf{\boldsymbol{u}}_i} \rangle} \mathbf{\boldsymbol{u}}_i. \] The matrix $\mathsf H := \mathsf U^* \mathsf A \mathsf U$ is tridiagonal in this case.

Question: which $\mathbf{\boldsymbol{v}}$ gives a better approximation $\frac{\mathbf{\boldsymbol{v}}^* \mathsf A \mathbf{\boldsymbol{v}}}{\mathbf{\boldsymbol{v}}^* \mathbf{\boldsymbol{v}}}$ of the dominant eigenvalue of $\mathsf A$?

Power iteration: $\mathbf{\boldsymbol{v}} = \mathsf A^{p-1} \mathbf{\boldsymbol{x}}_0$.
Lanczos iteration: $\mathbf{\boldsymbol{v}} = \widehat {\mathbf{\boldsymbol{v}}}_1$ the Ritz vector relative to $\mathcal K_{p}$ associated with the dominant Ritz value.

Solving eigenvalue problems numerically

Objective

Program of today

Motivating example: Random walk on a graph

Constructing the adjacency matrix

Calculating the steady state

Efficiency considerations

Efficiency considerations

Sparse matrix formats

Simple vector iterations

The power iteration

Convergence of the power iteration

Inverse iteration

Rayleigh quotient iteration

Stopping criterion

Subspace iterations

Simultaneous iteration

Convergence of the simultaneous iteration

A particular case of the simultaneous iteration: \(p = n\)

Projection methods

General projection method

Arnoldi iteration

Performance of Arnoldi iteration in practice

Particular case of \(\mathsf A\) Hermitian: Lanczos iteration