= [1 1; 1 (1-1e-12)]
A = [0; 1e-12]
b = A\b x
2-element Vector{Float64}:
1.0000221222095027
-1.0000221222095027
Cours ENPC — Pratique du calcul scientifique
$$
% % %
%
$$
Goal of this chapter
Study numerical methods for the linear equation \mathsf A \mathbf{\boldsymbol{x}} = \mathbf{\boldsymbol{b}}, where \mathsf A \in \mathbf C^{n \times n} and b \in \mathbf C^n.
Two classes of methods:
Direct methods:
LU decomposition (for general invertible matrices);
Cholesky decomposition (for symmetric positive definite matrices)
Iterative methods:
Conjugate gradients
And many more: GMRES, BiCGStab, etc.
Before discussing these methods, we introduce the concept of conditioning.
In Julia, we can calculate the solution with the \
operator
2-element Vector{Float64}:
1.0000221222095027
-1.0000221222095027
Why is the relative error much larger than the machine epsilon?
Question: Can we estimate \frac{{\lVert {\Delta \mathbf{\boldsymbol{x}}} \rVert}}{{\lVert {\mathbf{\boldsymbol{x}}} \rVert}} in terms of \frac{{\lVert {\Delta \mathsf A} \rVert}}{{\lVert {\mathsf A} \rVert}} and \frac{{\lVert {\Delta \mathbf{\boldsymbol{b}}} \rVert}}{{\lVert {\mathbf{\boldsymbol{b}}} \rVert}}?
Given p \in [1, \infty], the p-norm of a vector \mathbf{\boldsymbol{x}} \in \mathbf C^n is defined as follows: {\lVert {\mathbf{\boldsymbol{x}}} \rVert}_p := \begin{cases} \left( \sum_{i=1}^{n} {\lvert {x_i} \rvert}^p \right)^{\frac{1}{p}} & \text{if $p < \infty$}, \\ \max \Bigl\{ {\lvert {x_1} \rvert}, \dotsc, {\lvert {x_n} \rvert} \Bigr\} & \text{if $p = \infty$}. \end{cases}
Plots.plot(aspect_ratio=:equal, xlims=(-1.1,1.1), ylims=(-1.1, 1.1))
Plots.plot!(framestyle=:origin, grid=true, legend=:outertopright)
Plots.plot!(title="Unit circle in different norms")
Plots.plot!([0, 1, 0, -1, 0], [-1, 0, 1, 0, -1], label=L"$p = 1$ (taxicab norm)")
Plots.plot!([1, 1, -1, -1, 1], [1, -1, -1, 1, 1], label=L"p = \infty")
Plots.plot!(t -> cos(t), t -> sin(t), 0, 2π, label=L"$p = 2$ (Euclidean norm)")
The operator norm on matrices induced by the vector p-norm is given by {\lVert {\mathsf A} \rVert}_{p} := \sup_{{\lVert {\mathbf{\boldsymbol{x}}} \rVert}_{p} \leqslant 1} {\lVert {\mathsf A \mathbf{\boldsymbol{x}}} \rVert}_{p} It follows from the definition that {\lVert {\mathsf A \mathsf B} \rVert}_p \leqslant{\lVert {\mathsf A} \rVert}_p {\lVert {\mathsf B} \rVert}_p.
Exercises
Show that
The matrix 2-norm is given by \sqrt{\lambda_{\rm max}(\mathsf A^* \mathsf A)}.
The matrix 1-norm is the maximum absolute column sum:
{\lVert {\mathsf A} \rVert}_1 = \max_{1 \leqslant j \leqslant n} \sum_{i=1}^{n} {\lvert {a_{ij}} \rvert}.
The matrix \infty-norm is the maximum absolute row sum:
{\lVert {\mathsf A} \rVert}_{\infty} = \max_{1 \leqslant i \leqslant n} \sum_{j=1}^{n} {\lvert {a_{ij}} \rvert}.
From matrix norms, we define the condition number of a matrix as \kappa_p(\mathsf A) = {\lVert {\mathsf A} \rVert}_p {\lVert {\mathsf A^{-1}} \rVert}_p
Properties:
\kappa_p(\mathsf I) = 1
\kappa_p(\mathsf A) \geqslant 1
\kappa_p(\alpha \mathsf A) = \kappa_p(\mathsf A)
In Julia, calculate the condition number using cond(A, p=2)
.
Proposition
Let \mathbf{\boldsymbol{x}} + \Delta \mathbf{\boldsymbol{x}} denote the solution to \mathsf A (\mathbf{\boldsymbol{x}} + \Delta \mathbf{\boldsymbol{x}}) = \mathbf{\boldsymbol{b}} + \Delta \mathbf{\boldsymbol{b}} The following inequality holds: \frac{{\lVert {\Delta \mathbf{\boldsymbol{x}}} \rVert}}{{\lVert {\mathbf{\boldsymbol{x}}} \rVert}} \leqslant\kappa(\mathsf A) \frac{{\lVert {\Delta \mathbf{\boldsymbol{b}}} \rVert}}{{\lVert {\mathbf{\boldsymbol{b}}} \rVert}}
Proof
It holds by definition of \Delta \mathbf{\boldsymbol{x}} that \mathsf A \Delta \mathbf{\boldsymbol{x}} = \Delta \mathbf{\boldsymbol{b}}. Therefore \begin{aligned} {\lVert {\Delta \mathbf{\boldsymbol{x}}} \rVert} &= {\lVert {\mathsf A^{-1} \Delta \mathbf{\boldsymbol{b}}} \rVert} \leqslant{\lVert {\mathsf A^{-1}} \rVert} {\lVert {\Delta \mathbf{\boldsymbol{b}}} \rVert} \\ &= \frac{{\lVert {\mathsf A \mathbf{\boldsymbol{x}}} \rVert}}{{\lVert {\mathbf{\boldsymbol{b}}} \rVert}} {\lVert {\mathsf A^{-1}} \rVert} {\lVert {\Delta \mathbf{\boldsymbol{b}}} \rVert} \leqslant\frac{{\lVert {\mathsf A} \rVert} {\lVert {\mathbf{\boldsymbol{x}}} \rVert}}{{\lVert {\mathbf{\boldsymbol{b}}} \rVert}} {\lVert {\mathsf A^{-1}} \rVert} {\lVert {\Delta \mathbf{\boldsymbol{b}}} \rVert}. \end{aligned} Rearranging, we obtain the statement.
Proposition
Let \mathbf{\boldsymbol{x}} + \Delta \mathbf{\boldsymbol{x}} denote the solution to (\mathsf A + \Delta \mathsf A) (\mathbf{\boldsymbol{x}} + \Delta \mathbf{\boldsymbol{x}}) = \mathbf{\boldsymbol{b}} If \mathsf A is invertible and {\lVert {\Delta \mathsf A} \rVert} < \frac{1}{2} {\lVert {\mathsf A^{-1}} \rVert}^{-1}, then \frac{{\lVert {\Delta \mathbf{\boldsymbol{x}}} \rVert}}{{\lVert {\mathbf{\boldsymbol{x}}} \rVert}} \leqslant 2\kappa(\mathsf A) \frac{{\lVert {\Delta \mathsf A} \rVert}}{{\lVert {\mathsf A} \rVert}}
Conclusion: \kappa(\mathsf A) measures sensitivity to perturbations:
useful to estimate the impact of round-off errors;
influences convergence speed of methods (see later).
When \kappa_p(\mathsf A) \gg 1, the system is called ill-conditioned.
First calculate the \mathsf L \mathsf U decomposition of \mathsf A with
\mathsf U upper triangular matrix;
\mathsf L unit lower triangular.
Then solve \mathsf L \mathbf{\boldsymbol{y}} = \mathbf{\boldsymbol{b}} using forward substitution.
Remark: The LU decomposition may not exist
In practice, use decomposition \mathsf P \mathsf A = \mathsf L \mathsf U, with \mathsf P a permutation matrix.
Guaranteed to exist if \mathsf A is invertible;
More numerically stable.
\underbrace{ \begin{pmatrix} 2 & 1 & -1 \\ -3 & -1 & 2 \\ -2 & 1 & 2 \end{pmatrix} }_{\mathsf A}
\underbrace{ \begin{pmatrix} 1 & 0 & 0 \\ {\color{red}\frac{3}{2}} & 1 & 0 \\ {\color{red}{1}} & 0 & 1 \end{pmatrix} }_{=: \, \mathsf M_1}~ \underbrace{ \begin{pmatrix} 2 & 1 & -1 \\ -3 & -1 & 2 \\ -2 & 1 & 2 \end{pmatrix} }_{\mathsf A} = \begin{pmatrix} 2 & 1 & -1 \\ {\color{red} 0} & \frac{1}{2} & \frac{1}{2} \\ {\color{red} 0} & 2 & 1 \end{pmatrix}
\underbrace{ \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & {\color{blue} -4} & 1 \end{pmatrix} }_{=: \, \mathsf M_2}~ \underbrace{ \begin{pmatrix} 1 & 0 & 0 \\ {\color{red}\frac{3}{2}} & 1 & 0 \\ {\color{red}{1}} & 0 & 1 \end{pmatrix} }_{=: \, \mathsf M_1}~ \underbrace{ \begin{pmatrix} 2 & 1 & -1 \\ -3 & -1 & 2 \\ -2 & 1 & 2 \end{pmatrix} }_{\mathsf A} = \underbrace{ \begin{pmatrix} 2 & 1 & -1 \\ {\color{red} 0} & \frac{1}{2} & \frac{1}{2} \\ {\color{red} 0} & {\color{blue} 0} & -1 \end{pmatrix} }_{=: \, \mathsf U}
Gaussian transformations \mathsf M_1, \mathsf M_2 are simple to invert and multiply. In particular \mathsf A = \mathsf M_1^{-1} \mathsf M_2^{-1} \mathsf U = \underbrace{ \begin{pmatrix} 1 & 0 & 0 \\ {\color{red}-\frac{3}{2}} & 1 & 0 \\ {\color{red}{-1}} & 0 & 1 \end{pmatrix} }_{=: \, \mathsf M_1^{-1}}~ \underbrace{ \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & {\color{blue} 4} & 1 \end{pmatrix} }_{=: \, \mathsf M_2^{-1}}~ \mathsf U = \underbrace{ \begin{pmatrix} 1 & 0 & 0 \\ {\color{red}-\frac{3}{2}} & 1 & 0 \\ {\color{red}{-1}} & {\color{blue} 4} & 1 \end{pmatrix} }_{\mathsf M_1^{-1} \mathsf M_2^{-1} =: \, \mathsf L}~ \mathsf U
# A is an invertible matrix of size n x n
L = [r == c ? 1.0 : 0.0 for r in 1:n, c in 1:n]
U = copy(A)
for c in 1:n-1, r in c+1:n
U[c, c] == 0 && error("Pivotal entry is zero!")
L[r, c] = U[r, c] / U[c, c]
U[r, c:end] -= U[c, c:end] * L[r, c]
end
# L is unit lower triangular and U is upper triangular
~
Computational cost: \frac{2}{3} n^3 + \mathcal O(n^2) floating point operations (flops);
In comparison, the computational cost of forward/backward substitution scales as \mathcal O(n^2);
The LU decomposition can be reused for different right-hand sides;
If \mathsf A is a banded matrix, so are \mathsf L and \mathsf U.
\mathsf L and \mathsf U are not necessarily sparse when \mathsf A is sparse.
In this case, it is more efficient to calculate the Cholesky factorization:
\mathsf A = \mathsf C \mathsf C^*
with \mathsf C lower triangular with positive diagonal elements.
~
Exercise:
Write an algorithm for calculating \mathsf C by identification of \mathsf A and \mathsf C \mathsf C^*.
What is the computational complexity of your implementation?
Bonus points if your implementation is efficient on banded matrices.
Motivation: General-purpose direct methods are
exact up to roundoff errors
but computationally expensive: \mathcal O(n^3) flops for full matrices…
Additionally, they require the storage of \mathsf L and \mathsf U.
~
In contrast, iterative methods
are usually approximate (but there are exceptions);
usually have a cost per iteration scaling as \mathcal O(n^2) at most;
\rightarrow often computationally more economical
can be stopped at any point when the residual \mathsf A \mathbf{\boldsymbol{x}} - \mathbf{\boldsymbol{b}} is sufficiently small;
~
In this section, we focus on basic iterative methods based on a splitting
\mathsf A = \mathsf M - \mathsf N
Richardson’s method
\begin{align*} \mathbf{\boldsymbol{x}}^{(k + 1)} = \mathbf{\boldsymbol{x}}^{(k)} + \omega (\mathbf{\boldsymbol{b}} - \mathsf A \mathbf{\boldsymbol{x}}^{(k)}) \end{align*}
Proposition
Assume \omega \neq 0. If (\mathbf{\boldsymbol{x}}^{(k)}) converges in the iteration above, then it converges towards the solution of \begin{align*} \mathsf A \mathbf{\boldsymbol{x}} = \mathbf{\boldsymbol{b}}. \end{align*}
Questions:
Does this method converge and how quickly?
What is a good stopping criterion?
Definition
The spectral radius of \mathsf A \in \mathbf C^{n \times n} is given by \rho(\mathsf A) := \max_{\lambda \in \mathop{\mathrm{spectrum}}A} {\lvert {\lambda} \rvert}
Remarks:
\rho(A) is not a norm;
\rho(A) \leqslant{\lVert {\mathsf A} \rVert} for any induced matrix norm.
Theorem: Gelfand’s formula
For any \mathsf A \in \mathbf C^{n \times n} and any matrix norm {\lVert {\cdot} \rVert}, \lim_{k \to \infty} {\lVert {\mathsf A^k} \rVert}^{1/k} = \rho(\mathsf A).
In particular {\lVert {\mathsf A^k} \rVert} \to 0 as k \to \infty iff \rho(\mathsf A) < 1.
The smaller \rho(\mathsf A), the faster the convergence.
\begin{align*} \mathbf{\boldsymbol{x}}^{(k + 1)} = \mathbf{\boldsymbol{x}}^{(k)} + \omega (\mathbf{\boldsymbol{b}} - \mathsf A \mathbf{\boldsymbol{x}}^{(k)}) \end{align*}
Proposition
The Richardson iteration converges to the real solution for every choice of \mathbf{\boldsymbol{x}}^{(0)} if and only if \begin{align*} \rho(\mathsf I - \omega \mathsf A) = \max_{\lambda \in \mathop{\mathrm{spectrum}}A} {\lvert {1 - \omega \lambda} \rvert} < 1. \end{align*}
Proof
We notice that \mathbf{\boldsymbol{x}}^{(k+1)} - \mathbf{\boldsymbol{x}}_* = \mathbf{\boldsymbol{x}}^{(k)} - \mathbf{\boldsymbol{x}}_* + \omega (\mathsf A \mathbf{\boldsymbol{x}}_* - \mathsf A \mathbf{\boldsymbol{x}}^{(k)}) = (\mathsf I - \omega \mathsf A) \left( \mathbf{\boldsymbol{x}}^{(k)} - \mathbf{\boldsymbol{x}}_* \right). Therefore \mathbf{\boldsymbol{x}}^{(k)} - \mathbf{\boldsymbol{x}}_* = (\mathsf I - \omega \mathsf A)^k \left( \mathbf{\boldsymbol{x}}^{(0)} - \mathbf{\boldsymbol{x}}_* \right) The “only if” part of the statement is clear, and the “if” part follows from the equivalence {\lVert {\mathsf B^k} \rVert} \xrightarrow[k \to \infty]{} 0 \quad \Leftrightarrow \quad \rho(B) < 1.
Corollary: It is necessary for convergence that all the eigenvalues have nonzero real parts of the same sign.
By Gelfand’s formula, for all \varepsilon > 0 there is K \in \mathbf N such that \forall k \geqslant K, \qquad {\lVert {\mathbf{\boldsymbol{x}}^{(k)} - \mathbf{\boldsymbol{x}}_*} \rVert} \leqslant\bigl( \rho(\mathsf I - \omega \mathsf A) + \varepsilon \bigr)^k {\lVert {\mathbf{\boldsymbol{x}}^{(0)} - \mathbf{\boldsymbol{x}}_*} \rVert}.
We consider the particular case where \mathsf A \in \mathbf R^{n \times n} is symmetric positive definite.
Remark: link to optimization
When \mathsf A is Hermitian positive definite, the solution to \mathsf A \mathbf{\boldsymbol{x}} = \mathbf{\boldsymbol{b}} is the minimizer f(\mathbf{\boldsymbol{x}}) = \frac{1}{2} \mathbf{\boldsymbol{x}}^T\mathsf A \mathbf{\boldsymbol{x}} - \mathbf{\boldsymbol{b}}^T\mathbf{\boldsymbol{x}}. In this case, Richardson’s iteration may be rewritten as \mathbf{\boldsymbol{x}}^{(k+1)} = \mathbf{\boldsymbol{x}}^{(k)} - \omega \nabla f(\mathbf{\boldsymbol{x}}^{(k)}).
~
Exercise: find \omega that minimizes \rho(\mathsf I - \omega \mathsf A)
Simple calculations lead to \omega_* = \frac{2}{\lambda_{\min}(\mathsf A) + \lambda_{\max}(\mathsf A)}, \qquad \rho(\mathsf I - \omega_* \mathsf A) = \frac{\kappa_2(\mathsf A) - 1}{\kappa_2(\mathsf A) + 1}
Consider the linear system \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = \begin{pmatrix} 7 \\ 8 \end{pmatrix}
The exact solution is (2, 3)^T;
The eigenvalues of \mathsf A are 1 and 3;
The condition number \kappa_2(\mathsf A) is 3;
The optimal \omega is \frac{1}{2}.
Contour plot of f(\mathbf{\boldsymbol{x}}) = \frac{1}{2} \mathbf{\boldsymbol{x}}^T\mathsf A \mathbf{\boldsymbol{x}} - \mathbf{\boldsymbol{b}}^T\mathbf{\boldsymbol{x}}.
Consider the splitting \mathsf A = {\color{green}\mathsf M} - {\color{blue} \mathsf N} where \mathsf M is invertible and easy to invert. Then \mathsf A \mathbf{\boldsymbol{x}}_* = \mathbf{\boldsymbol{b}} \quad \Leftrightarrow \quad {\color{green}\mathsf M} \mathbf{\boldsymbol{x}}_* = {\color{blue} \mathsf N} \mathbf{\boldsymbol{x}}_* + \mathbf{\boldsymbol{b}} which suggests the iterative method {\color{green}\mathsf M} \mathbf{\boldsymbol{x}}^{(k+1)} = {\color{blue} \mathsf N} \mathbf{\boldsymbol{x}}^{(k)} + \mathbf{\boldsymbol{b}}
Standard basic iterative methods (here \mathsf A = \mathsf L + \mathsf D + \mathsf U)
\mathsf A = \underbrace{{\color{green}\frac{1}{\omega} \mathsf I }}_{\mathsf M} - \underbrace{\left({\color{blue} \frac{1}{\omega} \mathsf I - \mathsf A}\right)}_{\mathsf N}
Jacobi’s iteration: \mathsf A = {\color{green}\mathsf D} - ({\color{blue}-\mathsf L - \mathsf U}).
Gauss Seidel’s iteration: \mathsf A = ({\color{green}\mathsf D + \mathsf L}) - ({\color{blue}- \mathsf U})
Relaxation method: \mathsf A = \left({\color{green}\frac{\mathsf D}{\omega} + \mathsf L }\right) - \left({\color{blue}\frac{1 - \omega}{\omega} \mathsf D - \mathsf U} \right). ~(Coincides with Gauss-Seidel when \omega = 1)
Proposition: convergence of the splitting method
The iteration {\color{green}\mathsf M} \mathbf{\boldsymbol{x}}^{(k+1)} = {\color{blue} \mathsf N} \mathbf{\boldsymbol{x}}^{(k)} + \mathbf{\boldsymbol{b}} converges for any initial \mathbf{\boldsymbol{x}}^{(0)} if and only if \rho({\color{green}\mathsf M}^{-1} {\color{blue} \mathsf N}) < 1.
In addition, for any \varepsilon > 0 there exists K > 0 such that \forall k \geqslant K, \qquad {\lVert {\mathbf{\boldsymbol{x}}^{(k)} - \mathbf{\boldsymbol{x}}_*} \rVert} \leqslant\bigl(\rho({\color{green}\mathsf M}^{-1} {\color{blue} \mathsf N}) + \varepsilon\bigr)^k {\lVert {\mathbf{\boldsymbol{x}}^{(0)} - \mathbf{\boldsymbol{x}}_*} \rVert}.
Proof
Subtracting the equations \begin{aligned} {\color{green}\mathsf M} \mathbf{\boldsymbol{x}}^{(k+1)} &= {\color{blue} \mathsf N} \mathbf{\boldsymbol{x}}^{(k)} + \mathbf{\boldsymbol{b}} \\ {\color{green}\mathsf M} \mathbf{\boldsymbol{x}}_* &= {\color{blue} \mathsf N} \mathbf{\boldsymbol{x}}_* + \mathbf{\boldsymbol{b}} \end{aligned} and rearranging gives \mathbf{\boldsymbol{x}}^{(k+1)} - \mathbf{\boldsymbol{x}}_* = {\color{green}\mathsf M}^{-1} {\color{blue} \mathsf N} (\mathbf{\boldsymbol{x}}^{(k)} - \mathbf{\boldsymbol{x}}_*) = \dots = ({\color{green}\mathsf M}^{-1} {\color{blue} \mathsf N})^{k+1} (\mathbf{\boldsymbol{x}}^{(0)} - \mathbf{\boldsymbol{x}}_*). The “only if” part of the first item is simple. The other claims follow from Gelfand’s formula: \forall \mathsf B \in \mathbf C^{n \times n}, \qquad \lim_{k \to \infty} {\lVert {\mathsf B^k} \rVert}^{1/k} = \rho(\mathsf B).
Settings where \rho(\mathsf M^{-1} \mathsf N) < 1 are identified on a case by case basis:
Richardson’s iteration, sufficient condition: \mathsf A symmetric positive definite;
Jacobi’s iteration, sufficient condition: \mathsf A strictly row or column diagonally dominant:
\lvert a_{ii} \rvert > \sum_{j \neq i} \lvert a_{ij} \rvert \quad \forall i \qquad \text{ or } \qquad \lvert a_{jj} \rvert > \sum_{i \neq j} \lvert a_{ij} \rvert \quad \forall j.
Gauss Seidel’s iteration, sufficient condition: \mathsf A strictly row or column diagonally dominant;
Relaxation method, sufficient condition: \mathsf A symmetric positive definite and \omega \in (0, 2);
Relaxation method, necessary condition: \omega \in (0, 2);
…
Suppose again that \mathsf A is symmetric positive definite and recall Richardson’s iteration:
\begin{align*} \mathbf{\boldsymbol{x}}^{(k + 1)} = \mathbf{\boldsymbol{x}}^{(k)} - \omega \nabla f(\mathbf{\boldsymbol{x}}^{(k)}), \qquad f(\mathbf{\boldsymbol{x}}) := \frac{1}{2} \mathbf{\boldsymbol{x}}^T\mathsf A \mathbf{\boldsymbol{x}} - \mathbf{\boldsymbol{b}}^T\mathbf{\boldsymbol{x}}. \end{align*}
Proposition
For any \mathbf{\boldsymbol{d}} \in \mathbf R^n, it holds that \mathop{\mathrm{arg\,min}}_{\omega \in \mathbf R} f \left(\mathbf{\boldsymbol{x}} - \omega \mathbf{\boldsymbol{d}} \right) = \frac{\mathbf{\boldsymbol{d}}^T\mathbf{\boldsymbol{r}}}{\mathbf{\boldsymbol{d}}^T\mathsf A \mathbf{\boldsymbol{d}}}, \qquad \mathbf{\boldsymbol{r}} = \mathsf A \mathbf{\boldsymbol{x}} - \mathbf{\boldsymbol{b}}.
We can improve upon Richardson’s iteration by letting \mathbf{\boldsymbol{x}}^{(k + 1)} = \mathbf{\boldsymbol{x}}^{(k)} - \omega_{\color{red}k} \nabla f(\mathbf{\boldsymbol{x}}^{(k)}), \qquad \omega_{\color{red}k} := \frac{\mathbf{\boldsymbol{d}}_k^T\mathbf{\boldsymbol{d}}_k}{\mathbf{\boldsymbol{d}}_k^T\mathsf A \mathbf{\boldsymbol{d}}_k}, \qquad \mathbf{\boldsymbol{d}}_k := \mathsf A \mathbf{\boldsymbol{x}}^{(k)} - \mathbf{\boldsymbol{b}}.
This is the steepest descent method with optimal step.
First note that f(\mathbf{\boldsymbol{x}}) = \frac{1}{2} \mathbf{\boldsymbol{x}}^T\mathsf A \mathbf{\boldsymbol{x}} - \mathbf{\boldsymbol{b}}^T\mathbf{\boldsymbol{x}} = \frac{1}{2} {\lVert {\mathbf{\boldsymbol{x}} - \mathbf{\boldsymbol{x}}_*} \rVert}_{\mathsf A}^2 + \text{constant}, \qquad {\lVert {\mathbf{\boldsymbol{y}}} \rVert}_{\mathsf A} := \sqrt{{\langle {\mathbf{\boldsymbol{y}}, \mathbf{\boldsymbol{y}}} \rangle}_{\mathsf A}}, \qquad {\langle {\mathbf{\boldsymbol{x}}, \mathbf{\boldsymbol{y}}} \rangle}_{\mathsf A} := \mathbf{\boldsymbol{x}}^T\mathsf A \mathbf{\boldsymbol{y}}. Given n conjugate directions \{\mathbf{\boldsymbol{d}}_0, \dotsc, \mathbf{\boldsymbol{d}}_{n-1}\} such that {\langle {\mathbf{\boldsymbol{d}}_i, \mathbf{\boldsymbol{d}}_j} \rangle}_{\mathsf A} = \delta_{ij}, we have \mathbf{\boldsymbol{x}}_* - \mathbf{\boldsymbol{x}}^{(0)} = \sum_{i=0}^{n-1} \mathbf{\boldsymbol{d}}_i \mathbf{\boldsymbol{d}}_i^T\mathsf A (\mathbf{\boldsymbol{x}}_* - \mathbf{\boldsymbol{x}}^{(0)}) = \sum_{i=0}^{n-1} \mathbf{\boldsymbol{d}}_i \mathbf{\boldsymbol{d}}_i^T({\color{green}\mathbf{\boldsymbol{b}}} - \mathsf A \mathbf{\boldsymbol{x}}^{(0)})
\color{green}\rightarrow the \mathsf A-projections of \mathbf{\boldsymbol{x}}_* - \mathbf{\boldsymbol{x}}^{(0)} can be calculated even though \mathbf{\boldsymbol{x}}_* is unknown!
Minimization property: by construction,
\mathbf{\boldsymbol{x}}^{(k)} - \mathbf{\boldsymbol{x}}^{(0)} is the \mathsf A-orthogonal projection of \mathbf{\boldsymbol{x}}_* - \mathbf{\boldsymbol{x}}^{(0)} onto \mathop{\mathrm{Span}}\{\mathbf{\boldsymbol{d}}_0, \dotsc, \mathbf{\boldsymbol{d}}_{k-1}\}.
Therefore, \mathbf{\boldsymbol{x}}^{(k)} minimizes f(\mathbf{\boldsymbol{x}}^{(k)}) over the full affine subspace \mathbf{\boldsymbol{x}}^{(0)} + \mathop{\mathrm{Span}}\{\mathbf{\boldsymbol{d}}_0, \dotsc, \mathbf{\boldsymbol{d}}_{k-1}\}.
Consider again the linear system \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = \begin{pmatrix} 7 \\ 8 \end{pmatrix}
(non-normalized) Conjugate directions calculated by Gram-Schmidt: \mathbf{\boldsymbol{d}}_0 = \begin{pmatrix} 1 \\ 0 \end{pmatrix}, \qquad \mathbf{\boldsymbol{d}}_1 = \begin{pmatrix} -1 \\ 2 \end{pmatrix}.
Question: how to generate conjugate directions ?
{\color{green} \rightarrow} the conjugate gradient method enables to generate conjugate directions on the fly.
Theorem
Suppose that \mathsf A is symmetric positive definite. Then \forall k \geqslant 0, \qquad {\lVert {\mathbf{\boldsymbol{x}}^{(k)} - \mathbf{\boldsymbol{x}}_*} \rVert}_{\mathsf A} \leqslant 2 \left( \frac{{\color{red}\sqrt{\kappa_2(\mathsf A)}} - 1}{{\color{red}\sqrt{\kappa_2(\mathsf A)}} + 1} \right)^{k} {\lVert {\mathbf{\boldsymbol{x}}^{(0)} - \mathbf{\boldsymbol{x}}_*} \rVert}_{\mathsf A},
Many iterative methods used at present are based on Krylov subspaces: \mathcal K_k(\mathsf A, \mathbf{\boldsymbol{b}}) = \mathop{\mathrm{Span}}\Bigl\{ \mathbf{\boldsymbol{b}}, \mathsf A \mathbf{\boldsymbol{b}}, \mathsf A^2 \mathbf{\boldsymbol{b}}, \dotsc, \mathsf A^{k-1} \mathbf{\boldsymbol{b}} \Bigr\}.
Full orthogonalization method (FOM): find \mathbf{\boldsymbol{x}}^{(k)} in \mathcal K_k such that the following Galerkin condition is satisfied:
{\langle {\mathsf A \mathbf{\boldsymbol{x}}^{(k)} - \mathbf{\boldsymbol{b}}, \mathcal K_k} \rangle} = 0.
Iterates coincide with conjugate gradient method when \mathsf A is symmetric positive definite.