Conjugate Gradients

Conjugate Gradients

Conjugate gradients (CG) is one of the most widely used algorithms in optimization. What’s taught in schools about CG is a pseudo-code of 5 steps; in reality, it is much more than that. Since the literature on this topic is quite large, it’s hard to find really good references. Frustrated by that for a while I had to search quite extensively for books/articles that would help me better understand CG. My search was rewarded when I came across this article by Schewchuk. This is by far the best I could read about CG. If you’ve never read this, you should do it now!

Here, I have no interest in duplicating Shewchuk’s efforts; instead, I’m aiming  to share my own intuitive understanding. I explain, in words, why CG is so powerful and how it does that.

Whether you need to minimize a function or solve a system of equations, you always need an algorithm (solver) to help you achieve that. For convenience, I define a quadratic function f(x)=\dfrac{1}{2} x^T A x+bx+c, and I denote the system of equations Ax=b. It’s easy to see the equivalence between the two: finding x that minimizes f(x) is similar to finding x that is solution to  \dfrac{\partial{f}}{\partial{x}}=Ax-b=0 which is similar to solving Ax=b. There is a plethora of solvers to choose from, which is both good and bad. Good because however you state the problem, there is most likely already a custom-tailored solver out there waiting to be tried. Bad because what works in theory does not always work in practice; for instance, the mathematically rigorous guarantees on convergence often do not prove useful in practice.

I like to think of solver as cooking recipe. You need some ingredients to start with, the cook pot has limited capacity, and you have to wait for some time before the food is ready. Each solver is unique in the same way that each recipe is. Examples of ingredients you need for solvers include: initial guess, starting direction, tolerance, maximum number of iterations, gradient, Hessian, and the list can go long … Your working environment (cook pot) has limited memory (capacity) to store all these ingredients. A good solver is one which should not ask for too much memory. Then you need to set a temperature which controls cooking time. This is the same as the normal rate of convergence of your solver. You can always increase the temperature and get your food cooked more quickly, in the same way that you can precondition your solver and help it converge faster.

I define the best solvers as analogs to the best chefs. You want a solver that is cheap, efficient, and that works well in practice. By cheap, I mean you don’t want to spend too long preparing to gather all the ingredients, and you certainly don’t want to exhaust memory. If you have too many ingredients to start with, this should be justified by the complexity of your problem. The memory limitation is more obvious. Despite Moore’s law, we soon realize that we can be much less optimistic on memory. One important aspect of many real-world problems is non-uniqueness. This means you can have many possible solutions to your problem. To address that, we often regularize the problem by adding some desired attributes about the solution. In some cases, the non-uniqueness can be so bad that checking the correctness of the solution becomes nontrivial. In these cases, having quantitative measures about the correctness of the solution is challenging despite the best efforts; at the end, we seek maximum objectivity with minimum subjectivity.

In the same way that different recipes belong to different cuisines, different solvers belong to different families. Each family is characterized by its own demands in terms of ingredients, memory and compute time. American cuisine includes fast food and invokes big portions. This helps get you something fast. It’s “good-enough” but it’s certainly not the optimal tasting experience. French cuisine is much more refined. It’s characterized by small portions, reasonable cooking time, and often tasty results. Solvers can be classified into two broad families: direct and iterative. Direct methods are useful when the solution can be expressed analytically (in closed form). This often calls for an inverse of a large matrix, which, besides being prohibitively expensive, can also be dangerous (in terms of noise amplification). In many cases, you can’t express the solution in one single expression. This is where you need to take a series of “careful” steps (iterations) to reach your objective.

The simplest iterative procedure is to start at some x_0, an initial guess of the solution (which can be zero if no guess is available), and then take a series of steps that progressively minimize the objective function f(x) in certain directions. A good direction to search along is one we know should point towards a new x that minimizes the objective function. Back in the era where there was no GPS, people had to rely on other people to find their way. I think of good search directions as people who point you, each time you ask, towards where you get closer to the sought location. Mathematically, we are taught early in school that the gradient of a function helps achieve that. More specifically, the negative of the gradient specifies the direction along which the function decreases fastest. Going back to our objective function f(x), it is clear that a good search direction is along the negative gradient -\dfrac{\partial{f}}{\partial{x}}=b-Ax, which is also known as the residual.

In optimization and inverse problems, the residual plays an important role. It can be interpreted several ways. First, it is the error transformed by A. Accessing the error isn’t possible in real world problems since we don’t know the true answer. Hence, the residual represents an alternative which can sometimes be informative. Matrix A multiplied by any vector rotates and stretches/squeezes this vector. This means that, while residuals are useful, they are not equivalent to the error. Thus, having the residual go to zero is not always a guarantee of convergence. Another argument is non-uniqueness: several solutions can reduce the residual equally well, but, most likely, we’re only interested in one of them. This is why regularization is needed: we add a penalty term (summarizing undesired features in the solution) to the residual in such a way that both should decrease with iteration number.

The second (not so intuitive) interpretation of the residual is that it is also the direction of steepest descent (since it equals the gradient per the equation above). Shewchuck’s advice is straightforward:

“whenever you read “residual”, think “direction of steepest descent.” “.

I find this advice precious when reading the sequence of steps involved in a solver. If x is a vector, the residual might as well be a vector, often not of the same size. Coefficients of the residual that are different from zero indicate where more work is needed. Not only that: since the residual is also the direction of steepest descent, the magnitude of each of its components is related to the rate of decrease in the objective function.

The method that takes a series of steps in the direction opposite to the gradient is called steepest descent. Our first step in the direction of steepest descent is: x_1=x_0+\alpha r_0. Once the search direction is chosen, what remains is how big a step we should take in that direction. We need the step length \alpha that minimizes f along the line x_0+\alpha r_0 (a procedure known as line search). Setting the derivative of f with respect to \alpha at x_1 equal to zero entails r_0 and \deriv{f(x_1)} are orthogonal (i.e. inner product equal to zero). It also expresses \alpha in terms of the residual. The steepest descent algorithm summarizes below:

    \begin{align*} r_k &= b - Ax_k\\ \alpha_k &= \dfrac{r_k^T r_k}{r_k^TAr_k}\\ x_{k+1}&=x_k+\alpha_kr_k \end{align*}

An illustration of steepest descent at work is presented here (figure 8 from Schewchuk’s article):

Convergence path of steepest descent.

The fact that each new direction is orthogonal to the previous one explains the zigzaggy nature of the path. What’s bad is that some new search directions end up being parallel to (i.e. same as) previous ones, meaning the solver ends up doing a lot of repetitive work. This is the main reason why steepest descent is so slow.

In a way, we want our solver to do a better job than this. We want it to be smarter and reach the minimum without much overwork. We want each new search direction not only to be orthogonal to the previous search direction, but to also be orthogonal to all previous search directions. This way, each time a direction is set, we know it will be just right to get us closer to the minimum.

Seeking complete orthogonality is worthy of a little pause. Implicit to complete orthogonality is the notion that new information is being generated, that some king of progress is taking place. When a space has an orthogonal basis, it becomes easy to explore, because each element of this space can be expressed in terms of the orthogonal basis. I see a strong analogy between complete orthogonality and strong argumentation. The best debaters are usually ones that convince the audience by analyzing a topic from various (orthogonal) angles. Orthogonality does not mean contradiction, which happens when two vectors point to opposite search directions. Orthogonality means progress towards something new.

Mathematically, it can be shown that having such an orthogonal set of directions can only be possible if we already know the solution. Hence, we somehow need a milder requirement. A more feasible requirement is to choose the search directions to be Aorthogonal, which is another name for Aconjugate (or simply conjugate) directions. We remind the reader that two vectors d_i and d_j are A-orthogonal if d_i^TAd_j=0. Somehow, this is just a different flavor of orthogonality. To further grasp what A-orthogonality means, I find these lines from Schewchuk’s article very illustrative:

“Figure 22(a) shows what A-orthogonal vectors look like. Imagine if this article were printed on bubble gum, and you grabbed figure 22(a) by the ends and stretched it until the ellipses appreared circular. The vectors would then appear orthogonal, as in Figure 22(b)”.

What A-orthogonality means.

All we need now is a set of A-orthogonal search directions. Fortunately, there is a well known mathematical tool called Gram-Schmidt conjugation that allows to build such set from linearly independent vectors. However, in practice, an important aspect (in fact shortcoming) of such a process is that it works by keeping all (conjugate) search directions in memory! For real-world problems, this can hurt memory very bad. The method based on this process is called conjugate directions.

We’ve so far seen:

  1. Steepest descent uses residuals as search directions. This works but it can be slow. No bad impact on memory.
  2. Conjugate directions uses A-orthongonal (conjugate) search directions. It converges fast. But it hurts memory.

Can we hope for a solver that combines the best of both? The answer is strikingly simple: in the method of conjugate directions, apply the Gram-Schmidt procedure to residuals in order to get the search directions.

You can prove mathematically that residuals have a very attractive property which is very relevant to the conjugate directions methods. It says that if you have a set of k conjugate directions (obtained after k iterations), the residual at step k+1 is orthogonal to all these (previous) directions. This has a significant impact on the Gram-Schmidt procedure. Because of this property, the main operation, which is a summation involving all past search directions, is greatly simplified. In concrete terms, it means we no longer have to store in memory all previous search directions!

In my opinion, this is the essence of CG. It is the simple, and yet not easy to find, idea that you can just apply Gram Schmidt to residuals and there you go!

It is a perfect example of doing more with less: more efficiency, which is retained from conjugate directions, with less memory, which is enabled by the use of residuals. I always like to define technology as our ability to do more with less. In this sense, CG is in my view as a good example of technology.

The method of CG

Now that we have paved the way, we can review the elegant set of instructions in the method of Conjugate Gradients:

    \begin{align*} p_0 &= r_0=b - Ax_0\\ \alpha_k &= \dfrac{r_k^T r_k}{p_k^TAp_k}\\ x_{k+1}&=x_k+\alpha_kp_k\\ r_{k+1}&=r_k-\alpha_kAp_k\\ \beta_{k+1} &= \dfrac{r_{k+1}^T r_{k+1}}{r_{k}^T r_{k}}\\ p_{k+1}&=r_{k+1}+\beta_{k+1}p_k \end{align*}

The key equation in this sequence is the search update  p_{k+1}&=r_{k+1}+\beta_{k+1}p_k. For \beta=0, the search direction will be equal to the residual, which is what steepest descent does. In conjugate gradients, we basically correct the search direction from residual r_{k+1} to a new direction p_{k+1} which helps reach the minimum faster. The search directions obtained are A-orthogonal (conjugate). Thus, CG is just one type of conjugate directions methods.

Some say “Conjugate Gradients” is not an accurate name because gradients are actually not conjugate by nature. The Gram-Schmidt procedure conjugates the gradients, and so they become conjugated. Hence, “Conjugated Gradients” should be a better naming. I would argue the naming can still be made more precise. I find “Conjugated Residuals” describes CG best. The reason being that residuals, as directions of steepest decent, point to the opposite way of gradients, which are direction of steepest ascent.

When exploring the convergence path of both steepest descent and CG, a natural question that arises is can we converge even faster? A family of solvers known as Newton methods can achieve that. For well behaved functions, these methods can hit the minimum in one single iteration. But there are two important considerations:

1- They need some computation of the Hessian (i.e. second derivatives), which is not always easy to obtain

2- They also need more memory to store the Hessian

There is more prep work, technically known as overhead, associated with Newton methods which motivates the use of CG, since CG does not need anything else other than the residual at each iteration.

There is a lot more to say about powers of CG. But I think this is sufficiently long so I will probably tackle more on this topic in later posts.


+ There are no comments

Add yours