Introduction

The inverse of a matrix A is another matrix A1 that has this property:

AA1=IA1A=I

where I is the identity matrix. This is a nice property for a matrix to have, because then we can work with it in equations just like we might with ordinary numbers. For instance, to solve some linear system of equations Ax=b we can just multiply the inverse of A to both sides x=A1b and then we have some unique solution vector x. Again, this is just like we would do if we were trying to solve a real-number equation like ax=b.

Now, a matrix has an inverse whenever it is square and its rows are linearly independent. But not every system of equations we might care about will give us a matrix that satisfies these properties. The coefficient matrix A would fail to be invertible if the system did not have the same number of equations as unknowns (A is not square), or if the system had dependent equations (A has dependent rows).

Generalized inverses are meant to solve this problem. They are meant to solve equations like Ax=b in the “best way possible” when A1 fails to exist. There are many kinds of generalized inverses, each with its own “best way.” (They can be used to solve ridge regression problems, for instance.)

The most common is the Moore-Penrose inverse, or sometimes just the pseudoinverse. It solves the least-squares problem for linear systems, and therefore will give us a solution ˆx so that Aˆx is as close as possible in ordinary Euclidean distance to the vector b.

The notation for the Moore-Penrose inverse is A+ instead of A1. If A is invertible, then in fact A+=A1, and in that case the solution to the least-squares problem is the same as the ordinary solution (A+b=A1b). So, the MP-inverse is strictly more general than the ordinary inverse: we can always use it and it will always give us the same solution as the ordinary inverse whenever the ordinary inverse exists.

We will look at how we can construct the Moore-Penrose inverse using the SVD. This turns out to be an easy extension to constructing the ordinary matrix inverse with the SVD. We will then see how solving a least-squares problem is just as easy as solving an ordinary equation.

Example - System with an Invertible Matrix

First let’s recall how to solve a system whose coefficient matrix is invertible. In this case, we have the same number of equations as unknowns and the equations are all independent. Then A1 exists and we can find a unique solution for x by multiplying A1 on both sides.

For instance, say we have

{x112x2=112x1+x2=1

Then

A=[11/21/21],A1=[4/32/32/34/3]

and

x=A1b=[4/32/32/34/3][11]=[2/32/3]

So x1=23 and x2=23.

Constructing Inverses with the SVD

The singular value decomposition (SVD) gives us an intuitive way constructing an inverse matrix. We will be able to see how the geometric transforms of A1 undo the transforms of A.

The SVD says that for any matrix A,

A=UΣV

where U and V are orthogonal matricies and Σ is a diagonal matrix.

Now, if A is invertible, we can use its SVD to find A1 like so:

A1=VΣ1U

If we have the SVD of A, we can construct its inverse by swapping the orthogonal matrices U and V and finding the inverse of Σ. Since Σ is diagonal, we can do this by just taking reciprocals of its diagonal entries.

Example

Let’s look at our earlier matrix again:

A=[11/21/21]

It has SVD

A=UΣV=[2/22/22/22/2][3/2001/2][2/22/22/22/2]

And so,

A1=VΣ1U=[2/22/22/22/2][2/3002][2/22/22/22/2]

and after multiplying everything out, we get

A1=[4/32/32/34/3]

just like we had before.

In an earlier post, we saw how we could use the SVD to visualize a matrix as a sequence of geometric transformations. Here is the matrix A followed by A1, acting on the unit circle:

The inverse matrix A1 reverses exactly the action of A. The matrix A will map any circle to a unique ellipse, with no overlap. So, A1 can map ellipses back to those same circles without any ambiguity. We don’t “lose information” by applying A.

Constructing MP-Inverses with the SVD

We can in fact do basically the same thing for any matrix, not just the invertible ones. The SVD always exists, so for some matrix A, first write

A=UΣV

And then find the MP-inverse by

A+=VΣ+U

Now, the matrix A might not be invertible. If it is not square, then, to find Σ+, we need to take the transpose of Σ to make sure all the dimensions are conformable in the multiplication. It A is singular (dependent rows), then Σ will have 0’s on its diagaonal, so we need to make sure only take reciprocals of the non-zero entries.

Summarizing, to find the Moore-Penrose inverse of a matrix A:

  1. Find the Singular Value Decomposition: A=UΣV (using R or Python, if you like).
  2. Find Σ+ by transposing Σ and taking the reciprocal of all its non-zero diagonal entries.
  3. Compute A+=VΣ+U

Example - An Inconsistent System

Let’s find the MP-inverse of a singular matrix. Let’s take

A=[1111]

Because the rows of this matrix are linearly dependent, A1 does not exist. But we can still find the more general MP-inverse by following the procedure above.

So, first we find the SVD of A:

A=UΣV=[2/22/22/22/2][2000][2/22/22/22/2]

Then we apply the procedure above to find A+:

A+=VΣ+U=[2/22/22/22/2][1/2000][2/22/22/22/2]

And now we multiply everything out to get:

A+=[1/41/41/41/4]

This is the Moore-Penrose inverse of A.

Like we did for the invertible matrix before, let’s get an idea of what A and A+ are doing geometrically. Here they are acting on the unit circle:

Notice how A now collapses the circle onto a one-dimensional space. This is a consequence of it having dependent columns. For matricies with dependent columns, its image will be of lesser dimension than the space it’s mapping into. Another way of saying this is that it has a non-trivial null space. It “zeroes out” some of the dimensions in its domain during the transformation.

What if A were the coefficient matrix of a system of equations? We might have:

{x1+x2=b1x1+x2=b2

for some b1 and b2.

Now, unless b1 and b2 are equal, this system won’t have an exact solution for x1 and x2. It will be inconsistent. But, with A+, we can still find values for x1 and x2 that minimize the distance between Ax and b. More specifically, let ˆx=A+b. Then ˆx will minimize ||bAx||2, the squared error, and ˆb=Aˆx=AA+x is the closest we can come to b. (The vector bAˆx is sometimes called the residual vector.)

We have

ˆx=A+b=[1/4(b1+b2)1/4(b1+b2)]

so x1=14(b1+b2) and x2=14(b1+b2). And the closest we can get to b is

ˆb=Aˆx=[1/2(b1+b2)1/2(b1+b2)]

In other words, if we have to make x1+x2 as close as possible to two different values b1 and b2, the best we can do is to choose x1 and x2 so as to get the average of b1 and b2.

The vector b = (1, 3) and its orthogonal projection \hat{b} = (2, 2).
The vector b=(1,3) and its orthogonal projection ˆb=(2,2).

Or we could think about this problem geometrically. In order for there to be a solution to Ax=b, the vector b has to reside in the image of A. The image of A is the span of its columns, which is all vectors like c(1,1) for a scalar c. This is just the line through the origin in the picture above. But b=(b1,b2) is not on that line if b1b2, and so instead we minimize the distance between the two with its orthogonal projection ˆb. The error or residual is the difference ϵ=bˆb.