# Introduction

The inverse of a matrix $$A$$ is another matrix $$A^{-1}$$ that has this property:

\begin{align*} A A^{-1} &= I \\ A^{-1} A &= I \end{align*}

where $$I$$ is the identity matrix. This is a nice property for a matrix to have, because then we can work with it in equations just like we might with ordinary numbers. For instance, to solve some linear system of equations $A x = b$ we can just multiply the inverse of $$A$$ to both sides $x = A^{-1} b$ and then we have some unique solution vector $$x$$. Again, this is just like we would do if we were trying to solve a real-number equation like $$a x = b$$.

Now, a matrix has an inverse whenever it is square and its rows are linearly independent. But not every system of equations we might care about will give us a matrix that satisfies these properties. The coefficient matrix $$A$$ would fail to be invertible if the system did not have the same number of equations as unknowns ($$A$$ is not square), or if the system had dependent equations ($$A$$ has dependent rows).

Generalized inverses are meant to solve this problem. They are meant to solve equations like $$A x = b$$ in the “best way possible” when $$A^{-1}$$ fails to exist. There are many kinds of generalized inverses, each with its own “best way.” (They can be used to solve ridge regression problems, for instance.)

The most common is the Moore-Penrose inverse, or sometimes just the pseudoinverse. It solves the least-squares problem for linear systems, and therefore will give us a solution $$\hat{x}$$ so that $$A \hat{x}$$ is as close as possible in ordinary Euclidean distance to the vector $$b$$.

The notation for the Moore-Penrose inverse is $$A^+$$ instead of $$A^{-1}$$. If $$A$$ is invertible, then in fact $$A^+ = A^{-1}$$, and in that case the solution to the least-squares problem is the same as the ordinary solution ($$A^+ b = A^{-1} b$$). So, the MP-inverse is strictly more general than the ordinary inverse: we can always use it and it will always give us the same solution as the ordinary inverse whenever the ordinary inverse exists.

We will look at how we can construct the Moore-Penrose inverse using the SVD. This turns out to be an easy extension to constructing the ordinary matrix inverse with the SVD. We will then see how solving a least-squares problem is just as easy as solving an ordinary equation.

# Example - System with an Invertible Matrix

First let’s recall how to solve a system whose coefficient matrix is invertible. In this case, we have the same number of equations as unknowns and the equations are all independent. Then $$A^{-1}$$ exists and we can find a unique solution for $$x$$ by multiplying $$A^{-1}$$ on both sides.

For instance, say we have

\left\{\begin{align*} x_1 - \frac{1}{2}x_2 &= 1 \\ -\frac{1}{2} x_1 + x_2 &= -1 \end{align*}\right.

Then

$\begin{array}{c c} A = \begin{bmatrix} 1 & -1/2 \\ -1/2 & 1 \end{bmatrix}, &A^{-1} = \begin{bmatrix} 4/3 & 2/3 \\ 2/3 & 4/3 \end{bmatrix} \end{array}$

and

$x = A^{-1}b = \begin{bmatrix} 4/3 & 2/3 \\ 2/3 & 4/3 \end{bmatrix} \begin{bmatrix} 1 \\ -1 \end{bmatrix} = \begin{bmatrix} 2/3 \\ -2/3 \end{bmatrix}$

So $$x_1 = \frac{2}{3}$$ and $$x_2 = -\frac{2}{3}$$.

# Constructing Inverses with the SVD

The singular value decomposition (SVD) gives us an intuitive way constructing an inverse matrix. We will be able to see how the geometric transforms of $$A^{-1}$$ undo the transforms of $$A$$.

The SVD says that for any matrix $$A$$,

$A = U \Sigma V^*$

where $$U$$ and $$V$$ are orthogonal matricies and $$\Sigma$$ is a diagonal matrix.

Now, if $$A$$ is invertible, we can use its SVD to find $$A^{-1}$$ like so:

$A^{-1} = V \Sigma^{-1} U^*$

If we have the SVD of $$A$$, we can construct its inverse by swapping the orthogonal matrices $$U$$ and $$V$$ and finding the inverse of $$\Sigma$$. Since $$\Sigma$$ is diagonal, we can do this by just taking reciprocals of its diagonal entries.

## Example

Let’s look at our earlier matrix again:

$A = \begin{bmatrix} 1 & -1/2 \\ -1/2 & 1 \end{bmatrix}$

It has SVD

$A = U \Sigma V^* = \begin{bmatrix} \sqrt{2}/2 & -\sqrt{2}/2 \\ \sqrt{2}/2 & \sqrt{2}/2 \end{bmatrix} \begin{bmatrix} 3/2 & 0 \\ 0 & 1/2 \end{bmatrix} \begin{bmatrix} \sqrt{2}/2 & \sqrt{2}/2 \\ -\sqrt{2}/2 & \sqrt{2}/2 \end{bmatrix}$

And so,

$A^{-1} = V \Sigma^{-1} U^* = \begin{bmatrix} \sqrt{2}/2 & -\sqrt{2}/2 \\ \sqrt{2}/2 & \sqrt{2}/2 \end{bmatrix} \begin{bmatrix} 2/3 & 0 \\ 0 & 2 \end{bmatrix} \begin{bmatrix} \sqrt{2}/2 & \sqrt{2}/2 \\ -\sqrt{2}/2 & \sqrt{2}/2 \end{bmatrix}$

and after multiplying everything out, we get

$A^{-1} = \begin{bmatrix} 4/3 & 2/3 \\ 2/3 & 4/3 \end{bmatrix}$

In an earlier post, we saw how we could use the SVD to visualize a matrix as a sequence of geometric transformations. Here is the matrix $$A$$ followed by $$A^{-1}$$, acting on the unit circle:

The inverse matrix $$A^{-1}$$ reverses exactly the action of $$A$$. The matrix $$A$$ will map any circle to a unique ellipse, with no overlap. So, $$A^{-1}$$ can map ellipses back to those same circles without any ambiguity. We don’t “lose information” by applying $$A$$.

# Constructing MP-Inverses with the SVD

We can in fact do basically the same thing for any matrix, not just the invertible ones. The SVD always exists, so for some matrix $$A$$, first write

$A = U \Sigma V^*$

And then find the MP-inverse by

$A^+ = V \Sigma^+ U^*$

Now, the matrix $$A$$ might not be invertible. If it is not square, then, to find $$\Sigma^+$$, we need to take the transpose of $$\Sigma$$ to make sure all the dimensions are conformable in the multiplication. It $$A$$ is singular (dependent rows), then $$\Sigma$$ will have 0’s on its diagaonal, so we need to make sure only take reciprocals of the non-zero entries.

Summarizing, to find the Moore-Penrose inverse of a matrix $$A$$:

1. Find the Singular Value Decomposition: $$A = U \Sigma V^*$$ (using R or Python, if you like).
2. Find $$\Sigma^+$$ by transposing $$\Sigma$$ and taking the reciprocal of all its non-zero diagonal entries.
3. Compute $$A^+ = V \Sigma^+ U^*$$

## Example - An Inconsistent System

Let’s find the MP-inverse of a singular matrix. Let’s take

$A = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}$

Because the rows of this matrix are linearly dependent, $$A^{-1}$$ does not exist. But we can still find the more general MP-inverse by following the procedure above.

So, first we find the SVD of $$A$$:

$A = U \Sigma V^* = \begin{bmatrix} \sqrt{2}/2 & -\sqrt{2}/2 \\ \sqrt{2}/2 & \sqrt{2}/2 \end{bmatrix} \begin{bmatrix} 2 & 0 \\ 0 & 0 \end{bmatrix} \begin{bmatrix} \sqrt{2}/2 & \sqrt{2}/2 \\ -\sqrt{2}/2 & \sqrt{2}/2 \end{bmatrix}$

Then we apply the procedure above to find $$A^+$$:

$A^+ = V \Sigma^+ U^* = \begin{bmatrix} \sqrt{2}/2 & -\sqrt{2}/2 \\ \sqrt{2}/2 & \sqrt{2}/2 \end{bmatrix} \begin{bmatrix} 1/2 & 0 \\ 0 & 0 \end{bmatrix} \begin{bmatrix} \sqrt{2}/2 & \sqrt{2}/2 \\ -\sqrt{2}/2 & \sqrt{2}/2 \end{bmatrix}$

And now we multiply everything out to get:

$A^+ = \begin{bmatrix} 1/4 & 1/4 \\ 1/4 & 1/4 \end{bmatrix}$

This is the Moore-Penrose inverse of $$A$$.

Like we did for the invertible matrix before, let’s get an idea of what $$A$$ and $$A^+$$ are doing geometrically. Here they are acting on the unit circle:

Notice how $$A$$ now collapses the circle onto a one-dimensional space. This is a consequence of it having dependent columns. For matricies with dependent columns, its image will be of lesser dimension than the space it’s mapping into. Another way of saying this is that it has a non-trivial null space. It “zeroes out” some of the dimensions in its domain during the transformation.

What if $$A$$ were the coefficient matrix of a system of equations? We might have:

\left\{ \begin{align*} x_1 + x_2 &= b_1 \\ x_1 + x_2 &= b_2 \end{align*} \right.

for some $$b_1$$ and $$b_2$$.

Now, unless $$b_1$$ and $$b_2$$ are equal, this system won’t have an exact solution for $$x_1$$ and $$x_2$$. It will be inconsistent. But, with $$A^+$$, we can still find values for $$x_1$$ and $$x_2$$ that minimize the distance between $$A x$$ and $$b$$. More specifically, let $$\hat{x} = A^{+}b$$. Then $$\hat{x}$$ will minimize $$|| b - A x ||^2$$, the squared error, and $$\hat{b} = A \hat{x} = A A^{+} x$$ is the closest we can come to $$b$$. (The vector $$b - A \hat{x}$$ is sometimes called the residual vector.)

We have

$\hat{x} = A^{+} b = \begin{bmatrix} 1/4 (b_1 + b_2) \\ 1/4 (b_1 + b_2) \end{bmatrix}$

so $$x_1 = \frac{1}{4} (b_1 + b_2)$$ and $$x_2 = \frac{1}{4} (b_1 + b_2)$$. And the closest we can get to $$b$$ is

$\hat{b} = A \hat{x} = \begin{bmatrix} 1/2 (b_1 + b_2) \\ 1/2 (b_1 + b_2) \end{bmatrix}$

In other words, if we have to make $$x_1 + x_2$$ as close as possible to two different values $$b_1$$ and $$b_2$$, the best we can do is to choose $$x_1$$ and $$x_2$$ so as to get the average of $$b_1$$ and $$b_2$$.

Or we could think about this problem geometrically. In order for there to be a solution to $$A x = b$$, the vector $$b$$ has to reside in the image of $$A$$. The image of $$A$$ is the span of its columns, which is all vectors like $$c(1, 1)$$ for a scalar $$c$$. This is just the line through the origin in the picture above. But $$b = (b_1, b_2)$$ is not on that line if $$b_1 \neq b_2$$, and so instead we minimize the distance between the two with its orthogonal projection $$\hat b$$. The error or residual is the difference $$\epsilon = b - \hat{b}$$.

## Understanding Eigenvalues and Singular Values

tutorial, linear-algebra, eigenvalues, singular-values
Published: November 15, 2019

# Introduction

What are eigenvalues? What are singular values? They both describe the behavior of a matrix on a certain set of vectors. The difference is this: The eigenvectors of a matrix describe the directions of its invariant action. The singular vectors of a matrix describe the directions of its maximum action. And the corresponding eigen- and singular values describe the magnitude of that action.

They are defined this way. A scalar $$\lambda$$ is an eigenvalue of a linear transformation $$A$$ if there is a vector $$v$$ such that $$A v = \lambda v$$, and $$v$$ is called an eigenvector of $$\lambda$$. A scalar $$\sigma$$ is a singular value of $$A$$ if there are (unit) vectors $$u$$ and $$v$$ such that $$A v = \sigma u$$ and $$A^* u = \sigma v$$, where $$A^*$$ is the conjugate transpose of $$A$$; the vectors $$u$$ and $$v$$ are singular vectors. The vector $$u$$ is called a left singular vector and $$v$$ a right singular vector.

# Eigenvalues and Eigenvectors

That eigenvectors give the directions of invariant action is obvious from the definition. The definition says that when $$A$$ acts on an eigenvector, it just multiplies it by a constant, the corresponding eigenvalue. In other words, when a linear transformation acts on one of its eigenvectors, it shrinks the vector or stretches it and reverses its direction if $$\lambda$$ is negative, but never changes the direction otherwise. The action is invariant.

Take this matrix, for instance:

$A = \begin{bmatrix} 0 & 2 \\ 2 & 0 \end{bmatrix}$

We can see how the transformation just stretches the red vector by a factor of 2, while the blue vector it stretches but also reflects over the origin.

And this matrix:

$A = \begin{bmatrix} 1 & \frac{1}{3} \\ \frac{4}{3} & 1 \end{bmatrix}$

It stretches the red vector and shrinks the blue vector, but reverses neither.

The point is that in every case, when a matrix acts on one of its eigenvectors, the action is always in a parallel direction.

# Singular Values and Singular Vectors

This invariant direction does not necessarily give the transformation’s direction of greatest effect, however. You can see that in the previous example. But say $$\sigma_1$$ is the largest singular value of $$A$$ with right singular vector $$v$$. Then $$v$$ is a solution to

$\operatorname*{argmax}_{x, ||x||=1} ||A x||$

In other words, $$||A v|| = \sigma_1$$ is at least as big as $$||A x||$$ for any other unit vector $$x$$. It’s not necessarily the case that $$A v$$ is parallel to $$v$$, though.

Compare the eigenvectors of the matrix in the last example to its singular vectors:

The directions of maximum effect will be exactly the semi-axes of the ellipse, the ellipse which is the image of the unit circle under $$A$$.

Let’s extend this idea to 3-dimensional space to get a better idea of what’s going on. Consider this transformation:

$A = \begin{bmatrix} \frac{3}{2} \, \sqrt{2} & -\sqrt{2} & 0 \\ \frac{3}{2} \, \sqrt{2} & \sqrt{2} & 0 \\ 0 & 0 & 1 \end{bmatrix}$

This will have the effect of transforming the unit sphere into an ellipsoid:

Its singular values are 3, 2, and 1. You can see how they again form the semi-axes of the resulting figure.

# Matrix Approximation with SVD

Now, the singular value decomposition (SVD) will tell us what $$A$$’s singular values are:

$A = U \Sigma V^* = \begin{bmatrix} \frac{\sqrt{2}}{2} & -\frac{\sqrt{2}}{2} & 0.0 \\ \frac{\sqrt{2}}{2} & \frac{\sqrt{2}}{2} & 0.0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 3 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}$

The diagonal entries of the matrix $$\Sigma$$ are the singular values of $$A$$. We can obtain a lower-dimensional approximation to $$A$$ by setting one or more of its singular values to 0.

For instance, say we set the largest singular value, 3, to 0. We then get this matrix:

$A_1 = \begin{bmatrix} \frac{\sqrt{2}}{2} & -\frac{\sqrt{2}}{2} & 0.0 \\ \frac{\sqrt{2}}{2} & \frac{\sqrt{2}}{2} & 0.0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 0 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} 0 & -\frac{\sqrt{2}}{2} & 0 \\ 0 & \frac{\sqrt{2}}{2} & 0 \\ 0 & 0 & 1 \end{bmatrix}$

which transforms the unit sphere like this:

The resulting figure now lives in a 2-dimensional space. Further, the largest singular value of $$A_1$$ is now 2. Set it to 0:

$A_2 = \begin{bmatrix} \frac{\sqrt{2}}{2} & -\frac{\sqrt{2}}{2} & 0.0 \\ \frac{\sqrt{2}}{2} & \frac{\sqrt{2}}{2} & 0.0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 1 \end{bmatrix}$

And we get a 1-dimensional figure, and a final largest singular value of 1:

This is the point: Each set of singular vectors will form an orthonormal basis for some linear subspace of $$\mathbb{R}^n$$. A singular value and its singular vectors give the direction of maximum action among all directions orthogonal to the singular vectors of any larger singular value.

This has important applications. There are many problems in statistics and machine learning that come down to finding a low-rank approximation to some matrix at hand. Principal component analysis is a problem of this kind. It says: approximate some matrix $$X$$ of observations with a number of its uncorrelated components of maximum variance. This problem is solved by computing its singular value decomposition and setting some of its smallest singular values to 0.

## Visualizing Linear Transformations

tutorial, linear-algebra, matrix-decomposition, geometry
Published: November 12, 2019

# Introduction

Say $$V$$ and $$W$$ are vector spaces with scalars in some field $$\mathbb{F}$$ (the real numbers, maybe). A linear map is a function $$T : V \rightarrow W$$ satisfying two conditions:

• additivity $$T(x + y) = T x + T y$$ for all $$x, y \in V$$
• homogeneity $$T(c x) = c (T x)$$ for all $$c \in \mathbb{F}$$ and all $$x \in V$$

Defining a linear map this way just ensures that anything that acts like a vector in $$V$$ also acts like a vector in $$W$$ after you map it over. It means that the map preserves all the structure of a vector space after it’s applied.

It’s a simple definition – which is good – but doesn’t speak much to the imagination. Since linear algebra is possibly the most useful and most ubiquitous of all the branches of mathematics, we’d like to have some intuition about what linear maps are so we have some idea of what we’re doing when we use it. Though not all vectors live there, the Euclidean plane $$\mathbb{R}^2$$ is certainly the easiest to visualize, and the way we measure distance there is very similar to the way we measure error in statistics, so we can feel that our intuitions will carry over.

It turns out that all linear maps in $$\mathbb{R}^2$$ can be factored into just a few primitive geometric operations: scaling, rotation, and reflection. This isn’t the only way to factor these maps, but I think it’s the easiest to understand. (We could get by without rotations, in fact.)

# Three Primitive Transformations

## Scaling

A (non-uniform) scaling transformation $$D$$ in $$\mathbb{R}^2$$ is given by a diagonal matrix:

$Scl(d1, d2) = \begin{bmatrix} d_1 & 0 \\ 0 & d_2 \\ \end{bmatrix}$

where $$d_1$$ and $$d_2$$ are non-negative. The transformation has the effect of stretching or shrinking a vector along each coordinate axis, and, so long as $$d_1$$ and $$d_2$$ are positive, it will also preserve the orientation of vectors after mapping because in this case $$\det(D) = d_1 d_2 > 0$$.

For instance, here is the effect on a vector of this matrix: $D = \begin{bmatrix} 0.75 & 0 \\ 0 & 1.25 \\ \end{bmatrix}$

It will shrink a vector by a factor of 0.75 along the x-axis and stretch a vector by a factor of 1.25 along the y-axis.

If we think about all the vectors of length 1 as being the points of the unit circle, then we can get an idea of how the transformation will affect any vector. We can see a scaling as a continous transformation beginning at the identity matrix.

If one of the diagonal entries is 0, then it will collapse the circle on the other axis.

$D = \begin{bmatrix} 0 & 0 \\ 0 & 1.25 \\ \end{bmatrix}$

This is an example of a rank-deficient matrix. It maps every vector onto the y-axis, and so its image has a dimension less than the dimension of the full space.

## Rotation

A rotation transformation $$Ref$$ is given by a matrix: $Ref(\theta) = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \\ \end{bmatrix}$

This transformation will have the effect of rotating a vector counter-clockwise by an angle $$\theta$$, when $$\theta$$ is positive, and clockwise by $$\theta$$ when $$\theta$$ is negative.

And the unit circle gets mapped onto itself.

It shouldn’t be too hard to convince ourselves that the matrix we’ve written down is the one we want. Take some unit vector and write its coordinates like $$(\cos\gamma, \sin\gamma)$$. Multiply it by $$Ref(\theta)$$ to get $$(\cos\gamma \cos\theta - \sin\gamma \sin\theta, \cos\gamma \sin\theta + \sin\gamma \cos\theta)$$. But by a trigonometric identity, this is exactly the vector $$(\cos(\gamma + \theta), \sin(\gamma + \theta))$$, which is our vector rotated by $$\theta$$.

A rotation should preserve not only orientations, but also distances. Now, recall that the determinant for a $$2\times 2$$ matrix $$\begin{bmatrix} a & b \\ c & d \end{bmatrix}$$ is $$a d - b c$$. So a rotation matrix will have determinant $$\cos^2(\theta) + \sin^2(\theta)$$, which, by the Pythagorean identity, is equal to 1. This, together with the fact that its columns are orthonormal means that it does preserve both. It is a kind of orthogonal matrix, which is a kind of isometry.

## Reflection

A reflection in $$\mathbb{R}^2$$ can be described with matricies like: $Ref(\theta) = \begin{bmatrix} \cos(2\theta) & \sin(2\theta) \\ \sin(2\theta) & -\cos(2\theta) \\ \end{bmatrix}$ where the reflection is through a line crossing the origin and forming an angle $$\theta$$ with the x-axis.

And the unit circle gets mapped onto itself.

Note that the determinant of this matrix is -1, which means that it reverses orientation. But its columns are still orthonormal, and so it too is an isometry.

# Decomposing Matricies into Primitives

The singular value decomposition (SVD) will factor any matrix $$A$$ having like this:

$A = U \Sigma V^*$

We are working with real matricies, so $$U$$ and $$V$$ will both be orthogonal matrices. This means each of these will be either a reflection or a rotation, depending on the pattern of signs in its entries. The matrix $$\Sigma$$ is a diagonal matrix with non-negative entries, which means that it is a scaling transform. (The $$*$$ on the $$V$$ is the conjugate-transpose operator, which just means ordinary transpose when $$V$$ doesn’t contain any imaginary entries. So, for us, $$V^* = V^\top$$.) Now with the SVD we can rewrite any linear transformation as:

1. $$V^*$$: Rotate/Reflect
2. $$\Sigma$$: Scale
3. $$U$$: Rotate/Reflect

## Example

$\begin{bmatrix} 0.5 & 1.5 \\ 1.5 & 0.5 \end{bmatrix} \approx \begin{bmatrix} -0.707 & -0.707 \\ -0.707 & 0.707 \end{bmatrix} \begin{bmatrix} 2.0 & 0.0 \\ 0.0 & 1.0 \end{bmatrix} \begin{bmatrix} -0.707 & -0.707 \\ 0.707 & -0.707 \end{bmatrix}$

This turns out to be:

1. $$V^*$$: Rotate clockwise by $$\theta = \frac{3 \pi}{4}$$.
2. $$\Sigma$$: Scale x-coordinate by $$d_1 = 2$$ and y-coordinate by $$d_2 = 1$$.
3. $$U$$: Reflect over the line with angle $$-\frac{3\pi}{8}$$.

## Example

And here is a shear transform, represented as: rotation, scale, rotation.

$\begin{bmatrix} 1.0 & 1.0 \\ 0.0 & 1.0 \end{bmatrix} \approx \begin{bmatrix} 0.85 & -0.53 \\ 0.53 & 0.85 \end{bmatrix} \begin{bmatrix} 1.62 & 0.0 \\ 0.0 & 0.62 \end{bmatrix} \begin{bmatrix} 0.53 & 0.85 \\ -0.85 & 0.53 \end{bmatrix}$