Introduction

What are eigenvalues? What are singular values? They both describe the behavior of a matrix on a certain set of vectors. The difference is this: The eigenvectors of a matrix describe the directions of its invariant action. The singular vectors of a matrix describe the directions of its maximum action. And the corresponding eigen- and singular values describe the magnitude of that action.

They are defined this way. A scalar \(\lambda\) is an eigenvalue of a linear transformation \(A\) if there is a vector \(v\) such that \(A v = \lambda v\), and \(v\) is called an eigenvector of \(\lambda\). A scalar \(\sigma\) is a singular value of \(A\) if there are (unit) vectors \(u\) and \(v\) such that \(A v = \sigma u\) and \(A^* u = \sigma v\), where \(A^*\) is the conjugate transpose of \(A\); the vectors \(u\) and \(v\) are singular vectors. The vector \(u\) is called a left singular vector and \(v\) a right singular vector.

Eigenvalues and Eigenvectors

That eigenvectors give the directions of invariant action is obvious from the definition. The definition says that when \(A\) acts on an eigenvector, it just multiplies it by a constant, the corresponding eigenvalue. In other words, when a linear transformation acts on one of its eigenvectors, it shrinks the vector or stretches it and reverses its direction if \(\lambda\) is negative, but never changes the direction otherwise. The action is invariant.

Take this matrix, for instance:

\[ A = \begin{bmatrix} 0 & 2 \\ 2 & 0 \end{bmatrix} \]

Eigenvectors of A
Eigenvectors of \(A\)

We can see how the transformation just stretches the red vector by a factor of 2, while the blue vector it stretches but also reflects over the origin.

And this matrix:

\[ A = \begin{bmatrix} 1 & \frac{1}{3} \\ \frac{4}{3} & 1 \end{bmatrix} \]

Eigenvectors of A
Eigenvectors of \(A\)

It stretches the red vector and shrinks the blue vector, but reverses neither.

The point is that in every case, when a matrix acts on one of its eigenvectors, the action is always in a parallel direction.

Singular Values and Singular Vectors

This invariant direction does not necessarily give the transformation’s direction of greatest effect, however. You can see that in the previous example. But say \(\sigma_1\) is the largest singular value of \(A\) with right singular vector \(v\). Then \(v\) is a solution to

\[ \operatorname*{argmax}_{x, ||x||=1} ||A x|| \]

In other words, \( ||A v|| = \sigma_1 \) is at least as big as \( ||A x|| \) for any other unit vector \(x\). It’s not necessarily the case that \(A v\) is parallel to \(v\), though.

Compare the eigenvectors of the matrix in the last example to its singular vectors:

Singular vectors of A
Singular vectors of \(A\)

The directions of maximum effect will be exactly the semi-axes of the ellipse, the ellipse which is the image of the unit circle under \(A\).

Let’s extend this idea to 3-dimensional space to get a better idea of what’s going on. Consider this transformation:

\[A = \begin{bmatrix} \frac{3}{2} \, \sqrt{2} & -\sqrt{2} & 0 \\ \frac{3}{2} \, \sqrt{2} & \sqrt{2} & 0 \\ 0 & 0 & 1 \end{bmatrix} \]

This will have the effect of transforming the unit sphere into an ellipsoid:

The unit sphere transformed into an ellipsoid.
The unit sphere transformed into an ellipsoid.

Its singular values are 3, 2, and 1. You can see how they again form the semi-axes of the resulting figure.

The singular vectors as semi-axes in the ellipsoid.
The singular vectors as semi-axes in the ellipsoid.

Matrix Approximation with SVD

Now, the singular value decomposition (SVD) will tell us what \(A\)’s singular values are:

\[ A = U \Sigma V^* = \begin{bmatrix} \frac{\sqrt{2}}{2} & -\frac{\sqrt{2}}{2} & 0.0 \\ \frac{\sqrt{2}}{2} & \frac{\sqrt{2}}{2} & 0.0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 3 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \]

The diagonal entries of the matrix \(\Sigma\) are the singular values of \(A\). We can obtain a lower-dimensional approximation to \(A\) by setting one or more of its singular values to 0.

For instance, say we set the largest singular value, 3, to 0. We then get this matrix:

\[ A_1 = \begin{bmatrix} \frac{\sqrt{2}}{2} & -\frac{\sqrt{2}}{2} & 0.0 \\ \frac{\sqrt{2}}{2} & \frac{\sqrt{2}}{2} & 0.0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 0 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} 0 & -\frac{\sqrt{2}}{2} & 0 \\ 0 & \frac{\sqrt{2}}{2} & 0 \\ 0 & 0 & 1 \end{bmatrix} \]

which transforms the unit sphere like this:

The transformation with the largest singular value set to 0.
The transformation with the largest singular value set to 0.

The resulting figure now lives in a 2-dimensional space. Further, the largest singular value of \(A_1\) is now 2. Set it to 0:

\[ A_2 = \begin{bmatrix} \frac{\sqrt{2}}{2} & -\frac{\sqrt{2}}{2} & 0.0 \\ \frac{\sqrt{2}}{2} & \frac{\sqrt{2}}{2} & 0.0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 1 \end{bmatrix} \]

And we get a 1-dimensional figure, and a final largest singular value of 1:

The transformation with the two largest singular values set to 0.
The transformation with the two largest singular values set to 0.

This is the point: Each set of singular vectors will form an orthonormal basis for some linear subspace of \(\mathbb{R}^n\). A singular value and its singular vectors give the direction of maximum action among all directions orthogonal to the singular vectors of any larger singular value.

This has important applications. There are many problems in statistics and machine learning that come down to finding a low-rank approximation to some matrix at hand. Principal component analysis is a problem of this kind. It says: approximate some matrix \(X\) of observations with a number of its uncorrelated components of maximum variance. This problem is solved by computing its singular value decomposition and setting some of its smallest singular values to 0.

Low-rank approximations of A.
Low-rank approximations of \(A\).

Introduction

Say \(V\) and \(W\) are vector spaces with scalars in some field \(\mathbb{F}\) (the real numbers, maybe). A linear map is a function \(T : V \rightarrow W \) satisfying two conditions:

  • additivity \(T(x + y) = T x + T y\) for all \(x, y \in V\)
  • homogeneity \(T(c x) = c (T x)\) for all \(c \in \mathbb{F} \) and all \(x \in V\)

Defining a linear map this way just ensures that anything that acts like a vector in \(V\) also acts like a vector in \(W\) after you map it over. It means that the map preserves all the structure of a vector space after it’s applied.

It’s a simple definition – which is good – but doesn’t speak much to the imagination. Since linear algebra is possibly the most useful and most ubiquitous of all the branches of mathematics, we’d like to have some intuition about what linear maps are so we have some idea of what we’re doing when we use it. Though not all vectors live there, the Euclidean plane \(\mathbb{R}^2\) is certainly the easiest to visualize, and the way we measure distance there is very similar to the way we measure error in statistics, so we can feel that our intuitions will carry over.

It turns out that all linear maps in \(\mathbb{R}^2\) can be factored into just a few primitive geometric operations: scaling, rotation, and reflection. This isn’t the only way to factor these maps, but I think it’s the easiest to understand. (We could get by without rotations, in fact.)

The unit circle, rotated, reflected, and scaled.
The unit circle, rotated, reflected, and scaled.

Three Primitive Transformations

Scaling

A (non-uniform) scaling transformation \(D\) in \(\mathbb{R}^2\) is given by a diagonal matrix:

\[Scl(d1, d2) = \begin{bmatrix} d_1 & 0 \\ 0 & d_2 \\ \end{bmatrix}\]

where \(d_1\) and \(d_2\) are non-negative. The transformation has the effect of stretching or shrinking a vector along each coordinate axis, and, so long as \(d_1\) and \(d_2\) are positive, it will also preserve the orientation of vectors after mapping because in this case \(\det(D) = d_1 d_2 > 0\).

For instance, here is the effect on a vector of this matrix: \[D = \begin{bmatrix} 0.75 & 0 \\ 0 & 1.25 \\ \end{bmatrix}\]

A vector, scaled.
A vector, scaled.

It will shrink a vector by a factor of 0.75 along the x-axis and stretch a vector by a factor of 1.25 along the y-axis.

If we think about all the vectors of length 1 as being the points of the unit circle, then we can get an idea of how the transformation will affect any vector. We can see a scaling as a continous transformation beginning at the identity matrix.

If one of the diagonal entries is 0, then it will collapse the circle on the other axis.

\[D = \begin{bmatrix} 0 & 0 \\ 0 & 1.25 \\ \end{bmatrix}\]

This is an example of a rank-deficient matrix. It maps every vector onto the y-axis, and so its image has a dimension less than the dimension of the full space.

Rotation

A rotation transformation \(Ref\) is given by a matrix: \[Ref(\theta) = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \\ \end{bmatrix}\]

This transformation will have the effect of rotating a vector counter-clockwise by an angle \(\theta\), when \(\theta\) is positive, and clockwise by \(\theta\) when \(\theta\) is negative.

A vector, rotated by 3\pi/4
A vector, rotated by \(3\pi/4\)

And the unit circle gets mapped onto itself.

It shouldn’t be too hard to convince ourselves that the matrix we’ve written down is the one we want. Take some unit vector and write its coordinates like \((\cos\gamma, \sin\gamma)\). Multiply it by \(Ref(\theta)\) to get \((\cos\gamma \cos\theta - \sin\gamma \sin\theta, \cos\gamma \sin\theta + \sin\gamma \cos\theta)\). But by a trigonometric identity, this is exactly the vector \((\cos(\gamma + \theta), \sin(\gamma + \theta))\), which is our vector rotated by \(\theta\).

A rotation should preserve not only orientations, but also distances. Now, recall that the determinant for a \(2\times 2\) matrix \(\begin{bmatrix} a & b \\ c & d \end{bmatrix}\) is \(a d - b c\). So a rotation matrix will have determinant \(\cos^2(\theta) + \sin^2(\theta)\), which, by the Pythagorean identity, is equal to 1. This, together with the fact that its columns are orthonormal means that it does preserve both. It is a kind of orthogonal matrix, which is a kind of isometry.

Reflection

A reflection in \(\mathbb{R}^2\) can be described with matricies like: \[Ref(\theta) = \begin{bmatrix} \cos(2\theta) & \sin(2\theta) \\ \sin(2\theta) & -\cos(2\theta) \\ \end{bmatrix}\] where the reflection is through a line crossing the origin and forming an angle \(\theta\) with the x-axis.

A vector, reflected over a line at angle \pi/4.
A vector, reflected over a line at angle \(\pi/4\).

And the unit circle gets mapped onto itself.

Note that the determinant of this matrix is -1, which means that it reverses orientation. But its columns are still orthonormal, and so it too is an isometry.

Decomposing Matricies into Primitives

The singular value decomposition (SVD) will factor any matrix \(A\) having like this:

\[ A = U \Sigma V^* \]

We are working with real matricies, so \(U\) and \(V\) will both be orthogonal matrices. This means each of these will be either a reflection or a rotation, depending on the pattern of signs in its entries. The matrix \(\Sigma\) is a diagonal matrix with non-negative entries, which means that it is a scaling transform. (The \(*\) on the \(V\) is the conjugate-transpose operator, which just means ordinary transpose when \(V\) doesn’t contain any imaginary entries. So, for us, \(V^* = V^\top\).) Now with the SVD we can rewrite any linear transformation as:

  1. \(V^*\): Rotate/Reflect
  2. \(\Sigma\): Scale
  3. \(U\): Rotate/Reflect

Example

\[\begin{bmatrix} 0.5 & 1.5 \\ 1.5 & 0.5 \end{bmatrix} \approx \begin{bmatrix} -0.707 & -0.707 \\ -0.707 & 0.707 \end{bmatrix} \begin{bmatrix} 2.0 & 0.0 \\ 0.0 & 1.0 \end{bmatrix} \begin{bmatrix} -0.707 & -0.707 \\ 0.707 & -0.707 \end{bmatrix} \]

This turns out to be:

  1. \(V^*\): Rotate clockwise by \(\theta = \frac{3 \pi}{4}\).
  2. \(\Sigma\): Scale x-coordinate by \(d_1 = 2\) and y-coordinate by \(d_2 = 1\).
  3. \(U\): Reflect over the line with angle \(-\frac{3\pi}{8}\).

Example

And here is a shear transform, represented as: rotation, scale, rotation.

\[\begin{bmatrix} 1.0 & 1.0 \\ 0.0 & 1.0 \end{bmatrix} \approx \begin{bmatrix} 0.85 & -0.53 \\ 0.53 & 0.85 \end{bmatrix} \begin{bmatrix} 1.62 & 0.0 \\ 0.0 & 0.62 \end{bmatrix} \begin{bmatrix} 0.53 & 0.85 \\ -0.85 & 0.53 \end{bmatrix} \]

Bayesian Aggregation

Yang, Y., & Dunson, D. B., Minimax Optimal Bayesian Aggregation 2014 (arXiv)

Say we have a number of estimators \(\hat f_1, \ldots, \hat f_K\) derived from a number of models \(M_1, \ldots, M_K\) for some regression problem \(Y = f(X) + \epsilon\), but, as is the nature of things when estimating with limited data, we don’t know which estimator represents the true model (assuming the true model is in our list). The Bayesian habit is to stick a prior on the uncertainty, compute posteriors probabilities, and then average across the unknown parameter using the posterior probabilities as weights. Since the posterior probabilities (call them \(\lambda_1, \ldots, \lambda_K\)) have to sum to 1, we obtain a convex combination of our estimators \[ \hat f = \sum_{1\leq i \leq K} \lambda_i \hat f_i \] This is the approach of Bayesian Model Averaging (BMA). Yang et al. propose to find such combinations using a Dirichlet prior on the weights \(\lambda_i\). If we remove the restriction that the weights sum to 1 and instead only ask that they have finite sum in absolute value, then we obtain \(\hat f\) as a linear combination of \(\hat f_i\). The authors then place a Gamma prior on \(A = \sum_i |\lambda_i|\) and a Dirichlet prior on \(\mu_i = \frac{|\lambda_i|}{A}\). In both the linear and the convex cases they show that the resulting estimator is minimax optimal in the sense that it will give the best worst-case predictions for a given number of observations, including the case where a sparsity restriction is placed on the number of estimators \(\hat f_i\); in other words, \(\hat f\) converges to the true estimator as the number of observations increases with minimax optimal risk. The advantage to previous non-bayesian methods of linear or convex aggregation is that the sparsity parameter can be learned from the data. The Dirichlet convex combination gives good performance against Best Model selection, Majority Voting, and SuperLearner, especially when there are both a large number of observations and a large number of estimators.

I implemented the convex case in R for use with brms. The Dirichlet distribution has been reparameterized as a sum of Gamma RVs to aid in sampling. The Dirichlet concentration parameter is \(\frac{\alpha}{K^\gamma}\); the authors recommend choosing \(\alpha = 1\) and \(\gamma = 2\).

convex_regression <- function(formula, data,
                              family = "gaussian",
                              ## Yang (2014) recommends alpha = 1, gamma = 2
                              alpha = 1, gamma = 2,
                              verbose = 0,
                              ...) {
  if (gamma <= 1) {
    warning(paste("Parameter gamma should be greater than 1. Given:", gamma))
  }
  if (alpha <= 0) {
    warning(paste("Parameter alpha should be greater than 0. Given:", alpha))
  }
  ## Set up priors.
  K <- length(terms(formula))
  alpha_K <- alpha / (K^gamma)
  stanvars <-
    stanvar(alpha_K,
      "alpha_K",
      block = "data",
      scode = "  real<lower = 0> alpha_K;  // dirichlet parameter"
    ) +
    stanvar(
      name = "b_raw",
      block = "parameters",
      scode = "  vector<lower = 0>[K] b_raw; "
    ) +
    stanvar(
      name = "b",
      block = "tparameters",
      scode = "  vector[K] b = b_raw / sum(b_raw);"
    )
  prior <- prior("target += gamma_lpdf(b_raw | alpha_K, 1)",
    class = "b_raw", check = FALSE
  )
  f <- update.formula(formula, . ~ . - 1)
  if (verbose > 0) {
    make_stancode(f,
      prior = prior,
      data = data,
      stanvars = stanvars
    ) %>% message()
  }
  fit_dir <- brm(f,
    prior = prior,
    family = family,
    data = data,
    stanvars = stanvars,
    ...
  )
  fit_dir
}

Here is a gist that includes an interface to parsnip.

In my own experiments, I found the performance of the convex aggregator to be comparable to a LASSO SuperLearner at the cost of the lengthier training that goes with MCMC methods and the finicky convergence of sparse priors. So I would likely reserve this for when I had lots of features and lots of estimators to work through, where I presume it would show an advantage. But in that case it would definitely be on my list of things to try.

Bayesian Stacking

Yao, Y., Vehtari, A., Simpson, D., & Gelman, A., Using Stacking to Average Bayesian Predictive Distributions (pdf)

Another approach to model combination is stacking. With stacking, model weights are chosen by cross-validation to minimize RMSE predictive error. Now, BMA finds the aggregated model that best fits the data, while stacking finds the aggregated model that gives the best predictions. Stacking therefore is usually better when predictions are what you want. A drawback is that stacking produces models through point estimates. So, they don’t give you all the information of a full distribution like BMA would. Yao et al. propose a method of stacking that instead finds the optimal predictive distribution by convex combinations of distributions with weights chosen by some scoring rule: the authors use the minimization of KL-divergence. Hence, they choose weights \(w\) empirically through LOO by \[ \max_w \frac{1}{n} \sum_{1\leq i \leq n} \log \sum_{1\leq k \leq K} w_k p(y_i | y_{-i}, M_k) \] where \(y_1, \ldots, y_n\) are the observed data and \(y_{-i}\) is the data with \(y_i\) left out. The following figure shows how stacking of predictive distributions gives the “best of both worlds” for BMA and point prediction stacking.

From Yao (2018)
From Yao (2018)

They have implemented stacking for Stan models in the R package loo.