Cross-validation is often used in machine learning to judge how well a model is fit. Instead of using the entire data set to fit the model, it will use one part of the data set to fit a model and then test the model on the remaining data. This gives an idea of how well the model will generalize to indpendent data.

Leave-one-out prediction uses an entire model fit to all the data except a single point, and then makes a prediction at that point which can be compared to the actual value. It seems like this may be very expensive to do, but it is actually an inexpensive computation for a Gaussian process model, as long as the same parameters are used from the full model. This will bias the predictions to better results than if parameters were re-estimated.

Normally each prediction point requires solving a matrix equation. To predict the output, \(y\), at point \(\mathbf{x}\), given input data in matrix \(X_2\) and output \(\mathbf{y_2}\), we use the equation \[ \hat{y} = \hat{\mu} + R(\mathbf{x},~X_2) R(X_2)^{-1}( \mathbf{y_2} - \mu\mathbf{1_n})) \] For leave-one-out predictions, the matrix \(X_2\) will have all the design points except for the one we are predicting at, and thus will be different for each one. However, we will have the correlation matrix \(R\) for the full data set from estimating the parameters, and there is a shortcut to find the inverse of a matrix leaving out a single row and column.

There is significant speed-up by using a multiplication instead of a matrix solve. The code chunk below shows that solving with a square matrix with 200 rows is over 30 times slower than a matrix multiplication.

```
<- 200
n <- matrix(runif(n*n),ncol=n)
m1 <- runif(n)
b1 if (requireNamespace("microbenchmark", quietly = TRUE)) {
::microbenchmark(solve(m1, b1), m1 %*% b1)
microbenchmark }
```

```
## Unit: microseconds
## expr min lq mean median uq max neval
## solve(m1, b1) 1242.101 1398.751 1509.65801 1500.3510 1603.251 2637.400 100
## m1 %*% b1 32.901 37.001 40.89896 38.6505 42.851 95.301 100
```

Suppose we have a matrix \(K\) and know its inverse \(K^{-1}\). Suppose that \(K\) has block structure \[ K = \begin{bmatrix} A~B \\ C~D \end{bmatrix}\] Now we want to find out how to find \(A^{-1}\) using \(K^{-1}\) instead of doing the full inverse. We can write \(K^{-1}\) in block structure \[K^{-1} = \begin{bmatrix} E~F \\ G~H \end{bmatrix}\]

Now we use the fact that \(K K^{-1} = I\) \[ \begin{bmatrix} I~0 \\ 0~I \end{bmatrix} = \begin{bmatrix} A~B \\ C~D \end{bmatrix}\begin{bmatrix} E~F \\ G~H \end{bmatrix} \]

This gives the equations \[ AE + BG = I\] \[ AF + BH = 0\] \[ CE + DG = 0\] \[ CF + DH = I\]

Solving the first equation gives that \[ A = (I - BG)E^{-1}\] or \[ A^{-1} = E (I - BG) ^{-1}\]

For Gaussian processes we can consider the block matrix for the covariance (or correlation) matrix where a single row and its corresponding column is being removed. Let the first \(n-1\) rows and columns be the covariance of the points in design matrix \(X\), while the last row and column are the covariance for the vector \(\mathbf{x}\) with \(X\) and \(\mathbf{x}\). Then we can have

\[ K = \begin{bmatrix} C(X,X)~ C(X,\mathbf{x}) \\ C(\mathbf{x},X)~C(\mathbf{x},\mathbf{x}) \end{bmatrix}\]

Using the notation from the previous subsection we have \(A = C(X,X)\) and \(B=C(X,\mathbf{x})\), and \(E\) and \(G\) will be submatrices of the full \(K^{-1}\). \(B\) is a column vector, so I’ll write it as a vector \(\mathbf{b}\), and \(G\) is a row vector, so I’ll write it as a vector \(\mathbf{g}^T\). So we have \[ C(X,X)^{-1} = E(I-C(X,x)G)^{-1}\] So if we want to calculate \[ A^{-1} = E (I - \mathbf{b}\mathbf{g}^T) ^{-1}\] we still have to invert \(I-BG\), which is a large matrix. However this can be done efficiently since it is a rank one matrix using the Sherman-Morrison formula. \[ (I - \mathbf{b}\mathbf{g}^T)^{-1} = I^{-1} - \frac{I^{-1}\mathbf{b}\mathbf{g}^TI^{-1}}{1+\mathbf{g}^TI^{-1}\mathbf{b}} = I - \frac{\mathbf{b}\mathbf{g}^T}{1+\mathbf{g}^T\mathbf{b}} \] Thus we have the shortcut for \(A^{-1}\) that is only multiplication \[ A^{-1} = E (I - \frac{\mathbf{b}\mathbf{g}^T}{1+\mathbf{g}^T\mathbf{b}})\]

To speed this up it should be calculated as

\[ A^{-1} = E - \frac{(E\mathbf{b})\mathbf{g}^T}{1+\mathbf{g}^T\mathbf{b}}\] Below demonstrates that we get a speedup of almost twenty by using this shortcut.

```
set.seed(0)
<- function(x,y) {exp(sum(-30*(x-y)^2))}
corr <- 200
n <- 2
d <- matrix(runif(n*d),ncol=2)
X <- outer(1:n,1:n, Vectorize(function(i,j) {corr(X[i,], X[j,])}))
R <- solve(R)
Rinv <- R[-n,-n]
A <- solve(A)
Ainv <- Rinv[-n, -n]
E <- R[n,-n]
b <- Rinv[n,-n]
g <- E + E %*% b %*% g / (1-sum(g*b))
Ainv_shortcut summary(c(Ainv - Ainv_shortcut))
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.768e-03 -2.300e-08 0.000e+00 0.000e+00 2.300e-08 3.838e-03
```

```
if (requireNamespace("microbenchmark", quietly = TRUE)) {
::microbenchmark(solve(A), E + E %*% b %*% g / (1-sum(g*b)))
microbenchmark }
```

```
## Unit: microseconds
## expr min lq mean median
## solve(A) 3820.001 4617.201 4932.1611 4844.801
## E + E %*% b %*% g/(1 - sum(g * b)) 109.201 197.201 216.1549 207.001
## uq max neval
## 5230.7015 8877.901 100
## 223.4505 356.101 100
```

In terms of the covariance matrices already calculated, this is the following, where \(M_{-i}\) is the matrix \(M\) with the ith row and column removed, and \(M_{i,-i}\) is the ith row of the matrix \(M\) with the value from the ith column removed.

\[ R(X_{-i})^{-1} = R(X)_{-i} - \frac{(R(X)_{-i}R(X)_{-i,i}) (R(X)^{-1})_{i,-i}^T }{1 + (R(X)^{-1})_{i,-i}^T R(X)_{-i,i}}\]

Recall that the predicted mean at a new point is \[ \hat{y} = \hat{\mu} + R(\mathbf{x},~X_2) R(X_2)^{-1}( \mathbf{y_2} - \mu\mathbf{1_n})) \]

\[ \hat{y} = \hat{\mu} + R(\mathbf{x_i},~X_{-1}) R(X_{-i})^{-1}( \mathbf{y_{-i}} - \mu\mathbf{1_n})) \]