Interpretability (and, relatedly, communicability), flexibility, and scalability are desirable properties for statistical techniques. Many newer techniques are quite flexible and work well on large data sets but are “black boxes” when it comes to interpretation, making it hard to understand the underlying conditions which make predictions accurate (or how long they will last for). Kernel Regularized Least Squares (KRLS) is a kernel-based, complexity-penalized method developed by Hainmueller and Hazlett (2013), and designed to minimize parametric assumptions while maintaining interpretive clarity. However, the interpretability and flexibility come at the cost of scalability because most of the calculations require comparing each observation to each and every other observation and therefore many computationally-costly **N** x **N** calculations. We introduce *bigKRLS*, an updated version of the original KRLS R package with algorithmic and implementation improvements designed to optimize speed and memory usage. These improvements allow users to straightforwardly fit KRLS models to medium and large data sets (N > ~2,500).

bigKRLS is the workhorse of this package; there are only two basic inputs: a vector of *N* observations on the dependent variable, **y**, and an *N* x *P* matrix **X**, where *P* is the number of independent variables and also ncol(**X**).^{1}

`mtcars[1:5,]`

```
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
```

Suppose we want to regress fuel efficiency on the other observables.

```
reg.out <- bigKRLS(y = as.matrix(mtcars$mpg),
X = as.matrix(mtcars[,-1]), Ncores = 1)
```

```
## ..........................
## All done. You may wish to use summary() for more detail, predict() for out-of-sample forecasts, or shiny.bigKRLS() to interact with results. For an alternative approach, see help(crossvalidate.bigKRLS). Type vignette("bigKRLS_basics") for sample syntax. Use save.bigKRLS() to store results and load.bigKRLS() to re-open them.
```

Unlike classical regression, the algorithm does not directly obtain an estimate of the slope, the “average” marginal effect (the slope). By contrast, KRLS first estimates the the marginal effect d*y*/d*x*_{p} at each observation (given the pairwise distances with each of the other observations). For example, we may want to know the effect the number of gears has on a particular vehicle’s fuel efficiency.

`summary(reg.out)`

```
##
##
## MODEL SUMMARY:
##
## lambda: 0.1306
## N: 32
## N Effective: 15.17858
## R2: 0.9405
## R2AME**: 0.7876
##
## Average Marginal Effects:
##
## Estimate Std. Error t value Pr(>|t|)
## cyl -0.0733 0.0765 -0.9581 0.3806
## disp -0.0037 0.0022 -1.6485 0.1582
## hp -0.0145 0.0040 -3.5903 0.0148
## drat 1.2786 0.6303 2.0287 0.0963
## wt -1.4940 0.3115 -4.7968 0.0045
## qsec 0.3678 0.1741 2.1123 0.0864
## vs* 1.2772 0.6234 2.0488 0.0938
## am* 0.9967 0.5954 1.6739 0.1530
## gear 1.0847 0.2867 3.7828 0.0120
## carb -0.5951 0.1518 -3.9200 0.0104
##
##
## Percentiles of Marginal Effects:
##
## 5% 25% 50% 75% 95%
## cyl -0.3219 -0.1071 -0.0182 0.0065 0.0406
## disp -0.0108 -0.0080 -0.0040 -0.0011 0.0044
## hp -0.0372 -0.0258 -0.0138 -0.0009 0.0038
## drat -0.4892 0.6973 1.1559 2.1076 3.4894
## wt -3.5756 -1.9904 -1.2308 -0.7641 -0.2853
## qsec -0.1250 0.0504 0.2099 0.5900 1.2423
## vs* -0.3757 0.8754 1.2843 1.8167 2.6408
## am* -0.6929 0.3432 0.8767 1.8618 2.7406
## gear -0.4504 -0.1866 0.8420 1.9432 3.4149
## carb -1.1902 -1.0159 -0.5844 -0.2336 0.0107
##
## (*) Reported average and percentiles of dy/dx is for discrete change of the dummy variable from min to max (usually 0 to 1)).
##
##
## (**) Pseudo-R^2 computed using only the Average Marginal Effects.
##
##
## You may also wish to use predict() for out-of-sample forecasts or shiny.bigKRLS() to interact with results. Type vignette("bigKRLS_basics") for sample syntax. Use save.bigKRLS() to store results and load.bigKRLS() to re-open them.
```

The “Percentiles of the Marginal Effects” can be interpreted as evidence about whether *y* is a monotonic function of *x*_{p} and the extent to which the effect of *x*_{p} on *y* is homogeneous, if at all. In this toy data set, the number of cylinders is not a statistically significant predictor of fuel efficiency; perhaps unsurprisingly, the marginal effect of cylinders is negative for about half of the cars investigated. By contrast, horsepower has a more uniformly negative effect on fuel efficiency.

Suppose you wanted to plot how similar a Toyota Corolla is to the other four cylinder cars:

```
s <- reg.out$K[which(mtcars$cyl == 4), grep("Corolla", rownames(mtcars))]
barplot(s, main = "Similarity to a Toyota Corolla",
ylab = "Kernel", sub="Toy Data from mtcars", cex.names = .7,
col = colorRampPalette(c("red", "blue"))(length(s))[rank(s)],
names.arg = lapply(strsplit(rownames(mtcars), split=" "),
function(x) x[2])[which(mtcars$cyl == 4)])
```

Apparently my Corolla is more similar to a Civic than a Porsche 914 but more tests are needed… Note on exceedingly large data sets, you may wish to grab the relevant subset first, standardize that **X** data, and then call *bGaussKernel* on the smaller set.

It appears that fuel efficiency decreases as horsepower increases but that the effect isn’t quite monotonic further. We might first ask whether the outcome is an additive function of horsepower…

```
scatter.smooth(mtcars$hp, reg.out$derivatives[,3], ylab="HP's Effect", xlab="Horsepower", pch = 19, bty = "n",
main="Horsepower's Marginal Effect on Fuel Efficiency",
sub="Toy Data from mtcars",
col = colorRampPalette(c("blue", "red"))(nrow(mtcars))[rank(reg.out$coeffs^2)],
ylim = c(-0.042, 0.015), xlim = c(50, 400))
```

The above graph suggests that though in general lower horsepower helps explain which cars have better fuel efficiency, beyond a certain threshold, that’s no longer the case (or perhaps log horsepower is more relevant).

Suppose you want to cross-validate your data…

```
CV.out <- crossvalidate.bigKRLS(y = as.matrix(mtcars$mpg), seed = 123, Kfolds = 4,
X = as.matrix(mtcars[,-1]), Ncores = 1)
```

```
## .........................
## ...........................
## .........................
## ........................
```

`cor(CV.out$fold_3$tested$predicted, CV.out$fold_3$tested$ytest)`

`## [1] 0.9662644`

The predictions and the holdout data correlate at 0.86. To look at particular models:

`summary(CV.out$fold_1$trained) # not run`

`CV.out`

contains several test statistics including:

`CV.out$MSE_is`

```
## fold1 fold2 fold3 fold4
## 3.408028 1.757684 1.191496 2.173197
```

`CV.out$MSE_oos`

```
## fold1 fold2 fold3 fold4
## 3.425657 4.247855 21.958875 15.206500
```

`CV.out$R2_oos`

```
## fold1 fold2 fold3 fold4
## 0.9062257 0.7065661 0.9336668 0.5525216
```

`CV.out$R2AME_oos`

```
## fold1 fold2 fold3 fold4
## 0.9288903 0.6453527 0.9067781 0.7085692
```

The first two test stats are the mean squared error, in and out of sample (respectively). The second two show the performance of the full model (the N coefficients) vs. the portion that is linear and additive in X. To do a simple train and test, leave `Kfolds`

blank and set `ptesting`

instead.

To interact with your results in a pop up window or your browser, simply call:

`shiny.bigKRLS(reg.out) # not run`

To remove the big square matrices so that you can easily put your results up on a server, use export:

`shiny.bigKRLS(reg.out, export = T) # not run`

The output will describe the new, more compact object that has been created.

Let’s say we wanted to know what percentage of cars would have lower gas mileage if they had 200 horsepower.

```
Xnew <- mtcars[,-1]
Xnew$hp <- 200
forecast <- predict(reg.out, as.matrix(Xnew))
```

`## .`

`mean(forecast$predicted < mtcars$mpg)`

`## [1] 0.6875`

So approximately 68.8% of cars would be less efficient.

If *N* > 2,500 or if you supply big matrices, using save() and load() will crash your R session. Instead you may do one of two things to save:

```
out <- bigKRLS(y, X, model_subfolder_name = "my_results") # not run
save.bigKRLS(out, "my_results") # not run
```

Either will save the model estimates to a new subfolder called “my_results” in your current working directory. To re-load,

`load.bigKRLS("my_results") # not run`

When *N* > 2,500 or the user provides big matrices, big matrices will be returned, which are really just memory addresses.

```
Z <- big.matrix(nrow=5, ncol=5, init=1)
Z
```

```
## An object of class "big.matrix"
## Slot "address":
## <pointer: 0x7fdac20ac490>
```

You do not necessarily need to work with the big square matrices in the output. But if you do and your comfortable they fit in memory, just use the square brackets:

`Z[]`

```
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 1 1 1 1
## [2,] 1 1 1 1 1
## [3,] 1 1 1 1 1
## [4,] 1 1 1 1 1
## [5,] 1 1 1 1 1
```

**X**and**y**should only contain numeric data (no missing data, factors, or vectors of constants) and may be base*R*matrices or “big” matrices (from*bigmemory*).↩