Introduction

Dynamic regression models are extension to basic linear regression models where instead of constant but unknown regression coefficients, the underlying coefficients are assumed to vary over “time” according to random walk. In their simplest form these models can be used to model regression models with additional time series component, and they also allow robust modelling of phenomenas where the effect size of the predictor variables can vary during the period of the study. The R [@R] package walker provides an efficient method for fully Bayesian inference of such models, where the main computations are performed using the Markov chain Monte Carlo (MCMC) algorithms provided by Stan [@stan, @rstan]. This also allows the use of many general diagnostic and graphical tools provided by several Stan related R packages such as ShinyStan [@shinystan].

More specifically, the dynamic linear regression model is defined as \[ \begin{aligned} y_t &= x'_t \beta_t + \epsilon_t, \quad t = 1,\ldots, n\ \beta_{t+1} &= \beta_t + \eta_t, \end{aligned} \] where \(y_t\) is the observation at time \(t\), \(x_t\) contains the corresponding predictor variables, \(\beta_t\) is a \(k\) dimensional vector of regression coefficients at time \(t\), \(\epsilon_t \sim N(0, \sigma^2_{\epsilon})\), and \(\eta_t \sim N(0, D)\), with \(D\) being \(k \times k\) diagonal matrix with diagonal elements \(\sigma^2_{i,\eta}\), \(i=1,\ldots,k\). Denote the unknown parameters of the model by \(\beta = (\beta'_1, \ldots, \beta'_n)‘\) and \(\sigma = (\sigma_{\epsilon}, \sigma_{1, \eta}, \ldots, \sigma_{k, \eta})\). We define priors for first \(\beta_1\) as \(N(\mu_{\beta_1}, \sigma_{\beta_1})\), and for \(\sigma_i \sim N(\mu_{\sigma_i}, \sigma_{\sigma_i})\), \(i=1,\ldots,k+1\), truncated to positive values, with slighly awful notation.

Although in principle writing dynamic linear regression model as above in Stan or BUGS [@lunn-thomas-best-spiegelhalter] language is straighforward, these standard implementations are computationally inefficient and prone to severe problems related to convergence of the underlying MCMC algorithm. For example in (block) Gibbs sampling approach we target the joint posterior \(p(\beta, \sigma | y)\) by sampling from \(p(\beta | \sigma, y)\) and \(p(\sigma | \beta, y)\). But because of strong autocorrelations between the coeffients \(\beta\) at different time points, as well as with the associated standard error parameters, this MCMC scheme can be very inefficient. Also, the total number of parameters to sample increase with number of data points \(n\). Although the Hamiltonian Monte Carlo algoritms offered by Stan are typically more efficient exploring the posterior than Gibbs-type algorithms, we still encounter similar problems (see the illustration in next Section).

An alternative solution used by walker is based on the marginalization of the regression coefficients \(\beta\) during the MCMC sampling by using the Kalman filter. This provides fast and accurate inference of marginal posterior \(p(\sigma | y)\). Then, the corresponding joint posterior \(p(\sigma, \beta | y) = p(\beta | \sigma, y)p(\sigma | y)\) can be obtained by simulating the regression coefficients given marginal posterior of standard deviations. This sampling can be performed for example by simulation smoothing algorithm by @durbin-koopman2002. Note that we have opted to sample the \(\beta\) parameters given \(\sigma\)'s, but it is also possible to obtain somewhat more accurate summary statistics such as mean and variance of these parameters by using the standard Kalman smoother for compution of \(\textrm{E}(\beta| \sigma, y)\) and \(\textrm{Var}(\beta| \sigma, y)\), and using the law of total expectation.

In this vignette we first introduce the basic use using simple linear regression model with first order random walk, and then discuss extensions to second order random walk, as well as how to deal with non-Gaussian data.

Illustration

Let us consider a observations \(y\) of length \(n=250\), generated by random walk (i.e. time varying intercept) and two predictors. This is rather small problem, but it was chosen in order to make possible comparisons with the “naive” Stan implementation. For larger problems (in terms of number of observations and especially number of predictors) it is very difficult to get naive implementation to work properly, as even after tweaking several parameters of the underlying MCMC sampler, one typically ends up with divergent transitions or low BMFI index, meaning that the results are not to be trusted.

First we simulate the coefficients and the predictors:

set.seed(1)
n <- 100
beta1 <- cumsum(c(0.5, rnorm(n - 1, 0, sd = 0.05)))
beta2 <- cumsum(c(-1, rnorm(n - 1, 0, sd = 0.15)))
x1 <- rnorm(n, mean = 2)
x2 <- cos(1:n)
u <- cumsum(rnorm(n, 0, 0.5))
ts.plot(cbind(u, beta1 * x1, beta2 * x2), col = 1:3)

plot of chunk example

signal <- u + beta1 * x1 + beta2 * x2
y <- rnorm(n, signal, 0.5)
ts.plot(cbind(signal, y), col = 1:2)

plot of chunk observations

Then we can call function walker. The model is defined as a formula like in lm, and we can give several arguments which are passed to sampling method of rstan, such as number of iteration iter and number of chains chains (default values for these are 2000 and 4). In addition to these, we use arguments beta_prior and sigma_prior, which define the prior distributions for \(\beta_1\) and \(\sigma\) respectively. These arguments should length two vectors, where the first value defines the prior mean, and the second value defines the prior standard deviation.

fit <- walker(y ~ -1 + rw1(~ x1 + x2, beta_prior = c(0, 10), sigma_prior = c(0, 10)), 
  refresh = 0, chains = 2, sigma_y_prior = c(0, 10))
## 
## Gradient evaluation took 0.000681 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 6.81 seconds.
## Adjust your expectations accordingly!
## 
## 
## 
##  Elapsed Time: 5.75927 seconds (Warm-up)
##                6.38318 seconds (Sampling)
##                12.1424 seconds (Total)
## 
## 
## Gradient evaluation took 0.000537 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 5.37 seconds.
## Adjust your expectations accordingly!
## 
## 
## 
##  Elapsed Time: 5.54086 seconds (Warm-up)
##                5.05988 seconds (Sampling)
##                10.6007 seconds (Total)

We sometimes get a few (typically one) warning message about numerical problems, as the sampling algorithm warms up, but this is nothing to be concerned with (if more errors occur, then a Github issue for walker package is more than welcome).

The output of walker is walker_fit object, which is essentially a list with stanfit from Stan's sampling function, and the original observations y and the covariate matrix xreg. Thus we can use all the ready made tools for postprocessing stanfit objects:

print(fit$stanfit, pars = c("sigma_y", "sigma_rw1"))
## Inference for Stan model: walker_lm.
## 2 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=2000.
## 
##              mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## sigma_y      0.50    0.01 0.14 0.18 0.42 0.51 0.60  0.74   678    1
## sigma_rw1[1] 0.59    0.00 0.13 0.36 0.50 0.58 0.67  0.87   896    1
## sigma_rw1[2] 0.08    0.00 0.04 0.01 0.05 0.07 0.10  0.17  1401    1
## sigma_rw1[3] 0.32    0.00 0.10 0.15 0.24 0.30 0.37  0.54  1355    1
## 
## Samples were drawn using NUTS(diag_e) at Wed Jul 12 08:46:34 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
mcmc_areas(as.matrix(fit$stanfit), regex_pars = c("sigma_y", "sigma_rw1"))

plot of chunk pars

Let's check how well our estimates of \(\beta\) coincide with the true values (the solid lines correspond to true values):

betas <- summary(fit$stanfit, "beta_rw")$summary[, "mean"]

ts.plot(cbind(u, beta1, beta2, matrix(betas, ncol = 3)),
  col = rep(1:3, 2), lty = rep(1:2, each = 3))

plot of chunk plot_with_true_betas

There is also simpler (and prettier) ggplot based plotting function for coefficients:

plot_coefs(fit)

plot of chunk plot_pretty_betas

Posterior predictive checks

The walker function also returns samples from the posterior predictive distribution \(p(y^{\textrm{rep}} | y) = \int p(y^{\textrm{rep}} | \beta, \sigma, y) p(\beta, \sigma | y) \textrm{d}\beta\textrm{d}\sigma\). This can be used to used for example in assessment of model fit to the data. By comparing the replication series (mean and 95% quantiles in black) and the original observations (in red) we see that very good overlap, which is not that suprising given that we know the correct model:

pp_check(fit)

plot of chunk ppc

Out-of-sample prediction

It is also possible to obtain actual forecasts given new covariates \(x^{new}\):

new_data <- data.frame(x1 = rnorm(10, mean = 2), x2 = cos((n + 1):(n + 10)))
pred <- predict(fit, new_data)
plot_predict(pred)

plot of chunk prediction

Extensions: Smoother effects and non-Gaussian models

When modelling regression coefficients as a simple random walk, the posterior estimates of these coefficients can have large short-term variation which might not be realistic in practice. One way of imposing more smoothness for the estimates is to switch from random walk coefficients to integrated second order random walk coefficients: \[ \beta_{t+1} = \beta_t + \nu_t,\ \nu_{t+1} = \nu_t + \eta_t. \] This is essentially local linear trend model [@harvey] with restriction that there is no noise on the \(\beta\) level. This model can be estimated by switching rw1 function inside of the walker formula to rw1, with almost identical interface, but now \(\sigma\) correspond to the standard deviations of the slope terms \(\nu\). The prior for \(\nu\) most also be defined. Using RW2 model, the coefficient estimates of our example model are clearly smoother:

fit_rw2 <-walker(y ~ -1 + 
    rw2(~ x1 + x2, beta_prior = c(0, 10), sigma_prior = c(0, 10), slope_prior = c(0, 10)), 
  refresh = 0, chains = 2, sigma_y_prior = c(0, 10))
## 
## Gradient evaluation took 0.001009 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 10.09 seconds.
## Adjust your expectations accordingly!
## 
## 
## 
##  Elapsed Time: 8.73405 seconds (Warm-up)
##                8.14255 seconds (Sampling)
##                16.8766 seconds (Total)
## 
## 
## Gradient evaluation took 0.000982 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 9.82 seconds.
## Adjust your expectations accordingly!
## 
## 
## 
##  Elapsed Time: 11.2767 seconds (Warm-up)
##                7.91997 seconds (Sampling)
##                19.1966 seconds (Total)
plot_coefs(fit_rw2)

plot of chunk walker_rw2

So far we have focused on simple Gaussian linear regression. The above treatment cannot be directly extended to non-Gaussian models such as Poisson regression, as the marginal log-likelihood is intractable. However, it is possible to use relatively accurate Gaussian approximations, and the resulting approximate posterior can then be weighted using importance sampling type correction [@vihola-helske-franks], leading to exact inference. Currently function walker_glm can be used for Poisson observations, but straightforward extensions to binomial and negative binomial distributions are added in future.

Comparison with naive approach

We now compare the efficiency of the “naive” implementation and the state space approach. For this, we use function walker_rw1, which supports only basic random walk models. We can perform the same analysis with naive implementation by setting the argument naive to TRUE:

naive_walker <- walker_rw1(y ~ x1 + x2, seed = 1, refresh = 0, chains = 2,
  beta_prior = cbind(0, rep(5, 3)), sigma_prior = cbind(0, rep(2, 4)),
  naive = TRUE, control = list(adapt_delta = 0.9, max_treedepth = 15))
## 
## Gradient evaluation took 7e-05 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0.7 seconds.
## Adjust your expectations accordingly!
## 
## 
## 
##  Elapsed Time: 29.5078 seconds (Warm-up)
##                31.4577 seconds (Sampling)
##                60.9655 seconds (Total)
## 
## 
## Gradient evaluation took 7.1e-05 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0.71 seconds.
## Adjust your expectations accordingly!
## 
## 
## 
##  Elapsed Time: 34.2561 seconds (Warm-up)
##                31.8394 seconds (Sampling)
##                66.0954 seconds (Total)
## Warning: There were 1 divergent transitions after warmup. Increasing adapt_delta above 0.9 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Warning: Examine the pairs() plot to diagnose sampling problems
print(naive_walker$stanfit, pars = c("sigma_y", "sigma_b"))
## Inference for Stan model: rw1_model_naive.
## 2 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=2000.
## 
##            mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## sigma_y    0.51    0.01 0.11 0.29 0.44 0.51 0.59  0.73   121 1.04
## sigma_b[1] 0.58    0.01 0.12 0.38 0.51 0.57 0.66  0.83   266 1.01
## sigma_b[2] 0.08    0.00 0.04 0.01 0.05 0.08 0.10  0.17   385 1.01
## sigma_b[3] 0.31    0.00 0.10 0.15 0.24 0.30 0.37  0.52   405 1.00
## 
## Samples were drawn using NUTS(diag_e) at Wed Jul 12 08:49:25 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Let's run the state space version again:

kalman_walker <- walker_rw1(y ~ x1 + x2, seed = 1, refresh = 0, chains = 2,
  beta_prior = cbind(0, rep(5, 3)), sigma_prior = cbind(0, rep(2, 4)),
  naive = FALSE)
## 
## Gradient evaluation took 0.000356 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 3.56 seconds.
## Adjust your expectations accordingly!
## 
## 
## 
##  Elapsed Time: 3.45731 seconds (Warm-up)
##                4.18442 seconds (Sampling)
##                7.64173 seconds (Total)
## 
## 
## Gradient evaluation took 0.000388 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 3.88 seconds.
## Adjust your expectations accordingly!
## 
## 
## 
##  Elapsed Time: 6.03657 seconds (Warm-up)
##                4.6773 seconds (Sampling)
##                10.7139 seconds (Total)
print(kalman_walker$stanfit, pars = c("sigma_y", "sigma_b"))
## Inference for Stan model: rw1_model.
## 2 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=2000.
## 
##            mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## sigma_y    0.50       0 0.14 0.21 0.42 0.51 0.59  0.74   918    1
## sigma_b[1] 0.59       0 0.13 0.36 0.50 0.58 0.67  0.86   930    1
## sigma_b[2] 0.08       0 0.04 0.01 0.05 0.07 0.10  0.16   878    1
## sigma_b[3] 0.31       0 0.10 0.15 0.24 0.30 0.37  0.55   957    1
## 
## Samples were drawn using NUTS(diag_e) at Wed Jul 12 08:49:44 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
sum(get_elapsed_time(kalman_walker$stanfit))
## [1] 18.3556
sum(get_elapsed_time(naive_walker$stanfit))
## [1] 127.061

With naive implementation we get smaller effective sample sizes and much higher computation time, as well as some indications of divergence problems, even with adjusted step size (argument adapt_delta).

Discussion

In this vignette we illustrated the benefits of marginalisation in the context of dynamic regression models. The underlying idea is not new; this approach is typical especially in classic Metropolis-type algorithms for linear-Gaussian state space models where the marginal likelihood \(p(y | \theta)\) (where \(\theta\) denotes the hyperparameters i.e. not the latents states such as \(\beta\)'s in current context) is used in the computation of the acceptance probability. Here instead of building specific MCMC machinery, we rely on readily available Hamiltonian Monte Carlo based Stan software, thus allowing us to enjoy the benefits of diverse tools of the Stan community.

References