Using custom priors, likelihood, or movements in outbreaker2

Thibaut Jombart

2018-03-06

In this vignette, we show how custom functions for priors, likelihood, or movement of parameters and augmented data can be used in outbreaker2. In all these functions, the process will be similar:

  1. write your own function with the right arguments
  2. pass this function as an argument to a custom... function
  3. pass the result to outbreaker2

Note that 2-3 can be a single step if passing the function to the arguments of outbreaker2 directly. Also note that all priors and likelihoods are expected on a log scale. Finally, also note that while the various custom... functions will try to some extent to check that the provided functions are valid, such tests are very difficult to implement. In short: you are using these custom features at your own risks - make sure these functions work before passing them to outbreaker2.


Customising priors

Priors of outbreaker2 must be a function of an outbreaker_param list (see ?outbreaker_param). Here, we decide to use a step function rather than the default Beta function as a prior for pi, the reporting probability, and a flat prior between 0 and 1 for the mutation rate (which is technically a probability in the basic genetic model used in outbreaker2).

We start by defining two functions: an auxiliary function f which returns values on the natural scale, and which we can use for plotting the prior distribution, and then a function f_pi which will be used for the customisation.

f <- function(pi) {
    ifelse(pi < 0.8, 0, 5)
}

f_pi <- function(param) { 
    log(f(param$pi))
}

plot(f, type = "s", col = "blue", 
     xlab = expression(pi), ylab = expression(p(pi)), 
     main = expression(paste("New prior for ", pi)))

While f is a useful function to visualise the prior, f_pi is the function which will be passed to outbreaker. To do so, we pass it to custom_priors:

library(outbreaker2)

f_mu <- function(param) {
  if (param$mu < 0 || param$mu > 1) {
    return(-Inf)
  } else {
    return(0.0)
  }
  
}

priors <- custom_priors(pi = f_pi, mu = f_mu)
priors
#> 
#> 
#>  ///// outbreaker custom priors ///
#> 
#> class: custom_priors list
#> number of items: 4 
#> 
#> /// custom priors set to NULL (default used) //
#> $eps
#> NULL
#> 
#> $lambda
#> NULL
#> 
#> /// custom priors //
#> $mu
#> function (param) 
#> {
#>     if (param$mu < 0 || param$mu > 1) {
#>         return(-Inf)
#>     }
#>     else {
#>         return(0)
#>     }
#> }
#> 
#> $pi
#> function (param) 
#> {
#>     log(f(param$pi))
#> }

Note that custom_priors does more than just adding the custom function to a list. For instance, the following customisations are all wrong, and rightfully rejected:


## wrong: not a function
## should be pi = function(x){0.0}
custom_priors(pi = 0.0)
#> Error in custom_priors(pi = 0): The following priors are not functions: pi

## wrong: two arguments
custom_priors(pi = function(x, y){0.0})
#> Error in custom_priors(pi = function(x, y) {: The following priors dont' have a single argument: pi

We can now use the new priors to run outbreaker on the fake_outbreak data (see introduction vignette):


dna <- fake_outbreak$dna
dates <- fake_outbreak$sample
w <- fake_outbreak$w
data <- outbreaker_data(dna = dna, dates = dates, w_dens = w)

## we set the seed to ensure results won't change
set.seed(1)


res <- outbreaker(data = data, priors = priors)

We can check the results first by looking at the traces, and then by plotting the posterior distributions of pi and mu, respectively:


plot(res)

plot(res, "pi", burnin = 500)

plot(res, "mu", burnin = 500)

plot(res, "pi", type = "density", burnin = 500)

plot(res, "mu", type = "hist", burnin = 500)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Note that we are using density and histograms here for illustrative purposes, but there is no reason to prefer one or the other for a specific parameter.

Interestingly, the trace of pi suggests that the MCMC oscillates between two different states, on either bound of the interval on which the prior is positive (it is -Inf outside (0.8; 1)). This may be a consequence of the step function, which causes sharp ‘cliffs’ in the posterior landscape. What shall one do to derive good samples from the posterior distribution in this kind of situation? There are several options, which in fact apply to typical cases of multi-modal posterior distributions:

Because we know what the real transmission tree is for this dataset, we can assess how the new priors impacted the inference of the transmission tree.


summary(res, burnin = 500)
#> $step
#>    first     last interval  n_steps 
#>      550    10000       50      190 
#> 
#> $post
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  -699.5  -618.8  -586.0  -580.3  -547.9  -458.3 
#> 
#> $like
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  -701.2  -620.4  -587.6  -581.9  -549.5  -459.9 
#> 
#> $prior
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.609   1.609   1.609   1.609   1.609   1.609 
#> 
#> $mu
#>      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
#> 3.838e-05 6.505e-05 7.780e-05 8.419e-05 9.993e-05 1.770e-04 
#> 
#> $pi
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.8002  0.8030  0.8093  0.8254  0.8232  0.9940 
#> 
#> $tree
#>    from to time   support generations
#> 1    NA  1   -1        NA          NA
#> 2     1  2    1 0.9947368           1
#> 3     2  3    3 0.9947368           1
#> 4    NA  4    2        NA          NA
#> 5     3  5    4 0.9894737           1
#> 6     9  6    6 0.9894737           1
#> 7     4  7    5 1.0000000           1
#> 8     5  8    6 0.9631579           1
#> 9     4  9    5 0.9789474           1
#> 10    6 10    7 1.0000000           1
#> 11    7 11    7 0.5210526           1
#> 12    5 12    7 0.8210526           1
#> 13    9 13    6 1.0000000           1
#> 14    5 14    8 0.7947368           1
#> 15    5 15    7 0.8315789           1
#> 16    4 16    8 0.5052632           2
#> 17    4 17    7 0.5684211           2
#> 18    5 18    9 0.4473684           2
#> 19    9 19    9 0.9894737           2
#> 20   10 20   11 0.9789474           2
#> 21   11 21   10 0.9842105           2
#> 22   11 22   11 1.0000000           2
#> 23   13 23    9 1.0000000           2
#> 24   13 24   10 0.9947368           2
#> 25   13 25    9 0.9947368           3
#> 26   17 26    9 0.8789474           3
#> 27   17 27   11 1.0000000           3
#> 28   NA 28    9        NA          NA
#> 29   10 29   12 1.0000000           3
#> 30   13 30   11 0.9631579           3
tree <- summary(res, burnin = 500)$tree

comparison <- data.frame(case = 1:30,
                         inferred = paste(tree$from),
             true = paste(fake_outbreak$ances),
             stringsAsFactors = FALSE)
             
comparison$correct <- comparison$inferred == comparison$true
comparison
#>    case inferred true correct
#> 1     1       NA   NA    TRUE
#> 2     2        1    1    TRUE
#> 3     3        2    2    TRUE
#> 4     4       NA   NA    TRUE
#> 5     5        3    3    TRUE
#> 6     6        9    4   FALSE
#> 7     7        4    4    TRUE
#> 8     8        5    5    TRUE
#> 9     9        4    6   FALSE
#> 10   10        6    6    TRUE
#> 11   11        7    7    TRUE
#> 12   12        5    8   FALSE
#> 13   13        9    9    TRUE
#> 14   14        5    5    TRUE
#> 15   15        5    5    TRUE
#> 16   16        4    7   FALSE
#> 17   17        4    7   FALSE
#> 18   18        5    8   FALSE
#> 19   19        9    9    TRUE
#> 20   20       10   10    TRUE
#> 21   21       11   11    TRUE
#> 22   22       11   11    TRUE
#> 23   23       13   13    TRUE
#> 24   24       13   13    TRUE
#> 25   25       13   13    TRUE
#> 26   26       17   17    TRUE
#> 27   27       17   17    TRUE
#> 28   28       NA   NA    TRUE
#> 29   29       10   10    TRUE
#> 30   30       13   13    TRUE
mean(comparison$correct)
#> [1] 0.8


Customizing likelihood

Likelihood functions customisation works identically to prior functions. The only difference is that custom functions will take two arguments (data and param) instead of one in the prior functions. The function used to specify custom likelihood is custom_likelihoods. Each custom function will correspond to a specific likelihood component:


custom_likelihoods()
#> 
#> 
#>  ///// outbreaker custom likelihoods ///
#> 
#> class: custom_likelihoods list
#> number of items: 5 
#> 
#> /// custom likelihoods set to NULL (default used) //
#> $genetic
#> NULL
#> 
#> $reporting
#> NULL
#> 
#> $timing_infections
#> NULL
#> 
#> $timing_sampling
#> NULL
#> 
#> $contact
#> NULL

see ?custom_likelihoods for details of these components, and see the section ‘Extending the model’ for new, other components. As for custom_priors, a few checks are performed by custom_likelihoods:


## wrong: not a function
custom_likelihoods(genetic = "fubar")
#> Error in custom_likelihoods(genetic = "fubar"): The following likelihoods are not functions: genetic

## wrong: only one argument
custom_likelihoods(genetic = function(x){ 0.0 })
#> Error in custom_likelihoods(genetic = function(x) {: The following likelihoods dont' have two arguments: genetic

A trivial customisation is to disable some or all of the likelihood components of the model by returning a finite constant. Here, we apply this to two cases: first, we will disable all likelihood components as a sanity check, making sure that the transmission tree landscape is explored freely by the MCMC. Second, we will recreate the Wallinga & Teunis (1994) model, by disabling specific components.

A null model


f_null <- function(data, param) {
   return(0.0)
}

null_model <- custom_likelihoods(genetic = f_null,
                                 timing_sampling = f_null,
                                 timing_infections = f_null,
                                 reporting = f_null,
                                 contact = f_null)

null_model
#> 
#> 
#>  ///// outbreaker custom likelihoods ///
#> 
#> class: custom_likelihoods list
#> number of items: 5 
#> 
#> /// custom likelihoods //
#> $genetic
#> function (data, param) 
#> {
#>     return(0)
#> }
#> 
#> $reporting
#> function (data, param) 
#> {
#>     return(0)
#> }
#> 
#> $timing_infections
#> function (data, param) 
#> {
#>     return(0)
#> }
#> 
#> $timing_sampling
#> function (data, param) 
#> {
#>     return(0)
#> }
#> 
#> $contact
#> function (data, param) 
#> {
#>     return(0)
#> }

We also specify settings via the config argument to avoid detecting imported cases, reduce the number of iterations and sampling each of them:


null_config <- list(find_import = FALSE,
n_iter = 500,
sample_every = 1)

set.seed(1)

res_null <- outbreaker(data = data,
config = null_config,
likelihoods = null_model)

plot(res_null)

plot(res_null, "pi")

plot(res_null, "mu")

By typical MCMC standards, these traces look appaling, as they haven’t reach stationarity (i.e. same mean and variance over time), and are grossly autocorrelated in parts. Fair enough, as these are only the first 500 iterations of the MCMC, so that autocorrelation is expected. In fact, what we observe here literally is the random walk across the posterior landscape, which in this case is only impacted by the priors.

We can check that transmission trees are indeed freely explored:


plot(res_null, type = "alpha")

Do not try to render the corresponding network using plot(..., type = "network") as the force-direction algorithm will go insane. However, this network can be visualised using igraph, extracting the edges and nodes from the plot (without displaying it):


## extract nodes and edges from the visNetwork object
temp <- plot(res_null, type = "network", min_support = 0)
class(temp)
#> [1] "visNetwork" "htmlwidget"
head(temp$x$edges)
#>   from to value arrows   color
#> 1    1  2 0.004     to #CCDDFF
#> 2    1  3 0.002     to #CCDDFF
#> 3    1  4 0.008     to #CCDDFF
#> 4    1  5 0.010     to #CCDDFF
#> 5    1  6 0.004     to #CCDDFF
#> 6    1  7 0.006     to #CCDDFF
head(temp$x$nodes)
#>   id label value   color shape shaped
#> 1  1     1 0.206 #CCDDFF   dot   <NA>
#> 2  2     2 0.276 #B2D9E3   dot   <NA>
#> 3  3     3 0.266 #98D6C7   dot   <NA>
#> 4  4     4 0.336 #7ED2AC   dot   <NA>
#> 5  5     5 0.356 #99CAA9   dot   <NA>
#> 6  6     6 0.384 #C2C0AD   dot   <NA>

## make an igraph object
library(igraph)
#> 
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#> 
#>     decompose, spectrum
#> The following object is masked from 'package:base':
#> 
#>     union

net_null <- graph.data.frame(temp$x$edges,
                             vertices = temp$x$nodes[1:4])

plot(net_null, layout = layout.circle,
     main = "Null model, posterior trees")

We can derive similar diagnostics for the number of generations betweens cases (kappa), only constrained by default settings to be between 1 and 5, and for the infection dates (t_inf):


plot(res_null, type = "kappa")

plot(res_null, type = "t_inf")

Finally, we can verify that the distributions of mu and pi match their priors, respectively an exponential distribution with rate 1000 and a beta with parameters 10 and 1. Here, we get a qualitative assessment by comparing the observed distribution (histograms) to the densities of similar sized random samples from the priors:


par(xpd=TRUE)
hist(res_null$mu, prob = TRUE, col = "grey",
     border = "white",
     main = "Distribution of mu")

invisible(replicate(30,
     points(density(rexp(500, 1000)), type = "l", col = "blue")))



hist(res_null$pi, prob = TRUE, col = "grey",
     border = "white", main = "Distribution of pi")

invisible(replicate(30,
     points(density(rbeta(500, 10, 1)), type = "l", col = "blue")))