This vignette describes general best practices for creating, configuring, and running drake projects. It answers frequently asked questions and clears up common misconceptions, and it will continuously develop in response to community feedback.

How to organize your files

Examples

For examples of how to structure your code files, see the beginner oriented example projects:

Write the code directly with the drake_example() function.

drake_example("basic")
drake_example("gsp")
drake_example("packages")

In practice, you do not need to organize your files the way the examples do, but it does happen to be a reasonable way of doing things.

Where do you put your code?

It is best to write your code as a bunch of functions. You can save those functions in R scripts and then source() them before doing anything else.

# Load functions get_data(), analyze_data, and summarize_results()
source("my_functions.R")

Then, set up your workflow plan data frame.

good_plan <- drake_plan(
  my_data = get_data(file_in("data.csv")), # External files need to be in commands explicitly. # nolint
  my_analysis = analyze_data(my_data),
  my_summaries = summarize_results(my_data, my_analysis)
)
## Warning: Converting double-quotes to single-quotes because the
## `strings_in_dots` argument is missing. Use the file_in(), file_out(), and
## knitr_in() functions to work with files in your commands. To remove this
## warning, either call `drake_plan()` with `strings_in_dots = "literals"` or
## use `pkgconfig::set_config("drake::strings_in_dots" = "literals")`.

good_plan
## # A tibble: 3 x 2
##   target       command                                
##   <chr>        <chr>                                  
## 1 my_data      get_data(file_in('data.csv'))          
## 2 my_analysis  analyze_data(my_data)                  
## 3 my_summaries summarize_results(my_data, my_analysis)

Drake knows that my_analysis depends on my_data because my_data is an argument to analyze_data(), which is part of the command for my_analysis.

config <- drake_config(good_plan)
vis_drake_graph(config)

Now, you can call make() to build the targets.

make(good_plan)

If your commands are really long, just put them in larger functions. Drake analyzes imported functions for non-file dependencies.

Remember: your commands are code chunks, not R scripts

Some people are accustomed to dividing their work into R scripts and then calling source() to run each step of the analysis. For example you might have the following files.

If you migrate to drake, you may be tempted to set up a workflow plan like this.

bad_plan <- drake_plan(
  my_data = source(file_in("get_data.R")),
  my_analysis = source(file_in("analyze_data.R")),
  my_summaries = source(file_in("summarize_data.R"))
)
## Warning: Converting double-quotes to single-quotes because the
## `strings_in_dots` argument is missing. Use the file_in(), file_out(), and
## knitr_in() functions to work with files in your commands. To remove this
## warning, either call `drake_plan()` with `strings_in_dots = "literals"` or
## use `pkgconfig::set_config("drake::strings_in_dots" = "literals")`.

bad_plan
## # A tibble: 3 x 2
##   target       command                            
##   <chr>        <chr>                              
## 1 my_data      source(file_in('get_data.R'))      
## 2 my_analysis  source(file_in('analyze_data.R'))  
## 3 my_summaries source(file_in('summarize_data.R'))

But now, the dependency structure of your work is broken. Your R script files are dependencies, but since my_data is not mentioned in a function or command, drake does not know that my_analysis depends on it.

config <- drake_config(bad_plan)
vis_drake_graph(config)

Dangers:

  1. In the first make(bad_plan, jobs = 2), drake will try to build my_data and my_analysis at the same time even though my_data must finish before my_analysis begins.
  2. Drake is oblivious to data.csv since it is not explicitly mentioned in a workflow plan command. So when data.csv changes, make(bad_plan) will not rebuild my_data.
  3. my_analysis will not update when my_data changes.
  4. The return value of source() is formatted counter-intuitively. If source(file_in("get_data.R")) is the command for my_data, then my_data will always be a list with elements "value" and "visible". In other words, source(file_in("get_data.R"))$value is really what you would want.

In addition, this source()-based approach is simply inconvenient. Drake rebuilds my_data every time get_data.R changes, even when those changes are just extra comments or blank lines. On the other hand, in the previous plan that uses my_data = get_data(), drake does not trigger rebuilds when comments or whitespace in get_data() are modified. Drake is R-focused, not file-focused. If you embrace this viewpoint, your work will be easier.

R Markdown and knitr reports

For a serious project, you should use drake's make() function outside knitr. In other words, you should treat R Markdown reports and other knitr documents as targets and imports, not as a way to run make(). Viewed as targets, drake makes special exceptions for R Markdown reports and other knitr reports such as *.Rmd and *.Rnw files. Not every drake project needs them, but it is good practice to use them to summarize the final results of a project once all the other targets have already been built. The basic example, for instance, has an R Markdown report. report.Rmd is knitted to build report.md, which summarizes the final results.

# Load all the functions and the workflow plan data frame, my_plan.
load_basic_example() # Get the code with drake_example("basic").

To see where report.md will be built, look to the right of the dependency graph.

config <- drake_config(my_plan)
vis_drake_graph(config)

Drake treats knitr report as a special cases. Whenever drake sees knit() or render() (rmarkdown) mentioned in a command, it dives into the source file to look for dependencies. Consider report.Rmd, which you can view here. When drake sees readd(small) in an active code chunk, it knows report.Rmd depends on the target called small, and it draws the appropriate arrow in the dependency graph above. And if small ever changes, make(my_plan) will re-process report.Rmd to produce the target file report.md.

knitr reports are the only kind of file that drake analyzes for dependencies. It does not give R scripts the same special treatment.

Workflows as R packages

The R package structure is a great way to organize the files of your project. Writing your own package to contain your data science workflow is a good idea, but you will need to

  1. Use expose_imports() to properly account for all your nested function dependencies, and
  2. If you load the package with devtools::load_all(), set the prework argument of make(): e.g. make(prework = "devtools::load_all()").

Thanks to Jasper Clarkberg for the workaround behind expose_imports().

Advantages of putting workflows in R packages

The problem

For drake, there is one problem: nested functions. Drake always looks for imported functions nested in other imported functions, but only in your environment. When it sees a function from a package, it does not look in its body for other imports.

To see this, consider the digest() function from the digest package. Digest package is a utility for computing hashes, not a data science workflow, but I will use it to demonstrate how drake treats imports from packages.

library(digest)
g <- function(x){
  digest(x)
}
f <- function(x){
  g(x)
}
plan <- drake_plan(x = f(1))

# Here are the reproducibly tracked objects in the workflow.
tracked(plan)
## [1] "g"      "digest" "f"      "x"

# But the `digest()` function has dependencies too.
# Because `drake` knows `digest()` is from a package,
# it ignores these dependencies by default.
head(deps(digest), 10)
##  [1] ".Call"           ".errorhandler"   "any"            
##  [4] "as.integer"      "as.raw"          "base::serialize"
##  [7] "digest_impl"     "file.access"     "file.exists"    
## [10] "file.info"

The solution

To force drake to dive deeper into the nested functions in a package, you must use expose_imports(). Again, I demonstrate with the digest package package, but you should really only do this with a package you write yourself to contain your workflow. For external packages, packrat is a much better solution for package reproducibility.

expose_imports(digest)
## <environment: R_GlobalEnv>
new_objects <- tracked(plan)
head(new_objects, 10)
##  [1] "digest"          "warning"         "as.raw"         
##  [4] ".Call"           ".errorhandler"   "any"            
##  [7] "as.integer"      "base::serialize" "digest_impl"    
## [10] "file.access"
length(new_objects)
## [1] 32

# Now when you call `make()`, `drake` will dive into `digest`
# to import dependencies.

cache <- storr::storr_environment() # just for examples
make(plan, cache = cache)
## target x
head(cached(cache = cache), 10)
##  [1] "any"             "as.integer"      "as.raw"         
##  [4] "base::serialize" "digest"          "digest_impl"    
##  [7] "f"               "file.access"     "file.exists"    
## [10] "file.info"
length(cached(cache = cache))
## [1] 30
## [1] TRUE

Generating workflow plan data frames

Drake has the following functions to generate workflow plan data frames (the plan argument of make(), where you list your targets and commands).

Except for drake_plan(), they all use wildcards as templates. For example, suppose your workflow checks several metrics of several schools. The idea is to write a workflow plan with your metrics and let the wildcard templating expand over the available schools.

hard_plan <- drake_plan(
  credits = check_credit_hours(school__),
  students = check_students(school__),
  grads = check_graduations(school__),
  public_funds = check_public_funding(school__)
)

evaluate_plan(
  hard_plan,
  rules = list(school__ = c("schoolA", "schoolB", "schoolC"))
)
## # A tibble: 12 x 2
##    target               command                      
##    <chr>                <chr>                        
##  1 credits_schoolA      check_credit_hours(schoolA)  
##  2 credits_schoolB      check_credit_hours(schoolB)  
##  3 credits_schoolC      check_credit_hours(schoolC)  
##  4 students_schoolA     check_students(schoolA)      
##  5 students_schoolB     check_students(schoolB)      
##  6 students_schoolC     check_students(schoolC)      
##  7 grads_schoolA        check_graduations(schoolA)   
##  8 grads_schoolB        check_graduations(schoolB)   
##  9 grads_schoolC        check_graduations(schoolC)   
## 10 public_funds_schoolA check_public_funding(schoolA)
## 11 public_funds_schoolB check_public_funding(schoolB)
## 12 public_funds_schoolC check_public_funding(schoolC)

But what if some metrics do not make sense? For example, what if schoolC is a completely privately-funded school? With no public funds, check_public_funds(schoolC) may quit in error if we are not careful. This is where setting up workflow plans gets tricky. You may need to use multiple wildcards and make some combinations of values are left out.

library(magrittr)
rules_grid <- tibble::tibble(
  school_ =  c("schoolA", "schoolB", "schoolC"),
  funding_ = c("public", "public", "private"),
) %>%
  tidyr::crossing(cohort_ = c("2012", "2013", "2014", "2015")) %>%
  dplyr::filter(!(school_ == "schoolB" & cohort_ %in% c("2012", "2013"))) %>%
  print()
## # A tibble: 10 x 3
##    school_ funding_ cohort_
##    <chr>   <chr>    <chr>  
##  1 schoolA public   2012   
##  2 schoolA public   2013   
##  3 schoolA public   2014   
##  4 schoolA public   2015   
##  5 schoolB public   2014   
##  6 schoolB public   2015   
##  7 schoolC private  2012   
##  8 schoolC private  2013   
##  9 schoolC private  2014   
## 10 schoolC private  2015

Then, alternately choose expand = TRUE and expand = FALSE when evaluating the wildcards.

drake_plan(
  credits = check_credit_hours("school_", "funding_", "cohort_"),
  students = check_students("school_", "funding_", "cohort_"),
  grads = check_graduations("school_", "funding_", "cohort_"),
  public_funds = check_public_funding("school_", "funding_", "cohort_"),
  strings_in_dots = "literals"
) %>% evaluate_plan(
    wildcard = "school_",
    values = rules_grid$school_,
    expand = TRUE
  ) %>%
  evaluate_plan(
    wildcard = "funding_",
    rules = rules_grid,
    expand = FALSE
  ) %>%
  DT::datatable()
## Error in loadNamespace(name): there is no package called 'webshot'

Thanks to Alex Axthelm for this example in issue 235.

Remote data sources

Some workflows rely on remote data from the internet, and the workflow needs to refresh when the datasets change. As an example, let us consider the download logs of CRAN packages.

library(drake)
library(R.utils) # For unzipping the files we download.
library(curl)    # For downloading data.
library(httr)    # For querying websites.

url <- "http://cran-logs.rstudio.com/2018/2018-02-09-r.csv.gz"

How do we know when the data at the URL changed? We get the time that the file was last modified. (Alternatively, we could use an HTTP ETag.)

query <- HEAD(url)
timestamp <- query$headers[["last-modified"]]
timestamp
## [1] "Mon, 12 Feb 2018 16:34:48 GMT"

In our workflow plan, the timestamp is a target and a dependency. When the timestamp changes, so does everything downstream.

cranlogs_plan <- drake_plan(
  timestamp = HEAD(url)$headers[["last-modified"]],
  logs = get_logs(url, timestamp),
  strings_in_dots = "literals"
)
cranlogs_plan
## # A tibble: 2 x 2
##   target    command                                 
##   <chr>     <chr>                                   
## 1 timestamp "HEAD(url)$headers[[\"last-modified\"]]"
## 2 logs      get_logs(url, timestamp)

To make sure we always have the latest timestamp, we use the "always" trigger. (See this section of the debugging vignette for more on triggers.)

cranlogs_plan$trigger <- c("always", "any")
cranlogs_plan
## # A tibble: 2 x 3
##   target    command                                  trigger
##   <chr>     <chr>                                    <chr>  
## 1 timestamp "HEAD(url)$headers[[\"last-modified\"]]" always 
## 2 logs      get_logs(url, timestamp)                 any

Lastly, we define the get_logs() function, which actually downloads the data.

# The ... is just so we can write dependencies as function arguments
# in the workflow plan.
get_logs <- function(url, ...){
  curl_download(url, "logs.csv.gz")       # Get a big file.
  gunzip("logs.csv.gz", overwrite = TRUE) # Unzip it.
  out <- read.csv("logs.csv", nrows = 4)  # Extract the data you need.
  unlink(c("logs.csv.gz", "logs.csv"))    # Remove the big files
  out                                     # Value of the target.
}

When we are ready, we run the workflow.

make(cranlogs_plan)
## Unloading targets from environment:
##   timestamp
## target timestamp: trigger "always"
## target logs
## Used non-default triggers. Some targets may not be up to date.

readd(logs)
##         date     time     size version  os country ip_id
## 1 2018-02-09 13:01:13 82375220   3.4.3 win      RO     1
## 2 2018-02-09 13:02:06 74286541   3.3.3 win      US     2
## 3 2018-02-09 13:02:10 82375216   3.4.3 win      US     3
## 4 2018-02-09 13:03:30 82375220   3.4.3 win      IS     4