This vignette describes general best practices for creating, configuring, and running drake projects.

Where to put your code

It is best to write your code as a bunch of functions. You can save those functions in R scripts and then source() them before doing anything else.

# Load functions get_data(), analyze_data, and summarize_results()
source("my_functions.R")

Then, set up your workflow plan data frame.

good_plan <- drake_plan(
  my_data = get_data('data.csv'), # External files need to be in commands explicitly. # nolint
  my_analysis = analyze_data(my_data),
  my_summaries = summarize_results(my_data, my_analysis)
)

good_plan
##         target                                 command
## 1      my_data                    get_data('data.csv')
## 2  my_analysis                   analyze_data(my_data)
## 3 my_summaries summarize_results(my_data, my_analysis)

Drake knows that my_analysis depends on my_data because my_data is an argument to analyze_data(), which is part of the command for my_analysis.

config <- drake_config(good_plan)
vis_drake_graph(config)

Now, you can call make() to build the targets.

make(good_plan)

If your commands are really long, just put them in larger functions. Drake analyzes imported functions for non-file dependencies.

Remember: your commands are code chunks, not R scripts

Some people are accustomed to dividing their work into R scripts and then calling source() to run each step of the analysis. For example you might have the following files.

If you migrate to drake, you may be tempted to set up a workflow plan like this.

bad_plan <- drake_plan(
  my_data = source('get_data.R'),           # nolint
  my_analysis = source('analyze_data.R'),   # nolint
  my_summaries = source('summarize_data.R') # nolint
)

bad_plan
##         target                    command
## 1      my_data       source('get_data.R')
## 2  my_analysis   source('analyze_data.R')
## 3 my_summaries source('summarize_data.R')

But now, the dependency structure of your work is broken. Your R script files are dependencies, but since my_data is not mentioned in a function or command, drake does not know that my_analysis depends on it.

config <- drake_config(bad_plan)
vis_drake_graph(config)

Dangers:

  1. In the first make(bad_plan, jobs = 2), drake will try to build my_data and my_analysis at the same time even though my_data must finish before my_analysis begins.
  2. Drake is oblivious to data.csv since it is not explicitly mentioned in a workflow plan command. So when data.csv changes, make(bad_plan) will not rebuild my_data.
  3. my_analysis will not update when my_data changes.
  4. The return value of source() is formatted counter-intuitively. If source('get_data.R') is the command for my_data, then my_data will always be a list with elements "value" and "visible". In other words, source('get_data.R')$value is really what you would want.

In addition, this source()-based approach is simply inconvenient. Drake rebuilds my_data every time get_data.R changes, even when those changes are just extra comments or blank lines. On the other hand, in the previous plan that uses my_data = get_data(), drake does not trigger rebuilds when comments or whitespace in get_data() are modified. Drake is R-focused, not file-focused. If you embrace this viewpoint, your work will be easier.

R Markdown and knitr reports

Drake makes special exceptions for R Markdown reports and other knitr reports such as *.Rmd and *.Rnw files. Not every drake project needs them, but it is good practice to use them to summarize the final results of a project once all the other targets have already been built. The basic example, for instance, has an R Markdown report. report.Rmd is knitted to build report.md, which summarizes the final results.

# Load all the functions and the workflow plan data frame, my_plan.
load_basic_example() # Get the code with drake_example("basic").
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
## connect 7 imports: tmp, simulate, reg1, my_plan, reg2, bad_plan, good_plan
## connect 15 targets: 'report.md', small, large, regression1_small, regression1...

To see where report.md will be built, look to the right of the workflow graph.

config <- drake_config(my_plan)
vis_drake_graph(config)

Drake treats knitr report as a special cases. Whenever drake sees knit() or render() (rmarkdown) mentioned in a command, it dives into the source file to look for dependencies. Consider report.Rmd, which you can view here. When drake sees readd(small) in an active code chunk, it knows report.Rmd depends on the target called small, and it draws the appropriate arrow in the workflow graph above. And if small ever changes, make(my_plan) will re-process report.Rmd to produce the target file report.md.

knitr reports are the only kind of file that drake analyzes for dependencies. It does not give R scripts the same special treatment.

## Error in file.remove(): invalid first filename