drake

data frames in R for Make

William Michael Landau

2017-08-04

Data frames in R for Make

Drake is a workflow manager and build system for

  1. Reproducibility.
  2. High-performance computing.

Organize your work in a data frame. Then make() it.

library(drake)
load_basic_example() # Also (over)writes report.Rmd. `example_drake("basic")`, `vignette("quickstart")`.
my_plan
##                    target                                      command
## 1             'report.md'   my_knit('report.Rmd', report_dependencies)
## 2                   small                                  simulate(5)
## 3                   large                                 simulate(50)
## 4     report_dependencies      c(small, large, coef_regression2_small)
## 5       regression1_small                                  reg1(small)
## 6       regression1_large                                  reg1(large)
## 7       regression2_small                                  reg2(small)
## 8       regression2_large                                  reg2(large)
## 9  summ_regression1_small suppressWarnings(summary(regression1_small))
## 10 summ_regression1_large suppressWarnings(summary(regression1_large))
## 11 summ_regression2_small suppressWarnings(summary(regression2_small))
## 12 summ_regression2_large suppressWarnings(summary(regression2_large))
## 13 coef_regression1_small                      coef(regression1_small)
## 14 coef_regression1_large                      coef(regression1_large)
## 15 coef_regression2_small                      coef(regression2_small)
## 16 coef_regression2_large                      coef(regression2_large)
make(my_plan)

Installation

install.packages("drake") # latest CRAN release
devtools::install_github("wlandau-lilly/drake@v3.1.0", build = TRUE) # latest GitHub release
devtools::install_github("wlandau-lilly/drake", build = TRUE) # development version

For make(..., parallelism = "Makefile"), Windows users need to download and install Rtools.

Quickstart

library(drake)
load_basic_example() # Also (over)writes report.Rmd. `example_drake("basic")`, `vignette("quickstart")`.
plot_graph(my_plan) # Hover, click, drag, zoom, pan. Try file = "graph.html" and targets_only = TRUE.
outdated(my_plan) # Which targets need to be (re)built?
missed(my_plan) # Are you missing anything from your workspace?
check(my_plan) # Are you missing files? Is your workflow plan okay?
make(my_plan) # Run the workflow.
outdated(my_plan) # Everything is up to date.
plot_graph(my_plan) # The graph also shows what is up to date.

Dive deeper into the built-in examples.

example_drake("basic") # Write the code files of the canonical tutorial.
examples_drake() # List the other examples.
vignette("quickstart") # Same as https://cran.r-project.org/package=drake/vignettes/quickstart.html

Useful functions

Besides make(), here are some useful functions to learn about drake,

load_basic_example()
drake_tip()
examples_drake()
example_drake()

set up your workflow plan,

plan()
analyses()
summaries()
evaluate()
expand()
gather()

explore the dependency network,

outdated()
missed()
plot_graph()
dataframes_graph()
render_graph()
read_graph()
deps()
tracked()

interact with the cache,

clean()
cached()
imported()
built()
readd()
loadd()
find_project()
find_cache()

debug your work,

check()
session()
in_progress()
progress()
config()
read_config()

and speed up your project with parallel computing.

make() # with jobs > 2
max_useful_jobs()
parallelism_choices()
shell_file()

Documentation

The CRAN page links to multiple rendered vignettes.

vignette(package = "drake") # List the vignettes.
vignette("drake") # High-level intro.
vignette("quickstart") # Walk through a simple example.
vignette("caution") # Avoid common pitfalls.

Help and troubleshooting

Please refer to TROUBLESHOOTING.md on the GitHub page for instructions.

Reproducibility

There is room to improve the conversation and the landscape of reproducibility in the R and Statistics communities. At a more basic level than scientific replicability, literate programming, and version control, reproducibility carries an implicit promise that the alleged results of an analysis really do match the code. Drake helps keep this promise by tracking the relationships among the components of the analysis, a rare and effective approach that also saves time.

library(drake)
load_basic_example()
outdated(my_plan) # Which targets need to be (re)built?
make(my_plan) # Build what needs to be built.
outdated(my_plan) # Everything is up to date.
reg2 = function(d){ # Change one of your functions.
  d$x3 = d$x^3
  lm(y ~ x3, data = d)
}
outdated(my_plan) # Some targets depend on reg2().
plot_graph(my_plan) # Set targets_only to TRUE for smaller graphs.
make(my_plan) # Rebuild just the outdated targets.
outdated(my_plan) # Everything is up to date again.
plot_graph(my_plan) # The colors changed in the graph.

High-performance computing

Similarly to Make, drake arranges the intermediate steps of your workflow in a dependency web. This network is the key to drake’s parallel computing. For example, consider the network graph of the basic example.

library(drake)
load_basic_example()
make(my_plan, jobs = 2, verbose = FALSE) # Parallelize over 2 jobs.
reg2 = function(d){ # Change a dependency.
  d$x3 = d$x^3
  lm(y ~ x3, data = d)
}

# Skip the file argument to just plot.
# Hover, click, drag, zoom, pan.
plot_graph(my_plan, width = "100%", height = "500px", 
  file = "drake_graph.html") 
## Unloading targets from environment:
##   report_dependencies
## import 'report.Rmd'
## import c
## import summary
## import suppressWarnings
## import coef
## import knit
## import data.frame
## import rpois
## import stats::rnorm
## import lm
## import my_knit
## import simulate
## import reg1
## import reg2

When you call make(my_plan, jobs = 4), the work proceeds in chronological order from left to right. The items are built or imported column by column in sequence, and up-to-date targets are skipped. Within each column, the targets/objects are all independent of each other conditional on the previous steps, so they are distributed over the 4 available parallel jobs/workers. Assuming the targets are rate-limiting (as opposed to imported objects), the next make(..., jobs = 4) should be faster than make(..., jobs = 1), but it would be superfluous to use more than 4 jobs.

See function max_useful_jobs() to suggest the number of jobs, taking into account which targets are already up to date. Try out the following in a fresh R session.

library(drake)
load_basic_example()
plot_graph(my_plan) # Look at the graph to make sense of the output.
max_useful_jobs(my_plan) # 8
max_useful_jobs(my_plan, imports = "files") # 8
max_useful_jobs(my_plan, imports = "all") # 10
max_useful_jobs(my_plan, imports = "none") # 8
make(my_plan)
plot_graph(my_plan)
# Ignore the targets already built.
max_useful_jobs(my_plan) # 1
max_useful_jobs(my_plan, imports = "files") # 1
max_useful_jobs(my_plan, imports = "all") # 10
max_useful_jobs(my_plan, imports = "none") # 0
# Change a function so some targets are now out of date.
reg2 = function(d){
  d$x3 = d$x^3
  lm(y ~ x3, data = d)
}
plot_graph(my_plan)
max_useful_jobs(my_plan) # 4
max_useful_jobs(my_plan, imports = "files") # 4
max_useful_jobs(my_plan, imports = "all") # 10
max_useful_jobs(my_plan, imports = "none") # 4

As for how the parallelism is implemented, you can choose from multiple built-in backends.

  1. mclapply: low-overhead, light-weight. make(..., parallelism = "mclapply", jobs = 2) invokes parallel::mclapply() under the hood and distributes the work over at most two independent processes (set with jobs). Mclapply is an ideal choice for low-overhead single-node parallelism, but it does not work on Windows.
  2. parLapply: medium-overhead, light-weight. make(..., parallelism = "parLapply", jobs = 2) invokes parallel::mclapply() under the hood. This option is similar to mclapply except that it works on Windows and costs a little extra time up front.
  3. Makefile: high-overhead, heavy-duty. make(..., parallelism = "Makefile", jobs = 2) creates a proper Makefile to distribute the work over multiple independent R sessions. With custom settings, you can distribute the R sessions over different jobs/nodes on a cluster. See the quickstart vignette for more details.