drake

data frames in R for Make

William Michael Landau

2017-11-05

## cache C:/Users/c240390/AppData/Local/Temp/RtmpSQkVhU/Rbuild2db86da6234/drake/...

1 Data frames in R for Make

Drake is a workflow manager and build system for

  1. Reproducibility.
  2. High-performance computing.

Organize your work in a data frame.

library(drake)
load_basic_example() # Also (over)writes report.Rmd.
my_plan              # Each target is a file (single-quoted) or object.
##                    target                                      command
## 1             'report.md'             knit('report.Rmd', quiet = TRUE)
## 2                   small                                  simulate(5)
## 3                   large                                 simulate(50)
## 4       regression1_small                                  reg1(small)
## 5       regression1_large                                  reg1(large)
## 6       regression2_small                                  reg2(small)
## 7       regression2_large                                  reg2(large)
## 8  summ_regression1_small suppressWarnings(summary(regression1_small))
## 9  summ_regression1_large suppressWarnings(summary(regression1_large))
## 10 summ_regression2_small suppressWarnings(summary(regression2_small))
## 11 summ_regression2_large suppressWarnings(summary(regression2_large))
## 12 coef_regression1_small              coefficients(regression1_small)
## 13 coef_regression1_large              coefficients(regression1_large)
## 14 coef_regression2_small              coefficients(regression2_small)
## 15 coef_regression2_large              coefficients(regression2_large)

Then make() it to build all your targets.

make(my_plan) # Run the commands to build the targets.

If a target fails, diagnose it.

failed()                 # Targets that failed in the most recent `make()`
diagnose()               # Targets that failed in any previous `make()`
error <- diagnose(large) # Most recent verbose error log of `large`
str(error)               # Object of class "error"
error$calls              # Call stack / traceback

2 Installation

install.packages("drake") # latest CRAN release
devtools::install_github(
  "wlandau-lilly/drake@v4.2.0",
  build = TRUE
) # GitHub release
devtools::install_github("wlandau-lilly/drake", build = TRUE) # dev version

For make(..., parallelism = "Makefile"), Windows users need to download and install Rtools.

3 Quickstart

library(drake)
load_basic_example() # Also (over)writes report.Rmd.
plot_graph(my_plan)  # Hover, click, drag, zoom, pan. See args 'from' and 'to'.
outdated(my_plan)    # Which targets need to be (re)built?
missed(my_plan)      # Are you missing anything from your workspace?
check(my_plan)       # Are you missing files? Is your workflow plan okay?
make(my_plan)        # Run the workflow.
diagnose(large)      # View error info if the target "large" failed to build.
outdated(my_plan)    # Everything is up to date.
plot_graph(my_plan)  # The graph also shows what is up to date.

Dive deeper into the built-in examples.

example_drake("basic") # Write the code files of the canonical tutorial.
examples_drake()       # List the other examples.
vignette("quickstart") # See https://cran.r-project.org/package=drake/vignettes

4 Useful functions

make(), workplan(), failed(), and diagnose() are the most important functions. Beyond that, there are functions to learn about drake,

load_basic_example()
drake_tip()
examples_drake()
example_drake()

set up your workflow plan,

analyses()
summaries()
evaluate()
expand()
gather()
wildcard() # from the wildcard package

explore the dependency network,

outdated()
missed()
plot_graph() # Now with subgraphs too.
dataframes_graph()
render_graph()
read_graph()
deps()
knitr_deps
tracked()

interact with the cache,

clean()
cached()
imported()
built()
readd()
loadd()
find_project()
find_cache()

make use of recorded build times,

build_times()
predict_runtime()
rate_limiting_times()

speed up your project with parallel computing,

make() # with jobs > 2
max_useful_jobs()
parallelism_choices()
shell_file()

finely tune the caching and hashing,

available_hash_algos()
cache_path()
cache_types()
configure_cache()
default_long_hash_algo()
default_short_hash_algo()
long_hash()
short_hash()
new_cache()
recover_cache()
this_cache()
type_of_cache()

and debug your work.

diagnose()
check()
session()
in_progress()
progress()
config()
read_config()

5 Documentation

The CRAN page links to multiple rendered vignettes.

vignette(package = "drake")            # List the vignettes.
vignette("drake")                      # High-level intro.
vignette("graph")                      # Visualilze the workflow graph.
vignette("quickstart")                 # Walk through a simple example.
vignette("parallelism") # Lots of parallel computing support.
vignette("storage")                    # Learn how drake stores your stuff.
vignette("timing")                     # Build times, runtime predictions
vignette("caution")                    # Avoid common pitfalls.

6 Help and troubleshooting

Please refer to TROUBLESHOOTING.md on the GitHub page for instructions.

7 Reproducibility

There is room to improve the conversation and the landscape of reproducibility in the R and Statistics communities. At a more basic level than scientific replicability, literate programming, and version control, reproducibility carries an implicit promise that the alleged results of an analysis really do match the code. Drake helps keep this promise by tracking the relationships among the components of the analysis, a rare and effective approach that also saves time.

library(drake)
load_basic_example()
outdated(my_plan) # Which targets need to be (re)built?
make(my_plan)     # Build what needs to be built.
outdated(my_plan) # Everything is up to date.
# Change one of your functions.
reg2 <- function(d) {
  d$x3 <- d$x ^ 3
  lm(y ~ x3, data = d)
}
outdated(my_plan)   # Some targets depend on reg2().
plot_graph(my_plan) # Set targets_only to TRUE for smaller graphs.
make(my_plan)       # Rebuild just the outdated targets.
outdated(my_plan)   # Everything is up to date again.
plot_graph(my_plan) # The colors changed in the graph.

Similarly to imported functions like reg2(), drake reacts to changes in

  1. Other imported functions, whether user-defined or from packages.
  2. For imported functions from your environment, any nested functions also in your environment or from packages.
  3. Commands in your workflow plan data frame.
  4. Global variables mentioned in the commands or imported functions.
  5. Upstream targets.
  6. For dynamic knitr reports (with knit('your_report.Rmd') as a command in your workflow plan data frame), targets and imports mentioned in calls to readd() and loadd() in the code chunks to be evaluated. Drake treats these targets and imports as dependencies of the compiled output target (say, report.md).

See the quickstart vignette for demonstrations of drake’s reproducibility and reactivity. See the graph vignette for a walkthrough of the workflow plan visualizations.

vignette("graph")
vignette("quickstart")

You can enhance reproducibility beyond the scope of drake. Packrat creates a tightly-controlled local library of packages to extend the shelf life of your project. And with Docker, you can execute your project on a virtual machine to ensure platform independence. Together, packrat and Docker can help others reproduce your work even if they have different software and hardware.

8 High-performance computing

Similarly to Make, drake arranges the intermediate steps of your workflow in a dependency web. This network is the key to drake’s parallel computing. For example, consider the network graph of the basic example.

library(drake)
load_basic_example()
make(my_plan, jobs = 2, verbose = FALSE) # Parallelize with 2 jobs.
# Change one of your functions.
reg2 <- function(d){
  d$x3 <- d$x ^ 3
  lm(y ~ x3, data = d)
}
# Hover, click, drag, zoom, and pan. See args 'from' and 'to'.
plot_graph(my_plan, width = "100%", height = "500px")

When you call make(my_plan, jobs = 4), the work proceeds in chronological order from left to right. The items are built or imported column by column in sequence, and up-to-date targets are skipped. Within each column, the targets/objects are all independent of each other conditional on the previous steps, so they are distributed over the 4 available parallel jobs/workers. Assuming the targets are rate-limiting (as opposed to imported objects), the next make(..., jobs = 4) should be faster than make(..., jobs = 1), but it would be superfluous to use more than 4 jobs. See function max_useful_jobs() to suggest the number of jobs, taking into account which targets are already up to date.

As for the implementation, you can choose from multiple built-in parallel backends, including parLapply(), mclapply(), Makefiles, and the staggering array of backends available through the future and future.batchtools packages. Please see the parallelism vignette for details.

vignette("parallelism")