Graphs with drake

Visualize your workflow.

William Michael Landau

2017-11-05

Drake has powerful visuals to help you understand and plan your workflow. The workflow plan graph is interactive. Click, drag, hover, zoom, and pan. Use either the mouse or the green buttons near the bottom.

1 Dependency reactivity

Initially, your entire project is out of date.

library(drake)
load_basic_example()
plot_graph(my_plan)

In the previous graph, all the targets were out of date. But after a make(), we will be all caught up, and the graph will show you.

make(my_plan, jobs = 4)
plot_graph(my_plan)

But when you change a dependency, you throw some targets out of date until you call make(my_plan) again.

reg2 <- function(d){
  d$x3 <- d$x ^ 3
  lm(y ~ x3, data = d)
}
plot_graph(my_plan)

2 Subgraphs

Graphs can grow enormous for serious projects, so there are multiple ways to focus on a manageable subgraph. The most brute-force way is to just pick a manual subset of nodes to show. However, with the subset argument, plot_graph() is prone to losing intermediate nodes and thus dropping edges.

plot_graph(my_plan, subset = c("regression2_small", "'report.md'"))

The other subgraph functionality is much better at preserving connectedness. Use targets_only to ignore the imports.

plot_graph(my_plan, targets_only = TRUE)

Similarly, you can just show downstream nodes.

plot_graph(my_plan, from = c("regression2_small", "regression2_large"))

Or upstream ones.

plot_graph(my_plan, from = "small", mode = "in")

In fact, let’s just take a small neighborhood around a target in both directions.

plot_graph(my_plan, from = "small", mode = "all", order = 1)

The report.md node is drawn in somewhat, but it is still the farthest right in order to communicate drake’s parallel computing strategy.

3 Parallel computing laid bare

Drake shows its parallel computing strategy plainly in the graph.

When you call make(my_plan, jobs = 4), the work proceeds in chronological order from left to right in the above graph. The items are built or imported column by column in sequence, and up-to-date targets are skipped. Within each column, the targets/objects are all independent of each other conditional on the previous steps, so they are distributed over the 4 available parallel jobs/workers. Assuming the targets are rate-limiting (as opposed to imported objects), the next make(..., jobs = 4) should be faster than make(..., jobs = 1), but it would be superfluous to use more than 4 jobs.

The division of targets into parallelizable stages depends on the kind of parallelism you use. Even the small workflow plan below is affected.

f <- function(x){
  x
}
small_plan <- workplan(a = 1, b = f(2))
small_plan
##   target command
## 1      a       1
## 2      b    f(2)
plot_graph(small_plan)

However, for any kind of distributed parallelism option such as "Makefile" or "future_lapply", all the imports are processed before any of the targets are built. For the small workflow, this puts both targets in the same parallelizable stage.

plot_graph(small_plan, parallelism = "future_lapply")

You can list the distributed backends quickly, or you can read the parallelism vignette.

parallelism_choices()
## [1] "parLapply"     "mclapply"      "Makefile"      "future_lapply"
parallelism_choices(distributed_only = TRUE)
## [1] "Makefile"      "future_lapply"

The help file of parallelism_choices() is particularly detailed.

?parallelism_choices

4 Finer control

We have only scratched the surface of plot_graph(), there is much more functionality documented in the help file (?plot_graph). In addition, dataframes_graph() outputs a list of nodes, edges, and legend nodes that you can modify and then feed right into your own visNetwork graph.