This vignette explores R package download trends using the cranlogs package.

Get the code.

Write the code files to your workspace.

drake_example("packages")

The new packages folder now includes a file structure of a serious drake project, plus an interactive-tutorial.R to narrate the example. The code is also online here.

Overview

This small data analysis project explores some trends in R package downloads over time. The datasets are downloaded using the cranlogs package.

library(cranlogs)
cran_downloads(packages = "dplyr", when = "last-week")
##         date count package
## 1 2018-04-02 12577   dplyr
## 2 2018-04-03 15520   dplyr
## 3 2018-04-04 16830   dplyr
## 4 2018-04-05 15695   dplyr
## 5 2018-04-06 12978   dplyr
## 7 2018-04-07     0   dplyr
## 6 2018-04-08  8140   dplyr

Above, each count is the number of times dplyr was downloaded from the RStudio CRAN mirror on the given day. To stay up to date with the latest download statistics, we need to refresh the data frequently. With drake, we can bring all our work up to date without restarting everything from scratch.

Analysis

First, we load the required packages. Drake knows about the packages you install and load.

library(drake)
library(cranlogs)
library(ggplot2)
library(knitr)
library(plyr)

We want to explore the daily downloads from these packages.

package_list <- c(
  "knitr",
  "Rcpp",
  "ggplot2"
)

We plan to use the cranlogs package. The data frames older and recent will contain the number of daily downloads for each package from the RStudio CRAN mirror.

data_plan <- drake_plan(
  older = cran_downloads(
    packages = package_list,
    from = "2016-11-01",
    to = "2016-12-01"
  ),
  recent = target(
    command = cran_downloads(
      packages = package_list,
      when = "last-month"
    ),
    trigger = "always"
  ),
  strings_in_dots = "literals"
)

data_plan
## # A tibble: 2 x 3
##   target command                                                   trigger
##   <chr>  <chr>                                                     <chr>  
## 1 older  "cran_downloads(packages = package_list, from = \"2016-1… any    
## 2 recent "cran_downloads(packages = package_list, when = \"last-m… always

Our data_plan data frame has a "trigger" column because the latest download data needs to be refreshed every day. We use triggers to force recent to always build. For more on triggers, see the vignette on debugging and testing. Instead of triggers, we could have just made recent a global variable like package_list instead of a formal target in data_plan.

We want to summarize each set of download statistics a couple different ways.

output_types <- drake_plan(
  averages = make_my_table(dataset__),
  plot = make_my_plot(dataset__)
)

output_types
## # A tibble: 2 x 2
##   target   command                 
##   <chr>    <chr>                   
## 1 averages make_my_table(dataset__)
## 2 plot     make_my_plot(dataset__)

We need to define functions to summarize and plot the data.

make_my_table <- function(downloads){
  ddply(downloads, "package", function(package_downloads){
    data.frame(mean_downloads = mean(package_downloads$count))
  })
}

make_my_plot <- function(downloads){
  ggplot(downloads) +
    geom_line(aes(x = date, y = count, group = package, color = package))
}

Below, the targets recent and older each take turns substituting the dataset__ wildcard. Thus, output_plan has four rows.

output_plan <- plan_analyses(
  plan = output_types,
  datasets = data_plan
)

output_plan
## # A tibble: 4 x 2
##   target          command              
##   <chr>           <chr>                
## 1 averages_older  make_my_table(older) 
## 2 averages_recent make_my_table(recent)
## 3 plot_older      make_my_plot(older)  
## 4 plot_recent     make_my_plot(recent)

We plan to weave the results together in a dynamic knitr report.

report_plan <- drake_plan(
  knit(knitr_in("report.Rmd"), file_out("report.md"), quiet = TRUE)
)

report_plan
## # A tibble: 1 x 2
##   target          command                                                 
##   <chr>           <chr>                                                   
## 1 "\"report.md\"" "knit(knitr_in(\"report.Rmd\"), file_out(\"report.md\")…

Because of the mention of knitr_in() above, make() will look dependencies inside report.Rmd (targets mentioned with loadd() or readd() in active code chunks). That way, whenever a dependency changes, drake will rebuild report.md when you call make(). For that to happen, we need report.Rmd to exist before the call to make(). For this example, you can find report.Rmd here.

Now, we complete the workflow plan data frame by concatenating the results together. Drake analyzes the plan to figure out the dependency network, so row order does not matter.

whole_plan <- bind_plans(
  data_plan,
  output_plan,
  report_plan
)

whole_plan
## # A tibble: 7 x 3
##   target          command                                          trigger
##   <chr>           <chr>                                            <chr>  
## 1 older           "cran_downloads(packages = package_list, from =… any    
## 2 recent          "cran_downloads(packages = package_list, when =… always 
## 3 averages_older  make_my_table(older)                             any    
## 4 averages_recent make_my_table(recent)                            any    
## 5 plot_older      make_my_plot(older)                              any    
## 6 plot_recent     make_my_plot(recent)                             any    
## 7 "\"report.md\"" "knit(knitr_in(\"report.Rmd\"), file_out(\"repo… any

Now, we run the project to download the data and analyze it. The results will be summarized in the knitted report, report.md, but you can also read the results directly from the cache.

make(whole_plan)
## target older
## target recent: trigger "always"
## target averages_older
## target averages_recent
## target plot_older
## target plot_recent
## target file "report.md"
## Used non-default triggers. Some targets may not be up to date.

readd(averages_recent)
##   package mean_downloads
## 1    Rcpp      21870.967
## 2 ggplot2      15225.633
## 3   knitr       9980.433

readd(averages_older)
##   package mean_downloads
## 1    Rcpp       14408.06
## 2 ggplot2       14641.29
## 3   knitr        9068.71

readd(plot_recent)

plot of chunk firstmakepackages


readd(plot_older)

plot of chunk firstmakepackages

Because we used triggers, each make() rebuilds the recent target to get the latest download numbers for today. If the newly-downloaded data are the same as last time and nothing else changes, drake skips all the other targets.

make(whole_plan)
## Unloading targets from environment:
##   averages_recent
##   averages_older
##   plot_older
##   plot_recent
## target recent: trigger "always"
## Used non-default triggers. Some targets may not be up to date.

To visualize the build behavior, plot the dependency network. Target recent and everything depending on it is always out of date because of the "always" trigger. If you rerun the project tomorrow, the recent dataset will have shifted one day forward, so make() will refresh averages_recent, plot_recent, and report.md. Targets averages_older and plot_older should be unaffected, so drake will skip them.

config <- drake_config(whole_plan)
vis_drake_graph(config)

What remote data sources in general?

When you rely on data from the internet, you should trigger a new download when the data change remotely. This section of the best practices guide explains how to automatically refresh the data when the online timestamp changes.