Quick start

This package will allow you to send function calls as jobs on a computing cluster with a minimal interface provided by the Q function:

# load the library and create a simple function
library(clustermq)
fx = function(x) x * 2

# queue the function call on your scheduler
Q(fx, x=1:3, n_jobs=1)
# list(2,4,6)

Computations are done entirely on the network and without any temporary files on network-mounted storage, so there is no strain on the file system apart from starting up R once per job. This way, we can also send data and results around a lot quicker.

All calculations are load-balanced, i.e. workers that get their jobs done faster will also receive more function calls to work on. This is especially useful if not all calls return after the same time, or one worker has a high load.

Installation

First, we need the ZeroMQ system library. Most likely, your package manager will provide this:

# You can skip this step on Windows and OS-X, the rzmq binary has it
# On a computing cluster, we recommend to use Conda or Linuxbrew
brew install zeromq # Linuxbrew, Homebrew on OS-X
conda install zeromq # Conda
sudo apt-get install libzmq3-dev # Ubuntu
sudo yum install zeromq3-devel # Fedora
pacman -S zeromq # Archlinux

Then install the clustermq package in R (which automatically installs the rzmq package as well) from CRAN:

install.packages('clustermq')

Alternatively you can use devtools to install directly from Github:

# install.packages('devtools')
devtools::install_github('mschubert/clustermq')
# devtools::install_github('mschubert/clustermq', ref="develop") # dev version

You should be good to go!

By default, clustermq will look for sbatch (SLURM), bsub (LSF), or qsub (SGE) in your $PATH and use the scheduler that is available. If the examples don't run out of the box, you might need to set your scheduler explicitly.

Setting up the scheduler explicitly

An HPC cluster's scheduler ensures that computing jobs are distributed to available worker nodes. Hence, this is what clustermq interfaces with in order to do computations.

We currently support the following schedulers:

You can also access each of these schedulers from your local machine via the SSH connector. Results will be returned to your local session.

If you need specific computing environments or containers, you can activate them via the scheduler template.

Examples

The package is designed to distribute arbitrary function calls on HPC worker nodes. There are, however, a couple of caveats to observe as the R session running on a worker does not share your local memory.

The simplest example is to a function call that is completely self-sufficient, and there is one argument (x) that we iterate through:

fx = function(x) x * 2
Q(fx, x=1:3, n_jobs=1)
## Submitting 1 worker jobs (ID: 6212) ...
## [[1]]
## [1] 2
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 6

Non-iterated arguments are supported by the const argument:

fx = function(x, y) x * 2 + y
Q(fx, x=1:3, const=list(y=10), n_jobs=1)
## Submitting 1 worker jobs (ID: 6855) ...
## [[1]]
## [1] 12
## 
## [[2]]
## [1] 14
## 
## [[3]]
## [1] 16

If a function relies on objects in its environment that are not passed as arguments, they can be exported using the export argument:

fx = function(x) x * 2 + y
Q(fx, x=1:3, export=list(y=10), n_jobs=1)
## Submitting 1 worker jobs (ID: 7140) ...
## [[1]]
## [1] 12
## 
## [[2]]
## [1] 14
## 
## [[3]]
## [1] 16

If we want to use a package function we need to load it on the worker using a library() call or referencing it with package_name:::

fx = function(x) {
    library(dplyr)
    x %>%
        mutate(area = Sepal.Length * Sepal.Width) %>%
        head()
}
Q(fx, x=list(iris), n_jobs=1)
## Submitting 1 worker jobs (ID: 6776) ...
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## [[1]]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  area
## 1          5.1         3.5          1.4         0.2  setosa 17.85
## 2          4.9         3.0          1.4         0.2  setosa 14.70
## 3          4.7         3.2          1.3         0.2  setosa 15.04
## 4          4.6         3.1          1.5         0.2  setosa 14.26
## 5          5.0         3.6          1.4         0.2  setosa 18.00
## 6          5.4         3.9          1.7         0.4  setosa 21.06

Usage

The following arguments are supported by Q:

Behavior can further be fine-tuned using the options below:

The full documentation is available by typing ?Q.

Comparison to other packages

There are some packages that provide high-level parallelization of R function calls on a computing cluster. A thorough comparison of features and performance is available on the wiki.

Briefly, we compare how long it takes different HPC scheduler tools to submit, run and collect function calls of negligible processing time (multiplying a numeric value by 2). This serves to quantify the maximum throughput we can reach with BatchJobs, batchtools and clustermq.

We find that BatchJobs is unable to process 106 calls or more but produces a reproducible RSQLite error. batchtools is able to process more function calls, but the file system practically limits it at about 106 calls. clustermq has no problems processing 109 calls, and is still faster than batchtools at 106 calls.

In short, use ClusterMQ if you want:

Use batchtools if:

Use Snakemake (or flowr, remake, drake) if:

Don't use batch (last updated 2013) or BatchJobs (issues with SQLite on network-mounted storage).