Remote R cluster

Mark Edmondson

2017-09-16

Remote R cluster

This workflow takes advatage of the future integration to run your local R-functions within a cluster of GCE machines.
You can do this to throw up expensive computations by spinning up a cluster and tearing it down again once you are done.

In summary, this workflow:

  1. Creates a GCE cluster
  2. Lets you perform computations
  3. Stops the VMs

Create the cluster

The example below uses a default r-base template, but you can use the steps above to create a dynamic_template pulled from the Container Registry if required.

Instead of the more generic gce_vm() that is used for more interactive use, we create the instances directly using gce_vm_container() so it doesn’t wait for the job to complete before starting the next (not useful if you have a lot of VMs). You can then use gce_get_zone_op() to get the job status.

library(future)
library(googleComputeEngineR)

## names for your cluster
vm_names <- c("vm1","vm2","vm3")

## create the cluster using default template for r-base
## creates jobs that are creating VMs in background
jobs <- lapply(vm_names, function(x) {
    gce_vm_container(file = get_template_file("r-base"),
                     predefined_type = "n1-highmem-2",
                     name = x)
                     })
jobs
# [[1]]
# ==Operation insert :  PENDING
# Started:  2016-11-16 06:52:58
# [[2]]
# ==Operation insert :  PENDING
# Started:  2016-11-16 06:53:04
# [[3]]
# ==Operation insert :  PENDING
# Started:  2016-11-16 06:53:09

## check status of jobs
lapply(jobs, gce_get_zone_op)
# [[1]]
# ==Operation insert :  DONE
# Started:  2016-11-16 06:52:58
# Ended: 2016-11-16 06:53:14 
# Operation complete in 16 secs 

# [[2]]
# ==Operation insert :  DONE
# Started:  2016-11-16 06:53:04
# Ended: 2016-11-16 06:53:20 
# Operation complete in 16 secs 

# [[3]]
# ==Operation insert :  DONE
# Started:  2016-11-16 06:53:09
# Ended: 2016-11-16 06:53:30 
# Operation complete in 21 secs

## get the VM objects
vms <- lapply(vm_names, gce_vm)

It is safest to setup the SSH keys seperately for multiple instances, using gce_ssh_setup() - this is normally called for you when you first connect to a VM.

## set up SSH for the VMs
vms <- lapply(vms, gce_ssh_setup)

We now make the VM cluster as per details given in the future README

## make a future cluster
plan(cluster, workers = vms)

Using the cluster

The cluster is now ready to recieve jobs. You can send them by simply using %<-% instead of <-.

## use %<-% to send functions to work on cluster
## See future README for details: https://github.com/HenrikBengtsson/future
a %<-% Sys.getpid()

## make a big function to run asynchronously
f <- function(my_data, args){
   ## ....expensive...computations
   
   result
}

## send to cluster
result %<-% f(my_data) 

For long running jobs you can use future::resolved to check on its progress.

## check if resolved
resolved(result)
[1] TRUE

Cleanup

Remember to shut down your cluster. You are charged per minute, per instance of uptime.

## shutdown instances when finished
lapply(vms, gce_vm_stop)