mlbgameday: Parallel Processing

Kris Eberwein

2017-12-01

Why Parallel?

Due to the large size of the data, it was discovered that parallel processing allows us to gather and transform the data faster than with conventional single-core methods. The get_payload() function is faster by several orders of magnitude when a registered cluster is present.

The package leverages the doParallel and foreach packages for parallelism. These packages were chose due to better performance and portability. Note: There are several other parallel packages in the R universe, but mlbgameday was designed to work specifically with doParallel and foreach, other parallel packages will not work.

Number of Cores

It is recommended that we do not use all the cores in our system. It’s best to keep one or two cores open for background processes. If we are unsure of how many cores our system has, we can use the detectCores() function from doParallel.

library(doParallel)

detectCores()

Register Cluster

Once we have the number of cores we would like to use, we have to register the cluster. The example below registers a cluster of all the cores on our system minus two for background processes.

library(doParallel)

no_cores <- detectCores() - 2
cl <- makeCluster(no_cores)  
registerDoParallel(cl)

After the cluster is registered, we can find out how many “workers” doParallel plans to use.

library(doParallel)

getDoParWorkers()

Closing Clusters

It is important to tell the package to close the cluster after we are finished using it. Otherwise, the individual cores will not be released for the rest of our system to use.

library(doParallel)

stopImplicitCluster()
rm(cl)