Tidy distributed grid search in R and Sun/Open Grid Engine
devtools::install_github("patr1ckm/distributr")
The basic function is grid_apply
, which applies a function over a grid of its arguments expand.grid(...)
, returning results in a list. Function applications can be executed repeatedly and in parallel.
do.one <- function(n, mu, sd){ mean(rnorm(n, mu, sd)) }
sim <- grid_apply(do.one, n = c(50, 100, 500), mu = c(1,5), sd = c(1, 5, 10),
.reps=50, .mc.cores=5)
[[1]]
[1] 1.053669
[[2]]
[1] 1.244468
[[3]]
[1] 1.267939
[[4]]
[1] 1.157546
[[5]]
[1] 0.786027
The arguments to grid over must be scalar. Other arguments (such as data) can be passed as a list to .args
.
A tidy method is provided that merges the list of results with the argument grid, putting the results in tidy form. This format is convenient for plotting and further data analysis. tidy
works with lists of vectors, lists, and data frames.
The function gapply
runs grid_apply
followed by tidy
.
res <- sim(tidy)
res <- gapply(do.one, n = c(50, 100, 500), mu = c(1,5), sd = c(1, 5, 10),
.reps=50, .mc.cores=5)
n mu sd .rep value
1 50 1 1 1 0.9476228
2 50 1 1 2 0.7545730
3 50 1 1 3 0.9154810
4 50 1 1 4 1.0704074
5 50 1 1 5 0.9840148
6 50 1 1 6 1.1933439
If results are of varying length, it can be helpful to stack them into key, value
pairs
with tidy(., stack=TRUE)
.
gapply
captures both warnings and errors. These can be accessed very simply:
err(sim)
warn(sim)
A compute plan can be setup and executed using the Sun/Open Grid Engine scheduler. Rows of the argument grid are submitted to nodes, and replications are carried out in parallel via mclapply
.
sim <- gapply(do.one, n = c(50, 100, 500), mu = c(1,5), sd = c(1, 5, 10), .eval=F)
sim <- setup(sim, .reps=500, .mc.cores = 5)
submit(sim)
res <- tidy(collect())
The setup
function asks for user confirmation if an existing argument grid would be overwritten.
Jobs can be added to the compute plan via add_jobs
. A set of jobs can be selected from the argument grid using filter_jobs
and the usual dplyr syntax to filter
.
jobs(sim) # access jobs grid (argument grid)
add_jobs(sim, n=1000, mu=10, sd=50) # add jobs to plan
filter_jobs(sim, n < 100, .mc.cores=5) # filter jobs as in dplyr
collect(filter="n < 100") # collect results from jobs matching filter
More information is available on the wiki, for example, illustrating how chunking, random number control, and caching can all be done transparently via do.one
. These slides give a general overview.