pulsar: parallel utilities for model selection

pulsar: parallel utilities for model selection

SpiecEasi is now using the pulsar package as the backend for performing model selection. In the default parameter setting, this uses the same StARS procedure as previous versions.

As in the previous version of SpiecEasi, we can supply the ncores argument to the pulsar.params list to break up the subsampled computations into parallel tasks.

Note: On Windows systems, mc.cores > 1 is not supported by default. For Windows users, we recommend using the snow cluster type or running in serial mode.

In this example, we set the random seed to make consistent comparison across experiments:

library(SpiecEasi)
data(amgut1.filt)

## Default settings ##
pargs1 <- list(rep.num=50, seed=10010)
t1 <- system.time(
se1 <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30,
              sel.criterion='stars', pulsar.select=TRUE, pulsar.params=pargs1)
)

## Platform-specific or envrionment-specific parallel processing ##
if (.Platform$OS.type == "windows" || Sys.getenv("CI") == "true" || nzchar(Sys.getenv("GITHUB_ACTIONS"))) {
  # On Windows, use snow cluster or run serial
  pargs2 <- list(rep.num=50, seed=10010, ncores=1)  # Serial for Windows
  cat("Running on Windows - using serial processing\n")
} else {
  # On Unix-like systems, use multicore
  n_cores <- min(2, parallel::detectCores())
  pargs2 <- list(rep.num=50, seed=10010, ncores=n_cores)
  cat("Running on Unix-like system - using parallel processing\n")
}

t2 <- system.time(
se2 <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30,
              sel.criterion='stars', pulsar.select=TRUE, pulsar.params=pargs2)
)

We can further speed up StARS using the bounded-StARS (‘bstars’) method. The B-StARS approach computes network stability across the whole lambda path, but only for the first 2 subsamples. This is used to build an initial estimate of the summary statistic, which in turn gives us a lower/upper bound on the optimal lambda. The remaining subsamples are used to compute the stability over the restricted path. Since denser networks are more computational expensive to compute, this can significantly reduce computational time for datasets with many variables.

t3 <- system.time(
se3 <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30,
               sel.criterion='bstars', pulsar.select=TRUE, pulsar.params=pargs1)
)
t4 <- system.time(
se4 <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30,
               sel.criterion='bstars', pulsar.select=TRUE, pulsar.params=pargs2)
)

We can see that in addition to the computational savings, the refit networks are identical:

## serial vs parallel
identical(getRefit(se1), getRefit(se2))
# [1] TRUE
t1[3] > t2[3]
# elapsed 
#    TRUE
## stars vs bstars
identical(getRefit(se1), getRefit(se3))
# [1] TRUE
t1[3] > t3[3]
# elapsed 
#    TRUE
identical(getRefit(se2), getRefit(se4))
# [1] TRUE
t2[3] > t4[3]
# elapsed 
#    TRUE

Windows-specific considerations

For Windows users, there are several options for parallel processing:

Option 1: Use snow cluster

# For Windows users who want parallel processing
library(parallel)
cl <- makeCluster(4, type = "SOCK")
pargs.windows <- list(rep.num=50, seed=10010)
se.windows <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30,
                         sel.criterion='stars', pulsar.select=TRUE, pulsar.params=pargs.windows)
stopCluster(cl)

Option 2: Use serial processing

# Simple serial processing for Windows
pargs.serial <- list(rep.num=50, seed=10010, ncores=1)
se.serial <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30,
                        sel.criterion='stars', pulsar.select=TRUE, pulsar.params=pargs.serial)

Option 3: Use batch mode

# Batch mode works on all platforms
bargs <- list(rep.num=50, seed=10010, conffile="parallel")
se.batch <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30,
                       sel.criterion='stars', pulsar.select='batch', pulsar.params=bargs)

Batch Mode

Pulsar gives us the option of running stability selection in batch mode, using the batchtools package. This will be useful to anyone with access to an hpc/distributing computing system. Each subsample will be independently executed using a system-specific cluster function.

This requires an external config file which will instruct the batchtools registry how to construct the cluster function which will execute the individual jobs. batch.pulsar has some built in config files that are useful for testing purposes (serial mode, “parallel”, “snow”, etc), but it is advisable to create your own config file and pass in the absolute path. See the batchtools docs for instructions on how to construct config file and template files (i.e. to interact with a queueing system such as TORQUE or SGE).

## bargs <- list(rep.num=50, seed=10010, conffile="path/to/conf.txt")
bargs <- list(rep.num=50, seed=10010, conffile="parallel")
## See the config file stores:
pulsar::findConfFile('parallel')

## uncomment line below to turn off batchtools reporting
# options(batchtools.verbose=FALSE)
se5 <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30,
            sel.criterion='stars', pulsar.select='batch', pulsar.params=bargs)

Performance comparison

Let’s compare the performance of different approaches:

# Compare timing
cat("Serial StARS:", t1[3], "seconds\n")
# Serial StARS: 360.816 seconds
cat("Platform-specific StARS:", t2[3], "seconds\n")
# Platform-specific StARS: 244.183 seconds
cat("Serial B-StARS:", t3[3], "seconds\n")
# Serial B-StARS: 79.431 seconds
cat("Platform-specific B-StARS:", t4[3], "seconds\n")
# Platform-specific B-StARS: 61.065 seconds

# Speedup factors (only meaningful on Unix-like systems)
if (.Platform$OS.type != "windows") {
  cat("Parallel speedup (StARS):", t1[3]/t2[3], "\n")
  cat("B-StARS speedup (serial):", t1[3]/t3[3], "\n")
  cat("B-StARS speedup (parallel):", t2[3]/t4[3], "\n")
}
# Parallel speedup (StARS): 1.477646 
# B-StARS speedup (serial): 4.542509 
# B-StARS speedup (parallel): 3.998739

Key parameters

  • rep.num: Number of subsamples for stability selection (default: 50)
  • ncores: Number of cores for parallel processing (use 1 on Windows)
  • sel.criterion: Selection criterion (‘stars’ or ‘bstars’)
  • pulsar.select: Whether to use pulsar for model selection
  • pulsar.params: List of parameters passed to pulsar

Platform-specific recommendations

Unix-like systems (Linux, macOS):

  1. For small datasets: Use default StARS with serial processing
  2. For medium datasets: Use parallel StARS with 4-8 cores
  3. For large datasets: Use B-StARS with parallel processing
  4. For HPC clusters: Use batch mode with appropriate config files

Windows systems:

  1. For small datasets: Use default StARS with serial processing
  2. For medium datasets: Use B-StARS with serial processing
  3. For large datasets: Use batch mode or snow cluster
  4. For HPC clusters: Use batch mode with appropriate config files

Session info:

sessionInfo()
# R version 4.5.1 (2025-06-13)
# Platform: x86_64-pc-linux-gnu
# Running under: Ubuntu 24.04.3 LTS
# 
# Matrix products: default
# BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
# LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
# 
# time zone: Etc/UTC
# tzcode source: system (glibc)
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] phyloseq_1.55.0  igraph_2.2.1     Matrix_1.7-4     SpiecEasi_1.99.3
# [5] BiocStyle_2.39.0
# 
# loaded via a namespace (and not attached):
#  [1] gtable_0.3.6        shape_1.4.6.1       xfun_0.54          
#  [4] bslib_0.9.0         ggplot2_4.0.0       rhdf5_2.55.4       
#  [7] Biobase_2.71.0      lattice_0.22-7      vctrs_0.6.5        
# [10] rhdf5filters_1.23.0 tools_4.5.1         generics_0.1.4     
# [13] biomformat_1.39.0   stats4_4.5.1        parallel_4.5.1     
# [16] cluster_2.1.8.1     pkgconfig_2.0.3     huge_1.3.5         
# [19] data.table_1.17.8   RColorBrewer_1.1-3  S7_0.2.0           
# [22] S4Vectors_0.49.0    lifecycle_1.0.4     compiler_4.5.1     
# [25] farver_2.1.2        stringr_1.6.0       Biostrings_2.79.2  
# [28] Seqinfo_1.1.0       codetools_0.2-20    permute_0.9-8      
# [31] htmltools_0.5.8.1   sys_3.4.3           buildtools_1.0.0   
# [34] sass_0.4.10         yaml_2.3.10         glmnet_4.1-10      
# [37] crayon_1.5.3        jquerylib_0.1.4     MASS_7.3-65        
# [40] cachem_1.1.0        vegan_2.7-2         iterators_1.0.14   
# [43] foreach_1.5.2       nlme_3.1-168        digest_0.6.37      
# [46] stringi_1.8.7       reshape2_1.4.4      labeling_0.4.3     
# [49] maketools_1.3.2     splines_4.5.1       ade4_1.7-23        
# [52] fastmap_1.2.0       grid_4.5.1          cli_3.6.5          
# [55] magrittr_2.0.4      survival_3.8-3      ape_5.8-1          
# [58] withr_3.0.2         scales_1.4.0        rmarkdown_2.30     
# [61] XVector_0.51.0      multtest_2.67.0     pulsar_0.3.11      
# [64] VGAM_1.1-13         evaluate_1.0.5      knitr_1.50         
# [67] IRanges_2.45.0      mgcv_1.9-3          rlang_1.1.6        
# [70] Rcpp_1.1.0          glue_1.8.0          BiocManager_1.30.26
# [73] BiocGenerics_0.57.0 jsonlite_2.0.0      R6_2.6.1           
# [76] Rhdf5lib_1.33.0     plyr_1.8.9