Packages like reticulate facilitate the use of Python modules in our R-based data analyses, allowing us to leverage Python’s strengths in fields such as machine learning and image analysis. However, it is notoriously difficult to ensure that a consistent version of Python is available with a consistently versioned set of modules, especially when the system installation of Python is used. As a result, we cannot easily guarantee that some Python code executed via reticulate on one computer will yield the same results as the same code run on another computer. It is also possible that two R packages depend on incompatible versions of Python modules, such that it is impossible to use both packages at the same time. These versioning issues represent a major obstacle to reliable execution of Python code across a variety of systems via R/Bioconductor packages.
basilisk provisions custom Python virtual environments that are managed by the Bioconductor installation machinery. This provides developers of downstream Bioconductor packages (i.e., basilisk “clients”) with more control over how their Python code is executed. Additionally, basilisk provides utilities to manage different Python environments within a single R session, allowing multiple Bioconductor packages to use incompatible versions of Python packages in the course of a single analysis. These features enable reproducible analysis, simplify debugging of code and improve interoperability between compliant packages.
The son.of.basilisk package (provided in the
inst/example directory of this package) is provided as an
example of how one might write a client package that depends on basilisk.
This is a fully operational example package that can be installed and
run, so prospective developers should use it as a template for their own
packages. We will assume that readers are familiar with general R
package development practices and will limit our discussion to the
basilisk-specific
elements.
StagedInstall: no should be set, to ensure that Python
packages are installed with the correct hard-coded paths within the R
package installation directory.
Imports: basilisk should be set along with appropriate
directives in the NAMESPACE for all basilisk
functions that are used.
BasiliskEnvironment objectsA basilisk.R file should be present in the
R/ subdirectory containing commands to produce a
BasiliskEnvironment object. These objects define the Python
virtual environments to be constructed by basilisk
on behalf of your client package.
library(basilisk)
my_env <- BasiliskEnvironment(envname="my_env_name",
pkgname="ClientPackage",
packages=c("pandas==2.2.3", "scikit-learn==1.6.1")
)
second_env <- BasiliskEnvironment(envname="second_env_name",
pkgname="ClientPackage",
packages=c("scipy=1.15.1", "numpy==2.2.2")
)As shown above, all listed Python packages should have valid version
numbers that can be obtained by pip. It is strongly
recommended to explicitly list the versions of any dependencies so as to
future-proof the installation process. If the package versions are not
known, we suggest using setBasiliskCheckVersions(FALSE) and
listPackages() to identify the appropriate versions.
If a different version of Python is required, it should be explicitly
listed in the packages=, e.g., with
python=3.7. Otherwise, basilisk
will automatically use the default specified in
defaultPythonVersion (currently 3.12.10). It is a good idea
to explicitly list a version of Python in packages=, even
if it is already version-compatible with the default; this ensures that
Python environment creation is robust to future changes to the
default.
An executable configure file should be created in the
top level of the client package, containing the command shown below.
This enables creation of Python environments during package installation
if BASILISK_USE_SYSTEM_DIR is set.
For completeness, configure.win should also be
created:
Note that basilisk.R should be executable as a
standalone file and create all BasiliskEnvironments as
named variables in the current R environment. This is because the file
will be directly sourced by
configureBasiliskEnv() for system installation of the
Python environments (see BASILISK_USE_SYSTEM_DIR below). As
such, the file should not assume that the rest of the client package has
been installed or that the client’s various dependencies have been
loaded.
To use methods from the my_env environment that we
previously defined, the functions in our hypothetical
ClientPackage package should define functions like:
my_example_function <- function(ARG_VALUE_1, ARG_VALUE_2) {
proc <- basiliskStart(my_env)
on.exit(basiliskStop(proc))
some_useful_thing <- basiliskRun(proc, fun=function(arg1, arg2) {
mod <- reticulate::import("scikit-learn")
output <- mod$some_calculation(arg1, arg2)
# The return value MUST be a pure R object, i.e., no reticulate
# Python objects, no pointers to shared memory.
output
}, arg1=ARG_VALUE_1, arg2=ARG_VALUE_2)
some_useful_thing
}In the above chunk, a developer-defined function fun is
passed to basiliskRun() for execution inside the
proc context where the specified Python environment is
loaded. Developers should not make any assumptions about the nature of
proc, which is dependent on the state of the R session. For
example, basilisk
may choose to run fun in the current R session, or in
another forked/socket process with the same R installation. Any R
functions that use Python code should do so via
basiliskRun(), which ensures that different Bioconductor
packages play nice when their dependencies clash.
basiliskStart() will lazily install the requested
version of Python and packages in my_env if they are not
already present. This can result in some delays when
my_example_function() is first called; afterwards, the
cached environments will simply be re-used. Check out the
use_python() and virtualenv_install()
functions from reticulate
for more details.
Developers should respect several constraints when defining a
function for use in basiliskRun():
basiliskRun() may
execute in a different process such that any pointers are no longer
valid when they are transferred back to the parent process. Both the
arguments to the function passed to basiliskRun() and its
return value MUST be amenable to serialization.:: operator. This ensures that the relevant package will be
loaded during function execution in a separate process.More details on acceptable function definitions are provided in
?basiliskRun. Developers can check that their function
behaves correctly in a different process by setting
setBasiliskShared(FALSE) and
setBasiliskFork(FALSE) prior to running
basiliskRun() in their unit tests.
Developers can persist variables across multiple calls to
basiliskRun() by setting persist=TRUE. This
instructs basiliskRun() to pass along an R environment to
fun as the store= argument, which can be used
inside fun to set or get variables if
basiliskRun() is called with the same proc.
Stored variables are not subject to the restrictions on the
arguments/return value of fun, but they are strictly
internal to any instance of proc.
my_example_function <- function() {
proc <- basiliskStart(my_env)
on.exit(basiliskStop(proc))
basiliskRun(proc, fun=function(store) {
store$something <- rand(1)
invisible(NULL)
}, persist=TRUE)
basiliskRun(proc, fun=function(store) {
store$something
}, persist=TRUE)
}This capability allows developers to modularize complex Python
workflows by splitting up steps across multiple calls to
basiliskRun(). However, it is probably unwise to re-use
proc across user-visible functions, i.e., the end user
should never have an opportunity to interact with proc.
In most cases, end users should not have to read this document. Properly configured basilisk clients should handle all aspects of Python environment creation and loading without requiring user intervention. That said, some system configurations are less cooperative than others: this section contains a list of known issues and possible fixes.
Windows has a limit of 260 characters for its file paths. This is
occasionally exceeded due to deeply nested directories for some
packages, causing installation to silently fail. If this constraint is
causing problems, it may be possible to circumvent them by setting
BASILISK_EXTERNAL_DIR to a shorter path.
Builds for 32-bit Windows are not supported due to a lack of demand relative to the difficulty of setting it up.
Older versions of Rstudio on MacOSX have some difficulties with the generation of separate processes (see here). As a workaround in such cases, users should set:
basilisk will automatically attempt to remove old Python environments for each client package. However, this removal may not be fast enough on systems with low disk usage quotas, resulting in incomplete or failed installations. In such cases, users can forcibly clear the external directory themselves to free up some space:
# Remove obsolete environments for specific package:
basilisk::clearExternalDir(package = "pkg_name", obsolete.only = TRUE)
# Remove all environments for a specific package:
basilisk::clearExternalDir(package = "pkg_name")
# Remove all basilisk-managed environments:
basilisk.utils::clearExternalDir()Administrators of an R installation can modify the behavior of basilisk by setting a few environment variables. All environment variables described here must be set at both installation time and run time to have any effect. If any value is changed, it is generally safest to reinstall basilisk and all of its clients.
Setting the BASILISK_EXTERNAL_DIR environment variable
will change where the environments are created by basiliskStart() during lazy
installation. This is usually unnecessary unless the default path
contains spaces or the combination of the default location and
environment’s directory structure exceeds the file path length limit on
Windows.
Setting BASILISK_USE_SYSTEM_DIR to 1 will
instruct basilisk
to install a client package’s environments in the R system directory
during R package installation. This is useful for enterprise-level
deployments as the environments are (i) not duplicated in each user’s
home directory and (ii) always available to any user with access to the
R installation. However, it requires installation from source and thus
is not set by default.
Setting the BASILISK_CUSTOM_PYTHON_X_Y_Z environment
variable will cause all requests for Python version X.Y.Z
to use the Python binary at the specified path. This allows users to
force basilisk
to use their own Python installation instead of installing one via
reticulate.
The same approach can be used with
BASILISK_CUSTOM_PYTHON_X_Y and
BASILISK_CUSTOM_PYTHON_X for Python versions
X.Y or X. Different client packages may need
different Python versions so multiple environment variables may need to
be set for different X/Y/Z
combinations.
Setting the BASILISK_NO_PYENV environment variable will
prevent basilisk
from installing any new Python instances via Pyenv. If a requested
Python version has no matching BASILISK_CUSTOM_PYTHON_*
path, basilisk
will throw an error instead of attempting an installation. This allows
administrators to prevent unexpected installation of new Python
instances, e.g., when the requested version of Python is already
available but the assotiated BASILISK_CUSTOM_PYTHON_*
variable has not been set.
Setting BASILISK_NO_DESTROY to 1 will
instruct basilisk
to not destroy previous environments upon installation of a new
version of basilisk.
This destruction is done by default to avoid accumulating many large
obsolete environments. However, it is not desirable if there are
multiple R instances running different versions of basilisk
from the same Bioconductor release, as installation by one R instance
would delete the installed content for the other. (Multiple R instances
running different Bioconductor releases are not affected.) This option
has no effect if BASILISK_USE_SYSTEM_DIR is set.
While basilisk
is primarily intended for package developers, end users can also take
advantage of its graceful handling of multiple Python environments in
complex workflows. For example, we can easily instantiate a Python
environment in our working directory with
createLocalBasiliskEnv():
tmp <- createLocalBasiliskEnv(
"basilisk-vignette-test",
packages=c("scikit-learn=1.6.1", "numpy=2.2.2")
)## Installing pyenv ...
## Done! pyenv has been installed to '/github/home/.local/share/r-reticulate/pyenv/bin/pyenv'.
## Using Python: /github/home/.pyenv/versions/3.12.10/bin/python3.12
## Creating virtual environment '/tmp/Rtmpfs9UNf/Rbuilda6f1576a55a/basilisk/vignettes/basilisk-vignette-test/1.22.0' ...
## Done!
## Installing packages: pip, wheel, setuptools
## Installing packages: 'scikit-learn==1.6.1', 'numpy==2.2.2'
## Virtual environment '/tmp/Rtmpfs9UNf/Rbuilda6f1576a55a/basilisk/vignettes/basilisk-vignette-test/1.22.0' successfully created.
We can then supply this environment’s path to
basiliskRun() to execute Python-based calculations. To
demonstrate, we’ll apply scikit-learn’s
truncated PCA on a random matrix. Note that the restrictions mentioned
above for fun are still
applicable here.
x <- matrix(rnorm(1000), ncol=10)
basiliskRun(env=tmp, fun=function(mat) {
module <- reticulate::import("sklearn.decomposition")
runner <- module$TruncatedSVD()
output <- runner$fit(mat)
output$singular_values_
}, mat = x, testload="scipy.optimize")## [1] 12.02781 11.72442
basiliskRun() can also be used with Python environments
constructed outside of basilisk.
Of course, in this case, it is the user’s responsibility to ensure that
the environment is correctly provisioned.
library(reticulate)
tmp2 <- file.path(getwd(), "basilisk-vignette-test2")
if (!file.exists(tmp2)) {
py.cmd <- suppressMessages(install_python(defaultPythonVersion))
virtualenv_install(
envname=tmp2,
python_version=py.cmd,
packages="scipy==1.15.1"
)
}## Using Python: /github/home/.pyenv/versions/3.12.10/bin/python3.12
## Creating virtual environment '/tmp/Rtmpfs9UNf/Rbuilda6f1576a55a/basilisk/vignettes/basilisk-vignette-test2' ...
## Done!
## Installing packages: pip, wheel, setuptools
## Virtual environment '/tmp/Rtmpfs9UNf/Rbuilda6f1576a55a/basilisk/vignettes/basilisk-vignette-test2' successfully created.
## Using virtual environment '/tmp/Rtmpfs9UNf/Rbuilda6f1576a55a/basilisk/vignettes/basilisk-vignette-test2' ...
basiliskRun(env=tmp2, fun=function(mat) {
module <- reticulate::import("scipy.stats")
norm <- module$norm
norm$cdf(c(-1, 0, 1))
}, mat = x, testload="scipy.optimize")## [1] 0.1586553 0.5000000 0.8413447
Notice how we were able to call basiliskRun()
successfully on two different environments within the same R session.
This enables the construction of complex analysis workflows that span
across R and multiple Python environments.
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] basilisk_1.22.0 reticulate_1.44.0 BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.5 knitr_1.50 rlang_1.1.6
## [4] xfun_0.54 png_0.1-8 jsonlite_2.0.0
## [7] dir.expiry_1.19.0 buildtools_1.0.0 htmltools_0.5.8.1
## [10] maketools_1.3.2 sys_3.4.3 sass_0.4.10
## [13] rappdirs_0.3.3 rmarkdown_2.30 grid_4.5.2
## [16] filelock_1.0.3 evaluate_1.0.5 jquerylib_0.1.4
## [19] fastmap_1.2.0 yaml_2.3.10 lifecycle_1.0.4
## [22] BiocManager_1.30.26 compiler_4.5.2 Rcpp_1.1.0
## [25] lattice_0.22-7 digest_0.6.37 R6_2.6.1
## [28] parallel_4.5.2 bslib_0.9.0 Matrix_1.7-4
## [31] withr_3.0.2 tools_4.5.2 cachem_1.1.0