Chapter 7 Advanced options
7.1 Preconstructed indices
Advanced users can split the SingleR()
workflow into two separate training and classification steps.
This means that training (e.g., marker detection, assembling of nearest-neighbor indices) only needs to be performed once
for any reference.
The resulting data structure can then be re-used across multiple classifications with different test datasets,
provided the gene annotation in the test dataset is identical to or a superset of the genes in the training set.
To illustrate, we will consider the DICE reference dataset (Schmiedel et al. 2018) from the celldex package.
## class: SummarizedExperiment
## dim: 29914 1561
## metadata(0):
## assays(1): logcounts
## rownames(29914): ENSG00000121410 ENSG00000268895 ... ENSG00000159840
## ENSG00000074755
## rowData names(0):
## colnames(1561): TPM_1 TPM_2 ... TPM_101 TPM_102
## colData names(3): label.main label.fine label.ont
##
## B cells, naive Monocytes, CD14+
## 106 106
## Monocytes, CD16+ NK cells
## 105 105
## T cells, CD4+, TFH T cells, CD4+, Th1
## 104 104
## T cells, CD4+, Th17 T cells, CD4+, Th1_17
## 104 104
## T cells, CD4+, Th2 T cells, CD4+, memory TREG
## 104 104
## T cells, CD4+, naive T cells, CD4+, naive TREG
## 103 104
## T cells, CD4+, naive, stimulated T cells, CD8+, naive
## 102 104
## T cells, CD8+, naive, stimulated
## 102
Let’s say we want to use the DICE reference to annotate the PBMC dataset from Chapter 1.
We use the trainSingleR()
function to do all the necessary calculations
that are independent of the test dataset.
(Almost; see comments below about common
.)
This yields a list of various components that contains all identified marker genes
and precomputed rank indices to be used in the score calculation.
We can also turn on aggregation with aggr.ref=TRUE
(Section @ref(pseudo-bulk aggregation))
to further reduce computational work.
common <- intersect(rownames(sce), rownames(dice))
library(SingleR)
set.seed(2000)
trained <- trainSingleR(dice[common,], labels=dice$label.fine, aggr.ref=TRUE)
We then use the trained
object to annotate our dataset of interest through the classifySingleR()
function.
As we can see, this yields exactly the same result as applying SingleR()
directly.
The advantage here is that trained
can be re-used for multiple classifySingleR()
calls -
possibly on different datasets - without having to repeat unnecessary steps when the reference is unchanged.
##
## B cells, naive Monocytes, CD14+
## 344 515
## Monocytes, CD16+ NK cells
## 187 320
## T cells, CD4+, TFH T cells, CD4+, Th1
## 365 222
## T cells, CD4+, Th17 T cells, CD4+, Th1_17
## 64 62
## T cells, CD4+, Th2 T cells, CD4+, memory TREG
## 69 169
## T cells, CD4+, naive T cells, CD4+, naive TREG
## 115 57
## T cells, CD8+, naive
## 211
# Comparing to the direct approach.
set.seed(2000)
direct <- SingleR(sce, ref=dice, labels=dice$label.fine,
assay.type.test=1, aggr.ref=TRUE)
identical(pred$labels, direct$labels)
## [1] TRUE
The big caveat is that the universe of genes in the test dataset must be a superset of that the reference.
This is the reason behind the intersection to common
genes and the subsequent subsetting of dice
.
Practical use of preconstructed indices is best combined with some prior information about the gene-level annotation;
for example, we might know that we always use a particular version of the Ensembl gene models,
so we would filter out any genes in the reference dataset that are not in our test datasets.
7.2 Parallelization
Parallelization is an obvious approach to increasing annotation throughput.
This is done using the framework in the BiocParallel package,
which provides several options for parallelization depending on the available hardware.
On POSIX-compliant systems (i.e., Linux and MacOS), the simplest method is to use forking
by passing MulticoreParam()
to the BPPARAM=
argument:
library(BiocParallel)
pred2a <- SingleR(sce, ref=dice, assay.type.test=1, labels=dice$label.fine,
BPPARAM=MulticoreParam(8)) # 8 CPUs.
Alternatively, one can use separate processes with SnowParam()
,
which is slower but can be used on all systems - including Windows, our old nemesis.
pred2b <- SingleR(sce, ref=dice, assay.type.test=1, labels=dice$label.fine,
BPPARAM=SnowParam(8))
identical(pred2a$labels, pred2b$labels)
## [1] TRUE
When working on a cluster, passing BatchtoolsParam()
to SingleR()
allows us to
seamlessly interface with various job schedulers like SLURM, LSF and so on.
This permits heavy-duty parallelization across hundreds of CPUs for highly intensive jobs,
though often some configuration is required -
see the vignette for more details.
7.3 Approximate algorithms
It is possible to sacrifice accuracy to squeeze more speed out of SingleR.
The most obvious approach is to simply turn off the fine-tuning with fine.tune=FALSE
,
which avoids the time-consuming fine-tuning iterations.
When the reference labels are well-separated, this is probably an acceptable trade-off.
pred3a <- SingleR(sce, ref=dice, assay.type.test=1,
labels=dice$label.main, fine.tune=FALSE)
table(pred3a$labels)
##
## B cells Monocytes NK cells T cells, CD4+ T cells, CD8+
## 348 705 357 950 340
Another approximation is based on the fact that the initial score calculation is done using a nearest-neighbors search.
By default, this is an exact seach but we can switch to an approximate algorithm via the BNPARAM=
argument.
In the example below, we use the Annoy algorithm
via the BiocNeighbors framework, which yields mostly similar results.
(Note, though, that the Annoy method does involve a considerable amount of overhead,
so for small jobs it will actually be slower than the exact search.)
library(BiocNeighbors)
pred3b <- SingleR(sce, ref=dice, assay.type.test=1,
labels=dice$label.main, fine.tune=FALSE, # for comparison with pred3a.
BNPARAM=AnnoyParam())
table(pred3a$labels, pred3b$labels)
##
## B cells Monocytes NK cells T cells, CD4+ T cells, CD8+
## B cells 348 0 0 0 0
## Monocytes 0 705 0 0 0
## NK cells 0 0 357 0 0
## T cells, CD4+ 0 0 0 950 0
## T cells, CD8+ 0 0 0 0 340
7.4 Cluster-level annotation
The default philosophy of SingleR is to perform annotation of each individual cell in the test dataset. An alternative strategy is to perform annotation of aggregated profiles for groups or clusters of cells. To demonstrate, we will perform a quick-and-dirty clustering of our PBMC dataset with a variety of Bioconductor packages.
library(scuttle)
sce <- logNormCounts(sce)
library(scran)
dec <- modelGeneVarByPoisson(sce)
sce <- denoisePCA(sce, dec, subset.row=getTopHVGs(dec, n=5000))
library(bluster)
colLabels(sce) <- clusterRows(reducedDim(sce), NNGraphParam())
library(scater)
set.seed(117)
sce <- runTSNE(sce, dimred="PCA")
plotTSNE(sce, colour_by="label")
By passing clusters=
to SingleR()
, we direct the function to compute an aggregated profile per cluster.
Annotation is then performed on the cluster-level profiles rather than on the single-cell level.
This has the major advantage of being much faster to compute as there are obviously fewer clusters than cells;
it is also easier to interpret as it directly returns the likely cell type identity of each cluster.
## DataFrame with 11 rows and 5 columns
## scores first.labels tuning.scores
## <matrix> <character> <DataFrame>
## 1 0.1534120:0.261500:0.599638:... NK cells 0.391189:0.359560
## 2 0.2061287:0.232735:0.357030:... T cells, CD4+ 0.551037:0.503613
## 3 0.0526260:0.271140:0.727792:... NK cells 0.727792:0.393413
## 4 0.1419607:0.781275:0.209402:... Monocytes 0.781275:0.209402
## 5 0.1642361:0.763198:0.206301:... Monocytes 0.763198:0.206301
## 6 0.6402007:0.335862:0.223424:... B cells 0.640201:0.335862
## 7 0.2275805:0.602347:0.211547:... Monocytes 0.602347:0.227580
## 8 0.2136143:0.276442:0.404099:... T cells, CD4+ 0.687793:0.580251
## 9 0.1675653:0.753343:0.260288:... Monocytes 0.753343:0.260288
## 10 0.2470206:0.245272:0.333286:... T cells, CD4+ 0.359902:0.304589
## 11 0.0713926:0.223101:0.117047:... Monocytes 0.223101:0.117047
## labels pruned.labels
## <character> <character>
## 1 T cells, CD4+ T cells, CD4+
## 2 T cells, CD4+ T cells, CD4+
## 3 NK cells NK cells
## 4 Monocytes Monocytes
## 5 Monocytes Monocytes
## 6 B cells B cells
## 7 Monocytes Monocytes
## 8 T cells, CD4+ T cells, CD4+
## 9 Monocytes Monocytes
## 10 T cells, CD4+ T cells, CD4+
## 11 Monocytes NA
This approach assumes that each cluster in the test dataset corresponds to exactly one reference label.
If a cluster actually contains a mixture of multiple labels, this will not be reflected in its lone assigned label.
(We note that it would be very difficult to determine the composition of the mixture from the SingleR()
scores.)
Indeed, there is no guarantee that the clustering is driven by the same factors that distinguish the reference labels,
decreasing the reliability of the annotations when novel heterogeneity is present in the test dataset.
The default per-cell strategy is safer and provides more information about the ambiguity of the annotations,
which is important for closely related labels where a close correspondence between clusters and labels cannot be expected.
Session information
R version 4.2.0 RC (2022-04-19 r82224)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS
Matrix products: default
BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] scater_1.24.0 ggplot2_3.3.5
[3] bluster_1.6.0 scran_1.24.0
[5] scuttle_1.6.0 BiocNeighbors_1.14.0
[7] BiocParallel_1.30.0 SingleR_1.10.0
[9] TENxPBMCData_1.13.0 HDF5Array_1.24.0
[11] rhdf5_2.40.0 DelayedArray_0.22.0
[13] Matrix_1.4-1 SingleCellExperiment_1.18.0
[15] ensembldb_2.20.0 AnnotationFilter_1.20.0
[17] GenomicFeatures_1.48.0 AnnotationDbi_1.58.0
[19] celldex_1.5.0 SummarizedExperiment_1.26.0
[21] Biobase_2.56.0 GenomicRanges_1.48.0
[23] GenomeInfoDb_1.32.0 IRanges_2.30.0
[25] S4Vectors_0.34.0 BiocGenerics_0.42.0
[27] MatrixGenerics_1.8.0 matrixStats_0.62.0
[29] BiocStyle_2.24.0 rebook_1.6.0
loaded via a namespace (and not attached):
[1] snow_0.4-4 AnnotationHub_3.4.0
[3] BiocFileCache_2.4.0 igraph_1.3.1
[5] lazyeval_0.2.2 digest_0.6.29
[7] htmltools_0.5.2 viridis_0.6.2
[9] fansi_1.0.3 magrittr_2.0.3
[11] memoise_2.0.1 ScaledMatrix_1.4.0
[13] cluster_2.1.3 limma_3.52.0
[15] Biostrings_2.64.0 prettyunits_1.1.1
[17] colorspace_2.0-3 ggrepel_0.9.1
[19] blob_1.2.3 rappdirs_0.3.3
[21] xfun_0.30 dplyr_1.0.8
[23] crayon_1.5.1 RCurl_1.98-1.6
[25] jsonlite_1.8.0 graph_1.74.0
[27] glue_1.6.2 gtable_0.3.0
[29] zlibbioc_1.42.0 XVector_0.36.0
[31] BiocSingular_1.12.0 Rhdf5lib_1.18.0
[33] scales_1.2.0 DBI_1.1.2
[35] edgeR_3.38.0 Rcpp_1.0.8.3
[37] viridisLite_0.4.0 xtable_1.8-4
[39] progress_1.2.2 dqrng_0.3.0
[41] bit_4.0.4 rsvd_1.0.5
[43] metapod_1.4.0 httr_1.4.2
[45] dir.expiry_1.4.0 ellipsis_0.3.2
[47] farver_2.1.0 pkgconfig_2.0.3
[49] XML_3.99-0.9 CodeDepends_0.6.5
[51] sass_0.4.1 dbplyr_2.1.1
[53] locfit_1.5-9.5 utf8_1.2.2
[55] labeling_0.4.2 tidyselect_1.1.2
[57] rlang_1.0.2 later_1.3.0
[59] munsell_0.5.0 BiocVersion_3.15.2
[61] tools_4.2.0 cachem_1.0.6
[63] cli_3.3.0 generics_0.1.2
[65] RSQLite_2.2.12 ExperimentHub_2.4.0
[67] evaluate_0.15 stringr_1.4.0
[69] fastmap_1.1.0 yaml_2.3.5
[71] knitr_1.38 bit64_4.0.5
[73] purrr_0.3.4 KEGGREST_1.36.0
[75] sparseMatrixStats_1.8.0 mime_0.12
[77] xml2_1.3.3 biomaRt_2.52.0
[79] compiler_4.2.0 beeswarm_0.4.0
[81] filelock_1.0.2 curl_4.3.2
[83] png_0.1-7 interactiveDisplayBase_1.34.0
[85] statmod_1.4.36 tibble_3.1.6
[87] bslib_0.3.1 stringi_1.7.6
[89] highr_0.9 lattice_0.20-45
[91] ProtGenerics_1.28.0 vctrs_0.4.1
[93] pillar_1.7.0 lifecycle_1.0.1
[95] rhdf5filters_1.8.0 BiocManager_1.30.17
[97] jquerylib_0.1.4 cowplot_1.1.1
[99] bitops_1.0-7 irlba_2.3.5
[101] httpuv_1.6.5 rtracklayer_1.56.0
[103] R6_2.5.1 BiocIO_1.6.0
[105] bookdown_0.26 promises_1.2.0.1
[107] gridExtra_2.3 vipor_0.4.5
[109] codetools_0.2-18 assertthat_0.2.1
[111] rjson_0.2.21 withr_2.5.0
[113] GenomicAlignments_1.32.0 Rsamtools_2.12.0
[115] GenomeInfoDbData_1.2.8 parallel_4.2.0
[117] hms_1.1.1 grid_4.2.0
[119] beachmat_2.12.0 rmarkdown_2.14
[121] DelayedMatrixStats_1.18.0 Rtsne_0.16
[123] shiny_1.7.1 ggbeeswarm_0.6.0
[125] restfulr_0.0.13
Bibliography
Schmiedel, Benjamin J., Divya Singh, Ariel Madrigal, Alan G. Valdovino-Gonzalez, Brandie M. White, Jose Zapardiel-Gonzalo, Brendan Ha, et al. 2018. “Impact of Genetic Polymorphisms on Human Immune Cell Gene Expression.” Cell 175 (6): 1701–1715.e16. https://doi.org/10.1016/j.cell.2018.10.022.