To illustrate the Coralysis reference-mapping method, we
will use in this vignette two pancreatic islet datasets distributed
through the scRNAseq R package:
MuraroPancreasData (Muraro et al., 2016) and
GrunPancreasData (Grun et al., 2016). The former
dataset will be the reference and latter the query,
i.e., the dataset that will be mapped against the reference.
MuraroPancreasData (reference): 2,122 cells with cell type labels
GrunPancreasData (query): 1,064 cells without cell type labels
The datasets were imported below with the function
MuraroPancreasData() from GrunPancreasData()
package as a SingleCellExperiment object, the object
(class) required by most functions in Coralysis (see Chapter
4 The SingleCellExperiment class - OSCA manual). The
SCE object provided comprises counts (raw
count data), cell colData (cell metadata), among other data
layers. Both datasets were normalised and subsetted to 500 variable
genes with the scuttle/scran functions
logNormCounts and getTopHVGs, respectively
(other preprocessing steps, such as removing unlabeled cells or applying
quality control, were carried out following Chapter
3 Using single-cell references - OSCA manual).
Run the code below to import the R packages and data
required to reproduce this vignette.
# Packages
library("scran")
library("SingleR")
library("ggplot2")
library("scRNAseq")
library("Coralysis")
library("SingleCellExperiment")### Objects used and QC steps from SingleR book:
# https://bioconductor.org/books/release/SingleRBook/sc-mode.html
## Reference: muraro-pancreas-2016
# Import data
ref <- MuraroPancreasData()
# Remove unnecessary data & labels
altExps(ref) <- NULL
ref <- ref[, !is.na(ref$label) & ref$label != "unclear"]
# Normalise
ref <- logNormCounts(ref)
counts(ref) <- NULL
## Query: grun-pancreas-2016
# Import data
query <- GrunPancreasData()
# Remove unnecessary data & labels
query <- addPerCellQC(query)
qc <- quickPerCellQC(colData(query),
percent_subsets = "altexps_ERCC_percent",
batch = query$donor,
subset = query$donor %in% c("D17", "D7", "D2")
)
query <- query[, !qc$discard]
altExps(query) <- NULL
# Normalise
query <- logNormCounts(query)
counts(query) <- NULL
# Feature selection with 'scran' package for reference
nhvg <- 500
set.seed(123)
top.hvg <- getTopHVGs(ref, n = nhvg)
# Subset object
ref <- ref[top.hvg, ]
# Filter out genes for query:
# not required, only done to reduce query size object
query <- query[top.hvg, ]The first step in performing reference mapping with
Coralysis is training a dataset that is representative of
the biological system under study. In this example, ref and
query correspond to the SingleCellExperiment
objects of the reference and query pancreatic islet datasets,
MuraroPancreasData and
GrunPancreasData, respectively. The reference
ref is trained through RunParallelDivisiveICP
function without providing a batch.label. In case the
reference requires integration, a batch.label should be
provided. An higher number of threads can be provided to
speed up computing time depending on the number of cores available. For
this example, the algorithm was run 10 times (L = 10), but
generally, this number should be higher (with the default being
L = 50). A train set can be built to improve the modeling
step in Coralysis (by default,
build.train.set=TRUE), with the respective named
list of parameters provided to
build.train.params. In this case, the number of variable
genes was set to 500, as this was the number selected above
(nhvg = 500). Next, run the function RunPCA to
obtain the main result required for cell type prediction later on. In
addition to cell type prediction, the query dataset(s) can be projected
onto UMAP. To allow this, the argument return.model should
be set to TRUE in both functions RunPCA and
RunUMAP.
# Train the reference
set.seed(123)
ref <- RunParallelDivisiveICP(
object = ref, L = 10, k = 8,
build.train.params = list(nhvg = 500), # default 2000
threads = 2
) # runs without 'batch.label'## WARNING: Setting 'divisive.method' to 'cluster' as 'batch.label=NULL'.
## If 'batch.label=NULL', 'divisive.method' can be one of: 'cluster', 'random'.
##
## Building training set...
## Training set successfully built.
##
## Computing cluster seed.
##
## Initializing divisive ICP clustering...
## | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100%
##
## Divisive ICP clustering completed successfully.
##
## Predicting cell cluster probabilities using ICP models...
## Prediction of cell cluster probabilities completed successfully.
##
## Multi-level integration completed successfully.
## Divisive ICP: selecting ICP tables multiple of 3
# Compute reference UMAP
set.seed(123)
ref <- RunUMAP(ref,
return.model = TRUE,
umap.method = "uwot", dims = 1:15,
n_neighbors = 30, min_dist = 0.3
)Below is the UMAP plot for the reference sample with cell type annotations. In this example, we are using the annotations provided in the object. Ideally, the sample should be annotated after training by performing clustering and manual cluster annotation. Only the resulting manual cell type annotations should be used for prediction. For simplicity, we will use the annotations provided in the object.
# Vizualize reference
ref.celltype.plot <- PlotDimRed(
object = ref,
color.by = "label",
dimred = "UMAP",
point.size = 0.01,
legend.nrow = 3,
seed.color = 17
) +
ggtitle("Reference: ground-truth labels")## Warning in guide_legend(title = legend.title, nrow = legend.nrow, bycol = TRUE, : Arguments in `...` must be used.
## ✖ Problematic argument:
## • bycol = TRUE
## ℹ Did you misspell an argument name?
Perform reference-mapping with Coralysis by running the
function ReferenceMapping. This requires to provide the
trained reference (ref) with cell type annotations intended
for the prediction (ref.label = "label") and the query
dataset (query). The label in the reference aimed to be
used for prediction needs to be available on colData(ref).
In this case, we are providing the cell type labels from the column
label available in the reference ref. Since we
want to project the query onto the reference UMAP, set
project.umap as TRUE. The argument
dimred.name.prefix just sets the name given as prefix of
the low dimensional embeddings stored in
reducedDimNames(map). The SingleCellExperiment
object query will contain the same information as
query, with the predictions and embeddings mapped onto the
reference. The predictions consist in coral_probability,
coral_labels, and pruned_coral_labels stored
in colData(query). The coral_labels correspond
to the cell type predictions obtained against the reference. The
coral_probability represents the proportion of K
neighbors from the winning class (k.nn was set to 5 -
default 10); the higher the value, the better. The
pruned_coral_labels corresponds to the pruned
low-confidence predicted cell labels based on the confidence probability
scores, i.e., predicted cell labels with confidence scores less than or
equal to 0.5 (label.prune.cutoff = 0.5) are considered
unclassified and set to NA.
# Reference-mapping
set.seed(1024)
query <- ReferenceMapping(
ref = ref, query = query, ref.label = "label",
project.umap = TRUE, dimred.name.prefix = "ref",
k.nn = 5, label.prune.cutoff = 0.5
)Below, SingleR is run with the reference to
predict the cell type labels for the query, in order to
compare with the Coralysis predictions, since the
query dataset used does not provide cell type labels. The
accuracy of Coralysis reference-mapping method is presented
below together with a confusion matrix between the predicted
Coralysis labels (rows) versus predicted
SingleR labels (columns).
# Compare against SingleR
preds.singler <- SingleR(test = query, ref = ref, labels = ref$label, de.method = "wilcox")
colData(query) <- cbind(colData(query), preds.singler)
# Confusion matrix 'coral_labels'
preds_x_truth <- table(query$coral_labels, query$labels)
# Accuracy for 'coral_labels'
acc <- sum(diag(preds_x_truth)) / sum(preds_x_truth)
# Print confusion matrix
preds_x_truth##
## acinar alpha beta delta duct endothelial epsilon mesenchymal pp
## acinar 249 0 0 0 0 0 0 0 0
## alpha 4 201 0 1 0 0 0 0 0
## beta 8 1 165 10 0 0 0 0 0
## delta 0 0 2 45 0 0 0 0 0
## duct 22 0 9 0 299 1 0 8 0
## endothelial 0 0 0 0 0 5 0 0 0
## epsilon 0 0 0 0 0 0 1 0 0
## mesenchymal 0 0 0 0 0 0 0 15 0
## pp 0 0 0 0 0 0 0 0 18
The accuracy of Coralysis reference-mapping method was
93.8%.
The accuracy is higher for the pruned Coralysis cell
labels, as the low-confidence cell labels were removed. See below the
confusion matrix between the pruned Coralysis labels (rows)
versus predicted SingleR labels (columns) and respective
accuracy.
# Confusion matrix for 'pruned_coral_labels'
pruned_preds_x_truth <- table(query$pruned_coral_labels, query$labels)
# Accuracy for 'pruned_coral_labels'
pruned_acc <- sum(diag(pruned_preds_x_truth)) / sum(pruned_preds_x_truth)
# Print confusion matrix
pruned_preds_x_truth##
## acinar alpha beta delta duct endothelial epsilon mesenchymal pp
## acinar 245 0 0 0 0 0 0 0 0
## alpha 2 201 0 0 0 0 0 0 0
## beta 2 1 164 10 0 0 0 0 0
## delta 0 0 2 44 0 0 0 0 0
## duct 19 0 9 0 299 1 0 5 0
## endothelial 0 0 0 0 0 5 0 0 0
## epsilon 0 0 0 0 0 0 0 0 0
## mesenchymal 0 0 0 0 0 0 0 10 0
## pp 0 0 0 0 0 0 0 0 18
The accuracy of Coralysis reference-mapping method using
the pruned cell labels was 95.1%. The accuracy is higher using the
pruned labels as expected once the low-confidence cell predictions were
removed.
Visualize below the query cells projected onto reference UMAP and how
well the predictions match the query ground-truth. The
coral_probability is a prediction confidence score.
Predictions with low scores (<0.5) should be carefully inspected.
# Plot query and reference UMAP side-by-side # fig.height=7.25
# with ground-truth & predicted cell labels
use.color <- c("acinar" = "#FF7F00", "alpha" = "#CCCCCC", "beta" = "#FDCDAC",
"delta" = "#E6F5C9", "duct" = "#E31A1C", "endothelial" = "#377EB8",
"epsilon" = "#7FC97F", "mesenchymal" = "#F0027F", "pp" = "#B3CDE3")
query.singler.plot <- PlotDimRed(
object = query,
color.by = "labels",
dimred = "refUMAP",
point.size = 0.25,
point.stroke = 0.25,
legend.nrow = 3,
use.color = use.color
) +
ggtitle("Query: SingleR predictions")## Warning in guide_legend(title = legend.title, nrow = legend.nrow, bycol = TRUE, : Arguments in `...` must be used.
## ✖ Problematic argument:
## • bycol = TRUE
## ℹ Did you misspell an argument name?
query.predicted.plot <- PlotDimRed(
object = query,
color.by = "coral_labels",
dimred = "refUMAP",
point.size = 0.25,
point.stroke = 0.25,
legend.nrow = 3,
use.color = use.color
) +
ggtitle("Query: Coralysis predictions")## Warning in guide_legend(title = legend.title, nrow = legend.nrow, bycol = TRUE, : Arguments in `...` must be used.
## ✖ Problematic argument:
## • bycol = TRUE
## ℹ Did you misspell an argument name?
query.pruned.plot <- PlotDimRed(
object = query,
color.by = "pruned_coral_labels",
dimred = "refUMAP",
point.size = 0.25,
point.stroke = 0.25,
legend.nrow = 3,
use.color = use.color
) +
ggtitle("Query: Coralysis predictions pruned")## Warning in guide_legend(title = legend.title, nrow = legend.nrow, bycol = TRUE, : Arguments in `...` must be used.
## ✖ Problematic argument:
## • bycol = TRUE
## ℹ Did you misspell an argument name?
query.confidence.plot <- PlotExpression(
object = query,
color.by = "coral_probability",
dimred = "refUMAP",
point.size = 0.25,
point.stroke = 0.1,
color.scale = "viridis"
) +
ggtitle("Query: Coralysis confidence")
cowplot::plot_grid(ref.celltype.plot, query.singler.plot,
query.predicted.plot, query.pruned.plot,
query.confidence.plot,
ncol = 2, align = "vh"
)## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] SingleR_2.13.0 MouseGastrulationData_1.25.0
## [3] SpatialExperiment_1.21.0 scran_1.39.0
## [5] ensembldb_2.35.0 AnnotationFilter_1.34.0
## [7] GenomicFeatures_1.63.1 AnnotationDbi_1.72.0
## [9] ComplexHeatmap_2.27.0 Coralysis_1.0.0
## [11] scRNAseq_2.25.0 scater_1.39.0
## [13] ggplot2_4.0.0 scuttle_1.21.0
## [15] SingleCellExperiment_1.33.0 SummarizedExperiment_1.41.0
## [17] Biobase_2.70.0 GenomicRanges_1.63.0
## [19] Seqinfo_1.1.0 IRanges_2.45.0
## [21] S4Vectors_0.49.0 BiocGenerics_0.57.0
## [23] generics_0.1.4 MatrixGenerics_1.23.0
## [25] matrixStats_1.5.0 dplyr_1.1.4
## [27] BiocStyle_2.38.0
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 BiocIO_1.20.0
## [3] bitops_1.0-9 filelock_1.0.3
## [5] tibble_3.3.0 XML_3.99-0.20
## [7] lifecycle_1.0.4 httr2_1.2.1
## [9] aricode_1.0.3 edgeR_4.9.0
## [11] doParallel_1.0.17 lattice_0.22-7
## [13] alabaster.base_1.10.0 magrittr_2.0.4
## [15] limma_3.67.0 sass_0.4.10
## [17] rmarkdown_2.30 jquerylib_0.1.4
## [19] yaml_2.3.10 metapod_1.19.0
## [21] askpass_1.2.1 reticulate_1.44.0
## [23] cowplot_1.2.0 LiblineaR_2.10-24
## [25] DBI_1.2.3 buildtools_1.0.0
## [27] RColorBrewer_1.1-3 abind_1.4-8
## [29] Rtsne_0.17 purrr_1.2.0
## [31] BumpyMatrix_1.18.0 RCurl_1.98-1.17
## [33] rappdirs_0.3.3 circlize_0.4.16
## [35] ggrepel_0.9.6 irlba_2.3.5.1
## [37] alabaster.sce_1.10.0 maketools_1.3.2
## [39] pheatmap_1.0.13 umap_0.2.10.0
## [41] RSpectra_0.16-2 dqrng_0.4.1
## [43] DelayedMatrixStats_1.33.0 codetools_0.2-20
## [45] DelayedArray_0.37.0 shape_1.4.6.1
## [47] tidyselect_1.2.1 UCSC.utils_1.7.0
## [49] farver_2.1.2 ScaledMatrix_1.19.0
## [51] viridis_0.6.5 BiocFileCache_3.0.0
## [53] GenomicAlignments_1.47.0 jsonlite_2.0.0
## [55] GetoptLong_1.0.5 BiocNeighbors_2.4.0
## [57] iterators_1.0.14 foreach_1.5.2
## [59] tools_4.5.2 Rcpp_1.1.0
## [61] glue_1.8.0 gridExtra_2.3
## [63] SparseArray_1.11.1 xfun_0.54
## [65] flexclust_1.5.0 GenomeInfoDb_1.47.0
## [67] HDF5Array_1.39.0 gypsum_1.7.0
## [69] withr_3.0.2 BiocManager_1.30.26
## [71] fastmap_1.2.0 rhdf5filters_1.23.0
## [73] bluster_1.21.0 openssl_2.3.4
## [75] SparseM_1.84-2 digest_0.6.37
## [77] rsvd_1.0.5 R6_2.6.1
## [79] colorspace_2.1-2 Cairo_1.7-0
## [81] RSQLite_2.4.4 cigarillo_1.1.0
## [83] h5mread_1.3.0 rtracklayer_1.69.1
## [85] class_7.3-23 httr_1.4.7
## [87] S4Arrays_1.11.0 uwot_0.2.4
## [89] pkgconfig_2.0.3 gtable_0.3.6
## [91] modeltools_0.2-24 blob_1.2.4
## [93] S7_0.2.0 XVector_0.51.0
## [95] sys_3.4.3 htmltools_0.5.8.1
## [97] clue_0.3-66 ProtGenerics_1.43.0
## [99] scales_1.4.0 alabaster.matrix_1.10.0
## [101] png_0.1-8 knitr_1.50
## [103] reshape2_1.4.4 rjson_0.2.23
## [105] curl_7.0.0 GlobalOptions_0.1.2
## [107] cachem_1.1.0 rhdf5_2.55.7
## [109] stringr_1.6.0 BiocVersion_3.23.1
## [111] parallel_4.5.2 vipor_0.4.7
## [113] ggrastr_1.0.2 restfulr_0.0.16
## [115] pillar_1.11.1 alabaster.schemas_1.10.0
## [117] vctrs_0.6.5 RANN_2.6.2
## [119] BiocSingular_1.27.0 dbplyr_2.5.1
## [121] beachmat_2.26.0 cluster_2.1.8.1
## [123] beeswarm_0.4.0 evaluate_1.0.5
## [125] magick_2.9.0 cli_3.6.5
## [127] locfit_1.5-9.12 compiler_4.5.2
## [129] Rsamtools_2.27.0 rlang_1.1.6
## [131] crayon_1.5.3 labeling_0.4.3
## [133] plyr_1.8.9 ggbeeswarm_0.7.2
## [135] stringi_1.8.7 viridisLite_0.4.2
## [137] alabaster.se_1.10.0 BiocParallel_1.45.0
## [139] Biostrings_2.79.2 lazyeval_0.2.2
## [141] scrapper_1.5.2 Matrix_1.7-4
## [143] ExperimentHub_3.1.0 sparseMatrixStats_1.23.0
## [145] bit64_4.6.0-1 Rhdf5lib_1.33.0
## [147] KEGGREST_1.51.0 statmod_1.5.1
## [149] alabaster.ranges_1.10.0 AnnotationHub_4.0.0
## [151] igraph_2.2.1 memoise_2.0.1
## [153] bslib_0.9.0 bit_4.6.0
Amezquita R, Lun A, Becht E, Carey V, Carpp L, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pages H, Smith M, Huber W, Morgan M, Gottardo R, Hicks S (2020). “Orchestrating single-cell analysis with Bioconductor.” Nature Methods, 17, 137-145. doi:10.1038/s41592-019-0654-x.
Aran D, Looney AP, Liu L, Wu E, Fong V, Hsu A, Chak S, Naikawadi RP, Wolters PJ, Abate AR, Butte AJ, Bhattacharya M (2019). “Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage.” Nat. Immunol., 20:163-172. doi:10.1038/s41590-018-0276-y.
Grun D, Muraro MJ, Boisset JC, Wiebrands K, Lyubimova A, Dharmadhikari G, van den Born M, van Es J, Jansen E, Clevers H, de Koning EJP, van Oudenaarden EJP (2016). “De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data.” Cell Stem Cell, 19(2):266–77. doi.org/10.1016/j.stem.2016.05.010
Muraro MJ, Dharmadhikari G, Grun D, Groen N, Dielen T, Jansen E, van Gurp L, Engelse MA, Carlotti F, de Koning EJP, van Oudenaarden A (2016). “A Single-Cell Transcriptome Atlas of the Human Pancreas.” Cell Systems, 3(4):385–94. doi.org/10.1016/j.cels.2016.09.002
Sousa A, Smolander J, Junttila S, Elo L (2025). “Coralysis enables sensitive identification of imbalanced cell types and states in single-cell data via multi-level integration.” bioRxiv. doi:10.1101/2025.02.07.637023
Wickham H (2016). “ggplot2: Elegant Graphics for Data Analysis.” Springer-Verlag New York.