To illustrate the multi-level integration algorithm, we will use in
this vignette two samples from the MouseGastrulationData
R package: 1 and 3. The
samples correspond to different pool and embryonic stages:
1: pool index 1 and embryonic stage E6.5 with 360 cells
3: pool index 3 and embryonic stage E7.5 with 458 cells
The samples were imported below with the function
EmbryoAtlasData() from MouseGastrulationData
package as a SingleCellExperiment object, the object
(class) required by most functions in Coralysis (see Chapter
4 The SingleCellExperiment class - OSCA manual). The
SCE object provided comprises counts (raw
count data), cell colData (which includes batch and cell
labels, designated as pool and celltype,
respectively), among other data layers.
Run the code below to import the R packages and data
required to reproduce this vignette.
# Packages
library("scran")
library("ggplot2")
library("Coralysis")
library("SingleCellExperiment")
library("MouseGastrulationData")Coralysis requires log-normalised data as input. Run the
code below to apply the basic log-normalisation method implemented in
Seurat (see NormalizeData).
## Normalize the data
# log1p normalization
SeuratNormalisation <- function(object) {
# 'SeuratNormalisation()' function applies the basic Seurat normalization to
# a SingleCellExperiment object with a 'counts' assay. Normalized data
# is saved in the 'logcounts' assay.
logcounts(object) <- apply(counts(object), 2, function(x) {
log1p(x / sum(x) * 10000)
}) # log1p normalization w/ 10K scaling factor
logcounts(object) <- as(logcounts(object), "sparseMatrix")
return(object)
}
sce <- SeuratNormalisation(object = sce)
# Remove 'counts' to reduce data size object
counts(sce) <- NULLIn alternative, scran normalization can be performed,
which is particularly beneficial if rare cell types exist (see the
following vignette).
In addition, have a look into Chapter
7 Normalization - OSCA manual.
Highly variable genes (HVG) can be selected using the R package
scran. The variable pool from
colData is used as batch label. The
SingleCellExperiment object allows alternative experiments
to the main experiment. This is important to keep a backup of all genes
in the same SingleCellExperiment object before selecting
HVGs (see SingleCellExperiment
vignette).
# Feature selection with 'scran' package
nhvg <- 500
batch.label <- "pool"
sce[[batch.label]] <- factor(sce[[batch.label]])
m.hvg <- scran::modelGeneVar(sce, block = sce[[batch.label]])
top.hvg <- getTopHVGs(m.hvg, n = nhvg)
# Subset object
sce <- sce[top.hvg, ]
# Rename genes: 'Ensembl ID' to SYMBOL
row.names(sce) <- rowData(sce)$SYMBOL
rowData(sce) <- NULLThe batch effect between assays can be inspected below by projecting
the data onto t-distributed Stochastic Neighbor Embedding
(t-SNE). This can be achieved by running sequentially the
Coralysis functions RunPCA and
RunTSNE. Provide a seed before running each one of these
functions to ensure reproducibility. The function RunPCA
runs by default the PCA method implemented in the R package
irlba (pca.method="irlba"), which requires a
seed to ensure the same PCA result. In addition, the
assay.name argument needs to be provided, otherwise uses by
default the Coralysis probabilities which are obtained only after
integration (after running RunParallelDivisiveICP). The
assay logcounts, corresponding to the log-normalized data,
and number of principal components to use p were provided.
Any categorical variable available in colData(sce), such as
pool or celltype, can be visualized in a low
dimensional embedding stored in reducedDimNames(sce) with
the Coralysis function PlotDimRed.
# Compute PCA & TSNE
set.seed(123)
sce <- RunPCA(
object = sce,
assay.name = "logcounts",
p = 30, dimred.name = "unintPCA"
)
set.seed(123)
sce <- RunTSNE(sce,
dimred.type = "unintPCA",
dimred.name = "unintTSNE"
)
# Plot TSNE highlighting the batch & cell type
unint.batch.plot <- PlotDimRed(
object = sce,
color.by = "pool",
dimred = "unintTSNE",
point.size = 0.01,
legend.nrow = 1,
seed.color = 1024
)## Warning in guide_legend(title = legend.title, nrow = legend.nrow, bycol = TRUE, : Arguments in `...` must be used.
## ✖ Problematic argument:
## • bycol = TRUE
## ℹ Did you misspell an argument name?
unint.cell.plot <- PlotDimRed(
object = sce,
color.by = "celltype",
dimred = "unintTSNE",
point.size = 0.01,
legend.nrow = 14,
seed.color = 7
)## Warning in guide_legend(title = legend.title, nrow = legend.nrow, bycol = TRUE, : Arguments in `...` must be used.
## ✖ Problematic argument:
## • bycol = TRUE
## ℹ Did you misspell an argument name?
Integrate assays with the multi-level integration algorithm
implemented in Coralysis by running the function
RunParallelDivisiveICP. The only arguments required by this
function are object and batch.label. The
object requires a SingleCellExperiment object
with the assay logcounts. The matrix in
logcounts should be sparse, i.e.,
is(logcounts(sce), "dgCMatrix") is TRUE, and
it should not contain non-expressing genes. This is ensured by running
PrepareData before. The batch.label argument
requires a label column name in colData(sce) corresponding
to the batch label that should be used for integration. In the absence
of a batch, the same function, RunParallelDivisiveICP, can
be run without providing batch.label (i.e.,
batch.label = NULL), in which case the data will be modeled
through the algorithm to identify fine-grained populations that do not
required batch correction. An higher number of threads can
be provided to speed up computing time depending on the number of cores
available. For this example, the algorithm was run 10 times
(L = 10), but generally, this number should be higher (with
the default being L = 50). A train set can be built to
improve the modeling step in Coralysis. In this case, the
option was set to FALSE
(build.train.set = FALSE) as it is faster and requires less
memory. Otherwise, using this option is recommended as it can make the
modeling step faster and consume less memory for larger datasets.
# Prepare data for integration:
# remove non-expressing genes & logcounts is from `dgCMatrix` class
sce <- PrepareData(object = sce)## Data in `logcounts` slot already of `dgCMatrix` class...
## 500/500 features remain after filtering features with only zero values.
# Perform integration with Coralysis
set.seed(1024)
sce <- RunParallelDivisiveICP(
object = sce,
batch.label = "pool",
build.train.set = FALSE,
L = 10, threads = 2
)##
## Computing cluster seed.
##
## Initializing divisive ICP clustering...
## | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100%
##
## Divisive ICP clustering completed successfully.
##
## Predicting cell cluster probabilities using ICP models...
## Prediction of cell cluster probabilities completed successfully.
##
## Multi-level integration completed successfully.
The integration result can be visually inspected by running
sequentially the functions RunPCA and RunTSNE.
The assay.name provided to RunPCA must be
joint.probability (the default), the primary output of
integration with Coralysis. The probability matrices from
Coralysis (i.e., joint.probability) can be
used to obtain an integrated embedding by running
RunPCA(..., assay.name = "joint.probability"). This
integrated PCA can, in turn, be used downstream for clustering or
non-linear dimensional reduction techniques, such as
RunTSNE. Below, the integrated PCA was named
intPCA.
# Compute PCA with joint cluster probabilities & TSNE
set.seed(123)
sce <- RunPCA(sce,
assay.name = "joint.probability",
dimred.name = "intPCA"
)## Divisive ICP: selecting ICP tables multiple of 4
set.seed(123)
sce <- RunTSNE(sce,
dimred.type = "intPCA",
dimred.name = "intTSNE"
)
# Plot TSNE highlighting the batch & cell type
int.batch.plot <- PlotDimRed(
object = sce,
color.by = "pool",
dimred = "intTSNE",
point.size = 0.01,
legend.nrow = 1,
seed.color = 1024
)## Warning in guide_legend(title = legend.title, nrow = legend.nrow, bycol = TRUE, : Arguments in `...` must be used.
## ✖ Problematic argument:
## • bycol = TRUE
## ℹ Did you misspell an argument name?
int.cell.plot <- PlotDimRed(
object = sce,
color.by = "celltype",
dimred = "intTSNE",
point.size = 0.01,
legend.nrow = 14,
seed.color = 7
)## Warning in guide_legend(title = legend.title, nrow = legend.nrow, bycol = TRUE, : Arguments in `...` must be used.
## ✖ Problematic argument:
## • bycol = TRUE
## ℹ Did you misspell an argument name?
Run graph-based clustering with the scran function
clusterCells (see Chapter
5 Clustering - OSCA manual).
# Graph-based clustering on the integrated PCA w/ 'scran' package
blusparams <- bluster::SNNGraphParam(k = 15, cluster.fun = "louvain")
set.seed(123)
sce$cluster <- scran::clusterCells(sce,
use.dimred = "intPCA",
BLUSPARAM = blusparams
)
# Plot clustering
clt.plot <- PlotDimRed(
object = sce,
color.by = "cluster",
dimred = "intTSNE",
point.size = 0.01,
legend.nrow = 3,
seed.color = 65
)## Warning in guide_legend(title = legend.title, nrow = legend.nrow, bycol = TRUE, : Arguments in `...` must be used.
## ✖ Problematic argument:
## • bycol = TRUE
## ℹ Did you misspell an argument name?
Identify the cluster markers by running the Coralysis
function FindAllClusterMarkers. Provide the
clustering.label, in this case, the label used above, i.e.,
cluster. The top three positive markers per cluster were
retrieved and plotted below using the Coralysis function
HeatmapFeatures.
# Cluster markers
cluster.markers <- FindAllClusterMarkers(object = sce, clustering.label = "cluster")## -----------------------------------
## testing cluster 1
## 496 features left after min.pct filtering
## 496 features left after min.diff.pct filtering
## 191 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 2
## 497 features left after min.pct filtering
## 497 features left after min.diff.pct filtering
## 183 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 3
## 499 features left after min.pct filtering
## 499 features left after min.diff.pct filtering
## 189 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 4
## 498 features left after min.pct filtering
## 498 features left after min.diff.pct filtering
## 263 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 5
## 498 features left after min.pct filtering
## 498 features left after min.diff.pct filtering
## 277 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 6
## 498 features left after min.pct filtering
## 498 features left after min.diff.pct filtering
## 399 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 7
## 499 features left after min.pct filtering
## 499 features left after min.diff.pct filtering
## 226 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 8
## 497 features left after min.pct filtering
## 497 features left after min.diff.pct filtering
## 210 features left after log2fc.threshold filtering
## -----------------------------------
# Select the top 3 positive markers per cluster
top3.markers <- lapply(X = split(x = cluster.markers, f = cluster.markers$cluster), FUN = function(x) {
head(x[order(x$log2FC, decreasing = TRUE), ], n = 3)
})
top3.markers <- do.call(rbind, top3.markers)
top3.markers <- top3.markers[order(as.numeric(top3.markers$cluster)), ]
# Heatmap of the top 3 positive markers per cluster
HeatmapFeatures(
object = sce,
clustering.label = "cluster",
features = top3.markers$marker,
seed.color = 65
)Differential gene expression (DGE) between cluster 3 and 6
corresponding roughly to Visceral endoderm and
ExE endoderm, respectively. The genes Ttr,
Rbp4, Apoa1, Apom and Ctsl are
upregulated in cluster 6, while Tmsb10 is upregulated in
cluster 3.
# DGE analysis: cluster 3 vs 6
dge.clt3vs6 <- FindClusterMarkers(sce,
clustering.label = "cluster",
clusters.1 = "3",
clusters.2 = "6"
)## testing cluster group.1
## 473 features left after min.pct filtering
## 473 features left after min.diff.pct filtering
## 356 features left after log2fc.threshold filtering
## p.value adj.p.value log2FC pct.1 pct.2 diff.pct marker
## Ttr 1.312605e-26 6.563025e-24 -3.066542 0.5500000 1.0000000 0.4500000 Ttr
## Apoa1 2.699740e-25 1.349870e-22 -2.963640 0.5833333 1.0000000 0.4166667 Apoa1
## Rbp4 7.497694e-27 3.748847e-24 -2.814879 0.5333333 1.0000000 0.4666667 Rbp4
## Apom 8.502979e-27 4.251489e-24 -2.400533 0.5666667 1.0000000 0.4333333 Apom
## Tmsb10 9.995506e-26 4.997753e-23 2.137930 1.0000000 0.6315789 0.3684211 Tmsb10
## Ctsl 4.817850e-23 2.408925e-20 -2.052262 0.8666667 1.0000000 0.1333333 Ctsl
top6.degs <- head(dge.clt3vs6[order(abs(dge.clt3vs6$log2FC),
decreasing = TRUE
), "marker"])
exp.plots <- lapply(X = top6.degs, FUN = function(x) {
PlotExpression(
object = sce, color.by = x,
scale.values = TRUE, point.size = 0.5, point.stroke = 0.5
)
})
cowplot::plot_grid(plotlist = exp.plots, align = "vh", ncol = 3)## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] MouseGastrulationData_1.25.0 SpatialExperiment_1.21.0
## [3] scran_1.39.0 ensembldb_2.35.0
## [5] AnnotationFilter_1.34.0 GenomicFeatures_1.63.1
## [7] AnnotationDbi_1.72.0 ComplexHeatmap_2.27.0
## [9] Coralysis_1.0.0 scRNAseq_2.25.0
## [11] scater_1.39.0 ggplot2_4.0.0
## [13] scuttle_1.21.0 SingleCellExperiment_1.33.0
## [15] SummarizedExperiment_1.41.0 Biobase_2.70.0
## [17] GenomicRanges_1.63.0 Seqinfo_1.1.0
## [19] IRanges_2.45.0 S4Vectors_0.49.0
## [21] BiocGenerics_0.57.0 generics_0.1.4
## [23] MatrixGenerics_1.23.0 matrixStats_1.5.0
## [25] dplyr_1.1.4 BiocStyle_2.38.0
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 BiocIO_1.20.0 bitops_1.0-9
## [4] filelock_1.0.3 tibble_3.3.0 XML_3.99-0.20
## [7] lifecycle_1.0.4 httr2_1.2.1 aricode_1.0.3
## [10] edgeR_4.9.0 doParallel_1.0.17 lattice_0.22-7
## [13] alabaster.base_1.10.0 magrittr_2.0.4 limma_3.67.0
## [16] sass_0.4.10 rmarkdown_2.30 jquerylib_0.1.4
## [19] yaml_2.3.10 metapod_1.19.0 askpass_1.2.1
## [22] reticulate_1.44.0 cowplot_1.2.0 LiblineaR_2.10-24
## [25] DBI_1.2.3 buildtools_1.0.0 RColorBrewer_1.1-3
## [28] abind_1.4-8 Rtsne_0.17 purrr_1.2.0
## [31] BumpyMatrix_1.18.0 RCurl_1.98-1.17 rappdirs_0.3.3
## [34] circlize_0.4.16 ggrepel_0.9.6 irlba_2.3.5.1
## [37] alabaster.sce_1.10.0 maketools_1.3.2 pheatmap_1.0.13
## [40] umap_0.2.10.0 RSpectra_0.16-2 dqrng_0.4.1
## [43] codetools_0.2-20 DelayedArray_0.37.0 shape_1.4.6.1
## [46] tidyselect_1.2.1 UCSC.utils_1.7.0 farver_2.1.2
## [49] ScaledMatrix_1.19.0 viridis_0.6.5 BiocFileCache_3.0.0
## [52] GenomicAlignments_1.47.0 jsonlite_2.0.0 GetoptLong_1.0.5
## [55] BiocNeighbors_2.4.0 iterators_1.0.14 foreach_1.5.2
## [58] tools_4.5.2 Rcpp_1.1.0 glue_1.8.0
## [61] gridExtra_2.3 SparseArray_1.11.1 xfun_0.54
## [64] flexclust_1.5.0 GenomeInfoDb_1.47.0 HDF5Array_1.39.0
## [67] gypsum_1.7.0 withr_3.0.2 BiocManager_1.30.26
## [70] fastmap_1.2.0 rhdf5filters_1.23.0 bluster_1.21.0
## [73] openssl_2.3.4 SparseM_1.84-2 digest_0.6.37
## [76] rsvd_1.0.5 R6_2.6.1 colorspace_2.1-2
## [79] Cairo_1.7-0 RSQLite_2.4.4 cigarillo_1.1.0
## [82] h5mread_1.3.0 rtracklayer_1.69.1 class_7.3-23
## [85] httr_1.4.7 S4Arrays_1.11.0 uwot_0.2.4
## [88] pkgconfig_2.0.3 gtable_0.3.6 modeltools_0.2-24
## [91] blob_1.2.4 S7_0.2.0 XVector_0.51.0
## [94] sys_3.4.3 htmltools_0.5.8.1 clue_0.3-66
## [97] ProtGenerics_1.43.0 scales_1.4.0 alabaster.matrix_1.10.0
## [100] png_0.1-8 knitr_1.50 reshape2_1.4.4
## [103] rjson_0.2.23 curl_7.0.0 GlobalOptions_0.1.2
## [106] cachem_1.1.0 rhdf5_2.55.7 stringr_1.6.0
## [109] BiocVersion_3.23.1 parallel_4.5.2 vipor_0.4.7
## [112] ggrastr_1.0.2 restfulr_0.0.16 pillar_1.11.1
## [115] alabaster.schemas_1.10.0 vctrs_0.6.5 RANN_2.6.2
## [118] BiocSingular_1.27.0 dbplyr_2.5.1 beachmat_2.26.0
## [121] cluster_2.1.8.1 beeswarm_0.4.0 evaluate_1.0.5
## [124] magick_2.9.0 cli_3.6.5 locfit_1.5-9.12
## [127] compiler_4.5.2 Rsamtools_2.27.0 rlang_1.1.6
## [130] crayon_1.5.3 labeling_0.4.3 plyr_1.8.9
## [133] ggbeeswarm_0.7.2 stringi_1.8.7 viridisLite_0.4.2
## [136] alabaster.se_1.10.0 BiocParallel_1.45.0 Biostrings_2.79.2
## [139] lazyeval_0.2.2 Matrix_1.7-4 ExperimentHub_3.1.0
## [142] sparseMatrixStats_1.23.0 bit64_4.6.0-1 Rhdf5lib_1.33.0
## [145] KEGGREST_1.51.0 statmod_1.5.1 alabaster.ranges_1.10.0
## [148] AnnotationHub_4.0.0 igraph_2.2.1 memoise_2.0.1
## [151] bslib_0.9.0 bit_4.6.0
Amezquita R, Lun A, Becht E, Carey V, Carpp L, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pages H, Smith M, Huber W, Morgan M, Gottardo R, Hicks S (2020). “Orchestrating single-cell analysis with Bioconductor.” Nature Methods, 17, 137-145. https://www.nature.com/articles/s41592-019-0654-x.
Griffiths J, Lun A (2024). “MouseGastrulationData: Single-Cell-omics Data across Mouse Gastrulation and Early Organogenesis”. 10.18129/B9.bioc.MouseGastrulationData. R package version 1.18.0.
Lun ATL, McCarthy DJ, Marioni JC (2016). “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Res., 5, 2122. doi:10.12688/f1000research.9501.2.
Sousa A, Smolander J, Junttila S, Elo L (2025). “Coralysis enables sensitive identification of imbalanced cell types and states in single-cell data via multi-level integration.” bioRxiv. doi:10.1101/2025.02.07.637023
Wickham H (2016). “ggplot2: Elegant Graphics for Data Analysis.” Springer-Verlag New York.