[["index.html", "Advanced Single-Cell Analysis with Bioconductor Welcome", " Advanced Single-Cell Analysis with Bioconductor Authors: Robert Amezquita [aut], Aaron Lun [aut], Stephanie Hicks [aut], Raphael Gottardo [aut], Ludwig Geistlinger [cre], Peter Hickey [ctb] Version: 1.14.0 Modified: 2024-07-01 Compiled: 2025-01-22 Environment: R version 4.4.2 (2024-10-31), Bioconductor 3.20 License: CC BY 4.0 Copyright: Bioconductor, 2025 Source: https://github.com/OSCA-source/OSCA.advanced Welcome This site contains the advanced analysis chapters for the “Orchestrating Single-Cell Analysis with Bioconductor” book. This describes the more complex steps of a single-cell RNA-seq analysis ranging from doublet detection, cell cycle assignment, specific steps for processing droplet data, nuclei-specific analyses, trajectory analyses, integrated analyses with protein abundances, and interactive visualization. It also elaborates on some of the basic analysis steps, focusing on alternative strategies and theoretical considerations. It is intended for readers who are already familiar with basic single-cell analyses, possibly after reading some of the prior books in this collection. "],["quality-control-redux.html", "Chapter 1 Quality control, redux 1.1 Overview 1.2 The isOutlier() function 1.3 Assumptions of outlier detection 1.4 Considering experimental factors 1.5 Diagnosing cell type loss Session Info", " Chapter 1 Quality control, redux .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 1.1 Overview Basic Chapter 1 introduced the concept of per-cell quality control, focusing on outlier detection to provide an adaptive threshold on our chosen QC metrics. This chapter elaborates on the technical details of outlier-based quality control, including some of the underlying assumptions, how to handle multi-batch experiments and diagnosing loss of cell types. We will again demonstrate using the 416B dataset from A. T. L. Lun et al. (2017). View set-up code (Workflow Chapter 1) #--- loading ---# library(scRNAseq) sce.416b &lt;- LunSpikeInData(which=&quot;416b&quot;) sce.416b$block &lt;- factor(sce.416b$block) 1.2 The isOutlier() function The isOutlier() function from the scuttle package is the workhorse function for outlier detection. As previously mentioned, it will define an observation as an outlier if it is more than a specified number of MADs (default 3) from the median in the specified direction. library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] chr.loc &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SEQNAME&quot;) is.mito &lt;- which(chr.loc==&quot;MT&quot;) library(scuttle) df &lt;- perCellQCMetrics(sce.416b, subsets=list(Mito=is.mito)) low.lib &lt;- isOutlier(df$sum, type=&quot;lower&quot;, log=TRUE) summary(low.lib) ## Mode FALSE TRUE ## logical 188 4 high.mito &lt;- isOutlier(df$sum, type=&quot;higher&quot;) summary(high.mito) ## Mode FALSE TRUE ## logical 185 7 The perCellQCFilters() function mentioned in Basic Chapter 1 is just a convenience wrapper around isOutlier(). Advanced users may prefer to use isOutlier() directly to achieve more control over the fields of df that are used for filtering. reasons &lt;- perCellQCFilters(df, sub.fields=c(&quot;subsets_Mito_percent&quot;, &quot;altexps_ERCC_percent&quot;)) stopifnot(identical(low.lib, reasons$low_lib_size)) We can also alter the directionality of the outlier detection, number of MADs used, as well as more advanced parameters related to batch processing (see Section 1.4). For example, we can remove both high and low outliers that are more than 5 MADs from the median. The output also contains the thresholds in the attributes for further perusal. low.lib2 &lt;- isOutlier(df$sum, type=&quot;both&quot;, log=TRUE, nmads=5) attr(low.lib, &quot;thresholds&quot;) ## lower higher ## 434083 Inf Incidentally, the is.mito code provides a demonstration of how to obtain the identity of the mitochondrial genes from the gene identifiers. The same approach can be used for gene symbols by simply setting keytype=\"SYMBOL\". 1.3 Assumptions of outlier detection Outlier detection assumes that most cells are of acceptable quality. This is usually reasonable and can be experimentally supported in some situations by visually checking that the cells are intact, e.g., on the microwell plate. If most cells are of (unacceptably) low quality, the adaptive thresholds will fail as they cannot remove the majority of cells by definition - see Figure 1.1 below for a demonstrative example. Of course, what is acceptable or not is in the eye of the beholder - neurons, for example, are notoriously difficult to dissociate, and we would often retain cells in a neuronal scRNA-seq dataset with QC metrics that would be unacceptable in a more amenable system like embryonic stem cells. Another assumption mentioned in Basic Chapter 1 is that the QC metrics are independent of the biological state of each cell. This is most likely to be violated in highly heterogeneous cell populations where some cell types naturally have, e.g., less total RNA (see Figure 3A of Germain, Sonrel, and Robinson (2020)) or more mitochondria. Such cells are more likely to be considered outliers and removed, even in the absence of any technical problems with their capture or sequencing. The use of the MAD mitigates this problem to some extent by accounting for biological variability in the QC metrics. A heterogeneous population should have higher variability in the metrics among high-quality cells, increasing the MAD and reducing the chance of incorrectly removing particular cell types (at the cost of reducing power to remove low-quality cells). In general, these assumptions are either reasonable or their violations have little effect on downstream conclusions. Nonetheless, it is helpful to keep them in mind when interpreting the results. 1.4 Considering experimental factors More complex studies may involve batches of cells generated with different experimental parameters (e.g., sequencing depth). In such cases, the adaptive strategy should be applied to each batch separately. It makes little sense to compute medians and MADs from a mixture distribution containing samples from multiple batches. For example, if the sequencing coverage is lower in one batch compared to the others, it will drag down the median and inflate the MAD. This will reduce the suitability of the adaptive threshold for the other batches. If each batch is represented by its own SingleCellExperiment, the perCellQCFilters() function can be directly applied to each batch as previously described. However, if cells from all batches have been merged into a single SingleCellExperiment, the batch= argument should be used to ensure that outliers are identified within each batch. By doing so, the outlier detection algorithm has the opportunity to account for systematic differences in the QC metrics across batches. Diagnostic plots are also helpful here: batches with systematically poor values for any metric can then be quickly identified for further troubleshooting or outright removal. We will again illustrate using the 416B dataset, which contains two experimental factors - plate of origin and oncogene induction status. We combine these factors together and use this in the batch= argument to isOutlier() via quickPerCellQC(). This results in the removal of slightly more cells as the MAD is no longer inflated by (i) systematic differences in sequencing depth between batches and (ii) differences in number of genes expressed upon oncogene induction. batch &lt;- paste0(sce.416b$phenotype, &quot;-&quot;, sce.416b$block) batch.reasons &lt;- perCellQCFilters(df, batch=batch, sub.fields=c(&quot;subsets_Mito_percent&quot;, &quot;altexps_ERCC_percent&quot;)) colSums(as.matrix(batch.reasons)) ## low_lib_size low_n_features high_subsets_Mito_percent ## 5 4 2 ## high_altexps_ERCC_percent discard ## 6 9 That said, the use of batch= involves the stronger assumption that most cells in each batch are of high quality. If an entire batch failed, outlier detection will not be able to act as an appropriate QC filter for that batch. For example, two batches in the Grun et al. (2016) human pancreas dataset contain a substantial proportion of putative damaged cells with higher ERCC content than the other batches (Figure 1.1). This inflates the median and MAD within those batches, resulting in a failure to remove the assumed low-quality cells. library(scRNAseq) sce.grun &lt;- GrunPancreasData() sce.grun &lt;- addPerCellQC(sce.grun) # First attempt with batch-specific thresholds. library(scater) discard.ercc &lt;- isOutlier(sce.grun$altexps_ERCC_percent, type=&quot;higher&quot;, batch=sce.grun$donor) plotColData(sce.grun, x=&quot;donor&quot;, y=&quot;altexps_ERCC_percent&quot;, colour_by=I(discard.ercc)) Figure 1.1: Distribution of the proportion of ERCC transcripts in each donor of the Grun pancreas dataset. Each point represents a cell and is coloured according to whether it was identified as an outlier within each batch. In such cases, it is better to compute a shared median and MAD from the other batches and use those estimates to obtain an appropriate filter threshold for cells in the problematic batches. This is automatically done by isOutlier() when we susbet to cells from those other batches, as shown in Figure 1.2. # Second attempt, sharing information across batches # to avoid dramatically different thresholds for unusual batches. discard.ercc2 &lt;- isOutlier(sce.grun$altexps_ERCC_percent, type=&quot;higher&quot;, batch=sce.grun$donor, subset=sce.grun$donor %in% c(&quot;D17&quot;, &quot;D2&quot;, &quot;D7&quot;)) plotColData(sce.grun, x=&quot;donor&quot;, y=&quot;altexps_ERCC_percent&quot;, colour_by=I(discard.ercc2)) Figure 1.2: Distribution of the proportion of ERCC transcripts in each donor of the Grun pancreas dataset. Each point represents a cell and is coloured according to whether it was identified as an outlier, using a common threshold for the problematic batches. To identify problematic batches, one useful rule of thumb is to find batches with QC thresholds that are themselves outliers compared to the thresholds of other batches. The assumption here is that most batches consist of a majority of high quality cells such that the threshold value should follow some unimodal distribution across “typical” batches. If we observe a batch with an extreme threshold value, we may suspect that it contains a large number of low-quality cells that inflate the per-batch MAD. We demonstrate this process below for the Grun et al. (2016) data. ercc.thresholds &lt;- attr(discard.ercc, &quot;thresholds&quot;)[&quot;higher&quot;,] ercc.thresholds ## D10 D17 D2 D3 D7 ## 73.611 7.600 6.011 113.106 15.217 names(ercc.thresholds)[isOutlier(ercc.thresholds, type=&quot;higher&quot;)] ## [1] &quot;D10&quot; &quot;D3&quot; If we cannot assume that most batches contain a majority of high-quality cells, then all bets are off; we must revert to the approach of picking an arbitrary threshold value (Basic Section 1.3.1) based on some “sensible” prior expectations and hoping for the best. 1.5 Diagnosing cell type loss The biggest practical concern during QC is whether an entire cell type is inadvertently discarded. There is always some risk of this occurring as the QC metrics are never fully independent of biological state. We can diagnose cell type loss by looking for systematic differences in gene expression between the discarded and retained cells. To demonstrate, we compute the average count across the discarded and retained pools in the 416B data set, and we compute the log-fold change between the pool averages. # Using the non-batched &#39;discard&#39; vector for demonstration purposes, # as it has more cells for stable calculation of &#39;lost&#39;. discard &lt;- reasons$discard lost &lt;- calculateAverage(counts(sce.416b)[,discard]) kept &lt;- calculateAverage(counts(sce.416b)[,!discard]) library(edgeR) logged &lt;- cpm(cbind(lost, kept), log=TRUE, prior.count=2) logFC &lt;- logged[,1] - logged[,2] abundance &lt;- rowMeans(logged) If the discarded pool is enriched for a certain cell type, we should observe increased expression of the corresponding marker genes. No systematic upregulation of genes is apparent in the discarded pool in Figure 1.3, suggesting that the QC step did not inadvertently filter out a cell type in the 416B dataset. plot(abundance, logFC, xlab=&quot;Average count&quot;, ylab=&quot;Log-FC (lost/kept)&quot;, pch=16) points(abundance[is.mito], logFC[is.mito], col=&quot;dodgerblue&quot;, pch=16) Figure 1.3: Log-fold change in expression in the discarded cells compared to the retained cells in the 416B dataset. Each point represents a gene with mitochondrial transcripts in blue. For comparison, let us consider the QC step for the PBMC dataset from 10X Genomics (Zheng et al. 2017). We’ll apply an arbitrary fixed threshold on the library size to filter cells rather than using any outlier-based method. Specifically, we remove all libraries with a library size below 500. View set-up code (Workflow Chapter 3) #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) #--- gene-annotation ---# library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) #--- cell-detection ---# set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] discard &lt;- colSums(counts(sce.pbmc)) &lt; 500 lost &lt;- calculateAverage(counts(sce.pbmc)[,discard]) kept &lt;- calculateAverage(counts(sce.pbmc)[,!discard]) logged &lt;- edgeR::cpm(cbind(lost, kept), log=TRUE, prior.count=2) logFC &lt;- logged[,1] - logged[,2] abundance &lt;- rowMeans(logged) The presence of a distinct population in the discarded pool manifests in Figure 1.4 as a set of genes that are strongly upregulated in lost. This includes PF4, PPBP and SDPR, which (spoiler alert!) indicates that there is a platelet population that has been discarded by alt.discard. plot(abundance, logFC, xlab=&quot;Average count&quot;, ylab=&quot;Log-FC (lost/kept)&quot;, pch=16) platelet &lt;- c(&quot;PF4&quot;, &quot;PPBP&quot;, &quot;SDPR&quot;) points(abundance[platelet], logFC[platelet], col=&quot;orange&quot;, pch=16) Figure 1.4: Average counts across all discarded and retained cells in the PBMC dataset, after using a more stringent filter on the total UMI count. Each point represents a gene, with platelet-related genes highlighted in orange. If we suspect that cell types have been incorrectly discarded by our QC procedure, the most direct solution is to relax the QC filters for metrics that are associated with genuine biological differences. For example, outlier detection can be relaxed by increasing nmads= in the isOutlier() calls. Of course, this increases the risk of retaining more low-quality cells and encountering the problems discussed in Basic Section 1.1. The logical endpoint of this line of reasoning is to avoid filtering altogether, as discussed in Basic Section 1.5. As an aside, it is worth mentioning that the true technical quality of a cell may also be correlated with its type. (This differs from a correlation between the cell type and the QC metrics, as the latter are our imperfect proxies for quality.) This can arise if some cell types are not amenable to dissociation or microfluidics handling during the scRNA-seq protocol. In such cases, it is possible to “correctly” discard an entire cell type during QC if all of its cells are damaged. Indeed, concerns over the computational removal of cell types during QC are probably minor compared to losses in the experimental protocol. Session Info View session info R version 4.4.2 (2024-10-31) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.1 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] edgeR_4.4.1 limma_3.62.2 [3] scater_1.34.0 ggplot2_3.5.1 [5] scRNAseq_2.20.0 scuttle_1.16.0 [7] SingleCellExperiment_1.28.1 SummarizedExperiment_1.36.0 [9] MatrixGenerics_1.18.1 matrixStats_1.5.0 [11] ensembldb_2.30.0 AnnotationFilter_1.30.0 [13] GenomicFeatures_1.58.0 AnnotationDbi_1.68.0 [15] Biobase_2.66.0 GenomicRanges_1.58.0 [17] GenomeInfoDb_1.42.1 IRanges_2.40.1 [19] S4Vectors_0.44.0 AnnotationHub_3.14.0 [21] BiocFileCache_2.14.0 dbplyr_2.5.0 [23] BiocGenerics_0.52.0 BiocStyle_2.34.0 [25] rebook_1.16.0 loaded via a namespace (and not attached): [1] jsonlite_1.8.9 CodeDepends_0.6.6 magrittr_2.0.3 [4] ggbeeswarm_0.7.2 gypsum_1.2.0 farver_2.1.2 [7] rmarkdown_2.29 BiocIO_1.16.0 zlibbioc_1.52.0 [10] vctrs_0.6.5 memoise_2.0.1 Rsamtools_2.22.0 [13] RCurl_1.98-1.16 htmltools_0.5.8.1 S4Arrays_1.6.0 [16] curl_6.1.0 BiocNeighbors_2.0.1 Rhdf5lib_1.28.0 [19] SparseArray_1.6.1 rhdf5_2.50.2 sass_0.4.9 [22] alabaster.base_1.6.1 bslib_0.8.0 alabaster.sce_1.6.0 [25] httr2_1.1.0 cachem_1.1.0 GenomicAlignments_1.42.0 [28] mime_0.12 lifecycle_1.0.4 pkgconfig_2.0.3 [31] rsvd_1.0.5 Matrix_1.7-1 R6_2.5.1 [34] fastmap_1.2.0 GenomeInfoDbData_1.2.13 digest_0.6.37 [37] colorspace_2.1-1 irlba_2.3.5.1 ExperimentHub_2.14.0 [40] RSQLite_2.3.9 beachmat_2.22.0 labeling_0.4.3 [43] filelock_1.0.3 httr_1.4.7 abind_1.4-8 [46] compiler_4.4.2 bit64_4.6.0-1 withr_3.0.2 [49] BiocParallel_1.40.0 viridis_0.6.5 DBI_1.2.3 [52] HDF5Array_1.34.0 alabaster.ranges_1.6.0 alabaster.schemas_1.6.0 [55] rappdirs_0.3.3 DelayedArray_0.32.0 rjson_0.2.23 [58] tools_4.4.2 vipor_0.4.7 beeswarm_0.4.0 [61] glue_1.8.0 restfulr_0.0.15 rhdf5filters_1.18.0 [64] grid_4.4.2 generics_0.1.3 gtable_0.3.6 [67] BiocSingular_1.22.0 ScaledMatrix_1.14.0 XVector_0.46.0 [70] ggrepel_0.9.6 BiocVersion_3.20.0 pillar_1.10.1 [73] dplyr_1.1.4 lattice_0.22-6 rtracklayer_1.66.0 [76] bit_4.5.0.1 tidyselect_1.2.1 locfit_1.5-9.10 [79] Biostrings_2.74.1 knitr_1.49 gridExtra_2.3 [82] bookdown_0.42 ProtGenerics_1.38.0 xfun_0.50 [85] statmod_1.5.0 UCSC.utils_1.2.0 lazyeval_0.2.2 [88] yaml_2.3.10 evaluate_1.0.3 codetools_0.2-20 [91] tibble_3.2.1 alabaster.matrix_1.6.1 BiocManager_1.30.25 [94] graph_1.84.1 cli_3.6.3 munsell_0.5.1 [97] jquerylib_0.1.4 Rcpp_1.0.14 dir.expiry_1.14.0 [100] png_0.1-8 XML_3.99-0.18 parallel_4.4.2 [103] blob_1.2.4 bitops_1.0-9 viridisLite_0.4.2 [106] alabaster.se_1.6.0 scales_1.3.0 purrr_1.0.2 [109] crayon_1.5.3 rlang_1.1.5 cowplot_1.1.3 [112] KEGGREST_1.46.0 References "],["more-norm.html", "Chapter 2 Normalization, redux 2.1 Overview 2.2 Scaling and the pseudo-count 2.3 Downsampling instead of scaling 2.4 Comments on other transformations 2.5 Normalization versus batch correction Session Info", " Chapter 2 Normalization, redux .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 2.1 Overview Basic Chapter 2 introduced the principles and methodology for scaling normalization of scRNA-seq data. This chapter provides some commentary on some miscellaneous theoretical aspects including the motivation for the pseudo-count, the use and benefits of downsampling instead of scaling, and some discussion of alternative transformations. 2.2 Scaling and the pseudo-count When log-transforming, logNormCounts() will add a pseudo-count to avoid undefined values at zero. Larger pseudo-counts will shrink the log-fold changes between cells towards zero for low-abundance genes, meaning that downstream high-dimensional analyses will be driven more by differences in expression for high-abundance genes. Conversely, smaller pseudo-counts will increase the relative contribution of low-abundance genes. Common practice is to use a pseudo-count of 1, for the simple pragmatic reason that it preserves sparsity in the original matrix (i.e., zeroes in the input remain zeroes after transformation). This works well in all but the most pathological scenarios (A. Lun 2018). An interesting subtlety of logNormCounts() is that it will center the size factors at unity, if they were not already. This puts the normalized expression values on roughly the same scale as the original counts for easier interpretation. For example, Figure 2.1 shows that interneurons have a median Snap25 log-expression from 5-6; this roughly translates to an original count of 30-60 UMIs in each cell, which gives us some confidence that it is actually expressed. This relationship to the original data would be less obvious - or indeed, lost altogether - if the centering were not performed. View set-up code (Workflow Chapter 2) #--- loading ---# library(scRNAseq) sce.zeisel &lt;- ZeiselBrainData() library(scater) sce.zeisel &lt;- aggregateAcrossFeatures(sce.zeisel, id=sub(&quot;_loc[0-9]+$&quot;, &quot;&quot;, rownames(sce.zeisel))) #--- gene-annotation ---# library(org.Mm.eg.db) rowData(sce.zeisel)$Ensembl &lt;- mapIds(org.Mm.eg.db, keys=rownames(sce.zeisel), keytype=&quot;SYMBOL&quot;, column=&quot;ENSEMBL&quot;) #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.zeisel, subsets=list( Mt=rowData(sce.zeisel)$featureType==&quot;mito&quot;)) qc &lt;- quickPerCellQC(stats, percent_subsets=c(&quot;altexps_ERCC_percent&quot;, &quot;subsets_Mt_percent&quot;)) sce.zeisel &lt;- sce.zeisel[,!qc$discard] library(scuttle) library(scater) sce.zeisel &lt;- logNormCounts(sce.zeisel) plotExpression(sce.zeisel, x=&quot;level1class&quot;, features=&quot;Snap25&quot;, colour=&quot;level1class&quot;) Figure 2.1: Distribution of log-expression values for Snap25 in each cell type of the Zeisel brain dataset. Centering also allows us to interpret a pseudo-count of 1 as an extra read or UMI for each gene. In practical terms, this means that the shrinkage effect of the pseudo-count diminishes as read/UMI counts increase. As a result, any estimates of log-fold changes in expression (e.g., from differences in the log-values between groups of cells) become increasingly accurate with deeper coverage. Conversely, at lower counts, stronger shrinkage avoids inflated differences due to sampling noise, which might otherwise mask interesting features in downstream analyses like clustering. In some sense, the pseudo-count aims to protect later analyses from the lack of information at low counts while trying to miminize its own effect at high counts. For comparison, consider the situation where we applied a constant pseudo-count to some count-per-million-like measure. It is easy to see that the accuracy of the subsequent log-fold changes would never improve regardless of how much additional sequencing was performed; scaling to a constant library size of a million means that the pseudo-count will have the same effect for all datasets. This is ironic given that the whole intention of sequencing more deeply is to improve quantification of these differences between cell subpopulations. The same criticism applies to popular metrics like the “counts-per-10K” used in, e.g., seurat. 2.3 Downsampling instead of scaling In rare cases, direct scaling of the counts is not appropriate due to the effect described by A. Lun (2018). Briefly, this is caused by the fact that the mean of the log-normalized counts is not the same as the log-transformed mean of the normalized counts. The difference between them depends on the mean and variance of the original counts, such that there is a systematic trend in the mean of the log-counts with respect to the count size. This typically manifests as trajectories correlated strongly with library size even after library size normalization, as shown in Figure 2.2 for synthetic scRNA-seq data generated with a pool-and-split approach (Tian et al. 2019). # TODO: move to scRNAseq. library(BiocFileCache) bfc &lt;- BiocFileCache(ask=FALSE) qcdata &lt;- bfcrpath(bfc, &quot;https://github.com/LuyiTian/CellBench_data/blob/master/data/mRNAmix_qc.RData?raw=true&quot;) env &lt;- new.env() load(qcdata, envir=env) sce.8qc &lt;- env$sce8_qc # Library size normalization and log-transformation. sce.8qc &lt;- logNormCounts(sce.8qc) sce.8qc &lt;- runPCA(sce.8qc) gridExtra::grid.arrange( plotPCA(sce.8qc, colour_by=I(factor(sce.8qc$mix))), plotPCA(sce.8qc, colour_by=I(librarySizeFactors(sce.8qc))), ncol=2 ) Figure 2.2: PCA plot of all pool-and-split libraries in the SORT-seq CellBench data, computed from the log-normalized expression values with library size-derived size factors. Each point represents a library and is colored by the mixing ratio used to construct it (left) or by the size factor (right). As the problem arises from differences in the sizes of the counts, the most straightforward solution is to downsample the counts of the high-coverage cells to match those of low-coverage cells. This uses the size factors to determine the amount of downsampling for each cell required to reach the 1st percentile of size factors. (The small minority of cells with smaller size factors are simply scaled up. We do not attempt to downsample to the smallest size factor, as this would result in excessive loss of information for one aberrant cell with very low size factors.) We can see that this eliminates the library size factor-associated trajectories from the first two PCs, improving resolution of the known differences based on mixing ratios (Figure 2.3). The log-transformation is still necessary but no longer introduces a shift in the means when the sizes of the counts are similar across cells. sce.8qc2 &lt;- logNormCounts(sce.8qc, downsample=TRUE) sce.8qc2 &lt;- runPCA(sce.8qc2) gridExtra::grid.arrange( plotPCA(sce.8qc2, colour_by=I(factor(sce.8qc2$mix))), plotPCA(sce.8qc2, colour_by=I(librarySizeFactors(sce.8qc2))), ncol=2 ) Figure 2.3: PCA plot of pool-and-split libraries in the SORT-seq CellBench data, computed from the log-transformed counts after downsampling in proportion to the library size factors. Each point represents a library and is colored by the mixing ratio used to construct it (left) or by the size factor (right). While downsampling is an expedient solution, it is statistically inefficient as it needs to increase the noise of high-coverage cells in order to avoid differences with low-coverage cells. It is also slower than simple scaling. Thus, we would only recommend using this approach after an initial analysis with scaled counts reveals suspicious trajectories that are strongly correlated with the size factors. In such cases, it is a simple matter to re-normalize by downsampling to determine whether the trajectory is an artifact of the log-transformation. 2.4 Comments on other transformations Of course, the log-transformation is not the only possible transformation. Another somewhat common choice is the square root, motivated by the fact that it is the variance stabilizing transformation for Poisson-distributed counts. This assumes that counts are actually Poisson-distributed, which is true enough from the perspective of sequencing noise in UMI counts but ignores biological overdispersion. One may also see the inverse hyperbolic sine (a.k.a, arcsinh) transformation being used on occasion, which is very similar to the log-transformation when considering non-negative values. The main practical difference for scRNA-seq applications is a larger initial jump from zero to non-zero values. Alternatively, we may use more sophisticated approaches for variance stabilizing transformations in genomics data, e.g., DESeq2 or sctransform. These aim to remove the mean-variance trend more effectively than the simpler transformations mentioned above, though it could be argued whether this is actually desirable. For low-coverage scRNA-seq data, there will always be a mean-variance trend under any transformation, for the simple reason that the variance must be zero when the mean count is zero. These methods also face the challenge of removing the mean-variance trend while preserving the interesting component of variation, i.e., the log-fold changes between subpopulations; this may or may not be done adequately, depending on the aggressiveness of the algorithm. In practice, the log-transformation is a good default choice due to its simplicity and interpretability, and is what we will be using for all downstream analyses. 2.5 Normalization versus batch correction It is worth noting the difference between normalization and batch correction (Multi-sample Chapter 1). Normalization typically refers to removal of technical biases between cells, while batch correction involves removal of both technical biases and biological differences between batches. Technical biases are relatively simple and straightforward to remove, whereas biological differences between batches can be highly unpredictable. On the other hand, batch correction algorithms can share information between cells in the same batch, as all cells in the same batch are assumed to be subject to the same batch effect, whereas most normalization strategies tend to operate on a cell-by-cell basis with less information sharing. The key point here is that normalization and batch correction are different tasks, involve different assumptions and generally require different computational methods (though some packages aim to perform both steps at once, e.g., zinbwave). Thus, it is important to distinguish between “normalized” and “batch-corrected” data, as these usually refer to different stages of processing. Of course, these processes are not exclusive, and most workflows will perform normalization within each batch followed by correction between batches. Interested readers are directed to Multi-sample Chapter 1 for more details. Session Info View session info R version 4.4.2 (2024-10-31) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.1 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] BiocFileCache_2.14.0 dbplyr_2.5.0 [3] scater_1.34.0 ggplot2_3.5.1 [5] scuttle_1.16.0 SingleCellExperiment_1.28.1 [7] SummarizedExperiment_1.36.0 Biobase_2.66.0 [9] GenomicRanges_1.58.0 GenomeInfoDb_1.42.1 [11] IRanges_2.40.1 S4Vectors_0.44.0 [13] BiocGenerics_0.52.0 MatrixGenerics_1.18.1 [15] matrixStats_1.5.0 BiocStyle_2.34.0 [17] rebook_1.16.0 loaded via a namespace (and not attached): [1] tidyselect_1.2.1 viridisLite_0.4.2 blob_1.2.4 [4] dplyr_1.1.4 vipor_0.4.7 farver_2.1.2 [7] filelock_1.0.3 viridis_0.6.5 fastmap_1.2.0 [10] XML_3.99-0.18 digest_0.6.37 rsvd_1.0.5 [13] lifecycle_1.0.4 RSQLite_2.3.9 magrittr_2.0.3 [16] compiler_4.4.2 rlang_1.1.5 sass_0.4.9 [19] tools_4.4.2 yaml_2.3.10 knitr_1.49 [22] S4Arrays_1.6.0 labeling_0.4.3 curl_6.1.0 [25] bit_4.5.0.1 DelayedArray_0.32.0 abind_1.4-8 [28] BiocParallel_1.40.0 purrr_1.0.2 withr_3.0.2 [31] CodeDepends_0.6.6 grid_4.4.2 beachmat_2.22.0 [34] colorspace_2.1-1 scales_1.3.0 cli_3.6.3 [37] rmarkdown_2.29 crayon_1.5.3 generics_0.1.3 [40] httr_1.4.7 DBI_1.2.3 ggbeeswarm_0.7.2 [43] cachem_1.1.0 zlibbioc_1.52.0 parallel_4.4.2 [46] BiocManager_1.30.25 XVector_0.46.0 vctrs_0.6.5 [49] Matrix_1.7-1 jsonlite_1.8.9 dir.expiry_1.14.0 [52] bookdown_0.42 BiocSingular_1.22.0 BiocNeighbors_2.0.1 [55] bit64_4.6.0-1 ggrepel_0.9.6 irlba_2.3.5.1 [58] beeswarm_0.4.0 jquerylib_0.1.4 glue_1.8.0 [61] codetools_0.2-20 cowplot_1.1.3 gtable_0.3.6 [64] UCSC.utils_1.2.0 ScaledMatrix_1.14.0 munsell_0.5.1 [67] tibble_3.2.1 pillar_1.10.1 rappdirs_0.3.3 [70] htmltools_0.5.8.1 graph_1.84.1 GenomeInfoDbData_1.2.13 [73] R6_2.5.1 evaluate_1.0.3 lattice_0.22-6 [76] memoise_2.0.1 bslib_0.8.0 Rcpp_1.0.14 [79] gridExtra_2.3 SparseArray_1.6.1 xfun_0.50 [82] pkgconfig_2.0.3 References "],["more-hvgs.html", "Chapter 3 Feature selection, redux 3.1 Overview 3.2 Fine-tuning the fitted trend 3.3 Handling covariates with linear models 3.4 Using the coefficient of variation 3.5 More HVG selection strategies Session Info", " Chapter 3 Feature selection, redux .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 3.1 Overview Basic Chapter 3 introduced the principles and methodology for feature selection in scRNA-seq data. This chapter provides some commentary on some additional options at each step, including the fine-tuning of the fitted trend in modelGeneVar(), how to handle more uninteresting factors of variation with linear models, and the use of coefficient of variation to quantify variation. We also got through a number of other HVG selection strategies that may be of use. 3.2 Fine-tuning the fitted trend The trend fit has several useful parameters (see ?fitTrendVar) that can be tuned for a more appropriate fit. For example, the defaults can occasionally yield an overfitted trend when the few high-abundance genes are also highly variable. In such cases, users can reduce the contribution of those high-abundance genes by turning off density weights, as demonstrated in Figure 3.1 with a single donor from the Segerstolpe et al. (2016) dataset. View set-up code (Workflow Chapter 8) #--- loading ---# library(scRNAseq) sce.seger &lt;- SegerstolpePancreasData() #--- gene-annotation ---# library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] symbols &lt;- rowData(sce.seger)$symbol ens.id &lt;- mapIds(edb, keys=symbols, keytype=&quot;SYMBOL&quot;, column=&quot;GENEID&quot;) ens.id &lt;- ifelse(is.na(ens.id), symbols, ens.id) # Removing duplicated rows. keep &lt;- !duplicated(ens.id) sce.seger &lt;- sce.seger[keep,] rownames(sce.seger) &lt;- ens.id[keep] #--- sample-annotation ---# emtab.meta &lt;- colData(sce.seger)[,c(&quot;cell type&quot;, &quot;disease&quot;, &quot;individual&quot;, &quot;single cell well quality&quot;)] colnames(emtab.meta) &lt;- c(&quot;CellType&quot;, &quot;Disease&quot;, &quot;Donor&quot;, &quot;Quality&quot;) colData(sce.seger) &lt;- emtab.meta sce.seger$CellType &lt;- gsub(&quot; cell&quot;, &quot;&quot;, sce.seger$CellType) sce.seger$CellType &lt;- paste0( toupper(substr(sce.seger$CellType, 1, 1)), substring(sce.seger$CellType, 2)) #--- quality-control ---# low.qual &lt;- sce.seger$Quality == &quot;OK, filtered&quot; library(scater) stats &lt;- perCellQCMetrics(sce.seger) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sce.seger$Donor, subset=!sce.seger$Donor %in% c(&quot;H6&quot;, &quot;H5&quot;)) sce.seger &lt;- sce.seger[,!(qc$discard | low.qual)] #--- normalization ---# library(scran) clusters &lt;- quickCluster(sce.seger) sce.seger &lt;- computeSumFactors(sce.seger, clusters=clusters) sce.seger &lt;- logNormCounts(sce.seger) library(scran) sce.seger &lt;- sce.seger[,sce.seger$Donor==&quot;H4&quot;] dec.default &lt;- modelGeneVar(sce.seger) dec.noweight &lt;- modelGeneVar(sce.seger, density.weights=FALSE) fit.default &lt;- metadata(dec.default) plot(fit.default$mean, fit.default$var, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curve(fit.default$trend(x), col=&quot;dodgerblue&quot;, add=TRUE, lwd=2) fit.noweight &lt;- metadata(dec.noweight) curve(fit.noweight$trend(x), col=&quot;red&quot;, add=TRUE, lwd=2) legend(&quot;topleft&quot;, col=c(&quot;dodgerblue&quot;, &quot;red&quot;), legend=c(&quot;Default&quot;, &quot;No weight&quot;), lwd=2) Figure 3.1: Variance in the Segerstolpe pancreas data set as a function of the mean. Each point represents a gene while the lines represent the trend fitted to all genes with default parameters (blue) or without weights (red). 3.3 Handling covariates with linear models For experiments with multiple batches, the use of block-specific trends with block= in modelGeneVar() is the recommended approach for avoiding unwanted variation. However, this is not possible for experimental designs involving multiple unwanted factors of variation and/or continuous covariates. In such cases, we can use the design= argument to specify a design matrix with uninteresting factors of variation. This fits a linear model to the expression values for each gene to obtain the residual variance. We illustrate again with the 416B data set, blocking on the plate of origin and oncogene induction. (The same argument is available in modelGeneVar() when spike-ins are not available.) View set-up code (Workflow Chapter 1) #--- loading ---# library(scRNAseq) sce.416b &lt;- LunSpikeInData(which=&quot;416b&quot;) sce.416b$block &lt;- factor(sce.416b$block) #--- gene-annotation ---# library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] rowData(sce.416b)$ENSEMBL &lt;- rownames(sce.416b) rowData(sce.416b)$SYMBOL &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SYMBOL&quot;) rowData(sce.416b)$SEQNAME &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SEQNAME&quot;) library(scater) rownames(sce.416b) &lt;- uniquifyFeatureNames(rowData(sce.416b)$ENSEMBL, rowData(sce.416b)$SYMBOL) #--- quality-control ---# mito &lt;- which(rowData(sce.416b)$SEQNAME==&quot;MT&quot;) stats &lt;- perCellQCMetrics(sce.416b, subsets=list(Mt=mito)) qc &lt;- quickPerCellQC(stats, percent_subsets=c(&quot;subsets_Mt_percent&quot;, &quot;altexps_ERCC_percent&quot;), batch=sce.416b$block) sce.416b &lt;- sce.416b[,!qc$discard] #--- normalization ---# library(scran) sce.416b &lt;- computeSumFactors(sce.416b) sce.416b &lt;- logNormCounts(sce.416b) design &lt;- model.matrix(~factor(block) + phenotype, colData(sce.416b)) dec.design.416b &lt;- modelGeneVarWithSpikes(sce.416b, &quot;ERCC&quot;, design=design) dec.design.416b[order(dec.design.416b$bio, decreasing=TRUE),] ## DataFrame with 46604 rows and 6 columns ## mean total tech bio p.value FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Lyz2 6.61097 8.90513 1.50407 7.40105 1.90610e-172 1.37453e-169 ## Ccnb2 5.97776 9.54373 2.24185 7.30187 8.03683e-77 1.49416e-74 ## Gem 5.90225 9.54358 2.35181 7.19178 5.66284e-68 8.37010e-66 ## Cenpa 5.81349 8.65622 2.48797 6.16824 2.12289e-45 1.55921e-43 ## Idh1 5.99343 8.32113 2.21969 6.10144 2.48746e-55 2.47674e-53 ## ... ... ... ... ... ... ... ## Gm5054 2.90434 0.463698 6.76999 -6.30630 1 1 ## Gm12191 3.55920 0.170709 6.53294 -6.36223 1 1 ## Gm7429 3.45394 0.248351 6.63466 -6.38631 1 1 ## Gm16378 2.83987 0.208215 6.74662 -6.53840 1 1 ## Rps2-ps2 3.11324 0.202307 6.78485 -6.58255 1 1 This strategy is simple but somewhat inaccurate as it does not consider the mean expression in each blocking level. To illustrate, assume we have an experiment with two equally-sized batches where the mean-variance trend in each batch is the same as that observed in Figure 3.1. Imagine that we have two genes with variances lying on this trend; the first gene has an average expression of 0 in one batch and 6 in the other batch, while the second gene with an average expression of 3 in both batches. Both genes would have the same mean across all cells but quite different variances, making it difficult to fit a single mean-variance trend - despite both genes following the mean-variance trend in each of their respective batches! The block= approach is safer as it handles the trend fitting and decomposition within each batch, and should be preferred in all situations where it is applicable. 3.4 Using the coefficient of variation An alternative approach to quantification uses the squared coefficient of variation (CV2) of the normalized expression values prior to log-transformation. The CV2 is a widely used metric for describing variation in non-negative data and is closely related to the dispersion parameter of the negative binomial distribution in packages like edgeR and DESeq2. We compute the CV2 for each gene in the PBMC dataset using the modelGeneCV2() function, which provides a robust implementation of the approach described by Brennecke et al. (2013). View set-up code (Workflow Chapter 3) #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) #--- gene-annotation ---# library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) #--- cell-detection ---# set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location==&quot;MT&quot;))) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) sce.pbmc &lt;- sce.pbmc[,!high.mito] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) sce.pbmc &lt;- logNormCounts(sce.pbmc) dec.cv2.pbmc &lt;- modelGeneCV2(sce.pbmc) This allows us to model the mean-variance relationship when considering the relevance of each gene (Figure 3.2). Again, our assumption is that most genes contain random noise and that the trend captures mostly technical variation. Large CV2 values that deviate strongly from the trend are likely to represent genes affected by biological structure. If spike-ins are available, we can also fit the trend to the spike-ins via the modelGeneCV2WithSpikes() function. fit.cv2.pbmc &lt;- metadata(dec.cv2.pbmc) plot(fit.cv2.pbmc$mean, fit.cv2.pbmc$cv2, log=&quot;xy&quot;) curve(fit.cv2.pbmc$trend(x), col=&quot;dodgerblue&quot;, add=TRUE, lwd=2) Figure 3.2: CV2 in the PBMC data set as a function of the mean. Each point represents a gene while the blue line represents the fitted trend. For each gene, we quantify the deviation from the trend in terms of the ratio of its CV2 to the fitted value of trend at its abundance. This is more appropriate than the directly subtracting the trend from the CV2, as the magnitude of the ratio is not affected by the mean. dec.cv2.pbmc[order(dec.cv2.pbmc$ratio, decreasing=TRUE),] ## DataFrame with 33694 rows and 6 columns ## mean total trend ratio p.value FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## PPBP 2.2437397 132.364 0.803725 164.688 0 0 ## PRTFDC1 0.0658743 3197.564 20.268571 157.760 0 0 ## HIST1H2AC 1.3731487 175.035 1.177003 148.712 0 0 ## FAM81B 0.0477082 3654.419 27.904490 130.962 0 0 ## PF4 1.8333127 109.451 0.935531 116.993 0 0 ## ... ... ... ... ... ... ... ## AC023491.2 0 NaN Inf NaN NaN NaN ## AC233755.2 0 NaN Inf NaN NaN NaN ## AC233755.1 0 NaN Inf NaN NaN NaN ## AC213203.1 0 NaN Inf NaN NaN NaN ## FAM231B 0 NaN Inf NaN NaN NaN We can then select HVGs based on the largest ratios using getTopHVGs(). hvg.pbmc.cv2 &lt;- getTopHVGs(dec.cv2.pbmc, var.field=&quot;ratio&quot;, n=1000) str(hvg.pbmc.cv2) ## chr [1:1000] &quot;PPBP&quot; &quot;PRTFDC1&quot; &quot;HIST1H2AC&quot; &quot;FAM81B&quot; &quot;PF4&quot; &quot;GNG11&quot; ... Both the CV2 and the variance of log-counts are effective metrics for quantifying variation in gene expression. The CV2 tends to give higher rank to low-abundance HVGs driven by upregulation in rare subpopulations, for which the increase in variance on the raw scale is stronger than that on the log-scale. However, the variation described by the CV2 is less directly relevant to downstream procedures operating on the log-counts, and the reliance on the ratio can assign high rank to uninteresting genes with low absolute variance. As such, we prefer to use the variance of log-counts for feature selection, though many of the same principles apply to procedures based on the CV2. 3.5 More HVG selection strategies 3.5.1 Keeping all genes above the trend Here, the aim is to only remove the obviously uninteresting genes with variances below the trend. By doing so, we avoid the need to make any judgement calls regarding what level of variation is interesting enough to retain. This approach represents one extreme of the bias-variance trade-off where bias is minimized at the cost of maximizing noise. For modelGeneVar(), it equates to keeping all positive biological components: dec.pbmc &lt;- modelGeneVar(sce.pbmc) hvg.pbmc.var.3 &lt;- getTopHVGs(dec.pbmc, var.threshold=0) length(hvg.pbmc.var.3) ## [1] 12745 For modelGeneCV2(), this involves keeping all ratios above 1: hvg.pbmc.cv2.3 &lt;- getTopHVGs(dec.cv2.pbmc, var.field=&quot;ratio&quot;, var.threshold=1) length(hvg.pbmc.cv2.3) ## [1] 6642 By retaining all potential biological signal, we give secondary population structure the chance to manifest. This is most useful for rare subpopulations where the relevant markers will not exhibit strong overdispersion owing to the small number of affected cells. It will also preserve a weak but consistent effect across many genes with small biological components; admittedly, though, this is not of major interest in most scRNA-seq studies given the difficulty of experimentally validating population structure in the absence of strong marker genes. The obvious cost is that more noise is also captured, which can reduce the resolution of otherwise well-separated populations and mask the secondary signal that we were trying to preserve. The use of more genes also introduces more computational work in each downstream step. This strategy is thus best suited to very heterogeneous populations containing many different cell types (possibly across many datasets that are to be merged, as in Multi-sample Chapter 1) where there is a justified fear of ignoring marker genes for low-abundance subpopulations under a competitive top \\(X\\) approach. 3.5.2 Based on significance Another approach to feature selection is to set a fixed threshold of one of the metrics. This is most commonly done with the (adjusted) \\(p\\)-value reported by each of the above methods. The \\(p\\)-value for each gene is generated by testing against the null hypothesis that the variance is equal to the trend. For example, we might define our HVGs as all genes that have adjusted \\(p\\)-values below 0.05. hvg.pbmc.var.2 &lt;- getTopHVGs(dec.pbmc, fdr.threshold=0.05) length(hvg.pbmc.var.2) ## [1] 814 This approach is simple to implement and - if the test holds its size - it controls the false discovery rate (FDR). That is, it returns a subset of genes where the proportion of false positives is expected to be below the specified threshold. This can occasionally be useful in applications where the HVGs themselves are of interest. For example, if we were to use the list of HVGs in further experiments to verify the existence of heterogeneous expression for some of the genes, we would want to control the FDR in that list. The downside of this approach is that it is less predictable than the top \\(X\\) strategy. The number of genes returned depends on the type II error rate of the test and the severity of the multiple testing correction. One might obtain no genes or every gene at a given FDR threshold, depending on the circumstances. Moreover, control of the FDR is usually not helpful at this stage of the analysis. We are not interpreting the individual HVGs themselves but are only using them for feature selection prior to downstream steps. There is no reason to think that a 5% threshold on the FDR yields a more suitable compromise between bias and noise compared to the top \\(X\\) selection. As an aside, we might consider ranking genes by the \\(p\\)-value instead of the biological component for use in a top \\(X\\) approach. This results in some counterintuitive behavior due to the nature of the underlying hypothesis test, which is based on the ratio of the total variance to the expected technical variance. Ranking based on \\(p\\)-value tends to prioritize HVGs that are more likely to be true positives but, at the same time, less likely to be biologically interesting. Many of the largest ratios are observed in high-abundance genes and are driven by very low technical variance; the total variance is typically modest for such genes, and they do not contribute much to population heterogeneity in absolute terms. (Note that the same can be said of the ratio of CV2 values, as briefly discussed above.) 3.5.3 Selecting a priori genes of interest A blunt yet effective feature selection strategy is to use pre-defined sets of interesting genes. The aim is to focus on specific aspects of biological heterogeneity that may be masked by other factors when using unsupervised methods for HVG selection. One example application lies in the dissection of transcriptional changes during the earliest stages of cell fate commitment (Messmer et al. 2019), which may be modest relative to activity in other pathways (e.g., cell cycle, metabolism). Indeed, if our aim is to show that there is no meaningful heterogeneity in a given pathway, we would - at the very least - be obliged to repeat our analysis using only the genes in that pathway to maximize power for detecting such heterogeneity. Using scRNA-seq data in this manner is conceptually equivalent to a fluorescence activated cell sorting (FACS) experiment, with the convenience of being able to (re)define the features of interest at any time. For example, in the PBMC dataset, we might use some of the C7 immunologic signatures from MSigDB (Godec et al. 2016) to improve resolution of the various T cell subtypes. We stress that there is no shame in leveraging prior biological knowledge to address specific hypotheses in this manner. We say this because a common refrain in genomics is that the data analysis should be “unbiased”, i.e., free from any biological preconceptions. This is admirable but such “biases” are already present at every stage, starting with experimental design and ending with the interpretation of the data. library(msigdbr) c7.sets &lt;- msigdbr(species = &quot;Homo sapiens&quot;, category = &quot;C7&quot;) head(unique(c7.sets$gs_name)) ## [1] &quot;ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_1DY_DN&quot; ## [2] &quot;ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_1DY_UP&quot; ## [3] &quot;ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_3DY_DN&quot; ## [4] &quot;ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_3DY_UP&quot; ## [5] &quot;ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_6HR_DN&quot; ## [6] &quot;ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_6HR_UP&quot; # Using the Goldrath sets to distinguish CD8 subtypes cd8.sets &lt;- c7.sets[grep(&quot;GOLDRATH&quot;, c7.sets$gs_name),] cd8.genes &lt;- rowData(sce.pbmc)$Symbol %in% cd8.sets$human_gene_symbol summary(cd8.genes) ## Mode FALSE TRUE ## logical 32872 822 # Using GSE11924 to distinguish between T helper subtypes th.sets &lt;- c7.sets[grep(&quot;GSE11924&quot;, c7.sets$gs_name),] th.genes &lt;- rowData(sce.pbmc)$Symbol %in% th.sets$human_gene_symbol summary(th.genes) ## Mode FALSE TRUE ## logical 31796 1898 # Using GSE11961 to distinguish between B cell subtypes b.sets &lt;- c7.sets[grep(&quot;GSE11961&quot;, c7.sets$gs_name),] b.genes &lt;- rowData(sce.pbmc)$Symbol %in% b.sets$human_gene_symbol summary(b.genes) ## Mode FALSE TRUE ## logical 28211 5483 Of course, the downside of focusing on pre-defined genes is that it will limit our capacity to detect novel or unexpected aspects of variation. Thus, this kind of focused analysis should be complementary to (rather than a replacement for) the unsupervised feature selection strategies discussed previously. Alternatively, we can invert this reasoning to remove genes that are unlikely to be of interest prior to downstream analyses. This eliminates unwanted variation that could mask relevant biology and interfere with interpretation of the results. Ribosomal protein genes or mitochondrial genes are common candidates for removal, especially in situations with varying levels of cell damage within a population. For immune cell subsets, we might also be inclined to remove immunoglobulin genes and T cell receptor genes for which clonal expression introduces (possibly irrelevant) population structure. # Identifying ribosomal proteins: ribo.discard &lt;- grepl(&quot;^RP[SL]\\\\d+&quot;, rownames(sce.pbmc)) sum(ribo.discard) ## [1] 99 # A more curated approach for identifying ribosomal protein genes: c2.sets &lt;- msigdbr(species = &quot;Homo sapiens&quot;, category = &quot;C2&quot;) ribo.set &lt;- c2.sets[c2.sets$gs_name==&quot;KEGG_RIBOSOME&quot;,]$human_gene_symbol ribo.discard &lt;- rownames(sce.pbmc) %in% ribo.set sum(ribo.discard) ## [1] 87 library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] anno &lt;- select(edb, keys=rowData(sce.pbmc)$ID, keytype=&quot;GENEID&quot;, columns=&quot;TXBIOTYPE&quot;) # Removing immunoglobulin variable chains: igv.set &lt;- anno$GENEID[anno$TXBIOTYPE %in% c(&quot;IG_V_gene&quot;, &quot;IG_V_pseudogene&quot;)] igv.discard &lt;- rowData(sce.pbmc)$ID %in% igv.set sum(igv.discard) ## [1] 326 # Removing TCR variable chains: tcr.set &lt;- anno$GENEID[anno$TXBIOTYPE %in% c(&quot;TR_V_gene&quot;, &quot;TR_V_pseudogene&quot;)] tcr.discard &lt;- rowData(sce.pbmc)$ID %in% tcr.set sum(tcr.discard) ## [1] 138 In practice, we tend to err on the side of caution and abstain from preemptive filtering on biological function until these genes are demonstrably problematic in downstream analyses. Session Info View session info R version 4.4.2 (2024-10-31) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.1 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] ensembldb_2.30.0 AnnotationFilter_1.30.0 [3] GenomicFeatures_1.58.0 AnnotationDbi_1.68.0 [5] AnnotationHub_3.14.0 BiocFileCache_2.14.0 [7] dbplyr_2.5.0 msigdbr_7.5.1 [9] scran_1.34.0 scuttle_1.16.0 [11] SingleCellExperiment_1.28.1 SummarizedExperiment_1.36.0 [13] Biobase_2.66.0 GenomicRanges_1.58.0 [15] GenomeInfoDb_1.42.1 IRanges_2.40.1 [17] S4Vectors_0.44.0 BiocGenerics_0.52.0 [19] MatrixGenerics_1.18.1 matrixStats_1.5.0 [21] BiocStyle_2.34.0 rebook_1.16.0 loaded via a namespace (and not attached): [1] bitops_1.0-9 DBI_1.2.3 CodeDepends_0.6.6 [4] rlang_1.1.5 magrittr_2.0.3 compiler_4.4.2 [7] RSQLite_2.3.9 dir.expiry_1.14.0 png_0.1-8 [10] vctrs_0.6.5 ProtGenerics_1.38.0 pkgconfig_2.0.3 [13] crayon_1.5.3 fastmap_1.2.0 XVector_0.46.0 [16] Rsamtools_2.22.0 rmarkdown_2.29 graph_1.84.1 [19] UCSC.utils_1.2.0 purrr_1.0.2 bit_4.5.0.1 [22] xfun_0.50 bluster_1.16.0 zlibbioc_1.52.0 [25] cachem_1.1.0 beachmat_2.22.0 jsonlite_1.8.9 [28] blob_1.2.4 DelayedArray_0.32.0 BiocParallel_1.40.0 [31] irlba_2.3.5.1 parallel_4.4.2 cluster_2.1.8 [34] R6_2.5.1 bslib_0.8.0 rtracklayer_1.66.0 [37] limma_3.62.2 jquerylib_0.1.4 Rcpp_1.0.14 [40] bookdown_0.42 knitr_1.49 Matrix_1.7-1 [43] igraph_2.1.3 tidyselect_1.2.1 abind_1.4-8 [46] yaml_2.3.10 codetools_0.2-20 curl_6.1.0 [49] lattice_0.22-6 tibble_3.2.1 withr_3.0.2 [52] KEGGREST_1.46.0 evaluate_1.0.3 Biostrings_2.74.1 [55] pillar_1.10.1 BiocManager_1.30.25 filelock_1.0.3 [58] generics_0.1.3 RCurl_1.98-1.16 BiocVersion_3.20.0 [61] glue_1.8.0 metapod_1.14.0 lazyeval_0.2.2 [64] tools_4.4.2 BiocIO_1.16.0 BiocNeighbors_2.0.1 [67] ScaledMatrix_1.14.0 GenomicAlignments_1.42.0 locfit_1.5-9.10 [70] babelgene_22.9 XML_3.99-0.18 grid_4.4.2 [73] edgeR_4.4.1 GenomeInfoDbData_1.2.13 BiocSingular_1.22.0 [76] restfulr_0.0.15 cli_3.6.3 rsvd_1.0.5 [79] rappdirs_0.3.3 S4Arrays_1.6.0 dplyr_1.1.4 [82] sass_0.4.9 digest_0.6.37 SparseArray_1.6.1 [85] dqrng_0.4.1 rjson_0.2.23 memoise_2.0.1 [88] htmltools_0.5.8.1 lifecycle_1.0.4 httr_1.4.7 [91] mime_0.12 statmod_1.5.0 bit64_4.6.0-1 References "],["dimensionality-reduction-redux.html", "Chapter 4 Dimensionality reduction, redux 4.1 Overview 4.2 More choices for the number of PCs 4.3 Count-based dimensionality reduction 4.4 More visualization methods Session Info", " Chapter 4 Dimensionality reduction, redux .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 4.1 Overview Basic Chapter 4 introduced the key concepts for dimensionality reduction of scRNA-seq data. Here, we describe some data-driven strategies for picking an appropriate number of top PCs for downstream analyses. We also demonstrate some other dimensionality reduction strategies that operate on the raw counts. For the most part, we will be again using the Zeisel et al. (2015) dataset: View set-up code (Workflow Chapter 2) #--- loading ---# library(scRNAseq) sce.zeisel &lt;- ZeiselBrainData() library(scater) sce.zeisel &lt;- aggregateAcrossFeatures(sce.zeisel, id=sub(&quot;_loc[0-9]+$&quot;, &quot;&quot;, rownames(sce.zeisel))) #--- gene-annotation ---# library(org.Mm.eg.db) rowData(sce.zeisel)$Ensembl &lt;- mapIds(org.Mm.eg.db, keys=rownames(sce.zeisel), keytype=&quot;SYMBOL&quot;, column=&quot;ENSEMBL&quot;) #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.zeisel, subsets=list( Mt=rowData(sce.zeisel)$featureType==&quot;mito&quot;)) qc &lt;- quickPerCellQC(stats, percent_subsets=c(&quot;altexps_ERCC_percent&quot;, &quot;subsets_Mt_percent&quot;)) sce.zeisel &lt;- sce.zeisel[,!qc$discard] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.zeisel) sce.zeisel &lt;- computeSumFactors(sce.zeisel, cluster=clusters) sce.zeisel &lt;- logNormCounts(sce.zeisel) #--- variance-modelling ---# dec.zeisel &lt;- modelGeneVarWithSpikes(sce.zeisel, &quot;ERCC&quot;) top.hvgs &lt;- getTopHVGs(dec.zeisel, prop=0.1) library(scran) top.zeisel &lt;- getTopHVGs(dec.zeisel, n=2000) set.seed(100) sce.zeisel &lt;- fixedPCA(sce.zeisel, subset.row=top.zeisel) 4.2 More choices for the number of PCs 4.2.1 Using the elbow point A simple heuristic for choosing the suitable number of PCs \\(d\\) involves identifying the elbow point in the percentage of variance explained by successive PCs. This refers to the “elbow” in the curve of a scree plot as shown in Figure 4.1. # Percentage of variance explained is tucked away in the attributes. percent.var &lt;- attr(reducedDim(sce.zeisel), &quot;percentVar&quot;) library(PCAtools) chosen.elbow &lt;- findElbowPoint(percent.var) chosen.elbow ## [1] 7 plot(percent.var, xlab=&quot;PC&quot;, ylab=&quot;Variance explained (%)&quot;) abline(v=chosen.elbow, col=&quot;red&quot;) Figure 4.1: Percentage of variance explained by successive PCs in the Zeisel brain data. The identified elbow point is marked with a red line. Our assumption is that each of the top PCs capturing biological signal should explain much more variance than the remaining PCs. Thus, there should be a sharp drop in the percentage of variance explained when we move past the last “biological” PC. This manifests as an elbow in the scree plot, the location of which serves as a natural choice for \\(d\\). Once this is identified, we can subset the reducedDims() entry to only retain the first \\(d\\) PCs of interest. # Creating a new entry with only the first 20 PCs, # which is useful if we still need the full set of PCs later. reducedDim(sce.zeisel, &quot;PCA.elbow&quot;) &lt;- reducedDim(sce.zeisel)[,1:chosen.elbow] reducedDimNames(sce.zeisel) ## [1] &quot;PCA&quot; &quot;PCA.elbow&quot; From a practical perspective, the use of the elbow point tends to retain fewer PCs compared to other methods. The definition of “much more variance” is relative so, in order to be retained, later PCs must explain a amount of variance that is comparable to that explained by the first few PCs. Strong biological variation in the early PCs will shift the elbow to the left, potentially excluding weaker (but still interesting) variation in the next PCs immediately following the elbow. 4.2.2 Using the technical noise Another strategy is to retain all PCs until the percentage of total variation explained reaches some threshold \\(T\\). For example, we might retain the top set of PCs that explains 80% of the total variation in the data. Of course, it would be pointless to swap one arbitrary parameter \\(d\\) for another \\(T\\). Instead, we derive a suitable value for \\(T\\) by calculating the proportion of variance in the data that is attributed to the biological component. This is done using the denoisePCA() function with the variance modelling results from modelGeneVarWithSpikes() or related functions, where \\(T\\) is defined as the ratio of the sum of the biological components to the sum of total variances. To illustrate, we use this strategy to pick the number of PCs in the 10X PBMC dataset. View set-up code (Workflow Chapter 3) #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) #--- gene-annotation ---# library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) #--- cell-detection ---# set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location==&quot;MT&quot;))) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) sce.pbmc &lt;- sce.pbmc[,!high.mito] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) sce.pbmc &lt;- logNormCounts(sce.pbmc) #--- variance-modelling ---# set.seed(1001) dec.pbmc &lt;- modelGeneVarByPoisson(sce.pbmc) top.pbmc &lt;- getTopHVGs(dec.pbmc, prop=0.1) library(scran) set.seed(111001001) denoised.pbmc &lt;- denoisePCA(sce.pbmc, technical=dec.pbmc, subset.row=top.pbmc) ncol(reducedDim(denoised.pbmc)) ## [1] 9 The dimensionality of the output represents the lower bound on the number of PCs required to retain all biological variation. This choice of \\(d\\) is motivated by the fact that any fewer PCs will definitely discard some aspect of biological signal. (Of course, the converse is not true; there is no guarantee that the retained PCs capture all of the signal, which is only generally possible if no dimensionality reduction is performed at all.) From a practical perspective, the denoisePCA() approach usually retains more PCs than the elbow point method as the former does not compare PCs to each other and is less likely to discard PCs corresponding to secondary factors of variation. The downside is that many minor aspects of variation may not be interesting (e.g., transcriptional bursting) and their retention would only add irrelevant noise. Note that denoisePCA() imposes internal caps on the number of PCs that can be chosen in this manner. By default, the number is bounded within the “reasonable” limits of 5 and 50 to avoid selection of too few PCs (when technical noise is high relative to biological variation) or too many PCs (when technical noise is very low). For example, applying this function to the Zeisel brain data hits the upper limit: set.seed(001001001) denoised.zeisel &lt;- denoisePCA(sce.zeisel, technical=dec.zeisel, subset.row=top.zeisel) ncol(reducedDim(denoised.zeisel)) ## [1] 50 This method also tends to perform best when the mean-variance trend reflects the actual technical noise, i.e., estimated by modelGeneVarByPoisson() or modelGeneVarWithSpikes() instead of modelGeneVar() (Basic Section 3.3). Variance modelling results from modelGeneVar() tend to understate the actual biological variation, especially in highly heterogeneous datasets where secondary factors of variation inflate the fitted values of the trend. Fewer PCs are subsequently retained because \\(T\\) is artificially lowered, as evidenced by denoisePCA() returning the lower limit of 5 PCs for the PBMC dataset: dec.pbmc2 &lt;- modelGeneVar(sce.pbmc) denoised.pbmc2 &lt;- denoisePCA(sce.pbmc, technical=dec.pbmc2, subset.row=top.pbmc) ncol(reducedDim(denoised.pbmc2)) ## [1] 5 4.2.3 Based on population structure Yet another method to choose \\(d\\) uses information about the number of subpopulations in the data. Consider a situation where each subpopulation differs from the others along a different axis in the high-dimensional space (e.g., because it is defined by a unique set of marker genes). This suggests that we should set \\(d\\) to the number of unique subpopulations minus 1, which guarantees separation of all subpopulations while retaining as few dimensions (and noise) as possible. We can use this reasoning to loosely motivate an a priori choice for \\(d\\) - for example, if we expect around 10 different cell types in our population, we would set \\(d \\approx 10\\). In practice, the number of subpopulations is usually not known in advance. Rather, we use a heuristic approach that uses the number of clusters as a proxy for the number of subpopulations. We perform clustering (graph-based by default, see Basic Section 5.2) on the first \\(d^*\\) PCs and only consider the values of \\(d^*\\) that yield no more than \\(d^*+1\\) clusters. If we detect more clusters with fewer dimensions, we consider this to represent overclustering rather than distinct subpopulations, assuming that multiple subpopulations should not be distinguishable on the same axes. We test a range of \\(d^*\\) and set \\(d\\) to the value that maximizes the number of clusters while satisfying the above condition. This attempts to capture as many distinct (putative) subpopulations as possible by retaining biological signal in later PCs, up until the point that the additional noise reduces resolution. pcs &lt;- reducedDim(sce.zeisel) choices &lt;- getClusteredPCs(pcs) val &lt;- metadata(choices)$chosen plot(choices$n.pcs, choices$n.clusters, xlab=&quot;Number of PCs&quot;, ylab=&quot;Number of clusters&quot;) abline(a=1, b=1, col=&quot;red&quot;) abline(v=val, col=&quot;grey80&quot;, lty=2) Figure 4.2: Number of clusters detected in the Zeisel brain dataset as a function of the number of PCs. The red unbroken line represents the theoretical upper constraint on the number of clusters, while the grey dashed line is the number of PCs suggested by getClusteredPCs(). We subset the PC matrix by column to retain the first \\(d\\) PCs and assign the subsetted matrix back into our SingleCellExperiment object. Downstream applications that use the \"PCA.clust\" results in sce.zeisel will subsequently operate on the chosen PCs only. reducedDim(sce.zeisel, &quot;PCA.clust&quot;) &lt;- pcs[,1:val] This strategy is pragmatic as it directly addresses the role of the bias-variance trade-off in downstream analyses, specifically clustering. There is no need to preserve biological signal beyond what is distinguishable in later steps. However, it involves strong assumptions about the nature of the biological differences between subpopulations - and indeed, discrete subpopulations may not even exist in studies of continuous processes like differentiation. It also requires repeated applications of the clustering procedure on increasing number of PCs, which may be computational expensive. 4.2.4 Using random matrix theory We consider the observed (log-)expression matrix to be the sum of (i) a low-rank matrix containing the true biological signal for each cell and (ii) a random matrix representing the technical noise in the data. Under this interpretation, we can use random matrix theory to guide the choice of the number of PCs based on the properties of the noise matrix. The Marchenko-Pastur (MP) distribution defines an upper bound on the singular values of a matrix with random i.i.d. entries. Thus, all PCs associated with larger singular values are likely to contain real biological structure - or at least, signal beyond that expected by noise - and should be retained (Shekhar et al. 2016). We can implement this scheme using the chooseMarchenkoPastur() function from the PCAtools package, given the dimensionality of the matrix used for the PCA (noting that we only used the HVG subset); the variance explained by each PC (not the percentage); and the variance of the noise matrix derived from our previous variance decomposition results. # Generating more PCs for demonstration purposes: set.seed(10100101) sce.zeisel2 &lt;- fixedPCA(sce.zeisel, subset.row=top.zeisel, rank=200) # Actual variance explained is also provided in the attributes: mp.choice &lt;- chooseMarchenkoPastur( .dim=c(length(top.zeisel), ncol(sce.zeisel2)), var.explained=attr(reducedDim(sce.zeisel2), &quot;varExplained&quot;), noise=median(dec.zeisel[top.zeisel,&quot;tech&quot;])) mp.choice ## [1] 144 ## attr(,&quot;limit&quot;) ## [1] 2.336 We can then subset the PC coordinate matrix by the first mp.choice columns as previously demonstrated. It is best to treat this as a guideline only; PCs below the MP limit are not necessarily uninteresting, especially in noisy datasets where the higher noise drives a more aggressive choice of \\(d\\). Conversely, many PCs above the limit may not be relevant if they are driven by uninteresting biological processes like transcriptional bursting, cell cycle or metabolic variation. Morever, the use of the MP distribution is not entirely justified here as the noise distribution differs by abundance for each gene and by sequencing depth for each cell. In a similar vein, Horn’s parallel analysis is commonly used to pick the number of PCs to retain in factor analysis. This involves randomizing the input matrix, repeating the PCA and creating a scree plot of the PCs of the randomized matrix. The desired number of PCs is then chosen based on the intersection of the randomized scree plot with that of the original matrix (Figure 4.3). Here, the reasoning is that PCs are unlikely to be interesting if they explain less variance that that of the corresponding PC of a random matrix. Note that this differs from the MP approach as we are not using the upper bound of randomized singular values to threshold the original PCs. set.seed(100010) horn &lt;- parallelPCA(logcounts(sce.zeisel)[top.zeisel,], BSPARAM=BiocSingular::IrlbaParam(), niters=10) horn$n ## [1] 26 plot(horn$original$variance, type=&quot;b&quot;, log=&quot;y&quot;, pch=16) permuted &lt;- horn$permuted for (i in seq_len(ncol(permuted))) { points(permuted[,i], col=&quot;grey80&quot;, pch=16) lines(permuted[,i], col=&quot;grey80&quot;, pch=16) } abline(v=horn$n, col=&quot;red&quot;) Figure 4.3: Percentage of variance explained by each PC in the original matrix (black) and the PCs in the randomized matrix (grey) across several randomization iterations. The red line marks the chosen number of PCs. The parallelPCA() function helpfully emits the PC coordinates in horn$original$rotated, which we can subset by horn$n and add to the reducedDims() of our SingleCellExperiment. Parallel analysis is reasonably intuitive (as random matrix methods go) and avoids any i.i.d. assumption across genes. However, its obvious disadvantage is the not-insignificant computational cost of randomizing and repeating the PCA. One can also debate whether the scree plot of the randomized matrix is even comparable to that of the original, given that the former includes biological variation and thus cannot be interpreted as purely technical noise. This manifests in Figure 4.3 as a consistently higher curve for the randomized matrix due to the redistribution of biological variation to the later PCs. Another approach is based on optimizing the reconstruction error of the low-rank representation (Gavish and Donoho 2014). Recall that PCA produces both the matrix of per-cell coordinates and a rotation matrix of per-gene loadings, the product of which recovers the original log-expression matrix. If we subset these two matrices to the first \\(d\\) dimensions, the product of the resulting submatrices serves as an approximation of the original matrix. Under certain conditions, the difference between this approximation and the true low-rank signal (i.e., sans the noise matrix) has a defined mininum at a certain number of dimensions. This minimum can be defined using the chooseGavishDonoho() function from PCAtools as shown below. gv.choice &lt;- chooseGavishDonoho( .dim=c(length(top.zeisel), ncol(sce.zeisel2)), var.explained=attr(reducedDim(sce.zeisel2), &quot;varExplained&quot;), noise=median(dec.zeisel[top.zeisel,&quot;tech&quot;])) gv.choice ## [1] 59 ## attr(,&quot;limit&quot;) ## [1] 3.121 The Gavish-Donoho method is appealing as, unlike the other approaches for choosing \\(d\\), the concept of the optimum is rigorously defined. By minimizing the reconstruction error, we can most accurately represent the true biological variation in terms of the distances between cells in PC space. However, there remains some room for difference between “optimal” and “useful”; for example, noisy datasets may find themselves with very low \\(d\\) as including more PCs will only ever increase reconstruction error, regardless of whether they contain relevant biological variation. This approach is also dependent on some strong i.i.d. assumptions about the noise matrix. 4.3 Count-based dimensionality reduction For count matrices, correspondence analysis (CA) is a natural approach to dimensionality reduction. In this procedure, we compute an expected value for each entry in the matrix based on the per-gene abundance and size factors. Each count is converted into a standardized residual in a manner analogous to the calculation of the statistic in Pearson’s chi-squared tests, i.e., subtraction of the expected value and division by its square root. An SVD is then applied on this matrix of residuals to obtain the necessary low-dimensional coordinates for each cell. To demonstrate, we use the corral package to compute CA factors for the Zeisel dataset. library(corral) sce.corral &lt;- corral_sce(sce.zeisel, subset_row=top.zeisel, col.w=sizeFactors(sce.zeisel)) dim(reducedDim(sce.corral, &quot;corral&quot;)) ## [1] 2816 30 The major advantage of CA is that it avoids difficulties with the mean-variance relationship upon transformation (Figure 2.2). If two cells have the same expression profile but differences in their total counts, CA will return the same expected location for both cells; this avoids artifacts observed in PCA on log-transformed counts (Figure 4.4). However, CA is more sensitive to overdispersion in the random noise due to the nature of its standardization. This may cause some problems in some datasets where the CA factors may be driven by a few genes with random expression rather than the underlying biological structure. # TODO: move to scRNAseq. The rm(env) avoids problems with knitr caching inside # rebook&#39;s use of callr::r() during compilation. library(BiocFileCache) bfc &lt;- BiocFileCache(ask=FALSE) qcdata &lt;- bfcrpath(bfc, &quot;https://github.com/LuyiTian/CellBench_data/blob/master/data/mRNAmix_qc.RData?raw=true&quot;) env &lt;- new.env() load(qcdata, envir=env) sce.8qc &lt;- env$sce8_qc rm(env) sce.8qc$mix &lt;- factor(sce.8qc$mix) sce.8qc ## class: SingleCellExperiment ## dim: 15571 296 ## metadata(2): scPipe Biomart ## assays(1): counts ## rownames(15571): ENSG00000245025 ENSG00000257433 ... ENSG00000233117 ## ENSG00000115687 ## rowData names(0): ## colnames(296): L19 A10 ... P8 P9 ## colData names(21): unaligned aligned_unmapped ... HCC827_prop ## mRNA_amount ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): # Choosing some HVGs for PCA: sce.8qc &lt;- logNormCounts(sce.8qc) dec.8qc &lt;- modelGeneVar(sce.8qc) hvgs.8qc &lt;- getTopHVGs(dec.8qc, n=1000) sce.8qc &lt;- fixedPCA(sce.8qc, subset.row=hvgs.8qc) # By comparison, corral operates on the raw counts: sce.8qc &lt;- corral_sce(sce.8qc, subset_row=hvgs.8qc, col.w=sizeFactors(sce.8qc)) library(scater) gridExtra::grid.arrange( plotPCA(sce.8qc, colour_by=&quot;mix&quot;) + ggtitle(&quot;PCA&quot;), plotReducedDim(sce.8qc, &quot;corral&quot;, colour_by=&quot;mix&quot;) + ggtitle(&quot;corral&quot;), ncol=2 ) Figure 4.4: Dimensionality reduction results of all pool-and-split libraries in the SORT-seq CellBench data, computed by a PCA on the log-normalized expression values (left) or using the corral package (right). Each point represents a library and is colored by the mixing ratio used to construct it. 4.4 More visualization methods 4.4.1 Fast interpolation-based \\(t\\)-SNE Conventional \\(t\\)-SNE algorithms scale poorly with the number of cells. Fast interpolation-based \\(t\\)-SNE (FIt-SNE) (Linderman et al. 2019) is an alternative algorithm that reduces the computational complexity of the calculations from \\(N\\log N\\) to \\(\\sim 2 p N\\). This is achieved by using interpolation nodes in the high-dimensional space; the bulk of the calculations are performed on the nodes and the embedding of individual cells around each node is determined by interpolation. To use this method, we can simply set use_fitsne=TRUE when calling runTSNE() with scater - this calls the snifter package, which in turn wraps the Python library openTSNE using basilisk As Figure 4.5 shows, the embeddings produced by this method are qualitatively similar to those produced by other algorithms, supported by some theoretical results from Linderman et al. (2019) showing that any difference from conventional \\(t\\)-SNE implementations is low and bounded. set.seed(9000) sce.zeisel &lt;- runTSNE(sce.zeisel) sce.zeisel &lt;- runTSNE(sce.zeisel, use_fitsne = TRUE, name=&quot;FIt-SNE&quot;) gridExtra::grid.arrange( plotReducedDim(sce.zeisel, &quot;TSNE&quot;, colour_by=&quot;level1class&quot;) + ggtitle(&quot;t-SNE&quot;), plotReducedDim(sce.zeisel, &quot;FIt-SNE&quot;, colour_by=&quot;level1class&quot;) + ggtitle(&quot;FIt-SNE&quot;), ncol=2 ) Figure 4.5: FI-tSNE embedding and Barnes-Hut \\(t\\)-SNE embeddings for the Zeisel brain data. By using snifter directly, we can also take advantage of openTSNE’s ability to project new points into an existing embedding. In this process, the existing points remain static while new points are inserted based on their affinities with each other and the points in the existing embedding. For example, cells are generally projected near to cells of a similar type in Figure 4.6. This may be useful as an exploratory step when combining datasets, though the projection may not be sensible for cell types that are not present in the existing embedding. set.seed(1000) ind_test &lt;- as.logical(rbinom(ncol(sce.zeisel), 1, 0.2)) ind_train &lt;- !ind_test library(snifter) olddata &lt;- reducedDim(sce.zeisel[, ind_train], &quot;PCA&quot;) embedding &lt;- fitsne(olddata) newdata &lt;- reducedDim(sce.zeisel[, ind_test], &quot;PCA&quot;) projected &lt;- project(embedding, new = newdata, old = olddata) all &lt;- rbind(embedding, projected) label &lt;- c(sce.zeisel$level1class[ind_train], sce.zeisel$level1class[ind_test]) ggplot() + aes(all[, 1], all[, 2], col = factor(label), shape = ind_test) + labs(x = &quot;t-SNE 1&quot;, y = &quot;t-SNE 2&quot;) + geom_point(alpha = 0.5) + scale_colour_brewer(palette = &quot;Set2&quot;, name=&quot;level1class&quot;) + theme_bw() + scale_shape_manual(values = c(8, 19), name = &quot;Set&quot;, labels = c(&quot;Training&quot;, &quot;Test&quot;)) Figure 4.6: \\(t\\)-SNE embedding created with snifter, using 80% of the cells in the Zeisel brain data. The remaining 20% of the cells were projected into this pre-existing embedding. 4.4.2 Density-preserving \\(t\\)-SNE and UMAP One downside of t\\(-\\)SNE and UMAP is that they preserve the neighbourhood structure of the data while neglecting the local density of the data. This can result in seemingly compact clusters on a t-SNE or UMAP plot that correspond to very heterogeneous groups in the original data. The dens-SNE and densMAP algorithms mitigate this effect by incorporating information about the average distance to the nearest neighbours when creating the embedding (Narayan 2021). We demonstrate below by applying these approaches on the PCs of the Zeisel dataset using the densviz wrapper package. library(densvis) dt &lt;- densne(reducedDim(sce.zeisel, &quot;PCA&quot;), dens_frac = 0.4, dens_lambda = 0.2) reducedDim(sce.zeisel, &quot;dens-SNE&quot;) &lt;- dt dm &lt;- densmap(reducedDim(sce.zeisel, &quot;PCA&quot;), dens_frac = 0.4, dens_lambda = 0.2) reducedDim(sce.zeisel, &quot;densMAP&quot;) &lt;- dm sce.zeisel &lt;- runUMAP(sce.zeisel) # for comparison These methods provide more information about transcriptional heterogeneity within clusters (Figure 4.7), with the astrocyte cluster being less compact in the density-preserving versions. This excessive compactness can imply a lower level of within-population heterogeneity. gridExtra::grid.arrange( plotReducedDim(sce.zeisel, &quot;TSNE&quot;, colour_by=&quot;level1class&quot;) + ggtitle(&quot;t-SNE&quot;), plotReducedDim(sce.zeisel, &quot;dens-SNE&quot;, colour_by=&quot;level1class&quot;) + ggtitle(&quot;dens-SNE&quot;), plotReducedDim(sce.zeisel, &quot;UMAP&quot;, colour_by=&quot;level1class&quot;) + ggtitle(&quot;UMAP&quot;), plotReducedDim(sce.zeisel, &quot;densMAP&quot;, colour_by=&quot;level1class&quot;) + ggtitle(&quot;densMAP&quot;), ncol=2 ) Figure 4.7: \\(t\\)-SNE, UMAP, dens-SNE and densMAP embeddings for the Zeisel brain data. Session Info View session info R version 4.4.2 (2024-10-31) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.1 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] densvis_1.16.0 snifter_1.16.0 [3] scater_1.34.0 BiocFileCache_2.14.0 [5] dbplyr_2.5.0 corral_1.16.0 [7] PCAtools_2.18.0 ggrepel_0.9.6 [9] ggplot2_3.5.1 scran_1.34.0 [11] scuttle_1.16.0 SingleCellExperiment_1.28.1 [13] SummarizedExperiment_1.36.0 Biobase_2.66.0 [15] GenomicRanges_1.58.0 GenomeInfoDb_1.42.1 [17] IRanges_2.40.1 S4Vectors_0.44.0 [19] BiocGenerics_0.52.0 MatrixGenerics_1.18.1 [21] matrixStats_1.5.0 BiocStyle_2.34.0 [23] rebook_1.16.0 loaded via a namespace (and not attached): [1] RColorBrewer_1.1-3 jsonlite_1.8.9 [3] CodeDepends_0.6.6 MultiAssayExperiment_1.32.0 [5] magrittr_2.0.3 ggbeeswarm_0.7.2 [7] farver_2.1.2 rmarkdown_2.29 [9] zlibbioc_1.52.0 vctrs_0.6.5 [11] memoise_2.0.1 DelayedMatrixStats_1.28.1 [13] htmltools_0.5.8.1 S4Arrays_1.6.0 [15] curl_6.1.0 BiocNeighbors_2.0.1 [17] SparseArray_1.6.1 sass_0.4.9 [19] bslib_0.8.0 basilisk_1.18.0 [21] plyr_1.8.9 cachem_1.1.0 [23] igraph_2.1.3 lifecycle_1.0.4 [25] pkgconfig_2.0.3 rsvd_1.0.5 [27] Matrix_1.7-1 R6_2.5.1 [29] fastmap_1.2.0 GenomeInfoDbData_1.2.13 [31] digest_0.6.37 colorspace_2.1-1 [33] dqrng_0.4.1 irlba_2.3.5.1 [35] RSQLite_2.3.9 beachmat_2.22.0 [37] filelock_1.0.3 labeling_0.4.3 [39] httr_1.4.7 RMTstat_0.3.1 [41] abind_1.4-8 compiler_4.4.2 [43] bit64_4.6.0-1 withr_3.0.2 [45] BiocParallel_1.40.0 viridis_0.6.5 [47] DBI_1.2.3 maps_3.4.2.1 [49] rappdirs_0.3.3 DelayedArray_0.32.0 [51] bluster_1.16.0 tools_4.4.2 [53] vipor_0.4.7 beeswarm_0.4.0 [55] glue_1.8.0 grid_4.4.2 [57] Rtsne_0.17 cluster_2.1.8 [59] reshape2_1.4.4 generics_0.1.3 [61] gtable_0.3.6 data.table_1.16.4 [63] BiocSingular_1.22.0 ScaledMatrix_1.14.0 [65] metapod_1.14.0 XVector_0.46.0 [67] pillar_1.10.1 stringr_1.5.1 [69] limma_3.62.2 pals_1.9 [71] dplyr_1.1.4 lattice_0.22-6 [73] FNN_1.1.4.1 bit_4.5.0.1 [75] tidyselect_1.2.1 locfit_1.5-9.10 [77] transport_0.15-4 knitr_1.49 [79] gridExtra_2.3 bookdown_0.42 [81] edgeR_4.4.1 xfun_0.50 [83] statmod_1.5.0 stringi_1.8.4 [85] UCSC.utils_1.2.0 yaml_2.3.10 [87] evaluate_1.0.3 codetools_0.2-20 [89] tibble_3.2.1 BiocManager_1.30.25 [91] graph_1.84.1 cli_3.6.3 [93] uwot_0.2.2 reticulate_1.40.0 [95] munsell_0.5.1 jquerylib_0.1.4 [97] dichromat_2.0-0.1 Rcpp_1.0.14 [99] dir.expiry_1.14.0 mapproj_1.2.11 [101] png_0.1-8 XML_3.99-0.18 [103] parallel_4.4.2 assertthat_0.2.1 [105] blob_1.2.4 basilisk.utils_1.18.0 [107] sparseMatrixStats_1.18.0 ggthemes_5.1.0 [109] viridisLite_0.4.2 scales_1.3.0 [111] purrr_1.0.2 crayon_1.5.3 [113] rlang_1.1.5 cowplot_1.1.3 References "],["clustering-redux.html", "Chapter 5 Clustering, redux 5.1 Motivation 5.2 Quantifying clustering behavior 5.3 Comparing different clusterings 5.4 Evaluating cluster stability 5.5 Clustering parameter sweeps 5.6 Agglomerating graph communities Session Info", " Chapter 5 Clustering, redux .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 5.1 Motivation Basic Chapter 5 described the process of clustering cells to obtain discrete summaries of scRNA-seq datasets. Here, we describe some diagnostics to evaluate clustering separation and stability, methods to compare clusterings that represent different views of the data, and some strategies to choose the number of clusters. We will again be demonstrating these techniques on the 10X Genomics PBMC dataset, clustered using the default graph-based method in clusterCells(). View set-up code (Workflow Chapter 3) #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) #--- gene-annotation ---# library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) #--- cell-detection ---# set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location==&quot;MT&quot;))) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) sce.pbmc &lt;- sce.pbmc[,!high.mito] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) sce.pbmc &lt;- logNormCounts(sce.pbmc) #--- variance-modelling ---# set.seed(1001) dec.pbmc &lt;- modelGeneVarByPoisson(sce.pbmc) top.pbmc &lt;- getTopHVGs(dec.pbmc, prop=0.1) #--- dimensionality-reduction ---# set.seed(10000) sce.pbmc &lt;- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc) set.seed(100000) sce.pbmc &lt;- runTSNE(sce.pbmc, dimred=&quot;PCA&quot;) set.seed(1000000) sce.pbmc &lt;- runUMAP(sce.pbmc, dimred=&quot;PCA&quot;) library(scran) nn.clust &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, full=TRUE) colLabels(sce.pbmc) &lt;- nn.clust$clusters sce.pbmc ## class: SingleCellExperiment ## dim: 33694 3985 ## metadata(1): Samples ## assays(2): counts logcounts ## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B ## rowData names(2): ID Symbol ## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ... ## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1 ## colData names(4): Sample Barcode sizeFactor label ## reducedDimNames(3): PCA TSNE UMAP ## mainExpName: NULL ## altExpNames(0): 5.2 Quantifying clustering behavior 5.2.1 Motivation Several metrics are available to quantify the behavior of a clustering method, such as the silhouette width or graph modularity. These metrics are most helpful when they can be computed for each cluster, allowing us to prioritize certain clusters for more careful analysis. Poorly separated clusters can be prioritized for manual inspection to determine the relevance of the differences to its neighboring clusters, while heterogeneous clusters may be subject to further clustering to identify internal structure. We can also quantify the effect of different algorithms or parameter values, providing some insight into how our dataset responds to changes in the clustering method. As an aside: it is also possible to choose the clustering that optimizes for some of these metrics. This can be helpful to automatically obtain a “reasonable” clustering, though in practice, the clustering that yields the strongest separation often does not provide the most biological insight. Poorly separated clusters will often be observed in non-trivial analyses of scRNA-seq data where the aim is to characterize closely related subtypes or states. Indeed, the most well-separated clusters are rarely interesting as these describe obvious differences between known cell types. These diagnostics are best used to guide interpretation by highlighting clusters that require more investigation rather than to filter out “bad” clusters altogether. 5.2.2 Silhouette width The silhouette width is an established metric for evaluating cluster separation. For each cell, we compute the average distance to all cells in the same cluster. We also compute the average distance to all cells in another cluster, taking the minimum of the averages across all other clusters. The silhouette width for each cell is defined as the difference between these two values divided by their maximum. Cells with large positive silhouette widths are closer to other cells in the same cluster than to cells in different clusters. Thus, clusters with large positive silhouette widths are well-separated from other clusters. The silhouette width is a natural diagnostic for hierarchical clustering where the distance matrix is already available. For larger datasets, we instead use an approximate approach that uses the root of the average squared distances rather than the average distance itself. The approximation avoids the time-consuming calculation of pairwise distances that would otherwise make this metric impractical. This is implemented in the approxSilhouette() function from bluster, allowing us to quickly identify poorly separate clusters with mostly negative widths (Figure 5.1). # Performing the calculations on the PC coordinates, like before. library(bluster) sil.approx &lt;- approxSilhouette(reducedDim(sce.pbmc, &quot;PCA&quot;), clusters=colLabels(sce.pbmc)) sil.approx ## DataFrame with 3985 rows and 3 columns ## cluster other width ## &lt;factor&gt; &lt;factor&gt; &lt;numeric&gt; ## AAACCTGAGAAGGCCT-1 2 12 0.197754 ## AAACCTGAGACAGACC-1 2 4 0.290793 ## AAACCTGAGGCATGGT-1 9 13 0.456424 ## AAACCTGCAAGGTTCT-1 3 9 -0.305378 ## AAACCTGCAGGCGATA-1 7 10 0.557639 ## ... ... ... ... ## TTTGGTTTCGCTAGCG-1 2 4 0.316017 ## TTTGTCACACTTAACG-1 11 6 0.307818 ## TTTGTCACAGGTCCAC-1 1 9 -0.455344 ## TTTGTCAGTTAAGACA-1 5 10 0.609735 ## TTTGTCATCCCAAGAT-1 2 4 0.276464 sil.data &lt;- as.data.frame(sil.approx) sil.data$closest &lt;- factor(ifelse(sil.data$width &gt; 0, colLabels(sce.pbmc), sil.data$other)) sil.data$cluster &lt;- colLabels(sce.pbmc) library(ggplot2) ggplot(sil.data, aes(x=cluster, y=width, colour=closest)) + ggbeeswarm::geom_quasirandom(method=&quot;smiley&quot;) Figure 5.1: Distribution of the approximate silhouette width across cells in each cluster of the PBMC dataset. Each point represents a cell and colored with the identity of its own cluster if its silhouette width is positive and that of the closest other cluster if the width is negative. For a more detailed examination, we identify the closest neighboring cluster for each cell in each cluster. In the table below, each row corresponds to one cluster; large off-diagonal counts indicate that its cells are easily confused with those from another cluster. table(Cluster=colLabels(sce.pbmc), sil.data$closest) ## ## Cluster 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ## 1 172 0 0 0 0 15 0 0 3 1 11 0 0 0 3 ## 2 0 627 0 5 0 1 8 0 0 0 0 54 0 36 0 ## 3 0 0 231 0 0 41 0 0 208 0 42 0 95 0 0 ## 4 0 8 0 46 0 0 0 0 1 0 0 0 0 1 0 ## 5 0 0 0 0 538 1 0 0 1 0 1 0 0 0 0 ## 6 7 0 4 0 0 315 0 0 0 0 26 0 0 0 0 ## 7 0 0 0 0 0 0 125 0 0 0 0 0 0 0 0 ## 8 0 0 0 0 0 0 0 45 0 0 0 1 0 0 0 ## 9 0 0 0 0 0 0 0 0 811 0 0 0 8 0 0 ## 10 0 0 0 0 0 0 0 0 2 45 0 0 0 0 0 ## 11 0 0 0 0 0 1 0 0 0 0 152 0 0 0 0 ## 12 0 0 0 0 0 0 0 0 0 0 0 61 0 0 0 ## 13 0 0 0 0 0 0 0 0 0 0 0 0 129 0 0 ## 14 0 0 0 0 0 0 0 0 0 0 0 0 0 87 0 ## 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 A useful aspect of the silhouette width is that it naturally captures both underclustering and overclustering. Cells in heterogeneous clusters will have a large average distance to other cells in the same cluster; all else being equal, this will decrease their widths compared to cells in homogeneous clusters. Conversely, cells in overclustered datasets will have low average distances to cells in adjacent clusters, again resulting in low widths. One can exploit this to obtain a “sensible” initial choice for the number of clusters by maximizing the average silhouette width (Section 5.5). 5.2.3 Cluster purity Another metric to assess cluster separation is the degree to which cells from multiple clusters intermingle in expression space. The “clustering purity” is defined for each cell as the proportion of neighboring cells that are assigned to the same cluster, after some weighting to adjust for differences in the number of cells between clusteres. Well-separated clusters should exhibit little intermingling and thus high purity values for all member cells. pure.pbmc &lt;- neighborPurity(reducedDim(sce.pbmc, &quot;PCA&quot;), colLabels(sce.pbmc)) pure.pbmc ## DataFrame with 3985 rows and 2 columns ## purity maximum ## &lt;numeric&gt; &lt;factor&gt; ## AAACCTGAGAAGGCCT-1 1.000000 2 ## AAACCTGAGACAGACC-1 1.000000 2 ## AAACCTGAGGCATGGT-1 1.000000 9 ## AAACCTGCAAGGTTCT-1 0.558038 3 ## AAACCTGCAGGCGATA-1 1.000000 7 ## ... ... ... ## TTTGGTTTCGCTAGCG-1 1 2 ## TTTGTCACACTTAACG-1 1 11 ## TTTGTCACAGGTCCAC-1 1 1 ## TTTGTCAGTTAAGACA-1 1 5 ## TTTGTCATCCCAAGAT-1 1 2 In Figure 5.2, median purity values are consistently greater than 0.9, indicating that most cells in each cluster are primarily surrounded by other cells from the same cluster. Some clusters have low purity values that may warrant more careful inspection - these probably represent closely related subpopulations. pure.data &lt;- as.data.frame(pure.pbmc) pure.data$maximum &lt;- factor(pure.data$maximum) pure.data$cluster &lt;- colLabels(sce.pbmc) ggplot(pure.data, aes(x=cluster, y=purity, colour=maximum)) + ggbeeswarm::geom_quasirandom(method=&quot;smiley&quot;) Figure 5.2: Distribution of cluster purities across cells in each cluster of the PBMC dataset. Each point represents a cell and colored with the identity of the cluster contributing the largest proportion of its neighbors. To determine which clusters contaminate each other, we can identify the cluster with the most neighbors for each cell. In the table below, each row corresponds to one cluster; large off-diagonal counts indicate that its cells are easily confused with those from another cluster. table(Cluster=colLabels(sce.pbmc), pure.data$maximum) ## ## Cluster 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ## 1 205 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## 2 0 702 0 1 0 0 5 0 0 0 0 17 0 6 0 ## 3 0 0 521 0 0 9 0 0 41 0 10 0 36 0 0 ## 4 0 0 0 56 0 0 0 0 0 0 0 0 0 0 0 ## 5 0 0 0 0 541 0 0 0 0 0 0 0 0 0 0 ## 6 0 0 0 0 0 349 0 0 0 0 3 0 0 0 0 ## 7 0 0 0 0 0 0 125 0 0 0 0 0 0 0 0 ## 8 0 0 0 0 0 0 0 46 0 0 0 0 0 0 0 ## 9 0 0 2 0 0 0 0 0 812 0 0 0 5 0 0 ## 10 0 0 0 0 0 0 0 0 0 47 0 0 0 0 0 ## 11 0 0 0 0 0 0 0 0 0 0 153 0 0 0 0 ## 12 0 0 0 0 0 0 0 0 0 0 0 61 0 0 0 ## 13 0 0 0 0 0 0 0 0 0 0 0 0 129 0 0 ## 14 0 0 0 0 0 0 0 0 0 0 0 0 0 87 0 ## 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 The main difference between the cluster purity and silhouette width is that the former ignores the intra-cluster variance. This provides a simpler interpretation of cluster separation; a low silhouette width may still occur in well-separated clusters if the internal heterogeneity is high, while no such complication exists for the cluster purity. Comparing these two metrics can give us an indication of which clusters are heterogeneous, well-separated, or both. 5.2.4 Within-cluster sum of squares The root-mean-squared-deviation (RMSD) for each cluster is defined pretty much as it is named - the root of the mean of the squared differences from the cluster centroid, where the mean is computed across all cells in the cluster. It is closely related to the within-cluster sum of squares (WCSS) and is a natural diagnostic for \\(k\\)-means clustering, given that the algorithm aims to find a clustering that minimizes the WCSS. However, it can be computed for any set of clusters if the original coordinates are available (Figure 5.3). rmsd &lt;- clusterRMSD(reducedDim(sce.pbmc, &quot;PCA&quot;), colLabels(sce.pbmc)) barplot(rmsd, ylab=&quot;RMSD&quot;, xlab=&quot;Cluster&quot;) Figure 5.3: RMSDs for each cluster in the PBMC dataset. A large RMSD suggests that a cluster has some internal structure and should be prioritized for subclustering. Of course, this is subject to a number of caveats. Clusters generated from cells with low library sizes will naturally have larger RMSDs due to the greater impact of sequencing noise (Figure 5.4). Immediately adjacent clusters may also have high RMSDs even if they are not heterogeneous. This occurs when there are many cells on the boundaries between clusters, which would result in a higher sum of squares from the centroid. by.clust &lt;- split(sizeFactors(sce.pbmc), colLabels(sce.pbmc)) sf.by.clust &lt;- vapply(by.clust, mean, 0) plot(rmsd, sf.by.clust, log=&quot;xy&quot;, pch=16, xlab=&quot;RMSD&quot;, ylab=&quot;Average size factor&quot;) Figure 5.4: RMSDs for each cluster in the PBMC dataset as a function of the average size factor. 5.2.5 Using graph modularity For graph-based clustering, the modularity is a natural metric for evaluating the separation between communities. This is defined as the (scaled) difference between the observed total weight of edges between nodes in the same cluster and the expected total weight if edge weights were randomly distributed across all pairs of nodes. Larger modularity values indicate that there most edges occur within clusters, suggesting that the clusters are sufficiently well separated to avoid edges forming between neighboring cells in different clusters. The standard approach is to report a single modularity value for a clustering on a given graph. This is useful for comparing different clusterings on the same graph - and indeed, some community detection algorithms are designed with the aim of maximizing the modularity - but it is less helpful for interpreting a given clustering. Rather, we use the pairwiseModularity() function from bluster with as.ratio=TRUE, which returns the ratio of the observed to expected sum of weights between each pair of clusters. We use the ratio instead of the difference as the former is less dependent on the number of cells in each cluster. g &lt;- nn.clust$objects$graph ratio &lt;- pairwiseModularity(g, colLabels(sce.pbmc), as.ratio=TRUE) dim(ratio) ## [1] 15 15 In this matrix, each row/column corresponds to a cluster and each entry contains the ratio of the observed to total weight of edges between cells in the respective clusters. A dataset containing well-separated clusters should contain most of the observed total weight on the diagonal entries, i.e., most edges occur between cells in the same cluster. Indeed, concentration of the weight on the diagonal of (Figure 5.5) indicates that most of the clusters are well-separated, while some modest off-diagonal entries represent closely related clusters with more inter-connecting edges. library(pheatmap) pheatmap(log2(ratio+1), cluster_rows=FALSE, cluster_cols=FALSE, color=colorRampPalette(c(&quot;white&quot;, &quot;blue&quot;))(100)) Figure 5.5: Heatmap of the log2-ratio of the total weight between nodes in the same cluster or in different clusters, relative to the total weight expected under a null model of random links. One useful approach is to use the ratio matrix to form another graph where the nodes are clusters rather than cells. Edges between nodes are weighted according to the ratio of observed to expected edge weights between cells in those clusters. We can then repeat our graph operations on this new cluster-level graph to explore the relationships between clusters. For example, we could obtain clusters of clusters, or we could simply create a new cluster-based layout for visualization (Figure 5.6). This is analogous to the “graph abstraction” approach described by Wolf et al. (2017), which can be used to identify trajectories in the data based on high-weight paths between clusters. cluster.gr &lt;- igraph::graph_from_adjacency_matrix(log2(ratio+1), mode=&quot;upper&quot;, weighted=TRUE, diag=FALSE) # Increasing the weight to increase the visibility of the lines. set.seed(11001010) plot(cluster.gr, edge.width=igraph::E(cluster.gr)$weight*5, layout=igraph::layout_with_lgl) Figure 5.6: Force-based layout showing the relationships between clusters based on the log-ratio of observed to expected total weights between nodes in different clusters. The thickness of the edge between a pair of clusters is proportional to the corresponding log-ratio. Incidentally, some readers may have noticed that all igraph commands were prefixed with igraph::. We have done this deliberately to avoid bringing igraph::normalize into the global namespace. Rather unfortunately, this normalize function accepts any argument and returns NULL, which causes difficult-to-diagnose bugs when it overwrites normalize from BiocGenerics. 5.3 Comparing different clusterings 5.3.1 Motivation As previously mentioned, clustering’s main purpose is to obtain a discrete summary of the data for further interpretation. The diversity of available methods (and the subsequent variation in the clustering results) reflects the many different “perspectives” that can be derived from a high-dimensional scRNA-seq dataset. It is helpful to determine how these perspectives relate to each other by comparing the clustering results. More concretely, we want to know which clusters map to each other across algorithms; inconsistencies may be indicative of complex variation that is summarized differently by each clustering procedure. To illustrate, we will consider different variants of graph-based clustering on our PBMC dataset. clust.walktrap &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=NNGraphParam(cluster.fun=&quot;walktrap&quot;)) clust.louvain &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=NNGraphParam(cluster.fun=&quot;louvain&quot;)) 5.3.2 Identifying corresponding clusters The simplest approach for comparing two clusterings of the same dataset is to create a 2-dimensional table of label frequencies. We can then identify how our Walktrap clusters are redistributed when we switch to using Louvain community detection (Figure 5.7). Multiple non-zero entries along a row or column indicate that multiple clusters in one clustering are merged in the other clustering. tab &lt;- table(Walktrap=clust.walktrap, Louvain=clust.louvain) rownames(tab) &lt;- paste(&quot;Walktrap&quot;, rownames(tab)) colnames(tab) &lt;- paste(&quot;Louvain&quot;, colnames(tab)) library(pheatmap) pheatmap(log10(tab+10), color=viridis::viridis(100), cluster_cols=FALSE, cluster_rows=FALSE) Figure 5.7: Heatmap of the number of cells in each pair of clusters from Walktrap (rows) and Louvain clustering (columns) in the PBMC dataset. A more sophisticated approach involves computing the Jaccard index for each pair of clusters. This normalizes for the differences in cluster abundance so that large clusters do not dominate the color scale (Figure 5.8). Jaccard indices of 1 represent a perfect 1:1 mapping between a pair of clusters. jacc.mat &lt;- linkClustersMatrix(clust.walktrap, clust.louvain) rownames(jacc.mat) &lt;- paste(&quot;Walktrap&quot;, rownames(jacc.mat)) colnames(jacc.mat) &lt;- paste(&quot;Louvain&quot;, colnames(jacc.mat)) pheatmap(jacc.mat, color=viridis::viridis(100), cluster_cols=FALSE, cluster_rows=FALSE) Figure 5.8: Heatmap of the Jaccard indices comparing each Walktrap cluster (rows) to each Louvain cluster (columns) in the PBMC dataset. We identify the best corresponding clusters based on the largest Jaccard index along each row. The magnitude of the index can be used as a measure of strength for the correspondence between those two clusters. A low index for a cluster indicates that no counterpart exists in the other clustering. best &lt;- max.col(jacc.mat, ties.method=&quot;first&quot;) DataFrame( Cluster=rownames(jacc.mat), Corresponding=colnames(jacc.mat)[best], Index=jacc.mat[cbind(seq_len(nrow(jacc.mat)), best)] ) ## DataFrame with 15 rows and 3 columns ## Cluster Corresponding Index ## &lt;character&gt; &lt;character&gt; &lt;numeric&gt; ## 1 Walktrap 1 Louvain 9 0.909910 ## 2 Walktrap 2 Louvain 1 0.559343 ## 3 Walktrap 3 Louvain 3 0.505477 ## 4 Walktrap 4 Louvain 5 0.160819 ## 5 Walktrap 5 Louvain 6 0.996303 ## ... ... ... ... ## 11 Walktrap 11 Louvain 7 0.5103448 ## 12 Walktrap 12 Louvain 1 0.1210317 ## 13 Walktrap 13 Louvain 13 0.8216561 ## 14 Walktrap 14 Louvain 11 0.9666667 ## 15 Walktrap 15 Louvain 9 0.0730594 5.3.3 Visualizing differences The linkClusters() function constructs a graph where nodes are clusters from multiple clusterings and edges are formed between corresponding clusters. More specifically, the weight of each edge is defined from the number of shared cells, with the Jaccard index being the potential weighting scheme used when denominator=\"union\". The idea is to obtain “clusters of clusters”, allowing us to visualize the relationships between clusters from two or more clusterings (Figure 5.9). As the name suggests, linkClusters() is closely related to the output of the linkClustersMatrix() function; indeed, the latter simply returns part of the adjacency matrix used to create the graph in the former. clust.infomap &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=NNGraphParam(cluster.fun=&quot;infomap&quot;)) linked &lt;- linkClusters(list( walktrap=clust.walktrap, louvain=clust.louvain, infomap=clust.infomap) ) linked ## IGRAPH 1ada23d UNW- 64 145 -- ## + attr: name (v/c), weight (e/n) ## + edges from 1ada23d (vertex names): ## [1] walktrap.2 --louvain.1 walktrap.12--louvain.1 walktrap.3 --louvain.2 ## [4] walktrap.9 --louvain.2 walktrap.3 --louvain.3 walktrap.6 --louvain.3 ## [7] walktrap.9 --louvain.3 walktrap.7 --louvain.4 walktrap.2 --louvain.5 ## [10] walktrap.4 --louvain.5 walktrap.7 --louvain.5 walktrap.5 --louvain.6 ## [13] walktrap.3 --louvain.7 walktrap.5 --louvain.7 walktrap.6 --louvain.7 ## [16] walktrap.11--louvain.7 walktrap.10--louvain.8 walktrap.1 --louvain.9 ## [19] walktrap.3 --louvain.9 walktrap.15--louvain.9 walktrap.2 --louvain.10 ## [22] walktrap.8 --louvain.10 walktrap.2 --louvain.11 walktrap.4 --louvain.11 ## + ... omitted several edges meta &lt;- igraph::cluster_walktrap(linked) plot(linked, mark.groups=meta) Figure 5.9: Force-directed layout of the graph of the clusters obtained from different variants of community detection on the PBMC dataset. Each node represents a cluster obtained using one comunity detection method, with colored groupings representing clusters of clusters across different methods. For clusterings that differ primarily in resolution (usually from different parameterizations of the same algorithm), we can use the clustree package to visualize the relationships between them. Here, the aim is to capture the redistribution of cells from one clustering to another at progressively higher resolution, providing a convenient depiction of how clusters split apart (Figure 5.10). This approach is most effective when the clusterings exhibit a clear gradation in resolution but is less useful for comparisons involving theoretically distinct clustering procedures. clust.5 &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=NNGraphParam(k=5)) clust.10 &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=NNGraphParam(k=10)) clust.50 &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=NNGraphParam(k=50)) combined &lt;- cbind(k.5=clust.5, k.10=clust.10, k.50=clust.50) library(clustree) set.seed(1111) clustree(combined, prefix=&quot;k.&quot;, edge_arrow=FALSE) Figure 5.10: Graph of the relationships between the Walktrap clusterings of the PBMC dataset, generated with varying \\(k\\) during the nearest-neighbor graph construction. (A higher \\(k\\) generally corresponds to a lower resolution clustering.) The size of the nodes is proportional to the number of cells in each cluster, and the edges depict cells in one cluster that are reassigned to another cluster at a different resolution. The color of the edges is defined according to the number of reassigned cells and the opacity is defined from the corresponding proportion relative to the size of the lower-resolution cluster. 5.3.4 Adjusted Rand index We can quantify the agreement between two clusterings by computing the Rand index with bluster’s pairwiseRand(). This is defined as the proportion of pairs of cells that retain the same status (i.e., both cells in the same cluster, or each cell in different clusters) in both clusterings. In practice, we usually compute the adjusted Rand index (ARI) where we subtract the number of concordant pairs expected under random permutations of the clusterings; this accounts for differences in the size and number of clusters within and between clusterings. A larger ARI indicates that the clusters are preserved, up to a maximum value of 1 for identical clusterings. In and of itself, the magnitude of the ARI has little meaning, and it is best used to assess the relative similarities of different clusterings (e.g., “Walktrap is more similar to Louvain than either are to Infomap”). Nonetheless, if one must have a hard-and-fast rule, experience suggests that an ARI greater than 0.5 corresponds to “good” similarity between two clusterings. pairwiseRand(clust.10, clust.5, mode=&quot;index&quot;) ## [1] 0.7731 The same function can also provide a more granular perspective with mode=\"ratio\", where the ARI is broken down into its contributions from each pair of clusters in one of the clusterings. This mode is helpful if one of the clusterings - in this case, clust - is considered to be a “reference”, and the aim is to quantify the extent to which the reference clusters retain their integrity in another clustering. In the breakdown matrix, each entry is a ratio of the adjusted number of concoordant pairs to the adjusted total number of pairs. Low values on the diagonal in Figure 5.11 indicate that cells from the corresponding reference cluster in clust are redistributed to multiple other clusters in clust.5. Conversely, low off-diagonal values indicate that the corresponding pair of reference clusters are merged together in clust.5. breakdown &lt;- pairwiseRand(ref=clust.10, alt=clust.5, mode=&quot;ratio&quot;) pheatmap(breakdown, color=viridis::magma(100), cluster_rows=FALSE, cluster_cols=FALSE) Figure 5.11: ARI-based ratio for each pair of clusters in the reference Walktrap clustering compared to a higher-resolution alternative clustering for the PBMC dataset. Rows and columns of the heatmap represent clusters in the reference clustering. Each entry represents the proportion of pairs of cells involving the row/column clusters that retain the same status in the alternative clustering. 5.4 Evaluating cluster stability A desirable property of a given clustering is that it is stable to perturbations to the input data (Von Luxburg 2010). Stable clusters are logistically convenient as small changes to upstream processing will not change the conclusions; greater stability also increases the likelihood that those conclusions can be reproduced in an independent replicate study. scran uses bootstrapping to evaluate the stability of a clustering algorithm on a given dataset - that is, cells are sampled with replacement to create a “bootstrap replicate” dataset, and clustering is repeated on this replicate to see if the same clusters can be reproduced. We demonstrate below for graph-based clustering on the PCs of the PBMC dataset. myClusterFUN &lt;- function(x) { g &lt;- bluster::makeSNNGraph(x, type=&quot;jaccard&quot;) igraph::cluster_louvain(g)$membership } pcs &lt;- reducedDim(sce.pbmc, &quot;PCA&quot;) originals &lt;- myClusterFUN(pcs) table(originals) # inspecting the cluster sizes. ## originals ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ## 693 821 295 127 55 541 178 48 200 45 125 446 91 61 259 set.seed(0010010100) ratios &lt;- bootstrapStability(pcs, FUN=myClusterFUN, clusters=originals) dim(ratios) ## [1] 15 15 The function returns a matrix of ARI-derived ratios for every pair of original clusters in originals (Figure 5.12), averaged across bootstrap iterations. High ratios indicate that the clustering in the bootstrap replicates are highly consistent with that of the original dataset. More specifically, high ratios on the diagonal indicate that cells in the same original cluster are still together in the bootstrap replicates, while high ratios off the diagonal indicate that cells in the corresponding cluster pair are still separted. pheatmap(ratios, cluster_row=FALSE, cluster_col=FALSE, color=viridis::magma(100), breaks=seq(-1, 1, length.out=101)) Figure 5.12: Heatmap of ARI-derived ratios from bootstrapping of graph-based clustering in the PBMC dataset. Each row and column represents an original cluster and each entry is colored according to the value of the ARI ratio between that pair of clusters. Bootstrapping is a general approach for evaluating cluster stability that is compatible with any clustering algorithm. The ARI-derived ratio between cluster pairs is also more informative than a single stability measure for all/each cluster as the former considers the relationships between clusters, e.g., unstable separation between \\(X\\) and \\(Y\\) does not penalize the stability of separation between \\(X\\) and another cluster \\(Z\\). However, one should take these metrics with a grain of salt, as bootstrapping only considers the effect of sampling noise and ignores other factors that affect reproducibility in an independent study (e.g., batch effects, donor variation). In addition, it is possible for a poor separation to be highly stable, so highly stable cluster may not necessarily represent some distinct subpopulation. 5.5 Clustering parameter sweeps The clusterSweep() function allows us to quickly apply our clustering with a range of different parameters. For example, we can iterate across different combinations of k and community detection algorithms for graph-based clustering. We could then use linkClusters(), clustree or similar functions to visualize the relationships between different clusterings. # Parallelizing for some speed. out &lt;- clusterSweep(reducedDim(sce.pbmc, &quot;PCA&quot;), NNGraphParam(), k=as.integer(c(5, 10, 15, 20, 25, 30, 35, 40)), cluster.fun=c(&quot;louvain&quot;, &quot;walktrap&quot;, &quot;infomap&quot;), BPPARAM=BiocParallel::MulticoreParam(8)) colnames(out$clusters) ## [1] &quot;k.5_cluster.fun.louvain&quot; &quot;k.10_cluster.fun.louvain&quot; ## [3] &quot;k.15_cluster.fun.louvain&quot; &quot;k.20_cluster.fun.louvain&quot; ## [5] &quot;k.25_cluster.fun.louvain&quot; &quot;k.30_cluster.fun.louvain&quot; ## [7] &quot;k.35_cluster.fun.louvain&quot; &quot;k.40_cluster.fun.louvain&quot; ## [9] &quot;k.5_cluster.fun.walktrap&quot; &quot;k.10_cluster.fun.walktrap&quot; ## [11] &quot;k.15_cluster.fun.walktrap&quot; &quot;k.20_cluster.fun.walktrap&quot; ## [13] &quot;k.25_cluster.fun.walktrap&quot; &quot;k.30_cluster.fun.walktrap&quot; ## [15] &quot;k.35_cluster.fun.walktrap&quot; &quot;k.40_cluster.fun.walktrap&quot; ## [17] &quot;k.5_cluster.fun.infomap&quot; &quot;k.10_cluster.fun.infomap&quot; ## [19] &quot;k.15_cluster.fun.infomap&quot; &quot;k.20_cluster.fun.infomap&quot; ## [21] &quot;k.25_cluster.fun.infomap&quot; &quot;k.30_cluster.fun.infomap&quot; ## [23] &quot;k.35_cluster.fun.infomap&quot; &quot;k.40_cluster.fun.infomap&quot; table(out$clusters$k.5_cluster.fun.walktrap) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ## 287 523 125 508 45 172 244 283 446 847 97 152 38 18 63 38 30 16 15 9 ## 21 22 ## 16 13 Parameter sweeps are particularly useful when combined with one of the metrics described in Section 5.2. This provides a high-level summary of the behavior of the clustering method as the parameters change, as demonstrated in Figure 5.13. We can then make some decisions on which clustering(s) to use for further analyses. For example, we might choose a few clusterings that span several different resolutions, so as to obtain a greater diversity of summaries of the data; conversely, we might be able to save some time and effort by ignoring redundant clusterings with similar values for our metrics. df &lt;- as.data.frame(out$parameters) df$num.clusters &lt;- vapply(as.list(out$clusters), function(cluster) { length(unique(cluster)) }, 0L) all.sil &lt;- lapply(as.list(out$clusters), function(cluster) { sil &lt;- approxSilhouette(reducedDim(sce.pbmc), cluster) mean(sil$width) }) df$silhouette &lt;- unlist(all.sil) all.wcss &lt;- lapply(as.list(out$clusters), function(cluster) { sum(clusterRMSD(reducedDim(sce.pbmc), cluster, sum=TRUE), na.rm=TRUE) }) df$wcss &lt;- unlist(all.wcss) library(ggplot2) gridExtra::grid.arrange( ggplot(df, aes(x=k, y=num.clusters, group=cluster.fun, color=cluster.fun)) + geom_line(lwd=2) + scale_y_log10(), ggplot(df, aes(x=k, y=silhouette, group=cluster.fun, color=cluster.fun)) + geom_line(lwd=2), ggplot(df, aes(x=k, y=wcss, group=cluster.fun, color=cluster.fun)) + geom_line(lwd=2), ncol=3 ) Figure 5.13: Behavior of graph-based clustering as quantified by the number of clusters (left), silhouette width (middle) and the within-cluster sum of squares (right), in response to changes in the number of neighbors k and the community detection algorithm. We could even use the sweep to automatically choose the “best” clustering by optimizing one or more of these metrics. The simplest strategy is to maximize the silhouette width, though one can imagine more complex scores involving combinations of metrics. This approach is valid but any automatic choice should be treated as a suggestion rather than a rule. As previously discussed, the clustering at the optimal value of a metric may not be the most scientifically informative clustering, given that well-separated clusters typically correspond to cell types that are already known. 5.6 Agglomerating graph communities Some community detection algorithms operate by agglomeration and thus can be used to construct a hierarchical dendrogram based on the pattern of merges between clusters. The dendrogram itself is not particularly informative as it simply describes the order of merge steps performed by the algorithm; unlike the dendrograms produced by hierarchical clustering (Basic Section 5.4), it does not capture the magnitude of differences between subpopulations. However, it does provide a convenient avenue for manually tuning the clustering resolution by generating nested clusterings using the cut_at() function, as shown below. community.walktrap &lt;- igraph::cluster_walktrap(g) table(igraph::cut_at(community.walktrap, n=5)) ## ## 1 2 3 4 5 ## 3546 221 125 46 47 table(igraph::cut_at(community.walktrap, n=20)) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ## 541 352 125 330 46 401 819 408 67 47 153 138 163 61 40 129 87 46 16 16 If cut_at()-like functionality is desired for non-hierarchical methods, bluster provides a mergeCommunities() function to retrospectively tune the clustering resolution. This function will greedily merge pairs of clusters until a specified number of clusters is achieved, where pairs are chosen to maximize the modularity at each merge step. community.louvain &lt;- igraph::cluster_louvain(g) table(community.louvain$membership) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 ## 759 855 408 125 53 540 156 48 215 47 122 374 283 try(igraph::cut_at(community.louvain, n=10)) # Not hierarchical. ## Error in igraph::cut_at(community.louvain, n = 10) : ## Not a hierarchical communitity structure merged &lt;- mergeCommunities(g, community.louvain$membership, number=10) table(merged) ## merged ## 1 2 3 6 7 9 10 11 12 13 ## 759 855 408 540 156 215 220 175 374 283 Session Info View session info R version 4.4.2 (2024-10-31) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.1 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] clustree_0.5.1 ggraph_2.2.1 [3] pheatmap_1.0.12 ggplot2_3.5.1 [5] bluster_1.16.0 scran_1.34.0 [7] scuttle_1.16.0 SingleCellExperiment_1.28.1 [9] SummarizedExperiment_1.36.0 Biobase_2.66.0 [11] GenomicRanges_1.58.0 GenomeInfoDb_1.42.1 [13] IRanges_2.40.1 S4Vectors_0.44.0 [15] BiocGenerics_0.52.0 MatrixGenerics_1.18.1 [17] matrixStats_1.5.0 BiocStyle_2.34.0 [19] rebook_1.16.0 loaded via a namespace (and not attached): [1] gridExtra_2.3 CodeDepends_0.6.6 rlang_1.1.5 [4] magrittr_2.0.3 compiler_4.4.2 dir.expiry_1.14.0 [7] vctrs_0.6.5 pkgconfig_2.0.3 crayon_1.5.3 [10] fastmap_1.2.0 backports_1.5.0 XVector_0.46.0 [13] labeling_0.4.3 rmarkdown_2.29 graph_1.84.1 [16] UCSC.utils_1.2.0 ggbeeswarm_0.7.2 purrr_1.0.2 [19] xfun_0.50 zlibbioc_1.52.0 cachem_1.1.0 [22] beachmat_2.22.0 jsonlite_1.8.9 DelayedArray_0.32.0 [25] tweenr_2.0.3 BiocParallel_1.40.0 irlba_2.3.5.1 [28] parallel_4.4.2 cluster_2.1.8 R6_2.5.1 [31] bslib_0.8.0 RColorBrewer_1.1-3 limma_3.62.2 [34] jquerylib_0.1.4 Rcpp_1.0.14 bookdown_0.42 [37] knitr_1.49 Matrix_1.7-1 igraph_2.1.3 [40] tidyselect_1.2.1 abind_1.4-8 yaml_2.3.10 [43] viridis_0.6.5 codetools_0.2-20 lattice_0.22-6 [46] tibble_3.2.1 withr_3.0.2 evaluate_1.0.3 [49] polyclip_1.10-7 pillar_1.10.1 BiocManager_1.30.25 [52] filelock_1.0.3 checkmate_2.3.2 generics_0.1.3 [55] munsell_0.5.1 scales_1.3.0 glue_1.8.0 [58] metapod_1.14.0 tools_4.4.2 BiocNeighbors_2.0.1 [61] ScaledMatrix_1.14.0 locfit_1.5-9.10 graphlayouts_1.2.1 [64] XML_3.99-0.18 tidygraph_1.3.1 grid_4.4.2 [67] tidyr_1.3.1 edgeR_4.4.1 colorspace_2.1-1 [70] GenomeInfoDbData_1.2.13 ggforce_0.4.2 beeswarm_0.4.0 [73] BiocSingular_1.22.0 vipor_0.4.7 cli_3.6.3 [76] rsvd_1.0.5 rappdirs_0.3.3 S4Arrays_1.6.0 [79] viridisLite_0.4.2 dplyr_1.1.4 gtable_0.3.6 [82] sass_0.4.9 digest_0.6.37 ggrepel_0.9.6 [85] SparseArray_1.6.1 dqrng_0.4.1 farver_2.1.2 [88] memoise_2.0.1 htmltools_0.5.8.1 lifecycle_1.0.4 [91] httr_1.4.7 statmod_1.5.0 MASS_7.3-64 References "],["marker-detection-redux.html", "Chapter 6 Marker detection, redux 6.1 Motivation 6.2 Properties of each effect size 6.3 Using custom DE methods 6.4 Invalidity of \\(p\\)-values 6.5 Further comments Session information", " Chapter 6 Marker detection, redux .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 6.1 Motivation Basic Chapter 6 described the basic process of marker detection with pairwise comparisons between clusters. Here, we describe some of the properties of each effect size, how to use custom DE methods, and a few justifications for the omission of \\(p\\)-values from the marker detection process. 6.2 Properties of each effect size One of the AUC’s advantages is its robustness to the shape of the distribution of expression values within each cluster. A gene is not penalized for having a large variance so long as it does not compromise the separation between clusters. This property is demonstrated in Figure 6.1, where the AUC is not affected by the presence of an outlier subpopulation in cluster 1. By comparison, Cohen’s \\(d\\) decreases in the second gene, despite the fact that the outlier subpopulation does not affect the interpretation of the difference between clusters. Figure 6.1: Distribution of log-expression values for two simulated genes in a pairwise comparison between clusters, in the scenario where the second gene is highly expressed in a subpopulation of cluster 1. On the other hand, Cohen’s \\(d\\) accounts for the magnitude of the change in expression. All else being equal, a gene with a larger log-fold change will have a larger Cohen’s \\(d\\) and be prioritized during marker ranking. By comparison, the relationship between the AUC and the log-fold change is less direct. A larger log-fold change implies stronger separation and will usually lead to a larger AUC, but only up to a point - two perfectly separated distributions will have an AUC of 1 regardless of the difference in means (Figure 6.2). This reduces the resolution of the ranking and makes it more difficult to distinguish between good and very good markers. Figure 6.2: Distribution of log-expression values for two simulated genes in a pairwise comparison between clusters, in the scenario where both genes are upregulated in cluster 1 but by different magnitudes. The log-fold change in the detected proportions is specifically designed to look for on/off changes in expression patterns. It is relatively stringent compared to the AUC and Cohen’s \\(d\\), which this can lead to the loss of good candidate markers in general applications. For example, GCG is a known marker for pancreatic alpha cells but is expressed in almost every other cell of the Lawlor et al. (2017) pancreas data (Figure 6.3) and would not be highly ranked with logFC.detected. View set-up code (Workflow Chapter 7) #--- loading ---# library(scRNAseq) sce.lawlor &lt;- LawlorPancreasData() #--- gene-annotation ---# library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] anno &lt;- select(edb, keys=rownames(sce.lawlor), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.lawlor) &lt;- anno[match(rownames(sce.lawlor), anno[,1]),-1] #--- quality-control ---# library(scater) stats &lt;- perCellQCMetrics(sce.lawlor, subsets=list(Mito=which(rowData(sce.lawlor)$SEQNAME==&quot;MT&quot;))) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;subsets_Mito_percent&quot;, batch=sce.lawlor$`islet unos id`) sce.lawlor &lt;- sce.lawlor[,!qc$discard] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.lawlor) sce.lawlor &lt;- computeSumFactors(sce.lawlor, clusters=clusters) sce.lawlor &lt;- logNormCounts(sce.lawlor) plotExpression(sce.lawlor, x=&quot;cell type&quot;, features=&quot;ENSG00000115263&quot;) Figure 6.3: Distribution of log-normalized expression values for GCG across different pancreatic cell types in the Lawlor pancreas data. All of these effect sizes have different interactions with log-normalization. For Cohen’s \\(d\\) on the log-expression values, we are directly subject to the effects described in A. Lun (2018), which can lead to spurious differences between groups. Similarly, for the AUC, we can obtain unexpected results due to the fact that normalization only equalizes the means of two distributions and not their shape. The log-fold change in the detected proportions is completely unresponsive to scaling normalization, as a zero remains so after any scaling. However, this is not necessarily problematic for marker gene detection - users can interpret this effect as retaining information about the total RNA content, analogous to spike-in normalization in Basic Section 2.4. 6.3 Using custom DE methods We can also detect marker genes from precomputed DE statistics, allowing us to take advantage of more sophisticated tests in other Bioconductor packages such as edgeR and DESeq2. This functionality is not commonly used - see below for an explanation - but nonetheless, we will demonstrate how one would go about applying it to the PBMC dataset. Our strategy is to loop through each pair of clusters, performing a more-or-less standard DE analysis between pairs using the voom() approach from the limma package (Law et al. 2014). (Specifically, we use the TREAT strategy (McCarthy and Smyth 2009) to test for log-fold changes that are significantly greater than 0.5.) View set-up code (Workflow Chapter 3) #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) #--- gene-annotation ---# library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) #--- cell-detection ---# set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location==&quot;MT&quot;))) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) sce.pbmc &lt;- sce.pbmc[,!high.mito] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) sce.pbmc &lt;- logNormCounts(sce.pbmc) #--- variance-modelling ---# set.seed(1001) dec.pbmc &lt;- modelGeneVarByPoisson(sce.pbmc) top.pbmc &lt;- getTopHVGs(dec.pbmc, prop=0.1) #--- dimensionality-reduction ---# set.seed(10000) sce.pbmc &lt;- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc) set.seed(100000) sce.pbmc &lt;- runTSNE(sce.pbmc, dimred=&quot;PCA&quot;) set.seed(1000000) sce.pbmc &lt;- runUMAP(sce.pbmc, dimred=&quot;PCA&quot;) #--- clustering ---# g &lt;- buildSNNGraph(sce.pbmc, k=10, use.dimred = &#39;PCA&#39;) clust &lt;- igraph::cluster_walktrap(g)$membership colLabels(sce.pbmc) &lt;- factor(clust) library(limma) dge &lt;- convertTo(sce.pbmc) uclust &lt;- unique(dge$samples$label) all.results &lt;- all.pairs &lt;- list() counter &lt;- 1L for (x in uclust) { for (y in uclust) { if (x==y) break # avoid redundant comparisons. # Factor ordering ensures that &#39;x&#39; is the not the intercept, # so resulting fold changes can be interpreted as x/y. subdge &lt;- dge[,dge$samples$label %in% c(x, y)] subdge$samples$label &lt;- factor(subdge$samples$label, c(y, x)) design &lt;- model.matrix(~label, subdge$samples) # No need to normalize as we are using the size factors # transferred from &#39;sce.pbmc&#39; and converted to norm.factors. # We also relax the filtering for the lower UMI counts. subdge &lt;- subdge[calculateAverage(subdge$counts) &gt; 0.1,] # Standard voom-limma pipeline starts here. v &lt;- voom(subdge, design) fit &lt;- lmFit(v, design) fit &lt;- treat(fit, lfc=0.5) res &lt;- topTreat(fit, n=Inf, sort.by=&quot;none&quot;) # Filling out the genes that got filtered out with NA&#39;s. res &lt;- res[rownames(dge),] rownames(res) &lt;- rownames(dge) all.results[[counter]] &lt;- res all.pairs[[counter]] &lt;- c(x, y) counter &lt;- counter+1L # Also filling the reverse comparison. res$logFC &lt;- -res$logFC all.results[[counter]] &lt;- res all.pairs[[counter]] &lt;- c(y, x) counter &lt;- counter+1L } } For each comparison, we store the corresponding data frame of statistics in all.results, along with the identities of the clusters involved in all.pairs. We consolidate the pairwise DE statistics into a single marker list for each cluster with the combineMarkers() function, yielding a per-cluster DataFrame that can be interpreted in the same manner as discussed previously. We can also specify pval.type= and direction= to control the consolidation procedure, e.g., setting pval.type=\"all\" and direction=\"up\" will prioritize genes that are significantly upregulated in each cluster against all other clusters. all.pairs &lt;- do.call(rbind, all.pairs) combined &lt;- combineMarkers(all.results, all.pairs, pval.field=&quot;P.Value&quot;) # Inspecting the results for one of the clusters. interesting.voom &lt;- combined[[&quot;1&quot;]] colnames(interesting.voom) ## [1] &quot;Top&quot; &quot;p.value&quot; &quot;FDR&quot; &quot;summary.logFC&quot; ## [5] &quot;logFC.2&quot; &quot;logFC.9&quot; &quot;logFC.3&quot; &quot;logFC.7&quot; ## [9] &quot;logFC.4&quot; &quot;logFC.5&quot; &quot;logFC.11&quot; &quot;logFC.10&quot; ## [13] &quot;logFC.8&quot; &quot;logFC.14&quot; &quot;logFC.6&quot; &quot;logFC.12&quot; ## [17] &quot;logFC.13&quot; &quot;logFC.15&quot; head(interesting.voom[,1:4]) ## DataFrame with 6 rows and 4 columns ## Top p.value FDR summary.logFC ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## FO538757.2 1 0 0 7.21233 ## NOC2L 1 0 0 7.14916 ## RPL22 1 0 0 10.52793 ## RPL11 1 0 0 11.97721 ## ISG15 2 0 0 7.62768 ## PARK7 2 0 0 8.01179 We do not routinely use custom DE methods to perform marker detection, for several reasons. Many of these methods rely on empirical Bayes shrinkage to share information across genes in the presence of limited replication. This is unnecessary when there are large numbers of “replicate” cells in each group, and does nothing to solve the fundamental \\(n=1\\) problem in these comparisons (Section 6.4.2). These methods also make stronger assumptions about the data (e.g., equal variances for linear models, the distribution of variances during empirical Bayes) that are more likely to be violated in noisy scRNA-seq contexts. From a practical perspective, they require more work to set up and take more time to run. That said, some custom methods (e.g., MAST) may provide a useful point of difference from the simpler tests, in which case they can be converted into a marker detection scheme by modifing the above code. Indeed, the same code chunk can be directly applied (after switching back to the standard filtering and normalization steps inside the loop) to bulk RNA-seq experiments involving a large number of different conditions. This allows us to recycle the scran machinery to consolidate results across many pairwise comparisons for easier interpretation. 6.4 Invalidity of \\(p\\)-values 6.4.1 From data snooping Given that scoreMarkers() already reports effect sizes, it is tempting to take the next step and obtain \\(p\\)-values for the pairwise comparisons. Unfortunately, the \\(p\\)-values from the relevant tests cannot be reliably used to reject the null hypothesis. This is because DE analysis is performed on the same data used to obtain the clusters, which represents “data dredging” (also known as fishing or data snooping). The hypothesis of interest - are there differences between clusters? - is formulated from the data, so we are more likely to get a positive result when we re-use the data set to test that hypothesis. The practical effect of data dredging is best illustrated with a simple simulation. We simulate i.i.d. normal values, perform \\(k\\)-means clustering and test for DE between clusters of cells with pairwiseTTests(). The resulting distribution of \\(p\\)-values is heavily skewed towards low values (Figure 6.4). Thus, we can detect “significant” differences between clusters even in the absence of any real substructure in the data. This effect arises from the fact that clustering, by definition, yields groups of cells that are separated in expression space. Testing for DE genes between clusters will inevitably yield some significant results as that is how the clusters were defined. library(scran) set.seed(0) y &lt;- matrix(rnorm(100000), ncol=200) clusters &lt;- kmeans(t(y), centers=2)$cluster out &lt;- pairwiseTTests(y, clusters) hist(out$statistics[[1]]$p.value, col=&quot;grey80&quot;, xlab=&quot;p-value&quot;, main=&quot;&quot;) Figure 6.4: Distribution of \\(p\\)-values from a DE analysis between two clusters in a simulation with no true subpopulation structure. For marker gene detection, this effect is largely harmless as the \\(p\\)-values are used only for ranking. However, it becomes an issue when the \\(p\\)-values are used to claim some statistically significant separation between clusters. Indeed, the concept of statistical significance has no obvious meaning if the clusters are empirical and cannot be stably reproduced across replicate experiments. 6.4.2 Nature of replication The naive application of DE analysis methods will treat counts from the same cluster of cells as replicate observations. This is not the most relevant level of replication when cells are derived from the same biological sample (i.e., cell culture, animal or patient). DE analyses that treat cells as replicates fail to properly model the sample-to-sample variability (A. T. L. Lun and Marioni 2017). The latter is arguably the more important level of replication as different samples will necessarily be generated if the experiment is to be replicated. Indeed, the use of cells as replicates only masks the fact that the sample size is actually one in an experiment involving a single biological sample. This reinforces the inappropriateness of using the marker gene \\(p\\)-values to perform statistical inference. Once subpopulations are identified, it is prudent to select some markers for use in validation studies with an independent replicate population of cells. A typical strategy is to identify a corresponding subset of cells that express the upregulated markers and do not express the downregulated markers. Ideally, a different technique for quantifying expression would also be used during validation, e.g., fluorescent in situ hybridisation or quantitative PCR. This confirms that the subpopulation genuinely exists and is not an artifact of the scRNA-seq protocol or the computational analysis. 6.5 Further comments One consequence of the DE analysis strategy is that markers are defined relative to subpopulations in the same dataset. Biologically meaningful genes will not be detected if they are expressed uniformly throughout the population, e.g., T cell markers will not be detected if only T cells are present in the dataset. In practice, this is usually only a problem when the experimental data are provided without any biological context - certainly, we would hope to have some a priori idea about what cells have been captured. For most applications, it is actually desirable to avoid detecting such genes as we are interested in characterizing heterogeneity within the context of a known cell population. Continuing from the example above, the failure to detect T cell markers is of little consequence if we already know we are working with T cells. Nonetheless, if “absolute” identification of cell types is desired, some strategies for doing so are described in Basic Chapter 7. Alternatively, marker detection can be performed by treating gene expression as a predictor variable for cluster assignment. For a pair of clusters, we can find genes that discriminate between them by performing inference with a logistic model where the outcome for each cell is whether it was assigned to the first cluster and the lone predictor is the expression of each gene. Treating the cluster assignment as the dependent variable is more philosophically pleasing in some sense, as the clusters are indeed defined from the expression data rather than being known in advance. (Note that this does not solve the data snooping problem.) In practice, this approach effectively does the same task as a Wilcoxon rank sum test in terms of quantifying separation between clusters. Logistic models have the advantage in that they can easily be extended to block on multiple nuisance variables, though this is not typically necessary in most use cases. Even more complex strategies use machine learning methods to determine which features contribute most to successful cluster classification, but this is probably unnecessary for routine analyses. Session information sessionInfo() ## R version 4.4.2 (2024-10-31) ## Platform: x86_64-pc-linux-gnu ## Running under: Ubuntu 24.04.1 LTS ## ## Matrix products: default ## BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so ## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_GB LC_COLLATE=C ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## time zone: America/New_York ## tzcode source: system (glibc) ## ## attached base packages: ## [1] stats4 stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] limma_3.62.2 scran_1.34.0 ## [3] scater_1.34.0 ggplot2_3.5.1 ## [5] scuttle_1.16.0 SingleCellExperiment_1.28.1 ## [7] SummarizedExperiment_1.36.0 Biobase_2.66.0 ## [9] GenomicRanges_1.58.0 GenomeInfoDb_1.42.1 ## [11] IRanges_2.40.1 S4Vectors_0.44.0 ## [13] BiocGenerics_0.52.0 MatrixGenerics_1.18.1 ## [15] matrixStats_1.5.0 BiocStyle_2.34.0 ## [17] rebook_1.16.0 ## ## loaded via a namespace (and not attached): ## [1] tidyselect_1.2.1 viridisLite_0.4.2 farver_2.1.2 ## [4] dplyr_1.1.4 vipor_0.4.7 filelock_1.0.3 ## [7] viridis_0.6.5 fastmap_1.2.0 bluster_1.16.0 ## [10] XML_3.99-0.18 digest_0.6.37 rsvd_1.0.5 ## [13] lifecycle_1.0.4 cluster_2.1.8 statmod_1.5.0 ## [16] magrittr_2.0.3 compiler_4.4.2 rlang_1.1.5 ## [19] sass_0.4.9 tools_4.4.2 igraph_2.1.3 ## [22] yaml_2.3.10 knitr_1.49 labeling_0.4.3 ## [25] dqrng_0.4.1 S4Arrays_1.6.0 DelayedArray_0.32.0 ## [28] abind_1.4-8 BiocParallel_1.40.0 withr_3.0.2 ## [31] CodeDepends_0.6.6 grid_4.4.2 beachmat_2.22.0 ## [34] colorspace_2.1-1 edgeR_4.4.1 scales_1.3.0 ## [37] cli_3.6.3 rmarkdown_2.29 crayon_1.5.3 ## [40] generics_0.1.3 metapod_1.14.0 httr_1.4.7 ## [43] ggbeeswarm_0.7.2 cachem_1.1.0 zlibbioc_1.52.0 ## [46] parallel_4.4.2 BiocManager_1.30.25 XVector_0.46.0 ## [49] vctrs_0.6.5 Matrix_1.7-1 jsonlite_1.8.9 ## [52] dir.expiry_1.14.0 bookdown_0.42 BiocSingular_1.22.0 ## [55] BiocNeighbors_2.0.1 ggrepel_0.9.6 irlba_2.3.5.1 ## [58] beeswarm_0.4.0 locfit_1.5-9.10 jquerylib_0.1.4 ## [61] glue_1.8.0 codetools_0.2-20 cowplot_1.1.3 ## [64] gtable_0.3.6 UCSC.utils_1.2.0 ScaledMatrix_1.14.0 ## [67] munsell_0.5.1 tibble_3.2.1 pillar_1.10.1 ## [70] rappdirs_0.3.3 htmltools_0.5.8.1 graph_1.84.1 ## [73] GenomeInfoDbData_1.2.13 R6_2.5.1 evaluate_1.0.3 ## [76] lattice_0.22-6 bslib_0.8.0 Rcpp_1.0.14 ## [79] gridExtra_2.3 SparseArray_1.6.1 xfun_0.50 ## [82] pkgconfig_2.0.3 References "],["droplet-processing.html", "Chapter 7 Droplet processing 7.1 Motivation 7.2 Calling cells from empty droplets 7.3 Removing ambient contamination 7.4 Demultiplexing cell hashes 7.5 Removing swapped molecules Session Info", " Chapter 7 Droplet processing .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 7.1 Motivation Droplet-based single-cell protocols aim to isolate each cell inside its own droplet in a water-in-oil emulsion, such that each droplet serves as a miniature reaction chamber for highly multiplexed library preparation (Macosko et al. 2015; Klein et al. 2015). Upon sequencing, reads are assigned to individual cells based on the presence of droplet-specific barcodes. This enables a massive increase in the number of cells that can be processed in typical scRNA-seq experiments, contributing to the dominance1 of technologies such as the 10X Genomics platform (Zheng et al. 2017). However, as the allocation of cells to droplets is not known in advance, the data analysis requires some special steps to determine what each droplet actually contains. This chapter explores some of the more common preprocessing procedures that might be applied to the count matrices generated from droplet protocols. 7.2 Calling cells from empty droplets 7.2.1 Background An unique aspect of droplet-based data is that we have no prior knowledge about whether a particular library (i.e., cell barcode) corresponds to cell-containing or empty droplets. Thus, we need to call cells from empty droplets based on the observed expression profiles. This is not entirely straightforward as empty droplets can contain ambient (i.e., extracellular) RNA that can be captured and sequenced, resulting in non-zero counts for libraries that do not contain any cell. To demonstrate, we obtain the unfiltered count matrix for the PBMC dataset from 10X Genomics. View set-up code (Workflow Chapter 3) #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) sce.pbmc ## class: SingleCellExperiment ## dim: 33694 737280 ## metadata(1): Samples ## assays(1): counts ## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475 ## ENSG00000268674 ## rowData names(2): ID Symbol ## colnames(737280): AAACCTGAGAAACCAT-1 AAACCTGAGAAACCGC-1 ... ## TTTGTCATCTTTAGTC-1 TTTGTCATCTTTCCTC-1 ## colData names(2): Sample Barcode ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): The distribution of total counts exhibits a sharp transition between barcodes with large and small total counts (Figure 7.1), probably corresponding to cell-containing and empty droplets respectively. A simple approach would be to apply a threshold on the total count to only retain those barcodes with large totals. However, this unnecessarily discards libraries derived from cell types with low RNA content. library(DropletUtils) bcrank &lt;- barcodeRanks(counts(sce.pbmc)) # Only showing unique points for plotting speed. uniq &lt;- !duplicated(bcrank$rank) plot(bcrank$rank[uniq], bcrank$total[uniq], log=&quot;xy&quot;, xlab=&quot;Rank&quot;, ylab=&quot;Total UMI count&quot;, cex.lab=1.2) abline(h=metadata(bcrank)$inflection, col=&quot;darkgreen&quot;, lty=2) abline(h=metadata(bcrank)$knee, col=&quot;dodgerblue&quot;, lty=2) legend(&quot;bottomleft&quot;, legend=c(&quot;Inflection&quot;, &quot;Knee&quot;), col=c(&quot;darkgreen&quot;, &quot;dodgerblue&quot;), lty=2, cex=1.2) Figure 7.1: Total UMI count for each barcode in the PBMC dataset, plotted against its rank (in decreasing order of total counts). The inferred locations of the inflection and knee points are also shown. 7.2.2 Testing for empty droplets We use the emptyDrops() function to test whether the expression profile for each cell barcode is significantly different from the ambient RNA pool (Lun et al. 2019). Any significant deviation indicates that the barcode corresponds to a cell-containing droplet. This allows us to discriminate between well-sequenced empty droplets and droplets derived from cells with little RNA, both of which would have similar total counts in Figure 7.1. We call cells at a false discovery rate (FDR) of 0.1%, meaning that no more than 0.1% of our called barcodes should be empty droplets on average. # emptyDrops performs Monte Carlo simulations to compute p-values, # so we need to set the seed to obtain reproducible results. set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) # See ?emptyDrops for an explanation of why there are NA values. summary(e.out$FDR &lt;= 0.001) ## Mode FALSE TRUE NA&#39;s ## logical 989 4300 731991 emptyDrops() uses Monte Carlo simulations to compute \\(p\\)-values for the multinomial sampling transcripts from the ambient pool. The number of Monte Carlo iterations determines the lower bound for the \\(p\\)-values (Phipson and Smyth 2010). The Limited field in the output indicates whether or not the computed \\(p\\)-value for a particular barcode is bounded by the number of iterations. If any non-significant barcodes are TRUE for Limited, we may need to increase the number of iterations. A larger number of iterations will result in a lower \\(p\\)-value for these barcodes, which may allow them to be detected after correcting for multiple testing. table(Sig=e.out$FDR &lt;= 0.001, Limited=e.out$Limited) ## Limited ## Sig FALSE TRUE ## FALSE 989 0 ## TRUE 1728 2572 As mentioned above, emptyDrops() assumes that barcodes with low total UMI counts are empty droplets. Thus, the null hypothesis should be true for all of these barcodes. We can check whether the hypothesis testing procedure holds its size by examining the distribution of \\(p\\)-values for low-total barcodes with test.ambient=TRUE. Ideally, the distribution should be close to uniform (Figure 7.2). Large peaks near zero indicate that barcodes with total counts below lower are not all ambient in origin. This can be resolved by decreasing lower further to ensure that barcodes corresponding to droplets with very small cells are not used to estimate the ambient profile. set.seed(100) limit &lt;- 100 all.out &lt;- emptyDrops(counts(sce.pbmc), lower=limit, test.ambient=TRUE) hist(all.out$PValue[all.out$Total &lt;= limit &amp; all.out$Total &gt; 0], xlab=&quot;P-value&quot;, main=&quot;&quot;, col=&quot;grey80&quot;) Figure 7.2: Distribution of \\(p\\)-values for the assumed empty droplets. Once we are satisfied with the performance of emptyDrops(), we subset our SingleCellExperiment object to retain only the detected cells. Discerning readers will notice the use of which(), which conveniently removes the NAs prior to the subsetting. sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] It usually only makes sense to call cells using a count matrix involving libraries from a single sample. The composition of transcripts in the ambient solution will usually vary between samples, so the same ambient profile cannot be reused. If multiple samples are present in a dataset, their counts should only be combined after cell calling is performed on each matrix. 7.2.3 Relationship with other QC metrics While emptyDrops() will distinguish cells from empty droplets, it makes no statement about the quality of the cells. It is entirely possible for droplets to contain damaged or dying cells, which need to be removed prior to downstream analysis. This is achieved using the same outlier-based strategy described in Basic Section 1.3.2. Filtering on the mitochondrial proportion provides the most additional benefit in this situation, provided that we check that we are not removing a subpopulation of metabolically active cells (Figure 7.3). library(scuttle) is.mito &lt;- grep(&quot;^MT-&quot;, rowData(sce.pbmc)$Symbol) pbmc.qc &lt;- perCellQCMetrics(sce.pbmc, subsets=list(MT=is.mito)) discard.mito &lt;- isOutlier(pbmc.qc$subsets_MT_percent, type=&quot;higher&quot;) summary(discard.mito) ## Mode FALSE TRUE ## logical 3985 315 plot(pbmc.qc$sum, pbmc.qc$subsets_MT_percent, log=&quot;x&quot;, xlab=&quot;Total count&quot;, ylab=&#39;Mitochondrial %&#39;) abline(h=attr(discard.mito, &quot;thresholds&quot;)[&quot;higher&quot;], col=&quot;red&quot;) Figure 7.3: Percentage of reads assigned to mitochondrial transcripts, plotted against the library size. The red line represents the upper threshold used for QC filtering. emptyDrops() already removes cells with very low library sizes or (by association) low numbers of expressed genes. Thus, further filtering on these metrics is not strictly necessary. It may still be desirable to filter on both of these metrics to remove non-empty droplets containing cell fragments or stripped nuclei that were not caught by the mitochondrial filter. However, this should be weighed against the risk of losing genuine cell types as discussed in Section 1.3. Note that CellRanger version 3 automatically performs cell calling using an algorithm similar to emptyDrops(). If we had started our analysis with the filtered count matrix, we could go straight to computing other QC metrics. We would not need to run emptyDrops() manually as shown here, and indeed, attempting to do so would lead to nonsensical results if not outright software errors. Nonetheless, it may still be desirable to load the unfiltered matrix and apply emptyDrops() ourselves, on occasions where more detailed inspection or control of the cell-calling statistics is desired. 7.3 Removing ambient contamination For routine analyses, there is usually no need to remove the ambient contamination from each library. A consistent level of contamination across the dataset does not introduce much spurious heterogeneity, so dimensionality reduction and clustering on the original (log-)expression matrix remain valid. For genes that are highly abundant in the ambient solution, we can expect some loss of signal due to shrinkage of the log-fold changes between clusters towards zero, but this effect should be negligible for any genes that are so strongly upregulated that they are able to contribute to the ambient solution in the first place. This suggests that ambient removal can generally be omitted from most analyses, though we will describe it here regardless as it can be useful in specific situations. Effective removal of ambient contamination involves tackling a number of issues. We need to know how much contamination is present in each cell, which usually requires some prior biological knowledge about genes that should not be expressed in the dataset (e.g., mitochondrial genes in single-nuclei datasets, see Section 11.4) or genes with mutually exclusive expression profiles (Young and Behjati 2018). Those same genes must be highly abundant in the ambient solution to have enough counts in each cell for precise estimation of the scale of the contamination. The actual subtraction of the ambient contribution also must be done in a manner that respects the mean-variance relationship of the count data. Unfortunately, these issues are difficult to address for single-cell data due to the imprecision of low counts, Rather than attempting to remove contamination from individual cells, a more measured approach is to operate on clusters of related cells. The removeAmbience() function from DropletUtils will remove the contamination from the cluster-level profiles and propagate the effect of those changes back to the individual cells. Specifically, given a count matrix for a single sample and its associated ambient profile, removeAmbience() will: Aggregate counts in each cluster to obtain an average profile per cluster. Estimate the contamination proportion in each cluster with maximumAmbience() (see Multi-sample Chapter 5). This has the useful property of not requiring any prior knowledge of control or mutually exclusive expression profiles, albeit at the cost of some statistical rigor. Subtract the estimated contamination from the cluster-level average. Perform quantile-quantile mapping of each individual cell’s counts from the old average to the new subtracted average. This preserves the mean-variance relationship while yielding corrected single-cell profiles. We demonstrate this process on our PBMC dataset below. View set-up code (Workflow Chapter 3) #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) #--- gene-annotation ---# library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) #--- cell-detection ---# set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location==&quot;MT&quot;))) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) sce.pbmc &lt;- sce.pbmc[,!high.mito] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) sce.pbmc &lt;- logNormCounts(sce.pbmc) #--- variance-modelling ---# set.seed(1001) dec.pbmc &lt;- modelGeneVarByPoisson(sce.pbmc) top.pbmc &lt;- getTopHVGs(dec.pbmc, prop=0.1) #--- dimensionality-reduction ---# set.seed(10000) sce.pbmc &lt;- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc) set.seed(100000) sce.pbmc &lt;- runTSNE(sce.pbmc, dimred=&quot;PCA&quot;) set.seed(1000000) sce.pbmc &lt;- runUMAP(sce.pbmc, dimred=&quot;PCA&quot;) #--- clustering ---# g &lt;- buildSNNGraph(sce.pbmc, k=10, use.dimred = &#39;PCA&#39;) clust &lt;- igraph::cluster_walktrap(g)$membership colLabels(sce.pbmc) &lt;- factor(clust) # Not all genes are reported in the ambient profile from emptyDrops, # as genes with counts of zero across all droplets are just removed. # So for convenience, we will restrict our analysis to genes with # non-zero counts in at least one droplet (empty or otherwise). amb &lt;- metadata(e.out)$ambient[,1] stripped &lt;- sce.pbmc[names(amb),] out &lt;- removeAmbience(counts(stripped), ambient=amb, groups=colLabels(stripped)) dim(out) ## [1] 20112 3985 We can visualize the effects of ambient removal on a gene like IGKC, which presumably should only be expressed in the B cell lineage. This gene has some level of expression in each cluster in the original dataset but is “zeroed” in most clusters after removal (Figure 7.4). library(scater) counts(stripped, withDimnames=FALSE) &lt;- out stripped &lt;- logNormCounts(stripped) gridExtra::grid.arrange( plotExpression(sce.pbmc, x=&quot;label&quot;, colour_by=&quot;label&quot;, features=&quot;IGKC&quot;) + ggtitle(&quot;Before&quot;), plotExpression(stripped, x=&quot;label&quot;, colour_by=&quot;label&quot;, features=&quot;IGKC&quot;) + ggtitle(&quot;After&quot;), ncol=2 ) Figure 7.4: Distribution of IGKC log-expression values in each cluster of the PBMC dataset, before and after removal of ambient contamination. We observe a similar phenomenon with the LYZ gene (Figure 7.5), which should only be expressed in macrophages and neutrophils. In fact, if we knew this beforehand, we could specify these two mutually exclusive sets - i.e., LYZ and IGKC and their related genes - in the features= argument to removeAmbience(). This knowledge is subsequently used to estimate the contamination in each cluster, an approach that is more conceptually similar to the methods in the SoupX package. gridExtra::grid.arrange( plotExpression(sce.pbmc, x=&quot;label&quot;, colour_by=&quot;label&quot;, features=&quot;LYZ&quot;) + ggtitle(&quot;Before&quot;), plotExpression(stripped, x=&quot;label&quot;, colour_by=&quot;label&quot;, features=&quot;LYZ&quot;) + ggtitle(&quot;After&quot;), ncol=2 ) Figure 7.5: Distribution of LYZ log-expression values in each cluster of the PBMC dataset, before and after removal of ambient contamination. While these results look impressive, discerning readers will note that the method relies on having sensible clusters. This limits the function’s applicability to the end of an analysis after all the characterization has already been done. As such, the stripped matrix can really only be used in downstream steps like the DE analysis (where it is unlikely to have much effect beyond inflating already-large log-fold changes) or - most importantly - in visualization, where users can improve the aesthetics of their plots by eliminating harmless background expression. Of course, one could repeat the entire analysis on the stripped count matrix to obtain new clusters, but this seems unnecessarily circuituous, especially if the clusters were deemed good enough for use in removeAmbience() in the first place. Finally, it may be worth considering whether a corrected per-cell count matrix is really necessary. In removeAmbience(), counts for each gene are assumed to follow a negative binomial distribution with a fixed dispersion. This is necessary to perform the quantile-quantile remapping to obtain a corrected version of each individual cell’s counts, but violations of these distributional assumptions will introduce inaccuracies in downstream models. Some analyses may have specific remedies to ambient contamination that do not require corrected per-cell counts (Multi-sample Chapter 5), so we can avoid these assumptions altogether if such remedies are available. 7.4 Demultiplexing cell hashes 7.4.1 Background Cell hashing (Stoeckius et al. 2018) is a useful technique that allows cells from different samples to be processed in a single run of a droplet-based protocol. Cells from a single sample are first labelled with a unique hashing tag oligo (HTOs), usually via conjugation of the HTO to an antibody against a ubiquitous surface marker or a membrane-binding compound like cholesterol (McGinnis et al. 2019). Cells from different samples are then mixed together and the multiplexed pool is used for droplet-based library preparation; each cell is assigned back to its sample of origin based on its most abundant HTO. By processing multiple samples together, we can avoid batch effects and simplify the logistics of studies with a large number of samples. Sequencing of the HTO-derived cDNA library yields a count matrix where each row corresponds to a HTO and each column corresponds to a cell barcode. This can be stored as an alternative Experiment in our SingleCellExperiment, alongside the main experiment containing the counts for the actual genes. We demonstrate on some data from the original Stoeckius et al. (2018) study, which contains counts for a mixture of 4 cell lines across 12 samples. library(scRNAseq) hto.sce &lt;- StoeckiusHashingData(type=&quot;mixed&quot;) hto.sce # The full dataset ## class: SingleCellExperiment ## dim: 25339 25088 ## metadata(0): ## assays(1): counts ## rownames(25339): A1BG A1BG-AS1 ... snoU2-30 snoZ178 ## rowData names(0): ## colnames(25088): CAGATCAAGTAGGCCA CCTTTCTGTCGGATCC ... CGGTTATCCATCTGCT ## CGGTTCACACGTCAGC ## colData names(0): ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(1): hto altExp(hto.sce) # Contains the HTO counts ## class: SingleCellExperiment ## dim: 12 25088 ## metadata(0): ## assays(1): counts ## rownames(12): HEK_A HEK_B ... KG1_B KG1_C ## rowData names(2): cell_line replicate ## colnames(25088): CAGATCAAGTAGGCCA CCTTTCTGTCGGATCC ... CGGTTATCCATCTGCT ## CGGTTCACACGTCAGC ## colData names(1): metrics ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): counts(altExp(hto.sce))[,1:3] # Preview of the count profiles ## 12 x 3 sparse Matrix of class &quot;dgCMatrix&quot; ## CAGATCAAGTAGGCCA CCTTTCTGTCGGATCC CATATGGCATGGAATA ## HEK_A 1 . . ## HEK_B 111 . 1 ## HEK_C 7 1 7 ## THP1_A 15 19 16 ## THP1_B 10 8 5 ## THP1_C 4 3 6 ## K562_A 118 . 245 ## K562_B 5 530 131 ## K562_C 1 3 4 ## KG1_A 30 25 24 ## KG1_B 40 14 239 ## KG1_C 32 36 38 7.4.2 Cell calling options Our first task is to identify the libraries corresponding to cell-containing droplets. This can be applied on the gene count matrix or the HTO count matrix, depending on what information we have available. We start with the usual application of emptyDrops() on the gene count matrix of hto.sce (Figure 7.6). set.seed(10010) e.out.gene &lt;- emptyDrops(counts(hto.sce)) is.cell &lt;- e.out.gene$FDR &lt;= 0.001 summary(is.cell) ## Mode FALSE TRUE NA&#39;s ## logical 1384 7934 15770 par(mfrow=c(1,2)) r &lt;- rank(-e.out.gene$Total) plot(r, e.out.gene$Total, log=&quot;xy&quot;, xlab=&quot;Rank&quot;, ylab=&quot;Total gene count&quot;, main=&quot;&quot;) abline(h=metadata(e.out.gene)$retain, col=&quot;darkgrey&quot;, lty=2, lwd=2) hist(log10(e.out.gene$Total[is.cell]), xlab=&quot;Log[10] gene count&quot;, main=&quot;&quot;) Figure 7.6: Cell-calling statistics from running emptyDrops() on the gene count in the cell line mixture data. Left: Barcode rank plot with the estimated knee point in grey. Right: distribution of log-total counts for libraries identified as cells. Alternatively, we could also apply emptyDrops() to the HTO count matrix but this is slightly more complicated. As HTOs are sequenced separately from the endogenous transcripts, the coverage of the former is less predictable across studies; this makes it difficult to determine an appropriate default value of lower= for estimation of the initial ambient profile. We instead estimate the ambient profile by excluding the top by.rank= barcodes with the largest totals, under the assumption that no more than by.rank= cells were loaded. Here we have chosen 12000, which is largely a guess to ensure that we can directly pick the knee point (Figure 7.7) in this somewhat pre-filtered dataset. set.seed(10010) # Setting lower= for correct knee point detection, # as the coverage in this dataset is particularly low. e.out.hto &lt;- emptyDrops(counts(altExp(hto.sce)), by.rank=12000, lower=10) summary(is.cell.hto &lt;- e.out.hto$FDR &lt;= 0.001) ## Mode FALSE TRUE NA&#39;s ## logical 7067 4933 13088 par(mfrow=c(1,2)) r &lt;- rank(-e.out.hto$Total) plot(r, e.out.hto$Total, log=&quot;xy&quot;, xlab=&quot;Rank&quot;, ylab=&quot;Total HTO count&quot;, main=&quot;&quot;) abline(h=metadata(e.out.hto)$retain, col=&quot;darkgrey&quot;, lty=2, lwd=2) hist(log10(e.out.hto$Total[is.cell.hto]), xlab=&quot;Log[10] HTO count&quot;, main=&quot;&quot;) Figure 7.7: Cell-calling statistics from running emptyDrops() on the HTO counts in the cell line mixture data. Left: Barcode rank plot with the knee point shown in grey. Right: distribution of log-total counts for libraries identified as cells. While both approaches are valid, we tend to favor the cell calls derived from the gene matrix as this directly indicates that a cell is present in the droplet. Indeed, at least a few libraries have very high total HTO counts yet very low total gene counts (Figure 7.8), suggesting that the presence of HTOs may not always equate to successful capture of that cell’s transcriptome. HTO counts also tend to exhibit stronger overdispersion (i.e., lower alpha in the emptyDrops() calculations), increasing the risk of violating emptyDrops()’s distributional assumptions. table(HTO=is.cell.hto, Genes=is.cell, useNA=&quot;always&quot;) ## Genes ## HTO FALSE TRUE &lt;NA&gt; ## FALSE 504 2834 3729 ## TRUE 59 4757 117 ## &lt;NA&gt; 821 343 11924 plot(e.out.gene$Total, e.out.hto$Total, log=&quot;xy&quot;, xlab=&quot;Total gene count&quot;, ylab=&quot;Total HTO count&quot;) abline(v=metadata(e.out.gene)$lower, col=&quot;red&quot;, lwd=2, lty=2) abline(h=metadata(e.out.hto)$lower, col=&quot;blue&quot;, lwd=2, lty=2) Figure 7.8: Total HTO counts plotted against the total gene counts for each library in the cell line mixture dataset. Each point represents a library while the dotted lines represent the thresholds below which libraries were assumed to be empty droplets. Again, note that if we are picking up our analysis after processing with pipelines like CellRanger, it may be that the count matrix has already been subsetted to the cell-containing libraries. If so, we can skip this section entirely and proceed straight to demultiplexing. 7.4.3 Demultiplexing on HTO abundance We run hashedDrops() to demultiplex the HTO count matrix for the subset of cell-containing libraries. This reports the likely sample of origin for each library based on its most abundant HTO after adjusting those abundances for ambient contamination. For quality control, it returns the log-fold change between the first and second-most abundant HTOs in each barcode libary (Figure 7.9), allowing us to quantify the certainty of each assignment. hto.mat &lt;- counts(altExp(hto.sce))[,which(is.cell)] hash.stats &lt;- hashedDrops(hto.mat) hist(hash.stats$LogFC, xlab=&quot;Log fold-change from best to second HTO&quot;, main=&quot;&quot;) Figure 7.9: Distribution of log-fold changes from the first to second-most abundant HTO in each cell. Confidently assigned cells should have large log-fold changes between the best and second-best HTO abundances as there should be exactly one dominant HTO per cell. These are marked as such by the Confident field in the output of hashedDrops(), which can be used to filter out ambiguous assignments prior to downstream analyses. # Raw assignments: table(hash.stats$Best) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 ## 664 732 636 595 629 655 570 684 603 726 662 778 # Confident assignments based on (i) a large log-fold change # and (ii) not being a doublet. table(hash.stats$Best[hash.stats$Confident]) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 ## 580 619 560 524 553 573 379 605 427 640 556 607 In the absence of an a priori ambient profile, hashedDrops() will attempt to automatically estimate it from the count matrix. This is done by assuming that each HTO has a bimodal distribution where the lower peak corresponds to ambient contamination in cells that do not belong to that HTO’s sample. Counts are then averaged across all cells in the lower mode to obtain the relative abundance of that HTO (Figure 7.10). hashedDrops() uses the ambient profile to adjust for systematic differences in HTO concentrations that could otherwise skew the log-fold changes - for example, this particular dataset exhibits order-of-magnitude differences in the concentration of different HTOs. The adjustment process itself involves a fair number of assumptions that we will not discuss here; see ?hashedDrops for more details. barplot(metadata(hash.stats)$ambient, las=2, ylab=&quot;Inferred proportion of counts in the ambient solution&quot;) Figure 7.10: Proportion of each HTO in the ambient solution for the cell line mixture data, estimated from the HTO counts of cell-containing droplets. If we are dealing with unfiltered data, we have the opportunity to improve the inferences by defining the ambient profile beforehand based on the empty droplets. This simply involves summing the counts for each HTO across all known empty droplets, marked as those libraries with NA FDR values in the emptyDrops() output. Alternatively, if we had called emptyDrops() directly on the HTO count matrix, we could just extract the ambient profile from the output’s metadata(). For this dataset, all methods agree well (Figure 7.10) though providing an a priori profile can be helpful in more extreme situations where the automatic method fails, e.g., if there are too few cells in the lower mode for accurate estimation of a HTO’s ambient concentration. estimates &lt;- rbind( `Bimodal`=proportions(metadata(hash.stats)$ambient), `Empty (genes)`=proportions(rowSums(counts(altExp(hto.sce))[,is.na(e.out.gene$FDR)])), `Empty (HTO)`=metadata(e.out.hto)$ambient[,1] ) barplot(estimates, beside=TRUE, ylab=&quot;Proportion of counts in the ambient solution&quot;) legend(&quot;topleft&quot;, fill=gray.colors(3), legend=rownames(estimates)) Figure 7.11: Proportion of each HTO in the ambient solution for the cell line mixture data, estimated using the bimodal method in hashedDrops() or by computing the average abundance across all empty droplets (where the empty state is defined by using emptyDrops() on either the genes or the HTO matrix). Given an estimate of the ambient profile - say, the one derived from empty droplets detected using the HTO count matrix - we can easily use it in hashedDrops() via the ambient= argument. This yields very similar results to those obtained with the automatic method, as expected from the similarity in the profiles. hash.stats2 &lt;- hashedDrops(hto.mat, ambient=metadata(e.out.hto)$ambient[,1]) table(hash.stats2$Best[hash.stats2$Confident]) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 ## 575 602 565 526 551 559 354 597 411 632 553 589 7.4.4 Further comments After demultiplexing, it is a simple matter to subset the SingleCellExperiment to the confident assignments. This actually involves two steps - the first is to subset to the libraries that were actually used in hashedDrops(), and the second is to subset to the libraries that were confidently assigned to a single sample. Of course, we also include the putative sample of origin for each cell. sce &lt;- hto.sce[,rownames(hash.stats)] sce$sample &lt;- hash.stats$Best sce &lt;- sce[,hash.stats$Confident] We examine the success of the demultiplexing by performing a quick analysis. Recall that this experiment involved 4 cell lines that were multiplexed together; we see that the separation between cell lines is preserved in Figure 7.12, indicating that the cells were assigned to their correct samples of origin. library(scran) library(scater) sce &lt;- logNormCounts(sce) dec &lt;- modelGeneVar(sce) set.seed(100) sce &lt;- runPCA(sce, subset_row=getTopHVGs(dec, n=5000)) sce &lt;- runTSNE(sce, dimred=&quot;PCA&quot;) cell.lines &lt;- sub(&quot;_.*&quot;, &quot;&quot;, rownames(altExp(sce))) sce$cell.line &lt;- cell.lines[sce$sample] plotTSNE(sce, colour_by=&quot;cell.line&quot;) Figure 7.12: The usual \\(t\\)-SNE plot of the cell line mixture data, where each point is a cell and is colored by the cell line corresponding to its sample of origin. Cell hashing information can also be used to detect doublets - see Chapter 8 for more details. 7.5 Removing swapped molecules Some of the more recent DNA sequencing machines released by Illumina (e.g., HiSeq 3000/4000/X, X-Ten, and NovaSeq) use patterned flow cells to improve throughput and cost efficiency. However, in multiplexed pools, the use of these flow cells can lead to the mislabelling of DNA molecules with the incorrect library barcode (Sinha et al. 2017), a phenomenon known as “barcode swapping”. This leads to contamination of each library with reads from other libraries - for droplet sequencing experiments, this is particularly problematic as it manifests as the appearance of artificial cells that are low-coverage copies of their originals from other samples (Griffiths et al. 2018). Fortunately, it is easy enough to remove affected reads from droplet experiments with the swappedDrops() function from DropletUtils. Given a multiplexed pool of samples, we identify potential swapping events as transcript molecules that share the same combination of UMI sequence, assigned gene and cell barcode across samples. We only keep the molecule if it has dominant coverage in a single sample, which is likely to be its original sample; we remove all (presumably swapped) instances of that molecule in the other samples. Our assumption is that it is highly unlikely that two molecules would have the same combination of values by chance. To demonstrate, we will use some multiplexed 10X Genomics data from an attempted study of the mouse mammary gland (Griffiths et al. 2018). This experiment consists of 8 scRNA-seq samples from various stages of mammary gland development, sequenced using the HiSeq 4000. We use the DropletTestFiles package to obtain the molecule information files produced by the CellRanger software suite; as its name suggests, this file format contains information on each individual transcript molecule in each sample. library(DropletTestFiles) swap.files &lt;- listTestFiles(dataset=&quot;bach-mammary-swapping&quot;) swap.files &lt;- swap.files[dirname(swap.files$file.name)==&quot;hiseq_4000&quot;,] swap.files &lt;- vapply(swap.files$rdatapath, getTestFile, prefix=FALSE, &quot;&quot;) names(swap.files) &lt;- sub(&quot;.*_(.*)\\\\.h5&quot;, &quot;\\\\1&quot;, names(swap.files)) We examine the barcode rank plots before making any attempt to remove swapped molecules (Figure 7.13), using the get10xMolInfoStats() function to efficiently obtain summary statistics from each molecule information file. We see that samples E1 and F1 have different curves but this is not cause for alarm given that they also correspond to a different developmental stage compared to the other samples. library(DropletUtils) before.stats &lt;- lapply(swap.files, get10xMolInfoStats) max.umi &lt;- vapply(before.stats, function(x) max(x$num.umis), 0) ylim &lt;- c(1, max(max.umi)) max.ncells &lt;- vapply(before.stats, nrow, 0L) xlim &lt;- c(1, max(max.ncells)) plot(0,0,type=&quot;n&quot;, xlab=&quot;Rank&quot;, ylab=&quot;Number of UMIs&quot;, log=&quot;xy&quot;, xlim=xlim, ylim=ylim) for (i in seq_along(before.stats)) { u &lt;- sort(before.stats[[i]]$num.umis, decreasing=TRUE) lines(seq_along(u), u, col=i, lwd=5) } legend(&quot;topright&quot;, col=seq_along(before.stats), lwd=5, legend=names(before.stats)) Figure 7.13: Barcode rank curves for all samples in the HiSeq 4000-sequenced mammary gland dataset, before removing any swapped molecules. We apply the swappedDrops() function to the molecule information files to identify and remove swapped molecules from each sample. While all samples have some percentage of removed molecules, the majority of molecules in samples E1 and F1 are considered to be swapping artifacts. The most likely cause is that these samples contain no real cells or highly damaged cells with little RNA, which frees up sequencing resources for deeper coverage of swapped molecules. after.mat &lt;- swappedDrops(swap.files, get.swapped=TRUE) cleaned.sum &lt;- vapply(after.mat$cleaned, sum, 0) swapped.sum &lt;- vapply(after.mat$swapped, sum, 0) swapped.sum / (swapped.sum + cleaned.sum) ## A1 B1 C1 D1 E1 F1 G1 H1 ## 0.02761 0.02767 0.12274 0.03797 0.82535 0.86561 0.02241 0.03648 After removing the swapped molecules, the barcode rank curves for E1 and F1 drop dramatically (Figure 7.14). This represents the worst-case outcome of the swapping phenomenon, where cells are “carbon-copied”2 from the other multiplexed samples. Proceeding with the cleaned matrices protects us from these egregious artifacts as well as the more subtle effects of contamination. plot(0,0,type=&quot;n&quot;, xlab=&quot;Rank&quot;, ylab=&quot;Number of UMIs&quot;, log=&quot;xy&quot;, xlim=xlim, ylim=ylim) for (i in seq_along(after.mat$cleaned)) { cur.stats &lt;- barcodeRanks(after.mat$cleaned[[i]]) u &lt;- sort(cur.stats$total, decreasing=TRUE) lines(seq_along(u), u, col=i, lwd=5) } legend(&quot;topright&quot;, col=seq_along(after.mat$cleaned), lwd=5, legend=names(after.mat$cleaned)) Figure 7.14: Barcode rank curves for all samples in the HiSeq 4000-sequenced mammary gland dataset, after removing any swapped molecules. Session Info View session info R version 4.4.2 (2024-10-31) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.1 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] DropletTestFiles_1.16.0 scran_1.34.0 [3] scRNAseq_2.20.0 scater_1.34.0 [5] ggplot2_3.5.1 scuttle_1.16.0 [7] DropletUtils_1.26.0 SingleCellExperiment_1.28.1 [9] SummarizedExperiment_1.36.0 Biobase_2.66.0 [11] GenomicRanges_1.58.0 GenomeInfoDb_1.42.1 [13] IRanges_2.40.1 S4Vectors_0.44.0 [15] BiocGenerics_0.52.0 MatrixGenerics_1.18.1 [17] matrixStats_1.5.0 BiocStyle_2.34.0 [19] rebook_1.16.0 loaded via a namespace (and not attached): [1] jsonlite_1.8.9 CodeDepends_0.6.6 [3] magrittr_2.0.3 gypsum_1.2.0 [5] ggbeeswarm_0.7.2 GenomicFeatures_1.58.0 [7] farver_2.1.2 rmarkdown_2.29 [9] BiocIO_1.16.0 zlibbioc_1.52.0 [11] vctrs_0.6.5 memoise_2.0.1 [13] Rsamtools_2.22.0 DelayedMatrixStats_1.28.1 [15] RCurl_1.98-1.16 htmltools_0.5.8.1 [17] S4Arrays_1.6.0 AnnotationHub_3.14.0 [19] curl_6.1.0 BiocNeighbors_2.0.1 [21] Rhdf5lib_1.28.0 SparseArray_1.6.1 [23] rhdf5_2.50.2 sass_0.4.9 [25] alabaster.base_1.6.1 bslib_0.8.0 [27] alabaster.sce_1.6.0 httr2_1.1.0 [29] cachem_1.1.0 GenomicAlignments_1.42.0 [31] igraph_2.1.3 mime_0.12 [33] lifecycle_1.0.4 pkgconfig_2.0.3 [35] rsvd_1.0.5 Matrix_1.7-1 [37] R6_2.5.1 fastmap_1.2.0 [39] GenomeInfoDbData_1.2.13 digest_0.6.37 [41] colorspace_2.1-1 AnnotationDbi_1.68.0 [43] dqrng_0.4.1 irlba_2.3.5.1 [45] ExperimentHub_2.14.0 RSQLite_2.3.9 [47] beachmat_2.22.0 filelock_1.0.3 [49] labeling_0.4.3 httr_1.4.7 [51] abind_1.4-8 compiler_4.4.2 [53] bit64_4.6.0-1 withr_3.0.2 [55] BiocParallel_1.40.0 viridis_0.6.5 [57] DBI_1.2.3 alabaster.ranges_1.6.0 [59] HDF5Array_1.34.0 alabaster.schemas_1.6.0 [61] R.utils_2.12.3 rappdirs_0.3.3 [63] DelayedArray_0.32.0 bluster_1.16.0 [65] rjson_0.2.23 tools_4.4.2 [67] vipor_0.4.7 beeswarm_0.4.0 [69] R.oo_1.27.0 glue_1.8.0 [71] restfulr_0.0.15 rhdf5filters_1.18.0 [73] grid_4.4.2 Rtsne_0.17 [75] cluster_2.1.8 generics_0.1.3 [77] gtable_0.3.6 R.methodsS3_1.8.2 [79] ensembldb_2.30.0 metapod_1.14.0 [81] BiocSingular_1.22.0 ScaledMatrix_1.14.0 [83] XVector_0.46.0 ggrepel_0.9.6 [85] BiocVersion_3.20.0 pillar_1.10.1 [87] limma_3.62.2 dplyr_1.1.4 [89] BiocFileCache_2.14.0 lattice_0.22-6 [91] rtracklayer_1.66.0 bit_4.5.0.1 [93] tidyselect_1.2.1 locfit_1.5-9.10 [95] Biostrings_2.74.1 knitr_1.49 [97] gridExtra_2.3 bookdown_0.42 [99] ProtGenerics_1.38.0 edgeR_4.4.1 [101] xfun_0.50 statmod_1.5.0 [103] UCSC.utils_1.2.0 lazyeval_0.2.2 [105] yaml_2.3.10 evaluate_1.0.3 [107] codetools_0.2-20 tibble_3.2.1 [109] alabaster.matrix_1.6.1 BiocManager_1.30.25 [111] graph_1.84.1 cli_3.6.3 [113] munsell_0.5.1 jquerylib_0.1.4 [115] Rcpp_1.0.14 dir.expiry_1.14.0 [117] dbplyr_2.5.0 png_0.1-8 [119] XML_3.99-0.18 parallel_4.4.2 [121] blob_1.2.4 AnnotationFilter_1.30.0 [123] sparseMatrixStats_1.18.0 bitops_1.0-9 [125] alabaster.se_1.6.0 viridisLite_0.4.2 [127] scales_1.3.0 purrr_1.0.2 [129] crayon_1.5.3 rlang_1.1.5 [131] cowplot_1.1.3 KEGGREST_1.46.0 References "],["doublet-detection.html", "Chapter 8 Doublet detection 8.1 Overview 8.2 Doublet detection with clusters 8.3 Doublet detection by simulation 8.4 Doublet detection in multiplexed experiments 8.5 Further comments Session Info", " Chapter 8 Doublet detection .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 8.1 Overview In single-cell RNA sequencing (scRNA-seq) experiments, doublets are artifactual libraries generated from two cells. They typically arise due to errors in cell sorting or capture, especially in droplet-based protocols (Zheng et al. 2017) involving thousands of cells. Doublets are obviously undesirable when the aim is to characterize populations at the single-cell level. In particular, doublets can be mistaken for intermediate populations or transitory states that do not actually exist. Thus, it is desirable to identify and remove doublet libraries so that they do not compromise interpretation of the results. Several experimental strategies are available for doublet removal. One approach exploits natural genetic variation when pooling cells from multiple donor individuals (Kang et al. 2018). Doublets can be identified as libraries with allele combinations that do not exist in any single donor. Another approach is to mark a subset of cells (e.g., all cells from one sample) with an antibody conjugated to a different oligonucleotide (Stoeckius et al. 2018). Upon pooling, libraries that are observed to have different oligonucleotides are considered to be doublets and removed. These approaches can be highly effective but rely on experimental information that may not be available. A more general approach is to infer doublets from the expression profiles alone (Dahlin et al. 2018). In this workflow, we will describe two purely computational approaches for detecting doublets from scRNA-seq data. The main difference between these two methods is whether or not they need cluster information beforehand. We will demonstrate the use of these methods on 10X Genomics data from a droplet-based scRNA-seq study of the mouse mammary gland (Bach et al. 2017). View set-up code (Workflow Chapter 12) #--- loading ---# library(scRNAseq) sce.mam &lt;- BachMammaryData(samples=&quot;G_1&quot;) #--- gene-annotation ---# library(scater) rownames(sce.mam) &lt;- uniquifyFeatureNames( rowData(sce.mam)$Ensembl, rowData(sce.mam)$Symbol) library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] rowData(sce.mam)$SEQNAME &lt;- mapIds(ens.mm.v97, keys=rowData(sce.mam)$Ensembl, keytype=&quot;GENEID&quot;, column=&quot;SEQNAME&quot;) #--- quality-control ---# is.mito &lt;- rowData(sce.mam)$SEQNAME == &quot;MT&quot; stats &lt;- perCellQCMetrics(sce.mam, subsets=list(Mito=which(is.mito))) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;subsets_Mito_percent&quot;) sce.mam &lt;- sce.mam[,!qc$discard] #--- normalization ---# library(scran) set.seed(101000110) clusters &lt;- quickCluster(sce.mam) sce.mam &lt;- computeSumFactors(sce.mam, clusters=clusters) sce.mam &lt;- logNormCounts(sce.mam) #--- variance-modelling ---# set.seed(00010101) dec.mam &lt;- modelGeneVarByPoisson(sce.mam) top.mam &lt;- getTopHVGs(dec.mam, prop=0.1) #--- dimensionality-reduction ---# library(BiocSingular) set.seed(101010011) sce.mam &lt;- denoisePCA(sce.mam, technical=dec.mam, subset.row=top.mam) sce.mam &lt;- runTSNE(sce.mam, dimred=&quot;PCA&quot;) #--- clustering ---# snn.gr &lt;- buildSNNGraph(sce.mam, use.dimred=&quot;PCA&quot;, k=25) colLabels(sce.mam) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) sce.mam ## class: SingleCellExperiment ## dim: 27998 2772 ## metadata(0): ## assays(2): counts logcounts ## rownames(27998): Xkr4 Gm1992 ... Vmn2r122 CAAA01147332.1 ## rowData names(3): Ensembl Symbol SEQNAME ## colnames: NULL ## colData names(5): Barcode Sample Condition sizeFactor label ## reducedDimNames(2): PCA TSNE ## mainExpName: NULL ## altExpNames(0): 8.2 Doublet detection with clusters The findDoubletClusters() function from the scDblFinder package identifies clusters with expression profiles lying between two other clusters (Bach et al. 2017). We consider every possible triplet of clusters consisting of a query cluster and two putative “source” clusters. Under the null hypothesis that the query consists of doublets from the two sources, we compute the number of genes (num.de) that are differentially expressed in the same direction in the query cluster compared to both of the source clusters. Such genes would be unique markers for the query cluster and provide evidence against the null hypothesis. For each query cluster, the best pair of putative sources is identified based on the lowest num.de. Clusters are then ranked by num.de where those with the few unique genes are more likely to be composed of doublets. # Like &#39;findMarkers&#39;, this function will automatically # retrieve cluster assignments from &#39;colLabels&#39;. library(scDblFinder) dbl.out &lt;- findDoubletClusters(sce.mam) dbl.out ## DataFrame with 10 rows and 9 columns ## source1 source2 num.de median.de best p.value ## &lt;character&gt; &lt;character&gt; &lt;integer&gt; &lt;numeric&gt; &lt;character&gt; &lt;numeric&gt; ## 6 3 1 8 522.5 Fabp3 1.70911e-03 ## 2 10 4 76 737.0 Xist 3.84476e-17 ## 8 10 5 135 490.5 Gde1 1.80429e-11 ## 3 8 6 140 975.0 Cotl1 1.10778e-07 ## 10 8 7 193 392.0 Gpx3 1.10908e-19 ## 5 8 7 270 771.5 C1qb 9.41842e-49 ## 9 8 7 300 518.0 Fabp4 2.21523e-32 ## 7 10 9 388 687.5 Col1a1 6.82664e-32 ## 4 8 2 468 1604.5 Cdc20 7.00502e-72 ## 1 7 6 539 1845.5 Acta2 2.75356e-25 ## lib.size1 lib.size2 prop ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## 6 0.792415 0.523830 0.03174603 ## 2 0.613116 1.401351 0.30555556 ## 8 1.125474 1.167854 0.00793651 ## 3 0.572162 1.261965 0.23051948 ## 10 0.888514 0.888432 0.00865801 ## 5 0.856271 0.856192 0.01948052 ## 9 0.655685 0.655624 0.01154401 ## 7 1.125578 1.525264 0.01406926 ## 4 0.388741 0.713597 0.17207792 ## 1 0.865449 1.909018 0.19841270 If a more concrete threshold is necessary, we can identify clusters that have unusually low num.de using an outlier-based approach. library(scater) chosen.doublet &lt;- rownames(dbl.out)[isOutlier(dbl.out$num.de, type=&quot;lower&quot;, log=TRUE)] chosen.doublet ## [1] &quot;6&quot; The function also reports the ratio of the median library size in each source to the median library size in the query (lib.size fields). Ideally, a potential doublet cluster would have ratios lower than unity; this is because doublet libraries are generated from a larger initial pool of RNA compared to libraries for single cells, and thus the former should have larger library sizes. The proportion of cells in the query cluster should also be reasonable - typically less than 5% of all cells, depending on how many cells were loaded onto the 10X Genomics device. Examination of the findDoubletClusters() output indicates that cluster 6 has the fewest unique genes and library sizes that are comparable to or greater than its sources. We see that every gene detected in this cluster is also expressed in either of the two proposed source clusters (Figure 8.1). library(scran) markers &lt;- findMarkers(sce.mam, direction=&quot;up&quot;) dbl.markers &lt;- markers[[chosen.doublet]] library(scater) chosen &lt;- rownames(dbl.markers)[dbl.markers$Top &lt;= 10] plotHeatmap(sce.mam, order_columns_by=&quot;label&quot;, features=chosen, center=TRUE, symmetric=TRUE, zlim=c(-5, 5)) Figure 8.1: Heatmap of mean-centered and normalized log-expression values for the top set of markers for cluster 6 in the mammary gland dataset. Column colours represent the cluster to which each cell is assigned, as indicated by the legend. Closer examination of some known markers suggests that the offending cluster consists of doublets of basal cells (Acta2) and alveolar cells (Csn2) (Figure 8.2). Indeed, no cell type is known to strongly express both of these genes at the same time, which supports the hypothesis that this cluster consists solely of doublets rather than being an entirely novel cell type. plotExpression(sce.mam, features=c(&quot;Acta2&quot;, &quot;Csn2&quot;), x=&quot;label&quot;, colour_by=&quot;label&quot;) Figure 8.2: Distribution of log-normalized expression values for Acta2 and Csn2 in each cluster. Each point represents a cell. The strength of findDoubletClusters() lies in its simplicity and ease of interpretation. Suspect clusters can be quickly flagged based on the metrics returned by the function. However, it is obviously dependent on the quality of the clustering. Clusters that are too coarse will fail to separate doublets from other cells, while clusters that are too fine will complicate interpretation. The method is also somewhat biased towards clusters with fewer cells, where the reduction in power is more likely to result in a low N. (Fortunately, this is a desirable effect as doublets should be rare in a properly performed scRNA-seq experiment.) 8.3 Doublet detection by simulation 8.3.1 Computing doublet densities The other doublet detection strategy involves in silico simulation of doublets from the single-cell expression profiles (Dahlin et al. 2018). This is performed using the computeDoubletDensity() function from scDblFinder, which will: Simulate thousands of doublets by adding together two randomly chosen single-cell profiles. For each original cell, compute the density of simulated doublets in the surrounding neighborhood. For each original cell, compute the density of other observed cells in the neighborhood. Return the ratio between the two densities as a “doublet score” for each cell. This approach assumes that the simulated doublets are good approximations for real doublets. The use of random selection accounts for the relative abundances of different subpopulations, which affect the likelihood of their involvement in doublets; and the calculation of a ratio avoids high scores for non-doublet cells in highly abundant subpopulations. We see the function in action below. To speed up the density calculations, computeDoubletDensity() will perform a PCA on the log-expression matrix, and we perform some (optional) parametrization to ensure that the computed PCs are consistent with that from our previous analysis on this dataset. library(BiocSingular) set.seed(100) # Setting up the parameters for consistency with denoisePCA(); # this can be changed depending on your feature selection scheme. dbl.dens &lt;- computeDoubletDensity(sce.mam, subset.row=top.mam, d=ncol(reducedDim(sce.mam))) summary(dbl.dens) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.000 0.249 0.543 1.029 1.175 14.974 The highest doublet scores are concentrated in a single cluster of cells in the center of Figure 8.3. sce.mam$DoubletScore &lt;- dbl.dens plotTSNE(sce.mam, colour_by=&quot;DoubletScore&quot;) Figure 8.3: t-SNE plot of the mammary gland data set. Each point is a cell coloured according to its doublet density. We can explicitly convert this into doublet calls by identifying large outliers for the score within each sample (Pijuan-Sala et al. 2019). Here, we only have one sample so the call below is fairly simple, but it can also support multiple samples if the input data.frame has a sample column. dbl.calls &lt;- doubletThresholding(data.frame(score=dbl.dens), method=&quot;griffiths&quot;, returnType=&quot;call&quot;) summary(dbl.calls) ## singlet doublet ## 2378 394 From the clustering information, we see that the affected cells belong to the same cluster that was identified using findDoubletClusters() (Figure 8.4), which is reassuring. plotColData(sce.mam, x=&quot;label&quot;, y=&quot;DoubletScore&quot;, colour_by=I(dbl.calls)) Figure 8.4: Distribution of doublet scores for each cluster in the mammary gland data set. Each point is a cell and is colored according to whether it was called as a doublet. The advantage of computeDoubletDensity() is that it does not depend on clusters, reducing the sensitivity of the results to clustering quality. The downside is that it requires some strong assumptions about how doublets form, such as the combining proportions and the sampling from pure subpopulations. In particular, computeDoubletDensity() treats the library size of each cell as an accurate proxy for its total RNA content. If this is not true, the simulation will not combine expression profiles from different cells in the correct proportions. This means that the simulated doublets will be systematically shifted away from the real doublets, resulting in doublet scores that are too low. (As an aside, the issue of unknown combining proportions can be solved if spike-in information is available, e.g., in plate-based protocols. This will provide an accurate estimate of the total RNA content of each cell. To this end, spike-in-based size factors from Basic Section 2.4 can be supplied to the computeDoubletDensity() function via the size.factors.content= argument. This will use the spike-in size factors to scale the contribution of each cell to a doublet library.) 8.3.2 Doublet classification The scDblFinder() function (Germain et al. 2021) combines the simulated doublet density with an iterative classification scheme. For each observed cell, an initial score is computed by combining the fraction of simulated doublets in its neighborhood with another score based on co-expression of mutually exclusive gene pairs (Bais and Kostka 2020). A threshold is chosen that best distinguishes between the real and simulated cells, allowing us to obtain putative doublet calls among the real cells. The threshold and scores are then iteratively refined by training a classifier on the putative calls with a variety of metrics to characterize the doublet neighborhood. These metrics include the low-dimensional embeddings in PC space, the fraction of doublets among the \\(k\\) nearest neighbors for a variety of \\(k\\), the distance to the closest real cell, the expected within-cluster doublet formation rate, the observed number of cells in each cluster, and so on. The classifier distills all of these metrics into a single score based on their learnt importance (Figure 8.5). set.seed(10010101) sce.mam.dbl &lt;- scDblFinder(sce.mam, clusters=colLabels(sce.mam)) plotTSNE(sce.mam.dbl, colour_by=&quot;scDblFinder.score&quot;) Figure 8.5: t-SNE plot of the mammary gland data set where each point is a cell coloured according to its scDblFinder() score. We can also extract explicit doublet calls for each cell based on the final threshold from the iterative process. table(sce.mam.dbl$scDblFinder.class) ## ## singlet doublet ## 2462 310 We set clusters= to instruct scDblFinder() to exclusively simulate doublets between different clusters in Figure 8.5. This improves the efficiency of the simulation process by eliminating intra-cluster doublets that are indistinguishable from singlets. Doing so is entirely optional and can be omitted if clustering information is not available or the clusters are not well-defined. The default of clusters=NULL will cause scDblFinder() to fall back to a random simulation scheme, similar to that used by computeDoubletDensity(). Compared to computeDoubletDensity(), scDblFinder() provides a more sophisticated approach that makes greater use of the information in the simulated doublets. This can improve the performance of doublet identification in difficult scenarios, e.g., when one of the two cells contributes considerably more to the doublet, or when doublets are transcriptionally similar to real intermediate stages. However, as with all algorithms of this class, accuracy is dependent on the correctness of the simulation process. 8.3.3 Further comments Simply removing cells with high doublet scores will typically not be sufficient to eliminate real doublets from the data set. In some cases, only a subset of the cells in the putative doublet cluster actually have high scores, and removing these would still leave enough cells in that cluster to mislead downstream analyses. In fact, even defining a threshold on the doublet score is difficult as the interpretation of the score is relative. There is no general definition for a fixed threshold above which libraries are to be considered doublets. We recommend interpreting these scores in the context of cluster annotation. All cells from a cluster with a large average doublet score should be considered suspect, and close neighbors of problematic clusters should be treated with caution. A cluster containing only a small proportion of high-scoring cells is safer, though this prognosis comes with the caveat that true doublets often lie immediately adjacent to their source populations and end up being assigned to the same cluster. It is worth confirming that any interesting results of downstream analyses are not being driven by those cells, e.g., by checking that DE in an interesting gene is not driven solely by cells with high doublet scores. While clustering is still required for interpretation, the simulation-based strategy is more robust than findDoubletClusters() to the quality of the clustering as the scores are computed on a per-cell basis. 8.4 Doublet detection in multiplexed experiments 8.4.1 Background For multiplexed samples (Kang et al. 2018; Stoeckius et al. 2018), we can identify doublet cells based on the cells that have multiple labels. The idea here is that cells from the same sample are labelled in a unique manner, either implicitly with genotype information or experimentally with hashing tag oligos (HTOs). Cells from all samples are then mixed together and the multiplexed pool is subjected to scRNA-seq, avoiding batch effects and simplifying the logistics of processing a large number of samples. Importantly, most per-cell libraries are expected to contain one label that can be used to assign that cell to its sample of origin. Cell libraries containing two labels are thus likely to be doublets of cells from different samples. To demonstrate, we will use some data from the original cell hashing study (Stoeckius et al. 2018). Each sample’s cells were stained with an antibody against a ubiquitous surface protein, where the antibody was conjugated to a sample-specific HTO. Sequencing of the HTO-derived cDNA library ultimately yields a count matrix where each row corresponds to a HTO and each column corresponds to a cell barcode. library(scRNAseq) hto.sce &lt;- StoeckiusHashingData(mode=&quot;hto&quot;) dim(hto.sce) ## [1] 8 65000 8.4.2 Identifying inter-sample doublets Before we proceed to doublet detection, we simplify the problem by first identifying the barcodes that contain cells. This is most conventionally done using the gene expression matrix for the same set of barcodes, as shown in Section 7.2. Here, though, we will keep things simple and apply emptyDrops() directly on the HTO count matrix. The considerations are largely the same as that for gene expression matrices; the main difference is that the default lower= is often too low for deeply sequenced HTOs, so we instead estimate the ambient profile by excluding the top by.rank= barcodes with the largest totals (under the assumption that no more than by.rank= cells were loaded). The barcode-rank plots are quite similar to what one might expect from gene expression data (Figure 8.6). library(DropletUtils) set.seed(101) hash.calls &lt;- emptyDrops(counts(hto.sce), by.rank=40000) is.cell &lt;- which(hash.calls$FDR &lt;= 0.001) length(is.cell) ## [1] 21780 par(mfrow=c(1,2)) r &lt;- rank(-hash.calls$Total) plot(r, hash.calls$Total, log=&quot;xy&quot;, xlab=&quot;Rank&quot;, ylab=&quot;Total HTO count&quot;, main=&quot;&quot;) hist(log10(hash.calls$Total[is.cell]), xlab=&quot;Log[10] HTO count&quot;, main=&quot;&quot;) Figure 8.6: Cell-calling statistics from running emptyDrops() on the HTO counts in the cell hashing study. Left: Barcode rank plot of the HTO counts in the cell hashing study. Right: distribution of log-total counts for libraries identified as cells. We then run hashedDrops() on the subset of cell barcode libraries that actually contain cells. This returns the likely sample of origin for each barcode library based on its most abundant HTO, using abundances adjusted for ambient contamination in the ambient= argument. (The adjustment process itself involves a fair number of assumptions that we will not discuss here; see ?hashedDrops for more details.) For quality control, it returns the log-fold change between the first and second-most abundant HTOs in each barcode libary (Figure 7.9), allowing us to quantify the certainty of each assignment. Confidently assigned singlets are marked using the Confident field in the output. hash.stats &lt;- hashedDrops(counts(hto.sce)[,is.cell], ambient=metadata(hash.calls)$ambient) hist(hash.stats$LogFC, xlab=&quot;Log fold-change from best to second HTO&quot;, main=&quot;&quot;) Figure 7.9: Distribution of log-fold changes from the first to second-most abundant HTO in each cell. # Raw assignments: table(hash.stats$Best) ## ## 1 2 3 4 5 6 7 8 ## 2703 3276 2752 2782 2493 2381 2586 2807 # Confident assignments based on (i) a large log-fold change # and (ii) not being a doublet, see below. table(hash.stats$Best[hash.stats$Confident]) ## ## 1 2 3 4 5 6 7 8 ## 2349 2779 2457 2275 2090 1994 2176 2458 Of greater interest here is how we can use the hashing information to detect doublets. This is achieved by reporting the log-fold change between the count for the second HTO and the estimated contribution from ambient contamination. A large log-fold change indicates that the second HTO still has an above-expected abundance, consistent with a doublet containing HTOs from two samples. We use outlier detection to explicitly identify putative doublets as those barcode libraries that have large log-fold changes; this is visualized in Figure 8.7, which shows a clear separation between the putative singlets and doublets. summary(hash.stats$Doublet) ## Mode FALSE TRUE ## logical 18742 3038 colors &lt;- rep(&quot;grey&quot;, nrow(hash.stats)) colors[hash.stats$Doublet] &lt;- &quot;red&quot; colors[hash.stats$Confident] &lt;- &quot;black&quot; plot(hash.stats$LogFC, hash.stats$LogFC2, xlab=&quot;Log fold-change from best to second HTO&quot;, ylab=&quot;Log fold-change of second HTO over ambient&quot;, col=colors) Figure 8.7: Log-fold change of the second-most abundant HTO over ambient contamination, compared to the log-fold change of the first HTO over the second HTO. Each point represents a cell where potential doublets are shown in red while confidently assigned singlets are shown in black. 8.4.3 Guilt by association for unmarked doublets One obvious limitation of this approach is that doublets of cells marked with the same HTO are not detected. In a simple multiplexing experiment involving \\(N\\) samples with similar numbers of cells, we would expect around \\(1/N\\) of all doublets to involve cells from the same sample. For typical values of \\(N\\) of 5 to 12, this may still be enough to cause the formation of misleading doublet clusters even after the majority of known doublets are removed. To avoid this, we recover the remaining intra-sample doublets based on their similarity with known doublets in gene expression space (hence, “guilt by association”). We illustrate by loading the gene expression data for this study: sce.hash &lt;- StoeckiusHashingData(mode=&quot;human&quot;) # Subsetting to all barcodes detected as cells. Requires an intersection, # because `hto.sce` and `sce.hash` are not the same dimensions! common &lt;- intersect(colnames(sce.hash), rownames(hash.stats)) sce.hash &lt;- sce.hash[,common] colData(sce.hash) &lt;- hash.stats[common,] sce.hash ## class: SingleCellExperiment ## dim: 27679 20828 ## metadata(0): ## assays(1): counts ## rownames(27679): A1BG A1BG-AS1 ... hsa-mir-8072 snoU2-30 ## rowData names(0): ## colnames(20828): ACTGCTCAGGTGTTAA ATGAGGGAGATGTTAG ... CACCAGGCACACAGAG ## CTCGGAGTCTAACTCT ## colData names(7): Total Best ... Doublet Confident ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): For each cell, we calculate the proportion of its nearest neighbors that are known doublets. Intra-sample doublets should have high proportions under the assumption that their gene expression profiles are similar to inter-sample doublets involving the same combination of cell states/types. Unlike in Section 8.3, the use of experimentally derived doublet calls avoids any assumptions about the relative quantity of total RNA or the probability of doublet formation across different cell types. # Performing a quick-and-dirty analysis to get some PCs to use # for nearest neighbor detection inside recoverDoublets(). library(scran) sce.hash &lt;- logNormCounts(sce.hash) dec.hash &lt;- modelGeneVar(sce.hash) top.hash &lt;- getTopHVGs(dec.hash, n=1000) set.seed(1011110) sce.hash &lt;- runPCA(sce.hash, subset_row=top.hash, ncomponents=20) # Recovering the intra-sample doublets: hashed.doublets &lt;- recoverDoublets(sce.hash, use.dimred=&quot;PCA&quot;, doublets=sce.hash$Doublet, samples=table(sce.hash$Best)) hashed.doublets ## DataFrame with 20828 rows and 3 columns ## proportion known predicted ## &lt;numeric&gt; &lt;logical&gt; &lt;logical&gt; ## 1 0.10 TRUE FALSE ## 2 0.02 FALSE FALSE ## 3 0.14 FALSE FALSE ## 4 0.08 FALSE FALSE ## 5 0.18 FALSE FALSE ## ... ... ... ... ## 20824 0.04 FALSE FALSE ## 20825 0.04 FALSE FALSE ## 20826 0.02 FALSE FALSE ## 20827 0.04 FALSE FALSE ## 20828 0.10 FALSE FALSE The recoverDoublets() function also returns explicit intra-sample doublet predictions based on the doublet neighbor proportions. Given the distribution of cells across multiplexed samples in samples=, we estimate the fraction of doublets that would not be observed from the HTO counts. This is converted into an absolute number based on the number of observed doublets; the top set of libraries with the highest proportions are then marked as intra-sample doublets (Figure 8.8) set.seed(1000101001) sce.hash &lt;- runTSNE(sce.hash, dimred=&quot;PCA&quot;) sce.hash$proportion &lt;- hashed.doublets$proportion sce.hash$predicted &lt;- hashed.doublets$predicted gridExtra::grid.arrange( plotTSNE(sce.hash, colour_by=&quot;proportion&quot;) + ggtitle(&quot;Doublet proportions&quot;), plotTSNE(sce.hash, colour_by=&quot;Doublet&quot;) + ggtitle(&quot;Known doublets&quot;), ggcells(sce.hash) + geom_point(aes(x=TSNE.1, y=TSNE.2), color=&quot;grey&quot;) + geom_point(aes(x=TSNE.1, y=TSNE.2), color=&quot;red&quot;, data=function(x) x[x$predicted,]) + ggtitle(&quot;Predicted intra-sample doublets&quot;), ncol=2 ) Figure 8.8: \\(t\\)-SNE plots for gene expression data from the cell hashing study, where each point is a cell and is colored by the doublet proportion (top left), whether or not it is a known inter-sample doublet (top right) and whether it is a predicted intra-sample doublet (bottom left). As an aside, it is worth noting that even known doublets may not necessarily have high doublet neighbor proportions. This is typically observed for doublets involving cells of the same type or state, which are effectively intermixed in gene expression space with the corresponding singlets. The latter are much more abundant in most (well-controlled) experiments, which results in low proportions for the doublets involved (Figure 8.9). This effect can generally be ignored given the mostly harmless nature of these doublets. state &lt;- ifelse(hashed.doublets$predicted, &quot;predicted&quot;, ifelse(hashed.doublets$known, &quot;known&quot;, &quot;singlet&quot;)) ggplot(as.data.frame(hashed.doublets)) + geom_violin(aes(x=state, y=proportion)) Figure 8.9: Distribution of doublet neighbor proportions for all cells in the cell hashing study, stratified by doublet detection status. 8.5 Further comments Doublet detection procedures should only be applied to libraries generated in the same experimental batch. It is obviously impossible for doublets to form between two cells that were captured separately. Thus, some understanding of the experimental design is required prior to the use of the above functions. This avoids unnecessary concerns about the validity of batch-specific clusters that cannot possibly consist of doublets. It is also difficult to interpret doublet predictions in data containing cellular trajectories. By definition, cells in the middle of a trajectory are always intermediate between other cells and are liable to be incorrectly detected as doublets. Some protection is provided by the non-linear nature of many real trajectories, which reduces the risk of simulated doublets coinciding with real cells in computeDoubletDensity(). One can also put more weight on the relative library sizes in findDoubletClusters() instead of relying solely on N, under the assumption that sudden spikes in RNA content are unlikely in a continuous biological process. The best solution to the doublet problem is experimental - that is, to avoid generating them in the first place. This should be a consideration when designing scRNA-seq experiments, where the desire to obtain large numbers of cells at minimum cost should be weighed against the general deterioration in data quality and reliability when doublets become more frequent. Session Info View session info R version 4.4.2 (2024-10-31) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.1 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] DropletUtils_1.26.0 scRNAseq_2.20.0 [3] BiocSingular_1.22.0 scran_1.34.0 [5] scater_1.34.0 ggplot2_3.5.1 [7] scuttle_1.16.0 scDblFinder_1.20.0 [9] SingleCellExperiment_1.28.1 SummarizedExperiment_1.36.0 [11] Biobase_2.66.0 GenomicRanges_1.58.0 [13] GenomeInfoDb_1.42.1 IRanges_2.40.1 [15] S4Vectors_0.44.0 BiocGenerics_0.52.0 [17] MatrixGenerics_1.18.1 matrixStats_1.5.0 [19] BiocStyle_2.34.0 rebook_1.16.0 loaded via a namespace (and not attached): [1] BiocIO_1.16.0 bitops_1.0-9 [3] filelock_1.0.3 tibble_3.2.1 [5] R.oo_1.27.0 CodeDepends_0.6.6 [7] graph_1.84.1 XML_3.99-0.18 [9] lifecycle_1.0.4 httr2_1.1.0 [11] edgeR_4.4.1 lattice_0.22-6 [13] ensembldb_2.30.0 MASS_7.3-64 [15] alabaster.base_1.6.1 magrittr_2.0.3 [17] limma_3.62.2 sass_0.4.9 [19] rmarkdown_2.29 jquerylib_0.1.4 [21] yaml_2.3.10 metapod_1.14.0 [23] cowplot_1.1.3 DBI_1.2.3 [25] RColorBrewer_1.1-3 abind_1.4-8 [27] zlibbioc_1.52.0 Rtsne_0.17 [29] purrr_1.0.2 R.utils_2.12.3 [31] AnnotationFilter_1.30.0 RCurl_1.98-1.16 [33] rappdirs_0.3.3 GenomeInfoDbData_1.2.13 [35] ggrepel_0.9.6 irlba_2.3.5.1 [37] alabaster.sce_1.6.0 pheatmap_1.0.12 [39] dqrng_0.4.1 DelayedMatrixStats_1.28.1 [41] codetools_0.2-20 DelayedArray_0.32.0 [43] tidyselect_1.2.1 UCSC.utils_1.2.0 [45] farver_2.1.2 ScaledMatrix_1.14.0 [47] viridis_0.6.5 BiocFileCache_2.14.0 [49] GenomicAlignments_1.42.0 jsonlite_1.8.9 [51] BiocNeighbors_2.0.1 tools_4.4.2 [53] Rcpp_1.0.14 glue_1.8.0 [55] gridExtra_2.3 SparseArray_1.6.1 [57] xfun_0.50 dplyr_1.1.4 [59] HDF5Array_1.34.0 gypsum_1.2.0 [61] withr_3.0.2 BiocManager_1.30.25 [63] fastmap_1.2.0 rhdf5filters_1.18.0 [65] bluster_1.16.0 digest_0.6.37 [67] rsvd_1.0.5 R6_2.5.1 [69] mime_0.12 colorspace_2.1-1 [71] RSQLite_2.3.9 R.methodsS3_1.8.2 [73] generics_0.1.3 data.table_1.16.4 [75] rtracklayer_1.66.0 httr_1.4.7 [77] S4Arrays_1.6.0 pkgconfig_2.0.3 [79] gtable_0.3.6 blob_1.2.4 [81] XVector_0.46.0 htmltools_0.5.8.1 [83] bookdown_0.42 ProtGenerics_1.38.0 [85] scales_1.3.0 alabaster.matrix_1.6.1 [87] png_0.1-8 knitr_1.49 [89] rjson_0.2.23 curl_6.1.0 [91] cachem_1.1.0 rhdf5_2.50.2 [93] BiocVersion_3.20.0 parallel_4.4.2 [95] vipor_0.4.7 AnnotationDbi_1.68.0 [97] restfulr_0.0.15 pillar_1.10.1 [99] grid_4.4.2 alabaster.schemas_1.6.0 [101] vctrs_0.6.5 dbplyr_2.5.0 [103] beachmat_2.22.0 cluster_2.1.8 [105] beeswarm_0.4.0 evaluate_1.0.3 [107] GenomicFeatures_1.58.0 cli_3.6.3 [109] locfit_1.5-9.10 compiler_4.4.2 [111] Rsamtools_2.22.0 rlang_1.1.5 [113] crayon_1.5.3 labeling_0.4.3 [115] ggbeeswarm_0.7.2 viridisLite_0.4.2 [117] alabaster.se_1.6.0 BiocParallel_1.40.0 [119] munsell_0.5.1 Biostrings_2.74.1 [121] lazyeval_0.2.2 Matrix_1.7-1 [123] dir.expiry_1.14.0 ExperimentHub_2.14.0 [125] sparseMatrixStats_1.18.0 bit64_4.6.0-1 [127] Rhdf5lib_1.28.0 KEGGREST_1.46.0 [129] statmod_1.5.0 alabaster.ranges_1.6.0 [131] AnnotationHub_3.14.0 igraph_2.1.3 [133] memoise_2.0.1 bslib_0.8.0 [135] bit_4.5.0.1 xgboost_1.7.8.1 References "],["cell-cycle-assignment.html", "Chapter 9 Cell cycle assignment 9.1 Motivation 9.2 Using the cyclins 9.3 Using reference profiles 9.4 Using the cyclone() classifier 9.5 Removing cell cycle effects Session Info", " Chapter 9 Cell cycle assignment .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 9.1 Motivation On occasion, it can be desirable to determine cell cycle activity from scRNA-seq data. In and of itself, the distribution of cells across phases of the cell cycle is not usually informative, but we can use this to determine if there are differences in proliferation between subpopulations or across treatment conditions. Many of the key events in the cell cycle (e.g., passage through checkpoints) are driven by post-translational mechanisms and thus not directly visible in transcriptomic data; nonetheless, there are enough changes in expression that can be exploited to determine cell cycle phase. We demonstrate using the 416B dataset, which is known to contain actively cycling cells after oncogene induction. View set-up code (Workflow Chapter 1) #--- loading ---# library(scRNAseq) sce.416b &lt;- LunSpikeInData(which=&quot;416b&quot;) sce.416b$block &lt;- factor(sce.416b$block) #--- gene-annotation ---# library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] rowData(sce.416b)$ENSEMBL &lt;- rownames(sce.416b) rowData(sce.416b)$SYMBOL &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SYMBOL&quot;) rowData(sce.416b)$SEQNAME &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SEQNAME&quot;) library(scater) rownames(sce.416b) &lt;- uniquifyFeatureNames(rowData(sce.416b)$ENSEMBL, rowData(sce.416b)$SYMBOL) #--- quality-control ---# mito &lt;- which(rowData(sce.416b)$SEQNAME==&quot;MT&quot;) stats &lt;- perCellQCMetrics(sce.416b, subsets=list(Mt=mito)) qc &lt;- quickPerCellQC(stats, percent_subsets=c(&quot;subsets_Mt_percent&quot;, &quot;altexps_ERCC_percent&quot;), batch=sce.416b$block) sce.416b &lt;- sce.416b[,!qc$discard] #--- normalization ---# library(scran) sce.416b &lt;- computeSumFactors(sce.416b) sce.416b &lt;- logNormCounts(sce.416b) #--- variance-modelling ---# dec.416b &lt;- modelGeneVarWithSpikes(sce.416b, &quot;ERCC&quot;, block=sce.416b$block) chosen.hvgs &lt;- getTopHVGs(dec.416b, prop=0.1) #--- batch-correction ---# library(limma) assay(sce.416b, &quot;corrected&quot;) &lt;- removeBatchEffect(logcounts(sce.416b), design=model.matrix(~sce.416b$phenotype), batch=sce.416b$block) #--- dimensionality-reduction ---# sce.416b &lt;- runPCA(sce.416b, ncomponents=10, subset_row=chosen.hvgs, exprs_values=&quot;corrected&quot;, BSPARAM=BiocSingular::ExactParam()) set.seed(1010) sce.416b &lt;- runTSNE(sce.416b, dimred=&quot;PCA&quot;, perplexity=10) #--- clustering ---# my.dist &lt;- dist(reducedDim(sce.416b, &quot;PCA&quot;)) my.tree &lt;- hclust(my.dist, method=&quot;ward.D2&quot;) library(dynamicTreeCut) my.clusters &lt;- unname(cutreeDynamic(my.tree, distM=as.matrix(my.dist), minClusterSize=10, verbose=0)) colLabels(sce.416b) &lt;- factor(my.clusters) sce.416b ## class: SingleCellExperiment ## dim: 46604 185 ## metadata(0): ## assays(3): counts logcounts corrected ## rownames(46604): 4933401J01Rik Gm26206 ... CAAA01147332.1 ## CBFB-MYH11-mcherry ## rowData names(4): Length ENSEMBL SYMBOL SEQNAME ## colnames(185): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ... ## SLX-11312.N712_S507.H5H5YBBXX.s_8.r_1 ## SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1 ## colData names(10): cell line cell type ... sizeFactor label ## reducedDimNames(2): PCA TSNE ## mainExpName: endogenous ## altExpNames(2): ERCC SIRV 9.2 Using the cyclins The cyclins control progression through the cell cycle and have well-characterized patterns of expression across cell cycle phases. Cyclin D is expressed throughout but peaks at G1; cyclin E is expressed highest in the G1/S transition; cyclin A is expressed across S and G2; and cyclin B is expressed highest in late G2 and mitosis (Morgan 2007). The expression of cyclins can help to determine the relative cell cycle activity in each cluster (Figure 9.1). For example, most cells in cluster 1 are likely to be in G1 while the other clusters are scattered across the later phases. library(scater) cyclin.genes &lt;- grep(&quot;^Ccn[abde][0-9]$&quot;, rowData(sce.416b)$SYMBOL) cyclin.genes &lt;- rownames(sce.416b)[cyclin.genes] cyclin.genes ## [1] &quot;Ccnb3&quot; &quot;Ccna2&quot; &quot;Ccna1&quot; &quot;Ccne2&quot; &quot;Ccnd2&quot; &quot;Ccne1&quot; &quot;Ccnd1&quot; &quot;Ccnb2&quot; &quot;Ccnb1&quot; ## [10] &quot;Ccnd3&quot; plotHeatmap(sce.416b, order_columns_by=&quot;label&quot;, cluster_rows=FALSE, features=sort(cyclin.genes)) Figure 9.1: Heatmap of the log-normalized expression values of the cyclin genes in the 416B dataset. Each column represents a cell that is sorted by the cluster of origin. We quantify these observations with standard DE methods (Basic Chapter 6) to test for upregulation of each cyclin between clusters, which would imply that a subpopulation contains more cells in the corresponding cell cycle phase. The same logic applies to comparisons between treatment conditions as described in Multi-sample Chapter 4. For example, we can infer that cluster 4 has the highest proportion of cells in the S and G2 phases based on higher expression of cyclins A2 and B1, respectively. library(scran) markers &lt;- findMarkers(sce.416b, subset.row=cyclin.genes, test.type=&quot;wilcox&quot;, direction=&quot;up&quot;) markers[[4]] ## DataFrame with 10 rows and 7 columns ## Top p.value FDR summary.AUC AUC.1 AUC.2 ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Ccna2 1 4.47082e-09 4.47082e-08 0.996337 0.996337 0.641822 ## Ccnd1 1 2.27713e-04 5.69283e-04 0.822981 0.368132 0.822981 ## Ccnb1 1 1.19027e-07 5.95137e-07 0.949634 0.949634 0.519669 ## Ccnb2 2 3.87799e-07 1.29266e-06 0.934066 0.934066 0.781573 ## Ccna1 4 2.96992e-02 5.93985e-02 0.535714 0.535714 0.495342 ## Ccne2 5 6.56983e-02 1.09497e-01 0.641941 0.641941 0.447205 ## Ccne1 6 5.85979e-01 8.37113e-01 0.564103 0.564103 0.366460 ## Ccnd3 7 9.94578e-01 1.00000e+00 0.402930 0.402930 0.283644 ## Ccnd2 8 9.99993e-01 1.00000e+00 0.306548 0.134615 0.327122 ## Ccnb3 10 1.00000e+00 1.00000e+00 0.500000 0.500000 0.500000 ## AUC.3 ## &lt;numeric&gt; ## Ccna2 0.925595 ## Ccnd1 0.776786 ## Ccnb1 0.934524 ## Ccnb2 0.898810 ## Ccna1 0.535714 ## Ccne2 0.455357 ## Ccne1 0.473214 ## Ccnd3 0.273810 ## Ccnd2 0.306548 ## Ccnb3 0.500000 While straightforward to implement and interpret, this approach assumes that cyclin expression is unaffected by biological processes other than the cell cycle. This is a strong assumption in highly heterogeneous populations where cyclins may perform cell-type-specific roles. For example, using the Grun HSC dataset (Grun et al. 2016), we see an upregulation of cyclin D2 in sorted HSCs (Figure 9.2) that is consistent with a particular reliance on D-type cyclins in these cells (Steinman 2002; Kozar et al. 2004). Similar arguments apply to other genes with annotated functions in cell cycle, e.g., from relevant Gene Ontology terms. View set-up code (Workflow Chapter 9) #--- data-loading ---# library(scRNAseq) sce.grun.hsc &lt;- GrunHSCData(ensembl=TRUE) #--- gene-annotation ---# library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] anno &lt;- select(ens.mm.v97, keys=rownames(sce.grun.hsc), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.grun.hsc) &lt;- anno[match(rownames(sce.grun.hsc), anno$GENEID),] #--- quality-control ---# library(scuttle) stats &lt;- perCellQCMetrics(sce.grun.hsc) qc &lt;- quickPerCellQC(stats, batch=sce.grun.hsc$protocol, subset=grepl(&quot;sorted&quot;, sce.grun.hsc$protocol)) sce.grun.hsc &lt;- sce.grun.hsc[,!qc$discard] #--- normalization ---# library(scran) set.seed(101000110) clusters &lt;- quickCluster(sce.grun.hsc) sce.grun.hsc &lt;- computeSumFactors(sce.grun.hsc, clusters=clusters) sce.grun.hsc &lt;- logNormCounts(sce.grun.hsc) #--- variance-modelling ---# set.seed(00010101) dec.grun.hsc &lt;- modelGeneVarByPoisson(sce.grun.hsc) top.grun.hsc &lt;- getTopHVGs(dec.grun.hsc, prop=0.1) #--- dimensionality-reduction ---# set.seed(101010011) sce.grun.hsc &lt;- denoisePCA(sce.grun.hsc, technical=dec.grun.hsc, subset.row=top.grun.hsc) sce.grun.hsc &lt;- runTSNE(sce.grun.hsc, dimred=&quot;PCA&quot;) #--- clustering ---# snn.gr &lt;- buildSNNGraph(sce.grun.hsc, use.dimred=&quot;PCA&quot;) colLabels(sce.grun.hsc) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) # Switching the row names for a nicer plot. rownames(sce.grun.hsc) &lt;- uniquifyFeatureNames(rownames(sce.grun.hsc), rowData(sce.grun.hsc)$SYMBOL) cyclin.genes &lt;- grep(&quot;^Ccn[abde][0-9]$&quot;, rowData(sce.grun.hsc)$SYMBOL) cyclin.genes &lt;- rownames(sce.grun.hsc)[cyclin.genes] plotHeatmap(sce.grun.hsc, order_columns_by=&quot;label&quot;, cluster_rows=FALSE, features=sort(cyclin.genes), colour_columns_by=&quot;protocol&quot;) Figure 9.2: Heatmap of the log-normalized expression values of the cyclin genes in the Grun HSC dataset. Each column represents a cell that is sorted by the cluster of origin and extraction protocol. Admittedly, this is merely a symptom of a more fundamental issue - that the cell cycle is not independent of the other processes that are occurring in a cell. This will be a recurring theme throughout the chapter, which suggests that cell cycle inferences are best used in comparisons between closely related cell types where there are fewer changes elsewhere that might interfere with interpretation. 9.3 Using reference profiles Cell cycle assignment can be considered a specialized case of cell annotation, which suggests that the strategies described in Basic Chapter 7 can also be applied here. Given a reference dataset containing cells of known cell cycle phase, we could use methods like SingleR to determine the phase of each cell in a test dataset. We demonstrate on a reference of mouse ESCs from Buettner et al. (2015) that were sorted by cell cycle phase prior to scRNA-seq. library(scRNAseq) sce.ref &lt;- BuettnerESCData() sce.ref &lt;- logNormCounts(sce.ref) sce.ref ## class: SingleCellExperiment ## dim: 38293 288 ## metadata(0): ## assays(2): counts logcounts ## rownames(38293): ENSMUSG00000000001 ENSMUSG00000000003 ... ## ENSMUSG00000097934 ENSMUSG00000097935 ## rowData names(3): EnsemblTranscriptID AssociatedGeneName GeneLength ## colnames(288): G1_cell1_count G1_cell2_count ... G2M_cell95_count ## G2M_cell96_count ## colData names(3): phase metrics sizeFactor ## reducedDimNames(0): ## mainExpName: gene ## altExpNames(1): ERCC We will restrict the annotation process to a subset of genes with a priori known roles in cell cycle. This aims to avoid detecting markers for other biological processes that happen to be correlated with the cell cycle in the reference dataset, which would reduce classification performance if those processes are absent or uncorrelated in the test dataset. # Find genes that are cell cycle-related. library(org.Mm.eg.db) cycle.anno &lt;- select(org.Mm.eg.db, keytype=&quot;GOALL&quot;, keys=&quot;GO:0007049&quot;, columns=&quot;ENSEMBL&quot;)[,&quot;ENSEMBL&quot;] str(cycle.anno) ## chr [1:3321] &quot;ENSMUSG00000026842&quot; &quot;ENSMUSG00000026842&quot; ... We use the SingleR() function to assign labels to the 416B data based on the cell cycle phases in the ESC reference. Cluster 1 mostly consists of G1 cells while the other clusters have more cells in the other phases, which is broadly consistent with our conclusions from the cyclin-based analysis. Unlike the cyclin-based analysis, this approach yields “absolute” assignments of cell cycle phase that do not need to be interpreted relative to other cells in the same dataset. # Switching row names back to Ensembl to match the reference. test.data &lt;- logcounts(sce.416b) rownames(test.data) &lt;- rowData(sce.416b)$ENSEMBL library(SingleR) assignments &lt;- SingleR(test.data, ref=sce.ref, label=sce.ref$phase, de.method=&quot;wilcox&quot;, restrict=cycle.anno) tab &lt;- table(assignments$labels, colLabels(sce.416b)) tab ## ## 1 2 3 4 ## G1 61 4 13 0 ## G2M 3 63 2 13 ## S 14 2 9 1 The key assumption here is that the cell cycle effect is orthogonal to other aspects of biological heterogeneity like cell type. This justifies the use of a reference involving cell types that are quite different from the cells in the test dataset, provided that the cell cycle transcriptional program is conserved across datasets (Bertoli, Skotheim, and Bruin 2013; Conboy et al. 2007). However, it is not difficult to find holes in this reasoning - for example, Lef1 is detected as one of the top markers to distinguish between G1 from G2/M in the reference but has no detectable expression in the 416B dataset (Figure 9.3). More generally, non-orthogonality can introduce biases where, e.g., one cell type is consistently misclassified as being in a particular phase because it happens to be more similar to that phase’s profile in the reference. gridExtra::grid.arrange( plotExpression(sce.ref, features=&quot;ENSMUSG00000027985&quot;, x=&quot;phase&quot;), plotExpression(sce.416b, features=&quot;Lef1&quot;, x=&quot;label&quot;), ncol=2) Figure 9.3: Distribution of log-normalized expression values for Lef1 in the reference dataset (left) and in the 416B dataset (right). Thus, a healthy dose of skepticism is required when interpreting these assignments. Our hope is that any systematic assignment error is consistent across clusters and conditions such that they cancel out in comparisons of phase frequencies, which is the more interesting analysis anyway. Indeed, while the availability of absolute phase calls may be more appealing, it may not make much practical difference to the conclusions if the frequencies are ultimately interpreted in a relative sense (e.g., using a chi-squared test). # Test for differences in phase distributions between clusters 1 and 2. chisq.test(tab[,1:2]) ## ## Pearson&#39;s Chi-squared test ## ## data: tab[, 1:2] ## X-squared = 113, df = 2, p-value &lt;2e-16 9.4 Using the cyclone() classifier The method described by Scialdone et al. (2015) is yet another approach for classifying cells into cell cycle phases. Using a reference dataset, we first compute the sign of the difference in expression between each pair of genes. Pairs with changes in the sign across cell cycle phases are chosen as markers. Cells in a test dataset can then be classified into the appropriate phase, based on whether the observed sign for each marker pair is consistent with one phase or another. This approach is implemented in the cyclone() function from the scran package, which also contains pre-trained set of marker pairs for mouse and human data. set.seed(100) library(scran) mm.pairs &lt;- readRDS(system.file(&quot;exdata&quot;, &quot;mouse_cycle_markers.rds&quot;, package=&quot;scran&quot;)) # Using Ensembl IDs to match up with the annotation in &#39;mm.pairs&#39;. assignments &lt;- cyclone(sce.416b, mm.pairs, gene.names=rowData(sce.416b)$ENSEMBL) The phase assignment result for each cell in the 416B dataset is shown in Figure 9.4. For each cell, a higher score for a phase corresponds to a higher probability that the cell is in that phase. We focus on the G1 and G2/M scores as these are the most informative for classification. plot(assignments$score$G1, assignments$score$G2M, xlab=&quot;G1 score&quot;, ylab=&quot;G2/M score&quot;, pch=16) Figure 9.4: Cell cycle phase scores from applying the pair-based classifier on the 416B dataset. Each point represents a cell, plotted according to its scores for G1 and G2/M phases. Cells are classified as being in G1 phase if the G1 score is above 0.5 and greater than the G2/M score; in G2/M phase if the G2/M score is above 0.5 and greater than the G1 score; and in S phase if neither score is above 0.5. We see that the results are quite similar to those from SingleR(), which is reassuring. table(assignments$phases, colLabels(sce.416b)) ## ## 1 2 3 4 ## G1 74 8 20 0 ## G2M 1 48 0 13 ## S 3 13 4 1 The same considerations and caveats described for the SingleR-based approach are also applicable here. From a practical perspective, cyclone() takes much longer but does not require an explicit reference as the marker pairs are already computed. 9.5 Removing cell cycle effects 9.5.1 Comments For some time, it was popular to regress out the cell cycle phase prior to downstream analyses like clustering. The aim was to remove uninteresting variation due to cell cycle, thus improving resolution of other biological processes. With the benefit of hindsight, we do not consider cell cycle adjustment to be necessary for routine applications. In most scenarios, the cell cycle is a minor factor of variation, secondary to stronger factors like cell type identity. Moreover, most strategies for removal run into problems when cell cycle activity varies across cell types or conditions; this is not uncommon with, e.g., increased proliferation of T cells upon activation (Richard et al. 2018), changes in cell cycle phase progression across development (Roccio et al. 2013) and correlations between cell cycle and fate decisions (Soufi and Dalton 2016). Nonetheless, we will discuss some approaches for mitigating the cell cycle effect in this section. 9.5.2 With linear regression and friends Here, we treat each phase as a separate batch and apply any of the batch correction strategies described in Multi-sample Chapter 1. The most common approach is to use a linear model to simply regress out any effect associated with the assigned phases, as shown below in Figure 9.5 via regressBatches(). Similarly, any functions that support blocking can use the phase assignments as a blocking factor, e.g., block= in modelGeneVarWithSpikes(). library(batchelor) dec.nocycle &lt;- modelGeneVarWithSpikes(sce.416b, &quot;ERCC&quot;, block=assignments$phases) reg.nocycle &lt;- regressBatches(sce.416b, batch=assignments$phases) set.seed(100011) reg.nocycle &lt;- runPCA(reg.nocycle, exprs_values=&quot;corrected&quot;, subset_row=getTopHVGs(dec.nocycle, prop=0.1)) # Shape points by induction status. relabel &lt;- c(&quot;onco&quot;, &quot;WT&quot;)[factor(sce.416b$phenotype)] scaled &lt;- scale_shape_manual(values=c(onco=4, WT=16)) gridExtra::grid.arrange( plotPCA(sce.416b, colour_by=I(assignments$phases), shape_by=I(relabel)) + ggtitle(&quot;Before&quot;) + scaled, plotPCA(reg.nocycle, colour_by=I(assignments$phases), shape_by=I(relabel)) + ggtitle(&quot;After&quot;) + scaled, ncol=2 ) Figure 9.5: PCA plots before and after regressing out the cell cycle effect in the 416B dataset, based on the phase assignments from cyclone(). Each point is a cell and is colored by its inferred phase and shaped by oncogene induction status. Alternatively, one could regress on the classification scores to account for any ambiguity in assignment. An example using cyclone() scores is shown below in Figure 9.6 but the same procedure can be used with any classification step that yields some confidence per label - for example, the correlation-based scores from SingleR(). design &lt;- model.matrix(~as.matrix(assignments$scores)) dec.nocycle2 &lt;- modelGeneVarWithSpikes(sce.416b, &quot;ERCC&quot;, design=design) reg.nocycle2 &lt;- regressBatches(sce.416b, design=design) set.seed(100011) reg.nocycle2 &lt;- runPCA(reg.nocycle2, exprs_values=&quot;corrected&quot;, subset_row=getTopHVGs(dec.nocycle2, prop=0.1)) plotPCA(reg.nocycle2, colour_by=I(assignments$phases), point_size=3, shape_by=I(relabel)) + scaled Figure 9.6: PCA plot on the residuals after regression on the cell cycle phase scores from cyclone() in the 416B dataset. Each point is a cell and is colored by its inferred phase and shaped by oncogene induction status. The main assumption of regression is that the cell cycle is consistent across different aspects of cellular heterogeneity (Multi-sample Section 1.5). In particular, we assume that each cell type contains the same distribution of cells across phases as well as a constant magnitude of the cell cycle effect on expression. Violations will lead to incomplete removal or, at worst, overcorrection that introduces spurious signal - even in the absence of any cell cycle effect! For example, if two subpopulations differ in their cell cycle phase distribution, regression will always apply a non-zero adjustment to all DE genes between those subpopulations. If this type of adjustment is truly necessary, it is safest to apply it separately to the subset of cells in each cluster. This weakens the consistency assumptions as we do not require the same behavior across all cell types in the population. Alternatively, we could use other methods that are more robust to differences in composition (Figure 9.7), though this becomes somewhat complicated if we want to correct for both cell cycle and batch at the same time. Gene-based analyses should use the uncorrected data with blocking where possible (Multi-sample Chapter 3), which provides a sanity check that protects against distortions introduced by the adjustment. set.seed(100011) reg.nocycle3 &lt;- fastMNN(sce.416b, batch=assignments$phases) plotReducedDim(reg.nocycle3, dimred=&quot;corrected&quot;, point_size=3, colour_by=I(assignments$phases), shape_by=I(relabel)) + scaled Figure 9.7: Plot of the corrected PCs after applying fastMNN() with respect to the cell cycle phase assignments from cyclone() in the 416B dataset. Each point is a cell and is colored by its inferred phase and shaped by oncogene induction status. 9.5.3 Removing cell cycle-related genes A gentler alternative to regression is to remove the genes that are associated with cell cycle. Here, we compute the percentage of variance explained by the cell cycle phase in the expression profile for each gene, and we remove genes with high percentages from the dataset prior to further downstream analyses. We demonstrate below with the Leng et al. (2015) dataset containing phase-sorted ESCs, where removal of marker genes detected between phases eliminates the separation between G1 and S populations (Figure 9.8). library(scRNAseq) sce.leng &lt;- LengESCData(ensembl=TRUE) # Performing a default analysis without any removal: sce.leng &lt;- logNormCounts(sce.leng, assay.type=&quot;normcounts&quot;) dec.leng &lt;- modelGeneVar(sce.leng) top.hvgs &lt;- getTopHVGs(dec.leng, n=1000) sce.leng &lt;- runPCA(sce.leng, subset_row=top.hvgs) # Identifying the likely cell cycle genes between phases, # using an arbitrary threshold of 5%. library(scater) diff &lt;- getVarianceExplained(sce.leng, &quot;Phase&quot;) discard &lt;- diff &gt; 5 summary(discard) ## Phase ## Mode :logical ## FALSE:13027 ## TRUE :2717 ## NA&#39;s :1801 # ... and repeating the PCA without them. top.hvgs2 &lt;- getTopHVGs(dec.leng[which(!discard),], n=1000) sce.nocycle &lt;- runPCA(sce.leng, subset_row=top.hvgs2) fill &lt;- geom_point(pch=21, colour=&quot;grey&quot;) # Color the NA points. gridExtra::grid.arrange( plotPCA(sce.leng, colour_by=&quot;Phase&quot;) + ggtitle(&quot;Before&quot;) + fill, plotPCA(sce.nocycle, colour_by=&quot;Phase&quot;) + ggtitle(&quot;After&quot;) + fill, ncol=2 ) Figure 9.8: PCA plots of the Leng ESC dataset, generated before and after removal of cell cycle-related genes. Each point corresponds to a cell that is colored by the sorted cell cycle phase. The same procedure can also be applied to the inferred phases or classification scores from, e.g., cyclone(). This is demonstrated in Figure 9.9 with our trusty 416B dataset, where the cell cycle variation is removed without sacrificing the differences due to oncogene induction. # Need to wrap the phase vector in a DataFrame: diff &lt;- getVarianceExplained(sce.416b, DataFrame(assignments$phases)) discard &lt;- diff &gt; 5 summary(discard) ## assignments.phases ## Mode :logical ## FALSE:19590 ## TRUE :4207 ## NA&#39;s :22807 set.seed(100011) top.discard &lt;- getTopHVGs(dec.416b[which(!discard),], n=1000) sce.416b.discard &lt;- runPCA(sce.416b, subset_row=top.discard) plotPCA(sce.416b.discard, colour_by=I(assignments$phases), shape_by=I(relabel), point_size=3) + scaled Figure 9.9: PCA plots of the 416B dataset, generated before and after removal of cell cycle-related genes. Each point corresponds to a cell that is colored by the inferred phase and shaped by oncogene induction status. This approach discards any gene with significant cell cycle variation, regardless of how much interesting variation it might also contain from other processes. In this respect, it is more conservative than regression as no attempt is made to salvage any information from such genes, possibly resulting in the loss of relevant biological signal. However, gene removal is more amenable to fine-tuning: any lost heterogeneity can be easily identified by examining the discarded genes, and users can choose to recover interesting genes even if they are correlated with known/inferred cell cycle phase. Most importantly, direct removal of genes is much less likely to introduce spurious signal compared to regression when the consistency assumptions are not applicable. 9.5.4 Using contrastive PCA Alternatively, we might consider a more sophisticated approach called contrastive PCA (Abid et al. 2018). This aims to identify patterns that are enriched in our test dataset - in this case, the 416B data - compared to a control dataset in which cell cycle is the dominant factor of variation. We demonstrate below using the scPCA package (Boileau, Hejazi, and Dudoit 2020) where we use the subset of wild-type 416B cells as our control, based on the expectation that an untreated cell line in culture has little else to do but divide. This yields low-dimensional coordinates in which the cell cycle effect within the oncogene-induced and wild-type groups is reduced without removing the difference between groups (Figure 9.10). top.hvgs &lt;- getTopHVGs(dec.416b, p=0.1) wild &lt;- sce.416b$phenotype==&quot;wild type phenotype&quot; set.seed(100) library(scPCA) con.out &lt;- scPCA( target=t(logcounts(sce.416b)[top.hvgs,]), background=t(logcounts(sce.416b)[top.hvgs,wild]), penalties=0, n_eigen=10, contrasts=100) # Visualizing the results in a t-SNE. sce.con &lt;- sce.416b reducedDim(sce.con, &quot;cPCA&quot;) &lt;- con.out$x sce.con &lt;- runTSNE(sce.con, dimred=&quot;cPCA&quot;) # Making the labels easier to read. relabel &lt;- c(&quot;onco&quot;, &quot;WT&quot;)[factor(sce.416b$phenotype)] scaled &lt;- scale_color_manual(values=c(onco=&quot;red&quot;, WT=&quot;black&quot;)) gridExtra::grid.arrange( plotTSNE(sce.416b, colour_by=I(assignments$phases)) + ggtitle(&quot;Before (416b)&quot;), plotTSNE(sce.416b, colour_by=I(relabel)) + scaled, plotTSNE(sce.con, colour_by=I(assignments$phases)) + ggtitle(&quot;After (416b)&quot;), plotTSNE(sce.con, colour_by=I(relabel)) + scaled, ncol=2 ) Figure 9.10: \\(t\\)-SNE plots for the 416B dataset before and after contrastive PCA. Each point is a cell and is colored according to its inferred cell cycle phase (left) or oncogene induction status (right). The strength of this approach lies in its ability to accurately remove the cell cycle effect based on its magnitude in the control dataset. This avoids loss of heterogeneity associated with other processes that happen to be correlated with the cell cycle. The requirements for the control dataset are also quite loose - there is no need to know the cell cycle phase of each cell a priori, and indeed, we can manufacture a like-for-like control by subsetting our dataset to a homogeneous cluster in which the only detectable factor of variation is the cell cycle. (See Workflow Chapter 13 for another demonstration of cPCA to remove the cell cycle effect.) In fact, any consistent but uninteresting variation can be eliminated in this manner as long as it is captured by the control. The downside is that the magnitude of variation in the control dataset must accurately reflect that in the test dataset, requiring more care in choosing the former. As a result, the procedure is more sensitive to quantitative differences between datasets compared to SingleR() or cyclone() during cell cycle phase assignment. This makes it difficult to use control datasets from different scRNA-seq technologies or biological systems, as a mismatch in the covariance structure may lead to insufficient or excessive correction. At worst, any interesting variation that is inadvertently contained in the control will also be removed. Session Info View session info R version 4.4.2 (2024-10-31) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.1 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] scPCA_1.20.0 batchelor_1.22.0 [3] bluster_1.16.0 SingleR_2.8.0 [5] org.Mm.eg.db_3.20.0 ensembldb_2.30.0 [7] AnnotationFilter_1.30.0 GenomicFeatures_1.58.0 [9] AnnotationDbi_1.68.0 scRNAseq_2.20.0 [11] scran_1.34.0 scater_1.34.0 [13] ggplot2_3.5.1 scuttle_1.16.0 [15] SingleCellExperiment_1.28.1 SummarizedExperiment_1.36.0 [17] Biobase_2.66.0 GenomicRanges_1.58.0 [19] GenomeInfoDb_1.42.1 IRanges_2.40.1 [21] S4Vectors_0.44.0 BiocGenerics_0.52.0 [23] MatrixGenerics_1.18.1 matrixStats_1.5.0 [25] BiocStyle_2.34.0 rebook_1.16.0 loaded via a namespace (and not attached): [1] BiocIO_1.16.0 bitops_1.0-9 [3] filelock_1.0.3 tibble_3.2.1 [5] CodeDepends_0.6.6 graph_1.84.1 [7] XML_3.99-0.18 lifecycle_1.0.4 [9] httr2_1.1.0 Rdpack_2.6.2 [11] edgeR_4.4.1 globals_0.16.3 [13] lattice_0.22-6 alabaster.base_1.6.1 [15] magrittr_2.0.3 limma_3.62.2 [17] sass_0.4.9 rmarkdown_2.29 [19] jquerylib_0.1.4 yaml_2.3.10 [21] metapod_1.14.0 cowplot_1.1.3 [23] DBI_1.2.3 RColorBrewer_1.1-3 [25] ResidualMatrix_1.16.0 abind_1.4-8 [27] zlibbioc_1.52.0 Rtsne_0.17 [29] purrr_1.0.2 RCurl_1.98-1.16 [31] rappdirs_0.3.3 GenomeInfoDbData_1.2.13 [33] ggrepel_0.9.6 irlba_2.3.5.1 [35] listenv_0.9.1 alabaster.sce_1.6.0 [37] pheatmap_1.0.12 RSpectra_0.16-2 [39] parallelly_1.41.0 dqrng_0.4.1 [41] DelayedMatrixStats_1.28.1 codetools_0.2-20 [43] DelayedArray_0.32.0 tidyselect_1.2.1 [45] UCSC.utils_1.2.0 farver_2.1.2 [47] ScaledMatrix_1.14.0 viridis_0.6.5 [49] BiocFileCache_2.14.0 GenomicAlignments_1.42.0 [51] jsonlite_1.8.9 BiocNeighbors_2.0.1 [53] tools_4.4.2 Rcpp_1.0.14 [55] glue_1.8.0 gridExtra_2.3 [57] SparseArray_1.6.1 xfun_0.50 [59] dplyr_1.1.4 HDF5Array_1.34.0 [61] gypsum_1.2.0 withr_3.0.2 [63] BiocManager_1.30.25 fastmap_1.2.0 [65] rhdf5filters_1.18.0 digest_0.6.37 [67] rsvd_1.0.5 R6_2.5.1 [69] mime_0.12 colorspace_2.1-1 [71] RSQLite_2.3.9 generics_0.1.3 [73] data.table_1.16.4 rtracklayer_1.66.0 [75] httr_1.4.7 S4Arrays_1.6.0 [77] pkgconfig_2.0.3 gtable_0.3.6 [79] blob_1.2.4 XVector_0.46.0 [81] htmltools_0.5.8.1 bookdown_0.42 [83] ProtGenerics_1.38.0 scales_1.3.0 [85] alabaster.matrix_1.6.1 png_0.1-8 [87] knitr_1.49 rjson_0.2.23 [89] curl_6.1.0 cachem_1.1.0 [91] rhdf5_2.50.2 stringr_1.5.1 [93] BiocVersion_3.20.0 parallel_4.4.2 [95] vipor_0.4.7 restfulr_0.0.15 [97] pillar_1.10.1 grid_4.4.2 [99] alabaster.schemas_1.6.0 vctrs_0.6.5 [101] origami_1.0.7 BiocSingular_1.22.0 [103] dbplyr_2.5.0 beachmat_2.22.0 [105] cluster_2.1.8 beeswarm_0.4.0 [107] evaluate_1.0.3 cli_3.6.3 [109] locfit_1.5-9.10 compiler_4.4.2 [111] Rsamtools_2.22.0 rlang_1.1.5 [113] crayon_1.5.3 future.apply_1.11.3 [115] labeling_0.4.3 ggbeeswarm_0.7.2 [117] stringi_1.8.4 viridisLite_0.4.2 [119] alabaster.se_1.6.0 BiocParallel_1.40.0 [121] assertthat_0.2.1 munsell_0.5.1 [123] Biostrings_2.74.1 coop_0.6-3 [125] lazyeval_0.2.2 Matrix_1.7-1 [127] dir.expiry_1.14.0 ExperimentHub_2.14.0 [129] future_1.34.0 sparseMatrixStats_1.18.0 [131] bit64_4.6.0-1 Rhdf5lib_1.28.0 [133] KEGGREST_1.46.0 statmod_1.5.0 [135] alabaster.ranges_1.6.0 AnnotationHub_3.14.0 [137] kernlab_0.9-33 rbibutils_2.3 [139] igraph_2.1.3 memoise_2.0.1 [141] bslib_0.8.0 bit_4.5.0.1 References "],["trajectory-analysis.html", "Chapter 10 Trajectory Analysis 10.1 Overview 10.2 Obtaining pseudotime orderings 10.3 Characterizing trajectories 10.4 Finding the root Session Info", " Chapter 10 Trajectory Analysis .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 10.1 Overview Many biological processes manifest as a continuum of dynamic changes in the cellular state. The most obvious example is that of differentiation into increasingly specialized cell subtypes, but we might also consider phenomena like the cell cycle or immune cell activation that are accompanied by gradual changes in the cell’s transcriptome. We characterize these processes from single-cell expression data by identifying a “trajectory”, i.e., a path through the high-dimensional expression space that traverses the various cellular states associated with a continuous process like differentiation. In the simplest case, a trajectory will be a simple path from one point to another, but we can also observe more complex trajectories that branch to multiple endpoints. The “pseudotime” is defined as the positioning of cells along the trajectory that quantifies the relative activity or progression of the underlying biological process. For example, the pseudotime for a differentiation trajectory might represent the degree of differentiation from a pluripotent cell to a terminal state where cells with larger pseudotime values are more differentiated. This metric allows us to tackle questions related to the global population structure in a more quantitative manner. The most common application is to fit models to gene expression against the pseudotime to identify the genes responsible for generating the trajectory in the first place, especially around interesting branch events. In this section, we will demonstrate several different approaches to trajectory analysis using the haematopoietic stem cell (HSC) dataset from Nestorowa et al. (2016). View set-up code (Workflow Chapter 10) #--- data-loading ---# library(scRNAseq) sce.nest &lt;- NestorowaHSCData() #--- gene-annotation ---# library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] anno &lt;- select(ens.mm.v97, keys=rownames(sce.nest), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.nest) &lt;- anno[match(rownames(sce.nest), anno$GENEID),] #--- quality-control ---# library(scater) stats &lt;- perCellQCMetrics(sce.nest) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;) sce.nest &lt;- sce.nest[,!qc$discard] #--- normalization ---# library(scran) set.seed(101000110) clusters &lt;- quickCluster(sce.nest) sce.nest &lt;- computeSumFactors(sce.nest, clusters=clusters) sce.nest &lt;- logNormCounts(sce.nest) #--- variance-modelling ---# set.seed(00010101) dec.nest &lt;- modelGeneVarWithSpikes(sce.nest, &quot;ERCC&quot;) top.nest &lt;- getTopHVGs(dec.nest, prop=0.1) #--- dimensionality-reduction ---# set.seed(101010011) sce.nest &lt;- denoisePCA(sce.nest, technical=dec.nest, subset.row=top.nest) sce.nest &lt;- runTSNE(sce.nest, dimred=&quot;PCA&quot;) #--- clustering ---# snn.gr &lt;- buildSNNGraph(sce.nest, use.dimred=&quot;PCA&quot;) colLabels(sce.nest) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) sce.nest ## class: SingleCellExperiment ## dim: 46078 1656 ## metadata(0): ## assays(2): counts logcounts ## rownames(46078): ENSMUSG00000000001 ENSMUSG00000000003 ... ## ENSMUSG00000107391 ENSMUSG00000107392 ## rowData names(3): GENEID SYMBOL SEQNAME ## colnames(1656): HSPC_025 HSPC_031 ... Prog_852 Prog_810 ## colData names(11): gate broad ... sizeFactor label ## reducedDimNames(3): diffusion PCA TSNE ## mainExpName: endogenous ## altExpNames(2): ERCC FACS 10.2 Obtaining pseudotime orderings 10.2.1 Overview The pseudotime is simply a number describing the relative position of a cell in the trajectory, where cells with larger values are consider to be “after” their counterparts with smaller values. Branched trajectories will typically be associated with multiple pseudotimes, one per path through the trajectory; these values are not usually comparable across paths. It is worth noting that “pseudotime” is a rather unfortunate term as it may not have much to do with real-life time. For example, one can imagine a continuum of stress states where cells move in either direction (or not) over time but the pseudotime simply describes the transition from one end of the continuum to the other. In trajectories describing time-dependent processes like differentiation, a cell’s pseudotime value may be used as a proxy for its relative age, but only if directionality can be inferred (see Section 10.4). The big question is how to identify the trajectory from high-dimensional expression data and map individual cells onto it. A massive variety of different algorithms are available for doing so (Saelens et al. 2019), and while we will demonstrate only a few specific methods below, many of the concepts apply generally to all trajectory inference strategies. A more philosophical question is whether a trajectory even exists in the dataset. One can interpret a continuum of states as a series of closely related (but distinct) subpopulations, or two well-separated clusters as the endpoints of a trajectory with rare intermediates. The choice between these two perspectives is left to the analyst based on which is more useful, convenient or biologically sensible. 10.2.2 Cluster-based minimum spanning tree 10.2.2.1 Basic steps The TSCAN algorithm uses a simple yet effective approach to trajectory reconstruction. It uses the clustering to summarize the data into a smaller set of discrete units, computes cluster centroids by averaging the coordinates of its member cells, and then forms the minimum spanning tree (MST) across those centroids. The MST is simply an undirected acyclic graph that passes through each centroid exactly once and is thus the most parsimonious structure that captures the transitions between clusters. We demonstrate below on the Nestorowa et al. (2016) dataset, computing the cluster centroids in the low-dimensional PC space to take advantage of data compaction and denoising (Basic Chapter 4). library(scater) by.cluster &lt;- aggregateAcrossCells(sce.nest, ids=colLabels(sce.nest)) centroids &lt;- reducedDim(by.cluster, &quot;PCA&quot;) # Set clusters=NULL as we have already aggregated above. library(TSCAN) mst &lt;- createClusterMST(centroids, clusters=NULL) mst ## IGRAPH 536cb43 UNW- 10 9 -- ## + attr: name (v/c), coordinates (v/x), weight (e/n), gain (e/n) ## + edges from 536cb43 (vertex names): ## [1] 1--2 1--9 2--5 2--6 2--8 3--5 4--9 4--10 6--7 For reference, we can draw the same lines between the centroids in a \\(t\\)-SNE plot (Figure 10.1). This allows us to identify interesting clusters such as those at bifurcations or endpoints. Note that the MST in mst was generated from distances in the PC space and is merely being visualized here in the \\(t\\)-SNE space, for the same reasons as discussed in Basic Section 4.5.3. This may occasionally result in some visually unappealing plots if the original ordering of clusters in the PC space is not preserved in the \\(t\\)-SNE space. line.data &lt;- reportEdges(by.cluster, mst=mst, clusters=NULL, use.dimred=&quot;TSNE&quot;) plotTSNE(sce.nest, colour_by=&quot;label&quot;) + geom_line(data=line.data, mapping=aes(x=TSNE1, y=TSNE2, group=edge)) Figure 10.1: \\(t\\)-SNE plot of the Nestorowa HSC dataset, where each point is a cell and is colored according to its cluster assignment. The MST obtained using a TSCAN-like algorithm is overlaid on top. We obtain a pseudotime ordering by projecting the cells onto the MST with mapCellsToEdges(). More specifically, we move each cell onto the closest edge of the MST; the pseudotime is then calculated as the distance along the MST to this new position from a “root node” with orderCells(). For our purposes, we will arbitrarily pick one of the endpoint nodes as the root, though a more careful choice based on the biological annotation of each node may yield more relevant orderings (e.g., picking a node corresponding to a more pluripotent state). map.tscan &lt;- mapCellsToEdges(sce.nest, mst=mst, use.dimred=&quot;PCA&quot;) tscan.pseudo &lt;- orderCells(map.tscan, mst) head(tscan.pseudo) ## class: PseudotimeOrdering ## dim: 6 3 ## metadata(1): start ## pathStats(1): &#39;&#39; ## cellnames(6): HSPC_025 HSPC_031 ... HSPC_014 HSPC_020 ## cellData names(4): left.cluster right.cluster left.distance ## right.distance ## pathnames(3): 8 7 10 ## pathData names(0): Here, multiple sets of pseudotimes are reported for a branched trajectory. Each column contains one pseudotime ordering and corresponds to one path from the root node to one of the terminal nodes - the name of the terminal node that defines this path is recorded in the column names of tscan.pseudo. Some cells may be shared across multiple paths, in which case they will have the same pseudotime in those paths. We can then examine the pseudotime ordering on our desired visualization as shown in Figure 10.2. common.pseudo &lt;- averagePseudotime(tscan.pseudo) plotTSNE(sce.nest, colour_by=I(common.pseudo), text_by=&quot;label&quot;, text_colour=&quot;red&quot;) + geom_line(data=line.data, mapping=aes(x=TSNE1, y=TSNE2, group=edge)) Figure 10.2: \\(t\\)-SNE plot of the Nestorowa HSC dataset, where each point is a cell and is colored according to its pseudotime value. The MST obtained using TSCAN is overlaid on top. Alternatively, this entire series of calculations can be conveniently performed with the quickPseudotime() wrapper. This executes all steps from aggregateAcrossCells() to orderCells() and returns a list with the output from each step. pseudo.all &lt;- quickPseudotime(sce.nest, use.dimred=&quot;PCA&quot;) head(pseudo.all$ordering) ## class: PseudotimeOrdering ## dim: 6 3 ## metadata(1): start ## pathStats(1): &#39;&#39; ## cellnames(6): HSPC_025 HSPC_031 ... HSPC_014 HSPC_020 ## cellData names(4): left.cluster right.cluster left.distance ## right.distance ## pathnames(3): 8 7 10 ## pathData names(0): 10.2.2.2 Tweaking the MST The MST can be constructed with an “outgroup” to avoid connecting unrelated populations in the dataset. Based on the OMEGA cluster concept from Street et al. (2018), the outgroup is an artificial cluster that is equidistant from all real clusters at some threshold value. If the original MST sans the outgroup contains an edge that is longer than twice the threshold, the addition of the outgroup will cause the MST to instead be routed through the outgroup. We can subsequently break up the MST into subcomponents (i.e., a minimum spanning forest) by removing the outgroup. We set outgroup=TRUE to introduce an outgroup with an automatically determined threshold distance, which breaks up our previous MST into two components (Figure 10.3). pseudo.og &lt;- quickPseudotime(sce.nest, use.dimred=&quot;PCA&quot;, outgroup=TRUE) set.seed(10101) plot(pseudo.og$mst) Figure 10.3: Minimum spanning tree of the Nestorowa clusters after introducing an outgroup. Another option is to construct the MST based on distances between mutual nearest neighbor (MNN) pairs between clusters (Multi-sample Section 1.6). This exploits the fact that MNN pairs occur at the boundaries of two clusters, with short distances between paired cells meaning that the clusters are “touching”. In this mode, the MST focuses on the connectivity between clusters, which can be different from the shortest distance between centroids (Figure 10.4). Consider, for example, a pair of elongated clusters that are immediately adjacent to each other. A large distance between their centroids precludes the formation of the obvious edge with the default MST construction; in contrast, the MNN distance is very low and encourages the MST to create a connection between the two clusters. pseudo.mnn &lt;- quickPseudotime(sce.nest, use.dimred=&quot;PCA&quot;, with.mnn=TRUE) mnn.pseudo &lt;- averagePseudotime(pseudo.mnn$ordering) plotTSNE(sce.nest, colour_by=I(mnn.pseudo), text_by=&quot;label&quot;, text_colour=&quot;red&quot;) + geom_line(data=pseudo.mnn$connected$TSNE, mapping=aes(x=TSNE1, y=TSNE2, group=edge)) Figure 10.4: \\(t\\)-SNE plot of the Nestorowa HSC dataset, where each point is a cell and is colored according to its pseudotime value. The MST obtained using TSCAN with MNN distances is overlaid on top. 10.2.2.3 Further comments The TSCAN approach derives several advantages from using clusters to form the MST. The most obvious is that of computational speed as calculations are performed over clusters rather than cells. The relative coarseness of clusters protects against the per-cell noise that would otherwise reduce the stability of the MST. The interpretation of the MST is also straightforward as it uses the same clusters as the rest of the analysis, allowing us to recycle previous knowledge about the biological annotations assigned to each cluster. However, the reliance on clustering is a double-edged sword. If the clusters are not sufficiently granular, it is possible for TSCAN to overlook variation that occurs inside a single cluster. The MST is obliged to pass through each cluster exactly once, which can lead to excessively circuitous paths in overclustered datasets as well as the formation of irrelevant paths between distinct cell subpopulations if the outgroup threshold is too high. The MST also fails to handle more complex events such as “bubbles” (i.e., a bifurcation and then a merging) or cycles. 10.2.3 Principal curves To identify a trajectory, one might imagine simply “fitting” a one-dimensional curve so that it passes through the cloud of cells in the high-dimensional expression space. This is the idea behind principal curves (Hastie and Stuetzle 1989), effectively a non-linear generalization of PCA where the axes of most variation are allowed to bend. We use the slingshot package (Street et al. 2018) to fit a single principal curve to the Nestorowa dataset, again using the low-dimensional PC coordinates for denoising and speed. This yields a pseudotime ordering of cells based on their relative positions when projected onto the curve. library(slingshot) sce.sling &lt;- slingshot(sce.nest, reducedDim=&#39;PCA&#39;) head(sce.sling$slingPseudotime_1) ## [1] 60.82 44.63 58.82 47.29 53.10 43.75 We can then visualize the path taken by the fitted curve in any desired space with embedCurves(). For example, Figure 10.5 shows the behavior of the principle curve on the \\(t\\)-SNE plot. Again, users should note that this may not always yield aesthetically pleasing plots if the \\(t\\)-SNE algorithm decides to arrange clusters so that they no longer match the ordering of the pseudotimes. embedded &lt;- embedCurves(sce.sling, &quot;TSNE&quot;) embedded &lt;- slingCurves(embedded)[[1]] # only 1 path. embedded &lt;- data.frame(embedded$s[embedded$ord,]) plotTSNE(sce.sling, colour_by=&quot;slingPseudotime_1&quot;) + geom_path(data=embedded, aes(x=TSNE1, y=TSNE2), size=1.2) Figure 10.5: \\(t\\)-SNE plot of the Nestorowa HSC dataset where each point is a cell and is colored by the slingshot pseudotime ordering. The fitted principal curve is shown in black. The previous call to slingshot() assumed that all cells in the dataset were part of a single curve. To accommodate more complex events like bifurcations, we use our previously computed cluster assignments to build a rough sketch for the global structure in the form of a MST across the cluster centroids. Each path through the MST from a designated root node is treated as a lineage that contains cells from the associated clusters. Principal curves are then simultaneously fitted to all lineages with some averaging across curves to encourage consistency in shared clusters across lineages. This process yields a matrix of pseudotimes where each column corresponds to a lineage and contains the pseudotimes of all cells assigned to that lineage. sce.sling2 &lt;- slingshot(sce.nest, cluster=colLabels(sce.nest), reducedDim=&#39;PCA&#39;) pseudo.paths &lt;- slingPseudotime(sce.sling2) head(pseudo.paths) ## Lineage1 Lineage2 Lineage3 Lineage4 ## HSPC_025 102.28 NA NA NA ## HSPC_031 NA 127.9 NA NA ## HSPC_037 NA 117.6 NA NA ## HSPC_008 96.26 108.0 109.6 107.45 ## HSPC_014 100.26 110.5 106.5 108.74 ## HSPC_020 93.11 95.7 106.8 99.78 By using the MST as a scaffold for the global structure, slingshot() can accommodate branching events based on divergence in the principal curves (Figure 10.6). However, unlike TSCAN, the MST here is only used as a rough guide and does not define the final pseudotime. sce.nest &lt;- runUMAP(sce.nest, dimred=&quot;PCA&quot;) reducedDim(sce.sling2, &quot;UMAP&quot;) &lt;- reducedDim(sce.nest, &quot;UMAP&quot;) # Taking the rowMeans just gives us a single pseudo-time for all cells. Cells # in segments that are shared across paths have similar pseudo-time values in # all paths anyway, so taking the rowMeans is not particularly controversial. shared.pseudo &lt;- rowMeans(pseudo.paths, na.rm=TRUE) # Need to loop over the paths and add each one separately. gg &lt;- plotUMAP(sce.sling2, colour_by=I(shared.pseudo)) embedded &lt;- embedCurves(sce.sling2, &quot;UMAP&quot;) embedded &lt;- slingCurves(embedded) for (path in embedded) { embedded &lt;- data.frame(path$s[path$ord,]) gg &lt;- gg + geom_path(data=embedded, aes(x=UMAP1, y=UMAP2), size=1.2) } gg Figure 10.6: UMAP plot of the Nestorowa HSC dataset where each point is a cell and is colored by the average slingshot pseudotime across paths. The principal curves fitted to each lineage are shown in black. We can use slingshotBranchID() to determine whether a particular cell is shared across multiple curves or is unique to a subset of curves (i.e., is located “after” branching). In this case, we can see that most cells jump directly from a global common segment (1,2,3,4) to one of the curves (1, 2, 3, 4) without any further hierarchy, i.e., no noticeable internal branch points. curve.assignments &lt;- slingBranchID(sce.sling2) table(curve.assignments) ## curve.assignments ## 1 1,2 1,2,3 1,2,3,4 1,2,4 1,3,4 2 2,3,4 2,4 3 ## 425 13 3 776 7 1 198 24 4 142 ## 3,4 4 ## 39 24 For larger datasets, we can speed up the algorithm by approximating each principal curve with a fixed number of points. By default, slingshot() uses one point per cell to define the curve, which is unnecessarily precise when the number of cells is large. Applying an approximation with approx_points= reduces computational work without any major loss of precision in the pseudotime estimates. sce.sling3 &lt;- slingshot(sce.nest, cluster=colLabels(sce.nest), reducedDim=&#39;PCA&#39;, approx_points=100) pseudo.paths3 &lt;- slingPseudotime(sce.sling3) head(pseudo.paths3) ## Lineage1 Lineage2 Lineage3 Lineage4 ## HSPC_025 102.01 NA NA NA ## HSPC_031 NA 128.11 NA NA ## HSPC_037 NA 117.38 NA NA ## HSPC_008 95.72 107.96 109.1 107.27 ## HSPC_014 99.91 110.43 106.1 108.45 ## HSPC_020 92.40 95.38 106.0 99.88 The MST can also be constructed with an OMEGA cluster to avoid connecting unrelated trajectories. This operates in the same manner as (and was the inspiration for) the outgroup for TSCAN’s MST. Principal curves are fitted through each component individually, manifesting in the pseudotime matrix as paths that do not share any cells. sce.sling4 &lt;- slingshot(sce.nest, cluster=colLabels(sce.nest), reducedDim=&#39;PCA&#39;, approx_points=100, omega=TRUE) pseudo.paths4 &lt;- slingPseudotime(sce.sling4) head(pseudo.paths4) ## Lineage1 Lineage2 Lineage3 Lineage4 ## HSPC_025 102.12 NA NA NA ## HSPC_031 NA 127.50 NA NA ## HSPC_037 NA 117.07 NA NA ## HSPC_008 96.02 107.87 109.0 NA ## HSPC_014 100.14 110.25 106.2 NA ## HSPC_020 92.57 95.56 106.5 NA shared.pseudo &lt;- rowMeans(pseudo.paths, na.rm=TRUE) gg &lt;- plotUMAP(sce.sling4, colour_by=I(shared.pseudo)) embedded &lt;- embedCurves(sce.sling4, &quot;UMAP&quot;) embedded &lt;- slingCurves(embedded) for (path in embedded) { embedded &lt;- data.frame(path$s[path$ord,]) gg &lt;- gg + geom_path(data=embedded, aes(x=UMAP1, y=UMAP2), size=1.2) } gg Figure 10.7: UMAP plot of the Nestorowa HSC dataset where each point is a cell and is colored by the average slingshot pseudotime across paths. The principal curves (black lines) were constructed with an OMEGA cluster. The use of principal curves adds an extra layer of sophistication that complements the deficiencies of the cluster-based MST. The principal curve has the opportunity to model variation within clusters that would otherwise be overlooked; for example, slingshot could build a trajectory out of one cluster while TSCAN cannot. Conversely, the principal curves can “smooth out” circuitous paths in the MST for overclustered data, ignoring small differences between fine clusters that are unlikely to be relevant to the overall trajectory. That said, the structure of the initial MST is still fundamentally dependent on the resolution of the clusters. One can arbitrarily change the number of branches from slingshot by tuning the cluster granularity, making it difficult to use the output as evidence for the presence/absence of subtle branch events. If the variation within clusters is uninteresting, the greater sensitivity of the curve fitting to such variation may yield irrelevant trajectories where the differences between clusters are masked. Moreover, slingshot is no longer obliged to separate clusters in pseudotime, which may complicate intepretation of the trajectory with respect to existing cluster annotations. 10.3 Characterizing trajectories 10.3.1 Overview Once we have constructed a trajectory, the next step is to characterize the underlying biology based on its DE genes. The aim here is to find the genes that exhibit significant changes in expression across pseudotime, as these are the most likely to have driven the formation of the trajectory in the first place. The overall strategy is to fit a model to the per-gene expression with respect to pseudotime, allowing us to obtain inferences about the significance of any association. We can then prioritize interesting genes as those with low \\(p\\)-values for further investigation. A wide range of options are available for model fitting but we will focus on the simplest approach of fitting a linear model to the log-expression values with respect to the pseudotime; we will discuss some of the more advanced models later. 10.3.2 Changes along a trajectory To demonstrate, we will identify genes with significant changes with respect to one of the TSCAN pseudotimes in the Nestorowa data. We use the testPseudotime() utility to fit a natural spline to the expression of each gene, allowing us to model a range of non-linear relationships in the data. We then perform an analysis of variance (ANOVA) to determine if any of the spline coefficients are significantly non-zero, i.e., there is some significant trend with respect to pseudotime. library(TSCAN) pseudo &lt;- testPseudotime(sce.nest, pseudotime=tscan.pseudo[,2])[[1]] pseudo$SYMBOL &lt;- rowData(sce.nest)$SYMBOL pseudo[order(pseudo$p.value),] ## DataFrame with 46078 rows and 4 columns ## logFC p.value FDR SYMBOL ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## ENSMUSG00000029322 -0.0965470 4.02656e-282 1.68846e-277 Plac8 ## ENSMUSG00000105231 0.0177180 6.13349e-281 1.28598e-276 Iglj3 ## ENSMUSG00000009350 -0.1154449 2.45843e-261 3.43631e-257 Mpo ## ENSMUSG00000040314 -0.1138909 2.87506e-248 3.01400e-244 Ctsg ## ENSMUSG00000106668 0.0172294 9.98159e-240 8.37116e-236 Iglj1 ## ... ... ... ... ... ## ENSMUSG00000107367 0 NaN NaN Mir192 ## ENSMUSG00000107372 0 NaN NaN NA ## ENSMUSG00000107381 0 NaN NaN NA ## ENSMUSG00000107382 0 NaN NaN Gm37714 ## ENSMUSG00000107391 0 NaN NaN Rian In practice, it is helpful to pair the spline-based ANOVA results with a fit from a much simpler model where we assume that there exists a linear relationship between expression and the pseudotime. This yields an interpretable summary of the overall direction of change in the logFC field above, complementing the more poweful spline-based model used to populate the p.value field. In contrast, the magnitude and sign of the spline coefficients cannot be easily interpreted. To simplify the results, we will repeat our DE analysis after filtering out cluster 7. This cluster seems to contain a set of B cell precursors that are located at one end of the trajectory, causing immunoglobulins to dominate the set of DE genes and mask other interesting effects. Incidentally, this is the same cluster that was split into a separate component in the outgroup-based MST. # Making a copy of our SCE and including the pseudotimes in the colData. sce.nest2 &lt;- sce.nest sce.nest2$TSCAN.first &lt;- pathStat(tscan.pseudo)[,1] sce.nest2$TSCAN.second &lt;- pathStat(tscan.pseudo)[,2] # Discarding the offending cluster. discard &lt;- &quot;7&quot; keep &lt;- colLabels(sce.nest)!=discard sce.nest2 &lt;- sce.nest2[,keep] # Testing against the first path again. pseudo &lt;- testPseudotime(sce.nest2, pseudotime=sce.nest2$TSCAN.second) pseudo$SYMBOL &lt;- rowData(sce.nest2)$SYMBOL sorted &lt;- pseudo[order(pseudo$p.value),] Examination of the top downregulated genes suggests that this pseudotime represents a transition away from myeloid identity, based on the decrease in expression of genes such as Mpo and Plac8 (Figure 10.8). up.left &lt;- sorted[sorted$logFC &lt; 0,] head(up.left, 10) ## DataFrame with 10 rows and 4 columns ## logFC p.value FDR SYMBOL ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## ENSMUSG00000029322 -0.1109816 2.40090e-282 1.00501e-277 Plac8 ## ENSMUSG00000009350 -0.1324239 1.33503e-256 2.79422e-252 Mpo ## ENSMUSG00000040314 -0.1332196 9.28809e-244 1.29600e-239 Ctsg ## ENSMUSG00000020125 -0.1116214 1.56577e-209 1.63857e-205 Elane ## ENSMUSG00000015355 -0.1103926 1.34676e-191 9.39592e-188 Cd48 ## ENSMUSG00000045799 -0.0280381 1.11162e-180 5.81655e-177 Gm9800 ## ENSMUSG00000026238 -0.0269643 6.69605e-180 3.11441e-176 Ptma ## ENSMUSG00000090164 -0.1085666 1.23212e-177 5.15767e-174 BC035044 ## ENSMUSG00000024681 -0.0918238 2.66639e-177 1.01468e-173 Ms4a3 ## ENSMUSG00000015937 -0.0504447 6.36784e-176 2.22132e-172 H2afy best &lt;- head(up.left$SYMBOL, 10) plotExpression(sce.nest2, features=best, swap_rownames=&quot;SYMBOL&quot;, x=&quot;TSCAN.second&quot;, colour_by=&quot;label&quot;) Figure 10.8: Expression of the top 10 genes that decrease in expression with increasing pseudotime along the second path in the MST of the Nestorowa dataset. Each point represents a cell that is mapped to this path and is colored by the assigned cluster. Conversely, the later parts of the pseudotime may correspond to a more stem-like state based on upregulation of genes like Hlf. There is also increased expression of genes associated with the lymphoid lineage (e.g., Ltb), consistent with reduced commitment to the myeloid lineage at earlier pseudotime values. up.right &lt;- sorted[sorted$logFC &gt; 0,] head(up.right, 10) ## DataFrame with 10 rows and 4 columns ## logFC p.value FDR SYMBOL ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## ENSMUSG00000006389 0.1268583 5.43001e-192 4.54600e-188 Mpl ## ENSMUSG00000028716 0.1166947 1.14794e-182 6.86469e-179 Pdzk1ip1 ## ENSMUSG00000086567 0.0302618 4.92895e-173 1.37550e-169 Gm2830 ## ENSMUSG00000027562 0.0692752 2.66072e-163 5.56890e-160 Car2 ## ENSMUSG00000047867 0.0897000 1.99174e-161 3.78973e-158 Gimap6 ## ENSMUSG00000003949 0.1011850 2.77392e-142 4.00400e-139 Hlf ## ENSMUSG00000024399 0.1080444 5.31309e-137 6.73958e-134 Ltb ## ENSMUSG00000075602 0.1022186 2.93946e-135 3.51559e-132 Ly6a ## ENSMUSG00000061232 0.0195871 9.04679e-116 7.14526e-113 H2-K1 ## ENSMUSG00000107061 0.0679604 5.26426e-109 3.33881e-106 Gm19590 best &lt;- head(up.right$SYMBOL, 10) plotExpression(sce.nest2, features=best, swap_rownames=&quot;SYMBOL&quot;, x=&quot;TSCAN.second&quot;, colour_by=&quot;label&quot;) Figure 10.9: Expression of the top 10 genes that increase in expression with increasing pseudotime along the second path in the MST of the Nestorowa dataset. Each point represents a cell that is mapped to this path and is colored by the assigned cluster. Alternatively, a heatmap can be used to provide a more compact visualization (Figure 10.10). on.second.path &lt;- !is.na(sce.nest2$TSCAN.first) plotHeatmap(sce.nest2[,on.second.path], order_columns_by=&quot;TSCAN.second&quot;, colour_columns_by=&quot;label&quot;, features=head(up.right$SYMBOL, 50), center=TRUE, swap_rownames=&quot;SYMBOL&quot;) Figure 10.10: Heatmap of the expression of the top 50 genes that increase in expression with increasing pseudotime along the second path in the MST of the Nestorowa HSC dataset. Each column represents a cell that is mapped to this path and is ordered by its pseudotime value. 10.3.3 Changes between paths A more advanced analysis involves looking for differences in expression between paths of a branched trajectory. This is most interesting for cells close to the branch point between two or more paths where the differential expression analysis may highlight genes is responsible for the branching event. The general strategy here is to fit one trend to the unique part of each path immediately following the branch point, followed by a comparison of the fits between paths. To this end, a particularly tempting approach is to perform another ANOVA with our spline-based model and test for significant differences in the spline parameters between paths. While this can be done with testPseudotime(), the magnitude of the pseudotime has little comparability across paths. A pseudotime value in one path of the MST does not, in general, have any relation to the same value in another path; the pseudotime can be arbitrarily “stretched” by factors such as the magnitude of DE or the density of cells, depending on the algorithm. This compromises any comparison of trends as we cannot reliably say that they are being fitted to comparable \\(x\\)-axes. Rather, we employ the much simpler ad hoc approach of fitting a spline to each trajectory and comparing the sets of DE genes. To demonstrate, we focus on the cluster containing the branch point in the Nestorowa-derived MST (Figure 10.2). We recompute the pseudotimes so that the root lies at the cluster center, allowing us to detect genes that are associated with the divergence of the branches. starter &lt;- &quot;2&quot; tscan.pseudo2 &lt;- orderCells(map.tscan, mst, start=starter) We visualize the reordered pseudotimes using only the cells in our branch point cluster (Figure 10.11), which allows us to see the correspondence between each pseudotime to the projected edges of the MST. A more precise determination of the identity of each pseudotime can be achieved by examining the column names of tscan.pseudo2, which contains the name of the terminal node for the path of the MST corresponding to each column. # Making a copy and giving the paths more friendly names. sub.nest &lt;- sce.nest sub.nest$TSCAN.first &lt;- pathStat(tscan.pseudo2)[,1] sub.nest$TSCAN.second &lt;- pathStat(tscan.pseudo2)[,2] sub.nest$TSCAN.third &lt;- pathStat(tscan.pseudo2)[,3] # Subsetting to the desired cluster containing the branch point. keep &lt;- colLabels(sce.nest) == starter sub.nest &lt;- sub.nest[,keep] # Showing only the lines to/from our cluster of interest. line.data.sub &lt;- line.data[grepl(&quot;^2--&quot;, line.data$edge) | grepl(&quot;--2$&quot;, line.data$edge),] ggline &lt;- geom_line(data=line.data.sub, mapping=aes(x=TSNE1, y=TSNE2, group=edge)) gridExtra::grid.arrange( plotTSNE(sub.nest, colour_by=&quot;TSCAN.first&quot;) + ggline, plotTSNE(sub.nest, colour_by=&quot;TSCAN.second&quot;) + ggline, plotTSNE(sub.nest, colour_by=&quot;TSCAN.third&quot;) + ggline, ncol=3 ) Figure 10.11: TSCAN-derived pseudotimes around cluster 2 in the Nestorowa HSC dataset. Each point is a cell in this cluster and is colored by its pseudotime value along the path to which it was assigned. The overlaid lines represent the relevant edges of the MST. We then apply testPseudotime() to each path involving cluster 2. Because we are operating over a relatively short pseudotime interval, we do not expect complex trends and so we set df=1 (i.e., a linear trend) to avoid problems from overfitting. pseudo1 &lt;- testPseudotime(sub.nest, df=1, pseudotime=sub.nest$TSCAN.first) pseudo1$SYMBOL &lt;- rowData(sce.nest)$SYMBOL pseudo1[order(pseudo1$p.value),] ## DataFrame with 46078 rows and 4 columns ## logFC p.value FDR SYMBOL ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## ENSMUSG00000030559 -0.373227 4.76987e-11 1.62767e-06 Rab38 ## ENSMUSG00000006389 -0.360497 3.38068e-08 5.76812e-04 Mpl ## ENSMUSG00000001946 -0.365870 5.12474e-07 5.82922e-03 Esam ## ENSMUSG00000032586 -0.236584 8.01901e-07 6.84102e-03 Traip ## ENSMUSG00000050071 -0.235045 1.35907e-06 7.03349e-03 Bex1 ## ... ... ... ... ... ## ENSMUSG00000107379 0 NaN NaN Gm43126 ## ENSMUSG00000107381 0 NaN NaN NA ## ENSMUSG00000107382 0 NaN NaN Gm37714 ## ENSMUSG00000107384 0 NaN NaN Gm42557 ## ENSMUSG00000107391 0 NaN NaN Rian pseudo2 &lt;- testPseudotime(sub.nest, df=1, pseudotime=sub.nest$TSCAN.second) pseudo2$SYMBOL &lt;- rowData(sce.nest)$SYMBOL pseudo2[order(pseudo2$p.value),] ## DataFrame with 46078 rows and 4 columns ## logFC p.value FDR SYMBOL ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## ENSMUSG00000042462 0.161764 2.36240e-11 7.80773e-07 Dctpp1 ## ENSMUSG00000101878 0.123094 6.75797e-10 1.07843e-05 Gm8203 ## ENSMUSG00000027203 0.165513 9.78911e-10 1.07843e-05 Dut ## ENSMUSG00000027342 0.167364 2.41459e-09 1.99506e-05 Pcna ## ENSMUSG00000037894 0.114120 8.86855e-09 5.86211e-05 H2afz ## ... ... ... ... ... ## ENSMUSG00000107382 0 NaN NaN Gm37714 ## ENSMUSG00000107384 0 NaN NaN Gm42557 ## ENSMUSG00000107385 0 NaN NaN C330024D21Rik ## ENSMUSG00000107391 0 NaN NaN Rian ## ENSMUSG00000107392 0 NaN NaN Gm7792 pseudo3 &lt;- testPseudotime(sub.nest, df=1, pseudotime=sub.nest$TSCAN.third) pseudo3$SYMBOL &lt;- rowData(sce.nest)$SYMBOL pseudo3[order(pseudo3$p.value),] ## DataFrame with 46078 rows and 4 columns ## logFC p.value FDR SYMBOL ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## ENSMUSG00000001056 -0.2457275 3.86556e-10 1.33872e-05 Nhp2 ## ENSMUSG00000064284 -0.2700519 3.47773e-07 5.50809e-03 Cdpf1 ## ENSMUSG00000026234 -0.1233114 4.77139e-07 5.50809e-03 Ncl ## ENSMUSG00000061232 0.0430956 7.51271e-07 6.50450e-03 H2-K1 ## ENSMUSG00000035443 -0.2399334 1.01763e-06 6.94257e-03 Thyn1 ## ... ... ... ... ... ## ENSMUSG00000107379 0 NaN NaN Gm43126 ## ENSMUSG00000107381 0 NaN NaN NA ## ENSMUSG00000107382 0 NaN NaN Gm37714 ## ENSMUSG00000107384 0 NaN NaN Gm42557 ## ENSMUSG00000107391 0 NaN NaN Rian We want to find genes that are significant in our path of interest (for this demonstration, the third path reported by TSCAN) and are not significant and/or changing in the opposite direction in the other paths. We use the raw \\(p\\)-values to look for non-significant genes in order to increase the stringency of the definition of unique genes in our path. only3 &lt;- pseudo3[which(pseudo3$FDR &lt;= 0.05 &amp; (pseudo2$p.value &gt;= 0.05 | sign(pseudo1$logFC)!=sign(pseudo3$logFC)) &amp; (pseudo2$p.value &gt;= 0.05 | sign(pseudo2$logFC)!=sign(pseudo3$logFC))),] only3[order(only3$p.value),] ## DataFrame with 25 rows and 4 columns ## logFC p.value FDR SYMBOL ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## ENSMUSG00000001056 -0.2457275 3.86556e-10 1.33872e-05 Nhp2 ## ENSMUSG00000064284 -0.2700519 3.47773e-07 5.50809e-03 Cdpf1 ## ENSMUSG00000061232 0.0430956 7.51271e-07 6.50450e-03 H2-K1 ## ENSMUSG00000035443 -0.2399334 1.01763e-06 6.94257e-03 Thyn1 ## ENSMUSG00000030609 -0.2846594 1.20280e-06 6.94257e-03 Aen ## ... ... ... ... ... ## ENSMUSG00000021361 -0.1641263 4.26345e-05 0.0466556 Tmem14c ## ENSMUSG00000082741 0.1116316 4.30399e-05 0.0466556 Gm9703 ## ENSMUSG00000018583 -0.1190385 4.44310e-05 0.0466556 G3bp1 ## ENSMUSG00000019832 -0.2757749 4.50850e-05 0.0466556 Rab32 ## ENSMUSG00000087775 0.0248296 4.58041e-05 0.0466556 Rprl2 We observe upregulation of interesting genes such as Gata2, Cd9 and Apoe in this path, along with downregulation of Flt3 (Figure 10.12). One might speculate that this path leads to a less differentiated HSC state compared to the other directions. gridExtra::grid.arrange( plotTSNE(sub.nest, colour_by=&quot;Flt3&quot;, swap_rownames=&quot;SYMBOL&quot;) + ggline, plotTSNE(sub.nest, colour_by=&quot;Apoe&quot;, swap_rownames=&quot;SYMBOL&quot;) + ggline, plotTSNE(sub.nest, colour_by=&quot;Gata2&quot;, swap_rownames=&quot;SYMBOL&quot;) + ggline, plotTSNE(sub.nest, colour_by=&quot;Cd9&quot;, swap_rownames=&quot;SYMBOL&quot;) + ggline ) Figure 10.12: \\(t\\)-SNE plots of cells in the cluster containing the branch point of the MST in the Nestorowa dataset. Each point is a cell colored by the expression of a gene of interest and the relevant edges of the MST are overlaid on top. While simple and practical, this comparison strategy is even less statistically defensible than usual. The differential testing machinery is not suited to making inferences on the absence of differences, and we should not have used the non-significant genes to draw any conclusions. Another limitation is that this approach cannot detect differences in the magnitude of the gradient of the trend between paths; a gene that is significantly upregulated in each of two paths but with a sharper gradient in one of the paths will not be DE. (Of course, this is only a limitation if the pseudotimes were comparable in the first place.) 10.3.4 Further comments The magnitudes of the \\(p\\)-values reported here should be treated with some skepticism. The same fundamental problems discussed in Section 6.4 remain; the \\(p\\)-values are computed from the same data used to define the trajectory, and there is only a sample size of 1 in this analysis regardless of the number of cells. Nonetheless, the \\(p\\)-value is still useful for prioritizing interesting genes in the same manner that it is used to identify markers between clusters. The previous sections have focused on a very simple and efficient - but largely effective - approach to trend fitting. Alternatively, we can use more complex strategies that involve various generalizations to the concept of linear models. For example, generalized additive models (GAMs) are quite popular for pseudotime-based DE analyses as they are able to handle non-normal noise distributions and a greater diversity of non-linear trends. We demonstrate the use of the GAM implementation from the tradeSeq package on the Nestorowa dataset below. Specifically, we will take a leap of faith and assume that our pseudotime values are comparable across paths of the MST, allowing us to use the patternTest() function to test for significant differences in expression between paths. # Getting rid of the NA&#39;s; using the cell weights # to indicate which cell belongs on which path. nonna.pseudo &lt;- pathStat(tscan.pseudo) not.on.path &lt;- is.na(nonna.pseudo) nonna.pseudo[not.on.path] &lt;- 0 cell.weights &lt;- !not.on.path storage.mode(cell.weights) &lt;- &quot;numeric&quot; # Fitting a GAM on the subset of genes for speed. library(tradeSeq) fit &lt;- fitGAM(counts(sce.nest)[1:100,], pseudotime=nonna.pseudo, cellWeights=cell.weights) res &lt;- patternTest(fit) res$Symbol &lt;- rowData(sce.nest)[1:100,&quot;SYMBOL&quot;] res &lt;- res[order(res$pvalue),] head(res, 10) ## waldStat df pvalue fcMedian Symbol ## ENSMUSG00000000028 549.3 12 0 0.8014 Cdc45 ## ENSMUSG00000000031 116.0 12 0 1.4459 H19 ## ENSMUSG00000000058 281.7 12 0 1.2202 Cav2 ## ENSMUSG00000000078 608.5 12 0 0.5642 Klf6 ## ENSMUSG00000000088 282.0 12 0 0.2803 Cox5a ## ENSMUSG00000000094 121.3 12 0 0.6265 Tbx4 ## ENSMUSG00000000120 169.5 11 0 0.1649 Ngfr ## ENSMUSG00000000171 102.7 11 0 0.1038 Sdhd ## ENSMUSG00000000184 351.9 12 0 0.2315 Ccnd2 ## ENSMUSG00000000244 148.7 12 0 0.2628 Tspan32 From a statistical perspective, the GAM is superior to linear models as the former uses the raw counts. This accounts for the idiosyncrasies of the mean-variance relationship for low counts and avoids some problems with spurious trajectories introduced by the log-transformation (Basic Section 2.5). However, this sophistication comes at the cost of increased complexity and compute time, requiring parallelization via BiocParallel even for relatively small datasets. When a trajectory consists of a series of clusters (as in the Nestorowa dataset), pseudotime-based DE tests can be considered a continuous generalization of cluster-based marker detection. One would expect to identify similar genes by performing an ANOVA on the per-cluster expression values, and indeed, this may be a more interpretable approach as it avoids imposing the assumption that a trajectory exists at all. The main benefit of pseudotime-based tests is that they encourage expression to be a smooth function of pseudotime, assuming that the degrees of freedom in the trend fit prevents overfitting. This smoothness reflects an expectation that changes in expression along a trajectory should be gradual. 10.4 Finding the root 10.4.1 Overview The pseudotime calculations rely on some specification of the root of the trajectory to define “position zero”. In some cases, this choice has little effect beyond flipping the sign of the gradients of the DE genes. In other cases, this choice may necessarily arbitrary depending on the questions being asked, e.g., what are the genes driving the transition to or from a particular part of the trajectory? However, in situations where the trajectory is associated with a time-dependent biological process, the position on the trajectory corresponding to the earliest timepoint is clearly the best default choice for the root. This simplifies interpretation by allowing the pseudotime to be treated as a proxy for real time. 10.4.2 Entropy-based methods Trajectories are commonly used to characterize differentiation where branches are interpreted as multiple lineages. In this setting, the root of the trajectory is best set to the “start” of the differentiation process, i.e., the most undifferentiated state that is observed in the dataset. It is usually possible to identify this state based on the genes that are expressed at each point of the trajectory. However, when such prior biological knowledge is not available, we can fall back to the more general concept that undifferentiated cells have more diverse expression profiles (Gulati et al. 2020). The assumption is that terminally differentiated cells have expression profiles that are highly specialized for their function while multipotent cells have no such constraints - and indeed, may need to have active expression programs for many lineages in preparation for commitment to any of them. We quantify the diversity of expression by computing the entropy of each cell’s expression profile (Grun et al. 2016; Guo et al. 2017; Teschendorff and Enver 2017), with higher entropies representing greater diversity. We demonstrate on the Nestorowa HSC dataset (Figure 10.13) where clusters 4 and 10 have the highest entropies, suggesting that they represent the least differentiated states within the trajectory. It is also reassuring that these two clusters are adjacent on the MST (Figure 10.1), which is consistent with branched differentiation “away” from a single root. library(TSCAN) entropy &lt;- perCellEntropy(sce.nest) ent.data &lt;- data.frame(cluster=colLabels(sce.nest), entropy=entropy) ggplot(ent.data, aes(x=cluster, y=entropy)) + geom_violin() + coord_cartesian(ylim=c(7, NA)) + stat_summary(fun=median, geom=&quot;point&quot;) Figure 10.13: Distribution of per-cell entropies for each cluster in the Nestorowa dataset. The median entropy for each cluster is shown as a point in the violin plot. Of course, this interpretation is fully dependent on whether the underlying assumption is reasonable. While the association between diversity and differentiation potential is likely to be generally applicable, it may not be sufficiently precise to enable claims on the relative potency of closely related subpopulations. Indeed, other processes such as stress or metabolic responses may interfere with the entropy comparisons. Furthermore, at low counts, the magnitude of the entropy is dependent on sequencing depth in a manner that cannot be corrected by scaling normalization. Cells with lower coverage will have lower entropy even if the underlying transcriptional diversity is the same, which may confound the interpretation of entropy as a measure of potency. 10.4.3 RNA velocity Another strategy is to use the concept of “RNA velocity” to identify the root (La Manno et al. 2018). For a given gene, a high ratio of unspliced to spliced transcripts indicates that that gene is being actively upregulated, under the assumption that the increase in transcription exceeds the capability of the splicing machinery to process the pre-mRNA. Conversely, a low ratio indicates that the gene is being downregulated as the rate of production and processing of pre-mRNAs cannot compensate for the degradation of mature transcripts. Thus, we can infer that cells with high and low ratios are moving towards a high- and low-expression state, respectively, allowing us to assign directionality to any trajectory or even individual cells. To demonstrate, we will use matrices of spliced and unspliced counts from Hermann et al. (2018). The unspliced count matrix is most typically generated by counting reads across intronic regions, thus quantifying the abundance of nascent transcripts for each gene in each cell. The spliced counts are obtained in a more standard manner by counting reads aligned to exonic regions; however, some extra thought is required to deal with reads spanning exon-intron boundaries, as well as reads mapping to regions that can be either intronic or exonic depending on the isoform (Soneson et al. 2020). Conveniently, both matrices have the same shape and thus can be stored as separate assays in our usual SingleCellExperiment. library(scRNAseq) sce.sperm &lt;- HermannSpermatogenesisData(strip=TRUE, location=TRUE) assayNames(sce.sperm) ## [1] &quot;spliced&quot; &quot;unspliced&quot; We run through a quick-and-dirty analysis on the spliced counts, which can - by and large - be treated in the same manner as the standard exonic gene counts used in non-velocity-aware analyses. Alternatively, if the standard exonic count matrix was available, we could just use it directly in these steps and restrict the involvement of the spliced/unspliced matrices to the velocity calculations. The latter approach is logistically convenient when adding an RNA velocity section to an existing analysis, such that the prior steps (and the interpretation of their results) do not have to be repeated on the spliced count matrix. # Quality control: library(scuttle) is.mito &lt;- which(seqnames(sce.sperm)==&quot;MT&quot;) sce.sperm &lt;- addPerCellQC(sce.sperm, subsets=list(Mt=is.mito), assay.type=&quot;spliced&quot;) qc &lt;- quickPerCellQC(colData(sce.sperm), sub.fields=TRUE) sce.sperm &lt;- sce.sperm[,!qc$discard] # Normalization: set.seed(10000) library(scran) sce.sperm &lt;- logNormCounts(sce.sperm, assay.type=&quot;spliced&quot;) dec &lt;- modelGeneVarByPoisson(sce.sperm, assay.type=&quot;spliced&quot;) hvgs &lt;- getTopHVGs(dec, n=2500) # Dimensionality reduction: set.seed(1000101) library(scater) sce.sperm &lt;- runPCA(sce.sperm, ncomponents=25, subset_row=hvgs) sce.sperm &lt;- runTSNE(sce.sperm, dimred=&quot;PCA&quot;) We use the velociraptor package to perform the velocity calculations on this dataset via the scvelo Python package (Bergen et al. 2019). scvelo offers some improvements over the original implementation of RNA velocity by La Manno et al. (2018), most notably eliminating the need for observed subpopulations at steady state (i.e., where the rates of transcription, splicing and degradation are equal). velociraptor conveniently wraps this functionality by providing a function that accepts a SingleCellExperiment object such as sce.sperm and returns a similar object decorated with the velocity statistics. library(velociraptor) velo.out &lt;- scvelo(sce.sperm, assay.X=&quot;spliced&quot;, subset.row=hvgs, use.dimred=&quot;PCA&quot;) ## computing moments based on connectivities ## finished (0:00:00) --&gt; added ## &#39;Ms&#39; and &#39;Mu&#39;, moments of un/spliced abundances (adata.layers) ## computing velocities ## finished (0:00:00) --&gt; added ## &#39;velocity&#39;, velocity vectors for each individual cell (adata.layers) ## computing velocity graph (using 1/72 cores) ## 0%| | 0/2175 [00:00&lt;?, ?cells/s] ## finished (0:00:01) --&gt; added ## &#39;velocity_graph&#39;, sparse matrix with cosine correlations (adata.uns) ## computing terminal states ## identified 2 regions of root cells and 3 regions of end points . ## finished (0:00:00) --&gt; added ## &#39;root_cells&#39;, root cells of Markov diffusion process (adata.obs) ## &#39;end_points&#39;, end points of Markov diffusion process (adata.obs) ## --&gt; added &#39;velocity_length&#39; (adata.obs) ## --&gt; added &#39;velocity_confidence&#39; (adata.obs) ## --&gt; added &#39;velocity_confidence_transition&#39; (adata.obs) velo.out ## class: SingleCellExperiment ## dim: 2500 2175 ## metadata(4): neighbors velocity_params velocity_graph ## velocity_graph_neg ## assays(6): X spliced ... Mu velocity ## rownames(2500): ENSMUSG00000038015 ENSMUSG00000022501 ... ## ENSMUSG00000095650 ENSMUSG00000002524 ## rowData names(4): velocity_gamma velocity_qreg_ratio velocity_r2 ## velocity_genes ## colnames(2175): CCCATACTCCGAAGAG AATCCAGTCATCTGCC ... ATCCACCCACCACCAG ## ATTGGTGGTTACCGAT ## colData names(7): velocity_self_transition root_cells ... ## velocity_confidence velocity_confidence_transition ## reducedDimNames(1): X_pca ## mainExpName: NULL ## altExpNames(0): The primary output is the matrix of velocity vectors that describe the direction and magnitude of transcriptional change for each cell. To construct an ordering, we extrapolate from the vector for each cell to determine its future state. Roughly speaking, if a cell’s future state is close to the observed state of another cell, we place the former behind the latter in the ordering. This yields a “velocity pseudotime” that provides directionality without the need to explicitly define a root in our trajectory. We visualize this procedure in Figure 10.14 by embedding the estimated velocities into any low-dimensional representation of the dataset. sce.sperm$pseudotime &lt;- velo.out$velocity_pseudotime # Also embedding the velocity vectors, for some verisimilitude. embedded &lt;- embedVelocity(reducedDim(sce.sperm, &quot;TSNE&quot;), velo.out) ## computing velocity embedding ## finished (0:00:00) --&gt; added ## &#39;velocity_target&#39;, embedded velocity vectors (adata.obsm) grid.df &lt;- gridVectors(reducedDim(sce.sperm, &quot;TSNE&quot;), embedded, resolution=30) library(ggplot2) plotTSNE(sce.sperm, colour_by=&quot;pseudotime&quot;, point_alpha=0.3) + geom_segment(data=grid.df, mapping=aes(x=start.TSNE1, y=start.TSNE2, xend=end.TSNE1, yend=end.TSNE2), arrow=arrow(length=unit(0.05, &quot;inches&quot;), type=&quot;closed&quot;)) Figure 10.14: \\(t\\)-SNE plot of the Hermann spermatogenesis dataset, where each point is a cell and is colored by its velocity pseudotime. Arrows indicate the direction and magnitude of the velocity vectors, averaged over nearby cells. While we could use the velocity pseudotimes directly in our downstream analyses, it is often helpful to pair this information with other trajectory analyses. This is because the velocity calculations are done on a per-cell basis but interpretation is typically performed at a lower granularity, e.g., per cluster or lineage. For example, we can overlay the average velocity pseudotime for each cluster onto our TSCAN-derived MST (Figure 10.15) to identify the likely root clusters. More complex analyses can also be performed (e.g., to identify the likely fate of each cell in the intermediate clusters) but will not be discussed here. library(bluster) colLabels(sce.sperm) &lt;- clusterRows(reducedDim(sce.sperm, &quot;PCA&quot;), NNGraphParam()) library(TSCAN) mst &lt;- TSCAN::createClusterMST(sce.sperm, use.dimred=&quot;PCA&quot;, outgroup=TRUE) # Could also use velo.out$root_cell here, for a more direct measure of &#39;rootness&#39;. by.cluster &lt;- split(sce.sperm$pseudotime, colLabels(sce.sperm)) mean.by.cluster &lt;- vapply(by.cluster, mean, 0) mean.by.cluster &lt;- mean.by.cluster[names(igraph::V(mst))] color.by.cluster &lt;- viridis::viridis(21)[cut(mean.by.cluster, 21)] set.seed(1001) plot(mst, vertex.color=color.by.cluster) Figure 10.15: TSCAN-derived MST created from the Hermann spermatogenesis dataset. Each node is a cluster and is colored by the average velocity pseudotime of all cells in that cluster, from lowest (purple) to highest (yellow). Needless to say, this lunch is not entirely free. The inferences rely on a sophisticated mathematical model that has a few assumptions, the most obvious of which being that the transcriptional dynamics are the same across subpopulations. The use of unspliced counts increases the sensitivity of the analysis to unannotated transcripts (e.g., microRNAs in the gene body), intron retention events, annotation errors or quantification ambiguities (Soneson et al. 2020) that could interfere with the velocity calculations. There is also the question of whether there is enough intronic coverage to reliably estimate the velocity for the relevant genes for the process of interest, and if not, whether this lack of information may bias the resulting velocity estimates. From a purely practical perspective, the main difficulty with RNA velocity is that the unspliced counts are often unavailable. 10.4.4 Real timepoints There does, however, exist a gold-standard approach to rooting a trajectory: simply collect multiple real-life timepoints over the course of a biological process and use the population(s) at the earliest time point as the root. This approach experimentally defines a link between pseudotime and real time without requiring any further assumptions. To demonstrate, we will use the activated T cell dataset from Richard et al. (2018) where they collected CD8+ T cells at various time points after ovalbumin stimulation. library(scRNAseq) sce.richard &lt;- RichardTCellData() sce.richard &lt;- sce.richard[,sce.richard$`single cell quality`==&quot;OK&quot;] # Only using cells treated with the highest affinity peptide # plus the unstimulated cells as time zero. sub.richard &lt;- sce.richard[,sce.richard$stimulus %in% c(&quot;OT-I high affinity peptide N4 (SIINFEKL)&quot;, &quot;unstimulated&quot;)] sub.richard$time[is.na(sub.richard$time)] &lt;- 0 table(sub.richard$time) ## ## 0 1 3 6 ## 44 51 64 91 We run through the standard workflow for single-cell data with spike-ins - see Basic Section 2.4 and Basic Section 3.3 for more details. library(scran) sub.richard &lt;- computeSpikeFactors(sub.richard, &quot;ERCC&quot;) sub.richard &lt;- logNormCounts(sub.richard) dec.richard &lt;- modelGeneVarWithSpikes(sub.richard, &quot;ERCC&quot;) top.hvgs &lt;- getTopHVGs(dec.richard, prop=0.2) sub.richard &lt;- denoisePCA(sub.richard, technical=dec.richard, subset.row=top.hvgs) We can then run our trajectory inference method of choice. As we expecting a fairly simple trajectory, we will keep matters simple and use slingshot() without any clusters. This yields a pseudotime that is strongly associated with real time (Figure 10.16) and from which it is straightforward to identify the best location of the root. The rooted trajectory can then be used to determine the “real time equivalent” of other activation stimuli, see Richard et al. (2018) for more details. sub.richard &lt;- slingshot(sub.richard, reducedDim=&quot;PCA&quot;) plot(sub.richard$time, sub.richard$slingPseudotime_1, xlab=&quot;Time (hours)&quot;, ylab=&quot;Pseudotime&quot;) Figure 10.16: Pseudotime as a function of real time in the Richard T cell dataset. Of course, this strategy relies on careful experimental design to ensure that multiple timepoints are actually collected. This requires more planning and resources (i.e., cost!) and is frequently absent from many scRNA-seq studies that only consider a single “snapshot” of the system. Generation of multiple timepoints also requires an amenable experimental system where the initiation of the process of interest can be tightly controlled. This is often more complex to set up than a strictly observational study, though having causal information arguably makes the data more useful for making inferences. Session Info View session info R version 4.4.2 (2024-10-31) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.1 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] bluster_1.16.0 velociraptor_1.16.0 [3] scran_1.34.0 ensembldb_2.30.0 [5] AnnotationFilter_1.30.0 GenomicFeatures_1.58.0 [7] AnnotationDbi_1.68.0 scRNAseq_2.20.0 [9] tradeSeq_1.20.0 slingshot_2.14.0 [11] princurve_2.1.6 TSCAN_1.44.0 [13] TrajectoryUtils_1.14.0 scater_1.34.0 [15] ggplot2_3.5.1 scuttle_1.16.0 [17] SingleCellExperiment_1.28.1 SummarizedExperiment_1.36.0 [19] Biobase_2.66.0 GenomicRanges_1.58.0 [21] GenomeInfoDb_1.42.1 IRanges_2.40.1 [23] S4Vectors_0.44.0 BiocGenerics_0.52.0 [25] MatrixGenerics_1.18.1 matrixStats_1.5.0 [27] BiocStyle_2.34.0 rebook_1.16.0 loaded via a namespace (and not attached): [1] splines_4.4.2 later_1.4.1 [3] BiocIO_1.16.0 bitops_1.0-9 [5] filelock_1.0.3 tibble_3.2.1 [7] CodeDepends_0.6.6 basilisk.utils_1.18.0 [9] graph_1.84.1 XML_3.99-0.18 [11] httr2_1.1.0 lifecycle_1.0.4 [13] edgeR_4.4.1 lattice_0.22-6 [15] alabaster.base_1.6.1 magrittr_2.0.3 [17] limma_3.62.2 sass_0.4.9 [19] rmarkdown_2.29 jquerylib_0.1.4 [21] yaml_2.3.10 metapod_1.14.0 [23] httpuv_1.6.15 reticulate_1.40.0 [25] cowplot_1.1.3 pbapply_1.7-2 [27] DBI_1.2.3 RColorBrewer_1.1-3 [29] abind_1.4-8 zlibbioc_1.52.0 [31] Rtsne_0.17 purrr_1.0.2 [33] RCurl_1.98-1.16 rappdirs_0.3.3 [35] GenomeInfoDbData_1.2.13 ggrepel_0.9.6 [37] irlba_2.3.5.1 alabaster.sce_1.6.0 [39] pheatmap_1.0.12 dqrng_0.4.1 [41] DelayedMatrixStats_1.28.1 codetools_0.2-20 [43] DelayedArray_0.32.0 tidyselect_1.2.1 [45] UCSC.utils_1.2.0 farver_2.1.2 [47] ScaledMatrix_1.14.0 viridis_0.6.5 [49] BiocFileCache_2.14.0 GenomicAlignments_1.42.0 [51] jsonlite_1.8.9 BiocNeighbors_2.0.1 [53] tools_4.4.2 Rcpp_1.0.14 [55] glue_1.8.0 gridExtra_2.3 [57] SparseArray_1.6.1 xfun_0.50 [59] mgcv_1.9-1 HDF5Array_1.34.0 [61] gypsum_1.2.0 dplyr_1.1.4 [63] withr_3.0.2 combinat_0.0-8 [65] BiocManager_1.30.25 fastmap_1.2.0 [67] basilisk_1.18.0 rhdf5filters_1.18.0 [69] caTools_1.18.3 digest_0.6.37 [71] rsvd_1.0.5 R6_2.5.1 [73] mime_0.12 colorspace_2.1-1 [75] gtools_3.9.5 RSQLite_2.3.9 [77] generics_0.1.3 FNN_1.1.4.1 [79] rtracklayer_1.66.0 httr_1.4.7 [81] S4Arrays_1.6.0 uwot_0.2.2 [83] pkgconfig_2.0.3 gtable_0.3.6 [85] blob_1.2.4 XVector_0.46.0 [87] htmltools_0.5.8.1 bookdown_0.42 [89] ProtGenerics_1.38.0 alabaster.matrix_1.6.1 [91] scales_1.3.0 png_0.1-8 [93] knitr_1.49 rjson_0.2.23 [95] nlme_3.1-166 curl_6.1.0 [97] rhdf5_2.50.2 cachem_1.1.0 [99] BiocVersion_3.20.0 KernSmooth_2.23-26 [101] parallel_4.4.2 vipor_0.4.7 [103] zellkonverter_1.16.0 restfulr_0.0.15 [105] alabaster.schemas_1.6.0 pillar_1.10.1 [107] grid_4.4.2 fastICA_1.2-7 [109] vctrs_0.6.5 gplots_3.2.0 [111] promises_1.3.2 BiocSingular_1.22.0 [113] dbplyr_2.5.0 beachmat_2.22.0 [115] xtable_1.8-4 cluster_2.1.8 [117] beeswarm_0.4.0 evaluate_1.0.3 [119] Rsamtools_2.22.0 cli_3.6.3 [121] locfit_1.5-9.10 compiler_4.4.2 [123] rlang_1.1.5 crayon_1.5.3 [125] labeling_0.4.3 mclust_6.1.1 [127] plyr_1.8.9 ggbeeswarm_0.7.2 [129] alabaster.se_1.6.0 viridisLite_0.4.2 [131] BiocParallel_1.40.0 munsell_0.5.1 [133] Biostrings_2.74.1 lazyeval_0.2.2 [135] Matrix_1.7-1 dir.expiry_1.14.0 [137] ExperimentHub_2.14.0 sparseMatrixStats_1.18.0 [139] bit64_4.6.0-1 Rhdf5lib_1.28.0 [141] KEGGREST_1.46.0 statmod_1.5.0 [143] shiny_1.10.0 alabaster.ranges_1.6.0 [145] AnnotationHub_3.14.0 igraph_2.1.3 [147] memoise_2.0.1 bslib_0.8.0 [149] bit_4.5.0.1 References "],["single-nuclei-rna-seq-processing.html", "Chapter 11 Single-nuclei RNA-seq processing 11.1 Introduction 11.2 Quality control for stripped nuclei 11.3 Comments on downstream analyses 11.4 Tricks with ambient contamination Session Info", " Chapter 11 Single-nuclei RNA-seq processing .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 11.1 Introduction Single-nuclei RNA-seq (snRNA-seq) provides another strategy for performing single-cell transcriptomics where individual nuclei instead of cells are captured and sequenced. The major advantage of snRNA-seq over scRNA-seq is that the former does not require the preservation of cellular integrity during sample preparation, especially dissociation. We only need to extract nuclei in an intact state, meaning that snRNA-seq can be applied to cell types, tissues and samples that are not amenable to dissociation and later processing. The cost of this flexibility is the loss of transcripts that are primarily located in the cytoplasm, potentially limiting the availability of biological signal for genes with little nuclear localization. The computational analysis of snRNA-seq data is very much like that of scRNA-seq data. We have a matrix of (UMI) counts for genes by cells that requires quality control, normalization and so on. (Technically, the columsn correspond to nuclei but we will use these two terms interchangeably in this chapter.) In fact, the biggest difference in processing occurs in the construction of the count matrix itself, where intronic regions must be included in the annotation for each gene to account for the increased abundance of unspliced transcripts. The rest of the analysis only requires a few minor adjustments to account for the loss of cytoplasmic transcripts. We demonstrate using a dataset from Wu et al. (2019) involving snRNA-seq on healthy and fibrotic mouse kidneys. library(scRNAseq) sce &lt;- WuKidneyData() sce &lt;- sce[,sce$Technology==&quot;sNuc-10x&quot;] sce ## class: SingleCellExperiment ## dim: 18249 8231 ## metadata(0): ## assays(1): counts ## rownames(18249): mt-Cytb mt-Nd6 ... Gm44613 Gm38304 ## rowData names(0): ## colnames: NULL ## colData names(4): CellBarcode CellType Technology Status ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): 11.2 Quality control for stripped nuclei The loss of the cytoplasm means that the stripped nuclei should not contain any mitochondrial transcripts. This means that the mitochondrial proportion becomes an excellent QC metric for the efficacy of the stripping process. Unlike scRNA-seq, there is no need to worry about variations in mitochondrial content due to genuine biology. High-quality nuclei should not contain any mitochondrial transcripts; the presence of any mitochondrial counts in a library indicates that the removal of the cytoplasm was not complete, possibly introducing irrelevant heterogeneity in downstream analyses. library(scuttle) sce &lt;- addPerCellQC(sce, subsets=list(Mt=grep(&quot;^mt-&quot;, rownames(sce)))) summary(sce$subsets_Mt_percent == 0) ## Mode FALSE TRUE ## logical 2264 5967 We apply a simple filter to remove libraries corresponding to incompletely stripped nuclei. The outlier-based approach described in Section 12.3 can be used here, but some caution is required in low-coverage experiments where a majority of cells have zero mitochondrial counts. In such cases, the MAD may also be zero such that other libraries with very low but non-zero mitochondrial counts are removed. This is typically too conservative as such transcripts may be present due to sporadic ambient contamination rather than incomplete stripping. stats &lt;- quickPerCellQC(colData(sce), sub.fields=&quot;subsets_Mt_percent&quot;) colSums(as.matrix(stats)) ## low_lib_size low_n_features high_subsets_Mt_percent ## 0 0 2264 ## discard ## 2264 Instead, we enforce a minimum difference between the threshold and the median in isOutlier() (Figure 11.1). We arbitrarily choose +0.5% here, which takes precedence over the outlier-based threshold if the latter is too low. In this manner, we avoid discarding libraries with a very modest amount of contamination; the same code will automatically fall back to the outlier-based threshold in datasets where the stripping was systematically less effective. stats$high_subsets_Mt_percent &lt;- isOutlier(sce$subsets_Mt_percent, type=&quot;higher&quot;, min.diff=0.5) stats$discard &lt;- Reduce(&quot;|&quot;, stats[,colnames(stats)!=&quot;discard&quot;]) colSums(as.matrix(stats)) ## low_lib_size low_n_features high_subsets_Mt_percent ## 0 0 42 ## discard ## 42 library(scater) plotColData(sce, x=&quot;Status&quot;, y=&quot;subsets_Mt_percent&quot;, colour_by=I(stats$high_subsets_Mt_percent)) Figure 11.1: Distribution of the mitochondrial proportions in the Wu kidney dataset. Each point represents a cell and is colored according to whether it was considered to be of low quality and discarded. 11.3 Comments on downstream analyses The rest of the analysis can then be performed using the same strategies discussed for scRNA-seq (Figure 11.2). Despite the loss of cytoplasmic transcripts, there is usually still enough biological signal to characterize population heterogeneity (Bakken et al. 2018; Wu et al. 2019). In fact, one could even say that snRNA-seq has a higher signal-to-noise ratio as sequencing coverage is not spent on highly abundant but typically uninteresting transcripts for mitochondrial and ribosomal protein genes. It also has the not inconsiderable advantage of being able to recover subpopulations that are not amenable to dissociation and would be lost by scRNA-seq protocols. library(scran) set.seed(111) sce &lt;- logNormCounts(sce[,!stats$discard]) dec &lt;- modelGeneVarByPoisson(sce) sce &lt;- runPCA(sce, subset_row=getTopHVGs(dec, n=4000)) sce &lt;- runTSNE(sce, dimred=&quot;PCA&quot;) library(bluster) colLabels(sce) &lt;- clusterRows(reducedDim(sce, &quot;PCA&quot;), NNGraphParam()) gridExtra::grid.arrange( plotTSNE(sce, colour_by=&quot;label&quot;, text_by=&quot;label&quot;), plotTSNE(sce, colour_by=&quot;Status&quot;), ncol=2 ) Figure 11.2: \\(t\\)-SNE plots of the Wu kidney dataset. Each point is a cell and is colored by its cluster assignment (left) or its disease status (right). We can also apply more complex procedures such as batch correction (Multi-sample Chapter 1). Here, we eliminate the disease effect to identify shared clusters (Figure 11.3). library(batchelor) set.seed(1101) merged &lt;- multiBatchNorm(sce, batch=sce$Status) merged &lt;- correctExperiments(merged, batch=merged$Status, PARAM=FastMnnParam()) merged &lt;- runTSNE(merged, dimred=&quot;corrected&quot;) colLabels(merged) &lt;- clusterRows(reducedDim(merged, &quot;corrected&quot;), NNGraphParam()) gridExtra::grid.arrange( plotTSNE(merged, colour_by=&quot;label&quot;, text_by=&quot;label&quot;), plotTSNE(merged, colour_by=&quot;batch&quot;), ncol=2 ) Figure 11.3: More \\(t\\)-SNE plots of the Wu kidney dataset after applying MNN correction across diseases. Similarly, we can perform marker detection on the snRNA-seq expression values as discussed in Basic Chapter 6. For the most part, interpretation of these DE results makes the simplifying assumption that nuclear abundances are a good proxy for the overall expression profile. This is generally reasonable but may not always be true, resulting in some discrepancies in the marker sets between snRNA-seq and scRNA-seq datasets. For example, transcripts for strongly expressed genes might localize to the cytoplasm for efficient translation and subsequently be lost upon stripping, while genes with the same overall expression but differences in the rate of nuclear export may appear to be differentially expressed between clusters. In the most pathological case, higher snRNA-seq abundances may indicate nuclear sequestration of transcripts for protein-coding genes and reduced activity of the relevant biological process, contrary to the usual interpretation of the effect of upregulation. markers &lt;- findMarkers(merged, block=merged$Status, direction=&quot;up&quot;) markers[[&quot;3&quot;]][1:10,1:3] ## DataFrame with 10 rows and 3 columns ## Top p.value FDR ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; ## Sorcs1 1 8.31936e-262 1.51820e-258 ## Ltn1 1 7.85490e-83 1.07778e-80 ## Bmp6 1 0.00000e+00 0.00000e+00 ## Il34 1 0.00000e+00 0.00000e+00 ## Them7 1 9.46508e-208 8.22516e-205 ## Pak1 1 1.35170e-184 8.22241e-182 ## Kcnip4 1 0.00000e+00 0.00000e+00 ## Mecom 1 5.30105e-20 7.11838e-19 ## Pakap 1 0.00000e+00 0.00000e+00 ## Wdr17 1 1.34668e-202 1.11707e-199 plotTSNE(merged, colour_by=&quot;Kcnip4&quot;) Other analyses described for scRNA-seq require more care when they are applied to snRNA-seq data. Most obviously, cell type annotation based on reference profiles (Basic Chapter 7) should be treated with some caution as the majority of existing references are constructed from bulk or single-cell datasets with cytoplasmic transcripts. Interpretation of RNA velocity results may also be complicated by variation in the rate of nuclear export of spliced transcripts. 11.4 Tricks with ambient contamination The expected absence of genuine mitochondrial expression can also be exploited to estimate the level of ambient contamination (Multi-sample Chapter 5). We demonstrate on mouse brain snRNA-seq data from 10X Genomics (Zheng et al. 2017), using the raw count matrix prior to any filtering for nuclei-containing barcodes. library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.0.1-nuclei_900/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;nuclei&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/mm10&quot;) sce.brain &lt;- read10xCounts(fname, col.names=TRUE) sce.brain ## class: SingleCellExperiment ## dim: 27998 737280 ## metadata(1): Samples ## assays(1): counts ## rownames(27998): ENSMUSG00000051951 ENSMUSG00000089699 ... ## ENSMUSG00000096730 ENSMUSG00000095742 ## rowData names(2): ID Symbol ## colnames(737280): AAACCTGAGAAACCAT-1 AAACCTGAGAAACCGC-1 ... ## TTTGTCATCTTTAGTC-1 TTTGTCATCTTTCCTC-1 ## colData names(2): Sample Barcode ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): We call non-empty droplets using emptyDrops() as previously described (Section 7.2). library(DropletUtils) e.out &lt;- emptyDrops(counts(sce.brain)) summary(e.out$FDR &lt;= 0.001) ## Mode FALSE TRUE NA&#39;s ## logical 2307 1729 733244 If our libraries are of high quality, we can assume that any mitochondrial “expression” is due to contamination from the ambient solution. We then use the controlAmbience() function to estimate the proportion of ambient contamination for each gene, allowing us to mark potentially problematic genes in the DE results (Figure 11.4). In fact, we can use this information even earlier to remove these genes during dimensionality reduction and clustering. This is not generally possible for scRNA-seq as any notable contaminating transcripts may originate from a subpopulation that actually expresses that gene and thus cannot be blindly removed. ambient &lt;- estimateAmbience(counts(sce.brain), round=FALSE, good.turing=FALSE) nuclei &lt;- rowSums(counts(sce.brain)[,which(e.out$FDR &lt;= 0.001)]) is.mito &lt;- grepl(&quot;mt-&quot;, rowData(sce.brain)$Symbol) contam &lt;- controlAmbience(nuclei, ambient, features=is.mito, mode=&quot;proportion&quot;) plot(log10(nuclei+1), contam*100, col=ifelse(is.mito, &quot;red&quot;, &quot;grey&quot;), pch=16, xlab=&quot;Log-nuclei expression&quot;, ylab=&quot;Contamination (%)&quot;) Figure 11.4: Percentage of counts in the nuclei of the 10X brain dataset that are attributed to contamination from the ambient solution. Each point represents a gene and mitochondrial genes are highlighted in red. Session Info View session info R version 4.4.2 (2024-10-31) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.1 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] DropletUtils_1.26.0 DropletTestFiles_1.16.0 [3] batchelor_1.22.0 bluster_1.16.0 [5] scran_1.34.0 scater_1.34.0 [7] ggplot2_3.5.1 scuttle_1.16.0 [9] scRNAseq_2.20.0 SingleCellExperiment_1.28.1 [11] SummarizedExperiment_1.36.0 Biobase_2.66.0 [13] GenomicRanges_1.58.0 GenomeInfoDb_1.42.1 [15] IRanges_2.40.1 S4Vectors_0.44.0 [17] BiocGenerics_0.52.0 MatrixGenerics_1.18.1 [19] matrixStats_1.5.0 BiocStyle_2.34.0 [21] rebook_1.16.0 loaded via a namespace (and not attached): [1] jsonlite_1.8.9 CodeDepends_0.6.6 [3] magrittr_2.0.3 ggbeeswarm_0.7.2 [5] GenomicFeatures_1.58.0 gypsum_1.2.0 [7] farver_2.1.2 rmarkdown_2.29 [9] BiocIO_1.16.0 zlibbioc_1.52.0 [11] vctrs_0.6.5 DelayedMatrixStats_1.28.1 [13] memoise_2.0.1 Rsamtools_2.22.0 [15] RCurl_1.98-1.16 htmltools_0.5.8.1 [17] S4Arrays_1.6.0 AnnotationHub_3.14.0 [19] curl_6.1.0 BiocNeighbors_2.0.1 [21] Rhdf5lib_1.28.0 SparseArray_1.6.1 [23] rhdf5_2.50.2 sass_0.4.9 [25] alabaster.base_1.6.1 bslib_0.8.0 [27] alabaster.sce_1.6.0 httr2_1.1.0 [29] cachem_1.1.0 ResidualMatrix_1.16.0 [31] GenomicAlignments_1.42.0 igraph_2.1.3 [33] mime_0.12 lifecycle_1.0.4 [35] pkgconfig_2.0.3 rsvd_1.0.5 [37] Matrix_1.7-1 R6_2.5.1 [39] fastmap_1.2.0 GenomeInfoDbData_1.2.13 [41] digest_0.6.37 colorspace_2.1-1 [43] AnnotationDbi_1.68.0 dqrng_0.4.1 [45] irlba_2.3.5.1 ExperimentHub_2.14.0 [47] RSQLite_2.3.9 beachmat_2.22.0 [49] labeling_0.4.3 filelock_1.0.3 [51] httr_1.4.7 abind_1.4-8 [53] compiler_4.4.2 bit64_4.6.0-1 [55] withr_3.0.2 BiocParallel_1.40.0 [57] viridis_0.6.5 DBI_1.2.3 [59] R.utils_2.12.3 HDF5Array_1.34.0 [61] alabaster.ranges_1.6.0 alabaster.schemas_1.6.0 [63] rappdirs_0.3.3 DelayedArray_0.32.0 [65] rjson_0.2.23 tools_4.4.2 [67] vipor_0.4.7 beeswarm_0.4.0 [69] R.oo_1.27.0 glue_1.8.0 [71] restfulr_0.0.15 rhdf5filters_1.18.0 [73] grid_4.4.2 Rtsne_0.17 [75] cluster_2.1.8 generics_0.1.3 [77] gtable_0.3.6 R.methodsS3_1.8.2 [79] ensembldb_2.30.0 metapod_1.14.0 [81] BiocSingular_1.22.0 ScaledMatrix_1.14.0 [83] XVector_0.46.0 ggrepel_0.9.6 [85] BiocVersion_3.20.0 pillar_1.10.1 [87] limma_3.62.2 dplyr_1.1.4 [89] BiocFileCache_2.14.0 lattice_0.22-6 [91] rtracklayer_1.66.0 bit_4.5.0.1 [93] tidyselect_1.2.1 locfit_1.5-9.10 [95] Biostrings_2.74.1 knitr_1.49 [97] gridExtra_2.3 bookdown_0.42 [99] ProtGenerics_1.38.0 edgeR_4.4.1 [101] xfun_0.50 statmod_1.5.0 [103] UCSC.utils_1.2.0 lazyeval_0.2.2 [105] yaml_2.3.10 evaluate_1.0.3 [107] codetools_0.2-20 tibble_3.2.1 [109] alabaster.matrix_1.6.1 BiocManager_1.30.25 [111] graph_1.84.1 cli_3.6.3 [113] munsell_0.5.1 jquerylib_0.1.4 [115] Rcpp_1.0.14 dir.expiry_1.14.0 [117] dbplyr_2.5.0 png_0.1-8 [119] XML_3.99-0.18 parallel_4.4.2 [121] blob_1.2.4 AnnotationFilter_1.30.0 [123] sparseMatrixStats_1.18.0 bitops_1.0-9 [125] viridisLite_0.4.2 alabaster.se_1.6.0 [127] scales_1.3.0 purrr_1.0.2 [129] crayon_1.5.3 rlang_1.1.5 [131] cowplot_1.1.3 KEGGREST_1.46.0 References "],["integrating-with-protein-abundance.html", "Chapter 12 Integrating with protein abundance 12.1 Motivation 12.2 Setting up the data 12.3 Quality control 12.4 Normalization 12.5 Comments on downstream analyses 12.6 Integration with gene expression data 12.7 Finding correlations between features Session Info", " Chapter 12 Integrating with protein abundance .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 12.1 Motivation Cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq) is a technique that quantifies both gene expression and the abundance of selected surface proteins in each cell simultaneously (Stoeckius et al. 2017). In this approach, cells are first labelled with antibodies that have been conjugated to synthetic RNA tags. A cell with a higher abundance of a target protein will be bound by more antibodies, causing more molecules of the corresponding antibody-derived tag (ADT) to be attached to that cell. Cells are then separated into their own reaction chambers using droplet-based microfluidics (Zheng et al. 2017). Both the ADTs and endogenous transcripts are reverse-transcribed and captured into a cDNA library; the abundance of each protein or expression of each gene is subsequently quantified by sequencing of each set of features. This provides a powerful tool for interrogating aspects of the proteome (such as post-translational modifications) and other cellular features that would normally be invisible to transcriptomic studies. How should the ADT data be incorporated into the analysis? While we have counts for both ADTs and transcripts, there are fundamental differences in nature of the data that make it difficult to treat the former as additional features in the latter. Most experiments involve only a small number (20-200) of antibodies that are chosen by the researcher because they are of a priori interest, in contrast to gene expression data that captures the entire transcriptome regardless of the study. The coverage of the ADTs is also much deeper as they are sequenced separately from the transcripts, allowing the sequencing resources to be concentrated into a smaller number of features. And, of course, the use of antibodies against protein targets involves consideration of separate biases compared to those observed for transcripts. This chapter will describe some strategies for integrated analysis of ADT and transcript data in CITE-seq experiments. 12.2 Setting up the data We will demonstrate using a PBMC dataset from 10X Genomics that contains quantified abundances for a number of interesting surface proteins. We obtain the dataset using the DropletTestFiles package, after which we can create a SingleCellExperiment as shown below. library(DropletTestFiles) path &lt;- getTestFile(&quot;tenx-3.0.0-pbmc_10k_protein_v3/1.0.0/filtered.tar.gz&quot;) dir &lt;- tempfile() untar(path, exdir=dir) # Loading it in as a SingleCellExperiment object. library(DropletUtils) sce &lt;- read10xCounts(file.path(dir, &quot;filtered_feature_bc_matrix&quot;)) sce ## class: SingleCellExperiment ## dim: 33555 7865 ## metadata(1): Samples ## assays(1): counts ## rownames(33555): ENSG00000243485 ENSG00000237613 ... IgG1 IgG2b ## rowData names(3): ID Symbol Type ## colnames: NULL ## colData names(2): Sample Barcode ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): The SingleCellExperiment class provides the concept of an “alternative Experiment” to store data for different sets of features but the same cells. This involves storing another SummarizedExperiment (or an instance of a subclass) inside our SingleCellExperiment where the rows (features) can differ but the columns (cells) are the same. In previous chapters, we were using the alternative Experiments to store spike-in data, but here we will use the concept to split off the ADT data. This isolates the two sets of features to ensure that analyses on one set do not inadvertently use data from the other set, and vice versa. sce &lt;- splitAltExps(sce, rowData(sce)$Type) altExpNames(sce) ## [1] &quot;Antibody Capture&quot; altExp(sce) # Can be used like any other SingleCellExperiment. ## class: SingleCellExperiment ## dim: 17 7865 ## metadata(1): Samples ## assays(1): counts ## rownames(17): CD3 CD4 ... IgG1 IgG2b ## rowData names(3): ID Symbol Type ## colnames: NULL ## colData names(0): ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): counts(altExp(sce))[,1:10] # sneak peek ## 17 x 10 sparse Matrix of class &quot;dgCMatrix&quot; ## ## CD3 18 30 18 18 5 21 34 48 4522 2910 ## CD4 138 119 207 11 14 1014 324 1127 3479 2900 ## CD8a 13 19 10 17 14 29 27 43 38 28 ## CD14 491 472 1289 20 19 2428 1958 2189 55 41 ## CD15 61 102 128 124 156 204 607 128 111 130 ## CD16 17 155 72 1227 1873 148 676 75 44 37 ## CD56 17 248 26 491 458 29 29 29 30 15 ## CD19 3 3 8 5 4 7 15 4 6 6 ## CD25 9 5 15 15 16 52 85 17 13 18 ## CD45RA 110 125 5268 4743 4108 227 175 523 4044 1081 ## CD45RO 74 156 28 28 21 492 517 316 26 43 ## PD-1 9 9 20 25 28 16 26 16 28 16 ## TIGIT 4 9 11 59 76 11 12 12 9 8 ## CD127 7 8 12 16 17 15 11 10 231 179 ## IgG2a 5 4 12 12 7 9 6 3 19 14 ## IgG1 2 8 19 16 14 10 12 7 16 10 ## IgG2b 3 3 6 4 9 8 50 2 8 2 12.3 Quality control 12.3.1 Issues with RNA-based metrics As in the RNA-based analysis, we want to remove cells that have failed to capture/sequence the ADTs. Ideally, we would re-use the QC metrics and strategies described in Chapter 12.3. Unfortunately, this is complicated by some key differences between ADT and RNA count data. In particular, the features in a CITE-seq experiment are usually chosen because they capture interesting variation. This enriches for biology, which is desirable, but also makes it difficult to separate biological and technical effects. We start with the most obvious QC metric, the total count across all ADTs in a cell. The presence of a targeted protein can lead to a several-fold increase in the total ADT count given the binary nature of most surface markers. For example, the most abundant marker in each cell is variable and associated with order of magnitude changes in the total ADT count (Figure 12.1). Removing cells with low total ADT counts could inadvertently eliminate cell types that do not express many - or indeed, any - of the selected protein targets. adt.counts &lt;- counts(altExp(sce)) top.marker &lt;- rownames(adt.counts)[max.col(t(adt.counts))] total.count &lt;- colSums(adt.counts) boxplot(split(log10(total.count), top.marker), ylab=&quot;Log-total ADT count&quot;, las=2) Figure 12.1: Distribution of the log10-total ADT count across cells, stratified by the identity of the most abundant marker in each cell. The other common metric is the number of features detected (i.e., with non-zero counts) in each cell, but again, this is not as helpful for ADTs compared to RNA data. Recall that droplet-based libraries will contain contamination from ambient solution (Section 7.2), in this case containing containing conjugated antibodies that are either free in solution or bound to cell fragments. As the ADTs are relatively deeply sequenced, we expect non-zero counts for most ADTs in each cell due to these contaminating transcripts. This may result in near-identical values for all cells (Figure 12.2) regardless of the efficacy of cDNA capture. adt.detected &lt;- colSums(adt.counts &gt; 0) hist(adt.detected, col=&#39;grey&#39;, main=&quot;&quot;, xlab=&quot;Number of detected ADTs&quot;) Figure 12.2: Distribution of the number of detected ADTs across all cells in the PBMC dataset. Contrast this with the utility of the same QC metrics for the RNA counts. Most counts in an RNA-seq library are assigned to constitutively expressed genes that should not exhibit major changes across subpopulations, allowing us to use the total RNA count as a QC metric that is somewhat independent of biological state. Similarly, the larger number of features keeps the counts close to zero and ensures that the number of detected genes changes appreciably in response to technical effects. This suggests that, for a larger antibody panel (including some constitutively abundant proteins) and shallower sequencing, the ADT count matrix may be sufficiently “RNA-like” that these metrics can be used in the same way. Obviously, the mitochondrial and spike-in proportions are not applicable here. 12.3.2 Applying custom QC filters To replace the standard QC metrics, we take advantage of one of the more obvious technical effects in ADT data - namely, contamination from the ambient solution. As previously mentioned, we expect non-zero counts for most ADTs due to this contamination, even if the cell does not exhibit any actual expression of the corresponding protein marker. If most ADTs for a cell instead have zero counts, we may suspect a problem during library preparation. We identify such libraries that have no estimated contribution from the ambient solution and mark them for removal during QC filtering. controls &lt;- grep(&quot;^Ig&quot;, rownames(altExp(sce))) # see below for details. qc.stats &lt;- cleanTagCounts(altExp(sce), controls=controls) summary(qc.stats$zero.ambient) # libraries removed with no ambient contamination ## Mode FALSE TRUE ## logical 7864 1 This experiment also includes isotype control (IgG) antibodies that have similar properties to a primary antibody but lack a specific target in the cell. The coverage of these control ADTs serves as a measure of non-specific binding in each cell, most notably protein aggregates. Inclusion of antibody aggregates in a droplet causes a large increase in the counts for all tags in the corresponding library, which interferes with the downstream analysis by masking the cell’s true surface marker phenotype. We identify affected cells as high outliers for the total control count (Figure 12.3); hard thresholds are more difficult to specify due to experiment-by-experiment variation in the expected coverage of ADTs. summary(qc.stats$high.controls) ## Mode FALSE TRUE ## logical 7730 135 hist(log10(qc.stats$sum.controls + 1), col=&#39;grey&#39;, breaks=50, main=&quot;&quot;, xlab=&quot;Log-total count for controls per cell&quot;) thresholds &lt;- attr(qc.stats$high.controls, &quot;thresholds&quot;) abline(v=log10(thresholds[&quot;higher&quot;]+1), col=&quot;red&quot;, lty=2) Figure 12.3: Distribution of the log-coverage of IgG control ADTs across all cells in the PBMC dataset. The red dotted line indicates the threshold above which cells were removed. If we want to analyze gene expression and ADT data together, we still need to apply quality control on the gene counts. It is possible for a cell to have satisfactory ADT counts but poor QC metrics from the endogenous genes - for example, cell damage manifesting as high mitochondrial proportions would not be captured in the ADT data. More generally, the ADTs are synthetic, external to the cell and conjugated to antibodies, so it is not surprising that they would experience different cell-specific technical effects than the endogenous transcripts. This motivates a distinct QC step on the genes to ensure that we only retain high-quality cells in both feature spaces. Here, the count matrix has already been filtered by Cellranger to remove empty droplets so we only filter on the mitochondrial proportions to remove putative low-quality cells. library(scuttle) mito &lt;- grep(&quot;^MT-&quot;, rowData(sce)$Symbol) df &lt;- perCellQCMetrics(sce, subsets=list(Mito=mito)) mito.discard &lt;- isOutlier(df$subsets_Mito_percent, type=&quot;higher&quot;) summary(mito.discard) ## Mode FALSE TRUE ## logical 7569 296 Finally, to remove the low-quality cells, we subset the SingleCellExperiment as previously described. This automatically applies the filtering to both the transcript and ADT data; such coordination is one of the advantages of storing both datasets in a single object. Helpfully, cleanTagCounts() also provides a discard field containing the intersection of the ADT-specific filters described above. unfiltered &lt;- sce discard &lt;- qc.stats$discard | mito.discard sce &lt;- sce[,!discard] ncol(sce) ## [1] 7472 12.3.3 Other comments An additional motivation for the initial zero.ambient filter is to ensure that the size factors in Section 12.4.3 will always be positive. For different normalization schemes, another filter may be preferred, e.g., removing cells with low total control counts to facilitate control-based normalization. However, the overall goal remains the same - to remove cells with unusually low coverage, presumably due to failed capture or sequencing. In the absence of control ADTs, protein aggregates can be difficult to distinguish from cells that genuinely express multiple markers. A workaround is to assume that most markers are not actually expressed by each cell - we then estimate the amount of ambient contamination in each droplet, allowing us to remove high outliers that presumably represent droplets contaminated by protein aggregates. (This is done automatically for us by cleanTagCounts() if controls are not supplied.) However, this strategy is not appropriate for small antibody panels where a majority of markers may be expressed. qc.stats.amb &lt;- cleanTagCounts(altExp(sce)) summary(qc.stats.amb$high.ambient) ## Mode FALSE TRUE ## logical 7240 232 A more assumption-laden approach can be implemented by passing a set of mutually exclusive markers to cleanTagCounts(). This eliminates cells that exhibit high expression for more than one such marker, presumably due to contamination by aggregates (or doublet formation). In this manner, we eliminate marker combinations that are nonsensical based on prior biological knowledge. Needless to say, this is not a comprehensive approach but at least the worst offenders can be filtered out. qc.stats.amb &lt;- cleanTagCounts(altExp(sce), exclusive=c(&quot;CD3&quot;, &quot;CD19&quot;)) summary(qc.stats.amb$high.ambient) ## Mode FALSE TRUE ## logical 7429 43 It is also worth mentioning that, for 10X Genomics datasets, Cellranger applies its own algorithm to remove protein aggregates. 12.4 Normalization 12.4.1 Overview Counts for the ADTs are subject to several biases that must be normalized prior to further analysis. Capture efficiency varies from cell to cell though the differences in biophysical properties between endogenous transcripts and the (much shorter) ADTs means that the capture-related biases for the two sets of features are unlikely to be identical. Composition biases are also much more pronounced in ADT data due to (i) the binary nature of target protein abundances, where any increase in protein abundance manifests as a large increase to the total tag count; and (ii) the a priori selection of interesting protein targets, which enriches for features that are more likely to be differentially abundant across the population. As in Chapter 12.4, we assume that these are scaling biases and compute ADT-specific size factors to remove them. To this end, several strategies are again available to calculate a size factor for each cell. 12.4.2 Library size normalization The simplest approach is to normalize on the total ADT counts, effectively the library size for the ADTs. Like in Section 12.4.2, these “ADT library size factors” are adequate for clustering but will introduce composition biases that interfere with interpretation of the fold-changes between clusters. This is especially true for relatively subtle (e.g., ~2-fold) changes in the abundances of markers associated with functional activity rather than cell type. sf.lib &lt;- librarySizeFactors(altExp(sce)) summary(sf.lib) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.028 0.541 0.933 1.000 1.301 5.264 We might instead consider the related approach of taking the geometric mean of all counts as the size factor for each cell (Stoeckius et al. 2017). The geometric mean is a reasonable estimator of the scaling biases for large counts with the added benefit that it mitigates the effects of composition biases by dampening the impact of one or two highly abundant ADTs. While more robust than the ADT library size factors, these geometric mean-based factors are still not entirely correct and will progressively become less accurate as upregulation increases in strength. sf.geo &lt;- geometricSizeFactors(altExp(sce)) summary(sf.geo) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.075 0.708 0.912 1.000 1.148 6.637 12.4.3 Median-based normalization Ideally, we would like to compute size factors that adjust for the composition biases. This usually requires an assumption that most ADTs are not differentially expressed between cell types/states. At first glance, this appears to be a strong assumption - the target proteins were specifically chosen as they exhibit interesting heterogeneity across the population, meaning that a non-differential majority across ADTs would be unlikely. However, we can make it work by assuming that each cell only upregulates a minority of the targeted proteins, with the remaining ADTs exhibiting some low baseline abundance - a combination of weak constitutive expression and ambient contamination - that should be constant across the population. We estimate the baseline abundance profile from the ADT count matrix by assuming that the distribution of abundances for each ADT should be bimodal. In this model, one population of cells exhibits low baseline expression while another population upregulates the corresponding protein target. We use all cells in the lower mode to compute the baseline abundance for that ADT, as shown below with the ambientProfileBimodal() function. We can also use other baseline profiles here - one alternative would be the ambient solution computed from empty droplets (Section 7.2). baseline &lt;- ambientProfileBimodal(altExp(sce)) head(baseline) ## CD3 CD4 CD8a CD14 CD15 CD16 ## 30.43 30.06 29.33 32.18 107.78 44.54 library(scater) plotExpression(altExp(sce), features=rownames(altExp(sce)), exprs_values=&quot;counts&quot;) + scale_y_log10() + geom_point(data=data.frame(x=names(baseline), y=baseline), mapping=aes(x=x, y=y), cex=3) Figure 12.4: Distribution of (log-)counts for each ADT in the PBMC dataset, with the inferred ambient abundance marked by the black dot. We then compute size factors to equalize the coverage of the non-upregulated majority, thus eliminating cell-to-cell differences in capture efficiency. We use a DESeq2-like approach where the size factor for each cell is defined as the median of the ratios of that cell’s counts to the baseline profile. If the abundances for most ADTs in each cell are baseline-derived, they should be roughly constant across cells; any systematic differences in the ratios correspond to cell-specific biases in sequencing coverage and are captured by the size factor. The use of the median protects against the minority of ADTs corresponding to genuinely expressed targets. sf.amb &lt;- medianSizeFactors(altExp(sce), reference=baseline) summary(sf.amb) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.025 0.723 0.877 1.000 1.091 12.640 In one subpopulation, the size factors are consistently larger than the ADT library size factors, whereas the opposite is true for most of the other subpopulations (Figure 12.5). This is consistent with the presence of composition biases due to differential abundance of the targeted proteins between subpopulations. Here, composition biases would introduce a spurious 2-fold change in normalized ADT abundance if the library size factors were used. # We&#39;re getting a little ahead of ourselves here, in order to define some # clusters to make this figure; see the next section for more details. library(scran) tagdata &lt;- logNormCounts(altExp(sce)) clusters &lt;- clusterCells(tagdata, assay.type=&quot;logcounts&quot;) by.clust &lt;- split(log2(sf.amb/sf.lib), clusters) boxplot(by.clust, xlab=&quot;Cluster&quot;, ylab=&quot;Log-ratio (median-based/library size factors)&quot;) Figure 12.5: Distribution of the log-ratio of the median-based size factor to the corresponding ADT library size factor for each cell in the PBMC dataset, stratified according to the cluster identity defined from normalized ADT data. 12.4.4 Control-based normalization If control ADTs are available, we could make the assumption that they should not be differentially abundant between cells. Any difference thus represents some bias that should be normalized by defining control-based size factors from the sum of counts over all control ADTs, analogous to spike-in normalization (Basic Section 2.4), We demonstrate this approach below by computing size factors from the IgG controls (Figure 12.6). controls &lt;- grep(&quot;^Ig&quot;, rownames(altExp(sce))) sf.control &lt;- librarySizeFactors(altExp(sce), subset_row=controls) summary(sf.control) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.000 0.688 0.930 1.000 1.213 2.993 plot(sf.amb, sf.control, log=&quot;xy&quot;, xlab=&quot;median-based size factors (tag)&quot;, ylab=&quot;Control size factors (tag)&quot;) abline(0, 1, col=&quot;grey&quot;, lty=2) Figure 12.6: IgG control-derived size factors for each cell in the PBMC dataset, compared to the median-based size factors. The dashed line represents equality. This approach exchanges the previous assumption of a non-differential majority for another assumption about the lack of differential abundance in the control tags. We might feel that the latter is a generally weaker assumption, but it is possible for non-specific binding to vary due to biology (e.g., when the cell surface area increases), at which point this normalization strategy may not be appropriate. It also relies on sufficient coverage of the control ADTs, which is not always possible in well-executed experiments where there is little non-specific binding to provide such counts. 12.4.5 Computing log-normalized values We suggest using the median-based size factors by default, as they are generally applicable and eliminate most problems with composition biases. We set the size factors for the ADT data by calling sizeFactors() on the relevant altExp(). In contrast, sizeFactors(sce) refers to size factors for the gene counts, which are usually quite different. sizeFactors(altExp(sce)) &lt;- sf.amb Regardless of which size factors are chosen, running logNormCounts() will then perform scaling normalization and log-transformation for both the endogenous transcripts and the ADTs using their respective size factors. Alternatively, we could run logNormCounts() directly on altExp(sce) if we only wanted to compute normalized log-abundances for the ADT data (though in this case, we want to normalize the RNA data as we will be using it later in this chapter). sce &lt;- applySCE(sce, logNormCounts) # Checking that we have normalized values: assayNames(sce) ## [1] &quot;counts&quot; &quot;logcounts&quot; assayNames(altExp(sce)) ## [1] &quot;counts&quot; &quot;logcounts&quot; 12.5 Comments on downstream analyses 12.5.1 Feature selection Unlike transcript-based counts, feature selection is largely unnecessary for analyzing ADT data. This is because feature selection has already occurred during experimental design where the manual choice of target proteins means that all ADTs correspond to interesting features by definition. In this particular dataset, we have fewer than 20 features so there is little scope for further filtering. Even larger datasets will only have 100-200 features - small numbers compared to our previous selections of 1000-5000 HVGs in Chapter 12.5.1. That said, larger datasets may benefit from a PCA step to compact the data from &gt;100 ADTs to 10-20 PCs. This can be easily achieved by running runPCA() without passing in a vector of HVGs, as demonstrated below with another CITE-seq dataset from Kotliarov et al. (2020). library(scRNAseq) adt.kotliarov &lt;- KotliarovPBMCData(mode = &quot;adt&quot;) adt.kotliarov ## class: SingleCellExperiment ## dim: 87 58654 ## metadata(0): ## assays(1): counts ## rownames(87): AnnexinV_PROT BTLA_PROT ... CD34_PROT CD20_PROT ## rowData names(0): ## colnames(58654): AAACCTGAGAGCCCAA_H1B1ln1 AAACCTGAGGCGTACA_H1B1ln1 ... ## TTTGTCATCGGTTCGG_H1B2ln6 TTTGTCATCTACCTGC_H1B2ln6 ## colData names(24): nGene nUMI ... dmx_hto_match timepoint ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): # QC. controls &lt;- grep(&quot;isotype&quot;, rownames(adt.kotliarov), ignore.case=TRUE) stats.kotliarov &lt;- cleanTagCounts(adt.kotliarov, controls=controls) adt.kotliarov &lt;- adt.kotliarov[,!stats.kotliarov$discard] summary(stats.kotliarov$discard) ## Mode FALSE TRUE ## logical 58410 244 # Normalization. ambient &lt;- ambientProfileBimodal(adt.kotliarov) sizeFactors(adt.kotliarov) &lt;- medianSizeFactors(adt.kotliarov, ref=ambient) adt.kotliarov &lt;- logNormCounts(adt.kotliarov) # PCA. adt.kotliarov &lt;- runPCA(adt.kotliarov, ncomponents=25) reducedDimNames(adt.kotliarov) ## [1] &quot;PCA&quot; 12.5.2 Clustering and interpretation We can apply downstream procedures like clustering and visualization on the log-normalized abundance matrix for the ADTs (Figure 12.7). Alternatively, if we had generated a matrix of PCs, we could use that as well. clusters.adt &lt;- clusterCells(altExp(sce), assay.type=&quot;logcounts&quot;) # Generating a t-SNE plot. library(scater) set.seed(1010010) altExp(sce) &lt;- runTSNE(altExp(sce)) colLabels(altExp(sce)) &lt;- factor(clusters.adt) plotTSNE(altExp(sce), colour_by=&quot;label&quot;, text_by=&quot;label&quot;, text_color=&quot;red&quot;) Figure 12.7: \\(t\\)-SNE plot generated from the log-normalized abundance of each ADT in the PBMC dataset. Each point is a cell and is labelled according to its assigned cluster. With only a few ADTs, characterization of each cluster is most efficiently achieved by creating a heatmap of the average log-abundance of each tag (Figure 12.8). For this experiment, we can easily identify B cells (CD19+), various subsets of T cells (CD3+, CD4+, CD8+), monocytes and macrophages (CD14+, CD16+), to name a few. More detailed examination of the distribution of abundances within each cluster is easily performed with plotExpression() where strong bimodality may indicate that finer clustering is required to resolve cell subtypes. se.averaged &lt;- sumCountsAcrossCells(altExp(sce), clusters.adt, exprs_values=&quot;logcounts&quot;, average=TRUE) library(pheatmap) averaged &lt;- assay(se.averaged) pheatmap(averaged - rowMeans(averaged), breaks=seq(-3, 3, length.out=101)) Figure 12.8: Heatmap of the average log-normalized abundance of each ADT in each cluster of the PBMC dataset. Colors represent the log2-fold change from the grand average across all clusters. Of course, this provides little information beyond what we could have obtained from a mass cytometry experiment; the real value of this data lies in the integration of protein abundance with gene expression. 12.6 Integration with gene expression data 12.6.1 By subclustering In the simplest approach to integration, we take cells in each of the ADT-derived clusters and perform subclustering using the transcript data. This is an in silico equivalent to an experiment that performs FACS to isolate cell types followed by scRNA-seq for further characterization. We exploit the fact that the ADT abundances are cleaner (larger counts, stronger signal) for more robust identification of broad cell types, and use the gene expression data to identify more subtle structure that manifests in the transcriptome. We demonstrate below by using quickSubCluster() to loop over all of the ADT-derived clusters and subcluster on gene expression (Figure 12.9). set.seed(101010) all.sce &lt;- quickSubCluster(sce, clusters.adt, prepFUN=function(x) { dec &lt;- modelGeneVar(x) top &lt;- getTopHVGs(dec, prop=0.1) x &lt;- runPCA(x, subset_row=top, ncomponents=25) }, clusterFUN=function(x) { clusterCells(x, use.dimred=&quot;PCA&quot;) } ) # Summarizing the number of subclusters in each tag-derived parent cluster, # compared to the number of cells in that parent cluster. ncells &lt;- vapply(all.sce, ncol, 0L) nsubclusters &lt;- vapply(all.sce, FUN=function(x) length(unique(x$subcluster)), 0L) plot(ncells, nsubclusters, xlab=&quot;Number of cells&quot;, type=&quot;n&quot;, ylab=&quot;Number of subclusters&quot;, log=&quot;xy&quot;) text(ncells, nsubclusters, names(all.sce)) Figure 12.9: Number of subclusters identified from the gene expression data within each ADT-derived parent cluster. Another benefit of subclustering is that we can use the annotation on the ADT-derived clusters to facilitate annotation of each subcluster. If we knew that cluster X contained T cells from the ADT-derived data, there is no need to identify subclusters X.1, X.2, etc. as T cells from scratch; rather, we can focus on the more subtle (and interesting) differences between the subclusters using findMarkers(). For example, cluster 3 contains CD8+ T cells according to Figure 12.8, in which we further identify internal subclusters based on a variety of markers (Figure 12.10). Subclustering is also conceptually appealing as it avoids comparing log-fold changes in protein abundances with log-fold changes in gene expression. This ensures that variation (or noise) from the transcript counts does not compromise cell type/state identification from the relatively cleaner ADT counts. of.interest &lt;- &quot;3&quot; markers &lt;- c(&quot;GZMH&quot;, &quot;IL7R&quot;, &quot;KLRB1&quot;) plotExpression(all.sce[[of.interest]], x=&quot;subcluster&quot;, features=markers, swap_rownames=&quot;Symbol&quot;, ncol=3) Figure 12.10: Distribution of log-normalized expression values of several markers in transcript-derived subclusters of a ADT-derived subpopulation of CD8+ T cells. The downside is that relying on previous results increases the risk of misleading conclusions when ambiguities in those results are not considered, as previously discussed in Basic Section 5.5. It is a good idea to perform some additional checks to ensure that each subcluster has similar protein abundances, e.g., using a heatmap as in Figure 12.8 or with a series of plots like in Figure 12.11. If so, this allows the subcluster to “inherit” the annotation attached to the parent cluster for easier interpretation. sce.cd8 &lt;- all.sce[[of.interest]] plotExpression(altExp(sce.cd8), x=I(sce.cd8$subcluster), features=c(&quot;CD3&quot;, &quot;CD8a&quot;)) Figure 12.11: Distribution of log-normalized abundances of ADTs for CD3 and CD8a in each subcluster of the CD8+ T cell population. 12.6.2 By intersecting clusters In contrast to subclustering, we can “intersect” the clusters generated independently for each set of features. Two cells are only considered to be in the same intersected cluster if they belong to the same cluster in every feature set. In other words, strong separation between cells in any feature set is sufficient to cause the formation of a new cluster. The goal of the intersection is to capture population heterogeneity across all features in a single set of clusters. To illustrate, we first perform some standard steps on the transcript count matrix to obtain an RNA-only clustering. sce.main &lt;- logNormCounts(sce) dec.main &lt;- modelGeneVar(sce.main) top.main &lt;- getTopHVGs(dec.main, prop=0.1) set.seed(100010) # for IRLBA. sce.main &lt;- runPCA(sce.main, subset_row=top.main, ncomponents=25) library(bluster) clusters.rna &lt;- clusterRows(reducedDim(sce.main, &quot;PCA&quot;), NNGraphParam()) table(clusters.rna) ## clusters.rna ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 145 1638 56 1510 156 84 97 134 1619 296 73 42 750 75 371 181 ## 17 18 19 20 21 ## 144 33 18 27 23 A naive implementation of this intersection concept would involve simply concatenating the cluster identifiers. However, this is not ideal as it generates many small clusters. Here, most intersected clusters would contain fewer than 10 cells each, which is not helpful for downstream interpretation. naive.intersection &lt;- paste0(clusters.rna, &quot;.&quot;, clusters.adt) summary(table(naive.intersection) &lt; 10) ## Mode FALSE TRUE ## logical 47 76 A more careful approach is implemented in the intersectClusters() function from the mumosa package. Given the per-feature-set clusterings and the coordinates used to generate them, the function starts with the naive intersection but performs a series of merges to eliminate the smaller clusters. Specifcially, we choose a pair of clusters to merge that minimizes the gain in the within-cluster sum of squares (WCSS); this is repeated until the WCSS of the intersected clusters exceeds the WCSS of any per-feature-set clustering. In this manner, we remove irrelevant partitionings without forcibly merging clusters that are well-separated in one of the feature sets. library(mumosa) clusters.intersect &lt;- intersectClusters( clusters=list(clusters.rna, clusters.adt), coords=list( reducedDim(sce.main, &quot;PCA&quot;), t(assay(altExp(sce.main))) ) ) table(clusters.intersect) ## clusters.intersect ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 1618 162 7 54 1154 56 319 362 41 191 323 248 65 745 206 69 ## 17 18 19 20 21 22 23 24 25 ## 79 25 151 81 38 38 39 216 1185 Intersecting clusters is more convenient than subclustering as each feature set can be analyzed independently in the former. Changes in the analysis strategy for one feature set do not affect the processing of other sets; similarly, there is no need to choose one feature set to define the initial clusters. The disadvantage is that there is no opportunity to improve resolution by choosing a more appropriate set of HVGs to separate subpopulations within each cluster. 12.6.3 With rescaled expression matrices Alternatively, we can combine the data from both modalities into a single matrix for use in downstream analyses. The simplest version of this idea involves literally combining the log-normalized abundance matrix for the ADTs with the log-expression matrix (or the same with the corresponding matrices of PCs). This requires some reweighting to balance the contribution of the transcript and ADT data to variance in the combined matrix, especially given that the former has at least an order of magnitude more features than the latter. To do so, we compute the average nearest-neighbor distance in each feature set and use it as a proxy for uninteresting noise within each subpopulation. Each matrix is then scaled to equalize the nearest-neighbor distances, ensuring that all modalities have comparable “baseline” variation without forcing them to have the same total variance (which would not be desirable if they captured different aspects of biology). We demonstrate below by combining the log-abundance ADT matrix with the matrix of PCs derived from the log-expression matrix via the rescaleByNeighbors() function. # Re-using &#39;sce.main&#39; from above. rescaled.combined &lt;- rescaleByNeighbors(sce.main, dimreds=&quot;PCA&quot;, altexps=&quot;Antibody Capture&quot;) dim(rescaled.combined) ## [1] 7472 42 This approach is logistically convenient as the combined structure is compatible with the same analysis workflows used for transcript-only data. For example, we can use the rescaled.combined matrix for clustering or dimensionality reduction (Figure 12.12) as if we were dealing with our usual matrix of PCs. library(bluster) clusters.rescale &lt;- clusterRows(rescaled.combined, NNGraphParam()) table(clusters.rescale) ## clusters.rescale ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 89 1366 486 1042 512 80 130 84 1708 40 23 19 57 339 750 39 ## 17 18 19 20 21 22 23 24 25 26 27 ## 80 36 67 13 22 26 377 24 16 13 34 reducedDim(sce.main, &quot;RescaledNeighbors&quot;) &lt;- rescaled.combined set.seed(1000) sce.main &lt;- runUMAP(sce.main, dimred=&quot;RescaledNeighbors&quot;, name=&quot;UMAP_rescaled&quot;) plotReducedDim(sce.main, &quot;UMAP_rescaled&quot;, colour_by=I(clusters.rescale), text_by=I(clusters.rescale)) Figure 12.12: UMAP plot of the PBMC data generated from rescaled ADT and RNA data values. Each point is a cell and is colored according to the assigned cluster from a graph-based approach. However, this strategy implicitly makes assumptions about the importance of heterogeneity in the ADT data relative to the transcript data. In this case, we are equalizing the within-population variance across feature sets but there is no reason to think that this is the best choice - one might instead give the ADT data twice as much weight, for example. More generally, any calculations involving multiple sets of features must consider the potential for uninteresting noise in one set to interfere with biological signal in the other set. This concern is largely avoided by subclustering or intersections, where a clearer separation exists between the information contributed by each feature set. 12.6.4 With multi-metric UMAP A more sophisticated approach to combining matrices uses the UMAP algorithm (McInnes, Healy, and Melville 2018) to integrate information from two or more sets of features. Loosely speaking, we can imagine this as an intersection of the nearest-neighbor graphs formed from each set, which effectively encourages the formation of communities of cells that are close in both feature spaces. The UMAP algorithm then uses this intersection to generate a matrix of coordinates that can be directly used in downstream analyses. We demonstrate this process below using the runMultiUMAP() function, setting n_components=20 to return a reasonably faithful high-dimensional representation of the input data. set.seed(1001010) umap.combined &lt;- mumosa::calculateMultiUMAP(sce.main, dimreds=&quot;PCA&quot;, altexps=&quot;Antibody Capture&quot;, n_components=20) dim(umap.combined) ## [1] 7472 20 Graph-based clustering on umap.combined generates a series of fine-grained clusters. This is attributable to the stringency of intersection operations for defining the local neighborhood; cells are only placed close together in the UMAP coordinate space if they are also close together in each of the feature sets. We perform another round of UMAP to generate a 2-dimensional representation for visualization of the same intersected graph (Figure 12.13). clusters.umap &lt;- clusterRows(umap.combined, NNGraphParam(k=20)) table(clusters.umap) ## clusters.umap ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 132 582 315 297 393 947 796 100 1055 161 130 653 67 75 745 66 ## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 ## 62 49 48 89 37 47 93 25 33 51 28 22 40 23 39 30 ## 33 34 35 36 37 38 39 40 ## 23 33 45 26 33 34 27 21 set.seed(0101110) sce.main &lt;- mumosa::runMultiUMAP(sce.main, dimreds=&quot;PCA&quot;, altexps=&quot;Antibody Capture&quot;, name=&quot;MultiUMAP2&quot;) colLabels(sce.main) &lt;- clusters.umap plotReducedDim(sce.main, &quot;MultiUMAP2&quot;, colour_by=&quot;label&quot;, text_by=&quot;label&quot;) Figure 12.13: UMAP plot obtained by combining transcript and ADT data in the PBMC dataset using a multi-metric UMAP embedding. Each point represents a cell and is colored according to its assigned cluster. The same concerns mentioned for rescaleByNeighbors() are also relevant here. For example, both sets of features contribute equally to the edge weights in the intersected graph, which may not be appropriate if the biology of interest is concentrated in only one set. The use of UMAP for both clustering and visualization may also reduce the effectiveness of plots like Figure 12.13 as diagnostics, as discussed in Basic Section 4.5.3. 12.7 Finding correlations between features Another interesting analysis involves finding correlations between gene expression and the abundance of surface markers. This can provide some information about the regulatory networks or pathways that might be driving a phenotype of interest - for example, which genes are up- or down-regulated in response to increasing PD-1 abundance? We use the computeCorrelations() function to quickly compute all pairwise Spearman correlations between two feature sets. all.correlations &lt;- computeCorrelations(sce.main, altExp(sce.main), use.names=c(&quot;Symbol&quot;, NA)) with.pd1 &lt;- all.correlations[all.correlations$feature2==&quot;PD-1&quot;,] with.pd1 &lt;- with.pd1[order(with.pd1$p.value),] head(with.pd1[which(with.pd1$rho &gt; 0),]) ## DataFrame with 6 rows and 5 columns ## feature1 feature2 rho p.value FDR ## &lt;character&gt; &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## 1 IL32 PD-1 0.238841 2.04014e-97 2.43020e-95 ## 2 TRAC PD-1 0.236039 4.00402e-95 4.63760e-93 ## 3 CD3D PD-1 0.208997 1.57594e-74 1.38727e-72 ## 4 LTB PD-1 0.196742 4.32165e-66 3.31255e-64 ## 5 TRBC2 PD-1 0.193617 5.00589e-64 3.69993e-62 ## 6 B2M PD-1 0.191855 7.04541e-63 5.09311e-61 head(with.pd1[which(with.pd1$rho &lt; 0),]) ## DataFrame with 6 rows and 5 columns ## feature1 feature2 rho p.value FDR ## &lt;character&gt; &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## 1 FCER1G PD-1 -0.287557 3.19929e-142 6.04856e-140 ## 2 CST3 PD-1 -0.285696 2.47246e-140 4.60590e-138 ## 3 TYROBP PD-1 -0.275635 2.25594e-130 3.83876e-128 ## 4 MNDA PD-1 -0.274989 9.52972e-130 1.61082e-127 ## 5 AIF1 PD-1 -0.274390 3.60656e-129 6.06450e-127 ## 6 CSTA PD-1 -0.271784 1.13681e-126 1.86058e-124 Of course, computing correlations across all cell types in the population is not particularly interesting. Strong correlations may be driven by cell type differences, while cell type-specific correlations may be “diluted” by noise in other cell types. Rather, it may be better to compute correlations for each cluster separately, which allows us to focus on the more subtle regulatory effects within each cell type. For example, in cluster 3, we observe that CD127 abundance is positively correlated with TPT1 but negatively correlated with GZMH (Figure 12.14). chosen &lt;- clusters.adt == of.interest chosen.correlations &lt;- computeCorrelations(sce.main, altExp(sce.main), subset.cols=chosen, use.names=c(&quot;Symbol&quot;, NA)) sorted.correlations &lt;- chosen.correlations[order(chosen.correlations$p.value),] head(sorted.correlations[which(sorted.correlations$rho &gt; 0),]) ## DataFrame with 6 rows and 5 columns ## feature1 feature2 rho p.value FDR ## &lt;character&gt; &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## 1 IL7R CD127 0.785613 3.38336e-155 8.63677e-150 ## 2 KLRB1 CD127 0.746670 4.36446e-132 5.57062e-127 ## 3 HLA-DRB1 TIGIT 0.624707 6.62063e-81 2.11258e-76 ## 4 TPT1 CD127 0.605186 9.77116e-75 2.49430e-70 ## 5 CD8B CD8a 0.593515 3.00970e-71 6.40245e-67 ## 6 GZMH TIGIT 0.591745 9.88881e-71 1.94180e-66 head(sorted.correlations[which(sorted.correlations$rho &lt; 0),]) ## DataFrame with 6 rows and 5 columns ## feature1 feature2 rho p.value FDR ## &lt;character&gt; &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## 1 GZMH CD127 -0.711815 9.86062e-115 8.39047e-110 ## 2 KLRB1 TIGIT -0.683086 2.88695e-102 1.84239e-97 ## 3 TMSB10 CD127 -0.677936 3.50310e-100 1.78849e-95 ## 4 IL7R TIGIT -0.652214 2.21674e-90 9.43121e-86 ## 5 HLA-DRB1 CD127 -0.647804 8.53002e-89 3.11068e-84 ## 6 TMSB4X CD127 -0.623470 1.67726e-80 4.75732e-76 plotExpression(sce.main[,chosen], x=&quot;CD127&quot;, show_smooth=TRUE, show_se=FALSE, features=c(&quot;TPT1&quot;, &quot;GZMH&quot;), swap_rownames=&quot;Symbol&quot;) Figure 12.14: Expression of GZMH and TPT1 with respect to CD127 abundance in each cell of cluster 3 in the PBMC dataset. For experiments with multiple batches (or other uninteresting categorical factors), we can use the block= option to only compute correlations within each batch. The results are then consolidated across batches to obtain a single DataFrame of statistics as before. This ensures that correlations are not driven by differences between batches; it also focuses on feature pairs that have correlations with the same sign across batches. To demonstrate, assume that we are interested in correlations that are consistently present in all cell types. As such, clustering is an uninteresting factor of variation that can be specified in block=. This yields a number of gene/protein pairs with strong correlations - many of which, unsurprisingly, involve the gene responsible for producing the protein, e.g., IL7R and CD127 (Figure 12.15). # Using multiple core for speed. library(BiocParallel) blocked.correlations &lt;- computeCorrelations(sce.main, altExp(sce.main), block=clusters.adt, BPPARAM=MulticoreParam(4), use.names=c(&quot;Symbol&quot;, NA)) blocked.correlations[order(blocked.correlations$p.value),] ## DataFrame with 570146 rows and 5 columns ## feature1 feature2 rho p.value FDR ## &lt;character&gt; &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## 1 IL7R CD127 0.445280 1.68084e-214 5.98058e-209 ## 2 TIGIT TIGIT 0.202662 2.90230e-45 5.16333e-40 ## 3 EEF1A1 CD127 0.207421 7.86897e-42 9.33286e-37 ## 4 CD8B CD8a 0.220594 2.28886e-39 2.03599e-34 ## 5 RPS12 CD127 0.189224 1.40630e-36 1.00075e-31 ## ... ... ... ... ... ... ## 570142 hsa-mir-1253 TIGIT NaN NA NA ## 570143 hsa-mir-1253 CD127 NaN NA NA ## 570144 hsa-mir-1253 IgG2a NaN NA NA ## 570145 hsa-mir-1253 IgG1 NaN NA NA ## 570146 hsa-mir-1253 IgG2b NaN NA NA plotExpression(sce.main, x=&quot;CD127&quot;, show_smooth=TRUE, show_se=FALSE, features=c(&quot;IL7R&quot;), swap_rownames=&quot;Symbol&quot;, other_fields=list(data.frame(clusters=clusters.adt))) + facet_wrap(~clusters) Figure 12.15: Expression of IL7R with respect to CD127 abundance in each cluster of the PBMC dataset. Each facet represents a cluster and each point represents a cell in that cluster. For large feature sets, computing all pairwise correlations may not be computationally feasible. Instead, we can use the findTopCorrelations() to find the most correlated features in one set for each feature in the other set. This uses a nearest-neighbors approach in rank space to identify the top closest genes for each protein marker, which is equivalent to an exhaustive pairwise search; we further speed it up by performing a PCA to compact the data by approximating the distances between genes in a lower-dimensional space. The example below will find the top 10 genes in the RNA data that are correlated with the abundance of the markers in the ADT data. (Though in this case, the number of features in ADT data is small enough that an exhaustive search would actually be faster.) set.seed(100) # For the IRLBA step. top.correlations &lt;- findTopCorrelations(x=altExp(sce.main), y=sce.main, number=10, use.names=c(NA, &quot;Symbol&quot;), BPPARAM=MulticoreParam(4)) top.correlations$positive ## DataFrame with 170 rows and 5 columns ## feature1 feature2 rho p.value FDR ## &lt;character&gt; &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## 1 CD3 CD3G 0.653505 0 0 ## 2 CD3 LDHB 0.620759 0 0 ## 3 CD3 CD3D 0.704559 0 0 ## 4 CD3 TRAC 0.707055 0 0 ## 5 CD3 TCF7 0.550894 0 0 ## ... ... ... ... ... ... ## 166 IgG2b AP004609.1 0.02217055 0.0276600 1 ## 167 IgG2b SLC30A4 0.02121528 0.0333445 1 ## 168 IgG2b ADAMTS1 0.00630101 0.2930218 1 ## 169 IgG2b LIX1-AS1 0.00563289 0.3131881 1 ## 170 IgG2b SPATC1 0.00538439 0.3208383 1 Session Info View session info R version 4.4.2 (2024-10-31) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.1 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] BiocParallel_1.40.0 mumosa_1.14.0 [3] bluster_1.16.0 pheatmap_1.0.12 [5] scRNAseq_2.20.0 scran_1.34.0 [7] scater_1.34.0 ggplot2_3.5.1 [9] scuttle_1.16.0 DropletUtils_1.26.0 [11] SingleCellExperiment_1.28.1 SummarizedExperiment_1.36.0 [13] Biobase_2.66.0 GenomicRanges_1.58.0 [15] GenomeInfoDb_1.42.1 IRanges_2.40.1 [17] S4Vectors_0.44.0 BiocGenerics_0.52.0 [19] MatrixGenerics_1.18.1 matrixStats_1.5.0 [21] DropletTestFiles_1.16.0 BiocStyle_2.34.0 [23] rebook_1.16.0 loaded via a namespace (and not attached): [1] RcppAnnoy_0.0.22 splines_4.4.2 [3] batchelor_1.22.0 BiocIO_1.16.0 [5] bitops_1.0-9 filelock_1.0.3 [7] tibble_3.2.1 R.oo_1.27.0 [9] CodeDepends_0.6.6 graph_1.84.1 [11] XML_3.99-0.18 lifecycle_1.0.4 [13] httr2_1.1.0 edgeR_4.4.1 [15] lattice_0.22-6 ensembldb_2.30.0 [17] alabaster.base_1.6.1 magrittr_2.0.3 [19] limma_3.62.2 sass_0.4.9 [21] rmarkdown_2.29 jquerylib_0.1.4 [23] yaml_2.3.10 metapod_1.14.0 [25] cowplot_1.1.3 DBI_1.2.3 [27] RColorBrewer_1.1-3 ResidualMatrix_1.16.0 [29] abind_1.4-8 zlibbioc_1.52.0 [31] Rtsne_0.17 purrr_1.0.2 [33] R.utils_2.12.3 AnnotationFilter_1.30.0 [35] RCurl_1.98-1.16 rappdirs_0.3.3 [37] GenomeInfoDbData_1.2.13 ggrepel_0.9.6 [39] irlba_2.3.5.1 alabaster.sce_1.6.0 [41] dqrng_0.4.1 DelayedMatrixStats_1.28.1 [43] codetools_0.2-20 DelayedArray_0.32.0 [45] tidyselect_1.2.1 UCSC.utils_1.2.0 [47] farver_2.1.2 ScaledMatrix_1.14.0 [49] viridis_0.6.5 BiocFileCache_2.14.0 [51] GenomicAlignments_1.42.0 jsonlite_1.8.9 [53] BiocNeighbors_2.0.1 tools_4.4.2 [55] Rcpp_1.0.14 glue_1.8.0 [57] gridExtra_2.3 SparseArray_1.6.1 [59] xfun_0.50 mgcv_1.9-1 [61] dplyr_1.1.4 HDF5Array_1.34.0 [63] gypsum_1.2.0 withr_3.0.2 [65] BiocManager_1.30.25 fastmap_1.2.0 [67] rhdf5filters_1.18.0 digest_0.6.37 [69] rsvd_1.0.5 R6_2.5.1 [71] mime_0.12 colorspace_2.1-1 [73] RSQLite_2.3.9 R.methodsS3_1.8.2 [75] generics_0.1.3 rtracklayer_1.66.0 [77] httr_1.4.7 S4Arrays_1.6.0 [79] uwot_0.2.2 pkgconfig_2.0.3 [81] gtable_0.3.6 blob_1.2.4 [83] XVector_0.46.0 htmltools_0.5.8.1 [85] bookdown_0.42 ProtGenerics_1.38.0 [87] scales_1.3.0 alabaster.matrix_1.6.1 [89] png_0.1-8 knitr_1.49 [91] rjson_0.2.23 nlme_3.1-166 [93] curl_6.1.0 cachem_1.1.0 [95] rhdf5_2.50.2 BiocVersion_3.20.0 [97] parallel_4.4.2 vipor_0.4.7 [99] AnnotationDbi_1.68.0 restfulr_0.0.15 [101] pillar_1.10.1 grid_4.4.2 [103] alabaster.schemas_1.6.0 vctrs_0.6.5 [105] BiocSingular_1.22.0 dbplyr_2.5.0 [107] beachmat_2.22.0 cluster_2.1.8 [109] beeswarm_0.4.0 evaluate_1.0.3 [111] GenomicFeatures_1.58.0 cli_3.6.3 [113] locfit_1.5-9.10 compiler_4.4.2 [115] Rsamtools_2.22.0 rlang_1.1.5 [117] crayon_1.5.3 labeling_0.4.3 [119] ggbeeswarm_0.7.2 viridisLite_0.4.2 [121] alabaster.se_1.6.0 munsell_0.5.1 [123] Biostrings_2.74.1 lazyeval_0.2.2 [125] Matrix_1.7-1 dir.expiry_1.14.0 [127] ExperimentHub_2.14.0 sparseMatrixStats_1.18.0 [129] bit64_4.6.0-1 Rhdf5lib_1.28.0 [131] KEGGREST_1.46.0 statmod_1.5.0 [133] alabaster.ranges_1.6.0 AnnotationHub_3.14.0 [135] igraph_2.1.3 memoise_2.0.1 [137] bslib_0.8.0 bit_4.5.0.1 References "],["interactive-sharing.html", "Chapter 13 Interactive data exploration 13.1 Motivation 13.2 Quick start 13.3 Usage examples 13.4 Reproducible visualizations 13.5 Dissemination of analysis results 13.6 Additional resources Session Info", " Chapter 13 Interactive data exploration .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 13.1 Motivation Exploratory data analysis (EDA) and visualization are crucial for many aspects of data analysis such as quality control, hypothesis generation and contextual result interpretation. Single-cell ’omics datasets generated with modern high-throughput technologies are no exception, especially given their increasing size and complexity. The need for flexible and interactive platforms to explore those data from various perspectives has contributed to the increasing popularity of graphical user interfaces (GUIs) for interactive visualization. In this chapter, we illustrate how the Bioconductor package iSEE can be used to perform some common exploratory tasks during single-cell analysis workflows. We note that these are examples only; in practice, EDA is often context-dependent and driven by distinct motivations and hypotheses for every new data set. To this end, iSEE provides a flexible framework that is immediately compatible with a wide range of genomics data modalities and can be easily customized to focus on key aspects of individual data sets. 13.2 Quick start An instance of an interactive iSEE application can be launched with any data set that is stored in an object of the SummarizedExperiment class (or any class that extends it, e.g., SingleCellExperiment, DESeqDataSet, MethylSet). In its simplest form, this is done simply by calling iSEE(sce) with the sce data object as the sole argument, as demonstrated here with the 10X PBMC dataset (Figure 13.1). View set-up code (Workflow Chapter 3) #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) #--- gene-annotation ---# library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) #--- cell-detection ---# set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location==&quot;MT&quot;))) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) sce.pbmc &lt;- sce.pbmc[,!high.mito] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) sce.pbmc &lt;- logNormCounts(sce.pbmc) #--- variance-modelling ---# set.seed(1001) dec.pbmc &lt;- modelGeneVarByPoisson(sce.pbmc) top.pbmc &lt;- getTopHVGs(dec.pbmc, prop=0.1) #--- dimensionality-reduction ---# set.seed(10000) sce.pbmc &lt;- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc) set.seed(100000) sce.pbmc &lt;- runTSNE(sce.pbmc, dimred=&quot;PCA&quot;) set.seed(1000000) sce.pbmc &lt;- runUMAP(sce.pbmc, dimred=&quot;PCA&quot;) #--- clustering ---# g &lt;- buildSNNGraph(sce.pbmc, k=10, use.dimred = &#39;PCA&#39;) clust &lt;- igraph::cluster_walktrap(g)$membership colLabels(sce.pbmc) &lt;- factor(clust) library(iSEE) app &lt;- iSEE(sce.pbmc) Figure 13.1: Screenshot of the iSEE application with its default initialization. The default interface contains up to eight built-in panels, each displaying a particular aspect of the data set. The layout of panels in the interface may be altered interactively - panels can be added, removed, resized or repositioned using the “Organize panels” menu in the top right corner of the interface. The initial layout of the application can also be altered programmatically as described in the rest of this Chapter. To familiarize themselves with the GUI, users can launch an interactive tour from the menu in the top right corner. In addition, custom tours can be written to substitute the default built-in tour. This feature is particularly useful to disseminate new data sets with accompanying bespoke explanations guiding users through the salient features of any given data set (see Section @ref{dissemination}). It is also possible to deploy “empty” instances of iSEE apps, where any SummarizedExperiment object stored in an RDS file may be uploaded to the running application. Once the file is uploaded, the application will import the sce object and initialize the GUI panels with the contents of the object for interactive exploration. This type of iSEE applications is launched without specifying the sce argument, as shown in Figure 13.2. app &lt;- iSEE() Figure 13.2: Screenshot of the iSEE application with a landing page. 13.3 Usage examples 13.3.1 Quality control In this example, we demonstrate that an iSEE app can be configured to focus on quality control metrics. Here, we are interested in two plots: The library size of each cell in decreasing order. An elbow in this plot generally reveals the transition between good quality cells and low quality cells or empty droplets. A dimensionality reduction result (in this case, we will pick \\(t\\)-SNE) where cells are colored by the log-library size. This view identifies trajectories or clusters associated with library size and can be used to diagnose QC/normalization problems. Alternatively, it could also indicate the presence of multiple cell types or states that differ in total RNA content. In addition, by setting the ColumnSelectionSource parmaeter, any point selection made in the Column data plot panel will highlight the corresponding points in the Reduced dimension plot panel. A user can then select the cells with either large or small library sizes to inspect their distribution in low-dimensional space. copy.pbmc &lt;- sce.pbmc # Computing various QC metrics; in particular, the log10-transformed library # size for each cell and the log-rank by decreasing library size. library(scater) copy.pbmc &lt;- addPerCellQC(copy.pbmc, exprs_values=&quot;counts&quot;) copy.pbmc$log10_total_counts &lt;- log10(copy.pbmc$total) copy.pbmc$total_counts_rank &lt;- rank(-copy.pbmc$total) initial.state &lt;- list( # Configure a &quot;Column data plot&quot; panel ColumnDataPlot(YAxis=&quot;log10_total_counts&quot;, XAxis=&quot;Column data&quot;, XAxisColumnData=&quot;total_counts_rank&quot;, DataBoxOpen=TRUE, PanelId=1L), # Configure a &quot;Reduced dimension plot &quot; panel ReducedDimensionPlot( Type=&quot;TSNE&quot;, VisualBoxOpen=TRUE, DataBoxOpen=TRUE, ColorBy=&quot;Column data&quot;, ColorByColumnData=&quot;log10_total_counts&quot;, SelectionBoxOpen=TRUE, ColumnSelectionSource=&quot;ColumnDataPlot1&quot;) ) # Prepare the app app &lt;- iSEE(copy.pbmc, initial=initial.state) The configured Shiny app can then be launched with the runApp() function or by simply printing the app object (Figure 13.3). Figure 13.3: Screenshot of an iSEE application for interactive exploration of quality control metrics. This app remains fully interactive, i.e., users can interactively control the settings and layout of the panels. For instance, users may choose to color data points by percentage of UMI mapped to mitochondrial genes (\"pct_counts_Mito\") in the Reduced dimension plot. Using the transfer of point selection between panels, users could select cells with small library sizes in the Column data plot and highlight them in the Reduced dimension plot, to investigate a possible relation between library size, clustering and proportion of reads mapped to mitochondrial genes. 13.3.2 Annotation of cell populations In this example, we use iSEE to interactively examine the marker genes to conveniently determine cell identities. We identify upregulated markers in each cluster (Basic Chapter 6) and collect the log-\\(p\\)-value for each gene in each cluster. These are stored in the rowData slot of the SingleCellExperiment object for access by iSEE. copy.pbmc &lt;- sce.pbmc library(scran) markers.pbmc.up &lt;- findMarkers(copy.pbmc, direction=&quot;up&quot;, log.p=TRUE, sorted=FALSE) # Collate the log-p-value for each marker in a single table all.p &lt;- lapply(markers.pbmc.up, FUN = &quot;[[&quot;, i=&quot;log.p.value&quot;) all.p &lt;- DataFrame(all.p, check.names=FALSE) colnames(all.p) &lt;- paste0(&quot;cluster&quot;, colnames(all.p)) # Store the table of results as row metadata rowData(copy.pbmc) &lt;- cbind(rowData(copy.pbmc), all.p) The next code chunk sets up an app that contains: A table of feature statistics, including the log-transformed FDR of cluster markers computed above. A plot showing the distribution of expression values for a chosen gene in each cluster. A plot showing the result of the UMAP dimensionality reduction method overlaid with the expression value of a chosen gene. Moreover, we configure the second and third panel to use the gene (i.e., row) selected in the first panel. This enables convenient examination of important markers when combined with sorting by \\(p\\)-value for a cluster of interest. initial.state &lt;- list( RowDataTable(PanelId=1L), # Configure a &quot;Feature assay plot&quot; panel FeatureAssayPlot( YAxisFeatureSource=&quot;RowDataTable1&quot;, XAxis=&quot;Column data&quot;, XAxisColumnData=&quot;label&quot;, Assay=&quot;logcounts&quot;, DataBoxOpen=TRUE ), # Configure a &quot;Reduced dimension plot&quot; panel ReducedDimensionPlot( Type=&quot;UMAP&quot;, ColorBy=&quot;Feature name&quot;, ColorByFeatureSource=&quot;RowDataTable1&quot;, ColorByFeatureNameAssay=&quot;logcounts&quot; ) ) # Prepare the app app &lt;- iSEE(copy.pbmc, initial=initial.state) After launching the application (Figure 13.4), we can then sort the table by ascending values of cluster1 to identify genes that are strong markers for cluster 1. Then, users may select the first row in the Row statistics table and watch the second and third panel automatically update to display the most significant marker gene on the y-axis (Feature assay plot) or as a color scale overlaid on the data points (Reduced dimension plot). Alternatively, users can simply search the table for arbitrary gene names and select known markers for visualization. Figure 13.4: Screenshot of the iSEE application initialized for interactive exploration of population-specific marker expression. 13.3.3 Querying features of interest So far, the plots that we have examined have represented each column (i.e., cell) as a point. However, it is straightforward to instead represent rows as points that can be selected and transmitted to eligible panels. This is useful for more gene-centric exploratory analyses. To illustrate, we will add variance modelling statistics to the rowData() of our SingleCellExperiment object. copy.pbmc &lt;- sce.pbmc # Adding some mean-variance information. dec &lt;- modelGeneVarByPoisson(copy.pbmc) rowData(copy.pbmc) &lt;- cbind(rowData(copy.pbmc), dec) The next code chunk sets up an app (Figure 13.5) that contains: A plot showing the mean-variance trend, where each point represents a cell. A table of feature statistics, similar to that generated in the previous example. A heatmap for the genes in the first plot. We again configure the second and third panels to respond to the selection of points in the first panel. This allows the user to select several highly variable genes at once and examine their statistics or expression profiles. More advanced users can even configure the app to start with a brush or lasso to define a selection of genes at initialization. initial.state &lt;- list( # Configure a &quot;Feature assay plot&quot; panel RowDataPlot( YAxis=&quot;total&quot;, XAxis=&quot;Row data&quot;, XAxisRowData=&quot;mean&quot;, PanelId=1L ), RowDataTable( RowSelectionSource=&quot;RowDataPlot1&quot; ), # Configure a &quot;ComplexHeatmap&quot; panel ComplexHeatmapPlot( RowSelectionSource=&quot;RowDataPlot1&quot;, CustomRows=FALSE, ColumnData=&quot;label&quot;, Assay=&quot;logcounts&quot;, ClusterRows=TRUE, PanelHeight=800L, AssayCenterRows=TRUE ) ) # Prepare the app app &lt;- iSEE(copy.pbmc, initial=initial.state) Figure 13.5: Screenshot of the iSEE application initialized for examining highly variable genes. It is entirely possible for these row-centric panels to exist alongside the column-centric panels discussed previously. The only limitation is that row-based panels cannot transmit multi-row selections to column-based panels and vice versa. That said, a row-based panel can still transmit a single row selection to a column-based panel for, e.g., coloring by expression; this allows us to set up an app where selecting a single HVG in the mean-variance plot causes the neighboring \\(t\\)-SNE to be colored by the expression of the selected gene (Figure 13.6). initial.state &lt;- list( # Configure a &quot;Feature assay plot&quot; panel RowDataPlot( YAxis=&quot;total&quot;, XAxis=&quot;Row data&quot;, XAxisRowData=&quot;mean&quot;, PanelId=1L ), # Configure a &quot;Reduced dimension plot&quot; panel ReducedDimensionPlot( Type=&quot;TSNE&quot;, ColorBy=&quot;Feature name&quot;, ColorByFeatureSource=&quot;RowDataPlot1&quot;, ColorByFeatureNameAssay=&quot;logcounts&quot; ) ) # Prepare the app app &lt;- iSEE(copy.pbmc, initial=initial.state) Figure 13.6: Screenshot of the iSEE application containing both row- and column-based panels. 13.4 Reproducible visualizations The state of the iSEE application can be saved at any point to provide a snapshot of the current view of the dataset. This is achieved by clicking on the “Display panel settings” button under the “Export” dropdown menu in the top right corner and saving an RDS file containing a serialized list of panel parameters. Anyone with access to this file and the original SingleCellExperiment can then run iSEE to recover the same application state. Alternatively, the code required to construct the panel parameters can be returned, which is more transparent and amenable to further modification. This facility is most obviously useful for reproducing a perspective on the data that leads to a particular scientific conclusion; it is also helpful for collaborations whereby different views of the same dataset can be easily transferred between analysts. iSEE also keeps a record of the R commands used to generate each figure and table in the app. This information is readily available via the “Extract the R code” button under the “Export” dropdown menu. By copying the code displayed in the modal window and executing it in the R session from which the iSEE app was launched, a user can exactly reproduce all plots currently displayed in the GUI. In this manner, a user can use iSEE to rapidly prototype plots of interest without having to write the associated boilerplate, after which they can then copy the code in an R script for fine-tuning. Of course, the user can also save the plots and tables directly for further adjustment with other tools. 13.5 Dissemination of analysis results iSEE provides a powerful avenue for disseminating results through a “guided tour” of the dataset. This involves writing a step-by-step walkthrough of the different panels with explanations to facilitate their interpretation. All that is needed to add a tour to an iSEE instance is a data frame with two columns named “element” and “intro”; the first column declares the UI element to highlight in each step of the tour, and the second one contains the text to display at that step. This data frame must then be provided to the iSEE() function via the tour argument. Below we demonstrate the implementation of a simple tour that takes users through the two panels that compose a GUI and trains them to use the collapsible boxes. tour &lt;- data.frame( element = c( &quot;#Welcome&quot;, &quot;#ReducedDimensionPlot1&quot;, &quot;#ColumnDataPlot1&quot;, &quot;#ColumnDataPlot1_DataBoxOpen&quot;, &quot;#Conclusion&quot;), intro = c( &quot;Welcome to this tour!&quot;, &quot;This is a &lt;i&gt;Reduced dimension plot.&lt;/i&gt;&quot;, &quot;And this is a &lt;i&gt;Column data plot.&lt;/i&gt;&quot;, &quot;&lt;b&gt;Action:&lt;/b&gt; Click on this collapsible box to open and close it.&quot;, &quot;Thank you for taking this tour!&quot;), stringsAsFactors = FALSE) initial.state &lt;- list( ReducedDimensionPlot(PanelWidth=6L), ColumnDataPlot(PanelWidth=6L) ) The preconfigured Shiny app can then be loaded with the tour and launched to obtain Figure 13.7. Note that the viewer is free to leave the interactive tour at any time and explore the data from their own perspective. Examples of advanced tours showcasing a selection of published data sets can be found at https://github.com/iSEE/iSEE2018. app &lt;- iSEE(sce.pbmc, initial = initial.state, tour = tour) Figure 13.7: Screenshot of the iSEE application initialized with a tour. 13.6 Additional resources For demonstration and inspiration, we refer readers to the following examples of deployed applications: Use cases accompanying the published article: https://marionilab.cruk.cam.ac.uk/ (source code: https://github.com/iSEE/iSEE2018) Examples of iSEE in production: http://www.teichlab.org/singlecell-treg Other examples as source code: Gallery of examples notebooks to reproduce analyses on public data: https://github.com/iSEE/iSEE_instances Gallery of example custom panels: https://github.com/iSEE/iSEE_custom Session Info View session info R version 4.4.2 (2024-10-31) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.1 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] scran_1.34.0 scater_1.34.0 [3] ggplot2_3.5.1 scuttle_1.16.0 [5] iSEE_2.18.0 SingleCellExperiment_1.28.1 [7] SummarizedExperiment_1.36.0 Biobase_2.66.0 [9] GenomicRanges_1.58.0 GenomeInfoDb_1.42.1 [11] IRanges_2.40.1 S4Vectors_0.44.0 [13] BiocGenerics_0.52.0 MatrixGenerics_1.18.1 [15] matrixStats_1.5.0 BiocStyle_2.34.0 [17] rebook_1.16.0 loaded via a namespace (and not attached): [1] gridExtra_2.3 CodeDepends_0.6.6 rlang_1.1.5 [4] magrittr_2.0.3 shinydashboard_0.7.2 clue_0.3-66 [7] GetoptLong_1.0.5 compiler_4.4.2 mgcv_1.9-1 [10] dir.expiry_1.14.0 png_0.1-8 vctrs_0.6.5 [13] pkgconfig_2.0.3 shape_1.4.6.1 crayon_1.5.3 [16] fastmap_1.2.0 XVector_0.46.0 fontawesome_0.5.3 [19] promises_1.3.2 rmarkdown_2.29 ggbeeswarm_0.7.2 [22] graph_1.84.1 UCSC.utils_1.2.0 shinyAce_0.4.3 [25] bluster_1.16.0 xfun_0.50 beachmat_2.22.0 [28] zlibbioc_1.52.0 cachem_1.1.0 jsonlite_1.8.9 [31] listviewer_4.0.0 later_1.4.1 DelayedArray_0.32.0 [34] BiocParallel_1.40.0 irlba_2.3.5.1 parallel_4.4.2 [37] cluster_2.1.8 R6_2.5.1 bslib_0.8.0 [40] RColorBrewer_1.1-3 limma_3.62.2 jquerylib_0.1.4 [43] Rcpp_1.0.14 bookdown_0.42 iterators_1.0.14 [46] knitr_1.49 httpuv_1.6.15 Matrix_1.7-1 [49] splines_4.4.2 igraph_2.1.3 tidyselect_1.2.1 [52] viridis_0.6.5 abind_1.4-8 yaml_2.3.10 [55] doParallel_1.0.17 codetools_0.2-20 miniUI_0.1.1.1 [58] lattice_0.22-6 tibble_3.2.1 withr_3.0.2 [61] shiny_1.10.0 evaluate_1.0.3 circlize_0.4.16 [64] pillar_1.10.1 BiocManager_1.30.25 filelock_1.0.3 [67] DT_0.33 foreach_1.5.2 shinyjs_2.1.0 [70] generics_0.1.3 munsell_0.5.1 scales_1.3.0 [73] xtable_1.8-4 glue_1.8.0 metapod_1.14.0 [76] tools_4.4.2 BiocNeighbors_2.0.1 ScaledMatrix_1.14.0 [79] locfit_1.5-9.10 colourpicker_1.3.0 XML_3.99-0.18 [82] grid_4.4.2 edgeR_4.4.1 colorspace_2.1-1 [85] nlme_3.1-166 GenomeInfoDbData_1.2.13 beeswarm_0.4.0 [88] BiocSingular_1.22.0 vipor_0.4.7 rsvd_1.0.5 [91] cli_3.6.3 rappdirs_0.3.3 viridisLite_0.4.2 [94] S4Arrays_1.6.0 ComplexHeatmap_2.22.0 dplyr_1.1.4 [97] gtable_0.3.6 rintrojs_0.3.4 sass_0.4.9 [100] digest_0.6.37 dqrng_0.4.1 SparseArray_1.6.1 [103] ggrepel_0.9.6 rjson_0.2.23 htmlwidgets_1.6.4 [106] memoise_2.0.1 htmltools_0.5.8.1 lifecycle_1.0.4 [109] httr_1.4.7 shinyWidgets_0.8.7 statmod_1.5.0 [112] GlobalOptions_0.1.2 mime_0.12 "],["dealing-with-big-data.html", "Chapter 14 Dealing with big data 14.1 Motivation 14.2 Fast approximations 14.3 Parallelization 14.4 Out of memory representations Session Info", " Chapter 14 Dealing with big data .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 14.1 Motivation Advances in scRNA-seq technologies have increased the number of cells that can be assayed in routine experiments. Public databases such as GEO are continually expanding with more scRNA-seq studies, while large-scale projects such as the Human Cell Atlas are expected to generate data for billions of cells. For effective data analysis, the computational methods need to scale with the increasing size of scRNA-seq data sets. This section discusses how we can use various aspects of the Bioconductor ecosystem to tune our analysis pipelines for greater speed and efficiency. 14.2 Fast approximations 14.2.1 Nearest neighbor searching Identification of neighbouring cells in PC or expression space is a common procedure that is used in many functions, e.g., buildSNNGraph(), doubletCells(). The default is to favour accuracy over speed by using an exact nearest neighbour (NN) search, implemented with the \\(k\\)-means for \\(k\\)-nearest neighbours algorithm (Wang 2012). However, for large data sets, it may be preferable to use a faster approximate approach. The BiocNeighbors framework makes it easy to switch between search options by simply changing the BNPARAM= argument in compatible functions. To demonstrate, we will use the 10X PBMC data: View set-up code (Workflow Chapter 3) #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) #--- gene-annotation ---# library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) #--- cell-detection ---# set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location==&quot;MT&quot;))) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) sce.pbmc &lt;- sce.pbmc[,!high.mito] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) sce.pbmc &lt;- logNormCounts(sce.pbmc) #--- variance-modelling ---# set.seed(1001) dec.pbmc &lt;- modelGeneVarByPoisson(sce.pbmc) top.pbmc &lt;- getTopHVGs(dec.pbmc, prop=0.1) #--- dimensionality-reduction ---# set.seed(10000) sce.pbmc &lt;- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc) set.seed(100000) sce.pbmc &lt;- runTSNE(sce.pbmc, dimred=&quot;PCA&quot;) set.seed(1000000) sce.pbmc &lt;- runUMAP(sce.pbmc, dimred=&quot;PCA&quot;) #--- clustering ---# g &lt;- buildSNNGraph(sce.pbmc, k=10, use.dimred = &#39;PCA&#39;) clust &lt;- igraph::cluster_walktrap(g)$membership colLabels(sce.pbmc) &lt;- factor(clust) sce.pbmc ## class: SingleCellExperiment ## dim: 33694 3985 ## metadata(1): Samples ## assays(2): counts logcounts ## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B ## rowData names(2): ID Symbol ## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ... ## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1 ## colData names(4): Sample Barcode sizeFactor label ## reducedDimNames(3): PCA TSNE UMAP ## mainExpName: NULL ## altExpNames(0): We had previously clustered on a shared nearest neighbor graph generated with an exact neighbour search (Basic Section 5.2). We repeat this below using an approximate search, implemented using the Annoy algorithm. This involves constructing a AnnoyParam object to specify the search algorithm and then passing it to the buildSNNGraph() function. The results from the exact and approximate searches are consistent with most clusters from the former re-appearing in the latter. This suggests that the inaccuracy from the approximation can be largely ignored. library(scran) library(BiocNeighbors) snn.gr &lt;- buildSNNGraph(sce.pbmc, BNPARAM=AnnoyParam(), use.dimred=&quot;PCA&quot;) clusters &lt;- igraph::cluster_walktrap(snn.gr) table(Exact=colLabels(sce.pbmc), Approx=clusters$membership) ## Approx ## Exact 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ## 1 205 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## 2 0 727 0 0 0 0 0 0 0 0 0 4 0 0 0 ## 3 0 0 599 0 0 0 0 0 11 0 0 0 7 0 0 ## 4 0 3 1 51 0 0 0 0 0 0 0 0 0 1 0 ## 5 0 0 1 0 540 0 0 0 0 0 0 0 0 0 0 ## 6 0 0 2 0 0 350 0 0 0 0 0 0 0 0 0 ## 7 0 0 0 0 0 0 125 0 0 0 0 0 0 0 0 ## 8 0 0 0 0 0 0 0 46 0 0 0 0 0 0 0 ## 9 0 0 1 0 0 0 0 0 818 0 0 0 0 0 0 ## 10 0 0 0 0 0 0 0 0 0 47 0 0 0 0 0 ## 11 0 0 0 0 0 4 0 0 0 0 149 0 0 0 0 ## 12 0 0 0 0 0 0 0 0 0 0 0 61 0 0 0 ## 13 0 0 0 0 0 0 0 0 0 0 0 0 129 0 0 ## 14 0 0 0 0 0 0 0 0 0 0 0 0 0 87 0 ## 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 Note that Annoy writes the NN index to disk prior to performing the search. Thus, it may not actually be faster than the default exact algorithm for small datasets, depending on whether the overhead of disk write is offset by the computational complexity of the search. It is also not difficult to find situations where the approximation deteriorates, especially at high dimensions, though this may not have an appreciable impact on the biological conclusions. set.seed(1000) y1 &lt;- matrix(rnorm(50000), nrow=1000) y2 &lt;- matrix(rnorm(50000), nrow=1000) Y &lt;- rbind(y1, y2) exact &lt;- findKNN(Y, k=20) approx &lt;- findKNN(Y, k=20, BNPARAM=AnnoyParam()) mean(exact$index!=approx$index) ## [1] 0.5619 14.2.2 Singular value decomposition The singular value decomposition (SVD) underlies the PCA used throughout our analyses, e.g., in denoisePCA(), fastMNN(), doubletCells(). (Briefly, the right singular vectors are the eigenvectors of the gene-gene covariance matrix, where each eigenvector represents the axis of maximum remaining variation in the PCA.) The default base::svd() function performs an exact SVD that is not performant for large datasets. Instead, we use fast approximate methods from the irlba and rsvd packages, conveniently wrapped into the BiocSingular package for ease of use and package development. Specifically, we can change the SVD algorithm used in any of these functions by simply specifying an alternative value for the BSPARAM= argument. library(scater) library(BiocSingular) # As the name suggests, it is random, so we need to set the seed. set.seed(101000) r.out &lt;- runPCA(sce.pbmc, ncomponents=20, BSPARAM=RandomParam()) str(reducedDim(r.out)) ## num [1:3985, 1:20] 15.05 13.43 -8.67 -7.74 6.45 ... ## - attr(*, &quot;dimnames&quot;)=List of 2 ## ..$ : chr [1:3985] &quot;AAACCTGAGAAGGCCT-1&quot; &quot;AAACCTGAGACAGACC-1&quot; &quot;AAACCTGAGGCATGGT-1&quot; &quot;AAACCTGCAAGGTTCT-1&quot; ... ## ..$ : chr [1:20] &quot;PC1&quot; &quot;PC2&quot; &quot;PC3&quot; &quot;PC4&quot; ... ## - attr(*, &quot;varExplained&quot;)= num [1:20] 85.36 40.43 23.22 8.99 6.66 ... ## - attr(*, &quot;percentVar&quot;)= num [1:20] 19.85 9.4 5.4 2.09 1.55 ... ## - attr(*, &quot;rotation&quot;)= num [1:500, 1:20] 0.203 0.1834 0.1779 0.1063 0.0647 ... ## ..- attr(*, &quot;dimnames&quot;)=List of 2 ## .. ..$ : chr [1:500] &quot;LYZ&quot; &quot;S100A9&quot; &quot;S100A8&quot; &quot;HLA-DRA&quot; ... ## .. ..$ : chr [1:20] &quot;PC1&quot; &quot;PC2&quot; &quot;PC3&quot; &quot;PC4&quot; ... set.seed(101001) i.out &lt;- runPCA(sce.pbmc, ncomponents=20, BSPARAM=IrlbaParam()) str(reducedDim(i.out)) ## num [1:3985, 1:20] 15.05 13.43 -8.67 -7.74 6.45 ... ## - attr(*, &quot;dimnames&quot;)=List of 2 ## ..$ : chr [1:3985] &quot;AAACCTGAGAAGGCCT-1&quot; &quot;AAACCTGAGACAGACC-1&quot; &quot;AAACCTGAGGCATGGT-1&quot; &quot;AAACCTGCAAGGTTCT-1&quot; ... ## ..$ : chr [1:20] &quot;PC1&quot; &quot;PC2&quot; &quot;PC3&quot; &quot;PC4&quot; ... ## - attr(*, &quot;varExplained&quot;)= num [1:20] 85.36 40.43 23.22 8.99 6.66 ... ## - attr(*, &quot;percentVar&quot;)= num [1:20] 19.85 9.4 5.4 2.09 1.55 ... ## - attr(*, &quot;rotation&quot;)= num [1:500, 1:20] 0.203 0.1834 0.1779 0.1063 0.0647 ... ## ..- attr(*, &quot;dimnames&quot;)=List of 2 ## .. ..$ : chr [1:500] &quot;LYZ&quot; &quot;S100A9&quot; &quot;S100A8&quot; &quot;HLA-DRA&quot; ... ## .. ..$ : chr [1:20] &quot;PC1&quot; &quot;PC2&quot; &quot;PC3&quot; &quot;PC4&quot; ... Both IRLBA and randomized SVD (RSVD) are much faster than the exact SVD with negligible loss of accuracy. This motivates their default use in many scran and scater functions, at the cost of requiring users to set the seed to guarantee reproducibility. IRLBA can occasionally fail to converge and require more iterations (passed via maxit= in IrlbaParam()), while RSVD involves an explicit trade-off between accuracy and speed based on its oversampling parameter (p=) and number of power iterations (q=). We tend to prefer IRLBA as its default behavior is more accurate, though RSVD is much faster for file-backed matrices (Section 14.4). 14.3 Parallelization Parallelization of calculations across genes or cells is an obvious strategy for speeding up scRNA-seq analysis workflows. The BiocParallel package provides a common interface for parallel computing throughout the Bioconductor ecosystem, manifesting as a BPPARAM= argument in compatible functions. We can pick from a diverse range of parallelization backends depending on the available hardware and operating system. For example, we might use forking across 2 cores to parallelize the variance calculations on a Unix system: library(BiocParallel) dec.pbmc.mc &lt;- modelGeneVar(sce.pbmc, BPPARAM=MulticoreParam(2)) dec.pbmc.mc ## DataFrame with 33694 rows and 6 columns ## mean total tech bio p.value ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## RP11-34P13.3 0.000000000 0.000000000 0.000000000 0.00000e+00 NaN ## FAM138A 0.000000000 0.000000000 0.000000000 0.00000e+00 NaN ## OR4F5 0.000000000 0.000000000 0.000000000 0.00000e+00 NaN ## RP11-34P13.7 0.002166050 0.002227438 0.002227459 -2.15811e-08 0.500027 ## RP11-34P13.8 0.000522431 0.000549601 0.000537242 1.23586e-05 0.436496 ## ... ... ... ... ... ... ## AC233755.2 0.0000000 0.0000000 0.0000000 0.00000000 NaN ## AC233755.1 0.0000000 0.0000000 0.0000000 0.00000000 NaN ## AC240274.1 0.0102893 0.0121099 0.0105809 0.00152901 0.157639 ## AC213203.1 0.0000000 0.0000000 0.0000000 0.00000000 NaN ## FAM231B 0.0000000 0.0000000 0.0000000 0.00000000 NaN ## FDR ## &lt;numeric&gt; ## RP11-34P13.3 NaN ## FAM138A NaN ## OR4F5 NaN ## RP11-34P13.7 0.756443 ## RP11-34P13.8 0.756443 ## ... ... ## AC233755.2 NaN ## AC233755.1 NaN ## AC240274.1 0.756443 ## AC213203.1 NaN ## FAM231B NaN Another approach would be to distribute jobs across a network of computers, which yields the same result: dec.pbmc.snow &lt;- modelGeneVar(sce.pbmc, BPPARAM=SnowParam(5)) dec.pbmc.snow ## DataFrame with 33694 rows and 6 columns ## mean total tech bio p.value ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## RP11-34P13.3 0.000000000 0.000000000 0.000000000 0.00000e+00 NaN ## FAM138A 0.000000000 0.000000000 0.000000000 0.00000e+00 NaN ## OR4F5 0.000000000 0.000000000 0.000000000 0.00000e+00 NaN ## RP11-34P13.7 0.002166050 0.002227438 0.002227459 -2.15811e-08 0.500027 ## RP11-34P13.8 0.000522431 0.000549601 0.000537242 1.23586e-05 0.436496 ## ... ... ... ... ... ... ## AC233755.2 0.0000000 0.0000000 0.0000000 0.00000000 NaN ## AC233755.1 0.0000000 0.0000000 0.0000000 0.00000000 NaN ## AC240274.1 0.0102893 0.0121099 0.0105809 0.00152901 0.157639 ## AC213203.1 0.0000000 0.0000000 0.0000000 0.00000000 NaN ## FAM231B 0.0000000 0.0000000 0.0000000 0.00000000 NaN ## FDR ## &lt;numeric&gt; ## RP11-34P13.3 NaN ## FAM138A NaN ## OR4F5 NaN ## RP11-34P13.7 0.756443 ## RP11-34P13.8 0.756443 ## ... ... ## AC233755.2 NaN ## AC233755.1 NaN ## AC240274.1 0.756443 ## AC213203.1 NaN ## FAM231B NaN For high-performance computing (HPC) systems with a cluster of compute nodes, we can distribute jobs via the job scheduler using the BatchtoolsParam class. The example below assumes a SLURM cluster, though the settings can be easily configured for a particular system (see here for details). # 2 hours, 8 GB, 1 CPU per task, for 10 tasks. bpp &lt;- BatchtoolsParam(10, cluster=&quot;slurm&quot;, resources=list(walltime=7200, memory=8000, ncpus=1)) Parallelization is best suited for CPU-intensive calculations where the division of labor results in a concomitant reduction in compute time. It is not suited for tasks that are bounded by other compute resources, e.g., memory or file I/O (though the latter is less of an issue on HPC systems with parallel read/write). In particular, R itself is inherently single-core, so many of the parallelization backends involve (i) setting up one or more separate R sessions, (ii) loading the relevant packages and (iii) transmitting the data to that session. Depending on the nature and size of the task, this overhead may outweigh any benefit from parallel computing. 14.4 Out of memory representations The count matrix is the central structure around which our analyses are based. In most of the previous chapters, this has been held fully in memory as a dense matrix or as a sparse dgCMatrix. Howevever, in-memory representations may not be feasible for very large data sets, especially on machines with limited memory. For example, the 1.3 million brain cell data set from 10X Genomics (Zheng et al. 2017) would require over 100 GB of RAM to hold as a matrix and around 30 GB as a dgCMatrix. This makes it challenging to explore the data on anything less than a HPC system. The obvious solution is to use a file-backed matrix representation where the data are held on disk and subsets are retrieved into memory as requested. While a number of implementations of file-backed matrices are available (e.g., bigmemory, matter), we will be using the implementation from the HDF5Array package. This uses the popular HDF5 format as the underlying data store, which provides a measure of standardization and portability across systems. We demonstrate with a subset of 20,000 cells from the 1.3 million brain cell data set, as provided by the TENxBrainData package. library(TENxBrainData) sce.brain &lt;- TENxBrainData20k() sce.brain ## class: SingleCellExperiment ## dim: 27998 20000 ## metadata(0): ## assays(1): counts ## rownames: NULL ## rowData names(2): Ensembl Symbol ## colnames: NULL ## colData names(4): Barcode Sequence Library Mouse ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): Examination of the SingleCellExperiment object indicates that the count matrix is a HDF5Matrix. From a comparison of the memory usage, it is clear that this matrix object is simply a stub that points to the much larger HDF5 file that actually contains the data. This avoids the need for large RAM availability during analyses. counts(sce.brain) ## &lt;27998 x 20000&gt; HDF5Matrix object of type &quot;integer&quot;: ## [,1] [,2] [,3] [,4] ... [,19997] [,19998] [,19999] ## [1,] 0 0 0 0 . 0 0 0 ## [2,] 0 0 0 0 . 0 0 0 ## [3,] 0 0 0 0 . 0 0 0 ## [4,] 0 0 0 0 . 0 0 0 ## [5,] 0 0 0 0 . 0 0 0 ## ... . . . . . . . . ## [27994,] 0 0 0 0 . 0 0 0 ## [27995,] 0 0 0 1 . 0 2 0 ## [27996,] 0 0 0 0 . 0 1 0 ## [27997,] 0 0 0 0 . 0 0 0 ## [27998,] 0 0 0 0 . 0 0 0 ## [,20000] ## [1,] 0 ## [2,] 0 ## [3,] 0 ## [4,] 0 ## [5,] 0 ## ... . ## [27994,] 0 ## [27995,] 0 ## [27996,] 0 ## [27997,] 0 ## [27998,] 0 object.size(counts(sce.brain)) ## 2496 bytes file.info(path(counts(sce.brain)))$size ## [1] 76264332 Manipulation of the count matrix will generally result in the creation of a DelayedArray object from the DelayedArray package. This remembers the operations to be applied to the counts and stores them in the object, to be executed when the modified matrix values are realized for use in calculations. The use of delayed operations avoids the need to write the modified values to a new file at every operation, which would unnecessarily require time-consuming disk I/O. tmp &lt;- counts(sce.brain) tmp &lt;- log2(tmp + 1) tmp ## &lt;27998 x 20000&gt; DelayedMatrix object of type &quot;double&quot;: ## [,1] [,2] [,3] ... [,19999] [,20000] ## [1,] 0 0 0 . 0 0 ## [2,] 0 0 0 . 0 0 ## [3,] 0 0 0 . 0 0 ## [4,] 0 0 0 . 0 0 ## [5,] 0 0 0 . 0 0 ## ... . . . . . . ## [27994,] 0 0 0 . 0 0 ## [27995,] 0 0 0 . 0 0 ## [27996,] 0 0 0 . 0 0 ## [27997,] 0 0 0 . 0 0 ## [27998,] 0 0 0 . 0 0 Many functions described in the previous workflows are capable of accepting HDF5Matrix objects. This is powered by the availability of common methods for all matrix representations (e.g., subsetting, combining, methods from DelayedMatrixStats) as well as representation-agnostic C++ code using beachmat (A. T. L. Lun, Pages, and Smith 2018). For example, we compute QC metrics below with the same calculateQCMetrics() function that we used in the other workflows. library(scater) is.mito &lt;- grepl(&quot;^mt-&quot;, rowData(sce.brain)$Symbol) qcstats &lt;- perCellQCMetrics(sce.brain, subsets=list(Mt=is.mito)) qcstats ## DataFrame with 20000 rows and 6 columns ## sum detected subsets_Mt_sum subsets_Mt_detected subsets_Mt_percent ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## 1 3060 1546 123 10 4.01961 ## 2 3500 1694 118 11 3.37143 ## 3 3092 1613 58 9 1.87581 ## 4 4420 2050 131 10 2.96380 ## 5 3771 1813 100 8 2.65182 ## ... ... ... ... ... ... ## 19996 4431 2050 127 9 2.866170 ## 19997 6988 2704 60 9 0.858615 ## 19998 8749 2988 305 11 3.486113 ## 19999 3842 1711 129 8 3.357626 ## 20000 1775 945 26 6 1.464789 ## total ## &lt;numeric&gt; ## 1 3060 ## 2 3500 ## 3 3092 ## 4 4420 ## 5 3771 ## ... ... ## 19996 4431 ## 19997 6988 ## 19998 8749 ## 19999 3842 ## 20000 1775 Needless to say, data access from file-backed representations is slower than that from in-memory representations. The time spent retrieving data from disk is an unavoidable cost of reducing memory usage. Whether this is tolerable depends on the application. One example usage pattern involves performing the heavy computing quickly with in-memory representations on HPC systems with plentiful memory, and then distributing file-backed counterparts to individual users for exploration and visualization on their personal machines. Session Info View session info R version 4.4.2 (2024-10-31) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.1 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] TENxBrainData_1.26.0 HDF5Array_1.34.0 [3] rhdf5_2.50.2 DelayedArray_0.32.0 [5] SparseArray_1.6.1 S4Arrays_1.6.0 [7] abind_1.4-8 Matrix_1.7-1 [9] BiocParallel_1.40.0 BiocSingular_1.22.0 [11] scater_1.34.0 ggplot2_3.5.1 [13] bluster_1.16.0 BiocNeighbors_2.0.1 [15] scran_1.34.0 scuttle_1.16.0 [17] SingleCellExperiment_1.28.1 SummarizedExperiment_1.36.0 [19] Biobase_2.66.0 GenomicRanges_1.58.0 [21] GenomeInfoDb_1.42.1 IRanges_2.40.1 [23] S4Vectors_0.44.0 BiocGenerics_0.52.0 [25] MatrixGenerics_1.18.1 matrixStats_1.5.0 [27] BiocStyle_2.34.0 rebook_1.16.0 loaded via a namespace (and not attached): [1] DBI_1.2.3 gridExtra_2.3 CodeDepends_0.6.6 [4] rlang_1.1.5 magrittr_2.0.3 RSQLite_2.3.9 [7] compiler_4.4.2 dir.expiry_1.14.0 png_0.1-8 [10] vctrs_0.6.5 pkgconfig_2.0.3 crayon_1.5.3 [13] fastmap_1.2.0 dbplyr_2.5.0 XVector_0.46.0 [16] rmarkdown_2.29 graph_1.84.1 UCSC.utils_1.2.0 [19] ggbeeswarm_0.7.2 purrr_1.0.2 bit_4.5.0.1 [22] xfun_0.50 zlibbioc_1.52.0 cachem_1.1.0 [25] beachmat_2.22.0 jsonlite_1.8.9 blob_1.2.4 [28] rhdf5filters_1.18.0 Rhdf5lib_1.28.0 irlba_2.3.5.1 [31] parallel_4.4.2 cluster_2.1.8 R6_2.5.1 [34] bslib_0.8.0 limma_3.62.2 jquerylib_0.1.4 [37] Rcpp_1.0.14 bookdown_0.42 knitr_1.49 [40] snow_0.4-4 igraph_2.1.3 tidyselect_1.2.1 [43] yaml_2.3.10 viridis_0.6.5 codetools_0.2-20 [46] curl_6.1.0 lattice_0.22-6 tibble_3.2.1 [49] KEGGREST_1.46.0 withr_3.0.2 evaluate_1.0.3 [52] BiocFileCache_2.14.0 ExperimentHub_2.14.0 Biostrings_2.74.1 [55] pillar_1.10.1 BiocManager_1.30.25 filelock_1.0.3 [58] generics_0.1.3 BiocVersion_3.20.0 munsell_0.5.1 [61] scales_1.3.0 glue_1.8.0 metapod_1.14.0 [64] tools_4.4.2 AnnotationHub_3.14.0 ScaledMatrix_1.14.0 [67] locfit_1.5-9.10 XML_3.99-0.18 grid_4.4.2 [70] AnnotationDbi_1.68.0 edgeR_4.4.1 colorspace_2.1-1 [73] GenomeInfoDbData_1.2.13 beeswarm_0.4.0 vipor_0.4.7 [76] cli_3.6.3 rsvd_1.0.5 rappdirs_0.3.3 [79] viridisLite_0.4.2 dplyr_1.1.4 gtable_0.3.6 [82] sass_0.4.9 digest_0.6.37 ggrepel_0.9.6 [85] dqrng_0.4.1 memoise_2.0.1 htmltools_0.5.8.1 [88] lifecycle_1.0.4 httr_1.4.7 mime_0.12 [91] statmod_1.5.0 bit64_4.6.0-1 "]]