[["index.html", "Multi-Sample Single-Cell Analyses with Bioconductor Welcome", " Multi-Sample Single-Cell Analyses with Bioconductor Authors: Robert Amezquita [aut], Aaron Lun [aut, cre], Stephanie Hicks [aut], Raphael Gottardo [aut] Version: 1.1.0 Modified: 2021-05-17 Compiled: 2021-05-25 Environment: R version 4.1.0 beta (2021-05-03 r80259), Bioconductor 3.14 License: CC BY Copyright: Bioconductor, 2021 Source: https://github.com/LTLA/OSCA.multisample Welcome This site contains the advanced analysis chapters for the “Orchestrating Single-Cell Analysis with Bioconductor” book. This describes the handling of multiple samples in a single-cell RNA-seq analysis, starting with integration of multiple datasets into a common space for consistent analyses, differential expression comparisons between conditions based on pseudo-bulk samples, and differential abundance analyses for cell subpopulations. It is intended for readers who are already familiar with basic single-cell analyses, possibly after reading some of the prior books in this collection. "],["integrating-datasets.html", "Chapter 1 Correcting batch effects 1.1 Motivation 1.2 Quick start 1.3 Explaining the data preparation 1.4 No correction 1.5 Linear regression 1.6 MNN correction 1.7 Further options Session Info", " Chapter 1 Correcting batch effects .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 1.1 Motivation Large single-cell RNA sequencing (scRNA-seq) projects usually need to generate data across multiple batches due to logistical constraints. However, the processing of different batches is often subject to uncontrollable differences, e.g., changes in operator, differences in reagent quality. This results in systematic differences in the observed expression in cells from different batches, which we refer to as “batch effects”. Batch effects are problematic as they can be major drivers of heterogeneity in the data, masking the relevant biological differences and complicating interpretation of the results. Computational removal of batch-to-batch variation allows us to combine data across multiple batches for a consolidated downstream analysis. However, existing methods based on linear models (Ritchie et al. 2015; Leek et al. 2012) assume that the composition of cell populations are either known or the same across batches. To overcome these limitations, bespoke methods have been developed for batch correction of single-cell data (Haghverdi et al. 2018; Butler et al. 2018; Lin et al. 2019) that do not require a priori knowledge about the composition of the population. This allows them to be used in workflows for exploratory analyses of scRNA-seq data where such knowledge is usually unavailable. 1.2 Quick start To demonstrate, we will use two separate 10X Genomics PBMC datasets generated in two different batches. Each dataset was obtained from the TENxPBMCData package and separately subjected to basic processing steps such as quality control and normalization. As a general rule, these upstream processing steps should be done within each batch where possible. For example, outlier-based QC on the cells is more effective when performed within a batch (Advanced Section 1.4), and we can more effectively model the mean-variance relationship on each batch separately (Basic Section 3.4). View set-up code (Chapter 7) #--- loading ---# library(TENxPBMCData) all.sce &lt;- list( pbmc3k=TENxPBMCData(&#39;pbmc3k&#39;), pbmc4k=TENxPBMCData(&#39;pbmc4k&#39;), pbmc8k=TENxPBMCData(&#39;pbmc8k&#39;) ) #--- quality-control ---# library(scater) stats &lt;- high.mito &lt;- list() for (n in names(all.sce)) { current &lt;- all.sce[[n]] is.mito &lt;- grep(&quot;MT&quot;, rowData(current)$Symbol_TENx) stats[[n]] &lt;- perCellQCMetrics(current, subsets=list(Mito=is.mito)) high.mito[[n]] &lt;- isOutlier(stats[[n]]$subsets_Mito_percent, type=&quot;higher&quot;) all.sce[[n]] &lt;- current[,!high.mito[[n]]] } #--- normalization ---# all.sce &lt;- lapply(all.sce, logNormCounts) #--- variance-modelling ---# library(scran) all.dec &lt;- lapply(all.sce, modelGeneVar) all.hvgs &lt;- lapply(all.dec, getTopHVGs, prop=0.1) #--- dimensionality-reduction ---# library(BiocSingular) set.seed(10000) all.sce &lt;- mapply(FUN=runPCA, x=all.sce, subset_row=all.hvgs, MoreArgs=list(ncomponents=25, BSPARAM=RandomParam()), SIMPLIFY=FALSE) set.seed(100000) all.sce &lt;- lapply(all.sce, runTSNE, dimred=&quot;PCA&quot;) set.seed(1000000) all.sce &lt;- lapply(all.sce, runUMAP, dimred=&quot;PCA&quot;) #--- clustering ---# for (n in names(all.sce)) { g &lt;- buildSNNGraph(all.sce[[n]], k=10, use.dimred=&#39;PCA&#39;) clust &lt;- igraph::cluster_walktrap(g)$membership colLabels(all.sce[[n]]) &lt;- factor(clust) } pbmc3k &lt;- all.sce$pbmc3k dec3k &lt;- all.dec$pbmc3k pbmc3k ## class: SingleCellExperiment ## dim: 32738 2609 ## metadata(0): ## assays(2): counts logcounts ## rownames(32738): ENSG00000243485 ENSG00000237613 ... ENSG00000215616 ## ENSG00000215611 ## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol ## colnames: NULL ## colData names(13): Sample Barcode ... sizeFactor label ## reducedDimNames(3): PCA TSNE UMAP ## mainExpName: NULL ## altExpNames(0): pbmc4k &lt;- all.sce$pbmc4k dec4k &lt;- all.dec$pbmc4k pbmc4k ## class: SingleCellExperiment ## dim: 33694 4182 ## metadata(0): ## assays(2): counts logcounts ## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475 ## ENSG00000268674 ## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol ## colnames: NULL ## colData names(13): Sample Barcode ... sizeFactor label ## reducedDimNames(3): PCA TSNE UMAP ## mainExpName: NULL ## altExpNames(0): We then use the quickCorrect() function from the batchelor package to compute corrected values across the two objects. This performs all the steps to set up the data for correction (Section 1.3), followed by MNN correction to actually perform the correction itself (Section 1.6). Algernatively, we could use one of the other correction algorithms described in this chapter by modifying PARAM= appropriately. library(batchelor) quick.corrected &lt;- quickCorrect(pbmc3k, pbmc4k, precomputed=list(dec3k, dec4k), PARAM=FastMnnParam(BSPARAM=BiocSingular::RandomParam())) quick.sce &lt;- quick.corrected$corrected quick.sce ## class: SingleCellExperiment ## dim: 31232 6791 ## metadata(2): merge.info pca.info ## assays(1): reconstructed ## rownames(31232): ENSG00000243485 ENSG00000237613 ... ENSG00000198695 ## ENSG00000198727 ## rowData names(1): rotation ## colnames: NULL ## colData names(1): batch ## reducedDimNames(1): corrected ## mainExpName: NULL ## altExpNames(0): This yields low-dimensional corrected values for use in downstream analyses (Figure 1.1). library(scater) set.seed(00101010) quick.sce &lt;- runTSNE(quick.sce, dimred=&quot;corrected&quot;) quick.sce$batch &lt;- factor(quick.sce$batch) plotTSNE(quick.sce, colour_by=&quot;batch&quot;) Figure 1.1: \\(t\\)-SNE plot of the PBMC datasets after MNN correction with quickCorrect(). Each point is a cell that is colored according to its batch of origin. 1.3 Explaining the data preparation The quickCorrect() function wraps a number of steps that are required to prepare the data for batch correction. The first and most obvious is to subset all batches to the common “universe” of features. In this case, it is straightforward as both batches use Ensembl gene annotation; more difficult integrations will require some mapping of identifiers using packages like org.Mm.eg.db. universe &lt;- intersect(rownames(pbmc3k), rownames(pbmc4k)) length(universe) ## [1] 31232 # Subsetting the SingleCellExperiment object. pbmc3k &lt;- pbmc3k[universe,] pbmc4k &lt;- pbmc4k[universe,] # Also subsetting the variance modelling results, for convenience. dec3k &lt;- dec3k[universe,] dec4k &lt;- dec4k[universe,] The second step is to rescale each batch to adjust for differences in sequencing depth between batches. The multiBatchNorm() function recomputes log-normalized expression values after adjusting the size factors for systematic differences in coverage between SingleCellExperiment objects. (Size factors only remove biases between cells within a single batch.) This improves the quality of the correction by removing one aspect of the technical differences between batches. rescaled &lt;- multiBatchNorm(pbmc3k, pbmc4k) pbmc3k &lt;- rescaled[[1]] pbmc4k &lt;- rescaled[[2]] Finally, we perform feature selection by averaging the variance components across all batches with the combineVar() function. We compute the average as it is responsive to batch-specific HVGs while still preserving the within-batch ranking of genes. This allows us to use the same strategies described in Basic Section 3.5 to select genes of interest. In contrast, approaches based on taking the intersection or union of HVGs across batches become increasingly conservative or liberal, respectively, with an increasing number of batches. library(scran) combined.dec &lt;- combineVar(dec3k, dec4k) chosen.hvgs &lt;- combined.dec$bio &gt; 0 sum(chosen.hvgs) ## [1] 13431 When integrating datasets of variable composition, it is generally safer to err on the side of including more HVGs than are used in a single dataset analysis, to ensure that markers are retained for any dataset-specific subpopulations that might be present. For a top \\(X\\) selection, this means using a larger \\(X\\) (e.g., quickCorrect() defaults to 5000), or in this case, we simply take all genes above the trend. That said, many of the signal-to-noise considerations described in Basic Section 3.5 still apply here, so some experimentation may be necessary for best results. 1.4 No correction Before we actually perform any correction, it is worth examining whether there is any batch effect in this dataset. We combine the two SingleCellExperiments and perform a PCA on the log-expression values for our selected subset of HVGs. In this example, our datasets are file-backed and so we instruct runPCA() to use randomized PCA for greater efficiency - see Advanced Section 14.2.2 for more details - though the default IRLBA will suffice for more common in-memory representations. # Synchronizing the metadata for cbind()ing. # TODO: replace with combineCols when that comes out. rowData(pbmc3k) &lt;- rowData(pbmc4k) pbmc3k$batch &lt;- &quot;3k&quot; pbmc4k$batch &lt;- &quot;4k&quot; uncorrected &lt;- cbind(pbmc3k, pbmc4k) # Using RandomParam() as it is more efficient for file-backed matrices. library(scater) set.seed(0010101010) uncorrected &lt;- runPCA(uncorrected, subset_row=chosen.hvgs, BSPARAM=BiocSingular::RandomParam()) We use graph-based clustering on the components to obtain a summary of the population structure. As our two PBMC populations should be replicates, each cluster should ideally consist of cells from both batches. However, we instead see clusters that are comprised of cells from a single batch. This indicates that cells of the same type are artificially separated due to technical differences between batches. library(scran) snn.gr &lt;- buildSNNGraph(uncorrected, use.dimred=&quot;PCA&quot;) clusters &lt;- igraph::cluster_walktrap(snn.gr)$membership tab &lt;- table(Cluster=clusters, Batch=uncorrected$batch) tab ## Batch ## Cluster 3k 4k ## 1 1 781 ## 2 0 1309 ## 3 0 535 ## 4 14 51 ## 5 0 605 ## 6 489 0 ## 7 0 184 ## 8 1272 0 ## 9 0 414 ## 10 151 0 ## 11 0 50 ## 12 155 0 ## 13 0 65 ## 14 0 61 ## 15 0 88 ## 16 30 0 ## 17 339 0 ## 18 145 0 ## 19 11 3 ## 20 2 36 This is supported by the \\(t\\)-SNE visualization (Figure 1.2). where the strong separation between cells from different batches is consistent with the clustering results. set.seed(1111001) uncorrected &lt;- runTSNE(uncorrected, dimred=&quot;PCA&quot;) plotTSNE(uncorrected, colour_by=&quot;batch&quot;) Figure 1.2: \\(t\\)-SNE plot of the PBMC datasets without any batch correction. Each point is a cell that is colored according to its batch of origin. Of course, the other explanation for batch-specific clusters is that there are cell types that are unique to each batch. The degree of intermingling of cells from different batches is not an effective diagnostic when the batches involved might actually contain unique cell subpopulations (which is not a consideration in the PBMC dataset, but the same cannot be said in general). If a cluster only contains cells from a single batch, one can always debate whether that is caused by a failure of the correction method or if there is truly a batch-specific subpopulation. For example, do batch-specific metabolic or differentiation states represent distinct subpopulations? Or should they be merged together? We will not attempt to answer this here, only noting that each batch correction algorithm will make different (and possibly inappropriate) decisions on what constitutes “shared” and “unique” populations. 1.5 Linear regression 1.5.1 By rescaling the counts Batch effects in bulk RNA sequencing studies are commonly removed with linear regression. This involves fitting a linear model to each gene’s expression profile, setting the undesirable batch term to zero and recomputing the observations sans the batch effect, yielding a set of corrected expression values for downstream analyses. Linear modelling is the basis of the removeBatchEffect() function from the limma package (Ritchie et al. 2015) as well the comBat() function from the sva package (Leek et al. 2012). To use this approach in a scRNA-seq context, we assume that the composition of cell subpopulations is the same across batches. We also assume that the batch effect is additive, i.e., any batch-induced fold-change in expression is the same across different cell subpopulations for any given gene. These are strong assumptions as batches derived from different individuals will naturally exhibit variation in cell type abundances and expression. Nonetheless, they may be acceptable when dealing with batches that are technical replicates generated from the same population of cells. (In fact, when its assumptions hold, linear regression is the most statistically efficient as it uses information from all cells to compute the common batch vector.) Linear modelling can also accommodate situations where the composition is known a priori by including the cell type as a factor in the linear model, but this situation is even less common. We use the rescaleBatches() function from the batchelor package to remove the batch effect. This is roughly equivalent to applying a linear regression to the log-expression values per gene, with some adjustments to improve performance and efficiency. For each gene, the mean expression in each batch is scaled down until it is equal to the lowest mean across all batches. We deliberately choose to scale all expression values down as this mitigates differences in variance when batches lie at different positions on the mean-variance trend. (Specifically, the shrinkage effect of the pseudo-count is greater for smaller counts, suppressing any differences in variance across batches.) An additional feature of rescaleBatches() is that it will preserve sparsity in the input matrix for greater efficiency, whereas other methods like removeBatchEffect() will always return a dense matrix. library(batchelor) rescaled &lt;- rescaleBatches(pbmc3k, pbmc4k) rescaled ## class: SingleCellExperiment ## dim: 31232 6791 ## metadata(0): ## assays(1): corrected ## rownames(31232): ENSG00000243485 ENSG00000237613 ... ENSG00000198695 ## ENSG00000198727 ## rowData names(0): ## colnames: NULL ## colData names(1): batch ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): The corrected expression values can be used in place of the \"logcounts\" assay in PCA and clustering (see Chapter 3). After clustering, we observe that most clusters consist of mixtures of cells from the two replicate batches, consistent with the removal of the batch effect. This conclusion is supported by the apparent mixing of cells from different batches in Figure 1.3. However, at least one batch-specific cluster is still present, indicating that the correction is not entirely complete. This is attributable to violation of one of the aforementioned assumptions, even in this simple case involving replicated batches. # To ensure reproducibility of the randomized PCA. set.seed(1010101010) rescaled &lt;- runPCA(rescaled, subset_row=chosen.hvgs, exprs_values=&quot;corrected&quot;, BSPARAM=BiocSingular::RandomParam()) snn.gr &lt;- buildSNNGraph(rescaled, use.dimred=&quot;PCA&quot;) clusters.resc &lt;- igraph::cluster_walktrap(snn.gr)$membership tab.resc &lt;- table(Cluster=clusters.resc, Batch=rescaled$batch) tab.resc ## Batch ## Cluster 1 2 ## 1 278 525 ## 2 16 23 ## 3 337 606 ## 4 43 748 ## 5 604 529 ## 6 22 71 ## 7 188 48 ## 8 25 49 ## 9 263 0 ## 10 123 135 ## 11 16 85 ## 12 11 57 ## 13 116 6 ## 14 455 1035 ## 15 6 31 ## 16 89 187 ## 17 3 36 ## 18 3 8 ## 19 11 3 rescaled &lt;- runTSNE(rescaled, dimred=&quot;PCA&quot;) rescaled$batch &lt;- factor(rescaled$batch) plotTSNE(rescaled, colour_by=&quot;batch&quot;) Figure 1.3: \\(t\\)-SNE plot of the PBMC datasets after correction with rescaleBatches(). Each point represents a cell and is colored according to the batch of origin. 1.5.2 By fitting a linear model Alternatively, we could use the regressBatches() function to perform a more conventional linear regression for batch correction. This is subject to the same assumptions as described above for rescaleBatches(), though it has the additional disadvantage of discarding sparsity in the matrix of residuals. To avoid this, we avoid explicit calculation of the residuals during matrix multiplication (see ?ResidualMatrix for details), allowing us to perform an approximate PCA more efficiently. Advanced users can set design= and specify which coefficients to retain in the output matrix, reminiscent of limma’s removeBatchEffect() function. set.seed(10001) residuals &lt;- regressBatches(pbmc3k, pbmc4k, d=50, subset.row=chosen.hvgs, correct.all=TRUE, BSPARAM=BiocSingular::RandomParam()) We set d=50 to instruct regressBatches() to automatically perform a PCA for us. The PCs derived from the residuals can then be used in clustering and further dimensionality reduction, as demonstrated in Figure 1.4. snn.gr &lt;- buildSNNGraph(residuals, use.dimred=&quot;corrected&quot;) clusters.resid &lt;- igraph::cluster_walktrap(snn.gr)$membership tab.resid &lt;- table(Cluster=clusters.resid, Batch=residuals$batch) tab.resid ## Batch ## Cluster 1 2 ## 1 478 2 ## 2 142 179 ## 3 22 41 ## 4 298 566 ## 5 340 606 ## 6 0 138 ## 7 404 376 ## 8 145 91 ## 9 2 636 ## 10 22 73 ## 11 6 51 ## 12 629 1110 ## 13 3 36 ## 14 91 211 ## 15 12 55 ## 16 4 8 ## 17 11 3 residuals &lt;- runTSNE(residuals, dimred=&quot;corrected&quot;) residuals$batch &lt;- factor(residuals$batch) plotTSNE(residuals, colour_by=&quot;batch&quot;) Figure 1.4: \\(t\\)-SNE plot of the PBMC datasets after correction with regressBatches(). Each point represents a cell and is colored according to the batch of origin. 1.6 MNN correction Consider a cell \\(a\\) in batch \\(A\\), and identify the cells in batch \\(B\\) that are nearest neighbors to \\(a\\) in the expression space defined by the selected features. Repeat this for a cell \\(b\\) in batch \\(B\\), identifying its nearest neighbors in \\(A\\). Mutual nearest neighbors are pairs of cells from different batches that belong in each other’s set of nearest neighbors. The reasoning is that MNN pairs represent cells from the same biological state prior to the application of a batch effect - see Haghverdi et al. (2018) for full theoretical details. Thus, the difference between cells in MNN pairs can be used as an estimate of the batch effect, the subtraction of which yields batch-corrected values. Compared to linear regression, MNN correction does not assume that the population composition is the same or known beforehand. This is because it learns the shared population structure via identification of MNN pairs and uses this information to obtain an appropriate estimate of the batch effect. Instead, the key assumption of MNN-based approaches is that the batch effect is orthogonal to the biology in high-dimensional expression space. Violations reduce the effectiveness and accuracy of the correction, with the most common case arising from variations in the direction of the batch effect between clusters. Nonetheless, the assumption is usually reasonable as a random vector is very likely to be orthogonal in high-dimensional space. The batchelor package provides an implementation of the MNN approach via the fastMNN() function. (Unlike the MNN method originally described by Haghverdi et al. (2018), the fastMNN() function performs PCA to reduce the dimensions beforehand and speed up the downstream neighbor detection steps.) We apply it to our two PBMC batches to remove the batch effect across the highly variable genes in chosen.hvgs. To reduce computational work and technical noise, all cells in all batches are projected into the low-dimensional space defined by the top d principal components. Identification of MNNs and calculation of correction vectors are then performed in this low-dimensional space. # Again, using randomized SVD here, as this is faster than IRLBA for # file-backed matrices. We set deferred=TRUE for greater speed. set.seed(1000101001) mnn.out &lt;- fastMNN(pbmc3k, pbmc4k, d=50, k=20, subset.row=chosen.hvgs, BSPARAM=BiocSingular::RandomParam(deferred=TRUE)) mnn.out ## class: SingleCellExperiment ## dim: 13431 6791 ## metadata(2): merge.info pca.info ## assays(1): reconstructed ## rownames(13431): ENSG00000239945 ENSG00000228463 ... ENSG00000198695 ## ENSG00000198727 ## rowData names(1): rotation ## colnames: NULL ## colData names(1): batch ## reducedDimNames(1): corrected ## mainExpName: NULL ## altExpNames(0): The function returns a SingleCellExperiment object containing corrected values for downstream analyses like clustering or visualization. Each column of mnn.out corresponds to a cell in one of the batches, while each row corresponds to an input gene in chosen.hvgs. The batch field in the column metadata contains a vector specifying the batch of origin of each cell. head(mnn.out$batch) ## [1] 1 1 1 1 1 1 The corrected matrix in the reducedDims() contains the low-dimensional corrected coordinates for all cells, which we will use in place of the PCs in our downstream analyses. dim(reducedDim(mnn.out, &quot;corrected&quot;)) ## [1] 6791 50 A reconstructed matrix in the assays() contains the corrected expression values for each gene in each cell, obtained by projecting the low-dimensional coordinates in corrected back into gene expression space. We do not recommend using this for anything other than visualization (Chapter 3). assay(mnn.out, &quot;reconstructed&quot;) ## &lt;13431 x 6791&gt; matrix of class LowRankMatrix and type &quot;double&quot;: ## [,1] [,2] [,3] ... [,6790] [,6791] ## ENSG00000239945 -2.522e-06 -1.851e-06 -1.199e-05 . 1.832e-06 -3.641e-06 ## ENSG00000228463 -6.627e-04 -6.724e-04 -4.820e-04 . -8.531e-04 -3.999e-04 ## ENSG00000237094 -8.077e-05 -8.038e-05 -9.631e-05 . 7.261e-06 -4.094e-05 ## ENSG00000229905 3.838e-06 6.180e-06 5.432e-06 . 8.534e-06 3.485e-06 ## ENSG00000237491 -4.527e-04 -3.178e-04 -1.510e-04 . -3.491e-04 -2.082e-04 ## ... . . . . . . ## ENSG00000198840 -0.0296508 -0.0340101 -0.0502385 . -0.0362884 -0.0183084 ## ENSG00000212907 -0.0041681 -0.0056570 -0.0106420 . -0.0083837 0.0005996 ## ENSG00000198886 0.0145358 0.0200517 -0.0307131 . -0.0109254 -0.0070064 ## ENSG00000198695 0.0014427 0.0013490 0.0001493 . -0.0009826 -0.0022712 ## ENSG00000198727 0.0152570 0.0106167 -0.0256450 . -0.0227962 -0.0022898 The most relevant parameter for tuning fastMNN() is k, which specifies the number of nearest neighbors to consider when defining MNN pairs. This can be interpreted as the minimum anticipated frequency of any shared cell type or state in each batch. Increasing k will generally result in more aggressive merging as the algorithm is more generous in matching subpopulations across batches. It can occasionally be desirable to increase k if one clearly sees that the same cell types are not being adequately merged across batches. We cluster on the low-dimensional corrected coordinates to obtain a partitioning of the cells that serves as a proxy for the population structure. If the batch effect is successfully corrected, clusters corresponding to shared cell types or states should contain cells from multiple batches. We see that all clusters contain contributions from each batch after correction, consistent with our expectation that the two batches are replicates of each other. library(scran) snn.gr &lt;- buildSNNGraph(mnn.out, use.dimred=&quot;corrected&quot;) clusters.mnn &lt;- igraph::cluster_walktrap(snn.gr)$membership tab.mnn &lt;- table(Cluster=clusters.mnn, Batch=mnn.out$batch) tab.mnn ## Batch ## Cluster 1 2 ## 1 337 606 ## 2 289 542 ## 3 152 181 ## 4 12 4 ## 5 517 467 ## 6 17 19 ## 7 313 661 ## 8 162 118 ## 9 11 56 ## 10 547 1083 ## 11 17 59 ## 12 16 58 ## 13 144 93 ## 14 67 191 ## 15 4 36 ## 16 4 8 We can also visualize the corrected coordinates using a \\(t\\)-SNE plot (Figure 1.5). The presence of visual clusters containing cells from both batches provides a comforting illusion that the correction was successful. library(scater) set.seed(0010101010) mnn.out &lt;- runTSNE(mnn.out, dimred=&quot;corrected&quot;) mnn.out$batch &lt;- factor(mnn.out$batch) plotTSNE(mnn.out, colour_by=&quot;batch&quot;) Figure 1.5: \\(t\\)-SNE plot of the PBMC datasets after MNN correction with fastMNN(). Each point is a cell that is colored according to its batch of origin. See also Chapter 8 for a case study using MNN correction on a series of human pancreas datasets. 1.7 Further options All of the batchelor functions can operate on a single SingleCellExperiment containing data from all batches. For example, if we were to recycle the uncorrected object from Section 1.4, we could apply MNN correction without splitting the object into multiple parts. set.seed(10000) single.correct &lt;- fastMNN(uncorrected, batch=uncorrected$batch, subset.row=chosen.hvgs, BSPARAM=BiocSingular::RandomParam()) single.correct ## class: SingleCellExperiment ## dim: 13431 6791 ## metadata(2): merge.info pca.info ## assays(1): reconstructed ## rownames(13431): ENSG00000239945 ENSG00000228463 ... ENSG00000198695 ## ENSG00000198727 ## rowData names(1): rotation ## colnames: NULL ## colData names(1): batch ## reducedDimNames(1): corrected ## mainExpName: NULL ## altExpNames(0): It is similarly straightforward to simultaneously perform correction across &gt;2 batches, either by having multiple levels in batch= or by providing more SingleCellExperiment objects (or even raw matrices of expression values). This is demonstrated below for MNN correction with an additional PBMC dataset (Figure 1.6). pbmc8k &lt;- all.sce$pbmc8k dec8k &lt;- all.dec$pbmc8k quick.corrected2 &lt;- quickCorrect(`3k`=pbmc3k, `4k`=pbmc4k, `8k`=pbmc8k, precomputed=list(dec3k, dec4k, dec8k), PARAM=FastMnnParam(BSPARAM=BiocSingular::RandomParam(), auto.merge=TRUE)) quick.sce2 &lt;- quick.corrected2$corrected set.seed(00101010) quick.sce2 &lt;- runTSNE(quick.sce2, dimred=&quot;corrected&quot;) plotTSNE(quick.sce2, colour_by=&quot;batch&quot;) Figure 1.6: Yet another \\(t\\)-SNE plot of the PBMC datasets after MNN correction. Each point is a cell that is colored according to its batch of origin. In the specific case of MNN correction, we can also set auto.merge=TRUE to allow it to choose the “best” order in which to perform the merges. This is slower but can occasionally be useful when the batches involved have very different cell type compositions. For example, if one batch contained only B cells, another batch contained only T cells and a third batch contained B and T cells, it would be unwise to try to merge the first two batches together as the wrong MNN pairs would be identified. With auto.merge=TRUE, the function would automatically recognize that the third batch should be used as the reference to which the others should be merged. Session Info View session info R version 4.1.0 beta (2021-05-03 r80259) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] scran_1.21.1 scater_1.21.0 [3] ggplot2_3.3.3 scuttle_1.3.0 [5] batchelor_1.9.0 SingleCellExperiment_1.15.1 [7] SummarizedExperiment_1.23.0 Biobase_2.53.0 [9] GenomicRanges_1.45.0 GenomeInfoDb_1.29.0 [11] HDF5Array_1.21.0 rhdf5_2.37.0 [13] DelayedArray_0.19.0 IRanges_2.27.0 [15] S4Vectors_0.31.0 MatrixGenerics_1.5.0 [17] matrixStats_0.58.0 BiocGenerics_0.39.0 [19] Matrix_1.3-3 BiocStyle_2.21.0 [21] rebook_1.3.0 loaded via a namespace (and not attached): [1] bitops_1.0-7 filelock_1.0.2 [3] tools_4.1.0 bslib_0.2.5.1 [5] utf8_1.2.1 R6_2.5.0 [7] irlba_2.3.3 ResidualMatrix_1.3.0 [9] vipor_0.4.5 DBI_1.1.1 [11] colorspace_2.0-1 rhdf5filters_1.5.0 [13] withr_2.4.2 gridExtra_2.3 [15] tidyselect_1.1.1 compiler_4.1.0 [17] graph_1.71.0 BiocNeighbors_1.11.0 [19] labeling_0.4.2 bookdown_0.22 [21] sass_0.4.0 scales_1.1.1 [23] stringr_1.4.0 digest_0.6.27 [25] rmarkdown_2.8 XVector_0.33.0 [27] pkgconfig_2.0.3 htmltools_0.5.1.1 [29] sparseMatrixStats_1.5.0 highr_0.9 [31] limma_3.49.0 rlang_0.4.11 [33] DelayedMatrixStats_1.15.0 farver_2.1.0 [35] generics_0.1.0 jquerylib_0.1.4 [37] jsonlite_1.7.2 BiocParallel_1.27.0 [39] dplyr_1.0.6 RCurl_1.98-1.3 [41] magrittr_2.0.1 BiocSingular_1.9.0 [43] GenomeInfoDbData_1.2.6 ggbeeswarm_0.6.0 [45] Rcpp_1.0.6 munsell_0.5.0 [47] Rhdf5lib_1.15.0 fansi_0.4.2 [49] viridis_0.6.1 lifecycle_1.0.0 [51] stringi_1.6.2 yaml_2.2.1 [53] edgeR_3.35.0 zlibbioc_1.39.0 [55] Rtsne_0.15 grid_4.1.0 [57] dqrng_0.3.0 crayon_1.4.1 [59] dir.expiry_1.1.0 lattice_0.20-44 [61] cowplot_1.1.1 beachmat_2.9.0 [63] locfit_1.5-9.4 CodeDepends_0.6.5 [65] metapod_1.1.0 knitr_1.33 [67] pillar_1.6.1 igraph_1.2.6 [69] codetools_0.2-18 ScaledMatrix_1.1.0 [71] XML_3.99-0.6 glue_1.4.2 [73] evaluate_0.14 BiocManager_1.30.15 [75] vctrs_0.3.8 purrr_0.3.4 [77] gtable_0.3.0 assertthat_0.2.1 [79] xfun_0.23 rsvd_1.0.5 [81] viridisLite_0.4.0 tibble_3.1.2 [83] beeswarm_0.3.1 cluster_2.1.2 [85] bluster_1.3.0 statmod_1.4.36 [87] ellipsis_0.3.2 References "],["correction-diagnostics.html", "Chapter 2 Correction diagnostics 2.1 Motivation 2.2 Mixing between batches 2.3 Preserving biological heterogeneity 2.4 MNN-specific diagnostics Session Info", " Chapter 2 Correction diagnostics .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 2.1 Motivation Ideally, batch correction would remove the differences between batches while preserving the heterogeneity within batches. In the corrected data, cells of the same type should be intermingled and indistinguishable even if they come from different batches, while cells of different types should remain well-separated. Unfortunately, we rarely have prior knowledge of the underlying types of the cells, making it difficult to unambiguously determine whether differences between batches represent geniune biology or incomplete correction. Indeed, it could be said that all correction methods are at least somewhat incorrect (Section 6.4.2), though that not preclude them from being useful. In this chapter, we will describe a few diagnostics that, when combined with biological context, can be used to identify potential problems with the correction. We will recycle the mnn.out, tab.mnn and clusters.mnn objects that were produced in Section 1.6. For the sake of brevity, we will not reproduce the relevant code - see Chapter 1 for more details. mnn.out ## class: SingleCellExperiment ## dim: 13431 6791 ## metadata(2): merge.info pca.info ## assays(1): reconstructed ## rownames(13431): ENSG00000239945 ENSG00000228463 ... ENSG00000198695 ## ENSG00000198727 ## rowData names(1): rotation ## colnames: NULL ## colData names(1): batch ## reducedDimNames(1): corrected ## mainExpName: NULL ## altExpNames(0): tab.mnn ## Batch ## Cluster 1 2 ## 1 337 606 ## 2 289 542 ## 3 152 181 ## 4 12 4 ## 5 517 467 ## 6 17 19 ## 7 313 661 ## 8 162 118 ## 9 11 56 ## 10 547 1083 ## 11 17 59 ## 12 16 58 ## 13 144 93 ## 14 67 191 ## 15 4 36 ## 16 4 8 2.2 Mixing between batches The simplest way to quantify the degree of mixing across batches is to test each cluster for imbalances in the contribution from each batch (Büttner et al. 2019). This is done by applying Pearson’s chi-squared test to each row of tab.mnn where the expected proportions under the null hypothesis proportional to the total number of cells per batch. Low \\(p\\)-values indicate that there are significant imbalances In practice, this strategy is most suited to experiments where the batches are technical replicates with identical population composition; it is usually too stringent for batches with more biological variation, where proportions can genuinely vary even in the absence of any batch effect. library(batchelor) p.values &lt;- clusterAbundanceTest(tab.mnn) p.values ## 1 2 3 4 5 6 7 8 ## 9.047e-02 3.093e-02 6.700e-03 2.627e-03 8.424e-20 2.775e-01 5.546e-05 2.274e-11 ## 9 10 11 12 13 14 15 16 ## 2.136e-04 5.480e-05 4.019e-03 2.972e-03 1.538e-12 3.936e-05 2.197e-04 7.172e-01 We favor a more qualitative approach where we compute the variance in the log-normalized abundances across batches for each cluster. A highly variable cluster has large relative differences in cell abundance across batches; this may be an indicator for incomplete batch correction, e.g., if the same cell type in two batches was not combined into a single cluster in the corrected data. We can then focus our attention on these clusters to determine whether they might pose a problem for downstream interpretation. Of course, a large variance can also be caused by genuinely batch-specific populations, so some prior knowledge about the biological context is necessary to distinguish between these two possibilities. For the PBMC dataset, none of the most variable clusters are overtly batch-specific, consistent with the fact that our batches are effectively replicates. rv &lt;- clusterAbundanceVar(tab.mnn) # Also printing the percentage of cells in each cluster in each batch: percent &lt;- t(t(tab.mnn)/colSums(tab.mnn)) * 100 df &lt;- DataFrame(Batch=unclass(percent), var=rv) df[order(df$var, decreasing=TRUE),] ## DataFrame with 16 rows and 3 columns ## Batch.1 Batch.2 var ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## 15 0.153315 0.860832 0.934778 ## 13 5.519356 2.223816 0.728465 ## 9 0.421617 1.339072 0.707757 ## 8 6.209276 2.821616 0.563419 ## 4 0.459946 0.095648 0.452565 ## ... ... ... ... ## 6 0.651591 0.454328 0.05689945 ## 10 20.965887 25.896700 0.04527468 ## 2 11.077041 12.960306 0.02443988 ## 1 12.916826 14.490674 0.01318296 ## 16 0.153315 0.191296 0.00689661 2.3 Preserving biological heterogeneity Another useful diagnostic check is to compare the pre-correction clustering of each batch to the clustering of the same cells in the corrected data. Accurate data integration should preserve population structure within each batch as there is no batch effect to remove between cells in the same batch. This check complements the previously mentioned diagnostics that only focus on the removal of differences between batches. Specifically, it protects us against scenarios where the correction method simply aggregates all cells together, which would achieve perfect mixing but also discard the biological heterogeneity of interest. To illustrate, we will use clustering results from the analysis of each batch of the PBMC dataset: View set-up code (Chapter 7) #--- loading ---# library(TENxPBMCData) all.sce &lt;- list( pbmc3k=TENxPBMCData(&#39;pbmc3k&#39;), pbmc4k=TENxPBMCData(&#39;pbmc4k&#39;), pbmc8k=TENxPBMCData(&#39;pbmc8k&#39;) ) #--- quality-control ---# library(scater) stats &lt;- high.mito &lt;- list() for (n in names(all.sce)) { current &lt;- all.sce[[n]] is.mito &lt;- grep(&quot;MT&quot;, rowData(current)$Symbol_TENx) stats[[n]] &lt;- perCellQCMetrics(current, subsets=list(Mito=is.mito)) high.mito[[n]] &lt;- isOutlier(stats[[n]]$subsets_Mito_percent, type=&quot;higher&quot;) all.sce[[n]] &lt;- current[,!high.mito[[n]]] } #--- normalization ---# all.sce &lt;- lapply(all.sce, logNormCounts) #--- variance-modelling ---# library(scran) all.dec &lt;- lapply(all.sce, modelGeneVar) all.hvgs &lt;- lapply(all.dec, getTopHVGs, prop=0.1) #--- dimensionality-reduction ---# library(BiocSingular) set.seed(10000) all.sce &lt;- mapply(FUN=runPCA, x=all.sce, subset_row=all.hvgs, MoreArgs=list(ncomponents=25, BSPARAM=RandomParam()), SIMPLIFY=FALSE) set.seed(100000) all.sce &lt;- lapply(all.sce, runTSNE, dimred=&quot;PCA&quot;) set.seed(1000000) all.sce &lt;- lapply(all.sce, runUMAP, dimred=&quot;PCA&quot;) #--- clustering ---# for (n in names(all.sce)) { g &lt;- buildSNNGraph(all.sce[[n]], k=10, use.dimred=&#39;PCA&#39;) clust &lt;- igraph::cluster_walktrap(g)$membership colLabels(all.sce[[n]]) &lt;- factor(clust) } pbmc3k &lt;- all.sce$pbmc3k table(colLabels(pbmc3k)) ## ## 1 2 3 4 5 6 7 8 9 10 ## 487 154 603 514 31 150 179 333 147 11 pbmc4k &lt;- all.sce$pbmc4k table(colLabels(pbmc4k)) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 ## 497 185 569 786 373 232 44 1023 77 218 88 54 36 Ideally, we should see a many-to-1 mapping where the post-correction clustering is nested inside the pre-correction clustering. This indicates that any within-batch structure was preserved after correction while acknowledging that greater resolution is possible with more cells. We quantify this mapping using the nestedClusters() function from the bluster package, which identifies the nesting of post-correction clusters within the pre-correction clusters. Well-nested clusters have high max values, indicating that most of their cells are derived from a single pre-correction cluster. library(bluster) tab3k &lt;- nestedClusters(ref=paste(&quot;before&quot;, colLabels(pbmc3k)), alt=paste(&quot;after&quot;, clusters.mnn[mnn.out$batch==1])) tab3k$alt.mapping ## DataFrame with 16 rows and 2 columns ## max which ## &lt;numeric&gt; &lt;character&gt; ## after 1 0.985163 before 8 ## after 10 0.919561 before 3 ## after 11 1.000000 before 5 ## after 12 0.812500 before 5 ## after 13 0.993056 before 9 ## ... ... ... ## after 5 0.874275 before 4 ## after 6 0.882353 before 4 ## after 7 1.000000 before 1 ## after 8 0.981481 before 1 ## after 9 1.000000 before 1 We can visualize this mapping for the PBMC dataset in Figure 2.1. Ideally, each row should have a single dominant entry close to unity. Horizontal stripes are more concerning as these indicate that multiple pre-correction clusters were merged together, though the exact level of concern will depend on whether specific clusters of interest are gained or lost. In practice, more discrepancies can be expected even when the correction is perfect, due to the existence of closely related clusters that were arbitrarily separated in the within-batch clustering. library(pheatmap) # For the first batch: heat3k &lt;- pheatmap(tab3k$proportions, cluster_row=FALSE, cluster_col=FALSE, main=&quot;PBMC 3K comparison&quot;, silent=TRUE) # For the second batch: tab4k &lt;- nestedClusters(ref=paste(&quot;before&quot;, colLabels(pbmc4k)), alt=paste(&quot;after&quot;, clusters.mnn[mnn.out$batch==2])) heat4k &lt;- pheatmap(tab4k$proportions, cluster_row=FALSE, cluster_col=FALSE, main=&quot;PBMC 4K comparison&quot;, silent=TRUE) gridExtra::grid.arrange(heat3k[[4]], heat4k[[4]]) Figure 2.1: Comparison between the clusterings obtained before (columns) and after MNN correction (rows). One heatmap is generated for each of the PBMC 3K and 4K datasets, where each entry is colored according to the proportion of cells distributed along each row (i.e., the row sums equal unity). We use the adjusted Rand index (Advanced Section 5.3) to quantify the agreement between the clusterings before and after batch correction. Recall that larger indices are more desirable as this indicates that within-batch heterogeneity is preserved, though this must be balanced against the ability of each method to actually perform batch correction. library(bluster) ri3k &lt;- pairwiseRand(clusters.mnn[mnn.out$batch==1], colLabels(pbmc3k), mode=&quot;index&quot;) ri3k ## [1] 0.7361 ri4k &lt;- pairwiseRand(clusters.mnn[mnn.out$batch==2], colLabels(pbmc4k), mode=&quot;index&quot;) ri4k ## [1] 0.8301 We can also break down the ARI into per-cluster ratios for more detailed diagnostics (Figure 2.2). For example, we could see low ratios off the diagonal if distinct clusters in the within-batch clustering were incorrectly aggregated in the merged clustering. Conversely, we might see low ratios on the diagonal if the correction inflated or introduced spurious heterogeneity inside a within-batch cluster. # For the first batch. tab &lt;- pairwiseRand(colLabels(pbmc3k), clusters.mnn[mnn.out$batch==1]) heat3k &lt;- pheatmap(tab, cluster_row=FALSE, cluster_col=FALSE, col=rev(viridis::magma(100)), main=&quot;PBMC 3K probabilities&quot;, silent=TRUE) # For the second batch. tab &lt;- pairwiseRand(colLabels(pbmc4k), clusters.mnn[mnn.out$batch==2]) heat4k &lt;- pheatmap(tab, cluster_row=FALSE, cluster_col=FALSE, col=rev(viridis::magma(100)), main=&quot;PBMC 4K probabilities&quot;, silent=TRUE) gridExtra::grid.arrange(heat3k[[4]], heat4k[[4]]) Figure 2.2: ARI-derived ratios for the within-batch clusters after comparison to the merged clusters obtained after MNN correction. One heatmap is generated for each of the PBMC 3K and 4K datasets. 2.4 MNN-specific diagnostics For fastMNN(), one useful diagnostic is the proportion of variance within each batch that is lost during MNN correction. Specifically, this refers to the within-batch variance that is removed during orthogonalization with respect to the average correction vector at each merge step. This is returned via the lost.var field in the metadata of mnn.out, which contains a matrix of the variance lost in each batch (column) at each merge step (row). metadata(mnn.out)$merge.info$lost.var ## [,1] [,2] ## [1,] 0.006617 0.003315 Large proportions of lost variance (&gt;10%) suggest that correction is removing genuine biological heterogeneity. This would occur due to violations of the assumption of orthogonality between the batch effect and the biological subspace (Haghverdi et al. 2018). In this case, the proportion of lost variance is small, indicating that non-orthogonality is not a major concern. Another MNN-related diagnostic involves examining the variance in the differences in expression between MNN pairs. A small variance indicates that the correction had little effect - either there was no batch effect, or any batch effect was simply a constant shift across all cells. On the other hand, a large variance indicates that the correction was highly non-linear, most likely involving subpopulation-specific batch effects. This computation is achieved using the mnnDeltaVariance() function on the MNN pairings produced by fastMNN(). library(batchelor) common &lt;- rownames(mnn.out) vars &lt;- mnnDeltaVariance(pbmc3k[common,], pbmc4k[common,], pairs=metadata(mnn.out)$merge.info$pairs) vars[order(vars$adjusted, decreasing=TRUE),] ## DataFrame with 13431 rows and 4 columns ## mean total trend adjusted ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSG00000111796 0.404086 1.487536 0.657367 0.830169 ## ENSG00000257764 0.320638 1.044040 0.548209 0.495831 ## ENSG00000170345 2.543318 1.342864 0.948955 0.393909 ## ENSG00000187118 0.313542 0.916383 0.538257 0.378126 ## ENSG00000177606 1.799190 1.400792 1.033322 0.367470 ## ... ... ... ... ... ## ENSG00000204482 0.786304 0.537895 0.97769 -0.439796 ## ENSG00000011600 1.176075 0.628421 1.07374 -0.445319 ## ENSG00000101439 1.135830 0.626148 1.07300 -0.446850 ## ENSG00000158869 0.927395 0.575360 1.03902 -0.463658 ## ENSG00000105374 1.219511 0.536416 1.07336 -0.536943 Such genes with large variances are particularly interesting as they exhibit complex differences between batches that may reflect real biology. For example, in Figure 2.3, the KLRB1-positive clusters in the second batch lack any counterpart in the first batch, despite the two batches being replicates. This may represent some kind of batch-specific state in two otherwise identical populations, though whether this is biological or technical in nature is open for interpretation. library(scater) top &lt;- rownames(vars)[order(vars$adjusted, decreasing=TRUE)[1]] gridExtra::grid.arrange( plotExpression(pbmc3k, x=&quot;label&quot;, features=top) + ggtitle(&quot;3k&quot;), plotExpression(pbmc4k, x=&quot;label&quot;, features=top) + ggtitle(&quot;4k&quot;), ncol=2 ) Figure 2.3: Distribution of the expression of the gene with the largest variance of MNN pair differences in each batch of the the PBMC dataset. Session Info View session info R version 4.1.0 beta (2021-05-03 r80259) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] scater_1.21.0 ggplot2_3.3.3 [3] scuttle_1.3.0 HDF5Array_1.21.0 [5] rhdf5_2.37.0 DelayedArray_0.19.0 [7] Matrix_1.3-3 pheatmap_1.0.12 [9] bluster_1.3.0 batchelor_1.9.0 [11] BiocSingular_1.9.0 SingleCellExperiment_1.15.1 [13] SummarizedExperiment_1.23.0 Biobase_2.53.0 [15] GenomicRanges_1.45.0 GenomeInfoDb_1.29.0 [17] IRanges_2.27.0 S4Vectors_0.31.0 [19] BiocGenerics_0.39.0 MatrixGenerics_1.5.0 [21] matrixStats_0.58.0 BiocStyle_2.21.0 [23] rebook_1.3.0 loaded via a namespace (and not attached): [1] bitops_1.0-7 filelock_1.0.2 [3] RColorBrewer_1.1-2 tools_4.1.0 [5] bslib_0.2.5.1 utf8_1.2.1 [7] R6_2.5.0 irlba_2.3.3 [9] ResidualMatrix_1.3.0 vipor_0.4.5 [11] DBI_1.1.1 colorspace_2.0-1 [13] rhdf5filters_1.5.0 withr_2.4.2 [15] tidyselect_1.1.1 gridExtra_2.3 [17] compiler_4.1.0 graph_1.71.0 [19] BiocNeighbors_1.11.0 labeling_0.4.2 [21] bookdown_0.22 sass_0.4.0 [23] scales_1.1.1 stringr_1.4.0 [25] digest_0.6.27 rmarkdown_2.8 [27] XVector_0.33.0 pkgconfig_2.0.3 [29] htmltools_0.5.1.1 sparseMatrixStats_1.5.0 [31] limma_3.49.0 highr_0.9 [33] rlang_0.4.11 DelayedMatrixStats_1.15.0 [35] farver_2.1.0 generics_0.1.0 [37] jquerylib_0.1.4 jsonlite_1.7.2 [39] BiocParallel_1.27.0 dplyr_1.0.6 [41] RCurl_1.98-1.3 magrittr_2.0.1 [43] GenomeInfoDbData_1.2.6 ggbeeswarm_0.6.0 [45] Rhdf5lib_1.15.0 Rcpp_1.0.6 [47] munsell_0.5.0 fansi_0.4.2 [49] viridis_0.6.1 lifecycle_1.0.0 [51] stringi_1.6.2 yaml_2.2.1 [53] edgeR_3.35.0 zlibbioc_1.39.0 [55] grid_4.1.0 dqrng_0.3.0 [57] crayon_1.4.1 dir.expiry_1.1.0 [59] lattice_0.20-44 cowplot_1.1.1 [61] beachmat_2.9.0 locfit_1.5-9.4 [63] CodeDepends_0.6.5 metapod_1.1.0 [65] knitr_1.33 pillar_1.6.1 [67] igraph_1.2.6 codetools_0.2-18 [69] ScaledMatrix_1.1.0 XML_3.99-0.6 [71] glue_1.4.2 evaluate_0.14 [73] scran_1.21.1 BiocManager_1.30.15 [75] vctrs_0.3.8 purrr_0.3.4 [77] gtable_0.3.0 assertthat_0.2.1 [79] xfun_0.23 rsvd_1.0.5 [81] viridisLite_0.4.0 tibble_3.1.2 [83] beeswarm_0.3.1 cluster_2.1.2 [85] statmod_1.4.36 ellipsis_0.3.2 References "],["using-corrected-values.html", "Chapter 3 Using the corrected values 3.1 Background 3.2 For within-batch comparisons 3.3 After blocking on the batch 3.4 For between-batch comparisons Session Info", " Chapter 3 Using the corrected values .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 3.1 Background The greatest value of batch correction lies in facilitating cell-based analysis of population heterogeneity in a consistent manner across batches. Cluster 1 in batch A is the same as cluster 1 in batch B when the clustering is performed on the merged data. There is no need to identify mappings between separate clusterings, which might not even be possible when the clusters are not well-separated. By generating a single set of clusters for all batches, rather than requiring separate examination of each batch’s clusters, we avoid repeatedly paying the cost of manual interpration. Another benefit is that the available number of cells is increased when all batches are combined, which allows for greater resolution of population structure in downstream analyses. We previously demonstrated the application of clustering methods to the batch-corrected data, but the same principles apply for other analyses like trajectory reconstruction. In general, cell-based analyses are safe to apply on corrected data; indeed, the whole purpose of the correction is to place all cells in the same coordinate space. However, the same cannot be easily said for gene-based procedures like DE analyses or marker gene detection. An arbitrary correction algorithm is not obliged to preserve relative differences in per-gene expression when attempting to align multiple batches. For example, cosine normalization in fastMNN() shrinks the magnitude of the expression values so that the computed log-fold changes have no obvious interpretation. This chapter will elaborate on some of the problems with using corrected values for gene-based analyses. We consider both within-batch analyses like marker detection as well as between-batch comparisons. 3.2 For within-batch comparisons Correction is not guaranteed to preserve relative differences between cells in the same batch. This complicates the intepretation of corrected values for within-batch analyses such as marker detection. To demonstrate, consider the two pancreas datasets from Grun et al. (2016) and Muraro et al. (2016). View set-up code (Workflow Chapter 6) #--- loading ---# library(scRNAseq) sce.muraro &lt;- MuraroPancreasData() #--- gene-annotation ---# library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] gene.symb &lt;- sub(&quot;__chr.*$&quot;, &quot;&quot;, rownames(sce.muraro)) gene.ids &lt;- mapIds(edb, keys=gene.symb, keytype=&quot;SYMBOL&quot;, column=&quot;GENEID&quot;) # Removing duplicated genes or genes without Ensembl IDs. keep &lt;- !is.na(gene.ids) &amp; !duplicated(gene.ids) sce.muraro &lt;- sce.muraro[keep,] rownames(sce.muraro) &lt;- gene.ids[keep] #--- quality-control ---# library(scater) stats &lt;- perCellQCMetrics(sce.muraro) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sce.muraro$donor, subset=sce.muraro$donor!=&quot;D28&quot;) sce.muraro &lt;- sce.muraro[,!qc$discard] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.muraro) sce.muraro &lt;- computeSumFactors(sce.muraro, clusters=clusters) sce.muraro &lt;- logNormCounts(sce.muraro) #--- variance-modelling ---# block &lt;- paste0(sce.muraro$plate, &quot;_&quot;, sce.muraro$donor) dec.muraro &lt;- modelGeneVarWithSpikes(sce.muraro, &quot;ERCC&quot;, block=block) top.muraro &lt;- getTopHVGs(dec.muraro, prop=0.1) sce.muraro ## class: SingleCellExperiment ## dim: 16940 2299 ## metadata(0): ## assays(2): counts logcounts ## rownames(16940): ENSG00000268895 ENSG00000121410 ... ENSG00000159840 ## ENSG00000074755 ## rowData names(2): symbol chr ## colnames(2299): D28-1_1 D28-1_2 ... D30-8_93 D30-8_94 ## colData names(4): label donor plate sizeFactor ## reducedDimNames(0): ## mainExpName: endogenous ## altExpNames(1): ERCC View set-up code (Workflow Chapter 5) #--- loading ---# library(scRNAseq) sce.grun &lt;- GrunPancreasData() #--- gene-annotation ---# library(org.Hs.eg.db) gene.ids &lt;- mapIds(org.Hs.eg.db, keys=rowData(sce.grun)$symbol, keytype=&quot;SYMBOL&quot;, column=&quot;ENSEMBL&quot;) keep &lt;- !is.na(gene.ids) &amp; !duplicated(gene.ids) sce.grun &lt;- sce.grun[keep,] rownames(sce.grun) &lt;- gene.ids[keep] #--- quality-control ---# library(scater) stats &lt;- perCellQCMetrics(sce.grun) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sce.grun$donor, subset=sce.grun$donor %in% c(&quot;D17&quot;, &quot;D7&quot;, &quot;D2&quot;)) sce.grun &lt;- sce.grun[,!qc$discard] #--- normalization ---# library(scran) set.seed(1000) # for irlba. clusters &lt;- quickCluster(sce.grun) sce.grun &lt;- computeSumFactors(sce.grun, clusters=clusters) sce.grun &lt;- logNormCounts(sce.grun) #--- variance-modelling ---# block &lt;- paste0(sce.grun$sample, &quot;_&quot;, sce.grun$donor) dec.grun &lt;- modelGeneVarWithSpikes(sce.grun, spikes=&quot;ERCC&quot;, block=block) top.grun &lt;- getTopHVGs(dec.grun, prop=0.1) # Applying cell type labels for downstream interpretation. library(SingleR) training &lt;- sce.muraro[,!is.na(sce.muraro$label)] assignments &lt;- SingleR(sce.grun, training, labels=training$label) sce.grun$label &lt;- assignments$labels sce.grun ## class: SingleCellExperiment ## dim: 17398 1063 ## metadata(0): ## assays(2): counts logcounts ## rownames(17398): ENSG00000268895 ENSG00000121410 ... ENSG00000074755 ## ENSG00000036549 ## rowData names(2): symbol chr ## colnames(1063): D2ex_1 D2ex_2 ... D17TGFB_94 D17TGFB_95 ## colData names(4): donor sample sizeFactor label ## reducedDimNames(0): ## mainExpName: endogenous ## altExpNames(1): ERCC If we look at the expression of the INS-IGF2 transcript, we can see that there is a major difference between the two batches (Figure 3.1). This is most likely due to some difference in read mapping stringency between the two studies, but the exact cause is irrelevant to this example. library(scater) gridExtra::grid.arrange( plotExpression(sce.grun, x=&quot;label&quot;, features=&quot;ENSG00000129965&quot;) + ggtitle(&quot;Grun&quot;), plotExpression(sce.muraro, x=&quot;label&quot;, features=&quot;ENSG00000129965&quot;) + ggtitle(&quot;Muraro&quot;) ) Figure 3.1: Distribution of uncorrected expression values for INS-IGF2 across the cell types in the Grun and Muraro pancreas datasets. A “perfect” batch correction algorithm must eliminate differences in the expression of this gene between batches. Failing to do so would result in an incomplete merging of cell types - in this case, beta cells - across batches as they would still be separated on the dimension defined by INS-IGF2. Exactly how this is done can vary; Figure 3.2 presents one possible outcome from MNN correction, though another algorithm may choose to align the profiles by setting INS-IGF2 expression to zero for all eclls in both batches. library(batchelor) set.seed(1011011) mnn.pancreas &lt;- quickCorrect(grun=sce.grun, muraro=sce.muraro, precomputed=list(dec.grun, dec.muraro)) corrected &lt;- mnn.pancreas$corrected corrected$label &lt;- c(sce.grun$label, sce.muraro$label) plotExpression(corrected, x=&quot;label&quot;, features=&quot;ENSG00000129965&quot;, exprs_values=&quot;reconstructed&quot;, other_fields=&quot;batch&quot;) + facet_wrap(~batch) Figure 3.2: Distribution of MNN-corrected expression values for INS-IGF2 across the cell types in the Grun and Muraro pancreas datasets. In this manner, we have introduced artificial DE between the cell types in the Muraro batch in order to align with the DE present in the Grun dataset. We would be misled into believing that beta cells upregulate INS-IGF2 in both batches when in fact this is only true for the Grun batch. At best, this is only a minor error - after all, we do actually have INS-IGF2-high beta cells, they are just limited to batch 2, which limits the utility of this gene as a general marker. At worst, this can change the conclusions, e.g., if batch 1 was drug-treated and batch 2 was a control, we might mistakenly conclude that our drug has no effect on INS-IGF2 expression in beta cells. (This is discussed further in Section ??.) 3.3 After blocking on the batch For per-gene analyses that involve comparisons within batches, we prefer to use the uncorrected expression values and blocking on the batch in our statistical model. For marker detection, this is done by performing comparisons within each batch and combining statistics across batches (Basic Section 6.7). This strategy is based on the expectation that any genuine DE between clusters should still be present in a within-batch comparison where batch effects are absent. It penalizes genes that exhibit inconsistent DE across batches, thus protecting against misleading conclusions when a population in one batch is aligned to a similar-but-not-identical population in another batch. We demonstrate this approach below using a blocked \\(t\\)-test to detect markers in the PBMC dataset, where the presence of the same pattern across clusters within each batch (Figure 3.3) is reassuring. View set-up code (Chapter 7) #--- loading ---# library(TENxPBMCData) all.sce &lt;- list( pbmc3k=TENxPBMCData(&#39;pbmc3k&#39;), pbmc4k=TENxPBMCData(&#39;pbmc4k&#39;), pbmc8k=TENxPBMCData(&#39;pbmc8k&#39;) ) #--- quality-control ---# library(scater) stats &lt;- high.mito &lt;- list() for (n in names(all.sce)) { current &lt;- all.sce[[n]] is.mito &lt;- grep(&quot;MT&quot;, rowData(current)$Symbol_TENx) stats[[n]] &lt;- perCellQCMetrics(current, subsets=list(Mito=is.mito)) high.mito[[n]] &lt;- isOutlier(stats[[n]]$subsets_Mito_percent, type=&quot;higher&quot;) all.sce[[n]] &lt;- current[,!high.mito[[n]]] } #--- normalization ---# all.sce &lt;- lapply(all.sce, logNormCounts) #--- variance-modelling ---# library(scran) all.dec &lt;- lapply(all.sce, modelGeneVar) all.hvgs &lt;- lapply(all.dec, getTopHVGs, prop=0.1) #--- dimensionality-reduction ---# library(BiocSingular) set.seed(10000) all.sce &lt;- mapply(FUN=runPCA, x=all.sce, subset_row=all.hvgs, MoreArgs=list(ncomponents=25, BSPARAM=RandomParam()), SIMPLIFY=FALSE) set.seed(100000) all.sce &lt;- lapply(all.sce, runTSNE, dimred=&quot;PCA&quot;) set.seed(1000000) all.sce &lt;- lapply(all.sce, runUMAP, dimred=&quot;PCA&quot;) #--- clustering ---# for (n in names(all.sce)) { g &lt;- buildSNNGraph(all.sce[[n]], k=10, use.dimred=&#39;PCA&#39;) clust &lt;- igraph::cluster_walktrap(g)$membership colLabels(all.sce[[n]]) &lt;- factor(clust) } #--- data-integration ---# # Intersecting the common genes. universe &lt;- Reduce(intersect, lapply(all.sce, rownames)) all.sce2 &lt;- lapply(all.sce, &quot;[&quot;, i=universe,) all.dec2 &lt;- lapply(all.dec, &quot;[&quot;, i=universe,) # Renormalizing to adjust for differences in depth. library(batchelor) normed.sce &lt;- do.call(multiBatchNorm, all.sce2) # Identifying a set of HVGs using stats from all batches. combined.dec &lt;- do.call(combineVar, all.dec2) combined.hvg &lt;- getTopHVGs(combined.dec, n=5000) set.seed(1000101) merged.pbmc &lt;- do.call(fastMNN, c(normed.sce, list(subset.row=combined.hvg, BSPARAM=RandomParam()))) #--- merged-clustering ---# g &lt;- buildSNNGraph(merged.pbmc, use.dimred=&quot;corrected&quot;) colLabels(merged.pbmc) &lt;- factor(igraph::cluster_louvain(g)$membership) table(colLabels(merged.pbmc), merged.pbmc$batch) # TODO: make this process a one-liner. all.sce2 &lt;- lapply(all.sce2, function(x) { rowData(x) &lt;- rowData(all.sce2[[1]]) x }) combined &lt;- do.call(cbind, all.sce2) combined$batch &lt;- rep(c(&quot;3k&quot;, &quot;4k&quot;, &quot;8k&quot;), vapply(all.sce2, ncol, 0L)) clusters.mnn &lt;- colLabels(merged.pbmc) # Marker detection with block= set to the batch factor. library(scran) m.out &lt;- findMarkers(combined, clusters.mnn, block=combined$batch, direction=&quot;up&quot;, lfc=1, row.data=rowData(combined)[,3,drop=FALSE]) # Seems like CD8+ T cells: demo &lt;- m.out[[&quot;1&quot;]] as.data.frame(demo[1:10,c(&quot;Symbol&quot;, &quot;Top&quot;, &quot;p.value&quot;, &quot;FDR&quot;)]) ## Symbol Top p.value FDR ## ENSG00000172116 CD8B 1 7.352e-100 8.831e-97 ## ENSG00000167286 CD3D 1 1.545e-204 4.825e-201 ## ENSG00000111716 LDHB 1 5.445e-146 9.448e-143 ## ENSG00000213741 RPS29 1 0.000e+00 0.000e+00 ## ENSG00000171858 RPS21 1 0.000e+00 0.000e+00 ## ENSG00000171223 JUNB 1 8.880e-235 4.622e-231 ## ENSG00000177954 RPS27 2 1.045e-296 6.529e-293 ## ENSG00000153563 CD8A 2 1.000e+00 1.000e+00 ## ENSG00000136942 RPL35 2 0.000e+00 0.000e+00 ## ENSG00000198851 CD3E 2 6.773e-174 1.410e-170 plotExpression(combined, x=I(factor(clusters.mnn)), swap_rownames=&quot;Symbol&quot;, features=c(&quot;CD3D&quot;, &quot;CD8B&quot;), colour_by=&quot;batch&quot;) + facet_wrap(Feature~colour_by) Figure 3.3: Distributions of uncorrected log-expression values for CD8B and CD3D within each cluster in each batch of the merged PBMC dataset. In contrast, we suggest limiting the use of per-gene corrected values to visualization, e.g., when coloring points on a \\(t\\)-SNE plot by per-cell expression. This can be more aesthetically pleasing than uncorrected expression values that may contain large shifts on the colour scale between cells in different batches. Use of the corrected values in any quantitative procedure should be treated with caution, and should be backed up by similar results from an analysis on the uncorrected values. 3.4 For between-batch comparisons Here, the main problem is that correction will inevitably introduce artificial agreement across batches. Removal of biological differences between batches in the corrected data is unavoidable if we want to mix cells from different batches. To illustrate, we shall consider the pancreas dataset from Segerstolpe et al. (2016), involving both healthy and diabetic donors. Each donor has been treated as a separate batch for the purpose of removing donor effects. View set-up code (Workflow Chapter 8) #--- loading ---# library(scRNAseq) sce.seger &lt;- SegerstolpePancreasData() #--- gene-annotation ---# library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] symbols &lt;- rowData(sce.seger)$symbol ens.id &lt;- mapIds(edb, keys=symbols, keytype=&quot;SYMBOL&quot;, column=&quot;GENEID&quot;) ens.id &lt;- ifelse(is.na(ens.id), symbols, ens.id) # Removing duplicated rows. keep &lt;- !duplicated(ens.id) sce.seger &lt;- sce.seger[keep,] rownames(sce.seger) &lt;- ens.id[keep] #--- sample-annotation ---# emtab.meta &lt;- colData(sce.seger)[,c(&quot;cell type&quot;, &quot;disease&quot;, &quot;individual&quot;, &quot;single cell well quality&quot;)] colnames(emtab.meta) &lt;- c(&quot;CellType&quot;, &quot;Disease&quot;, &quot;Donor&quot;, &quot;Quality&quot;) colData(sce.seger) &lt;- emtab.meta sce.seger$CellType &lt;- gsub(&quot; cell&quot;, &quot;&quot;, sce.seger$CellType) sce.seger$CellType &lt;- paste0( toupper(substr(sce.seger$CellType, 1, 1)), substring(sce.seger$CellType, 2)) #--- quality-control ---# low.qual &lt;- sce.seger$Quality == &quot;low quality cell&quot; library(scater) stats &lt;- perCellQCMetrics(sce.seger) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sce.seger$Donor, subset=!sce.seger$Donor %in% c(&quot;HP1504901&quot;, &quot;HP1509101&quot;)) sce.seger &lt;- sce.seger[,!(qc$discard | low.qual)] #--- normalization ---# library(scran) clusters &lt;- quickCluster(sce.seger) sce.seger &lt;- computeSumFactors(sce.seger, clusters=clusters) sce.seger &lt;- logNormCounts(sce.seger) #--- variance-modelling ---# for.hvg &lt;- sce.seger[,librarySizeFactors(altExp(sce.seger)) &gt; 0 &amp; sce.seger$Donor!=&quot;AZ&quot;] dec.seger &lt;- modelGeneVarWithSpikes(for.hvg, &quot;ERCC&quot;, block=for.hvg$Donor) chosen.hvgs &lt;- getTopHVGs(dec.seger, n=2000) #--- dimensionality-reduction ---# library(BiocSingular) set.seed(101011001) sce.seger &lt;- runPCA(sce.seger, subset_row=chosen.hvgs, ncomponents=25) sce.seger &lt;- runTSNE(sce.seger, dimred=&quot;PCA&quot;) #--- clustering ---# library(bluster) clust.out &lt;- clusterRows(reducedDim(sce.seger, &quot;PCA&quot;), NNGraphParam(), full=TRUE) snn.gr &lt;- clust.out$objects$graph colLabels(sce.seger) &lt;- clust.out$clusters #--- data-integration ---# library(batchelor) set.seed(10001010) corrected &lt;- fastMNN(sce.seger, batch=sce.seger$Donor, subset.row=chosen.hvgs) set.seed(10000001) corrected &lt;- runTSNE(corrected, dimred=&quot;corrected&quot;) colLabels(corrected) &lt;- clusterRows(reducedDim(corrected, &quot;corrected&quot;), NNGraphParam()) tab &lt;- table(Cluster=colLabels(corrected), Donor=corrected$batch) tab sce.seger ## class: SingleCellExperiment ## dim: 25454 2090 ## metadata(0): ## assays(2): counts logcounts ## rownames(25454): ENSG00000118473 ENSG00000142920 ... ENSG00000278306 ## eGFP ## rowData names(2): symbol refseq ## colnames(2090): HP1502401_H13 HP1502401_J14 ... HP1526901T2D_N8 ## HP1526901T2D_A8 ## colData names(6): CellType Disease ... sizeFactor label ## reducedDimNames(2): PCA TSNE ## mainExpName: endogenous ## altExpNames(1): ERCC We examine the expression of INS in beta cells across donors (Figure 3.4). We observe some variation across donors with a modest downregulation in the set of diabetic patients. library(scater) sce.beta &lt;- sce.seger[,sce.seger$CellType==&quot;Beta&quot;] by.cell &lt;- plotExpression(sce.beta, features=&quot;INS&quot;, swap_rownames=&quot;symbol&quot;, colour_by=&quot;Disease&quot;, # Arrange donors by disease status, for a prettier plot. x=I(reorder(sce.beta$Donor, sce.beta$Disease, FUN=unique))) ave.beta &lt;- aggregateAcrossCells(sce.beta, statistics=&quot;mean&quot;, use.assay.type=&quot;logcounts&quot;, ids=sce.beta$Donor, use.altexps=FALSE) by.sample &lt;- plotExpression(ave.beta, features=&quot;INS&quot;, swap_rownames=&quot;symbol&quot;, x=&quot;Disease&quot;, colour_by=&quot;Disease&quot;) gridExtra::grid.arrange(by.cell, by.sample, ncol=2) Figure 3.4: Distribution of log-expression values for INS in beta cells across donors in the Segerstolpe pancreas dataset. Each point represents a cell in each donor (left) or the average of all cells in each donor (right), and is colored according to disease status of the donor. We repeat this examination on the MNN-corrected values, where the relative differences are largely eliminated (Figure 3.5). Note that the change in the y-axis scale can largely be ignored as the corrected values are on a different scale after cosine normalization. corr.beta &lt;- corrected[,sce.seger$CellType==&quot;Beta&quot;] corr.beta$Donor &lt;- sce.beta$Donor corr.beta$Disease &lt;- sce.beta$Disease by.cell &lt;- plotExpression(corr.beta, features=&quot;ENSG00000254647&quot;, x=I(reorder(sce.beta$Donor, sce.beta$Disease, FUN=unique)), exprs_values=&quot;reconstructed&quot;, colour_by=&quot;Disease&quot;) ave.beta &lt;- aggregateAcrossCells(corr.beta, statistics=&quot;mean&quot;, use.assay.type=&quot;reconstructed&quot;, ids=sce.beta$Donor) by.sample &lt;- plotExpression(ave.beta, features=&quot;ENSG00000254647&quot;, exprs_values=&quot;reconstructed&quot;, x=&quot;Disease&quot;, colour_by=&quot;Disease&quot;) gridExtra::grid.arrange(by.cell, by.sample, ncol=2) Figure 3.5: Distribution of MNN-corrected log-expression values for INS in beta cells across donors in the Segerstolpe pancreas dataset. Each point represents a cell in each donor (left) or the average of all cells in each donor (right), and is colored according to disease status of the donor. We will not attempt to determine whether the INS downregulation represents genuine biology or a batch effect (see Workflow Section 8.9 for a formal analysis). The real issue is that the analyst never has a chance to consider this question when the corrected values are used. Moreover, the variation in expression across donors is understated, which is problematic if we want to make conclusions about population variability. We suggest performing cross-batch comparisons on the original expression values wherever possible. Rather than performing correction, we rely on the statistical model to account for batch-to-batch variation when making inferences. This preserves any differences between conditions and does not distort the variance structure. Some further consequences of correction in the context of multi-condition comparisons are discussed in Section 6.4.2. Session Info View session info R version 4.1.0 beta (2021-05-03 r80259) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] scran_1.21.1 HDF5Array_1.21.0 [3] rhdf5_2.37.0 DelayedArray_0.19.0 [5] Matrix_1.3-3 batchelor_1.9.0 [7] scater_1.21.0 ggplot2_3.3.3 [9] scuttle_1.3.0 SingleCellExperiment_1.15.1 [11] SingleR_1.7.0 SummarizedExperiment_1.23.0 [13] Biobase_2.53.0 GenomicRanges_1.45.0 [15] GenomeInfoDb_1.29.0 IRanges_2.27.0 [17] S4Vectors_0.31.0 BiocGenerics_0.39.0 [19] MatrixGenerics_1.5.0 matrixStats_0.58.0 [21] BiocStyle_2.21.0 rebook_1.3.0 loaded via a namespace (and not attached): [1] bitops_1.0-7 filelock_1.0.2 [3] tools_4.1.0 bslib_0.2.5.1 [5] ResidualMatrix_1.3.0 utf8_1.2.1 [7] R6_2.5.0 irlba_2.3.3 [9] vipor_0.4.5 DBI_1.1.1 [11] colorspace_2.0-1 rhdf5filters_1.5.0 [13] withr_2.4.2 gridExtra_2.3 [15] tidyselect_1.1.1 compiler_4.1.0 [17] graph_1.71.0 BiocNeighbors_1.11.0 [19] labeling_0.4.2 bookdown_0.22 [21] sass_0.4.0 scales_1.1.1 [23] rappdirs_0.3.3 stringr_1.4.0 [25] digest_0.6.27 rmarkdown_2.8 [27] XVector_0.33.0 pkgconfig_2.0.3 [29] htmltools_0.5.1.1 sparseMatrixStats_1.5.0 [31] highr_0.9 limma_3.49.0 [33] rlang_0.4.11 DelayedMatrixStats_1.15.0 [35] farver_2.1.0 jquerylib_0.1.4 [37] generics_0.1.0 jsonlite_1.7.2 [39] BiocParallel_1.27.0 dplyr_1.0.6 [41] RCurl_1.98-1.3 magrittr_2.0.1 [43] BiocSingular_1.9.0 GenomeInfoDbData_1.2.6 [45] Rhdf5lib_1.15.0 ggbeeswarm_0.6.0 [47] Rcpp_1.0.6 munsell_0.5.0 [49] fansi_0.4.2 viridis_0.6.1 [51] lifecycle_1.0.0 stringi_1.6.2 [53] yaml_2.2.1 edgeR_3.35.0 [55] zlibbioc_1.39.0 grid_4.1.0 [57] dqrng_0.3.0 crayon_1.4.1 [59] dir.expiry_1.1.0 lattice_0.20-44 [61] cowplot_1.1.1 beachmat_2.9.0 [63] locfit_1.5-9.4 CodeDepends_0.6.5 [65] metapod_1.1.0 knitr_1.33 [67] pillar_1.6.1 igraph_1.2.6 [69] codetools_0.2-18 ScaledMatrix_1.1.0 [71] XML_3.99-0.6 glue_1.4.2 [73] evaluate_0.14 BiocManager_1.30.15 [75] vctrs_0.3.8 gtable_0.3.0 [77] purrr_0.3.4 assertthat_0.2.1 [79] xfun_0.23 rsvd_1.0.5 [81] viridisLite_0.4.0 tibble_3.1.2 [83] beeswarm_0.3.1 cluster_2.1.2 [85] bluster_1.3.0 statmod_1.4.36 [87] ellipsis_0.3.2 References "],["multi-sample-comparisons.html", "Chapter 4 DE analyses between conditions 4.1 Motivation 4.2 Setting up the data 4.3 Creating pseudo-bulk samples 4.4 Performing the DE analysis 4.5 Putting it all together 4.6 Testing for between-label differences Session Info", " Chapter 4 DE analyses between conditions .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 4.1 Motivation A powerful use of scRNA-seq technology lies in the design of replicated multi-condition experiments to detect population-specific changes in expression between conditions. For example, a researcher could use this strategy to detect gene expression changes for each cell type after drug treatment (Richard et al. 2018) or genetic modifications (Scialdone et al. 2016). This provides more biological insight than conventional scRNA-seq experiments involving only one biological condition, especially if we can relate population changes to specific experimental perturbations. Here, we will focus on differential expression analyses of replicated multi-condition scRNA-seq experiments. Our aim is to find significant changes in expression between conditions for cells of the same type that are present in both conditions. 4.2 Setting up the data Our demonstration scRNA-seq dataset was generated from chimeric mouse embryos at the E8.5 developmental stage (Pijuan-Sala et al. 2019). Each chimeric embryo was generated by injecting td-Tomato-positive embryonic stem cells (ESCs) into a wild-type (WT) blastocyst. Unlike in previous experiments (Scialdone et al. 2016), there is no genetic difference between the injected and background cells other than the expression of td-Tomato in the former. Instead, the aim of this “wild-type chimera” study is to determine whether the injection procedure itself introduces differences in lineage commitment compared to the background cells. The experiment used a paired design with three replicate batches of two samples each. Specifically, each batch contains one sample consisting of td-Tomato positive cells and another consisting of negative cells, obtained by fluorescence-activated cell sorting from a single pool of dissociated cells from 6-7 chimeric embryos. For each sample, scRNA-seq data was generated using the 10X Genomics protocol (Zheng et al. 2017) to obtain 2000-7000 cells. View set-up code (Chapter 10) #--- loading ---# library(MouseGastrulationData) sce.chimera &lt;- WTChimeraData(samples=5:10) sce.chimera #--- feature-annotation ---# library(scater) rownames(sce.chimera) &lt;- uniquifyFeatureNames( rowData(sce.chimera)$ENSEMBL, rowData(sce.chimera)$SYMBOL) #--- quality-control ---# drop &lt;- sce.chimera$celltype.mapped %in% c(&quot;stripped&quot;, &quot;Doublet&quot;) sce.chimera &lt;- sce.chimera[,!drop] #--- normalization ---# sce.chimera &lt;- logNormCounts(sce.chimera) #--- variance-modelling ---# library(scran) dec.chimera &lt;- modelGeneVar(sce.chimera, block=sce.chimera$sample) chosen.hvgs &lt;- dec.chimera$bio &gt; 0 #--- merging ---# library(batchelor) set.seed(01001001) merged &lt;- correctExperiments(sce.chimera, batch=sce.chimera$sample, subset.row=chosen.hvgs, PARAM=FastMnnParam( merge.order=list( list(1,3,5), # WT (3 replicates) list(2,4,6) # td-Tomato (3 replicates) ) ) ) #--- clustering ---# g &lt;- buildSNNGraph(merged, use.dimred=&quot;corrected&quot;) clusters &lt;- igraph::cluster_louvain(g) colLabels(merged) &lt;- factor(clusters$membership) #--- dimensionality-reduction ---# merged &lt;- runTSNE(merged, dimred=&quot;corrected&quot;, external_neighbors=TRUE) merged &lt;- runUMAP(merged, dimred=&quot;corrected&quot;, external_neighbors=TRUE) merged ## class: SingleCellExperiment ## dim: 14699 19426 ## metadata(2): merge.info pca.info ## assays(3): reconstructed counts logcounts ## rownames(14699): Xkr4 Rp1 ... Vmn2r122 CAAA01147332.1 ## rowData names(3): rotation ENSEMBL SYMBOL ## colnames(19426): cell_9769 cell_9770 ... cell_30701 cell_30702 ## colData names(13): batch cell ... sizeFactor label ## reducedDimNames(5): corrected pca.corrected.E7.5 pca.corrected.E8.5 ## TSNE UMAP ## mainExpName: NULL ## altExpNames(0): The differential analyses in this chapter will be predicated on many of the pre-processing steps covered previously. For brevity, we will not explicitly repeat them here, only noting that we have already merged cells from all samples into the same coordinate system (Chapter 1) and clustered the merged dataset to obtain a common partitioning across all samples (Basic Chapter 5). A brief inspection of the results indicates that clusters contain similar contributions from all batches with only modest differences associated with td-Tomato expression (Figure 4.1). library(scater) table(colLabels(merged), merged$tomato) ## ## FALSE TRUE ## 1 546 401 ## 2 60 52 ## 3 470 398 ## 4 469 211 ## 5 335 271 ## 6 258 249 ## 7 1241 967 ## 8 203 221 ## 9 630 629 ## 10 71 181 ## 11 417 310 ## 12 47 57 ## 13 58 0 ## 14 209 214 ## 15 414 630 ## 16 363 509 ## 17 234 198 ## 18 657 607 ## 19 151 303 ## 20 579 443 ## 21 137 74 ## 22 82 78 ## 23 155 1 ## 24 762 878 ## 25 363 497 ## 26 1420 716 table(colLabels(merged), merged$pool) ## ## 3 4 5 ## 1 224 173 550 ## 2 26 30 56 ## 3 226 172 470 ## 4 78 162 440 ## 5 99 227 280 ## 6 187 116 204 ## 7 300 909 999 ## 8 69 134 221 ## 9 229 423 607 ## 10 114 54 84 ## 11 179 169 379 ## 12 16 31 57 ## 13 2 51 5 ## 14 77 97 249 ## 15 114 289 641 ## 16 183 242 447 ## 17 157 81 194 ## 18 123 308 833 ## 19 106 118 230 ## 20 236 238 548 ## 21 3 10 198 ## 22 27 29 104 ## 23 6 84 66 ## 24 217 455 968 ## 25 132 172 556 ## 26 194 870 1072 gridExtra::grid.arrange( plotTSNE(merged, colour_by=&quot;tomato&quot;, text_by=&quot;label&quot;), plotTSNE(merged, colour_by=data.frame(pool=factor(merged$pool))), ncol=2 ) Figure 4.1: \\(t\\)-SNE plot of the WT chimeric dataset, where each point represents a cell and is colored according to td-Tomato expression (left) or batch of origin (right). Cluster numbers are superimposed based on the median coordinate of cells assigned to that cluster. Ordinarily, we would be obliged to perform marker detection to assign biological meaning to these clusters. For simplicity, we will skip this step by directly using the existing cell type labels provided by Pijuan-Sala et al. (2019). These were obtained by mapping the cells in this dataset to a larger, pre-annotated “atlas” of mouse early embryonic development. While there are obvious similarities, we see that many of our clusters map to multiple labels and vice versa (Figure 4.2), which reflects the difficulties in unambiguously resolving cell types undergoing differentiation. library(bluster) pairwiseRand(colLabels(merged), merged$celltype.mapped, &quot;index&quot;) ## [1] 0.5514 by.label &lt;- table(colLabels(merged), merged$celltype.mapped) pheatmap::pheatmap(log2(by.label+1), color=viridis::viridis(101)) Figure 4.2: Heatmap showing the abundance of cells with each combination of cluster (row) and cell type label (column). The color scale represents the log2-count for each combination. 4.3 Creating pseudo-bulk samples The most obvious differential analysis is to look for changes in expression between conditions. We perform the DE analysis separately for each label to identify cell type-specific transcriptional effects of injection. The actual DE testing is performed on “pseudo-bulk” expression profiles (Tung et al. 2017), generated by summing counts together for all cells with the same combination of label and sample. This leverages the resolution offered by single-cell technologies to define the labels, and combines it with the statistical rigor of existing methods for DE analyses involving a small number of samples. # Using &#39;label&#39; and &#39;sample&#39; as our two factors; each column of the output # corresponds to one unique combination of these two factors. summed &lt;- aggregateAcrossCells(merged, id=colData(merged)[,c(&quot;celltype.mapped&quot;, &quot;sample&quot;)]) summed ## class: SingleCellExperiment ## dim: 14699 186 ## metadata(2): merge.info pca.info ## assays(1): counts ## rownames(14699): Xkr4 Rp1 ... Vmn2r122 CAAA01147332.1 ## rowData names(3): rotation ENSEMBL SYMBOL ## colnames: NULL ## colData names(16): batch cell ... sample ncells ## reducedDimNames(5): corrected pca.corrected.E7.5 pca.corrected.E8.5 ## TSNE UMAP ## mainExpName: NULL ## altExpNames(0): At this point, it is worth reflecting on the motivations behind the use of pseudo-bulking: Larger counts are more amenable to standard DE analysis pipelines designed for bulk RNA-seq data. Normalization is more straightforward and certain statistical approximations are more accurate e.g., the saddlepoint approximation for quasi-likelihood methods or normality for linear models. Collapsing cells into samples reflects the fact that our biological replication occurs at the sample level (Lun and Marioni 2017). Each sample is represented no more than once for each condition, avoiding problems from unmodelled correlations between samples. Supplying the per-cell counts directly to a DE analysis pipeline would imply that each cell is an independent biological replicate, which is not true from an experimental perspective. (A mixed effects model can handle this variance structure but involves extra statistical and computational complexity for little benefit, see Crowell et al. (2019).) Variance between cells within each sample is masked, provided it does not affect variance across (replicate) samples. This avoids penalizing DEGs that are not uniformly up- or down-regulated for all cells in all samples of one condition. Masking is generally desirable as DEGs - unlike marker genes - do not need to have low within-sample variance to be interesting, e.g., if the treatment effect is consistent across replicate populations but heterogeneous on a per-cell basis. Of course, high per-cell variability will still result in weaker DE if it affects the variability across populations, while homogeneous per-cell responses will result in stronger DE due to a larger population-level log-fold change. These effects are also largely desirable. 4.4 Performing the DE analysis Our DE analysis will be performed using quasi-likelihood (QL) methods from the edgeR package (Robinson, McCarthy, and Smyth 2010; Chen, Lun, and Smyth 2016). This uses a negative binomial generalized linear model (NB GLM) to handle overdispersed count data in experiments with limited replication. In our case, we have biological variation with three paired replicates per condition, so edgeR or its contemporaries is a natural choice for the analysis. We do not use all labels for GLM fitting as the strong DE between labels makes it difficult to compute a sensible average abundance to model the mean-dispersion trend. Moreover, label-specific batch effects would not be easily handled with a single additive term in the design matrix for the batch. Instead, we arbitrarily pick one of the labels to use for this demonstration. label &lt;- &quot;Mesenchyme&quot; current &lt;- summed[,label==summed$celltype.mapped] # Creating up a DGEList object for use in edgeR: library(edgeR) y &lt;- DGEList(counts(current), samples=colData(current)) y ## An object of class &quot;DGEList&quot; ## $counts ## Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 ## Xkr4 2 0 0 0 3 0 ## Rp1 0 0 1 0 0 0 ## Sox17 7 0 3 0 14 9 ## Mrpl15 1420 271 1009 379 1578 749 ## Rgs20 3 0 1 1 0 0 ## 14694 more rows ... ## ## $samples ## group lib.size norm.factors batch cell barcode sample stage tomato pool ## Sample1 1 4607053 1 5 &lt;NA&gt; &lt;NA&gt; 5 E8.5 TRUE 3 ## Sample2 1 1064970 1 6 &lt;NA&gt; &lt;NA&gt; 6 E8.5 FALSE 3 ## Sample3 1 2494010 1 7 &lt;NA&gt; &lt;NA&gt; 7 E8.5 TRUE 4 ## Sample4 1 1028668 1 8 &lt;NA&gt; &lt;NA&gt; 8 E8.5 FALSE 4 ## Sample5 1 4290221 1 9 &lt;NA&gt; &lt;NA&gt; 9 E8.5 TRUE 5 ## Sample6 1 1950840 1 10 &lt;NA&gt; &lt;NA&gt; 10 E8.5 FALSE 5 ## stage.mapped celltype.mapped closest.cell doub.density sizeFactor label ## Sample1 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## Sample2 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## Sample3 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## Sample4 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## Sample5 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## Sample6 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## celltype.mapped.1 sample.1 ncells ## Sample1 Mesenchyme 5 286 ## Sample2 Mesenchyme 6 55 ## Sample3 Mesenchyme 7 243 ## Sample4 Mesenchyme 8 134 ## Sample5 Mesenchyme 9 478 ## Sample6 Mesenchyme 10 299 A typical step in bulk RNA-seq data analyses is to remove samples with very low library sizes due to failed library preparation or sequencing. The very low counts in these samples can be troublesome in downstream steps such as normalization (Basic Chapter 2) or for some statistical approximations used in the DE analysis. In our situation, this is equivalent to removing label-sample combinations that have very few or lowly-sequenced cells. The exact definition of “very low” will vary, but in this case, we remove combinations containing fewer than 10 cells (Crowell et al. 2019). Alternatively, we could apply the outlier-based strategy described in Basic Chapter 1, but this makes the strong assumption that all label-sample combinations have similar numbers of cells that are sequenced to similar depth. We defer to the usual diagnostics for bulk DE analyses to decide whether a particular pseudo-bulk profile should be removed. discarded &lt;- current$ncells &lt; 10 y &lt;- y[,!discarded] summary(discarded) ## Mode FALSE ## logical 6 Another typical step in bulk RNA-seq analyses is to remove genes that are lowly expressed. This reduces computational work, improves the accuracy of mean-variance trend modelling and decreases the severity of the multiple testing correction. Here, we use the filterByExpr() function from edgeR to remove genes that are not expressed above a log-CPM threshold in a minimum number of samples (determined from the size of the smallest treatment group in the experimental design). keep &lt;- filterByExpr(y, group=current$tomato) y &lt;- y[keep,] summary(keep) ## Mode FALSE TRUE ## logical 9011 5688 Finally, we correct for composition biases by computing normalization factors with the trimmed mean of M-values method (Robinson and Oshlack 2010). We do not need the bespoke single-cell methods described in Basic Chapter 2, as the counts for our pseudo-bulk samples are large enough to apply bulk normalization methods. (Note that edgeR normalization factors are closely related but not the same as the size factors described elsewhere in this book. Size factors are proportional to the product of the normalization factors and the library sizes.) y &lt;- calcNormFactors(y) y$samples ## group lib.size norm.factors batch cell barcode sample stage tomato pool ## Sample1 1 4607053 1.0683 5 &lt;NA&gt; &lt;NA&gt; 5 E8.5 TRUE 3 ## Sample2 1 1064970 1.0487 6 &lt;NA&gt; &lt;NA&gt; 6 E8.5 FALSE 3 ## Sample3 1 2494010 0.9582 7 &lt;NA&gt; &lt;NA&gt; 7 E8.5 TRUE 4 ## Sample4 1 1028668 0.9774 8 &lt;NA&gt; &lt;NA&gt; 8 E8.5 FALSE 4 ## Sample5 1 4290221 0.9707 9 &lt;NA&gt; &lt;NA&gt; 9 E8.5 TRUE 5 ## Sample6 1 1950840 0.9817 10 &lt;NA&gt; &lt;NA&gt; 10 E8.5 FALSE 5 ## stage.mapped celltype.mapped closest.cell doub.density sizeFactor label ## Sample1 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## Sample2 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## Sample3 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## Sample4 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## Sample5 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## Sample6 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## celltype.mapped.1 sample.1 ncells ## Sample1 Mesenchyme 5 286 ## Sample2 Mesenchyme 6 55 ## Sample3 Mesenchyme 7 243 ## Sample4 Mesenchyme 8 134 ## Sample5 Mesenchyme 9 478 ## Sample6 Mesenchyme 10 299 As part of the usual diagnostics for a bulk RNA-seq DE analysis, we generate a mean-difference (MD) plot for each normalized pseudo-bulk profile (Figure 4.3). This should exhibit a trumpet shape centered at zero indicating that the normalization successfully removed systematic bias between profiles. Lack of zero-centering or dominant discrete patterns at low abundances may be symptomatic of deeper problems with normalization, possibly due to insufficient cells/reads/UMIs composing a particular pseudo-bulk profile. par(mfrow=c(2,3)) for (i in seq_len(ncol(y))) { plotMD(y, column=i) } Figure 4.3: Mean-difference plots of the normalized expression values for each pseudo-bulk sample against the average of all other samples. We also generate a multi-dimensional scaling (MDS) plot for the pseudo-bulk profiles (Figure 4.4). This is closely related to PCA and allows us to visualize the structure of the data in a manner similar to that described in Basic Chapter 4 (though we rarely have enough pseudo-bulk profiles to make use of techniques like \\(t\\)-SNE). Here, the aim is to check whether samples separate by our known factors of interest - in this case, injection status. Strong separation foreshadows a large number of DEGs in the subsequent analysis. plotMDS(cpm(y, log=TRUE), col=ifelse(y$samples$tomato, &quot;red&quot;, &quot;blue&quot;)) Figure 4.4: MDS plot of the pseudo-bulk log-normalized CPMs, where each point represents a sample and is colored by the tomato status. We set up the design matrix to block on the batch-to-batch differences across different embryo pools, while retaining an additive term that represents the effect of injection. The latter is represented in our model as the log-fold change in gene expression in td-Tomato-positive cells over their negative counterparts within the same label. Our aim is to test whether this log-fold change is significantly different from zero. design &lt;- model.matrix(~factor(pool) + factor(tomato), y$samples) design ## (Intercept) factor(pool)4 factor(pool)5 factor(tomato)TRUE ## Sample1 1 0 0 1 ## Sample2 1 0 0 0 ## Sample3 1 1 0 1 ## Sample4 1 1 0 0 ## Sample5 1 0 1 1 ## Sample6 1 0 1 0 ## attr(,&quot;assign&quot;) ## [1] 0 1 1 2 ## attr(,&quot;contrasts&quot;) ## attr(,&quot;contrasts&quot;)$`factor(pool)` ## [1] &quot;contr.treatment&quot; ## ## attr(,&quot;contrasts&quot;)$`factor(tomato)` ## [1] &quot;contr.treatment&quot; We estimate the negative binomial (NB) dispersions with estimateDisp(). The role of the NB dispersion is to model the mean-variance trend (Figure 4.5), which is not easily accommodated by QL dispersions alone due to the quadratic nature of the NB mean-variance trend. y &lt;- estimateDisp(y, design) summary(y$trended.dispersion) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0103 0.0167 0.0213 0.0202 0.0235 0.0266 plotBCV(y) Figure 4.5: Biological coefficient of variation (BCV) for each gene as a function of the average abundance. The BCV is computed as the square root of the NB dispersion after empirical Bayes shrinkage towards the trend. Trended and common BCV estimates are shown in blue and red, respectively. We also estimate the quasi-likelihood dispersions with glmQLFit() (Chen, Lun, and Smyth 2016). This fits a GLM to the counts for each gene and estimates the QL dispersion from the GLM deviance. We set robust=TRUE to avoid distortions from highly variable clusters (Phipson et al. 2016). The QL dispersion models the uncertainty and variability of the per-gene variance (Figure 4.6) - which is not well handled by the NB dispersions, so the two dispersion types complement each other in the final analysis. fit &lt;- glmQLFit(y, design, robust=TRUE) summary(fit$var.prior) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.318 0.714 0.854 0.804 0.913 1.067 summary(fit$df.prior) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.227 12.675 12.675 12.339 12.675 12.675 plotQLDisp(fit) Figure 4.6: QL dispersion estimates for each gene as a function of abundance. Raw estimates (black) are shrunk towards the trend (blue) to yield squeezed estimates (red). We test for differences in expression due to injection using glmQLFTest(). DEGs are defined as those with non-zero log-fold changes at a false discovery rate of 5%. Very few genes are significantly DE, indicating that injection has little effect on the transcriptome of mesenchyme cells. (Note that this logic is somewhat circular, as a large transcriptional effect may have caused cells of this type to be re-assigned to a different label. We discuss this in more detail in Section 6.4.1 below.) res &lt;- glmQLFTest(fit, coef=ncol(design)) summary(decideTests(res)) ## factor(tomato)TRUE ## Down 8 ## NotSig 5672 ## Up 8 topTags(res) ## Coefficient: factor(tomato)TRUE ## logFC logCPM F PValue FDR ## Phlda2 -4.3874 9.934 1638.59 1.812e-16 1.031e-12 ## Erdr1 2.0691 8.833 356.37 1.061e-11 3.017e-08 ## Mid1 1.5191 6.931 120.15 1.844e-08 3.497e-05 ## H13 -1.0596 7.540 80.80 2.373e-07 2.527e-04 ## Kcnq1ot1 1.3763 7.242 83.31 2.392e-07 2.527e-04 ## Akr1e1 -1.7206 5.128 79.31 2.665e-07 2.527e-04 ## Zdbf2 1.8008 6.797 83.66 6.809e-07 5.533e-04 ## Asb4 -0.9235 7.341 53.45 2.918e-06 2.075e-03 ## Impact 0.8516 7.353 50.31 4.145e-06 2.620e-03 ## Lum -0.6031 9.275 41.67 1.205e-05 6.851e-03 4.5 Putting it all together 4.5.1 Looping across labels Now that we have laid out the theory underlying the DE analysis, we repeat this process for each of the labels to identify injection-induced DE in each cell type. This is conveniently done using the pseudoBulkDGE() function from scran, which will loop over all labels and apply the exact analysis described above to each label. Users can also set method=\"voom\" to perform an equivalent analysis using the voom() pipeline from limma - see Workflow Section 8.9 for the full set of function calls. # Removing all pseudo-bulk samples with &#39;insufficient&#39; cells. summed.filt &lt;- summed[,summed$ncells &gt;= 10] library(scran) de.results &lt;- pseudoBulkDGE(summed.filt, label=summed.filt$celltype.mapped, design=~factor(pool) + tomato, coef=&quot;tomatoTRUE&quot;, condition=summed.filt$tomato ) The function returns a list of DataFrames containing the DE results for each label. Each DataFrame also contains the intermediate edgeR objects used in the DE analyses, which can be used to generate any of previously described diagnostic plots (Figure 4.7). It is often wise to generate these plots to ensure that any interesting results are not compromised by technical issues. cur.results &lt;- de.results[[&quot;Allantois&quot;]] cur.results[order(cur.results$PValue),] ## DataFrame with 14699 rows and 5 columns ## logFC logCPM F PValue FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Phlda2 -2.489508 12.58150 1207.016 3.33486e-21 1.60507e-17 ## Xist -7.978532 8.00166 1092.831 1.27783e-17 3.07510e-14 ## Erdr1 1.947170 9.07321 296.937 1.58009e-14 2.53500e-11 ## Slc22a18 -4.347153 4.04380 117.389 1.92517e-10 2.31647e-07 ## Slc38a4 0.891849 10.24094 113.899 2.52208e-10 2.42776e-07 ## ... ... ... ... ... ... ## Ccl27a_ENSMUSG00000095247 NA NA NA NA NA ## CR974586.5 NA NA NA NA NA ## AC132444.6 NA NA NA NA NA ## Vmn2r122 NA NA NA NA NA ## CAAA01147332.1 NA NA NA NA NA y.allantois &lt;- metadata(cur.results)$y plotBCV(y.allantois) Figure 4.7: Biological coefficient of variation (BCV) for each gene as a function of the average abundance for the allantois pseudo-bulk analysis. Trended and common BCV estimates are shown in blue and red, respectively. We list the labels that were skipped due to the absence of replicates or contrasts. If it is necessary to extract statistics in the absence of replicates, several strategies can be applied such as reducing the complexity of the model or using a predefined value for the NB dispersion. We refer readers to the edgeR user’s guide for more details. metadata(de.results)$failed ## [1] &quot;Blood progenitors 1&quot; &quot;Caudal epiblast&quot; &quot;Caudal neurectoderm&quot; ## [4] &quot;ExE ectoderm&quot; &quot;Parietal endoderm&quot; &quot;Stripped&quot; 4.5.2 Cross-label meta-analyses We examine the numbers of DEGs at a FDR of 5% for each label using the decideTestsPerLabel() function. In general, there seems to be very little differential expression that is introduced by injection. Note that genes listed as NA were either filtered out as low-abundance genes for a given label’s analysis, or the comparison of interest was not possible for a particular label, e.g., due to lack of residual degrees of freedom or an absence of samples from both conditions. is.de &lt;- decideTestsPerLabel(de.results, threshold=0.05) summarizeTestsPerLabel(is.de) ## -1 0 1 NA ## Allantois 23 4766 24 9886 ## Blood progenitors 2 1 2472 2 12224 ## Cardiomyocytes 6 4361 5 10327 ## Caudal Mesoderm 2 1742 0 12955 ## Def. endoderm 7 1392 2 13298 ## Endothelium 3 3222 6 11468 ## Erythroid1 12 2777 15 11895 ## Erythroid2 5 3389 8 11297 ## Erythroid3 13 5048 16 9622 ## ExE mesoderm 2 5097 10 9590 ## Forebrain/Midbrain/Hindbrain 8 6226 11 8454 ## Gut 5 4482 6 10206 ## Haematoendothelial progenitors 4 4103 10 10582 ## Intermediate mesoderm 4 3072 4 11619 ## Mesenchyme 8 5672 8 9011 ## NMP 6 4107 10 10576 ## Neural crest 6 3311 8 11374 ## Paraxial mesoderm 4 4756 5 9934 ## Pharyngeal mesoderm 2 5082 9 9606 ## Rostral neurectoderm 5 3334 4 11356 ## Somitic mesoderm 7 2948 13 11731 ## Spinal cord 7 4591 7 10094 ## Surface ectoderm 9 5556 8 9126 For each gene, we compute the percentage of cell types in which that gene is upregulated or downregulated upon injection. Here, we consider a gene to be non-DE if it is not retained after filtering. We see that Xist is consistently downregulated in the injected cells; this is consistent with the fact that the injected cells are male while the background cells are derived from pools of male and female embryos, due to experimental difficulties with resolving sex at this stage. The consistent downregulation of Phlda2 and Cdkn1c in the injected cells is also interesting given that both are imprinted genes. However, some of these commonalities may be driven by shared contamination from ambient RNA - we discuss this further in Section 5. # Upregulated across most cell types. up.de &lt;- is.de &gt; 0 &amp; !is.na(is.de) head(sort(rowMeans(up.de), decreasing=TRUE), 10) ## Mid1 Erdr1 Impact Mcts2 Kcnq1ot1 Nnat Slc38a4 Zdbf2 ## 0.9130 0.7391 0.6087 0.5652 0.5652 0.5217 0.4348 0.3913 ## Hopx Peg3 ## 0.3913 0.2609 # Downregulated across cell types. down.de &lt;- is.de &lt; 0 &amp; !is.na(is.de) head(sort(rowMeans(down.de), decreasing=TRUE), 10) ## Xist Phlda2 Akr1e1 Cdkn1c H13 ## 0.73913 0.73913 0.73913 0.69565 0.52174 ## Wfdc2 B930036N10Rik B230312C02Rik Pink1 Mfap2 ## 0.21739 0.08696 0.08696 0.08696 0.08696 To identify label-specific DE, we use the pseudoBulkSpecific() function to test for significant differences from the average log-fold change over all other labels. More specifically, the null hypothesis for each label and gene is that the log-fold change lies between zero and the average log-fold change of the other labels. If a gene rejects this null for our label of interest, we can conclude that it exhibits DE that is more extreme or of the opposite sign compared to that in the majority of other labels (Figure 4.8). This approach is effectively a poor man’s interaction model that sacrifices the uncertainty of the average for an easier compute. We note that, while the difference from the average is a good heuristic, there is no guarantee that the top genes are truly label-specific; comparable DE in a subset of the other labels may be offset by weaker effects when computing the average. de.specific &lt;- pseudoBulkSpecific(summed.filt, label=summed.filt$celltype.mapped, design=~factor(pool) + tomato, coef=&quot;tomatoTRUE&quot;, condition=summed.filt$tomato ) cur.specific &lt;- de.specific[[&quot;Allantois&quot;]] cur.specific &lt;- cur.specific[order(cur.specific$PValue),] cur.specific ## DataFrame with 14699 rows and 6 columns ## logFC logCPM F PValue FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Slc22a18 -4.347153 4.04380 117.3889 1.92517e-10 9.26587e-07 ## Acta2 -0.829713 9.12472 55.6350 4.67332e-07 1.12463e-03 ## Mxd4 -1.421473 5.64606 50.2112 2.03567e-06 3.26589e-03 ## Rbp4 1.874290 4.35449 29.8731 1.53998e-05 1.85298e-02 ## Myl9 -0.985541 6.24833 30.6689 5.54072e-05 4.62274e-02 ## ... ... ... ... ... ... ## Ccl27a_ENSMUSG00000095247 NA NA NA NA NA ## CR974586.5 NA NA NA NA NA ## AC132444.6 NA NA NA NA NA ## Vmn2r122 NA NA NA NA NA ## CAAA01147332.1 NA NA NA NA NA ## OtherAverage ## &lt;numeric&gt; ## Slc22a18 NA ## Acta2 -0.0267428 ## Mxd4 -0.1565876 ## Rbp4 -0.1052237 ## Myl9 -0.1068453 ## ... ... ## Ccl27a_ENSMUSG00000095247 NA ## CR974586.5 NA ## AC132444.6 NA ## Vmn2r122 NA ## CAAA01147332.1 NA sizeFactors(summed.filt) &lt;- NULL plotExpression(logNormCounts(summed.filt), features=&quot;Rbp4&quot;, x=&quot;tomato&quot;, colour_by=&quot;tomato&quot;, other_fields=&quot;celltype.mapped&quot;) + facet_wrap(~celltype.mapped) Figure 4.8: Distribution of summed log-expression values for Rbp4 in each label of the chimeric embryo dataset. Each facet represents a label with distributions stratified by injection status. For greater control over the identification of label-specific DE, we can use the output of decideTestsPerLabel() to identify genes that are significant in our label of interest yet not DE in any other label. As hypothesis tests are not typically geared towards identifying genes that are not DE, we use an ad hoc approach where we consider a gene to be consistent with the null hypothesis for a label if it fails to be detected at a generous FDR threshold of 50%. We demonstrate this approach below by identifying injection-induced DE genes that are unique to the allantois. It is straightforward to tune the selection, e.g., to genes that are DE in no more than 90% of other labels by simply relaxing the threshold used to construct not.de.other, or to genes that are DE across multiple labels of interest but not in the rest, and so on. # Finding all genes that are not remotely DE in all other labels. remotely.de &lt;- decideTestsPerLabel(de.results, threshold=0.5) not.de &lt;- remotely.de==0 | is.na(remotely.de) not.de.other &lt;- rowMeans(not.de[,colnames(not.de)!=&quot;Allantois&quot;])==1 # Intersecting with genes that are DE inthe allantois. unique.degs &lt;- is.de[,&quot;Allantois&quot;]!=0 &amp; not.de.other unique.degs &lt;- names(which(unique.degs)) # Inspecting the results. de.allantois &lt;- de.results$Allantois de.allantois &lt;- de.allantois[unique.degs,] de.allantois &lt;- de.allantois[order(de.allantois$PValue),] de.allantois ## DataFrame with 5 rows and 5 columns ## logFC logCPM F PValue FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Slc22a18 -4.347153 4.04380 117.3889 1.92517e-10 2.31647e-07 ## Rbp4 1.874290 4.35449 29.8731 1.53998e-05 3.36906e-03 ## Cfc1 -0.950562 5.74762 23.1430 7.68376e-05 1.23215e-02 ## H3f3b 0.321634 12.04012 21.2710 1.25666e-04 1.63468e-02 ## Cryab -0.995629 5.28422 19.5921 1.99204e-04 2.45838e-02 The main caveat is that differences in power between labels require some caution when interpreting label specificity. For example, Figure 4.9 shows that the top-ranked allantois-specific gene exhibits some evidence of DE in other labels but was not detected for various reasons like low abundance or insufficient replicates. A more correct but complex approach would be to fit a interaction model to the pseudo-bulk profiles for each pair of labels, where the interaction is between the coefficient of interest and the label identity; this is left as an exercise for the reader. plotExpression(logNormCounts(summed.filt), features=&quot;Slc22a18&quot;, x=&quot;tomato&quot;, colour_by=&quot;tomato&quot;, other_fields=&quot;celltype.mapped&quot;) + facet_wrap(~celltype.mapped) Figure 4.9: Distribution of summed log-expression values for each label in the chimeric embryo dataset. Each facet represents a label with distributions stratified by injection status. 4.6 Testing for between-label differences The above examples focus on testing for differences in expression between conditions for the same cell type or label. However, the same methodology can be applied to test for differences between cell types across samples. This kind of DE analysis overcomes the lack of suitable replication discussed in Advanced Section 6.4.2. To demonstrate, say we want to test for DEGs between the neural crest and notochord samples. We subset our summed counts to those two cell types and we run the edgeR workflow via pseudoBulkDGE(). summed.sub &lt;- summed[,summed$celltype.mapped %in% c(&quot;Neural crest&quot;, &quot;Notochord&quot;)] # Using a dummy value for the label to allow us to include multiple cell types # in the fitted model; otherwise, each cell type will be processed separately. between.res &lt;- pseudoBulkDGE(summed.sub, label=rep(&quot;dummy&quot;, ncol(summed.sub)), design=~factor(sample) + celltype.mapped, coef=&quot;celltype.mappedNotochord&quot;)[[1]] table(Sig=between.res$FDR &lt;= 0.05, Sign=sign(between.res$logFC)) ## Sign ## Sig -1 1 ## FALSE 2235 1614 ## TRUE 683 228 between.res[order(between.res$PValue),] ## DataFrame with 14699 rows and 5 columns ## logFC logCPM F PValue FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## T 10.86559 7.08020 386.650 2.06989e-84 9.85269e-81 ## Krt19 8.14164 6.16734 264.874 9.19802e-59 2.18913e-55 ## Krt8 4.39442 8.43458 228.663 4.56130e-51 7.23726e-48 ## Mest -4.86549 11.98755 215.417 3.03569e-48 3.61247e-45 ## Krt18 4.71055 7.65385 183.064 2.49978e-41 2.37979e-38 ## ... ... ... ... ... ... ## Ccl27a_ENSMUSG00000095247 NA NA NA NA NA ## CR974586.5 NA NA NA NA NA ## AC132444.6 NA NA NA NA NA ## Vmn2r122 NA NA NA NA NA ## CAAA01147332.1 NA NA NA NA NA We inspect some of the top hits in more detail (Figure 4.10). As one might expect, these two cell types are quite different. summed.sub &lt;- logNormCounts(summed.sub, size.factors=NULL) plotExpression(summed.sub, features=head(rownames(between.res)[order(between.res$PValue)]), x=&quot;celltype.mapped&quot;, colour_by=I(factor(summed.sub$sample))) Figure 4.10: Distribution of the log-expression values for the top DEGs between the neural crest and notochord. Each point represents a pseudo-bulk profile and is colored by the sample of origin. Whether or not this is a scientifically meaningful comparison depends on the nature of the labels. These particular labels were defined by clustering, which means that the presence of DEGs is a foregone conclusion (Advanced Section 6.4). Nonetheless, it may have some utility for applications where the labels are defined using independent information, e.g., from FACS. The same approach can also be used to test whether the log-fold changes between two labels are significantly different between conditions. This is equivalent to testing for a significant interaction between each cell’s label and the condition of its sample of origin. The \\(p\\)-values are likely to be more sensible here; any artificial differences induced by clustering should cancel out between conditions, leaving behind real (and interesting) differences. Some extra effort is usually required to obtain a full-rank design matrix - this is demonstrated below to test for a significant interaction between the notochord/neural crest separation and injection status (tomato). inter.res &lt;- pseudoBulkDGE(summed.sub, label=rep(&quot;dummy&quot;, ncol(summed.sub)), design=function(df) { combined &lt;- with(df, paste0(tomato, &quot;.&quot;, celltype.mapped)) combined &lt;- make.names(combined) design &lt;- model.matrix(~0 + factor(sample) + combined, df) design[,!grepl(&quot;Notochord&quot;, colnames(design))] }, coef=&quot;combinedTRUE.Neural.crest&quot; )[[1]] table(Sig=inter.res$FDR &lt;= 0.05, Sign=sign(inter.res$logFC)) ## Sign ## Sig -1 0 1 ## FALSE 1443 12 3305 inter.res[order(inter.res$PValue),] ## DataFrame with 14699 rows and 5 columns ## logFC logCPM F PValue FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Cyb561 -10.46015 5.21092 12.9380 0.000330620 0.870112 ## Efhc1 -9.29150 3.52921 12.5622 0.000395525 0.870112 ## Epha2 -8.87723 4.24164 11.8490 0.000690709 0.870112 ## Spaca9 -6.32026 4.13089 10.7170 0.001138839 0.870112 ## Foxj1 -10.99035 6.66323 10.4632 0.001310048 0.870112 ## ... ... ... ... ... ... ## Ccl27a_ENSMUSG00000095247 NA NA NA NA NA ## CR974586.5 NA NA NA NA NA ## AC132444.6 NA NA NA NA NA ## Vmn2r122 NA NA NA NA NA ## CAAA01147332.1 NA NA NA NA NA Session Info View session info R version 4.1.0 beta (2021-05-03 r80259) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] scran_1.21.1 edgeR_3.35.0 [3] limma_3.49.0 bluster_1.3.0 [5] scater_1.21.0 ggplot2_3.3.3 [7] scuttle_1.3.0 BiocSingular_1.9.0 [9] SingleCellExperiment_1.15.1 SummarizedExperiment_1.23.0 [11] Biobase_2.53.0 GenomicRanges_1.45.0 [13] GenomeInfoDb_1.29.0 IRanges_2.27.0 [15] S4Vectors_0.31.0 BiocGenerics_0.39.0 [17] MatrixGenerics_1.5.0 matrixStats_0.58.0 [19] BiocStyle_2.21.0 rebook_1.3.0 loaded via a namespace (and not attached): [1] bitops_1.0-7 RColorBrewer_1.1-2 [3] filelock_1.0.2 tools_4.1.0 [5] bslib_0.2.5.1 utf8_1.2.1 [7] R6_2.5.0 irlba_2.3.3 [9] vipor_0.4.5 DBI_1.1.1 [11] colorspace_2.0-1 withr_2.4.2 [13] tidyselect_1.1.1 gridExtra_2.3 [15] compiler_4.1.0 graph_1.71.0 [17] BiocNeighbors_1.11.0 DelayedArray_0.19.0 [19] labeling_0.4.2 bookdown_0.22 [21] sass_0.4.0 scales_1.1.1 [23] stringr_1.4.0 digest_0.6.27 [25] rmarkdown_2.8 XVector_0.33.0 [27] pkgconfig_2.0.3 htmltools_0.5.1.1 [29] sparseMatrixStats_1.5.0 highr_0.9 [31] rlang_0.4.11 DelayedMatrixStats_1.15.0 [33] farver_2.1.0 jquerylib_0.1.4 [35] generics_0.1.0 jsonlite_1.7.2 [37] BiocParallel_1.27.0 dplyr_1.0.6 [39] RCurl_1.98-1.3 magrittr_2.0.1 [41] GenomeInfoDbData_1.2.6 Matrix_1.3-3 [43] Rcpp_1.0.6 ggbeeswarm_0.6.0 [45] munsell_0.5.0 fansi_0.4.2 [47] viridis_0.6.1 lifecycle_1.0.0 [49] stringi_1.6.2 yaml_2.2.1 [51] zlibbioc_1.39.0 grid_4.1.0 [53] dqrng_0.3.0 crayon_1.4.1 [55] dir.expiry_1.1.0 lattice_0.20-44 [57] splines_4.1.0 cowplot_1.1.1 [59] beachmat_2.9.0 locfit_1.5-9.4 [61] CodeDepends_0.6.5 metapod_1.1.0 [63] knitr_1.33 pillar_1.6.1 [65] igraph_1.2.6 codetools_0.2-18 [67] ScaledMatrix_1.1.0 XML_3.99-0.6 [69] glue_1.4.2 evaluate_0.14 [71] BiocManager_1.30.15 vctrs_0.3.8 [73] gtable_0.3.0 purrr_0.3.4 [75] assertthat_0.2.1 xfun_0.23 [77] rsvd_1.0.5 viridisLite_0.4.0 [79] pheatmap_1.0.12 tibble_3.1.2 [81] beeswarm_0.3.1 cluster_2.1.2 [83] statmod_1.4.36 ellipsis_0.3.2 References "],["ambient-problems.html", "Chapter 5 Problems with ambient RNA 5.1 Background 5.2 Filtering out affected DEGs 5.3 Subtracting ambient counts Session Info", " Chapter 5 Problems with ambient RNA .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 5.1 Background Ambient contamination is a phenomenon that is generally most pronounced in massively multiplexed scRNA-seq protocols. Briefly, extracellular RNA (most commonly released upon cell lysis) is captured along with each cell in its reaction chamber, contributing counts to genes that are not otherwise expressed in that cell (see Advanced Section 7.2). Differences in the ambient profile across samples are not uncommon when dealing with strong experimental perturbations where strong expression of a gene in a condition-specific cell type can “bleed over” into all other cell types in the same sample. This is problematic for DE analyses between conditions, as DEGs detected for a particular cell type may be driven by differences in the ambient profiles rather than any intrinsic change in gene regulation. To illustrate, we consider the Tal1-knockout (KO) chimera data from Pijuan-Sala et al. (2019). This is very similar to the WT chimera dataset we previously examined, only differing in that the Tal1 gene was knocked out in the injected cells. Tal1 is a transcription factor that has known roles in erythroid differentiation; the aim of the experiment was to determine if blocking of the erythroid lineage diverted cells to other developmental fates. (To cut a long story short: yes, it did.) library(MouseGastrulationData) sce.tal1 &lt;- Tal1ChimeraData() library(scuttle) rownames(sce.tal1) &lt;- uniquifyFeatureNames( rowData(sce.tal1)$ENSEMBL, rowData(sce.tal1)$SYMBOL ) sce.tal1 ## class: SingleCellExperiment ## dim: 29453 56122 ## metadata(0): ## assays(1): counts ## rownames(29453): Xkr4 Gm1992 ... CAAA01147332.1 tomato-td ## rowData names(2): ENSEMBL SYMBOL ## colnames(56122): cell_1 cell_2 ... cell_56121 cell_56122 ## colData names(9): cell barcode ... pool sizeFactor ## reducedDimNames(1): pca.corrected ## mainExpName: NULL ## altExpNames(0): We will perform a DE analysis between WT and KO cells labelled as “neural crest”. We observe that the strongest DEGs are the hemoglobins, which are downregulated in the injected cells. This is rather surprising as these cells are distinct from the erythroid lineage and should not express hemoglobins at all. The most sober explanation is that the background samples contain more hemoglobin transcripts in the ambient solution due to leakage from erythrocytes (or their precursors) during sorting and dissociation. library(scran) summed.tal1 &lt;- aggregateAcrossCells(sce.tal1, ids=DataFrame(sample=sce.tal1$sample, label=sce.tal1$celltype.mapped) ) summed.tal1$block &lt;- summed.tal1$sample %% 2 == 0 # Add blocking factor. # Subset to our neural crest cells. summed.neural &lt;- summed.tal1[,summed.tal1$label==&quot;Neural crest&quot;] summed.neural ## class: SingleCellExperiment ## dim: 29453 4 ## metadata(0): ## assays(1): counts ## rownames(29453): Xkr4 Gm1992 ... CAAA01147332.1 tomato-td ## rowData names(2): ENSEMBL SYMBOL ## colnames: NULL ## colData names(13): cell barcode ... ncells block ## reducedDimNames(1): pca.corrected ## mainExpName: NULL ## altExpNames(0): # Standard edgeR analysis, as described in previous chapters. res.neural &lt;- pseudoBulkDGE(summed.neural, label=summed.neural$label, design=~factor(block) + tomato, coef=&quot;tomatoTRUE&quot;, condition=summed.neural$tomato) summarizeTestsPerLabel(decideTestsPerLabel(res.neural)) ## -1 0 1 NA ## Neural crest 351 9818 481 18803 # Summary of the direction of log-fold changes. tab.neural &lt;- res.neural[[1]] tab.neural &lt;- tab.neural[order(tab.neural$PValue),] head(tab.neural, 10) ## DataFrame with 10 rows and 5 columns ## logFC logCPM F PValue FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Xist -7.555686 8.21232 6657.298 0.00000e+00 0.00000e+00 ## Hbb-bh1 -8.091042 9.15972 10758.256 0.00000e+00 0.00000e+00 ## Hbb-y -8.415622 8.35705 7364.290 0.00000e+00 0.00000e+00 ## Hba-x -7.724803 8.53284 7896.457 0.00000e+00 0.00000e+00 ## Hba-a1 -8.596706 6.74429 2756.573 0.00000e+00 0.00000e+00 ## Hba-a2 -8.866232 5.81300 1517.726 1.72378e-310 3.05972e-307 ## Erdr1 1.889536 7.61593 1407.112 2.34678e-289 3.57046e-286 ## Cdkn1c -8.864528 4.96097 814.936 8.79979e-173 1.17147e-169 ## Uba52 -0.879668 8.38618 424.191 1.86585e-92 2.20792e-89 ## Grb10 -1.403427 6.58314 401.353 1.13898e-87 1.21302e-84 As an aside, it is worth mentioning that the “replicates” in this study are more technical than biological, so some exaggeration of the significance of the effects is to be expected. Nonetheless, it is a useful dataset to demonstrate some strategies for mitigating issues caused by ambient contamination. 5.2 Filtering out affected DEGs 5.2.1 By estimating ambient contamination As shown above, the presence of ambient contamination makes it difficult to interpret multi-condition DE analyses. To mitigate its effects, we need to obtain an estimate of the ambient “expression” profile from the raw count matrix for each sample. We follow the approach used in emptyDrops() (Lun et al. 2019) and consider all barcodes with total counts below 100 to represent empty droplets. We then sum the counts for each gene across these barcodes to obtain an expression vector representing the ambient profile for each sample. library(DropletUtils) ambient &lt;- vector(&quot;list&quot;, ncol(summed.neural)) # Looping over all raw (unfiltered) count matrices and # computing the ambient profile based on its low-count barcodes. # Turning off rounding, as we know this is count data. for (s in seq_along(ambient)) { raw.tal1 &lt;- Tal1ChimeraData(type=&quot;raw&quot;, samples=s)[[1]] ambient[[s]] &lt;- ambientProfileEmpty(counts(raw.tal1), good.turing=FALSE, round=FALSE) } # Cleaning up the output for pretty printing. ambient &lt;- do.call(cbind, ambient) colnames(ambient) &lt;- seq_len(ncol(ambient)) rownames(ambient) &lt;- uniquifyFeatureNames( rowData(raw.tal1)$ENSEMBL, rowData(raw.tal1)$SYMBOL ) head(ambient) ## 1 2 3 4 ## Xkr4 1 0 0 0 ## Gm1992 0 0 0 0 ## Gm37381 1 0 1 0 ## Rp1 0 1 0 1 ## Sox17 76 76 31 53 ## Gm37323 0 0 0 0 For each sample, we determine the maximum proportion of the count for each gene that could be attributed to ambient contamination. This is done by scaling the ambient profile in ambient to obtain a per-gene expected count from ambient contamination, with which we compute the \\(p\\)-value for observing a count equal to or lower than that in summed.neural. We perform this for a range of scaling factors and identify the largest factor that yields a \\(p\\)-value above a given threshold. The scaled ambient profile represents the upper bound of the contribution to each sample from ambient contamination. We deliberately use an upper bound so that our next step will aggressively remove any gene that is potentially problematic. max.ambient &lt;- ambientContribMaximum(counts(summed.neural), ambient, mode=&quot;proportion&quot;) head(max.ambient) ## [,1] [,2] [,3] [,4] ## Xkr4 NaN NaN NaN NaN ## Gm1992 NaN NaN NaN NaN ## Gm37381 NaN NaN NaN NaN ## Rp1 NaN NaN NaN NaN ## Sox17 0.1775 0.1833 0.468 1 ## Gm37323 NaN NaN NaN NaN Genes in which over 10% of the counts are ambient-derived are subsequently discarded from our analysis. For balanced designs, this threshold prevents ambient contribution from biasing the true fold-change by more than 10%, which is a tolerable margin of error for most applications. (Unbalanced designs may warrant the use of a weighted average to account for sample size differences between groups.) This approach yields a slightly smaller list of DEGs without the hemoglobins, which is encouraging as it suggests that any other, less obvious effects of ambient contamination have also been removed. # Averaging the ambient contribution across samples. contamination &lt;- rowMeans(max.ambient, na.rm=TRUE) non.ambient &lt;- contamination &lt;= 0.1 summary(non.ambient) ## Mode FALSE TRUE NA&#39;s ## logical 1475 15306 12672 okay.genes &lt;- names(non.ambient)[which(non.ambient)] tab.neural2 &lt;- tab.neural[rownames(tab.neural) %in% okay.genes,] table(Direction=tab.neural2$logFC &gt; 0, Significant=tab.neural2$FDR &lt;= 0.05) ## Significant ## Direction FALSE TRUE ## FALSE 4820 317 ## TRUE 4781 452 head(tab.neural2, 10) ## DataFrame with 10 rows and 5 columns ## logFC logCPM F PValue FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Xist -7.555686 8.21232 6657.298 0.00000e+00 0.00000e+00 ## Erdr1 1.889536 7.61593 1407.112 2.34678e-289 3.57046e-286 ## Uba52 -0.879668 8.38618 424.191 1.86585e-92 2.20792e-89 ## Grb10 -1.403427 6.58314 401.353 1.13898e-87 1.21302e-84 ## Gt(ROSA)26Sor 1.481294 5.71617 351.940 2.80072e-77 2.71160e-74 ## Fdps 0.981388 7.21805 337.159 3.67655e-74 3.26294e-71 ## Mest 0.549349 10.98269 319.697 1.79833e-70 1.47324e-67 ## Impact 1.396666 5.71801 314.700 2.05057e-69 1.55990e-66 ## H13 -1.481658 5.90902 301.675 1.17372e-66 8.33343e-64 ## Msmo1 1.493771 5.43923 301.066 1.57983e-66 1.05158e-63 A softer approach is to simply report the average contaminating percentage for each gene in the table of DE statistics. Readers can then make up their own minds as to whether a particular DEG’s effect is driven by ambient contamination. Indeed, it is worth remembering that maximumAmbience() will report the maximum possible contamination rather than attempting to estimate the actual level of contamination, and filtering on the former may be too conservative. This is especially true for cell populations that are contributing to the differences in the ambient pool; in the most extreme case, the reported maximum contamination would be 100% for cell types with an expression profile that is identical to the ambient pool. tab.neural3 &lt;- tab.neural tab.neural3$contamination &lt;- contamination[rownames(tab.neural3)] head(tab.neural3) ## DataFrame with 6 rows and 6 columns ## logFC logCPM F PValue FDR contamination ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Xist -7.55569 8.21232 6657.30 0.00000e+00 0.00000e+00 0.0605735 ## Hbb-bh1 -8.09104 9.15972 10758.26 0.00000e+00 0.00000e+00 0.9900717 ## Hbb-y -8.41562 8.35705 7364.29 0.00000e+00 0.00000e+00 0.9674483 ## Hba-x -7.72480 8.53284 7896.46 0.00000e+00 0.00000e+00 0.9945348 ## Hba-a1 -8.59671 6.74429 2756.57 0.00000e+00 0.00000e+00 0.8626846 ## Hba-a2 -8.86623 5.81300 1517.73 1.72378e-310 3.05972e-307 0.7351403 5.2.2 With prior knowledge Another strategy to estimating the ambient proportions involves the use of prior knowledge of mutually exclusive gene expression profiles (Young and Behjati 2018). In this case, we assume (reasonably) that hemoglobins should not be expressed in neural crest cells and use this to estimate the contamination in each sample. This is achieved with the controlAmbience() function, which scales the ambient profile so that the hemoglobin coverage is the same as the corresponding sample of summed.neural. From these profiles, we compute proportions of ambient contamination that are used to mark or filter out affected genes in the same manner as described above. is.hbb &lt;- grep(&quot;^Hb[ab]-&quot;, rownames(summed.neural)) ctrl.ambient &lt;- ambientContribNegative(counts(summed.neural), ambient, features=is.hbb, mode=&quot;proportion&quot;) head(ctrl.ambient) ## [,1] [,2] [,3] [,4] ## Xkr4 NaN NaN NaN NaN ## Gm1992 NaN NaN NaN NaN ## Gm37381 NaN NaN NaN NaN ## Rp1 NaN NaN NaN NaN ## Sox17 0.06774 0.08798 0.4796 1 ## Gm37323 NaN NaN NaN NaN ctrl.non.ambient &lt;- rowMeans(ctrl.ambient, na.rm=TRUE) &lt;= 0.1 summary(ctrl.non.ambient) ## Mode FALSE TRUE NA&#39;s ## logical 1388 15393 12672 okay.genes &lt;- names(ctrl.non.ambient)[which(ctrl.non.ambient)] tab.neural4 &lt;- tab.neural[rownames(tab.neural) %in% okay.genes,] head(tab.neural4) ## DataFrame with 6 rows and 5 columns ## logFC logCPM F PValue FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Xist -7.555686 8.21232 6657.298 0.00000e+00 0.00000e+00 ## Erdr1 1.889536 7.61593 1407.112 2.34678e-289 3.57046e-286 ## Uba52 -0.879668 8.38618 424.191 1.86585e-92 2.20792e-89 ## Grb10 -1.403427 6.58314 401.353 1.13898e-87 1.21302e-84 ## Gt(ROSA)26Sor 1.481294 5.71617 351.940 2.80072e-77 2.71160e-74 ## Fdps 0.981388 7.21805 337.159 3.67655e-74 3.26294e-71 Any highly expressed cell type-specific gene is a candidate for this procedure, most typically in cell types that are highly specialized towards manufacturing a protein product. Aside from hemoglobin, we could use immunoglobulins in populations containing B cells, or insulin and glucagon in pancreas datasets (Advanced Figure 6.3). The experimental setting may also provide some genes that must only be present in the ambient solution; for example, the mitochondrial transcripts can be used to estimate ambient contamination in single-nucleus RNA-seq, while Xist can be used for datasets involving mixtures of male and female cells (where the contaminating percentages are estimated from the profiles of male cells only). If appropriate control features are available, this approach allows us to obtain a more accurate estimate of the contamination in each pseudo-bulk sample compared to the upper bound provided by maximumAmbience(). This avoids the removal of genuine DEGs due to overestimation fo the ambient contamination from the latter. However, the performance of this approach is fully dependent on the suitability of the control features - if a “control” feature is actually genuinely expressed in a cell type, the ambient contribution will be overestimated. A simple mitigating strategy is to simply take the lower of the proportions from controlAmbience() and maximumAmbience(), with the idea being that the latter will avoid egregious overestimation when the control set is misspecified. 5.2.3 Without an ambient profile An estimate of the ambient profile is rarely available for public datasets where only the per-cell count matrices are provided. In such cases, we must instead use the rest of the dataset to infer something about the effects of ambient contamination. The most obvious approach is construct a proxy ambient profile by summing the counts for all cells from each sample, which can be used in place of the actual profile in the previous calculations. proxy.ambient &lt;- aggregateAcrossCells(summed.tal1, ids=summed.tal1$sample) # Using &#39;proxy.ambient&#39; instead of the estimaed &#39;ambient&#39;. max.ambient.proxy &lt;- ambientContribMaximum(counts(summed.neural), counts(proxy.ambient), mode=&quot;proportion&quot;) head(max.ambient.proxy) ## [,1] [,2] [,3] [,4] ## Xkr4 NaN NaN NaN NaN ## Gm1992 NaN NaN NaN NaN ## Gm37381 NaN NaN NaN NaN ## Rp1 NaN NaN NaN NaN ## Sox17 0.7427 0.9891 0.5283 0.9067 ## Gm37323 NaN NaN NaN NaN con.ambient.proxy &lt;- ambientContribNegative(counts(summed.neural), counts(proxy.ambient), features=is.hbb, mode=&quot;proportion&quot;) head(con.ambient.proxy) ## [,1] [,2] [,3] [,4] ## Xkr4 NaN NaN NaN NaN ## Gm1992 NaN NaN NaN NaN ## Gm37381 NaN NaN NaN NaN ## Rp1 NaN NaN NaN NaN ## Sox17 1 1 0.6032 1 ## Gm37323 NaN NaN NaN NaN This assumes equal contributions from all labels to the ambient pool, which is not entirely unrealistic (Figure 5.1) though some discrepancies can be expected due to the presence of particularly fragile cell types or extracellular RNA. par(mfrow=c(2,2)) for (i in seq_len(ncol(proxy.ambient))) { true &lt;- ambient[,i] proxy &lt;- assay(proxy.ambient)[,i] logged &lt;- edgeR::cpm(cbind(proxy, true), log=TRUE, prior.count=2) logFC &lt;- logged[,1] - logged[,2] abundance &lt;- rowMeans(logged) plot(abundance, logFC, main=paste(&quot;Sample&quot;, i)) } Figure 5.1: MA plots of the log-fold change of the proxy ambient profile over the real profile for each sample in the Tal1 chimera dataset. Alternatively, we may choose to mitigate the effect of ambient contamination by focusing on label-specific DEGs. Contamination-driven DEGs should be systematically present in comparisons for all labels, and thus can be eliminated by simply ignoring all genes that are significant in a majority of these comparisons (Section 4.5.2). The obvious drawback of this approach is that it discounts genuine DEGs that have a consistent effect in most/all labels, though one could perhaps argue that such “global” DEGs are not particularly interesting anyway. It is also complicated by fluctuations in detection power across comparisons involving different numbers of cells - or replicates, after filtering pseudo-bulk profiles by the number of cells. res.tal1 &lt;- pseudoBulkSpecific(summed.tal1, label=summed.tal1$label, design=~factor(block) + tomato, coef=&quot;tomatoTRUE&quot;, condition=summed.tal1$tomato) # Inspecting our neural crest results again. tab.neural.again &lt;- res.tal1[[&quot;Neural crest&quot;]] head(tab.neural.again[order(tab.neural.again$PValue),], 10) ## DataFrame with 10 rows and 6 columns ## logFC logCPM F PValue FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Fdps 0.981388 7.21805 337.1586 3.67655e-74 3.91553e-70 ## Msmo1 1.493771 5.43923 301.0658 3.12891e-64 1.66614e-60 ## Hmgcs1 1.250024 5.70837 252.1105 3.95670e-56 1.40463e-52 ## Idi1 1.173709 5.37688 180.8890 6.68049e-41 1.77868e-37 ## Gt(ROSA)26Sor 1.481294 5.71617 351.9402 2.50809e-32 5.34222e-29 ## Sox9 0.537460 7.17373 99.3336 2.69822e-23 4.78934e-20 ## Nkd1 0.719043 5.92690 93.9636 2.74595e-20 4.17777e-17 ## Fdft1 0.841061 5.32293 89.8826 1.03053e-19 1.37189e-16 ## Insig1 1.257339 4.06887 82.2931 1.37966e-19 1.63260e-16 ## Acat2 0.508862 6.80012 73.8163 9.77138e-18 1.00391e-14 ## OtherAverage ## &lt;numeric&gt; ## Fdps -0.0913554 ## Msmo1 0.0269652 ## Hmgcs1 -0.0628281 ## Idi1 -0.0820383 ## Gt(ROSA)26Sor 0.5412288 ## Sox9 -0.0419813 ## Nkd1 0.0332706 ## Fdft1 0.0336446 ## Insig1 -0.2899829 ## Acat2 -0.0522834 # By comparison, the hemoglobins are all the way at the bottom. head(tab.neural.again[is.hbb,], 10) ## DataFrame with 8 rows and 6 columns ## logFC logCPM F PValue FDR OtherAverage ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Hbb-bt -7.76718 1.33059 58.7089 1.000000 1.000000 -7.86812 ## Hbb-bs -5.84817 3.42835 238.2627 1.000000 1.000000 -8.15469 ## Hbb-bh2 NA NA NA NA NA -8.98446 ## Hbb-bh1 -8.09104 9.15972 10758.2563 0.937101 1.000000 -8.08037 ## Hbb-y -8.41562 8.35705 7364.2897 0.502375 1.000000 -8.21261 ## Hba-x -7.72480 8.53284 7896.4567 0.138483 0.940591 -7.38504 ## Hba-a1 -8.59671 6.74429 2756.5730 0.326266 1.000000 -8.05722 ## Hba-a2 -8.86623 5.81300 1517.7259 0.254178 1.000000 -7.96508 The common theme here is that, in the absence of an ambient profile, we are using all labels as a proxy for the ambient effect. This can have unpredictable consequences as the results for each label are now dependent on the behavior of the entire dataset. For example, the metrics are susceptible to the idiosyncrasies of clustering where one cell type may be represented in multple related clusters that distort the percentages in up.de and down.de or the average log-fold change. The metrics may also be invalidated in analyses of a subset of the data - for example, a subclustering analysis focusing on a particular cell type may mark all relevant DEGs as problematic because they are consistently DE in all subtypes. 5.3 Subtracting ambient counts It is worth commenting on the seductive idea of subtracting the ambient counts from the pseudo-bulk samples. This may seem like the most obvious approach for removing ambient contamination, but unfortunately, subtracted counts have unpredictable statistical properties due the distortion of the mean-variance relationship. Minor relative fluctuations at very large counts become large fold-changes after subtraction, manifesting as spurious DE in genes where a substantial proportion of counts is derived from the ambient solution. For example, several hemoglobin genes retain strong DE even after subtraction of the scaled ambient profile. scaled.ambient &lt;- controlAmbience(counts(summed.neural), ambient, features=is.hbb, mode=&quot;profile&quot;) subtracted &lt;- counts(summed.neural) - scaled.ambient subtracted &lt;- round(subtracted) subtracted[subtracted &lt; 0] &lt;- 0 subtracted[is.hbb,] ## [,1] [,2] [,3] [,4] ## Hbb-bt 0 0 7 18 ## Hbb-bs 1 2 31 42 ## Hbb-bh2 0 0 0 0 ## Hbb-bh1 2 0 0 0 ## Hbb-y 0 0 39 107 ## Hba-x 1 1 0 0 ## Hba-a1 0 0 365 452 ## Hba-a2 0 0 314 329 Another tempting approach is to use interaction models to implicitly subtract the ambient effect during GLM fitting. The assumption is that, for a genuine DEG, the log-fold change within cells is larger in magnitude than that in the ambient solution. This is based on the expectation that any DE in the latter is “diluted” by contributions from cell types where that gene is not DE. Unfortunately, this is not always the case; a DE analysis of the ambient counts indicates that the hemoglobin log-fold change is actually stronger in the neural crest cells compared to the ambient solution, which leads to the rather awkward conclusion that the WT neural crest cells are expressing hemoglobin beyond that explained by ambient contamination. (This is probably an artifact of how cell calling is performed.) library(edgeR) y.ambient &lt;- DGEList(ambient, samples=colData(summed.neural)) y.ambient &lt;- y.ambient[filterByExpr(y.ambient, group=y.ambient$samples$tomato),] y.ambient &lt;- calcNormFactors(y.ambient) design &lt;- model.matrix(~factor(block) + tomato, y.ambient$samples) y.ambient &lt;- estimateDisp(y.ambient, design) fit.ambient &lt;- glmQLFit(y.ambient, design, robust=TRUE) res.ambient &lt;- glmQLFTest(fit.ambient, coef=ncol(design)) summary(decideTests(res.ambient)) ## tomatoTRUE ## Down 1910 ## NotSig 7683 ## Up 1645 topTags(res.ambient, n=10) ## Coefficient: tomatoTRUE ## logFC logCPM F PValue FDR ## Hbb-y -5.267 12.803 15115 3.523e-81 3.959e-77 ## Hbb-bh1 -5.075 13.725 14002 8.892e-80 4.996e-76 ## Hba-x -4.827 13.122 13317 3.135e-79 1.175e-75 ## Hba-a1 -4.662 10.734 11095 1.146e-76 3.220e-73 ## Hba-a2 -4.521 9.480 8411 1.246e-72 2.800e-69 ## Blvrb -4.319 7.649 4129 3.066e-62 5.742e-59 ## Xist -4.376 7.484 3891 1.864e-61 2.993e-58 ## Gypa -5.138 7.213 3808 3.833e-61 5.384e-58 ## Hbb-bs -4.941 7.209 3604 3.728e-60 4.655e-57 ## Car2 -3.499 8.534 4448 5.589e-60 6.281e-57 In addition, there are other issues with implicit subtraction in the fitted GLM that warrant caution with its use. This strategy precludes detection of DEGs that are common to all cell types as there is no longer a dilution effect being applied to the log-fold change in the ambient solution. It requires inclusion of the ambient profiles in the model, which is cause for at least some concern as they are unlikely to have the same degree of variability as the cell-derived pseudo-bulk profiles. Interpretation is also complicated by the fact that we are only interested in log-fold changes that are more extreme in the cells compared to the ambient solution; a non-zero interaction term is not sufficient for removing spurious DE. See also comments in Advanced Section 7.3 for more comments on the removal of ambient contamination, mostly for visualization purposes. Session Info View session info R version 4.1.0 beta (2021-05-03 r80259) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] edgeR_3.35.0 limma_3.49.0 [3] DropletUtils_1.13.0 scran_1.21.1 [5] scuttle_1.3.0 MouseGastrulationData_1.7.0 [7] SpatialExperiment_1.3.0 SingleCellExperiment_1.15.1 [9] SummarizedExperiment_1.23.0 Biobase_2.53.0 [11] GenomicRanges_1.45.0 GenomeInfoDb_1.29.0 [13] IRanges_2.27.0 S4Vectors_0.31.0 [15] BiocGenerics_0.39.0 MatrixGenerics_1.5.0 [17] matrixStats_0.58.0 BiocStyle_2.21.0 [19] rebook_1.3.0 loaded via a namespace (and not attached): [1] rjson_0.2.20 ellipsis_0.3.2 [3] bluster_1.3.0 XVector_0.33.0 [5] BiocNeighbors_1.11.0 bit64_4.0.5 [7] interactiveDisplayBase_1.31.0 AnnotationDbi_1.55.0 [9] fansi_0.4.2 splines_4.1.0 [11] codetools_0.2-18 R.methodsS3_1.8.1 [13] sparseMatrixStats_1.5.0 cachem_1.0.5 [15] knitr_1.33 jsonlite_1.7.2 [17] cluster_2.1.2 dbplyr_2.1.1 [19] png_0.1-7 R.oo_1.24.0 [21] graph_1.71.0 shiny_1.6.0 [23] HDF5Array_1.21.0 BiocManager_1.30.15 [25] compiler_4.1.0 httr_1.4.2 [27] dqrng_0.3.0 assertthat_0.2.1 [29] Matrix_1.3-3 fastmap_1.1.0 [31] later_1.2.0 BiocSingular_1.9.0 [33] htmltools_0.5.1.1 tools_4.1.0 [35] igraph_1.2.6 rsvd_1.0.5 [37] glue_1.4.2 GenomeInfoDbData_1.2.6 [39] dplyr_1.0.6 rappdirs_0.3.3 [41] Rcpp_1.0.6 jquerylib_0.1.4 [43] vctrs_0.3.8 Biostrings_2.61.0 [45] rhdf5filters_1.5.0 ExperimentHub_2.1.0 [47] DelayedMatrixStats_1.15.0 BumpyMatrix_1.1.0 [49] xfun_0.23 stringr_1.4.0 [51] beachmat_2.9.0 irlba_2.3.3 [53] mime_0.10 lifecycle_1.0.0 [55] statmod_1.4.36 XML_3.99-0.6 [57] AnnotationHub_3.1.0 zlibbioc_1.39.0 [59] promises_1.2.0.1 rhdf5_2.37.0 [61] yaml_2.2.1 curl_4.3.1 [63] memoise_2.0.0 sass_0.4.0 [65] stringi_1.6.2 RSQLite_2.2.7 [67] highr_0.9 BiocVersion_3.14.0 [69] ScaledMatrix_1.1.0 filelock_1.0.2 [71] BiocParallel_1.27.0 rlang_0.4.11 [73] pkgconfig_2.0.3 bitops_1.0-7 [75] evaluate_0.14 lattice_0.20-44 [77] purrr_0.3.4 Rhdf5lib_1.15.0 [79] CodeDepends_0.6.5 bit_4.0.4 [81] tidyselect_1.1.1 magrittr_2.0.1 [83] bookdown_0.22 R6_2.5.0 [85] magick_2.7.2 generics_0.1.0 [87] metapod_1.1.0 DelayedArray_0.19.0 [89] DBI_1.1.1 pillar_1.6.1 [91] withr_2.4.2 KEGGREST_1.33.0 [93] RCurl_1.98-1.3 tibble_3.1.2 [95] dir.expiry_1.1.0 crayon_1.4.1 [97] utf8_1.2.1 BiocFileCache_2.1.0 [99] rmarkdown_2.8 locfit_1.5-9.4 [101] grid_4.1.0 blob_1.2.1 [103] digest_0.6.27 xtable_1.8-4 [105] httpuv_1.6.1 R.utils_2.10.1 [107] bslib_0.2.5.1 References "],["differential-abundance.html", "Chapter 6 Changes in cluster abundance 6.1 Overview 6.2 Performing the DA analysis 6.3 Handling composition effects 6.4 Comments on interpretation Session Info", " Chapter 6 Changes in cluster abundance .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 6.1 Overview In a DA analysis, we test for significant changes in per-label cell abundance across conditions. This will reveal which cell types are depleted or enriched upon treatment, which is arguably just as interesting as changes in expression within each cell type. The DA analysis has a long history in flow cytometry (Finak et al. 2014; Lun, Richard, and Marioni 2017) where it is routinely used to examine the effects of different conditions on the composition of complex cell populations. By performing it here, we effectively treat scRNA-seq as a “super-FACS” technology for defining relevant subpopulations using the entire transcriptome. We prepare for the DA analysis by quantifying the number of cells assigned to each label (or cluster) in our WT chimeric experiment (Pijuan-Sala et al. 2019). In this case, we are aiming to identify labels that change in abundance among the compartment of injected cells compared to the background. View set-up code (Chapter 10) #--- loading ---# library(MouseGastrulationData) sce.chimera &lt;- WTChimeraData(samples=5:10) sce.chimera #--- feature-annotation ---# library(scater) rownames(sce.chimera) &lt;- uniquifyFeatureNames( rowData(sce.chimera)$ENSEMBL, rowData(sce.chimera)$SYMBOL) #--- quality-control ---# drop &lt;- sce.chimera$celltype.mapped %in% c(&quot;stripped&quot;, &quot;Doublet&quot;) sce.chimera &lt;- sce.chimera[,!drop] #--- normalization ---# sce.chimera &lt;- logNormCounts(sce.chimera) #--- variance-modelling ---# library(scran) dec.chimera &lt;- modelGeneVar(sce.chimera, block=sce.chimera$sample) chosen.hvgs &lt;- dec.chimera$bio &gt; 0 #--- merging ---# library(batchelor) set.seed(01001001) merged &lt;- correctExperiments(sce.chimera, batch=sce.chimera$sample, subset.row=chosen.hvgs, PARAM=FastMnnParam( merge.order=list( list(1,3,5), # WT (3 replicates) list(2,4,6) # td-Tomato (3 replicates) ) ) ) #--- clustering ---# g &lt;- buildSNNGraph(merged, use.dimred=&quot;corrected&quot;) clusters &lt;- igraph::cluster_louvain(g) colLabels(merged) &lt;- factor(clusters$membership) #--- dimensionality-reduction ---# merged &lt;- runTSNE(merged, dimred=&quot;corrected&quot;, external_neighbors=TRUE) merged &lt;- runUMAP(merged, dimred=&quot;corrected&quot;, external_neighbors=TRUE) abundances &lt;- table(merged$celltype.mapped, merged$sample) abundances &lt;- unclass(abundances) head(abundances) ## ## 5 6 7 8 9 10 ## Allantois 97 15 139 127 318 259 ## Blood progenitors 1 6 3 16 6 8 17 ## Blood progenitors 2 31 8 28 21 43 114 ## Cardiomyocytes 85 21 79 31 174 211 ## Caudal Mesoderm 10 10 9 3 10 29 ## Caudal epiblast 2 2 0 0 22 45 6.2 Performing the DA analysis Our DA analysis will again be performed with the edgeR package. This allows us to take advantage of the NB GLM methods to model overdispersed count data in the presence of limited replication - except that the counts are not of reads per gene, but of cells per label (Lun, Richard, and Marioni 2017). The aim is to share information across labels to improve our estimates of the biological variability in cell abundance between replicates. library(edgeR) # Attaching some column metadata. extra.info &lt;- colData(merged)[match(colnames(abundances), merged$sample),] y.ab &lt;- DGEList(abundances, samples=extra.info) y.ab ## An object of class &quot;DGEList&quot; ## $counts ## ## 5 6 7 8 9 10 ## Allantois 97 15 139 127 318 259 ## Blood progenitors 1 6 3 16 6 8 17 ## Blood progenitors 2 31 8 28 21 43 114 ## Cardiomyocytes 85 21 79 31 174 211 ## Caudal Mesoderm 10 10 9 3 10 29 ## 29 more rows ... ## ## $samples ## group lib.size norm.factors batch cell barcode sample stage ## 5 1 2298 1 5 cell_9769 AAACCTGAGACTGTAA 5 E8.5 ## 6 1 1026 1 6 cell_12180 AAACCTGCAGATGGCA 6 E8.5 ## 7 1 2740 1 7 cell_13227 AAACCTGAGACAAGCC 7 E8.5 ## 8 1 2904 1 8 cell_16234 AAACCTGCAAACCCAT 8 E8.5 ## 9 1 4057 1 9 cell_19332 AAACCTGCAACGATCT 9 E8.5 ## 10 1 6401 1 10 cell_23875 AAACCTGAGGCATGTG 10 E8.5 ## tomato pool stage.mapped celltype.mapped closest.cell ## 5 TRUE 3 E8.25 Mesenchyme cell_24159 ## 6 FALSE 3 E8.25 Somitic mesoderm cell_63247 ## 7 TRUE 4 E8.5 Somitic mesoderm cell_25454 ## 8 FALSE 4 E8.25 ExE mesoderm cell_139075 ## 9 TRUE 5 E8.0 ExE mesoderm cell_116116 ## 10 FALSE 5 E8.5 Forebrain/Midbrain/Hindbrain cell_39343 ## doub.density sizeFactor label ## 5 0.029850 1.6349 19 ## 6 0.291916 2.5981 6 ## 7 0.601740 1.5939 17 ## 8 0.004733 0.8707 9 ## 9 0.079415 0.8933 15 ## 10 0.040747 0.3947 1 We filter out low-abundance labels as previously described. This avoids cluttering the result table with very rare subpopulations that contain only a handful of cells. For a DA analysis of cluster abundances, filtering is generally not required as most clusters will not be of low-abundance (otherwise there would not have been enough evidence to define the cluster in the first place). keep &lt;- filterByExpr(y.ab, group=y.ab$samples$tomato) y.ab &lt;- y.ab[keep,] summary(keep) ## Mode FALSE TRUE ## logical 10 24 Unlike DE analyses, we do not perform an additional normalization step with calcNormFactors(). This means that we are only normalizing based on the “library size”, i.e., the total number of cells in each sample. Any changes we detect between conditions will subsequently represent differences in the proportion of cells in each cluster. The motivation behind this decision is discussed in more detail in Section 6.3. We formulate the design matrix with a blocking factor for the batch of origin for each sample and an additive term for the td-Tomato status (i.e., injection effect). Here, the log-fold change in our model refers to the change in cell abundance after injection, rather than the change in gene expression. design &lt;- model.matrix(~factor(pool) + factor(tomato), y.ab$samples) We use the estimateDisp() function to estimate the NB dispersion for each cluster (Figure 6.1). We turn off the trend as we do not have enough points for its stable estimation. y.ab &lt;- estimateDisp(y.ab, design, trend=&quot;none&quot;) summary(y.ab$common.dispersion) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0614 0.0614 0.0614 0.0614 0.0614 0.0614 plotBCV(y.ab, cex=1) Figure 6.1: Biological coefficient of variation (BCV) for each label with respect to its average abundance. BCVs are defined as the square root of the NB dispersion. Common dispersion estimates are shown in red. We repeat this process with the QL dispersion, again disabling the trend (Figure 6.2). fit.ab &lt;- glmQLFit(y.ab, design, robust=TRUE, abundance.trend=FALSE) summary(fit.ab$var.prior) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.25 1.25 1.25 1.25 1.25 1.25 summary(fit.ab$df.prior) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## Inf Inf Inf Inf Inf Inf plotQLDisp(fit.ab, cex=1) Figure 6.2: QL dispersion estimates for each label with respect to its average abundance. Quarter-root values of the raw estimates are shown in black while the shrunken estimates are shown in red. Shrinkage is performed towards the common dispersion in blue. We test for differences in abundance between td-Tomato-positive and negative samples using glmQLFTest(). We see that extra-embryonic ectoderm is strongly depleted in the injected cells. This is consistent with the expectation that cells injected into the blastocyst should not contribute to extra-embryonic tissue. The injected cells also contribute more to the mesenchyme, which may also be of interest. res &lt;- glmQLFTest(fit.ab, coef=ncol(design)) summary(decideTests(res)) ## factor(tomato)TRUE ## Down 1 ## NotSig 22 ## Up 1 topTags(res) ## Coefficient: factor(tomato)TRUE ## logFC logCPM F PValue FDR ## ExE ectoderm -6.5663 13.02 66.267 1.352e-10 3.245e-09 ## Mesenchyme 1.1652 16.29 11.291 1.535e-03 1.841e-02 ## Allantois 0.8345 15.51 5.312 2.555e-02 1.621e-01 ## Cardiomyocytes 0.8484 14.86 5.204 2.701e-02 1.621e-01 ## Neural crest -0.7706 14.76 4.106 4.830e-02 2.149e-01 ## Endothelium 0.7519 14.29 3.912 5.371e-02 2.149e-01 ## Erythroid3 -0.6431 17.28 3.604 6.367e-02 2.183e-01 ## Haematoendothelial progenitors 0.6581 14.72 3.124 8.351e-02 2.505e-01 ## ExE mesoderm 0.3805 15.68 1.181 2.827e-01 6.258e-01 ## Pharyngeal mesoderm 0.3793 15.72 1.169 2.850e-01 6.258e-01 6.3 Handling composition effects 6.3.1 Background As mentioned above, we do not use calcNormFactors() in our default DA analysis. This normalization step assumes that most of the input features are not different between conditions. While this assumption is reasonable for most types of gene expression data, it is generally too strong for cell type abundance - most experiments consist of only a few cell types that may all change in abundance upon perturbation. Thus, our default approach is to only normalize based on the total number of cells in each sample, which means that we are effectively testing for differential proportions between conditions. Unfortunately, the use of the total number of cells leaves us susceptible to composition effects. For example, a large increase in abundance for one cell subpopulation will introduce decreases in proportion for all other subpopulations - which is technically correct, but may be misleading if one concludes that those other subpopulations are decreasing in abundance of their own volition. If composition biases are proving problematic for interpretation of DA results, we have several avenues for removing them or mitigating their impact by leveraging a priori biological knowledge. 6.3.2 Assuming most labels do not change If it is possible to assume that most labels (i.e., cell types) do not change in abundance, we can use calcNormFactors() to compute normalization factors. This seems to be a fairly reasonable assumption for the WT chimeras where the injection is expected to have only a modest effect at most. y.ab2 &lt;- calcNormFactors(y.ab) y.ab2$samples$norm.factors ## [1] 1.0055 1.0833 1.1658 0.7614 1.0616 0.9743 We then proceed with the remainder of the edgeR analysis, shown below in condensed format. Many of the positive log-fold changes are shifted towards zero, consistent with the removal of composition biases from the presence of extra-embryonic ectoderm in only background cells. In particular, the mesenchyme is no longer significantly DA after injection. y.ab2 &lt;- estimateDisp(y.ab2, design, trend=&quot;none&quot;) fit.ab2 &lt;- glmQLFit(y.ab2, design, robust=TRUE, abundance.trend=FALSE) res2 &lt;- glmQLFTest(fit.ab2, coef=ncol(design)) topTags(res2, n=10) ## Coefficient: factor(tomato)TRUE ## logFC logCPM F PValue FDR ## ExE ectoderm -6.9215 13.17 70.364 5.738e-11 1.377e-09 ## Mesenchyme 0.9513 16.27 6.787 1.219e-02 1.143e-01 ## Neural crest -1.0032 14.78 6.464 1.429e-02 1.143e-01 ## Erythroid3 -0.8504 17.35 5.517 2.299e-02 1.380e-01 ## Cardiomyocytes 0.6400 14.84 2.735 1.047e-01 4.809e-01 ## Allantois 0.6054 15.51 2.503 1.202e-01 4.809e-01 ## Forebrain/Midbrain/Hindbrain -0.4943 16.55 1.928 1.713e-01 5.178e-01 ## Endothelium 0.5482 14.27 1.917 1.726e-01 5.178e-01 ## Erythroid2 -0.4818 16.00 1.677 2.015e-01 5.373e-01 ## Haematoendothelial progenitors 0.4262 14.73 1.185 2.818e-01 6.240e-01 6.3.3 Removing the offending labels Another approach is to repeat the analysis after removing DA clusters containing many cells. This provides a clearer picture of the changes in abundance among the remaining clusters. Here, we remove the extra-embryonic ectoderm and reset the total number of cells for all samples with keep.lib.sizes=FALSE. offenders &lt;- &quot;ExE ectoderm&quot; y.ab3 &lt;- y.ab[setdiff(rownames(y.ab), offenders),, keep.lib.sizes=FALSE] y.ab3$samples ## group lib.size norm.factors batch cell barcode sample stage ## 5 1 2268 1 5 cell_9769 AAACCTGAGACTGTAA 5 E8.5 ## 6 1 993 1 6 cell_12180 AAACCTGCAGATGGCA 6 E8.5 ## 7 1 2708 1 7 cell_13227 AAACCTGAGACAAGCC 7 E8.5 ## 8 1 2749 1 8 cell_16234 AAACCTGCAAACCCAT 8 E8.5 ## 9 1 4009 1 9 cell_19332 AAACCTGCAACGATCT 9 E8.5 ## 10 1 6224 1 10 cell_23875 AAACCTGAGGCATGTG 10 E8.5 ## tomato pool stage.mapped celltype.mapped closest.cell ## 5 TRUE 3 E8.25 Mesenchyme cell_24159 ## 6 FALSE 3 E8.25 Somitic mesoderm cell_63247 ## 7 TRUE 4 E8.5 Somitic mesoderm cell_25454 ## 8 FALSE 4 E8.25 ExE mesoderm cell_139075 ## 9 TRUE 5 E8.0 ExE mesoderm cell_116116 ## 10 FALSE 5 E8.5 Forebrain/Midbrain/Hindbrain cell_39343 ## doub.density sizeFactor label ## 5 0.029850 1.6349 19 ## 6 0.291916 2.5981 6 ## 7 0.601740 1.5939 17 ## 8 0.004733 0.8707 9 ## 9 0.079415 0.8933 15 ## 10 0.040747 0.3947 1 y.ab3 &lt;- estimateDisp(y.ab3, design, trend=&quot;none&quot;) fit.ab3 &lt;- glmQLFit(y.ab3, design, robust=TRUE, abundance.trend=FALSE) res3 &lt;- glmQLFTest(fit.ab3, coef=ncol(design)) topTags(res3, n=10) ## Coefficient: factor(tomato)TRUE ## logFC logCPM F PValue FDR ## Mesenchyme 1.1274 16.32 11.501 0.001438 0.03308 ## Allantois 0.7950 15.54 5.231 0.026836 0.18284 ## Cardiomyocytes 0.8104 14.90 5.152 0.027956 0.18284 ## Neural crest -0.8085 14.80 4.903 0.031798 0.18284 ## Erythroid3 -0.6808 17.32 4.387 0.041743 0.19202 ## Endothelium 0.7151 14.32 3.830 0.056443 0.21636 ## Haematoendothelial progenitors 0.6189 14.76 2.993 0.090338 0.29683 ## Def. endoderm 0.4911 12.43 1.084 0.303347 0.67818 ## ExE mesoderm 0.3419 15.71 1.036 0.314058 0.67818 ## Pharyngeal mesoderm 0.3407 15.76 1.025 0.316623 0.67818 A similar strategy can be used to focus on proportional changes within a single subpopulation of a very heterogeneous data set. For example, if we collected a whole blood data set, we could subset to T cells and test for changes in T cell subtypes (memory, killer, regulatory, etc.) using the total number of T cells in each sample as the library size. This avoids detecting changes in T cell subsets that are driven by compositional effects from changes in abundance of, say, B cells in the same sample. 6.3.4 Testing against a log-fold change threshold Here, we assume that composition bias introduces a spurious log2-fold change of no more than \\(\\tau\\) for a non-DA label. This can be roughly interpreted as the maximum log-fold change in the total number of cells caused by DA in other labels. (By comparison, fold-differences in the totals due to differences in capture efficiency or the size of the original cell population are not attributable to composition bias and should not be considered when choosing \\(\\tau\\).) We then mitigate the effect of composition biases by testing each label for changes in abundance beyond \\(\\tau\\) (McCarthy and Smyth 2009; Lun, Richard, and Marioni 2017). res.lfc &lt;- glmTreat(fit.ab, coef=ncol(design), lfc=1) summary(decideTests(res.lfc)) ## factor(tomato)TRUE ## Down 1 ## NotSig 23 ## Up 0 topTags(res.lfc) ## Coefficient: factor(tomato)TRUE ## logFC unshrunk.logFC logCPM PValue ## ExE ectoderm -6.5663 -7.0015 13.02 2.626e-09 ## Mesenchyme 1.1652 1.1658 16.29 1.323e-01 ## Cardiomyocytes 0.8484 0.8498 14.86 3.796e-01 ## Allantois 0.8345 0.8354 15.51 3.975e-01 ## Neural crest -0.7706 -0.7719 14.76 4.501e-01 ## Endothelium 0.7519 0.7536 14.29 4.665e-01 ## Haematoendothelial progenitors 0.6581 0.6591 14.72 5.622e-01 ## Def. endoderm 0.5262 0.5311 12.40 5.934e-01 ## Erythroid3 -0.6431 -0.6432 17.28 6.118e-01 ## Caudal Mesoderm -0.3996 -0.4036 12.09 6.827e-01 ## FDR ## ExE ectoderm 6.303e-08 ## Mesenchyme 9.950e-01 ## Cardiomyocytes 9.950e-01 ## Allantois 9.950e-01 ## Neural crest 9.950e-01 ## Endothelium 9.950e-01 ## Haematoendothelial progenitors 9.950e-01 ## Def. endoderm 9.950e-01 ## Erythroid3 9.950e-01 ## Caudal Mesoderm 9.950e-01 The choice of \\(\\tau\\) can be loosely motivated by external experimental data. For example, if we observe a doubling of cell numbers in an in vitro system after treatment, we might be inclined to set \\(\\tau=1\\). This ensures that any non-DA subpopulation is not reported as being depleted after treatment. Some caution is still required, though - even if the external numbers are accurate, we need to assume that cell capture efficiency is (on average) equal between conditions to justify their use as \\(\\tau\\). And obviously, the use of a non-zero \\(\\tau\\) will reduce power to detect real changes when the composition bias is not present. 6.4 Comments on interpretation 6.4.1 DE or DA? Two sides of the same coin While useful, the distinction between DA and DE analyses is inherently artificial for scRNA-seq data. This is because the labels used in the former are defined based on the genes to be tested in the latter. To illustrate, consider a scRNA-seq experiment involving two biological conditions with several shared cell types. We focus on a cell type \\(X\\) that is present in both conditions but contains some DEGs between conditions. This leads to two possible outcomes: The DE between conditions causes \\(X\\) to form two separate clusters (say, \\(X_1\\) and \\(X_2\\)) in expression space. This manifests as DA where \\(X_1\\) is enriched in one condition and \\(X_2\\) is enriched in the other condition. The DE between conditions is not sufficient to split \\(X\\) into two separate clusters, e.g., because the data integration procedure identifies them as corresponding cell types and merges them together. This means that the differences between conditions manifest as DE within the single cluster corresponding to \\(X\\). We have described the example above in terms of clustering, but the same arguments apply for any labelling strategy based on the expression profiles, e.g., automated cell type assignment (Basic Chapter 7). Moreover, the choice between outcomes 1 and 2 is made implicitly by the combined effect of the data merging, clustering and label assignment procedures. For example, differences between conditions are more likely to manifest as DE for coarser clusters and as DA for finer clusters, but this is difficult to predict reliably. The moral of the story is that DA and DE analyses are simply two different perspectives on the same phenomena. For any comprehensive characterization of differences between populations, it is usually necessary to consider both analyses. Indeed, they complement each other almost by definition, e.g., clustering parameters that reduce DE will increase DA and vice versa. 6.4.2 Sacrificing biology by integration Earlier in this chapter, we defined clusters from corrected values after applying fastMNN() to cells from all samples in the chimera dataset. Alert readers may realize that this would result in the removal of biological differences between our conditions. Any systematic difference in expression caused by injection would be treated as a batch effect and lost when cells from different samples are aligned to the same coordinate space. Now, one may not consider injection to be an interesting biological effect, but the same reasoning applies for other conditions, e.g., integration of wild-type and knock-out samples (Section 5) would result in the loss of any knock-out effect in the corrected values. This loss is both expected and desirable. As we mentioned in Section 3, the main motivation for performing batch correction is to enable us to characterize population heterogeneity in a consistent manner across samples. This remains true in situations with multiple conditions where we would like one set of clusters and annotations that can be used as common labels for the DE or DA analyses described above. The alternative would be to cluster each condition separately and to attempt to identify matching clusters across conditions - not straightforward for poorly separated clusters in contexts like differentiation. It may seem distressing to some that a (potentially very interesting) biological difference between conditions is lost during correction. However, this concern is largely misplaced as the correction is only ever used for defining common clusters and annotations. The DE analysis itself is performed on pseudo-bulk samples created from the uncorrected counts, preserving the biological difference and ensuring that it manifests in the list of DE genes for affected cell types. Of course, if the DE is strong enough, it may result in a new condition-specific cluster that would be captured by a DA analysis as discussed in Section 6.4.1. One final consideration is the interaction of condition-specific expression with the assumptions of each batch correction method. For example, MNN correction assumes that the differences between samples are orthogonal to the variation within samples. Arguably, this assumption is becomes more questionable if the between-sample differences are biological in nature, e.g., a treatment effect that makes one cell type seem more transcriptionally similar to another may cause the wrong clusters to be aligned across conditions. As usual, users will benefit from the diagnostics described in Chapter 1 and a healthy dose of skepticism. Session Info View session info R version 4.1.0 beta (2021-05-03 r80259) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] edgeR_3.35.0 limma_3.49.0 [3] SingleCellExperiment_1.15.1 SummarizedExperiment_1.23.0 [5] Biobase_2.53.0 GenomicRanges_1.45.0 [7] GenomeInfoDb_1.29.0 IRanges_2.27.0 [9] S4Vectors_0.31.0 BiocGenerics_0.39.0 [11] MatrixGenerics_1.5.0 matrixStats_0.58.0 [13] BiocStyle_2.21.0 rebook_1.3.0 loaded via a namespace (and not attached): [1] statmod_1.4.36 locfit_1.5-9.4 xfun_0.23 [4] bslib_0.2.5.1 splines_4.1.0 lattice_0.20-44 [7] htmltools_0.5.1.1 yaml_2.2.1 XML_3.99-0.6 [10] rlang_0.4.11 jquerylib_0.1.4 CodeDepends_0.6.5 [13] GenomeInfoDbData_1.2.6 stringr_1.4.0 zlibbioc_1.39.0 [16] codetools_0.2-18 evaluate_0.14 knitr_1.33 [19] highr_0.9 Rcpp_1.0.6 filelock_1.0.2 [22] BiocManager_1.30.15 DelayedArray_0.19.0 graph_1.71.0 [25] jsonlite_1.7.2 XVector_0.33.0 dir.expiry_1.1.0 [28] digest_0.6.27 stringi_1.6.2 bookdown_0.22 [31] grid_4.1.0 tools_4.1.0 bitops_1.0-7 [34] magrittr_2.0.1 sass_0.4.0 RCurl_1.98-1.3 [37] Matrix_1.3-3 rmarkdown_2.8 R6_2.5.0 [40] compiler_4.1.0 References "],["human-pbmcs-10x-genomics.html", "Chapter 7 Human PBMCs (10X Genomics) 7.1 Introduction 7.2 Data loading 7.3 Quality control 7.4 Normalization 7.5 Variance modelling 7.6 Dimensionality reduction 7.7 Clustering 7.8 Data integration Session Info", " Chapter 7 Human PBMCs (10X Genomics) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 7.1 Introduction This performs an analysis of the public PBMC ID dataset generated by 10X Genomics (Zheng et al. 2017), starting from the filtered count matrix. 7.2 Data loading library(TENxPBMCData) all.sce &lt;- list( pbmc3k=TENxPBMCData(&#39;pbmc3k&#39;), pbmc4k=TENxPBMCData(&#39;pbmc4k&#39;), pbmc8k=TENxPBMCData(&#39;pbmc8k&#39;) ) 7.3 Quality control unfiltered &lt;- all.sce Cell calling implicitly serves as a QC step to remove libraries with low total counts and number of detected genes. Thus, we will only filter on the mitochondrial proportion. library(scater) stats &lt;- high.mito &lt;- list() for (n in names(all.sce)) { current &lt;- all.sce[[n]] is.mito &lt;- grep(&quot;MT&quot;, rowData(current)$Symbol_TENx) stats[[n]] &lt;- perCellQCMetrics(current, subsets=list(Mito=is.mito)) high.mito[[n]] &lt;- isOutlier(stats[[n]]$subsets_Mito_percent, type=&quot;higher&quot;) all.sce[[n]] &lt;- current[,!high.mito[[n]]] } qcplots &lt;- list() for (n in names(all.sce)) { current &lt;- unfiltered[[n]] colData(current) &lt;- cbind(colData(current), stats[[n]]) current$discard &lt;- high.mito[[n]] qcplots[[n]] &lt;- plotColData(current, x=&quot;sum&quot;, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + scale_x_log10() } do.call(gridExtra::grid.arrange, c(qcplots, ncol=3)) Figure 7.1: Percentage of mitochondrial reads in each cell in each of the 10X PBMC datasets, compared to the total count. Each point represents a cell and is colored according to whether that cell was discarded. lapply(high.mito, summary) ## $pbmc3k ## Mode FALSE TRUE ## logical 2609 91 ## ## $pbmc4k ## Mode FALSE TRUE ## logical 4182 158 ## ## $pbmc8k ## Mode FALSE TRUE ## logical 8157 224 7.4 Normalization We perform library size normalization, simply for convenience when dealing with file-backed matrices. all.sce &lt;- lapply(all.sce, logNormCounts) lapply(all.sce, function(x) summary(sizeFactors(x))) ## $pbmc3k ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.234 0.748 0.926 1.000 1.157 6.604 ## ## $pbmc4k ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.315 0.711 0.890 1.000 1.127 11.027 ## ## $pbmc8k ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.296 0.704 0.877 1.000 1.118 6.794 7.5 Variance modelling library(scran) all.dec &lt;- lapply(all.sce, modelGeneVar) all.hvgs &lt;- lapply(all.dec, getTopHVGs, prop=0.1) par(mfrow=c(1,3)) for (n in names(all.dec)) { curdec &lt;- all.dec[[n]] plot(curdec$mean, curdec$total, pch=16, cex=0.5, main=n, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(curdec) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) } Figure 7.2: Per-gene variance as a function of the mean for the log-expression values in each PBMC dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the variances. 7.6 Dimensionality reduction For various reasons, we will first analyze each PBMC dataset separately rather than merging them together. We use randomized SVD, which is more efficient for file-backed matrices. library(BiocSingular) set.seed(10000) all.sce &lt;- mapply(FUN=runPCA, x=all.sce, subset_row=all.hvgs, MoreArgs=list(ncomponents=25, BSPARAM=RandomParam()), SIMPLIFY=FALSE) set.seed(100000) all.sce &lt;- lapply(all.sce, runTSNE, dimred=&quot;PCA&quot;) set.seed(1000000) all.sce &lt;- lapply(all.sce, runUMAP, dimred=&quot;PCA&quot;) 7.7 Clustering for (n in names(all.sce)) { g &lt;- buildSNNGraph(all.sce[[n]], k=10, use.dimred=&#39;PCA&#39;) clust &lt;- igraph::cluster_walktrap(g)$membership colLabels(all.sce[[n]]) &lt;- factor(clust) } lapply(all.sce, function(x) table(colLabels(x))) ## $pbmc3k ## ## 1 2 3 4 5 6 7 8 9 10 ## 487 154 603 514 31 150 179 333 147 11 ## ## $pbmc4k ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 ## 497 185 569 786 373 232 44 1023 77 218 88 54 36 ## ## $pbmc8k ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 1004 759 1073 1543 367 150 201 2067 59 154 244 67 76 285 20 15 ## 17 18 ## 64 9 all.tsne &lt;- list() for (n in names(all.sce)) { all.tsne[[n]] &lt;- plotTSNE(all.sce[[n]], colour_by=&quot;label&quot;) + ggtitle(n) } do.call(gridExtra::grid.arrange, c(all.tsne, list(ncol=2))) Figure 7.3: Obligatory \\(t\\)-SNE plots of each PBMC dataset, where each point represents a cell in the corresponding dataset and is colored according to the assigned cluster. 7.8 Data integration With the per-dataset analyses out of the way, we will now repeat the analysis after merging together the three batches. # Intersecting the common genes. universe &lt;- Reduce(intersect, lapply(all.sce, rownames)) all.sce2 &lt;- lapply(all.sce, &quot;[&quot;, i=universe,) all.dec2 &lt;- lapply(all.dec, &quot;[&quot;, i=universe,) # Renormalizing to adjust for differences in depth. library(batchelor) normed.sce &lt;- do.call(multiBatchNorm, all.sce2) # Identifying a set of HVGs using stats from all batches. combined.dec &lt;- do.call(combineVar, all.dec2) combined.hvg &lt;- getTopHVGs(combined.dec, n=5000) set.seed(1000101) merged.pbmc &lt;- do.call(fastMNN, c(normed.sce, list(subset.row=combined.hvg, BSPARAM=RandomParam()))) We use the percentage of lost variance as a diagnostic measure. metadata(merged.pbmc)$merge.info$lost.var ## pbmc3k pbmc4k pbmc8k ## [1,] 7.003e-03 3.126e-03 0.000000 ## [2,] 7.137e-05 5.125e-05 0.003003 We proceed to clustering: g &lt;- buildSNNGraph(merged.pbmc, use.dimred=&quot;corrected&quot;) colLabels(merged.pbmc) &lt;- factor(igraph::cluster_louvain(g)$membership) table(colLabels(merged.pbmc), merged.pbmc$batch) ## ## pbmc3k pbmc4k pbmc8k ## 1 113 387 825 ## 2 507 395 806 ## 3 175 344 581 ## 4 295 539 1018 ## 5 346 638 1210 ## 6 11 3 9 ## 7 17 27 111 ## 8 33 113 185 ## 9 423 754 1546 ## 10 4 36 67 ## 11 197 124 221 ## 12 150 180 293 ## 13 327 588 1125 ## 14 11 54 160 And visualization: set.seed(10101010) merged.pbmc &lt;- runTSNE(merged.pbmc, dimred=&quot;corrected&quot;) gridExtra::grid.arrange( plotTSNE(merged.pbmc, colour_by=&quot;label&quot;, text_by=&quot;label&quot;, text_colour=&quot;red&quot;), plotTSNE(merged.pbmc, colour_by=&quot;batch&quot;) ) Figure 7.4: Obligatory \\(t\\)-SNE plots for the merged PBMC datasets, where each point represents a cell and is colored by cluster (top) or batch (bottom). Session Info View session info R version 4.1.0 beta (2021-05-03 r80259) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] batchelor_1.9.0 BiocSingular_1.9.0 [3] scran_1.21.1 scater_1.21.0 [5] ggplot2_3.3.3 scuttle_1.3.0 [7] TENxPBMCData_1.11.0 HDF5Array_1.21.0 [9] rhdf5_2.37.0 DelayedArray_0.19.0 [11] Matrix_1.3-3 SingleCellExperiment_1.15.1 [13] SummarizedExperiment_1.23.0 Biobase_2.53.0 [15] GenomicRanges_1.45.0 GenomeInfoDb_1.29.0 [17] IRanges_2.27.0 S4Vectors_0.31.0 [19] BiocGenerics_0.39.0 MatrixGenerics_1.5.0 [21] matrixStats_0.58.0 BiocStyle_2.21.0 [23] rebook_1.3.0 loaded via a namespace (and not attached): [1] AnnotationHub_3.1.0 BiocFileCache_2.1.0 [3] igraph_1.2.6 BiocParallel_1.27.0 [5] digest_0.6.27 htmltools_0.5.1.1 [7] viridis_0.6.1 fansi_0.4.2 [9] magrittr_2.0.1 memoise_2.0.0 [11] ScaledMatrix_1.1.0 cluster_2.1.2 [13] limma_3.49.0 Biostrings_2.61.0 [15] colorspace_2.0-1 blob_1.2.1 [17] rappdirs_0.3.3 xfun_0.23 [19] dplyr_1.0.6 crayon_1.4.1 [21] RCurl_1.98-1.3 jsonlite_1.7.2 [23] graph_1.71.0 glue_1.4.2 [25] gtable_0.3.0 zlibbioc_1.39.0 [27] XVector_0.33.0 Rhdf5lib_1.15.0 [29] scales_1.1.1 DBI_1.1.1 [31] edgeR_3.35.0 Rcpp_1.0.6 [33] viridisLite_0.4.0 xtable_1.8-4 [35] dqrng_0.3.0 bit_4.0.4 [37] rsvd_1.0.5 ResidualMatrix_1.3.0 [39] metapod_1.1.0 httr_1.4.2 [41] FNN_1.1.3 dir.expiry_1.1.0 [43] ellipsis_0.3.2 pkgconfig_2.0.3 [45] XML_3.99-0.6 farver_2.1.0 [47] CodeDepends_0.6.5 sass_0.4.0 [49] uwot_0.1.10 dbplyr_2.1.1 [51] locfit_1.5-9.4 utf8_1.2.1 [53] tidyselect_1.1.1 labeling_0.4.2 [55] rlang_0.4.11 later_1.2.0 [57] AnnotationDbi_1.55.0 munsell_0.5.0 [59] BiocVersion_3.14.0 tools_4.1.0 [61] cachem_1.0.5 generics_0.1.0 [63] RSQLite_2.2.7 ExperimentHub_2.1.0 [65] evaluate_0.14 stringr_1.4.0 [67] fastmap_1.1.0 yaml_2.2.1 [69] knitr_1.33 bit64_4.0.5 [71] purrr_0.3.4 KEGGREST_1.33.0 [73] sparseMatrixStats_1.5.0 mime_0.10 [75] compiler_4.1.0 beeswarm_0.3.1 [77] filelock_1.0.2 curl_4.3.1 [79] png_0.1-7 interactiveDisplayBase_1.31.0 [81] tibble_3.1.2 statmod_1.4.36 [83] bslib_0.2.5.1 stringi_1.6.2 [85] highr_0.9 RSpectra_0.16-0 [87] lattice_0.20-44 bluster_1.3.0 [89] vctrs_0.3.8 pillar_1.6.1 [91] lifecycle_1.0.0 rhdf5filters_1.5.0 [93] BiocManager_1.30.15 jquerylib_0.1.4 [95] RcppAnnoy_0.0.18 BiocNeighbors_1.11.0 [97] cowplot_1.1.1 bitops_1.0-7 [99] irlba_2.3.3 httpuv_1.6.1 [101] R6_2.5.0 bookdown_0.22 [103] promises_1.2.0.1 gridExtra_2.3 [105] vipor_0.4.5 codetools_0.2-18 [107] assertthat_0.2.1 withr_2.4.2 [109] GenomeInfoDbData_1.2.6 grid_4.1.0 [111] beachmat_2.9.0 rmarkdown_2.8 [113] DelayedMatrixStats_1.15.0 Rtsne_0.15 [115] shiny_1.6.0 ggbeeswarm_0.6.0 References "],["merged-pancreas.html", "Chapter 8 Human pancreas (multiple technologies) 8.1 Introduction 8.2 The good 8.3 The bad 8.4 The ugly Session Info", " Chapter 8 Human pancreas (multiple technologies) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 8.1 Introduction For a period in 2016, there was a great deal of interest in using scRNA-seq to profile the human pancreas at cellular resolution (Muraro et al. 2016; Grun et al. 2016; Lawlor et al. 2017; Segerstolpe et al. 2016). As a consequence, we have a surplus of human pancreas datasets generated by different authors with different technologies, which provides an ideal use case for demonstrating more complex data integration strategies. This represents a more challenging application than the PBMC dataset in Chapter 1 as it involves different sequencing protocols and different patients, most likely with differences in cell type composition. 8.2 The good We start by considering only two datasets from Muraro et al. (2016) and Grun et al. (2016). This is a relatively simple scenario involving very similar protocols (CEL-seq and CEL-seq2) and a similar set of authors. View set-up code (Workflow Chapter 5) #--- loading ---# library(scRNAseq) sce.grun &lt;- GrunPancreasData() #--- gene-annotation ---# library(org.Hs.eg.db) gene.ids &lt;- mapIds(org.Hs.eg.db, keys=rowData(sce.grun)$symbol, keytype=&quot;SYMBOL&quot;, column=&quot;ENSEMBL&quot;) keep &lt;- !is.na(gene.ids) &amp; !duplicated(gene.ids) sce.grun &lt;- sce.grun[keep,] rownames(sce.grun) &lt;- gene.ids[keep] #--- quality-control ---# library(scater) stats &lt;- perCellQCMetrics(sce.grun) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sce.grun$donor, subset=sce.grun$donor %in% c(&quot;D17&quot;, &quot;D7&quot;, &quot;D2&quot;)) sce.grun &lt;- sce.grun[,!qc$discard] #--- normalization ---# library(scran) set.seed(1000) # for irlba. clusters &lt;- quickCluster(sce.grun) sce.grun &lt;- computeSumFactors(sce.grun, clusters=clusters) sce.grun &lt;- logNormCounts(sce.grun) #--- variance-modelling ---# block &lt;- paste0(sce.grun$sample, &quot;_&quot;, sce.grun$donor) dec.grun &lt;- modelGeneVarWithSpikes(sce.grun, spikes=&quot;ERCC&quot;, block=block) top.grun &lt;- getTopHVGs(dec.grun, prop=0.1) sce.grun ## class: SingleCellExperiment ## dim: 17398 1063 ## metadata(0): ## assays(2): counts logcounts ## rownames(17398): ENSG00000268895 ENSG00000121410 ... ENSG00000074755 ## ENSG00000036549 ## rowData names(2): symbol chr ## colnames(1063): D2ex_1 D2ex_2 ... D17TGFB_94 D17TGFB_95 ## colData names(3): donor sample sizeFactor ## reducedDimNames(0): ## mainExpName: endogenous ## altExpNames(1): ERCC View set-up code (Workflow Chapter 6) #--- loading ---# library(scRNAseq) sce.muraro &lt;- MuraroPancreasData() #--- gene-annotation ---# library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] gene.symb &lt;- sub(&quot;__chr.*$&quot;, &quot;&quot;, rownames(sce.muraro)) gene.ids &lt;- mapIds(edb, keys=gene.symb, keytype=&quot;SYMBOL&quot;, column=&quot;GENEID&quot;) # Removing duplicated genes or genes without Ensembl IDs. keep &lt;- !is.na(gene.ids) &amp; !duplicated(gene.ids) sce.muraro &lt;- sce.muraro[keep,] rownames(sce.muraro) &lt;- gene.ids[keep] #--- quality-control ---# library(scater) stats &lt;- perCellQCMetrics(sce.muraro) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sce.muraro$donor, subset=sce.muraro$donor!=&quot;D28&quot;) sce.muraro &lt;- sce.muraro[,!qc$discard] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.muraro) sce.muraro &lt;- computeSumFactors(sce.muraro, clusters=clusters) sce.muraro &lt;- logNormCounts(sce.muraro) #--- variance-modelling ---# block &lt;- paste0(sce.muraro$plate, &quot;_&quot;, sce.muraro$donor) dec.muraro &lt;- modelGeneVarWithSpikes(sce.muraro, &quot;ERCC&quot;, block=block) top.muraro &lt;- getTopHVGs(dec.muraro, prop=0.1) sce.muraro ## class: SingleCellExperiment ## dim: 16940 2299 ## metadata(0): ## assays(2): counts logcounts ## rownames(16940): ENSG00000268895 ENSG00000121410 ... ENSG00000159840 ## ENSG00000074755 ## rowData names(2): symbol chr ## colnames(2299): D28-1_1 D28-1_2 ... D30-8_93 D30-8_94 ## colData names(4): label donor plate sizeFactor ## reducedDimNames(0): ## mainExpName: endogenous ## altExpNames(1): ERCC We subset both batches to their common universe of genes; adjust their scaling to equalize sequencing coverage (not really necessary in this case, as the coverage is already similar, but we will do so anyway for consistency); and select those genes with positive average biological components for further use. universe &lt;- intersect(rownames(sce.grun), rownames(sce.muraro)) sce.grun2 &lt;- sce.grun[universe,] dec.grun2 &lt;- dec.grun[universe,] sce.muraro2 &lt;- sce.muraro[universe,] dec.muraro2 &lt;- dec.muraro[universe,] library(batchelor) normed.pancreas &lt;- multiBatchNorm(sce.grun2, sce.muraro2) sce.grun2 &lt;- normed.pancreas[[1]] sce.muraro2 &lt;- normed.pancreas[[2]] library(scran) combined.pan &lt;- combineVar(dec.grun2, dec.muraro2) chosen.genes &lt;- rownames(combined.pan)[combined.pan$bio &gt; 0] We observe that rescaleBatches() is unable to align cells from different batches in Figure 8.1. This is attributable to differences in population composition between batches, with additional complications from non-linearities in the batch effect, e.g., when the magnitude or direction of the batch effect differs between cell types. library(scater) rescaled.pancreas &lt;- rescaleBatches(sce.grun2, sce.muraro2) set.seed(100101) rescaled.pancreas &lt;- runPCA(rescaled.pancreas, subset_row=chosen.genes, exprs_values=&quot;corrected&quot;) rescaled.pancreas &lt;- runTSNE(rescaled.pancreas, dimred=&quot;PCA&quot;) plotTSNE(rescaled.pancreas, colour_by=&quot;batch&quot;) Figure 8.1: \\(t\\)-SNE plot of the two pancreas datasets after correction with rescaleBatches(). Each point represents a cell and is colored according to the batch of origin. Here, we use fastMNN() to merge together the two human pancreas datasets described earlier. Clustering on the merged datasets yields fewer batch-specific clusters, which is recapitulated as greater intermingling between batches in Figure 8.2. This improvement over Figure 8.1 represents the ability of fastMNN() to adapt to more complex situations involving differences in population composition between batches. set.seed(1011011) mnn.pancreas &lt;- fastMNN(sce.grun2, sce.muraro2, subset.row=chosen.genes) snn.gr &lt;- buildSNNGraph(mnn.pancreas, use.dimred=&quot;corrected&quot;) clusters &lt;- igraph::cluster_walktrap(snn.gr)$membership tab &lt;- table(Cluster=clusters, Batch=mnn.pancreas$batch) tab ## Batch ## Cluster 1 2 ## 1 242 280 ## 2 304 250 ## 3 205 846 ## 4 37 2 ## 5 56 194 ## 6 24 108 ## 7 119 399 ## 8 53 71 ## 9 18 113 ## 10 0 17 ## 11 5 19 mnn.pancreas &lt;- runTSNE(mnn.pancreas, dimred=&quot;corrected&quot;) plotTSNE(mnn.pancreas, colour_by=&quot;batch&quot;) Figure 8.2: \\(t\\)-SNE plot of the two pancreas datasets after correction with fastMNN(). Each point represents a cell and is colored according to the batch of origin. 8.3 The bad Flushed with our previous success, we now attempt to merge the other datasets from Lawlor et al. (2017) and Segerstolpe et al. (2016). This is a more challenging task as it involves different technologies, mixtures of UMI and read count data and a more diverse set of authors (presumably with greater differences in the patient population). View set-up code (Workflow Chapter 7) #--- loading ---# library(scRNAseq) sce.lawlor &lt;- LawlorPancreasData() #--- gene-annotation ---# library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] anno &lt;- select(edb, keys=rownames(sce.lawlor), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.lawlor) &lt;- anno[match(rownames(sce.lawlor), anno[,1]),-1] #--- quality-control ---# library(scater) stats &lt;- perCellQCMetrics(sce.lawlor, subsets=list(Mito=which(rowData(sce.lawlor)$SEQNAME==&quot;MT&quot;))) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;subsets_Mito_percent&quot;, batch=sce.lawlor$`islet unos id`) sce.lawlor &lt;- sce.lawlor[,!qc$discard] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.lawlor) sce.lawlor &lt;- computeSumFactors(sce.lawlor, clusters=clusters) sce.lawlor &lt;- logNormCounts(sce.lawlor) #--- variance-modelling ---# dec.lawlor &lt;- modelGeneVar(sce.lawlor, block=sce.lawlor$`islet unos id`) chosen.genes &lt;- getTopHVGs(dec.lawlor, n=2000) sce.lawlor ## class: SingleCellExperiment ## dim: 26616 604 ## metadata(0): ## assays(2): counts logcounts ## rownames(26616): ENSG00000229483 ENSG00000232849 ... ENSG00000251576 ## ENSG00000082898 ## rowData names(2): SYMBOL SEQNAME ## colnames(604): 10th_C11_S96 10th_C13_S61 ... 9th-C96_S81 9th-C9_S13 ## colData names(9): title age ... Sex sizeFactor ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): View set-up code (Workflow Chapter 8) #--- loading ---# library(scRNAseq) sce.seger &lt;- SegerstolpePancreasData() #--- gene-annotation ---# library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] symbols &lt;- rowData(sce.seger)$symbol ens.id &lt;- mapIds(edb, keys=symbols, keytype=&quot;SYMBOL&quot;, column=&quot;GENEID&quot;) ens.id &lt;- ifelse(is.na(ens.id), symbols, ens.id) # Removing duplicated rows. keep &lt;- !duplicated(ens.id) sce.seger &lt;- sce.seger[keep,] rownames(sce.seger) &lt;- ens.id[keep] #--- sample-annotation ---# emtab.meta &lt;- colData(sce.seger)[,c(&quot;cell type&quot;, &quot;disease&quot;, &quot;individual&quot;, &quot;single cell well quality&quot;)] colnames(emtab.meta) &lt;- c(&quot;CellType&quot;, &quot;Disease&quot;, &quot;Donor&quot;, &quot;Quality&quot;) colData(sce.seger) &lt;- emtab.meta sce.seger$CellType &lt;- gsub(&quot; cell&quot;, &quot;&quot;, sce.seger$CellType) sce.seger$CellType &lt;- paste0( toupper(substr(sce.seger$CellType, 1, 1)), substring(sce.seger$CellType, 2)) #--- quality-control ---# low.qual &lt;- sce.seger$Quality == &quot;low quality cell&quot; library(scater) stats &lt;- perCellQCMetrics(sce.seger) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sce.seger$Donor, subset=!sce.seger$Donor %in% c(&quot;HP1504901&quot;, &quot;HP1509101&quot;)) sce.seger &lt;- sce.seger[,!(qc$discard | low.qual)] #--- normalization ---# library(scran) clusters &lt;- quickCluster(sce.seger) sce.seger &lt;- computeSumFactors(sce.seger, clusters=clusters) sce.seger &lt;- logNormCounts(sce.seger) #--- variance-modelling ---# for.hvg &lt;- sce.seger[,librarySizeFactors(altExp(sce.seger)) &gt; 0 &amp; sce.seger$Donor!=&quot;AZ&quot;] dec.seger &lt;- modelGeneVarWithSpikes(for.hvg, &quot;ERCC&quot;, block=for.hvg$Donor) chosen.hvgs &lt;- getTopHVGs(dec.seger, n=2000) sce.seger ## class: SingleCellExperiment ## dim: 25454 2090 ## metadata(0): ## assays(2): counts logcounts ## rownames(25454): ENSG00000118473 ENSG00000142920 ... ENSG00000278306 ## eGFP ## rowData names(2): symbol refseq ## colnames(2090): HP1502401_H13 HP1502401_J14 ... HP1526901T2D_N8 ## HP1526901T2D_A8 ## colData names(5): CellType Disease Donor Quality sizeFactor ## reducedDimNames(0): ## mainExpName: endogenous ## altExpNames(1): ERCC We perform the usual routine to obtain re-normalized values and a set of HVGs. Here, we put all the objects into a list to avoid having to explicitly type their names separately. all.sce &lt;- list(Grun=sce.grun, Muraro=sce.muraro, Lawlor=sce.lawlor, Seger=sce.seger) all.dec &lt;- list(Grun=dec.grun, Muraro=dec.muraro, Lawlor=dec.lawlor, Seger=dec.seger) universe &lt;- Reduce(intersect, lapply(all.sce, rownames)) all.sce &lt;- lapply(all.sce, &quot;[&quot;, i=universe,) all.dec &lt;- lapply(all.dec, &quot;[&quot;, i=universe,) normed.pancreas &lt;- do.call(multiBatchNorm, all.sce) combined.pan &lt;- do.call(combineVar, all.dec) chosen.genes &lt;- rownames(combined.pan)[combined.pan$bio &gt; 0] We observe that the merge is generally successful, with many clusters containing contributions from each batch (Figure 8.3). There are few clusters that are specific to the Segerstolpe dataset, and if we were naive, we might consider them to represent interesting subpopulations that are not present in the other datasets. set.seed(1011110) mnn.pancreas &lt;- fastMNN(normed.pancreas) # Bumping up &#39;k&#39; to get broader clusters for this demonstration. snn.gr &lt;- buildSNNGraph(mnn.pancreas, use.dimred=&quot;corrected&quot;, k=20) clusters &lt;- igraph::cluster_walktrap(snn.gr)$membership clusters &lt;- factor(clusters) tab &lt;- table(Cluster=clusters, Batch=mnn.pancreas$batch) tab ## Batch ## Cluster Grun Lawlor Muraro Seger ## 1 304 28 256 384 ## 2 54 12 599 196 ## 3 0 0 0 238 ## 4 106 244 390 176 ## 5 226 17 250 152 ## 6 35 0 1 0 ## 7 0 0 0 47 ## 8 146 212 234 121 ## 9 56 17 196 108 ## 10 67 24 83 18 ## 11 0 0 0 50 ## 12 24 18 107 55 ## 13 0 1 16 4 ## 14 19 18 118 157 ## 15 0 0 0 193 ## 16 0 0 0 26 ## 17 0 0 0 112 ## 18 21 5 30 36 ## 19 5 8 19 17 mnn.pancreas &lt;- runTSNE(mnn.pancreas, dimred=&quot;corrected&quot;) gridExtra::grid.arrange( plotTSNE(mnn.pancreas, colour_by=&quot;batch&quot;, text_by=I(clusters)), plotTSNE(mnn.pancreas, colour_by=I(clusters), text_by=I(clusters)), ncol=2 ) Figure 8.3: \\(t\\)-SNE plots of the four pancreas datasets after correction with fastMNN(). Each point represents a cell and is colored according to the batch of origin (left) or the assigned cluster (right). The cluster label is shown at the median location across all cells in the cluster. Fortunately, we are battle-hardened and cynical, so we are sure to check for other sources of variation. The most obvious candidate is the donor of origin for each cell (Figure 8.4), which correlates strongly to these Segerstolpe-only clusters. This is not surprising given the large differences between humans in the wild, but donor-level variation is not interesting for the purposes of cell type characterization. (That said, preservation of the within-dataset donor effects is the technically correct course of action here, as a batch correction method should try to avoid removing heterogeneity within each of its defined batches.) donors &lt;- c( normed.pancreas$Grun$donor, normed.pancreas$Muraro$donor, normed.pancreas$Lawlor$`islet unos id`, normed.pancreas$Seger$Donor ) seger.donors &lt;- donors seger.donors[mnn.pancreas$batch!=&quot;Seger&quot;] &lt;- NA plotTSNE(mnn.pancreas, colour_by=I(seger.donors)) Figure 8.4: \\(t\\)-SNE plots of the four pancreas datasets after correction with fastMNN(). Each point represents a cell and is colored according to the donor of origin for the Segerstolpe dataset. 8.4 The ugly Given these results, the most prudent course of action is to remove the donor effects within each dataset in addition to the batch effects across datasets. This involves a bit more work to properly specify the two levels of unwanted heterogeneity. To make our job a bit easier, we use the noCorrect() utility to combine all batches into a single SingleCellExperiment object. combined &lt;- noCorrect(normed.pancreas) assayNames(combined) &lt;- &quot;logcounts&quot; combined$donor &lt;- donors We then call fastMNN() on the combined object with our chosen HVGs, using the batch= argument to specify which cells belong to which donors. This will progressively merge cells from each donor in each batch until all cells are mapped onto a common coordinate space. For some extra sophistication, we also set the weights= argument to ensure that each batch contributes equally to the PCA, regardless of the number of donors present in that batch; see ?multiBatchPCA for more details. donors.per.batch &lt;- split(combined$donor, combined$batch) donors.per.batch &lt;- lapply(donors.per.batch, unique) donors.per.batch ## $Grun ## [1] &quot;D2&quot; &quot;D3&quot; &quot;D7&quot; &quot;D10&quot; &quot;D17&quot; ## ## $Lawlor ## [1] &quot;ACIW009&quot; &quot;ACJV399&quot; &quot;ACCG268&quot; &quot;ACCR015A&quot; &quot;ACEK420A&quot; &quot;ACEL337&quot; &quot;ACHY057&quot; ## [8] &quot;ACIB065&quot; ## ## $Muraro ## [1] &quot;D28&quot; &quot;D29&quot; &quot;D31&quot; &quot;D30&quot; ## ## $Seger ## [1] &quot;HP1502401&quot; &quot;HP1504101T2D&quot; &quot;AZ&quot; &quot;HP1508501T2D&quot; &quot;HP1506401&quot; ## [6] &quot;HP1507101&quot; &quot;HP1509101&quot; &quot;HP1504901&quot; &quot;HP1525301T2D&quot; &quot;HP1526901T2D&quot; set.seed(1010100) multiout &lt;- fastMNN(combined, batch=combined$donor, subset.row=chosen.genes, weights=donors.per.batch) # Renaming metadata fields for easier communication later. multiout$dataset &lt;- combined$batch multiout$donor &lt;- multiout$batch multiout$batch &lt;- NULL With this approach, we see that the Segerstolpe-only clusters have disappeared (Figure 8.5). Visually, there also seems to be much greater mixing between cells from different Segerstolpe donors. This suggests that we have removed most of the donor effect, which simplifies the interpretation of our clusters. library(scater) g &lt;- buildSNNGraph(multiout, use.dimred=1, k=20) clusters &lt;- igraph::cluster_walktrap(g)$membership tab &lt;- table(clusters, multiout$dataset) tab ## ## clusters Grun Lawlor Muraro Seger ## 1 248 20 278 186 ## 2 338 27 258 388 ## 3 171 251 458 270 ## 4 201 242 851 887 ## 5 57 17 193 108 ## 6 24 18 108 55 ## 7 5 9 19 17 ## 8 0 1 17 4 ## 9 19 19 117 175 multiout &lt;- runTSNE(multiout, dimred=&quot;corrected&quot;) gridExtra::grid.arrange( plotTSNE(multiout, colour_by=&quot;dataset&quot;, text_by=I(clusters)), plotTSNE(multiout, colour_by=I(seger.donors)), ncol=2 ) Figure 8.5: \\(t\\)-SNE plots of the four pancreas datasets after donor-level correction with fastMNN(). Each point represents a cell and is colored according to the batch of origin (left) or the donor of origin for the Segerstolpe-derived cells (right). The cluster label is shown at the median location across all cells in the cluster. Our clusters compare well to the published annotations, indicating that we did not inadvertently discard important factors of variation during correction. (Though in this case, the cell types are so well defined, it would be quite a feat to fail to separate them!) proposed &lt;- c(rep(NA, ncol(sce.grun)), sce.muraro$label, sce.lawlor$`cell type`, sce.seger$CellType) proposed &lt;- tolower(proposed) proposed[proposed==&quot;gamma/pp&quot;] &lt;- &quot;gamma&quot; proposed[proposed==&quot;pp&quot;] &lt;- &quot;gamma&quot; proposed[proposed==&quot;duct&quot;] &lt;- &quot;ductal&quot; proposed[proposed==&quot;psc&quot;] &lt;- &quot;stellate&quot; table(proposed, clusters) ## clusters ## proposed 1 2 3 4 5 6 7 8 9 ## acinar 421 2 0 1 0 0 1 0 0 ## alpha 1 7 6 1869 1 0 1 0 2 ## beta 3 4 918 4 0 2 1 1 7 ## co-expression 0 0 17 22 0 0 0 0 0 ## delta 0 2 5 1 306 0 1 0 1 ## ductal 6 613 4 0 0 6 0 10 1 ## endothelial 0 0 0 0 0 1 33 0 0 ## epsilon 0 0 1 1 0 0 0 0 6 ## gamma 2 0 0 0 0 0 0 0 280 ## mesenchymal 0 1 0 0 0 79 0 0 0 ## mhc class ii 0 0 0 0 0 0 0 4 0 ## nana 2 8 1 13 1 0 1 0 0 ## none/other 0 3 1 2 0 0 4 1 1 ## stellate 0 0 0 0 0 71 1 0 0 ## unclassified 0 0 0 0 0 2 0 0 0 ## unclassified endocrine 0 0 3 3 0 0 0 0 0 ## unclear 0 4 0 0 0 0 0 0 0 Session Info View session info R version 4.1.0 beta (2021-05-03 r80259) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] bluster_1.3.0 scater_1.21.0 [3] ggplot2_3.3.3 scran_1.21.1 [5] scuttle_1.3.0 batchelor_1.9.0 [7] SingleCellExperiment_1.15.1 SummarizedExperiment_1.23.0 [9] Biobase_2.53.0 GenomicRanges_1.45.0 [11] GenomeInfoDb_1.29.0 IRanges_2.27.0 [13] S4Vectors_0.31.0 BiocGenerics_0.39.0 [15] MatrixGenerics_1.5.0 matrixStats_0.58.0 [17] BiocStyle_2.21.0 rebook_1.3.0 loaded via a namespace (and not attached): [1] bitops_1.0-7 filelock_1.0.2 [3] tools_4.1.0 bslib_0.2.5.1 [5] utf8_1.2.1 R6_2.5.0 [7] irlba_2.3.3 ResidualMatrix_1.3.0 [9] vipor_0.4.5 DBI_1.1.1 [11] colorspace_2.0-1 withr_2.4.2 [13] gridExtra_2.3 tidyselect_1.1.1 [15] compiler_4.1.0 graph_1.71.0 [17] BiocNeighbors_1.11.0 DelayedArray_0.19.0 [19] labeling_0.4.2 bookdown_0.22 [21] sass_0.4.0 scales_1.1.1 [23] rappdirs_0.3.3 stringr_1.4.0 [25] digest_0.6.27 rmarkdown_2.8 [27] XVector_0.33.0 pkgconfig_2.0.3 [29] htmltools_0.5.1.1 sparseMatrixStats_1.5.0 [31] highr_0.9 limma_3.49.0 [33] rlang_0.4.11 DelayedMatrixStats_1.15.0 [35] farver_2.1.0 jquerylib_0.1.4 [37] generics_0.1.0 jsonlite_1.7.2 [39] BiocParallel_1.27.0 dplyr_1.0.6 [41] RCurl_1.98-1.3 magrittr_2.0.1 [43] BiocSingular_1.9.0 GenomeInfoDbData_1.2.6 [45] Matrix_1.3-3 ggbeeswarm_0.6.0 [47] Rcpp_1.0.6 munsell_0.5.0 [49] fansi_0.4.2 viridis_0.6.1 [51] lifecycle_1.0.0 stringi_1.6.2 [53] yaml_2.2.1 edgeR_3.35.0 [55] zlibbioc_1.39.0 Rtsne_0.15 [57] grid_4.1.0 dqrng_0.3.0 [59] crayon_1.4.1 dir.expiry_1.1.0 [61] lattice_0.20-44 cowplot_1.1.1 [63] beachmat_2.9.0 locfit_1.5-9.4 [65] CodeDepends_0.6.5 metapod_1.1.0 [67] knitr_1.33 pillar_1.6.1 [69] igraph_1.2.6 codetools_0.2-18 [71] ScaledMatrix_1.1.0 XML_3.99-0.6 [73] glue_1.4.2 evaluate_0.14 [75] BiocManager_1.30.15 vctrs_0.3.8 [77] purrr_0.3.4 gtable_0.3.0 [79] assertthat_0.2.1 xfun_0.23 [81] rsvd_1.0.5 viridisLite_0.4.0 [83] tibble_3.1.2 beeswarm_0.3.1 [85] cluster_2.1.2 statmod_1.4.36 [87] ellipsis_0.3.2 References "],["merged-hsc.html", "Chapter 9 Mouse HSC (multiple technologies) 9.1 Introduction 9.2 Data loading 9.3 Setting up the merge 9.4 Merging the datasets 9.5 Combined analyses Session Info", " Chapter 9 Mouse HSC (multiple technologies) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 9.1 Introduction The blood is probably the most well-studied tissue in the single-cell field, mostly because everything is already dissociated “for free”. Of particular interest has been the use of single-cell genomics to study cell fate decisions in haematopoeisis. Indeed, it was not long ago that dueling interpretations of haematopoeitic stem cell (HSC) datasets were a mainstay of single-cell conferences. Sadly, these times have mostly passed so we will instead entertain ourselves by combining a small number of these datasets into a single analysis. 9.2 Data loading View set-up code (Workflow Chapter 10) #--- data-loading ---# library(scRNAseq) sce.nest &lt;- NestorowaHSCData() #--- gene-annotation ---# library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] anno &lt;- select(ens.mm.v97, keys=rownames(sce.nest), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.nest) &lt;- anno[match(rownames(sce.nest), anno$GENEID),] #--- quality-control ---# library(scater) stats &lt;- perCellQCMetrics(sce.nest) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;) sce.nest &lt;- sce.nest[,!qc$discard] #--- normalization ---# library(scran) set.seed(101000110) clusters &lt;- quickCluster(sce.nest) sce.nest &lt;- computeSumFactors(sce.nest, clusters=clusters) sce.nest &lt;- logNormCounts(sce.nest) #--- variance-modelling ---# set.seed(00010101) dec.nest &lt;- modelGeneVarWithSpikes(sce.nest, &quot;ERCC&quot;) top.nest &lt;- getTopHVGs(dec.nest, prop=0.1) sce.nest ## class: SingleCellExperiment ## dim: 46078 1656 ## metadata(0): ## assays(2): counts logcounts ## rownames(46078): ENSMUSG00000000001 ENSMUSG00000000003 ... ## ENSMUSG00000107391 ENSMUSG00000107392 ## rowData names(3): GENEID SYMBOL SEQNAME ## colnames(1656): HSPC_025 HSPC_031 ... Prog_852 Prog_810 ## colData names(3): cell.type FACS sizeFactor ## reducedDimNames(1): diffusion ## mainExpName: endogenous ## altExpNames(1): ERCC The Grun dataset requires a little bit of subsetting and re-analysis to only consider the sorted HSCs. View set-up code (Workflow Chapter 9) #--- data-loading ---# library(scRNAseq) sce.grun.hsc &lt;- GrunHSCData(ensembl=TRUE) #--- gene-annotation ---# library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] anno &lt;- select(ens.mm.v97, keys=rownames(sce.grun.hsc), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.grun.hsc) &lt;- anno[match(rownames(sce.grun.hsc), anno$GENEID),] #--- quality-control ---# library(scuttle) stats &lt;- perCellQCMetrics(sce.grun.hsc) qc &lt;- quickPerCellQC(stats, batch=sce.grun.hsc$protocol, subset=grepl(&quot;sorted&quot;, sce.grun.hsc$protocol)) sce.grun.hsc &lt;- sce.grun.hsc[,!qc$discard] library(scuttle) sce.grun.hsc &lt;- sce.grun.hsc[,sce.grun.hsc$protocol==&quot;sorted hematopoietic stem cells&quot;] sce.grun.hsc &lt;- logNormCounts(sce.grun.hsc) set.seed(11001) library(scran) dec.grun.hsc &lt;- modelGeneVarByPoisson(sce.grun.hsc) Finally, we will grab the Paul dataset, which we will also subset to only consider the unsorted myeloid population. This removes the various knockout conditions that just complicates matters. View set-up code (Workflow Chapter 11) #--- data-loading ---# library(scRNAseq) sce.paul &lt;- PaulHSCData(ensembl=TRUE) #--- gene-annotation ---# library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] anno &lt;- select(ens.mm.v97, keys=rownames(sce.paul), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.paul) &lt;- anno[match(rownames(sce.paul), anno$GENEID),] #--- quality-control ---# library(scater) stats &lt;- perCellQCMetrics(sce.paul) qc &lt;- quickPerCellQC(stats, batch=sce.paul$Plate_ID) # Detecting batches with unusually low threshold values. lib.thresholds &lt;- attr(qc$low_lib_size, &quot;thresholds&quot;)[&quot;lower&quot;,] nfeat.thresholds &lt;- attr(qc$low_n_features, &quot;thresholds&quot;)[&quot;lower&quot;,] ignore &lt;- union(names(lib.thresholds)[lib.thresholds &lt; 100], names(nfeat.thresholds)[nfeat.thresholds &lt; 100]) # Repeating the QC using only the &quot;high-quality&quot; batches. qc2 &lt;- quickPerCellQC(stats, batch=sce.paul$Plate_ID, subset=!sce.paul$Plate_ID %in% ignore) sce.paul &lt;- sce.paul[,!qc2$discard] sce.paul &lt;- sce.paul[,sce.paul$Batch_desc==&quot;Unsorted myeloid&quot;] sce.paul &lt;- logNormCounts(sce.paul) set.seed(00010010) dec.paul &lt;- modelGeneVarByPoisson(sce.paul) 9.3 Setting up the merge common &lt;- Reduce(intersect, list(rownames(sce.nest), rownames(sce.grun.hsc), rownames(sce.paul))) length(common) ## [1] 17147 Combining variances to obtain a single set of HVGs. combined.dec &lt;- combineVar( dec.nest[common,], dec.grun.hsc[common,], dec.paul[common,] ) hvgs &lt;- getTopHVGs(combined.dec, n=5000) Adjusting for gross differences in sequencing depth. library(batchelor) normed.sce &lt;- multiBatchNorm( Nestorowa=sce.nest[common,], Grun=sce.grun.hsc[common,], Paul=sce.paul[common,] ) 9.4 Merging the datasets We turn on auto.merge=TRUE to instruct fastMNN() to merge the batch that offers the largest number of MNNs. This aims to perform the “easiest” merges first, i.e., between the most replicate-like batches, before tackling merges between batches that have greater differences in their population composition. set.seed(1000010) merged &lt;- fastMNN(normed.sce, subset.row=hvgs, auto.merge=TRUE) Not too much variance lost inside each batch, hopefully. We also observe that the algorithm chose to merge the more diverse Nestorowa and Paul datasets before dealing with the HSC-only Grun dataset. metadata(merged)$merge.info[,c(&quot;left&quot;, &quot;right&quot;, &quot;lost.var&quot;)] ## DataFrame with 2 rows and 3 columns ## left right lost.var ## &lt;List&gt; &lt;List&gt; &lt;matrix&gt; ## 1 Paul Nestorowa 0.01069374:0.0000000:0.00739465 ## 2 Paul,Nestorowa Grun 0.00562344:0.0178334:0.00702615 9.5 Combined analyses The Grun dataset does not contribute to many clusters, consistent with a pure undifferentiated HSC population. Most of the other clusters contain contributions from the Nestorowa and Paul datasets, though some are unique to the Paul dataset. This may be due to incomplete correction though we tend to think that this are Paul-specific subpopulations, given that the Nestorowa dataset does not have similarly sized unique clusters that might represent their uncorrected counterparts. library(bluster) colLabels(merged) &lt;- clusterRows(reducedDim(merged), NNGraphParam(cluster.fun=&quot;louvain&quot;)) table(Cluster=colLabels(merged), Batch=merged$batch) ## Batch ## Cluster Grun Nestorowa Paul ## 1 0 40 206 ## 2 0 19 0 ## 3 39 353 146 ## 4 0 6 29 ## 5 0 217 487 ## 6 0 162 522 ## 7 0 133 191 ## 8 22 411 94 ## 9 230 315 348 ## 10 0 0 385 ## 11 0 0 397 While I prefer \\(t\\)-SNE plots, we’ll switch to a UMAP plot to highlight some of the trajectory-like structure across clusters (Figure 9.1). library(scater) set.seed(101010101) merged &lt;- runUMAP(merged, dimred=&quot;corrected&quot;) gridExtra::grid.arrange( plotUMAP(merged, colour_by=&quot;label&quot;), plotUMAP(merged, colour_by=&quot;batch&quot;), ncol=2 ) Figure 9.1: Obligatory UMAP plot of the merged HSC datasets, where each point represents a cell and is colored by the batch of origin (left) or its assigned cluster (right). In fact, we might as well compute a trajectory right now. TSCAN constructs a reasonable minimum spanning tree but the path choices are somewhat incongruent with the UMAP coordinates (Figure 9.2). This is most likely due to the fact that TSCAN operates on cluster centroids, which is simple and efficient but does not consider the variance of cells within each cluster. It is entirely possible for two well-separated clusters to be closer than two adjacent clusters if the latter span a wider region of the coordinate space. library(TSCAN) pseudo.out &lt;- quickPseudotime(merged, use.dimred=&quot;corrected&quot;, outgroup=TRUE) common.pseudo &lt;- averagePseudotime(pseudo.out$ordering) plotUMAP(merged, colour_by=I(common.pseudo), text_by=&quot;label&quot;, text_colour=&quot;red&quot;) + geom_line(data=pseudo.out$connected$UMAP, mapping=aes(x=dim1, y=dim2, group=edge)) Figure 9.2: Another UMAP plot of the merged HSC datasets, where each point represents a cell and is colored by its TSCAN pseudotime. The lines correspond to the edges of the MST across cluster centers. To fix this, we construct the minimum spanning tree using distances based on pairs of mutual nearest neighbors between clusters. This focuses on the closeness of the boundaries of each pair of clusters rather than their centroids, ensuring that adjacent clusters are connected even if their centroids are far apart. Doing so yields a trajectory that is more consistent with the visual connections on the UMAP plot (Figure 9.3). pseudo.out2 &lt;- quickPseudotime(merged, use.dimred=&quot;corrected&quot;, dist.method=&quot;mnn&quot;, outgroup=TRUE) common.pseudo2 &lt;- averagePseudotime(pseudo.out2$ordering) plotUMAP(merged, colour_by=I(common.pseudo2), text_by=&quot;label&quot;, text_colour=&quot;red&quot;) + geom_line(data=pseudo.out2$connected$UMAP, mapping=aes(x=dim1, y=dim2, group=edge)) Figure 9.3: Yet another UMAP plot of the merged HSC datasets, where each point represents a cell and is colored by its TSCAN pseudotime. The lines correspond to the edges of the MST across cluster centers. Session Info View session info R version 4.1.0 beta (2021-05-03 r80259) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] TSCAN_1.31.0 TrajectoryUtils_1.1.0 [3] scater_1.21.0 ggplot2_3.3.3 [5] bluster_1.3.0 batchelor_1.9.0 [7] scran_1.21.1 scuttle_1.3.0 [9] SingleCellExperiment_1.15.1 SummarizedExperiment_1.23.0 [11] Biobase_2.53.0 GenomicRanges_1.45.0 [13] GenomeInfoDb_1.29.0 IRanges_2.27.0 [15] S4Vectors_0.31.0 BiocGenerics_0.39.0 [17] MatrixGenerics_1.5.0 matrixStats_0.58.0 [19] BiocStyle_2.21.0 rebook_1.3.0 loaded via a namespace (and not attached): [1] ggbeeswarm_0.6.0 colorspace_2.0-1 [3] ellipsis_0.3.2 mclust_5.4.7 [5] XVector_0.33.0 BiocNeighbors_1.11.0 [7] farver_2.1.0 fansi_0.4.2 [9] splines_4.1.0 codetools_0.2-18 [11] sparseMatrixStats_1.5.0 knitr_1.33 [13] jsonlite_1.7.2 ResidualMatrix_1.3.0 [15] cluster_2.1.2 graph_1.71.0 [17] uwot_0.1.10 shiny_1.6.0 [19] BiocManager_1.30.15 compiler_4.1.0 [21] dqrng_0.3.0 fastmap_1.1.0 [23] assertthat_0.2.1 Matrix_1.3-3 [25] limma_3.49.0 later_1.2.0 [27] BiocSingular_1.9.0 htmltools_0.5.1.1 [29] tools_4.1.0 rsvd_1.0.5 [31] igraph_1.2.6 gtable_0.3.0 [33] glue_1.4.2 GenomeInfoDbData_1.2.6 [35] dplyr_1.0.6 rappdirs_0.3.3 [37] Rcpp_1.0.6 jquerylib_0.1.4 [39] vctrs_0.3.8 nlme_3.1-152 [41] DelayedMatrixStats_1.15.0 xfun_0.23 [43] stringr_1.4.0 beachmat_2.9.0 [45] mime_0.10 lifecycle_1.0.0 [47] irlba_2.3.3 gtools_3.8.2 [49] statmod_1.4.36 XML_3.99-0.6 [51] edgeR_3.35.0 zlibbioc_1.39.0 [53] scales_1.1.1 promises_1.2.0.1 [55] yaml_2.2.1 gridExtra_2.3 [57] sass_0.4.0 fastICA_1.2-2 [59] stringi_1.6.2 highr_0.9 [61] ScaledMatrix_1.1.0 caTools_1.18.2 [63] filelock_1.0.2 BiocParallel_1.27.0 [65] rlang_0.4.11 pkgconfig_2.0.3 [67] bitops_1.0-7 evaluate_0.14 [69] lattice_0.20-44 purrr_0.3.4 [71] CodeDepends_0.6.5 labeling_0.4.2 [73] cowplot_1.1.1 tidyselect_1.1.1 [75] RcppAnnoy_0.0.18 plyr_1.8.6 [77] magrittr_2.0.1 bookdown_0.22 [79] R6_2.5.0 gplots_3.1.1 [81] generics_0.1.0 metapod_1.1.0 [83] combinat_0.0-8 DelayedArray_0.19.0 [85] DBI_1.1.1 mgcv_1.8-35 [87] pillar_1.6.1 withr_2.4.2 [89] RCurl_1.98-1.3 tibble_3.1.2 [91] dir.expiry_1.1.0 crayon_1.4.1 [93] KernSmooth_2.23-20 utf8_1.2.1 [95] rmarkdown_2.8 viridis_0.6.1 [97] locfit_1.5-9.4 grid_4.1.0 [99] digest_0.6.27 xtable_1.8-4 [101] httpuv_1.6.1 munsell_0.5.0 [103] beeswarm_0.3.1 viridisLite_0.4.0 [105] vipor_0.4.5 bslib_0.2.5.1 "],["chimeric-mouse-embryo-10x-genomics.html", "Chapter 10 Chimeric mouse embryo (10X Genomics) 10.1 Introduction 10.2 Data loading 10.3 Quality control 10.4 Normalization 10.5 Variance modelling 10.6 Merging 10.7 Clustering 10.8 Dimensionality reduction Session Info", " Chapter 10 Chimeric mouse embryo (10X Genomics) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 10.1 Introduction This performs an analysis of the Pijuan-Sala et al. (2019) dataset on mouse gastrulation. Here, we examine chimeric embryos at the E8.5 stage of development where td-Tomato-positive embryonic stem cells (ESCs) were injected into a wild-type blastocyst. 10.2 Data loading library(MouseGastrulationData) sce.chimera &lt;- WTChimeraData(samples=5:10) sce.chimera ## class: SingleCellExperiment ## dim: 29453 20935 ## metadata(0): ## assays(1): counts ## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ... ## ENSMUSG00000095742 tomato-td ## rowData names(2): ENSEMBL SYMBOL ## colnames(20935): cell_9769 cell_9770 ... cell_30702 cell_30703 ## colData names(11): cell barcode ... doub.density sizeFactor ## reducedDimNames(2): pca.corrected.E7.5 pca.corrected.E8.5 ## mainExpName: NULL ## altExpNames(0): library(scater) rownames(sce.chimera) &lt;- uniquifyFeatureNames( rowData(sce.chimera)$ENSEMBL, rowData(sce.chimera)$SYMBOL) 10.3 Quality control Quality control on the cells has already been performed by the authors, so we will not repeat it here. We additionally remove cells that are labelled as stripped nuclei or doublets. drop &lt;- sce.chimera$celltype.mapped %in% c(&quot;stripped&quot;, &quot;Doublet&quot;) sce.chimera &lt;- sce.chimera[,!drop] 10.4 Normalization We use the pre-computed size factors in sce.chimera. sce.chimera &lt;- logNormCounts(sce.chimera) 10.5 Variance modelling We retain all genes with any positive biological component, to preserve as much signal as possible across a very heterogeneous dataset. library(scran) dec.chimera &lt;- modelGeneVar(sce.chimera, block=sce.chimera$sample) chosen.hvgs &lt;- dec.chimera$bio &gt; 0 par(mfrow=c(1,2)) blocked.stats &lt;- dec.chimera$per.block for (i in colnames(blocked.stats)) { current &lt;- blocked.stats[[i]] plot(current$mean, current$total, main=i, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(current) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) } Figure 10.1: Per-gene variance as a function of the mean for the log-expression values in the Pijuan-Sala chimeric mouse embryo dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the variances. Figure 10.2: Per-gene variance as a function of the mean for the log-expression values in the Pijuan-Sala chimeric mouse embryo dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the variances. Figure 10.3: Per-gene variance as a function of the mean for the log-expression values in the Pijuan-Sala chimeric mouse embryo dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the variances. 10.6 Merging We use a hierarchical merge to first merge together replicates with the same genotype, and then merge samples across different genotypes. library(batchelor) set.seed(01001001) merged &lt;- correctExperiments(sce.chimera, batch=sce.chimera$sample, subset.row=chosen.hvgs, PARAM=FastMnnParam( merge.order=list( list(1,3,5), # WT (3 replicates) list(2,4,6) # td-Tomato (3 replicates) ) ) ) We use the percentage of variance lost as a diagnostic: metadata(merged)$merge.info$lost.var ## 5 6 7 8 9 10 ## [1,] 0.000e+00 0.0204433 0.000e+00 0.0169567 0.000000 0.000000 ## [2,] 0.000e+00 0.0007389 0.000e+00 0.0004409 0.000000 0.015474 ## [3,] 3.090e-02 0.0000000 2.012e-02 0.0000000 0.000000 0.000000 ## [4,] 9.024e-05 0.0000000 8.272e-05 0.0000000 0.018047 0.000000 ## [5,] 4.321e-03 0.0072518 4.124e-03 0.0078280 0.003831 0.007786 10.7 Clustering g &lt;- buildSNNGraph(merged, use.dimred=&quot;corrected&quot;) clusters &lt;- igraph::cluster_louvain(g) colLabels(merged) &lt;- factor(clusters$membership) We examine the distribution of cells across clusters and samples. table(Cluster=colLabels(merged), Sample=merged$sample) ## Sample ## Cluster 5 6 7 8 9 10 ## 1 152 72 85 88 164 386 ## 2 19 7 13 17 20 36 ## 3 130 96 109 63 159 311 ## 4 43 35 81 81 87 353 ## 5 68 31 120 107 83 197 ## 6 122 65 64 52 63 141 ## 7 187 113 322 587 458 541 ## 8 47 22 84 50 90 131 ## 9 182 47 231 192 216 391 ## 10 95 19 36 18 50 34 ## 11 110 69 73 96 127 252 ## 12 9 7 18 13 30 27 ## 13 0 2 0 51 0 5 ## 14 38 39 50 47 126 123 ## 15 98 16 164 125 368 273 ## 16 146 37 132 110 231 216 ## 17 114 43 44 37 40 154 ## 18 78 45 189 119 340 493 ## 19 86 20 64 54 153 77 ## 20 159 77 137 101 147 401 ## 21 2 1 7 3 65 133 ## 22 11 16 20 9 47 57 ## 23 1 5 0 84 0 66 ## 24 170 47 282 173 426 542 ## 25 109 23 117 55 271 285 ## 26 122 72 298 572 296 776 10.8 Dimensionality reduction We use an external algorithm to compute nearest neighbors for greater speed. merged &lt;- runTSNE(merged, dimred=&quot;corrected&quot;, external_neighbors=TRUE) merged &lt;- runUMAP(merged, dimred=&quot;corrected&quot;, external_neighbors=TRUE) gridExtra::grid.arrange( plotTSNE(merged, colour_by=&quot;label&quot;, text_by=&quot;label&quot;, text_col=&quot;red&quot;), plotTSNE(merged, colour_by=&quot;batch&quot;) ) Figure 10.4: Obligatory \\(t\\)-SNE plots of the Pijuan-Sala chimeric mouse embryo dataset, where each point represents a cell and is colored according to the assigned cluster (top) or sample of origin (bottom). Session Info View session info R version 4.1.0 beta (2021-05-03 r80259) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] batchelor_1.9.0 scran_1.21.1 [3] scater_1.21.0 ggplot2_3.3.3 [5] scuttle_1.3.0 MouseGastrulationData_1.7.0 [7] SpatialExperiment_1.3.0 SingleCellExperiment_1.15.1 [9] SummarizedExperiment_1.23.0 Biobase_2.53.0 [11] GenomicRanges_1.45.0 GenomeInfoDb_1.29.0 [13] IRanges_2.27.0 S4Vectors_0.31.0 [15] BiocGenerics_0.39.0 MatrixGenerics_1.5.0 [17] matrixStats_0.58.0 BiocStyle_2.21.0 [19] rebook_1.3.0 loaded via a namespace (and not attached): [1] AnnotationHub_3.1.0 BiocFileCache_2.1.0 [3] igraph_1.2.6 BiocParallel_1.27.0 [5] digest_0.6.27 BumpyMatrix_1.1.0 [7] htmltools_0.5.1.1 viridis_0.6.1 [9] magick_2.7.2 fansi_0.4.2 [11] magrittr_2.0.1 memoise_2.0.0 [13] ScaledMatrix_1.1.0 cluster_2.1.2 [15] limma_3.49.0 Biostrings_2.61.0 [17] R.utils_2.10.1 colorspace_2.0-1 [19] blob_1.2.1 rappdirs_0.3.3 [21] xfun_0.23 dplyr_1.0.6 [23] crayon_1.4.1 RCurl_1.98-1.3 [25] jsonlite_1.7.2 graph_1.71.0 [27] glue_1.4.2 gtable_0.3.0 [29] zlibbioc_1.39.0 XVector_0.33.0 [31] DelayedArray_0.19.0 BiocSingular_1.9.0 [33] DropletUtils_1.13.0 Rhdf5lib_1.15.0 [35] HDF5Array_1.21.0 scales_1.1.1 [37] DBI_1.1.1 edgeR_3.35.0 [39] Rcpp_1.0.6 viridisLite_0.4.0 [41] xtable_1.8-4 dqrng_0.3.0 [43] bit_4.0.4 rsvd_1.0.5 [45] ResidualMatrix_1.3.0 metapod_1.1.0 [47] httr_1.4.2 dir.expiry_1.1.0 [49] ellipsis_0.3.2 farver_2.1.0 [51] pkgconfig_2.0.3 XML_3.99-0.6 [53] R.methodsS3_1.8.1 uwot_0.1.10 [55] CodeDepends_0.6.5 sass_0.4.0 [57] dbplyr_2.1.1 locfit_1.5-9.4 [59] utf8_1.2.1 labeling_0.4.2 [61] tidyselect_1.1.1 rlang_0.4.11 [63] later_1.2.0 AnnotationDbi_1.55.0 [65] munsell_0.5.0 BiocVersion_3.14.0 [67] tools_4.1.0 cachem_1.0.5 [69] generics_0.1.0 RSQLite_2.2.7 [71] ExperimentHub_2.1.0 evaluate_0.14 [73] stringr_1.4.0 fastmap_1.1.0 [75] yaml_2.2.1 knitr_1.33 [77] bit64_4.0.5 purrr_0.3.4 [79] KEGGREST_1.33.0 sparseMatrixStats_1.5.0 [81] mime_0.10 R.oo_1.24.0 [83] compiler_4.1.0 beeswarm_0.3.1 [85] filelock_1.0.2 curl_4.3.1 [87] png_0.1-7 interactiveDisplayBase_1.31.0 [89] statmod_1.4.36 tibble_3.1.2 [91] bslib_0.2.5.1 stringi_1.6.2 [93] highr_0.9 lattice_0.20-44 [95] bluster_1.3.0 Matrix_1.3-3 [97] vctrs_0.3.8 pillar_1.6.1 [99] lifecycle_1.0.0 rhdf5filters_1.5.0 [101] BiocManager_1.30.15 jquerylib_0.1.4 [103] BiocNeighbors_1.11.0 cowplot_1.1.1 [105] bitops_1.0-7 irlba_2.3.3 [107] httpuv_1.6.1 R6_2.5.0 [109] bookdown_0.22 promises_1.2.0.1 [111] gridExtra_2.3 vipor_0.4.5 [113] codetools_0.2-18 assertthat_0.2.1 [115] rhdf5_2.37.0 rjson_0.2.20 [117] withr_2.4.2 GenomeInfoDbData_1.2.6 [119] grid_4.1.0 beachmat_2.9.0 [121] rmarkdown_2.8 DelayedMatrixStats_1.15.0 [123] Rtsne_0.15 shiny_1.6.0 [125] ggbeeswarm_0.6.0 "]]