[["index.html", "Single-Cell Analysis Workflows with Bioconductor Welcome", " Single-Cell Analysis Workflows with Bioconductor Authors: Robert Amezquita [aut], Aaron Lun [aut], Stephanie Hicks [aut], Raphael Gottardo [aut], Peter Hickey [cre] Version: 1.8.0 Modified: 2022-10-12 Compiled: 2023-04-26 Environment: R version 4.3.0 RC (2023-04-13 r84269), Bioconductor 3.17 License: CC BY 4.0 Copyright: Bioconductor, 2020 Source: https://github.com/OSCA-source/OSCA.workflows Welcome This site contains the workflow chapters of the “Orchestrating Single-Cell Analysis with Bioconductor” book. This contains worked case studies of analyses of a variety of single-cell datasets, each proceeding from a SingleCellExperiment object. Exposition is generally minimal other than for dataset-specific justifications for parameter tweaks; refer to the other books in the OCSA collection for a detailed explanation of the theoretical basis of each step. It is intended for readers who already know the background and just want some code to copy and paste into their own analyses. "],["lun-416b-cell-line-smart-seq2.html", "Chapter 1 Lun 416B cell line (Smart-seq2) 1.1 Introduction 1.2 Data loading 1.3 Quality control 1.4 Normalization 1.5 Variance modelling 1.6 Batch correction 1.7 Dimensionality reduction 1.8 Clustering 1.9 Interpretation Session Info", " Chapter 1 Lun 416B cell line (Smart-seq2) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 1.1 Introduction The Lun et al. (2017) dataset contains two 96-well plates of 416B cells (an immortalized mouse myeloid progenitor cell line), processed using the Smart-seq2 protocol (Picelli et al. 2014). A constant amount of spike-in RNA from the External RNA Controls Consortium (ERCC) was also added to each cell’s lysate prior to library preparation. High-throughput sequencing was performed and the expression of each gene was quantified by counting the total number of reads mapped to its exonic regions. Similarly, the quantity of each spike-in transcript was measured by counting the number of reads mapped to the spike-in reference sequences. 1.2 Data loading We convert the blocking factor to a factor so that downstream steps do not treat it as an integer. library(scRNAseq) sce.416b &lt;- LunSpikeInData(which=&quot;416b&quot;) sce.416b$block &lt;- factor(sce.416b$block) We rename the rows of our SingleCellExperiment with the symbols, reverting to Ensembl identifiers for missing or duplicate symbols. library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] rowData(sce.416b)$ENSEMBL &lt;- rownames(sce.416b) rowData(sce.416b)$SYMBOL &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SYMBOL&quot;) rowData(sce.416b)$SEQNAME &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SEQNAME&quot;) library(scater) rownames(sce.416b) &lt;- uniquifyFeatureNames(rowData(sce.416b)$ENSEMBL, rowData(sce.416b)$SYMBOL) 1.3 Quality control We save an unfiltered copy of the SingleCellExperiment for later use. unfiltered &lt;- sce.416b Technically, we do not need to use the mitochondrial proportions as we already have the spike-in proportions (which serve a similar purpose) for this dataset. However, it probably doesn’t do any harm to include it anyway. mito &lt;- which(rowData(sce.416b)$SEQNAME==&quot;MT&quot;) stats &lt;- perCellQCMetrics(sce.416b, subsets=list(Mt=mito)) qc &lt;- quickPerCellQC(stats, percent_subsets=c(&quot;subsets_Mt_percent&quot;, &quot;altexps_ERCC_percent&quot;), batch=sce.416b$block) sce.416b &lt;- sce.416b[,!qc$discard] colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$block &lt;- factor(unfiltered$block) unfiltered$discard &lt;- qc$discard gridExtra::grid.arrange( plotColData(unfiltered, x=&quot;block&quot;, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(unfiltered, x=&quot;block&quot;, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(unfiltered, x=&quot;block&quot;, y=&quot;subsets_Mt_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;Mito percent&quot;), plotColData(unfiltered, x=&quot;block&quot;, y=&quot;altexps_ERCC_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;ERCC percent&quot;), nrow=2, ncol=2 ) Figure 1.1: Distribution of each QC metric across cells in the 416B dataset, stratified by the plate of origin. Each point represents a cell and is colored according to whether that cell was discarded. gridExtra::grid.arrange( plotColData(unfiltered, x=&quot;sum&quot;, y=&quot;subsets_Mt_percent&quot;, colour_by=&quot;discard&quot;) + scale_x_log10(), plotColData(unfiltered, x=&quot;altexps_ERCC_percent&quot;, y=&quot;subsets_Mt_percent&quot;, colour_by=&quot;discard&quot;), ncol=2 ) Figure 1.2: Percentage of mitochondrial reads in each cell in the 416B dataset, compared to the total count (left) or the percentage of spike-in reads (right). Each point represents a cell and is colored according to whether that cell was discarded. We also examine the number of cells removed for each reason. colSums(as.matrix(qc)) ## low_lib_size low_n_features high_subsets_Mt_percent ## 5 0 2 ## high_altexps_ERCC_percent discard ## 2 7 1.4 Normalization No pre-clustering is performed here, as the dataset is small and all cells are derived from the same cell line anyway. library(scran) sce.416b &lt;- computeSumFactors(sce.416b) sce.416b &lt;- logNormCounts(sce.416b) summary(sizeFactors(sce.416b)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.347 0.711 0.921 1.000 1.152 3.604 We see that the induced cells have size factors that are systematically shifted from the uninduced cells, consistent with the presence of a composition bias (Figure 1.3). plot(librarySizeFactors(sce.416b), sizeFactors(sce.416b), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, col=c(&quot;black&quot;, &quot;red&quot;)[grepl(&quot;induced&quot;, sce.416b$phenotype)+1], log=&quot;xy&quot;) Figure 1.3: Relationship between the library size factors and the deconvolution size factors in the 416B dataset. Each cell is colored according to its oncogene induction status. 1.5 Variance modelling We block on the plate of origin to minimize plate effects, and then we take the top 10% of genes with the largest biological components. dec.416b &lt;- modelGeneVarWithSpikes(sce.416b, &quot;ERCC&quot;, block=sce.416b$block) chosen.hvgs &lt;- getTopHVGs(dec.416b, prop=0.1) par(mfrow=c(1,2)) blocked.stats &lt;- dec.416b$per.block for (i in colnames(blocked.stats)) { current &lt;- blocked.stats[[i]] plot(current$mean, current$total, main=i, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(current) points(curfit$mean, curfit$var, col=&quot;red&quot;, pch=16) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) } Figure 1.4: Per-gene variance as a function of the mean for the log-expression values in the 416B dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the spike-in transcripts (red). This was performed separately for each plate. 1.6 Batch correction The composition of cells is expected to be the same across the two plates, hence the use of removeBatchEffect() rather than more complex methods. For larger datasets, consider using regressBatches() from the batchelor package. library(limma) assay(sce.416b, &quot;corrected&quot;) &lt;- removeBatchEffect(logcounts(sce.416b), design=model.matrix(~sce.416b$phenotype), batch=sce.416b$block) 1.7 Dimensionality reduction We do not expect a great deal of heterogeneity in this dataset, so we only request 10 PCs. We use an exact SVD to avoid warnings from irlba about handling small datasets. sce.416b &lt;- runPCA(sce.416b, ncomponents=10, subset_row=chosen.hvgs, exprs_values=&quot;corrected&quot;, BSPARAM=BiocSingular::ExactParam()) set.seed(1010) sce.416b &lt;- runTSNE(sce.416b, dimred=&quot;PCA&quot;, perplexity=10) 1.8 Clustering my.dist &lt;- dist(reducedDim(sce.416b, &quot;PCA&quot;)) my.tree &lt;- hclust(my.dist, method=&quot;ward.D2&quot;) library(dynamicTreeCut) my.clusters &lt;- unname(cutreeDynamic(my.tree, distM=as.matrix(my.dist), minClusterSize=10, verbose=0)) colLabels(sce.416b) &lt;- factor(my.clusters) We compare the clusters to the plate of origin. Each cluster is comprised of cells from both batches, indicating that the clustering is not driven by a batch effect. table(Cluster=colLabels(sce.416b), Plate=sce.416b$block) ## Plate ## Cluster 20160113 20160325 ## 1 40 38 ## 2 37 32 ## 3 10 14 ## 4 6 8 We compare the clusters to the oncogene induction status. We observe differences in in the composition of each cluster, consistent with a biological effect of oncogene induction. table(Cluster=colLabels(sce.416b), Oncogene=sce.416b$phenotype) ## Oncogene ## Cluster induced CBFB-MYH11 oncogene expression wild type phenotype ## 1 78 0 ## 2 0 69 ## 3 1 23 ## 4 14 0 plotTSNE(sce.416b, colour_by=&quot;label&quot;) Figure 1.5: Obligatory \\(t\\)-SNE plot of the 416B dataset, where each point represents a cell and is colored according to the assigned cluster. Most cells have relatively small positive widths in Figure 1.6, indicating that the separation between clusters is weak. This may be symptomatic of over-clustering where clusters that are clearly defined on oncogene induction status are further split into subsets that are less well separated. Nonetheless, we will proceed with the current clustering scheme as it provides reasonable partitions for further characterization of heterogeneity. library(cluster) clust.col &lt;- scater:::.get_palette(&quot;tableau10medium&quot;) # hidden scater colours sil &lt;- silhouette(my.clusters, dist = my.dist) sil.cols &lt;- clust.col[ifelse(sil[,3] &gt; 0, sil[,1], sil[,2])] sil.cols &lt;- sil.cols[order(-sil[,1], sil[,3])] plot(sil, main = paste(length(unique(my.clusters)), &quot;clusters&quot;), border=sil.cols, col=sil.cols, do.col.sort=FALSE) Figure 1.6: Silhouette plot for the hierarchical clustering of the 416B dataset. Each bar represents the silhouette width for a cell and is colored according to the assigned cluster (if positive) or the closest cluster (if negative). 1.9 Interpretation markers &lt;- findMarkers(sce.416b, my.clusters, block=sce.416b$block) marker.set &lt;- markers[[&quot;1&quot;]] head(marker.set, 10) ## DataFrame with 10 rows and 7 columns ## Top p.value FDR summary.logFC logFC.2 logFC.3 ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Ccna2 1 9.85422e-67 4.59246e-62 -7.13310 -7.13310 -2.20632 ## Cdca8 1 1.01449e-41 1.52514e-38 -7.26175 -6.00378 -2.03841 ## Pirb 1 4.16555e-33 1.95516e-30 5.87820 5.28149 5.87820 ## Cks1b 2 2.98233e-40 3.23229e-37 -6.43381 -6.43381 -4.15385 ## Aurkb 2 2.41436e-64 5.62593e-60 -6.94063 -6.94063 -1.65534 ## Myh11 2 1.28865e-46 3.75353e-43 4.38182 4.38182 4.29290 ## Mcm6 3 1.15877e-28 3.69887e-26 -5.44558 -5.44558 -5.82130 ## Cdca3 3 5.02047e-45 1.23144e-41 -6.22179 -6.22179 -2.10502 ## Top2a 3 7.25965e-61 1.12776e-56 -7.07811 -7.07811 -2.39123 ## Mcm2 4 1.50854e-33 7.98908e-31 -5.54197 -5.54197 -6.09178 ## logFC.4 ## &lt;numeric&gt; ## Ccna2 -7.3451052 ## Cdca8 -7.2617478 ## Pirb 0.0352849 ## Cks1b -6.4385323 ## Aurkb -6.4162126 ## Myh11 0.9410499 ## Mcm6 -3.5804973 ## Cdca3 -7.0539510 ## Top2a -6.8297343 ## Mcm2 -3.8238103 We visualize the expression profiles of the top candidates in Figure 1.7 to verify that the DE signature is robust. Most of the top markers have strong and consistent up- or downregulation in cells of cluster 1 compared to some or all of the other clusters. A cursory examination of the heatmap indicates that cluster 1 contains oncogene-induced cells with strong downregulation of DNA replication and cell cycle genes. This is consistent with the potential induction of senescence as an anti-tumorigenic response (Wajapeyee et al. 2010). top.markers &lt;- rownames(marker.set)[marker.set$Top &lt;= 10] plotHeatmap(sce.416b, features=top.markers, order_columns_by=&quot;label&quot;, colour_columns_by=c(&quot;label&quot;, &quot;block&quot;, &quot;phenotype&quot;), center=TRUE, symmetric=TRUE, zlim=c(-5, 5)) Figure 1.7: Heatmap of the top marker genes for cluster 1 in the 416B dataset, stratified by cluster. The plate of origin and oncogene induction status are also shown for each cell. Session Info View session info R version 4.3.0 RC (2023-04-13 r84269) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] cluster_2.1.4 dynamicTreeCut_1.63-1 [3] limma_3.56.0 scran_1.28.0 [5] scater_1.28.0 ggplot2_3.4.2 [7] scuttle_1.10.0 AnnotationHub_3.8.0 [9] BiocFileCache_2.8.0 dbplyr_2.3.2 [11] ensembldb_2.24.0 AnnotationFilter_1.24.0 [13] GenomicFeatures_1.52.0 AnnotationDbi_1.62.0 [15] scRNAseq_2.13.0 SingleCellExperiment_1.22.0 [17] SummarizedExperiment_1.30.0 Biobase_2.60.0 [19] GenomicRanges_1.52.0 GenomeInfoDb_1.36.0 [21] IRanges_2.34.0 S4Vectors_0.38.0 [23] BiocGenerics_0.46.0 MatrixGenerics_1.12.0 [25] matrixStats_0.63.0 BiocStyle_2.28.0 [27] rebook_1.10.0 loaded via a namespace (and not attached): [1] RColorBrewer_1.1-3 jsonlite_1.8.4 [3] CodeDepends_0.6.5 magrittr_2.0.3 [5] ggbeeswarm_0.7.1 farver_2.1.1 [7] rmarkdown_2.21 BiocIO_1.10.0 [9] zlibbioc_1.46.0 vctrs_0.6.2 [11] memoise_2.0.1 Rsamtools_2.16.0 [13] DelayedMatrixStats_1.22.0 RCurl_1.98-1.12 [15] htmltools_0.5.5 progress_1.2.2 [17] curl_5.0.0 BiocNeighbors_1.18.0 [19] sass_0.4.5 bslib_0.4.2 [21] cachem_1.0.7 GenomicAlignments_1.36.0 [23] igraph_1.4.2 mime_0.12 [25] lifecycle_1.0.3 pkgconfig_2.0.3 [27] rsvd_1.0.5 Matrix_1.5-4 [29] R6_2.5.1 fastmap_1.1.1 [31] GenomeInfoDbData_1.2.10 shiny_1.7.4 [33] digest_0.6.31 colorspace_2.1-0 [35] dqrng_0.3.0 irlba_2.3.5.1 [37] ExperimentHub_2.8.0 RSQLite_2.3.1 [39] beachmat_2.16.0 labeling_0.4.2 [41] filelock_1.0.2 fansi_1.0.4 [43] httr_1.4.5 compiler_4.3.0 [45] bit64_4.0.5 withr_2.5.0 [47] BiocParallel_1.34.0 viridis_0.6.2 [49] DBI_1.1.3 highr_0.10 [51] biomaRt_2.56.0 rappdirs_0.3.3 [53] DelayedArray_0.26.0 bluster_1.10.0 [55] rjson_0.2.21 tools_4.3.0 [57] vipor_0.4.5 beeswarm_0.4.0 [59] interactiveDisplayBase_1.38.0 httpuv_1.6.9 [61] glue_1.6.2 restfulr_0.0.15 [63] promises_1.2.0.1 grid_4.3.0 [65] Rtsne_0.16 generics_0.1.3 [67] gtable_0.3.3 hms_1.1.3 [69] metapod_1.8.0 BiocSingular_1.16.0 [71] ScaledMatrix_1.8.0 xml2_1.3.3 [73] utf8_1.2.3 XVector_0.40.0 [75] ggrepel_0.9.3 BiocVersion_3.17.1 [77] pillar_1.9.0 stringr_1.5.0 [79] later_1.3.0 dplyr_1.1.2 [81] lattice_0.21-8 rtracklayer_1.60.0 [83] bit_4.0.5 tidyselect_1.2.0 [85] locfit_1.5-9.7 Biostrings_2.68.0 [87] knitr_1.42 gridExtra_2.3 [89] bookdown_0.33 ProtGenerics_1.32.0 [91] edgeR_3.42.0 xfun_0.39 [93] statmod_1.5.0 pheatmap_1.0.12 [95] stringi_1.7.12 lazyeval_0.2.2 [97] yaml_2.3.7 evaluate_0.20 [99] codetools_0.2-19 tibble_3.2.1 [101] BiocManager_1.30.20 graph_1.78.0 [103] cli_3.6.1 xtable_1.8-4 [105] munsell_0.5.0 jquerylib_0.1.4 [107] Rcpp_1.0.10 dir.expiry_1.8.0 [109] png_0.1-8 XML_3.99-0.14 [111] parallel_4.3.0 ellipsis_0.3.2 [113] blob_1.2.4 prettyunits_1.1.1 [115] sparseMatrixStats_1.12.0 bitops_1.0-7 [117] viridisLite_0.4.1 scales_1.2.1 [119] purrr_1.0.1 crayon_1.5.2 [121] rlang_1.1.0 cowplot_1.1.1 [123] KEGGREST_1.40.0 References "],["zeisel-mouse-brain-strt-seq.html", "Chapter 2 Zeisel mouse brain (STRT-Seq) 2.1 Introduction 2.2 Data loading 2.3 Quality control 2.4 Normalization 2.5 Variance modelling 2.6 Dimensionality reduction 2.7 Clustering 2.8 Interpretation Session Info", " Chapter 2 Zeisel mouse brain (STRT-Seq) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 2.1 Introduction Here, we examine a heterogeneous dataset from a study of cell types in the mouse brain (Zeisel et al. 2015). This contains approximately 3000 cells of varying types such as oligodendrocytes, microglia and neurons. Individual cells were isolated using the Fluidigm C1 microfluidics system (Pollen et al. 2014) and library preparation was performed on each cell using a UMI-based protocol. After sequencing, expression was quantified by counting the number of unique molecular identifiers (UMIs) mapped to each gene. 2.2 Data loading We obtain a SingleCellExperiment object for this dataset using the relevant function from the scRNAseq package. The idiosyncrasies of the published dataset means that we need to do some extra work to merge together redundant rows corresponding to alternative genomic locations for the same gene. library(scRNAseq) sce.zeisel &lt;- ZeiselBrainData() library(scater) sce.zeisel &lt;- aggregateAcrossFeatures(sce.zeisel, id=sub(&quot;_loc[0-9]+$&quot;, &quot;&quot;, rownames(sce.zeisel))) We also fetch the Ensembl gene IDs, just in case we need them later. library(org.Mm.eg.db) rowData(sce.zeisel)$Ensembl &lt;- mapIds(org.Mm.eg.db, keys=rownames(sce.zeisel), keytype=&quot;SYMBOL&quot;, column=&quot;ENSEMBL&quot;) 2.3 Quality control unfiltered &lt;- sce.zeisel The original authors of the study have already removed low-quality cells prior to data publication. Nonetheless, we compute some quality control metrics to check whether the remaining cells are satisfactory. stats &lt;- perCellQCMetrics(sce.zeisel, subsets=list( Mt=rowData(sce.zeisel)$featureType==&quot;mito&quot;)) qc &lt;- quickPerCellQC(stats, percent_subsets=c(&quot;altexps_ERCC_percent&quot;, &quot;subsets_Mt_percent&quot;)) sce.zeisel &lt;- sce.zeisel[,!qc$discard] colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- qc$discard gridExtra::grid.arrange( plotColData(unfiltered, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(unfiltered, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(unfiltered, y=&quot;altexps_ERCC_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;ERCC percent&quot;), plotColData(unfiltered, y=&quot;subsets_Mt_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;Mito percent&quot;), ncol=2 ) Figure 2.1: Distribution of each QC metric across cells in the Zeisel brain dataset. Each point represents a cell and is colored according to whether that cell was discarded. gridExtra::grid.arrange( plotColData(unfiltered, x=&quot;sum&quot;, y=&quot;subsets_Mt_percent&quot;, colour_by=&quot;discard&quot;) + scale_x_log10(), plotColData(unfiltered, x=&quot;altexps_ERCC_percent&quot;, y=&quot;subsets_Mt_percent&quot;, colour_by=&quot;discard&quot;), ncol=2 ) Figure 2.2: Percentage of mitochondrial reads in each cell in the Zeisel brain dataset, compared to the total count (left) or the percentage of spike-in reads (right). Each point represents a cell and is colored according to whether that cell was discarded. We also examine the number of cells removed for each reason. colSums(as.matrix(qc)) ## low_lib_size low_n_features high_altexps_ERCC_percent ## 0 3 65 ## high_subsets_Mt_percent discard ## 128 189 2.4 Normalization library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.zeisel) sce.zeisel &lt;- computeSumFactors(sce.zeisel, cluster=clusters) sce.zeisel &lt;- logNormCounts(sce.zeisel) summary(sizeFactors(sce.zeisel)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.119 0.486 0.831 1.000 1.321 4.509 plot(librarySizeFactors(sce.zeisel), sizeFactors(sce.zeisel), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, log=&quot;xy&quot;) Figure 2.3: Relationship between the library size factors and the deconvolution size factors in the Zeisel brain dataset. 2.5 Variance modelling In theory, we should block on the plate of origin for each cell. However, only 20-40 cells are available on each plate, and the population is also highly heterogeneous. This means that we cannot assume that the distribution of sampled cell types on each plate is the same. Thus, to avoid regressing out potential biology, we will not block on any factors in this analysis. dec.zeisel &lt;- modelGeneVarWithSpikes(sce.zeisel, &quot;ERCC&quot;) top.hvgs &lt;- getTopHVGs(dec.zeisel, prop=0.1) We see from Figure 2.4 that the technical and total variances are much smaller than those in the read-based datasets. This is due to the use of UMIs, which reduces the noise caused by variable PCR amplification. Furthermore, the spike-in trend is consistently lower than the variances of the endogenous gene, which reflects the heterogeneity in gene expression across cells of different types. plot(dec.zeisel$mean, dec.zeisel$total, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(dec.zeisel) points(curfit$mean, curfit$var, col=&quot;red&quot;, pch=16) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) Figure 2.4: Per-gene variance as a function of the mean for the log-expression values in the Zeisel brain dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the spike-in transcripts (red). 2.6 Dimensionality reduction library(BiocSingular) set.seed(101011001) sce.zeisel &lt;- denoisePCA(sce.zeisel, technical=dec.zeisel, subset.row=top.hvgs) sce.zeisel &lt;- runTSNE(sce.zeisel, dimred=&quot;PCA&quot;) We have a look at the number of PCs retained by denoisePCA(). ncol(reducedDim(sce.zeisel, &quot;PCA&quot;)) ## [1] 50 2.7 Clustering snn.gr &lt;- buildSNNGraph(sce.zeisel, use.dimred=&quot;PCA&quot;) colLabels(sce.zeisel) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) table(colLabels(sce.zeisel)) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ## 283 451 114 143 599 167 191 128 350 70 199 58 39 24 plotTSNE(sce.zeisel, colour_by=&quot;label&quot;) Figure 2.5: Obligatory \\(t\\)-SNE plot of the Zeisel brain dataset, where each point represents a cell and is colored according to the assigned cluster. 2.8 Interpretation We focus on upregulated marker genes as these can quickly provide positive identification of cell type in a heterogeneous population. We examine the table for cluster 1, in which log-fold changes are reported between cluster 1 and every other cluster. The same output is provided for each cluster in order to identify genes that discriminate between clusters. markers &lt;- findMarkers(sce.zeisel, direction=&quot;up&quot;) marker.set &lt;- markers[[&quot;1&quot;]] head(marker.set[,1:8], 10) # only first 8 columns, for brevity ## DataFrame with 10 rows and 8 columns ## Top p.value FDR summary.logFC logFC.2 logFC.3 ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Atp1a3 1 1.45982e-282 7.24035e-279 3.45669 0.0398568 0.0893943 ## Celf4 1 2.27030e-246 4.50404e-243 3.10465 0.3886716 0.6145023 ## Gad1 1 7.44925e-232 1.34351e-228 4.57719 4.5392751 4.3003280 ## Gad2 1 2.88086e-207 3.57208e-204 4.25393 4.2322487 3.8884654 ## Mllt11 1 1.72982e-249 3.81309e-246 2.88363 0.5782719 1.4933128 ## Ndrg4 1 0.00000e+00 0.00000e+00 3.84337 0.8887239 1.0183408 ## Slc32a1 1 2.38276e-110 4.04030e-108 1.92859 1.9196173 1.8252062 ## Syngr3 1 3.68257e-143 1.30462e-140 2.55531 1.0981258 1.1994793 ## Atp6v1g2 2 3.04451e-204 3.55295e-201 2.50875 0.0981706 0.5203760 ## Napb 2 1.10402e-231 1.82522e-228 2.81533 0.1774508 0.3046901 ## logFC.4 logFC.5 ## &lt;numeric&gt; &lt;numeric&gt; ## Atp1a3 1.241388 3.45669 ## Celf4 0.869334 3.10465 ## Gad1 4.050305 4.47236 ## Gad2 3.769556 4.16902 ## Mllt11 0.951649 2.88363 ## Ndrg4 1.140041 3.84337 ## Slc32a1 1.804311 1.92426 ## Syngr3 1.188856 2.47696 ## Atp6v1g2 0.616391 2.50875 ## Napb 0.673772 2.81533 Figure 2.6 indicates that most of the top markers are strongly DE in cells of cluster 1 compared to some or all of the other clusters. We can use these markers to identify cells from cluster 1 in validation studies with an independent population of cells. A quick look at the markers suggest that cluster 1 represents interneurons based on expression of Gad1 and Slc6a1 (Zeng et al. 2012). top.markers &lt;- rownames(marker.set)[marker.set$Top &lt;= 10] plotHeatmap(sce.zeisel, features=top.markers, order_columns_by=&quot;label&quot;) Figure 2.6: Heatmap of the log-expression of the top markers for cluster 1 compared to each other cluster. Cells are ordered by cluster and the color is scaled to the log-expression of each gene in each cell. An alternative visualization approach is to plot the log-fold changes to all other clusters directly (Figure 2.7). This is more concise and is useful in situations involving many clusters that contain different numbers of cells. library(pheatmap) logFCs &lt;- getMarkerEffects(marker.set[1:50,]) pheatmap(logFCs, breaks=seq(-5, 5, length.out=101)) Figure 2.7: Heatmap of the log-fold changes of the top markers for cluster 1 compared to each other cluster. Session Info View session info R version 4.3.0 RC (2023-04-13 r84269) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] pheatmap_1.0.12 BiocSingular_1.16.0 [3] scran_1.28.0 org.Mm.eg.db_3.17.0 [5] AnnotationDbi_1.62.0 scater_1.28.0 [7] ggplot2_3.4.2 scuttle_1.10.0 [9] scRNAseq_2.13.0 SingleCellExperiment_1.22.0 [11] SummarizedExperiment_1.30.0 Biobase_2.60.0 [13] GenomicRanges_1.52.0 GenomeInfoDb_1.36.0 [15] IRanges_2.34.0 S4Vectors_0.38.0 [17] BiocGenerics_0.46.0 MatrixGenerics_1.12.0 [19] matrixStats_0.63.0 BiocStyle_2.28.0 [21] rebook_1.10.0 loaded via a namespace (and not attached): [1] RColorBrewer_1.1-3 jsonlite_1.8.4 [3] CodeDepends_0.6.5 magrittr_2.0.3 [5] ggbeeswarm_0.7.1 GenomicFeatures_1.52.0 [7] farver_2.1.1 rmarkdown_2.21 [9] BiocIO_1.10.0 zlibbioc_1.46.0 [11] vctrs_0.6.2 memoise_2.0.1 [13] Rsamtools_2.16.0 DelayedMatrixStats_1.22.0 [15] RCurl_1.98-1.12 htmltools_0.5.5 [17] progress_1.2.2 AnnotationHub_3.8.0 [19] curl_5.0.0 BiocNeighbors_1.18.0 [21] sass_0.4.5 bslib_0.4.2 [23] cachem_1.0.7 GenomicAlignments_1.36.0 [25] igraph_1.4.2 mime_0.12 [27] lifecycle_1.0.3 pkgconfig_2.0.3 [29] rsvd_1.0.5 Matrix_1.5-4 [31] R6_2.5.1 fastmap_1.1.1 [33] GenomeInfoDbData_1.2.10 shiny_1.7.4 [35] digest_0.6.31 colorspace_2.1-0 [37] dqrng_0.3.0 irlba_2.3.5.1 [39] ExperimentHub_2.8.0 RSQLite_2.3.1 [41] beachmat_2.16.0 labeling_0.4.2 [43] filelock_1.0.2 fansi_1.0.4 [45] httr_1.4.5 compiler_4.3.0 [47] bit64_4.0.5 withr_2.5.0 [49] BiocParallel_1.34.0 viridis_0.6.2 [51] DBI_1.1.3 highr_0.10 [53] biomaRt_2.56.0 rappdirs_0.3.3 [55] DelayedArray_0.26.0 bluster_1.10.0 [57] rjson_0.2.21 tools_4.3.0 [59] vipor_0.4.5 beeswarm_0.4.0 [61] interactiveDisplayBase_1.38.0 httpuv_1.6.9 [63] glue_1.6.2 restfulr_0.0.15 [65] promises_1.2.0.1 grid_4.3.0 [67] Rtsne_0.16 cluster_2.1.4 [69] generics_0.1.3 gtable_0.3.3 [71] ensembldb_2.24.0 hms_1.1.3 [73] metapod_1.8.0 ScaledMatrix_1.8.0 [75] xml2_1.3.3 utf8_1.2.3 [77] XVector_0.40.0 ggrepel_0.9.3 [79] BiocVersion_3.17.1 pillar_1.9.0 [81] stringr_1.5.0 limma_3.56.0 [83] later_1.3.0 dplyr_1.1.2 [85] BiocFileCache_2.8.0 lattice_0.21-8 [87] rtracklayer_1.60.0 bit_4.0.5 [89] tidyselect_1.2.0 locfit_1.5-9.7 [91] Biostrings_2.68.0 knitr_1.42 [93] gridExtra_2.3 bookdown_0.33 [95] ProtGenerics_1.32.0 edgeR_3.42.0 [97] xfun_0.39 statmod_1.5.0 [99] stringi_1.7.12 lazyeval_0.2.2 [101] yaml_2.3.7 evaluate_0.20 [103] codetools_0.2-19 tibble_3.2.1 [105] BiocManager_1.30.20 graph_1.78.0 [107] cli_3.6.1 xtable_1.8-4 [109] munsell_0.5.0 jquerylib_0.1.4 [111] Rcpp_1.0.10 dir.expiry_1.8.0 [113] dbplyr_2.3.2 png_0.1-8 [115] XML_3.99-0.14 parallel_4.3.0 [117] ellipsis_0.3.2 blob_1.2.4 [119] prettyunits_1.1.1 AnnotationFilter_1.24.0 [121] sparseMatrixStats_1.12.0 bitops_1.0-7 [123] viridisLite_0.4.1 scales_1.2.1 [125] purrr_1.0.1 crayon_1.5.2 [127] rlang_1.1.0 cowplot_1.1.1 [129] KEGGREST_1.40.0 References "],["unfiltered-human-pbmcs-10x-genomics.html", "Chapter 3 Unfiltered human PBMCs (10X Genomics) 3.1 Introduction 3.2 Data loading 3.3 Quality control 3.4 Normalization 3.5 Variance modelling 3.6 Dimensionality reduction 3.7 Clustering 3.8 Interpretation Session Info", " Chapter 3 Unfiltered human PBMCs (10X Genomics) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 3.1 Introduction Here, we describe a brief analysis of the peripheral blood mononuclear cell (PBMC) dataset from 10X Genomics (Zheng et al. 2017). The data are publicly available from the 10X Genomics website, from which we download the raw gene/barcode count matrices, i.e., before cell calling from the CellRanger pipeline. 3.2 Data loading library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) 3.3 Quality control We perform cell detection using the emptyDrops() algorithm, as discussed in Advanced Section 7.2. set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] unfiltered &lt;- sce.pbmc We use a relaxed QC strategy and only remove cells with large mitochondrial proportions, using it as a proxy for cell damage. This reduces the risk of removing cell types with low RNA content, especially in a heterogeneous PBMC population with many different cell types. stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location==&quot;MT&quot;))) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) sce.pbmc &lt;- sce.pbmc[,!high.mito] summary(high.mito) ## Mode FALSE TRUE ## logical 3985 315 colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- high.mito gridExtra::grid.arrange( plotColData(unfiltered, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(unfiltered, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(unfiltered, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;Mito percent&quot;), ncol=2 ) Figure 3.1: Distribution of various QC metrics in the PBMC dataset after cell calling. Each point is a cell and is colored according to whether it was discarded by the mitochondrial filter. plotColData(unfiltered, x=&quot;sum&quot;, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + scale_x_log10() Figure 3.2: Proportion of mitochondrial reads in each cell of the PBMC dataset compared to its total count. 3.4 Normalization library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) sce.pbmc &lt;- logNormCounts(sce.pbmc) summary(sizeFactors(sce.pbmc)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.007 0.712 0.875 1.000 1.099 12.254 plot(librarySizeFactors(sce.pbmc), sizeFactors(sce.pbmc), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, log=&quot;xy&quot;) Figure 3.3: Relationship between the library size factors and the deconvolution size factors in the PBMC dataset. 3.5 Variance modelling set.seed(1001) dec.pbmc &lt;- modelGeneVarByPoisson(sce.pbmc) top.pbmc &lt;- getTopHVGs(dec.pbmc, prop=0.1) plot(dec.pbmc$mean, dec.pbmc$total, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(dec.pbmc) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) Figure 3.4: Per-gene variance as a function of the mean for the log-expression values in the PBMC dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to simulated Poisson counts. 3.6 Dimensionality reduction set.seed(10000) sce.pbmc &lt;- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc) set.seed(100000) sce.pbmc &lt;- runTSNE(sce.pbmc, dimred=&quot;PCA&quot;) set.seed(1000000) sce.pbmc &lt;- runUMAP(sce.pbmc, dimred=&quot;PCA&quot;) We verify that a reasonable number of PCs is retained. ncol(reducedDim(sce.pbmc, &quot;PCA&quot;)) ## [1] 9 3.7 Clustering g &lt;- buildSNNGraph(sce.pbmc, k=10, use.dimred = &#39;PCA&#39;) clust &lt;- igraph::cluster_walktrap(g)$membership colLabels(sce.pbmc) &lt;- factor(clust) table(colLabels(sce.pbmc)) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 205 508 541 56 374 125 46 432 302 867 47 155 166 61 84 16 plotTSNE(sce.pbmc, colour_by=&quot;label&quot;) Figure 3.5: Obligatory \\(t\\)-SNE plot of the PBMC dataset, where each point represents a cell and is colored according to the assigned cluster. 3.8 Interpretation markers &lt;- findMarkers(sce.pbmc, pval.type=&quot;some&quot;, direction=&quot;up&quot;) We examine the markers for cluster 8 in more detail. High expression of CD14, CD68 and MNDA combined with low expression of CD16 suggests that this cluster contains monocytes, compared to macrophages in cluster 15 (Figure 3.6). marker.set &lt;- markers[[&quot;8&quot;]] as.data.frame(marker.set[1:30,1:3]) ## p.value FDR summary.logFC ## CSTA 7.171e-222 2.016e-217 2.4179 ## MNDA 1.197e-221 2.016e-217 2.6615 ## FCN1 2.376e-213 2.669e-209 2.6381 ## S100A12 4.393e-212 3.701e-208 3.0809 ## VCAN 1.711e-199 1.153e-195 2.2604 ## TYMP 1.174e-154 6.590e-151 2.0238 ## AIF1 3.674e-149 1.768e-145 2.4604 ## LGALS2 4.005e-137 1.687e-133 1.8928 ## MS4A6A 5.640e-134 2.111e-130 1.5457 ## FGL2 2.045e-124 6.889e-121 1.3859 ## RP11-1143G9.4 6.892e-122 2.111e-118 2.8042 ## AP1S2 1.786e-112 5.015e-109 1.7704 ## CD14 1.195e-110 3.098e-107 1.4260 ## CFD 6.870e-109 1.654e-105 1.3560 ## GPX1 9.049e-107 2.033e-103 2.4014 ## TNFSF13B 3.920e-95 8.256e-92 1.1151 ## KLF4 3.310e-94 6.560e-91 1.2049 ## GRN 4.801e-91 8.987e-88 1.3815 ## NAMPT 2.490e-90 4.415e-87 1.1439 ## CLEC7A 7.736e-88 1.303e-84 1.0616 ## S100A8 3.125e-84 5.014e-81 4.8052 ## SERPINA1 1.580e-82 2.420e-79 1.3843 ## CD36 8.018e-79 1.175e-75 1.0538 ## MPEG1 8.482e-79 1.191e-75 0.9778 ## CD68 5.119e-78 6.899e-75 0.9481 ## CYBB 1.201e-77 1.556e-74 1.0300 ## S100A11 1.175e-72 1.466e-69 1.8962 ## RBP7 2.467e-71 2.969e-68 0.9666 ## BLVRB 3.763e-71 4.372e-68 0.9701 ## CD302 9.859e-71 1.107e-67 0.8792 plotExpression(sce.pbmc, features=c(&quot;CD14&quot;, &quot;CD68&quot;, &quot;MNDA&quot;, &quot;FCGR3A&quot;), x=&quot;label&quot;, colour_by=&quot;label&quot;) Figure 3.6: Distribution of expression values for monocyte and macrophage markers across clusters in the PBMC dataset. Session Info View session info R version 4.3.0 RC (2023-04-13 r84269) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] scran_1.28.0 EnsDb.Hsapiens.v86_2.99.0 [3] ensembldb_2.24.0 AnnotationFilter_1.24.0 [5] GenomicFeatures_1.52.0 AnnotationDbi_1.62.0 [7] scater_1.28.0 ggplot2_3.4.2 [9] scuttle_1.10.0 DropletUtils_1.20.0 [11] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.0 [13] Biobase_2.60.0 GenomicRanges_1.52.0 [15] GenomeInfoDb_1.36.0 IRanges_2.34.0 [17] S4Vectors_0.38.0 BiocGenerics_0.46.0 [19] MatrixGenerics_1.12.0 matrixStats_0.63.0 [21] DropletTestFiles_1.9.0 BiocStyle_2.28.0 [23] rebook_1.10.0 loaded via a namespace (and not attached): [1] later_1.3.0 BiocIO_1.10.0 [3] bitops_1.0-7 filelock_1.0.2 [5] tibble_3.2.1 R.oo_1.25.0 [7] CodeDepends_0.6.5 graph_1.78.0 [9] XML_3.99-0.14 lifecycle_1.0.3 [11] edgeR_3.42.0 lattice_0.21-8 [13] magrittr_2.0.3 limma_3.56.0 [15] sass_0.4.5 rmarkdown_2.21 [17] jquerylib_0.1.4 yaml_2.3.7 [19] metapod_1.8.0 httpuv_1.6.9 [21] cowplot_1.1.1 DBI_1.1.3 [23] zlibbioc_1.46.0 Rtsne_0.16 [25] purrr_1.0.1 R.utils_2.12.2 [27] RCurl_1.98-1.12 rappdirs_0.3.3 [29] GenomeInfoDbData_1.2.10 ggrepel_0.9.3 [31] irlba_2.3.5.1 dqrng_0.3.0 [33] DelayedMatrixStats_1.22.0 codetools_0.2-19 [35] DelayedArray_0.26.0 xml2_1.3.3 [37] tidyselect_1.2.0 farver_2.1.1 [39] ScaledMatrix_1.8.0 viridis_0.6.2 [41] BiocFileCache_2.8.0 GenomicAlignments_1.36.0 [43] jsonlite_1.8.4 BiocNeighbors_1.18.0 [45] ellipsis_0.3.2 tools_4.3.0 [47] progress_1.2.2 Rcpp_1.0.10 [49] glue_1.6.2 gridExtra_2.3 [51] xfun_0.39 dplyr_1.1.2 [53] HDF5Array_1.28.0 withr_2.5.0 [55] BiocManager_1.30.20 fastmap_1.1.1 [57] rhdf5filters_1.12.0 bluster_1.10.0 [59] fansi_1.0.4 digest_0.6.31 [61] rsvd_1.0.5 R6_2.5.1 [63] mime_0.12 colorspace_2.1-0 [65] biomaRt_2.56.0 RSQLite_2.3.1 [67] R.methodsS3_1.8.2 utf8_1.2.3 [69] generics_0.1.3 rtracklayer_1.60.0 [71] FNN_1.1.3.2 prettyunits_1.1.1 [73] httr_1.4.5 uwot_0.1.14 [75] pkgconfig_2.0.3 gtable_0.3.3 [77] blob_1.2.4 XVector_0.40.0 [79] htmltools_0.5.5 bookdown_0.33 [81] ProtGenerics_1.32.0 scales_1.2.1 [83] png_0.1-8 knitr_1.42 [85] rjson_0.2.21 curl_5.0.0 [87] cachem_1.0.7 rhdf5_2.44.0 [89] stringr_1.5.0 BiocVersion_3.17.1 [91] parallel_4.3.0 vipor_0.4.5 [93] restfulr_0.0.15 pillar_1.9.0 [95] grid_4.3.0 vctrs_0.6.2 [97] promises_1.2.0.1 BiocSingular_1.16.0 [99] dbplyr_2.3.2 beachmat_2.16.0 [101] xtable_1.8-4 cluster_2.1.4 [103] beeswarm_0.4.0 evaluate_0.20 [105] cli_3.6.1 locfit_1.5-9.7 [107] compiler_4.3.0 Rsamtools_2.16.0 [109] rlang_1.1.0 crayon_1.5.2 [111] labeling_0.4.2 ggbeeswarm_0.7.1 [113] stringi_1.7.12 viridisLite_0.4.1 [115] BiocParallel_1.34.0 munsell_0.5.0 [117] Biostrings_2.68.0 lazyeval_0.2.2 [119] Matrix_1.5-4 dir.expiry_1.8.0 [121] ExperimentHub_2.8.0 hms_1.1.3 [123] sparseMatrixStats_1.12.0 bit64_4.0.5 [125] Rhdf5lib_1.22.0 KEGGREST_1.40.0 [127] statmod_1.5.0 shiny_1.7.4 [129] interactiveDisplayBase_1.38.0 highr_0.10 [131] AnnotationHub_3.8.0 igraph_1.4.2 [133] memoise_2.0.1 bslib_0.4.2 [135] bit_4.0.5 References "],["human-pbmc-with-surface-proteins-10x-genomics.html", "Chapter 4 Human PBMC with surface proteins (10X Genomics) 4.1 Introduction 4.2 Data loading 4.3 Quality control 4.4 Normalization 4.5 Dimensionality reduction 4.6 Clustering Session Info", " Chapter 4 Human PBMC with surface proteins (10X Genomics) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 4.1 Introduction Here, we describe a brief analysis of yet another peripheral blood mononuclear cell (PBMC) dataset from 10X Genomics (Zheng et al. 2017). Data are publicly available from the 10X Genomics website, from which we download the filtered gene/barcode count matrices for gene expression and cell surface proteins. 4.2 Data loading library(BiocFileCache) bfc &lt;- BiocFileCache(ask=FALSE) exprs.data &lt;- bfcrpath(bfc, file.path( &quot;http://cf.10xgenomics.com/samples/cell-vdj/3.1.0&quot;, &quot;vdj_v1_hs_pbmc3&quot;, &quot;vdj_v1_hs_pbmc3_filtered_feature_bc_matrix.tar.gz&quot;)) untar(exprs.data, exdir=tempdir()) library(DropletUtils) sce.pbmc &lt;- read10xCounts(file.path(tempdir(), &quot;filtered_feature_bc_matrix&quot;)) sce.pbmc &lt;- splitAltExps(sce.pbmc, rowData(sce.pbmc)$Type) 4.3 Quality control unfiltered &lt;- sce.pbmc We discard cells with high mitochondrial proportions and few detectable ADT counts. library(scater) is.mito &lt;- grep(&quot;^MT-&quot;, rowData(sce.pbmc)$Symbol) stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=is.mito)) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) low.adt &lt;- stats$`altexps_Antibody Capture_detected` &lt; nrow(altExp(sce.pbmc))/2 discard &lt;- high.mito | low.adt sce.pbmc &lt;- sce.pbmc[,!discard] We examine some of the statistics: summary(high.mito) ## Mode FALSE TRUE ## logical 6660 571 summary(low.adt) ## Mode FALSE ## logical 7231 summary(discard) ## Mode FALSE TRUE ## logical 6660 571 We examine the distribution of each QC metric (Figure 4.1). colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- discard gridExtra::grid.arrange( plotColData(unfiltered, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(unfiltered, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(unfiltered, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;Mito percent&quot;), plotColData(unfiltered, y=&quot;altexps_Antibody Capture_detected&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;ADT detected&quot;), ncol=2 ) Figure 4.1: Distribution of each QC metric in the PBMC dataset, where each point is a cell and is colored by whether or not it was discarded by the outlier-based QC approach. We also plot the mitochondrial proportion against the total count for each cell, as one does (Figure 4.2). plotColData(unfiltered, x=&quot;sum&quot;, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + scale_x_log10() Figure 4.2: Percentage of UMIs mapped to mitochondrial genes against the totalcount for each cell. 4.4 Normalization Computing size factors for the gene expression and ADT counts. library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) altExp(sce.pbmc) &lt;- computeMedianFactors(altExp(sce.pbmc)) sce.pbmc &lt;- applySCE(sce.pbmc, logNormCounts) We generate some summary statistics for both sets of size factors: summary(sizeFactors(sce.pbmc)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.074 0.719 0.908 1.000 1.133 8.858 summary(sizeFactors(altExp(sce.pbmc))) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.10 0.70 0.83 1.00 1.03 227.36 We also look at the distribution of size factors compared to the library size for each set of features (Figure 4.3). par(mfrow=c(1,2)) plot(librarySizeFactors(sce.pbmc), sizeFactors(sce.pbmc), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, main=&quot;Gene expression&quot;, log=&quot;xy&quot;) plot(librarySizeFactors(altExp(sce.pbmc)), sizeFactors(altExp(sce.pbmc)), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Median-based factors&quot;, main=&quot;Antibody capture&quot;, log=&quot;xy&quot;) Figure 4.3: Plot of the deconvolution size factors for the gene expression values (left) or the median-based size factors for the ADT expression values (right) compared to the library size-derived factors for the corresponding set of features. Each point represents a cell. 4.5 Dimensionality reduction We omit the PCA step for the ADT expression matrix, given that it is already so low-dimensional, and progress directly to \\(t\\)-SNE and UMAP visualizations. set.seed(100000) altExp(sce.pbmc) &lt;- runTSNE(altExp(sce.pbmc)) set.seed(1000000) altExp(sce.pbmc) &lt;- runUMAP(altExp(sce.pbmc)) 4.6 Clustering We perform graph-based clustering on the ADT data and use the assignments as the column labels of the alternative Experiment. g.adt &lt;- buildSNNGraph(altExp(sce.pbmc), k=10, d=NA) clust.adt &lt;- igraph::cluster_walktrap(g.adt)$membership colLabels(altExp(sce.pbmc)) &lt;- factor(clust.adt) We examine some basic statistics about the size of each cluster, their separation (Figure 4.4) and their distribution in our \\(t\\)-SNE plot (Figure 4.5). table(colLabels(altExp(sce.pbmc))) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 160 507 662 39 691 1415 32 650 76 1037 121 47 68 25 15 562 ## 17 18 19 20 21 22 23 24 ## 139 32 44 120 84 65 52 17 library(bluster) mod &lt;- pairwiseModularity(g.adt, clust.adt, as.ratio=TRUE) library(pheatmap) pheatmap::pheatmap(log10(mod + 10), cluster_row=FALSE, cluster_col=FALSE, color=colorRampPalette(c(&quot;white&quot;, &quot;blue&quot;))(101)) Figure 4.4: Heatmap of the pairwise cluster modularity scores in the PBMC dataset, computed based on the shared nearest neighbor graph derived from the ADT expression values. plotTSNE(altExp(sce.pbmc), colour_by=&quot;label&quot;, text_by=&quot;label&quot;, text_colour=&quot;red&quot;) Figure 4.5: Obligatory \\(t\\)-SNE plot of PBMC dataset based on its ADT expression values, where each point is a cell and is colored by the cluster of origin. Cluster labels are also overlaid at the median coordinates across all cells in the cluster. We perform some additional subclustering using the expression data to mimic an in silico FACS experiment. set.seed(1010010) subclusters &lt;- quickSubCluster(sce.pbmc, clust.adt, prepFUN=function(x) { dec &lt;- modelGeneVarByPoisson(x) top &lt;- getTopHVGs(dec, prop=0.1) denoisePCA(x, dec, subset.row=top) }, clusterFUN=function(x) { g.gene &lt;- buildSNNGraph(x, k=10, use.dimred = &#39;PCA&#39;) igraph::cluster_walktrap(g.gene)$membership } ) We counting the number of gene expression-derived subclusters in each ADT-derived parent cluster. data.frame( Cluster=names(subclusters), Ncells=vapply(subclusters, ncol, 0L), Nsub=vapply(subclusters, function(x) length(unique(x$subcluster)), 0L) ) ## Cluster Ncells Nsub ## 1 1 160 3 ## 2 2 507 4 ## 3 3 662 5 ## 4 4 39 1 ## 5 5 691 5 ## 6 6 1415 7 ## 7 7 32 1 ## 8 8 650 7 ## 9 9 76 2 ## 10 10 1037 8 ## 11 11 121 2 ## 12 12 47 1 ## 13 13 68 2 ## 14 14 25 1 ## 15 15 15 1 ## 16 16 562 9 ## 17 17 139 3 ## 18 18 32 1 ## 19 19 44 1 ## 20 20 120 4 ## 21 21 84 3 ## 22 22 65 2 ## 23 23 52 3 ## 24 24 17 1 Session Info View session info R version 4.3.0 RC (2023-04-13 r84269) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] pheatmap_1.0.12 bluster_1.10.0 [3] scran_1.28.0 scater_1.28.0 [5] ggplot2_3.4.2 scuttle_1.10.0 [7] DropletUtils_1.20.0 SingleCellExperiment_1.22.0 [9] SummarizedExperiment_1.30.0 Biobase_2.60.0 [11] GenomicRanges_1.52.0 GenomeInfoDb_1.36.0 [13] IRanges_2.34.0 S4Vectors_0.38.0 [15] BiocGenerics_0.46.0 MatrixGenerics_1.12.0 [17] matrixStats_0.63.0 BiocFileCache_2.8.0 [19] dbplyr_2.3.2 BiocStyle_2.28.0 [21] rebook_1.10.0 loaded via a namespace (and not attached): [1] DBI_1.1.3 bitops_1.0-7 [3] gridExtra_2.3 CodeDepends_0.6.5 [5] rlang_1.1.0 magrittr_2.0.3 [7] RcppAnnoy_0.0.20 compiler_4.3.0 [9] RSQLite_2.3.1 dir.expiry_1.8.0 [11] DelayedMatrixStats_1.22.0 vctrs_0.6.2 [13] pkgconfig_2.0.3 fastmap_1.1.1 [15] XVector_0.40.0 labeling_0.4.2 [17] utf8_1.2.3 rmarkdown_2.21 [19] graph_1.78.0 ggbeeswarm_0.7.1 [21] purrr_1.0.1 bit_4.0.5 [23] xfun_0.39 zlibbioc_1.46.0 [25] cachem_1.0.7 beachmat_2.16.0 [27] jsonlite_1.8.4 blob_1.2.4 [29] highr_0.10 rhdf5filters_1.12.0 [31] DelayedArray_0.26.0 Rhdf5lib_1.22.0 [33] BiocParallel_1.34.0 cluster_2.1.4 [35] irlba_2.3.5.1 parallel_4.3.0 [37] R6_2.5.1 RColorBrewer_1.1-3 [39] bslib_0.4.2 limma_3.56.0 [41] jquerylib_0.1.4 Rcpp_1.0.10 [43] bookdown_0.33 knitr_1.42 [45] R.utils_2.12.2 igraph_1.4.2 [47] Matrix_1.5-4 tidyselect_1.2.0 [49] viridis_0.6.2 yaml_2.3.7 [51] codetools_0.2-19 curl_5.0.0 [53] lattice_0.21-8 tibble_3.2.1 [55] withr_2.5.0 Rtsne_0.16 [57] evaluate_0.20 pillar_1.9.0 [59] BiocManager_1.30.20 filelock_1.0.2 [61] generics_0.1.3 RCurl_1.98-1.12 [63] sparseMatrixStats_1.12.0 munsell_0.5.0 [65] scales_1.2.1 glue_1.6.2 [67] metapod_1.8.0 tools_4.3.0 [69] BiocNeighbors_1.18.0 ScaledMatrix_1.8.0 [71] locfit_1.5-9.7 XML_3.99-0.14 [73] cowplot_1.1.1 rhdf5_2.44.0 [75] grid_4.3.0 edgeR_3.42.0 [77] colorspace_2.1-0 GenomeInfoDbData_1.2.10 [79] beeswarm_0.4.0 BiocSingular_1.16.0 [81] HDF5Array_1.28.0 vipor_0.4.5 [83] cli_3.6.1 rsvd_1.0.5 [85] fansi_1.0.4 viridisLite_0.4.1 [87] dplyr_1.1.2 uwot_0.1.14 [89] gtable_0.3.3 R.methodsS3_1.8.2 [91] sass_0.4.5 digest_0.6.31 [93] ggrepel_0.9.3 dqrng_0.3.0 [95] farver_2.1.1 memoise_2.0.1 [97] htmltools_0.5.5 R.oo_1.25.0 [99] lifecycle_1.0.3 httr_1.4.5 [101] statmod_1.5.0 bit64_4.0.5 References "],["grun-human-pancreas-cel-seq2.html", "Chapter 5 Grun human pancreas (CEL-seq2) 5.1 Introduction 5.2 Data loading 5.3 Quality control 5.4 Normalization 5.5 Variance modelling 5.6 Data integration 5.7 Dimensionality reduction 5.8 Clustering Session Info", " Chapter 5 Grun human pancreas (CEL-seq2) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 5.1 Introduction This workflow performs an analysis of the Grun et al. (2016) CEL-seq2 dataset consisting of human pancreas cells from various donors. 5.2 Data loading library(scRNAseq) sce.grun &lt;- GrunPancreasData() We convert to Ensembl identifiers, and we remove duplicated genes or genes without Ensembl IDs. library(org.Hs.eg.db) gene.ids &lt;- mapIds(org.Hs.eg.db, keys=rowData(sce.grun)$symbol, keytype=&quot;SYMBOL&quot;, column=&quot;ENSEMBL&quot;) keep &lt;- !is.na(gene.ids) &amp; !duplicated(gene.ids) sce.grun &lt;- sce.grun[keep,] rownames(sce.grun) &lt;- gene.ids[keep] 5.3 Quality control unfiltered &lt;- sce.grun This dataset lacks mitochondrial genes so we will do without them for quality control. We compute the median and MAD while blocking on the donor; for donors where the assumption of a majority of high-quality cells seems to be violated (Figure 5.1), we compute an appropriate threshold using the other donors as specified in the subset= argument. library(scater) stats &lt;- perCellQCMetrics(sce.grun) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sce.grun$donor, subset=sce.grun$donor %in% c(&quot;D17&quot;, &quot;D7&quot;, &quot;D2&quot;)) sce.grun &lt;- sce.grun[,!qc$discard] colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- qc$discard gridExtra::grid.arrange( plotColData(unfiltered, x=&quot;donor&quot;, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(unfiltered, x=&quot;donor&quot;, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(unfiltered, x=&quot;donor&quot;, y=&quot;altexps_ERCC_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;ERCC percent&quot;), ncol=2 ) Figure 5.1: Distribution of each QC metric across cells from each donor of the Grun pancreas dataset. Each point represents a cell and is colored according to whether that cell was discarded. colSums(as.matrix(qc), na.rm=TRUE) ## low_lib_size low_n_features high_altexps_ERCC_percent ## 452 510 606 ## discard ## 665 5.4 Normalization library(scran) set.seed(1000) # for irlba. clusters &lt;- quickCluster(sce.grun) sce.grun &lt;- computeSumFactors(sce.grun, clusters=clusters) sce.grun &lt;- logNormCounts(sce.grun) summary(sizeFactors(sce.grun)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.094 0.507 0.794 1.000 1.235 10.953 plot(librarySizeFactors(sce.grun), sizeFactors(sce.grun), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, log=&quot;xy&quot;) Figure 5.2: Relationship between the library size factors and the deconvolution size factors in the Grun pancreas dataset. 5.5 Variance modelling We block on a combined plate and donor factor. block &lt;- paste0(sce.grun$sample, &quot;_&quot;, sce.grun$donor) dec.grun &lt;- modelGeneVarWithSpikes(sce.grun, spikes=&quot;ERCC&quot;, block=block) top.grun &lt;- getTopHVGs(dec.grun, prop=0.1) We examine the number of cells in each level of the blocking factor. table(block) ## block ## CD13+ sorted cells_D17 CD24+ CD44+ live sorted cells_D17 ## 86 87 ## CD63+ sorted cells_D10 TGFBR3+ sorted cells_D17 ## 40 90 ## exocrine fraction, live sorted cells_D2 exocrine fraction, live sorted cells_D3 ## 82 7 ## live sorted cells, library 1_D10 live sorted cells, library 1_D17 ## 33 88 ## live sorted cells, library 1_D3 live sorted cells, library 1_D7 ## 25 85 ## live sorted cells, library 2_D10 live sorted cells, library 2_D17 ## 35 83 ## live sorted cells, library 2_D3 live sorted cells, library 2_D7 ## 27 84 ## live sorted cells, library 3_D3 live sorted cells, library 3_D7 ## 16 83 ## live sorted cells, library 4_D3 live sorted cells, library 4_D7 ## 29 83 par(mfrow=c(6,3)) blocked.stats &lt;- dec.grun$per.block for (i in colnames(blocked.stats)) { current &lt;- blocked.stats[[i]] plot(current$mean, current$total, main=i, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(current) points(curfit$mean, curfit$var, col=&quot;red&quot;, pch=16) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) } Figure 1.4: Per-gene variance as a function of the mean for the log-expression values in the Grun pancreas dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the spike-in transcripts (red) separately for each donor. 5.6 Data integration library(batchelor) set.seed(1001010) merged.grun &lt;- fastMNN(sce.grun, subset.row=top.grun, batch=sce.grun$donor) metadata(merged.grun)$merge.info$lost.var ## D10 D17 D2 D3 D7 ## [1,] 0.029789 0.031754 0.000000 0.00000 0.00000 ## [2,] 0.008008 0.012371 0.039101 0.00000 0.00000 ## [3,] 0.004108 0.005397 0.008157 0.05204 0.00000 ## [4,] 0.013393 0.016061 0.016364 0.01510 0.05522 5.7 Dimensionality reduction set.seed(100111) merged.grun &lt;- runTSNE(merged.grun, dimred=&quot;corrected&quot;) 5.8 Clustering snn.gr &lt;- buildSNNGraph(merged.grun, use.dimred=&quot;corrected&quot;) colLabels(merged.grun) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) table(Cluster=colLabels(merged.grun), Donor=merged.grun$batch) ## Donor ## Cluster D10 D17 D2 D3 D7 ## 1 32 71 33 80 29 ## 2 11 119 0 0 55 ## 3 2 7 3 3 6 ## 4 3 43 0 0 11 ## 5 5 14 0 0 10 ## 6 4 4 2 4 2 ## 7 11 69 29 3 69 ## 8 16 37 12 10 46 ## 9 14 31 3 2 66 ## 10 1 9 0 0 7 ## 11 4 13 0 0 1 ## 12 5 17 0 2 33 gridExtra::grid.arrange( plotTSNE(merged.grun, colour_by=&quot;label&quot;), plotTSNE(merged.grun, colour_by=&quot;batch&quot;), ncol=2 ) Figure 5.3: Obligatory \\(t\\)-SNE plots of the Grun pancreas dataset. Each point represents a cell that is colored by cluster (left) or batch (right). Session Info View session info R version 4.3.0 RC (2023-04-13 r84269) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] batchelor_1.16.0 scran_1.28.0 [3] scater_1.28.0 ggplot2_3.4.2 [5] scuttle_1.10.0 org.Hs.eg.db_3.17.0 [7] AnnotationDbi_1.62.0 scRNAseq_2.13.0 [9] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.0 [11] Biobase_2.60.0 GenomicRanges_1.52.0 [13] GenomeInfoDb_1.36.0 IRanges_2.34.0 [15] S4Vectors_0.38.0 BiocGenerics_0.46.0 [17] MatrixGenerics_1.12.0 matrixStats_0.63.0 [19] BiocStyle_2.28.0 rebook_1.10.0 loaded via a namespace (and not attached): [1] jsonlite_1.8.4 CodeDepends_0.6.5 [3] magrittr_2.0.3 ggbeeswarm_0.7.1 [5] GenomicFeatures_1.52.0 farver_2.1.1 [7] rmarkdown_2.21 BiocIO_1.10.0 [9] zlibbioc_1.46.0 vctrs_0.6.2 [11] memoise_2.0.1 Rsamtools_2.16.0 [13] DelayedMatrixStats_1.22.0 RCurl_1.98-1.12 [15] htmltools_0.5.5 progress_1.2.2 [17] AnnotationHub_3.8.0 curl_5.0.0 [19] BiocNeighbors_1.18.0 sass_0.4.5 [21] bslib_0.4.2 cachem_1.0.7 [23] ResidualMatrix_1.10.0 GenomicAlignments_1.36.0 [25] igraph_1.4.2 mime_0.12 [27] lifecycle_1.0.3 pkgconfig_2.0.3 [29] rsvd_1.0.5 Matrix_1.5-4 [31] R6_2.5.1 fastmap_1.1.1 [33] GenomeInfoDbData_1.2.10 shiny_1.7.4 [35] digest_0.6.31 colorspace_2.1-0 [37] dqrng_0.3.0 irlba_2.3.5.1 [39] ExperimentHub_2.8.0 RSQLite_2.3.1 [41] beachmat_2.16.0 labeling_0.4.2 [43] filelock_1.0.2 fansi_1.0.4 [45] httr_1.4.5 compiler_4.3.0 [47] bit64_4.0.5 withr_2.5.0 [49] BiocParallel_1.34.0 viridis_0.6.2 [51] DBI_1.1.3 highr_0.10 [53] biomaRt_2.56.0 rappdirs_0.3.3 [55] DelayedArray_0.26.0 bluster_1.10.0 [57] rjson_0.2.21 tools_4.3.0 [59] vipor_0.4.5 beeswarm_0.4.0 [61] interactiveDisplayBase_1.38.0 httpuv_1.6.9 [63] glue_1.6.2 restfulr_0.0.15 [65] promises_1.2.0.1 grid_4.3.0 [67] Rtsne_0.16 cluster_2.1.4 [69] generics_0.1.3 gtable_0.3.3 [71] ensembldb_2.24.0 hms_1.1.3 [73] metapod_1.8.0 BiocSingular_1.16.0 [75] ScaledMatrix_1.8.0 xml2_1.3.3 [77] utf8_1.2.3 XVector_0.40.0 [79] ggrepel_0.9.3 BiocVersion_3.17.1 [81] pillar_1.9.0 stringr_1.5.0 [83] limma_3.56.0 later_1.3.0 [85] dplyr_1.1.2 BiocFileCache_2.8.0 [87] lattice_0.21-8 rtracklayer_1.60.0 [89] bit_4.0.5 tidyselect_1.2.0 [91] locfit_1.5-9.7 Biostrings_2.68.0 [93] knitr_1.42 gridExtra_2.3 [95] bookdown_0.33 ProtGenerics_1.32.0 [97] edgeR_3.42.0 xfun_0.39 [99] statmod_1.5.0 stringi_1.7.12 [101] lazyeval_0.2.2 yaml_2.3.7 [103] evaluate_0.20 codetools_0.2-19 [105] tibble_3.2.1 BiocManager_1.30.20 [107] graph_1.78.0 cli_3.6.1 [109] xtable_1.8-4 munsell_0.5.0 [111] jquerylib_0.1.4 Rcpp_1.0.10 [113] dir.expiry_1.8.0 dbplyr_2.3.2 [115] png_0.1-8 XML_3.99-0.14 [117] parallel_4.3.0 ellipsis_0.3.2 [119] blob_1.2.4 prettyunits_1.1.1 [121] AnnotationFilter_1.24.0 sparseMatrixStats_1.12.0 [123] bitops_1.0-7 viridisLite_0.4.1 [125] scales_1.2.1 purrr_1.0.1 [127] crayon_1.5.2 rlang_1.1.0 [129] cowplot_1.1.1 KEGGREST_1.40.0 References "],["muraro-human-pancreas-cel-seq.html", "Chapter 6 Muraro human pancreas (CEL-seq) 6.1 Introduction 6.2 Data loading 6.3 Quality control 6.4 Normalization 6.5 Variance modelling 6.6 Data integration 6.7 Dimensionality reduction 6.8 Clustering Session Info", " Chapter 6 Muraro human pancreas (CEL-seq) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 6.1 Introduction This performs an analysis of the Muraro et al. (2016) CEL-seq dataset, consisting of human pancreas cells from various donors. 6.2 Data loading library(scRNAseq) sce.muraro &lt;- MuraroPancreasData() Converting back to Ensembl identifiers. library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] gene.symb &lt;- sub(&quot;__chr.*$&quot;, &quot;&quot;, rownames(sce.muraro)) gene.ids &lt;- mapIds(edb, keys=gene.symb, keytype=&quot;SYMBOL&quot;, column=&quot;GENEID&quot;) # Removing duplicated genes or genes without Ensembl IDs. keep &lt;- !is.na(gene.ids) &amp; !duplicated(gene.ids) sce.muraro &lt;- sce.muraro[keep,] rownames(sce.muraro) &lt;- gene.ids[keep] 6.3 Quality control unfiltered &lt;- sce.muraro This dataset lacks mitochondrial genes so we will do without. For the one batch that seems to have a high proportion of low-quality cells, we compute an appropriate filter threshold using a shared median and MAD from the other batches (Figure 6.1). library(scater) stats &lt;- perCellQCMetrics(sce.muraro) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sce.muraro$donor, subset=sce.muraro$donor!=&quot;D28&quot;) sce.muraro &lt;- sce.muraro[,!qc$discard] colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- qc$discard gridExtra::grid.arrange( plotColData(unfiltered, x=&quot;donor&quot;, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(unfiltered, x=&quot;donor&quot;, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(unfiltered, x=&quot;donor&quot;, y=&quot;altexps_ERCC_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;ERCC percent&quot;), ncol=2 ) Figure 6.1: Distribution of each QC metric across cells from each donor in the Muraro pancreas dataset. Each point represents a cell and is colored according to whether that cell was discarded. We have a look at the causes of removal: colSums(as.matrix(qc)) ## low_lib_size low_n_features high_altexps_ERCC_percent ## 663 700 738 ## discard ## 773 6.4 Normalization library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.muraro) sce.muraro &lt;- computeSumFactors(sce.muraro, clusters=clusters) sce.muraro &lt;- logNormCounts(sce.muraro) summary(sizeFactors(sce.muraro)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.088 0.541 0.821 1.000 1.211 13.987 plot(librarySizeFactors(sce.muraro), sizeFactors(sce.muraro), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, log=&quot;xy&quot;) Figure 6.2: Relationship between the library size factors and the deconvolution size factors in the Muraro pancreas dataset. 6.5 Variance modelling We block on a combined plate and donor factor. block &lt;- paste0(sce.muraro$plate, &quot;_&quot;, sce.muraro$donor) dec.muraro &lt;- modelGeneVarWithSpikes(sce.muraro, &quot;ERCC&quot;, block=block) top.muraro &lt;- getTopHVGs(dec.muraro, prop=0.1) par(mfrow=c(8,4)) blocked.stats &lt;- dec.muraro$per.block for (i in colnames(blocked.stats)) { current &lt;- blocked.stats[[i]] plot(current$mean, current$total, main=i, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(current) points(curfit$mean, curfit$var, col=&quot;red&quot;, pch=16) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) } Figure 6.3: Per-gene variance as a function of the mean for the log-expression values in the Muraro pancreas dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the spike-in transcripts (red) separately for each donor. 6.6 Data integration library(batchelor) set.seed(1001010) merged.muraro &lt;- fastMNN(sce.muraro, subset.row=top.muraro, batch=sce.muraro$donor) We use the proportion of variance lost as a diagnostic measure: metadata(merged.muraro)$merge.info$lost.var ## D28 D29 D30 D31 ## [1,] 0.060847 0.024121 0.000000 0.00000 ## [2,] 0.002646 0.003018 0.062421 0.00000 ## [3,] 0.003449 0.002641 0.002598 0.08162 6.7 Dimensionality reduction set.seed(100111) merged.muraro &lt;- runTSNE(merged.muraro, dimred=&quot;corrected&quot;) 6.8 Clustering snn.gr &lt;- buildSNNGraph(merged.muraro, use.dimred=&quot;corrected&quot;) colLabels(merged.muraro) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) tab &lt;- table(Cluster=colLabels(merged.muraro), CellType=sce.muraro$label) library(pheatmap) pheatmap(log10(tab+10), color=viridis::viridis(100)) Figure 6.4: Heatmap of the frequency of cells from each cell type label in each cluster. table(Cluster=colLabels(merged.muraro), Donor=merged.muraro$batch) ## Donor ## Cluster D28 D29 D30 D31 ## 1 104 6 57 112 ## 2 59 21 77 97 ## 3 12 75 64 43 ## 4 28 149 126 120 ## 5 87 261 277 214 ## 6 21 7 54 26 ## 7 1 6 6 37 ## 8 6 6 5 2 ## 9 11 68 5 30 ## 10 4 2 5 8 gridExtra::grid.arrange( plotTSNE(merged.muraro, colour_by=&quot;label&quot;), plotTSNE(merged.muraro, colour_by=&quot;batch&quot;), ncol=2 ) Figure 6.5: Obligatory \\(t\\)-SNE plots of the Muraro pancreas dataset. Each point represents a cell that is colored by cluster (left) or batch (right). Session Info View session info R version 4.3.0 RC (2023-04-13 r84269) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] pheatmap_1.0.12 batchelor_1.16.0 [3] scran_1.28.0 scater_1.28.0 [5] ggplot2_3.4.2 scuttle_1.10.0 [7] ensembldb_2.24.0 AnnotationFilter_1.24.0 [9] GenomicFeatures_1.52.0 AnnotationDbi_1.62.0 [11] AnnotationHub_3.8.0 BiocFileCache_2.8.0 [13] dbplyr_2.3.2 scRNAseq_2.13.0 [15] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.0 [17] Biobase_2.60.0 GenomicRanges_1.52.0 [19] GenomeInfoDb_1.36.0 IRanges_2.34.0 [21] S4Vectors_0.38.0 BiocGenerics_0.46.0 [23] MatrixGenerics_1.12.0 matrixStats_0.63.0 [25] BiocStyle_2.28.0 rebook_1.10.0 loaded via a namespace (and not attached): [1] RColorBrewer_1.1-3 jsonlite_1.8.4 [3] CodeDepends_0.6.5 magrittr_2.0.3 [5] ggbeeswarm_0.7.1 farver_2.1.1 [7] rmarkdown_2.21 BiocIO_1.10.0 [9] zlibbioc_1.46.0 vctrs_0.6.2 [11] memoise_2.0.1 Rsamtools_2.16.0 [13] DelayedMatrixStats_1.22.0 RCurl_1.98-1.12 [15] htmltools_0.5.5 progress_1.2.2 [17] curl_5.0.0 BiocNeighbors_1.18.0 [19] sass_0.4.5 bslib_0.4.2 [21] cachem_1.0.7 ResidualMatrix_1.10.0 [23] GenomicAlignments_1.36.0 igraph_1.4.2 [25] mime_0.12 lifecycle_1.0.3 [27] pkgconfig_2.0.3 rsvd_1.0.5 [29] Matrix_1.5-4 R6_2.5.1 [31] fastmap_1.1.1 GenomeInfoDbData_1.2.10 [33] shiny_1.7.4 digest_0.6.31 [35] colorspace_2.1-0 dqrng_0.3.0 [37] irlba_2.3.5.1 ExperimentHub_2.8.0 [39] RSQLite_2.3.1 beachmat_2.16.0 [41] labeling_0.4.2 filelock_1.0.2 [43] fansi_1.0.4 httr_1.4.5 [45] compiler_4.3.0 bit64_4.0.5 [47] withr_2.5.0 BiocParallel_1.34.0 [49] viridis_0.6.2 DBI_1.1.3 [51] highr_0.10 biomaRt_2.56.0 [53] rappdirs_0.3.3 DelayedArray_0.26.0 [55] bluster_1.10.0 rjson_0.2.21 [57] tools_4.3.0 vipor_0.4.5 [59] beeswarm_0.4.0 interactiveDisplayBase_1.38.0 [61] httpuv_1.6.9 glue_1.6.2 [63] restfulr_0.0.15 promises_1.2.0.1 [65] grid_4.3.0 Rtsne_0.16 [67] cluster_2.1.4 generics_0.1.3 [69] gtable_0.3.3 hms_1.1.3 [71] metapod_1.8.0 BiocSingular_1.16.0 [73] ScaledMatrix_1.8.0 xml2_1.3.3 [75] utf8_1.2.3 XVector_0.40.0 [77] ggrepel_0.9.3 BiocVersion_3.17.1 [79] pillar_1.9.0 stringr_1.5.0 [81] limma_3.56.0 later_1.3.0 [83] dplyr_1.1.2 lattice_0.21-8 [85] rtracklayer_1.60.0 bit_4.0.5 [87] tidyselect_1.2.0 locfit_1.5-9.7 [89] Biostrings_2.68.0 knitr_1.42 [91] gridExtra_2.3 bookdown_0.33 [93] ProtGenerics_1.32.0 edgeR_3.42.0 [95] xfun_0.39 statmod_1.5.0 [97] stringi_1.7.12 lazyeval_0.2.2 [99] yaml_2.3.7 evaluate_0.20 [101] codetools_0.2-19 tibble_3.2.1 [103] BiocManager_1.30.20 graph_1.78.0 [105] cli_3.6.1 xtable_1.8-4 [107] munsell_0.5.0 jquerylib_0.1.4 [109] Rcpp_1.0.10 dir.expiry_1.8.0 [111] png_0.1-8 XML_3.99-0.14 [113] parallel_4.3.0 ellipsis_0.3.2 [115] blob_1.2.4 prettyunits_1.1.1 [117] sparseMatrixStats_1.12.0 bitops_1.0-7 [119] viridisLite_0.4.1 scales_1.2.1 [121] purrr_1.0.1 crayon_1.5.2 [123] rlang_1.1.0 cowplot_1.1.1 [125] KEGGREST_1.40.0 References "],["lawlor-human-pancreas-smarter.html", "Chapter 7 Lawlor human pancreas (SMARTer) 7.1 Introduction 7.2 Data loading 7.3 Quality control 7.4 Normalization 7.5 Variance modelling 7.6 Dimensionality reduction 7.7 Clustering Session Info", " Chapter 7 Lawlor human pancreas (SMARTer) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 7.1 Introduction This performs an analysis of the Lawlor et al. (2017) dataset, consisting of human pancreas cells from various donors. 7.2 Data loading library(scRNAseq) sce.lawlor &lt;- LawlorPancreasData() library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] anno &lt;- select(edb, keys=rownames(sce.lawlor), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.lawlor) &lt;- anno[match(rownames(sce.lawlor), anno[,1]),-1] 7.3 Quality control unfiltered &lt;- sce.lawlor library(scater) stats &lt;- perCellQCMetrics(sce.lawlor, subsets=list(Mito=which(rowData(sce.lawlor)$SEQNAME==&quot;MT&quot;))) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;subsets_Mito_percent&quot;, batch=sce.lawlor$`islet unos id`) sce.lawlor &lt;- sce.lawlor[,!qc$discard] colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- qc$discard gridExtra::grid.arrange( plotColData(unfiltered, x=&quot;islet unos id&quot;, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;) + theme(axis.text.x = element_text(angle = 90)), plotColData(unfiltered, x=&quot;islet unos id&quot;, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;) + theme(axis.text.x = element_text(angle = 90)), plotColData(unfiltered, x=&quot;islet unos id&quot;, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;Mito percent&quot;) + theme(axis.text.x = element_text(angle = 90)), ncol=2 ) Figure 7.1: Distribution of each QC metric across cells from each donor of the Lawlor pancreas dataset. Each point represents a cell and is colored according to whether that cell was discarded. plotColData(unfiltered, x=&quot;sum&quot;, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + scale_x_log10() Figure 7.2: Percentage of mitochondrial reads in each cell in the 416B dataset compared to the total count. Each point represents a cell and is colored according to whether that cell was discarded. colSums(as.matrix(qc)) ## low_lib_size low_n_features high_subsets_Mito_percent ## 9 5 25 ## discard ## 34 7.4 Normalization library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.lawlor) sce.lawlor &lt;- computeSumFactors(sce.lawlor, clusters=clusters) sce.lawlor &lt;- logNormCounts(sce.lawlor) summary(sizeFactors(sce.lawlor)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.295 0.781 0.963 1.000 1.182 2.629 plot(librarySizeFactors(sce.lawlor), sizeFactors(sce.lawlor), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, log=&quot;xy&quot;) Figure 7.3: Relationship between the library size factors and the deconvolution size factors in the Lawlor pancreas dataset. 7.5 Variance modelling Using age as a proxy for the donor. dec.lawlor &lt;- modelGeneVar(sce.lawlor, block=sce.lawlor$`islet unos id`) chosen.genes &lt;- getTopHVGs(dec.lawlor, n=2000) par(mfrow=c(4,2)) blocked.stats &lt;- dec.lawlor$per.block for (i in colnames(blocked.stats)) { current &lt;- blocked.stats[[i]] plot(current$mean, current$total, main=i, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(current) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) } Figure 7.4: Per-gene variance as a function of the mean for the log-expression values in the Lawlor pancreas dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted separately for each donor. 7.6 Dimensionality reduction library(BiocSingular) set.seed(101011001) sce.lawlor &lt;- runPCA(sce.lawlor, subset_row=chosen.genes, ncomponents=25) sce.lawlor &lt;- runTSNE(sce.lawlor, dimred=&quot;PCA&quot;) 7.7 Clustering snn.gr &lt;- buildSNNGraph(sce.lawlor, use.dimred=&quot;PCA&quot;) colLabels(sce.lawlor) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) table(colLabels(sce.lawlor), sce.lawlor$`cell type`) ## ## Acinar Alpha Beta Delta Ductal Gamma/PP None/Other Stellate ## 1 1 0 0 13 2 16 2 0 ## 2 0 1 76 1 0 0 0 0 ## 3 0 161 1 0 0 1 2 0 ## 4 0 1 0 1 0 0 5 19 ## 5 0 0 175 4 1 0 1 0 ## 6 22 0 0 0 0 0 0 0 ## 7 0 75 0 0 0 0 0 0 ## 8 0 0 0 1 20 0 2 0 table(colLabels(sce.lawlor), sce.lawlor$`islet unos id`) ## ## ACCG268 ACCR015A ACEK420A ACEL337 ACHY057 ACIB065 ACIW009 ACJV399 ## 1 8 2 2 4 4 4 9 1 ## 2 14 3 2 33 3 2 4 17 ## 3 36 23 14 13 14 14 21 30 ## 4 7 1 0 1 0 4 9 4 ## 5 34 10 4 39 7 23 24 40 ## 6 0 2 13 0 0 0 5 2 ## 7 32 12 0 5 6 7 4 9 ## 8 1 1 2 1 2 1 12 3 gridExtra::grid.arrange( plotTSNE(sce.lawlor, colour_by=&quot;label&quot;), plotTSNE(sce.lawlor, colour_by=&quot;islet unos id&quot;), ncol=2 ) Figure 5.3: Obligatory \\(t\\)-SNE plots of the Lawlor pancreas dataset. Each point represents a cell that is colored by cluster (left) or batch (right). Session Info View session info R version 4.3.0 RC (2023-04-13 r84269) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] BiocSingular_1.16.0 scran_1.28.0 [3] scater_1.28.0 ggplot2_3.4.2 [5] scuttle_1.10.0 ensembldb_2.24.0 [7] AnnotationFilter_1.24.0 GenomicFeatures_1.52.0 [9] AnnotationDbi_1.62.0 AnnotationHub_3.8.0 [11] BiocFileCache_2.8.0 dbplyr_2.3.2 [13] scRNAseq_2.13.0 SingleCellExperiment_1.22.0 [15] SummarizedExperiment_1.30.0 Biobase_2.60.0 [17] GenomicRanges_1.52.0 GenomeInfoDb_1.36.0 [19] IRanges_2.34.0 S4Vectors_0.38.0 [21] BiocGenerics_0.46.0 MatrixGenerics_1.12.0 [23] matrixStats_0.63.0 BiocStyle_2.28.0 [25] rebook_1.10.0 loaded via a namespace (and not attached): [1] jsonlite_1.8.4 CodeDepends_0.6.5 [3] magrittr_2.0.3 ggbeeswarm_0.7.1 [5] farver_2.1.1 rmarkdown_2.21 [7] BiocIO_1.10.0 zlibbioc_1.46.0 [9] vctrs_0.6.2 memoise_2.0.1 [11] Rsamtools_2.16.0 DelayedMatrixStats_1.22.0 [13] RCurl_1.98-1.12 htmltools_0.5.5 [15] progress_1.2.2 curl_5.0.0 [17] BiocNeighbors_1.18.0 sass_0.4.5 [19] bslib_0.4.2 cachem_1.0.7 [21] GenomicAlignments_1.36.0 igraph_1.4.2 [23] mime_0.12 lifecycle_1.0.3 [25] pkgconfig_2.0.3 rsvd_1.0.5 [27] Matrix_1.5-4 R6_2.5.1 [29] fastmap_1.1.1 GenomeInfoDbData_1.2.10 [31] shiny_1.7.4 digest_0.6.31 [33] colorspace_2.1-0 dqrng_0.3.0 [35] irlba_2.3.5.1 ExperimentHub_2.8.0 [37] RSQLite_2.3.1 beachmat_2.16.0 [39] labeling_0.4.2 filelock_1.0.2 [41] fansi_1.0.4 httr_1.4.5 [43] compiler_4.3.0 bit64_4.0.5 [45] withr_2.5.0 BiocParallel_1.34.0 [47] viridis_0.6.2 DBI_1.1.3 [49] highr_0.10 biomaRt_2.56.0 [51] rappdirs_0.3.3 DelayedArray_0.26.0 [53] bluster_1.10.0 rjson_0.2.21 [55] tools_4.3.0 vipor_0.4.5 [57] beeswarm_0.4.0 interactiveDisplayBase_1.38.0 [59] httpuv_1.6.9 glue_1.6.2 [61] restfulr_0.0.15 promises_1.2.0.1 [63] grid_4.3.0 Rtsne_0.16 [65] cluster_2.1.4 generics_0.1.3 [67] gtable_0.3.3 hms_1.1.3 [69] metapod_1.8.0 ScaledMatrix_1.8.0 [71] xml2_1.3.3 utf8_1.2.3 [73] XVector_0.40.0 ggrepel_0.9.3 [75] BiocVersion_3.17.1 pillar_1.9.0 [77] stringr_1.5.0 limma_3.56.0 [79] later_1.3.0 dplyr_1.1.2 [81] lattice_0.21-8 rtracklayer_1.60.0 [83] bit_4.0.5 tidyselect_1.2.0 [85] locfit_1.5-9.7 Biostrings_2.68.0 [87] knitr_1.42 gridExtra_2.3 [89] bookdown_0.33 ProtGenerics_1.32.0 [91] edgeR_3.42.0 xfun_0.39 [93] statmod_1.5.0 stringi_1.7.12 [95] lazyeval_0.2.2 yaml_2.3.7 [97] evaluate_0.20 codetools_0.2-19 [99] tibble_3.2.1 BiocManager_1.30.20 [101] graph_1.78.0 cli_3.6.1 [103] xtable_1.8-4 munsell_0.5.0 [105] jquerylib_0.1.4 Rcpp_1.0.10 [107] dir.expiry_1.8.0 png_0.1-8 [109] XML_3.99-0.14 parallel_4.3.0 [111] ellipsis_0.3.2 blob_1.2.4 [113] prettyunits_1.1.1 sparseMatrixStats_1.12.0 [115] bitops_1.0-7 viridisLite_0.4.1 [117] scales_1.2.1 purrr_1.0.1 [119] crayon_1.5.2 rlang_1.1.0 [121] cowplot_1.1.1 KEGGREST_1.40.0 References "],["segerstolpe-human-pancreas-smart-seq2.html", "Chapter 8 Segerstolpe human pancreas (Smart-seq2) 8.1 Introduction 8.2 Data loading 8.3 Quality control 8.4 Normalization 8.5 Variance modelling 8.6 Dimensionality reduction 8.7 Clustering 8.8 Data integration 8.9 Multi-sample comparisons Session Info", " Chapter 8 Segerstolpe human pancreas (Smart-seq2) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 8.1 Introduction This performs an analysis of the Segerstolpe et al. (2016) dataset, consisting of human pancreas cells from various donors. 8.2 Data loading library(scRNAseq) sce.seger &lt;- SegerstolpePancreasData() library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] symbols &lt;- rowData(sce.seger)$symbol ens.id &lt;- mapIds(edb, keys=symbols, keytype=&quot;SYMBOL&quot;, column=&quot;GENEID&quot;) ens.id &lt;- ifelse(is.na(ens.id), symbols, ens.id) # Removing duplicated rows. keep &lt;- !duplicated(ens.id) sce.seger &lt;- sce.seger[keep,] rownames(sce.seger) &lt;- ens.id[keep] We simplify the names of some of the relevant column metadata fields for ease of access. Some editing of the cell type labels is necessary for consistency with other data sets. emtab.meta &lt;- colData(sce.seger)[,c(&quot;cell type&quot;, &quot;disease&quot;, &quot;individual&quot;, &quot;single cell well quality&quot;)] colnames(emtab.meta) &lt;- c(&quot;CellType&quot;, &quot;Disease&quot;, &quot;Donor&quot;, &quot;Quality&quot;) colData(sce.seger) &lt;- emtab.meta sce.seger$CellType &lt;- gsub(&quot; cell&quot;, &quot;&quot;, sce.seger$CellType) sce.seger$CellType &lt;- paste0( toupper(substr(sce.seger$CellType, 1, 1)), substring(sce.seger$CellType, 2)) 8.3 Quality control unfiltered &lt;- sce.seger We remove low quality cells that were marked by the authors. We then perform additional quality control as some of the remaining cells still have very low counts and numbers of detected features. For some batches that seem to have a majority of low-quality cells (Figure 8.1), we use the other batches to define an appropriate threshold via subset=. low.qual &lt;- sce.seger$Quality == &quot;low quality cell&quot; library(scater) stats &lt;- perCellQCMetrics(sce.seger) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sce.seger$Donor, subset=!sce.seger$Donor %in% c(&quot;HP1504901&quot;, &quot;HP1509101&quot;)) sce.seger &lt;- sce.seger[,!(qc$discard | low.qual)] colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- qc$discard gridExtra::grid.arrange( plotColData(unfiltered, x=&quot;Donor&quot;, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;) + theme(axis.text.x = element_text(angle = 90)), plotColData(unfiltered, x=&quot;Donor&quot;, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;) + theme(axis.text.x = element_text(angle = 90)), plotColData(unfiltered, x=&quot;Donor&quot;, y=&quot;altexps_ERCC_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;ERCC percent&quot;) + theme(axis.text.x = element_text(angle = 90)), ncol=2 ) Figure 8.1: Distribution of each QC metric across cells from each donor of the Segerstolpe pancreas dataset. Each point represents a cell and is colored according to whether that cell was discarded. colSums(as.matrix(qc)) ## low_lib_size low_n_features high_altexps_ERCC_percent ## 788 1056 1031 ## discard ## 1246 8.4 Normalization We don’t normalize the spike-ins at this point as there are some cells with no spike-in counts. library(scran) clusters &lt;- quickCluster(sce.seger) sce.seger &lt;- computeSumFactors(sce.seger, clusters=clusters) sce.seger &lt;- logNormCounts(sce.seger) summary(sizeFactors(sce.seger)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.014 0.390 0.708 1.000 1.332 11.182 plot(librarySizeFactors(sce.seger), sizeFactors(sce.seger), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, log=&quot;xy&quot;) Figure 8.2: Relationship between the library size factors and the deconvolution size factors in the Segerstolpe pancreas dataset. 8.5 Variance modelling We do not use cells with no spike-ins for variance modelling. Donor AZ also has very low spike-in counts and is subsequently ignored. for.hvg &lt;- sce.seger[,librarySizeFactors(altExp(sce.seger)) &gt; 0 &amp; sce.seger$Donor!=&quot;AZ&quot;] dec.seger &lt;- modelGeneVarWithSpikes(for.hvg, &quot;ERCC&quot;, block=for.hvg$Donor) chosen.hvgs &lt;- getTopHVGs(dec.seger, n=2000) par(mfrow=c(3,3)) blocked.stats &lt;- dec.seger$per.block for (i in colnames(blocked.stats)) { current &lt;- blocked.stats[[i]] plot(current$mean, current$total, main=i, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(current) points(curfit$mean, curfit$var, col=&quot;red&quot;, pch=16) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) } Figure 8.3: Per-gene variance as a function of the mean for the log-expression values in the Grun pancreas dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the spike-in transcripts (red) separately for each donor. 8.6 Dimensionality reduction We pick the first 25 PCs for downstream analyses, as it’s a nice square number. library(BiocSingular) set.seed(101011001) sce.seger &lt;- runPCA(sce.seger, subset_row=chosen.hvgs, ncomponents=25) sce.seger &lt;- runTSNE(sce.seger, dimred=&quot;PCA&quot;) 8.7 Clustering library(bluster) clust.out &lt;- clusterRows(reducedDim(sce.seger, &quot;PCA&quot;), NNGraphParam(), full=TRUE) snn.gr &lt;- clust.out$objects$graph colLabels(sce.seger) &lt;- clust.out$clusters We see a strong donor effect in Figures 8.4 and 5.3. This might be due to differences in cell type composition between donors, but the more likely explanation is that of a technical difference in plate processing or uninteresting genotypic differences. The implication is that we should have called fastMNN() at some point. tab &lt;- table(Cluster=colLabels(sce.seger), Donor=sce.seger$Donor) library(pheatmap) pheatmap(log10(tab+10), color=viridis::viridis(100)) Figure 8.4: Heatmap of the frequency of cells from each donor in each cluster. gridExtra::grid.arrange( plotTSNE(sce.seger, colour_by=&quot;label&quot;), plotTSNE(sce.seger, colour_by=&quot;Donor&quot;), ncol=2 ) Figure 8.5: Obligatory \\(t\\)-SNE plots of the Segerstolpe pancreas dataset. Each point represents a cell that is colored by cluster (left) or batch (right). 8.8 Data integration We repeat the clustering after running fastMNN() on the donors. This yields a more coherent set of clusters in Figure 8.6 where each cluster contains contributions from all donors. library(batchelor) set.seed(10001010) corrected &lt;- fastMNN(sce.seger, batch=sce.seger$Donor, subset.row=chosen.hvgs) set.seed(10000001) corrected &lt;- runTSNE(corrected, dimred=&quot;corrected&quot;) colLabels(corrected) &lt;- clusterRows(reducedDim(corrected, &quot;corrected&quot;), NNGraphParam()) tab &lt;- table(Cluster=colLabels(corrected), Donor=corrected$batch) tab ## Donor ## Cluster AZ HP1502401 HP1504101T2D HP1504901 HP1506401 HP1507101 HP1508501T2D ## 1 3 19 3 11 67 8 78 ## 2 14 53 13 19 37 41 20 ## 3 2 2 1 1 44 1 1 ## 4 2 18 7 3 36 2 28 ## 5 29 114 140 72 26 136 121 ## 6 8 21 9 6 2 6 6 ## 7 1 1 1 9 0 1 2 ## 8 2 1 3 10 2 6 12 ## 9 4 20 70 8 16 2 8 ## Donor ## Cluster HP1509101 HP1525301T2D HP1526901T2D ## 1 27 124 46 ## 2 14 11 70 ## 3 0 1 4 ## 4 2 23 9 ## 5 49 85 96 ## 6 11 5 34 ## 7 2 2 1 ## 8 3 13 4 ## 9 1 10 34 gridExtra::grid.arrange( plotTSNE(corrected, colour_by=&quot;label&quot;), plotTSNE(corrected, colour_by=&quot;batch&quot;), ncol=2 ) Figure 8.6: Yet another \\(t\\)-SNE plot of the Segerstolpe dataset, this time after batch correction across donors. Each point represents a cell and is colored by the assigned cluster identity. 8.9 Multi-sample comparisons This particular dataset contains both healthy donors and those with type II diabetes. It is thus of some interest to identify genes that are differentially expressed upon disease in each cell type. To keep things simple, we use the author-provided annotation rather than determining the cell type for each of our clusters. summed &lt;- aggregateAcrossCells(sce.seger, ids=colData(sce.seger)[,c(&quot;Donor&quot;, &quot;CellType&quot;)]) summed ## class: SingleCellExperiment ## dim: 25454 105 ## metadata(0): ## assays(1): counts ## rownames(25454): ENSG00000118473 ENSG00000142920 ... ENSG00000278306 ## eGFP ## rowData names(2): symbol refseq ## colnames: NULL ## colData names(9): CellType Disease ... CellType ncells ## reducedDimNames(2): PCA TSNE ## mainExpName: endogenous ## altExpNames(0): Here, we will use the voom pipeline from the limma package instead of the QL approach with edgeR. This allows us to use sample weights to better account for the variation in the precision of each pseudo-bulk profile. We see that insulin is downregulated in beta cells in the disease state, which is sensible enough. summed.beta &lt;- summed[,summed$CellType==&quot;Beta&quot;] library(edgeR) y.beta &lt;- DGEList(counts(summed.beta), samples=colData(summed.beta), genes=rowData(summed.beta)[,&quot;symbol&quot;,drop=FALSE]) y.beta &lt;- y.beta[filterByExpr(y.beta, group=y.beta$samples$Disease),] y.beta &lt;- calcNormFactors(y.beta) design &lt;- model.matrix(~Disease, y.beta$samples) v.beta &lt;- voomWithQualityWeights(y.beta, design) fit.beta &lt;- lmFit(v.beta) fit.beta &lt;- eBayes(fit.beta, robust=TRUE) res.beta &lt;- topTable(fit.beta, sort.by=&quot;p&quot;, n=Inf, coef=&quot;Diseasetype II diabetes mellitus&quot;) head(res.beta) ## symbol logFC AveExpr t P.Value adj.P.Val B ## ENSG00000254647 INS -2.728 16.680 -7.671 3.191e-06 0.03902 4.842 ## ENSG00000137731 FXYD2 -2.595 7.265 -6.705 1.344e-05 0.08219 3.353 ## ENSG00000169297 NR0B1 -2.092 6.790 -5.789 5.810e-05 0.09916 1.984 ## ENSG00000181029 TRAPPC5 -2.127 7.046 -5.678 7.007e-05 0.09916 1.877 ## ENSG00000105707 HPN -1.803 6.118 -5.654 7.298e-05 0.09916 1.740 ## LOC284889 LOC284889 -2.113 6.652 -5.515 9.259e-05 0.09916 1.571 We also create some diagnostic plots to check for potential problems in the analysis. The MA plots exhibit the expected shape (Figure 8.7) while the differences in the sample weights in Figure 8.8 justify the use of voom() in this context. par(mfrow=c(5, 2)) for (i in colnames(y.beta)) { plotMD(y.beta, column=i) } Figure 8.7: MA plots for the beta cell pseudo-bulk profiles. Each MA plot is generated by comparing the corresponding pseudo-bulk profile against the average of all other profiles # Easier to just re-run it with plot=TRUE than # to try to make the plot from &#39;v.beta&#39;. voomWithQualityWeights(y.beta, design, plot=TRUE) Figure 8.8: Diagnostic plots for voom after estimating observation and quality weights from the beta cell pseudo-bulk profiles. The left plot shows the mean-variance trend used to estimate the observation weights, while the right plot shows the per-sample quality weights. Session Info View session info R version 4.3.0 RC (2023-04-13 r84269) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] edgeR_3.42.0 limma_3.56.0 [3] batchelor_1.16.0 pheatmap_1.0.12 [5] bluster_1.10.0 BiocSingular_1.16.0 [7] scran_1.28.0 scater_1.28.0 [9] ggplot2_3.4.2 scuttle_1.10.0 [11] ensembldb_2.24.0 AnnotationFilter_1.24.0 [13] GenomicFeatures_1.52.0 AnnotationDbi_1.62.0 [15] AnnotationHub_3.8.0 BiocFileCache_2.8.0 [17] dbplyr_2.3.2 scRNAseq_2.13.0 [19] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.0 [21] Biobase_2.60.0 GenomicRanges_1.52.0 [23] GenomeInfoDb_1.36.0 IRanges_2.34.0 [25] S4Vectors_0.38.0 BiocGenerics_0.46.0 [27] MatrixGenerics_1.12.0 matrixStats_0.63.0 [29] BiocStyle_2.28.0 rebook_1.10.0 loaded via a namespace (and not attached): [1] RColorBrewer_1.1-3 jsonlite_1.8.4 [3] CodeDepends_0.6.5 magrittr_2.0.3 [5] ggbeeswarm_0.7.1 farver_2.1.1 [7] rmarkdown_2.21 BiocIO_1.10.0 [9] zlibbioc_1.46.0 vctrs_0.6.2 [11] memoise_2.0.1 Rsamtools_2.16.0 [13] DelayedMatrixStats_1.22.0 RCurl_1.98-1.12 [15] htmltools_0.5.5 progress_1.2.2 [17] curl_5.0.0 BiocNeighbors_1.18.0 [19] sass_0.4.5 bslib_0.4.2 [21] cachem_1.0.7 ResidualMatrix_1.10.0 [23] GenomicAlignments_1.36.0 igraph_1.4.2 [25] mime_0.12 lifecycle_1.0.3 [27] pkgconfig_2.0.3 rsvd_1.0.5 [29] Matrix_1.5-4 R6_2.5.1 [31] fastmap_1.1.1 GenomeInfoDbData_1.2.10 [33] shiny_1.7.4 digest_0.6.31 [35] colorspace_2.1-0 dqrng_0.3.0 [37] irlba_2.3.5.1 ExperimentHub_2.8.0 [39] RSQLite_2.3.1 beachmat_2.16.0 [41] labeling_0.4.2 filelock_1.0.2 [43] fansi_1.0.4 httr_1.4.5 [45] compiler_4.3.0 bit64_4.0.5 [47] withr_2.5.0 BiocParallel_1.34.0 [49] viridis_0.6.2 DBI_1.1.3 [51] highr_0.10 biomaRt_2.56.0 [53] rappdirs_0.3.3 DelayedArray_0.26.0 [55] rjson_0.2.21 tools_4.3.0 [57] vipor_0.4.5 beeswarm_0.4.0 [59] interactiveDisplayBase_1.38.0 httpuv_1.6.9 [61] glue_1.6.2 restfulr_0.0.15 [63] promises_1.2.0.1 grid_4.3.0 [65] Rtsne_0.16 cluster_2.1.4 [67] generics_0.1.3 gtable_0.3.3 [69] hms_1.1.3 metapod_1.8.0 [71] ScaledMatrix_1.8.0 xml2_1.3.3 [73] utf8_1.2.3 XVector_0.40.0 [75] ggrepel_0.9.3 BiocVersion_3.17.1 [77] pillar_1.9.0 stringr_1.5.0 [79] later_1.3.0 dplyr_1.1.2 [81] lattice_0.21-8 rtracklayer_1.60.0 [83] bit_4.0.5 tidyselect_1.2.0 [85] locfit_1.5-9.7 Biostrings_2.68.0 [87] knitr_1.42 gridExtra_2.3 [89] bookdown_0.33 ProtGenerics_1.32.0 [91] xfun_0.39 statmod_1.5.0 [93] stringi_1.7.12 lazyeval_0.2.2 [95] yaml_2.3.7 evaluate_0.20 [97] codetools_0.2-19 tibble_3.2.1 [99] BiocManager_1.30.20 graph_1.78.0 [101] cli_3.6.1 xtable_1.8-4 [103] munsell_0.5.0 jquerylib_0.1.4 [105] Rcpp_1.0.10 dir.expiry_1.8.0 [107] png_0.1-8 XML_3.99-0.14 [109] parallel_4.3.0 ellipsis_0.3.2 [111] blob_1.2.4 prettyunits_1.1.1 [113] sparseMatrixStats_1.12.0 bitops_1.0-7 [115] viridisLite_0.4.1 scales_1.2.1 [117] purrr_1.0.1 crayon_1.5.2 [119] rlang_1.1.0 cowplot_1.1.1 [121] KEGGREST_1.40.0 References "],["grun-mouse-hsc-cel-seq.html", "Chapter 9 Grun mouse HSC (CEL-seq) 9.1 Introduction 9.2 Data loading 9.3 Quality control 9.4 Normalization 9.5 Variance modelling 9.6 Dimensionality reduction 9.7 Clustering 9.8 Marker gene detection Session Info", " Chapter 9 Grun mouse HSC (CEL-seq) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 9.1 Introduction This performs an analysis of the mouse haematopoietic stem cell (HSC) dataset generated with CEL-seq (Grun et al. 2016). Despite its name, this dataset actually contains both sorted HSCs and a population of micro-dissected bone marrow cells. 9.2 Data loading library(scRNAseq) sce.grun.hsc &lt;- GrunHSCData(ensembl=TRUE) library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] anno &lt;- select(ens.mm.v97, keys=rownames(sce.grun.hsc), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.grun.hsc) &lt;- anno[match(rownames(sce.grun.hsc), anno$GENEID),] After loading and annotation, we inspect the resulting SingleCellExperiment object: sce.grun.hsc ## class: SingleCellExperiment ## dim: 21817 1915 ## metadata(0): ## assays(1): counts ## rownames(21817): ENSMUSG00000109644 ENSMUSG00000007777 ... ## ENSMUSG00000055670 ENSMUSG00000039068 ## rowData names(3): GENEID SYMBOL SEQNAME ## colnames(1915): JC4_349_HSC_FE_S13_ JC4_350_HSC_FE_S13_ ... ## JC48P6_1203_HSC_FE_S8_ JC48P6_1204_HSC_FE_S8_ ## colData names(2): sample protocol ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): 9.3 Quality control unfiltered &lt;- sce.grun.hsc For some reason, no mitochondrial transcripts are available, and we have no spike-in transcripts, so we only use the number of detected genes and the library size for quality control. We block on the protocol used for cell extraction, ignoring the micro-dissected cells when computing this threshold. This is based on our judgement that a majority of micro-dissected plates consist of a majority of low-quality cells, compromising the assumptions of outlier detection. library(scuttle) stats &lt;- perCellQCMetrics(sce.grun.hsc) qc &lt;- quickPerCellQC(stats, batch=sce.grun.hsc$protocol, subset=grepl(&quot;sorted&quot;, sce.grun.hsc$protocol)) sce.grun.hsc &lt;- sce.grun.hsc[,!qc$discard] We examine the number of cells discarded for each reason. colSums(as.matrix(qc)) ## low_lib_size low_n_features discard ## 465 482 488 We create some diagnostic plots for each metric (Figure 9.1). The library sizes are unusually low for many plates of micro-dissected cells; this may be attributable to damage induced by the extraction protocol compared to cell sorting. colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- qc$discard library(scater) gridExtra::grid.arrange( plotColData(unfiltered, y=&quot;sum&quot;, x=&quot;sample&quot;, colour_by=&quot;discard&quot;, other_fields=&quot;protocol&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;) + facet_wrap(~protocol), plotColData(unfiltered, y=&quot;detected&quot;, x=&quot;sample&quot;, colour_by=&quot;discard&quot;, other_fields=&quot;protocol&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;) + facet_wrap(~protocol), ncol=1 ) Figure 9.1: Distribution of each QC metric across cells in the Grun HSC dataset. Each point represents a cell and is colored according to whether that cell was discarded. 9.4 Normalization library(scran) set.seed(101000110) clusters &lt;- quickCluster(sce.grun.hsc) sce.grun.hsc &lt;- computeSumFactors(sce.grun.hsc, clusters=clusters) sce.grun.hsc &lt;- logNormCounts(sce.grun.hsc) We examine some key metrics for the distribution of size factors, and compare it to the library sizes as a sanity check (Figure 9.2). summary(sizeFactors(sce.grun.hsc)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.027 0.290 0.603 1.000 1.201 16.433 plot(librarySizeFactors(sce.grun.hsc), sizeFactors(sce.grun.hsc), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, log=&quot;xy&quot;) Figure 9.2: Relationship between the library size factors and the deconvolution size factors in the Grun HSC dataset. 9.5 Variance modelling We create a mean-variance trend based on the expectation that UMI counts have Poisson technical noise. We do not block on sample here as we want to preserve any difference between the micro-dissected cells and the sorted HSCs. set.seed(00010101) dec.grun.hsc &lt;- modelGeneVarByPoisson(sce.grun.hsc) top.grun.hsc &lt;- getTopHVGs(dec.grun.hsc, prop=0.1) The lack of a typical “bump” shape in Figure 9.3 is caused by the low counts. plot(dec.grun.hsc$mean, dec.grun.hsc$total, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(dec.grun.hsc) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) Figure 9.3: Per-gene variance as a function of the mean for the log-expression values in the Grun HSC dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the simulated Poisson-distributed noise. 9.6 Dimensionality reduction set.seed(101010011) sce.grun.hsc &lt;- denoisePCA(sce.grun.hsc, technical=dec.grun.hsc, subset.row=top.grun.hsc) sce.grun.hsc &lt;- runTSNE(sce.grun.hsc, dimred=&quot;PCA&quot;) We check that the number of retained PCs is sensible. ncol(reducedDim(sce.grun.hsc, &quot;PCA&quot;)) ## [1] 9 9.7 Clustering snn.gr &lt;- buildSNNGraph(sce.grun.hsc, use.dimred=&quot;PCA&quot;) colLabels(sce.grun.hsc) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) table(colLabels(sce.grun.hsc)) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 ## 259 148 221 103 177 108 48 122 98 63 62 18 short &lt;- ifelse(grepl(&quot;micro&quot;, sce.grun.hsc$protocol), &quot;micro&quot;, &quot;sorted&quot;) gridExtra:::grid.arrange( plotTSNE(sce.grun.hsc, colour_by=&quot;label&quot;), plotTSNE(sce.grun.hsc, colour_by=I(short)), ncol=2 ) Figure 9.4: Obligatory \\(t\\)-SNE plot of the Grun HSC dataset, where each point represents a cell and is colored according to the assigned cluster (left) or extraction protocol (right). 9.8 Marker gene detection markers &lt;- findMarkers(sce.grun.hsc, test.type=&quot;wilcox&quot;, direction=&quot;up&quot;, row.data=rowData(sce.grun.hsc)[,&quot;SYMBOL&quot;,drop=FALSE]) To illustrate the manual annotation process, we examine the marker genes for one of the clusters. Upregulation of Camp, Lcn2, Ltf and lysozyme genes indicates that this cluster contains cells of neuronal origin. chosen &lt;- markers[[&#39;6&#39;]] best &lt;- chosen[chosen$Top &lt;= 10,] aucs &lt;- getMarkerEffects(best, prefix=&quot;AUC&quot;) rownames(aucs) &lt;- best$SYMBOL library(pheatmap) pheatmap(aucs, color=viridis::plasma(100)) Figure 9.5: Heatmap of the AUCs for the top marker genes in cluster 6 compared to all other clusters in the Grun HSC dataset. Session Info View session info R version 4.3.0 RC (2023-04-13 r84269) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] pheatmap_1.0.12 scran_1.28.0 [3] scater_1.28.0 ggplot2_3.4.2 [5] scuttle_1.10.0 AnnotationHub_3.8.0 [7] BiocFileCache_2.8.0 dbplyr_2.3.2 [9] ensembldb_2.24.0 AnnotationFilter_1.24.0 [11] GenomicFeatures_1.52.0 AnnotationDbi_1.62.0 [13] scRNAseq_2.13.0 SingleCellExperiment_1.22.0 [15] SummarizedExperiment_1.30.0 Biobase_2.60.0 [17] GenomicRanges_1.52.0 GenomeInfoDb_1.36.0 [19] IRanges_2.34.0 S4Vectors_0.38.0 [21] BiocGenerics_0.46.0 MatrixGenerics_1.12.0 [23] matrixStats_0.63.0 BiocStyle_2.28.0 [25] rebook_1.10.0 loaded via a namespace (and not attached): [1] RColorBrewer_1.1-3 jsonlite_1.8.4 [3] CodeDepends_0.6.5 magrittr_2.0.3 [5] ggbeeswarm_0.7.1 farver_2.1.1 [7] rmarkdown_2.21 BiocIO_1.10.0 [9] zlibbioc_1.46.0 vctrs_0.6.2 [11] memoise_2.0.1 Rsamtools_2.16.0 [13] DelayedMatrixStats_1.22.0 RCurl_1.98-1.12 [15] htmltools_0.5.5 progress_1.2.2 [17] curl_5.0.0 BiocNeighbors_1.18.0 [19] sass_0.4.5 bslib_0.4.2 [21] cachem_1.0.7 GenomicAlignments_1.36.0 [23] igraph_1.4.2 mime_0.12 [25] lifecycle_1.0.3 pkgconfig_2.0.3 [27] rsvd_1.0.5 Matrix_1.5-4 [29] R6_2.5.1 fastmap_1.1.1 [31] GenomeInfoDbData_1.2.10 shiny_1.7.4 [33] digest_0.6.31 colorspace_2.1-0 [35] dqrng_0.3.0 irlba_2.3.5.1 [37] ExperimentHub_2.8.0 RSQLite_2.3.1 [39] beachmat_2.16.0 labeling_0.4.2 [41] filelock_1.0.2 fansi_1.0.4 [43] httr_1.4.5 compiler_4.3.0 [45] bit64_4.0.5 withr_2.5.0 [47] BiocParallel_1.34.0 viridis_0.6.2 [49] DBI_1.1.3 highr_0.10 [51] biomaRt_2.56.0 rappdirs_0.3.3 [53] DelayedArray_0.26.0 bluster_1.10.0 [55] rjson_0.2.21 tools_4.3.0 [57] vipor_0.4.5 beeswarm_0.4.0 [59] interactiveDisplayBase_1.38.0 httpuv_1.6.9 [61] glue_1.6.2 restfulr_0.0.15 [63] promises_1.2.0.1 grid_4.3.0 [65] Rtsne_0.16 cluster_2.1.4 [67] generics_0.1.3 gtable_0.3.3 [69] hms_1.1.3 metapod_1.8.0 [71] BiocSingular_1.16.0 ScaledMatrix_1.8.0 [73] xml2_1.3.3 utf8_1.2.3 [75] XVector_0.40.0 ggrepel_0.9.3 [77] BiocVersion_3.17.1 pillar_1.9.0 [79] stringr_1.5.0 limma_3.56.0 [81] later_1.3.0 dplyr_1.1.2 [83] lattice_0.21-8 rtracklayer_1.60.0 [85] bit_4.0.5 tidyselect_1.2.0 [87] locfit_1.5-9.7 Biostrings_2.68.0 [89] knitr_1.42 gridExtra_2.3 [91] bookdown_0.33 ProtGenerics_1.32.0 [93] edgeR_3.42.0 xfun_0.39 [95] statmod_1.5.0 stringi_1.7.12 [97] lazyeval_0.2.2 yaml_2.3.7 [99] evaluate_0.20 codetools_0.2-19 [101] tibble_3.2.1 BiocManager_1.30.20 [103] graph_1.78.0 cli_3.6.1 [105] xtable_1.8-4 munsell_0.5.0 [107] jquerylib_0.1.4 Rcpp_1.0.10 [109] dir.expiry_1.8.0 png_0.1-8 [111] XML_3.99-0.14 parallel_4.3.0 [113] ellipsis_0.3.2 blob_1.2.4 [115] prettyunits_1.1.1 sparseMatrixStats_1.12.0 [117] bitops_1.0-7 viridisLite_0.4.1 [119] scales_1.2.1 purrr_1.0.1 [121] crayon_1.5.2 rlang_1.1.0 [123] cowplot_1.1.1 KEGGREST_1.40.0 References "],["nestorowa-mouse-hsc-smart-seq2.html", "Chapter 10 Nestorowa mouse HSC (Smart-seq2) 10.1 Introduction 10.2 Data loading 10.3 Quality control 10.4 Normalization 10.5 Variance modelling 10.6 Dimensionality reduction 10.7 Clustering 10.8 Marker gene detection 10.9 Cell type annotation 10.10 Miscellaneous analyses Session Info", " Chapter 10 Nestorowa mouse HSC (Smart-seq2) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 10.1 Introduction This performs an analysis of the mouse haematopoietic stem cell (HSC) dataset generated with Smart-seq2 (Nestorowa et al. 2016). 10.2 Data loading library(scRNAseq) sce.nest &lt;- NestorowaHSCData() library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] anno &lt;- select(ens.mm.v97, keys=rownames(sce.nest), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.nest) &lt;- anno[match(rownames(sce.nest), anno$GENEID),] After loading and annotation, we inspect the resulting SingleCellExperiment object: sce.nest ## class: SingleCellExperiment ## dim: 46078 1920 ## metadata(0): ## assays(1): counts ## rownames(46078): ENSMUSG00000000001 ENSMUSG00000000003 ... ## ENSMUSG00000107391 ENSMUSG00000107392 ## rowData names(3): GENEID SYMBOL SEQNAME ## colnames(1920): HSPC_007 HSPC_013 ... Prog_852 Prog_810 ## colData names(2): cell.type FACS ## reducedDimNames(1): diffusion ## mainExpName: endogenous ## altExpNames(1): ERCC 10.3 Quality control unfiltered &lt;- sce.nest For some reason, no mitochondrial transcripts are available, so we will perform quality control using the spike-in proportions only. library(scater) stats &lt;- perCellQCMetrics(sce.nest) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;) sce.nest &lt;- sce.nest[,!qc$discard] We examine the number of cells discarded for each reason. colSums(as.matrix(qc)) ## low_lib_size low_n_features high_altexps_ERCC_percent ## 146 28 241 ## discard ## 264 We create some diagnostic plots for each metric (Figure 10.1). colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- qc$discard gridExtra::grid.arrange( plotColData(unfiltered, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(unfiltered, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(unfiltered, y=&quot;altexps_ERCC_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;ERCC percent&quot;), ncol=2 ) Figure 10.1: Distribution of each QC metric across cells in the Nestorowa HSC dataset. Each point represents a cell and is colored according to whether that cell was discarded. 10.4 Normalization library(scran) set.seed(101000110) clusters &lt;- quickCluster(sce.nest) sce.nest &lt;- computeSumFactors(sce.nest, clusters=clusters) sce.nest &lt;- logNormCounts(sce.nest) We examine some key metrics for the distribution of size factors, and compare it to the library sizes as a sanity check (Figure 10.2). summary(sizeFactors(sce.nest)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.044 0.422 0.748 1.000 1.249 15.927 plot(librarySizeFactors(sce.nest), sizeFactors(sce.nest), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, log=&quot;xy&quot;) Figure 10.2: Relationship between the library size factors and the deconvolution size factors in the Nestorowa HSC dataset. 10.5 Variance modelling We use the spike-in transcripts to model the technical noise as a function of the mean (Figure 10.3). set.seed(00010101) dec.nest &lt;- modelGeneVarWithSpikes(sce.nest, &quot;ERCC&quot;) top.nest &lt;- getTopHVGs(dec.nest, prop=0.1) plot(dec.nest$mean, dec.nest$total, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(dec.nest) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) points(curfit$mean, curfit$var, col=&quot;red&quot;) Figure 10.3: Per-gene variance as a function of the mean for the log-expression values in the Nestorowa HSC dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the spike-ins (red). 10.6 Dimensionality reduction set.seed(101010011) sce.nest &lt;- denoisePCA(sce.nest, technical=dec.nest, subset.row=top.nest) sce.nest &lt;- runTSNE(sce.nest, dimred=&quot;PCA&quot;) We check that the number of retained PCs is sensible. ncol(reducedDim(sce.nest, &quot;PCA&quot;)) ## [1] 9 10.7 Clustering snn.gr &lt;- buildSNNGraph(sce.nest, use.dimred=&quot;PCA&quot;) colLabels(sce.nest) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) table(colLabels(sce.nest)) ## ## 1 2 3 4 5 6 7 8 9 ## 203 472 258 175 142 229 20 83 74 plotTSNE(sce.nest, colour_by=&quot;label&quot;) Figure 10.4: Obligatory \\(t\\)-SNE plot of the Nestorowa HSC dataset, where each point represents a cell and is colored according to the assigned cluster. 10.8 Marker gene detection markers &lt;- findMarkers(sce.nest, colLabels(sce.nest), test.type=&quot;wilcox&quot;, direction=&quot;up&quot;, lfc=0.5, row.data=rowData(sce.nest)[,&quot;SYMBOL&quot;,drop=FALSE]) To illustrate the manual annotation process, we examine the marker genes for one of the clusters. Upregulation of Car2, Hebp1 amd hemoglobins indicates that cluster 8 contains erythroid precursors. chosen &lt;- markers[[&#39;8&#39;]] best &lt;- chosen[chosen$Top &lt;= 10,] aucs &lt;- getMarkerEffects(best, prefix=&quot;AUC&quot;) rownames(aucs) &lt;- best$SYMBOL library(pheatmap) pheatmap(aucs, color=viridis::plasma(100)) Figure 10.5: Heatmap of the AUCs for the top marker genes in cluster 8 compared to all other clusters. 10.9 Cell type annotation library(SingleR) mm.ref &lt;- MouseRNAseqData() # Renaming to symbols to match with reference row names. renamed &lt;- sce.nest rownames(renamed) &lt;- uniquifyFeatureNames(rownames(renamed), rowData(sce.nest)$SYMBOL) labels &lt;- SingleR(renamed, mm.ref, labels=mm.ref$label.fine) Most clusters are not assigned to any single lineage (Figure 10.6), which is perhaps unsurprising given that HSCs are quite different from their terminal fates. Cluster 8 is considered to contain erythrocytes, which is roughly consistent with our conclusions from the marker gene analysis above. tab &lt;- table(labels$labels, colLabels(sce.nest)) pheatmap(log10(tab+10), color=viridis::viridis(100)) Figure 10.6: Heatmap of the distribution of cells for each cluster in the Nestorowa HSC dataset, based on their assignment to each label in the mouse RNA-seq references from the SingleR package. 10.10 Miscellaneous analyses This dataset also contains information about the protein abundances in each cell from FACS. There is barely any heterogeneity in the chosen markers across the clusters (Figure 10.7); this is perhaps unsurprising given that all cells should be HSCs of some sort. Y &lt;- colData(sce.nest)$FACS keep &lt;- rowSums(is.na(Y))==0 # Removing NA intensities. se.averaged &lt;- sumCountsAcrossCells(t(Y[keep,]), colLabels(sce.nest)[keep], average=TRUE) averaged &lt;- assay(se.averaged) log.intensities &lt;- log2(averaged+1) centered &lt;- log.intensities - rowMeans(log.intensities) pheatmap(centered, breaks=seq(-1, 1, length.out=101)) Figure 10.7: Heatmap of the centered log-average intensity for each target protein quantified by FACS in the Nestorowa HSC dataset. Session Info View session info R version 4.3.0 RC (2023-04-13 r84269) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] celldex_1.9.0 SingleR_2.2.0 [3] pheatmap_1.0.12 scran_1.28.0 [5] scater_1.28.0 ggplot2_3.4.2 [7] scuttle_1.10.0 AnnotationHub_3.8.0 [9] BiocFileCache_2.8.0 dbplyr_2.3.2 [11] ensembldb_2.24.0 AnnotationFilter_1.24.0 [13] GenomicFeatures_1.52.0 AnnotationDbi_1.62.0 [15] scRNAseq_2.13.0 SingleCellExperiment_1.22.0 [17] SummarizedExperiment_1.30.0 Biobase_2.60.0 [19] GenomicRanges_1.52.0 GenomeInfoDb_1.36.0 [21] IRanges_2.34.0 S4Vectors_0.38.0 [23] BiocGenerics_0.46.0 MatrixGenerics_1.12.0 [25] matrixStats_0.63.0 BiocStyle_2.28.0 [27] rebook_1.10.0 loaded via a namespace (and not attached): [1] RColorBrewer_1.1-3 jsonlite_1.8.4 [3] CodeDepends_0.6.5 magrittr_2.0.3 [5] ggbeeswarm_0.7.1 farver_2.1.1 [7] rmarkdown_2.21 BiocIO_1.10.0 [9] zlibbioc_1.46.0 vctrs_0.6.2 [11] memoise_2.0.1 Rsamtools_2.16.0 [13] DelayedMatrixStats_1.22.0 RCurl_1.98-1.12 [15] htmltools_0.5.5 progress_1.2.2 [17] curl_5.0.0 BiocNeighbors_1.18.0 [19] sass_0.4.5 bslib_0.4.2 [21] cachem_1.0.7 GenomicAlignments_1.36.0 [23] igraph_1.4.2 mime_0.12 [25] lifecycle_1.0.3 pkgconfig_2.0.3 [27] rsvd_1.0.5 Matrix_1.5-4 [29] R6_2.5.1 fastmap_1.1.1 [31] GenomeInfoDbData_1.2.10 shiny_1.7.4 [33] digest_0.6.31 colorspace_2.1-0 [35] dqrng_0.3.0 irlba_2.3.5.1 [37] ExperimentHub_2.8.0 RSQLite_2.3.1 [39] beachmat_2.16.0 labeling_0.4.2 [41] filelock_1.0.2 fansi_1.0.4 [43] httr_1.4.5 compiler_4.3.0 [45] bit64_4.0.5 withr_2.5.0 [47] BiocParallel_1.34.0 viridis_0.6.2 [49] DBI_1.1.3 highr_0.10 [51] biomaRt_2.56.0 rappdirs_0.3.3 [53] DelayedArray_0.26.0 bluster_1.10.0 [55] rjson_0.2.21 tools_4.3.0 [57] vipor_0.4.5 beeswarm_0.4.0 [59] interactiveDisplayBase_1.38.0 httpuv_1.6.9 [61] glue_1.6.2 restfulr_0.0.15 [63] promises_1.2.0.1 grid_4.3.0 [65] Rtsne_0.16 cluster_2.1.4 [67] generics_0.1.3 gtable_0.3.3 [69] hms_1.1.3 metapod_1.8.0 [71] BiocSingular_1.16.0 ScaledMatrix_1.8.0 [73] xml2_1.3.3 utf8_1.2.3 [75] XVector_0.40.0 ggrepel_0.9.3 [77] BiocVersion_3.17.1 pillar_1.9.0 [79] stringr_1.5.0 limma_3.56.0 [81] later_1.3.0 dplyr_1.1.2 [83] lattice_0.21-8 rtracklayer_1.60.0 [85] bit_4.0.5 tidyselect_1.2.0 [87] locfit_1.5-9.7 Biostrings_2.68.0 [89] knitr_1.42 gridExtra_2.3 [91] bookdown_0.33 ProtGenerics_1.32.0 [93] edgeR_3.42.0 xfun_0.39 [95] statmod_1.5.0 stringi_1.7.12 [97] lazyeval_0.2.2 yaml_2.3.7 [99] evaluate_0.20 codetools_0.2-19 [101] tibble_3.2.1 BiocManager_1.30.20 [103] graph_1.78.0 cli_3.6.1 [105] xtable_1.8-4 munsell_0.5.0 [107] jquerylib_0.1.4 Rcpp_1.0.10 [109] dir.expiry_1.8.0 png_0.1-8 [111] XML_3.99-0.14 parallel_4.3.0 [113] ellipsis_0.3.2 blob_1.2.4 [115] prettyunits_1.1.1 sparseMatrixStats_1.12.0 [117] bitops_1.0-7 viridisLite_0.4.1 [119] scales_1.2.1 purrr_1.0.1 [121] crayon_1.5.2 rlang_1.1.0 [123] cowplot_1.1.1 KEGGREST_1.40.0 References "],["paul-mouse-hsc-mars-seq.html", "Chapter 11 Paul mouse HSC (MARS-seq) 11.1 Introduction 11.2 Data loading 11.3 Quality control 11.4 Normalization 11.5 Variance modelling 11.6 Dimensionality reduction 11.7 Clustering Session Info", " Chapter 11 Paul mouse HSC (MARS-seq) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 11.1 Introduction This performs an analysis of the mouse haematopoietic stem cell (HSC) dataset generated with MARS-seq (Paul et al. 2015). Cells were extracted from multiple mice under different experimental conditions (i.e., sorting protocols) and libraries were prepared using a series of 384-well plates. 11.2 Data loading library(scRNAseq) sce.paul &lt;- PaulHSCData(ensembl=TRUE) library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] anno &lt;- select(ens.mm.v97, keys=rownames(sce.paul), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.paul) &lt;- anno[match(rownames(sce.paul), anno$GENEID),] After loading and annotation, we inspect the resulting SingleCellExperiment object: sce.paul ## class: SingleCellExperiment ## dim: 17483 10368 ## metadata(0): ## assays(1): counts ## rownames(17483): ENSMUSG00000007777 ENSMUSG00000107002 ... ## ENSMUSG00000039068 ENSMUSG00000064363 ## rowData names(3): GENEID SYMBOL SEQNAME ## colnames(10368): W29953 W29954 ... W76335 W76336 ## colData names(13): Well_ID Seq_batch_ID ... CD34_measurement ## FcgR3_measurement ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): 11.3 Quality control unfiltered &lt;- sce.paul For some reason, only one mitochondrial transcripts are available, so we will perform quality control using only the library size and number of detected features. Ideally, we would simply block on the plate of origin to account for differences in processing, but unfortunately, it seems that many plates have a large proportion (if not outright majority) of cells with poor values for both metrics. We identify such plates based on the presence of very low outlier thresholds, for some arbitrary definition of “low”; we then redefine thresholds using information from the other (presumably high-quality) plates. library(scater) stats &lt;- perCellQCMetrics(sce.paul) qc &lt;- quickPerCellQC(stats, batch=sce.paul$Plate_ID) # Detecting batches with unusually low threshold values. lib.thresholds &lt;- attr(qc$low_lib_size, &quot;thresholds&quot;)[&quot;lower&quot;,] nfeat.thresholds &lt;- attr(qc$low_n_features, &quot;thresholds&quot;)[&quot;lower&quot;,] ignore &lt;- union(names(lib.thresholds)[lib.thresholds &lt; 100], names(nfeat.thresholds)[nfeat.thresholds &lt; 100]) # Repeating the QC using only the &quot;high-quality&quot; batches. qc2 &lt;- quickPerCellQC(stats, batch=sce.paul$Plate_ID, subset=!sce.paul$Plate_ID %in% ignore) sce.paul &lt;- sce.paul[,!qc2$discard] We examine the number of cells discarded for each reason. colSums(as.matrix(qc2)) ## low_lib_size low_n_features discard ## 1695 1781 1783 We create some diagnostic plots for each metric (Figure 11.1). colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- qc2$discard unfiltered$Plate_ID &lt;- factor(unfiltered$Plate_ID) gridExtra::grid.arrange( plotColData(unfiltered, y=&quot;sum&quot;, x=&quot;Plate_ID&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(unfiltered, y=&quot;detected&quot;, x=&quot;Plate_ID&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), ncol=1 ) Figure 11.1: Distribution of each QC metric across cells in the Paul HSC dataset. Each point represents a cell and is colored according to whether that cell was discarded. 11.4 Normalization library(scran) set.seed(101000110) clusters &lt;- quickCluster(sce.paul) sce.paul &lt;- computeSumFactors(sce.paul, clusters=clusters) sce.paul &lt;- logNormCounts(sce.paul) We examine some key metrics for the distribution of size factors, and compare it to the library sizes as a sanity check (Figure 11.2). summary(sizeFactors(sce.paul)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.057 0.422 0.775 1.000 1.335 9.654 plot(librarySizeFactors(sce.paul), sizeFactors(sce.paul), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, log=&quot;xy&quot;) Figure 11.2: Relationship between the library size factors and the deconvolution size factors in the Paul HSC dataset. 11.5 Variance modelling We fit a mean-variance trend to the endogenous genes to detect highly variable genes. Unfortunately, the plates are confounded with an experimental treatment (Batch_desc) so we cannot block on the plate of origin. set.seed(00010101) dec.paul &lt;- modelGeneVarByPoisson(sce.paul) top.paul &lt;- getTopHVGs(dec.paul, prop=0.1) plot(dec.paul$mean, dec.paul$total, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curve(metadata(dec.paul)$trend(x), col=&quot;blue&quot;, add=TRUE) Figure 11.3: Per-gene variance as a function of the mean for the log-expression values in the Paul HSC dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to simulated Poisson noise. 11.6 Dimensionality reduction set.seed(101010011) sce.paul &lt;- denoisePCA(sce.paul, technical=dec.paul, subset.row=top.paul) sce.paul &lt;- runTSNE(sce.paul, dimred=&quot;PCA&quot;) We check that the number of retained PCs is sensible. ncol(reducedDim(sce.paul, &quot;PCA&quot;)) ## [1] 13 11.7 Clustering snn.gr &lt;- buildSNNGraph(sce.paul, use.dimred=&quot;PCA&quot;, type=&quot;jaccard&quot;) colLabels(sce.paul) &lt;- factor(igraph::cluster_louvain(snn.gr)$membership) These is a strong relationship between the cluster and the experimental treatment (Figure 11.4), which is to be expected. Of course, this may also be attributable to some batch effect; the confounded nature of the experimental design makes it difficult to make any confident statements either way. tab &lt;- table(colLabels(sce.paul), sce.paul$Batch_desc) rownames(tab) &lt;- paste(&quot;Cluster&quot;, rownames(tab)) pheatmap::pheatmap(log10(tab+10), color=viridis::viridis(100)) Figure 11.4: Heatmap of the distribution of cells across clusters (rows) for each experimental treatment (column). plotTSNE(sce.paul, colour_by=&quot;label&quot;) Figure 11.5: Obligatory \\(t\\)-SNE plot of the Paul HSC dataset, where each point represents a cell and is colored according to the assigned cluster. plotTSNE(sce.paul, colour_by=&quot;label&quot;, other_fields=&quot;Batch_desc&quot;) + facet_wrap(~Batch_desc) Figure 11.6: Obligatory \\(t\\)-SNE plot of the Paul HSC dataset faceted by the treatment condition, where each point represents a cell and is colored according to the assigned cluster. Session Info View session info R version 4.3.0 RC (2023-04-13 r84269) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] scran_1.28.0 scater_1.28.0 [3] ggplot2_3.4.2 scuttle_1.10.0 [5] AnnotationHub_3.8.0 BiocFileCache_2.8.0 [7] dbplyr_2.3.2 ensembldb_2.24.0 [9] AnnotationFilter_1.24.0 GenomicFeatures_1.52.0 [11] AnnotationDbi_1.62.0 scRNAseq_2.13.0 [13] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.0 [15] Biobase_2.60.0 GenomicRanges_1.52.0 [17] GenomeInfoDb_1.36.0 IRanges_2.34.0 [19] S4Vectors_0.38.0 BiocGenerics_0.46.0 [21] MatrixGenerics_1.12.0 matrixStats_0.63.0 [23] BiocStyle_2.28.0 rebook_1.10.0 loaded via a namespace (and not attached): [1] RColorBrewer_1.1-3 jsonlite_1.8.4 [3] CodeDepends_0.6.5 magrittr_2.0.3 [5] ggbeeswarm_0.7.1 farver_2.1.1 [7] rmarkdown_2.21 BiocIO_1.10.0 [9] zlibbioc_1.46.0 vctrs_0.6.2 [11] memoise_2.0.1 Rsamtools_2.16.0 [13] DelayedMatrixStats_1.22.0 RCurl_1.98-1.12 [15] htmltools_0.5.5 progress_1.2.2 [17] curl_5.0.0 BiocNeighbors_1.18.0 [19] sass_0.4.5 bslib_0.4.2 [21] cachem_1.0.7 GenomicAlignments_1.36.0 [23] igraph_1.4.2 mime_0.12 [25] lifecycle_1.0.3 pkgconfig_2.0.3 [27] rsvd_1.0.5 Matrix_1.5-4 [29] R6_2.5.1 fastmap_1.1.1 [31] GenomeInfoDbData_1.2.10 shiny_1.7.4 [33] digest_0.6.31 colorspace_2.1-0 [35] dqrng_0.3.0 irlba_2.3.5.1 [37] ExperimentHub_2.8.0 RSQLite_2.3.1 [39] beachmat_2.16.0 labeling_0.4.2 [41] filelock_1.0.2 fansi_1.0.4 [43] httr_1.4.5 compiler_4.3.0 [45] bit64_4.0.5 withr_2.5.0 [47] BiocParallel_1.34.0 viridis_0.6.2 [49] DBI_1.1.3 highr_0.10 [51] biomaRt_2.56.0 rappdirs_0.3.3 [53] DelayedArray_0.26.0 bluster_1.10.0 [55] rjson_0.2.21 tools_4.3.0 [57] vipor_0.4.5 beeswarm_0.4.0 [59] interactiveDisplayBase_1.38.0 httpuv_1.6.9 [61] glue_1.6.2 restfulr_0.0.15 [63] promises_1.2.0.1 grid_4.3.0 [65] Rtsne_0.16 cluster_2.1.4 [67] generics_0.1.3 gtable_0.3.3 [69] hms_1.1.3 metapod_1.8.0 [71] BiocSingular_1.16.0 ScaledMatrix_1.8.0 [73] xml2_1.3.3 utf8_1.2.3 [75] XVector_0.40.0 ggrepel_0.9.3 [77] BiocVersion_3.17.1 pillar_1.9.0 [79] stringr_1.5.0 limma_3.56.0 [81] later_1.3.0 dplyr_1.1.2 [83] lattice_0.21-8 rtracklayer_1.60.0 [85] bit_4.0.5 tidyselect_1.2.0 [87] locfit_1.5-9.7 Biostrings_2.68.0 [89] knitr_1.42 gridExtra_2.3 [91] bookdown_0.33 ProtGenerics_1.32.0 [93] edgeR_3.42.0 xfun_0.39 [95] statmod_1.5.0 pheatmap_1.0.12 [97] stringi_1.7.12 lazyeval_0.2.2 [99] yaml_2.3.7 evaluate_0.20 [101] codetools_0.2-19 tibble_3.2.1 [103] BiocManager_1.30.20 graph_1.78.0 [105] cli_3.6.1 xtable_1.8-4 [107] munsell_0.5.0 jquerylib_0.1.4 [109] Rcpp_1.0.10 dir.expiry_1.8.0 [111] png_0.1-8 XML_3.99-0.14 [113] parallel_4.3.0 ellipsis_0.3.2 [115] blob_1.2.4 prettyunits_1.1.1 [117] sparseMatrixStats_1.12.0 bitops_1.0-7 [119] viridisLite_0.4.1 scales_1.2.1 [121] purrr_1.0.1 crayon_1.5.2 [123] rlang_1.1.0 cowplot_1.1.1 [125] KEGGREST_1.40.0 References "],["bach-mouse-mammary-gland-10x-genomics.html", "Chapter 12 Bach mouse mammary gland (10X Genomics) 12.1 Introduction 12.2 Data loading 12.3 Quality control 12.4 Normalization 12.5 Variance modelling 12.6 Dimensionality reduction 12.7 Clustering Session Info", " Chapter 12 Bach mouse mammary gland (10X Genomics) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 12.1 Introduction This performs an analysis of the Bach et al. (2017) 10X Genomics dataset, from which we will consider a single sample of epithelial cells from the mouse mammary gland during gestation. 12.2 Data loading library(scRNAseq) sce.mam &lt;- BachMammaryData(samples=&quot;G_1&quot;) library(scater) rownames(sce.mam) &lt;- uniquifyFeatureNames( rowData(sce.mam)$Ensembl, rowData(sce.mam)$Symbol) library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] rowData(sce.mam)$SEQNAME &lt;- mapIds(ens.mm.v97, keys=rowData(sce.mam)$Ensembl, keytype=&quot;GENEID&quot;, column=&quot;SEQNAME&quot;) 12.3 Quality control unfiltered &lt;- sce.mam is.mito &lt;- rowData(sce.mam)$SEQNAME == &quot;MT&quot; stats &lt;- perCellQCMetrics(sce.mam, subsets=list(Mito=which(is.mito))) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;subsets_Mito_percent&quot;) sce.mam &lt;- sce.mam[,!qc$discard] colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- qc$discard gridExtra::grid.arrange( plotColData(unfiltered, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(unfiltered, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(unfiltered, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;Mito percent&quot;), ncol=2 ) Figure 12.1: Distribution of each QC metric across cells in the Bach mammary gland dataset. Each point represents a cell and is colored according to whether that cell was discarded. plotColData(unfiltered, x=&quot;sum&quot;, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + scale_x_log10() Figure 12.2: Percentage of mitochondrial reads in each cell in the Bach mammary gland dataset compared to its total count. Each point represents a cell and is colored according to whether that cell was discarded. colSums(as.matrix(qc)) ## low_lib_size low_n_features high_subsets_Mito_percent ## 0 0 143 ## discard ## 143 12.4 Normalization library(scran) set.seed(101000110) clusters &lt;- quickCluster(sce.mam) sce.mam &lt;- computeSumFactors(sce.mam, clusters=clusters) sce.mam &lt;- logNormCounts(sce.mam) summary(sizeFactors(sce.mam)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.271 0.522 0.758 1.000 1.204 10.958 plot(librarySizeFactors(sce.mam), sizeFactors(sce.mam), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, log=&quot;xy&quot;) Figure 12.3: Relationship between the library size factors and the deconvolution size factors in the Bach mammary gland dataset. 12.5 Variance modelling We use a Poisson-based technical trend to capture more genuine biological variation in the biological component. set.seed(00010101) dec.mam &lt;- modelGeneVarByPoisson(sce.mam) top.mam &lt;- getTopHVGs(dec.mam, prop=0.1) plot(dec.mam$mean, dec.mam$total, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(dec.mam) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) Figure 12.4: Per-gene variance as a function of the mean for the log-expression values in the Bach mammary gland dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to simulated Poisson counts. 12.6 Dimensionality reduction library(BiocSingular) set.seed(101010011) sce.mam &lt;- denoisePCA(sce.mam, technical=dec.mam, subset.row=top.mam) sce.mam &lt;- runTSNE(sce.mam, dimred=&quot;PCA&quot;) ncol(reducedDim(sce.mam, &quot;PCA&quot;)) ## [1] 15 12.7 Clustering We use a higher k to obtain coarser clusters (for use in doubletCluster() later). snn.gr &lt;- buildSNNGraph(sce.mam, use.dimred=&quot;PCA&quot;, k=25) colLabels(sce.mam) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) table(colLabels(sce.mam)) ## ## 1 2 3 4 5 6 7 8 9 10 ## 550 799 716 452 24 84 52 39 32 24 plotTSNE(sce.mam, colour_by=&quot;label&quot;) Figure 12.5: Obligatory \\(t\\)-SNE plot of the Bach mammary gland dataset, where each point represents a cell and is colored according to the assigned cluster. Session Info View session info R version 4.3.0 RC (2023-04-13 r84269) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] BiocSingular_1.16.0 scran_1.28.0 [3] AnnotationHub_3.8.0 BiocFileCache_2.8.0 [5] dbplyr_2.3.2 scater_1.28.0 [7] ggplot2_3.4.2 scuttle_1.10.0 [9] ensembldb_2.24.0 AnnotationFilter_1.24.0 [11] GenomicFeatures_1.52.0 AnnotationDbi_1.62.0 [13] scRNAseq_2.13.0 SingleCellExperiment_1.22.0 [15] SummarizedExperiment_1.30.0 Biobase_2.60.0 [17] GenomicRanges_1.52.0 GenomeInfoDb_1.36.0 [19] IRanges_2.34.0 S4Vectors_0.38.0 [21] BiocGenerics_0.46.0 MatrixGenerics_1.12.0 [23] matrixStats_0.63.0 BiocStyle_2.28.0 [25] rebook_1.10.0 loaded via a namespace (and not attached): [1] jsonlite_1.8.4 CodeDepends_0.6.5 [3] magrittr_2.0.3 ggbeeswarm_0.7.1 [5] farver_2.1.1 rmarkdown_2.21 [7] BiocIO_1.10.0 zlibbioc_1.46.0 [9] vctrs_0.6.2 memoise_2.0.1 [11] Rsamtools_2.16.0 DelayedMatrixStats_1.22.0 [13] RCurl_1.98-1.12 htmltools_0.5.5 [15] progress_1.2.2 curl_5.0.0 [17] BiocNeighbors_1.18.0 sass_0.4.5 [19] bslib_0.4.2 cachem_1.0.7 [21] GenomicAlignments_1.36.0 igraph_1.4.2 [23] mime_0.12 lifecycle_1.0.3 [25] pkgconfig_2.0.3 rsvd_1.0.5 [27] Matrix_1.5-4 R6_2.5.1 [29] fastmap_1.1.1 GenomeInfoDbData_1.2.10 [31] shiny_1.7.4 digest_0.6.31 [33] colorspace_2.1-0 dqrng_0.3.0 [35] irlba_2.3.5.1 ExperimentHub_2.8.0 [37] RSQLite_2.3.1 beachmat_2.16.0 [39] labeling_0.4.2 filelock_1.0.2 [41] fansi_1.0.4 httr_1.4.5 [43] compiler_4.3.0 bit64_4.0.5 [45] withr_2.5.0 BiocParallel_1.34.0 [47] viridis_0.6.2 DBI_1.1.3 [49] highr_0.10 biomaRt_2.56.0 [51] rappdirs_0.3.3 DelayedArray_0.26.0 [53] bluster_1.10.0 rjson_0.2.21 [55] tools_4.3.0 vipor_0.4.5 [57] beeswarm_0.4.0 interactiveDisplayBase_1.38.0 [59] httpuv_1.6.9 glue_1.6.2 [61] restfulr_0.0.15 promises_1.2.0.1 [63] grid_4.3.0 Rtsne_0.16 [65] cluster_2.1.4 generics_0.1.3 [67] gtable_0.3.3 hms_1.1.3 [69] metapod_1.8.0 ScaledMatrix_1.8.0 [71] xml2_1.3.3 utf8_1.2.3 [73] XVector_0.40.0 ggrepel_0.9.3 [75] BiocVersion_3.17.1 pillar_1.9.0 [77] stringr_1.5.0 limma_3.56.0 [79] later_1.3.0 dplyr_1.1.2 [81] lattice_0.21-8 rtracklayer_1.60.0 [83] bit_4.0.5 tidyselect_1.2.0 [85] locfit_1.5-9.7 Biostrings_2.68.0 [87] knitr_1.42 gridExtra_2.3 [89] bookdown_0.33 ProtGenerics_1.32.0 [91] edgeR_3.42.0 xfun_0.39 [93] statmod_1.5.0 stringi_1.7.12 [95] lazyeval_0.2.2 yaml_2.3.7 [97] evaluate_0.20 codetools_0.2-19 [99] tibble_3.2.1 BiocManager_1.30.20 [101] graph_1.78.0 cli_3.6.1 [103] xtable_1.8-4 munsell_0.5.0 [105] jquerylib_0.1.4 Rcpp_1.0.10 [107] dir.expiry_1.8.0 png_0.1-8 [109] XML_3.99-0.14 parallel_4.3.0 [111] ellipsis_0.3.2 blob_1.2.4 [113] prettyunits_1.1.1 sparseMatrixStats_1.12.0 [115] bitops_1.0-7 viridisLite_0.4.1 [117] scales_1.2.1 purrr_1.0.1 [119] crayon_1.5.2 rlang_1.1.0 [121] cowplot_1.1.1 KEGGREST_1.40.0 References "],["messmer-hesc.html", "Chapter 13 Messmer human ESC (Smart-seq2) 13.1 Introduction 13.2 Data loading 13.3 Quality control 13.4 Normalization 13.5 Cell cycle phase assignment 13.6 Feature selection 13.7 Batch correction 13.8 Dimensionality Reduction Session Info", " Chapter 13 Messmer human ESC (Smart-seq2) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 13.1 Introduction This performs an analysis of the human embryonic stem cell (hESC) dataset generated with Smart-seq2 (Messmer et al. 2019), which contains several plates of naive and primed hESCs. The chapter’s code is based on the steps in the paper’s GitHub repository, with some additional steps for cell cycle effect removal contributed by Philippe Boileau. 13.2 Data loading Converting the batch to a factor, to make life easier later on. library(scRNAseq) sce.mess &lt;- MessmerESCData() sce.mess$`experiment batch` &lt;- factor(sce.mess$`experiment batch`) library(AnnotationHub) ens.hs.v97 &lt;- AnnotationHub()[[&quot;AH73881&quot;]] anno &lt;- select(ens.hs.v97, keys=rownames(sce.mess), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;)) rowData(sce.mess) &lt;- anno[match(rownames(sce.mess), anno$GENEID),] 13.3 Quality control Let’s have a look at the QC statistics. colSums(as.matrix(filtered)) ## low_lib_size low_n_features high_subsets_Mito_percent ## 107 99 22 ## high_altexps_ERCC_percent discard ## 117 156 gridExtra::grid.arrange( plotColData(original, x=&quot;experiment batch&quot;, y=&quot;sum&quot;, colour_by=I(filtered$discard), other_field=&quot;phenotype&quot;) + facet_wrap(~phenotype) + scale_y_log10(), plotColData(original, x=&quot;experiment batch&quot;, y=&quot;detected&quot;, colour_by=I(filtered$discard), other_field=&quot;phenotype&quot;) + facet_wrap(~phenotype) + scale_y_log10(), plotColData(original, x=&quot;experiment batch&quot;, y=&quot;subsets_Mito_percent&quot;, colour_by=I(filtered$discard), other_field=&quot;phenotype&quot;) + facet_wrap(~phenotype), plotColData(original, x=&quot;experiment batch&quot;, y=&quot;altexps_ERCC_percent&quot;, colour_by=I(filtered$discard), other_field=&quot;phenotype&quot;) + facet_wrap(~phenotype), ncol=1 ) Figure 13.1: Distribution of QC metrics across batches (x-axis) and phenotypes (facets) for cells in the Messmer hESC dataset. Each point is a cell and is colored by whether it was discarded. 13.4 Normalization library(scran) set.seed(10000) clusters &lt;- quickCluster(sce.mess) sce.mess &lt;- computeSumFactors(sce.mess, cluster=clusters) sce.mess &lt;- logNormCounts(sce.mess) par(mfrow=c(1,2)) plot(sce.mess$sum, sizeFactors(sce.mess), log = &quot;xy&quot;, pch=16, xlab = &quot;Library size (millions)&quot;, ylab = &quot;Size factor&quot;, col = ifelse(sce.mess$phenotype == &quot;naive&quot;, &quot;black&quot;, &quot;grey&quot;)) spike.sf &lt;- librarySizeFactors(altExp(sce.mess, &quot;ERCC&quot;)) plot(sizeFactors(sce.mess), spike.sf, log = &quot;xy&quot;, pch=16, ylab = &quot;Spike-in size factor&quot;, xlab = &quot;Deconvolution size factor&quot;, col = ifelse(sce.mess$phenotype == &quot;naive&quot;, &quot;black&quot;, &quot;grey&quot;)) Figure 13.2: Deconvolution size factors plotted against the library size (left) and spike-in size factors plotted against the deconvolution size factors (right). Each point is a cell and is colored by its phenotype. 13.5 Cell cycle phase assignment Here, we use multiple cores to speed up the processing. set.seed(10001) hs_pairs &lt;- readRDS(system.file(&quot;exdata&quot;, &quot;human_cycle_markers.rds&quot;, package=&quot;scran&quot;)) assigned &lt;- cyclone(sce.mess, pairs=hs_pairs, gene.names=rownames(sce.mess), BPPARAM=BiocParallel::MulticoreParam(10)) sce.mess$phase &lt;- assigned$phases table(sce.mess$phase) ## ## G1 G2M S ## 460 406 322 smoothScatter(assigned$scores$G1, assigned$scores$G2M, xlab=&quot;G1 score&quot;, ylab=&quot;G2/M score&quot;, pch=16) Figure 13.3: G1 cyclone() phase scores against the G2/M phase scores for each cell in the Messmer hESC dataset. 13.6 Feature selection dec &lt;- modelGeneVarWithSpikes(sce.mess, &quot;ERCC&quot;, block = sce.mess$`experiment batch`) top.hvgs &lt;- getTopHVGs(dec, prop = 0.1) par(mfrow=c(1,3)) for (i in seq_along(dec$per.block)) { current &lt;- dec$per.block[[i]] plot(current$mean, current$total, xlab=&quot;Mean log-expression&quot;, ylab=&quot;Variance&quot;, pch=16, cex=0.5, main=paste(&quot;Batch&quot;, i)) fit &lt;- metadata(current) points(fit$mean, fit$var, col=&quot;red&quot;, pch=16) curve(fit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) } Figure 13.4: Per-gene variance of the log-normalized expression values in the Messmer hESC dataset, plotted against the mean for each batch. Each point represents a gene with spike-ins shown in red and the fitted trend shown in blue. 13.7 Batch correction We eliminate the obvious batch effect between batches with linear regression, which is possible due to the replicated nature of the experimental design. We set keep=1:2 to retain the effect of the first two coefficients in design corresponding to our phenotype of interest. library(batchelor) sce.mess &lt;- correctExperiments(sce.mess, PARAM = RegressParam( design = model.matrix(~sce.mess$phenotype + sce.mess$`experiment batch`), keep = 1:2 ) ) 13.8 Dimensionality Reduction We could have set d= and subset.row= in correctExperiments() to automatically perform a PCA on the the residual matrix with the subset of HVGs, but we’ll just explicitly call runPCA() here to keep things simple. set.seed(1101001) sce.mess &lt;- runPCA(sce.mess, subset_row = top.hvgs, exprs_values = &quot;corrected&quot;) sce.mess &lt;- runTSNE(sce.mess, dimred = &quot;PCA&quot;, perplexity = 40) From a naive PCA, the cell cycle appears to be a major source of biological variation within each phenotype. gridExtra::grid.arrange( plotTSNE(sce.mess, colour_by = &quot;phenotype&quot;) + ggtitle(&quot;By phenotype&quot;), plotTSNE(sce.mess, colour_by = &quot;experiment batch&quot;) + ggtitle(&quot;By batch &quot;), plotTSNE(sce.mess, colour_by = &quot;CDK1&quot;, swap_rownames=&quot;SYMBOL&quot;) + ggtitle(&quot;By CDK1&quot;), plotTSNE(sce.mess, colour_by = &quot;phase&quot;) + ggtitle(&quot;By phase&quot;), ncol = 2 ) Figure 13.5: Obligatory \\(t\\)-SNE plots of the Messmer hESC dataset, where each point is a cell and is colored by various attributes. We perform contrastive PCA (cPCA) and sparse cPCA (scPCA) on the corrected log-expression data to obtain the same number of PCs. Given that the naive hESCs are actually reprogrammed primed hESCs, we will use the single batch of primed-only hESCs as the “background” dataset to remove the cell cycle effect. library(scPCA) is.bg &lt;- sce.mess$`experiment batch`==&quot;3&quot; target &lt;- sce.mess[,!is.bg] background &lt;- sce.mess[,is.bg] mat.target &lt;- t(assay(target, &quot;corrected&quot;)[top.hvgs,]) mat.background &lt;- t(assay(background, &quot;corrected&quot;)[top.hvgs,]) set.seed(1010101001) con_out &lt;- scPCA( target = mat.target, background = mat.background, penalties = 0, # no penalties = non-sparse cPCA. n_eigen = 50, contrasts = 100 ) reducedDim(target, &quot;cPCA&quot;) &lt;- con_out$x set.seed(101010101) sparse_con_out &lt;- scPCA( target = mat.target, background = mat.background, penalties = 1e-4, n_eigen = 50, contrasts = 100, alg = &quot;rand_var_proj&quot; # for speed. ) reducedDim(target, &quot;scPCA&quot;) &lt;- sparse_con_out$x We see greater intermingling between phases within both the naive and primed cells after cPCA and scPCA. set.seed(1101001) target &lt;- runTSNE(target, dimred = &quot;cPCA&quot;, perplexity = 40, name=&quot;cPCA+TSNE&quot;) target &lt;- runTSNE(target, dimred = &quot;scPCA&quot;, perplexity = 40, name=&quot;scPCA+TSNE&quot;) gridExtra::grid.arrange( plotReducedDim(target, &quot;cPCA+TSNE&quot;, colour_by = &quot;phase&quot;) + ggtitle(&quot;After cPCA&quot;), plotReducedDim(target, &quot;scPCA+TSNE&quot;, colour_by = &quot;phase&quot;) + ggtitle(&quot;After scPCA&quot;), ncol=2 ) Figure 13.6: More \\(t\\)-SNE plots of the Messmer hESC dataset after cPCA and scPCA, where each point is a cell and is colored by its assigned cell cycle phase. We can quantify the change in the separation between phases within each phenotype using the silhouette coefficient. library(bluster) naive &lt;- target[,target$phenotype==&quot;naive&quot;] primed &lt;- target[,target$phenotype==&quot;primed&quot;] N &lt;- approxSilhouette(reducedDim(naive, &quot;PCA&quot;), naive$phase) P &lt;- approxSilhouette(reducedDim(primed, &quot;PCA&quot;), primed$phase) c(naive=mean(N$width), primed=mean(P$width)) ## naive primed ## 0.02032 0.03025 cN &lt;- approxSilhouette(reducedDim(naive, &quot;cPCA&quot;), naive$phase) cP &lt;- approxSilhouette(reducedDim(primed, &quot;cPCA&quot;), primed$phase) c(naive=mean(cN$width), primed=mean(cP$width)) ## naive primed ## 0.007696 0.011941 scN &lt;- approxSilhouette(reducedDim(naive, &quot;scPCA&quot;), naive$phase) scP &lt;- approxSilhouette(reducedDim(primed, &quot;scPCA&quot;), primed$phase) c(naive=mean(scN$width), primed=mean(scP$width)) ## naive primed ## 0.006614 0.014601 Session Info View session info R version 4.3.0 RC (2023-04-13 r84269) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] bluster_1.10.0 scPCA_1.14.0 [3] batchelor_1.16.0 scran_1.28.0 [5] scater_1.28.0 ggplot2_3.4.2 [7] scuttle_1.10.0 AnnotationHub_3.8.0 [9] BiocFileCache_2.8.0 dbplyr_2.3.2 [11] ensembldb_2.24.0 AnnotationFilter_1.24.0 [13] GenomicFeatures_1.52.0 AnnotationDbi_1.62.0 [15] scRNAseq_2.13.0 SingleCellExperiment_1.22.0 [17] SummarizedExperiment_1.30.0 Biobase_2.60.0 [19] GenomicRanges_1.52.0 GenomeInfoDb_1.36.0 [21] IRanges_2.34.0 S4Vectors_0.38.0 [23] BiocGenerics_0.46.0 MatrixGenerics_1.12.0 [25] matrixStats_0.63.0 BiocStyle_2.28.0 [27] rebook_1.10.0 loaded via a namespace (and not attached): [1] later_1.3.0 BiocIO_1.10.0 [3] bitops_1.0-7 filelock_1.0.2 [5] tibble_3.2.1 CodeDepends_0.6.5 [7] graph_1.78.0 XML_3.99-0.14 [9] lifecycle_1.0.3 Rdpack_2.4 [11] edgeR_3.42.0 globals_0.16.2 [13] lattice_0.21-8 magrittr_2.0.3 [15] limma_3.56.0 sass_0.4.5 [17] rmarkdown_2.21 jquerylib_0.1.4 [19] yaml_2.3.7 metapod_1.8.0 [21] httpuv_1.6.9 cowplot_1.1.1 [23] DBI_1.1.3 ResidualMatrix_1.10.0 [25] abind_1.4-5 zlibbioc_1.46.0 [27] Rtsne_0.16 purrr_1.0.1 [29] RCurl_1.98-1.12 rappdirs_0.3.3 [31] GenomeInfoDbData_1.2.10 ggrepel_0.9.3 [33] irlba_2.3.5.1 listenv_0.9.0 [35] RSpectra_0.16-1 parallelly_1.35.0 [37] dqrng_0.3.0 DelayedMatrixStats_1.22.0 [39] codetools_0.2-19 DelayedArray_0.26.0 [41] xml2_1.3.3 tidyselect_1.2.0 [43] farver_2.1.1 ScaledMatrix_1.8.0 [45] viridis_0.6.2 GenomicAlignments_1.36.0 [47] jsonlite_1.8.4 BiocNeighbors_1.18.0 [49] ellipsis_0.3.2 tools_4.3.0 [51] progress_1.2.2 Rcpp_1.0.10 [53] glue_1.6.2 gridExtra_2.3 [55] xfun_0.39 dplyr_1.1.2 [57] withr_2.5.0 BiocManager_1.30.20 [59] fastmap_1.1.1 sparsepca_0.1.2 [61] fansi_1.0.4 digest_0.6.31 [63] rsvd_1.0.5 R6_2.5.1 [65] mime_0.12 colorspace_2.1-0 [67] biomaRt_2.56.0 RSQLite_2.3.1 [69] utf8_1.2.3 generics_0.1.3 [71] data.table_1.14.8 rtracklayer_1.60.0 [73] prettyunits_1.1.1 httr_1.4.5 [75] pkgconfig_2.0.3 gtable_0.3.3 [77] blob_1.2.4 XVector_0.40.0 [79] htmltools_0.5.5 bookdown_0.33 [81] ProtGenerics_1.32.0 scales_1.2.1 [83] png_0.1-8 knitr_1.42 [85] rjson_0.2.21 curl_5.0.0 [87] cachem_1.0.7 stringr_1.5.0 [89] BiocVersion_3.17.1 KernSmooth_2.23-20 [91] parallel_4.3.0 vipor_0.4.5 [93] restfulr_0.0.15 pillar_1.9.0 [95] grid_4.3.0 vctrs_0.6.2 [97] promises_1.2.0.1 origami_1.0.7 [99] BiocSingular_1.16.0 beachmat_2.16.0 [101] xtable_1.8-4 cluster_2.1.4 [103] beeswarm_0.4.0 evaluate_0.20 [105] cli_3.6.1 locfit_1.5-9.7 [107] compiler_4.3.0 Rsamtools_2.16.0 [109] rlang_1.1.0 crayon_1.5.2 [111] future.apply_1.10.0 labeling_0.4.2 [113] ggbeeswarm_0.7.1 stringi_1.7.12 [115] viridisLite_0.4.1 BiocParallel_1.34.0 [117] assertthat_0.2.1 munsell_0.5.0 [119] Biostrings_2.68.0 lazyeval_0.2.2 [121] coop_0.6-3 Matrix_1.5-4 [123] dir.expiry_1.8.0 ExperimentHub_2.8.0 [125] hms_1.1.3 future_1.32.0 [127] sparseMatrixStats_1.12.0 bit64_4.0.5 [129] KEGGREST_1.40.0 statmod_1.5.0 [131] shiny_1.7.4 interactiveDisplayBase_1.38.0 [133] highr_0.10 kernlab_0.9-32 [135] rbibutils_2.2.13 igraph_1.4.2 [137] memoise_2.0.1 bslib_0.4.2 [139] bit_4.0.5 References "],["hca-human-bone-marrow-10x-genomics.html", "Chapter 14 HCA human bone marrow (10X Genomics) 14.1 Introduction 14.2 Data loading 14.3 Quality control 14.4 Normalization 14.5 Variance modeling 14.6 Data integration 14.7 Dimensionality reduction 14.8 Clustering 14.9 Differential expression 14.10 Cell type classification Session Info", " Chapter 14 HCA human bone marrow (10X Genomics) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 14.1 Introduction Here, we use an example dataset from the Human Cell Atlas immune cell profiling project on bone marrow, which contains scRNA-seq data for 380,000 cells generated using the 10X Genomics technology. This is a fairly big dataset that represents a good use case for the techniques in Advanced Chapter 14. 14.2 Data loading This dataset is loaded via the HCAData package, which provides a ready-to-use SingleCellExperiment object. library(HCAData) sce.bone &lt;- HCAData(&#39;ica_bone_marrow&#39;, as.sparse=TRUE) sce.bone$Donor &lt;- sub(&quot;_.*&quot;, &quot;&quot;, sce.bone$Barcode) We use symbols in place of IDs for easier interpretation later. library(EnsDb.Hsapiens.v86) rowData(sce.bone)$Chr &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rownames(sce.bone), column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) library(scater) rownames(sce.bone) &lt;- uniquifyFeatureNames(rowData(sce.bone)$ID, names = rowData(sce.bone)$Symbol) 14.3 Quality control Cell calling was not performed (see here) so we will perform QC using all metrics and block on the donor of origin during outlier detection. We perform the calculation across multiple cores to speed things up. library(BiocParallel) bpp &lt;- MulticoreParam(8) sce.bone &lt;- unfiltered &lt;- addPerCellQC(sce.bone, BPPARAM=bpp, subsets=list(Mito=which(rowData(sce.bone)$Chr==&quot;MT&quot;))) qc &lt;- quickPerCellQC(colData(sce.bone), batch=sce.bone$Donor, sub.fields=&quot;subsets_Mito_percent&quot;) sce.bone &lt;- sce.bone[,!qc$discard] unfiltered$discard &lt;- qc$discard gridExtra::grid.arrange( plotColData(unfiltered, x=&quot;Donor&quot;, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(unfiltered, x=&quot;Donor&quot;, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(unfiltered, x=&quot;Donor&quot;, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;Mito percent&quot;), ncol=2 ) Figure 14.1: Distribution of QC metrics in the HCA bone marrow dataset. Each point represents a cell and is colored according to whether it was discarded. plotColData(unfiltered, x=&quot;sum&quot;, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + scale_x_log10() Figure 14.2: Percentage of mitochondrial reads in each cell in the HCA bone marrow dataset compared to its total count. Each point represents a cell and is colored according to whether that cell was discarded. 14.4 Normalization For a minor speed-up, we use already-computed library sizes rather than re-computing them from the column sums. sce.bone &lt;- logNormCounts(sce.bone, size_factors = sce.bone$sum) summary(sizeFactors(sce.bone)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.05 0.47 0.65 1.00 0.89 42.38 14.5 Variance modeling We block on the donor of origin to mitigate batch effects during HVG selection. We select a larger number of HVGs to capture any batch-specific variation that might be present. library(scran) set.seed(1010010101) dec.bone &lt;- modelGeneVarByPoisson(sce.bone, block=sce.bone$Donor, BPPARAM=bpp) top.bone &lt;- getTopHVGs(dec.bone, n=5000) par(mfrow=c(4,2)) blocked.stats &lt;- dec.bone$per.block for (i in colnames(blocked.stats)) { current &lt;- blocked.stats[[i]] plot(current$mean, current$total, main=i, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(current) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) } Figure 14.3: Per-gene variance as a function of the mean for the log-expression values in the HCA bone marrow dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the variances. 14.6 Data integration Here we use multiple cores, randomized SVD and approximate nearest-neighbor detection to speed up this step. library(batchelor) library(BiocNeighbors) set.seed(1010001) merged.bone &lt;- fastMNN(sce.bone, batch = sce.bone$Donor, subset.row = top.bone, BSPARAM=BiocSingular::RandomParam(deferred = TRUE), BNPARAM=AnnoyParam(), BPPARAM=bpp) reducedDim(sce.bone, &#39;MNN&#39;) &lt;- reducedDim(merged.bone, &#39;corrected&#39;) We use the percentage of variance lost as a diagnostic measure: metadata(merged.bone)$merge.info$lost.var ## MantonBM1 MantonBM2 MantonBM3 MantonBM4 MantonBM5 MantonBM6 MantonBM7 ## [1,] 0.006922 0.006392 0.000000 0.000000 0.000000 0.000000 0.000000 ## [2,] 0.006380 0.006863 0.023049 0.000000 0.000000 0.000000 0.000000 ## [3,] 0.005068 0.003084 0.005178 0.019496 0.000000 0.000000 0.000000 ## [4,] 0.002009 0.001891 0.001901 0.001786 0.023105 0.000000 0.000000 ## [5,] 0.002452 0.002003 0.001770 0.002926 0.002646 0.023852 0.000000 ## [6,] 0.003167 0.003222 0.003169 0.002636 0.003362 0.003442 0.024650 ## [7,] 0.001968 0.001701 0.002441 0.002045 0.001585 0.002312 0.002003 ## MantonBM8 ## [1,] 0.00000 ## [2,] 0.00000 ## [3,] 0.00000 ## [4,] 0.00000 ## [5,] 0.00000 ## [6,] 0.00000 ## [7,] 0.03216 14.7 Dimensionality reduction We set external_neighbors=TRUE to replace the internal nearest neighbor search in the UMAP implementation with our parallelized approximate search. We also set the number of threads to be used in the UMAP iterations. set.seed(01010100) sce.bone &lt;- runUMAP(sce.bone, dimred=&quot;MNN&quot;, external_neighbors=TRUE, BNPARAM=AnnoyParam(), BPPARAM=bpp, n_threads=bpnworkers(bpp)) 14.8 Clustering Graph-based clustering generates an excessively large intermediate graph so we will instead use a two-step approach with \\(k\\)-means. We generate 1000 small clusters that are subsequently aggregated into more interpretable groups with a graph-based method. If more resolution is required, we can increase centers in addition to using a lower k during graph construction. library(bluster) set.seed(1000) colLabels(sce.bone) &lt;- clusterRows(reducedDim(sce.bone, &quot;MNN&quot;), TwoStepParam(KmeansParam(centers=1000), NNGraphParam(k=5))) table(colLabels(sce.bone)) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 ## 20331 11161 55464 47426 15731 10581 64721 26493 18703 15043 17097 4992 3157 ## 14 15 ## 3403 2422 We observe mostly balanced contributions from different samples to each cluster (Figure 14.4), consistent with the expectation that all samples are replicates from different donors. tab &lt;- table(Cluster=colLabels(sce.bone), Donor=sce.bone$Donor) library(pheatmap) pheatmap(log10(tab+10), color=viridis::viridis(100)) Figure 14.4: Heatmap of log10-number of cells in each cluster (row) from each sample (column). # TODO: add scrambling option in scater&#39;s plotting functions. scrambled &lt;- sample(ncol(sce.bone)) gridExtra::grid.arrange( plotUMAP(sce.bone, colour_by=&quot;label&quot;, text_by=&quot;label&quot;), plotUMAP(sce.bone[,scrambled], colour_by=&quot;Donor&quot;) ) Figure 14.5: UMAP plots of the HCA bone marrow dataset after merging. Each point represents a cell and is colored according to the assigned cluster (top) or the donor of origin (bottom). 14.9 Differential expression We identify marker genes for each cluster while blocking on the donor. markers.bone &lt;- findMarkers(sce.bone, block = sce.bone$Donor, direction = &#39;up&#39;, lfc = 1, BPPARAM=bpp) We visualize the top markers for a randomly chosen cluster using a “dot plot” in Figure 14.6. The presence of upregulated genes like LYZ, S100A8 and VCAN is consistent with a monocyte identity for this cluster. top.markers &lt;- markers.bone[[&quot;4&quot;]] best &lt;- top.markers[top.markers$Top &lt;= 10,] lfcs &lt;- getMarkerEffects(best) library(pheatmap) pheatmap(lfcs, breaks=seq(-5, 5, length.out=101)) Figure 14.6: Heatmap of log2-fold changes for the top marker genes (rows) of cluster 4 compared to all other clusters (columns). 14.10 Cell type classification We perform automated cell type classification using a reference dataset to annotate each cluster based on its pseudo-bulk profile. This is faster than the per-cell approaches described in Chapter 10.9 at the cost of the resolution required to detect heterogeneity inside a cluster. Nonetheless, it is often sufficient for a quick assignment of cluster identity, and indeed, cluster 4 is also identified as consisting of monocytes from this analysis. se.aggregated &lt;- sumCountsAcrossCells(sce.bone, id=colLabels(sce.bone), BPPARAM=bpp) library(celldex) hpc &lt;- HumanPrimaryCellAtlasData() library(SingleR) anno.single &lt;- SingleR(se.aggregated, ref = hpc, labels = hpc$label.main, assay.type.test=&quot;sum&quot;) anno.single ## DataFrame with 15 rows and 4 columns ## scores labels delta.next ## &lt;matrix&gt; &lt;character&gt; &lt;numeric&gt; ## 1 0.366050:0.741975:0.637800:... GMP 0.3366255 ## 2 0.399229:0.715314:0.628516:... Pro-B_cell_CD34+ 0.0690916 ## 3 0.325812:0.654869:0.570039:... T_cells 0.1292127 ## 4 0.296455:0.742466:0.529404:... Monocyte 0.3099254 ## 5 0.345773:0.565378:0.479722:... T_cells 0.4943226 ## ... ... ... ... ## 11 0.326882:0.646229:0.561060:... T_cells 0.027835817 ## 12 0.380710:0.684123:0.784540:... BM &amp; Prog. 0.000499754 ## 13 0.368546:0.652935:0.580330:... B_cell 0.201098545 ## 14 0.294361:0.706019:0.527282:... Monocyte 0.355212614 ## 15 0.339786:0.687074:0.569933:... GMP 0.131830602 ## pruned.labels ## &lt;character&gt; ## 1 GMP ## 2 Pro-B_cell_CD34+ ## 3 T_cells ## 4 Monocyte ## 5 T_cells ## ... ... ## 11 T_cells ## 12 BM &amp; Prog. ## 13 NA ## 14 Monocyte ## 15 GMP Session Info View session info R version 4.3.0 RC (2023-04-13 r84269) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] SingleR_2.2.0 celldex_1.9.0 [3] pheatmap_1.0.12 bluster_1.10.0 [5] BiocNeighbors_1.18.0 batchelor_1.16.0 [7] scran_1.28.0 BiocParallel_1.34.0 [9] scater_1.28.0 ggplot2_3.4.2 [11] scuttle_1.10.0 EnsDb.Hsapiens.v86_2.99.0 [13] ensembldb_2.24.0 AnnotationFilter_1.24.0 [15] GenomicFeatures_1.52.0 AnnotationDbi_1.62.0 [17] rhdf5_2.44.0 HCAData_1.15.0 [19] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.0 [21] Biobase_2.60.0 GenomicRanges_1.52.0 [23] GenomeInfoDb_1.36.0 IRanges_2.34.0 [25] S4Vectors_0.38.0 BiocGenerics_0.46.0 [27] MatrixGenerics_1.12.0 matrixStats_0.63.0 [29] BiocStyle_2.28.0 rebook_1.10.0 loaded via a namespace (and not attached): [1] later_1.3.0 BiocIO_1.10.0 [3] bitops_1.0-7 filelock_1.0.2 [5] tibble_3.2.1 CodeDepends_0.6.5 [7] graph_1.78.0 XML_3.99-0.14 [9] lifecycle_1.0.3 edgeR_3.42.0 [11] lattice_0.21-8 magrittr_2.0.3 [13] limma_3.56.0 sass_0.4.5 [15] rmarkdown_2.21 jquerylib_0.1.4 [17] yaml_2.3.7 metapod_1.8.0 [19] httpuv_1.6.9 cowplot_1.1.1 [21] DBI_1.1.3 RColorBrewer_1.1-3 [23] ResidualMatrix_1.10.0 zlibbioc_1.46.0 [25] purrr_1.0.1 RCurl_1.98-1.12 [27] rappdirs_0.3.3 GenomeInfoDbData_1.2.10 [29] ggrepel_0.9.3 irlba_2.3.5.1 [31] dqrng_0.3.0 DelayedMatrixStats_1.22.0 [33] codetools_0.2-19 DelayedArray_0.26.0 [35] xml2_1.3.3 tidyselect_1.2.0 [37] farver_2.1.1 ScaledMatrix_1.8.0 [39] viridis_0.6.2 BiocFileCache_2.8.0 [41] GenomicAlignments_1.36.0 jsonlite_1.8.4 [43] ellipsis_0.3.2 tools_4.3.0 [45] progress_1.2.2 Rcpp_1.0.10 [47] glue_1.6.2 gridExtra_2.3 [49] xfun_0.39 dplyr_1.1.2 [51] HDF5Array_1.28.0 withr_2.5.0 [53] BiocManager_1.30.20 fastmap_1.1.1 [55] rhdf5filters_1.12.0 fansi_1.0.4 [57] digest_0.6.31 rsvd_1.0.5 [59] R6_2.5.1 mime_0.12 [61] colorspace_2.1-0 biomaRt_2.56.0 [63] RSQLite_2.3.1 utf8_1.2.3 [65] generics_0.1.3 rtracklayer_1.60.0 [67] prettyunits_1.1.1 httr_1.4.5 [69] uwot_0.1.14 pkgconfig_2.0.3 [71] gtable_0.3.3 blob_1.2.4 [73] XVector_0.40.0 htmltools_0.5.5 [75] bookdown_0.33 ProtGenerics_1.32.0 [77] scales_1.2.1 png_0.1-8 [79] knitr_1.42 rjson_0.2.21 [81] curl_5.0.0 cachem_1.0.7 [83] stringr_1.5.0 BiocVersion_3.17.1 [85] parallel_4.3.0 vipor_0.4.5 [87] restfulr_0.0.15 pillar_1.9.0 [89] grid_4.3.0 vctrs_0.6.2 [91] promises_1.2.0.1 BiocSingular_1.16.0 [93] dbplyr_2.3.2 beachmat_2.16.0 [95] xtable_1.8-4 cluster_2.1.4 [97] beeswarm_0.4.0 evaluate_0.20 [99] cli_3.6.1 locfit_1.5-9.7 [101] compiler_4.3.0 Rsamtools_2.16.0 [103] rlang_1.1.0 crayon_1.5.2 [105] labeling_0.4.2 ggbeeswarm_0.7.1 [107] stringi_1.7.12 viridisLite_0.4.1 [109] munsell_0.5.0 Biostrings_2.68.0 [111] lazyeval_0.2.2 Matrix_1.5-4 [113] dir.expiry_1.8.0 ExperimentHub_2.8.0 [115] hms_1.1.3 sparseMatrixStats_1.12.0 [117] bit64_4.0.5 Rhdf5lib_1.22.0 [119] KEGGREST_1.40.0 statmod_1.5.0 [121] shiny_1.7.4 interactiveDisplayBase_1.38.0 [123] highr_0.10 AnnotationHub_3.8.0 [125] igraph_1.4.2 memoise_2.0.1 [127] bslib_0.4.2 bit_4.0.5 "]]
