[["index.html", "Orchestrating Single-Cell Analysis with Bioconductor Welcome", " Orchestrating Single-Cell Analysis with Bioconductor Authors: Robert Amezquita [aut], Aaron Lun [aut, cre], Stephanie Hicks [aut], Raphael Gottardo [aut] Version: 1.0.6 Modified: 2020-11-13 Compiled: 2021-03-24 Environment: R version 4.0.4 (2021-02-15), Bioconductor 3.12 License: CC BY-NC-ND 3.0 US Copyright: Bioconductor, 2020 Source: https://github.com/Bioconductor/OrchestratingSingleCellAnalysis Welcome This is the website for “Orchestrating Single-Cell Analysis with Bioconductor”, a book that teaches users some common workflows for the analysis of single-cell RNA-seq data (scRNA-seq). This book will teach you how to make use of cutting-edge Bioconductor tools to process, analyze, visualize, and explore scRNA-seq data. Additionally, it serves as an online companion for the manuscript “Orchestrating Single-Cell Analysis with Bioconductor”. While we focus here on scRNA-seq data, a newer technology that profiles transcriptomes at the single-cell level, many of the tools, conventions, and analysis strategies utilized throughout this book are broadly applicable to other types of assays. By learning the grammar of Bioconductor workflows, we hope to provide you a starting point for the exploration of your own data, whether it be scRNA-seq or otherwise. This book is organized into three parts. In the Preamble, we introduce the book and dive into resources for learning R and Bioconductor (both at a beginner and developer level). Part I ends with a tutorial for a key data infrastructure, the SingleCellExperiment class, that is used throughout Bioconductor for single-cell analysis and in the subsequent section. The second part, Focus Topics, begins with an overview of the framework for analysis of scRNA-seq data, with deeper dives into specific topics are presented in each subsequent chapter. The third part, Workflows, provides primarily code detailing the analysis of various datasets throughout the book. Finally, the Appendix highlights our contributors. If you would like to cite this work, please use the reference “Orchestrating Single-Cell Analysis with Bioconductor”. The book is written in RMarkdown with bookdown. OSCA is a collaborative effort, supported by various folks from the Bioconductor team who have contributed workflows, fixes, and improvements. This website is free to use, and is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License. "],["introduction.html", "Chapter 1 Introduction 1.1 What you will learn 1.2 What you won’t learn 1.3 Who we wrote this for 1.4 Why we wrote this 1.5 Acknowledgements", " Chapter 1 Introduction .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } Bioconductor is an open source, open development software project to provide tools for the analysis and comprehension of high-throughput genomic data. It is based primarily on the R programming language. 1.1 What you will learn The goal of this book is to provide a solid foundation in the usage of Bioconductor tools for single-cell RNA-seq analysis by walking through various steps of typical workflows using example datasets. We strive to tackle key concepts covered in the manuscript, “Orchestrating Single-Cell Analysis with Bioconductor”, with each workflow covering these in varying detail, as well as essential preliminaries that are important for following along with the workflows on your own. 1.1.1 Preliminaries For those unfamiliar with R (and those looking to learn more), we recommend reading the Learning R and More chapter, which first and foremost covers how to get started with R. We point to many great online resources for learning R, as well as related tools that are nice to know for bioinformatic analysis. For advanced users, we also point to some extra resources that go beyond the basics. While we provide an extensive list of learning resources for the interested audience in this chapter, we only ask for some familiarity with R before going to the next section. We then briefly cover getting started with Using R and Bioconductor. Bioconductor, being its own repository, has a unique set of tools, documentation, resources, and practices that benefit from some extra explanation. Data Infrastructure merits a separate chapter. The reason for this is that common data containers are an essential part of Bioconductor workflows because they enable interoperability across packages, allowing for “plug and play” usage of cutting-edge tools. Specifically, here we cover the SingleCellExperiment class in depth, as it has become the working standard for Bioconductor based single-cell analysis packages. Finally, before diving into the various workflows, armed with knowledge about the SingleCellExperiment class, we briefly discuss the datasets that will be used throughout the book in About the Data. 1.1.2 Workflows All workflows begin with data import and subsequent quality control and normalization, going from a raw (count) expression matrix to a clean one. This includes adjusting for experimental factors and possibly even latent factors. Using the clean expression matrix, feature selection strategies can be applied to select the features (genes) driving heterogeneity. Furthermore, these features can then be used to perform dimensionality reduction, which enables downstream analysis that would not otherwise be possible and visualization in 2 or 3 dimensions. From there, the workflows largely focus on differing downstream analyses. Clustering details how to segment a scRNA-seq dataset, and differential expression provides a means to determine what drives the differences between different groups of cells. Integrating datasets walks through merging scRNA-seq datasets, an area of need as the number of scRNA-seq datasets continues to grow and comparisons between datasets must be done. Finally, we touch upon how to work with large scale data, specifically where it becomes impractical or impossible to work with data solely in-memory. As an added bonus, we dedicate a chapter to interactive visualization, which focuses on using the iSEE package to enable active exploration of a single cell experiment’s data. 1.2 What you won’t learn The field of bioinformatic analysis is large and filled with many potential trajectories depending on the biological system being studied and technology being deployed. Here, we only briefly survey some of the many tools available for the analysis of scRNA-seq, focusing on Bioconductor packages. It is impossible to thoroughly review the plethora of tools available through R and Bioconductor for biological analysis in one book, but we hope to provide the means for further exploration on your own. Thus, it goes without saying that you may not learn the optimal workflow for your own data from our examples - while we strive to provide high quality templates, they should be treated as just that - a template from which to extend upon for your own analyses. 1.3 Who we wrote this for We’ve written this book with the interested experimental biologist in mind, and do our best to make few assumptions on previous programming or statistical experience. Likewise, we also welcome more seasoned bioinformaticians who are looking for a starting point from which to dive into single-cell RNA-seq analysis. As such, we welcome any and all feedback for improving this book to help increase accessibility and refine technical details. 1.4 Why we wrote this This book was conceived in the fall of 2018, as single-cell RNA-seq analysis continued its rise in prominence in the field of biology. With its rapid growth, and the ongoing developments within Bioconductor tailored specifically for scRNA-seq, it became apparent that an update to the Orchestrating high-throughput genomic analysis with Bioconductor paper was necessary for the age of single-cell studies. We strive to highlight the fantastic software by people who call Bioconductor home for their tools, and in the process hope to showcase the Bioconductor community at large in continually pushing forward the field of biological analysis. 1.5 Acknowledgements We would like to thank all Bioconductor contributors for their efforts in creating the definitive leading-edge repository of software for biological analysis. It is truly extraordinary to chart the growth of Bioconductor over the years. We are thankful for the wonderful community of scientists and developers alike that together make the Bioconductor community special. We would first and foremost like to thank the Bioconductor core team and the emerging targets subcommittee for commissioning this work, Stephanie Hicks and Raphael Gottardo for their continuous mentorship, and all our contributors to the companion manuscript of this book. We’d also like to thank Garret Grolemund and Hadley Wickham for their book, R for Data Science, from which we drew stylistic and teaching inspiration. We also thank Levi Waldron and Aaron Lun for advice on the code-related aspects of managing the online version of this book. "],["learning-r-and-more.html", "Chapter 2 Learning R and Bioconductor 2.1 The Benefits of R and Bioconductor 2.2 Learning R Online 2.3 Running R Locally 2.4 Getting Help In (and Out) of R 2.5 Bioconductor Help", " Chapter 2 Learning R and Bioconductor In this chapter, we outline various resources for learning R and Bioconductor. We provide a brief set of instructions for installing R on your own machine, and then cover how to get help for functions, packages, and Bioconductor-specific resources for learning more. 2.1 The Benefits of R and Bioconductor R is a high-level programming language that was initially designed for statistical applications. While there is much to be said about R as a programming language, one of the key advantages of using R is that it is highly extensible through packages. Packages are collections of functions, data, and documentation that extend the capabilities of base R. The ease of development and distribution of packages for R has made it a rich environment for many fields of study and application. One of the primary ways in which packages are distributed is through centralized repositories. The first R repository a user typically runs into is the Comprehensive R Archive Network (CRAN), which hosts over 13,000 packages to date, and is home to many of the most popular R packages. Similar to CRAN, Bioconductor is a repository of R packages as well. However, whereas CRAN is a general purpose repository, Bioconductor focuses on software tailored for genomic analysis. Furthermore, Bioconductor has stricter requirements for a package to be accepted into the repository. Of particular interest to us is the inclusion of high quality documentation and the use of common data infrastructure to promote package interoperability. In order to use these packages from CRAN and Bioconductor, and start programming with R to follow along in these workflows, some knowledge of R is helpful. Here we outline resources to guide you through learning the basics. 2.2 Learning R Online To learn more about programming with R, we highly recommend checking out online courses offered by groups such as Codecademy, specifically the Learn R series. Codecademy is completely web-based, with a code editor/console that promotes an interactive learning experience. This makes it easy to get started without worrying about installing any software. Beyond just Codecademy, a foundational textbook resource for learning R is the R for Data Science book. This book illustrates R programming through the exploration of various data science concepts - transformation, visualization, exploration, and more. While it primarily focuses on the tidyverse ecosystem of packages, the concepts are translatable to any programming style. 2.3 Running R Locally While learning R through online resources is a great way to start with R, as it requires minimal knowledge to start up, at some point, it will be desirable to have a local installation - on your own hardware - of R. This will allow you to install and maintain your own software and code, and furthermore allow you to create a personalized workspace. 2.3.1 Installing R Prior to getting started with this book, some prior programming experience with R is helpful. Check out the Learning R and More chapter for a list of resources to get started with R and other useful tools for bioinformatic analysis. To follow along with the analysis workflows in this book on your personal computer, it is first necessary to install the R programming language. Additionally, we recommend a graphical user interface such as RStudio for programming in R and visualization. RStudio features many helpful tools, such as code completion and an interactive data viewer to name but two. For more details, please see the online book R for Data Science prerequisites section for more information about installing R and using RStudio. 2.3.1.1 For MacOS/Linux Users A special note for MacOS/Linux users: we highly recommend using a package manager to manage your R installation. This differs across different Linux distributions, but for MacOS we highly recommend the Homebrew package manager. Follow the website directions to install homebrew, and install R via the commandline with brew install R, and it will automatically configure your installation for you. Upgrading to new R versions can be done by running brew upgrade. 2.3.2 Installing R &amp; Bioconductor Packages After installing R, the next step is to install R packages. In the R console, you can install packages from CRAN via the install.packages() function. In order to install Bioconductor packages, we will first need the BiocManager package which is hosted on CRAN. This can be done by running: install.packages(&quot;BiocManager&quot;) The BiocManager package makes it easy to install packages from the Bioconductor repository. For example, to install the SingleCellExperiment package, we run: ## the command below is a one-line shortcut for: ## library(BiocManager) ## install(&quot;SingleCellExperiment&quot;) BiocManager::install(&quot;SingleCellExperiment&quot;) Throughout the book, we can load packages via the library() function, which by convention usually comes at the top of scripts to alert readers as to what packages are required. For example, to load the SingleCellExperiment package, we run: library(SingleCellExperiment) Many packages will be referenced throughout the book within the workflows, and similar to the above, can be installed using the BiocManager::install() function. 2.4 Getting Help In (and Out) of R One of the most helpful parts of R is being able to get help inside of R. For example, to get the manual associated with a function, class, dataset, or package, you can prepend the code of interest with a ? to retrieve the relevant help page. For example, to get information about the data.frame() function, the SingleCellExperiment class, the in-built iris dataset, or for the BiocManager package, you can type: ?data.frame ?SingleCellExperiment ?iris ?BiocManager Beyond the R console, there are myriad online resources to get help. The R for Data Science book has a great section dedicated to looking for help outside of R. In particular, Stackoverflow’s R tag is a helpful resource for asking and exploring general R programming questions. 2.5 Bioconductor Help One of the key tenets of Bioconductor software that makes it stand out from CRAN is the required documentation of packages and workflows. In addition, Bioconductor hosts a Bioconductor-specific support site that has grown into a valuable resource of its own, thanks to the work of dedicated volunteers. 2.5.1 Bioconductor Packages Each package hosted on Bioconductor has a dedicated page with various resources. For an example, looking at the scater package page on Bioconductor, we see that it contains: a brief description of the package at the top, in addition to the authors, maintainer, and an associated citation installation instructions that can be cut and paste into your R console documentation - vignettes, reference manual, news Here, the most important information comes from the documentation section. Every package in Bioconductor is required to be submitted with a vignette - a document showcasing basic functionality of the package. Typically, these vignettes have a descriptive title that summarizes the main objective of the vignette. These vignettes are a great resource for learning how to operate the essential functionality of the package. The reference manual contains a comprehensive listing of all the functions available in the package. This is a compilation of each function’s manual, aka help pages, which can be accessed programmatically in the R console via ?&lt;function&gt;. Finally, the NEWS file contains notes from the authors which highlight changes across different versions of the package. This is a great way of tracking changes, especially functions that are added, removed, or deprecated, in order to keep your scripts current with new versions of dependent packages. Below this, the Details section covers finer nuances of the package, mostly relating to its relationship to other packages: upstream dependencies (Depends, Imports, Suggests fields): packages that are imported upon loading the given package downstream dependencies (Depends On Me, Imports Me, Suggests Me): packages that import the given package when loaded For example, we can see that an entry called simpleSingle in the Depends On Me field on the scater page takes us to a step-by-step workflow for low-level analysis of single-cell RNA-seq data. One additional Details entry, the biocViews, is helpful for looking at how the authors annotate their package. For example, for the scater package, we see that it is associated with DataImport, DimensionReduction, GeneExpression, RNASeq, and SingleCell, to name but some of its many annotations. We cover biocViews in more detail. 2.5.2 biocViews To find packages via the Bioconductor website, one useful resource is the BiocViews page, which provides a hierarchically organized view of annotations associated with Bioconductor packages. Under the “Software” label for example (which is comprised of most of the Bioconductor packages), there exist many different views to explore packages. For example, we can inspect based on the associated “Technology”, and explore “Sequencing” associated packages, and furthermore subset based on “RNASeq”. Another area of particular interest is the “Workflow” view, which provides Bioconductor packages that illustrate an analytical workflow. For example, the “SingleCellWorkflow” contains the aforementioned tutorial, encapsulated in the simpleSingleCell package. 2.5.3 Bioconductor Forums The Bioconductor support site contains a Stackoverflow-style question and answer support site that is actively contributed to from both users and package developers. Thanks to the work of dedicated volunteers, there are ample questions to explore to learn more about Bioconductor specific workflows. Another way to connect with the Bioconductor community is through Slack, which hosts various channels dedicated to packages and workflows. The Bioc-community Slack is a great way to stay in the loop on the latest developments happening across Bioconductor, and we recommend exploring the “Channels” section to find topics of interest. "],["beyond-r-basics.html", "Chapter 3 Beyond R Basics 3.1 Becoming an R Expert 3.2 Becoming an R/Bioconductor Developer 3.3 Nice Companions for R", " Chapter 3 Beyond R Basics .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } Here we briefly outline resources for taking your R programming to the next level, including resources for learning about package development. We also outline some companions to R that are good to know not only for package development, but also for running your own bioinformatic pipelines, enabling you to use a broader array of tools to go from raw data to preprocessed data before working in R. 3.1 Becoming an R Expert A deeper dive into the finer details of the R programming language is provided by the book Advanced R. While targeted at more experienced R users and programmers, this book represents a comprehensive compendium of more advanced concepts, and touches on some of the paradigms used extensively by developers throughout Bioconductor, specifically programming with S4. Eventually, you’ll reach the point where you have your own collection of functions and datasets, and where you will be writing your own packages. Luckily, there’s a guide for just that, with the book R Packages. Packages are great even if just for personal use, and of course, with some polishing, can eventually become available on CRAN or Bioconductor. Furthermore, they are also a great way of putting together code associated with a manuscript, promoting reproducible, accessible computing practices, something we all strive for in our work. For many of the little details that are oft forgotten learning about R, the aptly named What They Forgot to Teach You About R is a great read for learning about the little things such as file naming, maintaining an R installation, and reproducible analysis habits. Finally, we save the most intriguing resource for last - another book for those on the road to becoming an R expert is R Inferno, which dives into many of the unique quirks of R. Warning: this book goes very deep into the painstaking details of R. 3.2 Becoming an R/Bioconductor Developer While learning to use Bioconductor tools is a very welcoming experience, unfortunately there is no central resource for navigating the plethora of gotchas and paradigms associated with developing for Bioconductor. Based on conversations with folks involved in developing for Bioconductor, much of this knowledge is hard won and fairly spread out. This however is beginning to change with more recent efforts led by the Bioconductor team, and while this book represents an earnest effort towards addressing the user perspective, it is currently out of scope to include a deep dive about the developer side. For those looking to get started with developing packages for Bioconductor, it is important to first become acquainted with developing standalone R packages. To this end, the R Packages book provides a deep dive into the details of constructing your own package, as well as details regarding submission of a package to CRAN. For programming practices, With that, some resources that are worth looking into to get started are the BiocWorkshops repository under the Bioconductor Github provides a book composed of workshops that have been hosted by Bioconductor team members and contributors. These workshops center around learning, using, and developing for Bioconductor. A host of topics are also available via the Learn module on the Bioconductor website. Finally, the Bioconductor developers portal contains a bevy of individual resources and guides for experienced R developers. 3.3 Nice Companions for R While not essential for our purposes, many bioinformatic tools for processing raw sequencing data require knowledge beyond just R to install, run, and import their results into R for further analysis. The most important of which are basic knowledge of the Shell/Bash utilities, for working with bioinformatic pipelines and troubleshooting (R package) installation issues. Additionally, for working with packages or software that are still in development and not hosted on an official repository like CRAN or Bioconductor, knowledge of Git - a version control system - and the popular GitHub hosting service is helpful. This enables you to not only work with other people’s code, but also better manage your own code to keep track of changes. 3.3.1 Shell/Bash Datacamp and other interactive online resources such as Codecademy are great places to learn some of these extra skills. We highly recommend learning Shell/Bash, as it is the starting point for most bioinformatic processing pipelines. 3.3.2 Git We would recommend learning Git next, a system for code versioning control which underlies the popular GitHub hosting service, where many of the most popular open source tools are hosted. Learning Git is essential not only for keeping track of your own code, but also for using, managing, and contributing to open source software projects. For a more R centric look at using Git (and Github), we highly recommend checking out Happy Git and Github for the useR. 3.3.3 Other Languages A frequent question that comes up is “What else should I learn besides R?” Firstly, we believe that honing your R skills is first and foremost, and beyond just R, learning Shell/Bash and Git covered in the Nice Companions for R section are already a great start. For those just getting started, these skills should become comfortable in practice before moving on. However, there are indeed benefits to going beyond just R. At a basic level, learning other programming languages helps broaden one’s perspective - similar to learning multiple spoken or written languages, learning about other programming languages (even if only in a cursory manner) helps one identify broader patterns that may be applicable across languages. At an applied level, work within and outside of R has made it ever more friendly now than ever before with multi-lingual setups and teams, enabling the use of the best tool for the job at hand. For example, Python is another popular language used in both data science and a broader array of applications as well. R now supports a native Python interface via the reticulate package, enabling access to tools developed originally in Python such as the popular TensorFlow framework for machine learning applications. C++ is frequently used natively in R as well via Rcpp in packages to massively accelerate computations. Finally, multiple languages are supported in code documents and reports through R Markdown. "],["data-infrastructure.html", "Chapter 4 Data Infrastructure 4.1 Background 4.2 Storing primary experimental data 4.3 Handling metadata 4.4 Single-cell-specific fields 4.5 Conclusion", " Chapter 4 Data Infrastructure 4.1 Background One of the main strengths of the Bioconductor project lies in the use of a common data infrastructure that powers interoperability across packages. Users should be able to analyze their data using functions from different Bioconductor packages without the need to convert between formats. To this end, the SingleCellExperiment class (from the SingleCellExperiment package) serves as the common currency for data exchange across 70+ single-cell-related Bioconductor packages. This class implements a data structure that stores all aspects of our single-cell data - gene-by-cell expression data, per-cell metadata and per-gene annotation (Figure 4.1) - and manipulate them in a synchronized manner. Figure 4.1: Overview of the structure of the SingleCellExperiment class. Each row of the assays corresponds to a row of the rowData (pink shading), while each column of the assays corresponds to a column of the colData and reducedDims (yellow shading). The SingleCellExperiment package is implicitly installed and loaded when using any package that depends on the SingleCellExperiment class, but it can also be explicitly installed (and loaded) as follows: BiocManager::install(&#39;SingleCellExperiment&#39;) Additionally, we use some functions from the scater and scran packages, as well as the CRAN package uwot (which conveniently can also be installed through BiocManager::install). These functions will be accessed through the &lt;package&gt;::&lt;function&gt; convention as needed. BiocManager::install(c(&#39;scater&#39;, &#39;scran&#39;, &#39;uwot&#39;)) We then load the SingleCellExperiment package into our R session. This avoids the need to prefix our function calls with ::, especially for packages that are heavily used throughout a workflow. library(SingleCellExperiment) Each piece of (meta)data in the SingleCellExperiment is represented by a separate “slot”. (This terminology comes from the S4 class system, but that’s not important right now.) If we imagine the SingleCellExperiment object to be a cargo ship, the slots can be thought of as individual cargo boxes with different contents, e.g., certain slots expect numeric matrices whereas others may expect data frames. In the rest of this chapter, we will discuss the available slots, their expected formats, and how we can interact with them. More experienced readers may note the similarity with the SummarizedExperiment class, and if you are such a reader, you may wish to jump directly to the end of this chapter for the single-cell-specific aspects of this class. 4.2 Storing primary experimental data 4.2.1 Filling the assays slot To construct a rudimentary SingleCellExperiment object, we only need to fill the assays slot. This contains primary data such as a matrix of sequencing counts where rows correspond to features (genes) and columns correspond to samples (cells) (Figure 4.1, blue box). Let’s start simple by generating three cells’ worth of count data across ten genes: counts_matrix &lt;- data.frame(cell_1 = rpois(10, 10), cell_2 = rpois(10, 10), cell_3 = rpois(10, 30)) rownames(counts_matrix) &lt;- paste0(&quot;gene_&quot;, 1:10) counts_matrix &lt;- as.matrix(counts_matrix) # must be a matrix object! From this, we can now construct our first SingleCellExperiment object using the SingleCellExperiment() function. Note that we provide our data as a named list where each entry of the list is a matrix. Here, we name the counts_matrix entry as simply \"counts\". sce &lt;- SingleCellExperiment(assays = list(counts = counts_matrix)) To inspect the object, we can simply type sce into the console to see some pertinent information, which will display an overview of the various slots available to us (which may or may not have any data). sce ## class: SingleCellExperiment ## dim: 10 3 ## metadata(0): ## assays(1): counts ## rownames(10): gene_1 gene_2 ... gene_9 gene_10 ## rowData names(0): ## colnames(3): cell_1 cell_2 cell_3 ## colData names(0): ## reducedDimNames(0): ## altExpNames(0): To access the count data we just supplied, we can do any one of the following: assay(sce, \"counts\") - this is the most general method, where we can supply the name of the assay as the second argument. counts(sce) - this is a short-cut for the above, but only works for assays with the special name \"counts\". counts(sce) ## cell_1 cell_2 cell_3 ## gene_1 8 16 25 ## gene_2 9 13 30 ## gene_3 9 13 27 ## gene_4 12 8 35 ## gene_5 12 12 27 ## gene_6 14 11 32 ## gene_7 11 8 34 ## gene_8 6 11 30 ## gene_9 6 18 26 ## gene_10 7 13 23 4.2.2 Adding more assays What makes the assays slot especially powerful is that it can hold multiple representations of the primary data. This is especially useful for storing the raw count matrix as well as a normalized version of the data. We can do just that as shown below, using the scater package to compute a normalized and log-transformed representation of the initial primary data. sce &lt;- scater::logNormCounts(sce) Note that, at each step, we overwrite our previous sce by reassigning the results back to sce. This is possible because these particular functions return a SingleCellExperiment object that contains the results in addition to original data. (Some functions - especially those outside of single-cell oriented Bioconductor packages - do not, in which case you will need to append your results to the sce object - see below for an example.) Viewing the object again, we see that these functions added some new entries: sce ## class: SingleCellExperiment ## dim: 10 3 ## metadata(0): ## assays(2): counts logcounts ## rownames(10): gene_1 gene_2 ... gene_9 gene_10 ## rowData names(0): ## colnames(3): cell_1 cell_2 cell_3 ## colData names(1): sizeFactor ## reducedDimNames(0): ## altExpNames(0): Specifically, we see that the assays slot has grown to contain two entries: \"counts\" (our initial data) and \"logcounts\" (the log-transformed normalized data). Similar to \"counts\", the \"logcounts\" name can be conveniently accessed using logcounts(sce), although the longhand version works just as well. logcounts(sce) ## cell_1 cell_2 cell_3 ## gene_1 3.940600 4.519817 3.962599 ## gene_2 4.100047 4.234697 4.210128 ## gene_3 4.100047 4.234697 4.066760 ## gene_4 4.493898 3.581374 4.421342 ## gene_5 4.493898 4.125592 4.066760 ## gene_6 4.707114 4.007555 4.298357 ## gene_7 4.374176 3.581374 4.381501 ## gene_8 3.556547 4.007555 4.210128 ## gene_9 3.556547 4.682738 4.015619 ## gene_10 3.761315 4.234697 3.850329 To look at all the available assays within sce, we can use the assays() accessor. By comparison, assay() only returns a single assay of interest. assays(sce) ## List of length 2 ## names(2): counts logcounts While the functions above automatically add assays to our sce object, there may be cases where we want to perform our own calculations and save the result into the assays slot. This is often necessary when using functions that do not return a SingleCellExperiment object. To illustrate, let’s append a new version of the data that has been offset by adding 100 to all values. counts_100 &lt;- counts(sce) + 100 assay(sce, &quot;counts_100&quot;) &lt;- counts_100 # assign a new entry to assays slot assays(sce) # new assay has now been added. ## List of length 3 ## names(3): counts logcounts counts_100 4.3 Handling metadata 4.3.1 On the columns To further annotate our SingleCellExperiment object, we can add metadata to describe the columns of our primary data, e.g., the samples or cells of our experiment. This data is entered into the colData slot, a data.frame or DataFrame object where rows correspond to cells and columns correspond to metadata fields, e.g., batch of origin, treatment condition (Figure 4.1, orange box). Let’s come up with some metadata for the cells, starting with a batch variable where cells 1 and 2 are in batch 1 and cell 3 is from batch 2. cell_metadata &lt;- data.frame(batch = c(1, 1, 2)) rownames(cell_metadata) &lt;- paste0(&quot;cell_&quot;, 1:3) Now, we can take two approaches - either append the cell_metadata to our existing sce, or start from scratch via the SingleCellExperiment() constructor. We’ll start from scratch for now: sce &lt;- SingleCellExperiment(assays = list(counts = counts_matrix), colData = cell_metadata) Similar to assays, we can see our colData is now populated: sce ## class: SingleCellExperiment ## dim: 10 3 ## metadata(0): ## assays(1): counts ## rownames(10): gene_1 gene_2 ... gene_9 gene_10 ## rowData names(0): ## colnames(3): cell_1 cell_2 cell_3 ## colData names(1): batch ## reducedDimNames(0): ## altExpNames(0): We can access our column data with the colData() function: colData(sce) ## DataFrame with 3 rows and 1 column ## batch ## &lt;numeric&gt; ## cell_1 1 ## cell_2 1 ## cell_3 2 Or even more simply, we can extract a single field using the $ shortcut: sce$batch ## [1] 1 1 2 Some functions automatically add column metadata by returning a SingleCellExperiment with extra fields in the colData slot. For example, the scater package contains the addPerCellQC() function that appends a lot of quality control data. Here, we show the first five columns of colData(sce) with the quality control metrics appended to it. sce &lt;- scater::addPerCellQC(sce) colData(sce) ## DataFrame with 3 rows and 4 columns ## batch sum detected total ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## cell_1 1 94 10 94 ## cell_2 1 123 10 123 ## cell_3 2 289 10 289 Alternatively, we might want to manually add more fields to the column metadata: sce$more_stuff &lt;- runif(ncol(sce)) colnames(colData(sce)) ## [1] &quot;batch&quot; &quot;sum&quot; &quot;detected&quot; &quot;total&quot; &quot;more_stuff&quot; A common operation with colData is to use its values for subsetting. For example, if we only wanted cells within batch 1, we could subset our sce object as shown below. (Remember, we subset on the columns in this case because we are filtering by cells/samples here.) sce[, sce$batch == 1] ## class: SingleCellExperiment ## dim: 10 2 ## metadata(0): ## assays(1): counts ## rownames(10): gene_1 gene_2 ... gene_9 gene_10 ## rowData names(0): ## colnames(2): cell_1 cell_2 ## colData names(5): batch sum detected total more_stuff ## reducedDimNames(0): ## altExpNames(0): 4.3.2 On the rows To store feature-level annotation, the SingleCellExperiment has the rowData slot containing a DataFrame where each row corresponds to a gene and contains annotations like the transcript length or gene symbol. Furthermore, there is a special rowRanges slot to hold genomic coordinates in the form of a GRanges or GRangesList. This stores describes the chromosome, start, and end coordinates of the features (genes, genomic regions) in a manner that is easy to query and manipulate via the GenomicRanges framework. Both of these slots can be accessed via their respective accessors, rowRanges() and rowData(). In our case, rowRanges(sce) produces an empty list because we did not fill it with any coordinate information. rowRanges(sce) # empty ## GRangesList object of length 10: ## $gene_1 ## GRanges object with 0 ranges and 0 metadata columns: ## seqnames ranges strand ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; ## ------- ## seqinfo: no sequences ## ## $gene_2 ## GRanges object with 0 ranges and 0 metadata columns: ## seqnames ranges strand ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; ## ------- ## seqinfo: no sequences ## ## $gene_3 ## GRanges object with 0 ranges and 0 metadata columns: ## seqnames ranges strand ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; ## ------- ## seqinfo: no sequences ## ## ... ## &lt;7 more elements&gt; Currently the rowData slot is also empty. However, analogous to our call to addPerCellQC() in the prior section, the addPerFeatureQC() function will insert values in the rowData slot of our sce object: sce &lt;- scater::addPerFeatureQC(sce) rowData(sce) ## DataFrame with 10 rows and 2 columns ## mean detected ## &lt;numeric&gt; &lt;numeric&gt; ## gene_1 16.3333 100 ## gene_2 17.3333 100 ## gene_3 16.3333 100 ## gene_4 18.3333 100 ## gene_5 17.0000 100 ## gene_6 19.0000 100 ## gene_7 17.6667 100 ## gene_8 15.6667 100 ## gene_9 16.6667 100 ## gene_10 14.3333 100 In a similar fashion to the colData slot, such feature metadata could be provided at the onset when creating the SingleCellExperiment object. Exactly how this is done depends on the organism and annotation available during alignment and quantification; for example, given Ensembl identifiers, we might use AnnotationHub resources to pull down an Ensembl anotation object and extract the gene bodies to store in the rowRanges of our SingleCellExperiment. library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] # Human, Ensembl v97. genes(edb)[,2] ## GRanges object with 67667 ranges and 1 metadata column: ## seqnames ranges strand | gene_name ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; | &lt;character&gt; ## ENSG00000223972 1 11869-14409 + | DDX11L1 ## ENSG00000227232 1 14404-29570 - | WASH7P ## ENSG00000278267 1 17369-17436 - | MIR6859-1 ## ENSG00000243485 1 29554-31109 + | MIR1302-2HG ## ENSG00000284332 1 30366-30503 + | MIR1302-2 ## ... ... ... ... . ... ## ENSG00000224240 Y 26549425-26549743 + | CYCSP49 ## ENSG00000227629 Y 26586642-26591601 - | SLC25A15P1 ## ENSG00000237917 Y 26594851-26634652 - | PARP4P1 ## ENSG00000231514 Y 26626520-26627159 - | CCNQP2 ## ENSG00000235857 Y 56855244-56855488 + | CTBP2P1 ## ------- ## seqinfo: 424 sequences from GRCh38 genome To subset a SingleCellExperiment object at the feature/gene level, we can do a row subsetting operation similar to other R objects, by supplying either numeric indices or a vector of names: sce[c(&quot;gene_1&quot;, &quot;gene_4&quot;), ] ## class: SingleCellExperiment ## dim: 2 3 ## metadata(0): ## assays(1): counts ## rownames(2): gene_1 gene_4 ## rowData names(2): mean detected ## colnames(3): cell_1 cell_2 cell_3 ## colData names(5): batch sum detected total more_stuff ## reducedDimNames(0): ## altExpNames(0): sce[c(1, 4), ] # same as above in this case ## class: SingleCellExperiment ## dim: 2 3 ## metadata(0): ## assays(1): counts ## rownames(2): gene_1 gene_4 ## rowData names(2): mean detected ## colnames(3): cell_1 cell_2 cell_3 ## colData names(5): batch sum detected total more_stuff ## reducedDimNames(0): ## altExpNames(0): 4.3.3 Other metadata Some analyses contain results or annotations that do not fit into the aforementioned slots, e.g., study metadata. Thankfully, there is a slot just for this type of messy data - the metadata slot, a named list of entries where each entry in the list can be anything you want it to be. For example, say we have some favorite genes (e.g., highly variable genes) that we want to store inside of sce for use in our analysis at a later point. We can do this simply by appending to the metadata slot as follows: my_genes &lt;- c(&quot;gene_1&quot;, &quot;gene_5&quot;) metadata(sce) &lt;- list(favorite_genes = my_genes) metadata(sce) ## $favorite_genes ## [1] &quot;gene_1&quot; &quot;gene_5&quot; Similarly, we can append more information via the $ operator: your_genes &lt;- c(&quot;gene_4&quot;, &quot;gene_8&quot;) metadata(sce)$your_genes &lt;- your_genes metadata(sce) ## $favorite_genes ## [1] &quot;gene_1&quot; &quot;gene_5&quot; ## ## $your_genes ## [1] &quot;gene_4&quot; &quot;gene_8&quot; 4.4 Single-cell-specific fields 4.4.1 Background So far, we have covered the assays (primary data), colData (cell metadata), rowData/rowRanges (feature metadata), and metadata slots (other) of the SingleCellExperiment class. These slots are actually inherited from the SummarizedExperiment parent class (see here for details), so any method that works on a SummarizedExperiment will also work on a SingleCellExperiment object. But why do we need a separate SingleCellExperiment class? This is motivated by the desire to streamline some single-cell-specific operations, which we will discuss in the rest of this section. 4.4.2 Dimensionality reduction results The reducedDims slot is specially designed to store reduced dimensionality representations of the primary data obtained by methods such as PCA and \\(t\\)-SNE (see Chapter 9 for more details). This slot contains a list of numeric matrices of low-reduced representations of the primary data, where the rows represent the columns of the primary data (i.e., cells), and columns represent the dimensions. As this slot holds a list, we can store multiple PCA/\\(t\\)-SNE/etc. results for the same dataset. In our example, we can calculate a PCA representation of our data using the runPCA() function from scater. We see that the sce now shows a new reducedDim that can be retrieved with the accessor reducedDim(). sce &lt;- scater::logNormCounts(sce) sce &lt;- scater::runPCA(sce) reducedDim(sce, &quot;PCA&quot;) ## PC1 PC2 ## cell_1 -0.8778588 -0.3323026 ## cell_2 1.1512369 -0.1409999 ## cell_3 -0.2733781 0.4733024 ## attr(,&quot;varExplained&quot;) ## [1] 1.0853591 0.1771606 ## attr(,&quot;percentVar&quot;) ## [1] 85.9677 14.0323 ## attr(,&quot;rotation&quot;) ## PC1 PC2 ## gene_9 0.53945842 0.16506870 ## gene_4 -0.47481793 0.26621235 ## gene_7 -0.42138376 0.32527437 ## gene_6 -0.31953155 -0.26763287 ## gene_8 0.15688029 0.69357794 ## gene_1 0.30441645 -0.20110929 ## gene_10 0.23984703 -0.06947464 ## gene_5 -0.14153714 -0.42400672 ## gene_3 0.07560372 -0.09804875 ## gene_2 0.05754807 0.09346258 We can also calculate a tSNE representation using the scater package function runTSNE(): sce &lt;- scater::runTSNE(sce, perplexity = 0.1) ## Perplexity should be lower than K! reducedDim(sce, &quot;TSNE&quot;) ## [,1] [,2] ## cell_1 -5077.5985 -2558.224 ## cell_2 4761.2737 -3124.250 ## cell_3 316.3247 5682.474 We can view the names of all our entries in the reducedDims slot via the accessor, reducedDims(). Note that this is plural and returns a list of all results, whereas reducedDim() only returns a single result. reducedDims(sce) ## List of length 2 ## names(2): PCA TSNE We can also manually add content to the reducedDims() slot, much like how we added matrices to the assays slot previously. To illustrate, we run the umap() function directly from the uwot package to generate a matrix of UMAP coordinates that is added to the reducedDims of our sce object. (In practice, scater has a runUMAP() wrapper function that adds the results for us, but we will manually call umap() here for demonstration purposes.) u &lt;- uwot::umap(t(logcounts(sce)), n_neighbors = 2) reducedDim(sce, &quot;UMAP_uwot&quot;) &lt;- u reducedDims(sce) # Now stored in the object. ## List of length 3 ## names(3): PCA TSNE UMAP_uwot reducedDim(sce, &quot;UMAP_uwot&quot;) ## [,1] [,2] ## cell_1 -0.762756 -0.7608464 ## cell_2 0.140951 0.1457694 ## cell_3 0.621805 0.6150770 ## attr(,&quot;scaled:center&quot;) ## [1] -5.317163 3.390621 4.4.3 Alternative Experiments The SingleCellExperiment class provides the concept of “alternative Experiments” where we have data for a distinct set of features but the same set of samples/cells. The classic application would be to store the per-cell counts for spike-in transcripts; this allows us to retain this data for downstream use but separate it from the assays holding the counts for endogenous genes. The separation is particularly important as such alternative features often need to be processed separately, see Chapter 20 for examples on antibody-derived tags. If we have data for alternative feature sets, we can store it in our SingleCellExperiment as an alternative Experiment. For example, if we have some data for spike-in transcripts, we first create a separate SummarizedExperiment object: spike_counts &lt;- cbind(cell_1 = rpois(5, 10), cell_2 = rpois(5, 10), cell_3 = rpois(5, 30)) rownames(spike_counts) &lt;- paste0(&quot;spike_&quot;, 1:5) spike_se &lt;- SummarizedExperiment(list(counts=spike_counts)) spike_se ## class: SummarizedExperiment ## dim: 5 3 ## metadata(0): ## assays(1): counts ## rownames(5): spike_1 spike_2 spike_3 spike_4 spike_5 ## rowData names(0): ## colnames(3): cell_1 cell_2 cell_3 ## colData names(0): Then we store this SummarizedExperiment in our sce object via the altExp() setter. Like assays() and reducedDims(), we can also retrieve all of the available alternative Experiments with altExps(). altExp(sce, &quot;spike&quot;) &lt;- spike_se altExps(sce) ## List of length 1 ## names(1): spike The alternative Experiment concept ensures that all relevant aspects of a single-cell dataset can be held in a single object. It is also convenient as it ensures that our spike-in data is synchronized with the data for the endogenous genes. For example, if we subsetted sce, the spike-in data would be subsetted to match: sub &lt;- sce[,1:2] # retain only two samples. altExp(sub, &quot;spike&quot;) ## class: SummarizedExperiment ## dim: 5 2 ## metadata(0): ## assays(1): counts ## rownames(5): spike_1 spike_2 spike_3 spike_4 spike_5 ## rowData names(0): ## colnames(2): cell_1 cell_2 ## colData names(0): Any SummarizedExperiment object can be stored as an alternative Experiment, including another SingleCellExperiment! This allows power users to perform tricks like those described in Section 8.5. 4.4.4 Size factors The sizeFactors() function allows us to get or set a numeric vector of per-cell scaling factors used for normalization (see Chapter 7 for more details). This is typically automatically added by normalization functions, as shown below for scran’s deconvolution-based size factors: sce &lt;- scran::computeSumFactors(sce) sizeFactors(sce) ## [1] 0.5573123 0.7292490 1.7134387 Alternatively, we can manually add the size factors, as shown below for library size-derived factors: sizeFactors(sce) &lt;- scater::librarySizeFactors(sce) sizeFactors(sce) ## cell_1 cell_2 cell_3 ## 0.5573123 0.7292490 1.7134387 Technically speaking, the sizeFactors concept is not unique to single-cell analyses. Nonetheless, we mention it here as it is an extension beyond what is available in the SummarizedExperiment parent class. 4.4.5 Column labels The colLabels() function allows us to get or set a vector or factor of per-cell labels, typically corresponding to groupings assigned by unsupervised clustering (see Chapter 10) or predicted cell type identities from classification algorithms (Chapter 12). colLabels(sce) &lt;- LETTERS[1:3] colLabels(sce) ## [1] &quot;A&quot; &quot;B&quot; &quot;C&quot; This is a convenient field to set as several functions (e.g., scran::findMarkers) will attempt to automatically retrieve the labels via colLabels(). We can thus avoid the few extra keystrokes that would otherwise be necessary to specify, say, the cluster assignments in the function call. 4.5 Conclusion The widespread use of the SingleCellExperiment class provides the foundation for interoperability between single-cell-related packages in the Bioconductor ecosystem. SingleCellExperiment objects generated by one package can be used as input into another package, encouraging synergies that enable our analysis to be greater than the sum of its parts. Each step of the analysis will also add new entries to the assays, colData, reducedDims, etc., meaning that the final SingleCellExperiment object effectively serves as a self-contained record of the analysis. This is convenient as the object can be saved for future use or transferred to collaborators for further analysis. Thus, for the rest of this book, we will be using the SingleCellExperiment as our basic data structure. "],["overview.html", "Chapter 5 Overview 5.1 Introduction 5.2 Experimental Design 5.3 Obtaining a count matrix 5.4 Data processing and downstream analysis 5.5 Quick start Session Info", " Chapter 5 Overview .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 5.1 Introduction This chapter provides an overview of the framework of a typical scRNA-seq analysis workflow (Figure 5.1). Subsequent chapters will describe each analysis step in more detail. Figure 5.1: Schematic of a typical scRNA-seq analysis workflow. Each stage (separated by dashed lines) consists of a number of specific steps, many of which operate on and modify a SingleCellExperiment instance. 5.2 Experimental Design Before starting the analysis itself, some comments on experimental design may be helpful. The most obvious question is the choice of technology, which can be roughly divided into: Droplet-based: 10X Genomics, inDrop, Drop-seq Plate-based with unique molecular identifiers (UMIs): CEL-seq, MARS-seq Plate-based with reads: Smart-seq2 Other: sci-RNA-seq, Seq-Well Each of these methods have their own advantages and weaknesses that are discussed extensively elsewhere (Mereu et al. 2019; Ziegenhain et al. 2017). In practical terms, droplet-based technologies are the current de facto standard due to their throughput and low cost per cell. Plate-based methods can capture other phenotypic information (e.g., morphology) and are more amenable to customization. Read-based methods provide whole-transcript coverage, which is useful in some applications (e.g., splicing, exome mutations); otherwise, UMI-based methods are more popular as they mitigate the effects of PCR amplification noise. The choice of method is left to the reader’s circumstances - we will simply note that most aspects of our analysis pipeline are technology-agnostic. The next question is how many cells should be captured, and to what depth they should be sequenced. The short answer is “as much as you can afford to spend”. The long answer is that it depends on the aim of the analysis. If we are aiming to discover rare cell subpopulations, then we need more cells. If we are aiming to characterize subtle differences, then we need more sequencing depth. As of time of writing, an informal survey of the literature suggests that typical droplet-based experiments would capture anywhere from 10,000 to 100,000 cells, sequenced at anywhere from 1,000 to 10,000 UMIs per cell (usually in inverse proportion to the number of cells). Droplet-based methods also have a trade-off between throughput and doublet rate that affects the true efficiency of sequencing. For studies involving multiple samples or conditions, the design considerations are the same as those for bulk RNA-seq experiments. There should be multiple biological replicates for each condition and conditions should not be confounded with batch. Note that individual cells are not replicates; rather, we are referring to samples derived from replicate donors or cultures. 5.3 Obtaining a count matrix Sequencing data from scRNA-seq experiments must be converted into a matrix of expression values that can be used for statistical analysis. Given the discrete nature of sequencing data, this is usually a count matrix containing the number of UMIs or reads mapped to each gene in each cell. The exact procedure for quantifying expression tends to be technology-dependent: For 10X Genomics data, the CellRanger software suite provides a custom pipeline to obtain a count matrix. This uses STAR to align reads to the reference genome and then counts the number of unique UMIs mapped to each gene. Pseudo-alignment methods such as alevin can be used to obtain a count matrix from the same data with greater efficiency. This avoids the need for explicit alignment, which reduces the compute time and memory usage. For other highly multiplexed protocols, the scPipe package provides a more general pipeline for processing scRNA-seq data. This uses the Rsubread aligner to align reads and then counts UMIs per gene. For CEL-seq or CEL-seq2 data, the scruff package provides a dedicated pipeline for quantification. For read-based protocols, we can generally re-use the same pipelines for processing bulk RNA-seq data. For any data involving spike-in transcripts, the spike-in sequences should be included as part of the reference genome during alignment and quantification. After quantification, we import the count matrix into R and create a SingleCellExperiment object. This can be done with base methods (e.g., read.table()) followed by applying the SingleCellExperiment() constructor. Alternatively, for specific file formats, we can use dedicated methods from the DropletUtils (for 10X data) or tximport/tximeta packages (for pseudo-alignment methods). Depending on the origin of the data, this requires some vigilance: Some feature-counting tools will report mapping statistics in the count matrix (e.g., the number of unaligned or unassigned reads). While these values can be useful for quality control, they would be misleading if treated as gene expression values. Thus, they should be removed (or at least moved to the colData) prior to further analyses. Be careful of using the ^ERCC regular expression to detect spike-in rows in human data where the row names of the count matrix are gene symbols. An ERCC gene family actually exists in human annotation, so this would result in incorrect identification of genes as spike-in transcripts. This problem can be avoided by using count matrices with standard identifiers (e.g., Ensembl, Entrez). 5.4 Data processing and downstream analysis In the simplest case, the workflow has the following form: We compute quality control metrics to remove low-quality cells that would interfere with downstream analyses. These cells may have been damaged during processing or may not have been fully captured by the sequencing protocol. Common metrics includes the total counts per cell, the proportion of spike-in or mitochondrial reads and the number of detected features. We convert the counts into normalized expression values to eliminate cell-specific biases (e.g., in capture efficiency). This allows us to perform explicit comparisons across cells in downstream steps like clustering. We also apply a transformation, typically log, to adjust for the mean-variance relationship. We perform feature selection to pick a subset of interesting features for downstream analysis. This is done by modelling the variance across cells for each gene and retaining genes that are highly variable. The aim is to reduce computational overhead and noise from uninteresting genes. We apply dimensionality reduction to compact the data and further reduce noise. Principal components analysis is typically used to obtain an initial low-rank representation for more computational work, followed by more aggressive methods like \\(t\\)-stochastic neighbor embedding for visualization purposes. We cluster cells into groups according to similarities in their (normalized) expression profiles. This aims to obtain groupings that serve as empirical proxies for distinct biological states. We typically interpret these groupings by identifying differentially expressed marker genes between clusters. Additional steps such as data integration and cell annotation will be discussed in their respective chapters. 5.5 Quick start Here, we use the a droplet-based retina dataset from Macosko et al. (2015), provided in the scRNAseq package. This starts from a count matrix and finishes with clusters (Figure 5.2) in preparation for biological interpretation. Similar workflows are available in abbreviated form in the Workflows., library(scRNAseq) sce &lt;- MacoskoRetinaData() # Quality control. library(scater) is.mito &lt;- grepl(&quot;^MT-&quot;, rownames(sce)) qcstats &lt;- perCellQCMetrics(sce, subsets=list(Mito=is.mito)) filtered &lt;- quickPerCellQC(qcstats, percent_subsets=&quot;subsets_Mito_percent&quot;) sce &lt;- sce[, !filtered$discard] # Normalization. sce &lt;- logNormCounts(sce) # Feature selection. library(scran) dec &lt;- modelGeneVar(sce) hvg &lt;- getTopHVGs(dec, prop=0.1) # Dimensionality reduction. set.seed(1234) sce &lt;- runPCA(sce, ncomponents=25, subset_row=hvg) sce &lt;- runUMAP(sce, dimred = &#39;PCA&#39;, external_neighbors=TRUE) # Clustering. g &lt;- buildSNNGraph(sce, use.dimred = &#39;PCA&#39;) colLabels(sce) &lt;- factor(igraph::cluster_louvain(g)$membership) # Visualization. plotUMAP(sce, colour_by=&quot;label&quot;) Figure 5.2: UMAP plot of the retina dataset, where each point is a cell and is colored by the cluster identity. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] scran_1.18.5 scater_1.18.6 [3] ggplot2_3.3.3 scRNAseq_2.4.0 [5] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0 [7] Biobase_2.50.0 GenomicRanges_1.42.0 [9] GenomeInfoDb_1.26.4 IRanges_2.24.1 [11] S4Vectors_0.28.1 BiocGenerics_0.36.0 [13] MatrixGenerics_1.2.1 matrixStats_0.58.0 [15] BiocStyle_2.18.1 rebook_1.0.0 loaded via a namespace (and not attached): [1] AnnotationHub_2.22.0 BiocFileCache_1.14.0 [3] igraph_1.2.6 lazyeval_0.2.2 [5] BiocParallel_1.24.1 digest_0.6.27 [7] ensembldb_2.14.0 htmltools_0.5.1.1 [9] viridis_0.5.1 fansi_0.4.2 [11] magrittr_2.0.1 memoise_2.0.0 [13] limma_3.46.0 Biostrings_2.58.0 [15] askpass_1.1 prettyunits_1.1.1 [17] colorspace_2.0-0 blob_1.2.1 [19] rappdirs_0.3.3 xfun_0.22 [21] dplyr_1.0.5 callr_3.5.1 [23] crayon_1.4.1 RCurl_1.98-1.3 [25] jsonlite_1.7.2 graph_1.68.0 [27] glue_1.4.2 gtable_0.3.0 [29] zlibbioc_1.36.0 XVector_0.30.0 [31] DelayedArray_0.16.2 BiocSingular_1.6.0 [33] scales_1.1.1 DBI_1.1.1 [35] edgeR_3.32.1 Rcpp_1.0.6 [37] viridisLite_0.3.0 xtable_1.8-4 [39] progress_1.2.2 dqrng_0.2.1 [41] bit_4.0.4 rsvd_1.0.3 [43] httr_1.4.2 ellipsis_0.3.1 [45] farver_2.1.0 pkgconfig_2.0.3 [47] XML_3.99-0.6 scuttle_1.0.4 [49] uwot_0.1.10 CodeDepends_0.6.5 [51] sass_0.3.1 dbplyr_2.1.0 [53] locfit_1.5-9.4 utf8_1.2.1 [55] labeling_0.4.2 tidyselect_1.1.0 [57] rlang_0.4.10 later_1.1.0.1 [59] AnnotationDbi_1.52.0 munsell_0.5.0 [61] BiocVersion_3.12.0 tools_4.0.4 [63] cachem_1.0.4 generics_0.1.0 [65] RSQLite_2.2.4 ExperimentHub_1.16.0 [67] evaluate_0.14 stringr_1.4.0 [69] fastmap_1.1.0 yaml_2.2.1 [71] processx_3.4.5 knitr_1.31 [73] bit64_4.0.5 purrr_0.3.4 [75] AnnotationFilter_1.14.0 sparseMatrixStats_1.2.1 [77] mime_0.10 xml2_1.3.2 [79] biomaRt_2.46.3 compiler_4.0.4 [81] beeswarm_0.3.1 curl_4.3 [83] interactiveDisplayBase_1.28.0 statmod_1.4.35 [85] tibble_3.1.0 bslib_0.2.4 [87] stringi_1.5.3 highr_0.8 [89] ps_1.6.0 RSpectra_0.16-0 [91] GenomicFeatures_1.42.2 lattice_0.20-41 [93] bluster_1.0.0 ProtGenerics_1.22.0 [95] Matrix_1.3-2 vctrs_0.3.6 [97] pillar_1.5.1 lifecycle_1.0.0 [99] BiocManager_1.30.10 jquerylib_0.1.3 [101] BiocNeighbors_1.8.2 cowplot_1.1.1 [103] bitops_1.0-6 irlba_2.3.3 [105] httpuv_1.5.5 rtracklayer_1.50.0 [107] R6_2.5.0 bookdown_0.21 [109] promises_1.2.0.1 gridExtra_2.3 [111] vipor_0.4.5 codetools_0.2-18 [113] assertthat_0.2.1 openssl_1.4.3 [115] withr_2.4.1 GenomicAlignments_1.26.0 [117] Rsamtools_2.6.0 GenomeInfoDbData_1.2.4 [119] hms_1.0.0 grid_4.0.4 [121] beachmat_2.6.4 rmarkdown_2.7 [123] DelayedMatrixStats_1.12.3 shiny_1.6.0 [125] ggbeeswarm_0.6.0 Bibliography "],["quality-control.html", "Chapter 6 Quality Control 6.1 Motivation 6.2 Choice of QC metrics 6.3 Identifying low-quality cells 6.4 Checking diagnostic plots 6.5 Removing low-quality cells 6.6 Marking low-quality cells Session Info", " Chapter 6 Quality Control .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 6.1 Motivation Low-quality libraries in scRNA-seq data can arise from a variety of sources such as cell damage during dissociation or failure in library preparation (e.g., inefficient reverse transcription or PCR amplification). These usually manifest as “cells” with low total counts, few expressed genes and high mitochondrial or spike-in proportions. These low-quality libraries are problematic as they can contribute to misleading results in downstream analyses: They form their own distinct cluster(s), complicating interpretation of the results. This is most obviously driven by increased mitochondrial proportions or enrichment for nuclear RNAs after cell damage. In the worst case, low-quality libraries generated from different cell types can cluster together based on similarities in the damage-induced expression profiles, creating artificial intermediate states or trajectories between otherwise distinct subpopulations. Additionally, very small libraries can form their own clusters due to shifts in the mean upon transformation (A. Lun 2018). They distort the characterization of population heterogeneity during variance estimation or principal components analysis. The first few principal components will capture differences in quality rather than biology, reducing the effectiveness of dimensionality reduction. Similarly, genes with the largest variances will be driven by differences between low- and high-quality cells. The most obvious example involves low-quality libraries with very low counts where scaling normalization inflates the apparent variance of genes that happen to have a non-zero count in those libraries. They contain genes that appear to be strongly “upregulated” due to aggressive scaling to normalize for small library sizes. This is most problematic for contaminating transcripts (e.g., from the ambient solution) that are present in all libraries at low but constant levels. Increased scaling in low-quality libraries transforms small counts for these transcripts in large normalized expression values, resulting in apparent upregulation compared to other cells. This can be misleading as the affected genes are often biologically sensible but are actually expressed in another subpopulation. To avoid - or at least mitigate - these problems, we need to remove these cells at the start of the analysis. This step is commonly referred to as quality control (QC) on the cells. (We will use “library” and “cell” rather interchangeably here, though the distinction will become important when dealing with droplet-based data.) We will demonstrate using a small scRNA-seq dataset from A. T. L. Lun et al. (2017), which is provided with no prior QC so that we can apply our own procedures. View history #--- loading ---# library(scRNAseq) sce.416b &lt;- LunSpikeInData(which=&quot;416b&quot;) sce.416b$block &lt;- factor(sce.416b$block) sce.416b ## class: SingleCellExperiment ## dim: 46604 192 ## metadata(0): ## assays(1): counts ## rownames(46604): ENSMUSG00000102693 ENSMUSG00000064842 ... ## ENSMUSG00000095742 CBFB-MYH11-mcherry ## rowData names(1): Length ## colnames(192): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ... ## SLX-11312.N712_S508.H5H5YBBXX.s_8.r_1 ## SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1 ## colData names(9): Source Name cell line ... spike-in addition block ## reducedDimNames(0): ## altExpNames(2): ERCC SIRV 6.2 Choice of QC metrics We use several common QC metrics to identify low-quality cells based on their expression profiles. These metrics are described below in terms of reads for SMART-seq2 data, but the same definitions apply to UMI data generated by other technologies like MARS-seq and droplet-based protocols. The library size is defined as the total sum of counts across all relevant features for each cell. Here, we will consider the relevant features to be the endogenous genes. Cells with small library sizes are of low quality as the RNA has been lost at some point during library preparation, either due to cell lysis or inefficient cDNA capture and amplification. The number of expressed features in each cell is defined as the number of endogenous genes with non-zero counts for that cell. Any cell with very few expressed genes is likely to be of poor quality as the diverse transcript population has not been successfully captured. The proportion of reads mapped to spike-in transcripts is calculated relative to the total count across all features (including spike-ins) for each cell. As the same amount of spike-in RNA should have been added to each cell, any enrichment in spike-in counts is symptomatic of loss of endogenous RNA. Thus, high proportions are indicative of poor-quality cells where endogenous RNA has been lost due to, e.g., partial cell lysis or RNA degradation during dissociation. In the absence of spike-in transcripts, the proportion of reads mapped to genes in the mitochondrial genome can be used. High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), presumably because of loss of cytoplasmic RNA from perforated cells. The reasoning is that, in the presence of modest damage, the holes in the cell membrane permit efflux of individual transcript molecules but are too small to allow mitochondria to escape, leading to a relative enrichment of mitochondrial transcripts. For single-nuclei RNA-seq experiments, high proportions are also useful as they can mark cells where the cytoplasm has not been successfully stripped. For each cell, we calculate these QC metrics using the perCellQCMetrics() function from the scater package (McCarthy et al. 2017). The sum column contains the total count for each cell and the detected column contains the number of detected genes. The subsets_Mito_percent column contains the percentage of reads mapped to mitochondrial transcripts. (For demonstration purposes, we show two different approaches of determining the genomic location of each transcript.) Finally, the altexps_ERCC_percent column contains the percentage of reads mapped to ERCC transcripts. # Retrieving the mitochondrial transcripts using genomic locations included in # the row-level annotation for the SingleCellExperiment. location &lt;- rowRanges(sce.416b) is.mito &lt;- any(seqnames(location)==&quot;MT&quot;) # ALTERNATIVELY: using resources in AnnotationHub to retrieve chromosomal # locations given the Ensembl IDs; this should yield the same result. library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] chr.loc &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SEQNAME&quot;) is.mito.alt &lt;- which(chr.loc==&quot;MT&quot;) library(scater) df &lt;- perCellQCMetrics(sce.416b, subsets=list(Mito=is.mito)) df ## DataFrame with 192 rows and 12 columns ## sum detected subsets_Mito_sum ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 865936 7618 78790 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 1076277 7521 98613 ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 1180138 8306 100341 ## SLX-9555.N701_S505.C89V9ANXX.s_1.r_1 1342593 8143 104882 ## SLX-9555.N701_S506.C89V9ANXX.s_1.r_1 1668311 7154 129559 ## ... ... ... ... ## SLX-11312.N712_S505.H5H5YBBXX.s_8.r_1 776622 8174 48126 ## SLX-11312.N712_S506.H5H5YBBXX.s_8.r_1 1299950 8956 112225 ## SLX-11312.N712_S507.H5H5YBBXX.s_8.r_1 1800696 9530 135693 ## SLX-11312.N712_S508.H5H5YBBXX.s_8.r_1 46731 6649 3505 ## SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1 1866692 10964 150375 ## subsets_Mito_detected ## &lt;numeric&gt; ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 20 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 20 ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 19 ## SLX-9555.N701_S505.C89V9ANXX.s_1.r_1 20 ## SLX-9555.N701_S506.C89V9ANXX.s_1.r_1 22 ## ... ... ## SLX-11312.N712_S505.H5H5YBBXX.s_8.r_1 20 ## SLX-11312.N712_S506.H5H5YBBXX.s_8.r_1 25 ## SLX-11312.N712_S507.H5H5YBBXX.s_8.r_1 23 ## SLX-11312.N712_S508.H5H5YBBXX.s_8.r_1 16 ## SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1 29 ## subsets_Mito_percent altexps_ERCC_sum ## &lt;numeric&gt; &lt;numeric&gt; ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 9.09882 65278 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 9.16242 74748 ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 8.50248 60878 ## SLX-9555.N701_S505.C89V9ANXX.s_1.r_1 7.81190 60073 ## SLX-9555.N701_S506.C89V9ANXX.s_1.r_1 7.76588 136810 ## ... ... ... ## SLX-11312.N712_S505.H5H5YBBXX.s_8.r_1 6.19684 61575 ## SLX-11312.N712_S506.H5H5YBBXX.s_8.r_1 8.63302 94982 ## SLX-11312.N712_S507.H5H5YBBXX.s_8.r_1 7.53559 113707 ## SLX-11312.N712_S508.H5H5YBBXX.s_8.r_1 7.50037 7580 ## SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1 8.05569 48664 ## altexps_ERCC_detected ## &lt;numeric&gt; ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 39 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 40 ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 42 ## SLX-9555.N701_S505.C89V9ANXX.s_1.r_1 42 ## SLX-9555.N701_S506.C89V9ANXX.s_1.r_1 44 ## ... ... ## SLX-11312.N712_S505.H5H5YBBXX.s_8.r_1 39 ## SLX-11312.N712_S506.H5H5YBBXX.s_8.r_1 41 ## SLX-11312.N712_S507.H5H5YBBXX.s_8.r_1 40 ## SLX-11312.N712_S508.H5H5YBBXX.s_8.r_1 44 ## SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1 39 ## altexps_ERCC_percent altexps_SIRV_sum ## &lt;numeric&gt; &lt;numeric&gt; ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 6.80658 27828 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 6.28030 39173 ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 4.78949 30058 ## SLX-9555.N701_S505.C89V9ANXX.s_1.r_1 4.18567 32542 ## SLX-9555.N701_S506.C89V9ANXX.s_1.r_1 7.28887 71850 ## ... ... ... ## SLX-11312.N712_S505.H5H5YBBXX.s_8.r_1 7.17620 19848 ## SLX-11312.N712_S506.H5H5YBBXX.s_8.r_1 6.65764 31729 ## SLX-11312.N712_S507.H5H5YBBXX.s_8.r_1 5.81467 41116 ## SLX-11312.N712_S508.H5H5YBBXX.s_8.r_1 13.48898 1883 ## SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1 2.51930 16289 ## altexps_SIRV_detected ## &lt;numeric&gt; ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 7 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 7 ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 7 ## SLX-9555.N701_S505.C89V9ANXX.s_1.r_1 7 ## SLX-9555.N701_S506.C89V9ANXX.s_1.r_1 7 ## ... ... ## SLX-11312.N712_S505.H5H5YBBXX.s_8.r_1 7 ## SLX-11312.N712_S506.H5H5YBBXX.s_8.r_1 7 ## SLX-11312.N712_S507.H5H5YBBXX.s_8.r_1 7 ## SLX-11312.N712_S508.H5H5YBBXX.s_8.r_1 7 ## SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1 7 ## altexps_SIRV_percent total ## &lt;numeric&gt; &lt;numeric&gt; ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 2.90165 959042 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 3.29130 1190198 ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 2.36477 1271074 ## SLX-9555.N701_S505.C89V9ANXX.s_1.r_1 2.26741 1435208 ## SLX-9555.N701_S506.C89V9ANXX.s_1.r_1 3.82798 1876971 ## ... ... ... ## SLX-11312.N712_S505.H5H5YBBXX.s_8.r_1 2.313165 858045 ## SLX-11312.N712_S506.H5H5YBBXX.s_8.r_1 2.224004 1426661 ## SLX-11312.N712_S507.H5H5YBBXX.s_8.r_1 2.102562 1955519 ## SLX-11312.N712_S508.H5H5YBBXX.s_8.r_1 3.350892 56194 ## SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1 0.843271 1931645 Alternatively, users may prefer to use the addPerCellQC() function. This computes and appends the per-cell QC statistics to the colData of the SingleCellExperiment object, allowing us to retain all relevant information in a single object for later manipulation. sce.416b &lt;- addPerCellQC(sce.416b, subsets=list(Mito=is.mito)) colnames(colData(sce.416b)) ## [1] &quot;Source Name&quot; &quot;cell line&quot; ## [3] &quot;cell type&quot; &quot;single cell well quality&quot; ## [5] &quot;genotype&quot; &quot;phenotype&quot; ## [7] &quot;strain&quot; &quot;spike-in addition&quot; ## [9] &quot;block&quot; &quot;sum&quot; ## [11] &quot;detected&quot; &quot;subsets_Mito_sum&quot; ## [13] &quot;subsets_Mito_detected&quot; &quot;subsets_Mito_percent&quot; ## [15] &quot;altexps_ERCC_sum&quot; &quot;altexps_ERCC_detected&quot; ## [17] &quot;altexps_ERCC_percent&quot; &quot;altexps_SIRV_sum&quot; ## [19] &quot;altexps_SIRV_detected&quot; &quot;altexps_SIRV_percent&quot; ## [21] &quot;total&quot; A key assumption here is that the QC metrics are independent of the biological state of each cell. Poor values (e.g., low library sizes, high mitochondrial proportions) are presumed to be driven by technical factors rather than biological processes, meaning that the subsequent removal of cells will not misrepresent the biology in downstream analyses. Major violations of this assumption would potentially result in the loss of cell types that have, say, systematically low RNA content or high numbers of mitochondria. We can check for such violations using some diagnostics described in Sections 6.4 and 6.5. 6.3 Identifying low-quality cells 6.3.1 With fixed thresholds The simplest approach to identifying low-quality cells is to apply thresholds on the QC metrics. For example, we might consider cells to be low quality if they have library sizes below 100,000 reads; express fewer than 5,000 genes; have spike-in proportions above 10%; or have mitochondrial proportions above 10%. qc.lib &lt;- df$sum &lt; 1e5 qc.nexprs &lt;- df$detected &lt; 5e3 qc.spike &lt;- df$altexps_ERCC_percent &gt; 10 qc.mito &lt;- df$subsets_Mito_percent &gt; 10 discard &lt;- qc.lib | qc.nexprs | qc.spike | qc.mito # Summarize the number of cells removed for each reason. DataFrame(LibSize=sum(qc.lib), NExprs=sum(qc.nexprs), SpikeProp=sum(qc.spike), MitoProp=sum(qc.mito), Total=sum(discard)) ## DataFrame with 1 row and 5 columns ## LibSize NExprs SpikeProp MitoProp Total ## &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; ## 1 3 0 19 14 33 While simple, this strategy requires considerable experience to determine appropriate thresholds for each experimental protocol and biological system. Thresholds for read count-based data are simply not applicable for UMI-based data, and vice versa. Differences in mitochondrial activity or total RNA content require constant adjustment of the mitochondrial and spike-in thresholds, respectively, for different biological systems. Indeed, even with the same protocol and system, the appropriate threshold can vary from run to run due to the vagaries of cDNA capture efficiency and sequencing depth per cell. 6.3.2 With adaptive thresholds 6.3.2.1 Identifying outliers To obtain an adaptive threshold, we assume that most of the dataset consists of high-quality cells. We then identify cells that are outliers for the various QC metrics, based on the median absolute deviation (MAD) from the median value of each metric across all cells. Specifically, a value is considered an outlier if it is more than 3 MADs from the median in the “problematic” direction. This is loosely motivated by the fact that such a filter will retain 99% of non-outlier values that follow a normal distribution. For the 416B data, we identify cells with log-transformed library sizes that are more than 3 MADs below the median. A log-transformation is used to improve resolution at small values when type=\"lower\". Specifically, it guarantees that the threshold is not a negative value, which would be meaningless for a non-negative metric. Furthermore, it is not uncommon for the distribution of library sizes to exhibit a heavy right tail; the log-transformation avoids inflation of the MAD in a manner that might compromise outlier detection on the left tail. (More generally, it makes the distribution seem more normal to justify the 99% rationale mentioned above.) qc.lib2 &lt;- isOutlier(df$sum, log=TRUE, type=&quot;lower&quot;) We do the same for the log-transformed number of expressed genes. qc.nexprs2 &lt;- isOutlier(df$detected, log=TRUE, type=&quot;lower&quot;) isOutlier() will also return the exact filter thresholds for each metric in the attributes of the output vector. These are useful for checking whether the automatically selected thresholds are appropriate. attr(qc.lib2, &quot;thresholds&quot;) ## lower higher ## 434083 Inf attr(qc.nexprs2, &quot;thresholds&quot;) ## lower higher ## 5231 Inf We identify outliers for the proportion-based metrics with the same function. These distributions frequently exhibit a heavy right tail, but unlike the two previous metrics, it is the right tail itself that contains the putative low-quality cells. Thus, we do not perform any transformation to shrink the tail - rather, our hope is that the cells in the tail are identified as large outliers. (While it is theoretically possible to obtain a meaningless threshold above 100%, this is rare enough to not be of practical concern.) qc.spike2 &lt;- isOutlier(df$altexps_ERCC_percent, type=&quot;higher&quot;) attr(qc.spike2, &quot;thresholds&quot;) ## lower higher ## -Inf 14.15 qc.mito2 &lt;- isOutlier(df$subsets_Mito_percent, type=&quot;higher&quot;) attr(qc.mito2, &quot;thresholds&quot;) ## lower higher ## -Inf 11.92 A cell that is an outlier for any of these metrics is considered to be of low quality and discarded. discard2 &lt;- qc.lib2 | qc.nexprs2 | qc.spike2 | qc.mito2 # Summarize the number of cells removed for each reason. DataFrame(LibSize=sum(qc.lib2), NExprs=sum(qc.nexprs2), SpikeProp=sum(qc.spike2), MitoProp=sum(qc.mito2), Total=sum(discard2)) ## DataFrame with 1 row and 5 columns ## LibSize NExprs SpikeProp MitoProp Total ## &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; ## 1 4 0 1 2 6 Alternatively, this entire process can be done in a single step using the quickPerCellQC() function. This is a wrapper that simply calls isOutlier() with the settings described above. reasons &lt;- quickPerCellQC(df, sub.fields=c(&quot;subsets_Mito_percent&quot;, &quot;altexps_ERCC_percent&quot;)) colSums(as.matrix(reasons)) ## low_lib_size low_n_features high_subsets_Mito_percent ## 4 0 2 ## high_altexps_ERCC_percent discard ## 1 6 With this strategy, the thresholds adapt to both the location and spread of the distribution of values for a given metric. This allows the QC procedure to adjust to changes in sequencing depth, cDNA capture efficiency, mitochondrial content, etc. without requiring any user intervention or prior experience. However, it does require some implicit assumptions that are discussed below in more detail. 6.3.2.2 Assumptions of outlier detection Outlier detection assumes that most cells are of acceptable quality. This is usually reasonable and can be experimentally supported in some situations by visually checking that the cells are intact, e.g., on the microwell plate. If most cells are of (unacceptably) low quality, the adaptive thresholds will obviously fail as they cannot remove the majority of cells. Of course, what is acceptable or not is in the eye of the beholder - neurons, for example, are notoriously difficult to dissociate, and we would often retain cells in a neuronal scRNA-seq dataset with QC metrics that would be unacceptable in a more amenable system like embryonic stem cells. Another assumption mentioned earlier is that the QC metrics are independent of the biological state of each cell. This is most likely to be violated in highly heterogeneous cell populations where some cell types naturally have, e.g., less total RNA (see Figure 3A of Germain, Sonrel, and Robinson (2020)) or more mitochondria. Such cells are more likely to be considered outliers and removed, even in the absence of any technical problems with their capture or sequencing. The use of the MAD mitigates this problem to some extent by accounting for biological variability in the QC metrics. A heterogeneous population should have higher variability in the metrics among high-quality cells, increasing the MAD and reducing the chance of incorrectly removing particular cell types (at the cost of reducing power to remove low-quality cells). In general, these assumptions are either reasonable or their violations have little effect on downstream conclusions. Nonetheless, it is helpful to keep them in mind when interpreting the results. 6.3.2.3 Considering experimental factors More complex studies may involve batches of cells generated with different experimental parameters (e.g., sequencing depth). In such cases, the adaptive strategy should be applied to each batch separately. It makes little sense to compute medians and MADs from a mixture distribution containing samples from multiple batches. For example, if the sequencing coverage is lower in one batch compared to the others, it will drag down the median and inflate the MAD. This will reduce the suitability of the adaptive threshold for the other batches. If each batch is represented by its own SingleCellExperiment, the isOutlier() function can be directly applied to each batch as shown above. However, if cells from all batches have been merged into a single SingleCellExperiment, the batch= argument should be used to ensure that outliers are identified within each batch. This allows isOutlier() to accommodate systematic differences in the QC metrics across batches. We will again illustrate using the 416B dataset, which contains two experimental factors - plate of origin and oncogene induction status. We combine these factors together and use this in the batch= argument to isOutlier() via quickPerCellQC(). This results in the removal of slightly more cells as the MAD is no longer inflated by (i) systematic differences in sequencing depth between batches and (ii) differences in number of genes expressed upon oncogene induction. batch &lt;- paste0(sce.416b$phenotype, &quot;-&quot;, sce.416b$block) batch.reasons &lt;- quickPerCellQC(df, batch=batch, sub.fields=c(&quot;subsets_Mito_percent&quot;, &quot;altexps_ERCC_percent&quot;)) colSums(as.matrix(batch.reasons)) ## low_lib_size low_n_features high_subsets_Mito_percent ## 5 4 2 ## high_altexps_ERCC_percent discard ## 6 9 That said, the use of batch= involves the stronger assumption that most cells in each batch are of high quality. If an entire batch failed, outlier detection will not be able to act as an appropriate QC filter for that batch. For example, two batches in the Grun et al. (2016) human pancreas dataset contain a substantial proportion of putative damaged cells with higher ERCC content than the other batches (Figure 6.1). This inflates the median and MAD within those batches, resulting in a failure to remove the assumed low-quality cells. In such cases, it is better to compute a shared median and MAD from the other batches and use those estimates to obtain an appropriate filter threshold for cells in the problematic batches, as shown below. library(scRNAseq) sce.grun &lt;- GrunPancreasData() sce.grun &lt;- addPerCellQC(sce.grun) # First attempt with batch-specific thresholds. discard.ercc &lt;- isOutlier(sce.grun$altexps_ERCC_percent, type=&quot;higher&quot;, batch=sce.grun$donor) with.blocking &lt;- plotColData(sce.grun, x=&quot;donor&quot;, y=&quot;altexps_ERCC_percent&quot;, colour_by=I(discard.ercc)) # Second attempt, sharing information across batches # to avoid dramatically different thresholds for unusual batches. discard.ercc2 &lt;- isOutlier(sce.grun$altexps_ERCC_percent, type=&quot;higher&quot;, batch=sce.grun$donor, subset=sce.grun$donor %in% c(&quot;D17&quot;, &quot;D2&quot;, &quot;D7&quot;)) without.blocking &lt;- plotColData(sce.grun, x=&quot;donor&quot;, y=&quot;altexps_ERCC_percent&quot;, colour_by=I(discard.ercc2)) gridExtra::grid.arrange(with.blocking, without.blocking, ncol=2) Figure 6.1: Distribution of the proportion of ERCC transcripts in each donor of the Grun pancreas dataset. Each point represents a cell and is coloured according to whether it was identified as an outlier within each batch (left) or using a common threshold (right). To identify problematic batches, one useful rule of thumb is to find batches with QC thresholds that are themselves outliers compared to the thresholds of other batches. The assumption here is that most batches consist of a majority of high quality cells such that the threshold value should follow some unimodal distribution across “typical” batches. If we observe a batch with an extreme threshold value, we may suspect that it contains a large number of low-quality cells that inflate the per-batch MAD. We demonstrate this process below for the Grun et al. (2016) data. ercc.thresholds &lt;- attr(discard.ercc, &quot;thresholds&quot;)[&quot;higher&quot;,] ercc.thresholds ## D10 D17 D2 D3 D7 ## 73.611 7.600 6.011 113.106 15.217 names(ercc.thresholds)[isOutlier(ercc.thresholds, type=&quot;higher&quot;)] ## [1] &quot;D10&quot; &quot;D3&quot; If we cannot assume that most batches contain a majority of high-quality cells, then all bets are off; we must revert to the approach of picking an arbitrary threshold value (Section 6.3.1) and hoping for the best. 6.3.3 Other approaches Another strategy is to identify outliers in high-dimensional space based on the QC metrics for each cell. We use methods from robustbase to quantify the “outlyingness” of each cells based on their QC metrics, and then use isOutlier() to identify low-quality cells that exhibit unusually high levels of outlyingness. stats &lt;- cbind(log10(df$sum), log10(df$detected), df$subsets_Mito_percent, df$altexps_ERCC_percent) library(robustbase) outlying &lt;- adjOutlyingness(stats, only.outlyingness = TRUE) multi.outlier &lt;- isOutlier(outlying, type = &quot;higher&quot;) summary(multi.outlier) ## Mode FALSE TRUE ## logical 183 9 This and related approaches like PCA-based outlier detection and support vector machines can provide more power to distinguish low-quality cells from high-quality counterparts (Ilicic et al. 2016) as they can exploit patterns across many QC metrics. However, this comes at some cost to interpretability, as the reason for removing a given cell may not always be obvious. For completeness, we note that outliers can also be identified from the gene expression profiles, rather than QC metrics. We consider this to be a risky strategy as it can remove high-quality cells in rare populations. 6.4 Checking diagnostic plots It is good practice to inspect the distributions of QC metrics (Figure 6.2) to identify possible problems. In the most ideal case, we would see normal distributions that would justify the 3 MAD threshold used in outlier detection. A large proportion of cells in another mode suggests that the QC metrics might be correlated with some biological state, potentially leading to the loss of distinct cell types during filtering; or that there were inconsistencies with library preparation for a subset of cells, a not-uncommon phenomenon in plate-based protocols. Batches with systematically poor values for any metric can then be quickly identified for further troubleshooting or outright removal, much like in Figure 6.1 above. colData(sce.416b) &lt;- cbind(colData(sce.416b), df) sce.416b$block &lt;- factor(sce.416b$block) sce.416b$phenotype &lt;- ifelse(grepl(&quot;induced&quot;, sce.416b$phenotype), &quot;induced&quot;, &quot;wild type&quot;) sce.416b$discard &lt;- reasons$discard gridExtra::grid.arrange( plotColData(sce.416b, x=&quot;block&quot;, y=&quot;sum&quot;, colour_by=&quot;discard&quot;, other_fields=&quot;phenotype&quot;) + facet_wrap(~phenotype) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(sce.416b, x=&quot;block&quot;, y=&quot;detected&quot;, colour_by=&quot;discard&quot;, other_fields=&quot;phenotype&quot;) + facet_wrap(~phenotype) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(sce.416b, x=&quot;block&quot;, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;, other_fields=&quot;phenotype&quot;) + facet_wrap(~phenotype) + ggtitle(&quot;Mito percent&quot;), plotColData(sce.416b, x=&quot;block&quot;, y=&quot;altexps_ERCC_percent&quot;, colour_by=&quot;discard&quot;, other_fields=&quot;phenotype&quot;) + facet_wrap(~phenotype) + ggtitle(&quot;ERCC percent&quot;), ncol=1 ) Figure 6.2: Distribution of QC metrics for each batch and phenotype in the 416B dataset. Each point represents a cell and is colored according to whether it was discarded, respectively. Another useful diagnostic involves plotting the proportion of mitochondrial counts against some of the other QC metrics. The aim is to confirm that there are no cells with both large total counts and large mitochondrial counts, to ensure that we are not inadvertently removing high-quality cells that happen to be highly metabolically active (e.g., hepatocytes). We demonstrate using data from a larger experiment involving the mouse brain (Zeisel et al. 2015); in this case, we do not observe any points in the top-right corner in Figure 6.3 that might potentially correspond to metabolically active, undamaged cells. View history #--- loading ---# library(scRNAseq) sce.zeisel &lt;- ZeiselBrainData() library(scater) sce.zeisel &lt;- aggregateAcrossFeatures(sce.zeisel, id=sub(&quot;_loc[0-9]+$&quot;, &quot;&quot;, rownames(sce.zeisel))) #--- gene-annotation ---# library(org.Mm.eg.db) rowData(sce.zeisel)$Ensembl &lt;- mapIds(org.Mm.eg.db, keys=rownames(sce.zeisel), keytype=&quot;SYMBOL&quot;, column=&quot;ENSEMBL&quot;) sce.zeisel &lt;- addPerCellQC(sce.zeisel, subsets=list(Mt=rowData(sce.zeisel)$featureType==&quot;mito&quot;)) qc &lt;- quickPerCellQC(colData(sce.zeisel), sub.fields=c(&quot;altexps_ERCC_percent&quot;, &quot;subsets_Mt_percent&quot;)) sce.zeisel$discard &lt;- qc$discard plotColData(sce.zeisel, x=&quot;sum&quot;, y=&quot;subsets_Mt_percent&quot;, colour_by=&quot;discard&quot;) Figure 6.3: Percentage of UMIs assigned to mitochondrial transcripts in the Zeisel brain dataset, plotted against the total number of UMIs (top). Each point represents a cell and is colored according to whether it was considered low-quality and discarded. Comparison of the ERCC and mitochondrial percentages can also be informative (Figure 6.4). Low-quality cells with small mitochondrial percentages, large spike-in percentages and small library sizes are likely to be stripped nuclei, i.e., they have been so extensively damaged that they have lost all cytoplasmic content. On the other hand, cells with high mitochondrial percentages and low ERCC percentages may represent undamaged cells that are metabolically active. This interpretation also applies for single-nuclei studies but with a switch of focus: the stripped nuclei become the libraries of interest while the undamaged cells are considered to be low quality. plotColData(sce.zeisel, x=&quot;altexps_ERCC_percent&quot;, y=&quot;subsets_Mt_percent&quot;, colour_by=&quot;discard&quot;) Figure 6.4: Percentage of UMIs assigned to mitochondrial transcripts in the Zeisel brain dataset, plotted against the percentage of UMIs assigned to spike-in transcripts (bottom). Each point represents a cell and is colored according to whether it was considered low-quality and discarded. We see that all of these metrics exhibit weak correlations to each other, presumably a manifestation of a common underlying effect of cell damage. The weakness of the correlations motivates the use of several metrics to capture different aspects of technical quality. Of course, the flipside is that these metrics may also represent different aspects of biology, increasing the risk of discarding entire cell types as discussed in Section 6.3.2.2. 6.5 Removing low-quality cells Once low-quality cells have been identified, we can choose to either remove them or mark them. Removal is the most straightforward option and is achieved by subsetting the SingleCellExperiment by column. In this case, we use the low-quality calls from Section 6.3.2.3 to generate a subsetted SingleCellExperiment that we would use for downstream analyses. # Keeping the columns we DON&#39;T want to discard. filtered &lt;- sce.416b[,!reasons$discard] The biggest practical concern during QC is whether an entire cell type is inadvertently discarded. There is always some risk of this occurring as the QC metrics are never fully independent of biological state. We can diagnose cell type loss by looking for systematic differences in gene expression between the discarded and retained cells. To demonstrate, we compute the average count across the discarded and retained pools in the 416B data set, and we compute the log-fold change between the pool averages. # Using the &#39;discard&#39; vector for demonstration purposes, # as it has more cells for stable calculation of &#39;lost&#39;. lost &lt;- calculateAverage(counts(sce.416b)[,!discard]) kept &lt;- calculateAverage(counts(sce.416b)[,discard]) library(edgeR) logged &lt;- cpm(cbind(lost, kept), log=TRUE, prior.count=2) logFC &lt;- logged[,1] - logged[,2] abundance &lt;- rowMeans(logged) If the discarded pool is enriched for a certain cell type, we should observe increased expression of the corresponding marker genes. No systematic upregulation of genes is apparent in the discarded pool in Figure 6.5, suggesting that the QC step did not inadvertently filter out a cell type in the 416B dataset. plot(abundance, logFC, xlab=&quot;Average count&quot;, ylab=&quot;Log-FC (lost/kept)&quot;, pch=16) points(abundance[is.mito], logFC[is.mito], col=&quot;dodgerblue&quot;, pch=16) Figure 6.5: Log-fold change in expression in the discarded cells compared to the retained cells in the 416B dataset. Each point represents a gene with mitochondrial transcripts in blue. For comparison, let us consider the QC step for the PBMC dataset from 10X Genomics (Zheng et al. 2017). We’ll apply an arbitrary fixed threshold on the library size to filter cells rather than using any outlier-based method. Specifically, we remove all libraries with a library size below 500. View history #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) #--- gene-annotation ---# library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) #--- cell-detection ---# set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] discard &lt;- colSums(counts(sce.pbmc)) &lt; 500 lost &lt;- calculateAverage(counts(sce.pbmc)[,discard]) kept &lt;- calculateAverage(counts(sce.pbmc)[,!discard]) logged &lt;- edgeR::cpm(cbind(lost, kept), log=TRUE, prior.count=2) logFC &lt;- logged[,1] - logged[,2] abundance &lt;- rowMeans(logged) The presence of a distinct population in the discarded pool manifests in Figure 6.6 as a set of genes that are strongly upregulated in lost. This includes PF4, PPBP and SDPR, which (spoiler alert!) indicates that there is a platelet population that has been discarded by alt.discard. plot(abundance, logFC, xlab=&quot;Average count&quot;, ylab=&quot;Log-FC (lost/kept)&quot;, pch=16) platelet &lt;- c(&quot;PF4&quot;, &quot;PPBP&quot;, &quot;SDPR&quot;) points(abundance[platelet], logFC[platelet], col=&quot;orange&quot;, pch=16) Figure 6.6: Average counts across all discarded and retained cells in the PBMC dataset, after using a more stringent filter on the total UMI count. Each point represents a gene, with platelet-related genes highlighted in orange. If we suspect that cell types have been incorrectly discarded by our QC procedure, the most direct solution is to relax the QC filters for metrics that are associated with genuine biological differences. For example, outlier detection can be relaxed by increasing nmads= in the isOutlier() calls. Of course, this increases the risk of retaining more low-quality cells and encountering the problems discussed in Section 6.1. The logical endpoint of this line of reasoning is to avoid filtering altogether, as discussed in Section 6.6. As an aside, it is worth mentioning that the true technical quality of a cell may also be correlated with its type. (This differs from a correlation between the cell type and the QC metrics, as the latter are our imperfect proxies for quality.) This can arise if some cell types are not amenable to dissociation or microfluidics handling during the scRNA-seq protocol. In such cases, it is possible to “correctly” discard an entire cell type during QC if all of its cells are damaged. Indeed, concerns over the computational removal of cell types during QC are probably minor compared to losses in the experimental protocol. 6.6 Marking low-quality cells The other option is to simply mark the low-quality cells as such and retain them in the downstream analysis. The aim here is to allow clusters of low-quality cells to form, and then to identify and ignore such clusters during interpretation of the results. This approach avoids discarding cell types that have poor values for the QC metrics, giving users an opportunity to decide whether a cluster of such cells represents a genuine biological state. marked &lt;- sce.416b marked$discard &lt;- batch.reasons$discard The downside is that it shifts the burden of QC to the interpretation of the clusters, which is already the bottleneck in scRNA-seq data analysis (Chapters 10, 11 and 12). Indeed, if we do not trust the QC metrics, we would have to distinguish between genuine cell types and low-quality cells based only on marker genes, and this is not always easy due to the tendency of the latter to “express” interesting genes (Section 6.1). Retention of low-quality cells also compromises the accuracy of the variance modelling, requiring, e.g., use of more PCs to offset the fact that the early PCs are driven by differences between low-quality and other cells. For routine analyses, we suggest performing removal by default to avoid complications from low-quality cells. This allows most of the population structure to be characterized with no - or, at least, fewer - concerns about its validity. Once the initial analysis is done, and if there are any concerns about discarded cell types (Section 6.5), a more thorough re-analysis can be performed where the low-quality cells are only marked. This recovers cell types with low RNA content, high mitochondrial proportions, etc. that only need to be interpreted insofar as they “fill the gaps” in the initial analysis. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] edgeR_3.32.1 limma_3.46.0 [3] robustbase_0.93-7 scRNAseq_2.4.0 [5] scater_1.18.6 ggplot2_3.3.3 [7] ensembldb_2.14.0 AnnotationFilter_1.14.0 [9] GenomicFeatures_1.42.2 AnnotationDbi_1.52.0 [11] AnnotationHub_2.22.0 BiocFileCache_1.14.0 [13] dbplyr_2.1.0 SingleCellExperiment_1.12.0 [15] SummarizedExperiment_1.20.0 Biobase_2.50.0 [17] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [19] IRanges_2.24.1 S4Vectors_0.28.1 [21] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [23] matrixStats_0.58.0 BiocStyle_2.18.1 [25] rebook_1.0.0 loaded via a namespace (and not attached): [1] lazyeval_0.2.2 BiocParallel_1.24.1 [3] digest_0.6.27 htmltools_0.5.1.1 [5] viridis_0.5.1 fansi_0.4.2 [7] magrittr_2.0.1 memoise_2.0.0 [9] Biostrings_2.58.0 askpass_1.1 [11] prettyunits_1.1.1 colorspace_2.0-0 [13] blob_1.2.1 rappdirs_0.3.3 [15] xfun_0.22 dplyr_1.0.5 [17] callr_3.5.1 crayon_1.4.1 [19] RCurl_1.98-1.3 jsonlite_1.7.2 [21] graph_1.68.0 glue_1.4.2 [23] gtable_0.3.0 zlibbioc_1.36.0 [25] XVector_0.30.0 DelayedArray_0.16.2 [27] BiocSingular_1.6.0 DEoptimR_1.0-8 [29] scales_1.1.1 DBI_1.1.1 [31] Rcpp_1.0.6 viridisLite_0.3.0 [33] xtable_1.8-4 progress_1.2.2 [35] bit_4.0.4 rsvd_1.0.3 [37] httr_1.4.2 ellipsis_0.3.1 [39] pkgconfig_2.0.3 XML_3.99-0.6 [41] farver_2.1.0 scuttle_1.0.4 [43] CodeDepends_0.6.5 sass_0.3.1 [45] locfit_1.5-9.4 utf8_1.2.1 [47] tidyselect_1.1.0 labeling_0.4.2 [49] rlang_0.4.10 later_1.1.0.1 [51] munsell_0.5.0 BiocVersion_3.12.0 [53] tools_4.0.4 cachem_1.0.4 [55] generics_0.1.0 RSQLite_2.2.4 [57] ExperimentHub_1.16.0 evaluate_0.14 [59] stringr_1.4.0 fastmap_1.1.0 [61] yaml_2.2.1 processx_3.4.5 [63] knitr_1.31 bit64_4.0.5 [65] purrr_0.3.4 sparseMatrixStats_1.2.1 [67] mime_0.10 xml2_1.3.2 [69] biomaRt_2.46.3 compiler_4.0.4 [71] beeswarm_0.3.1 curl_4.3 [73] interactiveDisplayBase_1.28.0 tibble_3.1.0 [75] bslib_0.2.4 stringi_1.5.3 [77] highr_0.8 ps_1.6.0 [79] lattice_0.20-41 ProtGenerics_1.22.0 [81] Matrix_1.3-2 vctrs_0.3.6 [83] pillar_1.5.1 lifecycle_1.0.0 [85] BiocManager_1.30.10 jquerylib_0.1.3 [87] BiocNeighbors_1.8.2 cowplot_1.1.1 [89] bitops_1.0-6 irlba_2.3.3 [91] httpuv_1.5.5 rtracklayer_1.50.0 [93] R6_2.5.0 bookdown_0.21 [95] promises_1.2.0.1 gridExtra_2.3 [97] vipor_0.4.5 codetools_0.2-18 [99] assertthat_0.2.1 openssl_1.4.3 [101] withr_2.4.1 GenomicAlignments_1.26.0 [103] Rsamtools_2.6.0 GenomeInfoDbData_1.2.4 [105] hms_1.0.0 grid_4.0.4 [107] beachmat_2.6.4 rmarkdown_2.7 [109] DelayedMatrixStats_1.12.3 shiny_1.6.0 [111] ggbeeswarm_0.6.0 Bibliography "],["normalization.html", "Chapter 7 Normalization 7.1 Motivation 7.2 Library size normalization 7.3 Normalization by deconvolution 7.4 Normalization by spike-ins 7.5 Applying the size factors Session Info", " Chapter 7 Normalization .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 7.1 Motivation Systematic differences in sequencing coverage between libraries are often observed in single-cell RNA sequencing data (Stegle, Teichmann, and Marioni 2015). They typically arise from technical differences in cDNA capture or PCR amplification efficiency across cells, attributable to the difficulty of achieving consistent library preparation with minimal starting material. Normalization aims to remove these differences such that they do not interfere with comparisons of the expression profiles between cells. This ensures that any observed heterogeneity or differential expression within the cell population are driven by biology and not technical biases. At this point, it is worth noting the difference between normalization and batch correction (Chapter 28.8). Normalization occurs regardless of the batch structure and only considers technical biases, while batch correction - as the name suggests - only occurs across batches and must consider both technical biases and biological differences. Technical biases tend to affect genes in a similar manner, or at least in a manner related to their biophysical properties (e.g., length, GC content), while biological differences between batches can be highly unpredictable. As such, these two tasks involve different assumptions and generally involve different computational methods (though some packages aim to perform both steps at once, e.g., zinbwave). Thus, it is important to avoid conflating “normalized” and “batch-corrected” data, as these usually refer to different things. We will mostly focus our attention on scaling normalization, which is the simplest and most commonly used class of normalization strategies. This involves dividing all counts for each cell by a cell-specific scaling factor, often called a “size factor” (Anders and Huber 2010). The assumption here is that any cell-specific bias (e.g., in capture or amplification efficiency) affects all genes equally via scaling of the expected mean count for that cell. The size factor for each cell represents the estimate of the relative bias in that cell, so division of its counts by its size factor should remove that bias. The resulting “normalized expression values” can then be used for downstream analyses such as clustering and dimensionality reduction. To demonstrate, we will use the Zeisel et al. (2015) dataset from the scRNAseq package. View history #--- loading ---# library(scRNAseq) sce.zeisel &lt;- ZeiselBrainData() library(scater) sce.zeisel &lt;- aggregateAcrossFeatures(sce.zeisel, id=sub(&quot;_loc[0-9]+$&quot;, &quot;&quot;, rownames(sce.zeisel))) #--- gene-annotation ---# library(org.Mm.eg.db) rowData(sce.zeisel)$Ensembl &lt;- mapIds(org.Mm.eg.db, keys=rownames(sce.zeisel), keytype=&quot;SYMBOL&quot;, column=&quot;ENSEMBL&quot;) #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.zeisel, subsets=list( Mt=rowData(sce.zeisel)$featureType==&quot;mito&quot;)) qc &lt;- quickPerCellQC(stats, percent_subsets=c(&quot;altexps_ERCC_percent&quot;, &quot;subsets_Mt_percent&quot;)) sce.zeisel &lt;- sce.zeisel[,!qc$discard] sce.zeisel ## class: SingleCellExperiment ## dim: 19839 2816 ## metadata(0): ## assays(1): counts ## rownames(19839): 0610005C13Rik 0610007N19Rik ... mt-Tw mt-Ty ## rowData names(2): featureType Ensembl ## colnames(2816): 1772071015_C02 1772071017_G12 ... 1772063068_D01 ## 1772066098_A12 ## colData names(10): tissue group # ... level1class level2class ## reducedDimNames(0): ## altExpNames(2): ERCC repeat 7.2 Library size normalization Library size normalization is the simplest strategy for performing scaling normalization. We define the library size as the total sum of counts across all genes for each cell, the expected value of which is assumed to scale with any cell-specific biases. The “library size factor” for each cell is then directly proportional to its library size where the proportionality constant is defined such that the mean size factor across all cells is equal to 1. This definition ensures that the normalized expression values are on the same scale as the original counts, which is useful for interpretation - especially when dealing with transformed data (see Section 7.5.1). library(scater) lib.sf.zeisel &lt;- librarySizeFactors(sce.zeisel) summary(lib.sf.zeisel) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.176 0.568 0.868 1.000 1.278 4.084 In the Zeisel brain data, the library size factors differ by up to 10-fold across cells (Figure 7.1). This is typical of the variability in coverage in scRNA-seq data. hist(log10(lib.sf.zeisel), xlab=&quot;Log10[Size factor]&quot;, col=&#39;grey80&#39;) Figure 7.1: Distribution of size factors derived from the library size in the Zeisel brain dataset. Strictly speaking, the use of library size factors assumes that there is no “imbalance” in the differentially expressed (DE) genes between any pair of cells. That is, any upregulation for a subset of genes is cancelled out by the same magnitude of downregulation in a different subset of genes. This ensures that the library size is an unbiased estimate of the relative cell-specific bias by avoiding composition effects (Robinson and Oshlack 2010). However, balanced DE is not generally present in scRNA-seq applications, which means that library size normalization may not yield accurate normalized expression values for downstream analyses. In practice, normalization accuracy is not a major consideration for exploratory scRNA-seq data analyses. Composition biases do not usually affect the separation of clusters, only the magnitude - and to a lesser extent, direction - of the log-fold changes between clusters or cell types. As such, library size normalization is usually sufficient in many applications where the aim is to identify clusters and the top markers that define each cluster. 7.3 Normalization by deconvolution As previously mentioned, composition biases will be present when any unbalanced differential expression exists between samples. Consider the simple example of two cells where a single gene \\(X\\) is upregulated in one cell \\(A\\) compared to the other cell \\(B\\). This upregulation means that either (i) more sequencing resources are devoted to \\(X\\) in \\(A\\), thus decreasing coverage of all other non-DE genes when the total library size of each cell is experimentally fixed (e.g., due to library quantification); or (ii) the library size of \\(A\\) increases when \\(X\\) is assigned more reads or UMIs, increasing the library size factor and yielding smaller normalized expression values for all non-DE genes. In both cases, the net effect is that non-DE genes in \\(A\\) will incorrectly appear to be downregulated compared to \\(B\\). The removal of composition biases is a well-studied problem for bulk RNA sequencing data analysis. Normalization can be performed with the estimateSizeFactorsFromMatrix() function in the DESeq2 package (Anders and Huber 2010; Love, Huber, and Anders 2014) or with the calcNormFactors() function (Robinson and Oshlack 2010) in the edgeR package. These assume that most genes are not DE between cells. Any systematic difference in count size across the non-DE majority of genes between two cells is assumed to represent bias that is used to compute an appropriate size factor for its removal. However, single-cell data can be problematic for these bulk normalization methods due to the dominance of low and zero counts. To overcome this, we pool counts from many cells to increase the size of the counts for accurate size factor estimation (Lun, Bach, and Marioni 2016). Pool-based size factors are then “deconvolved” into cell-based factors for normalization of each cell’s expression profile. This is performed using the calculateSumFactors() function from scran, as shown below. library(scran) set.seed(100) clust.zeisel &lt;- quickCluster(sce.zeisel) table(clust.zeisel) ## clust.zeisel ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ## 170 254 441 178 393 148 219 240 189 123 112 103 135 111 deconv.sf.zeisel &lt;- calculateSumFactors(sce.zeisel, cluster=clust.zeisel) summary(deconv.sf.zeisel) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.119 0.486 0.831 1.000 1.321 4.509 We use a pre-clustering step with quickCluster() where cells in each cluster are normalized separately and the size factors are rescaled to be comparable across clusters. This avoids the assumption that most genes are non-DE across the entire population - only a non-DE majority is required between pairs of clusters, which is a weaker assumption for highly heterogeneous populations. By default, quickCluster() will use an approximate algorithm for PCA based on methods from the irlba package. The approximation relies on stochastic initialization so we need to set the random seed (via set.seed()) for reproducibility. We see that the deconvolution size factors exhibit cell type-specific deviations from the library size factors in Figure 7.2. This is consistent with the presence of composition biases that are introduced by strong differential expression between cell types. Use of the deconvolution size factors adjusts for these biases to improve normalization accuracy for downstream applications. plot(lib.sf.zeisel, deconv.sf.zeisel, xlab=&quot;Library size factor&quot;, ylab=&quot;Deconvolution size factor&quot;, log=&#39;xy&#39;, pch=16, col=as.integer(factor(sce.zeisel$level1class))) abline(a=0, b=1, col=&quot;red&quot;) Figure 7.2: Deconvolution size factor for each cell in the Zeisel brain dataset, compared to the equivalent size factor derived from the library size. The red line corresponds to identity between the two size factors. Accurate normalization is most important for procedures that involve estimation and interpretation of per-gene statistics. For example, composition biases can compromise DE analyses by systematically shifting the log-fold changes in one direction or another. However, it tends to provide less benefit over simple library size normalization for cell-based analyses such as clustering. The presence of composition biases already implies strong differences in expression profiles, so changing the normalization strategy is unlikely to affect the outcome of a clustering procedure. 7.4 Normalization by spike-ins Spike-in normalization is based on the assumption that the same amount of spike-in RNA was added to each cell (A. T. L. Lun et al. 2017). Systematic differences in the coverage of the spike-in transcripts can only be due to cell-specific biases, e.g., in capture efficiency or sequencing depth. To remove these biases, we equalize spike-in coverage across cells by scaling with “spike-in size factors”. Compared to the previous methods, spike-in normalization requires no assumption about the biology of the system (i.e., the absence of many DE genes). Instead, it assumes that the spike-in transcripts were (i) added at a constant level to each cell, and (ii) respond to biases in the same relative manner as endogenous genes. Practically, spike-in normalization should be used if differences in the total RNA content of individual cells are of interest and must be preserved in downstream analyses. For a given cell, an increase in its overall amount of endogenous RNA will not increase its spike-in size factor. This ensures that the effects of total RNA content on expression across the population will not be removed upon scaling. By comparison, the other normalization methods described above will simply interpret any change in total RNA content as part of the bias and remove it. We demonstrate the use of spike-in normalization on a different dataset involving T cell activation after stimulation with T cell recepter ligands of varying affinity (Richard et al. 2018). library(scRNAseq) sce.richard &lt;- RichardTCellData() sce.richard &lt;- sce.richard[,sce.richard$`single cell quality`==&quot;OK&quot;] sce.richard ## class: SingleCellExperiment ## dim: 46603 528 ## metadata(0): ## assays(1): counts ## rownames(46603): ENSMUSG00000102693 ENSMUSG00000064842 ... ## ENSMUSG00000096730 ENSMUSG00000095742 ## rowData names(0): ## colnames(528): SLX-12611.N701_S502. SLX-12611.N702_S502. ... ## SLX-12612.i712_i522. SLX-12612.i714_i522. ## colData names(13): age individual ... stimulus time ## reducedDimNames(0): ## altExpNames(1): ERCC We apply the computeSpikeFactors() method to estimate spike-in size factors for all cells. This is defined by converting the total spike-in count per cell into a size factor, using the same reasoning as in librarySizeFactors(). Scaling will subsequently remove any differences in spike-in coverage across cells. sce.richard &lt;- computeSpikeFactors(sce.richard, &quot;ERCC&quot;) summary(sizeFactors(sce.richard)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.125 0.428 0.627 1.000 1.070 23.316 We observe a positive correlation between the spike-in size factors and deconvolution size factors within each treatment condition (Figure 7.3), indicating that they are capturing similar technical biases in sequencing depth and capture efficiency. However, we also observe that increasing stimulation of the T cell receptor - in terms of increasing affinity or time - results in a decrease in the spike-in factors relative to the library size factors. This is consistent with an increase in biosynthetic activity and total RNA content during stimulation, which reduces the relative spike-in coverage in each library (thereby decreasing the spike-in size factors) but increases the coverage of endogenous genes (thus increasing the library size factors). to.plot &lt;- data.frame( DeconvFactor=calculateSumFactors(sce.richard), SpikeFactor=sizeFactors(sce.richard), Stimulus=sce.richard$stimulus, Time=sce.richard$time ) ggplot(to.plot, aes(x=DeconvFactor, y=SpikeFactor, color=Time)) + geom_point() + facet_wrap(~Stimulus) + scale_x_log10() + scale_y_log10() + geom_abline(intercept=0, slope=1, color=&quot;red&quot;) Figure 7.3: Size factors from spike-in normalization, plotted against the library size factors for all cells in the T cell dataset. Each plot represents a different ligand treatment and each point is a cell coloured according by time from stimulation. The differences between these two sets of size factors have real consequences for downstream interpretation. If the spike-in size factors were applied to the counts, the expression values in unstimulated cells would be scaled up while expression in stimulated cells would be scaled down. However, the opposite would occur if the deconvolution size factors were used. This can manifest as shifts in the magnitude and direction of DE between conditions when we switch between normalization strategies, as shown below for Malat1 (Figure 7.4). # See below for explanation of logNormCounts(). sce.richard.deconv &lt;- logNormCounts(sce.richard, size_factors=to.plot$DeconvFactor) sce.richard.spike &lt;- logNormCounts(sce.richard, size_factors=to.plot$SpikeFactor) gridExtra::grid.arrange( plotExpression(sce.richard.deconv, x=&quot;stimulus&quot;, colour_by=&quot;time&quot;, features=&quot;ENSMUSG00000092341&quot;) + theme(axis.text.x = element_text(angle = 90)) + ggtitle(&quot;After deconvolution&quot;), plotExpression(sce.richard.spike, x=&quot;stimulus&quot;, colour_by=&quot;time&quot;, features=&quot;ENSMUSG00000092341&quot;) + theme(axis.text.x = element_text(angle = 90)) + ggtitle(&quot;After spike-in normalization&quot;), ncol=2 ) Figure 7.4: Distribution of log-normalized expression values for Malat1 after normalization with the deconvolution size factors (left) or spike-in size factors (right). Cells are stratified by the ligand affinity and colored by the time after stimulation. Whether or not total RNA content is relevant – and thus, the choice of normalization strategy – depends on the biological hypothesis. In most cases, changes in total RNA content are not interesting and can be normalized out by applying the library size or deconvolution factors. However, this may not always be appropriate if differences in total RNA are associated with a biological process of interest, e.g., cell cycle activity or T cell activation. Spike-in normalization will preserve these differences such that any changes in expression between biological groups have the correct sign. However! Regardless of whether we care about total RNA content, it is critical that the spike-in transcripts are normalized using the spike-in size factors. Size factors computed from the counts for endogenous genes should not be applied to the spike-in transcripts, precisely because the former captures differences in total RNA content that are not experienced by the latter. Attempting to normalize the spike-in counts with the gene-based size factors will lead to over-normalization and incorrect quantification. Thus, if normalized spike-in data is required, we must compute a separate set of size factors for the spike-in transcripts; this is automatically performed by functions such as modelGeneVarWithSpikes(). 7.5 Applying the size factors 7.5.1 Scaling and log-transforming Once we have computed the size factors, we use the logNormCounts() function from scater to compute normalized expression values for each cell. This is done by dividing the count for each gene/spike-in transcript with the appropriate size factor for that cell. The function also log-transforms the normalized values, creating a new assay called \"logcounts\". (Technically, these are “log-transformed normalized expression values”, but that’s too much of a mouthful to fit into the assay name.) These log-values will be the basis of our downstream analyses in the following chapters. set.seed(100) clust.zeisel &lt;- quickCluster(sce.zeisel) sce.zeisel &lt;- computeSumFactors(sce.zeisel, cluster=clust.zeisel, min.mean=0.1) sce.zeisel &lt;- logNormCounts(sce.zeisel) assayNames(sce.zeisel) ## [1] &quot;counts&quot; &quot;logcounts&quot; The log-transformation is useful as differences in the log-values represent log-fold changes in expression. This is important in downstream procedures based on Euclidean distances, which includes many forms of clustering and dimensionality reduction. By operating on log-transformed data, we ensure that these procedures are measuring distances between cells based on log-fold changes in expression. Or in other words, which is more interesting - a gene that is expressed at an average count of 50 in cell type \\(A\\) and 10 in cell type \\(B\\), or a gene that is expressed at an average count of 1100 in \\(A\\) and 1000 in \\(B\\)? Log-transformation focuses on the former by promoting contributions from genes with strong relative differences. When log-transforming, we typically add a pseudo-count to avoid undefined values at zero. Larger pseudo-counts will effectively shrink the log-fold changes between cells towards zero for low-abundance genes, meaning that downstream high-dimensional analyses will be driven more by differences in expression for high-abundance genes. Conversely, smaller pseudo-counts will increase the relative contribution of low-abundance genes. Common practice is to use a pseudo-count of 1, for the simple pragmatic reason that it preserves sparsity in the original matrix (i.e., zeroes in the input remain zeroes after transformation). This works well in all but the most pathological scenarios (A. Lun 2018). Incidentally, the addition of the pseudo-count is the motivation for the centering of the size factors at unity. This ensures that both the pseudo-count and the normalized expression values are on the same scale; a pseudo-count of 1 can be interpreted as an extra read or UMI for each gene. In practical terms, centering means that the shrinkage effect of the pseudo-count diminishes as sequencing depth improves. This correctly ensures that estimates of the log-fold change in expression (e.g., from differences in the log-values between groups of cells) become increasingly accurate with deeper coverage. In contrast, if we applied a constant pseudo-count to some count-per-million-like measure, accuracy of the subsequent log-fold changes would never improve regardless of how much additional sequencing we performed. 7.5.2 Downsampling and log-transforming In rare cases, direct scaling of the counts is not appropriate due to the effect described by A. Lun (2018). Briefly, this is caused by the fact that the mean of the log-normalized counts is not the same as the log-transformed mean of the normalized counts. The difference between them depends on the mean and variance of the original counts, such that there is a systematic trend in the mean of the log-counts with respect to the count size. This typically manifests as trajectories correlated strongly with library size even after library size normalization, as shown in Figure 7.5 for synthetic scRNA-seq data generated with a pool-and-split approach (Tian et al. 2019). # TODO: move to scRNAseq. library(BiocFileCache) bfc &lt;- BiocFileCache(ask=FALSE) qcdata &lt;- bfcrpath(bfc, &quot;https://github.com/LuyiTian/CellBench_data/blob/master/data/mRNAmix_qc.RData?raw=true&quot;) env &lt;- new.env() load(qcdata, envir=env) sce.8qc &lt;- env$sce8_qc # Library size normalization and log-transformation. sce.8qc &lt;- logNormCounts(sce.8qc) sce.8qc &lt;- runPCA(sce.8qc) gridExtra::grid.arrange( plotPCA(sce.8qc, colour_by=I(factor(sce.8qc$mix))), plotPCA(sce.8qc, colour_by=I(librarySizeFactors(sce.8qc))), ncol=2 ) Figure 7.5: PCA plot of all pool-and-split libraries in the SORT-seq CellBench data, computed from the log-normalized expression values with library size-derived size factors. Each point represents a library and is colored by the mixing ratio used to construct it (left) or by the size factor (right). As the problem arises from differences in the sizes of the counts, the most straightforward solution is to downsample the counts of the high-coverage cells to match those of low-coverage cells. This uses the size factors to determine the amount of downsampling for each cell required to reach the 1st percentile of size factors. (The small minority of cells with smaller size factors are simply scaled up. We do not attempt to downsample to the smallest size factor, as this would result in excessive loss of information for one aberrant cell with very low size factors.) We can see that this eliminates the library size factor-associated trajectories from the first two PCs, improving resolution of the known differences based on mixing ratios (Figure 7.6). The log-transformation is still necessary but no longer introduces a shift in the means when the sizes of the counts are similar across cells. sce.8qc2 &lt;- logNormCounts(sce.8qc, downsample=TRUE) sce.8qc2 &lt;- runPCA(sce.8qc2) gridExtra::grid.arrange( plotPCA(sce.8qc2, colour_by=I(factor(sce.8qc2$mix))), plotPCA(sce.8qc2, colour_by=I(librarySizeFactors(sce.8qc2))), ncol=2 ) Figure 7.6: PCA plot of pool-and-split libraries in the SORT-seq CellBench data, computed from the log-transformed counts after downsampling in proportion to the library size factors. Each point represents a library and is colored by the mixing ratio used to construct it (left) or by the size factor (right). While downsampling is an expedient solution, it is statistically inefficient as it needs to increase the noise of high-coverage cells in order to avoid differences with low-coverage cells. It is also slower than simple scaling. Thus, we would only recommend using this approach after an initial analysis with scaled counts reveals suspicious trajectories that are strongly correlated with the size factors. In such cases, it is a simple matter to re-normalize by downsampling to determine whether the trajectory is an artifact of the log-transformation. 7.5.3 Other options Of course, log-transformation is not the only possible transformation. More sophisticated approaches can be used such as dedicated variance stabilizing transformations (e.g., from the DESeq2 or sctransform packages), which out-perform the log-transformation for removal of the mean-variance trend. In practice, though, the log-transformation is a good default choice due to its simplicity (a.k.a., reliability, predictability and computational efficiency) and interpretability. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] BiocFileCache_1.14.0 dbplyr_2.1.0 [3] ensembldb_2.14.0 AnnotationFilter_1.14.0 [5] GenomicFeatures_1.42.2 AnnotationDbi_1.52.0 [7] scRNAseq_2.4.0 scran_1.18.5 [9] scater_1.18.6 ggplot2_3.3.3 [11] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0 [13] Biobase_2.50.0 GenomicRanges_1.42.0 [15] GenomeInfoDb_1.26.4 IRanges_2.24.1 [17] S4Vectors_0.28.1 BiocGenerics_0.36.0 [19] MatrixGenerics_1.2.1 matrixStats_0.58.0 [21] BiocStyle_2.18.1 rebook_1.0.0 loaded via a namespace (and not attached): [1] AnnotationHub_2.22.0 igraph_1.2.6 [3] lazyeval_0.2.2 BiocParallel_1.24.1 [5] digest_0.6.27 htmltools_0.5.1.1 [7] viridis_0.5.1 fansi_0.4.2 [9] magrittr_2.0.1 memoise_2.0.0 [11] limma_3.46.0 Biostrings_2.58.0 [13] askpass_1.1 prettyunits_1.1.1 [15] colorspace_2.0-0 blob_1.2.1 [17] rappdirs_0.3.3 xfun_0.22 [19] dplyr_1.0.5 callr_3.5.1 [21] crayon_1.4.1 RCurl_1.98-1.3 [23] jsonlite_1.7.2 graph_1.68.0 [25] glue_1.4.2 gtable_0.3.0 [27] zlibbioc_1.36.0 XVector_0.30.0 [29] DelayedArray_0.16.2 BiocSingular_1.6.0 [31] scales_1.1.1 DBI_1.1.1 [33] edgeR_3.32.1 Rcpp_1.0.6 [35] viridisLite_0.3.0 xtable_1.8-4 [37] progress_1.2.2 dqrng_0.2.1 [39] bit_4.0.4 rsvd_1.0.3 [41] httr_1.4.2 ellipsis_0.3.1 [43] farver_2.1.0 pkgconfig_2.0.3 [45] XML_3.99-0.6 scuttle_1.0.4 [47] CodeDepends_0.6.5 sass_0.3.1 [49] locfit_1.5-9.4 utf8_1.2.1 [51] labeling_0.4.2 tidyselect_1.1.0 [53] rlang_0.4.10 later_1.1.0.1 [55] munsell_0.5.0 BiocVersion_3.12.0 [57] tools_4.0.4 cachem_1.0.4 [59] generics_0.1.0 RSQLite_2.2.4 [61] ExperimentHub_1.16.0 evaluate_0.14 [63] stringr_1.4.0 fastmap_1.1.0 [65] yaml_2.2.1 processx_3.4.5 [67] knitr_1.31 bit64_4.0.5 [69] purrr_0.3.4 sparseMatrixStats_1.2.1 [71] mime_0.10 xml2_1.3.2 [73] biomaRt_2.46.3 compiler_4.0.4 [75] beeswarm_0.3.1 curl_4.3 [77] interactiveDisplayBase_1.28.0 tibble_3.1.0 [79] statmod_1.4.35 bslib_0.2.4 [81] stringi_1.5.3 highr_0.8 [83] ps_1.6.0 lattice_0.20-41 [85] bluster_1.0.0 ProtGenerics_1.22.0 [87] Matrix_1.3-2 vctrs_0.3.6 [89] pillar_1.5.1 lifecycle_1.0.0 [91] BiocManager_1.30.10 jquerylib_0.1.3 [93] BiocNeighbors_1.8.2 cowplot_1.1.1 [95] bitops_1.0-6 irlba_2.3.3 [97] httpuv_1.5.5 rtracklayer_1.50.0 [99] R6_2.5.0 bookdown_0.21 [101] promises_1.2.0.1 gridExtra_2.3 [103] vipor_0.4.5 codetools_0.2-18 [105] assertthat_0.2.1 openssl_1.4.3 [107] withr_2.4.1 GenomicAlignments_1.26.0 [109] Rsamtools_2.6.0 GenomeInfoDbData_1.2.4 [111] hms_1.0.0 grid_4.0.4 [113] beachmat_2.6.4 rmarkdown_2.7 [115] DelayedMatrixStats_1.12.3 shiny_1.6.0 [117] ggbeeswarm_0.6.0 Bibliography "],["feature-selection.html", "Chapter 8 Feature selection 8.1 Motivation 8.2 Quantifying per-gene variation 8.3 Selecting highly variable genes 8.4 Selecting a priori genes of interest 8.5 Putting it all together Session Info", " Chapter 8 Feature selection .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 8.1 Motivation We often use scRNA-seq data in exploratory analyses to characterize heterogeneity across cells. Procedures like clustering and dimensionality reduction compare cells based on their gene expression profiles, which involves aggregating per-gene differences into a single (dis)similarity metric between a pair of cells. The choice of genes to use in this calculation has a major impact on the behavior of the metric and the performance of downstream methods. We want to select genes that contain useful information about the biology of the system while removing genes that contain random noise. This aims to preserve interesting biological structure without the variance that obscures that structure, and to reduce the size of the data to improve computational efficiency of later steps. The simplest approach to feature selection is to select the most variable genes based on their expression across the population. This assumes that genuine biological differences will manifest as increased variation in the affected genes, compared to other genes that are only affected by technical noise or a baseline level of “uninteresting” biological variation (e.g., from transcriptional bursting). Several methods are available to quantify the variation per gene and to select an appropriate set of highly variable genes (HVGs). We will discuss these below using the 10X PBMC dataset for demonstration: View history #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) #--- gene-annotation ---# library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) #--- cell-detection ---# set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location==&quot;MT&quot;))) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) sce.pbmc &lt;- sce.pbmc[,!high.mito] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) sce.pbmc &lt;- logNormCounts(sce.pbmc) sce.pbmc ## class: SingleCellExperiment ## dim: 33694 3985 ## metadata(1): Samples ## assays(2): counts logcounts ## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B ## rowData names(2): ID Symbol ## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ... ## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1 ## colData names(3): Sample Barcode sizeFactor ## reducedDimNames(0): ## altExpNames(0): As well as the 416B dataset: View history #--- loading ---# library(scRNAseq) sce.416b &lt;- LunSpikeInData(which=&quot;416b&quot;) sce.416b$block &lt;- factor(sce.416b$block) #--- gene-annotation ---# library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] rowData(sce.416b)$ENSEMBL &lt;- rownames(sce.416b) rowData(sce.416b)$SYMBOL &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SYMBOL&quot;) rowData(sce.416b)$SEQNAME &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SEQNAME&quot;) library(scater) rownames(sce.416b) &lt;- uniquifyFeatureNames(rowData(sce.416b)$ENSEMBL, rowData(sce.416b)$SYMBOL) #--- quality-control ---# mito &lt;- which(rowData(sce.416b)$SEQNAME==&quot;MT&quot;) stats &lt;- perCellQCMetrics(sce.416b, subsets=list(Mt=mito)) qc &lt;- quickPerCellQC(stats, percent_subsets=c(&quot;subsets_Mt_percent&quot;, &quot;altexps_ERCC_percent&quot;), batch=sce.416b$block) sce.416b &lt;- sce.416b[,!qc$discard] #--- normalization ---# library(scran) sce.416b &lt;- computeSumFactors(sce.416b) sce.416b &lt;- logNormCounts(sce.416b) sce.416b ## class: SingleCellExperiment ## dim: 46604 185 ## metadata(0): ## assays(2): counts logcounts ## rownames(46604): 4933401J01Rik Gm26206 ... CAAA01147332.1 ## CBFB-MYH11-mcherry ## rowData names(4): Length ENSEMBL SYMBOL SEQNAME ## colnames(185): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ... ## SLX-11312.N712_S507.H5H5YBBXX.s_8.r_1 ## SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1 ## colData names(10): Source Name cell line ... block sizeFactor ## reducedDimNames(0): ## altExpNames(2): ERCC SIRV 8.2 Quantifying per-gene variation 8.2.1 Variance of the log-counts The simplest approach to quantifying per-gene variation is to simply compute the variance of the log-normalized expression values (referred to as “log-counts” for simplicity) for each gene across all cells in the population (A. T. L. Lun, McCarthy, and Marioni 2016). This has an advantage in that the feature selection is based on the same log-values that are used for later downstream steps. In particular, genes with the largest variances in log-values will contribute the most to the Euclidean distances between cells. By using log-values here, we ensure that our quantitative definition of heterogeneity is consistent throughout the entire analysis. Calculation of the per-gene variance is simple but feature selection requires modelling of the mean-variance relationship. As discussed briefly in Section 7.5.1, the log-transformation does not achieve perfect variance stabilization, which means that the variance of a gene is driven more by its abundance than its underlying biological heterogeneity. To account for this effect, we use the modelGeneVar() function to fit a trend to the variance with respect to abundance across all genes (Figure 8.1). library(scran) dec.pbmc &lt;- modelGeneVar(sce.pbmc) # Visualizing the fit: fit.pbmc &lt;- metadata(dec.pbmc) plot(fit.pbmc$mean, fit.pbmc$var, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curve(fit.pbmc$trend(x), col=&quot;dodgerblue&quot;, add=TRUE, lwd=2) Figure 8.1: Variance in the PBMC data set as a function of the mean. Each point represents a gene while the blue line represents the trend fitted to all genes. At any given abundance, we assume that the expression profiles of most genes are dominated by random technical noise (see Section 8.2.3 for details). Under this assumption, our trend represents an estimate of the technical noise as a function of abundance. We then break down the total variance of each gene into the technical component, i.e., the fitted value of the trend at that gene’s abundance; and the biological component, defined as the difference between the total variance and the technical component. This biological component represents the “interesting” variation for each gene and can be used as the metric for HVG selection. # Ordering by most interesting genes for inspection. dec.pbmc[order(dec.pbmc$bio, decreasing=TRUE),] ## DataFrame with 33694 rows and 6 columns ## mean total tech bio p.value FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## LYZ 1.95605 5.05854 0.835343 4.22320 1.10534e-270 2.17409e-266 ## S100A9 1.93416 4.53551 0.835439 3.70007 2.71036e-208 7.61572e-205 ## S100A8 1.69961 4.41084 0.824342 3.58650 4.31570e-201 9.43173e-198 ## HLA-DRA 2.09785 3.75174 0.831239 2.92050 5.93940e-132 4.86758e-129 ## CD74 2.90176 3.36879 0.793188 2.57560 4.83929e-113 2.50484e-110 ## ... ... ... ... ... ... ... ## TMSB4X 6.08142 0.441718 0.679215 -0.237497 0.992447 1 ## PTMA 3.82978 0.486454 0.731275 -0.244821 0.990002 1 ## HLA-B 4.50032 0.486130 0.739577 -0.253447 0.991376 1 ## EIF1 3.23488 0.482869 0.768946 -0.286078 0.995135 1 ## B2M 5.95196 0.314948 0.654228 -0.339280 0.999843 1 (Careful readers will notice that some genes have negative biological components, which have no obvious interpretation and can be ignored in most applications. They are inevitable when fitting a trend to the per-gene variances as approximately half of the genes will lie below the trend.) The trend fit has several useful parameters (see ?fitTrendVar) that can be tuned for a more appropriate fit. For example, the defaults can occasionally yield an overfitted trend when the few high-abundance genes are also highly variable. In such cases, users can reduce the contribution of those high-abundance genes by turning off density weights, as demonstrated in Figure 8.2 with a single donor from the Segerstolpe et al. (2016) dataset. View history #--- loading ---# library(scRNAseq) sce.seger &lt;- SegerstolpePancreasData() #--- gene-annotation ---# library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] symbols &lt;- rowData(sce.seger)$symbol ens.id &lt;- mapIds(edb, keys=symbols, keytype=&quot;SYMBOL&quot;, column=&quot;GENEID&quot;) ens.id &lt;- ifelse(is.na(ens.id), symbols, ens.id) # Removing duplicated rows. keep &lt;- !duplicated(ens.id) sce.seger &lt;- sce.seger[keep,] rownames(sce.seger) &lt;- ens.id[keep] #--- sample-annotation ---# emtab.meta &lt;- colData(sce.seger)[,c(&quot;cell type&quot;, &quot;disease&quot;, &quot;individual&quot;, &quot;single cell well quality&quot;)] colnames(emtab.meta) &lt;- c(&quot;CellType&quot;, &quot;Disease&quot;, &quot;Donor&quot;, &quot;Quality&quot;) colData(sce.seger) &lt;- emtab.meta sce.seger$CellType &lt;- gsub(&quot; cell&quot;, &quot;&quot;, sce.seger$CellType) sce.seger$CellType &lt;- paste0( toupper(substr(sce.seger$CellType, 1, 1)), substring(sce.seger$CellType, 2)) #--- quality-control ---# low.qual &lt;- sce.seger$Quality == &quot;low quality cell&quot; library(scater) stats &lt;- perCellQCMetrics(sce.seger) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sce.seger$Donor, subset=!sce.seger$Donor %in% c(&quot;HP1504901&quot;, &quot;HP1509101&quot;)) sce.seger &lt;- sce.seger[,!(qc$discard | low.qual)] #--- normalization ---# library(scran) clusters &lt;- quickCluster(sce.seger) sce.seger &lt;- computeSumFactors(sce.seger, clusters=clusters) sce.seger &lt;- logNormCounts(sce.seger) sce.seger &lt;- sce.seger[,sce.seger$Donor==&quot;HP1507101&quot;] dec.default &lt;- modelGeneVar(sce.seger) dec.noweight &lt;- modelGeneVar(sce.seger, density.weights=FALSE) fit.default &lt;- metadata(dec.default) plot(fit.default$mean, fit.default$var, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curve(fit.default$trend(x), col=&quot;dodgerblue&quot;, add=TRUE, lwd=2) fit.noweight &lt;- metadata(dec.noweight) curve(fit.noweight$trend(x), col=&quot;red&quot;, add=TRUE, lwd=2) legend(&quot;topleft&quot;, col=c(&quot;dodgerblue&quot;, &quot;red&quot;), legend=c(&quot;Default&quot;, &quot;No weight&quot;), lwd=2) Figure 8.2: Variance in the Segerstolpe pancreas data set as a function of the mean. Each point represents a gene while the lines represent the trend fitted to all genes with default parameters (blue) or without weights (red). 8.2.2 Coefficient of variation An alternative approach to quantification uses the squared coefficient of variation (CV2) of the normalized expression values prior to log-transformation. The CV2 is a widely used metric for describing variation in non-negative data and is closely related to the dispersion parameter of the negative binomial distribution in packages like edgeR and DESeq2. We compute the CV2 for each gene in the PBMC dataset using the modelGeneCV2() function, which provides a robust implementation of the approach described by Brennecke et al. (2013). dec.cv2.pbmc &lt;- modelGeneCV2(sce.pbmc) This allows us to model the mean-variance relationship when considering the relevance of each gene (Figure 8.3). Again, our assumption is that most genes contain random noise and that the trend captures mostly technical variation. Large CV2 values that deviate strongly from the trend are likely to represent genes affected by biological structure. fit.cv2.pbmc &lt;- metadata(dec.cv2.pbmc) plot(fit.cv2.pbmc$mean, fit.cv2.pbmc$cv2, log=&quot;xy&quot;) curve(fit.cv2.pbmc$trend(x), col=&quot;dodgerblue&quot;, add=TRUE, lwd=2) Figure 8.3: CV2 in the PBMC data set as a function of the mean. Each point represents a gene while the blue line represents the fitted trend. For each gene, we quantify the deviation from the trend in terms of the ratio of its CV2 to the fitted value of trend at its abundance. This is more appropriate than the directly subtracting the trend from the CV2, as the magnitude of the ratio is not affected by the mean. dec.cv2.pbmc[order(dec.cv2.pbmc$ratio, decreasing=TRUE),] ## DataFrame with 33694 rows and 6 columns ## mean total trend ratio p.value FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## PPBP 2.2437397 132.364 0.803689 164.696 0 0 ## PRTFDC1 0.0658743 3197.564 20.266829 157.773 0 0 ## HIST1H2AC 1.3731487 175.035 1.176934 148.721 0 0 ## FAM81B 0.0477082 3654.419 27.902078 130.973 0 0 ## PF4 1.8333127 109.451 0.935484 116.999 0 0 ## ... ... ... ... ... ... ... ## AC023491.2 0 NaN Inf NaN NaN NaN ## AC233755.2 0 NaN Inf NaN NaN NaN ## AC233755.1 0 NaN Inf NaN NaN NaN ## AC213203.1 0 NaN Inf NaN NaN NaN ## FAM231B 0 NaN Inf NaN NaN NaN Both the CV2 and the variance of log-counts are effective metrics for quantifying variation in gene expression. The CV2 tends to give higher rank to low-abundance HVGs driven by upregulation in rare subpopulations, for which the increase in variance on the raw scale is stronger than that on the log-scale. However, the variation described by the CV2 is less directly relevant to downstream procedures operating on the log-counts, and the reliance on the ratio can assign high rank to uninteresting genes with low absolute variance. We prefer the use of the variance of log-counts and will use it in the following sections though many of the same principles apply to procedures based on the CV2. 8.2.3 Quantifying technical noise Strictly speaking, the use of a trend fitted to endogenous genes assumes that the expression profiles of most genes are dominated by random technical noise. In practice, all expressed genes will exhibit some non-zero level of biological variability due to events like transcriptional bursting. This suggests that our estimates of the technical component are likely to be inflated. It would be more appropriate to consider these estimates as technical noise plus “uninteresting” biological variation, under the assumption that most genes are unaffected by the relevant heterogeneity in the population. This revised assumption is generally reasonable but may be problematic in some scenarios where many genes at a particular abundance are affected by a biological process. For example, strong upregulation of cell type-specific genes may result in an enrichment of HVGs at high abundances. This would inflate the fitted trend in that abundance interval and compromise the detection of the relevant genes. We can avoid this problem by fitting a mean-dependent trend to the variance of the spike-in transcripts (Figure 8.4), if they are available. The premise here is that spike-ins should not be affected by biological variation, so the fitted value of the spike-in trend should represent a better estimate of the technical component for each gene. dec.spike.416b &lt;- modelGeneVarWithSpikes(sce.416b, &quot;ERCC&quot;) dec.spike.416b[order(dec.spike.416b$bio, decreasing=TRUE),] ## DataFrame with 46604 rows and 6 columns ## mean total tech bio p.value FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Lyz2 6.61097 13.8497 1.57131 12.2784 1.48993e-186 1.54156e-183 ## Ccl9 6.67846 13.1869 1.50035 11.6866 2.21855e-185 2.19979e-182 ## Top2a 5.81024 14.1787 2.54776 11.6310 3.80016e-65 1.13040e-62 ## Cd200r3 4.83180 15.5613 4.22984 11.3314 9.46221e-24 6.08574e-22 ## Ccnb2 5.97776 13.1393 2.30177 10.8375 3.68706e-69 1.20193e-66 ## ... ... ... ... ... ... ... ## Rpl5-ps2 3.60625 0.612623 6.32853 -5.71590 0.999616 0.999726 ## Gm11942 3.38768 0.798570 6.51473 -5.71616 0.999459 0.999726 ## Gm12816 2.91276 0.838670 6.57364 -5.73497 0.999422 0.999726 ## Gm13623 2.72844 0.708071 6.45448 -5.74641 0.999544 0.999726 ## Rps12l1 3.15420 0.746615 6.59332 -5.84670 0.999522 0.999726 plot(dec.spike.416b$mean, dec.spike.416b$total, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) fit.spike.416b &lt;- metadata(dec.spike.416b) points(fit.spike.416b$mean, fit.spike.416b$var, col=&quot;red&quot;, pch=16) curve(fit.spike.416b$trend(x), col=&quot;dodgerblue&quot;, add=TRUE, lwd=2) Figure 8.4: Variance in the 416B data set as a function of the mean. Each point represents a gene (black) or spike-in transcript (red) and the blue line represents the trend fitted to all spike-ins. In the absence of spike-in data, one can attempt to create a trend by making some distributional assumptions about the noise. For example, UMI counts typically exhibit near-Poisson variation if we only consider technical noise from library preparation and sequencing. This can be used to construct a mean-variance trend in the log-counts (Figure 8.5) with the modelGeneVarByPoisson() function. Note the increased residuals of the high-abundance genes, which can be interpreted as the amount of biological variation that was assumed to be “uninteresting” when fitting the gene-based trend in Figure 8.1. set.seed(0010101) dec.pois.pbmc &lt;- modelGeneVarByPoisson(sce.pbmc) dec.pois.pbmc &lt;- dec.pois.pbmc[order(dec.pois.pbmc$bio, decreasing=TRUE),] head(dec.pois.pbmc) ## DataFrame with 6 rows and 6 columns ## mean total tech bio p.value FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## LYZ 1.95605 5.05854 0.631190 4.42735 0 0 ## S100A9 1.93416 4.53551 0.635102 3.90040 0 0 ## S100A8 1.69961 4.41084 0.671491 3.73935 0 0 ## HLA-DRA 2.09785 3.75174 0.604448 3.14730 0 0 ## CD74 2.90176 3.36879 0.444928 2.92386 0 0 ## CST3 1.47546 2.95646 0.691386 2.26507 0 0 plot(dec.pois.pbmc$mean, dec.pois.pbmc$total, pch=16, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curve(metadata(dec.pois.pbmc)$trend(x), col=&quot;dodgerblue&quot;, add=TRUE) Figure 8.5: Variance of normalized log-expression values for each gene in the PBMC dataset, plotted against the mean log-expression. The blue line represents represents the mean-variance relationship corresponding to Poisson noise. Interestingly, trends based purely on technical noise tend to yield large biological components for highly-expressed genes. This often includes so-called “house-keeping” genes coding for essential cellular components such as ribosomal proteins, which are considered uninteresting for characterizing cellular heterogeneity. These observations suggest that a more accurate noise model does not necessarily yield a better ranking of HVGs, though one should keep an open mind - house-keeping genes are regularly DE in a variety of conditions (Glare et al. 2002; Nazari, Parham, and Maleki 2015; Guimaraes and Zavolan 2016), and the fact that they have large biological components indicates that there is strong variation across cells that may not be completely irrelevant. 8.2.4 Accounting for blocking factors 8.2.4.1 Fitting block-specific trends Data containing multiple batches will often exhibit batch effects (see Chapter 28.8 for more details). We are usually not interested in HVGs that are driven by batch effects. Rather, we want to focus on genes that are highly variable within each batch. This is naturally achieved by performing trend fitting and variance decomposition separately for each batch. We demonstrate this approach by treating each plate (block) in the 416B dataset as a different batch, using the modelGeneVarWithSpikes() function. (The same argument is available in all other variance-modelling functions.) dec.block.416b &lt;- modelGeneVarWithSpikes(sce.416b, &quot;ERCC&quot;, block=sce.416b$block) head(dec.block.416b[order(dec.block.416b$bio, decreasing=TRUE),1:6]) ## DataFrame with 6 rows and 6 columns ## mean total tech bio p.value FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Lyz2 6.61235 13.8619 1.58416 12.2777 0.00000e+00 0.00000e+00 ## Ccl9 6.67841 13.2599 1.44553 11.8143 0.00000e+00 0.00000e+00 ## Top2a 5.81275 14.0192 2.74571 11.2734 3.89855e-137 8.43398e-135 ## Cd200r3 4.83305 15.5909 4.31892 11.2719 1.17783e-54 7.00722e-53 ## Ccnb2 5.97999 13.0256 2.46647 10.5591 1.20380e-151 2.98405e-149 ## Hbb-bt 4.91683 14.6539 4.12156 10.5323 2.52639e-49 1.34197e-47 The use of a batch-specific trend fit is useful as it accommodates differences in the mean-variance trends between batches. This is especially important if batches exhibit systematic technical differences, e.g., differences in coverage or in the amount of spike-in RNA added. In this case, there are only minor differences between the trends in Figure 8.6, which indicates that the experiment was tightly replicated across plates. The analysis of each plate yields estimates of the biological and technical components for each gene, which are averaged across plates to take advantage of information from multiple batches. par(mfrow=c(1,2)) blocked.stats &lt;- dec.block.416b$per.block for (i in colnames(blocked.stats)) { current &lt;- blocked.stats[[i]] plot(current$mean, current$total, main=i, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(current) points(curfit$mean, curfit$var, col=&quot;red&quot;, pch=16) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) } Figure 8.6: Variance in the 416B data set as a function of the mean after blocking on the plate of origin. Each plot represents the results for a single plate, each point represents a gene (black) or spike-in transcript (red) and the blue line represents the trend fitted to all spike-ins. As an aside, the wave-like shape observed above is typical of the mean-variance trend for log-expression values. (The same wave is present but much less pronounced for UMI data.) A linear increase in the variance is observed as the mean increases from zero, as larger variances are obviously possible when the counts are not all equal to zero. In contrast, the relative contribution of sampling noise decreases at high abundances, resulting in a downward trend. The peak represents the point at which these two competing effects cancel each other out. 8.2.4.2 Using a design matrix The use of block-specific trends is the recommended approach for experiments with a single blocking factor. However, this is not practical for studies involving a large number of blocking factors and/or covariates. In such cases, we can use the design= argument to specify a design matrix with uninteresting factors of variation. We illustrate again with the 416B data set, blocking on the plate of origin and oncogene induction. (The same argument is available in modelGeneVar() when spike-ins are not available.) design &lt;- model.matrix(~factor(block) + phenotype, colData(sce.416b)) dec.design.416b &lt;- modelGeneVarWithSpikes(sce.416b, &quot;ERCC&quot;, design=design) dec.design.416b[order(dec.design.416b$bio, decreasing=TRUE),] ## DataFrame with 46604 rows and 6 columns ## mean total tech bio p.value FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Lyz2 6.61097 8.90513 1.50405 7.40107 1.78185e-172 1.28493e-169 ## Ccnb2 5.97776 9.54373 2.24180 7.30192 7.77223e-77 1.44497e-74 ## Gem 5.90225 9.54358 2.35175 7.19183 5.49587e-68 8.12330e-66 ## Cenpa 5.81349 8.65622 2.48792 6.16830 2.08035e-45 1.52796e-43 ## Idh1 5.99343 8.32113 2.21965 6.10148 2.42819e-55 2.41772e-53 ## ... ... ... ... ... ... ... ## Gm5054 2.90434 0.463698 6.77000 -6.30630 1 1 ## Gm12191 3.55920 0.170709 6.53285 -6.36214 1 1 ## Gm7429 3.45394 0.248351 6.63458 -6.38623 1 1 ## Gm16378 2.83987 0.208215 6.74663 -6.53841 1 1 ## Rps2-ps2 3.11324 0.202307 6.78484 -6.58253 1 1 This strategy is simple but somewhat inaccurate as it does not consider the mean expression in each blocking level. Recall that the technical component is estimated as the fitted value of the trend at the average abundance for each gene. However, the true technical component is the average of the fitted values at the per-block means, which may be quite different for strong batch effects and non-linear mean-variance relationships. The block= approach is safer and should be preferred in all situations where it is applicable. 8.3 Selecting highly variable genes 8.3.1 Overview Once we have quantified the per-gene variation, the next step is to select the subset of HVGs to use in downstream analyses. A larger subset will reduce the risk of discarding interesting biological signal by retaining more potentially relevant genes, at the cost of increasing noise from irrelevant genes that might obscure said signal. It is difficult to determine the optimal trade-off for any given application as noise in one context may be useful signal in another. For example, heterogeneity in T cell activation responses is an interesting phenomena (Richard et al. 2018) but may be irrelevant noise in studies that only care about distinguishing the major immunophenotypes. That said, there are several common strategies that are routinely used to guide HVG selection, which we shall discuss here. 8.3.2 Based on the largest metrics The simplest HVG selection strategy is to take the top \\(X\\) genes with the largest values for the relevant variance metric. The main advantage of this approach is that the user can directly control the number of genes retained, which ensures that the computational complexity of downstream calculations is easily predicted. For modelGeneVar() and modelGeneVarWithSpikes(), we would select the genes with the largest biological components: # Taking the top 1000 genes here: hvg.pbmc.var &lt;- getTopHVGs(dec.pbmc, n=1000) str(hvg.pbmc.var) ## chr [1:1000] &quot;LYZ&quot; &quot;S100A9&quot; &quot;S100A8&quot; &quot;HLA-DRA&quot; &quot;CD74&quot; &quot;CST3&quot; &quot;TYROBP&quot; ... For modelGeneCV2() (and its relative, modelGeneCV2WithSpikes()), this would instead be the genes with the largest ratios: hvg.pbmc.cv2 &lt;- getTopHVGs(dec.cv2.pbmc, var.field=&quot;ratio&quot;, n=1000) str(hvg.pbmc.cv2) ## chr [1:1000] &quot;PPBP&quot; &quot;PRTFDC1&quot; &quot;HIST1H2AC&quot; &quot;FAM81B&quot; &quot;PF4&quot; &quot;GNG11&quot; ... The choice of \\(X\\) also has a fairly straightforward biological interpretation. Recall our trend-fitting assumption that most genes do not exhibit biological heterogeneity; this implies that they are not differentially expressed between cell types or states in our population. If we quantify this assumption into a statement that, e.g., no more than 5% of genes are differentially expressed, we can naturally set \\(X\\) to 5% of the number of genes. In practice, we usually do not know the proportion of DE genes beforehand so this interpretation just exchanges one unknown for another. Nonetheless, it is still useful as it implies that we should lower \\(X\\) for less heterogeneous datasets, retaining most of the biological signal without unnecessary noise from irrelevant genes. Conversely, more heterogeneous datasets should use larger values of \\(X\\) to preserve secondary factors of variation beyond those driving the most obvious HVGs. The main disadvantage of this approach that it turns HVG selection into a competition between genes, whereby a subset of very highly variable genes can push other informative genes out of the top set. This can be problematic for analyses of highly heterogeneous populations if the loss of important markers prevents the resolution of certain subpopulations. In the most extreme example, consider a situation where a single subpopulation is very different from the others. In such cases, the top set will be dominated by differentially expressed genes involving that distinct subpopulation, compromising resolution of heterogeneity between the other populations. (This can salvaged with a nested analysis, as discussed in Section 10.7, but we would prefer to avoid the problem in the first place.) Another possible concern with this approach is the fact that the choice of \\(X\\) is fairly arbitrary, with any value from 500 to 5000 considered “reasonable”. We have chosen \\(X=1000\\) in the code above though there is no particular a priori reason for doing so. Our recommendation is to simply pick an arbitrary \\(X\\) and proceed with the rest of the analysis, with the intention of testing other choices later, rather than spending much time worrying about obtaining the “optimal” value. 8.3.3 Based on significance Another approach to feature selection is to set a fixed threshold of one of the metrics. This is most commonly done with the (adjusted) \\(p\\)-value reported by each of the above methods. The \\(p\\)-value for each gene is generated by testing against the null hypothesis that the variance is equal to the trend. For example, we might define our HVGs as all genes that have adjusted \\(p\\)-values below 0.05. hvg.pbmc.var.2 &lt;- getTopHVGs(dec.pbmc, fdr.threshold=0.05) length(hvg.pbmc.var.2) ## [1] 813 This approach is simple to implement and - if the test holds its size - it controls the false discovery rate (FDR). That is, it returns a subset of genes where the proportion of false positives is expected to be below the specified threshold. This can occasionally be useful in applications where the HVGs themselves are of interest. For example, if we were to use the list of HVGs in further experiments to verify the existence of heterogeneous expression for some of the genes, we would want to control the FDR in that list. The downside of this approach is that it is less predictable than the top \\(X\\) strategy. The number of genes returned depends on the type II error rate of the test and the severity of the multiple testing correction. One might obtain no genes or every gene at a given FDR threshold, depending on the circumstances. Moreover, control of the FDR is usually not helpful at this stage of the analysis. We are not interpreting the individual HVGs themselves but are only using them for feature selection prior to downstream steps. There is no reason to think that a 5% threshold on the FDR yields a more suitable compromise between bias and noise compared to the top \\(X\\) selection. As an aside, we might consider ranking genes by the \\(p\\)-value instead of the biological component for use in a top \\(X\\) approach. This results in some counterintuitive behavior due to the nature of the underlying hypothesis test, which is based on the ratio of the total variance to the expected technical variance. Ranking based on \\(p\\)-value tends to prioritize HVGs that are more likely to be true positives but, at the same time, less likely to be biologically interesting. Many of the largest ratios are observed in high-abundance genes and are driven by very low technical variance; the total variance is typically modest for such genes, and they do not contribute much to population heterogeneity in absolute terms. (Note that the same can be said of the ratio of CV2 values, as briefly discussed above.) 8.3.4 Keeping all genes above the trend Here, the aim is to only remove the obviously uninteresting genes with variances below the trend. By doing so, we avoid the need to make any judgement calls regarding what level of variation is interesting enough to retain. This approach represents one extreme of the bias-variance trade-off where bias is minimized at the cost of maximizing noise. For modelGeneVar(), it equates to keeping all positive biological components: hvg.pbmc.var.3 &lt;- getTopHVGs(dec.pbmc, var.threshold=0) length(hvg.pbmc.var.3) ## [1] 12745 For modelGeneCV2(), this involves keeping all ratios above 1: hvg.pbmc.cv2.3 &lt;- getTopHVGs(dec.cv2.pbmc, var.field=&quot;ratio&quot;, var.threshold=1) length(hvg.pbmc.cv2.3) ## [1] 6643 By retaining all potential biological signal, we give secondary population structure the chance to manifest. This is most useful for rare subpopulations where the relevant markers will not exhibit strong overdispersion owing to the small number of affected cells. It will also preserve a weak but consistent effect across many genes with small biological components; admittedly, though, this is not of major interest in most scRNA-seq studies given the difficulty of experimentally validating population structure in the absence of strong marker genes. The obvious cost is that more noise is also captured, which can reduce the resolution of otherwise well-separated populations and mask the secondary signal that we were trying to preserve. The use of more genes also introduces more computational work in each downstream step. This strategy is thus best suited to very heterogeneous populations containing many different cell types (possibly across many datasets that are to be merged, as in Chapter 13) where there is a justified fear of ignoring marker genes for low-abundance subpopulations under a competitive top \\(X\\) approach. 8.4 Selecting a priori genes of interest A blunt yet effective feature selection strategy is to use pre-defined sets of interesting genes. The aim is to focus on specific aspects of biological heterogeneity that may be masked by other factors when using unsupervised methods for HVG selection. One example application lies in the dissection of transcriptional changes during the earliest stages of cell fate commitment (Messmer et al. 2019), which may be modest relative to activity in other pathways (e.g., cell cycle, metabolism). Indeed, if our aim is to show that there is no meaningful heterogeneity in a given pathway, we would - at the very least - be obliged to repeat our analysis using only the genes in that pathway to maximize power for detecting such heterogeneity. Using scRNA-seq data in this manner is conceptually equivalent to a fluorescence activated cell sorting (FACS) experiment, with the convenience of being able to (re)define the features of interest at any time. For example, in the PBMC dataset, we might use some of the C7 immunologic signatures from MSigDB (Godec et al. 2016) to improve resolution of the various T cell subtypes. We stress that there is no shame in leveraging prior biological knowledge to address specific hypotheses in this manner. We say this because a common refrain in genomics is that the data analysis should be “unbiased”, i.e., free from any biological preconceptions. Attempting to derive biological insight ab initio is admirable but such “biases” are already present at every stage, starting from experimental design (why are we interested in this cell population in the first place?) and continuing through to interpretation of marker genes (Section 11). library(msigdbr) c7.sets &lt;- msigdbr(species = &quot;Homo sapiens&quot;, category = &quot;C7&quot;) head(unique(c7.sets$gs_name)) ## [1] &quot;GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_DN&quot; ## [2] &quot;GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_UP&quot; ## [3] &quot;GOLDRATH_NAIVE_VS_EFF_CD8_TCELL_DN&quot; ## [4] &quot;GOLDRATH_NAIVE_VS_EFF_CD8_TCELL_UP&quot; ## [5] &quot;GOLDRATH_NAIVE_VS_MEMORY_CD8_TCELL_DN&quot; ## [6] &quot;GOLDRATH_NAIVE_VS_MEMORY_CD8_TCELL_UP&quot; # Using the Goldrath sets to distinguish CD8 subtypes cd8.sets &lt;- c7.sets[grep(&quot;GOLDRATH&quot;, c7.sets$gs_name),] cd8.genes &lt;- rowData(sce.pbmc)$Symbol %in% cd8.sets$human_gene_symbol summary(cd8.genes) ## Mode FALSE TRUE ## logical 32869 825 # Using GSE11924 to distinguish between T helper subtypes th.sets &lt;- c7.sets[grep(&quot;GSE11924&quot;, c7.sets$gs_name),] th.genes &lt;- rowData(sce.pbmc)$Symbol %in% th.sets$human_gene_symbol summary(th.genes) ## Mode FALSE TRUE ## logical 31785 1909 # Using GSE11961 to distinguish between B cell subtypes b.sets &lt;- c7.sets[grep(&quot;GSE11961&quot;, c7.sets$gs_name),] b.genes &lt;- rowData(sce.pbmc)$Symbol %in% b.sets$human_gene_symbol summary(b.genes) ## Mode FALSE TRUE ## logical 28192 5502 Of course, the downside of focusing on pre-defined genes is that it will limit our capacity to detect novel or unexpected aspects of variation. Thus, this kind of focused analysis should be complementary to (rather than a replacement for) the unsupervised feature selection strategies discussed previously. Alternatively, we can invert this reasoning to remove genes that are unlikely to be of interest prior to downstream analyses. This eliminates unwanted variation that could mask relevant biology and interfere with interpretation of the results. Ribosomal protein genes or mitochondrial genes are common candidates for removal, especially in situations with varying levels of cell damage within a population. For immune cell subsets, we might also be inclined to remove immunoglobulin genes and T cell receptor genes for which clonal expression introduces (possibly irrelevant) population structure. # Identifying ribosomal proteins: ribo.discard &lt;- grepl(&quot;^RP[SL]\\\\d+&quot;, rownames(sce.pbmc)) sum(ribo.discard) ## [1] 99 # A more curated approach for identifying ribosomal protein genes: c2.sets &lt;- msigdbr(species = &quot;Homo sapiens&quot;, category = &quot;C2&quot;) ribo.set &lt;- c2.sets[c2.sets$gs_name==&quot;KEGG_RIBOSOME&quot;,]$human_gene_symbol ribo.discard &lt;- rownames(sce.pbmc) %in% ribo.set sum(ribo.discard) ## [1] 87 library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] anno &lt;- select(edb, keys=rowData(sce.pbmc)$ID, keytype=&quot;GENEID&quot;, columns=&quot;TXBIOTYPE&quot;) # Removing immunoglobulin variable chains: igv.set &lt;- anno$GENEID[anno$TXBIOTYPE %in% c(&quot;IG_V_gene&quot;, &quot;IG_V_pseudogene&quot;)] igv.discard &lt;- rowData(sce.pbmc)$ID %in% igv.set sum(igv.discard) ## [1] 326 # Removing TCR variable chains: tcr.set &lt;- anno$GENEID[anno$TXBIOTYPE %in% c(&quot;TR_V_gene&quot;, &quot;TR_V_pseudogene&quot;)] tcr.discard &lt;- rowData(sce.pbmc)$ID %in% tcr.set sum(tcr.discard) ## [1] 138 In practice, we tend to err on the side of caution and abstain from preemptive filtering on biological function until these genes are demonstrably problematic in downstream analyses. 8.5 Putting it all together The few lines of code below will select the top 10% of genes with the highest biological components. dec.pbmc &lt;- modelGeneVar(sce.pbmc) chosen &lt;- getTopHVGs(dec.pbmc, prop=0.1) str(chosen) ## chr [1:1274] &quot;LYZ&quot; &quot;S100A9&quot; &quot;S100A8&quot; &quot;HLA-DRA&quot; &quot;CD74&quot; &quot;CST3&quot; &quot;TYROBP&quot; ... We then have several options to enforce our HVG selection on the rest of the analysis. We can subset the SingleCellExperiment to only retain our selection of HVGs. This ensures that downstream methods will only use these genes for their calculations. The downside is that the non-HVGs are discarded from the new SingleCellExperiment, making it slightly more inconvenient to interrogate the full dataset for interesting genes that are not HVGs. sce.pbmc.hvg &lt;- sce.pbmc[chosen,] dim(sce.pbmc.hvg) ## [1] 1274 3985 We can keep the original SingleCellExperiment object and specify the genes to use for downstream functions via an extra argument like subset.row=. This is useful if the analysis uses multiple sets of HVGs at different steps, whereby one set of HVGs can be easily swapped for another in specific steps. # Performing PCA only on the chosen HVGs. library(scater) sce.pbmc &lt;- runPCA(sce.pbmc, subset_row=chosen) reducedDimNames(sce.pbmc) ## [1] &quot;PCA&quot; This approach is facilitated by the rowSubset() utility, which allows us to easily store one or more sets of interest in our SingleCellExperiment. By doing so, we avoid the need to keep track of a separate chosen variable and ensure that our HVG set is synchronized with any downstream row subsetting of sce.pbmc. rowSubset(sce.pbmc) &lt;- chosen # stored in the default &#39;subset&#39;. rowSubset(sce.pbmc, &quot;HVGs.more&quot;) &lt;- getTopHVGs(dec.pbmc, prop=0.2) rowSubset(sce.pbmc, &quot;HVGs.less&quot;) &lt;- getTopHVGs(dec.pbmc, prop=0.3) colnames(rowData(sce.pbmc)) ## [1] &quot;ID&quot; &quot;Symbol&quot; &quot;subset&quot; &quot;HVGs.more&quot; &quot;HVGs.less&quot; It can be inconvenient to repeatedly specify the desired feature set across steps, so some downstream functions will automatically subset to the default rowSubset() if present in the SingleCellExperiment. However, we find that it is generally safest to be explicit about which set is being used for a particular step. We can have our cake and eat it too by (ab)using the “alternative Experiment” system in the SingleCellExperiment class. Initially designed for storing alternative features like spike-ins or antibody tags, we can instead use it to hold our full dataset while we perform our downstream operations conveniently on the HVG subset. This avoids book-keeping problems in long analyses when the original dataset is not synchronized with the HVG subsetted data. # Recycling the class above. altExp(sce.pbmc.hvg, &quot;original&quot;) &lt;- sce.pbmc altExpNames(sce.pbmc.hvg) ## [1] &quot;original&quot; # No need for explicit subset_row= specification in downstream operations. sce.pbmc.hvg &lt;- runPCA(sce.pbmc.hvg) # Recover original data: sce.pbmc.original &lt;- altExp(sce.pbmc.hvg, &quot;original&quot;, withColData=TRUE) Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] scater_1.18.6 ggplot2_3.3.3 [3] ensembldb_2.14.0 AnnotationFilter_1.14.0 [5] GenomicFeatures_1.42.2 AnnotationDbi_1.52.0 [7] AnnotationHub_2.22.0 BiocFileCache_1.14.0 [9] dbplyr_2.1.0 msigdbr_7.2.1 [11] scran_1.18.5 SingleCellExperiment_1.12.0 [13] SummarizedExperiment_1.20.0 Biobase_2.50.0 [15] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [17] IRanges_2.24.1 S4Vectors_0.28.1 [19] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [21] matrixStats_0.58.0 BiocStyle_2.18.1 [23] rebook_1.0.0 loaded via a namespace (and not attached): [1] igraph_1.2.6 lazyeval_0.2.2 [3] BiocParallel_1.24.1 digest_0.6.27 [5] htmltools_0.5.1.1 viridis_0.5.1 [7] fansi_0.4.2 magrittr_2.0.1 [9] memoise_2.0.0 limma_3.46.0 [11] Biostrings_2.58.0 askpass_1.1 [13] prettyunits_1.1.1 colorspace_2.0-0 [15] blob_1.2.1 rappdirs_0.3.3 [17] xfun_0.22 dplyr_1.0.5 [19] callr_3.5.1 crayon_1.4.1 [21] RCurl_1.98-1.3 jsonlite_1.7.2 [23] graph_1.68.0 glue_1.4.2 [25] gtable_0.3.0 zlibbioc_1.36.0 [27] XVector_0.30.0 DelayedArray_0.16.2 [29] BiocSingular_1.6.0 scales_1.1.1 [31] DBI_1.1.1 edgeR_3.32.1 [33] Rcpp_1.0.6 viridisLite_0.3.0 [35] xtable_1.8-4 progress_1.2.2 [37] dqrng_0.2.1 bit_4.0.4 [39] rsvd_1.0.3 httr_1.4.2 [41] ellipsis_0.3.1 pkgconfig_2.0.3 [43] XML_3.99-0.6 scuttle_1.0.4 [45] CodeDepends_0.6.5 sass_0.3.1 [47] locfit_1.5-9.4 utf8_1.2.1 [49] tidyselect_1.1.0 rlang_0.4.10 [51] later_1.1.0.1 munsell_0.5.0 [53] BiocVersion_3.12.0 tools_4.0.4 [55] cachem_1.0.4 generics_0.1.0 [57] RSQLite_2.2.4 evaluate_0.14 [59] stringr_1.4.0 fastmap_1.1.0 [61] yaml_2.2.1 processx_3.4.5 [63] knitr_1.31 bit64_4.0.5 [65] purrr_0.3.4 sparseMatrixStats_1.2.1 [67] mime_0.10 xml2_1.3.2 [69] biomaRt_2.46.3 compiler_4.0.4 [71] beeswarm_0.3.1 curl_4.3 [73] interactiveDisplayBase_1.28.0 tibble_3.1.0 [75] statmod_1.4.35 bslib_0.2.4 [77] stringi_1.5.3 highr_0.8 [79] ps_1.6.0 lattice_0.20-41 [81] bluster_1.0.0 ProtGenerics_1.22.0 [83] Matrix_1.3-2 vctrs_0.3.6 [85] pillar_1.5.1 lifecycle_1.0.0 [87] BiocManager_1.30.10 jquerylib_0.1.3 [89] BiocNeighbors_1.8.2 bitops_1.0-6 [91] irlba_2.3.3 httpuv_1.5.5 [93] rtracklayer_1.50.0 R6_2.5.0 [95] bookdown_0.21 promises_1.2.0.1 [97] gridExtra_2.3 vipor_0.4.5 [99] codetools_0.2-18 assertthat_0.2.1 [101] openssl_1.4.3 withr_2.4.1 [103] GenomicAlignments_1.26.0 Rsamtools_2.6.0 [105] GenomeInfoDbData_1.2.4 hms_1.0.0 [107] grid_4.0.4 beachmat_2.6.4 [109] rmarkdown_2.7 DelayedMatrixStats_1.12.3 [111] shiny_1.6.0 ggbeeswarm_0.6.0 Bibliography "],["dimensionality-reduction.html", "Chapter 9 Dimensionality reduction 9.1 Overview 9.2 Principal components analysis 9.3 Choosing the number of PCs 9.4 Count-based dimensionality reduction 9.5 Dimensionality reduction for visualization Session Info", " Chapter 9 Dimensionality reduction .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 9.1 Overview Many scRNA-seq analysis procedures involve comparing cells based on their expression values across multiple genes. For example, clustering aims to identify cells with similar transcriptomic profiles by computing Euclidean distances across genes. In these applications, each individual gene represents a dimension of the data. More intuitively, if we had a scRNA-seq data set with two genes, we could make a two-dimensional plot where each axis represents the expression of one gene and each point in the plot represents a cell. This concept can be extended to data sets with thousands of genes where each cell’s expression profile defines its location in the high-dimensional expression space. As the name suggests, dimensionality reduction aims to reduce the number of separate dimensions in the data. This is possible because different genes are correlated if they are affected by the same biological process. Thus, we do not need to store separate information for individual genes, but can instead compress multiple features into a single dimension, e.g., an “eigengene” (Langfelder and Horvath 2007). This reduces computational work in downstream analyses like clustering, as calculations only need to be performed for a few dimensions rather than thousands of genes; reduces noise by averaging across multiple genes to obtain a more precise representation of the patterns in the data; and enables effective plotting of the data, for those of us who are not capable of visualizing more than 3 dimensions. We will use the Zeisel et al. (2015) dataset to demonstrate the applications of various dimensionality reduction methods in this chapter. View history #--- loading ---# library(scRNAseq) sce.zeisel &lt;- ZeiselBrainData() library(scater) sce.zeisel &lt;- aggregateAcrossFeatures(sce.zeisel, id=sub(&quot;_loc[0-9]+$&quot;, &quot;&quot;, rownames(sce.zeisel))) #--- gene-annotation ---# library(org.Mm.eg.db) rowData(sce.zeisel)$Ensembl &lt;- mapIds(org.Mm.eg.db, keys=rownames(sce.zeisel), keytype=&quot;SYMBOL&quot;, column=&quot;ENSEMBL&quot;) #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.zeisel, subsets=list( Mt=rowData(sce.zeisel)$featureType==&quot;mito&quot;)) qc &lt;- quickPerCellQC(stats, percent_subsets=c(&quot;altexps_ERCC_percent&quot;, &quot;subsets_Mt_percent&quot;)) sce.zeisel &lt;- sce.zeisel[,!qc$discard] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.zeisel) sce.zeisel &lt;- computeSumFactors(sce.zeisel, cluster=clusters) sce.zeisel &lt;- logNormCounts(sce.zeisel) #--- variance-modelling ---# dec.zeisel &lt;- modelGeneVarWithSpikes(sce.zeisel, &quot;ERCC&quot;) top.hvgs &lt;- getTopHVGs(dec.zeisel, prop=0.1) sce.zeisel ## class: SingleCellExperiment ## dim: 19839 2816 ## metadata(0): ## assays(2): counts logcounts ## rownames(19839): 0610005C13Rik 0610007N19Rik ... mt-Tw mt-Ty ## rowData names(2): featureType Ensembl ## colnames(2816): 1772071015_C02 1772071017_G12 ... 1772063068_D01 ## 1772066098_A12 ## colData names(11): tissue group # ... level2class sizeFactor ## reducedDimNames(0): ## altExpNames(2): ERCC repeat 9.2 Principal components analysis Principal components analysis (PCA) discovers axes in high-dimensional space that capture the largest amount of variation. This is best understood by imagining each axis as a line. Say we draw a line anywhere, and we move all cells in our data set onto this line by the shortest path. The variance captured by this axis is defined as the variance across cells along that line. In PCA, the first axis (or “principal component”, PC) is chosen such that it captures the greatest variance across cells. The next PC is chosen such that it is orthogonal to the first and captures the greatest remaining amount of variation, and so on. By definition, the top PCs capture the dominant factors of heterogeneity in the data set. Thus, we can perform dimensionality reduction by restricting downstream analyses to the top PCs. This strategy is simple, highly effective and widely used throughout the data sciences. It takes advantage of the well-studied theoretical properties of the PCA - namely, that a low-rank approximation formed from the top PCs is the optimal approximation of the original data for a given matrix rank. It also allows us to use a wide range of fast PCA implementations for scalable and efficient data analysis. When applying PCA to scRNA-seq data, our assumption is that biological processes affect multiple genes in a coordinated manner. This means that the earlier PCs are likely to represent biological structure as more variation can be captured by considering the correlated behavior of many genes. By comparison, random technical or biological noise is expected to affect each gene independently. There is unlikely to be an axis that can capture random variation across many genes, meaning that noise should mostly be concentrated in the later PCs. This motivates the use of the earlier PCs in our downstream analyses, which concentrates the biological signal to simultaneously reduce computational work and remove noise. We perform the PCA on the log-normalized expression values using the runPCA() function from scater. By default, runPCA() will compute the first 50 PCs and store them in the reducedDims() of the output SingleCellExperiment object, as shown below. Here, we use only the top 2000 genes with the largest biological components to reduce both computational work and high-dimensional random noise. In particular, while PCA is robust to random noise, an excess of it may cause the earlier PCs to capture noise instead of biological structure (Johnstone and Lu 2009). This effect can be mitigated by restricting the PCA to a subset of HVGs, for which we can use any of the strategies described in Chapter 8. library(scran) top.zeisel &lt;- getTopHVGs(dec.zeisel, n=2000) library(scater) set.seed(100) # See below. sce.zeisel &lt;- runPCA(sce.zeisel, subset_row=top.zeisel) reducedDimNames(sce.zeisel) ## [1] &quot;PCA&quot; dim(reducedDim(sce.zeisel, &quot;PCA&quot;)) ## [1] 2816 50 For large data sets, greater efficiency is obtained by using approximate SVD algorithms that only compute the top PCs. By default, most PCA-related functions in scater and scran will use methods from the irlba or rsvd packages to perform the SVD. We can explicitly specify the SVD algorithm to use by passing an BiocSingularParam object (from the BiocSingular package) to the BSPARAM= argument (see Section 23.2.2 for more details). Many of these approximate algorithms are based on randomization and thus require set.seed() to obtain reproducible results. library(BiocSingular) set.seed(1000) sce.zeisel &lt;- runPCA(sce.zeisel, subset_row=top.zeisel, BSPARAM=RandomParam(), name=&quot;IRLBA&quot;) reducedDimNames(sce.zeisel) ## [1] &quot;PCA&quot; &quot;IRLBA&quot; 9.3 Choosing the number of PCs 9.3.1 Motivation How many of the top PCs should we retain for downstream analyses? The choice of the number of PCs \\(d\\) is a decision that is analogous to the choice of the number of HVGs to use. Using more PCs will retain more biological signal at the cost of including more noise that might mask said signal. Much like the choice of the number of HVGs, it is hard to determine whether an “optimal” choice exists for the number of PCs. Even if we put aside the technical variation that is almost always uninteresting, there is no straightforward way to automatically determine which aspects of biological variation are relevant; one analyst’s biological signal may be irrelevant noise to another analyst with a different scientific question. Most practitioners will simply set \\(d\\) to a “reasonable” but arbitrary value, typically ranging from 10 to 50. This is often satisfactory as the later PCs explain so little variance that their inclusion or omission has no major effect. For example, in the Zeisel dataset, few PCs explain more than 1% of the variance in the entire dataset (Figure 9.1) and using, say, 30 \\(\\pm\\) 10 PCs would not even amount to four percentage points’ worth of difference in variance. In fact, the main consequence of using more PCs is simply that downstream calculations take longer as they need to compute over more dimensions, but most PC-related calculations are fast enough that this is not a practical concern. percent.var &lt;- attr(reducedDim(sce.zeisel), &quot;percentVar&quot;) plot(percent.var, log=&quot;y&quot;, xlab=&quot;PC&quot;, ylab=&quot;Variance explained (%)&quot;) Figure 9.1: Percentage of variance explained by successive PCs in the Zeisel dataset, shown on a log-scale for visualization purposes. Nonetheless, we will describe some more data-driven strategies to guide a suitable choice of \\(d\\). These automated choices are best treated as guidelines as they make some strong assumptions about what variation is “interesting”. More diligent readers may consider repeating the analysis with a variety of choices of \\(d\\) to explore other perspectives of the dataset at a different bias-variance trade-off, though this tends to be more work than necessary for most questions. 9.3.2 Using the elbow point A simple heuristic for choosing \\(d\\) involves identifying the elbow point in the percentage of variance explained by successive PCs. This refers to the “elbow” in the curve of a scree plot as shown in Figure 9.2. # Percentage of variance explained is tucked away in the attributes. percent.var &lt;- attr(reducedDim(sce.zeisel), &quot;percentVar&quot;) chosen.elbow &lt;- PCAtools::findElbowPoint(percent.var) chosen.elbow ## [1] 7 plot(percent.var, xlab=&quot;PC&quot;, ylab=&quot;Variance explained (%)&quot;) abline(v=chosen.elbow, col=&quot;red&quot;) Figure 9.2: Percentage of variance explained by successive PCs in the Zeisel brain data. The identified elbow point is marked with a red line. Our assumption is that each of the top PCs capturing biological signal should explain much more variance than the remaining PCs. Thus, there should be a sharp drop in the percentage of variance explained when we move past the last “biological” PC. This manifests as an elbow in the scree plot, the location of which serves as a natural choice for \\(d\\). Once this is identified, we can subset the reducedDims() entry to only retain the first \\(d\\) PCs of interest. # Creating a new entry with only the first 20 PCs, # useful if we still need the full set of PCs later. reducedDim(sce.zeisel, &quot;PCA.elbow&quot;) &lt;- reducedDim(sce.zeisel)[,1:chosen.elbow] reducedDimNames(sce.zeisel) ## [1] &quot;PCA&quot; &quot;IRLBA&quot; &quot;PCA.elbow&quot; # Alternatively, just overwriting the original PCA entry. For demonstration # purposes, we&#39;ll do this to a copy so that we still have full PCs later on. sce.zeisel.copy &lt;- sce.zeisel reducedDim(sce.zeisel.copy) &lt;- reducedDim(sce.zeisel.copy)[,1:chosen.elbow] ncol(reducedDim(sce.zeisel.copy)) ## [1] 7 From a practical perspective, the use of the elbow point tends to retain fewer PCs compared to other methods. The definition of “much more variance” is relative so, in order to be retained, later PCs must explain a amount of variance that is comparable to that explained by the first few PCs. Strong biological variation in the early PCs will shift the elbow to the left, potentially excluding weaker (but still interesting) variation in the next PCs immediately following the elbow. 9.3.3 Using the technical noise Another strategy is to retain all PCs until the percentage of total variation explained reaches some threshold \\(T\\). For example, we might retain the top set of PCs that explains 80% of the total variation in the data. Of course, it would be pointless to swap one arbitrary parameter \\(d\\) for another \\(T\\). Instead, we derive a suitable value for \\(T\\) by calculating the proportion of variance in the data that is attributed to the biological component. This is done using the denoisePCA() function with the variance modelling results from modelGeneVarWithSpikes() or related functions, where \\(T\\) is defined as the ratio of the sum of the biological components to the sum of total variances. To illustrate, we use this strategy to pick the number of PCs in the 10X PBMC dataset. View history #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) #--- gene-annotation ---# library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) #--- cell-detection ---# set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location==&quot;MT&quot;))) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) sce.pbmc &lt;- sce.pbmc[,!high.mito] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) sce.pbmc &lt;- logNormCounts(sce.pbmc) #--- variance-modelling ---# set.seed(1001) dec.pbmc &lt;- modelGeneVarByPoisson(sce.pbmc) top.pbmc &lt;- getTopHVGs(dec.pbmc, prop=0.1) library(scran) set.seed(111001001) denoised.pbmc &lt;- denoisePCA(sce.pbmc, technical=dec.pbmc, subset.row=top.pbmc) ncol(reducedDim(denoised.pbmc)) ## [1] 9 The dimensionality of the output represents the lower bound on the number of PCs required to retain all biological variation. This choice of \\(d\\) is motivated by the fact that any fewer PCs will definitely discard some aspect of biological signal. (Of course, the converse is not true; there is no guarantee that the retained PCs capture all of the signal, which is only generally possible if no dimensionality reduction is performed at all.) From a practical perspective, the denoisePCA() approach usually retains more PCs than the elbow point method as the former does not compare PCs to each other and is less likely to discard PCs corresponding to secondary factors of variation. The downside is that many minor aspects of variation may not be interesting (e.g., transcriptional bursting) and their retention would only add irrelevant noise. Note that denoisePCA() imposes internal caps on the number of PCs that can be chosen in this manner. By default, the number is bounded within the “reasonable” limits of 5 and 50 to avoid selection of too few PCs (when technical noise is high relative to biological variation) or too many PCs (when technical noise is very low). For example, applying this function to the Zeisel brain data hits the upper limit: set.seed(001001001) denoised.zeisel &lt;- denoisePCA(sce.zeisel, technical=dec.zeisel, subset.row=top.zeisel) ncol(reducedDim(denoised.zeisel)) ## [1] 50 This method also tends to perform best when the mean-variance trend reflects the actual technical noise, i.e., estimated by modelGeneVarByPoisson() or modelGeneVarWithSpikes() instead of modelGeneVar() (Chapter 8). Variance modelling results from modelGeneVar() tend to understate the actual biological variation, especially in highly heterogeneous datasets where secondary factors of variation inflate the fitted values of the trend. Fewer PCs are subsequently retained because \\(T\\) is artificially lowered, as evidenced by denoisePCA() returning the lower limit of 5 PCs for the PBMC dataset: dec.pbmc2 &lt;- modelGeneVar(sce.pbmc) denoised.pbmc2 &lt;- denoisePCA(sce.pbmc, technical=dec.pbmc2, subset.row=top.pbmc) ncol(reducedDim(denoised.pbmc2)) ## [1] 5 9.3.4 Based on population structure Yet another method to choose \\(d\\) uses information about the number of subpopulations in the data. Consider a situation where each subpopulation differs from the others along a different axis in the high-dimensional space (e.g., because it is defined by a unique set of marker genes). This suggests that we should set \\(d\\) to the number of unique subpopulations minus 1, which guarantees separation of all subpopulations while retaining as few dimensions (and noise) as possible. We can use this reasoning to loosely motivate an a priori choice for \\(d\\) - for example, if we expect around 10 different cell types in our population, we would set \\(d \\approx 10\\). In practice, the number of subpopulations is usually not known in advance. Rather, we use a heuristic approach that uses the number of clusters as a proxy for the number of subpopulations. We perform clustering (graph-based by default, see Chapter 10) on the first \\(d^*\\) PCs and only consider the values of \\(d^*\\) that yield no more than \\(d^*+1\\) clusters. If we detect more clusters with fewer dimensions, we consider this to represent overclustering rather than distinct subpopulations, assuming that multiple subpopulations should not be distinguishable on the same axes. We test a range of \\(d^*\\) and set \\(d\\) to the value that maximizes the number of clusters while satisfying the above condition. This attempts to capture as many distinct (putative) subpopulations as possible by retaining biological signal in later PCs, up until the point that the additional noise reduces resolution. pcs &lt;- reducedDim(sce.zeisel) choices &lt;- getClusteredPCs(pcs) val &lt;- metadata(choices)$chosen plot(choices$n.pcs, choices$n.clusters, xlab=&quot;Number of PCs&quot;, ylab=&quot;Number of clusters&quot;) abline(a=1, b=1, col=&quot;red&quot;) abline(v=val, col=&quot;grey80&quot;, lty=2) Figure 9.3: Number of clusters detected in the Zeisel brain dataset as a function of the number of PCs. The red unbroken line represents the theoretical upper constraint on the number of clusters, while the grey dashed line is the number of PCs suggested by getClusteredPCs(). We subset the PC matrix by column to retain the first \\(d\\) PCs and assign the subsetted matrix back into our SingleCellExperiment object. Downstream applications that use the \"PCA.clust\" results in sce.zeisel will subsequently operate on the chosen PCs only. reducedDim(sce.zeisel, &quot;PCA.clust&quot;) &lt;- pcs[,1:val] This strategy is the most pragmatic as it directly addresses the role of the bias-variance trade-off in downstream analyses, specifically clustering. There is no need to preserve biological signal beyond what is distinguishable in later steps. However, it involves strong assumptions about the nature of the biological differences between subpopulations - and indeed, discrete subpopulations may not even exist in studies of continuous processes like differentiation. It also requires repeated applications of the clustering procedure on increasing number of PCs, which may be computational expensive. 9.3.5 Using random matrix theory We consider the observed (log-)expression matrix to be the sum of (i) a low-rank matrix containing the true biological signal for each cell and (ii) a random matrix representing the technical noise in the data. Under this interpretation, we can use random matrix theory to guide the choice of the number of PCs based on the properties of the noise matrix. The Marchenko-Pastur (MP) distribution defines an upper bound on the singular values of a matrix with random i.i.d. entries. Thus, all PCs associated with larger singular values are likely to contain real biological structure - or at least, signal beyond that expected by noise - and should be retained (Shekhar et al. 2016). We can implement this scheme using the chooseMarchenkoPastur() function from the PCAtools package, given the dimensionality of the matrix used for the PCA (noting that we only used the HVG subset); the variance explained by each PC (not the percentage); and the variance of the noise matrix derived from our previous variance decomposition results. # Generating more PCs for demonstration purposes: set.seed(10100101) sce.zeisel2 &lt;- runPCA(sce.zeisel, subset_row=top.hvgs, ncomponents=200) mp.choice &lt;- PCAtools::chooseMarchenkoPastur( .dim=c(length(top.hvgs), ncol(sce.zeisel2)), var.explained=attr(reducedDim(sce.zeisel2), &quot;varExplained&quot;), noise=median(dec.zeisel[top.hvgs,&quot;tech&quot;])) mp.choice ## [1] 142 ## attr(,&quot;limit&quot;) ## [1] 2.236 We can then subset the PC coordinate matrix by the first mp.choice columns as previously demonstrated. It is best to treat this as a guideline only; PCs below the MP limit are not necessarily uninteresting, especially in noisy datasets where the higher noise drives a more aggressive choice of \\(d\\). Conversely, many PCs above the limit may not be relevant if they are driven by uninteresting biological processes like transcriptional bursting, cell cycle or metabolic variation. Morever, the use of the MP distribution is not entirely justified here as the noise distribution differs by abundance for each gene and by sequencing depth for each cell. In a similar vein, Horn’s parallel analysis is commonly used to pick the number of PCs to retain in factor analysis. This involves randomizing the input matrix, repeating the PCA and creating a scree plot of the PCs of the randomized matrix. The desired number of PCs is then chosen based on the intersection of the randomized scree plot with that of the original matrix (Figure 9.4). Here, the reasoning is that PCs are unlikely to be interesting if they explain less variance that that of the corresponding PC of a random matrix. Note that this differs from the MP approach as we are not using the upper bound of randomized singular values to threshold the original PCs. set.seed(100010) horn &lt;- PCAtools::parallelPCA(logcounts(sce.zeisel)[top.hvgs,], BSPARAM=BiocSingular::IrlbaParam(), niters=10) horn$n ## [1] 24 plot(horn$original$variance, type=&quot;b&quot;, log=&quot;y&quot;, pch=16) permuted &lt;- horn$permuted for (i in seq_len(ncol(permuted))) { points(permuted[,i], col=&quot;grey80&quot;, pch=16) lines(permuted[,i], col=&quot;grey80&quot;, pch=16) } abline(v=horn$n, col=&quot;red&quot;) Figure 9.4: Percentage of variance explained by each PC in the original matrix (black) and the PCs in the randomized matrix (grey) across several randomization iterations. The red line marks the chosen number of PCs. The parallelPCA() function helpfully emits the PC coordinates in horn$original$rotated, which we can subset by horn$n and add to the reducedDims() of our SingleCellExperiment. Parallel analysis is reasonably intuitive (as random matrix methods go) and avoids any i.i.d. assumption across genes. However, its obvious disadvantage is the not-insignificant computational cost of randomizing and repeating the PCA. One can also debate whether the scree plot of the randomized matrix is even comparable to that of the original, given that the former includes biological variation and thus cannot be interpreted as purely technical noise. This manifests in Figure 9.4 as a consistently higher curve for the randomized matrix due to the redistribution of biological variation to the later PCs. Another approach is based on optimizing the reconstruction error of the low-rank representation (???). Recall that PCA produces both the matrix of per-cell coordinates and a rotation matrix of per-gene loadings, the product of which recovers the original log-expression matrix. If we subset these two matrices to the first \\(d\\) dimensions, the product of the resulting submatrices serves as an approximation of the original matrix. Under certain conditions, the difference between this approximation and the true low-rank signal (i.e., sans the noise matrix) has a defined mininum at a certain number of dimensions. This minimum can be defined using the chooseGavishDonoho() function from PCAtools as shown below. gv.choice &lt;- PCAtools::chooseGavishDonoho( .dim=c(length(top.hvgs), ncol(sce.zeisel2)), var.explained=attr(reducedDim(sce.zeisel2), &quot;varExplained&quot;), noise=median(dec.zeisel[top.hvgs,&quot;tech&quot;])) gv.choice ## [1] 59 ## attr(,&quot;limit&quot;) ## [1] 2.992 The Gavish-Donoho method is appealing as, unlike the other approaches for choosing \\(d\\), the concept of the optimum is rigorously defined. By minimizing the reconstruction error, we can most accurately represent the true biological variation in terms of the distances between cells in PC space. However, there remains some room for difference between “optimal” and “useful”; for example, noisy datasets may find themselves with very low \\(d\\) as including more PCs will only ever increase reconstruction error, regardless of whether they contain relevant biological variation. This approach is also dependent on some strong i.i.d. assumptions about the noise matrix. 9.4 Count-based dimensionality reduction For count matrices, correspondence analysis (CA) is a natural approach to dimensionality reduction. In this procedure, we compute an expected value for each entry in the matrix based on the per-gene abundance and size factors. Each count is converted into a standardized residual in a manner analogous to the calculation of the statistic in Pearson’s chi-squared tests, i.e., subtraction of the expected value and division by its square root. An SVD is then applied on this matrix of residuals to obtain the necessary low-dimensional coordinates for each cell. To demonstrate, we use the corral package to compute CA factors for the Zeisel dataset. library(corral) sce.corral &lt;- corral_sce(sce.zeisel, subset_row=top.hvgs, col.w=sizeFactors(sce.zeisel)) dim(reducedDim(sce.corral, &quot;corral&quot;)) ## [1] 2816 30 The major advantage of CA is that it avoids difficulties with the mean-variance relationship upon transformation (Section 7.5.1). If two cells have the same expression profile but differences in their total counts, CA will return the same expected location for both cells; this avoids artifacts observed in PCA on log-transformed counts (Figure 9.5). However, CA is more sensitive to overdispersion in the random noise due to the nature of its standardization. This may cause some problems in some datasets where the CA factors may be driven by a few genes with random expression rather than the underlying biological structure. # TODO: move to scRNAseq. library(BiocFileCache) bfc &lt;- BiocFileCache(ask=FALSE) qcdata &lt;- bfcrpath(bfc, &quot;https://github.com/LuyiTian/CellBench_data/blob/master/data/mRNAmix_qc.RData?raw=true&quot;) env &lt;- new.env() load(qcdata, envir=env) sce.8qc &lt;- env$sce8_qc sce.8qc$mix &lt;- factor(sce.8qc$mix) # Choosing some HVGs for PCA: sce.8qc &lt;- logNormCounts(sce.8qc) dec.8qc &lt;- modelGeneVar(sce.8qc) hvgs.8qc &lt;- getTopHVGs(dec.8qc, n=1000) sce.8qc &lt;- runPCA(sce.8qc, subset_row=hvgs.8qc) # By comparison, corral operates on the raw counts: sce.8qc &lt;- corral_sce(sce.8qc, subset_row=hvgs.8qc, col.w=sizeFactors(sce.8qc)) gridExtra::grid.arrange( plotPCA(sce.8qc, colour_by=&quot;mix&quot;) + ggtitle(&quot;PCA&quot;), plotReducedDim(sce.8qc, &quot;corral&quot;, colour_by=&quot;mix&quot;) + ggtitle(&quot;corral&quot;), ncol=2 ) Figure 9.5: Dimensionality reduction results of all pool-and-split libraries in the SORT-seq CellBench data, computed by a PCA on the log-normalized expression values (left) or using the corral package (right). Each point represents a library and is colored by the mixing ratio used to construct it. 9.5 Dimensionality reduction for visualization 9.5.1 Motivation Another application of dimensionality reduction is to compress the data into 2 (sometimes 3) dimensions for plotting. This serves a separate purpose to the PCA-based dimensionality reduction described above. Algorithms are more than happy to operate on 10-50 PCs, but these are still too many dimensions for human comprehension. Further dimensionality reduction strategies are required to pack the most salient features of the data into 2 or 3 dimensions, which we will discuss below. 9.5.2 Visualizing with PCA The simplest visualization approach is to plot the top 2 PCs (Figure 9.6): plotReducedDim(sce.zeisel, dimred=&quot;PCA&quot;, colour_by=&quot;level1class&quot;) Figure 9.6: PCA plot of the first two PCs in the Zeisel brain data. Each point is a cell, coloured according to the annotation provided by the original authors. The problem is that PCA is a linear technique, i.e., only variation along a line in high-dimensional space is captured by each PC. As such, it cannot efficiently pack differences in \\(d\\) dimensions into the first 2 PCs. This is demonstrated in Figure 9.6 where the top two PCs fail to resolve some subpopulations identified by Zeisel et al. (2015). If the first PC is devoted to resolving the biggest difference between subpopulations, and the second PC is devoted to resolving the next biggest difference, then the remaining differences will not be visible in the plot. One workaround is to plot several of the top PCs against each other in pairwise plots (Figure 9.7). However, it is difficult to interpret multiple plots simultaneously, and even this approach is not sufficient to separate some of the annotated subpopulations. plotReducedDim(sce.zeisel, dimred=&quot;PCA&quot;, ncomponents=4, colour_by=&quot;level1class&quot;) Figure 9.7: PCA plot of the first two PCs in the Zeisel brain data. Each point is a cell, coloured according to the annotation provided by the original authors. There are some advantages to the PCA for visualization. It is predictable and will not introduce artificial structure in the visualization. It is also deterministic and robust to small changes in the input values. However, as shown above, PCA is usually not satisfactory for visualization of complex populations. 9.5.3 \\(t\\)-stochastic neighbor embedding The de facto standard for visualization of scRNA-seq data is the \\(t\\)-stochastic neighbor embedding (\\(t\\)-SNE) method (Van der Maaten and Hinton 2008). This attempts to find a low-dimensional representation of the data that preserves the distances between each point and its neighbors in the high-dimensional space. Unlike PCA, it is not restricted to linear transformations, nor is it obliged to accurately represent distances between distant populations. This means that it has much more freedom in how it arranges cells in low-dimensional space, enabling it to separate many distinct clusters in a complex population (Figure 9.8). set.seed(00101001101) # runTSNE() stores the t-SNE coordinates in the reducedDims # for re-use across multiple plotReducedDim() calls. sce.zeisel &lt;- runTSNE(sce.zeisel, dimred=&quot;PCA&quot;) plotReducedDim(sce.zeisel, dimred=&quot;TSNE&quot;, colour_by=&quot;level1class&quot;) Figure 9.8: \\(t\\)-SNE plots constructed from the top PCs in the Zeisel brain dataset. Each point represents a cell, coloured according to the published annotation. One of the main disadvantages of \\(t\\)-SNE is that it is much more computationally intensive than other visualization methods. We mitigate this effect by setting dimred=\"PCA\" in runtTSNE(), which instructs the function to perform the \\(t\\)-SNE calculations on the top PCs to exploit the data compaction and noise removal provided by the PCA. It is possible to run \\(t\\)-SNE on the original expression matrix but this is less efficient. Another issue with \\(t\\)-SNE is that it requires the user to be aware of additional parameters (discussed here in some depth). It involves a random initialization so we need to (i) repeat the visualization several times to ensure that the results are representative and (ii) set the seed to ensure that the chosen results are reproducible. The “perplexity” is another important parameter that determines the granularity of the visualization (Figure 9.9). Low perplexities will favor resolution of finer structure, possibly to the point that the visualization is compromised by random noise. Thus, it is advisable to test different perplexity values to ensure that the choice of perplexity does not drive the interpretation of the plot. set.seed(100) sce.zeisel &lt;- runTSNE(sce.zeisel, dimred=&quot;PCA&quot;, perplexity=5) out5 &lt;- plotReducedDim(sce.zeisel, dimred=&quot;TSNE&quot;, colour_by=&quot;level1class&quot;) + ggtitle(&quot;perplexity = 5&quot;) set.seed(100) sce.zeisel &lt;- runTSNE(sce.zeisel, dimred=&quot;PCA&quot;, perplexity=20) out20 &lt;- plotReducedDim(sce.zeisel, dimred=&quot;TSNE&quot;, colour_by=&quot;level1class&quot;) + ggtitle(&quot;perplexity = 20&quot;) set.seed(100) sce.zeisel &lt;- runTSNE(sce.zeisel, dimred=&quot;PCA&quot;, perplexity=80) out80 &lt;- plotReducedDim(sce.zeisel, dimred=&quot;TSNE&quot;, colour_by=&quot;level1class&quot;) + ggtitle(&quot;perplexity = 80&quot;) gridExtra::grid.arrange(out5, out20, out80, ncol=3) Figure 9.9: \\(t\\)-SNE plots constructed from the top PCs in the Zeisel brain dataset, using a range of perplexity values. Each point represents a cell, coloured according to its annotation. Finally, it is tempting to interpret the \\(t\\)-SNE results as a “map” of single-cell identities. This is generally unwise as any such interpretation is easily misled by the size and positions of the visual clusters. Specifically, \\(t\\)-SNE will inflate dense clusters and compress sparse ones, such that we cannot use the size as a measure of subpopulation heterogeneity. Similarly, \\(t\\)-SNE is not obliged to preserve the relative locations of non-neighboring clusters, such that we cannot use their positions to determine relationships between distant clusters. We provide some suggestions on how to interpret these plots in Section 9.5.5. Despite its shortcomings, \\(t\\)-SNE is proven tool for general-purpose visualization of scRNA-seq data and remains a popular choice in many analysis pipelines. 9.5.4 Uniform manifold approximation and projection The uniform manifold approximation and projection (UMAP) method (McInnes, Healy, and Melville 2018) is an alternative to \\(t\\)-SNE for non-linear dimensionality reduction. It is roughly similar to \\(t\\)-SNE in that it also tries to find a low-dimensional representation that preserves relationships between neighbors in high-dimensional space. However, the two methods are based on different theory, represented by differences in the various graph weighting equations. This manifests as a different visualization as shown in Figure 9.10. set.seed(1100101001) sce.zeisel &lt;- runUMAP(sce.zeisel, dimred=&quot;PCA&quot;) plotReducedDim(sce.zeisel, dimred=&quot;UMAP&quot;, colour_by=&quot;level1class&quot;) Figure 9.10: UMAP plots constructed from the top PCs in the Zeisel brain dataset. Each point represents a cell, coloured according to the published annotation. Compared to \\(t\\)-SNE, the UMAP visualization tends to have more compact visual clusters with more empty space between them. It also attempts to preserve more of the global structure than \\(t\\)-SNE. From a practical perspective, UMAP is much faster than \\(t\\)-SNE, which may be an important consideration for large datasets. (Nonetheless, we have still run UMAP on the top PCs here for consistency.) UMAP also involves a series of randomization steps so setting the seed is critical. Like \\(t\\)-SNE, UMAP has its own suite of hyperparameters that affect the visualization. Of these, the number of neighbors (n_neighbors) and the minimum distance between embedded points (min_dist) have the greatest effect on the granularity of the output. If these values are too low, random noise will be incorrectly treated as high-resolution structure, while values that are too high will discard fine structure altogether in favor of obtaining an accurate overview of the entire dataset. Again, it is a good idea to test a range of values for these parameters to ensure that they do not compromise any conclusions drawn from a UMAP plot. It is arguable whether the UMAP or \\(t\\)-SNE visualizations are more useful or aesthetically pleasing. UMAP aims to preserve more global structure but this necessarily reduces resolution within each visual cluster. However, UMAP is unarguably much faster, and for that reason alone, it is increasingly displacing \\(t\\)-SNE as the method of choice for visualizing large scRNA-seq data sets. 9.5.5 Interpreting the plots Dimensionality reduction for visualization necessarily involves discarding information and distorting the distances between cells in order to fit high-dimensional data into a 2-dimensional space. One might wonder whether the results of such extreme data compression can be trusted. Indeed, some of our more quantitative colleagues consider such visualizations to be more artistic than scientific, fit for little but impressing collaborators and reviewers! Perhaps this perspective is not entirely invalid, but we suggest that there is some value to be extracted from them provided that they are accompanied by an analysis of a higher-rank representation. As a general rule, focusing on local neighborhoods provides the safest interpretation of \\(t\\)-SNE and UMAP plots. These methods spend considerable effort to ensure that each cell’s nearest neighbors in high-dimensional space are still its neighbors in the two-dimensional embedding. Thus, if we see multiple cell types or clusters in a single unbroken “island” in the embedding, we could infer that those populations were also close neighbors in higher-dimensional space. However, less can be said about the distances between non-neighboring cells; there is no guarantee that large distances are faithfully recapitulated in the embedding, given the distortions necessary for this type of dimensionality reduction. It would be courageous to use the distances between islands (measured, on occasion, with a ruler!) to make statements about the relative similarity of distinct cell types. On a related note, we prefer to restrict the \\(t\\)-SNE/UMAP coordinates for visualization and use the higher-rank representation for any quantitative analyses. To illustrate, consider the interaction between clustering and \\(t\\)-SNE. We do not perform clustering on the \\(t\\)-SNE coordinates, but rather, we cluster on the first 10-50 PCs (Chapter (clustering)) and then visualize the cluster identities on \\(t\\)-SNE plots like that in Figure 9.8. This ensures that clustering makes use of the information that was lost during compression into two dimensions for visualization. The plot can then be used for a diagnostic inspection of the clustering output, e.g., to check which clusters are close neighbors or whether a cluster can be split into further subclusters; this follows the aforementioned theme of focusing on local structure. From a naive perspective, using the \\(t\\)-SNE coordinates directly for clustering is tempting as it ensures that any results are immediately consistent with the visualization. Given that clustering is rather arbitrary anyway, there is nothing inherently wrong with this strategy - in fact, it can be treated as a rather circuitous implementation of graph-based clustering (Section 10.3). However, the enforced consistency can actually be considered a disservice as it masks the ambiguity of the conclusions, either due to the loss of information from dimensionality reduction or the uncertainty of the clustering. Rather than being errors, major discrepancies can instead be useful for motivating further investigation into the less obvious aspects of the dataset; conversely, the lack of discrepancies increases trust in the conclusions. Or perhaps more bluntly: do not let the tail (of visualization) wag the dog (of quantitative analysis). Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] BiocFileCache_1.14.0 dbplyr_2.1.0 [3] corral_1.0.0 BiocSingular_1.6.0 [5] scater_1.18.6 ggplot2_3.3.3 [7] scran_1.18.5 SingleCellExperiment_1.12.0 [9] SummarizedExperiment_1.20.0 Biobase_2.50.0 [11] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [13] IRanges_2.24.1 S4Vectors_0.28.1 [15] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [17] matrixStats_0.58.0 BiocStyle_2.18.1 [19] rebook_1.0.0 loaded via a namespace (and not attached): [1] Rtsne_0.15 ggbeeswarm_0.6.0 [3] colorspace_2.0-0 ellipsis_0.3.1 [5] scuttle_1.0.4 bluster_1.0.0 [7] XVector_0.30.0 BiocNeighbors_1.8.2 [9] dichromat_2.0-0 farver_2.1.0 [11] bit64_4.0.5 MultiAssayExperiment_1.16.0 [13] ggrepel_0.9.1 RSpectra_0.16-0 [15] fansi_0.4.2 codetools_0.2-18 [17] sparseMatrixStats_1.2.1 cachem_1.0.4 [19] knitr_1.31 jsonlite_1.7.2 [21] uwot_0.1.10 graph_1.68.0 [23] httr_1.4.2 BiocManager_1.30.10 [25] mapproj_1.2.7 compiler_4.0.4 [27] dqrng_0.2.1 fastmap_1.1.0 [29] assertthat_0.2.1 Matrix_1.3-2 [31] limma_3.46.0 htmltools_0.5.1.1 [33] tools_4.0.4 rsvd_1.0.3 [35] igraph_1.2.6 gtable_0.3.0 [37] glue_1.4.2 GenomeInfoDbData_1.2.4 [39] reshape2_1.4.4 dplyr_1.0.5 [41] rappdirs_0.3.3 ggthemes_4.2.4 [43] maps_3.3.0 Rcpp_1.0.6 [45] jquerylib_0.1.3 vctrs_0.3.6 [47] DelayedMatrixStats_1.12.3 xfun_0.22 [49] stringr_1.4.0 ps_1.6.0 [51] beachmat_2.6.4 lifecycle_1.0.0 [53] irlba_2.3.3 PCAtools_2.2.0 [55] statmod_1.4.35 XML_3.99-0.6 [57] edgeR_3.32.1 zlibbioc_1.36.0 [59] scales_1.1.1 curl_4.3 [61] yaml_2.2.1 memoise_2.0.0 [63] gridExtra_2.3 sass_0.3.1 [65] RSQLite_2.2.4 stringi_1.5.3 [67] highr_0.8 BiocParallel_1.24.1 [69] pals_1.6 rlang_0.4.10 [71] pkgconfig_2.0.3 bitops_1.0-6 [73] RMTstat_0.3 evaluate_0.14 [75] lattice_0.20-41 purrr_0.3.4 [77] labeling_0.4.2 CodeDepends_0.6.5 [79] bit_4.0.4 cowplot_1.1.1 [81] transport_0.12-2 processx_3.4.5 [83] tidyselect_1.1.0 plyr_1.8.6 [85] magrittr_2.0.1 bookdown_0.21 [87] R6_2.5.0 generics_0.1.0 [89] DelayedArray_0.16.2 DBI_1.1.1 [91] pillar_1.5.1 withr_2.4.1 [93] RCurl_1.98-1.3 tibble_3.1.0 [95] crayon_1.4.1 utf8_1.2.1 [97] rmarkdown_2.7 viridis_0.5.1 [99] locfit_1.5-9.4 grid_4.0.4 [101] data.table_1.14.0 FNN_1.1.3 [103] blob_1.2.1 callr_3.5.1 [105] digest_0.6.27 munsell_0.5.0 [107] beeswarm_0.3.1 viridisLite_0.3.0 [109] vipor_0.4.5 bslib_0.2.4 Bibliography "],["clustering.html", "Chapter 10 Clustering 10.1 Motivation 10.2 What is the “true clustering”? 10.3 Graph-based clustering 10.4 \\(k\\)-means clustering 10.5 Hierarchical clustering 10.6 General-purpose cluster diagnostics 10.7 Subclustering Session Info", " Chapter 10 Clustering .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 10.1 Motivation Clustering is an unsupervised learning procedure that is used in scRNA-seq data analysis to empirically define groups of cells with similar expression profiles. Its primary purpose is to summarize the data in a digestible format for human interpretation. This allows us to describe population heterogeneity in terms of discrete labels that are easily understood, rather than attempting to comprehend the high-dimensional manifold on which the cells truly reside. After annotation based on marker genes, the clusters can be treated as proxies for more abstract biological concepts such as cell types or states. Clustering is thus a critical step for extracting biological insights from scRNA-seq data. Here, we demonstrate the application of several commonly used methods with the 10X PBMC dataset. View history #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) #--- gene-annotation ---# library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) #--- cell-detection ---# set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location==&quot;MT&quot;))) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) sce.pbmc &lt;- sce.pbmc[,!high.mito] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) sce.pbmc &lt;- logNormCounts(sce.pbmc) #--- variance-modelling ---# set.seed(1001) dec.pbmc &lt;- modelGeneVarByPoisson(sce.pbmc) top.pbmc &lt;- getTopHVGs(dec.pbmc, prop=0.1) #--- dimensionality-reduction ---# set.seed(10000) sce.pbmc &lt;- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc) set.seed(100000) sce.pbmc &lt;- runTSNE(sce.pbmc, dimred=&quot;PCA&quot;) set.seed(1000000) sce.pbmc &lt;- runUMAP(sce.pbmc, dimred=&quot;PCA&quot;) sce.pbmc ## class: SingleCellExperiment ## dim: 33694 3985 ## metadata(1): Samples ## assays(2): counts logcounts ## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B ## rowData names(2): ID Symbol ## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ... ## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1 ## colData names(3): Sample Barcode sizeFactor ## reducedDimNames(3): PCA TSNE UMAP ## altExpNames(0): 10.2 What is the “true clustering”? At this point, it is worth stressing the distinction between clusters and cell types. The former is an empirical construct while the latter is a biological truth (albeit a vaguely defined one). For this reason, questions like “what is the true number of clusters?” are usually meaningless. We can define as many clusters as we like, with whatever algorithm we like - each clustering will represent its own partitioning of the high-dimensional expression space, and is as “real” as any other clustering. A more relevant question is “how well do the clusters approximate the cell types?” Unfortunately, this is difficult to answer given the context-dependent interpretation of biological truth. Some analysts will be satisfied with resolution of the major cell types; other analysts may want resolution of subtypes; and others still may require resolution of different states (e.g., metabolic activity, stress) within those subtypes. Moreover, two clusterings can be highly inconsistent yet both valid, simply partitioning the cells based on different aspects of biology. Indeed, asking for an unqualified “best” clustering is akin to asking for the best magnification on a microscope without any context. It is helpful to realize that clustering, like a microscope, is simply a tool to explore the data. We can zoom in and out by changing the resolution of the clustering parameters, and we can experiment with different clustering algorithms to obtain alternative perspectives of the data. This iterative approach is entirely permissible for data exploration, which constitutes the majority of all scRNA-seq data analyses. 10.3 Graph-based clustering 10.3.1 Background Popularized by its use in Seurat, graph-based clustering is a flexible and scalable technique for clustering large scRNA-seq datasets. We first build a graph where each node is a cell that is connected to its nearest neighbors in the high-dimensional space. Edges are weighted based on the similarity between the cells involved, with higher weight given to cells that are more closely related. We then apply algorithms to identify “communities” of cells that are more connected to cells in the same community than they are to cells of different communities. Each community represents a cluster that we can use for downstream interpretation. The major advantage of graph-based clustering lies in its scalability. It only requires a \\(k\\)-nearest neighbor search that can be done in log-linear time on average, in contrast to hierachical clustering methods with runtimes that are quadratic with respect to the number of cells. Graph construction avoids making strong assumptions about the shape of the clusters or the distribution of cells within each cluster, compared to other methods like \\(k\\)-means (that favor spherical clusters) or Gaussian mixture models (that require normality). From a practical perspective, each cell is forcibly connected to a minimum number of neighboring cells, which reduces the risk of generating many uninformative clusters consisting of one or two outlier cells. The main drawback of graph-based methods is that, after graph construction, no information is retained about relationships beyond the neighboring cells1. This has some practical consequences in datasets that exhibit differences in cell density, as more steps through the graph are required to move the same distance through a region of higher cell density. From the perspective of community detection algorithms, this effect “inflates” the high-density regions such that any internal substructure or noise is more likely to cause formation of subclusters. The resolution of clustering thus becomes dependent on the density of cells, which can occasionally be misleading if it overstates the heterogeneity in the data. 10.3.2 Implementation There are several considerations in the practical execution of a graph-based clustering method: How many neighbors are considered when constructing the graph. What scheme is used to weight the edges. Which community detection algorithm is used to define the clusters. For example, the following code uses the 10 nearest neighbors of each cell to construct a shared nearest neighbor graph. Two cells are connected by an edge if any of their nearest neighbors are shared, with the edge weight defined from the highest average rank of the shared neighbors (Xu and Su 2015). The Walktrap method from the igraph package is then used to identify communities. All calculations are performed using the top PCs to take advantage of data compression and denoising. library(scran) g &lt;- buildSNNGraph(sce.pbmc, k=10, use.dimred = &#39;PCA&#39;) clust &lt;- igraph::cluster_walktrap(g)$membership table(clust) ## clust ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 205 508 541 56 374 125 46 432 302 867 47 155 166 61 84 16 Alternatively, users may prefer to use the clusterRows() function from the bluster package. This calls the exact same series of functions when run in graph-based mode with the NNGraphParam() argument; however, it is often more convenient if we want try out different clustering procedures, as we can simply change the second argument to use a different set of parameters or a different algorithm altogether. library(bluster) clust2 &lt;- clusterRows(reducedDim(sce.pbmc, &quot;PCA&quot;), NNGraphParam()) table(clust2) # same as above. ## clust2 ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 205 508 541 56 374 125 46 432 302 867 47 155 166 61 84 16 We assign the cluster assignments back into our SingleCellExperiment object as a factor in the column metadata. This allows us to conveniently visualize the distribution of clusters in a \\(t\\)-SNE plot (Figure 10.1). library(scater) colLabels(sce.pbmc) &lt;- factor(clust) plotReducedDim(sce.pbmc, &quot;TSNE&quot;, colour_by=&quot;label&quot;) Figure 10.1: \\(t\\)-SNE plot of the 10X PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from graph-based clustering. One of the most important parameters is k, the number of nearest neighbors used to construct the graph. This controls the resolution of the clustering where higher k yields a more inter-connected graph and broader clusters. Users can exploit this by experimenting with different values of k to obtain a satisfactory resolution. # More resolved. g.5 &lt;- buildSNNGraph(sce.pbmc, k=5, use.dimred = &#39;PCA&#39;) clust.5 &lt;- igraph::cluster_walktrap(g.5)$membership table(clust.5) ## clust.5 ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ## 523 302 125 45 172 573 249 439 293 95 772 142 38 18 62 38 30 16 15 9 ## 21 22 ## 16 13 # Less resolved. g.50 &lt;- buildSNNGraph(sce.pbmc, k=50, use.dimred = &#39;PCA&#39;) clust.50 &lt;- igraph::cluster_walktrap(g.50)$membership table(clust.50) ## clust.50 ## 1 2 3 4 5 6 7 8 9 10 ## 869 514 194 478 539 944 138 175 89 45 The graph itself can be visualized using a force-directed layout (Figure 10.2). This yields a dimensionality reduction result that is closely related to \\(t\\)-SNE and UMAP, though which of these is the most aesthetically pleasing is left to the eye of the beholder. set.seed(11000) reducedDim(sce.pbmc, &quot;force&quot;) &lt;- igraph::layout_with_fr(g) plotReducedDim(sce.pbmc, colour_by=&quot;label&quot;, dimred=&quot;force&quot;) Figure 10.2: Force-directed layout for the shared nearest-neighbor graph of the PBMC dataset. Each point represents a cell and is coloured according to its assigned cluster identity. 10.3.3 Other parameters Further tweaking can be performed by changing the edge weighting scheme during graph construction. Setting type=\"number\" will weight edges based on the number of nearest neighbors that are shared between two cells. Similarly, type=\"jaccard\" will weight edges according to the Jaccard index of the two sets of neighbors. We can also disable weighting altogether by using buildKNNGraph(), which is occasionally useful for downstream graph operations that do not support weights. g.num &lt;- buildSNNGraph(sce.pbmc, use.dimred=&quot;PCA&quot;, type=&quot;number&quot;) g.jaccard &lt;- buildSNNGraph(sce.pbmc, use.dimred=&quot;PCA&quot;, type=&quot;jaccard&quot;) g.none &lt;- buildKNNGraph(sce.pbmc, use.dimred=&quot;PCA&quot;) All of these g variables are graph objects from the igraph package and can be used with any of the community detection algorithms provided by igraph. We have already mentioned the Walktrap approach, but many others are available to choose from: clust.louvain &lt;- igraph::cluster_louvain(g)$membership clust.infomap &lt;- igraph::cluster_infomap(g)$membership clust.fast &lt;- igraph::cluster_fast_greedy(g)$membership clust.labprop &lt;- igraph::cluster_label_prop(g)$membership clust.eigen &lt;- igraph::cluster_leading_eigen(g)$membership It is then straightforward to compare two clustering strategies to see how they differ. For example, Figure 10.3 suggests that Infomap yields finer clusters than Walktrap while fast-greedy yields coarser clusters. library(pheatmap) # Using a large pseudo-count for a smoother color transition # between 0 and 1 cell in each &#39;tab&#39;. tab &lt;- table(paste(&quot;Infomap&quot;, clust.infomap), paste(&quot;Walktrap&quot;, clust)) ivw &lt;- pheatmap(log10(tab+10), main=&quot;Infomap vs Walktrap&quot;, color=viridis::viridis(100), silent=TRUE) tab &lt;- table(paste(&quot;Fast&quot;, clust.fast), paste(&quot;Walktrap&quot;, clust)) fvw &lt;- pheatmap(log10(tab+10), main=&quot;Fast-greedy vs Walktrap&quot;, color=viridis::viridis(100), silent=TRUE) gridExtra::grid.arrange(ivw[[4]], fvw[[4]]) Figure 10.3: Number of cells assigned to combinations of cluster labels with different community detection algorithms in the PBMC dataset. Each entry of each heatmap represents a pair of labels, coloured proportionally to the log-number of cells with those labels. Pipelines involving scran default to rank-based weights followed by Walktrap clustering. In contrast, Seurat uses Jaccard-based weights followed by Louvain clustering. Both of these strategies work well, and it is likely that the same could be said for many other combinations of weighting schemes and community detection algorithms. Some community detection algorithms operate by agglomeration and thus can be used to construct a hierarchical dendrogram based on the pattern of merges between clusters. The dendrogram itself is not particularly informative as it simply describes the order of merge steps performed by the algorithm; unlike the dendrograms produced by hierarchical clustering (Section 10.5), it does not capture the magnitude of differences between subpopulations. However, it does provide a convenient avenue for manually tuning the clustering resolution by generating nested clusterings using the cut_at() function, as shown below. community.walktrap &lt;- igraph::cluster_walktrap(g) table(igraph::cut_at(community.walktrap, n=5)) ## ## 1 2 3 4 5 ## 3546 221 125 46 47 table(igraph::cut_at(community.walktrap, n=20)) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ## 462 374 125 437 46 432 302 173 867 47 155 166 104 40 61 84 46 32 16 16 If cut_at()-like functionality is desired for non-hierarchical methods, bluster provides a mergeCommunities() function to retrospectively tune the clustering resolution. This function will greedily merge pairs of clusters until a specified number of clusters is achieved, where pairs are chosen to maximize the modularity at each merge step. community.louvain &lt;- igraph::cluster_louvain(g) table(community.louvain$membership) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ## 353 152 376 857 48 46 68 661 287 53 541 198 219 126 try(igraph::cut_at(community.louvain, n=10)) # Not hierarchical. ## Error in igraph::cut_at(community.louvain, n = 10) : ## Not a hierarchical communitity structure merged &lt;- mergeCommunities(g, community.louvain$membership, number=10) table(merged) ## merged ## 1 3 4 7 8 9 10 11 12 14 ## 353 528 857 162 661 287 272 541 198 126 10.3.4 Assessing cluster separation When dealing with graphs, the modularity is a natural metric for evaluating the separation between communities/clusters. This is defined as the (scaled) difference between the observed total weight of edges between nodes in the same cluster and the expected total weight if edge weights were randomly distributed across all pairs of nodes. Larger modularity values indicate that there most edges occur within clusters, suggesting that the clusters are sufficiently well separated to avoid edges forming between neighboring cells in different clusters. The standard approach is to report a single modularity value for a clustering on a given graph. This is useful for comparing different clusterings on the same graph - and indeed, some community detection algorithms are designed with the aim of maximizing the modularity - but it is less helpful for interpreting a given clustering. Rather, we use the pairwiseModularity() function from bluster with as.ratio=TRUE, which returns the ratio of the observed to expected sum of weights between each pair of clusters. We use the ratio instead of the difference as the former is less dependent on the number of cells in each cluster. ratio &lt;- pairwiseModularity(g, clust, as.ratio=TRUE) dim(ratio) ## [1] 16 16 In this matrix, each row/column corresponds to a cluster and each entry contains the ratio of the observed to total weight of edges between cells in the respective clusters. A dataset containing well-separated clusters should contain most of the observed total weight on the diagonal entries, i.e., most edges occur between cells in the same cluster. Indeed, concentration of the weight on the diagonal of (Figure 10.4) indicates that most of the clusters are well-separated, while some modest off-diagonal entries represent closely related clusters with more inter-connecting edges. library(pheatmap) pheatmap(log2(ratio+1), cluster_rows=FALSE, cluster_cols=FALSE, color=colorRampPalette(c(&quot;white&quot;, &quot;blue&quot;))(100)) Figure 10.4: Heatmap of the log2-ratio of the total weight between nodes in the same cluster or in different clusters, relative to the total weight expected under a null model of random links. One useful approach is to use the ratio matrix to form another graph where the nodes are clusters rather than cells. Edges between nodes are weighted according to the ratio of observed to expected edge weights between cells in those clusters. We can then repeat our graph operations on this new cluster-level graph to explore the relationships between clusters. For example, we could obtain clusters of clusters, or we could simply create a new cluster-based layout for visualization (Figure 10.5). This is analogous to the “graph abstraction” approach described by Wolf et al. (2017), which can be used to identify trajectories in the data based on high-weight paths between clusters. cluster.gr &lt;- igraph::graph_from_adjacency_matrix(log2(ratio+1), mode=&quot;upper&quot;, weighted=TRUE, diag=FALSE) # Increasing the weight to increase the visibility of the lines. set.seed(11001010) plot(cluster.gr, edge.width=igraph::E(cluster.gr)$weight*5, layout=igraph::layout_with_lgl) Figure 10.5: Force-based layout showing the relationships between clusters based on the log-ratio of observed to expected total weights between nodes in different clusters. The thickness of the edge between a pair of clusters is proportional to the corresponding log-ratio. Incidentally, some readers may have noticed that all igraph commands were prefixed with igraph::. We have done this deliberately to avoid bringing igraph::normalize into the global namespace. Rather unfortunately, this normalize function accepts any argument and returns NULL, which causes difficult-to-diagnose bugs when it overwrites normalize from BiocGenerics. 10.4 \\(k\\)-means clustering 10.4.1 Background \\(k\\)-means clustering is a classic technique that aims to partition cells into \\(k\\) clusters. Each cell is assigned to the cluster with the closest centroid, which is done by minimizing the within-cluster sum of squares using a random starting configuration for the \\(k\\) centroids. The main advantage of this approach lies in its speed, given the simplicity and ease of implementation of the algorithm. However, it suffers from a number of serious shortcomings that reduce its appeal for obtaining interpretable clusters: It implicitly favors spherical clusters of equal radius. This can lead to unintuitive partitionings on real datasets that contain groupings with irregular sizes and shapes. The number of clusters \\(k\\) must be specified beforehand and represents a hard cap on the resolution of the clustering.. For example, setting \\(k\\) to be below the number of cell types will always lead to co-clustering of two cell types, regardless of how well separated they are. In contrast, other methods like graph-based clustering will respect strong separation even if the relevant resolution parameter is set to a low value. It is dependent on the randomly chosen initial coordinates. This requires multiple runs to verify that the clustering is stable. That said, \\(k\\)-means clustering is still one of the best approaches for sample-based data compression. In this application, we set \\(k\\) to a large value such as the square root of the number of cells to obtain fine-grained clusters. These are not meant to be interpreted directly, but rather, the centroids are treated as “samples” for further analyses. The idea here is to obtain a single representative of each region of the expression space, reducing the number of samples and computational work in later steps like, e.g., trajectory reconstruction (Ji and Ji 2016). This approach will also eliminate differences in cell density across the expression space, ensuring that the most abundant cell type does not dominate downstream results. 10.4.2 Base implementation Base R provides the kmeans() function that does as its name suggests. We call this on our top PCs to obtain a clustering for a specified number of clusters in the centers= argument, after setting the random seed to ensure that the results are reproducible. In general, the \\(k\\)-means clusters correspond to the visual clusters on the \\(t\\)-SNE plot in Figure 10.6, though there are some divergences that are not observed in, say, Figure 10.1. (This is at least partially due to the fact that \\(t\\)-SNE is itself graph-based and so will naturally agree more with a graph-based clustering strategy.) set.seed(100) clust.kmeans &lt;- kmeans(reducedDim(sce.pbmc, &quot;PCA&quot;), centers=10) table(clust.kmeans$cluster) ## ## 1 2 3 4 5 6 7 8 9 10 ## 548 46 408 270 539 199 148 783 163 881 colLabels(sce.pbmc) &lt;- factor(clust.kmeans$cluster) plotReducedDim(sce.pbmc, &quot;TSNE&quot;, colour_by=&quot;label&quot;) Figure 10.6: \\(t\\)-SNE plot of the 10X PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from \\(k\\)-means clustering. If we were so inclined, we could obtain a “reasonable” choice of \\(k\\) by computing the gap statistic using methods from the cluster package. This is the log-ratio of the expected to observed within-cluster sum of squares, where the expected value is computed by randomly distributing cells within the minimum bounding box of the original data. A larger gap statistic represents a lower observed sum of squares - and thus better clustering - compared to a population with no structure. Ideally, we would choose the \\(k\\) that maximizes the gap statistic, but this is often unhelpful as the tendency of \\(k\\)-means to favor spherical clusters drives a large \\(k\\) to capture different cluster shapes. Instead, we choose the most parsimonious \\(k\\) beyond which the increases in the gap statistic are considered insignificant (Figure 10.7). It must be said, though, that this process is time-consuming and the resulting choice of \\(k\\) is not always stable. library(cluster) set.seed(110010101) gaps &lt;- clusGap(reducedDim(sce.pbmc, &quot;PCA&quot;), kmeans, K.max=20) best.k &lt;- maxSE(gaps$Tab[,&quot;gap&quot;], gaps$Tab[,&quot;SE.sim&quot;]) best.k ## [1] 8 plot(gaps$Tab[,&quot;gap&quot;], xlab=&quot;Number of clusters&quot;, ylab=&quot;Gap statistic&quot;) abline(v=best.k, col=&quot;red&quot;) Figure 10.7: Gap statistic with respect to increasing number of \\(k\\)-means clusters in the 10X PBMC dataset. The red line represents the chosen \\(k\\). A more practical use of \\(k\\)-means is to deliberately set \\(k\\) to a large value to achieve overclustering. This will forcibly partition cells inside broad clusters that do not have well-defined internal structure. For example, we might be interested in the change in expression from one “side” of a cluster to the other, but the lack of any clear separation within the cluster makes it difficult to separate with graph-based methods, even at the highest resolution. \\(k\\)-means has no such problems and will readily split these broad clusters for greater resolution, though obviously one must be prepared for the additional work involved in interpreting a greater number of clusters. set.seed(100) clust.kmeans2 &lt;- kmeans(reducedDim(sce.pbmc, &quot;PCA&quot;), centers=20) table(clust.kmeans2$cluster) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ## 243 28 202 361 282 166 388 150 114 537 170 96 46 131 162 118 201 257 288 45 colLabels(sce.pbmc) &lt;- factor(clust.kmeans2$cluster) plotTSNE(sce.pbmc, colour_by=&quot;label&quot;, text_by=&quot;label&quot;) Figure 10.8: \\(t\\)-SNE plot of the 10X PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from \\(k\\)-means clustering with \\(k=20\\). As an aside: if we were already using clusterRows() from bluster, we can easily switch to \\(k\\)-means clustering by supplying a KmeansParam() as the second argument. This requires the number of clusters as a fixed integer or as a function of the number of cells - the example below sets the number of clusters to the square root of the number of cells, which is an effective rule-of-thumb for vector quantization. set.seed(10000) sq.clusts &lt;- clusterRows(reducedDim(sce.pbmc, &quot;PCA&quot;), KmeansParam(centers=sqrt)) nlevels(sq.clusts) ## [1] 63 10.4.3 Assessing cluster separation The within-cluster sum of squares (WCSS) for each cluster is the most relevant diagnostic for \\(k\\)-means, given that the algorithm aims to find a clustering that minimizes the WCSS. Specifically, we use the WCSS to compute the root-mean-squared deviation (RMSD) that represents the spread of cells within each cluster. A cluster is more likely to have a low RMSD if it has no internal structure and is separated from other clusters (such that there are not many cells on the boundaries between clusters, which would result in a higher sum of squares from the centroid). ncells &lt;- tabulate(clust.kmeans2$cluster) tab &lt;- data.frame(wcss=clust.kmeans2$withinss, ncells=ncells) tab$rms &lt;- sqrt(tab$wcss/tab$ncells) tab ## wcss ncells rms ## 1 3270 243 3.669 ## 2 2837 28 10.066 ## 3 3240 202 4.005 ## 4 3499 361 3.113 ## 5 4483 282 3.987 ## 6 3325 166 4.476 ## 7 6834 388 4.197 ## 8 3843 150 5.062 ## 9 2277 114 4.470 ## 10 4439 537 2.875 ## 11 2003 170 3.433 ## 12 3342 96 5.900 ## 13 6531 46 11.915 ## 14 2130 131 4.032 ## 15 3627 162 4.731 ## 16 3108 118 5.132 ## 17 4790 201 4.882 ## 18 4663 257 4.260 ## 19 6966 288 4.918 ## 20 1205 45 5.175 (As an aside, the RMSDs of the clusters are poorly correlated with their sizes in Figure 10.8. This highlights the risks of attempting to quantitatively interpret the sizes of visual clusters in \\(t\\)-SNE plots.) To explore the relationships between \\(k\\)-means clusters, a natural approach is to compute distances between their centroids. This directly lends itself to visualization as a tree after hierarchical clustering (Figure 10.9). cent.tree &lt;- hclust(dist(clust.kmeans2$centers), &quot;ward.D2&quot;) plot(cent.tree) Figure 10.9: Hierarchy of \\(k\\)-means cluster centroids, using Ward’s minimum variance method. 10.4.4 In two-step procedures As previously mentioned, \\(k\\)-means is most effective in its role of vector quantization, i.e., compressing adjacent cells into a single representative point. This allows \\(k\\)-means to be used as a prelude to more sophisticated and interpretable - but expensive - clustering algorithms. The clusterRows() function supports a “two-step” mode where \\(k\\)-means is initially used to obtain representative centroids that are subjected to graph-based clustering. Each cell is then placed in the same graph-based cluster that its \\(k\\)-means centroid was assigned to (Figure 10.10). # Setting the seed due to the randomness of k-means. set.seed(0101010) kgraph.clusters &lt;- clusterRows(reducedDim(sce.pbmc, &quot;PCA&quot;), TwoStepParam( first=KmeansParam(centers=1000), second=NNGraphParam(k=5) ) ) table(kgraph.clusters) ## kgraph.clusters ## 1 2 3 4 5 6 7 8 9 10 11 12 ## 191 854 506 541 541 892 46 120 29 132 47 86 plotTSNE(sce.pbmc, colour_by=I(kgraph.clusters)) Figure 10.10: \\(t\\)-SNE plot of the PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from combined \\(k\\)-means/graph-based clustering. The obvious benefit of this approach over direct graph-based clustering is the speed improvement. We avoid the need to identifying nearest neighbors for each cell and the construction of a large intermediate graph, while benefiting from the relative interpretability of graph-based clusters compared to those from \\(k\\)-means. This approach also mitigates the “inflation” effect discussed in Section 10.3. Each centroid serves as a representative of a region of space that is roughly similar in volume, ameliorating differences in cell density that can cause (potentially undesirable) differences in resolution. The choice of the number of \\(k\\)-means clusters (defined here by the kmeans.clusters= argument) determines the trade-off between speed and fidelity. Larger values provide a more faithful representation of the underlying distribution of cells, at the cost of requiring more computational work by the second-stage clustering procedure. Note that the second step operates on the centroids, so increasing kmeans.clusters= may have further implications if the second-stage procedure is sensitive to the total number of input observations. For example, increasing the number of centroids would require an concomitant increase in k= (the number of neighbors in graph construction) to maintain the same level of resolution in the final output. 10.5 Hierarchical clustering 10.5.1 Background Hierarchical clustering is an ancient technique that aims to generate a dendrogram containing a hierarchy of samples. This is most commonly done by greedily agglomerating samples into clusters, then agglomerating those clusters into larger clusters, and so on until all samples belong to a single cluster. Variants of hierarchical clustering methods primarily differ in how they choose to perform the agglomerations. For example, complete linkage aims to merge clusters with the smallest maximum distance between their elements, while Ward’s method aims to minimize the increase in within-cluster variance. In the context of scRNA-seq, the main advantage of hierarchical clustering lies in the production of the dendrogram. This is a rich summary that describes the relationships between cells and subpopulations at various resolutions and in a quantitative manner based on the branch lengths. Users can easily “cut” the tree at different heights to define clusters with different granularity, where clusters defined at high resolution are guaranteed to be nested within those defined at a lower resolution. (Guaranteed nesting can be helpful for interpretation, as discussed in Section 10.7.) The dendrogram is also a natural representation of the data in situations where cells have descended from a relatively recent common ancestor. In practice, hierachical clustering is too slow to be used for anything but the smallest scRNA-seq datasets. Most variants require a cell-cell distance matrix that is prohibitively expensive to compute for many cells. Greedy agglomeration is also likely to result in a quantitatively suboptimal partitioning (as defined by the agglomeration measure) at higher levels of the dendrogram when the number of cells and merge steps is high. Nonetheless, we will still demonstrate the application of hierarchical clustering here, as it can occasionally be useful for squeezing more information out of datasets with very few cells. 10.5.2 Implementation As the PBMC dataset is too large, we will demonstrate on the 416B dataset instead. View history #--- loading ---# library(scRNAseq) sce.416b &lt;- LunSpikeInData(which=&quot;416b&quot;) sce.416b$block &lt;- factor(sce.416b$block) #--- gene-annotation ---# library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] rowData(sce.416b)$ENSEMBL &lt;- rownames(sce.416b) rowData(sce.416b)$SYMBOL &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SYMBOL&quot;) rowData(sce.416b)$SEQNAME &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SEQNAME&quot;) library(scater) rownames(sce.416b) &lt;- uniquifyFeatureNames(rowData(sce.416b)$ENSEMBL, rowData(sce.416b)$SYMBOL) #--- quality-control ---# mito &lt;- which(rowData(sce.416b)$SEQNAME==&quot;MT&quot;) stats &lt;- perCellQCMetrics(sce.416b, subsets=list(Mt=mito)) qc &lt;- quickPerCellQC(stats, percent_subsets=c(&quot;subsets_Mt_percent&quot;, &quot;altexps_ERCC_percent&quot;), batch=sce.416b$block) sce.416b &lt;- sce.416b[,!qc$discard] #--- normalization ---# library(scran) sce.416b &lt;- computeSumFactors(sce.416b) sce.416b &lt;- logNormCounts(sce.416b) #--- variance-modelling ---# dec.416b &lt;- modelGeneVarWithSpikes(sce.416b, &quot;ERCC&quot;, block=sce.416b$block) chosen.hvgs &lt;- getTopHVGs(dec.416b, prop=0.1) #--- batch-correction ---# library(limma) assay(sce.416b, &quot;corrected&quot;) &lt;- removeBatchEffect(logcounts(sce.416b), design=model.matrix(~sce.416b$phenotype), batch=sce.416b$block) #--- dimensionality-reduction ---# sce.416b &lt;- runPCA(sce.416b, ncomponents=10, subset_row=chosen.hvgs, exprs_values=&quot;corrected&quot;, BSPARAM=BiocSingular::ExactParam()) set.seed(1010) sce.416b &lt;- runTSNE(sce.416b, dimred=&quot;PCA&quot;, perplexity=10) sce.416b ## class: SingleCellExperiment ## dim: 46604 185 ## metadata(0): ## assays(3): counts logcounts corrected ## rownames(46604): 4933401J01Rik Gm26206 ... CAAA01147332.1 ## CBFB-MYH11-mcherry ## rowData names(4): Length ENSEMBL SYMBOL SEQNAME ## colnames(185): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ... ## SLX-11312.N712_S507.H5H5YBBXX.s_8.r_1 ## SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1 ## colData names(10): Source Name cell line ... block sizeFactor ## reducedDimNames(2): PCA TSNE ## altExpNames(2): ERCC SIRV We compute a cell-cell distance matrix using the top PCs and we apply hierarchical clustering with Ward’s method. The resulting tree in Figure 10.11 shows a clear split in the population caused by oncogene induction. While both Ward’s method and complete linkage (hclust()’s default) yield compact clusters, we prefer the former it is less affected by differences in variance between clusters. dist.416b &lt;- dist(reducedDim(sce.416b, &quot;PCA&quot;)) tree.416b &lt;- hclust(dist.416b, &quot;ward.D2&quot;) # Making a prettier dendrogram. library(dendextend) tree.416b$labels &lt;- seq_along(tree.416b$labels) dend &lt;- as.dendrogram(tree.416b, hang=0.1) combined.fac &lt;- paste0(sce.416b$block, &quot;.&quot;, sub(&quot; .*&quot;, &quot;&quot;, sce.416b$phenotype)) labels_colors(dend) &lt;- c( `20160113.wild`=&quot;blue&quot;, `20160113.induced`=&quot;red&quot;, `20160325.wild`=&quot;dodgerblue&quot;, `20160325.induced`=&quot;salmon&quot; )[combined.fac][order.dendrogram(dend)] plot(dend) Figure 10.11: Hierarchy of cells in the 416B data set after hierarchical clustering, where each leaf node is a cell that is coloured according to its oncogene induction status (red is induced, blue is control) and plate of origin (light or dark). To obtain explicit clusters, we “cut” the tree by removing internal branches such that every subtree represents a distinct cluster. This is most simply done by removing internal branches above a certain height of the tree, as performed by the cutree() function. A more sophisticated variant of this approach is implemented in the dynamicTreeCut package, which uses the shape of the branches to obtain a better partitioning for complex dendrograms (Figure 10.12). library(dynamicTreeCut) # minClusterSize needs to be turned down for small datasets. # deepSplit controls the resolution of the partitioning. clust.416b &lt;- cutreeDynamic(tree.416b, distM=as.matrix(dist.416b), minClusterSize=10, deepSplit=1) ## ..cutHeight not given, setting it to 783 ===&gt; 99% of the (truncated) height range in dendro. ## ..done. table(clust.416b) ## clust.416b ## 1 2 3 4 ## 78 69 24 14 labels_colors(dend) &lt;- clust.416b[order.dendrogram(dend)] plot(dend) Figure 10.12: Hierarchy of cells in the 416B data set after hierarchical clustering, where each leaf node is a cell that is coloured according to its assigned cluster identity from a dynamic tree cut. This generally corresponds well to the grouping of cells on a \\(t\\)-SNE plot (Figure 10.13). The exception is cluster 2, which is split across two visual clusters in the plot. We attribute this to a distortion introduced by \\(t\\)-SNE rather than inappropriate behavior of the clustering algorithm, based on the examination of some later diagnostics. colLabels(sce.416b) &lt;- factor(clust.416b) plotReducedDim(sce.416b, &quot;TSNE&quot;, colour_by=&quot;label&quot;) Figure 10.13: \\(t\\)-SNE plot of the 416B dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from hierarchical clustering. Note that the series of calls required to obtain the clusters is also wrapped by clusterRows() for more convenient execution. In this case, we can reproduce clust.416b with the following: clust.416b.again &lt;- clusterRows(reducedDim(sce.416b, &quot;PCA&quot;), HclustParam(method=&quot;ward.D2&quot;, cut.dynamic=TRUE, minClusterSize=10, deepSplit=1)) table(clust.416b.again) ## clust.416b.again ## 1 2 3 4 ## 78 69 24 14 10.5.3 Assessing cluster separation We check the separation of the clusters using the silhouette width (Figure 10.14). For each cell, we compute the average distance to all cells in the same cluster. We also compute the average distance to all cells in another cluster, taking the minimum of the averages across all other clusters. The silhouette width for each cell is defined as the difference between these two values divided by their maximum. Cells with large positive silhouette widths are closer to other cells in the same cluster than to cells in different clusters. Each cluster would ideally contain large positive silhouette widths, indicating that it is well-separated from other clusters. This is indeed the case in Figure 10.14 - and in fact, cluster 2 has the largest width of all, indicating that it is a more coherent cluster than portrayed in Figure 10.13. Smaller widths can arise from the presence of internal subclusters, which inflates the within-cluster distance; or overclustering, where cells at the boundary of a partition are closer to the neighboring cluster than their own cluster. sil &lt;- silhouette(clust.416b, dist = dist.416b) plot(sil) Figure 10.14: Silhouette widths for cells in each cluster in the 416B dataset. Each bar represents a cell, grouped by the cluster to which it is assigned. For a more detailed examination, we identify the closest neighboring cluster for cells with negative widths. This provides a perspective on the relationships between clusters that is closer to the raw data than the dendrogram in Figure 10.12. neg.widths &lt;- sil[,3] &lt; 0 table(Cluster=sil[neg.widths,1], Neighbor=sil[neg.widths,2]) ## Neighbor ## Cluster 1 2 3 ## 2 0 0 3 ## 3 1 3 0 The average silhouette width across all cells can also be used to choose clustering parameters. The aim is to maximize the average silhouette width in order to obtain well-separated clusters. This can be helpful to automatically obtain a “reasonable” clustering, though in practice, the clustering that yields the strongest separation often does not provide the most biological insight. 10.6 General-purpose cluster diagnostics 10.6.1 Cluster separation, redux We previously introduced the silhouette width in the context of hierarchical clustering (Section 10.5.3). While this can be applied with other clustering algorithms, it requires calculation of all pairwise distances between cells and is not scalable for larger cdatasets. In such cases, we instead use an approximate approach that replaces the average of the distances with the distance to the average (i.e., centroid) of each cluster, with some tweaks to account for the distance due to the within-cluster variance. This is implemented in the approxSilhouette() function from bluster, allowing us to quickly identify poorly separate clusters with mostly negative widths (Figure 10.15). # Performing the calculations on the PC coordinates, like before. sil.approx &lt;- approxSilhouette(reducedDim(sce.pbmc, &quot;PCA&quot;), clusters=clust) sil.data &lt;- as.data.frame(sil.approx) sil.data$closest &lt;- factor(ifelse(sil.data$width &gt; 0, clust, sil.data$other)) sil.data$cluster &lt;- factor(clust) ggplot(sil.data, aes(x=cluster, y=width, colour=closest)) + ggbeeswarm::geom_quasirandom(method=&quot;smiley&quot;) Figure 10.15: Distribution of the approximate silhouette width across cells in each cluster of the PBMC dataset. Each point represents a cell and colored with the identity of its own cluster if its silhouette width is positive and that of the closest other cluster if the width is negative. Alternatively, we can quantify the degree to which cells from multiple clusters intermingle in expression space. The “clustering purity” is defined for each cell as the proportion of neighboring cells that are assigned to the same cluster. Well-separated clusters should exhibit little intermingling and thus high purity values for all member cells, as demonstrated below in Figure 10.16. Median purity values are consistently greater than 0.9, indicating that most cells in each cluster are primarily surrounded by other cells from the same cluster. pure.pbmc &lt;- neighborPurity(reducedDim(sce.pbmc, &quot;PCA&quot;), clusters=clust) pure.data &lt;- as.data.frame(pure.pbmc) pure.data$maximum &lt;- factor(pure.data$maximum) pure.data$cluster &lt;- factor(clust) ggplot(pure.data, aes(x=cluster, y=purity, colour=maximum)) + ggbeeswarm::geom_quasirandom(method=&quot;smiley&quot;) Figure 10.16: Distribution of cluster purities across cells in each cluster of the PBMC dataset. Each point represents a cell and colored with the identity of the cluster contributing the largest proportion of its neighbors. The main difference between these two methods is that the purity is ignorant of the intra-cluster variance. This may or may not be desirable depending on what level of heterogeneity is of interest. In addition, the purity will - on average - only decrease with increasing cluster number/resolution, making it less effective for choosing between different clusterings. However, regardless of the chosen method, it is worth keeping in mind that poor separation is not synonymous with poor quality. In fact, poorly separated clusters will often be observed in non-trivial analyses of scRNA-seq data where the aim is to characterize closely related subtypes or states. These diagnostics are best used to guide interpretation by highlighting clusters that require more investigation rather than to rule out poorly separated clusters altogether. 10.6.2 Comparing different clusterings As previously mentioned, clustering’s main purpose is to obtain a discrete summary of the data for further interpretation. The diversity of available methods (and the subsequent variation in the clustering results) reflects the many different “perspectives” that can be derived from a high-dimensional scRNA-seq dataset. It is helpful to determine how these perspectives relate to each other by comparing the clustering results. More concretely, we want to know which clusters map to each other across algorithms; inconsistencies may be indicative of complex variation that is summarized differently by each clustering procedure. A simple yet effective approach for comparing two clusterings of the same dataset is to create a 2-dimensional table of label frequencies (Figure 10.3). We can further improve the interpretability of this table by computing the proportions of cell assignments, which avoids difficulties with dynamic range when visualizing clusters of differing abundances, For example, we may be interested in how our Walktrap clusters from Section 10.3 are redistributed when we switch to using Louvain community detection (Figure 10.17). Note that this heatmap is best interpreted on a row-by-row basis as the proportions are computed within each row and cannot be easily compared between rows. tab &lt;- table(Walktrap=clust, Louvain=clust.louvain) tab &lt;- tab/rowSums(tab) pheatmap(tab, color=viridis::viridis(100), cluster_cols=FALSE, cluster_rows=FALSE) Figure 10.17: Heatmap of the proportions of cells from each Walktrap cluster (rows) across the Louvain clusters (columns) in the PBMC dataset. Each row represents the distribution of cells across Louvain clusters for a given Walktrap cluster. For clusterings that differ primarily in resolution (usually from different parameterizations of the same algorithm), we can use the clustree package to visualize the relationships between them. Here, the aim is to capture the redistribution of cells from one clustering to another at progressively higher resolution, providing a convenient depiction of how clusters split apart (Figure 10.18). This approach is most effective when the clusterings exhibit a clear gradation in resolution but is less useful for comparisons involving theoretically distinct clustering procedures. library(clustree) combined &lt;- cbind(k.50=clust.50, k.10=clust, k.5=clust.5) clustree(combined, prefix=&quot;k.&quot;, edge_arrow=FALSE) Figure 10.18: Graph of the relationships between the Walktrap clusterings of the PBMC dataset, generated with varying \\(k\\) during the nearest-neighbor graph construction. (A higher \\(k\\) generally corresponds to a lower resolution clustering.) The size of the nodes is proportional to the number of cells in each cluster, and the edges depict cells in one cluster that are reassigned to another cluster at a different resolution. The color of the edges is defined according to the number of reassigned cells and the opacity is defined from the corresponding proportion relative to the size of the lower-resolution cluster. We can quantify the agreement between two clusterings by computing the Rand index with bluster’s pairwiseRand(). This is defined as the proportion of pairs of cells that retain the same status (i.e., both cells in the same cluster, or each cell in different clusters) in both clusterings. In practice, we usually compute the adjusted Rand index (ARI) where we subtract the number of concordant pairs expected under random permutations of the clusterings; this accounts for differences in the size and number of clusters within and between clusterings. A larger ARI indicates that the clusters are preserved, up to a maximum value of 1 for identical clusterings. In and of itself, the magnitude of the ARI has little meaning, and it is best used to assess the relative similarities of different clusterings (e.g., “Walktrap is more similar to Louvain than either are to Infomap”). Nonetheless, if one must have a hard-and-fast rule, experience suggests that an ARI greater than 0.5 corresponds to “good” similarity between two clusterings. pairwiseRand(clust, clust.5, mode=&quot;index&quot;) ## [1] 0.7796 The same function can also provide a more granular perspective with mode=\"ratio\", where the ARI is broken down into its contributions from each pair of clusters in one of the clusterings. This mode is helpful if one of the clusterings - in this case, clust - is considered to be a “reference”, and the aim is to quantify the extent to which the reference clusters retain their integrity in another clustering. In the breakdown matrix, each entry is a ratio of the adjusted number of concoordant pairs to the adjusted total number of pairs. Low values on the diagonal in Figure 10.19 indicate that cells from the corresponding reference cluster in clust are redistributed to multiple other clusters in clust.5. Conversely, low off-diagonal values indicate that the corresponding pair of reference clusters are merged together in clust.5. breakdown &lt;- pairwiseRand(ref=clust, alt=clust.5, mode=&quot;ratio&quot;) pheatmap(breakdown, color=viridis::magma(100), cluster_rows=FALSE, cluster_cols=FALSE) Figure 10.19: ARI-based ratio for each pair of clusters in the reference Walktrap clustering compared to a higher-resolution alternative clustering for the PBMC dataset. Rows and columns of the heatmap represent clusters in the reference clustering. Each entry represents the proportion of pairs of cells involving the row/column clusters that retain the same status in the alternative clustering. 10.6.3 Evaluating cluster stability A desirable property of a given clustering is that it is stable to perturbations to the input data (Von Luxburg 2010). Stable clusters are logistically convenient as small changes to upstream processing will not change the conclusions; greater stability also increases the likelihood that those conclusions can be reproduced in an independent replicate study. scran uses bootstrapping to evaluate the stability of a clustering algorithm on a given dataset - that is, cells are sampled with replacement to create a “bootstrap replicate” dataset, and clustering is repeated on this replicate to see if the same clusters can be reproduced. We demonstrate below for graph-based clustering on the PCs of the PBMC dataset. myClusterFUN &lt;- function(x) { g &lt;- bluster::makeSNNGraph(x, type=&quot;jaccard&quot;) igraph::cluster_louvain(g)$membership } pcs &lt;- reducedDim(sce.pbmc, &quot;PCA&quot;) originals &lt;- myClusterFUN(pcs) table(originals) # inspecting the cluster sizes. ## originals ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 127 62 48 343 45 56 124 94 848 290 200 459 233 143 541 372 set.seed(0010010100) ratios &lt;- bootstrapStability(pcs, FUN=myClusterFUN, clusters=originals) dim(ratios) ## [1] 16 16 The function returns a matrix of ARI-derived ratios for every pair of original clusters in originals (Figure 10.20), averaged across bootstrap iterations. High ratios indicate that the clustering in the bootstrap replicates are highly consistent with that of the original dataset. More specifically, high ratios on the diagonal indicate that cells in the same original cluster are still together in the bootstrap replicates, while high ratios off the diagonal indicate that cells in the corresponding cluster pair are still separted. pheatmap(ratios, cluster_row=FALSE, cluster_col=FALSE, color=viridis::magma(100), breaks=seq(-1, 1, length.out=101)) Figure 10.20: Heatmap of ARI-derived ratios from bootstrapping of graph-based clustering in the PBMC dataset. Each row and column represents an original cluster and each entry is colored according to the value of the ARI ratio between that pair of clusters. Bootstrapping is a general approach for evaluating cluster stability that is compatible with any clustering algorithm. The ARI-derived ratio between cluster pairs is also more informative than a single stability measure for all/each cluster as the former considers the relationships between clusters, e.g., unstable separation between \\(X\\) and \\(Y\\) does not penalize the stability of separation between \\(X\\) and another cluster \\(Z\\). Of course, one should take these metrics with a grain of salt, as bootstrapping only considers the effect of sampling noise and ignores other factors that affect reproducibility in an independent study (e.g., batch effects, donor variation). In addition, it is possible for a poor separation to be highly stable, so highly stable cluster may not necessarily represent some distinct subpopulation. 10.7 Subclustering Another simple approach to improving resolution is to repeat the feature selection and clustering within a single cluster. This aims to select HVGs and PCs that are more relevant to internal structure, improving resolution by avoiding noise from unnecessary features. Subsetting also encourages clustering methods to separate cells according to more modest heterogeneity in the absence of distinct subpopulations. We demonstrate with a cluster of putative memory T cells from the PBMC dataset, identified according to several markers (Figure 10.21). g.full &lt;- buildSNNGraph(sce.pbmc, use.dimred = &#39;PCA&#39;) clust.full &lt;- igraph::cluster_walktrap(g.full)$membership plotExpression(sce.pbmc, features=c(&quot;CD3E&quot;, &quot;CCR7&quot;, &quot;CD69&quot;, &quot;CD44&quot;), x=I(factor(clust.full)), colour_by=I(factor(clust.full))) Figure 10.21: Distribution of log-normalized expression values for several T cell markers within each cluster in the 10X PBMC dataset. Each cluster is color-coded for convenience. # Repeating modelling and PCA on the subset. memory &lt;- 10L sce.memory &lt;- sce.pbmc[,clust.full==memory] dec.memory &lt;- modelGeneVar(sce.memory) sce.memory &lt;- denoisePCA(sce.memory, technical=dec.memory, subset.row=getTopHVGs(dec.memory, n=5000)) We apply graph-based clustering within this memory subset to obtain CD4+ and CD8+ subclusters (Figure 10.22). Admittedly, the expression of CD4 is so low that the change is rather modest, but the interpretation is clear enough. g.memory &lt;- buildSNNGraph(sce.memory, use.dimred=&quot;PCA&quot;) clust.memory &lt;- igraph::cluster_walktrap(g.memory)$membership plotExpression(sce.memory, features=c(&quot;CD8A&quot;, &quot;CD4&quot;), x=I(factor(clust.memory))) Figure 10.22: Distribution of CD4 and CD8A log-normalized expression values within each cluster in the memory T cell subset of the 10X PBMC dataset. For subclustering analyses, it is helpful to define a customized function that calls our desired algorithms to obtain a clustering from a given SingleCellExperiment. This function can then be applied multiple times on different subsets without having to repeatedly copy and modify the code for each subset. For example, quickSubCluster() loops over all subsets and executes this user-specified function to generate a list of SingleCellExperiment objects containing the subclustering results. (Of course, the downside is that this assumes that a similar analysis is appropriate for each subset. If different subsets require extensive reparametrization, copying the code may actually be more straightforward.) set.seed(1000010) subcluster.out &lt;- quickSubCluster(sce.pbmc, groups=clust.full, prepFUN=function(x) { # Preparing the subsetted SCE for clustering. dec &lt;- modelGeneVar(x) input &lt;- denoisePCA(x, technical=dec, subset.row=getTopHVGs(dec, prop=0.1), BSPARAM=BiocSingular::IrlbaParam()) }, clusterFUN=function(x) { # Performing the subclustering in the subset. g &lt;- buildSNNGraph(x, use.dimred=&quot;PCA&quot;, k=20) igraph::cluster_walktrap(g)$membership } ) # One SingleCellExperiment object per parent cluster: names(subcluster.out) ## [1] &quot;1&quot; &quot;2&quot; &quot;3&quot; &quot;4&quot; &quot;5&quot; &quot;6&quot; &quot;7&quot; &quot;8&quot; &quot;9&quot; &quot;10&quot; &quot;11&quot; &quot;12&quot; &quot;13&quot; &quot;14&quot; &quot;15&quot; ## [16] &quot;16&quot; # Looking at the subclustering for one example: table(subcluster.out[[1]]$subcluster) ## ## 1.1 1.2 1.3 1.4 1.5 1.6 ## 28 22 34 62 11 48 Subclustering is a general and conceptually straightforward procedure for increasing resolution. It can also simplify the interpretation of the subclusters, which only need to be considered in the context of the parent cluster’s identity - for example, we did not have to re-identify the cells in cluster 10 as T cells. However, this is a double-edged sword as it is difficult for practitioners to consider the uncertainty of identification for parent clusters when working with deep nesting. If cell types or states span cluster boundaries, conditioning on the putative cell type identity of the parent cluster can encourage the construction of a “house of cards” of cell type assignments, e.g., where a subcluster of one parent cluster is actually contamination from a cell type in a separate parent cluster. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] clustree_0.4.3 ggraph_2.0.5 [3] dynamicTreeCut_1.63-1 dendextend_1.14.0 [5] cluster_2.1.0 pheatmap_1.0.12 [7] scater_1.18.6 ggplot2_3.3.3 [9] bluster_1.0.0 scran_1.18.5 [11] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0 [13] Biobase_2.50.0 GenomicRanges_1.42.0 [15] GenomeInfoDb_1.26.4 IRanges_2.24.1 [17] S4Vectors_0.28.1 BiocGenerics_0.36.0 [19] MatrixGenerics_1.2.1 matrixStats_0.58.0 [21] BiocStyle_2.18.1 rebook_1.0.0 loaded via a namespace (and not attached): [1] ggbeeswarm_0.6.0 colorspace_2.0-0 [3] ellipsis_0.3.1 scuttle_1.0.4 [5] XVector_0.30.0 BiocNeighbors_1.8.2 [7] farver_2.1.0 graphlayouts_0.7.1 [9] ggrepel_0.9.1 fansi_0.4.2 [11] codetools_0.2-18 sparseMatrixStats_1.2.1 [13] knitr_1.31 polyclip_1.10-0 [15] jsonlite_1.7.2 graph_1.68.0 [17] ggforce_0.3.3 BiocManager_1.30.10 [19] compiler_4.0.4 backports_1.2.1 [21] dqrng_0.2.1 assertthat_0.2.1 [23] Matrix_1.3-2 limma_3.46.0 [25] tweenr_1.0.1 BiocSingular_1.6.0 [27] htmltools_0.5.1.1 tools_4.0.4 [29] rsvd_1.0.3 igraph_1.2.6 [31] gtable_0.3.0 glue_1.4.2 [33] GenomeInfoDbData_1.2.4 dplyr_1.0.5 [35] Rcpp_1.0.6 jquerylib_0.1.3 [37] vctrs_0.3.6 DelayedMatrixStats_1.12.3 [39] xfun_0.22 stringr_1.4.0 [41] ps_1.6.0 beachmat_2.6.4 [43] lifecycle_1.0.0 irlba_2.3.3 [45] statmod_1.4.35 XML_3.99-0.6 [47] edgeR_3.32.1 zlibbioc_1.36.0 [49] MASS_7.3-53 scales_1.1.1 [51] tidygraph_1.2.0 RColorBrewer_1.1-2 [53] yaml_2.2.1 gridExtra_2.3 [55] sass_0.3.1 stringi_1.5.3 [57] highr_0.8 checkmate_2.0.0 [59] BiocParallel_1.24.1 rlang_0.4.10 [61] pkgconfig_2.0.3 bitops_1.0-6 [63] evaluate_0.14 lattice_0.20-41 [65] purrr_0.3.4 CodeDepends_0.6.5 [67] labeling_0.4.2 cowplot_1.1.1 [69] processx_3.4.5 tidyselect_1.1.0 [71] magrittr_2.0.1 bookdown_0.21 [73] R6_2.5.0 generics_0.1.0 [75] DelayedArray_0.16.2 DBI_1.1.1 [77] pillar_1.5.1 withr_2.4.1 [79] RCurl_1.98-1.3 tibble_3.1.0 [81] crayon_1.4.1 utf8_1.2.1 [83] rmarkdown_2.7 viridis_0.5.1 [85] locfit_1.5-9.4 grid_4.0.4 [87] callr_3.5.1 digest_0.6.27 [89] tidyr_1.1.3 munsell_0.5.0 [91] beeswarm_0.3.1 viridisLite_0.3.0 [93] vipor_0.4.5 bslib_0.2.4 Bibliography "],["marker-detection.html", "Chapter 11 Marker gene detection 11.1 Motivation 11.2 Pairwise tests between clusters 11.3 Alternative testing regimes 11.4 Handling blocking factors 11.5 Invalidity of \\(p\\)-values 11.6 Further comments Session Info", " Chapter 11 Marker gene detection .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 11.1 Motivation To interpret our clustering results from Chapter 10, we identify the genes that drive separation between clusters. These marker genes allow us to assign biological meaning to each cluster based on their functional annotation. In the most obvious case, the marker genes for each cluster are a priori associated with particular cell types, allowing us to treat the clustering as a proxy for cell type identity. The same principle can be applied to discover more subtle differences between clusters (e.g., changes in activation or differentiation state) based on the behavior of genes in the affected pathways. Identification of marker genes is usually based around the retrospective detection of differential expression between clusters. Genes that are more strongly DE are more likely to have caused separate clustering of cells in the first place. Several different statistical tests are available to quantify the differences in expression profiles, and different approaches can be used to consolidate test results into a single ranking of genes for each cluster. These choices parametrize the theoretical differences between the various marker detection strategies presented in this chapter. We will demonstrate using the 10X PBMC dataset: View history #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) #--- gene-annotation ---# library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) #--- cell-detection ---# set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location==&quot;MT&quot;))) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) sce.pbmc &lt;- sce.pbmc[,!high.mito] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) sce.pbmc &lt;- logNormCounts(sce.pbmc) #--- variance-modelling ---# set.seed(1001) dec.pbmc &lt;- modelGeneVarByPoisson(sce.pbmc) top.pbmc &lt;- getTopHVGs(dec.pbmc, prop=0.1) #--- dimensionality-reduction ---# set.seed(10000) sce.pbmc &lt;- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc) set.seed(100000) sce.pbmc &lt;- runTSNE(sce.pbmc, dimred=&quot;PCA&quot;) set.seed(1000000) sce.pbmc &lt;- runUMAP(sce.pbmc, dimred=&quot;PCA&quot;) #--- clustering ---# g &lt;- buildSNNGraph(sce.pbmc, k=10, use.dimred = &#39;PCA&#39;) clust &lt;- igraph::cluster_walktrap(g)$membership colLabels(sce.pbmc) &lt;- factor(clust) sce.pbmc ## class: SingleCellExperiment ## dim: 33694 3985 ## metadata(1): Samples ## assays(2): counts logcounts ## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B ## rowData names(2): ID Symbol ## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ... ## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1 ## colData names(4): Sample Barcode sizeFactor label ## reducedDimNames(3): PCA TSNE UMAP ## altExpNames(0): 11.2 Pairwise tests between clusters 11.2.1 Motivation Our general strategy is to perform DE tests between pairs of clusters and then combine results into a single ranking of marker genes for each cluster. We deliberately use pairwise comparisons rather than comparing each cluster to the average of all other cells; the latter approach is sensitive to the population composition, which introduces an element of unpredictability to the marker sets due to variation in cell type abundances. (In the worst case, the presence of one subpopulation containing a majority of the cells will drive the selection of top markers for every other cluster, pushing out useful genes that can distinguish between the smaller subpopulations.) Moreover, pairwise comparisons naturally provide more information to interpret of the utility of a marker, e.g., by providing log-fold changes to indicate which clusters are distinguished by each gene. For this section, we will use the Welch \\(t\\)-test to perform our DE testing between clusters. This is an easy choice as it is quickly computed and has good statistical properties for large numbers of cells (Soneson and Robinson 2018). However, the same approach can also be applied with any pairwise statistical test, as discussed in Section 11.3. 11.2.2 Combining pairwise statistics per cluster 11.2.2.1 Looking for any differences We perform pairwise \\(t\\)-tests between clusters for each gene using the findMarkers() function, which returns a list of DataFrames containing ranked candidate markers for each cluster. The function will automatically retrieve the cluster identities from sce.pbmc using the colLabels() function, though we can easily specify other clustering schemes by explicitly supplying them via the groups= argument. library(scran) markers.pbmc &lt;- findMarkers(sce.pbmc) markers.pbmc ## List of length 16 ## names(16): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 The default philosophy of findMarkers() is to identify a combination of marker genes that - together - uniquely define one cluster against the rest. To this end, we collect the top DE genes from each pairwise comparison involving a particular cluster to assemble a set of candidate markers for that cluster. We will demonstrate on cluster 7; the relevant DataFrame contains log2-fold changes of expression in cluster 7 over each other cluster, along with several statistics obtained by combining \\(p\\)-values (Simes 1986) across the pairwise comparisons involving 7. chosen &lt;- &quot;7&quot; interesting &lt;- markers.pbmc[[chosen]] colnames(interesting) ## [1] &quot;Top&quot; &quot;p.value&quot; &quot;FDR&quot; &quot;summary.logFC&quot; ## [5] &quot;logFC.1&quot; &quot;logFC.2&quot; &quot;logFC.3&quot; &quot;logFC.4&quot; ## [9] &quot;logFC.5&quot; &quot;logFC.6&quot; &quot;logFC.8&quot; &quot;logFC.9&quot; ## [13] &quot;logFC.10&quot; &quot;logFC.11&quot; &quot;logFC.12&quot; &quot;logFC.13&quot; ## [17] &quot;logFC.14&quot; &quot;logFC.15&quot; &quot;logFC.16&quot; Of particular interest is the Top field. The set of genes with Top \\(\\le X\\) is the union of the top \\(X\\) genes (ranked by \\(p\\)-value) from each pairwise comparison involving cluster 7. For example, the set of all genes with Top values of 1 contains the gene with the lowest \\(p\\)-value from each comparison. Similarly, the set of genes with Top values less than or equal to 10 contains the top 10 genes from each comparison. The Top field represents findMarkers()’s approach to consolidating multiple pairwise comparisons into a single ranking for each cluster; each DataFrame produced by findMarkers() will order genes based on the Top value by default. interesting[1:10,1:4] ## DataFrame with 10 rows and 4 columns ## Top p.value FDR summary.logFC ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## S100A4 1 2.59737e-38 1.27018e-36 -4.27560 ## TAGLN2 1 8.65033e-28 2.44722e-26 5.07327 ## FCGR3A 1 8.84356e-63 1.15048e-60 -3.07121 ## GZMA 1 1.15392e-120 7.20000e-118 -1.92877 ## HLA-DQA1 1 3.43640e-83 8.90663e-81 -3.54890 ## TMSB4X 1 9.83227e-36 4.25820e-34 4.28970 ## FCN1 1 1.74313e-239 9.78883e-236 -2.77594 ## TRAC 1 0.00000e+00 0.00000e+00 -2.44793 ## RPL17 1 2.95529e-71 5.18622e-69 -2.86310 ## CD79A 1 0.00000e+00 0.00000e+00 -2.98030 We use the Top field to identify a set of genes that is guaranteed to distinguish cluster 7 from any other cluster. Here, we examine the top 6 genes from each pairwise comparison (Figure 11.1). Some inspection of the most upregulated genes suggest that cluster 9 contains platelets or their precursors, based on the expression of platelet factor 4 (PF4) and pro-platelet basic protein (PPBP). best.set &lt;- interesting[interesting$Top &lt;= 6,] logFCs &lt;- getMarkerEffects(best.set) library(pheatmap) pheatmap(logFCs, breaks=seq(-5, 5, length.out=101)) Figure 11.1: Heatmap of log-fold changes for cluster 7 over all other clusters. Colours are capped at -5 and 5 to preserve dynamic range. Each DataFrame also contains several other statistics that may be of interest. The summary.logFC field provides a convenient summary of the direction and effect size for each gene, and is defined here as the log-fold change from the comparison with the lowest \\(p\\)-value. The p.value field contains the combined \\(p\\)-value that is obtained by applying Simes’ method to the pairwise \\(p\\)-values for each gene and represents the evidence against the joint null hypothesis, i.e., that the gene is not DE between cluster 7 and any other cluster. Examination of these statistics permits a quick evaluation of the suitability of a candidate marker; if both of these metrics are poor (small log-fold change, large \\(p\\)-value), the gene can most likely be dismissed. 11.2.2.2 Finding cluster-specific markers By default, findMarkers() will give a high ranking to genes that are differentially expressed in any pairwise comparison. This is because a gene only needs a very low \\(p\\)-value in a single pairwise comparison to achieve a low Top value. A more stringent approach would only consider genes that are differentially expressed in all pairwise comparisons involving the cluster of interest. To achieve this, we set pval.type=\"all\" in findMarkers() to use an intersection-union test (Berger and Hsu 1996) where the combined \\(p\\)-value for each gene is the maximum of the \\(p\\)-values from all pairwise comparisons. A gene will only achieve a low combined \\(p\\)-value if it is strongly DE in all comparisons to other clusters. # Set direction=&#39;up&#39; to only consider upregulated genes as potential markers. markers.pbmc.up3 &lt;- findMarkers(sce.pbmc, pval.type=&quot;all&quot;, direction=&quot;up&quot;) interesting.up3 &lt;- markers.pbmc.up3[[chosen]] interesting.up3[1:10,1:3] ## DataFrame with 10 rows and 3 columns ## p.value FDR summary.logFC ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## SDPR 2.86451e-23 9.65166e-19 5.49695 ## NRGN 5.91075e-23 9.95784e-19 4.71034 ## TAGLN2 4.41617e-21 4.95995e-17 3.61864 ## PPBP 1.36171e-20 1.14703e-16 6.34717 ## GNG11 1.23155e-19 8.29918e-16 5.35904 ## HIST1H2AC 3.94013e-19 2.21265e-15 5.46400 ## TUBB1 9.64049e-19 4.64038e-15 4.92204 ## PF4 1.87045e-14 7.87785e-11 6.49672 ## CLU 8.75900e-13 3.27918e-09 3.90273 ## RGS18 8.88042e-12 2.99217e-08 3.63236 This strategy will only report genes that are highly specific to the cluster of interest. When it works, it can be highly effective as it generates a small focused set of candidate markers. However, any gene that is expressed at the same level in two or more clusters will simply not be detected. This is likely to discard many interesting genes, especially if the clusters are finely resolved with weak separation. To give a concrete example, consider a mixed population of CD4+-only, CD8+-only, double-positive and double-negative T cells. With pval.type=\"all\", neither Cd4 or Cd8 would be detected as subpopulation-specific markers because each gene is expressed in two subpopulations. In comparison, pval.type=\"any\" will detect both of these genes as they will be DE between at least one pair of subpopulations. 11.2.2.3 Balancing stringency and generality If pval.type=\"all\" is too stringent yet pval.type=\"any\" is too generous, a compromise is to set pval.type=\"some\". For each gene, we apply the Holm-Bonferroni correction across its \\(p\\)-values and take the middle-most value as the combined \\(p\\)-value. This effectively tests the global null hypothesis that at least 50% of the individual pairwise comparisons exhibit no DE. We then rank the genes by their combined \\(p\\)-values to obtain an ordered set of marker candidates. The aim is to improve the conciseness of the top markers for defining a cluster while mitigating the risk of discarding useful genes that are not DE to all other clusters. The downside is that taking this compromise position sacrifices the theoretical guarantees offered at the other two extremes. markers.pbmc.up4 &lt;- findMarkers(sce.pbmc, pval.type=&quot;some&quot;, direction=&quot;up&quot;) interesting.up4 &lt;- markers.pbmc.up4[[chosen]] interesting.up4[1:10,1:3] ## DataFrame with 10 rows and 3 columns ## p.value FDR summary.logFC ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## PF4 5.23414e-32 1.76359e-27 6.86288 ## TMSB4X 4.52854e-25 7.62923e-21 2.90202 ## TAGLN2 2.31252e-24 2.59727e-20 4.88268 ## NRGN 1.08964e-22 9.17861e-19 5.00827 ## SDPR 2.47896e-22 1.67052e-18 5.60445 ## PPBP 8.57360e-20 4.81465e-16 6.50103 ## CCL5 5.66181e-19 2.72527e-15 5.30774 ## GNG11 8.98381e-19 3.59373e-15 5.47403 ## GPX1 9.59922e-19 3.59373e-15 4.86299 ## HIST1H2AC 2.85071e-18 9.60519e-15 5.53275 In both cases, a different method is used to compute the summary effect size compared to pval.type=\"any\". For pval.type=\"all\", the summary log-fold change is defined as that corresponding to the pairwise comparison with the largest \\(p\\)-value, while for pval.type=\"some\", it is defined as the log-fold change for the comparison with the middle-most \\(p\\)-value. This reflects the calculation of the combined \\(p\\)-value and avoids focusing on genes with strong changes in only one comparison. 11.2.3 Using the log-fold change The default findMarkers() call considers both up- and downregulated genes to be potential markers. However, downregulated genes are less appealing as markers as it is more difficult to interpret and experimentally validate an absence of expression. To focus on up-regulated markers, we can instead perform a one-sided \\(t\\)-test to identify genes that are upregulated in each cluster compared to the others. This is achieved by setting direction=\"up\" in the findMarkers() call. markers.pbmc.up &lt;- findMarkers(sce.pbmc, direction=&quot;up&quot;) interesting.up &lt;- markers.pbmc.up[[chosen]] interesting.up[1:10,1:4] ## DataFrame with 10 rows and 4 columns ## Top p.value FDR summary.logFC ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## TAGLN2 1 4.32517e-28 4.85774e-24 5.07327 ## PF4 1 4.78929e-35 8.06851e-31 6.71811 ## TMSB4X 1 4.91613e-36 1.65644e-31 4.28970 ## NRGN 2 1.35810e-23 9.15195e-20 4.86347 ## B2M 2 8.10863e-25 6.83030e-21 2.40365 ## SDPR 3 2.32759e-23 1.30710e-19 5.54225 ## GPX1 4 4.74597e-21 2.28444e-17 5.71604 ## PPBP 5 8.85410e-21 3.31478e-17 6.41411 ## ACTB 6 6.22981e-21 2.62384e-17 3.79868 ## GNG11 6 9.05522e-20 3.05107e-16 5.48735 The \\(t\\)-test also allows us to specify a non-zero log-fold change as the null hypothesis. This allows us to consider the magnitude of the log-fold change in our \\(p\\)-value calculations, in a manner that is more rigorous than simply filtering directly on the log-fold changes (McCarthy and Smyth 2009). (Specifically, a simple threshold does not consider the variance and can enrich for genes that have both large log-fold changes and large variances.) We perform this by setting lfc= in our findMarkers() call - when combined with direction=, this tests for genes with log-fold changes that are significantly greater than 1: markers.pbmc.up2 &lt;- findMarkers(sce.pbmc, direction=&quot;up&quot;, lfc=1) interesting.up2 &lt;- markers.pbmc.up2[[chosen]] interesting.up2[1:10,1:4] ## DataFrame with 10 rows and 4 columns ## Top p.value FDR summary.logFC ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## TAGLN2 1 9.48392e-23 1.06517e-18 5.07327 ## PF4 1 2.19317e-31 7.38966e-27 6.71811 ## SDPR 2 5.42215e-20 4.56735e-16 5.54225 ## TMSB4X 2 9.90003e-28 1.66786e-23 4.28970 ## NRGN 3 9.24786e-20 6.23195e-16 4.95866 ## GPX1 4 7.73653e-18 3.72392e-14 5.71604 ## PPBP 4 5.53317e-18 3.10725e-14 6.51025 ## GNG11 6 1.56486e-16 6.59079e-13 5.48735 ## CCL5 6 2.72759e-16 1.02115e-12 5.39815 ## HIST1H2AC 7 5.56042e-16 1.87353e-12 5.57765 These two settings yield a more focused set of candidate marker genes that are upregulated in cluster 7 (Figure 11.2). best.set &lt;- interesting.up2[interesting.up2$Top &lt;= 5,] logFCs &lt;- getMarkerEffects(best.set) library(pheatmap) pheatmap(logFCs, breaks=seq(-5, 5, length.out=101)) Figure 11.2: Heatmap of log-fold changes for cluster 7 over all other clusters. Colours are capped at -5 and 5 to preserve dynamic range. Of course, this increased stringency is not without cost. If only upregulated genes are requested from findMarkers(), any cluster defined by downregulation of a marker gene will not contain that gene among the top set of features in its DataFrame. This is occasionally relevant for subtypes or other states that are defined by low expression of particular genes2. Similarly, setting an excessively high log-fold change threshold may discard otherwise useful genes. For example, a gene upregulated in a small proportion of cells of a cluster will have a small log-fold change but can still be an effective marker if the focus is on specificity rather than sensitivity. 11.3 Alternative testing regimes 11.3.1 Using the Wilcoxon rank sum test The Wilcoxon rank sum test (also known as the Wilcoxon-Mann-Whitney test, or WMW test) is another widely used method for pairwise comparisons between groups of observations. Its strength lies in the fact that it directly assesses separation between the expression distributions of different clusters. The WMW test statistic is proportional to the area-under-the-curve (AUC), i.e., the concordance probability, which is the probability of a random cell from one cluster having higher expression than a random cell from another cluster. In a pairwise comparison, AUCs of 1 or 0 indicate that the two clusters have perfectly separated expression distributions. Thus, the WMW test directly addresses the most desirable property of a candidate marker gene, while the \\(t\\) test only does so indirectly via the difference in the means and the intra-group variance. We perform WMW tests by again using the findMarkers() function, this time with test=\"wilcox\". This returns a list of DataFrames containing ranked candidate markers for each cluster. The direction=, lfc= and pval.type= arguments can be specified and have the same interpretation as described for \\(t\\)-tests. We demonstrate below by detecting upregulated genes in each cluster with direction=\"up\". markers.pbmc.wmw &lt;- findMarkers(sce.pbmc, test=&quot;wilcox&quot;, direction=&quot;up&quot;) names(markers.pbmc.wmw) ## [1] &quot;1&quot; &quot;2&quot; &quot;3&quot; &quot;4&quot; &quot;5&quot; &quot;6&quot; &quot;7&quot; &quot;8&quot; &quot;9&quot; &quot;10&quot; &quot;11&quot; &quot;12&quot; &quot;13&quot; &quot;14&quot; &quot;15&quot; ## [16] &quot;16&quot; To explore the results in more detail, we focus on the DataFrame for cluster 7. The interpretation of Top is the same as described for \\(t\\)-tests, and Simes’ method is again used to combine \\(p\\)-values across pairwise comparisons. If we want more focused sets, we can also change pval.type= as previously described. interesting.wmw &lt;- markers.pbmc.wmw[[chosen]] interesting.wmw[1:10,1:4] ## DataFrame with 10 rows and 4 columns ## Top p.value FDR summary.AUC ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## PF4 1 1.02312e-179 3.44731e-175 0.989080 ## TMSB4X 1 3.14604e-29 1.20457e-26 0.998195 ## SDPR 2 1.36598e-159 2.30126e-155 0.956221 ## NRGN 2 2.77288e-142 1.86859e-138 0.966865 ## TAGLN2 3 1.58373e-29 6.20491e-27 0.967680 ## PPBP 3 1.35961e-147 1.52702e-143 0.934256 ## GNG11 3 4.00798e-139 2.25075e-135 0.934030 ## TUBB1 3 3.23282e-146 2.72317e-142 0.923386 ## HIST1H2AC 5 2.49447e-97 5.60325e-94 0.932300 ## B2M 5 3.15826e-25 9.85320e-23 0.968938 The DataFrame contains the AUCs from comparing cluster 7 to every other cluster (Figure 11.3). A value greater than 0.5 indicates that the gene is upregulated in the current cluster compared to the other cluster, while values less than 0.5 correspond to downregulation. We would typically expect AUCs of 0.7-0.8 for a strongly upregulated candidate marker. best.set &lt;- interesting.wmw[interesting.wmw$Top &lt;= 5,] AUCs &lt;- getMarkerEffects(best.set, prefix=&quot;AUC&quot;) library(pheatmap) pheatmap(AUCs, breaks=seq(0, 1, length.out=21), color=viridis::viridis(21)) Figure 11.3: Heatmap of AUCs for cluster 7 compared to all other clusters. One practical advantage of the WMW test over the Welch \\(t\\)-test is that it is symmetric with respect to differences in the size of the groups being compared. This means that, all else being equal, the top-ranked genes on each side of a DE comparison will have similar expression profiles regardless of the number of cells in each group. In contrast, the \\(t\\)-test will favor genes where the larger group has the higher relative variance as this increases the estimated degrees of freedom and decreases the resulting \\(p\\)-value. This can lead to unappealing rankings when the aim is to identify genes upregulated in smaller groups. The WMW test is not completely immune to variance effects - for example, it will slightly favor detection of DEGs at low average abundance where the greater number of ties at zero deflates the approximate variance of the rank sum statistic - but this is relatively benign as the selected genes are still fairly interesting. We observe both of these effects in a comparison between alpha and gamma cells in the human pancreas data set from Lawlor et al. (2017) (Figure 11.4). View history #--- loading ---# library(scRNAseq) sce.lawlor &lt;- LawlorPancreasData() #--- gene-annotation ---# library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] anno &lt;- select(edb, keys=rownames(sce.lawlor), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.lawlor) &lt;- anno[match(rownames(sce.lawlor), anno[,1]),-1] #--- quality-control ---# library(scater) stats &lt;- perCellQCMetrics(sce.lawlor, subsets=list(Mito=which(rowData(sce.lawlor)$SEQNAME==&quot;MT&quot;))) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;subsets_Mito_percent&quot;, batch=sce.lawlor$`islet unos id`) sce.lawlor &lt;- sce.lawlor[,!qc$discard] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.lawlor) sce.lawlor &lt;- computeSumFactors(sce.lawlor, clusters=clusters) sce.lawlor &lt;- logNormCounts(sce.lawlor) marker.lawlor.t &lt;- findMarkers(sce.lawlor, groups=sce.lawlor$`cell type`, direction=&quot;up&quot;, restrict=c(&quot;Alpha&quot;, &quot;Gamma/PP&quot;)) marker.lawlor.w &lt;- findMarkers(sce.lawlor, groups=sce.lawlor$`cell type`, direction=&quot;up&quot;, restrict=c(&quot;Alpha&quot;, &quot;Gamma/PP&quot;), test.type=&quot;wilcox&quot;) # Upregulated in alpha: marker.alpha.t &lt;- marker.lawlor.t$Alpha marker.alpha.w &lt;- marker.lawlor.w$Alpha chosen.alpha.t &lt;- rownames(marker.alpha.t)[1:20] chosen.alpha.w &lt;- rownames(marker.alpha.w)[1:20] u.alpha.t &lt;- setdiff(chosen.alpha.t, chosen.alpha.w) u.alpha.w &lt;- setdiff(chosen.alpha.w, chosen.alpha.t) # Upregulated in gamma: marker.gamma.t &lt;- marker.lawlor.t$`Gamma/PP` marker.gamma.w &lt;- marker.lawlor.w$`Gamma/PP` chosen.gamma.t &lt;- rownames(marker.gamma.t)[1:20] chosen.gamma.w &lt;- rownames(marker.gamma.w)[1:20] u.gamma.t &lt;- setdiff(chosen.gamma.t, chosen.gamma.w) u.gamma.w &lt;- setdiff(chosen.gamma.w, chosen.gamma.t) # Examining all uniquely detected markers in each direction. library(scater) subset &lt;- sce.lawlor[,sce.lawlor$`cell type` %in% c(&quot;Alpha&quot;, &quot;Gamma/PP&quot;)] gridExtra::grid.arrange( plotExpression(subset, x=&quot;cell type&quot;, features=u.alpha.t, ncol=2) + ggtitle(&quot;Upregulated in alpha, t-test-only&quot;), plotExpression(subset, x=&quot;cell type&quot;, features=u.alpha.w, ncol=2) + ggtitle(&quot;Upregulated in alpha, WMW-test-only&quot;), plotExpression(subset, x=&quot;cell type&quot;, features=u.gamma.t, ncol=2) + ggtitle(&quot;Upregulated in gamma, t-test-only&quot;), plotExpression(subset, x=&quot;cell type&quot;, features=u.gamma.w, ncol=2) + ggtitle(&quot;Upregulated in gamma, WMW-test-only&quot;), ncol=2 ) Figure 11.4: Distribution of expression values for alpha or gamma cell-specific markers in the GSE86469 human pancreas dataset. Each panel focuses on the genes that were uniquely ranked in the top 20 candidate markers by either the t-test or WMW test. The main disadvantage of the WMW test is that the AUCs are much slower to compute compared to \\(t\\)-statistics. This may be inconvenient for interactive analyses involving multiple iterations of marker detection. We can mitigate this to some extent by parallelizing these calculations using the BPPARAM= argument in findMarkers(). 11.3.2 Using a binomial test The binomial test identifies genes that differ in the proportion of expressing cells between clusters. (For the purposes of this section, a cell is considered to express a gene simply if it has non-zero expression for that gene.) This represents a much more stringent definition of marker genes compared to the other methods, as differences in expression between clusters are effectively ignored if both distributions of expression values are not near zero. The premise is that genes are more likely to contribute to important biological decisions if they were active in one cluster and silent in another, compared to more subtle “tuning” effects from changing the expression of an active gene. From a practical perspective, a binary measure of presence/absence is easier to validate. We perform pairwise binomial tests between clusters using the findMarkers() function with test=\"binom\". This returns a list of DataFrames containing marker statistics for each cluster such as the Top rank and its \\(p\\)-value. Here, the effect size is reported as the log-fold change in this proportion between each pair of clusters. Large positive log-fold changes indicate that the gene is more frequently expressed in one cluster compared to the other. We focus on genes that are upregulated in each cluster compared to the others by setting direction=\"up\". markers.pbmc.binom &lt;- findMarkers(sce.pbmc, test=&quot;binom&quot;, direction=&quot;up&quot;) names(markers.pbmc.binom) ## [1] &quot;1&quot; &quot;2&quot; &quot;3&quot; &quot;4&quot; &quot;5&quot; &quot;6&quot; &quot;7&quot; &quot;8&quot; &quot;9&quot; &quot;10&quot; &quot;11&quot; &quot;12&quot; &quot;13&quot; &quot;14&quot; &quot;15&quot; ## [16] &quot;16&quot; interesting.binom &lt;- markers.pbmc.binom[[chosen]] colnames(interesting.binom) ## [1] &quot;Top&quot; &quot;p.value&quot; &quot;FDR&quot; &quot;summary.logFC&quot; ## [5] &quot;logFC.1&quot; &quot;logFC.2&quot; &quot;logFC.3&quot; &quot;logFC.4&quot; ## [9] &quot;logFC.5&quot; &quot;logFC.6&quot; &quot;logFC.8&quot; &quot;logFC.9&quot; ## [13] &quot;logFC.10&quot; &quot;logFC.11&quot; &quot;logFC.12&quot; &quot;logFC.13&quot; ## [17] &quot;logFC.14&quot; &quot;logFC.15&quot; &quot;logFC.16&quot; Figure 11.5 confirms that the top genes exhibit strong differences in the proportion of expressing cells in cluster 7 compared to the others. library(scater) top.genes &lt;- head(rownames(interesting.binom)) plotExpression(sce.pbmc, x=&quot;label&quot;, features=top.genes) Figure 11.5: Distribution of log-normalized expression values for the top 10 DE genes involving cluster 7 with the binomial test, stratified by cluster assignment and coloured by the plate of origin for each cell. The disadvantage of the binomial test is that its increased stringency can lead to the loss of good candidate markers. For example, GCG is a known marker for pancreatic alpha cells but is expressed in almost every other cell of the Lawlor et al. (2017) pancreas data (Figure 11.6) and would not be highly ranked by the binomial test. plotExpression(sce.lawlor, x=&quot;cell type&quot;, features=&quot;ENSG00000115263&quot;) Figure 11.6: Distribution of log-normalized expression values for GCG across different pancreatic cell types in the Lawlor pancreas data. Another property of the binomial test is that it will not respond to scaling normalization. Systematic differences in library size between clusters will not be considered when computing \\(p\\)-values or effect sizes. This is not necessarily problematic for marker gene detection - users can treat this as retaining information about the total RNA content, analogous to spike-in normalization in Section 7.4. 11.3.3 Using custom DE methods We can also detect marker genes from precomputed DE statistics, allowing us to take advantage of more sophisticated tests in other Bioconductor packages such as edgeR and DESeq2. This functionality is not commonly used - see below for an explanation - but nonetheless, we will demonstrate how one would go about applying it to the PBMC dataset. Our strategy is to loop through each pair of clusters, performing a more-or-less standard DE analysis between pairs using the voom() approach from the limma package (Law et al. 2014). (Specifically, we use the TREAT strategy (McCarthy and Smyth 2009) to test for log-fold changes that are significantly greater than 0.5.) library(limma) dge &lt;- convertTo(sce.pbmc) uclust &lt;- unique(dge$samples$label) all.results &lt;- all.pairs &lt;- list() counter &lt;- 1L for (x in uclust) { for (y in uclust) { if (x==y) break # avoid redundant comparisons. # Factor ordering ensures that &#39;x&#39; is the not the intercept, # so resulting fold changes can be interpreted as x/y. subdge &lt;- dge[,dge$samples$label %in% c(x, y)] subdge$samples$label &lt;- factor(subdge$samples$label, c(y, x)) design &lt;- model.matrix(~label, subdge$samples) # No need to normalize as we are using the size factors # transferred from &#39;sce.pbmc&#39; and converted to norm.factors. # We also relax the filtering for the lower UMI counts. subdge &lt;- subdge[calculateAverage(subdge$counts) &gt; 0.1,] # Standard voom-limma pipeline starts here. v &lt;- voom(subdge, design) fit &lt;- lmFit(v, design) fit &lt;- treat(fit, lfc=0.5) res &lt;- topTreat(fit, n=Inf, sort.by=&quot;none&quot;) # Filling out the genes that got filtered out with NA&#39;s. res &lt;- res[rownames(dge),] rownames(res) &lt;- rownames(dge) all.results[[counter]] &lt;- res all.pairs[[counter]] &lt;- c(x, y) counter &lt;- counter+1L # Also filling the reverse comparison. res$logFC &lt;- -res$logFC all.results[[counter]] &lt;- res all.pairs[[counter]] &lt;- c(y, x) counter &lt;- counter+1L } } For each comparison, we store the corresponding data frame of statistics in all.results, along with the identities of the clusters involved in all.pairs. We consolidate the pairwise DE statistics into a single marker list for each cluster with the combineMarkers() function, yielding a per-cluster DataFrame that can be interpreted in the same manner as discussed previously. We can also specify pval.type= and direction= to control the consolidation procedure, e.g., setting pval.type=\"all\" and direction=\"up\" will prioritize genes that are significantly upregulated in each cluster against all other clusters. all.pairs &lt;- do.call(rbind, all.pairs) combined &lt;- combineMarkers(all.results, all.pairs, pval.field=&quot;P.Value&quot;) # Inspecting the results for one of the clusters. interesting.voom &lt;- combined[[chosen]] colnames(interesting.voom) ## [1] &quot;Top&quot; &quot;p.value&quot; &quot;FDR&quot; &quot;summary.logFC&quot; ## [5] &quot;logFC.8&quot; &quot;logFC.10&quot; &quot;logFC.2&quot; &quot;logFC.6&quot; ## [9] &quot;logFC.4&quot; &quot;logFC.3&quot; &quot;logFC.12&quot; &quot;logFC.11&quot; ## [13] &quot;logFC.1&quot; &quot;logFC.15&quot; &quot;logFC.9&quot; &quot;logFC.5&quot; ## [17] &quot;logFC.14&quot; &quot;logFC.13&quot; &quot;logFC.16&quot; head(interesting.voom[,1:4]) ## DataFrame with 6 rows and 4 columns ## Top p.value FDR summary.logFC ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## FO538757.2 1 0 0 7.4010 ## AP006222.2 1 0 0 7.2696 ## RPL22 1 0 0 10.5080 ## RPL11 1 0 0 11.9058 ## RPS27 1 0 0 12.6731 ## H3F3A 1 0 0 11.0945 We do not routinely use custom DE methods to perform marker detection, for several reasons. Many of these methods rely on empirical Bayes shrinkage to share information across genes in the presence of limited replication. This is unnecessary when there are large numbers of “replicate” cells in each group, and does nothing to solve the fundamental \\(n=1\\) problem in these comparisons (Section 11.5.2). These methods also make stronger assumptions about the data (e.g., equal variances for linear models, the distribution of variances during empirical Bayes) that are more likely to be violated in noisy scRNA-seq contexts. From a practical perspective, they require more work to set up and take more time to run. That said, some custom methods (e.g., MAST) may provide a useful point of difference from the simpler tests, in which case they can be converted into a marker detection scheme by modifing the above code. Indeed, the same code chunk can be directly applied (after switching back to the standard filtering and normalization steps inside the loop) to bulk RNA-seq experiments involving a large number of different conditions. This allows us to recycle the scran machinery to consolidate results across many pairwise comparisons for easier interpretation. 11.3.4 Combining multiple marker statistics On occasion, we might want to combine marker statistics from several testing regimes into a single DataFrame. This allows us to easily inspect multiple statistics at once to verify that a particular gene is a strong candidate marker. For example, a large AUC from the WMW test indicates that the expression distributions are well-separated between clusters, while the log-fold change reported with the \\(t\\)-test provides a more interpretable measure of the magnitude of the change in expression. We use the multiMarkerStats() to merge the results of separate findMarkers() calls into one DataFrame per cluster, with statistics interleaved to facilitate a direct comparison between different test regimes. combined &lt;- multiMarkerStats( t=findMarkers(sce.pbmc, direction=&quot;up&quot;), wilcox=findMarkers(sce.pbmc, test=&quot;wilcox&quot;, direction=&quot;up&quot;), binom=findMarkers(sce.pbmc, test=&quot;binom&quot;, direction=&quot;up&quot;) ) # Interleaved marker statistics from both tests for each cluster. colnames(combined[[&quot;1&quot;]]) ## [1] &quot;Top&quot; &quot;p.value&quot; &quot;FDR&quot; ## [4] &quot;t.Top&quot; &quot;wilcox.Top&quot; &quot;binom.Top&quot; ## [7] &quot;t.p.value&quot; &quot;wilcox.p.value&quot; &quot;binom.p.value&quot; ## [10] &quot;t.FDR&quot; &quot;wilcox.FDR&quot; &quot;binom.FDR&quot; ## [13] &quot;t.summary.logFC&quot; &quot;wilcox.summary.AUC&quot; &quot;binom.summary.logFC&quot; ## [16] &quot;t.logFC.2&quot; &quot;wilcox.AUC.2&quot; &quot;binom.logFC.2&quot; ## [19] &quot;t.logFC.3&quot; &quot;wilcox.AUC.3&quot; &quot;binom.logFC.3&quot; ## [22] &quot;t.logFC.4&quot; &quot;wilcox.AUC.4&quot; &quot;binom.logFC.4&quot; ## [25] &quot;t.logFC.5&quot; &quot;wilcox.AUC.5&quot; &quot;binom.logFC.5&quot; ## [28] &quot;t.logFC.6&quot; &quot;wilcox.AUC.6&quot; &quot;binom.logFC.6&quot; ## [31] &quot;t.logFC.7&quot; &quot;wilcox.AUC.7&quot; &quot;binom.logFC.7&quot; ## [34] &quot;t.logFC.8&quot; &quot;wilcox.AUC.8&quot; &quot;binom.logFC.8&quot; ## [37] &quot;t.logFC.9&quot; &quot;wilcox.AUC.9&quot; &quot;binom.logFC.9&quot; ## [40] &quot;t.logFC.10&quot; &quot;wilcox.AUC.10&quot; &quot;binom.logFC.10&quot; ## [43] &quot;t.logFC.11&quot; &quot;wilcox.AUC.11&quot; &quot;binom.logFC.11&quot; ## [46] &quot;t.logFC.12&quot; &quot;wilcox.AUC.12&quot; &quot;binom.logFC.12&quot; ## [49] &quot;t.logFC.13&quot; &quot;wilcox.AUC.13&quot; &quot;binom.logFC.13&quot; ## [52] &quot;t.logFC.14&quot; &quot;wilcox.AUC.14&quot; &quot;binom.logFC.14&quot; ## [55] &quot;t.logFC.15&quot; &quot;wilcox.AUC.15&quot; &quot;binom.logFC.15&quot; ## [58] &quot;t.logFC.16&quot; &quot;wilcox.AUC.16&quot; &quot;binom.logFC.16&quot; head(combined[[&quot;1&quot;]][,1:9]) ## DataFrame with 6 rows and 9 columns ## Top p.value FDR t.Top wilcox.Top binom.Top ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; ## TYROBP 1 1.36219e-37 1.31136e-34 1 1 1 ## FCER1G 2 5.54939e-48 8.90386e-45 1 1 2 ## GZMA 2 7.10783e-83 2.39491e-78 1 2 1 ## HOPX 2 1.25041e-79 2.10656e-75 2 1 1 ## CTSW 3 2.51098e-71 1.20864e-67 1 1 3 ## KLRF1 3 5.69193e-66 2.39730e-62 3 1 1 ## t.p.value wilcox.p.value binom.p.value ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## TYROBP 3.93768e-112 2.99215e-124 1.36219e-37 ## FCER1G 4.67496e-82 1.73332e-116 5.54939e-48 ## GZMA 1.06381e-88 1.00829e-165 7.10783e-83 ## HOPX 1.25041e-79 3.23816e-190 2.40034e-111 ## CTSW 1.13373e-107 7.90522e-131 2.51098e-71 ## KLRF1 5.69193e-66 2.73030e-184 2.77818e-113 In addition, multiMarkerStats() will compute a number of new statistics by combining the per-regime statistics. The combined Top value is obtained by simply taking the largest Top value across all tests for a given gene, while the reported p.value is obtained by taking the largest \\(p\\)-value. Ranking on either metric focuses on genes with robust differences that are highly ranked and detected by each of the individual testing regimes. Of course, this might be considered an overly conservative approach in practice, so it is entirely permissible to re-rank the DataFrame according to the Top or p.value for an individual regime (effectively limiting the use of the other regimes’ statistics to diagnostics only). 11.4 Handling blocking factors 11.4.1 Using the block= argument Large studies may contain factors of variation that are known and not interesting (e.g., batch effects, sex differences). If these are not modelled, they can interfere with marker gene detection - most obviously by inflating the variance within each cluster, but also by distorting the log-fold changes if the cluster composition varies across levels of the blocking factor. To avoid these issues, we set the block= argument in the findMarkers() call, as demonstrated below for the 416B data set. View history #--- loading ---# library(scRNAseq) sce.416b &lt;- LunSpikeInData(which=&quot;416b&quot;) sce.416b$block &lt;- factor(sce.416b$block) #--- gene-annotation ---# library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] rowData(sce.416b)$ENSEMBL &lt;- rownames(sce.416b) rowData(sce.416b)$SYMBOL &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SYMBOL&quot;) rowData(sce.416b)$SEQNAME &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SEQNAME&quot;) library(scater) rownames(sce.416b) &lt;- uniquifyFeatureNames(rowData(sce.416b)$ENSEMBL, rowData(sce.416b)$SYMBOL) #--- quality-control ---# mito &lt;- which(rowData(sce.416b)$SEQNAME==&quot;MT&quot;) stats &lt;- perCellQCMetrics(sce.416b, subsets=list(Mt=mito)) qc &lt;- quickPerCellQC(stats, percent_subsets=c(&quot;subsets_Mt_percent&quot;, &quot;altexps_ERCC_percent&quot;), batch=sce.416b$block) sce.416b &lt;- sce.416b[,!qc$discard] #--- normalization ---# library(scran) sce.416b &lt;- computeSumFactors(sce.416b) sce.416b &lt;- logNormCounts(sce.416b) #--- variance-modelling ---# dec.416b &lt;- modelGeneVarWithSpikes(sce.416b, &quot;ERCC&quot;, block=sce.416b$block) chosen.hvgs &lt;- getTopHVGs(dec.416b, prop=0.1) #--- batch-correction ---# library(limma) assay(sce.416b, &quot;corrected&quot;) &lt;- removeBatchEffect(logcounts(sce.416b), design=model.matrix(~sce.416b$phenotype), batch=sce.416b$block) #--- dimensionality-reduction ---# sce.416b &lt;- runPCA(sce.416b, ncomponents=10, subset_row=chosen.hvgs, exprs_values=&quot;corrected&quot;, BSPARAM=BiocSingular::ExactParam()) set.seed(1010) sce.416b &lt;- runTSNE(sce.416b, dimred=&quot;PCA&quot;, perplexity=10) #--- clustering ---# my.dist &lt;- dist(reducedDim(sce.416b, &quot;PCA&quot;)) my.tree &lt;- hclust(my.dist, method=&quot;ward.D2&quot;) library(dynamicTreeCut) my.clusters &lt;- unname(cutreeDynamic(my.tree, distM=as.matrix(my.dist), minClusterSize=10, verbose=0)) colLabels(sce.416b) &lt;- factor(my.clusters) m.out &lt;- findMarkers(sce.416b, block=sce.416b$block, direction=&quot;up&quot;) For each gene, each pairwise comparison between clusters is performed separately in each level of the blocking factor - in this case, the plate of origin. The function will then combine \\(p\\)-values from different plates using Stouffer’s Z method to obtain a single \\(p\\)-value per pairwise comparison. (These \\(p\\)-values are further combined across comparisons to obtain a single \\(p\\)-value per gene, using either Simes’ method or an intersection-union test depending on the value of pval.type=.) This approach favours genes that exhibit consistent DE in the same direction in each plate. demo &lt;- m.out[[&quot;1&quot;]] demo[demo$Top &lt;= 5,1:4] ## DataFrame with 13 rows and 4 columns ## Top p.value FDR summary.logFC ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Foxs1 1 1.37387e-12 4.35563e-10 3.07058 ## Pirb 1 2.08277e-33 1.21332e-29 5.87820 ## Myh11 1 6.44327e-47 3.00282e-42 4.38182 ## Tmsb4x 2 3.22944e-44 7.52525e-40 1.47689 ## Ctsd 2 6.78109e-38 7.90065e-34 2.89152 ## ... ... ... ... ... ## Tob1 4 6.63870e-09 1.18088e-06 2.74161 ## Pi16 4 1.69247e-32 7.88758e-29 5.76914 ## Cd53 5 1.08574e-27 2.97646e-24 5.75200 ## Alox5ap 5 1.33791e-28 4.15679e-25 1.36676 ## CBFB-MYH11-mcherry 5 3.75556e-35 3.50049e-31 3.01677 The block= argument works with all tests shown above and is robust to difference in the log-fold changes or variance between batches. However, it assumes that each pair of clusters is present in at least one batch. In scenarios where cells from two clusters never co-occur in the same batch, the comparison will be impossible and NAs will be reported in the output. 11.4.2 Using the design= argument Another approach is to define a design matrix containing the batch of origin as the sole factor. findMarkers() will then fit a linear model to the log-expression values, similar to the use of limma for bulk RNA sequencing data (Ritchie et al. 2015). This handles situations where multiple batches contain unique clusters, as comparisons can be implicitly performed via shared cell types in each batch. There is also a slight increase in power when information is shared across clusters for variance estimation. # Setting up the design matrix (we remove intercept for full rank # in the final design matrix with the cluster-specific terms). design &lt;- model.matrix(~sce.416b$block) design &lt;- design[,-1,drop=FALSE] m.alt &lt;- findMarkers(sce.416b, design=design, direction=&quot;up&quot;) demo &lt;- m.alt[[&quot;1&quot;]] demo[demo$Top &lt;= 5,1:4] ## DataFrame with 12 rows and 4 columns ## Top p.value FDR summary.logFC ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Gm6977 1 7.15187e-24 8.77120e-21 0.810553 ## Myh11 1 4.56882e-64 2.12925e-59 4.381806 ## Tmsb4x 2 9.48997e-46 2.21135e-41 1.478213 ## Cd63 2 1.80446e-15 7.85933e-13 0.813016 ## Cd200r3 2 2.40861e-45 3.74170e-41 6.684003 ## ... ... ... ... ... ## Actb 4 5.61751e-36 2.90887e-32 0.961762 ## Ctsd 4 2.08646e-42 2.43094e-38 2.893014 ## Fth1 4 1.83949e-23 2.14319e-20 0.797407 ## Ccl9 5 1.75378e-30 3.71514e-27 5.396347 ## CBFB-MYH11-mcherry 5 9.09026e-39 8.47285e-35 3.017758 The use of a linear model makes some strong assumptions, necessitating some caution when interpreting the results. If the batch effect is not consistent across clusters, the variance will be inflated and the log-fold change estimates will be distorted. Variances are also assumed to be equal across groups, which is not true in general. In particular, the presence of clusters in which a gene is silent will shrink the residual variance towards zero, preventing the model from penalizing genes with high variance in other clusters. Thus, we generally recommend the use of block= where possible. 11.5 Invalidity of \\(p\\)-values 11.5.1 From data snooping All of our DE strategies for detecting marker genes between clusters are statistically flawed to some extent. The DE analysis is performed on the same data used to obtain the clusters, which represents “data dredging” (also known as fishing or data snooping). The hypothesis of interest - are there differences between clusters? - is formulated from the data, so we are more likely to get a positive result when we re-use the data set to test that hypothesis. The practical effect of data dredging is best illustrated with a simple simulation. We simulate i.i.d. normal values, perform \\(k\\)-means clustering and test for DE between clusters of cells with findMarkers(). The resulting distribution of \\(p\\)-values is heavily skewed towards low values (Figure 11.7). Thus, we can detect “significant” differences between clusters even in the absence of any real substructure in the data. This effect arises from the fact that clustering, by definition, yields groups of cells that are separated in expression space. Testing for DE genes between clusters will inevitably yield some significant results as that is how the clusters were defined. library(scran) set.seed(0) y &lt;- matrix(rnorm(100000), ncol=200) clusters &lt;- kmeans(t(y), centers=2)$cluster out &lt;- findMarkers(y, clusters) hist(out[[1]]$p.value, col=&quot;grey80&quot;, xlab=&quot;p-value&quot;) Figure 11.7: Distribution of \\(p\\)-values from a DE analysis between two clusters in a simulation with no true subpopulation structure. For marker gene detection, this effect is largely harmless as the \\(p\\)-values are used only for ranking. However, it becomes an issue when the \\(p\\)-values are used to define “significant differences” between clusters with respect to an error rate threshold. Meaningful interpretation of error rates require consideration of the long-run behavior, i.e., the rate of incorrect rejections if the experiment were repeated many times. The concept of statistical significance for differences between clusters is not applicable if clusters and their interpretations are not stably reproducible across (hypothetical) replicate experiments. 11.5.2 Nature of replication The naive application of DE analysis methods will treat counts from the same cluster of cells as replicate observations. This is not the most relevant level of replication when cells are derived from the same biological sample (i.e., cell culture, animal or patient). DE analyses that treat cells as replicates fail to properly model the sample-to-sample variability (A. T. L. Lun and Marioni 2017). The latter is arguably the more important level of replication as different samples will necessarily be generated if the experiment is to be replicated. Indeed, the use of cells as replicates only masks the fact that the sample size is actually one in an experiment involving a single biological sample. This reinforces the inappropriateness of using the marker gene \\(p\\)-values to perform statistical inference. We strongly recommend selecting some markers for use in validation studies with an independent replicate population of cells. A typical strategy is to identify a corresponding subset of cells that express the upregulated markers and do not express the downregulated markers. Ideally, a different technique for quantifying expression would also be used during validation, e.g., fluorescent in situ hybridisation or quantitative PCR. This confirms that the subpopulation genuinely exists and is not an artifact of the scRNA-seq protocol or the computational analysis. 11.6 Further comments One consequence of the DE analysis strategy is that markers are defined relative to subpopulations in the same dataset. Biologically meaningful genes will not be detected if they are expressed uniformly throughout the population, e.g., T cell markers will not be detected if only T cells are present in the dataset. In practice, this is usually only a problem when the experimental data are provided without any biological context - certainly, we would hope to have some a priori idea about what cells have been captured. For most applications, it is actually desirable to avoid detecting such genes as we are interested in characterizing heterogeneity within the context of a known cell population. Continuing from the example above, the failure to detect T cell markers is of little consequence if we already know we are working with T cells. Nonetheless, if “absolute” identification of cell types is necessary, we discuss some strategies for doing so in Chapter 12. Alternatively, marker detection can be performed by treating gene expression as a predictor variable for cluster assignment. For a pair of clusters, we can find genes that discriminate between them by performing inference with a logistic model where the outcome for each cell is whether it was assigned to the first cluster and the lone predictor is the expression of each gene. Treating the cluster assignment as the dependent variable is more philosophically pleasing in some sense, as the clusters are indeed defined from the expression data rather than being known in advance. (Note that this does not solve the data snooping problem.) In practice, this approach effectively does the same task as a Wilcoxon rank sum test in terms of quantifying separation between clusters. Logistic models have the advantage in that they can easily be extended to block on multiple nuisance variables, though this is not typically necessary in most use cases. Even more complex strategies use machine learning methods to determine which features contribute most to successful cluster classification, but this is probably unnecessary for routine analyses. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] limma_3.46.0 scater_1.18.6 [3] ggplot2_3.3.3 pheatmap_1.0.12 [5] scran_1.18.5 SingleCellExperiment_1.12.0 [7] SummarizedExperiment_1.20.0 Biobase_2.50.0 [9] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [11] IRanges_2.24.1 S4Vectors_0.28.1 [13] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [15] matrixStats_0.58.0 BiocStyle_2.18.1 [17] rebook_1.0.0 loaded via a namespace (and not attached): [1] bitops_1.0-6 RColorBrewer_1.1-2 [3] tools_4.0.4 bslib_0.2.4 [5] utf8_1.2.1 R6_2.5.0 [7] irlba_2.3.3 vipor_0.4.5 [9] DBI_1.1.1 colorspace_2.0-0 [11] withr_2.4.1 gridExtra_2.3 [13] tidyselect_1.1.0 processx_3.4.5 [15] compiler_4.0.4 graph_1.68.0 [17] BiocNeighbors_1.8.2 DelayedArray_0.16.2 [19] labeling_0.4.2 bookdown_0.21 [21] sass_0.3.1 scales_1.1.1 [23] callr_3.5.1 stringr_1.4.0 [25] digest_0.6.27 rmarkdown_2.7 [27] XVector_0.30.0 pkgconfig_2.0.3 [29] htmltools_0.5.1.1 sparseMatrixStats_1.2.1 [31] highr_0.8 rlang_0.4.10 [33] DelayedMatrixStats_1.12.3 farver_2.1.0 [35] jquerylib_0.1.3 generics_0.1.0 [37] jsonlite_1.7.2 BiocParallel_1.24.1 [39] dplyr_1.0.5 RCurl_1.98-1.3 [41] magrittr_2.0.1 BiocSingular_1.6.0 [43] GenomeInfoDbData_1.2.4 scuttle_1.0.4 [45] Matrix_1.3-2 ggbeeswarm_0.6.0 [47] Rcpp_1.0.6 munsell_0.5.0 [49] fansi_0.4.2 viridis_0.5.1 [51] lifecycle_1.0.0 stringi_1.5.3 [53] yaml_2.2.1 edgeR_3.32.1 [55] zlibbioc_1.36.0 grid_4.0.4 [57] dqrng_0.2.1 crayon_1.4.1 [59] lattice_0.20-41 cowplot_1.1.1 [61] beachmat_2.6.4 locfit_1.5-9.4 [63] CodeDepends_0.6.5 knitr_1.31 [65] ps_1.6.0 pillar_1.5.1 [67] igraph_1.2.6 codetools_0.2-18 [69] XML_3.99-0.6 glue_1.4.2 [71] evaluate_0.14 BiocManager_1.30.10 [73] vctrs_0.3.6 gtable_0.3.0 [75] purrr_0.3.4 assertthat_0.2.1 [77] xfun_0.22 rsvd_1.0.3 [79] viridisLite_0.3.0 tibble_3.1.0 [81] beeswarm_0.3.1 bluster_1.0.0 [83] statmod_1.4.35 ellipsis_0.3.1 Bibliography "],["cell-type-annotation.html", "Chapter 12 Cell type annotation 12.1 Motivation 12.2 Assigning cell labels from reference data 12.3 Assigning cell labels from gene sets 12.4 Assigning cluster labels from markers 12.5 Computing gene set activities Session Info", " Chapter 12 Cell type annotation .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 12.1 Motivation The most challenging task in scRNA-seq data analysis is arguably the interpretation of the results. Obtaining clusters of cells is fairly straightforward, but it is more difficult to determine what biological state is represented by each of those clusters. Doing so requires us to bridge the gap between the current dataset and prior biological knowledge, and the latter is not always available in a consistent and quantitative manner. Indeed, even the concept of a “cell type” is not clearly defined, with most practitioners possessing a “I’ll know it when I see it” intuition that is not amenable to computational analysis. As such, interpretation of scRNA-seq data is often manual and a common bottleneck in the analysis workflow. To expedite this step, we can use various computational approaches that exploit prior information to assign meaning to an uncharacterized scRNA-seq dataset. The most obvious sources of prior information are the curated gene sets associated with particular biological processes, e.g., from the Gene Ontology (GO) or the Kyoto Encyclopedia of Genes and Genomes (KEGG) collections. Alternatively, we can directly compare our expression profiles to published reference datasets where each sample or cell has already been annotated with its putative biological state by domain experts. Here, we will demonstrate both approaches with several different scRNA-seq datasets. 12.2 Assigning cell labels from reference data 12.2.1 Overview A conceptually straightforward annotation approach is to compare the single-cell expression profiles with previously annotated reference datasets. Labels can then be assigned to each cell in our uncharacterized test dataset based on the most similar reference sample(s), for some definition of “similar”. This is a standard classification challenge that can be tackled by standard machine learning techniques such as random forests and support vector machines. Any published and labelled RNA-seq dataset (bulk or single-cell) can be used as a reference, though its reliability depends greatly on the expertise of the original authors who assigned the labels in the first place. In this section, we will demonstrate the use of the SingleR method (Aran et al. 2019) for cell type annotation. This method assigns labels to cells based on the reference samples with the highest Spearman rank correlations, using only the marker genes between pairs of labels to focus on the relevant differences between cell types. It also performs a fine-tuning step for each cell where the correlations are recomputed with just the marker genes for the top-scoring labels. This aims to resolve any ambiguity between those labels by removing noise from irrelevant markers for other labels. Further details can be found in the SingleR book from which most of the examples here are derived. 12.2.2 Using existing references For demonstration purposes, we will use one of the 10X PBMC datasets as our test. While we have already applied quality control, normalization and clustering for this dataset, this is not strictly necessary. It is entirely possible to run SingleR() on the raw counts without any a priori quality control and filter on the annotation results at one’s leisure - see the book for an explanation. View history #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) #--- gene-annotation ---# library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) #--- cell-detection ---# set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location==&quot;MT&quot;))) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) sce.pbmc &lt;- sce.pbmc[,!high.mito] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) sce.pbmc &lt;- logNormCounts(sce.pbmc) #--- variance-modelling ---# set.seed(1001) dec.pbmc &lt;- modelGeneVarByPoisson(sce.pbmc) top.pbmc &lt;- getTopHVGs(dec.pbmc, prop=0.1) #--- dimensionality-reduction ---# set.seed(10000) sce.pbmc &lt;- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc) set.seed(100000) sce.pbmc &lt;- runTSNE(sce.pbmc, dimred=&quot;PCA&quot;) set.seed(1000000) sce.pbmc &lt;- runUMAP(sce.pbmc, dimred=&quot;PCA&quot;) #--- clustering ---# g &lt;- buildSNNGraph(sce.pbmc, k=10, use.dimred = &#39;PCA&#39;) clust &lt;- igraph::cluster_walktrap(g)$membership colLabels(sce.pbmc) &lt;- factor(clust) sce.pbmc ## class: SingleCellExperiment ## dim: 33694 3985 ## metadata(1): Samples ## assays(2): counts logcounts ## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B ## rowData names(2): ID Symbol ## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ... ## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1 ## colData names(4): Sample Barcode sizeFactor label ## reducedDimNames(3): PCA TSNE UMAP ## altExpNames(0): The celldex contains a number of curated reference datasets, mostly assembled from bulk RNA-seq or microarray data of sorted cell types. These references are often good enough for most applications provided that they contain the cell types that are expected in the test population. Here, we will use a reference constructed from Blueprint and ENCODE data (Martens and Stunnenberg 2013; The ENCODE Project Consortium 2012); this is obtained by calling the BlueprintEncode() function to construct a SummarizedExperiment containing log-expression values with curated labels for each sample. library(celldex) ref &lt;- BlueprintEncodeData() ref ## class: SummarizedExperiment ## dim: 19859 259 ## metadata(0): ## assays(1): logcounts ## rownames(19859): TSPAN6 TNMD ... LINC00550 GIMAP1-GIMAP5 ## rowData names(0): ## colnames(259): mature.neutrophil ## CD14.positive..CD16.negative.classical.monocyte ... ## epithelial.cell.of.umbilical.artery.1 ## dermis.lymphatic.vessel.endothelial.cell.1 ## colData names(3): label.main label.fine label.ont We call the SingleR() function to annotate each of our PBMCs with the main cell type labels from the Blueprint/ENCODE reference. This returns a DataFrame where each row corresponds to a cell in the test dataset and contains its label assignments. Alternatively, we could use the labels in ref$label.fine, which provide more resolution at the cost of speed and increased ambiguity in the assignments. library(SingleR) pred &lt;- SingleR(test=sce.pbmc, ref=ref, labels=ref$label.main) table(pred$labels) ## ## B-cells CD4+ T-cells CD8+ T-cells DC Eosinophils Erythrocytes ## 549 773 1274 1 1 5 ## HSC Monocytes NK cells ## 14 1117 251 We inspect the results using a heatmap of the per-cell and label scores (Figure 12.1). Ideally, each cell should exhibit a high score in one label relative to all of the others, indicating that the assignment to that label was unambiguous. This is largely the case for monocytes and B cells, whereas we see more ambiguity between CD4+ and CD8+ T cells (and to a lesser extent, NK cells). plotScoreHeatmap(pred) Figure 12.1: Heatmap of the assignment score for each cell (column) and label (row). Scores are shown before any fine-tuning and are normalized to [0, 1] within each cell. We compare the assignments with the clustering results to determine the identity of each cluster. Here, several clusters are nested within the monocyte and B cell labels (Figure 12.2), indicating that the clustering represents finer subdivisions within the cell types. Interestingly, our clustering does not effectively distinguish between CD4+ and CD8+ T cell labels. This is probably due to the presence of other factors of heterogeneity within the T cell subpopulation (e.g., activation) that have a stronger influence on unsupervised methods than the a priori expected CD4+/CD8+ distinction. tab &lt;- table(Assigned=pred$pruned.labels, Cluster=colLabels(sce.pbmc)) # Adding a pseudo-count of 10 to avoid strong color jumps with just 1 cell. library(pheatmap) pheatmap(log2(tab+10), color=colorRampPalette(c(&quot;white&quot;, &quot;blue&quot;))(101)) Figure 12.2: Heatmap of the distribution of cells across labels and clusters in the 10X PBMC dataset. Color scale is reported in the log10-number of cells for each cluster-label combination. This episode highlights some of the differences between reference-based annotation and unsupervised clustering. The former explicitly focuses on aspects of the data that are known to be interesting, simplifying the process of biological interpretation. However, the cost is that the downstream analysis is restricted by the diversity and resolution of the available labels, a problem that is largely avoided by de novo identification of clusters. We suggest applying both strategies to examine the agreement (or lack thereof) between reference label and cluster assignments. Any inconsistencies are not necessarily problematic due to the conceptual differences between the two approaches; indeed, one could use those discrepancies as the basis for further investigation to discover novel factors of variation in the data. 12.2.3 Using custom references We can also apply SingleR to single-cell reference datasets that are curated and supplied by the user. This is most obviously useful when we have an existing dataset that was previously (manually) annotated and we want to use that knowledge to annotate a new dataset in an automated manner. To illustrate, we will use the Muraro et al. (2016) human pancreas dataset as our reference. View history #--- loading ---# library(scRNAseq) sce.muraro &lt;- MuraroPancreasData() #--- gene-annotation ---# library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] gene.symb &lt;- sub(&quot;__chr.*$&quot;, &quot;&quot;, rownames(sce.muraro)) gene.ids &lt;- mapIds(edb, keys=gene.symb, keytype=&quot;SYMBOL&quot;, column=&quot;GENEID&quot;) # Removing duplicated genes or genes without Ensembl IDs. keep &lt;- !is.na(gene.ids) &amp; !duplicated(gene.ids) sce.muraro &lt;- sce.muraro[keep,] rownames(sce.muraro) &lt;- gene.ids[keep] #--- quality-control ---# library(scater) stats &lt;- perCellQCMetrics(sce.muraro) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sce.muraro$donor, subset=sce.muraro$donor!=&quot;D28&quot;) sce.muraro &lt;- sce.muraro[,!qc$discard] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.muraro) sce.muraro &lt;- computeSumFactors(sce.muraro, clusters=clusters) sce.muraro &lt;- logNormCounts(sce.muraro) sce.muraro ## class: SingleCellExperiment ## dim: 16940 2299 ## metadata(0): ## assays(2): counts logcounts ## rownames(16940): ENSG00000268895 ENSG00000121410 ... ENSG00000159840 ## ENSG00000074755 ## rowData names(2): symbol chr ## colnames(2299): D28-1_1 D28-1_2 ... D30-8_93 D30-8_94 ## colData names(4): label donor plate sizeFactor ## reducedDimNames(0): ## altExpNames(1): ERCC # Pruning out unknown or unclear labels. sce.muraro &lt;- sce.muraro[,!is.na(sce.muraro$label) &amp; sce.muraro$label!=&quot;unclear&quot;] table(sce.muraro$label) ## ## acinar alpha beta delta duct endothelial ## 217 795 442 189 239 18 ## epsilon mesenchymal pp ## 3 80 96 Our aim is to assign labels to our test dataset from Segerstolpe et al. (2016). We use the same call to SingleR() but with de.method=\"wilcox\" to identify markers via pairwise Wilcoxon ranked sum tests between labels in the reference Muraro dataset. This re-uses the same machinery from Chapter 11; further options to fine-tune the test procedure can be passed via the de.args argument. View history #--- loading ---# library(scRNAseq) sce.seger &lt;- SegerstolpePancreasData() #--- gene-annotation ---# library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] symbols &lt;- rowData(sce.seger)$symbol ens.id &lt;- mapIds(edb, keys=symbols, keytype=&quot;SYMBOL&quot;, column=&quot;GENEID&quot;) ens.id &lt;- ifelse(is.na(ens.id), symbols, ens.id) # Removing duplicated rows. keep &lt;- !duplicated(ens.id) sce.seger &lt;- sce.seger[keep,] rownames(sce.seger) &lt;- ens.id[keep] #--- sample-annotation ---# emtab.meta &lt;- colData(sce.seger)[,c(&quot;cell type&quot;, &quot;disease&quot;, &quot;individual&quot;, &quot;single cell well quality&quot;)] colnames(emtab.meta) &lt;- c(&quot;CellType&quot;, &quot;Disease&quot;, &quot;Donor&quot;, &quot;Quality&quot;) colData(sce.seger) &lt;- emtab.meta sce.seger$CellType &lt;- gsub(&quot; cell&quot;, &quot;&quot;, sce.seger$CellType) sce.seger$CellType &lt;- paste0( toupper(substr(sce.seger$CellType, 1, 1)), substring(sce.seger$CellType, 2)) #--- quality-control ---# low.qual &lt;- sce.seger$Quality == &quot;low quality cell&quot; library(scater) stats &lt;- perCellQCMetrics(sce.seger) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sce.seger$Donor, subset=!sce.seger$Donor %in% c(&quot;HP1504901&quot;, &quot;HP1509101&quot;)) sce.seger &lt;- sce.seger[,!(qc$discard | low.qual)] #--- normalization ---# library(scran) clusters &lt;- quickCluster(sce.seger) sce.seger &lt;- computeSumFactors(sce.seger, clusters=clusters) sce.seger &lt;- logNormCounts(sce.seger) # Converting to FPKM for a more like-for-like comparison to UMI counts. # However, results are often still good even when this step is skipped. library(AnnotationHub) hs.db &lt;- AnnotationHub()[[&quot;AH73881&quot;]] hs.exons &lt;- exonsBy(hs.db, by=&quot;gene&quot;) hs.exons &lt;- reduce(hs.exons) hs.len &lt;- sum(width(hs.exons)) library(scuttle) available &lt;- intersect(rownames(sce.seger), names(hs.len)) fpkm.seger &lt;- calculateFPKM(sce.seger[available,], hs.len[available]) pred.seger &lt;- SingleR(test=fpkm.seger, ref=sce.muraro, labels=sce.muraro$label, de.method=&quot;wilcox&quot;) table(pred.seger$labels) ## ## acinar alpha beta delta duct endothelial ## 192 892 273 106 381 18 ## epsilon mesenchymal pp ## 5 52 171 As it so happens, we are in the fortunate position where our test dataset also contains independently defined labels. We see strong consistency between the two sets of labels (Figure 12.3), indicating that our automatic annotation is comparable to that generated manually by domain experts. tab &lt;- table(pred.seger$pruned.labels, sce.seger$CellType) library(pheatmap) pheatmap(log2(tab+10), color=colorRampPalette(c(&quot;white&quot;, &quot;blue&quot;))(101)) Figure 12.3: Heatmap of the confusion matrix between the predicted labels (rows) and the independently defined labels (columns) in the Segerstolpe dataset. The color is proportinal to the log-transformed number of cells with a given combination of labels from each set. An interesting question is - given a single-cell reference dataset, is it better to use it directly or convert it to pseudo-bulk values? A single-cell reference preserves the “shape” of the subpopulation in high-dimensional expression space, potentially yielding more accurate predictions when the differences between labels are subtle (or at least capturing ambiguity more accurately to avoid grossly incorrect predictions). However, it also requires more computational work to assign each cell in the test dataset. We refer to the other book for more details on how to achieve a compromise between these two concerns. 12.3 Assigning cell labels from gene sets A related strategy is to explicitly identify sets of marker genes that are highly expressed in each individual cell. This does not require matching of individual cells to the expression values of the reference dataset, which is faster and more convenient when only the identities of the markers are available. We demonstrate this approach using neuronal cell type markers derived from the Zeisel et al. (2015) study. View history #--- loading ---# library(scRNAseq) sce.zeisel &lt;- ZeiselBrainData() library(scater) sce.zeisel &lt;- aggregateAcrossFeatures(sce.zeisel, id=sub(&quot;_loc[0-9]+$&quot;, &quot;&quot;, rownames(sce.zeisel))) #--- gene-annotation ---# library(org.Mm.eg.db) rowData(sce.zeisel)$Ensembl &lt;- mapIds(org.Mm.eg.db, keys=rownames(sce.zeisel), keytype=&quot;SYMBOL&quot;, column=&quot;ENSEMBL&quot;) #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.zeisel, subsets=list( Mt=rowData(sce.zeisel)$featureType==&quot;mito&quot;)) qc &lt;- quickPerCellQC(stats, percent_subsets=c(&quot;altexps_ERCC_percent&quot;, &quot;subsets_Mt_percent&quot;)) sce.zeisel &lt;- sce.zeisel[,!qc$discard] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.zeisel) sce.zeisel &lt;- computeSumFactors(sce.zeisel, cluster=clusters) sce.zeisel &lt;- logNormCounts(sce.zeisel) library(scran) wilcox.z &lt;- pairwiseWilcox(sce.zeisel, sce.zeisel$level1class, lfc=1, direction=&quot;up&quot;) markers.z &lt;- getTopMarkers(wilcox.z$statistics, wilcox.z$pairs, pairwise=FALSE, n=50) lengths(markers.z) ## astrocytes_ependymal endothelial-mural interneurons ## 79 83 118 ## microglia oligodendrocytes pyramidal CA1 ## 69 81 125 ## pyramidal SS ## 149 Our test dataset will be another brain scRNA-seq experiment from Tasic et al. (2016). library(scRNAseq) sce.tasic &lt;- TasicBrainData() sce.tasic ## class: SingleCellExperiment ## dim: 24058 1809 ## metadata(0): ## assays(1): counts ## rownames(24058): 0610005C13Rik 0610007C21Rik ... mt_X57780 tdTomato ## rowData names(0): ## colnames(1809): Calb2_tdTpositive_cell_1 Calb2_tdTpositive_cell_2 ... ## Rbp4_CTX_250ng_2 Trib2_CTX_250ng_1 ## colData names(13): sample_title mouse_line ... secondary_type ## aibs_vignette_id ## reducedDimNames(0): ## altExpNames(1): ERCC We use the AUCell package to identify marker sets that are highly expressed in each cell. This method ranks genes by their expression values within each cell and constructs a response curve of the number of genes from each marker set that are present with increasing rank. It then computes the area under the curve (AUC) for each marker set, quantifying the enrichment of those markers among the most highly expressed genes in that cell. This is roughly similar to performing a Wilcoxon rank sum test between genes in and outside of the set, but involving only the top ranking genes by expression in each cell. library(GSEABase) all.sets &lt;- lapply(names(markers.z), function(x) { GeneSet(markers.z[[x]], setName=x) }) all.sets &lt;- GeneSetCollection(all.sets) library(AUCell) rankings &lt;- AUCell_buildRankings(counts(sce.tasic), plotStats=FALSE, verbose=FALSE) cell.aucs &lt;- AUCell_calcAUC(all.sets, rankings) results &lt;- t(assay(cell.aucs)) head(results) ## gene sets ## cells astrocytes_ependymal endothelial-mural interneurons ## Calb2_tdTpositive_cell_1 0.1387 0.04264 0.5306 ## Calb2_tdTpositive_cell_2 0.1366 0.04885 0.4538 ## Calb2_tdTpositive_cell_3 0.1087 0.07270 0.3459 ## Calb2_tdTpositive_cell_4 0.1322 0.04993 0.5113 ## Calb2_tdTpositive_cell_5 0.1513 0.07161 0.4930 ## Calb2_tdTpositive_cell_6 0.1342 0.09161 0.3378 ## gene sets ## cells microglia oligodendrocytes pyramidal CA1 ## Calb2_tdTpositive_cell_1 0.04845 0.1318 0.2318 ## Calb2_tdTpositive_cell_2 0.02683 0.1211 0.2063 ## Calb2_tdTpositive_cell_3 0.03583 0.1567 0.3219 ## Calb2_tdTpositive_cell_4 0.05388 0.1481 0.2547 ## Calb2_tdTpositive_cell_5 0.06656 0.1386 0.2088 ## Calb2_tdTpositive_cell_6 0.03201 0.1553 0.4011 ## gene sets ## cells pyramidal SS ## Calb2_tdTpositive_cell_1 0.3477 ## Calb2_tdTpositive_cell_2 0.2762 ## Calb2_tdTpositive_cell_3 0.5244 ## Calb2_tdTpositive_cell_4 0.3506 ## Calb2_tdTpositive_cell_5 0.3010 ## Calb2_tdTpositive_cell_6 0.5393 We assign cell type identity to each cell in the test dataset by taking the marker set with the top AUC as the label for that cell. Our new labels mostly agree with the original annotation from Tasic et al. (2016), which is encouraging. The only exception involves misassignment of oligodendrocyte precursors to astrocytes, which may be understandable given that they are derived from a common lineage. In the absence of prior annotation, a more general diagnostic check is to compare the assigned labels to cluster identities, under the expectation that most cells of a single cluster would have the same label (or, if multiple labels are present, they should at least represent closely related cell states). new.labels &lt;- colnames(results)[max.col(results)] tab &lt;- table(new.labels, sce.tasic$broad_type) tab ## ## new.labels Astrocyte Endothelial Cell GABA-ergic Neuron ## astrocytes_ependymal 43 2 0 ## endothelial-mural 0 27 0 ## interneurons 0 0 759 ## microglia 0 0 0 ## oligodendrocytes 0 0 1 ## pyramidal SS 0 0 1 ## ## new.labels Glutamatergic Neuron Microglia Oligodendrocyte ## astrocytes_ependymal 0 0 0 ## endothelial-mural 0 0 0 ## interneurons 2 0 0 ## microglia 0 22 0 ## oligodendrocytes 0 0 38 ## pyramidal SS 810 0 0 ## ## new.labels Oligodendrocyte Precursor Cell Unclassified ## astrocytes_ependymal 20 4 ## endothelial-mural 0 2 ## interneurons 0 15 ## microglia 0 1 ## oligodendrocytes 2 0 ## pyramidal SS 0 60 As a diagnostic measure, we examine the distribution of AUCs across cells for each label (Figure 12.4). In heterogeneous populations, the distribution for each label should be bimodal with one high-scoring peak containing cells of that cell type and a low-scoring peak containing cells of other types. The gap between these two peaks can be used to derive a threshold for whether a label is “active” for a particular cell. (In this case, we simply take the single highest-scoring label per cell as the labels should be mutually exclusive.) In populations where a particular cell type is expected, lack of clear bimodality for the corresponding label may indicate that its gene set is not sufficiently informative. par(mfrow=c(3,3)) AUCell_exploreThresholds(cell.aucs, plotHist=TRUE, assign=TRUE) Figure 12.4: Distribution of AUCs in the Tasic brain dataset for each label in the Zeisel dataset. The blue curve represents the density estimate, the red curve represents a fitted two-component mixture of normals, the pink curve represents a fitted three-component mixture, and the grey curve represents a fitted normal distribution. Vertical lines represent threshold estimates corresponding to each estimate of the distribution. Interpretation of the AUCell results is most straightforward when the marker sets are mutually exclusive, as shown above for the cell type markers. In other applications, one might consider computing AUCs for gene sets associated with signalling or metabolic pathways. It is likely that multiple pathways will be active in any given cell, and it is tempting to use the AUCs to quantify this activity for comparison across cells. However, such comparisons must be interpreted with much caution as the AUCs are competitive values - any increase in one pathway’s activity will naturally reduce the AUCs for all other pathways, potentially resulting in spurious differences across the population. As we mentioned previously, the advantage of the AUCell approach is that it does not require reference expression values. This is particularly useful when dealing with gene sets derived from the literature or other qualitative forms of biological knowledge. For example, we might instead use single-cell signatures defined from MSigDB, obtained as shown below. # Downloading the signatures and caching them locally. library(BiocFileCache) bfc &lt;- BiocFileCache(ask=FALSE) scsig.path &lt;- bfcrpath(bfc, file.path(&quot;http://software.broadinstitute.org&quot;, &quot;gsea/msigdb/supplemental/scsig.all.v1.0.symbols.gmt&quot;)) scsigs &lt;- getGmt(scsig.path) The flipside is that information on relative expression is lost when only the marker identities are used. The net effect of ignoring expression values is difficult to predict; for example, it may reduce performance for resolving more subtle cell types, but may also improve performance if the per-cell expression was too noisy to be useful. Performance is also highly dependent on the gene sets themselves, which may not be defined in the same context in which they are used. For example, applying all of the MSigDB signatures on the Muraro dataset is rather disappointing (Figure 12.5), while restricting to the subset of pancreas signatures is more promising. muraro.mat &lt;- counts(sce.muraro) rownames(muraro.mat) &lt;- rowData(sce.muraro)$symbol muraro.rankings &lt;- AUCell_buildRankings(muraro.mat, plotStats=FALSE, verbose=FALSE) # Applying MsigDB to the Muraro dataset, because it&#39;s human: scsig.aucs &lt;- AUCell_calcAUC(scsigs, muraro.rankings) scsig.results &lt;- t(assay(scsig.aucs)) full.labels &lt;- colnames(scsig.results)[max.col(scsig.results)] tab &lt;- table(full.labels, sce.muraro$label) fullheat &lt;- pheatmap(log10(tab+10), color=viridis::viridis(100), silent=TRUE) # Restricting to the subset of Muraro-derived gene sets: scsigs.sub &lt;- scsigs[grep(&quot;Pancreas&quot;, names(scsigs))] sub.aucs &lt;- AUCell_calcAUC(scsigs.sub, muraro.rankings) sub.results &lt;- t(assay(sub.aucs)) sub.labels &lt;- colnames(sub.results)[max.col(sub.results)] tab &lt;- table(sub.labels, sce.muraro$label) subheat &lt;- pheatmap(log10(tab+10), color=viridis::viridis(100), silent=TRUE) gridExtra::grid.arrange(fullheat[[4]], subheat[[4]]) Figure 12.5: Heatmaps of the log-number of cells with each combination of known labels (columns) and assigned MSigDB signatures (rows) in the Muraro data set. The signature assigned to each cell was defined as that with the highest AUC across all (top) or all pancreas-related signatures (bottom). 12.4 Assigning cluster labels from markers Yet another strategy for annotation is to perform a gene set enrichment analysis on the marker genes defining each cluster. This identifies the pathways and processes that are (relatively) active in each cluster based on upregulation of the associated genes compared to other clusters. We demonstrate on the mouse mammary dataset from Bach et al. (2017), using markers that are identified by findMarkers() as being upregulated at a log-fold change threshold of 1. View history #--- loading ---# library(scRNAseq) sce.mam &lt;- BachMammaryData(samples=&quot;G_1&quot;) #--- gene-annotation ---# library(scater) rownames(sce.mam) &lt;- uniquifyFeatureNames( rowData(sce.mam)$Ensembl, rowData(sce.mam)$Symbol) library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] rowData(sce.mam)$SEQNAME &lt;- mapIds(ens.mm.v97, keys=rowData(sce.mam)$Ensembl, keytype=&quot;GENEID&quot;, column=&quot;SEQNAME&quot;) #--- quality-control ---# is.mito &lt;- rowData(sce.mam)$SEQNAME == &quot;MT&quot; stats &lt;- perCellQCMetrics(sce.mam, subsets=list(Mito=which(is.mito))) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;subsets_Mito_percent&quot;) sce.mam &lt;- sce.mam[,!qc$discard] #--- normalization ---# library(scran) set.seed(101000110) clusters &lt;- quickCluster(sce.mam) sce.mam &lt;- computeSumFactors(sce.mam, clusters=clusters) sce.mam &lt;- logNormCounts(sce.mam) #--- variance-modelling ---# set.seed(00010101) dec.mam &lt;- modelGeneVarByPoisson(sce.mam) top.mam &lt;- getTopHVGs(dec.mam, prop=0.1) #--- dimensionality-reduction ---# library(BiocSingular) set.seed(101010011) sce.mam &lt;- denoisePCA(sce.mam, technical=dec.mam, subset.row=top.mam) sce.mam &lt;- runTSNE(sce.mam, dimred=&quot;PCA&quot;) #--- clustering ---# snn.gr &lt;- buildSNNGraph(sce.mam, use.dimred=&quot;PCA&quot;, k=25) colLabels(sce.mam) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) markers.mam &lt;- findMarkers(sce.mam, direction=&quot;up&quot;, lfc=1) As an example, we obtain annotations for the marker genes that define cluster 2. We will use gene sets defined by the Gene Ontology (GO) project, which describe a comprehensive range of biological processes and functions. We define our subset of relevant marker genes at a FDR of 5% and apply the goana() function from the limma package. This performs a hypergeometric test to identify GO terms that are overrepresented in our marker subset. (The log-fold change threshold mentioned above is useful here, as it avoids including an excessive number of genes from the overpowered nature of per-cell DE comparisons.) chosen &lt;- &quot;2&quot; cur.markers &lt;- markers.mam[[chosen]] is.de &lt;- cur.markers$FDR &lt;= 0.05 summary(is.de) ## Mode FALSE TRUE ## logical 27819 179 # goana() requires Entrez IDs, some of which map to multiple # symbols - hence the unique() in the call below. library(org.Mm.eg.db) entrez.ids &lt;- mapIds(org.Mm.eg.db, keys=rownames(cur.markers), column=&quot;ENTREZID&quot;, keytype=&quot;SYMBOL&quot;) library(limma) go.out &lt;- goana(unique(entrez.ids[is.de]), species=&quot;Mm&quot;, universe=unique(entrez.ids)) # Only keeping biological process terms that are not overly general. go.out &lt;- go.out[order(go.out$P.DE),] go.useful &lt;- go.out[go.out$Ont==&quot;BP&quot; &amp; go.out$N &lt;= 200,] head(go.useful, 20) ## Term Ont ## GO:0006641 triglyceride metabolic process BP ## GO:0006639 acylglycerol metabolic process BP ## GO:0006638 neutral lipid metabolic process BP ## GO:0006119 oxidative phosphorylation BP ## GO:0042775 mitochondrial ATP synthesis coupled electron transport BP ## GO:0042773 ATP synthesis coupled electron transport BP ## GO:0046390 ribose phosphate biosynthetic process BP ## GO:0022408 negative regulation of cell-cell adhesion BP ## GO:0009152 purine ribonucleotide biosynthetic process BP ## GO:0035148 tube formation BP ## GO:0022904 respiratory electron transport chain BP ## GO:0009260 ribonucleotide biosynthetic process BP ## GO:0050729 positive regulation of inflammatory response BP ## GO:0022900 electron transport chain BP ## GO:0045333 cellular respiration BP ## GO:0006164 purine nucleotide biosynthetic process BP ## GO:0072522 purine-containing compound biosynthetic process BP ## GO:2001236 regulation of extrinsic apoptotic signaling pathway BP ## GO:0071404 cellular response to low-density lipoprotein particle stimulus BP ## GO:0019432 triglyceride biosynthetic process BP ## N DE P.DE ## GO:0006641 105 11 2.800e-10 ## GO:0006639 135 11 4.174e-09 ## GO:0006638 137 11 4.876e-09 ## GO:0006119 94 9 2.723e-08 ## GO:0042775 51 7 8.032e-08 ## GO:0042773 52 7 9.226e-08 ## GO:0046390 147 10 1.203e-07 ## GO:0022408 187 11 1.226e-07 ## GO:0009152 131 9 4.818e-07 ## GO:0035148 174 10 5.775e-07 ## GO:0022904 71 7 8.173e-07 ## GO:0009260 140 9 8.443e-07 ## GO:0050729 142 9 9.511e-07 ## GO:0022900 74 7 1.086e-06 ## GO:0045333 145 9 1.133e-06 ## GO:0006164 150 9 1.504e-06 ## GO:0072522 155 9 1.975e-06 ## GO:2001236 163 9 2.992e-06 ## GO:0071404 16 4 4.468e-06 ## GO:0019432 37 5 6.700e-06 We see an enrichment for genes involved in lipid synthesis, cell adhesion and tube formation. Given that this is a mammary gland experiment, we might guess that cluster 2 contains luminal epithelial cells responsible for milk production and secretion. Indeed, a closer examination of the marker list indicates that this cluster upregulates milk proteins Csn2 and Csn3 (Figure 12.6). library(scater) plotExpression(sce.mam, features=c(&quot;Csn2&quot;, &quot;Csn3&quot;), x=&quot;label&quot;, colour_by=&quot;label&quot;) Figure 12.6: Distribution of log-expression values for Csn2 and Csn3 in each cluster. Further inspection of interesting GO terms is achieved by extracting the relevant genes. This is usually desirable to confirm that the interpretation of the annotated biological process is appropriate. Many terms have overlapping gene sets, so a term may only be highly ranked because it shares genes with a more relevant term that represents the active pathway. # Extract symbols for each GO term; done once. tab &lt;- select(org.Mm.eg.db, keytype=&quot;SYMBOL&quot;, keys=rownames(sce.mam), columns=&quot;GOALL&quot;) by.go &lt;- split(tab[,1], tab[,2]) # Identify genes associated with an interesting term. adhesion &lt;- unique(by.go[[&quot;GO:0022408&quot;]]) head(cur.markers[rownames(cur.markers) %in% adhesion,1:4], 10) ## DataFrame with 10 rows and 4 columns ## Top p.value FDR summary.logFC ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Spint2 11 3.28234e-34 1.37163e-31 2.39280 ## Epcam 17 8.86978e-94 7.09531e-91 2.32968 ## Cebpb 21 6.76957e-16 2.03800e-13 1.80192 ## Cd24a 21 3.24195e-33 1.29669e-30 1.72318 ## Btn1a1 24 2.16574e-13 6.12488e-11 1.26343 ## Cd9 51 1.41373e-11 3.56592e-09 2.73785 ## Ceacam1 52 1.66948e-38 7.79034e-36 1.56912 ## Sdc4 59 9.15001e-07 1.75467e-04 1.84014 ## Anxa1 68 2.58840e-06 4.76777e-04 1.29724 ## Cdh1 69 1.73658e-07 3.54897e-05 1.31265 Gene set testing of marker lists is a reliable approach for determining if pathways are up- or down-regulated between clusters. As the top marker genes are simply DEGs, we can directly apply well-established procedures for testing gene enrichment in DEG lists (see here for relevant packages). This contrasts with the AUCell approach where scores are not easily comparable across cells. The downside is that all conclusions are made relative to the other clusters, making it more difficult to determine cell identity if an “outgroup” is not present in the same study. 12.5 Computing gene set activities For the sake of completeness, we should mention that we can also quantify gene set activity on a per-cell level and test for differences in activity. This inverts the standard gene set testing procedure by combining information across genes first and then testing for differences afterwards. To avoid the pitfalls mentioned previously for the AUCs, we simply compute the average of the log-expression values across all genes in the set for each cell. This is less sensitive to the behavior of other genes in that cell (aside from composition biases, as discussed in Chapter 7). aggregated &lt;- sumCountsAcrossFeatures(sce.mam, by.go, exprs_values=&quot;logcounts&quot;, average=TRUE) dim(aggregated) # rows are gene sets, columns are cells ## [1] 22440 2772 aggregated[1:10,1:5] ## [,1] [,2] [,3] [,4] [,5] ## GO:0000002 0.35837 0.3213 0.08054 0.2624 0.2376 ## GO:0000003 0.24919 0.2536 0.20097 0.1736 0.2017 ## GO:0000009 0.00000 0.0000 0.00000 0.0000 0.0000 ## GO:0000010 0.00000 0.0000 0.00000 0.0000 0.0000 ## GO:0000012 0.00000 0.1565 0.09245 0.0000 0.2212 ## GO:0000014 0.07068 0.2630 0.22187 0.2366 0.3539 ## GO:0000015 0.76686 0.4295 0.47544 0.5071 0.8043 ## GO:0000016 0.00000 0.0000 0.00000 0.0000 0.0000 ## GO:0000017 0.00000 0.0000 0.00000 0.0000 0.6636 ## GO:0000018 0.24770 0.2479 0.05839 0.1281 0.1440 We can then identify “differential gene set activity” between clusters by looking for significant differences in the per-set averages of the relevant cells. For example, we observe that cluster 2 has the highest average expression for the triacylglycerol biosynthesis GO term (Figure 12.7), consistent with the proposed identity of those cells. plotColData(sce.mam, y=I(aggregated[&quot;GO:0019432&quot;,]), x=&quot;label&quot;) Figure 12.7: Distribution of average log-normalized expression for genes involved in triacylglycerol biosynthesis, for all cells in each cluster of the mammary gland dataset. The obvious disadvantage of this approach is that not all genes in the set may exhibit the same pattern of differences. Non-DE genes will add noise to the per-set average, “diluting” the strength of any differences compared to an analysis that focuses directly on the DE genes (Figure 12.8). At worst, a gene set may contain subsets of DE genes that change in opposite directions, cancelling out any differences in the per-set average. This is not uncommon for gene sets that contain both positive and negative regulators of a particular biological process or pathway. # Choose the top-ranking gene in GO:0019432. plotExpression(sce.mam, &quot;Thrsp&quot;, x=&quot;label&quot;) Figure 12.8: Distribution of log-normalized expression values for Thrsp across all cells in each cluster of the mammary gland dataset. We could attempt to use the per-set averages to identify gene sets of interest via differential testing across all possible sets, e.g., with findMarkers(). However, the highest ranking gene sets in this approach tend to be very small and uninteresting because - by definition - the pitfalls mentioned above are avoided when there is only one gene in the set. This is compounded by the fact that the log-fold changes in the per-set averages are difficult to interpret. For these reasons, we generally reserve the use of this gene set summary statistic for visualization rather than any real statistical analysis. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] scater_1.18.6 ggplot2_3.3.3 [3] limma_3.46.0 org.Mm.eg.db_3.12.0 [5] AUCell_1.12.0 GSEABase_1.52.1 [7] graph_1.68.0 annotate_1.68.0 [9] XML_3.99-0.6 scRNAseq_2.4.0 [11] scran_1.18.5 bluster_1.0.0 [13] scuttle_1.0.4 ensembldb_2.14.0 [15] AnnotationFilter_1.14.0 GenomicFeatures_1.42.2 [17] AnnotationDbi_1.52.0 AnnotationHub_2.22.0 [19] BiocFileCache_1.14.0 dbplyr_2.1.0 [21] pheatmap_1.0.12 SingleR_1.4.1 [23] celldex_1.0.0 SingleCellExperiment_1.12.0 [25] SummarizedExperiment_1.20.0 Biobase_2.50.0 [27] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [29] IRanges_2.24.1 S4Vectors_0.28.1 [31] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [33] matrixStats_0.58.0 BiocStyle_2.18.1 [35] rebook_1.0.0 loaded via a namespace (and not attached): [1] igraph_1.2.6 lazyeval_0.2.2 [3] splines_4.0.4 BiocParallel_1.24.1 [5] digest_0.6.27 htmltools_0.5.1.1 [7] GO.db_3.12.1 viridis_0.5.1 [9] fansi_0.4.2 magrittr_2.0.1 [11] memoise_2.0.0 mixtools_1.2.0 [13] Biostrings_2.58.0 R.utils_2.10.1 [15] askpass_1.1 prettyunits_1.1.1 [17] colorspace_2.0-0 blob_1.2.1 [19] rappdirs_0.3.3 xfun_0.22 [21] dplyr_1.0.5 callr_3.5.1 [23] crayon_1.4.1 RCurl_1.98-1.3 [25] jsonlite_1.7.2 survival_3.2-7 [27] glue_1.4.2 gtable_0.3.0 [29] zlibbioc_1.36.0 XVector_0.30.0 [31] DelayedArray_0.16.2 kernlab_0.9-29 [33] BiocSingular_1.6.0 scales_1.1.1 [35] DBI_1.1.1 edgeR_3.32.1 [37] Rcpp_1.0.6 viridisLite_0.3.0 [39] xtable_1.8-4 progress_1.2.2 [41] dqrng_0.2.1 bit_4.0.4 [43] rsvd_1.0.3 httr_1.4.2 [45] RColorBrewer_1.1-2 ellipsis_0.3.1 [47] R.methodsS3_1.8.1 pkgconfig_2.0.3 [49] farver_2.1.0 CodeDepends_0.6.5 [51] sass_0.3.1 locfit_1.5-9.4 [53] utf8_1.2.1 labeling_0.4.2 [55] tidyselect_1.1.0 rlang_0.4.10 [57] later_1.1.0.1 munsell_0.5.0 [59] BiocVersion_3.12.0 tools_4.0.4 [61] cachem_1.0.4 generics_0.1.0 [63] RSQLite_2.2.4 ExperimentHub_1.16.0 [65] evaluate_0.14 stringr_1.4.0 [67] fastmap_1.1.0 yaml_2.2.1 [69] processx_3.4.5 knitr_1.31 [71] bit64_4.0.5 purrr_0.3.4 [73] sparseMatrixStats_1.2.1 mime_0.10 [75] R.oo_1.24.0 xml2_1.3.2 [77] biomaRt_2.46.3 compiler_4.0.4 [79] beeswarm_0.3.1 curl_4.3 [81] interactiveDisplayBase_1.28.0 statmod_1.4.35 [83] tibble_3.1.0 bslib_0.2.4 [85] stringi_1.5.3 highr_0.8 [87] ps_1.6.0 lattice_0.20-41 [89] ProtGenerics_1.22.0 Matrix_1.3-2 [91] vctrs_0.3.6 pillar_1.5.1 [93] lifecycle_1.0.0 BiocManager_1.30.10 [95] jquerylib_0.1.3 BiocNeighbors_1.8.2 [97] cowplot_1.1.1 data.table_1.14.0 [99] bitops_1.0-6 irlba_2.3.3 [101] httpuv_1.5.5 rtracklayer_1.50.0 [103] R6_2.5.0 bookdown_0.21 [105] promises_1.2.0.1 gridExtra_2.3 [107] vipor_0.4.5 codetools_0.2-18 [109] MASS_7.3-53 assertthat_0.2.1 [111] openssl_1.4.3 withr_2.4.1 [113] GenomicAlignments_1.26.0 Rsamtools_2.6.0 [115] GenomeInfoDbData_1.2.4 hms_1.0.0 [117] grid_4.0.4 beachmat_2.6.4 [119] rmarkdown_2.7 DelayedMatrixStats_1.12.3 [121] segmented_1.3-3 shiny_1.6.0 [123] ggbeeswarm_0.6.0 Bibliography "],["integrating-datasets.html", "Chapter 13 Integrating Datasets 13.1 Motivation 13.2 Setting up the data 13.3 Diagnosing batch effects 13.4 Linear regression 13.5 Performing MNN correction 13.6 Correction diagnostics 13.7 Encouraging consistency with marker genes 13.8 Using the corrected values Session Info", " Chapter 13 Integrating Datasets .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 13.1 Motivation Large single-cell RNA sequencing (scRNA-seq) projects usually need to generate data across multiple batches due to logistical constraints. However, the processing of different batches is often subject to uncontrollable differences, e.g., changes in operator, differences in reagent quality. This results in systematic differences in the observed expression in cells from different batches, which we refer to as “batch effects”. Batch effects are problematic as they can be major drivers of heterogeneity in the data, masking the relevant biological differences and complicating interpretation of the results. Computational correction of these effects is critical for eliminating batch-to-batch variation, allowing data across multiple batches to be combined for common downstream analysis. However, existing methods based on linear models (Ritchie et al. 2015; Leek et al. 2012) assume that the composition of cell populations are either known or the same across batches. To overcome these limitations, bespoke methods have been developed for batch correction of single-cell data (Haghverdi et al. 2018; Butler et al. 2018; Lin et al. 2019) that do not require a priori knowledge about the composition of the population. This allows them to be used in workflows for exploratory analyses of scRNA-seq data where such knowledge is usually unavailable. 13.2 Setting up the data To demonstrate, we will use two separate 10X Genomics PBMC datasets generated in two different batches. Each dataset was obtained from the TENxPBMCData package and separately subjected to basic processing steps. Separate processing prior to the batch correction step is more convenient, scalable and (on occasion) more reliable. For example, outlier-based QC on the cells is more effective when performed within a batch (Section 6.3.2.3). The same can also be said for trend fitting when modelling the mean-variance relationship (Section 8.2.4.1). View history #--- loading ---# library(TENxPBMCData) all.sce &lt;- list( pbmc3k=TENxPBMCData(&#39;pbmc3k&#39;), pbmc4k=TENxPBMCData(&#39;pbmc4k&#39;), pbmc8k=TENxPBMCData(&#39;pbmc8k&#39;) ) #--- quality-control ---# library(scater) stats &lt;- high.mito &lt;- list() for (n in names(all.sce)) { current &lt;- all.sce[[n]] is.mito &lt;- grep(&quot;MT&quot;, rowData(current)$Symbol_TENx) stats[[n]] &lt;- perCellQCMetrics(current, subsets=list(Mito=is.mito)) high.mito[[n]] &lt;- isOutlier(stats[[n]]$subsets_Mito_percent, type=&quot;higher&quot;) all.sce[[n]] &lt;- current[,!high.mito[[n]]] } #--- normalization ---# all.sce &lt;- lapply(all.sce, logNormCounts) #--- variance-modelling ---# library(scran) all.dec &lt;- lapply(all.sce, modelGeneVar) all.hvgs &lt;- lapply(all.dec, getTopHVGs, prop=0.1) #--- dimensionality-reduction ---# library(BiocSingular) set.seed(10000) all.sce &lt;- mapply(FUN=runPCA, x=all.sce, subset_row=all.hvgs, MoreArgs=list(ncomponents=25, BSPARAM=RandomParam()), SIMPLIFY=FALSE) set.seed(100000) all.sce &lt;- lapply(all.sce, runTSNE, dimred=&quot;PCA&quot;) set.seed(1000000) all.sce &lt;- lapply(all.sce, runUMAP, dimred=&quot;PCA&quot;) #--- clustering ---# for (n in names(all.sce)) { g &lt;- buildSNNGraph(all.sce[[n]], k=10, use.dimred=&#39;PCA&#39;) clust &lt;- igraph::cluster_walktrap(g)$membership colLabels(all.sce[[n]]) &lt;- factor(clust) } pbmc3k &lt;- all.sce$pbmc3k dec3k &lt;- all.dec$pbmc3k pbmc3k ## class: SingleCellExperiment ## dim: 32738 2609 ## metadata(0): ## assays(2): counts logcounts ## rownames(32738): ENSG00000243485 ENSG00000237613 ... ENSG00000215616 ## ENSG00000215611 ## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol ## colnames: NULL ## colData names(13): Sample Barcode ... sizeFactor label ## reducedDimNames(3): PCA TSNE UMAP ## altExpNames(0): pbmc4k &lt;- all.sce$pbmc4k dec4k &lt;- all.dec$pbmc4k pbmc4k ## class: SingleCellExperiment ## dim: 33694 4182 ## metadata(0): ## assays(2): counts logcounts ## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475 ## ENSG00000268674 ## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol ## colnames: NULL ## colData names(13): Sample Barcode ... sizeFactor label ## reducedDimNames(3): PCA TSNE UMAP ## altExpNames(0): To prepare for the batch correction: We subset all batches to the common “universe” of features. In this case, it is straightforward as both batches use Ensembl gene annotation. universe &lt;- intersect(rownames(pbmc3k), rownames(pbmc4k)) length(universe) ## [1] 31232 # Subsetting the SingleCellExperiment object. pbmc3k &lt;- pbmc3k[universe,] pbmc4k &lt;- pbmc4k[universe,] # Also subsetting the variance modelling results, for convenience. dec3k &lt;- dec3k[universe,] dec4k &lt;- dec4k[universe,] We rescale each batch to adjust for differences in sequencing depth between batches. The multiBatchNorm() function recomputes log-normalized expression values after adjusting the size factors for systematic differences in coverage between SingleCellExperiment objects. (Size factors only remove biases between cells within a single batch.) This improves the quality of the correction by removing one aspect of the technical differences between batches. library(batchelor) rescaled &lt;- multiBatchNorm(pbmc3k, pbmc4k) pbmc3k &lt;- rescaled[[1]] pbmc4k &lt;- rescaled[[2]] We perform feature selection by averaging the variance components across all batches with the combineVar() function. We compute the average as it is responsive to batch-specific HVGs while still preserving the within-batch ranking of genes. This allows us to use the same strategies described in Section 8.3 to select genes of interest. In contrast, approaches based on taking the intersection or union of HVGs across batches become increasingly conservative or liberal, respectively, with an increasing number of batches. library(scran) combined.dec &lt;- combineVar(dec3k, dec4k) chosen.hvgs &lt;- combined.dec$bio &gt; 0 sum(chosen.hvgs) ## [1] 13431 When integrating datasets of variable composition, it is generally safer to err on the side of including more genes than are used in a single dataset analysis, to ensure that markers are retained for any dataset-specific subpopulations that might be present. For a top \\(X\\) selection, this means using a larger \\(X\\) (say, ~5000), or in this case, we simply take all genes above the trend. That said, many of the signal-to-noise considerations described in Section 8.3 still apply here, so some experimentation may be necessary for best results. Alternatively, a more forceful approach to feature selection can be used based on marker genes from within-batch comparisons; this is discussed in more detail in Section 13.7. 13.3 Diagnosing batch effects Before we actually perform any correction, it is worth examining whether there is any batch effect in this dataset. We combine the two SingleCellExperiments and perform a PCA on the log-expression values for all genes with positive (average) biological components. In this example, our datasets are file-backed and so we instruct runPCA() to use randomized PCA for greater efficiency - see Section 23.2.2 for more details - though the default IRLBA will suffice for more common in-memory representations. # Synchronizing the metadata for cbind()ing. rowData(pbmc3k) &lt;- rowData(pbmc4k) pbmc3k$batch &lt;- &quot;3k&quot; pbmc4k$batch &lt;- &quot;4k&quot; uncorrected &lt;- cbind(pbmc3k, pbmc4k) # Using RandomParam() as it is more efficient for file-backed matrices. library(scater) set.seed(0010101010) uncorrected &lt;- runPCA(uncorrected, subset_row=chosen.hvgs, BSPARAM=BiocSingular::RandomParam()) We use graph-based clustering on the components to obtain a summary of the population structure. As our two PBMC populations should be replicates, each cluster should ideally consist of cells from both batches. However, we instead see clusters that are comprised of cells from a single batch. This indicates that cells of the same type are artificially separated due to technical differences between batches. library(scran) snn.gr &lt;- buildSNNGraph(uncorrected, use.dimred=&quot;PCA&quot;) clusters &lt;- igraph::cluster_walktrap(snn.gr)$membership tab &lt;- table(Cluster=clusters, Batch=uncorrected$batch) tab ## Batch ## Cluster 3k 4k ## 1 1 781 ## 2 0 1309 ## 3 0 535 ## 4 14 51 ## 5 0 605 ## 6 489 0 ## 7 0 184 ## 8 1272 0 ## 9 0 414 ## 10 151 0 ## 11 0 50 ## 12 155 0 ## 13 0 65 ## 14 0 61 ## 15 0 88 ## 16 30 0 ## 17 339 0 ## 18 145 0 ## 19 11 3 ## 20 2 36 We can also visualize the corrected coordinates using a \\(t\\)-SNE plot (Figure 13.1). The strong separation between cells from different batches is consistent with the clustering results. set.seed(1111001) uncorrected &lt;- runTSNE(uncorrected, dimred=&quot;PCA&quot;) plotTSNE(uncorrected, colour_by=&quot;batch&quot;) Figure 13.1: \\(t\\)-SNE plot of the PBMC datasets without any batch correction. Each point is a cell that is colored according to its batch of origin. Of course, the other explanation for batch-specific clusters is that there are cell types that are unique to each batch. The degree of intermingling of cells from different batches is not an effective diagnostic when the batches involved might actually contain unique cell subpopulations (which is not a consideration in the PBMC dataset, but the same cannot be said in general). If a cluster only contains cells from a single batch, one can always debate whether that is caused by a failure of the correction method or if there is truly a batch-specific subpopulation. For example, do batch-specific metabolic or differentiation states represent distinct subpopulations? Or should they be merged together? We will not attempt to answer this here, only noting that each batch correction algorithm will make different (and possibly inappropriate) decisions on what constitutes “shared” and “unique” populations. 13.4 Linear regression Batch effects in bulk RNA sequencing studies are commonly removed with linear regression. This involves fitting a linear model to each gene’s expression profile, setting the undesirable batch term to zero and recomputing the observations sans the batch effect, yielding a set of corrected expression values for downstream analyses. Linear modelling is the basis of the removeBatchEffect() function from the limma package (Ritchie et al. 2015) as well the comBat() function from the sva package (Leek et al. 2012). To use this approach in a scRNA-seq context, we assume that the composition of cell subpopulations is the same across batches. We also assume that the batch effect is additive, i.e., any batch-induced fold-change in expression is the same across different cell subpopulations for any given gene. These are strong assumptions as batches derived from different individuals will naturally exhibit variation in cell type abundances and expression. Nonetheless, they may be acceptable when dealing with batches that are technical replicates generated from the same population of cells. (In fact, when its assumptions hold, linear regression is the most statistically efficient as it uses information from all cells to compute the common batch vector.) Linear modelling can also accommodate situations where the composition is known a priori by including the cell type as a factor in the linear model, but this situation is even less common. We use the rescaleBatches() function from the batchelor package to remove the batch effect. This is roughly equivalent to applying a linear regression to the log-expression values per gene, with some adjustments to improve performance and efficiency. For each gene, the mean expression in each batch is scaled down until it is equal to the lowest mean across all batches. We deliberately choose to scale all expression values down as this mitigates differences in variance when batches lie at different positions on the mean-variance trend. (Specifically, the shrinkage effect of the pseudo-count is greater for smaller counts, suppressing any differences in variance across batches.) An additional feature of rescaleBatches() is that it will preserve sparsity in the input matrix for greater efficiency, whereas other methods like removeBatchEffect() will always return a dense matrix. library(batchelor) rescaled &lt;- rescaleBatches(pbmc3k, pbmc4k) rescaled ## class: SingleCellExperiment ## dim: 31232 6791 ## metadata(0): ## assays(1): corrected ## rownames(31232): ENSG00000243485 ENSG00000237613 ... ENSG00000198695 ## ENSG00000198727 ## rowData names(0): ## colnames: NULL ## colData names(1): batch ## reducedDimNames(0): ## altExpNames(0): After clustering, we observe that most clusters consist of mixtures of cells from the two replicate batches, consistent with the removal of the batch effect. This conclusion is supported by the apparent mixing of cells from different batches in Figure 13.2. However, at least one batch-specific cluster is still present, indicating that the correction is not entirely complete. This is attributable to violation of one of the aforementioned assumptions, even in this simple case involving replicated batches. # To ensure reproducibility of the randomized PCA. set.seed(1010101010) rescaled &lt;- runPCA(rescaled, subset_row=chosen.hvgs, exprs_values=&quot;corrected&quot;, BSPARAM=BiocSingular::RandomParam()) snn.gr &lt;- buildSNNGraph(rescaled, use.dimred=&quot;PCA&quot;) clusters.resc &lt;- igraph::cluster_walktrap(snn.gr)$membership tab.resc &lt;- table(Cluster=clusters.resc, Batch=rescaled$batch) tab.resc ## Batch ## Cluster 1 2 ## 1 278 525 ## 2 16 23 ## 3 337 606 ## 4 43 748 ## 5 604 529 ## 6 22 71 ## 7 188 48 ## 8 25 49 ## 9 263 0 ## 10 123 135 ## 11 16 85 ## 12 11 57 ## 13 116 6 ## 14 455 1035 ## 15 6 31 ## 16 89 187 ## 17 3 36 ## 18 3 8 ## 19 11 3 rescaled &lt;- runTSNE(rescaled, dimred=&quot;PCA&quot;) rescaled$batch &lt;- factor(rescaled$batch) plotTSNE(rescaled, colour_by=&quot;batch&quot;) Figure 13.2: \\(t\\)-SNE plot of the PBMC datasets after correction with rescaleBatches(). Each point represents a cell and is colored according to the batch of origin. Alternatively, we could use the regressBatches() function to perform a more conventional linear regression for batch correction. This is subject to the same assumptions as described above for rescaleBatches(), though it has the additional disadvantage of discarding sparsity in the matrix of residuals. To avoid this, we avoid explicit calculation of the residuals during matrix multiplication (see ?ResidualMatrix for details), allowing us to perform approximate PCA more efficiently. Advanced users can set design= and specify which coefficients to retain in the output matrix, reminiscent of limma’s removeBatchEffect() function. set.seed(10001) residuals &lt;- regressBatches(pbmc3k, pbmc4k, d=50, subset.row=chosen.hvgs, correct.all=TRUE, BSPARAM=BiocSingular::RandomParam()) snn.gr &lt;- buildSNNGraph(residuals, use.dimred=&quot;corrected&quot;) clusters.resid &lt;- igraph::cluster_walktrap(snn.gr)$membership tab.resid &lt;- table(Cluster=clusters.resid, Batch=residuals$batch) tab.resid ## Batch ## Cluster 1 2 ## 1 478 2 ## 2 142 179 ## 3 22 41 ## 4 298 566 ## 5 340 606 ## 6 0 138 ## 7 404 376 ## 8 145 91 ## 9 2 636 ## 10 22 73 ## 11 6 51 ## 12 629 1110 ## 13 3 36 ## 14 91 211 ## 15 12 55 ## 16 4 8 ## 17 11 3 residuals &lt;- runTSNE(residuals, dimred=&quot;corrected&quot;) residuals$batch &lt;- factor(residuals$batch) plotTSNE(residuals, colour_by=&quot;batch&quot;) Figure 13.3: \\(t\\)-SNE plot of the PBMC datasets after correction with regressBatches(). Each point represents a cell and is colored according to the batch of origin. 13.5 Performing MNN correction Consider a cell \\(a\\) in batch \\(A\\), and identify the cells in batch \\(B\\) that are nearest neighbors to \\(a\\) in the expression space defined by the selected features. Repeat this for a cell \\(b\\) in batch \\(B\\), identifying its nearest neighbors in \\(A\\). Mutual nearest neighbors are pairs of cells from different batches that belong in each other’s set of nearest neighbors. The reasoning is that MNN pairs represent cells from the same biological state prior to the application of a batch effect - see Haghverdi et al. (2018) for full theoretical details. Thus, the difference between cells in MNN pairs can be used as an estimate of the batch effect, the subtraction of which yields batch-corrected values. Compared to linear regression, MNN correction does not assume that the population composition is the same or known beforehand. This is because it learns the shared population structure via identification of MNN pairs and uses this information to obtain an appropriate estimate of the batch effect. Instead, the key assumption of MNN-based approaches is that the batch effect is orthogonal to the biology in high-dimensional expression space. Violations reduce the effectiveness and accuracy of the correction, with the most common case arising from variations in the direction of the batch effect between clusters. Nonetheless, the assumption is usually reasonable as a random vector is very likely to be orthogonal in high-dimensional space. The batchelor package provides an implementation of the MNN approach via the fastMNN() function. (Unlike the MNN method originally described by Haghverdi et al. (2018), the fastMNN() function performs PCA to reduce the dimensions beforehand and speed up the downstream neighbor detection steps.) We apply it to our two PBMC batches to remove the batch effect across the highly variable genes in chosen.hvgs. To reduce computational work and technical noise, all cells in all batches are projected into the low-dimensional space defined by the top d principal components. Identification of MNNs and calculation of correction vectors are then performed in this low-dimensional space. # Again, using randomized SVD here, as this is faster than IRLBA for # file-backed matrices. We set deferred=TRUE for greater speed. set.seed(1000101001) mnn.out &lt;- fastMNN(pbmc3k, pbmc4k, d=50, k=20, subset.row=chosen.hvgs, BSPARAM=BiocSingular::RandomParam(deferred=TRUE)) mnn.out ## class: SingleCellExperiment ## dim: 13431 6791 ## metadata(2): merge.info pca.info ## assays(1): reconstructed ## rownames(13431): ENSG00000239945 ENSG00000228463 ... ENSG00000198695 ## ENSG00000198727 ## rowData names(1): rotation ## colnames: NULL ## colData names(1): batch ## reducedDimNames(1): corrected ## altExpNames(0): The function returns a SingleCellExperiment object containing corrected values for downstream analyses like clustering or visualization. Each column of mnn.out corresponds to a cell in one of the batches, while each row corresponds to an input gene in chosen.hvgs. The batch field in the column metadata contains a vector specifying the batch of origin of each cell. head(mnn.out$batch) ## [1] 1 1 1 1 1 1 The corrected matrix in the reducedDims() contains the low-dimensional corrected coordinates for all cells, which we will use in place of the PCs in our downstream analyses. dim(reducedDim(mnn.out, &quot;corrected&quot;)) ## [1] 6791 50 A reconstructed matrix in the assays() contains the corrected expression values for each gene in each cell, obtained by projecting the low-dimensional coordinates in corrected back into gene expression space. We do not recommend using this for anything other than visualization (Section 13.8). assay(mnn.out, &quot;reconstructed&quot;) ## &lt;13431 x 6791&gt; matrix of class LowRankMatrix and type &quot;double&quot;: ## [,1] [,2] [,3] ... [,6790] [,6791] ## ENSG00000239945 -2.522e-06 -1.851e-06 -1.199e-05 . 1.832e-06 -3.641e-06 ## ENSG00000228463 -6.627e-04 -6.724e-04 -4.820e-04 . -8.531e-04 -3.999e-04 ## ENSG00000237094 -8.077e-05 -8.038e-05 -9.631e-05 . 7.261e-06 -4.094e-05 ## ENSG00000229905 3.838e-06 6.180e-06 5.432e-06 . 8.534e-06 3.485e-06 ## ENSG00000237491 -4.527e-04 -3.178e-04 -1.510e-04 . -3.491e-04 -2.082e-04 ## ... . . . . . . ## ENSG00000198840 -0.0296508 -0.0340101 -0.0502385 . -0.0362884 -0.0183084 ## ENSG00000212907 -0.0041681 -0.0056570 -0.0106420 . -0.0083837 0.0005996 ## ENSG00000198886 0.0145358 0.0200517 -0.0307131 . -0.0109254 -0.0070064 ## ENSG00000198695 0.0014427 0.0013490 0.0001493 . -0.0009826 -0.0022712 ## ENSG00000198727 0.0152570 0.0106167 -0.0256450 . -0.0227962 -0.0022898 The most relevant parameter for tuning fastMNN() is k, which specifies the number of nearest neighbors to consider when defining MNN pairs. This can be interpreted as the minimum anticipated frequency of any shared cell type or state in each batch. Increasing k will generally result in more aggressive merging as the algorithm is more generous in matching subpopulations across batches. It can occasionally be desirable to increase k if one clearly sees that the same cell types are not being adequately merged across batches. We cluster on the low-dimensional corrected coordinates to obtain a partitioning of the cells that serves as a proxy for the population structure. If the batch effect is successfully corrected, clusters corresponding to shared cell types or states should contain cells from multiple batches. We see that all clusters contain contributions from each batch after correction, consistent with our expectation that the two batches are replicates of each other. library(scran) snn.gr &lt;- buildSNNGraph(mnn.out, use.dimred=&quot;corrected&quot;) clusters.mnn &lt;- igraph::cluster_walktrap(snn.gr)$membership tab.mnn &lt;- table(Cluster=clusters.mnn, Batch=mnn.out$batch) tab.mnn ## Batch ## Cluster 1 2 ## 1 337 606 ## 2 289 542 ## 3 152 181 ## 4 12 4 ## 5 517 467 ## 6 17 19 ## 7 313 661 ## 8 162 118 ## 9 11 56 ## 10 547 1083 ## 11 17 59 ## 12 16 58 ## 13 144 93 ## 14 67 191 ## 15 4 36 ## 16 4 8 See Chapter 34 for an example of a more complex fastMNN() merge involving several human pancreas datasets generated by different authors on different patients with different technologies. 13.6 Correction diagnostics 13.6.1 Mixing between batches It is possible to quantify the degree of mixing across batches by testing each cluster for imbalances in the contribution from each batch (Büttner et al. 2019). This is done by applying Pearson’s chi-squared test to each row of tab.mnn where the expected proportions under the null hypothesis proportional to the total number of cells per batch. Low \\(p\\)-values indicate that there are significant imbalances In practice, this strategy is most suited to technical replicates with identical population composition; it is usually too stringent for batches with more biological variation, where proportions can genuinely vary even in the absence of any batch effect. chi.prop &lt;- colSums(tab.mnn)/sum(tab.mnn) chi.results &lt;- apply(tab.mnn, 1, FUN=chisq.test, p=chi.prop) p.values &lt;- vapply(chi.results, &quot;[[&quot;, i=&quot;p.value&quot;, 0) p.values ## 1 2 3 4 5 6 7 8 ## 9.047e-02 3.093e-02 6.700e-03 2.627e-03 8.424e-20 2.775e-01 5.546e-05 2.274e-11 ## 9 10 11 12 13 14 15 16 ## 2.136e-04 5.480e-05 4.019e-03 2.972e-03 1.538e-12 3.936e-05 2.197e-04 7.172e-01 We favor a more qualitative approach whereby we compute the variation in the log-abundances to rank the clusters with the greatest variability in their proportional abundances across batches. We can then focus on batch-specific clusters that may be indicative of incomplete batch correction. Obviously, though, this diagnostic is subject to interpretation as the same outcome can be caused by batch-specific populations; some prior knowledge about the biological context is necessary to distinguish between these two possibilities. For the PBMC dataset, none of the most variable clusters are overtly batch-specific, consistent with the fact that our batches are effectively replicates. # Avoid minor difficulties with the &#39;table&#39; class. tab.mnn &lt;- unclass(tab.mnn) # Using a large pseudo.count to avoid unnecessarily # large variances when the counts are low. norm &lt;- normalizeCounts(tab.mnn, pseudo.count=10) # Ranking clusters by the largest variances. rv &lt;- rowVars(norm) DataFrame(Batch=tab.mnn, var=rv)[order(rv, decreasing=TRUE),] ## DataFrame with 16 rows and 3 columns ## Batch.1 Batch.2 var ## &lt;integer&gt; &lt;integer&gt; &lt;numeric&gt; ## 15 4 36 0.934778 ## 13 144 93 0.728465 ## 9 11 56 0.707757 ## 8 162 118 0.563419 ## 4 12 4 0.452565 ## ... ... ... ... ## 6 17 19 0.05689945 ## 10 547 1083 0.04527468 ## 2 289 542 0.02443988 ## 1 337 606 0.01318296 ## 16 4 8 0.00689661 We can also visualize the corrected coordinates using a \\(t\\)-SNE plot (Figure 13.4). The presence of visual clusters containing cells from both batches provides a comforting illusion that the correction was successful. library(scater) set.seed(0010101010) mnn.out &lt;- runTSNE(mnn.out, dimred=&quot;corrected&quot;) mnn.out$batch &lt;- factor(mnn.out$batch) plotTSNE(mnn.out, colour_by=&quot;batch&quot;) Figure 13.4: \\(t\\)-SNE plot of the PBMC datasets after MNN correction. Each point is a cell that is colored according to its batch of origin. For fastMNN(), one useful diagnostic is the proportion of variance within each batch that is lost during MNN correction. Specifically, this refers to the within-batch variance that is removed during orthogonalization with respect to the average correction vector at each merge step. This is returned via the lost.var field in the metadata of mnn.out, which contains a matrix of the variance lost in each batch (column) at each merge step (row). metadata(mnn.out)$merge.info$lost.var ## [,1] [,2] ## [1,] 0.006617 0.003315 Large proportions of lost variance (&gt;10%) suggest that correction is removing genuine biological heterogeneity. This would occur due to violations of the assumption of orthogonality between the batch effect and the biological subspace (Haghverdi et al. 2018). In this case, the proportion of lost variance is small, indicating that non-orthogonality is not a major concern. 13.6.2 Preserving biological heterogeneity Another useful diagnostic check is to compare the clustering within each batch to the clustering of the merged data. Accurate data integration should preserve variance within each batch as there should be nothing to remove between cells in the same batch. This check complements the previously mentioned diagnostics that only focus on the removal of differences between batches. Specifically, it protects us against cases where the correction method simply aggregates all cells together, which would achieve perfect mixing but also discard the biological heterogeneity of interest. Ideally, we should see a many-to-1 mapping where the across-batch clustering is nested inside the within-batch clusterings. This indicates that any within-batch structure was preserved after correction while acknowledging that greater resolution is possible with more cells. In practice, more discrepancies can be expected even when the correction is perfect, due to the existence of closely related clusters that were arbitrarily separated in the within-batch clustering. As a general rule, we can be satisfied with the correction if the vast majority of entries in Figure 13.5 are zero, though this may depend on whether specific clusters of interest are gained or lost. library(pheatmap) # For the first batch (adding +10 for a smoother color transition # from zero to non-zero counts for any given matrix entry). tab &lt;- table(paste(&quot;after&quot;, clusters.mnn[rescaled$batch==1]), paste(&quot;before&quot;, colLabels(pbmc3k))) heat3k &lt;- pheatmap(log10(tab+10), cluster_row=FALSE, cluster_col=FALSE, main=&quot;PBMC 3K comparison&quot;, silent=TRUE) # For the second batch. tab &lt;- table(paste(&quot;after&quot;, clusters.mnn[rescaled$batch==2]), paste(&quot;before&quot;, colLabels(pbmc4k))) heat4k &lt;- pheatmap(log10(tab+10), cluster_row=FALSE, cluster_col=FALSE, main=&quot;PBMC 4K comparison&quot;, silent=TRUE) gridExtra::grid.arrange(heat3k[[4]], heat4k[[4]]) Figure 13.5: Comparison between the within-batch clusters and the across-batch clusters obtained after MNN correction. One heatmap is generated for each of the PBMC 3K and 4K datasets, where each entry is colored according to the number of cells with each pair of labels (before and after correction). We use the adjusted Rand index (Section 10.6.2) to quantify the agreement between the clusterings before and after batch correction. Recall that larger indices are more desirable as this indicates that within-batch heterogeneity is preserved, though this must be balanced against the ability of each method to actually perform batch correction. library(bluster) ri3k &lt;- pairwiseRand(clusters.mnn[rescaled$batch==1], colLabels(pbmc3k), mode=&quot;index&quot;) ri3k ## [1] 0.7361 ri4k &lt;- pairwiseRand(clusters.mnn[rescaled$batch==2], colLabels(pbmc4k), mode=&quot;index&quot;) ri4k ## [1] 0.8301 We can also break down the ARI into per-cluster ratios for more detailed diagnostics (Figure 13.6). For example, we could see low ratios off the diagonal if distinct clusters in the within-batch clustering were incorrectly aggregated in the merged clustering. Conversely, we might see low ratios on the diagonal if the correction inflated or introduced spurious heterogeneity inside a within-batch cluster. # For the first batch. tab &lt;- pairwiseRand(colLabels(pbmc3k), clusters.mnn[rescaled$batch==1]) heat3k &lt;- pheatmap(tab, cluster_row=FALSE, cluster_col=FALSE, col=rev(viridis::magma(100)), main=&quot;PBMC 3K probabilities&quot;, silent=TRUE) # For the second batch. tab &lt;- pairwiseRand(colLabels(pbmc4k), clusters.mnn[rescaled$batch==2]) heat4k &lt;- pheatmap(tab, cluster_row=FALSE, cluster_col=FALSE, col=rev(viridis::magma(100)), main=&quot;PBMC 4K probabilities&quot;, silent=TRUE) gridExtra::grid.arrange(heat3k[[4]], heat4k[[4]]) Figure 13.6: ARI-derived ratios for the within-batch clusters after comparison to the merged clusters obtained after MNN correction. One heatmap is generated for each of the PBMC 3K and 4K datasets. 13.7 Encouraging consistency with marker genes In some situations, we will already have performed within-batch analyses to characterize salient aspects of population heterogeneity. This is not uncommon when merging datasets from different sources where each dataset has already been analyzed, annotated and interpreted separately. It is subsequently desirable for the integration procedure to retain these “known interesting” aspects of each dataset in the merged dataset. We can encourage this outcome by using the marker genes within each dataset as our selected feature set for fastMNN() and related methods. This focuses on the relevant heterogeneity and represents a semi-supervised approach that is a natural extension of the strategy described in Section 8.4. To illustrate, we apply this strategy to our PBMC datasets. We identify the top marker genes from pairwise Wilcoxon ranked sum tests between every pair of clusters within each batch, analogous to the method used by SingleR (Chapter 12). In this case, we use the top 10 marker genes but any value can be used depending on the acceptable trade-off between signal and noise (and speed). We then take the union across all comparisons in all batches and use that in place of our HVG set in fastMNN(). # Recall that groups for marker detection # are automatically defined from &#39;colLabels()&#39;. stats3 &lt;- pairwiseWilcox(pbmc3k, direction=&quot;up&quot;) markers3 &lt;- getTopMarkers(stats3[[1]], stats3[[2]], n=10) stats4 &lt;- pairwiseWilcox(pbmc4k, direction=&quot;up&quot;) markers4 &lt;- getTopMarkers(stats4[[1]], stats4[[2]], n=10) marker.set &lt;- unique(unlist(c(unlist(markers3), unlist(markers4)))) length(marker.set) # getting the total number of genes selected in this manner. ## [1] 314 set.seed(1000110) mnn.out2 &lt;- fastMNN(pbmc3k, pbmc4k, subset.row=marker.set, BSPARAM=BiocSingular::RandomParam(deferred=TRUE)) A quick inspection of Figure 13.7 indicates that the original within-batch structure is indeed preserved in the corrected data. This highlights the utility of a marker-based feature set for integrating datasets that have already been characterized separately in a manner that preserves existing interpretations of each dataset. We note that some within-batch clusters have merged, most likely due to the lack of robust separation in the first place, though this may also be treated as a diagnostic on the appropriateness of the integration depending on the context. mnn.out2 &lt;- runTSNE(mnn.out2, dimred=&quot;corrected&quot;) gridExtra::grid.arrange( plotTSNE(mnn.out2[,mnn.out2$batch==1], colour_by=I(colLabels(pbmc3k))), plotTSNE(mnn.out2[,mnn.out2$batch==2], colour_by=I(colLabels(pbmc4k))), ncol=2 ) Figure 13.7: \\(t\\)-SNE plots of the merged PBMC datasets, where the merge was performed using only marker genes identified within each batch. Each point represents a cell that is colored by the assigned cluster from the within-batch analysis for the 3K (left) and 4K dataset (right). 13.8 Using the corrected values The greatest value of batch correction lies in facilitating cell-based analysis of population heterogeneity in a consistent manner across batches. Cluster 1 in batch A is the same as cluster 1 in batch B when the clustering is performed on the merged data. There is no need to identify mappings between separate clusterings, which might not even be possible when the clusters are not well-separated. The burden of interpretation is consolidated by generating a single set of clusters for all batches, rather than requiring separate examination of each batch’s clusters. Another benefit is that the available number of cells is increased when all batches are combined, which allows for greater resolution of population structure in downstream analyses. We previously demonstrated the application of clustering methods to the batch-corrected data, but the same principles apply for other analyses like trajectory reconstruction. At this point, it is also tempting to use the corrected expression values for gene-based analyses like DE-based marker gene detection. This is not generally recommended as an arbitrary correction algorithm is not obliged to preserve the magnitude (or even direction) of differences in per-gene expression when attempting to align multiple batches. For example, cosine normalization in fastMNN() shrinks the magnitude of the expression values so that the computed log-fold changes have no obvious interpretation. Of greater concern is the possibility that the correction introduces artificial agreement across batches. To illustrate: Consider a dataset (first batch) with two cell types, \\(A\\) and \\(B\\). Consider a second batch with the same cell types, denoted as \\(A&#39;\\) and \\(B&#39;\\). Assume that, for some reason, gene \\(X\\) is expressed in \\(A\\) but not in \\(A&#39;\\), \\(B\\) or \\(B&#39;\\) - possibly due to some difference in how the cells were treated, or maybe due to a donor effect. We then merge the batches together based on the shared cell types. This yields a result where \\(A\\) and \\(A&#39;\\) cells are intermingled and the difference due to \\(X\\) is eliminated. One can debate whether this should be the case, but in general, it is necessary for batch correction methods to smooth over small biological differences (as discussed in Section 13.3). Now, if we corrected the second batch to the first, we must have coerced the expression values of \\(X\\) in \\(A&#39;\\) to non-zero values to align with those of \\(A\\), while leaving the expression of \\(X\\) in \\(B&#39;\\) and \\(B\\) at zero. Thus, we have artificially introduced DE between \\(A&#39;\\) and \\(B&#39;\\) for \\(X\\) in the second batch to align with the DE between \\(A\\) and \\(B\\) in the first batch. (The converse is also possible where DE in the first batch is artificially removed to align with the second batch, depending on the order of merges.) The artificial DE has implications for the identification of the cell types and interpretation of the results. We would be misled into believing that both \\(A\\) and \\(A&#39;\\) are \\(X\\)-positive, when in fact this is only true for \\(A\\). At best, this is only a minor error - after all, we do actually have \\(X\\)-positive cells of that overall type, we simply do not see that \\(A&#39;\\) is \\(X\\)-negative. At worst, this can compromise the conclusions, e.g., if the first batch was drug treated and the second batch was a control, we might mistakenly think that a \\(X\\)-positive population exists in the latter and conclude that our drug has no effect. Rather, it is preferable to perform DE analyses using the uncorrected expression values with blocking on the batch, as discussed in Section 11.4. This strategy is based on the expectation that any genuine DE between clusters should still be present in a within-batch comparison where batch effects are absent. It penalizes genes that exhibit inconsistent DE across batches, thus protecting against misleading conclusions when a population in one batch is aligned to a similar-but-not-identical population in another batch. We demonstrate this approach below using a blocked \\(t\\)-test to detect markers in the PBMC dataset, where the presence of the same pattern across clusters within each batch (Figure 13.8) is reassuring. If integration is performed across multiple conditions, it is even more important to use the uncorrected expression values for downstream analyses - see Section 14.6.2 for a discussion. m.out &lt;- findMarkers(uncorrected, clusters.mnn, block=uncorrected$batch, direction=&quot;up&quot;, lfc=1, row.data=rowData(uncorrected)[,3,drop=FALSE]) # A (probably activated?) T cell subtype of some sort: demo &lt;- m.out[[&quot;10&quot;]] as.data.frame(demo[1:20,c(&quot;Symbol&quot;, &quot;Top&quot;, &quot;p.value&quot;, &quot;FDR&quot;)]) ## Symbol Top p.value FDR ## ENSG00000177954 RPS27 1 3.399e-168 1.061e-163 ## ENSG00000227507 LTB 1 1.238e-157 1.934e-153 ## ENSG00000167286 CD3D 1 9.136e-89 4.076e-85 ## ENSG00000111716 LDHB 1 8.699e-44 1.811e-40 ## ENSG00000008517 IL32 1 4.880e-31 6.928e-28 ## ENSG00000172809 RPL38 1 8.727e-143 6.814e-139 ## ENSG00000171223 JUNB 1 8.762e-72 2.737e-68 ## ENSG00000071082 RPL31 2 8.612e-78 2.989e-74 ## ENSG00000121966 CXCR4 2 2.370e-07 1.322e-04 ## ENSG00000251562 MALAT1 2 3.618e-33 5.650e-30 ## ENSG00000133639 BTG1 2 6.847e-12 4.550e-09 ## ENSG00000170345 FOS 2 2.738e-46 6.108e-43 ## ENSG00000129824 RPS4Y1 2 1.075e-108 6.713e-105 ## ENSG00000177606 JUN 3 1.039e-37 1.910e-34 ## ENSG00000112306 RPS12 3 1.656e-33 2.722e-30 ## ENSG00000110700 RPS13 3 7.600e-18 7.657e-15 ## ENSG00000198851 CD3E 3 1.058e-36 1.836e-33 ## ENSG00000213741 RPS29 3 1.494e-148 1.555e-144 ## ENSG00000116251 RPL22 4 3.992e-25 4.796e-22 ## ENSG00000144713 RPL32 4 1.224e-32 1.820e-29 plotExpression(uncorrected, x=I(factor(clusters.mnn)), features=&quot;ENSG00000177954&quot;, colour_by=&quot;batch&quot;) + facet_wrap(~colour_by) Figure 13.8: Distributions of RPSA uncorrected log-expression values within each cluster in each batch of the merged PBMC dataset. We suggest limiting the use of per-gene corrected values to visualization, e.g., when coloring points on a \\(t\\)-SNE plot by per-cell expression. This can be more aesthetically pleasing than uncorrected expression values that may contain large shifts on the colour scale between cells in different batches. Use of the corrected values in any quantitative procedure should be treated with caution, and should be backed up by similar results from an analysis on the uncorrected values. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] bluster_1.0.0 pheatmap_1.0.12 [3] scater_1.18.6 ggplot2_3.3.3 [5] scran_1.18.5 batchelor_1.6.2 [7] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0 [9] Biobase_2.50.0 GenomicRanges_1.42.0 [11] GenomeInfoDb_1.26.4 HDF5Array_1.18.1 [13] rhdf5_2.34.0 DelayedArray_0.16.2 [15] IRanges_2.24.1 S4Vectors_0.28.1 [17] MatrixGenerics_1.2.1 matrixStats_0.58.0 [19] BiocGenerics_0.36.0 Matrix_1.3-2 [21] BiocStyle_2.18.1 rebook_1.0.0 loaded via a namespace (and not attached): [1] bitops_1.0-6 RColorBrewer_1.1-2 [3] tools_4.0.4 bslib_0.2.4 [5] utf8_1.2.1 R6_2.5.0 [7] irlba_2.3.3 ResidualMatrix_1.0.0 [9] vipor_0.4.5 DBI_1.1.1 [11] colorspace_2.0-0 rhdf5filters_1.2.0 [13] withr_2.4.1 gridExtra_2.3 [15] tidyselect_1.1.0 processx_3.4.5 [17] compiler_4.0.4 graph_1.68.0 [19] BiocNeighbors_1.8.2 labeling_0.4.2 [21] bookdown_0.21 sass_0.3.1 [23] scales_1.1.1 callr_3.5.1 [25] stringr_1.4.0 digest_0.6.27 [27] rmarkdown_2.7 XVector_0.30.0 [29] pkgconfig_2.0.3 htmltools_0.5.1.1 [31] sparseMatrixStats_1.2.1 highr_0.8 [33] limma_3.46.0 rlang_0.4.10 [35] DelayedMatrixStats_1.12.3 farver_2.1.0 [37] jquerylib_0.1.3 generics_0.1.0 [39] jsonlite_1.7.2 BiocParallel_1.24.1 [41] dplyr_1.0.5 RCurl_1.98-1.3 [43] magrittr_2.0.1 BiocSingular_1.6.0 [45] GenomeInfoDbData_1.2.4 scuttle_1.0.4 [47] ggbeeswarm_0.6.0 Rcpp_1.0.6 [49] munsell_0.5.0 Rhdf5lib_1.12.1 [51] fansi_0.4.2 viridis_0.5.1 [53] lifecycle_1.0.0 stringi_1.5.3 [55] yaml_2.2.1 edgeR_3.32.1 [57] zlibbioc_1.36.0 Rtsne_0.15 [59] grid_4.0.4 dqrng_0.2.1 [61] crayon_1.4.1 lattice_0.20-41 [63] cowplot_1.1.1 beachmat_2.6.4 [65] locfit_1.5-9.4 CodeDepends_0.6.5 [67] knitr_1.31 ps_1.6.0 [69] pillar_1.5.1 igraph_1.2.6 [71] codetools_0.2-18 XML_3.99-0.6 [73] glue_1.4.2 evaluate_0.14 [75] BiocManager_1.30.10 vctrs_0.3.6 [77] purrr_0.3.4 gtable_0.3.0 [79] assertthat_0.2.1 xfun_0.22 [81] rsvd_1.0.3 viridisLite_0.3.0 [83] tibble_3.1.0 beeswarm_0.3.1 [85] statmod_1.4.35 ellipsis_0.3.1 Bibliography "],["multi-sample-comparisons.html", "Chapter 14 Multi-sample comparisons 14.1 Motivation 14.2 Setting up the data 14.3 Differential expression between conditions 14.4 Avoiding problems with ambient RNA 14.5 Differential abundance between conditions 14.6 Comments on interpretation Session Info", " Chapter 14 Multi-sample comparisons .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 14.1 Motivation A powerful use of scRNA-seq technology lies in the design of replicated multi-condition experiments to detect changes in composition or expression between conditions. For example, a researcher could use this strategy to detect changes in cell type abundance after drug treatment (Richard et al. 2018) or genetic modifications (Scialdone et al. 2016). This provides more biological insight than conventional scRNA-seq experiments involving only one biological condition, especially if we can relate population changes to specific experimental perturbations. Differential analyses of multi-condition scRNA-seq experiments can be broadly split into two categories - differential expression (DE) and differential abundance (DA) analyses. The former tests for changes in expression between conditions for cells of the same type that are present in both conditions, while the latter tests for changes in the composition of cell types (or states, etc.) between conditions. In this chapter, we will demonstrate both analyses using data from a study of the early mouse embryo (Pijuan-Sala et al. 2019). 14.2 Setting up the data Our demonstration scRNA-seq dataset was generated from chimeric mouse embryos at the E8.5 developmental stage. Each chimeric embryo was generated by injecting td-Tomato-positive embryonic stem cells (ESCs) into a wild-type (WT) blastocyst. Unlike in previous experiments (Scialdone et al. 2016), there is no genetic difference between the injected and background cells other than the expression of td-Tomato in the former. Instead, the aim of this “wild-type chimera” study is to determine whether the injection procedure itself introduces differences in lineage commitment compared to the background cells. The experiment used a paired design with three replicate batches of two samples each. Specifically, each batch contains one sample consisting of td-Tomato positive cells and another consisting of negative cells, obtained by fluorescence-activated cell sorting from a single pool of dissociated cells from 6-7 chimeric embryos. For each sample, scRNA-seq data was generated using the 10X Genomics protocol (Zheng et al. 2017) to obtain 2000-7000 cells. View history #--- loading ---# library(MouseGastrulationData) sce.chimera &lt;- WTChimeraData(samples=5:10) sce.chimera #--- feature-annotation ---# library(scater) rownames(sce.chimera) &lt;- uniquifyFeatureNames( rowData(sce.chimera)$ENSEMBL, rowData(sce.chimera)$SYMBOL) #--- quality-control ---# drop &lt;- sce.chimera$celltype.mapped %in% c(&quot;stripped&quot;, &quot;Doublet&quot;) sce.chimera &lt;- sce.chimera[,!drop] #--- normalization ---# sce.chimera &lt;- logNormCounts(sce.chimera) #--- variance-modelling ---# library(scran) dec.chimera &lt;- modelGeneVar(sce.chimera, block=sce.chimera$sample) chosen.hvgs &lt;- dec.chimera$bio &gt; 0 #--- merging ---# library(batchelor) set.seed(01001001) merged &lt;- correctExperiments(sce.chimera, batch=sce.chimera$sample, subset.row=chosen.hvgs, PARAM=FastMnnParam( merge.order=list( list(1,3,5), # WT (3 replicates) list(2,4,6) # td-Tomato (3 replicates) ) ) ) #--- clustering ---# g &lt;- buildSNNGraph(merged, use.dimred=&quot;corrected&quot;) clusters &lt;- igraph::cluster_louvain(g) colLabels(merged) &lt;- factor(clusters$membership) #--- dimensionality-reduction ---# merged &lt;- runTSNE(merged, dimred=&quot;corrected&quot;, external_neighbors=TRUE) merged &lt;- runUMAP(merged, dimred=&quot;corrected&quot;, external_neighbors=TRUE) merged ## class: SingleCellExperiment ## dim: 14699 19426 ## metadata(2): merge.info pca.info ## assays(3): reconstructed counts logcounts ## rownames(14699): Xkr4 Rp1 ... Vmn2r122 CAAA01147332.1 ## rowData names(3): rotation ENSEMBL SYMBOL ## colnames(19426): cell_9769 cell_9770 ... cell_30701 cell_30702 ## colData names(13): batch cell ... sizeFactor label ## reducedDimNames(5): corrected pca.corrected.E7.5 pca.corrected.E8.5 ## TSNE UMAP ## altExpNames(0): The differential analyses in this chapter will be predicated on many of the pre-processing steps covered previously. For brevity, we will not explicitly repeat them here, only noting that we have already merged cells from all samples into the same coordinate system (Chapter 28.8) and clustered the merged dataset to obtain a common partitioning across all samples (Chapter 10). A brief inspection of the results indicates that clusters contain similar contributions from all batches with only modest differences associated with td-Tomato expression (Figure 14.1). library(scater) table(colLabels(merged), merged$tomato) ## ## FALSE TRUE ## 1 546 401 ## 2 60 52 ## 3 470 398 ## 4 469 211 ## 5 335 271 ## 6 258 249 ## 7 1241 967 ## 8 203 221 ## 9 630 629 ## 10 71 181 ## 11 47 57 ## 12 417 310 ## 13 58 0 ## 14 209 214 ## 15 414 630 ## 16 363 509 ## 17 234 198 ## 18 657 607 ## 19 151 303 ## 20 579 443 ## 21 137 74 ## 22 82 78 ## 23 155 1 ## 24 762 878 ## 25 363 497 ## 26 1420 716 table(colLabels(merged), merged$pool) ## ## 3 4 5 ## 1 224 173 550 ## 2 26 30 56 ## 3 226 172 470 ## 4 78 162 440 ## 5 99 227 280 ## 6 187 116 204 ## 7 300 909 999 ## 8 69 134 221 ## 9 229 423 607 ## 10 114 54 84 ## 11 16 31 57 ## 12 179 169 379 ## 13 2 51 5 ## 14 77 97 249 ## 15 114 289 641 ## 16 183 242 447 ## 17 157 81 194 ## 18 123 308 833 ## 19 106 118 230 ## 20 236 238 548 ## 21 3 10 198 ## 22 27 29 104 ## 23 6 84 66 ## 24 217 455 968 ## 25 132 172 556 ## 26 194 870 1072 gridExtra::grid.arrange( plotTSNE(merged, colour_by=&quot;tomato&quot;, text_by=&quot;label&quot;), plotTSNE(merged, colour_by=data.frame(pool=factor(merged$pool))), ncol=2 ) Figure 14.1: \\(t\\)-SNE plot of the WT chimeric dataset, where each point represents a cell and is colored according to td-Tomato expression (left) or batch of origin (right). Cluster numbers are superimposed based on the median coordinate of cells assigned to that cluster. Ordinarily, we would be obliged to perform marker detection to assign biological meaning to these clusters. For simplicity, we will skip this step by directly using the existing cell type labels provided by Pijuan-Sala et al. (2019). These were obtained by mapping the cells in this dataset to a larger, pre-annotated “atlas” of mouse early embryonic development. While there are obvious similarities, we see that many of our clusters map to multiple labels and vice versa (Figure 14.2), which reflects the difficulties in unambiguously resolving cell types undergoing differentiation. library(bluster) pairwiseRand(colLabels(merged), merged$celltype.mapped, &quot;index&quot;) ## [1] 0.5514 by.label &lt;- table(colLabels(merged), merged$celltype.mapped) pheatmap::pheatmap(log2(by.label+1), color=viridis::viridis(101)) Figure 14.2: Heatmap showing the abundance of cells with each combination of cluster (row) and cell type label (column). The color scale represents the log2-count for each combination. 14.3 Differential expression between conditions 14.3.1 Creating pseudo-bulk samples The most obvious differential analysis is to look for changes in expression between conditions. We perform the DE analysis separately for each label to identify cell type-specific transcriptional effects of injection. The actual DE testing is performed on “pseudo-bulk” expression profiles (Tung et al. 2017), generated by summing counts together for all cells with the same combination of label and sample. This leverages the resolution offered by single-cell technologies to define the labels, and combines it with the statistical rigor of existing methods for DE analyses involving a small number of samples. # Using &#39;label&#39; and &#39;sample&#39; as our two factors; each column of the output # corresponds to one unique combination of these two factors. summed &lt;- aggregateAcrossCells(merged, id=colData(merged)[,c(&quot;celltype.mapped&quot;, &quot;sample&quot;)]) summed ## class: SingleCellExperiment ## dim: 14699 186 ## metadata(2): merge.info pca.info ## assays(1): counts ## rownames(14699): Xkr4 Rp1 ... Vmn2r122 CAAA01147332.1 ## rowData names(3): rotation ENSEMBL SYMBOL ## colnames: NULL ## colData names(16): batch cell ... sample ncells ## reducedDimNames(5): corrected pca.corrected.E7.5 pca.corrected.E8.5 ## TSNE UMAP ## altExpNames(0): At this point, it is worth reflecting on the motivations behind the use of pseudo-bulking: Larger counts are more amenable to standard DE analysis pipelines designed for bulk RNA-seq data. Normalization is more straightforward and certain statistical approximations are more accurate e.g., the saddlepoint approximation for quasi-likelihood methods or normality for linear models. Collapsing cells into samples reflects the fact that our biological replication occurs at the sample level (A. T. L. Lun and Marioni 2017). Each sample is represented no more than once for each condition, avoiding problems from unmodelled correlations between samples. Supplying the per-cell counts directly to a DE analysis pipeline would imply that each cell is an independent biological replicate, which is not true from an experimental perspective. (A mixed effects model can handle this variance structure but involves extra statistical and computational complexity for little benefit, see Crowell et al. (2019).) Variance between cells within each sample is masked, provided it does not affect variance across (replicate) samples. This avoids penalizing DEGs that are not uniformly up- or down-regulated for all cells in all samples of one condition. Masking is generally desirable as DEGs - unlike marker genes - do not need to have low within-sample variance to be interesting, e.g., if the treatment effect is consistent across replicate populations but heterogeneous on a per-cell basis. (Of course, high per-cell variability will still result in weaker DE if it affects the variability across populations, while homogeneous per-cell responses will result in stronger DE due to a larger population-level log-fold change. These effects are also largely desirable.) 14.3.2 Performing the DE analysis 14.3.2.1 Introduction The DE analysis will be performed using quasi-likelihood (QL) methods from the edgeR package (Robinson, McCarthy, and Smyth 2010; Chen, Lun, and Smyth 2016). This uses a negative binomial generalized linear model (NB GLM) to handle overdispersed count data in experiments with limited replication. In our case, we have biological variation with three paired replicates per condition, so edgeR (or its contemporaries) is a natural choice for the analysis. We do not use all labels for GLM fitting as the strong DE between labels makes it difficult to compute a sensible average abundance to model the mean-dispersion trend. Moreover, label-specific batch effects would not be easily handled with a single additive term in the design matrix for the batch. Instead, we arbitrarily pick one of the labels to use for this demonstration. label &lt;- &quot;Mesenchyme&quot; current &lt;- summed[,label==summed$celltype.mapped] # Creating up a DGEList object for use in edgeR: library(edgeR) y &lt;- DGEList(counts(current), samples=colData(current)) y ## An object of class &quot;DGEList&quot; ## $counts ## Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 ## Xkr4 2 0 0 0 3 0 ## Rp1 0 0 1 0 0 0 ## Sox17 7 0 3 0 14 9 ## Mrpl15 1420 271 1009 379 1578 749 ## Rgs20 3 0 1 1 0 0 ## 14694 more rows ... ## ## $samples ## group lib.size norm.factors batch cell barcode sample stage tomato pool ## Sample1 1 4607053 1 5 &lt;NA&gt; &lt;NA&gt; 5 E8.5 TRUE 3 ## Sample2 1 1064970 1 6 &lt;NA&gt; &lt;NA&gt; 6 E8.5 FALSE 3 ## Sample3 1 2494010 1 7 &lt;NA&gt; &lt;NA&gt; 7 E8.5 TRUE 4 ## Sample4 1 1028668 1 8 &lt;NA&gt; &lt;NA&gt; 8 E8.5 FALSE 4 ## Sample5 1 4290221 1 9 &lt;NA&gt; &lt;NA&gt; 9 E8.5 TRUE 5 ## Sample6 1 1950840 1 10 &lt;NA&gt; &lt;NA&gt; 10 E8.5 FALSE 5 ## stage.mapped celltype.mapped closest.cell doub.density sizeFactor label ## Sample1 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## Sample2 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## Sample3 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## Sample4 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## Sample5 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## Sample6 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## celltype.mapped.1 sample.1 ncells ## Sample1 Mesenchyme 5 286 ## Sample2 Mesenchyme 6 55 ## Sample3 Mesenchyme 7 243 ## Sample4 Mesenchyme 8 134 ## Sample5 Mesenchyme 9 478 ## Sample6 Mesenchyme 10 299 14.3.2.2 Pre-processing A typical step in bulk RNA-seq data analyses is to remove samples with very low library sizes due to failed library preparation or sequencing. The very low counts in these samples can be troublesome in downstream steps such as normalization (Chapter 7) or for some statistical approximations used in the DE analysis. In our situation, this is equivalent to removing label-sample combinations that have very few or lowly-sequenced cells. The exact definition of “very low” will vary, but in this case, we remove combinations containing fewer than 10 cells (Crowell et al. 2019). Alternatively, we could apply the outlier-based strategy described in Chapter 6, but this makes the strong assumption that all label-sample combinations have similar numbers of cells that are sequenced to similar depth. We defer to the usual diagnostics for bulk DE analyses to decide whether a particular pseudo-bulk profile should be removed. discarded &lt;- current$ncells &lt; 10 y &lt;- y[,!discarded] summary(discarded) ## Mode FALSE ## logical 6 Another typical step in bulk RNA-seq analyses is to remove genes that are lowly expressed. This reduces computational work, improves the accuracy of mean-variance trend modelling and decreases the severity of the multiple testing correction. Here, we use the filterByExpr() function from edgeR to remove genes that are not expressed above a log-CPM threshold in a minimum number of samples (determined from the size of the smallest treatment group in the experimental design). keep &lt;- filterByExpr(y, group=current$tomato) y &lt;- y[keep,] summary(keep) ## Mode FALSE TRUE ## logical 9011 5688 Finally, we correct for composition biases by computing normalization factors with the trimmed mean of M-values method (Robinson and Oshlack 2010). We do not need the bespoke single-cell methods described in Chapter 7, as the counts for our pseudo-bulk samples are large enough to apply bulk normalization methods. (Note that edgeR normalization factors are closely related but not the same as the size factors described elsewhere in this book. Size factors are proportional to the product of the normalization factors and the library sizes.) y &lt;- calcNormFactors(y) y$samples ## group lib.size norm.factors batch cell barcode sample stage tomato pool ## Sample1 1 4607053 1.0683 5 &lt;NA&gt; &lt;NA&gt; 5 E8.5 TRUE 3 ## Sample2 1 1064970 1.0487 6 &lt;NA&gt; &lt;NA&gt; 6 E8.5 FALSE 3 ## Sample3 1 2494010 0.9582 7 &lt;NA&gt; &lt;NA&gt; 7 E8.5 TRUE 4 ## Sample4 1 1028668 0.9774 8 &lt;NA&gt; &lt;NA&gt; 8 E8.5 FALSE 4 ## Sample5 1 4290221 0.9707 9 &lt;NA&gt; &lt;NA&gt; 9 E8.5 TRUE 5 ## Sample6 1 1950840 0.9817 10 &lt;NA&gt; &lt;NA&gt; 10 E8.5 FALSE 5 ## stage.mapped celltype.mapped closest.cell doub.density sizeFactor label ## Sample1 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## Sample2 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## Sample3 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## Sample4 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## Sample5 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## Sample6 &lt;NA&gt; Mesenchyme &lt;NA&gt; NA NA &lt;NA&gt; ## celltype.mapped.1 sample.1 ncells ## Sample1 Mesenchyme 5 286 ## Sample2 Mesenchyme 6 55 ## Sample3 Mesenchyme 7 243 ## Sample4 Mesenchyme 8 134 ## Sample5 Mesenchyme 9 478 ## Sample6 Mesenchyme 10 299 As part of the usual diagnostics for a bulk RNA-seq DE analysis, we generate a mean-difference (MD) plot for each normalized pseudo-bulk profile (Figure 14.3). This should exhibit a trumpet shape centered at zero indicating that the normalization successfully removed systematic bias between profiles. Lack of zero-centering or dominant discrete patterns at low abundances may be symptomatic of deeper problems with normalization, possibly due to insufficient cells/reads/UMIs composing a particular pseudo-bulk profile. par(mfrow=c(2,3)) for (i in seq_len(ncol(y))) { plotMD(y, column=i) } Figure 14.3: Mean-difference plots of the normalized expression values for each pseudo-bulk sample against the average of all other samples. We also generate a multi-dimensional scaling (MDS) plot for the pseudo-bulk profiles (Figure 14.4). This is closely related to PCA and allows us to visualize the structure of the data in a manner similar to that described in Chapter 9 (though we rarely have enough pseudo-bulk profiles to make use of techniques like \\(t\\)-SNE). Here, the aim is to check whether samples separate by our known factors of interest - in this case, injection status. Strong separation foreshadows a large number of DEGs in the subsequent analysis. plotMDS(cpm(y, log=TRUE), col=ifelse(y$samples$tomato, &quot;red&quot;, &quot;blue&quot;)) Figure 14.4: MDS plot of the pseudo-bulk log-normalized CPMs, where each point represents a sample and is colored by the tomato status. 14.3.2.3 Statistical modelling We set up the design matrix to block on the batch-to-batch differences across different embryo pools, while retaining an additive term that represents the effect of injection. The latter is represented in our model as the log-fold change in gene expression in td-Tomato-positive cells over their negative counterparts within the same label. Our aim is to test whether this log-fold change is significantly different from zero. design &lt;- model.matrix(~factor(pool) + factor(tomato), y$samples) design ## (Intercept) factor(pool)4 factor(pool)5 factor(tomato)TRUE ## Sample1 1 0 0 1 ## Sample2 1 0 0 0 ## Sample3 1 1 0 1 ## Sample4 1 1 0 0 ## Sample5 1 0 1 1 ## Sample6 1 0 1 0 ## attr(,&quot;assign&quot;) ## [1] 0 1 1 2 ## attr(,&quot;contrasts&quot;) ## attr(,&quot;contrasts&quot;)$`factor(pool)` ## [1] &quot;contr.treatment&quot; ## ## attr(,&quot;contrasts&quot;)$`factor(tomato)` ## [1] &quot;contr.treatment&quot; We estimate the negative binomial (NB) dispersions with estimateDisp(). The role of the NB dispersion is to model the mean-variance trend (Figure 14.5), which is not easily accommodated by QL dispersions alone due to the quadratic nature of the NB mean-variance trend. y &lt;- estimateDisp(y, design) summary(y$trended.dispersion) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0103 0.0167 0.0213 0.0202 0.0235 0.0266 plotBCV(y) Figure 14.5: Biological coefficient of variation (BCV) for each gene as a function of the average abundance. The BCV is computed as the square root of the NB dispersion after empirical Bayes shrinkage towards the trend. Trended and common BCV estimates are shown in blue and red, respectively. We also estimate the quasi-likelihood dispersions with glmQLFit() (Chen, Lun, and Smyth 2016). This fits a GLM to the counts for each gene and estimates the QL dispersion from the GLM deviance. We set robust=TRUE to avoid distortions from highly variable clusters (Phipson et al. 2016). The QL dispersion models the uncertainty and variability of the per-gene variance (Figure 14.6) - which is not well handled by the NB dispersions, so the two dispersion types complement each other in the final analysis. fit &lt;- glmQLFit(y, design, robust=TRUE) summary(fit$var.prior) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.318 0.714 0.854 0.804 0.913 1.067 summary(fit$df.prior) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.227 12.675 12.675 12.339 12.675 12.675 plotQLDisp(fit) Figure 14.6: QL dispersion estimates for each gene as a function of abundance. Raw estimates (black) are shrunk towards the trend (blue) to yield squeezed estimates (red). We test for differences in expression due to injection using glmQLFTest(). DEGs are defined as those with non-zero log-fold changes at a false discovery rate of 5%. Very few genes are significantly DE, indicating that injection has little effect on the transcriptome of mesenchyme cells. (Note that this logic is somewhat circular, as a large transcriptional effect may have caused cells of this type to be re-assigned to a different label. We discuss this in more detail in Section 14.6.1 below.) res &lt;- glmQLFTest(fit, coef=ncol(design)) summary(decideTests(res)) ## factor(tomato)TRUE ## Down 8 ## NotSig 5672 ## Up 8 topTags(res) ## Coefficient: factor(tomato)TRUE ## logFC logCPM F PValue FDR ## Phlda2 -4.3874 9.934 1638.59 1.812e-16 1.031e-12 ## Erdr1 2.0691 8.833 356.37 1.061e-11 3.017e-08 ## Mid1 1.5191 6.931 120.15 1.844e-08 3.497e-05 ## H13 -1.0596 7.540 80.80 2.373e-07 2.527e-04 ## Kcnq1ot1 1.3763 7.242 83.31 2.392e-07 2.527e-04 ## Akr1e1 -1.7206 5.128 79.31 2.665e-07 2.527e-04 ## Zdbf2 1.8008 6.797 83.66 6.809e-07 5.533e-04 ## Asb4 -0.9235 7.341 53.45 2.918e-06 2.075e-03 ## Impact 0.8516 7.353 50.31 4.145e-06 2.620e-03 ## Lum -0.6031 9.275 41.67 1.205e-05 6.851e-03 14.3.3 Putting it all together 14.3.3.1 Looping across labels Now that we have laid out the theory underlying the DE analysis, we repeat this process for each of the labels to identify injection-induced DE in each cell type. This is conveniently done using the pseudoBulkDGE() function from scran, which will loop over all labels and apply the exact analysis described above to each label. (Users can also set method=\"voom\" to perform an equivalent analysis using the voom() pipeline from limma - see Chapter 33.9 for the full set of function calls.) # Removing all pseudo-bulk samples with &#39;insufficient&#39; cells. summed.filt &lt;- summed[,summed$ncells &gt;= 10] library(scran) de.results &lt;- pseudoBulkDGE(summed.filt, label=summed.filt$celltype.mapped, design=~factor(pool) + tomato, coef=&quot;tomatoTRUE&quot;, condition=summed.filt$tomato ) The function returns a list of DataFrames containing the DE results for each label. Each DataFrame also contains the intermediate edgeR objects used in the DE analyses, which can be used to generate any of previously described diagnostic plots (Figure 14.7). It is often wise to generate these plots to ensure that any interesting results are not compromised by technical issues. cur.results &lt;- de.results[[&quot;Allantois&quot;]] cur.results[order(cur.results$PValue),] ## DataFrame with 14699 rows and 5 columns ## logFC logCPM F PValue FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Phlda2 -2.489508 12.58150 1207.016 3.33486e-21 1.60507e-17 ## Xist -7.978532 8.00166 1092.831 1.27783e-17 3.07510e-14 ## Erdr1 1.947170 9.07321 296.937 1.58009e-14 2.53500e-11 ## Slc22a18 -4.347153 4.04380 117.389 1.92517e-10 2.31647e-07 ## Slc38a4 0.891849 10.24094 113.899 2.52208e-10 2.42776e-07 ## ... ... ... ... ... ... ## Ccl27a_ENSMUSG00000095247 NA NA NA NA NA ## CR974586.5 NA NA NA NA NA ## AC132444.6 NA NA NA NA NA ## Vmn2r122 NA NA NA NA NA ## CAAA01147332.1 NA NA NA NA NA y.allantois &lt;- metadata(cur.results)$y plotBCV(y.allantois) Figure 14.7: Biological coefficient of variation (BCV) for each gene as a function of the average abundance for the allantois pseudo-bulk analysis. Trended and common BCV estimates are shown in blue and red, respectively. We list the labels that were skipped due to the absence of replicates or contrasts. If it is necessary to extract statistics in the absence of replicates, several strategies can be applied such as reducing the complexity of the model or using a predefined value for the NB dispersion. We refer readers to the edgeR user’s guide for more details. metadata(de.results)$failed ## [1] &quot;Blood progenitors 1&quot; &quot;Caudal epiblast&quot; &quot;Caudal neurectoderm&quot; ## [4] &quot;ExE ectoderm&quot; &quot;Parietal endoderm&quot; &quot;Stripped&quot; 14.3.3.2 Cross-label meta-analyses We examine the numbers of DEGs at a FDR of 5% for each label using the decideTestsPerLabel() function. In general, there seems to be very little differential expression that is introduced by injection. Note that genes listed as NA were either filtered out as low-abundance genes for a given label’s analysis, or the comparison of interest was not possible for a particular label, e.g., due to lack of residual degrees of freedom or an absence of samples from both conditions. is.de &lt;- decideTestsPerLabel(de.results, threshold=0.05) summarizeTestsPerLabel(is.de) ## -1 0 1 NA ## Allantois 23 4766 24 9886 ## Blood progenitors 2 1 2472 2 12224 ## Cardiomyocytes 6 4361 5 10327 ## Caudal Mesoderm 2 1742 0 12955 ## Def. endoderm 7 1392 2 13298 ## Endothelium 3 3222 6 11468 ## Erythroid1 12 2777 15 11895 ## Erythroid2 5 3389 8 11297 ## Erythroid3 13 5048 16 9622 ## ExE mesoderm 2 5097 10 9590 ## Forebrain/Midbrain/Hindbrain 8 6226 11 8454 ## Gut 5 4482 6 10206 ## Haematoendothelial progenitors 4 4103 10 10582 ## Intermediate mesoderm 4 3072 4 11619 ## Mesenchyme 8 5672 8 9011 ## NMP 6 4107 10 10576 ## Neural crest 6 3311 8 11374 ## Paraxial mesoderm 4 4756 5 9934 ## Pharyngeal mesoderm 2 5082 9 9606 ## Rostral neurectoderm 5 3334 4 11356 ## Somitic mesoderm 7 2948 13 11731 ## Spinal cord 7 4591 7 10094 ## Surface ectoderm 9 5556 8 9126 For each gene, we compute the percentage of cell types in which that gene is upregulated or downregulated upon injection. (Here, we consider a gene to be non-DE if it is not retained after filtering.) We see that Xist is consistently downregulated in the injected cells; this is consistent with the fact that the injected cells are male while the background cells are derived from pools of male and female embryos, due to experimental difficulties with resolving sex at this stage. The consistent downregulation of Phlda2 and Cdkn1c in the injected cells is also interesting given that both are imprinted genes. However, some of these commonalities may be driven by shared contamination from ambient RNA - we discuss this further in Section 14.4. # Upregulated across most cell types. up.de &lt;- is.de &gt; 0 &amp; !is.na(is.de) head(sort(rowMeans(up.de), decreasing=TRUE), 10) ## Mid1 Erdr1 Impact Mcts2 Kcnq1ot1 Nnat Slc38a4 Zdbf2 ## 0.9130 0.7391 0.6087 0.5652 0.5652 0.5217 0.4348 0.3913 ## Hopx Peg3 ## 0.3913 0.2609 # Downregulated across cell types. down.de &lt;- is.de &lt; 0 &amp; !is.na(is.de) head(sort(rowMeans(down.de), decreasing=TRUE), 10) ## Xist Phlda2 Akr1e1 Cdkn1c H13 ## 0.73913 0.73913 0.73913 0.69565 0.52174 ## Wfdc2 B930036N10Rik B230312C02Rik Pink1 Mfap2 ## 0.21739 0.08696 0.08696 0.08696 0.08696 To identify label-specific DE, we use the pseudoBulkSpecific() function to test for significant differences from the average log-fold change over all other labels. More specifically, the null hypothesis for each label and gene is that the log-fold change lies between zero and the average log-fold change of the other labels. If a gene rejects this null for our label of interest, we can conclude that it exhibits DE that is more extreme or of the opposite sign compared to that in the majority of other labels (Figure 14.8). This approach is effectively a poor man’s interaction model that sacrifices the uncertainty of the average for an easier compute. We note that, while the difference from the average is a good heuristic, there is no guarantee that the top genes are truly label-specific; comparable DE in a subset of the other labels may be offset by weaker effects when computing the average. de.specific &lt;- pseudoBulkSpecific(summed.filt, label=summed.filt$celltype.mapped, design=~factor(pool) + tomato, coef=&quot;tomatoTRUE&quot;, condition=summed.filt$tomato ) cur.specific &lt;- de.specific[[&quot;Allantois&quot;]] cur.specific &lt;- cur.specific[order(cur.specific$PValue),] cur.specific ## DataFrame with 14699 rows and 6 columns ## logFC logCPM F PValue FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Slc22a18 -4.347153 4.04380 117.3889 1.92517e-10 9.26587e-07 ## Acta2 -0.829713 9.12472 55.6350 4.67332e-07 1.12463e-03 ## Mxd4 -1.421473 5.64606 50.2112 2.03567e-06 3.26589e-03 ## Rbp4 1.874290 4.35449 29.8731 1.53998e-05 1.85298e-02 ## Myl9 -0.985541 6.24833 30.6689 5.54072e-05 4.62274e-02 ## ... ... ... ... ... ... ## Ccl27a_ENSMUSG00000095247 NA NA NA NA NA ## CR974586.5 NA NA NA NA NA ## AC132444.6 NA NA NA NA NA ## Vmn2r122 NA NA NA NA NA ## CAAA01147332.1 NA NA NA NA NA ## OtherAverage ## &lt;numeric&gt; ## Slc22a18 NA ## Acta2 -0.0267428 ## Mxd4 -0.1565876 ## Rbp4 -0.1052237 ## Myl9 -0.1068453 ## ... ... ## Ccl27a_ENSMUSG00000095247 NA ## CR974586.5 NA ## AC132444.6 NA ## Vmn2r122 NA ## CAAA01147332.1 NA sizeFactors(summed.filt) &lt;- NULL plotExpression(logNormCounts(summed.filt), features=&quot;Rbp4&quot;, x=&quot;tomato&quot;, colour_by=&quot;tomato&quot;, other_fields=&quot;celltype.mapped&quot;) + facet_wrap(~celltype.mapped) Figure 14.8: Distribution of summed log-expression values for Rbp4 in each label of the chimeric embryo dataset. Each facet represents a label with distributions stratified by injection status. For greater control over the identification of label-specific DE, we can use the output of decideTestsPerLabel() to identify genes that are significant in our label of interest yet not DE in any other label. As hypothesis tests are not typically geared towards identifying genes that are not DE, we use an ad hoc approach where we consider a gene to be consistent with the null hypothesis for a label if it fails to be detected at a generous FDR threshold of 50%. We demonstrate this approach below by identifying injection-induced DE genes that are unique to the allantois. It is straightforward to tune the selection, e.g., to genes that are DE in no more than 90% of other labels by simply relaxing the threshold used to construct not.de.other, or to genes that are DE across multiple labels of interest but not in the rest, and so on. # Finding all genes that are not remotely DE in all other labels. remotely.de &lt;- decideTestsPerLabel(de.results, threshold=0.5) not.de &lt;- remotely.de==0 | is.na(remotely.de) not.de.other &lt;- rowMeans(not.de[,colnames(not.de)!=&quot;Allantois&quot;])==1 # Intersecting with genes that are DE inthe allantois. unique.degs &lt;- is.de[,&quot;Allantois&quot;]!=0 &amp; not.de.other unique.degs &lt;- names(which(unique.degs)) # Inspecting the results. de.allantois &lt;- de.results$Allantois de.allantois &lt;- de.allantois[unique.degs,] de.allantois &lt;- de.allantois[order(de.allantois$PValue),] de.allantois ## DataFrame with 5 rows and 5 columns ## logFC logCPM F PValue FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Slc22a18 -4.347153 4.04380 117.3889 1.92517e-10 2.31647e-07 ## Rbp4 1.874290 4.35449 29.8731 1.53998e-05 3.36906e-03 ## Cfc1 -0.950562 5.74762 23.1430 7.68376e-05 1.23215e-02 ## H3f3b 0.321634 12.04012 21.2710 1.25666e-04 1.63468e-02 ## Cryab -0.995629 5.28422 19.5921 1.99204e-04 2.45838e-02 The main caveat is that differences in power between labels require some caution when interpreting label specificity. For example, Figure 14.9 shows that the top-ranked allantois-specific gene exhibits some evidence of DE in other labels but was not detected for various reasons like low abundance or insufficient replicates. A more correct but complex approach would be to fit a interaction model to the pseudo-bulk profiles for each pair of labels, where the interaction is between the coefficient of interest and the label identity; this is left as an exercise for the reader. plotExpression(logNormCounts(summed.filt), features=&quot;Slc22a18&quot;, x=&quot;tomato&quot;, colour_by=&quot;tomato&quot;, other_fields=&quot;celltype.mapped&quot;) + facet_wrap(~celltype.mapped) Figure 14.9: Distribution of summed log-expression values for each label in the chimeric embryo dataset. Each facet represents a label with distributions stratified by injection status. 14.4 Avoiding problems with ambient RNA 14.4.1 Motivation Ambient contamination is a phenomenon that is generally most pronounced in massively multiplexed scRNA-seq protocols. Briefly, extracellular RNA (most commonly released upon cell lysis) is captured along with each cell in its reaction chamber, contributing counts to genes that are not otherwise expressed in that cell (see Section 15.2). Differences in the ambient profile across samples are not uncommon when dealing with strong experimental perturbations where strong expression of a gene in a condition-specific cell type can “bleed over” into all other cell types in the same sample. This is problematic for DE analyses between conditions, as DEGs detected for a particular cell type may be driven by differences in the ambient profiles rather than any intrinsic change in gene regulation. To illustrate, we consider the Tal1-knockout (KO) chimera data from Pijuan-Sala et al. (2019). This is very similar to the WT chimera dataset we previously examined, only differing in that the Tal1 gene was knocked out in the injected cells. Tal1 is a transcription factor that has known roles in erythroid differentiation; the aim of the experiment was to determine if blocking of the erythroid lineage diverted cells to other developmental fates. (To cut a long story short: yes, it did.) library(MouseGastrulationData) sce.tal1 &lt;- Tal1ChimeraData() library(scuttle) rownames(sce.tal1) &lt;- uniquifyFeatureNames( rowData(sce.tal1)$ENSEMBL, rowData(sce.tal1)$SYMBOL ) sce.tal1 ## class: SingleCellExperiment ## dim: 29453 56122 ## metadata(0): ## assays(1): counts ## rownames(29453): Xkr4 Gm1992 ... CAAA01147332.1 tomato-td ## rowData names(2): ENSEMBL SYMBOL ## colnames(56122): cell_1 cell_2 ... cell_56121 cell_56122 ## colData names(9): cell barcode ... pool sizeFactor ## reducedDimNames(1): pca.corrected ## altExpNames(0): We will perform a DE analysis between WT and KO cells labelled as “neural crest”. We observe that the strongest DEGs are the hemoglobins, which are downregulated in the injected cells. This is rather surprising as these cells are distinct from the erythroid lineage and should not express hemoglobins at all. The most sober explanation is that the background samples contain more hemoglobin transcripts in the ambient solution due to leakage from erythrocytes (or their precursors) during sorting and dissociation. summed.tal1 &lt;- aggregateAcrossCells(sce.tal1, ids=DataFrame(sample=sce.tal1$sample, label=sce.tal1$celltype.mapped) ) summed.tal1$block &lt;- summed.tal1$sample %% 2 == 0 # Add blocking factor. # Subset to our neural crest cells. summed.neural &lt;- summed.tal1[,summed.tal1$label==&quot;Neural crest&quot;] summed.neural ## class: SingleCellExperiment ## dim: 29453 4 ## metadata(0): ## assays(1): counts ## rownames(29453): Xkr4 Gm1992 ... CAAA01147332.1 tomato-td ## rowData names(2): ENSEMBL SYMBOL ## colnames: NULL ## colData names(13): cell barcode ... ncells block ## reducedDimNames(1): pca.corrected ## altExpNames(0): # Standard edgeR analysis, as described above. res.neural &lt;- pseudoBulkDGE(summed.neural, label=summed.neural$label, design=~factor(block) + tomato, coef=&quot;tomatoTRUE&quot;, condition=summed.neural$tomato) summarizeTestsPerLabel(decideTestsPerLabel(res.neural)) ## -1 0 1 NA ## Neural crest 351 9818 481 18803 # Summary of the direction of log-fold changes. tab.neural &lt;- res.neural[[1]] tab.neural &lt;- tab.neural[order(tab.neural$PValue),] head(tab.neural, 10) ## DataFrame with 10 rows and 5 columns ## logFC logCPM F PValue FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Xist -7.555686 8.21232 6657.298 0.00000e+00 0.00000e+00 ## Hbb-bh1 -8.091042 9.15972 10758.256 0.00000e+00 0.00000e+00 ## Hbb-y -8.415622 8.35705 7364.290 0.00000e+00 0.00000e+00 ## Hba-x -7.724803 8.53284 7896.457 0.00000e+00 0.00000e+00 ## Hba-a1 -8.596706 6.74429 2756.573 0.00000e+00 0.00000e+00 ## Hba-a2 -8.866232 5.81300 1517.726 1.72378e-310 3.05972e-307 ## Erdr1 1.889536 7.61593 1407.112 2.34678e-289 3.57046e-286 ## Cdkn1c -8.864528 4.96097 814.936 8.79979e-173 1.17147e-169 ## Uba52 -0.879668 8.38618 424.191 1.86585e-92 2.20792e-89 ## Grb10 -1.403427 6.58314 401.353 1.13898e-87 1.21302e-84 As an aside, it is worth mentioning that the “replicates” in this study are more technical than biological, so some exaggeration of the significance of the effects is to be expected. Nonetheless, it is a useful dataset to demonstrate some strategies for mitigating issues caused by ambient contamination. 14.4.2 Finding affected DEGs 14.4.2.1 By estimating ambient contamination As shown above, the presence of ambient contamination makes it difficult to interpret multi-condition DE analyses. To mitigate its effects, we need to obtain an estimate of the ambient “expression” profile from the raw count matrix for each sample. We follow the approach used in emptyDrops() (Lun et al. 2019) and consider all barcodes with total counts below 100 to represent empty droplets. We then sum the counts for each gene across these barcodes to obtain an expression vector representing the ambient profile for each sample. library(DropletUtils) ambient &lt;- vector(&quot;list&quot;, ncol(summed.neural)) # Looping over all raw (unfiltered) count matrices and # computing the ambient profile based on its low-count barcodes. # Turning off rounding, as we know this is count data. for (s in seq_along(ambient)) { raw.tal1 &lt;- Tal1ChimeraData(type=&quot;raw&quot;, samples=s)[[1]] ambient[[s]] &lt;- estimateAmbience(counts(raw.tal1), good.turing=FALSE, round=FALSE) } # Cleaning up the output for pretty printing. ambient &lt;- do.call(cbind, ambient) colnames(ambient) &lt;- seq_len(ncol(ambient)) rownames(ambient) &lt;- uniquifyFeatureNames( rowData(raw.tal1)$ENSEMBL, rowData(raw.tal1)$SYMBOL ) head(ambient) ## 1 2 3 4 ## Xkr4 1 0 0 0 ## Gm1992 0 0 0 0 ## Gm37381 1 0 1 0 ## Rp1 0 1 0 1 ## Sox17 76 76 31 53 ## Gm37323 0 0 0 0 For each sample, we determine the maximum proportion of the count for each gene that could be attributed to ambient contamination. This is done by scaling the ambient profile in ambient to obtain a per-gene expected count from ambient contamination, with which we compute the \\(p\\)-value for observing a count equal to or lower than that in summed.neural. We perform this for a range of scaling factors and identify the largest factor that yields a \\(p\\)-value above a given threshold. The scaled ambient profile represents the upper bound of the contribution to each sample from ambient contamination. We deliberately use an upper bound so that our next step will aggressively remove any gene that is potentially problematic. max.ambient &lt;- maximumAmbience(counts(summed.neural), ambient, mode=&quot;proportion&quot;) head(max.ambient) ## [,1] [,2] [,3] [,4] ## Xkr4 NaN NaN NaN NaN ## Gm1992 NaN NaN NaN NaN ## Gm37381 NaN NaN NaN NaN ## Rp1 NaN NaN NaN NaN ## Sox17 0.1775 0.1833 0.468 1 ## Gm37323 NaN NaN NaN NaN Genes in which over 10% of the counts are ambient-derived (averaged across samples) are subsequently discarded from our analysis. For balanced designs, this threshold prevents ambient contribution from biasing the true fold-change by more than 10%, which is a tolerable margin of error for most applications. (Unbalanced designs may warrant the use of a weighted average to account for sample size differences between groups.) This approach yields a slightly smaller list of DEGs without the hemoglobins, which is encouraging as it suggests that any other (less obvious) effects of ambient contamination have also been removed. contamination &lt;- rowMeans(max.ambient, na.rm=TRUE) non.ambient &lt;- contamination &lt;= 0.1 summary(non.ambient) ## Mode FALSE TRUE NA&#39;s ## logical 1475 15306 12672 okay.genes &lt;- names(non.ambient)[which(non.ambient)] tab.neural2 &lt;- tab.neural[rownames(tab.neural) %in% okay.genes,] table(Direction=tab.neural2$logFC &gt; 0, Significant=tab.neural2$FDR &lt;= 0.05) ## Significant ## Direction FALSE TRUE ## FALSE 4820 317 ## TRUE 4781 452 head(tab.neural2, 10) ## DataFrame with 10 rows and 5 columns ## logFC logCPM F PValue FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Xist -7.555686 8.21232 6657.298 0.00000e+00 0.00000e+00 ## Erdr1 1.889536 7.61593 1407.112 2.34678e-289 3.57046e-286 ## Uba52 -0.879668 8.38618 424.191 1.86585e-92 2.20792e-89 ## Grb10 -1.403427 6.58314 401.353 1.13898e-87 1.21302e-84 ## Gt(ROSA)26Sor 1.481294 5.71617 351.940 2.80072e-77 2.71160e-74 ## Fdps 0.981388 7.21805 337.159 3.67655e-74 3.26294e-71 ## Mest 0.549349 10.98269 319.697 1.79833e-70 1.47324e-67 ## Impact 1.396666 5.71801 314.700 2.05057e-69 1.55990e-66 ## H13 -1.481658 5.90902 301.675 1.17372e-66 8.33343e-64 ## Msmo1 1.493771 5.43923 301.066 1.57983e-66 1.05158e-63 A softer approach is to simply report the average contaminating percentage for each gene in the table of DE statistics. This allows readers to make up their own minds as to whether a particular DEG’s effect is driven by ambient contamination, in much the same way as described for low-quality cells in Section 6.6. Indeed, it is worth remembering that maximumAmbience() will report the maximum possible contamination rather than attempting to estimate the actual level of contamination, and filtering on the former may be overly conservative. This is especially true for cell populations that are contributing to the differences in the ambient pool; in the most extreme case, the reported maximum contamination would be 100% for cell types with an expression profile that is identical to the ambient pool. tab.neural3 &lt;- tab.neural tab.neural3$contamination &lt;- contamination[rownames(tab.neural3)] head(tab.neural3) ## DataFrame with 6 rows and 6 columns ## logFC logCPM F PValue FDR contamination ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Xist -7.55569 8.21232 6657.30 0.00000e+00 0.00000e+00 0.0605735 ## Hbb-bh1 -8.09104 9.15972 10758.26 0.00000e+00 0.00000e+00 0.9900717 ## Hbb-y -8.41562 8.35705 7364.29 0.00000e+00 0.00000e+00 0.9674483 ## Hba-x -7.72480 8.53284 7896.46 0.00000e+00 0.00000e+00 0.9945348 ## Hba-a1 -8.59671 6.74429 2756.57 0.00000e+00 0.00000e+00 0.8626846 ## Hba-a2 -8.86623 5.81300 1517.73 1.72378e-310 3.05972e-307 0.7351403 14.4.2.2 With prior knowledge Another strategy to estimating the ambient proportions involves the use of prior knowledge of mutually exclusive gene expression profiles (Young and Behjati 2018). In this case, we assume (reasonably) that hemoglobins should not be expressed in neural crest cells and use this to estimate the contamination in each sample. This is achieved with the controlAmbience() function, which scales the ambient profile so that the hemoglobin coverage is the same as the corresponding sample of summed.neural. From these profiles, we compute proportions of ambient contamination that are used to mark or filter out affected genes in the same manner as described above. is.hbb &lt;- grep(&quot;^Hb[ab]-&quot;, rownames(summed.neural)) alt.prop &lt;- controlAmbience(counts(summed.neural), ambient, features=is.hbb, mode=&quot;proportion&quot;) head(alt.prop) ## 1 2 3 4 ## Xkr4 NaN NaN NaN NaN ## Gm1992 NaN NaN NaN NaN ## Gm37381 NaN NaN NaN NaN ## Rp1 NaN NaN NaN NaN ## Sox17 0.06774 0.08798 0.4796 1 ## Gm37323 NaN NaN NaN NaN alt.non.ambient &lt;- rowMeans(alt.prop, na.rm=TRUE) &lt;= 0.1 summary(alt.non.ambient) ## Mode FALSE TRUE NA&#39;s ## logical 1388 15393 12672 okay.genes &lt;- names(alt.non.ambient)[which(alt.non.ambient)] tab.neural4 &lt;- tab.neural[rownames(tab.neural) %in% okay.genes,] head(tab.neural4) ## DataFrame with 6 rows and 5 columns ## logFC logCPM F PValue FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Xist -7.555686 8.21232 6657.298 0.00000e+00 0.00000e+00 ## Erdr1 1.889536 7.61593 1407.112 2.34678e-289 3.57046e-286 ## Uba52 -0.879668 8.38618 424.191 1.86585e-92 2.20792e-89 ## Grb10 -1.403427 6.58314 401.353 1.13898e-87 1.21302e-84 ## Gt(ROSA)26Sor 1.481294 5.71617 351.940 2.80072e-77 2.71160e-74 ## Fdps 0.981388 7.21805 337.159 3.67655e-74 3.26294e-71 Any highly expressed cell type-specific gene is a candidate for this procedure, most typically in cell types that are highly specialized towards manufacturing a protein product. Aside from hemoglobin, we could use immunoglobulins in populations containing B cells, or insulin and glucagon in pancreas datasets (Figure 11.6). The experimental setting may also provide some genes that must only be present in the ambient solution; for example, the mitochondrial transcripts can be used to estimate ambient contamination in single-nucleus RNA-seq, while Xist can be used for datasets involving mixtures of male and female cells (where the contaminating percentages are estimated from the profiles of male cells only). If appropriate control features are available, this approach allows us to obtain a more accurate estimate of the contamination in each pseudo-bulk sample compared to the upper bound provided by maximumAmbience(). This avoids the removal of genuine DEGs due to overestimation fo the ambient contamination from the latter. However, the performance of this approach is fully dependent on the suitability of the control features - if a “control” feature is actually genuinely expressed in a cell type, the ambient contribution will be overestimated. A simple mitigating strategy is to simply take the lower of the proportions from controlAmbience() and maximumAmbience(), with the idea being that the latter will avoid egregious overestimation when the control set is misspecified. 14.4.2.3 Without an ambient profile An estimate of the ambient profile is rarely available for public datasets where only the per-cell count matrices are provided. In such cases, we must instead use the rest of the dataset to infer something about the effects of ambient contamination. The most obvious approach is construct a proxy ambient profile by summing the counts for all cells from each sample, which can be used in place of the actual profile in the previous calculations. This assumes equal contributions from all labels to the ambient pool, which is not entirely unrealistic (Figure 14.10) though some discrepancies can be expected due to the presence of particularly fragile cell types or extracellular RNA. proxy.ambient &lt;- aggregateAcrossCells(summed.tal1, ids=summed.tal1$sample) par(mfrow=c(2,2)) for (i in seq_len(ncol(proxy.ambient))) { true &lt;- ambient[,i] proxy &lt;- assay(proxy.ambient)[,i] logged &lt;- edgeR::cpm(cbind(proxy, true), log=TRUE, prior.count=2) logFC &lt;- logged[,1] - logged[,2] abundance &lt;- rowMeans(logged) plot(abundance, logFC, main=paste(&quot;Sample&quot;, i)) } Figure 14.10: MA plots of the log-fold change of the proxy ambient profile over the real profile for each sample in the Tal1 chimera dataset. We can also mitigate the effect of ambient contamination by focusing on label-specific DEGs. Contamination-driven DEGs should be systematically present in comparisons for all labels, and thus can be eliminated by simply ignoring all genes that are significant in a majority of these comparisons (Section 14.3.3.2). The obvious drawback of this approach is that it discounts genuine DEGs that have a consistent effect in most/all labels, though one could perhaps argue that such “global” DEGs are not the main features of interest in label-specific analyses. It is also complicated by fluctuations in detection power across comparisons involving different numbers of cells - or replicates, after filtering pseudo-bulk profiles by the number of cells. res.tal1 &lt;- pseudoBulkDGE(summed.tal1, label=summed.tal1$label, design=~factor(block) + tomato, coef=&quot;tomatoTRUE&quot;, condition=summed.tal1$tomato) # DE in the same direction across most labels. tal1.de &lt;- decideTestsPerLabel(res.tal1) up.de &lt;- rowMeans(tal1.de &gt; 0 &amp; !is.na(tal1.de)) down.de &lt;- rowMeans(tal1.de &lt; 0 &amp; !is.na(tal1.de)) # Inspecting our neural crest results again. tab.neural.again &lt;- res.tal1[[&quot;Neural crest&quot;]] tab.neural.again$OtherUp &lt;- up.de tab.neural.again$OtherDown &lt;- down.de head(tab.neural.again[order(tab.neural.again$PValue),], 10) ## DataFrame with 10 rows and 7 columns ## logFC logCPM F PValue FDR OtherUp ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Xist -7.555686 8.21232 6657.298 0.00000e+00 0.00000e+00 0.000000 ## Hbb-bh1 -8.091042 9.15972 10758.256 0.00000e+00 0.00000e+00 0.000000 ## Hbb-y -8.415622 8.35705 7364.290 0.00000e+00 0.00000e+00 0.000000 ## Hba-x -7.724803 8.53284 7896.457 0.00000e+00 0.00000e+00 0.000000 ## Hba-a1 -8.596706 6.74429 2756.573 0.00000e+00 0.00000e+00 0.000000 ## Hba-a2 -8.866232 5.81300 1517.726 1.72378e-310 3.05972e-307 0.000000 ## Erdr1 1.889536 7.61593 1407.112 2.34678e-289 3.57046e-286 0.444444 ## Cdkn1c -8.864528 4.96097 814.936 8.79979e-173 1.17147e-169 0.000000 ## Uba52 -0.879668 8.38618 424.191 1.86585e-92 2.20792e-89 0.000000 ## Grb10 -1.403427 6.58314 401.353 1.13898e-87 1.21302e-84 0.000000 ## OtherDown ## &lt;numeric&gt; ## Xist 0.851852 ## Hbb-bh1 0.925926 ## Hbb-y 0.851852 ## Hba-x 0.814815 ## Hba-a1 0.777778 ## Hba-a2 0.777778 ## Erdr1 0.000000 ## Cdkn1c 0.814815 ## Uba52 0.814815 ## Grb10 0.777778 # Actually removing those with a DE majority across labels. ignore &lt;- up.de &gt; 0.5 | down.de &gt; 0.5 tab.neural.again &lt;- tab.neural.again[!ignore,] head(tab.neural.again[order(tab.neural.again$PValue),], 10) ## DataFrame with 10 rows and 7 columns ## logFC logCPM F PValue FDR OtherUp ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Erdr1 1.889536 7.61593 1407.1122 2.34678e-289 3.57046e-286 0.444444 ## Fdps 0.981388 7.21805 337.1586 3.67655e-74 3.26294e-71 0.222222 ## Msmo1 1.493771 5.43923 301.0658 1.57983e-66 1.05158e-63 0.222222 ## Hmgcs1 1.250024 5.70837 252.1105 3.95670e-56 2.21783e-53 0.222222 ## Idi1 1.173709 5.37688 180.8890 6.68049e-41 2.73643e-38 0.148148 ## Ddit4 0.844702 5.59699 109.5505 1.63456e-25 3.95637e-23 0.444444 ## Scd2 0.798049 5.70852 103.7377 2.98274e-24 7.05915e-22 0.259259 ## Sox9 0.537460 7.17373 99.3336 2.69822e-23 5.74720e-21 0.148148 ## Nkd1 0.719043 5.92690 93.9636 3.96635e-22 8.28267e-20 0.185185 ## Fdft1 0.841061 5.32293 89.8826 3.06448e-21 5.93394e-19 0.185185 ## OtherDown ## &lt;numeric&gt; ## Erdr1 0.0000000 ## Fdps 0.1851852 ## Msmo1 0.1851852 ## Hmgcs1 0.1111111 ## Idi1 0.1111111 ## Ddit4 0.0000000 ## Scd2 0.0740741 ## Sox9 0.0740741 ## Nkd1 0.0740741 ## Fdft1 0.1111111 The common theme here is that, in the absence of an ambient profile, we are using all labels as a proxy for the ambient effect. This can have unpredictable consequences as the results for each label are now dependent on the behavior of the entire dataset. For example, the metrics are susceptible to the idiosyncrasies of clustering where one cell type may be represented in multple related clusters that distort the percentages in up.de and down.de or the average log-fold change. The metrics may also be invalidated in analyses of a subset of the data - for example, a subclustering analysis focusing on a particular cell type may mark all relevant DEGs as problematic because they are consistently DE in all subtypes. 14.4.3 Subtracting ambient counts It is worth commenting on the seductive idea of subtracting the ambient counts from the pseudo-bulk samples. This may seem like the most obvious approach for removing ambient contamination, but unfortunately, subtracted counts have unpredictable statistical properties due the distortion of the mean-variance relationship. Minor relative fluctuations at very large counts become large fold-changes after subtraction, manifesting as spurious DE in genes where a substantial proportion of counts is derived from the ambient solution. For example, several hemoglobin genes retain strong DE even after subtraction of the scaled ambient profile. scaled.ambient &lt;- controlAmbience(counts(summed.neural), ambient, features=is.hbb, mode=&quot;profile&quot;) subtracted &lt;- counts(summed.neural) - scaled.ambient subtracted &lt;- round(subtracted) subtracted[subtracted &lt; 0] &lt;- 0 subtracted[is.hbb,] ## [,1] [,2] [,3] [,4] ## Hbb-bt 0 0 7 18 ## Hbb-bs 1 2 31 42 ## Hbb-bh2 0 0 0 0 ## Hbb-bh1 2 0 0 0 ## Hbb-y 0 0 39 107 ## Hba-x 1 1 0 0 ## Hba-a1 0 0 365 452 ## Hba-a2 0 0 314 329 Another tempting approach is to use interaction models to implicitly subtract the ambient effect during GLM fitting. The assumption is that, for a genuine DEG, the log-fold change within cells is larger in magnitude than that in the ambient solution. This is based on the expectation that any DE in the latter is “diluted” by contributions from cell types where that gene is not DE. Unfortunately, this is not always the case; a DE analysis of the ambient counts indicates that the hemoglobin log-fold change is actually stronger in the neural crest cells compared to the ambient solution, which leads to the rather awkward conclusion that the WT neural crest cells are expressing hemoglobin beyond that explained by ambient contamination. y.ambient &lt;- DGEList(ambient, samples=colData(summed.neural)) y.ambient &lt;- y.ambient[filterByExpr(y.ambient, group=y.ambient$samples$tomato),] y.ambient &lt;- calcNormFactors(y.ambient) design &lt;- model.matrix(~factor(block) + tomato, y.ambient$samples) y.ambient &lt;- estimateDisp(y.ambient, design) fit.ambient &lt;- glmQLFit(y.ambient, design, robust=TRUE) res.ambient &lt;- glmQLFTest(fit.ambient, coef=ncol(design)) summary(decideTests(res.ambient)) ## tomatoTRUE ## Down 1910 ## NotSig 7683 ## Up 1645 topTags(res.ambient, n=10) ## Coefficient: tomatoTRUE ## logFC logCPM F PValue FDR ## Hbb-y -5.267 12.803 15115 3.523e-81 3.959e-77 ## Hbb-bh1 -5.075 13.725 14002 8.892e-80 4.996e-76 ## Hba-x -4.827 13.122 13317 3.135e-79 1.175e-75 ## Hba-a1 -4.662 10.734 11095 1.146e-76 3.220e-73 ## Hba-a2 -4.521 9.480 8411 1.246e-72 2.800e-69 ## Blvrb -4.319 7.649 4129 3.066e-62 5.742e-59 ## Xist -4.376 7.484 3891 1.864e-61 2.993e-58 ## Gypa -5.138 7.213 3808 3.833e-61 5.384e-58 ## Hbb-bs -4.941 7.209 3604 3.728e-60 4.655e-57 ## Car2 -3.499 8.534 4448 5.589e-60 6.281e-57 (One possible explanation is that erythrocyte fragments are present in the cell-containing libraries but are not used to estimate the ambient profile, presumably because the UMI counts are too high for fragment-containing libraries to be treated as empty. Technically speaking, this is not incorrect as, after all, those libraries are not actually empty (Section 15.2). In effect, every cell in the WT sample is a fractional multiplet with partial erythrocyte identity from the included fragments, which results in stronger log-fold changes between genotypes for hemoglobin compared to those for the ambient solution.) That aside, there are other issues with implicit subtraction in the fitted GLM that warrant caution with its use. This strategy precludes detection of DEGs that are common to all cell types as there is no longer a dilution effect being applied to the log-fold change in the ambient solution. It requires inclusion of the ambient profiles in the model, which is cause for at least some concern as they are unlikely to have the same degree of variability as the cell-derived pseudo-bulk profiles. Interpretation is also complicated by the fact that we are only interested in log-fold changes that are more extreme in the cells compared to the ambient solution; a non-zero interaction term is not sufficient for removing spurious DE. 14.5 Differential abundance between conditions 14.5.1 Overview In a DA analysis, we test for significant changes in per-label cell abundance across conditions. This will reveal which cell types are depleted or enriched upon treatment, which is arguably just as interesting as changes in expression within each cell type. The DA analysis has a long history in flow cytometry (Finak et al. 2014; A. T. L. Lun, Richard, and Marioni 2017) where it is routinely used to examine the effects of different conditions on the composition of complex cell populations. By performing it here, we effectively treat scRNA-seq as a “super-FACS” technology for defining relevant subpopulations using the entire transcriptome. We prepare for the DA analysis by quantifying the number of cells assigned to each label (or cluster) in our WT chimeric experiment. In this case, we are aiming to identify labels that change in abundance among the compartment of injected cells compared to the background. abundances &lt;- table(merged$celltype.mapped, merged$sample) abundances &lt;- unclass(abundances) head(abundances) ## ## 5 6 7 8 9 10 ## Allantois 97 15 139 127 318 259 ## Blood progenitors 1 6 3 16 6 8 17 ## Blood progenitors 2 31 8 28 21 43 114 ## Cardiomyocytes 85 21 79 31 174 211 ## Caudal Mesoderm 10 10 9 3 10 29 ## Caudal epiblast 2 2 0 0 22 45 14.5.2 Performing the DA analysis Our DA analysis will again be performed with the edgeR package. This allows us to take advantage of the NB GLM methods to model overdispersed count data in the presence of limited replication - except that the counts are not of reads per gene, but of cells per label (A. T. L. Lun, Richard, and Marioni 2017). The aim is to share information across labels to improve our estimates of the biological variability in cell abundance between replicates. # Attaching some column metadata. extra.info &lt;- colData(merged)[match(colnames(abundances), merged$sample),] y.ab &lt;- DGEList(abundances, samples=extra.info) y.ab ## An object of class &quot;DGEList&quot; ## $counts ## ## 5 6 7 8 9 10 ## Allantois 97 15 139 127 318 259 ## Blood progenitors 1 6 3 16 6 8 17 ## Blood progenitors 2 31 8 28 21 43 114 ## Cardiomyocytes 85 21 79 31 174 211 ## Caudal Mesoderm 10 10 9 3 10 29 ## 29 more rows ... ## ## $samples ## group lib.size norm.factors batch cell barcode sample stage ## 5 1 2298 1 5 cell_9769 AAACCTGAGACTGTAA 5 E8.5 ## 6 1 1026 1 6 cell_12180 AAACCTGCAGATGGCA 6 E8.5 ## 7 1 2740 1 7 cell_13227 AAACCTGAGACAAGCC 7 E8.5 ## 8 1 2904 1 8 cell_16234 AAACCTGCAAACCCAT 8 E8.5 ## 9 1 4057 1 9 cell_19332 AAACCTGCAACGATCT 9 E8.5 ## 10 1 6401 1 10 cell_23875 AAACCTGAGGCATGTG 10 E8.5 ## tomato pool stage.mapped celltype.mapped closest.cell ## 5 TRUE 3 E8.25 Mesenchyme cell_24159 ## 6 FALSE 3 E8.25 Somitic mesoderm cell_63247 ## 7 TRUE 4 E8.5 Somitic mesoderm cell_25454 ## 8 FALSE 4 E8.25 ExE mesoderm cell_139075 ## 9 TRUE 5 E8.0 ExE mesoderm cell_116116 ## 10 FALSE 5 E8.5 Forebrain/Midbrain/Hindbrain cell_39343 ## doub.density sizeFactor label ## 5 0.029850 1.6349 19 ## 6 0.291916 2.5981 6 ## 7 0.601740 1.5939 17 ## 8 0.004733 0.8707 9 ## 9 0.079415 0.8933 15 ## 10 0.040747 0.3947 1 We filter out low-abundance labels as previously described. This avoids cluttering the result table with very rare subpopulations that contain only a handful of cells. For a DA analysis of cluster abundances, filtering is generally not required as most clusters will not be of low-abundance (otherwise there would not have been enough evidence to define the cluster in the first place). keep &lt;- filterByExpr(y.ab, group=y.ab$samples$tomato) y.ab &lt;- y.ab[keep,] summary(keep) ## Mode FALSE TRUE ## logical 10 24 Unlike DE analyses, we do not perform an additional normalization step with calcNormFactors(). This means that we are only normalizing based on the “library size”, i.e., the total number of cells in each sample. Any changes we detect between conditions will subsequently represent differences in the proportion of cells in each cluster. The motivation behind this decision is discussed in more detail in Section 14.5.3. We formulate the design matrix with a blocking factor for the batch of origin for each sample and an additive term for the td-Tomato status (i.e., injection effect). Here, the log-fold change in our model refers to the change in cell abundance after injection, rather than the change in gene expression. design &lt;- model.matrix(~factor(pool) + factor(tomato), y.ab$samples) We use the estimateDisp() function to estimate the NB dispersion for each cluster (Figure 14.11). We turn off the trend as we do not have enough points for its stable estimation. y.ab &lt;- estimateDisp(y.ab, design, trend=&quot;none&quot;) summary(y.ab$common.dispersion) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0614 0.0614 0.0614 0.0614 0.0614 0.0614 plotBCV(y.ab, cex=1) Figure 14.11: Biological coefficient of variation (BCV) for each label with respect to its average abundance. BCVs are defined as the square root of the NB dispersion. Common dispersion estimates are shown in red. We repeat this process with the QL dispersion, again disabling the trend (Figure 14.12). fit.ab &lt;- glmQLFit(y.ab, design, robust=TRUE, abundance.trend=FALSE) summary(fit.ab$var.prior) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.25 1.25 1.25 1.25 1.25 1.25 summary(fit.ab$df.prior) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## Inf Inf Inf Inf Inf Inf plotQLDisp(fit.ab, cex=1) Figure 14.12: QL dispersion estimates for each label with respect to its average abundance. Quarter-root values of the raw estimates are shown in black while the shrunken estimates are shown in red. Shrinkage is performed towards the common dispersion in blue. We test for differences in abundance between td-Tomato-positive and negative samples using glmQLFTest(). We see that extra-embryonic ectoderm is strongly depleted in the injected cells. This is consistent with the expectation that cells injected into the blastocyst should not contribute to extra-embryonic tissue. The injected cells also contribute more to the mesenchyme, which may also be of interest. res &lt;- glmQLFTest(fit.ab, coef=ncol(design)) summary(decideTests(res)) ## factor(tomato)TRUE ## Down 1 ## NotSig 22 ## Up 1 topTags(res) ## Coefficient: factor(tomato)TRUE ## logFC logCPM F PValue FDR ## ExE ectoderm -6.5663 13.02 66.267 1.352e-10 3.245e-09 ## Mesenchyme 1.1652 16.29 11.291 1.535e-03 1.841e-02 ## Allantois 0.8345 15.51 5.312 2.555e-02 1.621e-01 ## Cardiomyocytes 0.8484 14.86 5.204 2.701e-02 1.621e-01 ## Neural crest -0.7706 14.76 4.106 4.830e-02 2.149e-01 ## Endothelium 0.7519 14.29 3.912 5.371e-02 2.149e-01 ## Erythroid3 -0.6431 17.28 3.604 6.367e-02 2.183e-01 ## Haematoendothelial progenitors 0.6581 14.72 3.124 8.351e-02 2.505e-01 ## ExE mesoderm 0.3805 15.68 1.181 2.827e-01 6.258e-01 ## Pharyngeal mesoderm 0.3793 15.72 1.169 2.850e-01 6.258e-01 14.5.3 Handling composition effects 14.5.3.1 Background As mentioned above, we do not use calcNormFactors() in our default DA analysis. This normalization step assumes that most of the input features are not different between conditions. While this assumption is reasonable for most types of gene expression data, it is generally too strong for cell type abundance - most experiments consist of only a few cell types that may all change in abundance upon perturbation. Thus, our default approach is to only normalize based on the total number of cells in each sample, which means that we are effectively testing for differential proportions between conditions. Unfortunately, the use of the total number of cells leaves us susceptible to composition effects. For example, a large increase in abundance for one cell subpopulation will introduce decreases in proportion for all other subpopulations - which is technically correct, but may be misleading if one concludes that those other subpopulations are decreasing in abundance of their own volition. If composition biases are proving problematic for interpretation of DA results, we have several avenues for removing them or mitigating their impact by leveraging a priori biological knowledge. 14.5.3.2 Assuming most labels do not change If it is possible to assume that most labels (i.e., cell types) do not change in abundance, we can use calcNormFactors() to compute normalization factors. This seems to be a fairly reasonable assumption for the WT chimeras where the injection is expected to have only a modest effect at most. y.ab2 &lt;- calcNormFactors(y.ab) y.ab2$samples$norm.factors ## [1] 1.0055 1.0833 1.1658 0.7614 1.0616 0.9743 We then proceed with the remainder of the edgeR analysis, shown below in condensed format. Many of the positive log-fold changes are shifted towards zero, consistent with the removal of composition biases from the presence of extra-embryonic ectoderm in only background cells. In particular, the mesenchyme is no longer significantly DA after injection. y.ab2 &lt;- estimateDisp(y.ab2, design, trend=&quot;none&quot;) fit.ab2 &lt;- glmQLFit(y.ab2, design, robust=TRUE, abundance.trend=FALSE) res2 &lt;- glmQLFTest(fit.ab2, coef=ncol(design)) topTags(res2, n=10) ## Coefficient: factor(tomato)TRUE ## logFC logCPM F PValue FDR ## ExE ectoderm -6.9215 13.17 70.364 5.738e-11 1.377e-09 ## Mesenchyme 0.9513 16.27 6.787 1.219e-02 1.143e-01 ## Neural crest -1.0032 14.78 6.464 1.429e-02 1.143e-01 ## Erythroid3 -0.8504 17.35 5.517 2.299e-02 1.380e-01 ## Cardiomyocytes 0.6400 14.84 2.735 1.047e-01 4.809e-01 ## Allantois 0.6054 15.51 2.503 1.202e-01 4.809e-01 ## Forebrain/Midbrain/Hindbrain -0.4943 16.55 1.928 1.713e-01 5.178e-01 ## Endothelium 0.5482 14.27 1.917 1.726e-01 5.178e-01 ## Erythroid2 -0.4818 16.00 1.677 2.015e-01 5.373e-01 ## Haematoendothelial progenitors 0.4262 14.73 1.185 2.818e-01 6.240e-01 14.5.3.3 Removing the offending labels Another approach is to repeat the analysis after removing DA clusters containing many cells. This provides a clearer picture of the changes in abundance among the remaining clusters. Here, we remove the extra-embryonic ectoderm and reset the total number of cells for all samples with keep.lib.sizes=FALSE. offenders &lt;- &quot;ExE ectoderm&quot; y.ab3 &lt;- y.ab[setdiff(rownames(y.ab), offenders),, keep.lib.sizes=FALSE] y.ab3$samples ## group lib.size norm.factors batch cell barcode sample stage ## 5 1 2268 1 5 cell_9769 AAACCTGAGACTGTAA 5 E8.5 ## 6 1 993 1 6 cell_12180 AAACCTGCAGATGGCA 6 E8.5 ## 7 1 2708 1 7 cell_13227 AAACCTGAGACAAGCC 7 E8.5 ## 8 1 2749 1 8 cell_16234 AAACCTGCAAACCCAT 8 E8.5 ## 9 1 4009 1 9 cell_19332 AAACCTGCAACGATCT 9 E8.5 ## 10 1 6224 1 10 cell_23875 AAACCTGAGGCATGTG 10 E8.5 ## tomato pool stage.mapped celltype.mapped closest.cell ## 5 TRUE 3 E8.25 Mesenchyme cell_24159 ## 6 FALSE 3 E8.25 Somitic mesoderm cell_63247 ## 7 TRUE 4 E8.5 Somitic mesoderm cell_25454 ## 8 FALSE 4 E8.25 ExE mesoderm cell_139075 ## 9 TRUE 5 E8.0 ExE mesoderm cell_116116 ## 10 FALSE 5 E8.5 Forebrain/Midbrain/Hindbrain cell_39343 ## doub.density sizeFactor label ## 5 0.029850 1.6349 19 ## 6 0.291916 2.5981 6 ## 7 0.601740 1.5939 17 ## 8 0.004733 0.8707 9 ## 9 0.079415 0.8933 15 ## 10 0.040747 0.3947 1 y.ab3 &lt;- estimateDisp(y.ab3, design, trend=&quot;none&quot;) fit.ab3 &lt;- glmQLFit(y.ab3, design, robust=TRUE, abundance.trend=FALSE) res3 &lt;- glmQLFTest(fit.ab3, coef=ncol(design)) topTags(res3, n=10) ## Coefficient: factor(tomato)TRUE ## logFC logCPM F PValue FDR ## Mesenchyme 1.1274 16.32 11.501 0.001438 0.03308 ## Allantois 0.7950 15.54 5.231 0.026836 0.18284 ## Cardiomyocytes 0.8104 14.90 5.152 0.027956 0.18284 ## Neural crest -0.8085 14.80 4.903 0.031798 0.18284 ## Erythroid3 -0.6808 17.32 4.387 0.041743 0.19202 ## Endothelium 0.7151 14.32 3.830 0.056443 0.21636 ## Haematoendothelial progenitors 0.6189 14.76 2.993 0.090338 0.29683 ## Def. endoderm 0.4911 12.43 1.084 0.303347 0.67818 ## ExE mesoderm 0.3419 15.71 1.036 0.314058 0.67818 ## Pharyngeal mesoderm 0.3407 15.76 1.025 0.316623 0.67818 A similar strategy can be used to focus on proportional changes within a single subpopulation of a very heterogeneous data set. For example, if we collected a whole blood data set, we could subset to T cells and test for changes in T cell subtypes (memory, killer, regulatory, etc.) using the total number of T cells in each sample as the library size. This avoids detecting changes in T cell subsets that are driven by compositional effects from changes in abundance of, say, B cells in the same sample. 14.5.3.4 Testing against a log-fold change threshold Here, we assume that composition bias introduces a spurious log2-fold change of no more than \\(\\tau\\) for a non-DA label. This can be roughly interpreted as the maximum log-fold change in the total number of cells caused by DA in other labels. (By comparison, fold-differences in the totals due to differences in capture efficiency or the size of the original cell population are not attributable to composition bias and should not be considered when choosing \\(\\tau\\).) We then mitigate the effect of composition biases by testing each label for changes in abundance beyond \\(\\tau\\) (McCarthy and Smyth 2009; A. T. L. Lun, Richard, and Marioni 2017). res.lfc &lt;- glmTreat(fit.ab, coef=ncol(design), lfc=1) summary(decideTests(res.lfc)) ## factor(tomato)TRUE ## Down 1 ## NotSig 23 ## Up 0 topTags(res.lfc) ## Coefficient: factor(tomato)TRUE ## logFC unshrunk.logFC logCPM PValue ## ExE ectoderm -6.5663 -7.0015 13.02 2.626e-09 ## Mesenchyme 1.1652 1.1658 16.29 1.323e-01 ## Cardiomyocytes 0.8484 0.8498 14.86 3.796e-01 ## Allantois 0.8345 0.8354 15.51 3.975e-01 ## Neural crest -0.7706 -0.7719 14.76 4.501e-01 ## Endothelium 0.7519 0.7536 14.29 4.665e-01 ## Haematoendothelial progenitors 0.6581 0.6591 14.72 5.622e-01 ## Def. endoderm 0.5262 0.5311 12.40 5.934e-01 ## Erythroid3 -0.6431 -0.6432 17.28 6.118e-01 ## Caudal Mesoderm -0.3996 -0.4036 12.09 6.827e-01 ## FDR ## ExE ectoderm 6.303e-08 ## Mesenchyme 9.950e-01 ## Cardiomyocytes 9.950e-01 ## Allantois 9.950e-01 ## Neural crest 9.950e-01 ## Endothelium 9.950e-01 ## Haematoendothelial progenitors 9.950e-01 ## Def. endoderm 9.950e-01 ## Erythroid3 9.950e-01 ## Caudal Mesoderm 9.950e-01 The choice of \\(\\tau\\) can be loosely motivated by external experimental data. For example, if we observe a doubling of cell numbers in an in vitro system after treatment, we might be inclined to set \\(\\tau=1\\). This ensures that any non-DA subpopulation is not reported as being depleted after treatment. Some caution is still required, though - even if the external numbers are accurate, we need to assume that cell capture efficiency is (on average) equal between conditions to justify their use as \\(\\tau\\). And obviously, the use of a non-zero \\(\\tau\\) will reduce power to detect real changes when the composition bias is not present. 14.6 Comments on interpretation 14.6.1 DE or DA? Two sides of the same coin While useful, the distinction between DA and DE analyses is inherently artificial for scRNA-seq data. This is because the labels used in the former are defined based on the genes to be tested in the latter. To illustrate, consider a scRNA-seq experiment involving two biological conditions with several shared cell types. We focus on a cell type \\(X\\) that is present in both conditions but contains some DEGs between conditions. This leads to two possible outcomes: The DE between conditions causes \\(X\\) to form two separate clusters (say, \\(X_1\\) and \\(X_2\\)) in expression space. This manifests as DA where \\(X_1\\) is enriched in one condition and \\(X_2\\) is enriched in the other condition. The DE between conditions is not sufficient to split \\(X\\) into two separate clusters, e.g., because the data integration procedure identifies them as corresponding cell types and merges them together. This means that the differences between conditions manifest as DE within the single cluster corresponding to \\(X\\). We have described the example above in terms of clustering, but the same arguments apply for any labelling strategy based on the expression profiles, e.g., automated cell type assignment (Chapter 12). Moreover, the choice between outcomes 1 and 2 is made implicitly by the combined effect of the data merging, clustering and label assignment procedures. For example, differences between conditions are more likely to manifest as DE for coarser clusters and as DA for finer clusters, but this is difficult to predict reliably. The moral of the story is that DA and DE analyses are simply two different perspectives on the same phenomena. For any comprehensive characterization of differences between populations, it is usually necessary to consider both analyses. Indeed, they complement each other almost by definition, e.g., clustering parameters that reduce DE will increase DA and vice versa. 14.6.2 Sacrificing biology by integration Earlier in this chapter, we defined clusters from corrected values after applying fastMNN() to cells from all samples in the chimera dataset. Alert readers may realize that this would result in the removal of biological differences between our conditions. Any systematic difference in expression caused by injection would be treated as a batch effect and lost when cells from different samples are aligned to the same coordinate space. Now, one may not consider injection to be an interesting biological effect, but the same reasoning applies for other conditions, e.g., integration of wild-type and knock-out samples (Section 14.4) would result in the loss of any knock-out effect in the corrected values. This loss is both expected and desirable. As we mentioned in Section 13.8, the main motivation for performing batch correction is to enable us to characterize population heterogeneity in a consistent manner across samples. This remains true in situations with multiple conditions where we would like one set of clusters and annotations that can be used as common labels for the DE or DA analyses described above. The alternative would be to cluster each condition separately and to attempt to identify matching clusters across conditions - not straightforward for poorly separated clusters in contexts like differentiation. It may seem distressing to some that a (potentially very interesting) biological difference between conditions is lost during correction. However, this concern is largely misplaced as the correction is only ever used for defining common clusters and annotations. The DE analysis itself is performed on pseudo-bulk samples created from the uncorrected counts, preserving the biological difference and ensuring that it manifests in the list of DE genes for affected cell types. Of course, if the DE is strong enough, it may result in a new condition-specific cluster that would be captured by a DA analysis as discussed in Section 14.6.1. One final consideration is the interaction of condition-specific expression with the assumptions of each batch correction method. For example, MNN correction assumes that the differences between samples are orthogonal to the variation within samples. Arguably, this assumption is becomes more questionable if the between-sample differences are biological in nature, e.g., a treatment effect that makes one cell type seem more transcriptionally similar to another may cause the wrong clusters to be aligned across conditions. As usual, users will benefit from the diagnostics described in Chapter 13 and a healthy dose of skepticism. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] DropletUtils_1.10.3 scuttle_1.0.4 [3] MouseGastrulationData_1.4.0 scran_1.18.5 [5] edgeR_3.32.1 limma_3.46.0 [7] bluster_1.0.0 scater_1.18.6 [9] ggplot2_3.3.3 BiocSingular_1.6.0 [11] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0 [13] Biobase_2.50.0 GenomicRanges_1.42.0 [15] GenomeInfoDb_1.26.4 IRanges_2.24.1 [17] S4Vectors_0.28.1 BiocGenerics_0.36.0 [19] MatrixGenerics_1.2.1 matrixStats_0.58.0 [21] BiocStyle_2.18.1 rebook_1.0.0 loaded via a namespace (and not attached): [1] AnnotationHub_2.22.0 BiocFileCache_1.14.0 [3] igraph_1.2.6 splines_4.0.4 [5] BiocParallel_1.24.1 digest_0.6.27 [7] htmltools_0.5.1.1 viridis_0.5.1 [9] fansi_0.4.2 magrittr_2.0.1 [11] memoise_2.0.0 R.utils_2.10.1 [13] colorspace_2.0-0 blob_1.2.1 [15] rappdirs_0.3.3 xfun_0.22 [17] dplyr_1.0.5 callr_3.5.1 [19] crayon_1.4.1 RCurl_1.98-1.3 [21] jsonlite_1.7.2 graph_1.68.0 [23] glue_1.4.2 gtable_0.3.0 [25] zlibbioc_1.36.0 XVector_0.30.0 [27] DelayedArray_0.16.2 Rhdf5lib_1.12.1 [29] HDF5Array_1.18.1 scales_1.1.1 [31] pheatmap_1.0.12 DBI_1.1.1 [33] Rcpp_1.0.6 viridisLite_0.3.0 [35] xtable_1.8-4 dqrng_0.2.1 [37] bit_4.0.4 rsvd_1.0.3 [39] httr_1.4.2 RColorBrewer_1.1-2 [41] ellipsis_0.3.1 pkgconfig_2.0.3 [43] XML_3.99-0.6 R.methodsS3_1.8.1 [45] farver_2.1.0 CodeDepends_0.6.5 [47] sass_0.3.1 dbplyr_2.1.0 [49] locfit_1.5-9.4 utf8_1.2.1 [51] tidyselect_1.1.0 labeling_0.4.2 [53] rlang_0.4.10 later_1.1.0.1 [55] AnnotationDbi_1.52.0 munsell_0.5.0 [57] BiocVersion_3.12.0 tools_4.0.4 [59] cachem_1.0.4 generics_0.1.0 [61] RSQLite_2.2.4 ExperimentHub_1.16.0 [63] evaluate_0.14 stringr_1.4.0 [65] fastmap_1.1.0 yaml_2.2.1 [67] processx_3.4.5 knitr_1.31 [69] bit64_4.0.5 purrr_0.3.4 [71] sparseMatrixStats_1.2.1 mime_0.10 [73] R.oo_1.24.0 compiler_4.0.4 [75] beeswarm_0.3.1 curl_4.3 [77] interactiveDisplayBase_1.28.0 tibble_3.1.0 [79] statmod_1.4.35 bslib_0.2.4 [81] stringi_1.5.3 highr_0.8 [83] ps_1.6.0 lattice_0.20-41 [85] Matrix_1.3-2 vctrs_0.3.6 [87] pillar_1.5.1 lifecycle_1.0.0 [89] rhdf5filters_1.2.0 BiocManager_1.30.10 [91] jquerylib_0.1.3 BiocNeighbors_1.8.2 [93] cowplot_1.1.1 bitops_1.0-6 [95] irlba_2.3.3 httpuv_1.5.5 [97] R6_2.5.0 bookdown_0.21 [99] promises_1.2.0.1 gridExtra_2.3 [101] vipor_0.4.5 codetools_0.2-18 [103] assertthat_0.2.1 rhdf5_2.34.0 [105] withr_2.4.1 GenomeInfoDbData_1.2.4 [107] grid_4.0.4 beachmat_2.6.4 [109] rmarkdown_2.7 DelayedMatrixStats_1.12.3 [111] shiny_1.6.0 ggbeeswarm_0.6.0 Bibliography "],["droplet-processing.html", "Chapter 15 Droplet processing 15.1 Motivation 15.2 Calling cells from empty droplets 15.3 Removing ambient contamination 15.4 Demultiplexing cell hashes 15.5 Removing swapped molecules Session Info", " Chapter 15 Droplet processing .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 15.1 Motivation Droplet-based single-cell protocols aim to isolate each cell inside its own droplet in a water-in-oil emulsion, such that each droplet serves as a miniature reaction chamber for highly multiplexed library preparation (Macosko et al. 2015; Klein et al. 2015). Upon sequencing, reads are assigned to individual cells based on the presence of droplet-specific barcodes. This enables a massive increase in the number of cells that can be processed in typical scRNA-seq experiments, contributing to the dominance3 of technologies such as the 10X Genomics platform (Zheng et al. 2017). However, as the allocation of cells to droplets is not known in advance, the data analysis requires some special steps to determine what each droplet actually contains. This chapter explores some of the more common preprocessing procedures that might be applied to the count matrices generated from droplet protocols. 15.2 Calling cells from empty droplets 15.2.1 Background An unique aspect of droplet-based data is that we have no prior knowledge about whether a particular library (i.e., cell barcode) corresponds to cell-containing or empty droplets. Thus, we need to call cells from empty droplets based on the observed expression profiles. This is not entirely straightforward as empty droplets can contain ambient (i.e., extracellular) RNA that can be captured and sequenced, resulting in non-zero counts for libraries that do not contain any cell. To demonstrate, we obtain the unfiltered count matrix for the PBMC dataset from 10X Genomics. View history #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) sce.pbmc ## class: SingleCellExperiment ## dim: 33694 737280 ## metadata(1): Samples ## assays(1): counts ## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475 ## ENSG00000268674 ## rowData names(2): ID Symbol ## colnames(737280): AAACCTGAGAAACCAT-1 AAACCTGAGAAACCGC-1 ... ## TTTGTCATCTTTAGTC-1 TTTGTCATCTTTCCTC-1 ## colData names(2): Sample Barcode ## reducedDimNames(0): ## altExpNames(0): The distribution of total counts exhibits a sharp transition between barcodes with large and small total counts (Figure 15.1), probably corresponding to cell-containing and empty droplets respectively. A simple approach would be to apply a threshold on the total count to only retain those barcodes with large totals. However, this unnecessarily discards libraries derived from cell types with low RNA content. library(DropletUtils) bcrank &lt;- barcodeRanks(counts(sce.pbmc)) # Only showing unique points for plotting speed. uniq &lt;- !duplicated(bcrank$rank) plot(bcrank$rank[uniq], bcrank$total[uniq], log=&quot;xy&quot;, xlab=&quot;Rank&quot;, ylab=&quot;Total UMI count&quot;, cex.lab=1.2) abline(h=metadata(bcrank)$inflection, col=&quot;darkgreen&quot;, lty=2) abline(h=metadata(bcrank)$knee, col=&quot;dodgerblue&quot;, lty=2) legend(&quot;bottomleft&quot;, legend=c(&quot;Inflection&quot;, &quot;Knee&quot;), col=c(&quot;darkgreen&quot;, &quot;dodgerblue&quot;), lty=2, cex=1.2) Figure 15.1: Total UMI count for each barcode in the PBMC dataset, plotted against its rank (in decreasing order of total counts). The inferred locations of the inflection and knee points are also shown. 15.2.2 Testing for empty droplets We use the emptyDrops() function to test whether the expression profile for each cell barcode is significantly different from the ambient RNA pool (Lun et al. 2019). Any significant deviation indicates that the barcode corresponds to a cell-containing droplet. This allows us to discriminate between well-sequenced empty droplets and droplets derived from cells with little RNA, both of which would have similar total counts in Figure 15.1. We call cells at a false discovery rate (FDR) of 0.1%, meaning that no more than 0.1% of our called barcodes should be empty droplets on average. # emptyDrops performs Monte Carlo simulations to compute p-values, # so we need to set the seed to obtain reproducible results. set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) # See ?emptyDrops for an explanation of why there are NA values. summary(e.out$FDR &lt;= 0.001) ## Mode FALSE TRUE NA&#39;s ## logical 989 4300 731991 emptyDrops() uses Monte Carlo simulations to compute \\(p\\)-values for the multinomial sampling transcripts from the ambient pool. The number of Monte Carlo iterations determines the lower bound for the \\(p\\)-values (Phipson and Smyth 2010). The Limited field in the output indicates whether or not the computed \\(p\\)-value for a particular barcode is bounded by the number of iterations. If any non-significant barcodes are TRUE for Limited, we may need to increase the number of iterations. A larger number of iterations will result in a lower \\(p\\)-value for these barcodes, which may allow them to be detected after correcting for multiple testing. table(Sig=e.out$FDR &lt;= 0.001, Limited=e.out$Limited) ## Limited ## Sig FALSE TRUE ## FALSE 989 0 ## TRUE 1728 2572 As mentioned above, emptyDrops() assumes that barcodes with low total UMI counts are empty droplets. Thus, the null hypothesis should be true for all of these barcodes. We can check whether the hypothesis testing procedure holds its size by examining the distribution of \\(p\\)-values for low-total barcodes with test.ambient=TRUE. Ideally, the distribution should be close to uniform (Figure 15.2). Large peaks near zero indicate that barcodes with total counts below lower are not all ambient in origin. This can be resolved by decreasing lower further to ensure that barcodes corresponding to droplets with very small cells are not used to estimate the ambient profile. set.seed(100) limit &lt;- 100 all.out &lt;- emptyDrops(counts(sce.pbmc), lower=limit, test.ambient=TRUE) hist(all.out$PValue[all.out$Total &lt;= limit &amp; all.out$Total &gt; 0], xlab=&quot;P-value&quot;, main=&quot;&quot;, col=&quot;grey80&quot;) Figure 15.2: Distribution of \\(p\\)-values for the assumed empty droplets. Once we are satisfied with the performance of emptyDrops(), we subset our SingleCellExperiment object to retain only the detected cells. Discerning readers will notice the use of which(), which conveniently removes the NAs prior to the subsetting. sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] It usually only makes sense to call cells using a count matrix involving libraries from a single sample. The composition of transcripts in the ambient solution will usually between samples, so the same ambient profile cannot be reused. If multiple samples are present in a dataset, their counts should only be combined after cell calling is performed on each matrix. 15.2.3 Relationship with other QC metrics While emptyDrops() will distinguish cells from empty droplets, it makes no statement about the quality of the cells. It is entirely possible for droplets to contain damaged or dying cells, which need to be removed prior to downstream analysis. This is achieved using the same outlier-based strategy described in Section 6.3.2. Filtering on the mitochondrial proportion provides the most additional benefit in this situation, provided that we check that we are not removing a subpopulation of metabolically active cells (Figure 15.3). library(scuttle) is.mito &lt;- grep(&quot;^MT-&quot;, rowData(sce.pbmc)$Symbol) pbmc.qc &lt;- perCellQCMetrics(sce.pbmc, subsets=list(MT=is.mito)) discard.mito &lt;- isOutlier(pbmc.qc$subsets_MT_percent, type=&quot;higher&quot;) summary(discard.mito) ## Mode FALSE TRUE ## logical 3985 315 plot(pbmc.qc$sum, pbmc.qc$subsets_MT_percent, log=&quot;x&quot;, xlab=&quot;Total count&quot;, ylab=&#39;Mitochondrial %&#39;) abline(h=attr(discard.mito, &quot;thresholds&quot;)[&quot;higher&quot;], col=&quot;red&quot;) Figure 15.3: Percentage of reads assigned to mitochondrial transcripts, plotted against the library size. The red line represents the upper threshold used for QC filtering. emptyDrops() already removes cells with very low library sizes or (by association) low numbers of expressed genes. Thus, further filtering on these metrics is not strictly necessary. It may still be desirable to filter on both of these metrics to remove non-empty droplets containing cell fragments or stripped nuclei that were not caught by the mitochondrial filter. However, this should be weighed against the risk of losing genuine cell types as discussed in Section 6.3.2.2. Note that CellRanger version 3 automatically performs cell calling using an algorithm similar to emptyDrops(). If we had started our analysis with the filtered count matrix, we could go straight to computing other QC metrics. We would not need to run emptyDrops() manually as shown here, and indeed, attempting to do so would lead to nonsensical results if not outright software errors. Nonetheless, it may still be desirable to load the unfiltered matrix and apply emptyDrops() ourselves, on occasions where more detailed inspection or control of the cell-calling statistics is desired. 15.3 Removing ambient contamination For routine analyses, there is usually no need to remove the ambient contamination from each library. A consistent level of contamination across the dataset does not introduce much spurious heterogeneity, so dimensionality reduction and clustering on the original (log-)expression matrix remain valid. For genes that are highly abundant in the ambient solution, we can expect some loss of signal due to shrinkage of the log-fold changes between clusters towards zero, but this effect should be negligible for any genes that are so strongly upregulated that they are able to contribute to the ambient solution in the first place. This suggests that ambient removal can generally be omitted from most analyses, though we will describe it here regardless as it can be useful in specific situations. Effective removal of ambient contamination involves tackling a number of issues. We need to know how much contamination is present in each cell, which usually requires some prior biological knowledge about genes that should not be expressed in the dataset (e.g., mitochondrial genes in single-nuclei datasets, see Section 19.4) or genes with mutually exclusive expression profiles (Young and Behjati 2018). Those same genes must be highly abundant in the ambient solution to have enough counts in each cell for precise estimation of the scale of the contamination. The actual subtraction of the ambient contribution also must be done in a manner that respects the mean-variance relationship of the count data. Unfortunately, these issues are difficult to address for single-cell data due to the imprecision of low counts, Rather than attempting to remove contamination from individual cells, a more measured approach is to operate on clusters of related cells. The removeAmbience() function from DropletUtils will remove the contamination from the cluster-level profiles and propagate the effect of those changes back to the individual cells. Specifically, given a count matrix for a single sample and its associated ambient profile, removeAmbience() will: Aggregate counts in each cluster to obtain an average profile per cluster. Estimate the contamination proportion in each cluster with maximumAmbience() (see Section 14.4). This has the useful property of not requiring any prior knowledge of control or mutually exclusive expression profiles, albeit at the cost of some statistical rigor. Subtract the estimated contamination from the cluster-level average. Perform quantile-quantile mapping of each individual cell’s counts from the old average to the new subtracted average. This preserves the mean-variance relationship while yielding corrected single-cell profiles. We demonstrate this process on our PBMC dataset below. View history #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) #--- gene-annotation ---# library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) #--- cell-detection ---# set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location==&quot;MT&quot;))) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) sce.pbmc &lt;- sce.pbmc[,!high.mito] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) sce.pbmc &lt;- logNormCounts(sce.pbmc) #--- variance-modelling ---# set.seed(1001) dec.pbmc &lt;- modelGeneVarByPoisson(sce.pbmc) top.pbmc &lt;- getTopHVGs(dec.pbmc, prop=0.1) #--- dimensionality-reduction ---# set.seed(10000) sce.pbmc &lt;- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc) set.seed(100000) sce.pbmc &lt;- runTSNE(sce.pbmc, dimred=&quot;PCA&quot;) set.seed(1000000) sce.pbmc &lt;- runUMAP(sce.pbmc, dimred=&quot;PCA&quot;) #--- clustering ---# g &lt;- buildSNNGraph(sce.pbmc, k=10, use.dimred = &#39;PCA&#39;) clust &lt;- igraph::cluster_walktrap(g)$membership colLabels(sce.pbmc) &lt;- factor(clust) # Not all genes are reported in the ambient profile from emptyDrops, # as genes with counts of zero across all droplets are just removed. # So for convenience, we will restrict our analysis to genes with # non-zero counts in at least one droplet (empty or otherwise). amb &lt;- metadata(e.out)$ambient[,1] stripped &lt;- sce.pbmc[names(amb),] out &lt;- removeAmbience(counts(stripped), ambient=amb, groups=colLabels(stripped)) dim(out) ## [1] 20112 3985 We can visualize the effects of ambient removal on a gene like IGKC, which presumably should only be expressed in the B cell lineage. This gene has some level of expression in each cluster in the original dataset but is “zeroed” in most clusters after removal (Figure 15.4). library(scater) counts(stripped, withDimnames=FALSE) &lt;- out stripped &lt;- logNormCounts(stripped) gridExtra::grid.arrange( plotExpression(sce.pbmc, x=&quot;label&quot;, colour_by=&quot;label&quot;, features=&quot;IGKC&quot;) + ggtitle(&quot;Before&quot;), plotExpression(stripped, x=&quot;label&quot;, colour_by=&quot;label&quot;, features=&quot;IGKC&quot;) + ggtitle(&quot;After&quot;), ncol=2 ) Figure 15.4: Distribution of IGKC log-expression values in each cluster of the PBMC dataset, before and after removal of ambient contamination. We observe a similar phenomenon with the LYZ gene (Figure 15.5), which should only be expressed in macrophages and neutrophils. In fact, if we knew this beforehand, we could specify these two mutually exclusive sets - i.e., LYZ and IGKC and their related genes - in the features= argument to removeAmbience(). This knowledge is subsequently used to estimate the contamination in each cluster, an approach that is more conceptually similar to the methods in the SoupX package. gridExtra::grid.arrange( plotExpression(sce.pbmc, x=&quot;label&quot;, colour_by=&quot;label&quot;, features=&quot;LYZ&quot;) + ggtitle(&quot;Before&quot;), plotExpression(stripped, x=&quot;label&quot;, colour_by=&quot;label&quot;, features=&quot;LYZ&quot;) + ggtitle(&quot;After&quot;), ncol=2 ) Figure 15.5: Distribution of LYZ log-expression values in each cluster of the PBMC dataset, before and after removal of ambient contamination. While these results look impressive, discerning readers will note that the method relies on having sensible clusters. This limits the function’s applicability to the end of an analysis after all the characterization has already been done. As such, the stripped matrix can really only be used in downstream steps like the DE analysis (where it is unlikely to have much effect beyond inflating already-large log-fold changes) or - most importantly - in visualization, where users can improve the aesthetics of their plots by eliminating harmless background expression. Of course, one could repeat the entire analysis on the stripped count matrix to obtain new clusters, but this seems unnecessarily circuituous, especially if the clusters were deemed good enough for use in removeAmbience() in the first place. Finally, it may be worth considering whether a corrected per-cell count matrix is really necessary. In removeAmbience(), counts for each gene are assumed to follow a negative binomial distribution with a fixed dispersion. This is necessary to perform the quantile-quantile remapping to obtain a corrected version of each individual cell’s counts, but violations of these distributional assumptions will introduce inaccuracies in downstream models. Some analyses may have specific remedies to ambient contamination that do not require corrected per-cell counts (Section 14.4), so we can avoid these assumptions altogether if such remedies are available. 15.4 Demultiplexing cell hashes 15.4.1 Background Cell hashing (Stoeckius et al. 2018) is a useful technique that allows cells from different samples to be processed in a single run of a droplet-based protocol. Cells from a single sample are first labelled with a unique hashing tag oligo (HTOs), usually via conjugation of the HTO to an antibody against a ubiquitous surface marker or a membrane-binding compound like cholesterol (McGinnis et al. 2019). Cells from different samples are then mixed together and the multiplexed pool is used for droplet-based library preparation; each cell is assigned back to its sample of origin based on its most abundant HTO. By processing multiple samples together, we can avoid batch effects and simplify the logistics of studies with a large number of samples. Sequencing of the HTO-derived cDNA library yields a count matrix where each row corresponds to a HTO and each column corresponds to a cell barcode. This can be stored as an alternative Experiment in our SingleCellExperiment, alongside the main experiment containing the counts for the actual genes. We demonstrate on some data from the original Stoeckius et al. (2018) study, which contains counts for a mixture of 4 cell lines across 12 samples. library(scRNAseq) hto.sce &lt;- StoeckiusHashingData(type=&quot;mixed&quot;) hto.sce # The full dataset ## class: SingleCellExperiment ## dim: 25339 25088 ## metadata(0): ## assays(1): counts ## rownames(25339): A1BG A1BG-AS1 ... snoU2-30 snoZ178 ## rowData names(0): ## colnames(25088): CAGATCAAGTAGGCCA CCTTTCTGTCGGATCC ... CGGTTATCCATCTGCT ## CGGTTCACACGTCAGC ## colData names(0): ## reducedDimNames(0): ## altExpNames(1): hto altExp(hto.sce) # Contains the HTO counts ## class: SingleCellExperiment ## dim: 12 25088 ## metadata(0): ## assays(1): counts ## rownames(12): HEK_A HEK_B ... KG1_B KG1_C ## rowData names(0): ## colnames(25088): CAGATCAAGTAGGCCA CCTTTCTGTCGGATCC ... CGGTTATCCATCTGCT ## CGGTTCACACGTCAGC ## colData names(0): ## reducedDimNames(0): ## altExpNames(0): counts(altExp(hto.sce))[,1:3] # Preview of the count profiles ## CAGATCAAGTAGGCCA CCTTTCTGTCGGATCC CATATGGCATGGAATA ## HEK_A 1 0 0 ## HEK_B 111 0 1 ## HEK_C 7 1 7 ## THP1_A 15 19 16 ## THP1_B 10 8 5 ## THP1_C 4 3 6 ## K562_A 118 0 245 ## K562_B 5 530 131 ## K562_C 1 3 4 ## KG1_A 30 25 24 ## KG1_B 40 14 239 ## KG1_C 32 36 38 15.4.2 Cell calling options Our first task is to identify the libraries corresponding to cell-containing droplets. This can be applied on the gene count matrix or the HTO count matrix, depending on what information we have available. We start with the usual application of emptyDrops() on the gene count matrix of hto.sce (Figure 15.6). set.seed(10010) e.out.gene &lt;- emptyDrops(counts(hto.sce)) is.cell &lt;- e.out.gene$FDR &lt;= 0.001 summary(is.cell) ## Mode FALSE TRUE NA&#39;s ## logical 1384 7934 15770 par(mfrow=c(1,2)) r &lt;- rank(-e.out.gene$Total) plot(r, e.out.gene$Total, log=&quot;xy&quot;, xlab=&quot;Rank&quot;, ylab=&quot;Total gene count&quot;, main=&quot;&quot;) abline(h=metadata(e.out.gene)$retain, col=&quot;darkgrey&quot;, lty=2, lwd=2) hist(log10(e.out.gene$Total[is.cell]), xlab=&quot;Log[10] gene count&quot;, main=&quot;&quot;) Figure 15.6: Cell-calling statistics from running emptyDrops() on the gene count in the cell line mixture data. Left: Barcode rank plot with the estimated knee point in grey. Right: distribution of log-total counts for libraries identified as cells. Alternatively, we could also apply emptyDrops() to the HTO count matrix but this is slightly more complicated. As HTOs are sequenced separately from the endogenous transcripts, the coverage of the former is less predictable across studies; this makes it difficult to determine an appropriate default value of lower= for estimation of the initial ambient profile. We instead estimate the ambient profile by excluding the top by.rank= barcodes with the largest totals, under the assumption that no more than by.rank= cells were loaded. Here we have chosen 12000, which is largely a guess to ensure that we can directly pick the knee point (Figure 15.7) in this somewhat pre-filtered dataset. set.seed(10010) # Setting lower= for correct knee point detection, # as the coverage in this dataset is particularly low. e.out.hto &lt;- emptyDrops(counts(altExp(hto.sce)), by.rank=12000, lower=10) summary(is.cell.hto &lt;- e.out.hto$FDR &lt;= 0.001) ## Mode FALSE TRUE NA&#39;s ## logical 2967 8955 13166 par(mfrow=c(1,2)) r &lt;- rank(-e.out.hto$Total) plot(r, e.out.hto$Total, log=&quot;xy&quot;, xlab=&quot;Rank&quot;, ylab=&quot;Total HTO count&quot;, main=&quot;&quot;) abline(h=metadata(e.out.hto)$retain, col=&quot;darkgrey&quot;, lty=2, lwd=2) hist(log10(e.out.hto$Total[is.cell.hto]), xlab=&quot;Log[10] HTO count&quot;, main=&quot;&quot;) Figure 15.7: Cell-calling statistics from running emptyDrops() on the HTO counts in the cell line mixture data. Left: Barcode rank plot with the knee point shown in grey. Right: distribution of log-total counts for libraries identified as cells. While both approaches are valid, we tend to favor the cell calls derived from the gene matrix as this directly indicates that a cell is present in the droplet. Indeed, at least a few libraries have very high total HTO counts yet very low total gene counts (Figure 15.8), suggesting that the presence of HTOs may not always equate to successful capture of that cell’s transcriptome. HTO counts also tend to exhibit stronger overdispersion (i.e., lower alpha in the emptyDrops() calculations), increasing the risk of violating emptyDrops()’s distributional assumptions. table(HTO=is.cell.hto, Genes=is.cell, useNA=&quot;always&quot;) ## Genes ## HTO FALSE TRUE &lt;NA&gt; ## FALSE 342 798 1827 ## TRUE 200 6777 1978 ## &lt;NA&gt; 842 359 11965 plot(e.out.gene$Total, e.out.hto$Total, log=&quot;xy&quot;, xlab=&quot;Total gene count&quot;, ylab=&quot;Total HTO count&quot;) abline(v=metadata(e.out.gene)$lower, col=&quot;red&quot;, lwd=2, lty=2) abline(h=metadata(e.out.hto)$lower, col=&quot;blue&quot;, lwd=2, lty=2) Figure 15.8: Total HTO counts plotted against the total gene counts for each library in the cell line mixture dataset. Each point represents a library while the dotted lines represent the thresholds below which libraries were assumed to be empty droplets. Again, note that if we are picking up our analysis after processing with pipelines like CellRanger, it may be that the count matrix has already been subsetted to the cell-containing libraries. If so, we can skip this section entirely and proceed straight to demultiplexing. 15.4.3 Demultiplexing on HTO abundance We run hashedDrops() to demultiplex the HTO count matrix for the subset of cell-containing libraries. This reports the likely sample of origin for each library based on its most abundant HTO after adjusting those abundances for ambient contamination. For quality control, it returns the log-fold change between the first and second-most abundant HTOs in each barcode libary (Figure 15.9), allowing us to quantify the certainty of each assignment. hto.mat &lt;- counts(altExp(hto.sce))[,which(is.cell)] hash.stats &lt;- hashedDrops(hto.mat) hist(hash.stats$LogFC, xlab=&quot;Log fold-change from best to second HTO&quot;, main=&quot;&quot;) Figure 15.9: Distribution of log-fold changes from the first to second-most abundant HTO in each cell. Confidently assigned cells should have large log-fold changes between the best and second-best HTO abundances as there should be exactly one dominant HTO per cell. These are marked as such by the Confident field in the output of hashedDrops(), which can be used to filter out ambiguous assignments prior to downstream analyses. # Raw assignments: table(hash.stats$Best) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 ## 664 732 636 595 629 655 570 684 603 726 662 778 # Confident assignments based on (i) a large log-fold change # and (ii) not being a doublet. table(hash.stats$Best[hash.stats$Confident]) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 ## 580 619 560 524 553 573 379 605 427 640 556 607 In the absence of an a priori ambient profile, hashedDrops() will attempt to automatically estimate it from the count matrix. This is done by assuming that each HTO has a bimodal distribution where the lower peak corresponds to ambient contamination in cells that do not belong to that HTO’s sample. Counts are then averaged across all cells in the lower mode to obtain the relative abundance of that HTO (Figure 15.10). hashedDrops() uses the ambient profile to adjust for systematic differences in HTO concentrations that could otherwise skew the log-fold changes - for example, this particular dataset exhibits order-of-magnitude differences in the concentration of different HTOs. The adjustment process itself involves a fair number of assumptions that we will not discuss here; see ?hashedDrops for more details. barplot(metadata(hash.stats)$ambient, las=2, ylab=&quot;Inferred proportion of counts in the ambient solution&quot;) Figure 15.10: Proportion of each HTO in the ambient solution for the cell line mixture data, estimated from the HTO counts of cell-containing droplets. If we are dealing with unfiltered data, we have the opportunity to improve the inferences by defining the ambient profile beforehand based on the empty droplets. This simply involves summing the counts for each HTO across all known empty droplets, marked as those libraries with NA FDR values in the emptyDrops() output. Alternatively, if we had called emptyDrops() directly on the HTO count matrix, we could just extract the ambient profile from the output’s metadata(). For this dataset, all methods agree well (Figure 15.10) though providing an a priori profile can be helpful in more extreme situations where the automatic method fails, e.g., if there are too few cells in the lower mode for accurate estimation of a HTO’s ambient concentration. estimates &lt;- rbind( `Bimodal`=proportions(metadata(hash.stats)$ambient), `Empty (genes)`=proportions(rowSums(counts(altExp(hto.sce))[,is.na(e.out.gene$FDR)])), `Empty (HTO)`=metadata(e.out.hto)$ambient[,1] ) barplot(estimates, beside=TRUE, ylab=&quot;Proportion of counts in the ambient solution&quot;) legend(&quot;topleft&quot;, fill=gray.colors(3), legend=rownames(estimates)) Figure 15.11: Proportion of each HTO in the ambient solution for the cell line mixture data, estimated using the bimodal method in hashedDrops() or by computing the average abundance across all empty droplets (where the empty state is defined by using emptyDrops() on either the genes or the HTO matrix). Given an estimate of the ambient profile - say, the one derived from empty droplets detected using the HTO count matrix - we can easily use it in hashedDrops() via the ambient= argument. This yields very similar results to those obtained with the automatic method, as expected from the similarity in the profiles. hash.stats2 &lt;- hashedDrops(hto.mat, ambient=metadata(e.out.hto)$ambient[,1]) table(hash.stats2$Best[hash.stats2$Confident]) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 ## 575 602 565 526 551 559 354 598 411 632 553 589 15.4.4 Further comments After demultiplexing, it is a simple matter to subset the SingleCellExperiment to the confident assignments. This actually involves two steps - the first is to subset to the libraries that were actually used in hashedDrops(), and the second is to subset to the libraries that were confidently assigned to a single sample. Of course, we also include the putative sample of origin for each cell. sce &lt;- hto.sce[,rownames(hash.stats)] sce$sample &lt;- hash.stats$Best sce &lt;- sce[,hash.stats$Confident] We examine the success of the demultiplexing by performing a quick analysis. Recall that this experiment involved 4 cell lines that were multiplexed together; we see that the separation between cell lines is preserved in Figure 15.12, indicating that the cells were assigned to their correct samples of origin. library(scran) library(scater) sce &lt;- logNormCounts(sce) dec &lt;- modelGeneVar(sce) set.seed(100) sce &lt;- runPCA(sce, subset_row=getTopHVGs(dec, n=5000)) sce &lt;- runTSNE(sce, dimred=&quot;PCA&quot;) cell.lines &lt;- sub(&quot;_.*&quot;, &quot;&quot;, rownames(altExp(sce))) sce$cell.line &lt;- cell.lines[sce$sample] plotTSNE(sce, colour_by=&quot;cell.line&quot;) Figure 15.12: The usual \\(t\\)-SNE plot of the cell line mixture data, where each point is a cell and is colored by the cell line corresponding to its sample of origin. Cell hashing information can also be used to detect doublets - see Chapter 16 for more details. 15.5 Removing swapped molecules Some of the more recent DNA sequencing machines released by Illumina (e.g., HiSeq 3000/4000/X, X-Ten, and NovaSeq) use patterned flow cells to improve throughput and cost efficiency. However, in multiplexed pools, the use of these flow cells can lead to the mislabelling of DNA molecules with the incorrect library barcode (Sinha et al. 2017), a phenomenon known as “barcode swapping”. This leads to contamination of each library with reads from other libraries - for droplet sequencing experiments, this is particularly problematic as it manifests as the appearance of artificial cells that are low-coverage copies of their originals from other samples (Griffiths et al. 2018). Fortunately, it is easy enough to remove affected reads from droplet experiments with the swappedDrops() function from DropletUtils. Given a multiplexed pool of samples, we identify potential swapping events as transcript molecules that share the same combination of UMI sequence, assigned gene and cell barcode across samples. We only keep the molecule if it has dominant coverage in a single sample, which is likely to be its original sample; we remove all (presumably swapped) instances of that molecule in the other samples. Our assumption is that it is highly unlikely that two molecules would have the same combination of values by chance. To demonstrate, we will use some multiplexed 10X Genomics data from an attempted study of the mouse mammary gland (Griffiths et al. 2018). This experiment consists of 8 scRNA-seq samples from various stages of mammary gland development, sequenced using the HiSeq 4000. We use the DropletTestFiles package to obtain the molecule information files produced by the CellRanger software suite; as its name suggests, this file format contains information on each individual transcript molecule in each sample. library(DropletTestFiles) swap.files &lt;- listTestFiles(dataset=&quot;bach-mammary-swapping&quot;) swap.files &lt;- swap.files[dirname(swap.files$file.name)==&quot;hiseq_4000&quot;,] swap.files &lt;- vapply(swap.files$rdatapath, getTestFile, prefix=FALSE, &quot;&quot;) names(swap.files) &lt;- sub(&quot;.*_(.*)\\\\.h5&quot;, &quot;\\\\1&quot;, names(swap.files)) We examine the barcode rank plots before making any attempt to remove swapped molecules (Figure 15.13), using the get10xMolInfoStats() function to efficiently obtain summary statistics from each molecule information file. We see that samples E1 and F1 have different curves but this is not cause for alarm given that they also correspond to a different developmental stage compared to the other samples. library(DropletUtils) before.stats &lt;- lapply(swap.files, get10xMolInfoStats) max.umi &lt;- vapply(before.stats, function(x) max(x$num.umis), 0) ylim &lt;- c(1, max(max.umi)) max.ncells &lt;- vapply(before.stats, nrow, 0L) xlim &lt;- c(1, max(max.ncells)) plot(0,0,type=&quot;n&quot;, xlab=&quot;Rank&quot;, ylab=&quot;Number of UMIs&quot;, log=&quot;xy&quot;, xlim=xlim, ylim=ylim) for (i in seq_along(before.stats)) { u &lt;- sort(before.stats[[i]]$num.umis, decreasing=TRUE) lines(seq_along(u), u, col=i, lwd=5) } legend(&quot;topright&quot;, col=seq_along(before.stats), lwd=5, legend=names(before.stats)) Figure 15.13: Barcode rank curves for all samples in the HiSeq 4000-sequenced mammary gland dataset, before removing any swapped molecules. We apply the swappedDrops() function to the molecule information files to identify and remove swapped molecules from each sample. While all samples have some percentage of removed molecules, the majority of molecules in samples E1 and F1 are considered to be swapping artifacts. The most likely cause is that these samples contain no real cells or highly damaged cells with little RNA, which frees up sequencing resources for deeper coverage of swapped molecules. after.mat &lt;- swappedDrops(swap.files, get.swapped=TRUE) cleaned.sum &lt;- vapply(after.mat$cleaned, sum, 0) swapped.sum &lt;- vapply(after.mat$swapped, sum, 0) swapped.sum / (swapped.sum + cleaned.sum) ## A1 B1 C1 D1 E1 F1 G1 H1 ## 0.02761 0.02767 0.12274 0.03797 0.82535 0.86561 0.02241 0.03648 After removing the swapped molecules, the barcode rank curves for E1 and F1 drop dramatically (Figure 15.14). This represents the worst-case outcome of the swapping phenomenon, where cells are “carbon-copied”4 from the other multiplexed samples. Proceeding with the cleaned matrices protects us from these egregious artifacts as well as the more subtle effects of contamination. plot(0,0,type=&quot;n&quot;, xlab=&quot;Rank&quot;, ylab=&quot;Number of UMIs&quot;, log=&quot;xy&quot;, xlim=xlim, ylim=ylim) for (i in seq_along(after.mat$cleaned)) { cur.stats &lt;- barcodeRanks(after.mat$cleaned[[i]]) u &lt;- sort(cur.stats$total, decreasing=TRUE) lines(seq_along(u), u, col=i, lwd=5) } legend(&quot;topright&quot;, col=seq_along(after.mat$cleaned), lwd=5, legend=names(after.mat$cleaned)) Figure 15.14: Barcode rank curves for all samples in the HiSeq 4000-sequenced mammary gland dataset, after removing any swapped molecules. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] DropletTestFiles_1.0.0 scran_1.18.5 [3] scRNAseq_2.4.0 scater_1.18.6 [5] ggplot2_3.3.3 scuttle_1.0.4 [7] DropletUtils_1.10.3 SingleCellExperiment_1.12.0 [9] SummarizedExperiment_1.20.0 Biobase_2.50.0 [11] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [13] IRanges_2.24.1 S4Vectors_0.28.1 [15] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [17] matrixStats_0.58.0 BiocStyle_2.18.1 [19] rebook_1.0.0 loaded via a namespace (and not attached): [1] AnnotationHub_2.22.0 BiocFileCache_1.14.0 [3] igraph_1.2.6 lazyeval_0.2.2 [5] BiocParallel_1.24.1 digest_0.6.27 [7] ensembldb_2.14.0 htmltools_0.5.1.1 [9] viridis_0.5.1 fansi_0.4.2 [11] magrittr_2.0.1 memoise_2.0.0 [13] limma_3.46.0 Biostrings_2.58.0 [15] R.utils_2.10.1 askpass_1.1 [17] prettyunits_1.1.1 colorspace_2.0-0 [19] blob_1.2.1 rappdirs_0.3.3 [21] xfun_0.22 dplyr_1.0.5 [23] callr_3.5.1 crayon_1.4.1 [25] RCurl_1.98-1.3 jsonlite_1.7.2 [27] graph_1.68.0 glue_1.4.2 [29] gtable_0.3.0 zlibbioc_1.36.0 [31] XVector_0.30.0 DelayedArray_0.16.2 [33] BiocSingular_1.6.0 Rhdf5lib_1.12.1 [35] HDF5Array_1.18.1 scales_1.1.1 [37] DBI_1.1.1 edgeR_3.32.1 [39] Rcpp_1.0.6 viridisLite_0.3.0 [41] xtable_1.8-4 progress_1.2.2 [43] dqrng_0.2.1 bit_4.0.4 [45] rsvd_1.0.3 httr_1.4.2 [47] ellipsis_0.3.1 pkgconfig_2.0.3 [49] XML_3.99-0.6 R.methodsS3_1.8.1 [51] farver_2.1.0 CodeDepends_0.6.5 [53] sass_0.3.1 dbplyr_2.1.0 [55] locfit_1.5-9.4 utf8_1.2.1 [57] tidyselect_1.1.0 labeling_0.4.2 [59] rlang_0.4.10 later_1.1.0.1 [61] AnnotationDbi_1.52.0 munsell_0.5.0 [63] BiocVersion_3.12.0 tools_4.0.4 [65] cachem_1.0.4 generics_0.1.0 [67] RSQLite_2.2.4 ExperimentHub_1.16.0 [69] evaluate_0.14 stringr_1.4.0 [71] fastmap_1.1.0 yaml_2.2.1 [73] processx_3.4.5 knitr_1.31 [75] bit64_4.0.5 purrr_0.3.4 [77] AnnotationFilter_1.14.0 sparseMatrixStats_1.2.1 [79] mime_0.10 R.oo_1.24.0 [81] xml2_1.3.2 biomaRt_2.46.3 [83] compiler_4.0.4 beeswarm_0.3.1 [85] curl_4.3 interactiveDisplayBase_1.28.0 [87] statmod_1.4.35 tibble_3.1.0 [89] bslib_0.2.4 stringi_1.5.3 [91] highr_0.8 ps_1.6.0 [93] GenomicFeatures_1.42.2 bluster_1.0.0 [95] lattice_0.20-41 ProtGenerics_1.22.0 [97] Matrix_1.3-2 vctrs_0.3.6 [99] pillar_1.5.1 lifecycle_1.0.0 [101] rhdf5filters_1.2.0 BiocManager_1.30.10 [103] jquerylib_0.1.3 BiocNeighbors_1.8.2 [105] cowplot_1.1.1 bitops_1.0-6 [107] irlba_2.3.3 rtracklayer_1.50.0 [109] httpuv_1.5.5 R6_2.5.0 [111] bookdown_0.21 promises_1.2.0.1 [113] gridExtra_2.3 vipor_0.4.5 [115] codetools_0.2-18 assertthat_0.2.1 [117] rhdf5_2.34.0 openssl_1.4.3 [119] withr_2.4.1 GenomicAlignments_1.26.0 [121] Rsamtools_2.6.0 GenomeInfoDbData_1.2.4 [123] hms_1.0.0 grid_4.0.4 [125] beachmat_2.6.4 rmarkdown_2.7 [127] DelayedMatrixStats_1.12.3 Rtsne_0.15 [129] shiny_1.6.0 ggbeeswarm_0.6.0 Bibliography "],["doublet-detection.html", "Chapter 16 Doublet detection 16.1 Overview 16.2 Doublet detection with clusters 16.3 Doublet detection by simulation 16.4 Doublet detection in multiplexed experiments 16.5 Further comments Session Info", " Chapter 16 Doublet detection .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 16.1 Overview In single-cell RNA sequencing (scRNA-seq) experiments, doublets are artifactual libraries generated from two cells. They typically arise due to errors in cell sorting or capture, especially in droplet-based protocols (Zheng et al. 2017) involving thousands of cells. Doublets are obviously undesirable when the aim is to characterize populations at the single-cell level. In particular, doublets can be mistaken for intermediate populations or transitory states that do not actually exist. Thus, it is desirable to identify and remove doublet libraries so that they do not compromise interpretation of the results. Several experimental strategies are available for doublet removal. One approach exploits natural genetic variation when pooling cells from multiple donor individuals (Kang et al. 2018). Doublets can be identified as libraries with allele combinations that do not exist in any single donor. Another approach is to mark a subset of cells (e.g., all cells from one sample) with an antibody conjugated to a different oligonucleotide (Stoeckius et al. 2018). Upon pooling, libraries that are observed to have different oligonucleotides are considered to be doublets and removed. These approaches can be highly effective but rely on experimental information that may not be available. A more general approach is to infer doublets from the expression profiles alone (Dahlin et al. 2018). In this workflow, we will describe two purely computational approaches for detecting doublets from scRNA-seq data. The main difference between these two methods is whether or not they need cluster information beforehand. We will demonstrate the use of these methods on 10X Genomics data from a droplet-based scRNA-seq study of the mouse mammary gland (Bach et al. 2017). View history #--- loading ---# library(scRNAseq) sce.mam &lt;- BachMammaryData(samples=&quot;G_1&quot;) #--- gene-annotation ---# library(scater) rownames(sce.mam) &lt;- uniquifyFeatureNames( rowData(sce.mam)$Ensembl, rowData(sce.mam)$Symbol) library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] rowData(sce.mam)$SEQNAME &lt;- mapIds(ens.mm.v97, keys=rowData(sce.mam)$Ensembl, keytype=&quot;GENEID&quot;, column=&quot;SEQNAME&quot;) #--- quality-control ---# is.mito &lt;- rowData(sce.mam)$SEQNAME == &quot;MT&quot; stats &lt;- perCellQCMetrics(sce.mam, subsets=list(Mito=which(is.mito))) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;subsets_Mito_percent&quot;) sce.mam &lt;- sce.mam[,!qc$discard] #--- normalization ---# library(scran) set.seed(101000110) clusters &lt;- quickCluster(sce.mam) sce.mam &lt;- computeSumFactors(sce.mam, clusters=clusters) sce.mam &lt;- logNormCounts(sce.mam) #--- variance-modelling ---# set.seed(00010101) dec.mam &lt;- modelGeneVarByPoisson(sce.mam) top.mam &lt;- getTopHVGs(dec.mam, prop=0.1) #--- dimensionality-reduction ---# library(BiocSingular) set.seed(101010011) sce.mam &lt;- denoisePCA(sce.mam, technical=dec.mam, subset.row=top.mam) sce.mam &lt;- runTSNE(sce.mam, dimred=&quot;PCA&quot;) #--- clustering ---# snn.gr &lt;- buildSNNGraph(sce.mam, use.dimred=&quot;PCA&quot;, k=25) colLabels(sce.mam) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) sce.mam ## class: SingleCellExperiment ## dim: 27998 2772 ## metadata(0): ## assays(2): counts logcounts ## rownames(27998): Xkr4 Gm1992 ... Vmn2r122 CAAA01147332.1 ## rowData names(3): Ensembl Symbol SEQNAME ## colnames: NULL ## colData names(5): Barcode Sample Condition sizeFactor label ## reducedDimNames(2): PCA TSNE ## altExpNames(0): 16.2 Doublet detection with clusters The findDoubletClusters() function from the scDblFinder package identifies clusters with expression profiles lying between two other clusters (Bach et al. 2017). We consider every possible triplet of clusters consisting of a query cluster and two putative “source” clusters. Under the null hypothesis that the query consists of doublets from the two sources, we compute the number of genes (num.de) that are differentially expressed in the same direction in the query cluster compared to both of the source clusters. Such genes would be unique markers for the query cluster and provide evidence against the null hypothesis. For each query cluster, the best pair of putative sources is identified based on the lowest num.de. Clusters are then ranked by num.de where those with the few unique genes are more likely to be composed of doublets. # Like &#39;findMarkers&#39;, this function will automatically # retrieve cluster assignments from &#39;colLabels&#39;. library(scDblFinder) dbl.out &lt;- findDoubletClusters(sce.mam) dbl.out ## DataFrame with 10 rows and 9 columns ## source1 source2 num.de median.de best p.value ## &lt;character&gt; &lt;character&gt; &lt;integer&gt; &lt;numeric&gt; &lt;character&gt; &lt;numeric&gt; ## 6 2 1 13 507.5 Pcbp2 1.28336e-03 ## 2 10 3 109 710.5 Pigr 4.34790e-21 ## 4 6 5 111 599.5 Cotl1 1.09709e-08 ## 5 10 7 139 483.5 Gde1 9.30195e-12 ## 10 8 5 191 392.0 Krt18 5.54539e-20 ## 7 8 5 270 784.5 AF251705 3.29661e-24 ## 9 8 5 295 514.5 Fabp4 2.21523e-32 ## 8 10 9 388 658.5 Col1a1 6.82664e-32 ## 1 8 6 513 1834.0 Acta2 1.07294e-24 ## 3 6 5 530 1751.5 Sapcd2 6.08574e-16 ## lib.size1 lib.size2 prop ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## 6 0.811531 0.516399 0.03030303 ## 2 0.619865 1.411579 0.28823954 ## 4 1.540751 0.688651 0.16305916 ## 5 1.125474 1.167854 0.00865801 ## 10 0.888432 0.888514 0.00865801 ## 7 0.856192 0.856271 0.01875902 ## 9 0.655624 0.655685 0.01154401 ## 8 1.125578 1.525264 0.01406926 ## 1 0.865449 1.936489 0.19841270 ## 3 0.872951 0.390173 0.25829726 If a more concrete threshold is necessary, we can identify clusters that have unusually low num.de using an outlier-based approach. library(scater) chosen.doublet &lt;- rownames(dbl.out)[isOutlier(dbl.out$num.de, type=&quot;lower&quot;, log=TRUE)] chosen.doublet ## [1] &quot;6&quot; The function also reports the ratio of the median library size in each source to the median library size in the query (lib.size fields). Ideally, a potential doublet cluster would have ratios lower than unity; this is because doublet libraries are generated from a larger initial pool of RNA compared to libraries for single cells, and thus the former should have larger library sizes. The proportion of cells in the query cluster should also be reasonable - typically less than 5% of all cells, depending on how many cells were loaded onto the 10X Genomics device. Examination of the findDoubletClusters() output indicates that cluster 6 has the fewest unique genes and library sizes that are comparable to or greater than its sources. We see that every gene detected in this cluster is also expressed in either of the two proposed source clusters (Figure 16.1). library(scran) markers &lt;- findMarkers(sce.mam, direction=&quot;up&quot;) dbl.markers &lt;- markers[[chosen.doublet]] library(scater) chosen &lt;- rownames(dbl.markers)[dbl.markers$Top &lt;= 10] plotHeatmap(sce.mam, order_columns_by=&quot;label&quot;, features=chosen, center=TRUE, symmetric=TRUE, zlim=c(-5, 5)) Figure 16.1: Heatmap of mean-centered and normalized log-expression values for the top set of markers for cluster 6 in the mammary gland dataset. Column colours represent the cluster to which each cell is assigned, as indicated by the legend. Closer examination of some known markers suggests that the offending cluster consists of doublets of basal cells (Acta2) and alveolar cells (Csn2) (Figure 16.2). Indeed, no cell type is known to strongly express both of these genes at the same time, which supports the hypothesis that this cluster consists solely of doublets rather than being an entirely novel cell type. plotExpression(sce.mam, features=c(&quot;Acta2&quot;, &quot;Csn2&quot;), x=&quot;label&quot;, colour_by=&quot;label&quot;) Figure 16.2: Distribution of log-normalized expression values for Acta2 and Csn2 in each cluster. Each point represents a cell. The strength of findDoubletClusters() lies in its simplicity and ease of interpretation. Suspect clusters can be quickly flagged based on the metrics returned by the function. However, it is obviously dependent on the quality of the clustering. Clusters that are too coarse will fail to separate doublets from other cells, while clusters that are too fine will complicate interpretation. The method is also somewhat biased towards clusters with fewer cells, where the reduction in power is more likely to result in a low N. (Fortunately, this is a desirable effect as doublets should be rare in a properly performed scRNA-seq experiment.) 16.3 Doublet detection by simulation The other doublet detection strategy involves in silico simulation of doublets from the single-cell expression profiles (Dahlin et al. 2018). This is performed using the computeDoubletDensity() function from scran, which will: Simulate thousands of doublets by adding together two randomly chosen single-cell profiles. For each original cell, compute the density of simulated doublets in the surrounding neighborhood. For each original cell, compute the density of other observed cells in the neighborhood. Return the ratio between the two densities as a “doublet score” for each cell. This approach assumes that the simulated doublets are good approximations for real doublets. The use of random selection accounts for the relative abundances of different subpopulations, which affect the likelihood of their involvement in doublets; and the calculation of a ratio avoids high scores for non-doublet cells in highly abundant subpopulations. We see the function in action below. To speed up the density calculations, computeDoubletDensity() will perform a PCA on the log-expression matrix, and we perform some (optional) parametrization to ensure that the computed PCs are consistent with that from our previous analysis on this dataset. library(BiocSingular) set.seed(100) # Setting up the parameters for consistency with denoisePCA(); # this can be changed depending on your feature selection scheme. dbl.dens &lt;- computeDoubletDensity(sce.mam, subset.row=top.mam, d=ncol(reducedDim(sce.mam))) summary(dbl.dens) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.000 0.249 0.527 1.041 1.188 14.570 The highest doublet scores are concentrated in a single cluster of cells in the center of Figure 16.3. sce.mam$DoubletScore &lt;- dbl.dens plotTSNE(sce.mam, colour_by=&quot;DoubletScore&quot;) Figure 16.3: t-SNE plot of the mammary gland data set. Each point is a cell coloured according to its doublet density. From the clustering information, we see that the affected cells belong to the same cluster that was identified using findDoubletClusters() (Figure 16.4), which is reassuring. plotColData(sce.mam, x=&quot;label&quot;, y=&quot;DoubletScore&quot;, colour_by=&quot;label&quot;) Figure 16.4: Distribution of doublet scores for each cluster in the mammary gland data set. Each point is a cell. The advantage of computeDoubletDensity() is that it does not depend on clusters, reducing the sensitivity of the results to clustering quality. The downside is that it requires some strong assumptions about how doublets form, such as the combining proportions and the sampling from pure subpopulations. In particular, computeDoubletDensity() treats the library size of each cell as an accurate proxy for its total RNA content. If this is not true, the simulation will not combine expression profiles from different cells in the correct proportions. This means that the simulated doublets will be systematically shifted away from the real doublets, resulting in doublet scores that are too low. Simply removing cells with high doublet scores will not be sufficient to eliminate real doublets from the data set. In some cases, only a subset of the cells in the putative doublet cluster actually have high scores, and removing these would still leave enough cells in that cluster to mislead downstream analyses. In fact, even defining a threshold on the doublet score is difficult as the interpretation of the score is relative. There is no general definition for a fixed threshold above which libraries are to be considered doublets. We recommend interpreting the computeDoubletDensity() scores in the context of cluster annotation. All cells from a cluster with a large average doublet score should be considered suspect, and close neighbors of problematic clusters should also be treated with caution. In contrast, a cluster containing a small proportion of high-scoring cells is probably safe provided that any interesting results are not being driven by those cells (e.g., checking that DE in an interesting gene is not driven solely by cells with high doublet scores). While clustering is still required, this approach is more robust than findDoubletClusters() to the quality of the clustering as the scores are computed on a per-cell basis. (As an aside, the issue of unknown combining proportions can be solved completely if spike-in information is available, e.g., in plate-based protocols. This will provide an accurate estimate of the total RNA content of each cell. To this end, spike-in-based size factors from Section 7.4 can be supplied to the computeDoubletDensity() function via the size.factors.content= argument. This will use the spike-in size factors to scale the contribution of each cell to a doublet library.) 16.4 Doublet detection in multiplexed experiments 16.4.1 Background For multiplexed samples (Kang et al. 2018; Stoeckius et al. 2018), we can identify doublet cells based on the cells that have multiple labels. The idea here is that cells from the same sample are labelled in a unique manner, either implicitly with genotype information or experimentally with hashing tag oligos (HTOs). Cells from all samples are then mixed together and the multiplexed pool is subjected to scRNA-seq, avoiding batch effects and simplifying the logistics of processing a large number of samples. Importantly, most per-cell libraries are expected to contain one label that can be used to assign that cell to its sample of origin. Cell libraries containing two labels are thus likely to be doublets of cells from different samples. To demonstrate, we will use some data from the original cell hashing study (Stoeckius et al. 2018). Each sample’s cells were stained with an antibody against a ubiquitous surface protein, where the antibody was conjugated to a sample-specific HTO. Sequencing of the HTO-derived cDNA library ultimately yields a count matrix where each row corresponds to a HTO and each column corresponds to a cell barcode. library(scRNAseq) hto.sce &lt;- StoeckiusHashingData(mode=&quot;hto&quot;) dim(hto.sce) ## [1] 8 65000 16.4.2 Identifying inter-sample doublets Before we proceed to doublet detection, we simplify the problem by first identifying the barcodes that contain cells. This is most conventionally done using the gene expression matrix for the same set of barcodes, as shown in Section 15.2. Here, though, we will keep things simple and apply emptyDrops() directly on the HTO count matrix. The considerations are largely the same as that for gene expression matrices; the main difference is that the default lower= is often too low for deeply sequenced HTOs, so we instead estimate the ambient profile by excluding the top by.rank= barcodes with the largest totals (under the assumption that no more than by.rank= cells were loaded). The barcode-rank plots are quite similar to what one might expect from gene expression data (Figure 16.5). library(DropletUtils) set.seed(101) hash.calls &lt;- emptyDrops(counts(hto.sce), by.rank=40000) is.cell &lt;- which(hash.calls$FDR &lt;= 0.001) length(is.cell) ## [1] 21782 par(mfrow=c(1,2)) r &lt;- rank(-hash.calls$Total) plot(r, hash.calls$Total, log=&quot;xy&quot;, xlab=&quot;Rank&quot;, ylab=&quot;Total HTO count&quot;, main=&quot;&quot;) hist(log10(hash.calls$Total[is.cell]), xlab=&quot;Log[10] HTO count&quot;, main=&quot;&quot;) Figure 16.5: Cell-calling statistics from running emptyDrops() on the HTO counts in the cell hashing study. Left: Barcode rank plot of the HTO counts in the cell hashing study. Right: distribution of log-total counts for libraries identified as cells. We then run hashedDrops() on the subset of cell barcode libraries that actually contain cells. This returns the likely sample of origin for each barcode library based on its most abundant HTO, using abundances adjusted for ambient contamination in the ambient= argument. (The adjustment process itself involves a fair number of assumptions that we will not discuss here; see ?hashedDrops for more details.) For quality control, it returns the log-fold change between the first and second-most abundant HTOs in each barcode libary (Figure 15.9), allowing us to quantify the certainty of each assignment. Confidently assigned singlets are marked using the Confident field in the output. hash.stats &lt;- hashedDrops(counts(hto.sce)[,is.cell], ambient=metadata(hash.calls)$ambient) hist(hash.stats$LogFC, xlab=&quot;Log fold-change from best to second HTO&quot;, main=&quot;&quot;) Figure 15.9: Distribution of log-fold changes from the first to second-most abundant HTO in each cell. # Raw assignments: table(hash.stats$Best) ## ## 1 2 3 4 5 6 7 8 ## 2703 3276 2753 2782 2494 2381 2586 2807 # Confident assignments based on (i) a large log-fold change # and (ii) not being a doublet, see below. table(hash.stats$Best[hash.stats$Confident]) ## ## 1 2 3 4 5 6 7 8 ## 2349 2779 2458 2275 2091 1994 2176 2458 Of greater interest here is how we can use the hashing information to detect doublets. This is achieved by reporting the log-fold change between the count for the second HTO and the estimated contribution from ambient contamination. A large log-fold change indicates that the second HTO still has an above-expected abundance, consistent with a doublet containing HTOs from two samples. We use outlier detection to explicitly identify putative doublets as those barcode libraries that have large log-fold changes; this is visualized in Figure 16.6, which shows a clear separation between the putative singlets and doublets. summary(hash.stats$Doublet) ## Mode FALSE TRUE ## logical 18744 3038 colors &lt;- rep(&quot;grey&quot;, nrow(hash.stats)) colors[hash.stats$Doublet] &lt;- &quot;red&quot; colors[hash.stats$Confident] &lt;- &quot;black&quot; plot(hash.stats$LogFC, hash.stats$LogFC2, xlab=&quot;Log fold-change from best to second HTO&quot;, ylab=&quot;Log fold-change of second HTO over ambient&quot;, col=colors) Figure 16.6: Log-fold change of the second-most abundant HTO over ambient contamination, compared to the log-fold change of the first HTO over the second HTO. Each point represents a cell where potential doublets are shown in red while confidently assigned singlets are shown in black. 16.4.3 Guilt by association for unmarked doublets One obvious limitation of this approach is that doublets of cells marked with the same HTO are not detected. In a simple multiplexing experiment involving \\(N\\) samples with similar numbers of cells, we would expect around \\(1/N\\) of all doublets to involve cells from the same sample. For typical values of \\(N\\) of 5 to 12, this may still be enough to cause the formation of misleading doublet clusters even after the majority of known doublets are removed. To avoid this, we recover the remaining intra-sample doublets based on their similarity with known doublets in gene expression space (hence, “guilt by association”). We illustrate by loading the gene expression data for this study: sce.hash &lt;- StoeckiusHashingData(mode=&quot;human&quot;) # Subsetting to all barcodes detected as cells. Requires an intersection, # because `hto.sce` and `sce.hash` are not the same dimensions! common &lt;- intersect(colnames(sce.hash), rownames(hash.stats)) sce.hash &lt;- sce.hash[,common] colData(sce.hash) &lt;- hash.stats[common,] sce.hash ## class: SingleCellExperiment ## dim: 27679 20830 ## metadata(0): ## assays(1): counts ## rownames(27679): A1BG A1BG-AS1 ... hsa-mir-8072 snoU2-30 ## rowData names(0): ## colnames(20830): ACTGCTCAGGTGTTAA ATGAGGGAGATGTTAG ... CACCAGGCACACAGAG ## CTCGGAGTCTAACTCT ## colData names(7): Total Best ... Doublet Confident ## reducedDimNames(0): ## altExpNames(0): For each cell, we calculate the proportion of its nearest neighbors that are known doublets. Intra-sample doublets should have high proportions under the assumption that their gene expression profiles are similar to inter-sample doublets involving the same combination of cell states/types. Unlike in Section 16.3, the use of experimentally derived doublet calls avoids any assumptions about the relative quantity of total RNA or the probability of doublet formation across different cell types. # Performing a quick-and-dirty analysis to get some PCs to use # for nearest neighbor detection inside recoverDoublets(). library(scran) sce.hash &lt;- logNormCounts(sce.hash) dec.hash &lt;- modelGeneVar(sce.hash) top.hash &lt;- getTopHVGs(dec.hash, n=1000) set.seed(1011110) sce.hash &lt;- runPCA(sce.hash, subset_row=top.hash, ncomponents=20) # Recovering the intra-sample doublets: hashed.doublets &lt;- recoverDoublets(sce.hash, use.dimred=&quot;PCA&quot;, doublets=sce.hash$Doublet, samples=table(sce.hash$Best)) hashed.doublets ## DataFrame with 20830 rows and 3 columns ## proportion known predicted ## &lt;numeric&gt; &lt;logical&gt; &lt;logical&gt; ## 1 0.10 TRUE FALSE ## 2 0.02 FALSE FALSE ## 3 0.14 FALSE FALSE ## 4 0.08 FALSE FALSE ## 5 0.18 FALSE FALSE ## ... ... ... ... ## 20826 0.04 FALSE FALSE ## 20827 0.04 FALSE FALSE ## 20828 0.02 FALSE FALSE ## 20829 0.04 FALSE FALSE ## 20830 0.10 FALSE FALSE The recoverDoublets() function also returns explicit intra-sample doublet predictions based on the doublet neighbor proportions. Given the distribution of cells across multiplexed samples in samples=, we estimate the fraction of doublets that would not be observed from the HTO counts. This is converted into an absolute number based on the number of observed doublets; the top set of libraries with the highest proportions are then marked as intra-sample doublets (Figure 16.7) set.seed(1000101001) sce.hash &lt;- runTSNE(sce.hash, dimred=&quot;PCA&quot;) sce.hash$proportion &lt;- hashed.doublets$proportion sce.hash$predicted &lt;- hashed.doublets$predicted gridExtra::grid.arrange( plotTSNE(sce.hash, colour_by=&quot;proportion&quot;) + ggtitle(&quot;Doublet proportions&quot;), plotTSNE(sce.hash, colour_by=&quot;Doublet&quot;) + ggtitle(&quot;Known doublets&quot;), ggcells(sce.hash) + geom_point(aes(x=TSNE.1, y=TSNE.2), color=&quot;grey&quot;) + geom_point(aes(x=TSNE.1, y=TSNE.2), color=&quot;red&quot;, data=function(x) x[x$predicted,]) + ggtitle(&quot;Predicted intra-sample doublets&quot;), ncol=2 ) Figure 16.7: \\(t\\)-SNE plots for gene expression data from the cell hashing study, where each point is a cell and is colored by the doublet proportion (top left), whether or not it is a known inter-sample doublet (top right) and whether it is a predicted intra-sample doublet (bottom left). As an aside, it is worth noting that even known doublets may not necessarily have high doublet neighbor proportions. This is typically observed for doublets involving cells of the same type or state, which are effectively intermixed in gene expression space with the corresponding singlets. The latter are much more abundant in most (well-controlled) experiments, which results in low proportions for the doublets involved (Figure 16.8). This effect can generally be ignored given the mostly harmless nature of these doublets. state &lt;- ifelse(hashed.doublets$predicted, &quot;predicted&quot;, ifelse(hashed.doublets$known, &quot;known&quot;, &quot;singlet&quot;)) ggplot(as.data.frame(hashed.doublets)) + geom_violin(aes(x=state, y=proportion)) Figure 16.8: Distribution of doublet neighbor proportions for all cells in the cell hashing study, stratified by doublet detection status. 16.5 Further comments Doublet detection procedures should only be applied to libraries generated in the same experimental batch. It is obviously impossible for doublets to form between two cells that were captured separately. Thus, some understanding of the experimental design is required prior to the use of the above functions. This avoids unnecessary concerns about the validity of batch-specific clusters that cannot possibly consist of doublets. It is also difficult to interpret doublet predictions in data containing cellular trajectories. By definition, cells in the middle of a trajectory are always intermediate between other cells and are liable to be incorrectly detected as doublets. Some protection is provided by the non-linear nature of many real trajectories, which reduces the risk of simulated doublets coinciding with real cells in computeDoubletDensity(). One can also put more weight on the relative library sizes in findDoubletClusters() instead of relying solely on N, under the assumption that sudden spikes in RNA content are unlikely in a continuous biological process. The best solution to the doublet problem is experimental - that is, to avoid generating them in the first place. This should be a consideration when designing scRNA-seq experiments, where the desire to obtain large numbers of cells at minimum cost should be weighed against the general deterioration in data quality and reliability when doublets become more frequent. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] DropletUtils_1.10.3 scRNAseq_2.4.0 [3] BiocSingular_1.6.0 scran_1.18.5 [5] scater_1.18.6 ggplot2_3.3.3 [7] scDblFinder_1.4.0 SingleCellExperiment_1.12.0 [9] SummarizedExperiment_1.20.0 Biobase_2.50.0 [11] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [13] IRanges_2.24.1 S4Vectors_0.28.1 [15] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [17] matrixStats_0.58.0 BiocStyle_2.18.1 [19] rebook_1.0.0 loaded via a namespace (and not attached): [1] AnnotationHub_2.22.0 BiocFileCache_1.14.0 [3] igraph_1.2.6 lazyeval_0.2.2 [5] BiocParallel_1.24.1 digest_0.6.27 [7] ensembldb_2.14.0 htmltools_0.5.1.1 [9] viridis_0.5.1 fansi_0.4.2 [11] magrittr_2.0.1 memoise_2.0.0 [13] limma_3.46.0 Biostrings_2.58.0 [15] R.utils_2.10.1 askpass_1.1 [17] prettyunits_1.1.1 colorspace_2.0-0 [19] blob_1.2.1 rappdirs_0.3.3 [21] xfun_0.22 dplyr_1.0.5 [23] callr_3.5.1 crayon_1.4.1 [25] RCurl_1.98-1.3 jsonlite_1.7.2 [27] graph_1.68.0 glue_1.4.2 [29] gtable_0.3.0 zlibbioc_1.36.0 [31] XVector_0.30.0 DelayedArray_0.16.2 [33] Rhdf5lib_1.12.1 HDF5Array_1.18.1 [35] scales_1.1.1 pheatmap_1.0.12 [37] DBI_1.1.1 edgeR_3.32.1 [39] Rcpp_1.0.6 viridisLite_0.3.0 [41] xtable_1.8-4 progress_1.2.2 [43] dqrng_0.2.1 bit_4.0.4 [45] rsvd_1.0.3 httr_1.4.2 [47] RColorBrewer_1.1-2 ellipsis_0.3.1 [49] R.methodsS3_1.8.1 pkgconfig_2.0.3 [51] XML_3.99-0.6 farver_2.1.0 [53] scuttle_1.0.4 CodeDepends_0.6.5 [55] sass_0.3.1 dbplyr_2.1.0 [57] locfit_1.5-9.4 utf8_1.2.1 [59] tidyselect_1.1.0 labeling_0.4.2 [61] rlang_0.4.10 later_1.1.0.1 [63] AnnotationDbi_1.52.0 munsell_0.5.0 [65] BiocVersion_3.12.0 tools_4.0.4 [67] cachem_1.0.4 xgboost_1.3.2.1 [69] generics_0.1.0 RSQLite_2.2.4 [71] ExperimentHub_1.16.0 evaluate_0.14 [73] stringr_1.4.0 fastmap_1.1.0 [75] yaml_2.2.1 processx_3.4.5 [77] knitr_1.31 bit64_4.0.5 [79] purrr_0.3.4 AnnotationFilter_1.14.0 [81] sparseMatrixStats_1.2.1 mime_0.10 [83] R.oo_1.24.0 xml2_1.3.2 [85] biomaRt_2.46.3 compiler_4.0.4 [87] beeswarm_0.3.1 curl_4.3 [89] interactiveDisplayBase_1.28.0 tibble_3.1.0 [91] statmod_1.4.35 bslib_0.2.4 [93] stringi_1.5.3 highr_0.8 [95] ps_1.6.0 GenomicFeatures_1.42.2 [97] lattice_0.20-41 bluster_1.0.0 [99] ProtGenerics_1.22.0 Matrix_1.3-2 [101] vctrs_0.3.6 rhdf5filters_1.2.0 [103] pillar_1.5.1 lifecycle_1.0.0 [105] BiocManager_1.30.10 jquerylib_0.1.3 [107] BiocNeighbors_1.8.2 data.table_1.14.0 [109] cowplot_1.1.1 bitops_1.0-6 [111] irlba_2.3.3 rtracklayer_1.50.0 [113] httpuv_1.5.5 R6_2.5.0 [115] bookdown_0.21 promises_1.2.0.1 [117] gridExtra_2.3 vipor_0.4.5 [119] codetools_0.2-18 assertthat_0.2.1 [121] rhdf5_2.34.0 openssl_1.4.3 [123] withr_2.4.1 GenomicAlignments_1.26.0 [125] Rsamtools_2.6.0 GenomeInfoDbData_1.2.4 [127] hms_1.0.0 grid_4.0.4 [129] beachmat_2.6.4 rmarkdown_2.7 [131] DelayedMatrixStats_1.12.3 Rtsne_0.15 [133] shiny_1.6.0 ggbeeswarm_0.6.0 Bibliography "],["cell-cycle-assignment.html", "Chapter 17 Cell cycle assignment 17.1 Motivation 17.2 Using the cyclins 17.3 Using reference profiles 17.4 Using the cyclone() classifier 17.5 Removing cell cycle effects Session Info", " Chapter 17 Cell cycle assignment .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 17.1 Motivation On occasion, it can be desirable to determine cell cycle activity from scRNA-seq data. In and of itself, the distribution of cells across phases of the cell cycle is not usually informative, but we can use this to determine if there are differences in proliferation between subpopulations or across treatment conditions. Many of the key events in the cell cycle (e.g., passage through checkpoints) are driven by post-translational mechanisms and thus not directly visible in transcriptomic data; nonetheless, there are enough changes in expression that can be exploited to determine cell cycle phase. We demonstrate using the 416B dataset, which is known to contain actively cycling cells after oncogene induction. View history #--- loading ---# library(scRNAseq) sce.416b &lt;- LunSpikeInData(which=&quot;416b&quot;) sce.416b$block &lt;- factor(sce.416b$block) #--- gene-annotation ---# library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] rowData(sce.416b)$ENSEMBL &lt;- rownames(sce.416b) rowData(sce.416b)$SYMBOL &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SYMBOL&quot;) rowData(sce.416b)$SEQNAME &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SEQNAME&quot;) library(scater) rownames(sce.416b) &lt;- uniquifyFeatureNames(rowData(sce.416b)$ENSEMBL, rowData(sce.416b)$SYMBOL) #--- quality-control ---# mito &lt;- which(rowData(sce.416b)$SEQNAME==&quot;MT&quot;) stats &lt;- perCellQCMetrics(sce.416b, subsets=list(Mt=mito)) qc &lt;- quickPerCellQC(stats, percent_subsets=c(&quot;subsets_Mt_percent&quot;, &quot;altexps_ERCC_percent&quot;), batch=sce.416b$block) sce.416b &lt;- sce.416b[,!qc$discard] #--- normalization ---# library(scran) sce.416b &lt;- computeSumFactors(sce.416b) sce.416b &lt;- logNormCounts(sce.416b) #--- variance-modelling ---# dec.416b &lt;- modelGeneVarWithSpikes(sce.416b, &quot;ERCC&quot;, block=sce.416b$block) chosen.hvgs &lt;- getTopHVGs(dec.416b, prop=0.1) #--- batch-correction ---# library(limma) assay(sce.416b, &quot;corrected&quot;) &lt;- removeBatchEffect(logcounts(sce.416b), design=model.matrix(~sce.416b$phenotype), batch=sce.416b$block) #--- dimensionality-reduction ---# sce.416b &lt;- runPCA(sce.416b, ncomponents=10, subset_row=chosen.hvgs, exprs_values=&quot;corrected&quot;, BSPARAM=BiocSingular::ExactParam()) set.seed(1010) sce.416b &lt;- runTSNE(sce.416b, dimred=&quot;PCA&quot;, perplexity=10) #--- clustering ---# my.dist &lt;- dist(reducedDim(sce.416b, &quot;PCA&quot;)) my.tree &lt;- hclust(my.dist, method=&quot;ward.D2&quot;) library(dynamicTreeCut) my.clusters &lt;- unname(cutreeDynamic(my.tree, distM=as.matrix(my.dist), minClusterSize=10, verbose=0)) colLabels(sce.416b) &lt;- factor(my.clusters) sce.416b ## class: SingleCellExperiment ## dim: 46604 185 ## metadata(0): ## assays(3): counts logcounts corrected ## rownames(46604): 4933401J01Rik Gm26206 ... CAAA01147332.1 ## CBFB-MYH11-mcherry ## rowData names(4): Length ENSEMBL SYMBOL SEQNAME ## colnames(185): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ... ## SLX-11312.N712_S507.H5H5YBBXX.s_8.r_1 ## SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1 ## colData names(11): Source Name cell line ... sizeFactor label ## reducedDimNames(2): PCA TSNE ## altExpNames(2): ERCC SIRV 17.2 Using the cyclins The cyclins control progression through the cell cycle and have well-characterized patterns of expression across cell cycle phases. Cyclin D is expressed throughout but peaks at G1; cyclin E is expressed highest in the G1/S transition; cyclin A is expressed across S and G2; and cyclin B is expressed highest in late G2 and mitosis (Morgan 2007). The expression of cyclins can help to determine the relative cell cycle activity in each cluster (Figure 17.1). For example, most cells in cluster 1 are likely to be in G1 while the other clusters are scattered across the later phases. library(scater) cyclin.genes &lt;- grep(&quot;^Ccn[abde][0-9]$&quot;, rowData(sce.416b)$SYMBOL) cyclin.genes &lt;- rownames(sce.416b)[cyclin.genes] cyclin.genes ## [1] &quot;Ccnb3&quot; &quot;Ccna2&quot; &quot;Ccna1&quot; &quot;Ccne2&quot; &quot;Ccnd2&quot; &quot;Ccne1&quot; &quot;Ccnd1&quot; &quot;Ccnb2&quot; &quot;Ccnb1&quot; ## [10] &quot;Ccnd3&quot; plotHeatmap(sce.416b, order_columns_by=&quot;label&quot;, cluster_rows=FALSE, features=sort(cyclin.genes)) Figure 17.1: Heatmap of the log-normalized expression values of the cyclin genes in the 416B dataset. Each column represents a cell that is sorted by the cluster of origin. We quantify these observations with standard DE methods (Chapter 11) to test for upregulation of each cyclin between clusters, which would imply that a subpopulation contains more cells in the corresponding cell cycle phase. The same logic applies to comparisons between treatment conditions as described in Chapter 14. For example, we can infer that cluster 4 has the highest proportion of cells in the S and G2 phases based on higher expression of cyclins A2 and B1, respectively. library(scran) markers &lt;- findMarkers(sce.416b, subset.row=cyclin.genes, test.type=&quot;wilcox&quot;, direction=&quot;up&quot;) markers[[4]] ## DataFrame with 10 rows and 7 columns ## Top p.value FDR summary.AUC AUC.1 AUC.2 ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Ccna2 1 4.47082e-09 4.47082e-08 0.996337 0.996337 0.641822 ## Ccnd1 1 2.27713e-04 5.69283e-04 0.822981 0.368132 0.822981 ## Ccnb1 1 1.19027e-07 5.95137e-07 0.949634 0.949634 0.519669 ## Ccnb2 2 3.87799e-07 1.29266e-06 0.934066 0.934066 0.781573 ## Ccna1 4 2.96992e-02 5.93985e-02 0.535714 0.535714 0.495342 ## Ccne2 5 6.56983e-02 1.09497e-01 0.641941 0.641941 0.447205 ## Ccne1 6 5.85979e-01 8.37113e-01 0.564103 0.564103 0.366460 ## Ccnd3 7 9.94578e-01 1.00000e+00 0.402930 0.402930 0.283644 ## Ccnd2 8 9.99993e-01 1.00000e+00 0.306548 0.134615 0.327122 ## Ccnb3 10 1.00000e+00 1.00000e+00 0.500000 0.500000 0.500000 ## AUC.3 ## &lt;numeric&gt; ## Ccna2 0.925595 ## Ccnd1 0.776786 ## Ccnb1 0.934524 ## Ccnb2 0.898810 ## Ccna1 0.535714 ## Ccne2 0.455357 ## Ccne1 0.473214 ## Ccnd3 0.273810 ## Ccnd2 0.306548 ## Ccnb3 0.500000 While straightforward to implement and interpret, this approach assumes that cyclin expression is unaffected by biological processes other than the cell cycle. This is a strong assumption in highly heterogeneous populations where cyclins may perform cell-type-specific roles. For example, using the Grun HSC dataset (Grun et al. 2016), we see an upregulation of cyclin D2 in sorted HSCs (Figure 17.2) that is consistent with a particular reliance on D-type cyclins in these cells (Steinman 2002; Kozar et al. 2004). Similar arguments apply to other genes with annotated functions in cell cycle, e.g., from relevant Gene Ontology terms. View history #--- data-loading ---# library(scRNAseq) sce.grun.hsc &lt;- GrunHSCData(ensembl=TRUE) #--- gene-annotation ---# library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] anno &lt;- select(ens.mm.v97, keys=rownames(sce.grun.hsc), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.grun.hsc) &lt;- anno[match(rownames(sce.grun.hsc), anno$GENEID),] #--- quality-control ---# library(scuttle) stats &lt;- perCellQCMetrics(sce.grun.hsc) qc &lt;- quickPerCellQC(stats, batch=sce.grun.hsc$protocol, subset=grepl(&quot;sorted&quot;, sce.grun.hsc$protocol)) sce.grun.hsc &lt;- sce.grun.hsc[,!qc$discard] #--- normalization ---# library(scran) set.seed(101000110) clusters &lt;- quickCluster(sce.grun.hsc) sce.grun.hsc &lt;- computeSumFactors(sce.grun.hsc, clusters=clusters) sce.grun.hsc &lt;- logNormCounts(sce.grun.hsc) #--- variance-modelling ---# set.seed(00010101) dec.grun.hsc &lt;- modelGeneVarByPoisson(sce.grun.hsc) top.grun.hsc &lt;- getTopHVGs(dec.grun.hsc, prop=0.1) #--- dimensionality-reduction ---# set.seed(101010011) sce.grun.hsc &lt;- denoisePCA(sce.grun.hsc, technical=dec.grun.hsc, subset.row=top.grun.hsc) sce.grun.hsc &lt;- runTSNE(sce.grun.hsc, dimred=&quot;PCA&quot;) #--- clustering ---# snn.gr &lt;- buildSNNGraph(sce.grun.hsc, use.dimred=&quot;PCA&quot;) colLabels(sce.grun.hsc) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) # Switching the row names for a nicer plot. rownames(sce.grun.hsc) &lt;- uniquifyFeatureNames(rownames(sce.grun.hsc), rowData(sce.grun.hsc)$SYMBOL) cyclin.genes &lt;- grep(&quot;^Ccn[abde][0-9]$&quot;, rowData(sce.grun.hsc)$SYMBOL) cyclin.genes &lt;- rownames(sce.grun.hsc)[cyclin.genes] plotHeatmap(sce.grun.hsc, order_columns_by=&quot;label&quot;, cluster_rows=FALSE, features=sort(cyclin.genes), colour_columns_by=&quot;protocol&quot;) Figure 17.2: Heatmap of the log-normalized expression values of the cyclin genes in the Grun HSC dataset. Each column represents a cell that is sorted by the cluster of origin and extraction protocol. Admittedly, this is merely a symptom of a more fundamental issue - that the cell cycle is not independent of the other processes that are occurring in a cell. This will be a recurring theme throughout the chapter, which suggests that cell cycle inferences are best used in comparisons between closely related cell types where there are fewer changes elsewhere that might interfere with interpretation. 17.3 Using reference profiles Cell cycle assignment can be considered a specialized case of cell annotation, which suggests that the strategies described in Chapter 12 can also be applied here. Given a reference dataset containing cells of known cell cycle phase, we could use methods like SingleR to determine the phase of each cell in a test dataset. We demonstrate on a reference of mouse ESCs from Buettner et al. (2015) that were sorted by cell cycle phase prior to scRNA-seq. library(scRNAseq) sce.ref &lt;- BuettnerESCData() sce.ref &lt;- logNormCounts(sce.ref) sce.ref ## class: SingleCellExperiment ## dim: 38293 288 ## metadata(0): ## assays(2): counts logcounts ## rownames(38293): ENSMUSG00000000001 ENSMUSG00000000003 ... ## ENSMUSG00000097934 ENSMUSG00000097935 ## rowData names(3): EnsemblTranscriptID AssociatedGeneName GeneLength ## colnames(288): G1_cell1_count G1_cell2_count ... G2M_cell95_count ## G2M_cell96_count ## colData names(2): phase sizeFactor ## reducedDimNames(0): ## altExpNames(1): ERCC We will restrict the annotation process to a subset of genes with a priori known roles in cell cycle. This aims to avoid detecting markers for other biological processes that happen to be correlated with the cell cycle in the reference dataset, which would reduce classification performance if those processes are absent or uncorrelated in the test dataset. # Find genes that are cell cycle-related. library(org.Mm.eg.db) cycle.anno &lt;- select(org.Mm.eg.db, keytype=&quot;GOALL&quot;, keys=&quot;GO:0007049&quot;, columns=&quot;ENSEMBL&quot;)[,&quot;ENSEMBL&quot;] str(cycle.anno) ## chr [1:2830] &quot;ENSMUSG00000026842&quot; &quot;ENSMUSG00000026842&quot; ... We use the SingleR() function to assign labels to the 416B data based on the cell cycle phases in the ESC reference. Cluster 1 mostly consists of G1 cells while the other clusters have more cells in the other phases, which is broadly consistent with our conclusions from the cyclin-based analysis. Unlike the cyclin-based analysis, this approach yields “absolute” assignments of cell cycle phase that do not need to be interpreted relative to other cells in the same dataset. # Switching row names back to Ensembl to match the reference. test.data &lt;- logcounts(sce.416b) rownames(test.data) &lt;- rowData(sce.416b)$ENSEMBL library(SingleR) assignments &lt;- SingleR(test.data, ref=sce.ref, label=sce.ref$phase, de.method=&quot;wilcox&quot;, restrict=cycle.anno) tab &lt;- table(assignments$labels, colLabels(sce.416b)) tab ## ## 1 2 3 4 ## G1 71 5 18 1 ## G2M 2 60 2 13 ## S 5 4 4 0 The key assumption here is that the cell cycle effect is orthogonal to other aspects of biological heterogeneity like cell type. This justifies the use of a reference involving cell types that are quite different from the cells in the test dataset, provided that the cell cycle transcriptional program is conserved across datasets (Bertoli, Skotheim, and Bruin 2013; Conboy et al. 2007). However, it is not difficult to find holes in this reasoning - for example, Lef1 is detected as one of the top markers to distinguish between G1 from G2/M in the reference but has no detectable expression in the 416B dataset (Figure 17.3). More generally, non-orthogonality can introduce biases where, e.g., one cell type is consistently misclassified as being in a particular phase because it happens to be more similar to that phase’s profile in the reference. gridExtra::grid.arrange( plotExpression(sce.ref, features=&quot;ENSMUSG00000027985&quot;, x=&quot;phase&quot;), plotExpression(sce.416b, features=&quot;Lef1&quot;, x=&quot;label&quot;), ncol=2) Figure 17.3: Distribution of log-normalized expression values for Lef1 in the reference dataset (left) and in the 416B dataset (right). Thus, a healthy dose of skepticism is required when interpreting these assignments. Our hope is that any systematic assignment error is consistent across clusters and conditions such that they cancel out in comparisons of phase frequencies, which is the more interesting analysis anyway. Indeed, while the availability of absolute phase calls may be more appealing, it may not make much practical difference to the conclusions if the frequencies are ultimately interpreted in a relative sense (e.g., using a chi-squared test). # Test for differences in phase distributions between clusters 1 and 2. chisq.test(tab[,1:2]) ## ## Pearson&#39;s Chi-squared test ## ## data: tab[, 1:2] ## X-squared = 112, df = 2, p-value &lt;2e-16 17.4 Using the cyclone() classifier The method described by Scialdone et al. (2015) is yet another approach for classifying cells into cell cycle phases. Using a reference dataset, we first compute the sign of the difference in expression between each pair of genes. Pairs with changes in the sign across cell cycle phases are chosen as markers. Cells in a test dataset can then be classified into the appropriate phase, based on whether the observed sign for each marker pair is consistent with one phase or another. This approach is implemented in the cyclone() function from the scran package, which also contains pre-trained set of marker pairs for mouse and human data. set.seed(100) library(scran) mm.pairs &lt;- readRDS(system.file(&quot;exdata&quot;, &quot;mouse_cycle_markers.rds&quot;, package=&quot;scran&quot;)) # Using Ensembl IDs to match up with the annotation in &#39;mm.pairs&#39;. assignments &lt;- cyclone(sce.416b, mm.pairs, gene.names=rowData(sce.416b)$ENSEMBL) The phase assignment result for each cell in the 416B dataset is shown in Figure 17.4. For each cell, a higher score for a phase corresponds to a higher probability that the cell is in that phase. We focus on the G1 and G2/M scores as these are the most informative for classification. plot(assignments$score$G1, assignments$score$G2M, xlab=&quot;G1 score&quot;, ylab=&quot;G2/M score&quot;, pch=16) Figure 17.4: Cell cycle phase scores from applying the pair-based classifier on the 416B dataset. Each point represents a cell, plotted according to its scores for G1 and G2/M phases. Cells are classified as being in G1 phase if the G1 score is above 0.5 and greater than the G2/M score; in G2/M phase if the G2/M score is above 0.5 and greater than the G1 score; and in S phase if neither score is above 0.5. We see that the results are quite similar to those from SingleR(), which is reassuring. table(assignments$phases, colLabels(sce.416b)) ## ## 1 2 3 4 ## G1 74 8 20 0 ## G2M 1 48 0 13 ## S 3 13 4 1 The same considerations and caveats described for the SingleR-based approach are also applicable here. From a practical perspective, cyclone() takes much longer but does not require an explicit reference as the marker pairs are already computed. 17.5 Removing cell cycle effects 17.5.1 Comments For some time, it was popular to regress out the cell cycle phase prior to downstream analyses like clustering. The aim was to remove uninteresting variation due to cell cycle, thus improving resolution of other biological processes. With the benefit of hindsight, we do not consider cell cycle adjustment to be necessary for routine applications. In most scenarios, the cell cycle is a minor factor of variation, secondary to stronger factors like cell type identity. Moreover, most strategies for removal run into problems when cell cycle activity varies across cell types or conditions; this is not uncommon with, e.g., increased proliferation of T cells upon activation (Richard et al. 2018), changes in cell cycle phase progression across development (Roccio et al. 2013) and correlations between cell cycle and fate decisions (Soufi and Dalton 2016). Nonetheless, we will discuss some approaches for mitigating the cell cycle effect in this section. 17.5.2 With linear regression and friends Here, we treat each phase as a separate batch and apply any of the batch correction strategies described in Chapter 28.8. The most common approach is to use a linear model to simply regress out any effect associated with the assigned phases, as shown below in Figure 17.5 via regressBatches(). Similarly, any functions that support blocking can use the phase assignments as a blocking factor, e.g., block= in modelGeneVarWithSpikes(). library(batchelor) dec.nocycle &lt;- modelGeneVarWithSpikes(sce.416b, &quot;ERCC&quot;, block=assignments$phases) reg.nocycle &lt;- regressBatches(sce.416b, batch=assignments$phases) set.seed(100011) reg.nocycle &lt;- runPCA(reg.nocycle, exprs_values=&quot;corrected&quot;, subset_row=getTopHVGs(dec.nocycle, prop=0.1)) # Shape points by induction status. relabel &lt;- c(&quot;onco&quot;, &quot;WT&quot;)[factor(sce.416b$phenotype)] scaled &lt;- scale_shape_manual(values=c(onco=4, WT=16)) gridExtra::grid.arrange( plotPCA(sce.416b, colour_by=I(assignments$phases), shape_by=I(relabel)) + ggtitle(&quot;Before&quot;) + scaled, plotPCA(reg.nocycle, colour_by=I(assignments$phases), shape_by=I(relabel)) + ggtitle(&quot;After&quot;) + scaled, ncol=2 ) Figure 17.5: PCA plots before and after regressing out the cell cycle effect in the 416B dataset, based on the phase assignments from cyclone(). Each point is a cell and is colored by its inferred phase and shaped by oncogene induction status. Alternatively, one could regress on the classification scores to account for any ambiguity in assignment. An example using cyclone() scores is shown below in Figure 17.6 but the same procedure can be used with any classification step that yields some confidence per label - for example, the correlation-based scores from SingleR(). design &lt;- model.matrix(~as.matrix(assignments$scores)) dec.nocycle2 &lt;- modelGeneVarWithSpikes(sce.416b, &quot;ERCC&quot;, design=design) reg.nocycle2 &lt;- regressBatches(sce.416b, design=design) set.seed(100011) reg.nocycle2 &lt;- runPCA(reg.nocycle2, exprs_values=&quot;corrected&quot;, subset_row=getTopHVGs(dec.nocycle2, prop=0.1)) plotPCA(reg.nocycle2, colour_by=I(assignments$phases), point_size=3, shape_by=I(relabel)) + scaled Figure 17.6: PCA plot on the residuals after regression on the cell cycle phase scores from cyclone() in the 416B dataset. Each point is a cell and is colored by its inferred phase and shaped by oncogene induction status. The main assumption of regression is that the cell cycle is consistent across different aspects of cellular heterogeneity (Section 13.4). In particular, we assume that each cell type contains the same distribution of cells across phases as well as a constant magnitude of the cell cycle effect on expression. Violations will lead to incomplete removal or, at worst, overcorrection that introduces spurious signal - even in the absence of any cell cycle effect! For example, if two subpopulations differ in their cell cycle phase distribution, regression will always apply a non-zero adjustment to all DE genes between those subpopulations. If this type of adjustment is truly necessary, it is safest to apply it separately to the subset of cells in each cluster. This weakens the consistency assumptions as we do not require the same behavior across all cell types in the population. Alternatively, we could use other methods that are more robust to differences in composition (Figure 17.7), though this becomes somewhat complicated if we want to correct for both cell cycle and batch at the same time. Gene-based analyses should use the uncorrected data with blocking where possible (Section 13.8), which provides a sanity check that protects against distortions introduced by the adjustment. set.seed(100011) reg.nocycle3 &lt;- fastMNN(sce.416b, batch=assignments$phases) plotReducedDim(reg.nocycle3, dimred=&quot;corrected&quot;, point_size=3, colour_by=I(assignments$phases), shape_by=I(relabel)) + scaled Figure 17.7: Plot of the corrected PCs after applying fastMNN() with respect to the cell cycle phase assignments from cyclone() in the 416B dataset. Each point is a cell and is colored by its inferred phase and shaped by oncogene induction status. 17.5.3 Removing cell cycle-related genes A gentler alternative to regression is to remove the genes that are associated with cell cycle. Here, we compute the percentage of variance explained by the cell cycle phase in the expression profile for each gene, and we remove genes with high percentages from the dataset prior to further downstream analyses. We demonstrate below with the Leng et al. (2015) dataset containing phase-sorted ESCs, where removal of marker genes detected between phases eliminates the separation between G1 and S populations (Figure 17.8). library(scRNAseq) sce.leng &lt;- LengESCData(ensembl=TRUE) # Performing a default analysis without any removal: sce.leng &lt;- logNormCounts(sce.leng, assay.type=&quot;normcounts&quot;) dec.leng &lt;- modelGeneVar(sce.leng) top.hvgs &lt;- getTopHVGs(dec.leng, n=1000) sce.leng &lt;- runPCA(sce.leng, subset_row=top.hvgs) # Identifying the likely cell cycle genes between phases, # using an arbitrary threshold of 5%. library(scater) diff &lt;- getVarianceExplained(sce.leng, &quot;Phase&quot;) discard &lt;- diff &gt; 5 summary(discard) ## Phase ## Mode :logical ## FALSE:12730 ## TRUE :2758 ## NA&#39;s :2057 # ... and repeating the PCA without them. top.hvgs2 &lt;- getTopHVGs(dec.leng[which(!discard),], n=1000) sce.nocycle &lt;- runPCA(sce.leng, subset_row=top.hvgs2) fill &lt;- geom_point(pch=21, colour=&quot;grey&quot;) # Color the NA points. gridExtra::grid.arrange( plotPCA(sce.leng, colour_by=&quot;Phase&quot;) + ggtitle(&quot;Before&quot;) + fill, plotPCA(sce.nocycle, colour_by=&quot;Phase&quot;) + ggtitle(&quot;After&quot;) + fill, ncol=2 ) Figure 17.8: PCA plots of the Leng ESC dataset, generated before and after removal of cell cycle-related genes. Each point corresponds to a cell that is colored by the sorted cell cycle phase. The same procedure can also be applied to the inferred phases or classification scores from, e.g., cyclone(). This is demonstrated in Figure 17.9 with our trusty 416B dataset, where the cell cycle variation is removed without sacrificing the differences due to oncogene induction. # Need to wrap the phase vector in a DataFrame: diff &lt;- getVarianceExplained(sce.416b, DataFrame(assignments$phases)) discard &lt;- diff &gt; 5 summary(discard) ## assignments.phases ## Mode :logical ## FALSE:19590 ## TRUE :4207 ## NA&#39;s :22807 set.seed(100011) top.discard &lt;- getTopHVGs(dec.416b[which(!discard),], n=1000) sce.416b.discard &lt;- runPCA(sce.416b, subset_row=top.discard) plotPCA(sce.416b.discard, colour_by=I(assignments$phases), shape_by=I(relabel), point_size=3) + scaled Figure 17.9: PCA plots of the 416B dataset, generated before and after removal of cell cycle-related genes. Each point corresponds to a cell that is colored by the inferred phase and shaped by oncogene induction status. This approach discards any gene with significant cell cycle variation, regardless of how much interesting variation it might also contain from other processes. In this respect, it is more conservative than regression as no attempt is made to salvage any information from such genes, possibly resulting in the loss of relevant biological signal. However, gene removal is more amenable to fine-tuning: any lost heterogeneity can be easily identified by examining the discarded genes, and users can choose to recover interesting genes even if they are correlated with known/inferred cell cycle phase. Most importantly, direct removal of genes is much less likely to introduce spurious signal compared to regression when the consistency assumptions are not applicable. 17.5.4 Using contrastive PCA Alternatively, we might consider a more sophisticated approach called contrastive PCA (Abid et al. 2018). This aims to identify patterns that are enriched in our test dataset - in this case, the 416B data - compared to a control dataset in which cell cycle is the dominant factor of variation. We demonstrate below using the scPCA package (Boileau, Hejazi, and Dudoit 2020) where we use the subset of wild-type 416B cells as our control, based on the expectation that an untreated cell line in culture has little else to do but divide. This yields low-dimensional coordinates in which the cell cycle effect within the oncogene-induced and wild-type groups is reduced without removing the difference between groups (Figure 17.10). top.hvgs &lt;- getTopHVGs(dec.416b, p=0.1) wild &lt;- sce.416b$phenotype==&quot;wild type phenotype&quot; set.seed(100) library(scPCA) con.out &lt;- scPCA( target=t(logcounts(sce.416b)[top.hvgs,]), background=t(logcounts(sce.416b)[top.hvgs,wild]), penalties=0, n_eigen=10, contrasts=100) # Visualizing the results in a t-SNE. sce.con &lt;- sce.416b reducedDim(sce.con, &quot;cPCA&quot;) &lt;- con.out$x sce.con &lt;- runTSNE(sce.con, dimred=&quot;cPCA&quot;) # Making the labels easier to read. relabel &lt;- c(&quot;onco&quot;, &quot;WT&quot;)[factor(sce.416b$phenotype)] scaled &lt;- scale_color_manual(values=c(onco=&quot;red&quot;, WT=&quot;black&quot;)) gridExtra::grid.arrange( plotTSNE(sce.416b, colour_by=I(assignments$phases)) + ggtitle(&quot;Before (416b)&quot;), plotTSNE(sce.416b, colour_by=I(relabel)) + scaled, plotTSNE(sce.con, colour_by=I(assignments$phases)) + ggtitle(&quot;After (416b)&quot;), plotTSNE(sce.con, colour_by=I(relabel)) + scaled, ncol=2 ) Figure 17.10: \\(t\\)-SNE plots for the 416B dataset before and after contrastive PCA. Each point is a cell and is colored according to its inferred cell cycle phase (left) or oncogene induction status (right). The strength of this approach lies in its ability to accurately remove the cell cycle effect based on its magnitude in the control dataset. This avoids loss of heterogeneity associated with other processes that happen to be correlated with the cell cycle. The requirements for the control dataset are also quite loose - there is no need to know the cell cycle phase of each cell a priori, and indeed, we can manufacture a like-for-like control by subsetting our dataset to a homogeneous cluster in which the only detectable factor of variation is the cell cycle. (See Chapter 41 for another demonstration of cPCA to remove the cell cycle effect.) In fact, any consistent but uninteresting variation can be eliminated in this manner as long as it is captured by the control. The downside is that the magnitude of variation in the control dataset must accurately reflect that in the test dataset, requiring more care in choosing the former. As a result, the procedure is more sensitive to quantitative differences between datasets compared to SingleR() or cyclone() during cell cycle phase assignment. This makes it difficult to use control datasets from different scRNA-seq technologies or biological systems, as a mismatch in the covariance structure may lead to insufficient or excessive correction. At worst, any interesting variation that is inadvertently contained in the control will also be removed. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] scPCA_1.4.0 batchelor_1.6.2 [3] bluster_1.0.0 SingleR_1.4.1 [5] org.Mm.eg.db_3.12.0 ensembldb_2.14.0 [7] AnnotationFilter_1.14.0 GenomicFeatures_1.42.2 [9] AnnotationDbi_1.52.0 scRNAseq_2.4.0 [11] scran_1.18.5 scater_1.18.6 [13] ggplot2_3.3.3 SingleCellExperiment_1.12.0 [15] SummarizedExperiment_1.20.0 Biobase_2.50.0 [17] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [19] IRanges_2.24.1 S4Vectors_0.28.1 [21] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [23] matrixStats_0.58.0 BiocStyle_2.18.1 [25] rebook_1.0.0 loaded via a namespace (and not attached): [1] AnnotationHub_2.22.0 BiocFileCache_1.14.0 [3] igraph_1.2.6 lazyeval_0.2.2 [5] listenv_0.8.0 BiocParallel_1.24.1 [7] digest_0.6.27 htmltools_0.5.1.1 [9] viridis_0.5.1 fansi_0.4.2 [11] magrittr_2.0.1 memoise_2.0.0 [13] cluster_2.1.0 limma_3.46.0 [15] globals_0.14.0 Biostrings_2.58.0 [17] askpass_1.1 prettyunits_1.1.1 [19] colorspace_2.0-0 blob_1.2.1 [21] rappdirs_0.3.3 rbibutils_2.0 [23] xfun_0.22 dplyr_1.0.5 [25] callr_3.5.1 crayon_1.4.1 [27] RCurl_1.98-1.3 jsonlite_1.7.2 [29] graph_1.68.0 glue_1.4.2 [31] gtable_0.3.0 zlibbioc_1.36.0 [33] XVector_0.30.0 DelayedArray_0.16.2 [35] coop_0.6-2 kernlab_0.9-29 [37] BiocSingular_1.6.0 future.apply_1.7.0 [39] abind_1.4-5 scales_1.1.1 [41] pheatmap_1.0.12 DBI_1.1.1 [43] edgeR_3.32.1 Rcpp_1.0.6 [45] viridisLite_0.3.0 xtable_1.8-4 [47] progress_1.2.2 dqrng_0.2.1 [49] bit_4.0.4 rsvd_1.0.3 [51] ResidualMatrix_1.0.0 httr_1.4.2 [53] RColorBrewer_1.1-2 ellipsis_0.3.1 [55] pkgconfig_2.0.3 XML_3.99-0.6 [57] farver_2.1.0 scuttle_1.0.4 [59] CodeDepends_0.6.5 sass_0.3.1 [61] dbplyr_2.1.0 locfit_1.5-9.4 [63] utf8_1.2.1 labeling_0.4.2 [65] tidyselect_1.1.0 rlang_0.4.10 [67] later_1.1.0.1 munsell_0.5.0 [69] BiocVersion_3.12.0 tools_4.0.4 [71] cachem_1.0.4 generics_0.1.0 [73] RSQLite_2.2.4 ExperimentHub_1.16.0 [75] evaluate_0.14 stringr_1.4.0 [77] fastmap_1.1.0 yaml_2.2.1 [79] processx_3.4.5 knitr_1.31 [81] bit64_4.0.5 purrr_0.3.4 [83] future_1.21.0 sparseMatrixStats_1.2.1 [85] mime_0.10 origami_1.0.3 [87] xml2_1.3.2 biomaRt_2.46.3 [89] compiler_4.0.4 beeswarm_0.3.1 [91] curl_4.3 interactiveDisplayBase_1.28.0 [93] tibble_3.1.0 statmod_1.4.35 [95] bslib_0.2.4 stringi_1.5.3 [97] highr_0.8 ps_1.6.0 [99] RSpectra_0.16-0 lattice_0.20-41 [101] ProtGenerics_1.22.0 Matrix_1.3-2 [103] vctrs_0.3.6 pillar_1.5.1 [105] lifecycle_1.0.0 BiocManager_1.30.10 [107] Rdpack_2.1.1 jquerylib_0.1.3 [109] BiocNeighbors_1.8.2 data.table_1.14.0 [111] cowplot_1.1.1 bitops_1.0-6 [113] irlba_2.3.3 httpuv_1.5.5 [115] rtracklayer_1.50.0 R6_2.5.0 [117] bookdown_0.21 promises_1.2.0.1 [119] gridExtra_2.3 parallelly_1.24.0 [121] vipor_0.4.5 codetools_0.2-18 [123] assertthat_0.2.1 openssl_1.4.3 [125] withr_2.4.1 GenomicAlignments_1.26.0 [127] Rsamtools_2.6.0 GenomeInfoDbData_1.2.4 [129] hms_1.0.0 grid_4.0.4 [131] beachmat_2.6.4 rmarkdown_2.7 [133] DelayedMatrixStats_1.12.3 Rtsne_0.15 [135] shiny_1.6.0 ggbeeswarm_0.6.0 Bibliography "],["trajectory-analysis.html", "Chapter 18 Trajectory Analysis 18.1 Overview 18.2 Obtaining pseudotime orderings 18.3 Characterizing trajectories 18.4 Finding the root Session Info", " Chapter 18 Trajectory Analysis .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 18.1 Overview Many biological processes manifest as a continuum of dynamic changes in the cellular state. The most obvious example is that of differentiation into increasingly specialized cell subtypes, but we might also consider phenomena like the cell cycle or immune cell activation that are accompanied by gradual changes in the cell’s transcriptome. We characterize these processes from single-cell expression data by identifying a “trajectory”, i.e., a path through the high-dimensional expression space that traverses the various cellular states associated with a continuous process like differentiation. In the simplest case, a trajectory will be a simple path from one point to another, but we can also observe more complex trajectories that branch to multiple endpoints. The “pseudotime” is defined as the positioning of cells along the trajectory that quantifies the relative activity or progression of the underlying biological process. For example, the pseudotime for a differentiation trajectory might represent the degree of differentiation from a pluripotent cell to a terminal state where cells with larger pseudotime values are more differentiated. This metric allows us to tackle questions related to the global population structure in a more quantitative manner. The most common application is to fit models to gene expression against the pseudotime to identify the genes responsible for generating the trajectory in the first place, especially around interesting branch events. In this section, we will demonstrate several different approaches to trajectory analysis using the haematopoietic stem cell (HSC) dataset from Nestorowa et al. (2016). View history #--- data-loading ---# library(scRNAseq) sce.nest &lt;- NestorowaHSCData() #--- gene-annotation ---# library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] anno &lt;- select(ens.mm.v97, keys=rownames(sce.nest), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.nest) &lt;- anno[match(rownames(sce.nest), anno$GENEID),] #--- quality-control-grun ---# library(scater) stats &lt;- perCellQCMetrics(sce.nest) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;) sce.nest &lt;- sce.nest[,!qc$discard] #--- normalization ---# library(scran) set.seed(101000110) clusters &lt;- quickCluster(sce.nest) sce.nest &lt;- computeSumFactors(sce.nest, clusters=clusters) sce.nest &lt;- logNormCounts(sce.nest) #--- variance-modelling ---# set.seed(00010101) dec.nest &lt;- modelGeneVarWithSpikes(sce.nest, &quot;ERCC&quot;) top.nest &lt;- getTopHVGs(dec.nest, prop=0.1) #--- dimensionality-reduction ---# set.seed(101010011) sce.nest &lt;- denoisePCA(sce.nest, technical=dec.nest, subset.row=top.nest) sce.nest &lt;- runTSNE(sce.nest, dimred=&quot;PCA&quot;) #--- clustering ---# snn.gr &lt;- buildSNNGraph(sce.nest, use.dimred=&quot;PCA&quot;) colLabels(sce.nest) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) sce.nest ## class: SingleCellExperiment ## dim: 46078 1656 ## metadata(0): ## assays(2): counts logcounts ## rownames(46078): ENSMUSG00000000001 ENSMUSG00000000003 ... ## ENSMUSG00000107391 ENSMUSG00000107392 ## rowData names(3): GENEID SYMBOL SEQNAME ## colnames(1656): HSPC_025 HSPC_031 ... Prog_852 Prog_810 ## colData names(4): cell.type FACS sizeFactor label ## reducedDimNames(3): diffusion PCA TSNE ## altExpNames(1): ERCC 18.2 Obtaining pseudotime orderings 18.2.1 Overview The pseudotime is simply a number describing the relative position of a cell in the trajectory, where cells with larger values are consider to be “after” their counterparts with smaller values. Branched trajectories will typically be associated with multiple pseudotimes, one per path through the trajectory; these values are not usually comparable across paths. It is worth noting that “pseudotime” is a rather unfortunate term as it may not have much to do with real-life time. For example, one can imagine a continuum of stress states where cells move in either direction (or not) over time but the pseudotime simply describes the transition from one end of the continuum to the other. In trajectories describing time-dependent processes like differentiation, a cell’s pseudotime value may be used as a proxy for its relative age, but only if directionality can be inferred (see Section 18.4). The big question is how to identify the trajectory from high-dimensional expression data and map individual cells onto it. A massive variety of different algorithms are available for doing so (Saelens et al. 2019), and while we will demonstrate only a few specific methods below, many of the concepts apply generally to all trajectory inference strategies. A more philosophical question is whether a trajectory even exists in the dataset. One can interpret a continuum of states as a series of closely related (but distinct) subpopulations, or two well-separated clusters as the endpoints of a trajectory with rare intermediates. The choice between these two perspectives is left to the analyst based on which is more useful, convenient or biologically sensible. 18.2.2 Cluster-based minimum spanning tree 18.2.2.1 Basic steps The TSCAN algorithm uses a simple yet effective approach to trajectory reconstruction. It uses the clustering to summarize the data into a smaller set of discrete units, computes cluster centroids by averaging the coordinates of its member cells, and then forms the minimum spanning tree (MST) across those centroids. The MST is simply an undirected acyclic graph that passes through each centroid exactly once and is thus the most parsimonious structure that captures the transitions between clusters. We demonstrate below on the Nestorowa et al. (2016) dataset, computing the cluster centroids in the low-dimensional PC space to take advantage of data compaction and denoising (Chapter 9). library(scater) by.cluster &lt;- aggregateAcrossCells(sce.nest, ids=colLabels(sce.nest)) centroids &lt;- reducedDim(by.cluster, &quot;PCA&quot;) # Set clusters=NULL as we have already aggregated above. library(TSCAN) mst &lt;- createClusterMST(centroids, clusters=NULL) mst ## IGRAPH 0b84c0a UNW- 9 8 -- ## + attr: name (v/c), coordinates (v/x), weight (e/n), gain (e/n) ## + edges from 0b84c0a (vertex names): ## [1] 1--3 1--9 2--3 2--6 3--4 5--8 5--9 6--7 For reference, we can draw the same lines between the centroids in a \\(t\\)-SNE plot (Figure 18.1). This allows us to identify interesting clusters such as those at bifurcations or endpoints. Note that the MST in mst was generated from distances in the PC space and is merely being visualized here in the \\(t\\)-SNE space, for the same reasons as discussed in Section 9.5.5. This may occasionally result in some visually unappealing plots if the original ordering of clusters in the PC space is not preserved in the \\(t\\)-SNE space. line.data &lt;- reportEdges(by.cluster, mst=mst, clusters=NULL, use.dimred=&quot;TSNE&quot;) plotTSNE(sce.nest, colour_by=&quot;label&quot;) + geom_line(data=line.data, mapping=aes(x=dim1, y=dim2, group=edge)) Figure 18.1: \\(t\\)-SNE plot of the Nestorowa HSC dataset, where each point is a cell and is colored according to its cluster assignment. The MST obtained using a TSCAN-like algorithm is overlaid on top. We obtain a pseudotime ordering by projecting the cells onto the MST with mapCellsToEdges(). More specifically, we move each cell onto the closest edge of the MST; the pseudotime is then calculated as the distance along the MST to this new position from a “root node” with orderCells(). For our purposes, we will arbitrarily pick one of the endpoint nodes as the root, though a more careful choice based on the biological annotation of each node may yield more relevant orderings (e.g., picking a node corresponding to a more pluripotent state). map.tscan &lt;- mapCellsToEdges(sce.nest, mst=mst, use.dimred=&quot;PCA&quot;) tscan.pseudo &lt;- orderCells(map.tscan, mst) head(tscan.pseudo) ## 7 8 ## [1,] 33.90 NA ## [2,] 53.34 NA ## [3,] 47.95 NA ## [4,] 59.92 NA ## [5,] 54.36 NA ## [6,] 70.73 NA Here, multiple sets of pseudotimes are reported for a branched trajectory. Each column contains one pseudotime ordering and corresponds to one path from the root node to one of the terminal nodes - the name of the terminal node that defines this path is recorded in the column names of tscan.pseudo. Some cells may be shared across multiple paths, in which case they will have the same pseudotime in those paths. We can then examine the pseudotime ordering on our desired visualization as shown in Figure 18.2. # Taking the rowMeans just gives us a single pseudo-time for all cells. Cells # in segments that are shared across paths have the same pseudo-time value for # those paths anyway, so the rowMeans doesn&#39;t change anything. common.pseudo &lt;- rowMeans(tscan.pseudo, na.rm=TRUE) plotTSNE(sce.nest, colour_by=I(common.pseudo), text_by=&quot;label&quot;, text_colour=&quot;red&quot;) + geom_line(data=line.data, mapping=aes(x=dim1, y=dim2, group=edge)) Figure 18.2: \\(t\\)-SNE plot of the Nestorowa HSC dataset, where each point is a cell and is colored according to its pseudotime value. The MST obtained using TSCAN is overlaid on top. Alternatively, this entire series of calculations can be conveniently performed with the quickPseudotime() wrapper. This executes all steps from aggregateAcrossCells() to orderCells() and returns a list with the output from each step. pseudo.all &lt;- quickPseudotime(sce.nest, use.dimred=&quot;PCA&quot;) head(pseudo.all$ordering) ## 7 8 ## [1,] 33.90 NA ## [2,] 53.34 NA ## [3,] 47.95 NA ## [4,] 59.92 NA ## [5,] 54.36 NA ## [6,] 70.73 NA 18.2.2.2 Tweaking the MST The MST can be constructed with an “outgroup” to avoid connecting unrelated populations in the dataset. Based on the OMEGA cluster concept from Street et al. (2018), the outgroup is an artificial cluster that is equidistant from all real clusters at some threshold value. If the original MST sans the outgroup contains an edge that is longer than twice the threshold, the addition of the outgroup will cause the MST to instead be routed through the outgroup. We can subsequently break up the MST into subcomponents (i.e., a minimum spanning forest) by removing the outgroup. We set outgroup=TRUE to introduce an outgroup with an automatically determined threshold distance, which breaks up our previous MST into two components (Figure 18.3). pseudo.og &lt;- quickPseudotime(sce.nest, use.dimred=&quot;PCA&quot;, outgroup=TRUE) set.seed(10101) plot(pseudo.og$mst) Figure 18.3: Minimum spanning tree of the Nestorowa clusters after introducing an outgroup. Another option is to construct the MST based on distances between mutual nearest neighbor (MNN) pairs between clusters (Section 13.5). This exploits the fact that MNN pairs occur at the boundaries of two clusters, with short distances between paired cells meaning that the clusters are “touching”. In this mode, the MST focuses on the connectivity between clusters, which can be different from the shortest distance between centroids (Figure 18.4). Consider, for example, a pair of elongated clusters that are immediately adjacent to each other. A large distance between their centroids precludes the formation of the obvious edge with the default MST construction; in contrast, the MNN distance is very low and encourages the MST to create a connection between the two clusters. pseudo.mnn &lt;- quickPseudotime(sce.nest, use.dimred=&quot;PCA&quot;, with.mnn=TRUE) mnn.pseudo &lt;- rowMeans(pseudo.mnn$ordering, na.rm=TRUE) plotTSNE(sce.nest, colour_by=I(mnn.pseudo), text_by=&quot;label&quot;, text_colour=&quot;red&quot;) + geom_line(data=pseudo.mnn$connected$TSNE, mapping=aes(x=dim1, y=dim2, group=edge)) Figure 18.4: \\(t\\)-SNE plot of the Nestorowa HSC dataset, where each point is a cell and is colored according to its pseudotime value. The MST obtained using TSCAN with MNN distances is overlaid on top. 18.2.2.3 Further comments The TSCAN approach derives several advantages from using clusters to form the MST. The most obvious is that of computational speed as calculations are performed over clusters rather than cells. The relative coarseness of clusters protects against the per-cell noise that would otherwise reduce the stability of the MST. The interpretation of the MST is also straightforward as it uses the same clusters as the rest of the analysis, allowing us to recycle previous knowledge about the biological annotations assigned to each cluster. However, the reliance on clustering is a double-edged sword. If the clusters are not sufficiently granular, it is possible for TSCAN to overlook variation that occurs inside a single cluster. The MST is obliged to pass through each cluster exactly once, which can lead to excessively circuitous paths in overclustered datasets as well as the formation of irrelevant paths between distinct cell subpopulations if the outgroup threshold is too high. The MST also fails to handle more complex events such as “bubbles” (i.e., a bifurcation and then a merging) or cycles. 18.2.3 Principal curves To identify a trajectory, one might imagine simply “fitting” a one-dimensional curve so that it passes through the cloud of cells in the high-dimensional expression space. This is the idea behind principal curves (Hastie and Stuetzle 1989), effectively a non-linear generalization of PCA where the axes of most variation are allowed to bend. We use the slingshot package (Street et al. 2018) to fit a single principal curve to the Nestorowa dataset, again using the low-dimensional PC coordinates for denoising and speed. This yields a pseudotime ordering of cells based on their relative positions when projected onto the curve. library(slingshot) sce.sling &lt;- slingshot(sce.nest, reducedDim=&#39;PCA&#39;) head(sce.sling$slingPseudotime_1) ## [1] 89.44 76.34 87.88 76.93 82.41 72.10 We can then visualize the path taken by the fitted curve in any desired space with embedCurves(). For example, Figure 18.5 shows the behavior of the principle curve on the \\(t\\)-SNE plot. Again, users should note that this may not always yield aesthetically pleasing plots if the \\(t\\)-SNE algorithm decides to arrange clusters so that they no longer match the ordering of the pseudotimes. embedded &lt;- embedCurves(sce.sling, &quot;TSNE&quot;) embedded &lt;- slingCurves(embedded)[[1]] # only 1 path. embedded &lt;- data.frame(embedded$s[embedded$ord,]) plotTSNE(sce.sling, colour_by=&quot;slingPseudotime_1&quot;) + geom_path(data=embedded, aes(x=Dim.1, y=Dim.2), size=1.2) Figure 18.5: \\(t\\)-SNE plot of the Nestorowa HSC dataset where each point is a cell and is colored by the slingshot pseudotime ordering. The fitted principal curve is shown in black. The previous call to slingshot() assumed that all cells in the dataset were part of a single curve. To accommodate more complex events like bifurcations, we use our previously computed cluster assignments to build a rough sketch for the global structure in the form of a MST across the cluster centroids. Each path through the MST from a designated root node is treated as a lineage that contains cells from the associated clusters. Principal curves are then simultaneously fitted to all lineages with some averaging across curves to encourage consistency in shared clusters across lineages. This process yields a matrix of pseudotimes where each column corresponds to a lineage and contains the pseudotimes of all cells assigned to that lineage. sce.sling2 &lt;- slingshot(sce.nest, cluster=colLabels(sce.nest), reducedDim=&#39;PCA&#39;) pseudo.paths &lt;- slingPseudotime(sce.sling2) head(pseudo.paths) ## curve1 curve2 curve3 ## HSPC_025 107.11 NA NA ## HSPC_031 95.38 101.6 117.1 ## HSPC_037 103.74 104.1 109.3 ## HSPC_008 99.25 115.7 103.9 ## HSPC_014 103.07 111.0 105.7 ## HSPC_020 NA 124.0 NA By using the MST as a scaffold for the global structure, slingshot() can accommodate branching events based on divergence in the principal curves (Figure 18.6). However, unlike TSCAN, the MST here is only used as a rough guide and does not define the final pseudotime. sce.nest &lt;- runUMAP(sce.nest, dimred=&quot;PCA&quot;) reducedDim(sce.sling2, &quot;UMAP&quot;) &lt;- reducedDim(sce.nest, &quot;UMAP&quot;) shared.pseudo &lt;- rowMeans(pseudo.paths, na.rm=TRUE) # Need to loop over the paths and add each one separately. gg &lt;- plotUMAP(sce.sling2, colour_by=I(shared.pseudo)) embedded &lt;- embedCurves(sce.sling2, &quot;UMAP&quot;) embedded &lt;- slingCurves(embedded) for (path in embedded) { embedded &lt;- data.frame(path$s[path$ord,]) gg &lt;- gg + geom_path(data=embedded, aes(x=Dim.1, y=Dim.2), size=1.2) } gg Figure 18.6: UMAP plot of the Nestorowa HSC dataset where each point is a cell and is colored by the average slingshot pseudotime across paths. The principal curves fitted to each lineage are shown in black. We can use slingshotBranchID() to determine whether a particular cell is shared across multiple curves or is unique to a subset of curves (i.e., is located “after” branching). In this case, we can see that most cells jump directly from a global common segment (1,2,3) to one of the curves (1, 2, 3) without any further hierarchy, i.e., no noticeable internal branch points. curve.assignments &lt;- slingBranchID(sce.sling2) table(curve.assignments) ## curve.assignments ## 1 1,2 1,2,3 1,3 2 2,3 3 ## 435 6 892 2 222 39 60 For larger datasets, we can speed up the algorithm by approximating each principal curve with a fixed number of points. By default, slingshot() uses one point per cell to define the curve, which is unnecessarily precise when the number of cells is large. Applying an approximation with approx_points= reduces computational work without any major loss of precision in the pseudotime estimates. sce.sling3 &lt;- slingshot(sce.nest, cluster=colLabels(sce.nest), reducedDim=&#39;PCA&#39;, approx_points=100) pseudo.paths3 &lt;- slingPseudotime(sce.sling3) head(pseudo.paths3) ## curve1 curve2 curve3 ## HSPC_025 106.85 NA NA ## HSPC_031 95.38 101.8 117.1 ## HSPC_037 103.08 104.1 109.0 ## HSPC_008 98.72 115.5 103.7 ## HSPC_014 103.08 110.9 105.3 ## HSPC_020 NA 123.5 NA The MST can also be constructed with an OMEGA cluster to avoid connecting unrelated trajectories. This operates in the same manner as (and was the inspiration for) the outgroup for TSCAN’s MST. Principal curves are fitted through each component individually, manifesting in the pseudotime matrix as paths that do not share any cells. sce.sling4 &lt;- slingshot(sce.nest, cluster=colLabels(sce.nest), reducedDim=&#39;PCA&#39;, approx_points=100, omega=TRUE) pseudo.paths4 &lt;- slingPseudotime(sce.sling4) head(pseudo.paths4) ## curve1 curve2 curve3 ## HSPC_025 111.83 NA NA ## HSPC_031 96.16 99.78 NA ## HSPC_037 105.49 105.08 NA ## HSPC_008 102.00 117.28 NA ## HSPC_014 105.49 112.70 NA ## HSPC_020 NA 126.08 NA shared.pseudo &lt;- rowMeans(pseudo.paths, na.rm=TRUE) gg &lt;- plotUMAP(sce.sling4, colour_by=I(shared.pseudo)) embedded &lt;- embedCurves(sce.sling4, &quot;UMAP&quot;) embedded &lt;- slingCurves(embedded) for (path in embedded) { embedded &lt;- data.frame(path$s[path$ord,]) gg &lt;- gg + geom_path(data=embedded, aes(x=Dim.1, y=Dim.2), size=1.2) } gg Figure 18.7: UMAP plot of the Nestorowa HSC dataset where each point is a cell and is colored by the average slingshot pseudotime across paths. The principal curves (black lines) were constructed with an OMEGA cluster. The use of principal curves adds an extra layer of sophistication that complements the deficiencies of the cluster-based MST. The principal curve has the opportunity to model variation within clusters that would otherwise be overlooked; for example, slingshot could build a trajectory out of one cluster while TSCAN cannot. Conversely, the principal curves can “smooth out” circuitous paths in the MST for overclustered data, ignoring small differences between fine clusters that are unlikely to be relevant to the overall trajectory. That said, the structure of the initial MST is still fundamentally dependent on the resolution of the clusters. One can arbitrarily change the number of branches from slingshot by tuning the cluster granularity, making it difficult to use the output as evidence for the presence/absence of subtle branch events. If the variation within clusters is uninteresting, the greater sensitivity of the curve fitting to such variation may yield irrelevant trajectories where the differences between clusters are masked. Moreover, slingshot is no longer obliged to separate clusters in pseudotime, which may complicate intepretation of the trajectory with respect to existing cluster annotations. 18.3 Characterizing trajectories 18.3.1 Overview Once we have constructed a trajectory, the next step is to characterize the underlying biology based on its DE genes. The aim here is to find the genes that exhibit significant changes in expression across pseudotime, as these are the most likely to have driven the formation of the trajectory in the first place. The overall strategy is to fit a model to the per-gene expression with respect to pseudotime, allowing us to obtain inferences about the significance of any association. We can then prioritize interesting genes as those with low \\(p\\)-values for further investigation. A wide range of options are available for model fitting but we will focus on the simplest approach of fitting a linear model to the log-expression values with respect to the pseudotime; we will discuss some of the more advanced models later. 18.3.2 Changes along a trajectory To demonstrate, we will identify genes with significant changes with respect to one of the TSCAN pseudotimes in the Nestorowa data. We use the testPseudotime() utility to fit a natural spline to the expression of each gene, allowing us to model a range of non-linear relationships in the data. We then perform an analysis of variance (ANOVA) to determine if any of the spline coefficients are significantly non-zero, i.e., there is some significant trend with respect to pseudotime. library(TSCAN) pseudo &lt;- testPseudotime(sce.nest, pseudotime=tscan.pseudo[,1]) pseudo$SYMBOL &lt;- rowData(sce.nest)$SYMBOL pseudo[order(pseudo$p.value),] ## DataFrame with 46078 rows and 4 columns ## logFC p.value FDR SYMBOL ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## ENSMUSG00000029322 -0.0872517 0.00000e+00 0.00000e+00 Plac8 ## ENSMUSG00000105231 0.0158450 0.00000e+00 0.00000e+00 Iglj3 ## ENSMUSG00000076608 0.0118768 2.66618e-310 3.78002e-306 Igkj5 ## ENSMUSG00000106668 0.0153919 2.54019e-300 2.70105e-296 Iglj1 ## ENSMUSG00000022496 0.0229337 4.84822e-297 4.12418e-293 Tnfrsf17 ## ... ... ... ... ... ## ENSMUSG00000107367 0 NaN NaN Mir192 ## ENSMUSG00000107372 0 NaN NaN NA ## ENSMUSG00000107381 0 NaN NaN NA ## ENSMUSG00000107382 0 NaN NaN Gm37714 ## ENSMUSG00000107391 0 NaN NaN Rian In practice, it is helpful to pair the spline-based ANOVA results with a fit from a much simpler model where we assume that there exists a linear relationship between expression and the pseudotime. This yields an interpretable summary of the overall direction of change in the logFC field above, complementing the more poweful spline-based model used to populate the p.value field. In contrast, the magnitude and sign of the spline coefficients cannot be easily interpreted. To simplify the results, we will repeat our DE analysis after filtering out cluster 7. This cluster seems to contain a set of B cell precursors that are located at one end of the trajectory, causing immunoglobulins to dominate the set of DE genes and mask other interesting effects. Incidentally, this is the same cluster that was split into a separate component in the outgroup-based MST. # Making a copy of our SCE and including the pseudotimes in the colData. sce.nest2 &lt;- sce.nest sce.nest2$TSCAN.first &lt;- tscan.pseudo[,1] sce.nest2$TSCAN.second &lt;- tscan.pseudo[,2] # Discarding the offending cluster. discard &lt;- &quot;7&quot; keep &lt;- colLabels(sce.nest)!=discard sce.nest2 &lt;- sce.nest2[,keep] # Testing against the first path again. pseudo &lt;- testPseudotime(sce.nest2, pseudotime=sce.nest2$TSCAN.first) pseudo$SYMBOL &lt;- rowData(sce.nest2)$SYMBOL sorted &lt;- pseudo[order(pseudo$p.value),] Examination of the top downregulated genes suggests that this pseudotime represents a transition away from myeloid identity, based on the decrease in expression of genes such as Mpo and Plac8 (Figure 18.8). up.left &lt;- sorted[sorted$logFC &lt; 0,] head(up.left, 10) ## DataFrame with 10 rows and 4 columns ## logFC p.value FDR SYMBOL ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## ENSMUSG00000029322 -0.0951619 0.00000e+00 0.00000e+00 Plac8 ## ENSMUSG00000009350 -0.1230460 6.07026e-245 1.28963e-240 Mpo ## ENSMUSG00000040314 -0.1247572 5.29679e-231 7.50202e-227 Ctsg ## ENSMUSG00000031722 -0.0772702 3.46925e-217 3.68521e-213 Hp ## ENSMUSG00000020125 -0.1055643 2.21357e-211 1.88109e-207 Elane ## ENSMUSG00000015937 -0.0439171 8.35182e-204 5.91448e-200 H2afy ## ENSMUSG00000035004 -0.0770322 8.34215e-201 5.06369e-197 Igsf6 ## ENSMUSG00000045799 -0.0270218 8.85762e-197 4.70450e-193 Gm9800 ## ENSMUSG00000026238 -0.0255206 1.31491e-194 6.20783e-191 Ptma ## ENSMUSG00000096544 -0.0264184 3.73314e-177 1.58621e-173 Gm4617 best &lt;- head(up.left$SYMBOL, 10) plotExpression(sce.nest2, features=best, swap_rownames=&quot;SYMBOL&quot;, x=&quot;TSCAN.first&quot;, colour_by=&quot;label&quot;) Figure 18.8: Expression of the top 10 genes that decrease in expression with increasing pseudotime along the first path in the MST of the Nestorowa dataset. Each point represents a cell that is mapped to this path and is colored by the assigned cluster. Conversely, the later parts of the pseudotime may correspond to a more stem-like state based on upregulation of genes like Hlf. There is also increased expression of genes associated with the lymphoid lineage (e.g., Ltb), consistent with reduced commitment to the myeloid lineage at earlier pseudotime values. up.right &lt;- sorted[sorted$logFC &gt; 0,] head(up.right, 10) ## DataFrame with 10 rows and 4 columns ## logFC p.value FDR SYMBOL ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## ENSMUSG00000047867 0.0869463 1.06721e-173 4.12235e-170 Gimap6 ## ENSMUSG00000028716 0.1023233 4.76874e-172 1.68853e-168 Pdzk1ip1 ## ENSMUSG00000086567 0.0294706 9.89947e-165 2.62893e-161 Gm2830 ## ENSMUSG00000027562 0.0646994 5.91659e-156 1.04748e-152 Car2 ## ENSMUSG00000006389 0.1096438 4.69440e-151 7.67174e-148 Mpl ## ENSMUSG00000037820 0.0702660 1.80467e-135 1.78327e-132 Tgm2 ## ENSMUSG00000003949 0.0934931 3.07633e-126 2.37661e-123 Hlf ## ENSMUSG00000061232 0.0191498 1.24511e-125 9.44725e-123 H2-K1 ## ENSMUSG00000044258 0.0557909 3.49882e-121 2.28715e-118 Ctla2a ## ENSMUSG00000024399 0.0998322 5.53699e-116 3.17928e-113 Ltb best &lt;- head(up.right$SYMBOL, 10) plotExpression(sce.nest2, features=best, swap_rownames=&quot;SYMBOL&quot;, x=&quot;TSCAN.first&quot;, colour_by=&quot;label&quot;) Figure 18.9: Expression of the top 10 genes that increase in expression with increasing pseudotime along the first path in the MST of the Nestorowa dataset. Each point represents a cell that is mapped to this path and is colored by the assigned cluster. Alternatively, a heatmap can be used to provide a more compact visualization (Figure 18.10). on.first.path &lt;- !is.na(sce.nest2$TSCAN.first) plotHeatmap(sce.nest2[,on.first.path], order_columns_by=&quot;TSCAN.first&quot;, colour_columns_by=&quot;label&quot;, features=head(up.right$SYMBOL, 50), center=TRUE, swap_rownames=&quot;SYMBOL&quot;) Figure 18.10: Heatmap of the expression of the top 50 genes that increase in expression with increasing pseudotime along the first path in the MST of the Nestorowa HSC dataset. Each column represents a cell that is mapped to this path and is ordered by its pseudotime value. 18.3.3 Changes between paths A more advanced analysis involves looking for differences in expression between paths of a branched trajectory. This is most interesting for cells close to the branch point between two or more paths where the differential expression analysis may highlight genes is responsible for the branching event. The general strategy here is to fit one trend to the unique part of each path immediately following the branch point, followed by a comparison of the fits between paths. To this end, a particularly tempting approach is to perform another ANOVA with our spline-based model and test for significant differences in the spline parameters between paths. While this can be done with testPseudotime(), the magnitude of the pseudotime has little comparability across paths. A pseudotime value in one path of the MST does not, in general, have any relation to the same value in another path; the pseudotime can be arbitrarily “stretched” by factors such as the magnitude of DE or the density of cells, depending on the algorithm. This compromises any comparison of trends as we cannot reliably say that they are being fitted to comparable \\(x\\)-axes. Rather, we employ the much simpler ad hoc approach of fitting a spline to each trajectory and comparing the sets of DE genes. To demonstrate, we focus on the cluster containing the branch point in the Nestorowa-derived MST (Figure 18.2). We recompute the pseudotimes so that the root lies at the cluster center, allowing us to detect genes that are associated with the divergence of the branches. starter &lt;- &quot;3&quot; tscan.pseudo2 &lt;- orderCells(map.tscan, mst, start=starter) We visualize the reordered pseudotimes using only the cells in our branch point cluster (Figure 18.11), which allows us to see the correspondence between each pseudotime to the projected edges of the MST. A more precise determination of the identity of each pseudotime can be achieved by examining the column names of tscan.pseudo2, which contains the name of the terminal node for the path of the MST corresponding to each column. # Making a copy and giving the paths more friendly names. sub.nest &lt;- sce.nest sub.nest$TSCAN.first &lt;- tscan.pseudo2[,1] sub.nest$TSCAN.second &lt;- tscan.pseudo2[,2] sub.nest$TSCAN.third &lt;- tscan.pseudo2[,3] # Subsetting to the desired cluster containing the branch point. keep &lt;- colLabels(sce.nest) == starter sub.nest &lt;- sub.nest[,keep] # Showing only the lines to/from our cluster of interest. line.data.sub &lt;- line.data[grepl(&quot;^3--&quot;, line.data$edge) | grepl(&quot;--3$&quot;, line.data$edge),] ggline &lt;- geom_line(data=line.data.sub, mapping=aes(x=dim1, y=dim2, group=edge)) gridExtra::grid.arrange( plotTSNE(sub.nest, colour_by=&quot;TSCAN.first&quot;) + ggline, plotTSNE(sub.nest, colour_by=&quot;TSCAN.second&quot;) + ggline, plotTSNE(sub.nest, colour_by=&quot;TSCAN.third&quot;) + ggline, ncol=3 ) Figure 18.11: TSCAN-derived pseudotimes around cluster 3 in the Nestorowa HSC dataset. Each point is a cell in this cluster and is colored by its pseudotime value along the path to which it was assigned. The overlaid lines represent the relevant edges of the MST. We then apply testPseudotime() to each path involving cluster 3. Because we are operating over a relatively short pseudotime interval, we do not expect complex trends and so we set df=1 (i.e., a linear trend) to avoid problems from overfitting. pseudo1 &lt;- testPseudotime(sub.nest, df=1, pseudotime=sub.nest$TSCAN.first) pseudo1$SYMBOL &lt;- rowData(sce.nest)$SYMBOL pseudo1[order(pseudo1$p.value),] ## DataFrame with 46078 rows and 5 columns ## logFC logFC.1 p.value FDR SYMBOL ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## ENSMUSG00000009350 0.332855 0.332855 2.67471e-18 9.59018e-14 Mpo ## ENSMUSG00000040314 0.475509 0.475509 5.65148e-16 1.01317e-11 Ctsg ## ENSMUSG00000064147 0.449444 0.449444 3.76156e-15 4.49569e-11 Rab44 ## ENSMUSG00000026581 0.379946 0.379946 3.86978e-14 3.46877e-10 Sell ## ENSMUSG00000085611 0.266637 0.266637 7.51248e-12 5.38720e-08 Ap3s1-ps1 ## ... ... ... ... ... ... ## ENSMUSG00000107380 0 0 NaN NaN Vmn1r-ps6 ## ENSMUSG00000107381 0 0 NaN NaN NA ## ENSMUSG00000107382 0 0 NaN NaN Gm37714 ## ENSMUSG00000107387 0 0 NaN NaN 5430435K18Rik ## ENSMUSG00000107391 0 0 NaN NaN Rian pseudo2 &lt;- testPseudotime(sub.nest, df=1, pseudotime=sub.nest$TSCAN.second) pseudo2$SYMBOL &lt;- rowData(sce.nest)$SYMBOL pseudo2[order(pseudo2$p.value),] ## DataFrame with 46078 rows and 5 columns ## logFC logFC.1 p.value FDR SYMBOL ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## ENSMUSG00000027342 -0.1265815 -0.1265815 1.14035e-11 4.01425e-07 Pcna ## ENSMUSG00000025747 -0.3693852 -0.3693852 5.06241e-09 6.43725e-05 Tyms ## ENSMUSG00000020358 -0.1001289 -0.1001289 6.95055e-09 6.43725e-05 Hnrnpab ## ENSMUSG00000035198 -0.4166721 -0.4166721 7.31465e-09 6.43725e-05 Tubg1 ## ENSMUSG00000045799 -0.0452833 -0.0452833 5.43487e-08 3.19298e-04 Gm9800 ## ... ... ... ... ... ... ## ENSMUSG00000107380 0 0 NaN NaN Vmn1r-ps6 ## ENSMUSG00000107381 0 0 NaN NaN NA ## ENSMUSG00000107382 0 0 NaN NaN Gm37714 ## ENSMUSG00000107386 0 0 NaN NaN Gm42800 ## ENSMUSG00000107391 0 0 NaN NaN Rian pseudo3 &lt;- testPseudotime(sub.nest, df=1, pseudotime=sub.nest$TSCAN.third) pseudo3$SYMBOL &lt;- rowData(sce.nest)$SYMBOL pseudo3[order(pseudo3$p.value),] ## DataFrame with 46078 rows and 5 columns ## logFC logFC.1 p.value FDR SYMBOL ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## ENSMUSG00000042817 -0.411375 -0.411375 7.83960e-14 2.00860e-09 Flt3 ## ENSMUSG00000015937 -0.163091 -0.163091 1.18901e-13 2.00860e-09 H2afy ## ENSMUSG00000002985 0.351661 0.351661 7.64160e-13 8.60597e-09 Apoe ## ENSMUSG00000053168 -0.398684 -0.398684 8.17626e-12 6.90608e-08 9030619P08Rik ## ENSMUSG00000029247 -0.137079 -0.137079 1.78997e-10 1.06448e-06 Paics ## ... ... ... ... ... ... ## ENSMUSG00000107381 0 0 NaN NaN NA ## ENSMUSG00000107382 0 0 NaN NaN Gm37714 ## ENSMUSG00000107384 0 0 NaN NaN Gm42557 ## ENSMUSG00000107387 0 0 NaN NaN 5430435K18Rik ## ENSMUSG00000107391 0 0 NaN NaN Rian We want to find genes that are significant in our path of interest (for this demonstration, the third path reported by TSCAN) and are not significant and/or changing in the opposite direction in the other paths. We use the raw \\(p\\)-values to look for non-significant genes in order to increase the stringency of the definition of unique genes in our path. only3 &lt;- pseudo3[which(pseudo3$FDR &lt;= 0.05 &amp; (pseudo2$p.value &gt;= 0.05 | sign(pseudo1$logFC)!=sign(pseudo3$logFC)) &amp; (pseudo2$p.value &gt;= 0.05 | sign(pseudo2$logFC)!=sign(pseudo3$logFC))),] only3[order(only3$p.value),] ## DataFrame with 64 rows and 5 columns ## logFC logFC.1 p.value FDR SYMBOL ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## ENSMUSG00000042817 -0.411375 -0.411375 7.83960e-14 2.00860e-09 Flt3 ## ENSMUSG00000002985 0.351661 0.351661 7.64160e-13 8.60597e-09 Apoe ## ENSMUSG00000016494 -0.248953 -0.248953 1.89039e-10 1.06448e-06 Cd34 ## ENSMUSG00000000486 -0.217213 -0.217213 1.24423e-09 5.25468e-06 Sept1 ## ENSMUSG00000021728 -0.293032 -0.293032 3.56762e-09 1.20535e-05 Emb ## ... ... ... ... ... ... ## ENSMUSG00000004609 -0.205262 -0.205262 0.000118937 0.0422992 Cd33 ## ENSMUSG00000083657 0.100788 0.100788 0.000145710 0.0484448 Gm12245 ## ENSMUSG00000023942 -0.144269 -0.144269 0.000146255 0.0484448 Slc29a1 ## ENSMUSG00000091408 -0.149411 -0.149411 0.000157634 0.0499154 Gm6728 ## ENSMUSG00000053559 0.135833 0.135833 0.000159559 0.0499154 Smagp We observe upregulation of interesting genes such as Gata2, Cd9 and Apoe in this path, along with downregulation of Flt3 (Figure 18.12). One might speculate that this path leads to a less differentiated HSC state compared to the other directions. gridExtra::grid.arrange( plotTSNE(sub.nest, colour_by=&quot;Flt3&quot;, swap_rownames=&quot;SYMBOL&quot;) + ggline, plotTSNE(sub.nest, colour_by=&quot;Apoe&quot;, swap_rownames=&quot;SYMBOL&quot;) + ggline, plotTSNE(sub.nest, colour_by=&quot;Gata2&quot;, swap_rownames=&quot;SYMBOL&quot;) + ggline, plotTSNE(sub.nest, colour_by=&quot;Cd9&quot;, swap_rownames=&quot;SYMBOL&quot;) + ggline ) Figure 18.12: \\(t\\)-SNE plots of cells in the cluster containing the branch point of the MST in the Nestorowa dataset. Each point is a cell colored by the expression of a gene of interest and the relevant edges of the MST are overlaid on top. While simple and practical, this comparison strategy is even less statistically defensible than usual. The differential testing machinery is not suited to making inferences on the absence of differences, and we should not have used the non-significant genes to draw any conclusions. Another limitation is that this approach cannot detect differences in the magnitude of the gradient of the trend between paths; a gene that is significantly upregulated in each of two paths but with a sharper gradient in one of the paths will not be DE. (Of course, this is only a limitation if the pseudotimes were comparable in the first place.) 18.3.4 Further comments The magnitudes of the \\(p\\)-values reported here should be treated with some skepticism. The same fundamental problems discussed in Section 11.5 remain; the \\(p\\)-values are computed from the same data used to define the trajectory, and there is only a sample size of 1 in this analysis regardless of the number of cells. Nonetheless, the \\(p\\)-value is still useful for prioritizing interesting genes in the same manner that it is used to identify markers between clusters. The previous sections have focused on a very simple and efficient - but largely effective - approach to trend fitting. Alternatively, we can use more complex strategies that involve various generalizations to the concept of linear models. For example, generalized additive models (GAMs) are quite popular for pseudotime-based DE analyses as they are able to handle non-normal noise distributions and a greater diversity of non-linear trends. We demonstrate the use of the GAM implementation from the tradeSeq package on the Nestorowa dataset below. Specifically, we will take a leap of faith and assume that our pseudotime values are comparable across paths of the MST, allowing us to use the patternTest() function to test for significant differences in expression between paths. # Getting rid of the NA&#39;s; using the cell weights # to indicate which cell belongs on which path. nonna.pseudo &lt;- tscan.pseudo nonna.pseudo[is.na(nonna.pseudo)] &lt;- 0 cell.weights &lt;- !is.na(tscan.pseudo) storage.mode(cell.weights) &lt;- &quot;numeric&quot; # Fitting a GAM on the subset of genes for speed. library(tradeSeq) fit &lt;- fitGAM(counts(sce.nest)[1:100,], pseudotime=nonna.pseudo, cellWeights=cell.weights) res &lt;- patternTest(fit) res$Symbol &lt;- rowData(sce.nest)[1:100,&quot;SYMBOL&quot;] res &lt;- res[order(res$pvalue),] head(res, 10) ## waldStat df pvalue fcMedian Symbol ## ENSMUSG00000000028 275.03 6 0 1.5507 Cdc45 ## ENSMUSG00000000058 124.99 6 0 1.3323 Cav2 ## ENSMUSG00000000078 188.82 6 0 0.9602 Klf6 ## ENSMUSG00000000088 122.82 6 0 0.5421 Cox5a ## ENSMUSG00000000184 216.14 6 0 0.2182 Ccnd2 ## ENSMUSG00000000247 108.03 6 0 0.2142 Lhx2 ## ENSMUSG00000000248 131.13 6 0 1.2077 Clec2g ## ENSMUSG00000000278 201.74 6 0 2.0016 Scpep1 ## ENSMUSG00000000303 111.59 6 0 1.1410 Cdh1 ## ENSMUSG00000000318 89.31 6 0 1.1405 Clec10a From a statistical perspective, the GAM is superior to linear models as the former uses the raw counts. This accounts for the idiosyncrasies of the mean-variance relationship for low counts and avoids some problems with spurious trajectories introduced by the log-transformation (Section 7.5.1). However, this sophistication comes at the cost of increased complexity and compute time, requiring parallelization via BiocParallel even for relatively small datasets. When a trajectory consists of a series of clusters (as in the Nestorowa dataset), pseudotime-based DE tests can be considered a continuous generalization of cluster-based marker detection. One would expect to identify similar genes by performing an ANOVA on the per-cluster expression values, and indeed, this may be a more interpretable approach as it avoids imposing the assumption that a trajectory exists at all. The main benefit of pseudotime-based tests is that they encourage expression to be a smooth function of pseudotime, assuming that the degrees of freedom in the trend fit prevents overfitting. This smoothness reflects an expectation that changes in expression along a trajectory should be gradual. 18.4 Finding the root 18.4.1 Overview The pseudotime calculations rely on some specification of the root of the trajectory to define “position zero”. In some cases, this choice has little effect beyond flipping the sign of the gradients of the DE genes. In other cases, this choice may necessarily arbitrary depending on the questions being asked, e.g., what are the genes driving the transition to or from a particular part of the trajectory? However, in situations where the trajectory is associated with a time-dependent biological process, the position on the trajectory corresponding to the earliest timepoint is clearly the best default choice for the root. This simplifies interpretation by allowing the pseudotime to be treated as a proxy for real time. 18.4.2 Entropy-based methods Trajectories are commonly used to characterize differentiation where branches are interpreted as multiple lineages. In this setting, the root of the trajectory is best set to the “start” of the differentiation process, i.e., the most undifferentiated state that is observed in the dataset. It is usually possible to identify this state based on the genes that are expressed at each point of the trajectory. However, when such prior biological knowledge is not available, we can fall back to the more general concept that undifferentiated cells have more diverse expression profiles (Gulati et al. 2020). The assumption is that terminally differentiated cells have expression profiles that are highly specialized for their function while multipotent cells have no such constraints - and indeed, may need to have active expression programs for many lineages in preparation for commitment to any of them. We quantify the diversity of expression by computing the entropy of each cell’s expression profile (Grun et al. 2016; Guo et al. 2017; Teschendorff and Enver 2017), with higher entropies representing greater diversity. We demonstrate on the Nestorowa HSC dataset (Figure 18.13) where clusters 5 and 8 have the highest entropies, suggesting that they represent the least differentiated states within the trajectory. It is also reassuring that these two clusters are adjacent on the MST (Figure 18.1), which is consistent with branched differentiation “away” from a single root. library(TSCAN) entropy &lt;- perCellEntropy(sce.nest) ent.data &lt;- data.frame(cluster=colLabels(sce.nest), entropy=entropy) ggplot(ent.data, aes(x=cluster, y=entropy)) + geom_violin() + coord_cartesian(ylim=c(7, NA)) + stat_summary(fun=median, geom=&quot;point&quot;) Figure 18.13: Distribution of per-cell entropies for each cluster in the Nestorowa dataset. The median entropy for each cluster is shown as a point in the violin plot. Of course, this interpretation is fully dependent on whether the underlying assumption is reasonable. While the association between diversity and differentiation potential is likely to be generally applicable, it may not be sufficiently precise to enable claims on the relative potency of closely related subpopulations. Indeed, other processes such as stress or metabolic responses may interfere with the entropy comparisons. Furthermore, at low counts, the magnitude of the entropy is dependent on sequencing depth in a manner that cannot be corrected by scaling normalization. Cells with lower coverage will have lower entropy even if the underlying transcriptional diversity is the same, which may confound the interpretation of entropy as a measure of potency. 18.4.3 RNA velocity Another strategy is to use the concept of “RNA velocity” to identify the root (La Manno et al. 2018). For a given gene, a high ratio of unspliced to spliced transcripts indicates that that gene is being actively upregulated, under the assumption that the increase in transcription exceeds the capability of the splicing machinery to process the pre-mRNA. Conversely, a low ratio indicates that the gene is being downregulated as the rate of production and processing of pre-mRNAs cannot compensate for the degradation of mature transcripts. Thus, we can infer that cells with high and low ratios are moving towards a high- and low-expression state, respectively, allowing us to assign directionality to any trajectory or even individual cells. To demonstrate, we will use matrices of spliced and unspliced counts from Hermann et al. (2018). The unspliced count matrix is most typically generated by counting reads across intronic regions, thus quantifying the abundance of nascent transcripts for each gene in each cell. The spliced counts are obtained in a more standard manner by counting reads aligned to exonic regions; however, some extra thought is required to deal with reads spanning exon-intron boundaries, as well as reads mapping to regions that can be either intronic or exonic depending on the isoform (???). Conveniently, both matrices have the same shape and thus can be stored as separate assays in our usual SingleCellExperiment. library(scRNAseq) sce.sperm &lt;- HermannSpermatogenesisData(strip=TRUE, location=TRUE) assayNames(sce.sperm) ## [1] &quot;spliced&quot; &quot;unspliced&quot; We run through a quick-and-dirty analysis on the spliced counts, which can - by and large - be treated in the same manner as the standard exonic gene counts used in non-velocity-aware analyses. Alternatively, if the standard exonic count matrix was available, we could just use it directly in these steps and restrict the involvement of the spliced/unspliced matrices to the velocity calculations. The latter approach is logistically convenient when adding an RNA velocity section to an existing analysis, such that the prior steps (and the interpretation of their results) do not have to be repeated on the spliced count matrix. # Quality control: library(scuttle) is.mito &lt;- which(seqnames(sce.sperm)==&quot;MT&quot;) sce.sperm &lt;- addPerCellQC(sce.sperm, subsets=list(Mt=is.mito), assay.type=&quot;spliced&quot;) qc &lt;- quickPerCellQC(colData(sce.sperm), sub.fields=TRUE) sce.sperm &lt;- sce.sperm[,!qc$discard] # Normalization: set.seed(10000) library(scran) sce.sperm &lt;- logNormCounts(sce.sperm, assay.type=&quot;spliced&quot;) dec &lt;- modelGeneVarByPoisson(sce.sperm, assay.type=&quot;spliced&quot;) hvgs &lt;- getTopHVGs(dec, n=2500) # Dimensionality reduction: set.seed(1000101) library(scater) sce.sperm &lt;- runPCA(sce.sperm, ncomponents=25, subset_row=hvgs) sce.sperm &lt;- runTSNE(sce.sperm, dimred=&quot;PCA&quot;) We use the velociraptor package to perform the velocity calculations on this dataset via the scvelo Python package (Bergen et al. 2019). scvelo offers some improvements over the original implementation of RNA velocity by La Manno et al. (2018), most notably eliminating the need for observed subpopulations at steady state (i.e., where the rates of transcription, splicing and degradation are equal). velociraptor conveniently wraps this functionality by providing a function that accepts a SingleCellExperiment object such as sce.sperm and returns a similar object decorated with the velocity statistics. library(velociraptor) velo.out &lt;- scvelo(sce.sperm, assay.X=&quot;spliced&quot;, subset.row=hvgs, use.dimred=&quot;PCA&quot;) velo.out ## class: SingleCellExperiment ## dim: 2500 2175 ## metadata(4): neighbors velocity_params velocity_graph ## velocity_graph_neg ## assays(6): X spliced ... Mu velocity ## rownames(2500): ENSMUSG00000038015 ENSMUSG00000022501 ... ## ENSMUSG00000095650 ENSMUSG00000002524 ## rowData names(3): velocity_gamma velocity_r2 velocity_genes ## colnames(2175): CCCATACTCCGAAGAG AATCCAGTCATCTGCC ... ATCCACCCACCACCAG ## ATTGGTGGTTACCGAT ## colData names(7): velocity_self_transition root_cells ... ## velocity_confidence velocity_confidence_transition ## reducedDimNames(1): X_pca ## altExpNames(0): The primary output is the matrix of velocity vectors that describe the direction and magnitude of transcriptional change for each cell. To construct an ordering, we extrapolate from the vector for each cell to determine its future state. Roughly speaking, if a cell’s future state is close to the observed state of another cell, we place the former behind the latter in the ordering. This yields a “velocity pseudotime” that provides directionality without the need to explicitly define a root in our trajectory. We visualize this procedure in Figure 18.14 by embedding the estimated velocities into any low-dimensional representation of the dataset. sce.sperm$pseudotime &lt;- velo.out$velocity_pseudotime # Also embedding the velocity vectors, for some verisimilitude. embedded &lt;- embedVelocity(reducedDim(sce.sperm, &quot;TSNE&quot;), velo.out) grid.df &lt;- gridVectors(reducedDim(sce.sperm, &quot;TSNE&quot;), embedded, resolution=30) library(ggplot2) plotTSNE(sce.sperm, colour_by=&quot;pseudotime&quot;, point_alpha=0.3) + geom_segment(data=grid.df, mapping=aes(x=start.1, y=start.2, xend=end.1, yend=end.2), arrow=arrow(length=unit(0.05, &quot;inches&quot;), type=&quot;closed&quot;)) Figure 18.14: \\(t\\)-SNE plot of the Hermann spermatogenesis dataset, where each point is a cell and is colored by its velocity pseudotime. Arrows indicate the direction and magnitude of the velocity vectors, averaged over nearby cells. While we could use the velocity pseudotimes directly in our downstream analyses, it is often helpful to pair this information with other trajectory analyses. This is because the velocity calculations are done on a per-cell basis but interpretation is typically performed at a lower granularity, e.g., per cluster or lineage. For example, we can overlay the average velocity pseudotime for each cluster onto our TSCAN-derived MST (Figure 18.15) to identify the likely root clusters. More complex analyses can also be performed (e.g., to identify the likely fate of each cell in the intermediate clusters) but will not be discussed here. library(bluster) colLabels(sce.sperm) &lt;- clusterRows(reducedDim(sce.sperm, &quot;PCA&quot;), NNGraphParam()) library(TSCAN) mst &lt;- TSCAN::createClusterMST(sce.sperm, use.dimred=&quot;PCA&quot;, outgroup=TRUE) # Could also use velo.out$root_cell here, for a more direct measure of &#39;rootness&#39;. by.cluster &lt;- split(sce.sperm$pseudotime, colLabels(sce.sperm)) mean.by.cluster &lt;- vapply(by.cluster, mean, 0) mean.by.cluster &lt;- mean.by.cluster[names(igraph::V(mst))] color.by.cluster &lt;- viridis::viridis(21)[cut(mean.by.cluster, 21)] set.seed(1001) plot(mst, vertex.color=color.by.cluster) Figure 18.15: TSCAN-derived MST created from the Hermann spermatogenesis dataset. Each node is a cluster and is colored by the average velocity pseudotime of all cells in that cluster, from lowest (purple) to highest (yellow). Needless to say, this lunch is not entirely free. The inferences rely on a sophisticated mathematical model that has a few assumptions, the most obvious of which being that the transcriptional dynamics are the same across subpopulations. The use of unspliced counts increases the sensitivity of the analysis to unannotated transcripts (e.g., microRNAs in the gene body), intron retention events, annotation errors or quantification ambiguities (Soneson et al. 2020) that could interfere with the velocity calculations. There is also the question of whether there is enough intronic coverage to reliably estimate the velocity for the relevant genes for the process of interest, and if not, whether this lack of information may bias the resulting velocity estimates. From a purely practical perspective, the main difficulty with RNA velocity is that the unspliced counts are often unavailable. 18.4.4 Real timepoints There does, however, exist a gold-standard approach to rooting a trajectory: simply collect multiple real-life timepoints over the course of a biological process and use the population(s) at the earliest time point as the root. This approach experimentally defines a link between pseudotime and real time without requiring any further assumptions. To demonstrate, we will use the activated T cell dataset from Richard et al. (2018) where they collected CD8+ T cells at various time points after ovalbumin stimulation. library(scRNAseq) sce.richard &lt;- RichardTCellData() sce.richard &lt;- sce.richard[,sce.richard$`single cell quality`==&quot;OK&quot;] # Only using cells treated with the highest affinity peptide # plus the unstimulated cells as time zero. sub.richard &lt;- sce.richard[,sce.richard$stimulus %in% c(&quot;OT-I high affinity peptide N4 (SIINFEKL)&quot;, &quot;unstimulated&quot;)] sub.richard$time[is.na(sub.richard$time)] &lt;- 0 table(sub.richard$time) ## ## 0 1 3 6 ## 44 51 64 91 We run through the standard workflow for single-cell data with spike-ins - see Sections 7.4 and 8.2.3 for more details. library(scran) sub.richard &lt;- computeSpikeFactors(sub.richard, &quot;ERCC&quot;) sub.richard &lt;- logNormCounts(sub.richard) dec.richard &lt;- modelGeneVarWithSpikes(sub.richard, &quot;ERCC&quot;) top.hvgs &lt;- getTopHVGs(dec.richard, prop=0.2) sub.richard &lt;- denoisePCA(sub.richard, technical=dec.richard, subset.row=top.hvgs) We can then run our trajectory inference method of choice. As we expecting a fairly simple trajectory, we will keep matters simple and use slingshot() without any clusters. This yields a pseudotime that is strongly associated with real time (Figure 18.16) and from which it is straightforward to identify the best location of the root. The rooted trajectory can then be used to determine the “real time equivalent” of other activation stimuli, see Richard et al. (2018) for more details. sub.richard &lt;- slingshot(sub.richard, reducedDim=&quot;PCA&quot;) plot(sub.richard$time, sub.richard$slingPseudotime_1, xlab=&quot;Time (hours)&quot;, ylab=&quot;Pseudotime&quot;) Figure 18.16: Pseudotime as a function of real time in the Richard T cell dataset. Of course, this strategy relies on careful experimental design to ensure that multiple timepoints are actually collected. This requires more planning and resources (i.e., cost!) and is frequently absent from many scRNA-seq studies that only consider a single “snapshot” of the system. Generation of multiple timepoints also requires an amenable experimental system where the initiation of the process of interest can be tightly controlled. This is often more complex to set up than a strictly observational study, though having causal information arguably makes the data more useful for making inferences. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] bluster_1.0.0 velociraptor_1.0.0 [3] scran_1.18.5 scuttle_1.0.4 [5] ensembldb_2.14.0 AnnotationFilter_1.14.0 [7] GenomicFeatures_1.42.2 AnnotationDbi_1.52.0 [9] scRNAseq_2.4.0 tradeSeq_1.4.0 [11] slingshot_1.8.0 princurve_2.1.6 [13] TSCAN_1.28.0 scater_1.18.6 [15] ggplot2_3.3.3 SingleCellExperiment_1.12.0 [17] SummarizedExperiment_1.20.0 Biobase_2.50.0 [19] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [21] IRanges_2.24.1 S4Vectors_0.28.1 [23] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [25] matrixStats_0.58.0 BiocStyle_2.18.1 [27] rebook_1.0.0 loaded via a namespace (and not attached): [1] utf8_1.2.1 reticulate_1.18 [3] tidyselect_1.1.0 RSQLite_2.2.4 [5] grid_4.0.4 combinat_0.0-8 [7] docopt_0.7.1 BiocParallel_1.24.1 [9] Rtsne_0.15 zellkonverter_1.0.3 [11] munsell_0.5.0 codetools_0.2-18 [13] statmod_1.4.35 withr_2.4.1 [15] batchelor_1.6.2 colorspace_2.0-0 [17] fastICA_1.2-2 filelock_1.0.2 [19] highr_0.8 knitr_1.31 [21] labeling_0.4.2 slam_0.1-48 [23] GenomeInfoDbData_1.2.4 bit64_4.0.5 [25] farver_2.1.0 pheatmap_1.0.12 [27] basilisk_1.2.1 vctrs_0.3.6 [29] generics_0.1.0 xfun_0.22 [31] BiocFileCache_1.14.0 R6_2.5.0 [33] ggbeeswarm_0.6.0 rsvd_1.0.3 [35] VGAM_1.1-5 locfit_1.5-9.4 [37] bitops_1.0-6 cachem_1.0.4 [39] DelayedArray_0.16.2 assertthat_0.2.1 [41] promises_1.2.0.1 scales_1.1.1 [43] beeswarm_0.3.1 gtable_0.3.0 [45] beachmat_2.6.4 processx_3.4.5 [47] rlang_0.4.10 splines_4.0.4 [49] lazyeval_0.2.2 rtracklayer_1.50.0 [51] BiocManager_1.30.10 yaml_2.2.1 [53] reshape2_1.4.4 httpuv_1.5.5 [55] tools_4.0.4 bookdown_0.21 [57] ellipsis_0.3.1 gplots_3.1.1 [59] jquerylib_0.1.3 RColorBrewer_1.1-2 [61] Rcpp_1.0.6 plyr_1.8.6 [63] sparseMatrixStats_1.2.1 progress_1.2.2 [65] zlibbioc_1.36.0 purrr_0.3.4 [67] RCurl_1.98-1.3 densityClust_0.3 [69] basilisk.utils_1.2.2 ps_1.6.0 [71] prettyunits_1.1.1 openssl_1.4.3 [73] pbapply_1.4-3 viridis_0.5.1 [75] cowplot_1.1.1 ggrepel_0.9.1 [77] cluster_2.1.0 magrittr_2.0.1 [79] RSpectra_0.16-0 ResidualMatrix_1.0.0 [81] RANN_2.6.1 ProtGenerics_1.22.0 [83] hms_1.0.0 mime_0.10 [85] evaluate_0.14 xtable_1.8-4 [87] XML_3.99-0.6 mclust_5.4.7 [89] sparsesvd_0.2 gridExtra_2.3 [91] HSMMSingleCell_1.10.0 compiler_4.0.4 [93] biomaRt_2.46.3 tibble_3.1.0 [95] KernSmooth_2.23-18 crayon_1.4.1 [97] htmltools_0.5.1.1 mgcv_1.8-33 [99] later_1.1.0.1 DBI_1.1.1 [101] ExperimentHub_1.16.0 dbplyr_2.1.0 [103] rappdirs_0.3.3 Matrix_1.3-2 [105] igraph_1.2.6 pkgconfig_2.0.3 [107] GenomicAlignments_1.26.0 xml2_1.3.2 [109] vipor_0.4.5 bslib_0.2.4 [111] dqrng_0.2.1 XVector_0.30.0 [113] stringr_1.4.0 callr_3.5.1 [115] digest_0.6.27 graph_1.68.0 [117] DDRTree_0.1.5 Biostrings_2.58.0 [119] rmarkdown_2.7 uwot_0.1.10 [121] edgeR_3.32.1 DelayedMatrixStats_1.12.3 [123] curl_4.3 Rsamtools_2.6.0 [125] shiny_1.6.0 gtools_3.8.2 [127] lifecycle_1.0.0 monocle_2.18.0 [129] nlme_3.1-152 jsonlite_1.7.2 [131] BiocNeighbors_1.8.2 CodeDepends_0.6.5 [133] viridisLite_0.3.0 askpass_1.1 [135] limma_3.46.0 fansi_0.4.2 [137] pillar_1.5.1 lattice_0.20-41 [139] fastmap_1.1.0 httr_1.4.2 [141] interactiveDisplayBase_1.28.0 glue_1.4.2 [143] qlcMatrix_0.9.7 FNN_1.1.3 [145] BiocVersion_3.12.0 bit_4.0.4 [147] stringi_1.5.3 sass_0.3.1 [149] blob_1.2.1 BiocSingular_1.6.0 [151] AnnotationHub_2.22.0 caTools_1.18.1 [153] memoise_2.0.0 dplyr_1.0.5 [155] irlba_2.3.3 ape_5.4-1 Bibliography "],["single-nuclei-rna-seq-processing.html", "Chapter 19 Single-nuclei RNA-seq processing 19.1 Introduction 19.2 Quality control for stripped nuclei 19.3 Comments on downstream analyses 19.4 Tricks with ambient contamination Session Info", " Chapter 19 Single-nuclei RNA-seq processing .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 19.1 Introduction Single-nuclei RNA-seq (snRNA-seq) provides another strategy for performing single-cell transcriptomics where individual nuclei instead of cells are captured and sequenced. The major advantage of snRNA-seq over scRNA-seq is that the former does not require the preservation of cellular integrity during sample preparation, especially dissociation. We only need to extract nuclei in an intact state, meaning that snRNA-seq can be applied to cell types, tissues and samples that are not amenable to dissociation and later processing. The cost of this flexibility is the loss of transcripts that are primarily located in the cytoplasm, potentially limiting the availability of biological signal for genes with little nuclear localization. The computational analysis of snRNA-seq data is very much like that of scRNA-seq data. We have a matrix of (UMI) counts for genes by cells that requires quality control, normalization and so on. (Technically, the columsn correspond to nuclei but we will use these two terms interchangeably in this chapter.) In fact, the biggest difference in processing occurs in the construction of the count matrix itself, where intronic regions must be included in the annotation for each gene to account for the increased abundance of unspliced transcripts. The rest of the analysis only requires a few minor adjustments to account for the loss of cytoplasmic transcripts. We demonstrate using a dataset from Wu et al. (2019) involving snRNA-seq on healthy and fibrotic mouse kidneys. library(scRNAseq) sce &lt;- WuKidneyData() sce &lt;- sce[,sce$Technology==&quot;sNuc-10x&quot;] sce ## class: SingleCellExperiment ## dim: 18249 8231 ## metadata(0): ## assays(1): counts ## rownames(18249): mt-Cytb mt-Nd6 ... Gm44613 Gm38304 ## rowData names(0): ## colnames(8231): sNuc-10x_AAACCTGAGTCCGGTC sNuc-10x_AAACCTGCACAGACAG ... ## UUO_TTGCCGTCACAAGACG UUO_TTTGTCATCTGCTGTC ## colData names(2): Technology Status ## reducedDimNames(0): ## altExpNames(0): 19.2 Quality control for stripped nuclei The loss of the cytoplasm means that the stripped nuclei should not contain any mitochondrial transcripts. This means that the mitochondrial proportion becomes an excellent QC metric for the efficacy of the stripping process. Unlike scRNA-seq, there is no need to worry about variations in mitochondrial content due to genuine biology. High-quality nuclei should not contain any mitochondrial transcripts; the presence of any mitochondrial counts in a library indicates that the removal of the cytoplasm was not complete, possibly introducing irrelevant heterogeneity in downstream analyses. library(scuttle) sce &lt;- addPerCellQC(sce, subsets=list(Mt=grep(&quot;^mt-&quot;, rownames(sce)))) summary(sce$subsets_Mt_percent == 0) ## Mode FALSE TRUE ## logical 2264 5967 We apply a simple filter to remove libraries corresponding to incompletely stripped nuclei. The outlier-based approach described in Section 6 can be used here, but some caution is required in low-coverage experiments where a majority of cells have zero mitochondrial counts. In such cases, the MAD may also be zero such that other libraries with very low but non-zero mitochondrial counts are removed. This is typically too conservative as such transcripts may be present due to sporadic ambient contamination rather than incomplete stripping. stats &lt;- quickPerCellQC(colData(sce), sub.fields=&quot;subsets_Mt_percent&quot;) colSums(as.matrix(stats)) ## low_lib_size low_n_features high_subsets_Mt_percent ## 0 0 2264 ## discard ## 2264 Instead, we enforce a minimum difference between the threshold and the median in isOutlier() (Figure 19.1). We arbitrarily choose +0.5% here, which takes precedence over the outlier-based threshold if the latter is too low. In this manner, we avoid discarding libraries with a very modest amount of contamination; the same code will automatically fall back to the outlier-based threshold in datasets where the stripping was systematically less effective. stats$high_subsets_Mt_percent &lt;- isOutlier(sce$subsets_Mt_percent, type=&quot;higher&quot;, min.diff=0.5) stats$discard &lt;- Reduce(&quot;|&quot;, stats[,colnames(stats)!=&quot;discard&quot;]) colSums(as.matrix(stats)) ## low_lib_size low_n_features high_subsets_Mt_percent ## 0 0 42 ## discard ## 42 library(scater) plotColData(sce, x=&quot;Status&quot;, y=&quot;subsets_Mt_percent&quot;, colour_by=I(stats$high_subsets_Mt_percent)) Figure 19.1: Distribution of the mitochondrial proportions in the Wu kidney dataset. Each point represents a cell and is colored according to whether it was considered to be of low quality and discarded. 19.3 Comments on downstream analyses The rest of the analysis can then be performed using the same strategies discussed for scRNA-seq (Figure 19.2). Despite the loss of cytoplasmic transcripts, there is usually still enough biological signal to characterize population heterogeneity (Bakken et al. 2018; Wu et al. 2019). In fact, one could even say that snRNA-seq has a higher signal-to-noise ratio as sequencing coverage is not spent on highly abundant but typically uninteresting transcripts for mitochondrial and ribosomal protein genes. It also has the not inconsiderable advantage of being able to recover subpopulations that are not amenable to dissociation and would be lost by scRNA-seq protocols. library(scran) set.seed(111) sce &lt;- logNormCounts(sce[,!stats$discard]) dec &lt;- modelGeneVarByPoisson(sce) sce &lt;- runPCA(sce, subset_row=getTopHVGs(dec, n=4000)) sce &lt;- runTSNE(sce, dimred=&quot;PCA&quot;) library(bluster) colLabels(sce) &lt;- clusterRows(reducedDim(sce, &quot;PCA&quot;), NNGraphParam()) gridExtra::grid.arrange( plotTSNE(sce, colour_by=&quot;label&quot;, text_by=&quot;label&quot;), plotTSNE(sce, colour_by=&quot;Status&quot;), ncol=2 ) Figure 19.2: \\(t\\)-SNE plots of the Wu kidney dataset. Each point is a cell and is colored by its cluster assignment (left) or its disease status (right). We can also apply more complex procedures such as batch correction (Section 13). Here, we eliminate the disease effect to identify shared clusters (Figure 19.3). library(batchelor) set.seed(1101) merged &lt;- multiBatchNorm(sce, batch=sce$Status) merged &lt;- correctExperiments(merged, batch=merged$Status, PARAM=FastMnnParam()) merged &lt;- runTSNE(merged, dimred=&quot;corrected&quot;) colLabels(merged) &lt;- clusterRows(reducedDim(merged, &quot;corrected&quot;), NNGraphParam()) gridExtra::grid.arrange( plotTSNE(merged, colour_by=&quot;label&quot;, text_by=&quot;label&quot;), plotTSNE(merged, colour_by=&quot;batch&quot;), ncol=2 ) Figure 19.3: More \\(t\\)-SNE plots of the Wu kidney dataset after applying MNN correction across diseases. Similarly, we can perform marker detection on the snRNA-seq expression values as discussed in Section 11. For the most part, interpretation of these DE results makes the simplifying assumption that nuclear abundances are a good proxy for the overall expression profile. This is generally reasonable but may not always be true, resulting in some discrepancies in the marker sets between snRNA-seq and scRNA-seq datasets. For example, transcripts for strongly expressed genes might localize to the cytoplasm for efficient translation and subsequently be lost upon stripping, while genes with the same overall expression but differences in the rate of nuclear export may appear to be differentially expressed between clusters. In the most pathological case, higher snRNA-seq abundances may indicate nuclear sequestration of transcripts for protein-coding genes and reduced activity of the relevant biological process, contrary to the usual interpretation of the effect of upregulation. markers &lt;- findMarkers(merged, block=merged$Status, direction=&quot;up&quot;) markers[[&quot;3&quot;]][1:10,1:3] ## DataFrame with 10 rows and 3 columns ## Top p.value FDR ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; ## Sorcs1 1 8.31936e-262 1.51820e-258 ## Ltn1 1 7.85490e-83 1.07778e-80 ## Bmp6 1 0.00000e+00 0.00000e+00 ## Il34 1 0.00000e+00 0.00000e+00 ## Them7 1 9.46508e-208 8.22516e-205 ## Pak1 1 1.35170e-184 8.22241e-182 ## Kcnip4 1 0.00000e+00 0.00000e+00 ## Mecom 1 5.30105e-20 7.11838e-19 ## Pakap 1 0.00000e+00 0.00000e+00 ## Wdr17 1 1.34668e-202 1.11707e-199 plotTSNE(merged, colour_by=&quot;Kcnip4&quot;) Other analyses described for scRNA-seq require more care when they are applied to snRNA-seq data. Most obviously, cell type annotation based on reference profiles (Section 12) should be treated with some caution as the majority of existing references are constructed from bulk or single-cell datasets with cytoplasmic transcripts. Interpretation of RNA velocity results may also be complicated by variation in the rate of nuclear export of spliced transcripts. 19.4 Tricks with ambient contamination The expected absence of genuine mitochondrial expression can also be exploited to estimate the level of ambient contamination (Section 14.4). We demonstrate on mouse brain snRNA-seq data from 10X Genomics (Zheng et al. 2017), using the raw count matrix prior to any filtering for nuclei-containing barcodes. library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.0.1-nuclei_900/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;nuclei&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/mm10&quot;) sce.brain &lt;- read10xCounts(fname, col.names=TRUE) sce.brain ## class: SingleCellExperiment ## dim: 27998 737280 ## metadata(1): Samples ## assays(1): counts ## rownames(27998): ENSMUSG00000051951 ENSMUSG00000089699 ... ## ENSMUSG00000096730 ENSMUSG00000095742 ## rowData names(2): ID Symbol ## colnames(737280): AAACCTGAGAAACCAT-1 AAACCTGAGAAACCGC-1 ... ## TTTGTCATCTTTAGTC-1 TTTGTCATCTTTCCTC-1 ## colData names(2): Sample Barcode ## reducedDimNames(0): ## altExpNames(0): We call non-empty droplets using emptyDrops() as previously described (Section 15.2). library(DropletUtils) e.out &lt;- emptyDrops(counts(sce.brain)) summary(e.out$FDR &lt;= 0.001) ## Mode FALSE TRUE NA&#39;s ## logical 2324 1712 733244 If our libraries are of high quality, we can assume that any mitochondrial “expression” is due to contamination from the ambient solution. We then use the controlAmbience() function to estimate the proportion of ambient contamination for each gene, allowing us to mark potentially problematic genes in the DE results (Figure 19.4). In fact, we can use this information even earlier to remove these genes during dimensionality reduction and clustering. This is not generally possible for scRNA-seq as any notable contaminating transcripts may originate from a subpopulation that actually expresses that gene and thus cannot be blindly removed. ambient &lt;- estimateAmbience(counts(sce.brain), round=FALSE, good.turing=FALSE) nuclei &lt;- rowSums(counts(sce.brain)[,which(e.out$FDR &lt;= 0.001)]) is.mito &lt;- grepl(&quot;mt-&quot;, rowData(sce.brain)$Symbol) contam &lt;- controlAmbience(nuclei, ambient, features=is.mito, mode=&quot;proportion&quot;) plot(log10(nuclei+1), contam*100, col=ifelse(is.mito, &quot;red&quot;, &quot;grey&quot;), pch=16, xlab=&quot;Log-nuclei expression&quot;, ylab=&quot;Contamination (%)&quot;) Figure 19.4: Percentage of counts in the nuclei of the 10X brain dataset that are attributed to contamination from the ambient solution. Each point represents a gene and mitochondrial genes are highlighted in red. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] DropletUtils_1.10.3 DropletTestFiles_1.0.0 [3] batchelor_1.6.2 bluster_1.0.0 [5] scran_1.18.5 scater_1.18.6 [7] ggplot2_3.3.3 scuttle_1.0.4 [9] scRNAseq_2.4.0 SingleCellExperiment_1.12.0 [11] SummarizedExperiment_1.20.0 Biobase_2.50.0 [13] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [15] IRanges_2.24.1 S4Vectors_0.28.1 [17] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [19] matrixStats_0.58.0 BiocStyle_2.18.1 [21] rebook_1.0.0 loaded via a namespace (and not attached): [1] AnnotationHub_2.22.0 BiocFileCache_1.14.0 [3] igraph_1.2.6 lazyeval_0.2.2 [5] BiocParallel_1.24.1 digest_0.6.27 [7] ensembldb_2.14.0 htmltools_0.5.1.1 [9] viridis_0.5.1 fansi_0.4.2 [11] magrittr_2.0.1 memoise_2.0.0 [13] limma_3.46.0 Biostrings_2.58.0 [15] R.utils_2.10.1 askpass_1.1 [17] prettyunits_1.1.1 colorspace_2.0-0 [19] blob_1.2.1 rappdirs_0.3.3 [21] xfun_0.22 dplyr_1.0.5 [23] callr_3.5.1 crayon_1.4.1 [25] RCurl_1.98-1.3 jsonlite_1.7.2 [27] graph_1.68.0 glue_1.4.2 [29] gtable_0.3.0 zlibbioc_1.36.0 [31] XVector_0.30.0 DelayedArray_0.16.2 [33] BiocSingular_1.6.0 Rhdf5lib_1.12.1 [35] HDF5Array_1.18.1 scales_1.1.1 [37] edgeR_3.32.1 DBI_1.1.1 [39] Rcpp_1.0.6 viridisLite_0.3.0 [41] xtable_1.8-4 progress_1.2.2 [43] dqrng_0.2.1 bit_4.0.4 [45] rsvd_1.0.3 ResidualMatrix_1.0.0 [47] httr_1.4.2 ellipsis_0.3.1 [49] R.methodsS3_1.8.1 pkgconfig_2.0.3 [51] XML_3.99-0.6 farver_2.1.0 [53] CodeDepends_0.6.5 sass_0.3.1 [55] dbplyr_2.1.0 locfit_1.5-9.4 [57] utf8_1.2.1 tidyselect_1.1.0 [59] labeling_0.4.2 rlang_0.4.10 [61] later_1.1.0.1 AnnotationDbi_1.52.0 [63] munsell_0.5.0 BiocVersion_3.12.0 [65] tools_4.0.4 cachem_1.0.4 [67] generics_0.1.0 RSQLite_2.2.4 [69] ExperimentHub_1.16.0 evaluate_0.14 [71] stringr_1.4.0 fastmap_1.1.0 [73] yaml_2.2.1 processx_3.4.5 [75] knitr_1.31 bit64_4.0.5 [77] purrr_0.3.4 AnnotationFilter_1.14.0 [79] sparseMatrixStats_1.2.1 mime_0.10 [81] R.oo_1.24.0 xml2_1.3.2 [83] biomaRt_2.46.3 compiler_4.0.4 [85] beeswarm_0.3.1 curl_4.3 [87] interactiveDisplayBase_1.28.0 statmod_1.4.35 [89] tibble_3.1.0 bslib_0.2.4 [91] stringi_1.5.3 highr_0.8 [93] ps_1.6.0 GenomicFeatures_1.42.2 [95] lattice_0.20-41 ProtGenerics_1.22.0 [97] Matrix_1.3-2 vctrs_0.3.6 [99] rhdf5filters_1.2.0 pillar_1.5.1 [101] lifecycle_1.0.0 BiocManager_1.30.10 [103] jquerylib_0.1.3 BiocNeighbors_1.8.2 [105] cowplot_1.1.1 bitops_1.0-6 [107] irlba_2.3.3 httpuv_1.5.5 [109] rtracklayer_1.50.0 R6_2.5.0 [111] bookdown_0.21 promises_1.2.0.1 [113] gridExtra_2.3 vipor_0.4.5 [115] codetools_0.2-18 assertthat_0.2.1 [117] rhdf5_2.34.0 openssl_1.4.3 [119] withr_2.4.1 GenomicAlignments_1.26.0 [121] Rsamtools_2.6.0 GenomeInfoDbData_1.2.4 [123] hms_1.0.0 grid_4.0.4 [125] beachmat_2.6.4 rmarkdown_2.7 [127] DelayedMatrixStats_1.12.3 Rtsne_0.15 [129] shiny_1.6.0 ggbeeswarm_0.6.0 Bibliography "],["integrating-with-protein-abundance.html", "Chapter 20 Integrating with protein abundance 20.1 Motivation 20.2 Setting up the data 20.3 Quality control 20.4 Normalization 20.5 Clustering and interpretation 20.6 Integration with gene expression data Session Info", " Chapter 20 Integrating with protein abundance .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 20.1 Motivation Cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq) is a technique that quantifies both gene expression and the abundance of selected surface proteins in each cell simultaneously (Stoeckius et al. 2017). In this approach, cells are first labelled with antibodies that have been conjugated to synthetic RNA tags. A cell with a higher abundance of a target protein will be bound by more antibodies, causing more molecules of the corresponding antibody-derived tag (ADT) to be attached to that cell. Cells are then separated into their own reaction chambers using droplet-based microfluidics (Zheng et al. 2017). Both the ADTs and endogenous transcripts are reverse-transcribed and captured into a cDNA library; the abundance of each protein or expression of each gene is subsequently quantified by sequencing of each set of features. This provides a powerful tool for interrogating aspects of the proteome (such as post-translational modifications) and other cellular features that would normally be invisible to transcriptomic studies. How should the ADT data be incorporated into the analysis? While we have counts for both ADTs and transcripts, there are fundamental differences in nature of the data that make it difficult to treat the former as additional features in the latter. Most experiments involve only a small number of antibodies (&lt;20) that are chosen by the researcher because they are of a priori interest, in contrast to gene expression data that captures the entire transcriptome regardless of the study. The coverage of the ADTs is also much deeper as they are sequenced separately from the transcripts, allowing the sequencing resources to be concentrated into a smaller number of features. And, of course, the use of antibodies against protein targets involves consideration of separate biases compared to those observed for transcripts. In this chapter, we will describe some strategies for integrated analysis of ADT and transcript data in CITE-seq experiments. We will demonstrate using a PBMC dataset from 10X Genomics that contains quantified abundances for a number of interesting surface proteins. We conveniently obtain the dataset using the DropletTestFiles package, after which we can create a SingleCellExperiment as shown below. library(DropletTestFiles) path &lt;- getTestFile(&quot;tenx-3.0.0-pbmc_10k_protein_v3/1.0.0/filtered.tar.gz&quot;) dir &lt;- tempfile() untar(path, exdir=dir) # Loading it in as a SingleCellExperiment object. library(DropletUtils) sce &lt;- read10xCounts(file.path(dir, &quot;filtered_feature_bc_matrix&quot;)) sce ## class: SingleCellExperiment ## dim: 33555 7865 ## metadata(1): Samples ## assays(1): counts ## rownames(33555): ENSG00000243485 ENSG00000237613 ... IgG1 IgG2b ## rowData names(3): ID Symbol Type ## colnames: NULL ## colData names(2): Sample Barcode ## reducedDimNames(0): ## altExpNames(0): 20.2 Setting up the data The SingleCellExperiment class provides the concept of an “alternative Experiment” to store data for different sets of features but the same cells. This involves storing another SummarizedExperiment (or an instance of a subclass) inside our SingleCellExperiment where the rows (features) can differ but the columns (cells) are the same. In previous chapters, we were using the alternative Experiments to store spike-in data, but here we will use the concept to split off the ADT data. This isolates the two sets of features to ensure that analyses on one set do not inadvertently use data from the other set, and vice versa. sce &lt;- splitAltExps(sce, rowData(sce)$Type) altExpNames(sce) ## [1] &quot;Antibody Capture&quot; altExp(sce) # Can be used like any other SingleCellExperiment. ## class: SingleCellExperiment ## dim: 17 7865 ## metadata(1): Samples ## assays(1): counts ## rownames(17): CD3 CD4 ... IgG1 IgG2b ## rowData names(3): ID Symbol Type ## colnames: NULL ## colData names(0): ## reducedDimNames(0): ## altExpNames(0): At this point, it is also helpful to coerce the sparse matrix for ADTs into a dense matrix. The ADT counts are usually not sparse so storage as a sparse matrix provides no advantage; in fact, it actually increases memory usage and computational time as the indices of non-zero entries must be unnecessarily stored and processed. From a practical perspective, this avoids unnecessary incompatibilities with downstream applications that do not accept sparse inputs. counts(altExp(sce)) &lt;- as.matrix(counts(altExp(sce))) counts(altExp(sce))[,1:10] # sneak peek ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## CD3 18 30 18 18 5 21 34 48 4522 2910 ## CD4 138 119 207 11 14 1014 324 1127 3479 2900 ## CD8a 13 19 10 17 14 29 27 43 38 28 ## CD14 491 472 1289 20 19 2428 1958 2189 55 41 ## CD15 61 102 128 124 156 204 607 128 111 130 ## CD16 17 155 72 1227 1873 148 676 75 44 37 ## CD56 17 248 26 491 458 29 29 29 30 15 ## CD19 3 3 8 5 4 7 15 4 6 6 ## CD25 9 5 15 15 16 52 85 17 13 18 ## CD45RA 110 125 5268 4743 4108 227 175 523 4044 1081 ## CD45RO 74 156 28 28 21 492 517 316 26 43 ## PD-1 9 9 20 25 28 16 26 16 28 16 ## TIGIT 4 9 11 59 76 11 12 12 9 8 ## CD127 7 8 12 16 17 15 11 10 231 179 ## IgG2a 5 4 12 12 7 9 6 3 19 14 ## IgG1 2 8 19 16 14 10 12 7 16 10 ## IgG2b 3 3 6 4 9 8 50 2 8 2 20.3 Quality control As with the endogenous genes, we want to remove cells that have failed to capture/sequence the ADTs. Recall that droplet-based libraries will contain contamination from ambient solution (Section 15.2), in this case containing containing conjugated antibodies that are either free in solution or bound to cell fragments. As the ADTs are (relatively) deeply sequenced, we can expect non-zero counts for most ADTs in each cell due to contamination (Figure 20.1; if this is not the case, we might suspect some failure of ADT processing for that cell. We thus remove cells that have unusually low numbers of detected ADTs, defined here as less than or equal to half of the total number of tags. # Applied on the alternative experiment containing the ADT counts: library(scuttle) df.ab &lt;- perCellQCMetrics(altExp(sce)) n.nonzero &lt;- sum(!rowAlls(counts(altExp(sce)), value=0L)) ab.discard &lt;- df.ab$detected &lt;= n.nonzero/2 summary(ab.discard) ## Mode FALSE TRUE ## logical 7864 1 hist(df.ab$detected, col=&#39;grey&#39;, main=&quot;&quot;, xlab=&quot;Number of detected ADTs&quot;) abline(v=n.nonzero/2, col=&quot;red&quot;, lty=2) Figure 20.1: Distribution of the number of detected ADTs across all cells in the PBMC dataset. The red dotted line indicates the threshold below which cells were removed. To elaborate on the filter above: we use n.nonzero to exclude ADTs with zero counts across all cells, which avoids inflation of the total number of ADTs due to barcodes that were erroniously included in the annotation. The use of the specific threshold of 50% aims to avoid zero size factors during median-based normalization (Section 20.4.3), though it is straightforward to use more stringent thresholds if so desired. We do not rely only on isOutlier() as the MAD is often zero in deeply sequenced datasets - where most cells contain non-zero counts for all ADTs - such that filtering would discard useful cells that only detect almost all of the ADTs. Some experiments include isotype control (IgG) antibodies that have similar properties to a primary antibody but lack a specific target in the cell, thus providing a measure of non-specific binding. If we assume that the magnitude of non-specific binding is constant across cells, we could define low-quality cells as those with unusually low coverage of the control ADTs, presumably due to failed capture or sequencing. We use the outlier-based approach described in Chapter 6 to choose an appropriate filter threshold (Figure 20.2); hard thresholds are more difficult to specify due to experiment-by-experiment variation in the expected coverage of ADTs. Paradoxically, though, this QC metric becomes less useful in well-executed experiments where there is not enough non-specific binding to obtain meaningful coverage of the control ADTs. # We could have specified subsets= in the above perCellQCMetrics call, but it&#39;s # easier to explain one concept at a time, so we&#39;ll just repeat it here: controls &lt;- grep(&quot;^Ig&quot;, rownames(altExp(sce))) df.ab2 &lt;- perCellQCMetrics(altExp(sce), subsets=list(controls=controls)) con.discard &lt;- isOutlier(df.ab2$subsets_controls_sum, log=TRUE, type=&quot;lower&quot;) summary(con.discard) ## Mode FALSE TRUE ## logical 7785 80 hist(log1p(df.ab2$subsets_controls_sum), col=&#39;grey&#39;, breaks=50, main=&quot;&quot;, xlab=&quot;Log-total count for controls per cell&quot;) abline(v=log1p(attr(con.discard, &quot;thresholds&quot;)[&quot;lower&quot;]), col=&quot;red&quot;, lty=2) Figure 20.2: Distribution of the log-coverage of IgG control ADTs across all cells in the PBMC dataset. The red dotted line indicates the threshold below which cells were removed. As with scRNA-seq, the total count across all ADTs can also be used as a QC metric (Figure 20.3). However, it is a riskier proportion for CITE-seq as there is a greater risk that the total count will be strongly correlated with the biological state of the cell. The presence of a targeted protein can lead to a several-fold increase in the total ADT count given the binary nature of most surface protein markers. Removing cells with low totals could inadvertently eliminate cell types that do not express many - or indeed, any - of the selected protein targets. Admittedly, this problem is the same as that discussed previously for the library size in scRNA-seq data (Section 6.3.2.2) but can be more pronounced in CITE-seq where there is no guaranteed set of constitutively expressed features to “buffer” cell-type-specific variation in the total count. sum.discard &lt;- isOutlier(df.ab2$sum, log=TRUE, type=&quot;lower&quot;) summary(sum.discard) ## Mode FALSE TRUE ## logical 7820 45 hist(log1p(df.ab2$sum), col=&#39;grey&#39;, breaks=50, main=&quot;&quot;, xlab=&quot;Log-total count per cell&quot;) abline(v=log1p(attr(sum.discard, &quot;thresholds&quot;)[&quot;lower&quot;]), col=&quot;red&quot;, lty=2) Figure 20.3: Distribution of the log-total count of all ADTs across all cells in the PBMC dataset. The red dotted line indicates the threshold below which cells are considered to be low outliers. Of course, if we want to analyze gene expression and ADT data together, we still need to apply quality control on the gene counts. It is possible for a cell to have satisfactory ADT counts but poor QC metrics from the endogenous genes - for example, cell damage manifesting as high mitochondrial proportions would not be captured in the ADT data. More generally, the ADTs are synthetic, external to the cell and conjugated to antibodies, so it is not surprising that they would experience different cell-specific technical effects than the endogenous transcripts. This motivates a distinct QC step on the genes to ensure that we only retain high-quality cells in both feature spaces. Here, the count matrix has already been filtered by Cellranger to remove empty droplets so we only filter on the mitochondrial proportions to remove putative low-quality cells. mito &lt;- grep(&quot;^MT-&quot;, rowData(sce)$Symbol) df &lt;- perCellQCMetrics(sce, subsets=list(Mito=mito)) mito.discard &lt;- isOutlier(df$subsets_Mito_percent, type=&quot;higher&quot;) summary(mito.discard) ## Mode FALSE TRUE ## logical 7569 296 Finally, to remove the low-quality cells, we subset the SingleCellExperiment as previously described. This automatically applies the filtering to both the transcript and ADT data; such coordination is one of the advantages of storing both datasets in a single object. We omit the total count filter to reduce the risk of discarding an all-negative cell type, at least for the first round of analysis. discard &lt;- ab.discard | con.discard | mito.discard sce &lt;- sce[,!discard] 20.4 Normalization 20.4.1 Overview Counts for the ADTs are subject to several biases that must be normalized prior to further analysis. Capture efficiency varies from cell to cell though the differences in biophysical properties between endogenous transcripts and the (much shorter) ADTs means that the capture-related biases for the two sets of features are unlikely to be identical. Composition biases are also much more pronounced in ADT data due to (i) the binary nature of target protein abundances, where any increase in protein abundance manifests as a large increase to the total tag count; and (ii) the a priori selection of interesting protein targets, which enriches for features that are more likely to be differentially abundant across the population. As in Chapter 7, we assume that these are scaling biases and compute ADT-specific size factors to remove them. To this end, several strategies are again available to calculate a size factor for each cell. 20.4.2 Library size normalization The simplest approach is to normalize on the total ADT counts, effectively the library size for the ADTs. Like in Section 7.2, these “ADT library size factors” are adequate for clustering but will introduce composition biases that interfere with interpretation of the fold-changes between clusters. This is especially tue for relatively subtle (e.g., ~2-fold) changes in the abundances of markers associated with functional activity rather than cell type. sf.lib &lt;- librarySizeFactors(altExp(sce)) summary(sf.lib) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.031 0.528 0.906 1.000 1.266 22.672 We might instead consider the related approach of taking the geometric mean of all counts as the size factor for each cell (Stoeckius et al. 2017). The geometric mean is a reasonable estimator of the scaling biases for large counts with the added benefit that it mitigates the effects of composition biases by dampening the impact of one or two highly abundant ADTs. While more robust than the ADT library size factors, these geometric mean-based factors are still not entirely correct and will progressively become less accurate as upregulation increases in strength. library(scuttle) sf.geo &lt;- geometricSizeFactors(altExp(sce)) summary(sf.geo) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.16 0.66 0.85 1.00 1.07 45.22 20.4.3 Median-based normalization Ideally, we would like to compute size factors that adjust for the composition biases. This usually requires an assumption that most ADTs are not differentially expressed between cell types/states. At first glance, this appears to be a strong assumption - the target proteins were specifically chosen as they exhibit interesting heterogeneity across the population, meaning that a non-differential majority across ADTs would be unlikely. However, we can make it work by assuming that each cell only upregulates a minority of the targeted proteins, with the remaining ADTs exhibiting some low baseline abundance that is constant across all cells. (Of course, the identity of the upregulated subset can differ across cells, otherwise the dataset would not be very interesting.) We then compute size factors to equalize the coverage of the non-upregulated majority, thus eliminating cell-to-cell differences in capture efficiency. We consider the baseline ADT profile to be a combination of weak constitutive expression and ambient contamination, both of which should be constant across the population. We estimate this profile by assuming that the distribution of abundances for each ADT should be bimodal, where one population of cells exhibits low baseline expression and another population upregulates the corresponding protein target. We then use all cells in the lower mode to compute the baseline abundance for that ADT. This entire calculation is performed by the inferAmbience() function, which was originally designed to estimate HTO ambient concentrations but can be repurposed for more general use (Figure 20.4). baseline &lt;- inferAmbience(counts(altExp(sce))) head(baseline) ## CD3 CD4 CD8a CD14 CD15 CD16 ## 30.99 29.97 30.09 32.61 108.65 45.25 library(scater) plotExpression(altExp(sce), features=rownames(altExp(sce)), exprs_values=&quot;counts&quot;) + scale_y_log10() + geom_point(data=data.frame(x=names(baseline), y=baseline), mapping=aes(x=x, y=y), cex=3) Figure 20.4: Distribution of (log-)counts for each ADT in the PBMC dataset, with the inferred ambient abundance marked by the black dot. We use a DESeq2-like approach to compute size factors against the baseline profile. Specifically, the size factor for each cell is defined as the median of the ratios of that cell’s counts to the baseline profile. If the abundances for most ADTs in each cell are baseline-derived, they should be roughly constant across cells; any systematic differences in the ratios correspond to cell-specific biases in sequencing coverage and are captured by the size factor. The use of the median protects against the minority of ADTs corresponding to genuinely expressed targets. sf.amb &lt;- medianSizeFactors(altExp(sce), reference=baseline) summary(sf.amb) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.20 0.66 0.78 1.00 1.00 71.00 In one subpopulation, the size factors are consistently larger than the ADT library size factors, whereas the opposite is true for most of the other subpopulations (Figure 20.5). This is consistent with the presence of composition biases due to differential abundance of the targeted proteins between subpopulations. Here, composition biases would introduce a spurious 2-fold change in normalized ADT abundance if the library size factors were used. # Coloring by cluster to highlight the composition biases. # We set k=20 to get fewer, broader clusters for a clearer picture. library(scran) tagdata &lt;- logNormCounts(altExp(sce)) # library size factors by default. g &lt;- buildSNNGraph(tagdata, k=20, d=NA) # no need for PCA, see below. clusters &lt;- igraph::cluster_walktrap(g)$membership plot(sf.lib, sf.amb, log=&quot;xy&quot;, col=clusters, xlab=&quot;Library size factors (tag)&quot;, ylab=&quot;DESeq-like size factors (tag)&quot;) abline(0, 1, col=&quot;grey&quot;, lty=2) Figure 20.5: DESeq-like size factors for each cell in the PBMC dataset, compared to ADT library size factors. Each point is a cell and is colored according to the cluster identity defined from normalized ADT data. 20.4.4 Control-based normalization If control ADTs are available, we could make the assumption that they should not be differentially abundant between cells. Any difference thus represents some bias that should be normalized by defining control-based size factors from the sum of counts over all control ADTs, analogous to spike-in normalization (Section 7.4), We demonstrate this approach below by computing size factors from the IgG controls (Figure 20.6). controls &lt;- grep(&quot;^Ig&quot;, rownames(altExp(sce))) sf.control &lt;- librarySizeFactors(altExp(sce), subset_row=controls) summary(sf.control) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.30 0.68 0.87 1.00 1.13 43.67 plot(sf.amb, sf.control, log=&quot;xy&quot;, xlab=&quot;DESeq-like size factors (tag)&quot;, ylab=&quot;Control size factors (tag)&quot;) abline(0, 1, col=&quot;grey&quot;, lty=2) Figure 20.6: IgG control-derived size factors for each cell in the PBMC dataset, compared to the DESeq-like size factors. This approach exchanges the previous assumption of a non-differential majority for another assumption about the lack of differential abundance in the control tags. We might feel that the latter is a generally weaker assumption, but it is possible for non-specific binding to vary due to biology (e.g., when the cell surface area increases), at which point this normalization strategy may not be appropriate. It also relies on sufficient coverage of the control ADTs, which is not always possible in well-executed experiments where there is little non-specific binding to provide such counts. 20.4.5 Computing log-normalized values We suggest using the median-based size factors by default, as they are generally applicable and eliminate most problems with composition biases. We set the size factors for the ADT data by calling sizeFactors() on the relevant altExp(). In contrast, sizeFactors(sce) refers to size factors for the gene counts, which are usually quite different. sizeFactors(altExp(sce)) &lt;- sf.amb Regardless of which size factors are chosen, running logNormCounts() will then perform scaling normalization and log-transformation for both the endogenous transcripts and the ADTs using their respective size factors. sce &lt;- logNormCounts(sce, use.altexps=TRUE) # Checking that we have normalized values: assayNames(sce) ## [1] &quot;counts&quot; &quot;logcounts&quot; assayNames(altExp(sce)) ## [1] &quot;counts&quot; &quot;logcounts&quot; 20.5 Clustering and interpretation Unlike transcript-based counts, feature selection is largely unnecessary for analyzing ADT data. This is because feature selection has already occurred during experimental design where the manual choice of target proteins means that all ADTs correspond to interesting features by definition. From a practical perspective, the ADT count matrix is already small so there is no need for data compaction from using HVGs or PCs. Moreover, each ADT is often chosen to capture some orthogonal biological signal, so there is not much extraneous noise in higher dimensions that can be readily removed. This suggests we should directly apply downstream procedures like clustering and visualization on the log-normalized abundance matrix for the ADTs (Figure 20.7). # Set d=NA so that the function does not perform PCA. g.adt &lt;- buildSNNGraph(altExp(sce), d=NA) clusters.adt &lt;- igraph::cluster_walktrap(g.adt)$membership # Generating a t-SNE plot. library(scater) set.seed(1010010) altExp(sce) &lt;- runTSNE(altExp(sce)) colLabels(altExp(sce)) &lt;- factor(clusters.adt) plotTSNE(altExp(sce), colour_by=&quot;label&quot;, text_by=&quot;label&quot;, text_col=&quot;red&quot;) Figure 20.7: \\(t\\)-SNE plot generated from the log-normalized abundance of each ADT in the PBMC dataset. Each point is a cell and is labelled according to its assigned cluster. With only a few ADTs, characterization of each cluster is most efficiently achieved by creating a heatmap of the average log-abundance of each tag (Figure 20.8). For this experiment, we can easily identify B cells (CD19+), various subsets of T cells (CD3+, CD4+, CD8+), monocytes and macrophages (CD14+, CD16+), to name a few. More detailed examination of the distribution of abundances within each cluster is easily performed with plotExpression() where strong bimodality may indicate that finer clustering is required to resolve cell subtypes. se.averaged &lt;- sumCountsAcrossCells(altExp(sce), clusters.adt, exprs_values=&quot;logcounts&quot;, average=TRUE) library(pheatmap) averaged &lt;- assay(se.averaged) pheatmap(averaged - rowMeans(averaged), breaks=seq(-3, 3, length.out=101)) Figure 20.8: Heatmap of the average log-normalized abundance of each ADT in each cluster of the PBMC dataset. Colors represent the log2-fold change from the grand average across all clusters. Of course, this provides little information beyond what we could have obtained from a mass cytometry experiment; the real value of this data lies in the integration of protein abundance with gene expression. 20.6 Integration with gene expression data 20.6.1 By subclustering In the simplest approach to integration, we take cells in each of the ADT-derived clusters and perform subclustering using the transcript data. This is an in silico equivalent to an experiment that performs FACS to isolate cell types followed by scRNA-seq for further characterization. We exploit the fact that the ADT abundances are cleaner (larger counts, stronger signal) for more robust identification of broad cell types, and use the gene expression data to identify more subtle structure that manifests in the transcriptome. We demonstrate below by using quickSubCluster() to loop over all of the ADT-derived clusters and subcluster on gene expression (Figure 20.9). set.seed(101010) all.sce &lt;- quickSubCluster(sce, clusters.adt, prepFUN=function(x) { dec &lt;- modelGeneVar(x) top &lt;- getTopHVGs(dec, prop=0.1) x &lt;- runPCA(x, subset_row=top, ncomponents=25) }, clusterFUN=function(x) { g.trans &lt;- buildSNNGraph(x, use.dimred=&quot;PCA&quot;) igraph::cluster_walktrap(g.trans)$membership } ) # Summarizing the number of subclusters in each tag-derived parent cluster, # compared to the number of cells in that parent cluster. ncells &lt;- vapply(all.sce, ncol, 0L) nsubclusters &lt;- vapply(all.sce, FUN=function(x) length(unique(x$subcluster)), 0L) plot(ncells, nsubclusters, xlab=&quot;Number of cells&quot;, type=&quot;n&quot;, ylab=&quot;Number of subclusters&quot;, log=&quot;xy&quot;) text(ncells, nsubclusters, names(all.sce)) Figure 20.9: Number of subclusters identified from the gene expression data within each ADT-derived parent cluster. Another benefit of subclustering is that we can use the annotation on the ADT-derived clusters to facilitate annotation of each subcluster. If we knew that cluster X contained T cells from the ADT-derived data, there is no need to identify subclusters X.1, X.2, etc. as T cells from scratch; rather, we can focus on the more subtle (and interesting) differences between the subclusters using findMarkers(). For example, cluster 11 contains CD8+ T cells according to Figure 20.8, in which we further identify internal subclusters based on granzyme expression (Figure 20.10). Subclustering is also conceptually appealing as it avoids comparing log-fold changes in protein abundances with log-fold changes in gene expression. This ensures that variation (or noise) from the transcript counts does not compromise cell type/state identification from the relatively cleaner ADT counts. of.interest &lt;- &quot;11&quot; plotExpression(all.sce[[of.interest]], x=&quot;subcluster&quot;, features=c(&quot;ENSG00000100450&quot;, &quot;ENSG00000113088&quot;)) Figure 20.10: Distribution of log-normalized expression values of GZMH (left) and GZHK (right) in transcript-derived subclusters of a ADT-derived subpopulation of CD8+ T cells. The downside is that relying on previous results increases the risk of misleading conclusions when ambiguities in those results are not considered, as previously discussed in Section 10.7. It is a good idea to perform some additional checks to ensure that each subcluster has similar protein abundances, e.g., using a heatmap as in Figure 20.8 or with a series of plots like in Figure 20.11. If so, this allows the subcluster to “inherit” the annotation attached to the parent cluster for easier interpretation. sce.cd8 &lt;- all.sce[[of.interest]] plotExpression(altExp(sce.cd8), x=I(sce.cd8$subcluster), features=c(&quot;CD3&quot;, &quot;CD8a&quot;)) Figure 20.11: Distribution of log-normalized abundances of ADTs for CD3 and CD8a in each subcluster of the CD8+ T cell population. 20.6.2 By combined clustering Alternatively, we can combine the information from both sets of features into a single matrix for use in downstream analyses. This is logistically convenient as the combined structure is compatible with routine analysis workflows for transcript-only data. To illustrate, we first perform some standard steps on the transcript count matrix: sce.main &lt;- logNormCounts(sce) dec.main &lt;- modelGeneVar(sce.main) top.main &lt;- getTopHVGs(dec.main, prop=0.1) sce.main &lt;- runPCA(sce.main, subset_row=top.main, ncomponents=25) The simplest version of this idea involves literally combining the log-normalized abundance matrix for the ADTs with the log-expression matrix (or its compacted form, the matrix of PCs) to obtain a single matrix for use in downstream procedures. This requires some reweighting to balance the contribution of the transcript and ADT data to the total variance in the combined matrix, especially given that the former has around 100-fold more features than the latter. We see that the number of clusters is slightly higher than that from the ADT data alone, consistent with the introduction of additional heterogeneity when the two feature sets are combined. # TODO: push this into a function somewhere. library(DelayedMatrixStats) transcript.data &lt;- logcounts(sce.main)[top.main,,drop=FALSE] transcript.var &lt;- sum(rowVars(DelayedArray(transcript.data))) tag.data &lt;- logcounts(altExp(sce.main)) tag.var &lt;- sum(rowVars(DelayedArray(tag.data))) reweight &lt;- sqrt(transcript.var/tag.var) combined &lt;- rbind(transcript.data, tag.data*reweight) # &#39;buildSNNGraph&#39; conveniently performs the PCA for us if requested. We use # more PCs in &#39;d&#39; to capture more variance in both sets of features. Note that # this uses IRLBA by default so we need to set the seed. set.seed(100010) g.com &lt;- buildSNNGraph(combined, d=50) clusters.com &lt;- igraph::cluster_walktrap(g.com)$membership table(clusters.com) ## clusters.com ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 51 70 347 116 77 842 504 1121 1679 60 52 340 167 47 19 53 ## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 ## 71 981 40 36 16 68 25 25 388 73 35 11 67 20 47 27 ## 33 ## 17 A more sophisticated approach uses the UMAP algorithm (McInnes, Healy, and Melville 2018) to integrate information from the two sets of features. Very loosely speaking, we can imagine this as an intersection of the nearest neighbor graphs formed from each set, which effectively encourages the formation of communities of cells that are close in both feature spaces. Here, we perform two rounds of UMAP; one round retains high dimensionality for a faithful representation of the data during clustering, while the other performs dimensionality reduction for a pretty visualization. This yields an extremely fine-grained clustering in Figure 20.12, which is attributable to the stringency of intersection operations for defining the local neighborhood. set.seed(1001010) sce.main &lt;- runMultiUMAP(sce.main, dimred=&quot;PCA&quot;, altexp=&quot;Antibody Capture&quot;, n_components=50) # Bumping up &#39;k&#39; for less granularity. g.com2 &lt;- buildSNNGraph(sce.main, k=20, use.dimred=&quot;MultiUMAP&quot;) clusters.com2 &lt;- igraph::cluster_walktrap(g.com2)$membership table(clusters.com2) ## clusters.com2 ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 324 173 220 704 790 123 395 733 1022 426 78 67 103 181 62 763 ## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 ## 50 96 87 159 90 53 33 71 62 65 33 53 42 28 61 34 ## 33 34 35 36 37 38 39 40 41 42 ## 47 33 29 23 33 36 20 32 26 32 # Combining again for visualization: set.seed(0101110) sce.main &lt;- runMultiUMAP(sce.main, dimred=&quot;PCA&quot;, altexp=&quot;Antibody Capture&quot;, name=&quot;MultiUMAP2&quot;) colLabels(sce.main) &lt;- clusters.com2 plotReducedDim(sce.main, &quot;MultiUMAP2&quot;, colour_by=&quot;label&quot;, text_by=&quot;label&quot;) Figure 20.12: UMAP plot obtained by combining transcript and ADT data in the PBMC dataset. Each point represents a cell and is colored according to its assigned cluster. An even more sophisticated approach uses factor analysis to identify common and unique factors of variation in each feature set. The set of factors can then be used as low-dimensional coordinates for each cell in downstream analyses, though a number of additional statistics are also computed that may be useful, e.g., the contribution of each feature to each factor. # Waiting for MOFA2. These combined strategies are convenient but do not consider (or implicitly make assumptions about) the importance of heterogeneity in the ADT data relative to the transcript data. For example, the UMAP approach takes equal contributions from both sets of features to the intersection, which may not be appropriate if the biology of interest is concentrated in only one set. More generally, a combined analysis must consider the potential for uninteresting noise in one set to interfere with biological signal in the other set, a concern that is largely avoided during subclustering. 20.6.3 By differential testing In more interesting applications of this technology, protein targets are chosen that reflect some functional activity rather than cell type. (Because, frankly, the latter is not particularly hard to infer from transcript data in most cases.) A particularly elegant example involves quantification of the immune response by using antibodies to target the influenza peptide-MHCII complexes in T cells, albeit for mass cytometry (Fehlings et al. 2018). If the aim is to test for differences in the functional readout, a natural analysis strategy is to use the transcript data for clustering (Figure 20.13) and perform differential testing between clusters or conditions for the relevant ADTs. # Performing a quick analysis of the gene expression data. sce &lt;- logNormCounts(sce) dec &lt;- modelGeneVar(sce) top &lt;- getTopHVGs(dec, prop=0.1) set.seed(1001010) sce &lt;- runPCA(sce, subset_row=top, ncomponents=25) g &lt;- buildSNNGraph(sce, use.dimred=&quot;PCA&quot;) clusters &lt;- igraph::cluster_walktrap(g)$membership colLabels(sce) &lt;- factor(clusters) set.seed(1000010) sce &lt;- runTSNE(sce, dimred=&quot;PCA&quot;) plotTSNE(sce, colour_by=&quot;label&quot;, text_by=&quot;label&quot;) Figure 20.13: \\(t\\)-SNE plot of the PBMC dataset based on the transcript data. Each point is a cell and is colored according to the assigned cluster. We demonstrate this approach using findMarkers() to test for differences in tag abundance between clusters (Chapter 11). For example, if the PD-1 level was a readout for some interesting phenotype - say, T cell exhaustion (Pauken and Wherry 2015) - we might be interested in its upregulation in cluster 13 compared to all other clustuers (Figure 20.14). Methods from Chapter 14 can be similarly used to test for differences between conditions based on pseudo-bulk ADT counts. markers &lt;- findMarkers(altExp(sce), colLabels(sce)) of.interest &lt;- markers[[13]] pheatmap(getMarkerEffects(of.interest), breaks=seq(-3, 3, length.out=101)) Figure 20.14: Heatmap of log-fold changes in tag abundances in cluster 13 compared to all other clusters identified from transcript data in the PBMC data set. The main appeal of this approach is that it avoids data snooping (Section 11.5.1) as the clusters are defined without knowledge of the ADTs. This improves the statistical rigor of the subsequent differential testing on the ADT abundances (though only to some extent; other problems are still present, such as the lack of true replication in between-cluster comparisons). From a practical perspective, this approach yields fewer clusters and reduces the amount of work involved in manual annotation, especially if there are multiple functional states (e.g., stressed, apoptotic, stimulated) for each cell type. However, it is fundamentally limited to per-tag inferences; if we want to identify subpopulations with interesting combinations of target proteins, we must resort to high-dimensional analyses like clustering on the ADT abundances. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] DelayedMatrixStats_1.12.3 DelayedArray_0.16.2 [3] Matrix_1.3-2 pheatmap_1.0.12 [5] scran_1.18.5 scater_1.18.6 [7] ggplot2_3.3.3 scuttle_1.0.4 [9] DropletUtils_1.10.3 SingleCellExperiment_1.12.0 [11] SummarizedExperiment_1.20.0 Biobase_2.50.0 [13] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [15] IRanges_2.24.1 S4Vectors_0.28.1 [17] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [19] matrixStats_0.58.0 DropletTestFiles_1.0.0 [21] BiocStyle_2.18.1 rebook_1.0.0 loaded via a namespace (and not attached): [1] AnnotationHub_2.22.0 BiocFileCache_1.14.0 [3] igraph_1.2.6 BiocParallel_1.24.1 [5] digest_0.6.27 htmltools_0.5.1.1 [7] viridis_0.5.1 fansi_0.4.2 [9] magrittr_2.0.1 memoise_2.0.0 [11] limma_3.46.0 R.utils_2.10.1 [13] colorspace_2.0-0 blob_1.2.1 [15] rappdirs_0.3.3 xfun_0.22 [17] dplyr_1.0.5 callr_3.5.1 [19] crayon_1.4.1 RCurl_1.98-1.3 [21] jsonlite_1.7.2 graph_1.68.0 [23] glue_1.4.2 gtable_0.3.0 [25] zlibbioc_1.36.0 XVector_0.30.0 [27] BiocSingular_1.6.0 Rhdf5lib_1.12.1 [29] HDF5Array_1.18.1 scales_1.1.1 [31] DBI_1.1.1 edgeR_3.32.1 [33] Rcpp_1.0.6 viridisLite_0.3.0 [35] xtable_1.8-4 dqrng_0.2.1 [37] bit_4.0.4 rsvd_1.0.3 [39] httr_1.4.2 RColorBrewer_1.1-2 [41] ellipsis_0.3.1 pkgconfig_2.0.3 [43] XML_3.99-0.6 R.methodsS3_1.8.1 [45] farver_2.1.0 uwot_0.1.10 [47] CodeDepends_0.6.5 sass_0.3.1 [49] dbplyr_2.1.0 locfit_1.5-9.4 [51] utf8_1.2.1 tidyselect_1.1.0 [53] labeling_0.4.2 rlang_0.4.10 [55] later_1.1.0.1 AnnotationDbi_1.52.0 [57] munsell_0.5.0 BiocVersion_3.12.0 [59] tools_4.0.4 cachem_1.0.4 [61] generics_0.1.0 RSQLite_2.2.4 [63] ExperimentHub_1.16.0 evaluate_0.14 [65] stringr_1.4.0 fastmap_1.1.0 [67] yaml_2.2.1 processx_3.4.5 [69] knitr_1.31 bit64_4.0.5 [71] purrr_0.3.4 sparseMatrixStats_1.2.1 [73] mime_0.10 R.oo_1.24.0 [75] compiler_4.0.4 beeswarm_0.3.1 [77] curl_4.3 interactiveDisplayBase_1.28.0 [79] tibble_3.1.0 statmod_1.4.35 [81] bslib_0.2.4 stringi_1.5.3 [83] highr_0.8 ps_1.6.0 [85] RSpectra_0.16-0 lattice_0.20-41 [87] bluster_1.0.0 vctrs_0.3.6 [89] pillar_1.5.1 lifecycle_1.0.0 [91] rhdf5filters_1.2.0 BiocManager_1.30.10 [93] jquerylib_0.1.3 RcppAnnoy_0.0.18 [95] BiocNeighbors_1.8.2 cowplot_1.1.1 [97] bitops_1.0-6 irlba_2.3.3 [99] httpuv_1.5.5 R6_2.5.0 [101] bookdown_0.21 promises_1.2.0.1 [103] gridExtra_2.3 vipor_0.4.5 [105] codetools_0.2-18 assertthat_0.2.1 [107] rhdf5_2.34.0 withr_2.4.1 [109] GenomeInfoDbData_1.2.4 grid_4.0.4 [111] beachmat_2.6.4 rmarkdown_2.7 [113] Rtsne_0.15 shiny_1.6.0 [115] ggbeeswarm_0.6.0 Bibliography "],["repertoire-seq.html", "Chapter 21 Analyzing repertoire sequencing data 21.1 Motivation 21.2 Loading the TCR repertoire 21.3 Leveraging List semantics 21.4 Converting back to DataFrames 21.5 Case study for clonotype analyses 21.6 Repeating for immunoglobulins Session Info", " Chapter 21 Analyzing repertoire sequencing data .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } Figure 21.1: This page is under construction. 21.1 Motivation An organism’s immune repertoire is defined as the set of T and B cell subtypes that contain genetic diversity in the T cell receptor (TCR) components or immunoglobin chains, respectively. This diversity is important for ensuring that the adaptive immune system can respond effectively to a wide range of antigens. We can profile the immune repertoire by simply sequencing the relevant transcripts (Georgiou et al. 2014; Rosati et al. 2017), a procedure that can be combined with previously mentioned technologies (Zheng et al. 2017) to achieve single-cell resolution. This data can then be used to characterize an individual’s immune response based on the expansion of T or B cell clones, i.e., multiple cells with the same sequences for each TCR component or immunoglobulin chain. By itself, single-cell repertoire sequencing data can be readily analyzed with a variety of tools such as those from the ImmCantation suite. For example, the alakazam package provides functions to perform common analyses to quantify clonal diversity, reconstruct lineages, examine amino acid properties and so on. We will not attempt to regurgitate their documentation in this chapter; rather, we will focus on how we can integrate repertoire sequencing data structures into our existing SingleCellExperiment framework. This is not entirely trivial as each cell may have zero, one or multiple sequences for any given repertoire component, whereas we only obtain a single expression profile for that cell. We would like to define a single data structure that captures both the expression profile and repertoire state for each cell. This ensures synchronization during operations like subsetting (as previously discussed for the SingleCellExperiment class) and reduces book-keeping errors throughout the course of an interactive analysis. We achieve this using the SplitDataFrameList class from the IRanges package, which allows us to accommodate repertoire sequencing data into existing Bioconductor classes while retaining compatibility with functions from external analysis tools. We demonstrate on a publicly available 10X Genomics dataset using mouse PBMCs, for which the expression and ADT data have already been processed below: View history #--- loading ---# library(BiocFileCache) bfc &lt;- BiocFileCache(ask=FALSE) exprs.data &lt;- bfcrpath(bfc, file.path( &quot;http://cf.10xgenomics.com/samples/cell-vdj/3.1.0&quot;, &quot;vdj_v1_hs_pbmc3&quot;, &quot;vdj_v1_hs_pbmc3_filtered_feature_bc_matrix.tar.gz&quot;)) untar(exprs.data, exdir=tempdir()) library(DropletUtils) sce.pbmc &lt;- read10xCounts(file.path(tempdir(), &quot;filtered_feature_bc_matrix&quot;)) sce.pbmc &lt;- splitAltExps(sce.pbmc, rowData(sce.pbmc)$Type) #--- quality-control ---# library(scater) is.mito &lt;- grep(&quot;^MT-&quot;, rowData(sce.pbmc)$Symbol) stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=is.mito)) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) low.adt &lt;- stats$`altexps_Antibody Capture_detected` &lt; nrow(altExp(sce.pbmc))/2 discard &lt;- high.mito | low.adt sce.pbmc &lt;- sce.pbmc[,!discard] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) altExp(sce.pbmc) &lt;- computeMedianFactors(altExp(sce.pbmc)) sce.pbmc &lt;- logNormCounts(sce.pbmc, use_altexps=TRUE) #--- dimensionality-reduction ---# set.seed(100000) altExp(sce.pbmc) &lt;- runTSNE(altExp(sce.pbmc)) set.seed(1000000) altExp(sce.pbmc) &lt;- runUMAP(altExp(sce.pbmc)) #--- clustering ---# g.adt &lt;- buildSNNGraph(altExp(sce.pbmc), k=10, d=NA) clust.adt &lt;- igraph::cluster_walktrap(g.adt)$membership colLabels(altExp(sce.pbmc)) &lt;- factor(clust.adt) sce.pbmc ## class: SingleCellExperiment ## dim: 33538 6660 ## metadata(1): Samples ## assays(2): counts logcounts ## rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475 ## ENSG00000268674 ## rowData names(3): ID Symbol Type ## colnames: NULL ## colData names(3): Sample Barcode sizeFactor ## reducedDimNames(0): ## altExpNames(1): Antibody Capture # Moving ADT-based clustering to the top level for convenience. colLabels(sce.pbmc) &lt;- colLabels(altExp(sce.pbmc)) 21.2 Loading the TCR repertoire First, we obtain the filtered TCR contig annotations for the same set of cells. Each row of the resulting data frame contains information about a single TCR component sequence in one cell, broken down into the alleles of the V(D)J genes making up that component (v_gene, d_gene, j_gene) where possible. The number of reads and UMIs supporting the set of allele assignments for a cell is also shown, though only the UMI count should be used for quantifying expression of a particular TCR sequence. Each cell is assigned to a clonotype (raw_clonotype_id) based on the combination of the \\(\\alpha\\)-chain (TRA) and \\(\\beta\\)-chain (TRB) sequences in that cell. library(BiocFileCache) bfc &lt;- BiocFileCache(ask=FALSE) tcr.data &lt;- bfcrpath(bfc, file.path( &quot;http://cf.10xgenomics.com/samples/cell-vdj/3.1.0&quot;, &quot;vdj_v1_hs_pbmc3/vdj_v1_hs_pbmc3_t_filtered_contig_annotations.csv&quot;)) tcr &lt;- read.csv(tcr.data, stringsAsFactors=FALSE) nrow(tcr) ## [1] 10121 head(tcr) ## barcode is_cell contig_id high_confidence length ## 1 AAACCTGAGATCTGAA-1 True AAACCTGAGATCTGAA-1_contig_1 True 521 ## 2 AAACCTGAGATCTGAA-1 True AAACCTGAGATCTGAA-1_contig_2 True 474 ## 3 AAACCTGAGGAACTGC-1 True AAACCTGAGGAACTGC-1_contig_1 True 496 ## 4 AAACCTGAGGAACTGC-1 True AAACCTGAGGAACTGC-1_contig_2 True 505 ## 5 AAACCTGAGGAGTCTG-1 True AAACCTGAGGAGTCTG-1_contig_1 True 495 ## 6 AAACCTGAGGAGTCTG-1 True AAACCTGAGGAGTCTG-1_contig_2 True 526 ## chain v_gene d_gene j_gene c_gene full_length productive ## 1 TRB TRBV20-1 None TRBJ2-7 TRBC2 True True ## 2 TRA TRAV13-1 None TRAJ44 TRAC True True ## 3 TRB TRBV7-2 None TRBJ2-1 TRBC2 True True ## 4 TRA TRAV23/DV6 None TRAJ34 TRAC True True ## 5 TRA TRAV2 None TRAJ38 TRAC True True ## 6 TRB TRBV6-2 None TRBJ1-1 TRBC1 True True ## cdr3 cdr3_nt ## 1 CSARDKGLSYEQYF TGCAGTGCTAGAGACAAGGGGCTTAGCTACGAGCAGTACTTC ## 2 CAASIGPLGTGTASKLTF TGTGCAGCAAGTATCGGCCCCCTAGGAACCGGCACTGCCAGTAAACTCACCTTT ## 3 CASSLGPSGEQFF TGTGCCAGCAGCTTGGGACCATCGGGTGAGCAGTTCTTC ## 4 CAASDNTDKLIF TGTGCAGCAAGCGATAACACCGACAAGCTCATCTTT ## 5 CAVEANNAGNNRKLIW TGTGCTGTGGAGGCTAATAATGCTGGCAACAACCGTAAGCTGATTTGG ## 6 CASSRTGGTEAFF TGTGCCAGCAGTCGGACAGGGGGCACTGAAGCTTTCTTT ## reads umis raw_clonotype_id raw_consensus_id ## 1 9327 12 clonotype100 clonotype100_consensus_1 ## 2 3440 3 clonotype100 clonotype100_consensus_2 ## 3 32991 29 clonotype101 clonotype101_consensus_2 ## 4 10714 9 clonotype101 clonotype101_consensus_1 ## 5 1734 3 clonotype102 clonotype102_consensus_1 ## 6 15530 13 clonotype102 clonotype102_consensus_2 The challenge in incorporating all of these data structures into a single object lies in the fact that each cell may have zero, one or many TCR sequences. This precludes direct storage of repertoire information in the colData() of the SingleCellExperiment, which would be expecting a 1:1 mapping from each cell to each repertoire sequence. Instead, we store the repertoire data as a SplitDataFrameList object where each cell is represented by a variable-row DataFrame containing information for zero-to-many sequences. This is easily done by: Converting our data.frame to a DataFrame from the S4Vectors package. We demonstrate this process below for the alpha chain: tra &lt;- tcr[tcr$chain==&quot;TRA&quot;,] tra &lt;- DataFrame(tra) Defining a factor of cell identities, using cells that are present in sce.pbmc as our levels. This ensures that, in the resulting SplitDataFrameList, we always have one entry per cell in sce.pbmc, even if those entries consist of zero-row DataFrames. cell.id &lt;- factor(tra$barcode, sce.pbmc$Barcode) Using the split() function to break up tra into one DataFrame per cell, forming the desired SplitDataFrameList. tra.list &lt;- split(tra, cell.id) class(tra.list) ## [1] &quot;CompressedSplitDFrameList&quot; ## attr(,&quot;package&quot;) ## [1] &quot;IRanges&quot; Put together, this is as simple as the following stretch of code (repeated for the beta chain): trb &lt;- tcr[tcr$chain==&quot;TRB&quot;,] trb.list &lt;- split(DataFrame(trb), factor(trb$barcode, sce.pbmc$Barcode)) length(trb.list) ## [1] 6660 Both of these objects are guaranteed to have a 1:1 mapping to the columns of sce.pbmc, allowing us to directly store them in the colData as additional metadata fields. This ensures that any subsetting applied to the SingleCellExperiment is synchronized across both gene expression and repertoire data. For example, an obvious step might be to subset sce.pbmc to only contain T cells so that the TCR analysis is not distorted by other irrelevant cell types. sce.pbmc$TRA &lt;- tra.list sce.pbmc$TRB &lt;- trb.list 21.3 Leveraging List semantics At this point, it is worth spending some time on the power of the SplitDataFrameList and the List grammar. In the simplest case, the SplitDataFrameList can be treated as an ordinary list of DataFrames with one entry per cell. sce.pbmc$TRA[[1]] # for the first cell. ## DataFrame with 1 row and 18 columns ## barcode is_cell contig_id high_confidence ## &lt;character&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; ## 1 AAACCTGAGATCTGAA-1 True AAACCTGAGATCTGAA-1_c.. True ## length chain v_gene d_gene j_gene c_gene ## &lt;integer&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; ## 1 474 TRA TRAV13-1 None TRAJ44 TRAC ## full_length productive cdr3 cdr3_nt reads ## &lt;character&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; &lt;integer&gt; ## 1 True True CAASIGPLGTGTASKLTF TGTGCAGCAAGTATCGGCCC.. 3440 ## umis raw_clonotype_id raw_consensus_id ## &lt;integer&gt; &lt;character&gt; &lt;character&gt; ## 1 3 clonotype100 clonotype100_consens.. sce.pbmc$TRA[[2]] # for the second cell. ## DataFrame with 1 row and 18 columns ## barcode is_cell contig_id high_confidence ## &lt;character&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; ## 1 AAACCTGAGGAACTGC-1 True AAACCTGAGGAACTGC-1_c.. True ## length chain v_gene d_gene j_gene c_gene ## &lt;integer&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; ## 1 505 TRA TRAV23/DV6 None TRAJ34 TRAC ## full_length productive cdr3 cdr3_nt reads ## &lt;character&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; &lt;integer&gt; ## 1 True True CAASDNTDKLIF TGTGCAGCAAGCGATAACAC.. 10714 ## umis raw_clonotype_id raw_consensus_id ## &lt;integer&gt; &lt;character&gt; &lt;character&gt; ## 1 9 clonotype101 clonotype101_consens.. sce.pbmc$TRA[3:5] # for the third-to-fifth cells. ## SplitDataFrameList of length 3 ## $`AAACCTGAGGAGTCTG-1` ## DataFrame with 1 row and 18 columns ## barcode is_cell contig_id high_confidence ## &lt;character&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; ## 1 AAACCTGAGGAGTCTG-1 True AAACCTGAGGAGTCTG-1_c.. True ## length chain v_gene d_gene j_gene c_gene ## &lt;integer&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; ## 1 495 TRA TRAV2 None TRAJ38 TRAC ## full_length productive cdr3 cdr3_nt reads ## &lt;character&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; &lt;integer&gt; ## 1 True True CAVEANNAGNNRKLIW TGTGCTGTGGAGGCTAATAA.. 1734 ## umis raw_clonotype_id raw_consensus_id ## &lt;integer&gt; &lt;character&gt; &lt;character&gt; ## 1 3 clonotype102 clonotype102_consens.. ## ## $`AAACCTGAGGCTCTTA-1` ## DataFrame with 2 rows and 18 columns ## barcode is_cell contig_id high_confidence ## &lt;character&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; ## 1 AAACCTGAGGCTCTTA-1 True AAACCTGAGGCTCTTA-1_c.. True ## 2 AAACCTGAGGCTCTTA-1 True AAACCTGAGGCTCTTA-1_c.. True ## length chain v_gene d_gene j_gene c_gene ## &lt;integer&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; ## 1 501 TRA TRAV12-3 None TRAJ37 TRAC ## 2 493 TRA TRAV25 None TRAJ29 TRAC ## full_length productive cdr3 cdr3_nt reads ## &lt;character&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; &lt;integer&gt; ## 1 True True CAMSSSGNTGKLIF TGTGCAATGAGCTCCTCTGG.. 4325 ## 2 True False None None 2600 ## umis raw_clonotype_id raw_consensus_id ## &lt;integer&gt; &lt;character&gt; &lt;character&gt; ## 1 3 clonotype103 clonotype103_consens.. ## 2 2 clonotype103 None ## ## $`AAACCTGAGTACGTTC-1` ## DataFrame with 0 rows and 18 columns head(lengths(sce.pbmc$TRA)) # number of sequences per cell. ## AAACCTGAGATCTGAA-1 AAACCTGAGGAACTGC-1 AAACCTGAGGAGTCTG-1 AAACCTGAGGCTCTTA-1 ## 1 1 1 2 ## AAACCTGAGTACGTTC-1 AAACCTGGTCTTGTCC-1 ## 0 0 However, it is also possible to treat a SplitDataFrameList like a giant DataFrame with respect to its columns. When constructed in the manner described above, all entries of the SplitDataFrameList have the same columns; this allows us to use column-subsetting semantics to extract the same column from all of the internal DataFrames. # Create a new SplitDataFrameList consisting only # of the columns &#39;reads&#39; and &#39;umis&#39;. sce.pbmc$TRA[,c(&quot;reads&quot;, &quot;umis&quot;)] ## SplitDataFrameList of length 6660 ## $`AAACCTGAGATCTGAA-1` ## DataFrame with 1 row and 2 columns ## reads umis ## &lt;integer&gt; &lt;integer&gt; ## 1 3440 3 ## ## $`AAACCTGAGGAACTGC-1` ## DataFrame with 1 row and 2 columns ## reads umis ## &lt;integer&gt; &lt;integer&gt; ## 1 10714 9 ## ## $`AAACCTGAGGAGTCTG-1` ## DataFrame with 1 row and 2 columns ## reads umis ## &lt;integer&gt; &lt;integer&gt; ## 1 1734 3 ## ## ... ## &lt;6657 more elements&gt; # Extract a single column as a new List. sce.pbmc$TRA[,&quot;reads&quot;] ## IntegerList of length 6660 ## [[&quot;AAACCTGAGATCTGAA-1&quot;]] 3440 ## [[&quot;AAACCTGAGGAACTGC-1&quot;]] 10714 ## [[&quot;AAACCTGAGGAGTCTG-1&quot;]] 1734 ## [[&quot;AAACCTGAGGCTCTTA-1&quot;]] 4325 2600 ## [[&quot;AAACCTGAGTACGTTC-1&quot;]] integer(0) ## [[&quot;AAACCTGGTCTTGTCC-1&quot;]] integer(0) ## [[&quot;AAACCTGGTTGCGTTA-1&quot;]] integer(0) ## [[&quot;AAACCTGTCAACGGGA-1&quot;]] 5355 1686 ## [[&quot;AAACCTGTCACTGGGC-1&quot;]] integer(0) ## [[&quot;AAACCTGTCAGCTGGC-1&quot;]] integer(0) ## ... ## &lt;6650 more elements&gt; For the \"reads\" column, the above subsetting yields an IntegerList where each entry corresponds to the integer vector that would have been extracted from the corresponding entry of the SplitDataFrameList. This is equivalent to looping over the SplitDataFrameList and column-subsetting each individual DataFrame, though the actual implementation is much more efficient. The IntegerList and its type counterparts (e.g., CharacterList, LogicalList) are convenient structures as they support a number of vector-like operations in the expected manner. For example, a boolean operation will convert an IntegerList into a LogicalList: sce.pbmc$TRA[,&quot;umis&quot;] &gt; 2 ## LogicalList of length 6660 ## [[&quot;AAACCTGAGATCTGAA-1&quot;]] TRUE ## [[&quot;AAACCTGAGGAACTGC-1&quot;]] TRUE ## [[&quot;AAACCTGAGGAGTCTG-1&quot;]] TRUE ## [[&quot;AAACCTGAGGCTCTTA-1&quot;]] TRUE FALSE ## [[&quot;AAACCTGAGTACGTTC-1&quot;]] logical(0) ## [[&quot;AAACCTGGTCTTGTCC-1&quot;]] logical(0) ## [[&quot;AAACCTGGTTGCGTTA-1&quot;]] logical(0) ## [[&quot;AAACCTGTCAACGGGA-1&quot;]] TRUE FALSE ## [[&quot;AAACCTGTCACTGGGC-1&quot;]] logical(0) ## [[&quot;AAACCTGTCAGCTGGC-1&quot;]] logical(0) ## ... ## &lt;6650 more elements&gt; This is where the final mode of SplitDataFrameList subsetting comes into play. If we use a LogicalList to subset a SplitDataFrameList, we can subset on the individual sequences within each cell. This is functionally equivalent to looping over both the LogicalList and SplitDataFrameList simultaneously and using each logical vector to subset the corresponding DataFrame. For example, we can filter our SplitDataFrameList so that each per-cell DataFrame only contains sequences with more than 2 UMIs. more.than.2 &lt;- sce.pbmc$TRA[sce.pbmc$TRA[,&quot;umis&quot;] &gt; 2] head(lengths(more.than.2)) ## AAACCTGAGATCTGAA-1 AAACCTGAGGAACTGC-1 AAACCTGAGGAGTCTG-1 AAACCTGAGGCTCTTA-1 ## 1 1 1 1 ## AAACCTGAGTACGTTC-1 AAACCTGGTCTTGTCC-1 ## 0 0 We can exploit these semantics to quickly assemble complex queries without the need for any other packages. Say we want to determine the proportion of cells in each cluster that have at least one productive sequence of a TCR component, i.e., contigs that are likely to produce a functional protein. Clusters with large counts are most likely to be T cells, though some background level of TCR expression may be observed in other clusters due to a mixture of clustering uncertainty, ambient contamination, doublet formation and potential expression in other cell types. This quantification is easily achieved with the built-in List semantics for both the alpha and beta chains (Figure 21.2). # Generate a LogicalList class where each entry corresponds to a cell and is # a logical vector specifying which of that cell&#39;s sequences are productive. is.prod.A &lt;- sce.pbmc$TRA[,&quot;productive&quot;]==&quot;True&quot; # We can apply operations to this LogicalList that mimic looping over a logical # vector. For example, `any()` will return a logical vector of length equal to # the number of cells, with a value of TRUE if any sequence is productive. has.prod.A &lt;- any(is.prod.A) # And then we simply count the number of cells in each cluster. tra.counts.prod &lt;- table(colLabels(sce.pbmc)[has.prod.A]) is.prod.B &lt;- sce.pbmc$TRB[,&quot;productive&quot;]==&quot;True&quot; has.prod.B &lt;- any(is.prod.B) trb.counts.prod &lt;- table(colLabels(sce.pbmc)[has.prod.B]) ncells &lt;- table(colLabels(sce.pbmc)) barplot(rbind(TRA=tra.counts.prod/ncells, TRB=trb.counts.prod/ncells), legend=TRUE, beside=TRUE) Figure 21.2: Proportion of cells in each cluster that express at least one productive sequence of the TCR \\(\\alpha\\) (dark) or \\(\\beta\\)-chains (light). Alternatively, we may wish to determine the proportion of UMIs assigned to the most abundant sequence in each cell. This is easily achieved by applying the max() and sum() functions on the UMI count “column” (Figure 21.3). umi.data &lt;- sce.pbmc$TRA[,&quot;umis&quot;] hist(max(umi.data)/sum(umi.data), xlab=&quot;Proportion in most abundant&quot;) Figure 21.3: Proportion of UMIs assigned to the most abundant sequence in each cell. We can also apply boolean operations on our LogicalList objects to perform per-sequence queries. The example below filters to retain sequences that are full-length, productive and have the largest UMI count in the cell. tra &lt;- sce.pbmc$TRA # assigning to a variable for brevity keep &lt;- tra[,&quot;full_length&quot;]==&quot;True&quot; &amp; tra[,&quot;productive&quot;]==&quot;True&quot; &amp; tra[,&quot;umis&quot;] == max(tra[,&quot;umis&quot;]) tra.sub &lt;- tra[keep] # How many cells have at least one sequence satisfying all requirements? summary(sum(keep) &gt;= 1) ## Mode FALSE TRUE ## logical 3369 3291 21.4 Converting back to DataFrames If an operation must be performed on the original sequence-level data frame, we can efficiently recover it by calling unlist() on our SplitDataFrameList. It is similarly straightforward to regenerate our SplitDataFrameList from the data frame by using the relist() function. This framework permits users to quickly switch between sequence level and cell level perspectives of the repertoire data depending on which is most convenient at any given point in the analysis. tra.seq &lt;- unlist(tra) dim(tra.seq) # Each row represents a sequence now. ## [1] 4863 18 # Adding some arbitrary extra annotation (mocked up here). extra.anno &lt;- DataFrame(anno=sample(LETTERS, nrow(tra.seq), replace=TRUE)) tra.seq &lt;- cbind(tra.seq, extra.anno) # Regenerating the SplitDataFrameList from the modified DataFrame. tra2 &lt;- relist(tra.seq, tra) length(tra2) # Each element represents a cell again. ## [1] 6660 While the SplitDataFrameList provides a natural representation of the 1:many mapping of cells to sequences, many applications require a 1:1 relationship in order to function properly. This includes plotting where each cell is a point that is to be colored by a single property, or any function that requires cells to be grouped by their characteristics (e.g., DE analysis, aggregation). We can use a combination of List and DataFrame semantics to choose a single representative sequence in cells where multiple sequences are available and fill in the row with NAs for cells where no sequences are available. # We identify the sequence with the most UMIs per cell. best.per.cell &lt;- which.max(tra[,&quot;umis&quot;]) # We convert this into an IntegerList: best.per.cell &lt;- as(best.per.cell, &quot;IntegerList&quot;) # And then we use it to subset our SplitDataFrameList: collapsed &lt;- tra[best.per.cell] # Finally unlisting to obtain a DataFrame: collapsed &lt;- unlist(collapsed) collapsed[,1:5] ## DataFrame with 6660 rows and 5 columns ## barcode is_cell contig_id ## &lt;character&gt; &lt;character&gt; &lt;character&gt; ## AAACCTGAGATCTGAA-1 AAACCTGAGATCTGAA-1 True AAACCTGAGATCTGAA-1_c.. ## AAACCTGAGGAACTGC-1 AAACCTGAGGAACTGC-1 True AAACCTGAGGAACTGC-1_c.. ## AAACCTGAGGAGTCTG-1 AAACCTGAGGAGTCTG-1 True AAACCTGAGGAGTCTG-1_c.. ## AAACCTGAGGCTCTTA-1 AAACCTGAGGCTCTTA-1 True AAACCTGAGGCTCTTA-1_c.. ## AAACCTGAGTACGTTC-1 NA NA NA ## ... ... ... ... ## TTTGTCATCACGGTTA-1 TTTGTCATCACGGTTA-1 True TTTGTCATCACGGTTA-1_c.. ## TTTGTCATCCCTTGTG-1 TTTGTCATCCCTTGTG-1 True TTTGTCATCCCTTGTG-1_c.. ## TTTGTCATCGGAAATA-1 NA NA NA ## TTTGTCATCTAGAGTC-1 NA NA NA ## TTTGTCATCTCGCATC-1 NA NA NA ## high_confidence length ## &lt;character&gt; &lt;integer&gt; ## AAACCTGAGATCTGAA-1 True 474 ## AAACCTGAGGAACTGC-1 True 505 ## AAACCTGAGGAGTCTG-1 True 495 ## AAACCTGAGGCTCTTA-1 True 501 ## AAACCTGAGTACGTTC-1 NA NA ## ... ... ... ## TTTGTCATCACGGTTA-1 True 541 ## TTTGTCATCCCTTGTG-1 True 522 ## TTTGTCATCGGAAATA-1 NA NA ## TTTGTCATCTAGAGTC-1 NA NA ## TTTGTCATCTCGCATC-1 NA NA This can be easily combined with more sophisticated strategies for choosing a representative sequence per cell. For example, we might only consider sequences that are full-length and productive to be valid representatives. # The above code compressed into a one-liner: collapsed2 &lt;- unlist(tra.sub[as(which.max(tra.sub[,&quot;umis&quot;]), &quot;IntegerList&quot;)]) nrow(collapsed2) ## [1] 6660 The collapsed objects can then be stored in the colData of our SingleCellExperiment alongside the SplitDataFrameLists for easy retrieval in downstream functions. We assume that downstream applications are tolerant of NA values for cells that have no sequences. 21.5 Case study for clonotype analyses Quantification of clonal expansion is the most obvious application of repertoire sequencing data. Cells with the same T cell clonotype are assumed to target the same antigen, and any increase in the frequency of a clonotype provides evidence for T cell activation and proliferation upon stimulation by the corresponding antigen. Thus, we can gain some insights into the immune activity of each T cell cluster by counting the number of expanded clonotypes in each cluster. # INSERT ALAKAZAM CODE HERE. 21.6 Repeating for immunoglobulins The process for the immunoglobulin (Ig) repertoire is largely the same as that for the TCR chains. The biggest difference is that now we have three chains - heavy (IGH), lambda (IGL) and kappa (IGK). We first pull down the dataset and load in the data frame, noting that it contains one sequence per row. library(BiocFileCache) bfc &lt;- BiocFileCache(ask=FALSE) ig.data &lt;- bfcrpath(bfc, file.path( &quot;http://cf.10xgenomics.com/samples/cell-vdj/3.1.0&quot;, &quot;vdj_v1_hs_pbmc3/vdj_v1_hs_pbmc3_b_filtered_contig_annotations.csv&quot;)) ig &lt;- read.csv(ig.data, stringsAsFactors=FALSE) nrow(ig) ## [1] 2059 head(ig) ## barcode is_cell contig_id high_confidence length ## 1 AAACCTGTCACTGGGC-1 True AAACCTGTCACTGGGC-1_contig_1 True 556 ## 2 AAACCTGTCACTGGGC-1 True AAACCTGTCACTGGGC-1_contig_2 True 516 ## 3 AAACCTGTCACTGGGC-1 True AAACCTGTCACTGGGC-1_contig_3 True 569 ## 4 AAACCTGTCAGGTAAA-1 True AAACCTGTCAGGTAAA-1_contig_1 True 548 ## 5 AAACCTGTCAGGTAAA-1 True AAACCTGTCAGGTAAA-1_contig_2 True 555 ## 6 AAACCTGTCAGGTAAA-1 True AAACCTGTCAGGTAAA-1_contig_3 True 518 ## chain v_gene d_gene j_gene c_gene full_length productive cdr3 ## 1 IGK IGKV1D-33 None IGKJ4 IGKC True True CQQYDNLPLTF ## 2 IGH IGHV4-59 None IGHJ5 IGHM True True CARGGNSGLDPW ## 3 IGK IGKV2D-30 None IGKJ1 IGKC True False CWGLLLHARYTLAWTF ## 4 IGK IGKV1-12 None IGKJ4 IGKC True True CQQANSFPLTF ## 5 IGK IGKV3-15 None IGKJ2 IGKC True True CQQYDNWPPYTF ## 6 IGL IGLV5-45 None IGLJ2 IGLC2 True False CMIWHSSASVF ## cdr3_nt reads umis raw_clonotype_id ## 1 TGTCAACAGTATGATAATCTCCCGCTCACTTTC 7410 61 clonotype7 ## 2 TGTGCGAGAGGCGGGAACAGTGGCTTAGACCCCTGG 1458 18 clonotype7 ## 3 TGTTGGGGTTTATTACTGCATGCAAGGTACACACTGGCCTGGACGTTC 2045 17 clonotype7 ## 4 TGTCAACAGGCTAACAGTTTCCCGCTCACTTTC 140 1 clonotype2 ## 5 TGTCAGCAGTATGATAACTGGCCTCCGTACACTTTT 274 2 clonotype2 ## 6 TGTATGATTTGGCACAGCAGTGCTTCGGTATTC 46 2 clonotype2 ## raw_consensus_id ## 1 clonotype7_consensus_1 ## 2 clonotype7_consensus_2 ## 3 None ## 4 clonotype2_consensus_2 ## 5 clonotype2_consensus_1 ## 6 None We then loop over all of the chains and create a SplitDataFrameList for each chain, storing it in the colData of our SingleCellExperiment as previously described. for (chain in c(&quot;IGH&quot;, &quot;IGL&quot;, &quot;IGK&quot;)) { current &lt;- ig[ig$chain==&quot;IGH&quot;,] x &lt;- split(DataFrame(current), factor(current$barcode, sce.pbmc$Barcode)) colData(sce.pbmc)[[chain]] &lt;- x } colnames(colData(sce.pbmc)) ## [1] &quot;Sample&quot; &quot;Barcode&quot; &quot;sizeFactor&quot; &quot;label&quot; &quot;TRA&quot; ## [6] &quot;TRB&quot; &quot;IGH&quot; &quot;IGL&quot; &quot;IGK&quot; One can see how we have managed to pack all of the information about this experiment - gene expression, protein abundance and repertoire information for both TCR and Ig chains - into a single object for convenient handling throughout the rest of the analysis. sce.pbmc ## class: SingleCellExperiment ## dim: 33538 6660 ## metadata(1): Samples ## assays(2): counts logcounts ## rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475 ## ENSG00000268674 ## rowData names(3): ID Symbol Type ## colnames: NULL ## colData names(9): Sample Barcode ... IGL IGK ## reducedDimNames(0): ## altExpNames(1): Antibody Capture Many of the analyses that can be performed on the TCR data can also be applied to the Ig repertoire. The most interesting difference between the two stems from the fact that immunoglobulins undergo hypersomatic mutation, providing additional sequence variation beyond that of the V(D)J rearrangements. This allows us to use sequence similarity to create lineage trees involving cells of the same clonotype, particularly useful for characterizing Ig microevolution in response to immune challenges like vaccination. In practice, this is difficult to achieve in single-cell data as it requires enough cells per clonotype to create a reasonably interesting tree. # INSERT ALAKAZAM CODE HERE? Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] BiocFileCache_1.14.0 dbplyr_2.1.0 [3] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0 [5] Biobase_2.50.0 GenomicRanges_1.42.0 [7] GenomeInfoDb_1.26.4 IRanges_2.24.1 [9] S4Vectors_0.28.1 BiocGenerics_0.36.0 [11] MatrixGenerics_1.2.1 matrixStats_0.58.0 [13] BiocStyle_2.18.1 rebook_1.0.0 loaded via a namespace (and not attached): [1] httr_1.4.2 sass_0.3.1 bit64_4.0.5 [4] jsonlite_1.7.2 bslib_0.2.4 assertthat_0.2.1 [7] BiocManager_1.30.10 highr_0.8 blob_1.2.1 [10] GenomeInfoDbData_1.2.4 yaml_2.2.1 pillar_1.5.1 [13] RSQLite_2.2.4 lattice_0.20-41 glue_1.4.2 [16] digest_0.6.27 XVector_0.30.0 htmltools_0.5.1.1 [19] Matrix_1.3-2 XML_3.99-0.6 pkgconfig_2.0.3 [22] bookdown_0.21 zlibbioc_1.36.0 purrr_0.3.4 [25] processx_3.4.5 tibble_3.1.0 generics_0.1.0 [28] ellipsis_0.3.1 cachem_1.0.4 withr_2.4.1 [31] magrittr_2.0.1 crayon_1.4.1 CodeDepends_0.6.5 [34] memoise_2.0.0 evaluate_0.14 ps_1.6.0 [37] fansi_0.4.2 graph_1.68.0 tools_4.0.4 [40] lifecycle_1.0.0 stringr_1.4.0 DelayedArray_0.16.2 [43] callr_3.5.1 compiler_4.0.4 jquerylib_0.1.3 [46] rlang_0.4.10 grid_4.0.4 RCurl_1.98-1.3 [49] rappdirs_0.3.3 bitops_1.0-6 rmarkdown_2.7 [52] codetools_0.2-18 DBI_1.1.1 curl_4.3 [55] R6_2.5.0 knitr_1.31 dplyr_1.0.5 [58] fastmap_1.1.0 bit_4.0.4 utf8_1.2.1 [61] stringi_1.5.3 Rcpp_1.0.6 vctrs_0.3.6 [64] tidyselect_1.1.0 xfun_0.22 Bibliography "],["interactive-sharing.html", "Chapter 22 Interactive data exploration 22.1 Motivation 22.2 Quick start 22.3 Usage examples 22.4 Reproducible visualizations 22.5 Dissemination of analysis results 22.6 Additional resources Session Info", " Chapter 22 Interactive data exploration .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 22.1 Motivation Exploratory data analysis (EDA) and visualization are crucial for many aspects of data analysis such as quality control, hypothesis generation and contextual result interpretation. Single-cell ’omics datasets generated with modern high-throughput technologies are no exception, especially given their increasing size and complexity. The need for flexible and interactive platforms to explore those data from various perspectives has contributed to the increasing popularity of graphical user interfaces (GUIs) for interactive visualization. In this chapter, we illustrate how the Bioconductor package iSEE can be used to perform some common exploratory tasks during single-cell analysis workflows. We note that these are examples only; in practice, EDA is often context-dependent and driven by distinct motivations and hypotheses for every new data set. To this end, iSEE provides a flexible framework that is immediately compatible with a wide range of genomics data modalities and can be easily customized to focus on key aspects of individual data sets. 22.2 Quick start An instance of an interactive iSEE application can be launched with any data set that is stored in an object of the SummarizedExperiment class (or any class that extends it, e.g., SingleCellExperiment, DESeqDataSet, MethylSet). In its simplest form, this is done simply by calling iSEE(sce) with the sce data object as the sole argument, as demonstrated here with the 10X PBMC dataset (Figure 22.1). View history #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) #--- gene-annotation ---# library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) #--- cell-detection ---# set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location==&quot;MT&quot;))) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) sce.pbmc &lt;- sce.pbmc[,!high.mito] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) sce.pbmc &lt;- logNormCounts(sce.pbmc) #--- variance-modelling ---# set.seed(1001) dec.pbmc &lt;- modelGeneVarByPoisson(sce.pbmc) top.pbmc &lt;- getTopHVGs(dec.pbmc, prop=0.1) #--- dimensionality-reduction ---# set.seed(10000) sce.pbmc &lt;- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc) set.seed(100000) sce.pbmc &lt;- runTSNE(sce.pbmc, dimred=&quot;PCA&quot;) set.seed(1000000) sce.pbmc &lt;- runUMAP(sce.pbmc, dimred=&quot;PCA&quot;) #--- clustering ---# g &lt;- buildSNNGraph(sce.pbmc, k=10, use.dimred = &#39;PCA&#39;) clust &lt;- igraph::cluster_walktrap(g)$membership colLabels(sce.pbmc) &lt;- factor(clust) library(iSEE) app &lt;- iSEE(sce.pbmc) Figure 22.1: Screenshot of the iSEE application with its default initialization. The default interface contains up to eight built-in panels, each displaying a particular aspect of the data set. The layout of panels in the interface may be altered interactively - panels can be added, removed, resized or repositioned using the “Organize panels” menu in the top right corner of the interface. The initial layout of the application can also be altered programmatically as described in the rest of this Chapter. To familiarize themselves with the GUI, users can launch an interactive tour from the menu in the top right corner. In addition, custom tours can be written to substitute the default built-in tour. This feature is particularly useful to disseminate new data sets with accompanying bespoke explanations guiding users through the salient features of any given data set (see Section @ref{dissemination}). It is also possible to deploy “empty” instances of iSEE apps, where any SummarizedExperiment object stored in an RDS file may be uploaded to the running application. Once the file is uploaded, the application will import the sce object and initialize the GUI panels with the contents of the object for interactive exploration. This type of iSEE applications is launched without specifying the sce argument, as shown in Figure 22.2. app &lt;- iSEE() Figure 22.2: Screenshot of the iSEE application with a landing page. 22.3 Usage examples 22.3.1 Quality control In this example, we demonstrate that an iSEE app can be configured to focus on quality control metrics. Here, we are interested in two plots: The library size of each cell in decreasing order. An elbow in this plot generally reveals the transition between good quality cells and low quality cells or empty droplets. A dimensionality reduction result (in this case, we will pick \\(t\\)-SNE) where cells are colored by the log-library size. This view identifies trajectories or clusters associated with library size and can be used to diagnose QC/normalization problems. Alternatively, it could also indicate the presence of multiple cell types or states that differ in total RNA content. In addition, by setting the ColumnSelectionSource parmaeter, any point selection made in the Column data plot panel will highlight the corresponding points in the Reduced dimension plot panel. A user can then select the cells with either large or small library sizes to inspect their distribution in low-dimensional space. copy.pbmc &lt;- sce.pbmc # Computing various QC metrics; in particular, the log10-transformed library # size for each cell and the log-rank by decreasing library size. library(scater) copy.pbmc &lt;- addPerCellQC(copy.pbmc, exprs_values=&quot;counts&quot;) copy.pbmc$log10_total_counts &lt;- log10(copy.pbmc$total) copy.pbmc$total_counts_rank &lt;- rank(-copy.pbmc$total) initial.state &lt;- list( # Configure a &quot;Column data plot&quot; panel ColumnDataPlot(YAxis=&quot;log10_total_counts&quot;, XAxis=&quot;Column data&quot;, XAxisColumnData=&quot;total_counts_rank&quot;, DataBoxOpen=TRUE, PanelId=1L), # Configure a &quot;Reduced dimension plot &quot; panel ReducedDimensionPlot( Type=&quot;TSNE&quot;, VisualBoxOpen=TRUE, DataBoxOpen=TRUE, ColorBy=&quot;Column data&quot;, ColorByColumnData=&quot;log10_total_counts&quot;, SelectionBoxOpen=TRUE, ColumnSelectionSource=&quot;ColumnDataPlot1&quot;) ) # Prepare the app app &lt;- iSEE(copy.pbmc, initial=initial.state) The configured Shiny app can then be launched with the runApp() function or by simply printing the app object (Figure 22.3). Figure 22.3: Screenshot of an iSEE application for interactive exploration of quality control metrics. This app remains fully interactive, i.e., users can interactively control the settings and layout of the panels. For instance, users may choose to color data points by percentage of UMI mapped to mitochondrial genes (\"pct_counts_Mito\") in the Reduced dimension plot. Using the transfer of point selection between panels, users could select cells with small library sizes in the Column data plot and highlight them in the Reduced dimension plot, to investigate a possible relation between library size, clustering and proportion of reads mapped to mitochondrial genes. 22.3.2 Annotation of cell populations In this example, we use iSEE to interactively examine the marker genes to conveniently determine cell identities. We identify upregulated markers in each cluster (Chapter 11) and collect the log-\\(p\\)-value for each gene in each cluster. These are stored in the rowData slot of the SingleCellExperiment object for access by iSEE. copy.pbmc &lt;- sce.pbmc library(scran) markers.pbmc.up &lt;- findMarkers(copy.pbmc, direction=&quot;up&quot;, log.p=TRUE, sorted=FALSE) # Collate the log-p-value for each marker in a single table all.p &lt;- lapply(markers.pbmc.up, FUN = &quot;[[&quot;, i=&quot;log.p.value&quot;) all.p &lt;- DataFrame(all.p, check.names=FALSE) colnames(all.p) &lt;- paste0(&quot;cluster&quot;, colnames(all.p)) # Store the table of results as row metadata rowData(copy.pbmc) &lt;- cbind(rowData(copy.pbmc), all.p) The next code chunk sets up an app that contains: A table of feature statistics, including the log-transformed FDR of cluster markers computed above. A plot showing the distribution of expression values for a chosen gene in each cluster. A plot showing the result of the UMAP dimensionality reduction method overlaid with the expression value of a chosen gene. Moreover, we configure the second and third panel to use the gene (i.e., row) selected in the first panel. This enables convenient examination of important markers when combined with sorting by \\(p\\)-value for a cluster of interest. initial.state &lt;- list( RowDataTable(PanelId=1L), # Configure a &quot;Feature assay plot&quot; panel FeatureAssayPlot( YAxisFeatureSource=&quot;RowDataTable1&quot;, XAxis=&quot;Column data&quot;, XAxisColumnData=&quot;label&quot;, Assay=&quot;logcounts&quot;, DataBoxOpen=TRUE ), # Configure a &quot;Reduced dimension plot&quot; panel ReducedDimensionPlot( Type=&quot;UMAP&quot;, ColorBy=&quot;Feature name&quot;, ColorByFeatureSource=&quot;RowDataTable1&quot;, ColorByFeatureNameAssay=&quot;logcounts&quot; ) ) # Prepare the app app &lt;- iSEE(copy.pbmc, initial=initial.state) After launching the application (Figure 22.4), we can then sort the table by ascending values of cluster1 to identify genes that are strong markers for cluster 1. Then, users may select the first row in the Row statistics table and watch the second and third panel automatically update to display the most significant marker gene on the y-axis (Feature assay plot) or as a color scale overlaid on the data points (Reduced dimension plot). Alternatively, users can simply search the table for arbitrary gene names and select known markers for visualization. Figure 22.4: Screenshot of the iSEE application initialized for interactive exploration of population-specific marker expression. 22.3.3 Querying features of interest So far, the plots that we have examined have represented each column (i.e., cell) as a point. However, it is straightforward to instead represent rows as points that can be selected and transmitted to eligible panels. This is useful for more gene-centric exploratory analyses. To illustrate, we will add variance modelling statistics to the rowData() of our SingleCellExperiment object. copy.pbmc &lt;- sce.pbmc # Adding some mean-variance information. dec &lt;- modelGeneVarByPoisson(copy.pbmc) rowData(copy.pbmc) &lt;- cbind(rowData(copy.pbmc), dec) The next code chunk sets up an app (Figure 22.5) that contains: A plot showing the mean-variance trend, where each point represents a cell. A table of feature statistics, similar to that generated in the previous example. A heatmap for the genes in the first plot. We again configure the second and third panels to respond to the selection of points in the first panel. This allows the user to select several highly variable genes at once and examine their statistics or expression profiles. More advanced users can even configure the app to start with a brush or lasso to define a selection of genes at initialization. initial.state &lt;- list( # Configure a &quot;Feature assay plot&quot; panel RowDataPlot( YAxis=&quot;total&quot;, XAxis=&quot;Row data&quot;, XAxisRowData=&quot;mean&quot;, PanelId=1L ), RowDataTable( RowSelectionSource=&quot;RowDataPlot1&quot; ), # Configure a &quot;ComplexHeatmap&quot; panel ComplexHeatmapPlot( RowSelectionSource=&quot;RowDataPlot1&quot;, CustomRows=FALSE, ColumnData=&quot;label&quot;, Assay=&quot;logcounts&quot;, ClusterRows=TRUE, PanelHeight=800L, AssayCenterRows=TRUE ) ) # Prepare the app app &lt;- iSEE(copy.pbmc, initial=initial.state) Figure 22.5: Screenshot of the iSEE application initialized for examining highly variable genes. It is entirely possible for these row-centric panels to exist alongside the column-centric panels discussed previously. The only limitation is that row-based panels cannot transmit multi-row selections to column-based panels and vice versa. That said, a row-based panel can still transmit a single row selection to a column-based panel for, e.g., coloring by expression; this allows us to set up an app where selecting a single HVG in the mean-variance plot causes the neighboring \\(t\\)-SNE to be colored by the expression of the selected gene (Figure 22.6). initial.state &lt;- list( # Configure a &quot;Feature assay plot&quot; panel RowDataPlot( YAxis=&quot;total&quot;, XAxis=&quot;Row data&quot;, XAxisRowData=&quot;mean&quot;, PanelId=1L ), # Configure a &quot;Reduced dimension plot&quot; panel ReducedDimensionPlot( Type=&quot;TSNE&quot;, ColorBy=&quot;Feature name&quot;, ColorByFeatureSource=&quot;RowDataPlot1&quot;, ColorByFeatureNameAssay=&quot;logcounts&quot; ) ) # Prepare the app app &lt;- iSEE(copy.pbmc, initial=initial.state) Figure 22.6: Screenshot of the iSEE application containing both row- and column-based panels. 22.4 Reproducible visualizations The state of the iSEE application can be saved at any point to provide a snapshot of the current view of the dataset. This is achieved by clicking on the “Display panel settings” button under the “Export” dropdown menu in the top right corner and saving an RDS file containing a serialized list of panel parameters. Anyone with access to this file and the original SingleCellExperiment can then run iSEE to recover the same application state. Alternatively, the code required to construct the panel parameters can be returned, which is more transparent and amenable to further modification. This facility is most obviously useful for reproducing a perspective on the data that leads to a particular scientific conclusion; it is also helpful for collaborations whereby different views of the same dataset can be easily transferred between analysts. iSEE also keeps a record of the R commands used to generate each figure and table in the app. This information is readily available via the “Extract the R code” button under the “Export” dropdown menu. By copying the code displayed in the modal window and executing it in the R session from which the iSEE app was launched, a user can exactly reproduce all plots currently displayed in the GUI. In this manner, a user can use iSEE to rapidly prototype plots of interest without having to write the associated boilerplate, after which they can then copy the code in an R script for fine-tuning. Of course, the user can also save the plots and tables directly for further adjustment with other tools. 22.5 Dissemination of analysis results iSEE provides a powerful avenue for disseminating results through a “guided tour” of the dataset. This involves writing a step-by-step walkthrough of the different panels with explanations to facilitate their interpretation. All that is needed to add a tour to an iSEE instance is a data frame with two columns named “element” and “intro”; the first column declares the UI element to highlight in each step of the tour, and the second one contains the text to display at that step. This data frame must then be provided to the iSEE() function via the tour argument. Below we demonstrate the implementation of a simple tour that takes users through the two panels that compose a GUI and trains them to use the collapsible boxes. tour &lt;- data.frame( element = c( &quot;#Welcome&quot;, &quot;#ReducedDimensionPlot1&quot;, &quot;#ColumnDataPlot1&quot;, &quot;#ColumnDataPlot1_DataBoxOpen&quot;, &quot;#Conclusion&quot;), intro = c( &quot;Welcome to this tour!&quot;, &quot;This is a &lt;i&gt;Reduced dimension plot.&lt;/i&gt;&quot;, &quot;And this is a &lt;i&gt;Column data plot.&lt;/i&gt;&quot;, &quot;&lt;b&gt;Action:&lt;/b&gt; Click on this collapsible box to open and close it.&quot;, &quot;Thank you for taking this tour!&quot;), stringsAsFactors = FALSE) initial.state &lt;- list( ReducedDimensionPlot(PanelWidth=6L), ColumnDataPlot(PanelWidth=6L) ) The preconfigured Shiny app can then be loaded with the tour and launched to obtain Figure 22.7. Note that the viewer is free to leave the interactive tour at any time and explore the data from their own perspective. Examples of advanced tours showcasing a selection of published data sets can be found at https://github.com/iSEE/iSEE2018. app &lt;- iSEE(sce.pbmc, initial = initial.state, tour = tour) Figure 22.7: Screenshot of the iSEE application initialized with a tour. 22.6 Additional resources For demonstration and inspiration, we refer readers to the following examples of deployed applications: Use cases accompanying the published article: https://marionilab.cruk.cam.ac.uk/ (source code: https://github.com/iSEE/iSEE2018) Examples of iSEE in production: http://www.teichlab.org/singlecell-treg Other examples as source code: Gallery of examples notebooks to reproduce analyses on public data: https://github.com/iSEE/iSEE_instances Gallery of example custom panels: https://github.com/iSEE/iSEE_custom Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] scran_1.18.5 scater_1.18.6 [3] ggplot2_3.3.3 iSEE_2.2.4 [5] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0 [7] Biobase_2.50.0 GenomicRanges_1.42.0 [9] GenomeInfoDb_1.26.4 IRanges_2.24.1 [11] S4Vectors_0.28.1 BiocGenerics_0.36.0 [13] MatrixGenerics_1.2.1 matrixStats_0.58.0 [15] BiocStyle_2.18.1 rebook_1.0.0 loaded via a namespace (and not attached): [1] ggbeeswarm_0.6.0 colorspace_2.0-0 [3] rjson_0.2.20 ellipsis_0.3.1 [5] circlize_0.4.12 scuttle_1.0.4 [7] bluster_1.0.0 XVector_0.30.0 [9] BiocNeighbors_1.8.2 GlobalOptions_0.1.2 [11] clue_0.3-58 ggrepel_0.9.1 [13] DT_0.17 fansi_0.4.2 [15] codetools_0.2-18 splines_4.0.4 [17] sparseMatrixStats_1.2.1 knitr_1.31 [19] jsonlite_1.7.2 Cairo_1.5-12.2 [21] cluster_2.1.0 png_0.1-7 [23] shinydashboard_0.7.1 graph_1.68.0 [25] shiny_1.6.0 BiocManager_1.30.10 [27] compiler_4.0.4 dqrng_0.2.1 [29] assertthat_0.2.1 Matrix_1.3-2 [31] fastmap_1.1.0 limma_3.46.0 [33] BiocSingular_1.6.0 later_1.1.0.1 [35] htmltools_0.5.1.1 tools_4.0.4 [37] rsvd_1.0.3 igraph_1.2.6 [39] gtable_0.3.0 glue_1.4.2 [41] GenomeInfoDbData_1.2.4 dplyr_1.0.5 [43] Rcpp_1.0.6 jquerylib_0.1.3 [45] vctrs_0.3.6 nlme_3.1-152 [47] rintrojs_0.2.2 DelayedMatrixStats_1.12.3 [49] xfun_0.22 stringr_1.4.0 [51] ps_1.6.0 beachmat_2.6.4 [53] irlba_2.3.3 mime_0.10 [55] miniUI_0.1.1.1 lifecycle_1.0.0 [57] statmod_1.4.35 XML_3.99-0.6 [59] shinyAce_0.4.1 edgeR_3.32.1 [61] zlibbioc_1.36.0 scales_1.1.1 [63] colourpicker_1.1.0 promises_1.2.0.1 [65] RColorBrewer_1.1-2 ComplexHeatmap_2.6.2 [67] yaml_2.2.1 gridExtra_2.3 [69] sass_0.3.1 stringi_1.5.3 [71] highr_0.8 BiocParallel_1.24.1 [73] shape_1.4.5 rlang_0.4.10 [75] pkgconfig_2.0.3 bitops_1.0-6 [77] evaluate_0.14 lattice_0.20-41 [79] purrr_0.3.4 CodeDepends_0.6.5 [81] htmlwidgets_1.5.3 processx_3.4.5 [83] tidyselect_1.1.0 magrittr_2.0.1 [85] bookdown_0.21 R6_2.5.0 [87] generics_0.1.0 DelayedArray_0.16.2 [89] DBI_1.1.1 pillar_1.5.1 [91] withr_2.4.1 mgcv_1.8-33 [93] RCurl_1.98-1.3 tibble_3.1.0 [95] crayon_1.4.1 shinyWidgets_0.6.0 [97] utf8_1.2.1 rmarkdown_2.7 [99] viridis_0.5.1 GetoptLong_1.0.5 [101] locfit_1.5-9.4 grid_4.0.4 [103] callr_3.5.1 digest_0.6.27 [105] xtable_1.8-4 httpuv_1.5.5 [107] munsell_0.5.0 beeswarm_0.3.1 [109] viridisLite_0.3.0 vipor_0.4.5 [111] bslib_0.2.4 shinyjs_2.0.0 "],["dealing-with-big-data.html", "Chapter 23 Dealing with big data 23.1 Motivation 23.2 Fast approximations 23.3 Parallelization 23.4 Out of memory representations Session Info", " Chapter 23 Dealing with big data .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 23.1 Motivation Advances in scRNA-seq technologies have increased the number of cells that can be assayed in routine experiments. Public databases such as GEO are continually expanding with more scRNA-seq studies, while large-scale projects such as the Human Cell Atlas are expected to generate data for billions of cells. For effective data analysis, the computational methods need to scale with the increasing size of scRNA-seq data sets. This section discusses how we can use various aspects of the Bioconductor ecosystem to tune our analysis pipelines for greater speed and efficiency. 23.2 Fast approximations 23.2.1 Nearest neighbor searching Identification of neighbouring cells in PC or expression space is a common procedure that is used in many functions, e.g., buildSNNGraph(), doubletCells(). The default is to favour accuracy over speed by using an exact nearest neighbour (NN) search, implemented with the \\(k\\)-means for \\(k\\)-nearest neighbours algorithm (Wang 2012). However, for large data sets, it may be preferable to use a faster approximate approach. The BiocNeighbors framework makes it easy to switch between search options by simply changing the BNPARAM= argument in compatible functions. To demonstrate, we will use the 10X PBMC data: View history #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) #--- gene-annotation ---# library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) #--- cell-detection ---# set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location==&quot;MT&quot;))) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) sce.pbmc &lt;- sce.pbmc[,!high.mito] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) sce.pbmc &lt;- logNormCounts(sce.pbmc) #--- variance-modelling ---# set.seed(1001) dec.pbmc &lt;- modelGeneVarByPoisson(sce.pbmc) top.pbmc &lt;- getTopHVGs(dec.pbmc, prop=0.1) #--- dimensionality-reduction ---# set.seed(10000) sce.pbmc &lt;- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc) set.seed(100000) sce.pbmc &lt;- runTSNE(sce.pbmc, dimred=&quot;PCA&quot;) set.seed(1000000) sce.pbmc &lt;- runUMAP(sce.pbmc, dimred=&quot;PCA&quot;) #--- clustering ---# g &lt;- buildSNNGraph(sce.pbmc, k=10, use.dimred = &#39;PCA&#39;) clust &lt;- igraph::cluster_walktrap(g)$membership colLabels(sce.pbmc) &lt;- factor(clust) sce.pbmc ## class: SingleCellExperiment ## dim: 33694 3985 ## metadata(1): Samples ## assays(2): counts logcounts ## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B ## rowData names(2): ID Symbol ## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ... ## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1 ## colData names(4): Sample Barcode sizeFactor label ## reducedDimNames(3): PCA TSNE UMAP ## altExpNames(0): We had previously clustered on a shared nearest neighbor graph generated with an exact neighbour search (Section 10.3). We repeat this below using an approximate search, implemented using the Annoy algorithm. This involves constructing a AnnoyParam object to specify the search algorithm and then passing it to the buildSNNGraph() function. The results from the exact and approximate searches are consistent with most clusters from the former re-appearing in the latter. This suggests that the inaccuracy from the approximation can be largely ignored. library(scran) library(BiocNeighbors) snn.gr &lt;- buildSNNGraph(sce.pbmc, BNPARAM=AnnoyParam(), use.dimred=&quot;PCA&quot;) clusters &lt;- igraph::cluster_walktrap(snn.gr) table(Exact=colLabels(sce.pbmc), Approx=clusters$membership) ## Approx ## Exact 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 1 205 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## 2 0 479 0 0 2 0 0 0 0 27 0 0 0 0 0 0 ## 3 0 0 540 0 1 0 0 0 0 0 0 0 0 0 0 0 ## 4 0 0 0 55 0 0 0 0 0 1 0 0 0 0 0 0 ## 5 0 25 0 0 349 0 0 0 0 0 0 0 0 0 0 0 ## 6 0 0 0 0 0 125 0 0 0 0 0 0 0 0 0 0 ## 7 0 0 0 0 0 0 46 0 0 0 0 0 0 0 0 0 ## 8 0 0 0 0 0 0 0 432 0 0 0 0 0 0 0 0 ## 9 0 0 0 1 0 0 0 10 291 0 0 0 0 0 0 0 ## 10 0 28 0 0 0 0 0 0 0 839 0 0 0 0 0 0 ## 11 0 0 0 0 0 0 0 0 0 0 47 0 0 0 0 0 ## 12 0 0 0 0 0 0 0 0 0 0 0 155 0 0 0 0 ## 13 0 0 0 0 0 0 0 0 0 0 0 0 166 0 0 0 ## 14 0 0 0 0 0 0 0 0 0 0 0 0 0 61 0 0 ## 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 84 0 ## 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 Note that Annoy writes the NN index to disk prior to performing the search. Thus, it may not actually be faster than the default exact algorithm for small datasets, depending on whether the overhead of disk write is offset by the computational complexity of the search. It is also not difficult to find situations where the approximation deteriorates, especially at high dimensions, though this may not have an appreciable impact on the biological conclusions. set.seed(1000) y1 &lt;- matrix(rnorm(50000), nrow=1000) y2 &lt;- matrix(rnorm(50000), nrow=1000) Y &lt;- rbind(y1, y2) exact &lt;- findKNN(Y, k=20) approx &lt;- findKNN(Y, k=20, BNPARAM=AnnoyParam()) mean(exact$index!=approx$index) ## [1] 0.5619 23.2.2 Singular value decomposition The singular value decomposition (SVD) underlies the PCA used throughout our analyses, e.g., in denoisePCA(), fastMNN(), doubletCells(). (Briefly, the right singular vectors are the eigenvectors of the gene-gene covariance matrix, where each eigenvector represents the axis of maximum remaining variation in the PCA.) The default base::svd() function performs an exact SVD that is not performant for large datasets. Instead, we use fast approximate methods from the irlba and rsvd packages, conveniently wrapped into the BiocSingular package for ease of use and package development. Specifically, we can change the SVD algorithm used in any of these functions by simply specifying an alternative value for the BSPARAM= argument. library(scater) library(BiocSingular) # As the name suggests, it is random, so we need to set the seed. set.seed(101000) r.out &lt;- runPCA(sce.pbmc, ncomponents=20, BSPARAM=RandomParam()) str(reducedDim(r.out)) ## num [1:3985, 1:20] 15.05 13.43 -8.67 -7.74 6.45 ... ## - attr(*, &quot;dimnames&quot;)=List of 2 ## ..$ : chr [1:3985] &quot;AAACCTGAGAAGGCCT-1&quot; &quot;AAACCTGAGACAGACC-1&quot; &quot;AAACCTGAGGCATGGT-1&quot; &quot;AAACCTGCAAGGTTCT-1&quot; ... ## ..$ : chr [1:20] &quot;PC1&quot; &quot;PC2&quot; &quot;PC3&quot; &quot;PC4&quot; ... ## - attr(*, &quot;varExplained&quot;)= num [1:20] 85.36 40.43 23.22 8.99 6.66 ... ## - attr(*, &quot;percentVar&quot;)= num [1:20] 19.85 9.4 5.4 2.09 1.55 ... ## - attr(*, &quot;rotation&quot;)= num [1:500, 1:20] 0.203 0.1834 0.1779 0.1063 0.0647 ... ## ..- attr(*, &quot;dimnames&quot;)=List of 2 ## .. ..$ : chr [1:500] &quot;LYZ&quot; &quot;S100A9&quot; &quot;S100A8&quot; &quot;HLA-DRA&quot; ... ## .. ..$ : chr [1:20] &quot;PC1&quot; &quot;PC2&quot; &quot;PC3&quot; &quot;PC4&quot; ... set.seed(101001) i.out &lt;- runPCA(sce.pbmc, ncomponents=20, BSPARAM=IrlbaParam()) str(reducedDim(i.out)) ## num [1:3985, 1:20] 15.05 13.43 -8.67 -7.74 6.45 ... ## - attr(*, &quot;dimnames&quot;)=List of 2 ## ..$ : chr [1:3985] &quot;AAACCTGAGAAGGCCT-1&quot; &quot;AAACCTGAGACAGACC-1&quot; &quot;AAACCTGAGGCATGGT-1&quot; &quot;AAACCTGCAAGGTTCT-1&quot; ... ## ..$ : chr [1:20] &quot;PC1&quot; &quot;PC2&quot; &quot;PC3&quot; &quot;PC4&quot; ... ## - attr(*, &quot;varExplained&quot;)= num [1:20] 85.36 40.43 23.22 8.99 6.66 ... ## - attr(*, &quot;percentVar&quot;)= num [1:20] 19.85 9.4 5.4 2.09 1.55 ... ## - attr(*, &quot;rotation&quot;)= num [1:500, 1:20] 0.203 0.1834 0.1779 0.1063 0.0647 ... ## ..- attr(*, &quot;dimnames&quot;)=List of 2 ## .. ..$ : chr [1:500] &quot;LYZ&quot; &quot;S100A9&quot; &quot;S100A8&quot; &quot;HLA-DRA&quot; ... ## .. ..$ : chr [1:20] &quot;PC1&quot; &quot;PC2&quot; &quot;PC3&quot; &quot;PC4&quot; ... Both IRLBA and randomized SVD (RSVD) are much faster than the exact SVD with negligible loss of accuracy. This motivates their default use in many scran and scater functions, at the cost of requiring users to set the seed to guarantee reproducibility. IRLBA can occasionally fail to converge and require more iterations (passed via maxit= in IrlbaParam()), while RSVD involves an explicit trade-off between accuracy and speed based on its oversampling parameter (p=) and number of power iterations (q=). We tend to prefer IRLBA as its default behavior is more accurate, though RSVD is much faster for file-backed matrices (Section 28.8). 23.3 Parallelization Parallelization of calculations across genes or cells is an obvious strategy for speeding up scRNA-seq analysis workflows. The BiocParallel package provides a common interface for parallel computing throughout the Bioconductor ecosystem, manifesting as a BPPARAM= argument in compatible functions. We can pick from a diverse range of parallelization backends depending on the available hardware and operating system. For example, we might use forking across 2 cores to parallelize the variance calculations on a Unix system: library(BiocParallel) dec.pbmc.mc &lt;- modelGeneVar(sce.pbmc, BPPARAM=MulticoreParam(2)) dec.pbmc.mc ## DataFrame with 33694 rows and 6 columns ## mean total tech bio p.value ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## RP11-34P13.3 0.000000000 0.000000000 0.000000000 0.00000e+00 NaN ## FAM138A 0.000000000 0.000000000 0.000000000 0.00000e+00 NaN ## OR4F5 0.000000000 0.000000000 0.000000000 0.00000e+00 NaN ## RP11-34P13.7 0.002166050 0.002227438 0.002227466 -2.81903e-08 0.500035 ## RP11-34P13.8 0.000522431 0.000549601 0.000537244 1.23571e-05 0.436506 ## ... ... ... ... ... ... ## AC233755.2 0.0000000 0.0000000 0.0000000 0.00000000 NaN ## AC233755.1 0.0000000 0.0000000 0.0000000 0.00000000 NaN ## AC240274.1 0.0102893 0.0121099 0.0105809 0.00152898 0.157651 ## AC213203.1 0.0000000 0.0000000 0.0000000 0.00000000 NaN ## FAM231B 0.0000000 0.0000000 0.0000000 0.00000000 NaN ## FDR ## &lt;numeric&gt; ## RP11-34P13.3 NaN ## FAM138A NaN ## OR4F5 NaN ## RP11-34P13.7 0.756451 ## RP11-34P13.8 0.756451 ## ... ... ## AC233755.2 NaN ## AC233755.1 NaN ## AC240274.1 0.756451 ## AC213203.1 NaN ## FAM231B NaN Another approach would be to distribute jobs across a network of computers, which yields the same result: dec.pbmc.snow &lt;- modelGeneVar(sce.pbmc, BPPARAM=SnowParam(5)) dec.pbmc.snow ## DataFrame with 33694 rows and 6 columns ## mean total tech bio p.value ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## RP11-34P13.3 0.000000000 0.000000000 0.000000000 0.00000e+00 NaN ## FAM138A 0.000000000 0.000000000 0.000000000 0.00000e+00 NaN ## OR4F5 0.000000000 0.000000000 0.000000000 0.00000e+00 NaN ## RP11-34P13.7 0.002166050 0.002227438 0.002227466 -2.81903e-08 0.500035 ## RP11-34P13.8 0.000522431 0.000549601 0.000537244 1.23571e-05 0.436506 ## ... ... ... ... ... ... ## AC233755.2 0.0000000 0.0000000 0.0000000 0.00000000 NaN ## AC233755.1 0.0000000 0.0000000 0.0000000 0.00000000 NaN ## AC240274.1 0.0102893 0.0121099 0.0105809 0.00152898 0.157651 ## AC213203.1 0.0000000 0.0000000 0.0000000 0.00000000 NaN ## FAM231B 0.0000000 0.0000000 0.0000000 0.00000000 NaN ## FDR ## &lt;numeric&gt; ## RP11-34P13.3 NaN ## FAM138A NaN ## OR4F5 NaN ## RP11-34P13.7 0.756451 ## RP11-34P13.8 0.756451 ## ... ... ## AC233755.2 NaN ## AC233755.1 NaN ## AC240274.1 0.756451 ## AC213203.1 NaN ## FAM231B NaN For high-performance computing (HPC) systems with a cluster of compute nodes, we can distribute jobs via the job scheduler using the BatchtoolsParam class. The example below assumes a SLURM cluster, though the settings can be easily configured for a particular system (see here for details). # 2 hours, 8 GB, 1 CPU per task, for 10 tasks. bpp &lt;- BatchtoolsParam(10, cluster=&quot;slurm&quot;, resources=list(walltime=7200, memory=8000, ncpus=1)) Parallelization is best suited for CPU-intensive calculations where the division of labor results in a concomitant reduction in compute time. It is not suited for tasks that are bounded by other compute resources, e.g., memory or file I/O (though the latter is less of an issue on HPC systems with parallel read/write). In particular, R itself is inherently single-core, so many of the parallelization backends involve (i) setting up one or more separate R sessions, (ii) loading the relevant packages and (iii) transmitting the data to that session. Depending on the nature and size of the task, this overhead may outweigh any benefit from parallel computing. 23.4 Out of memory representations The count matrix is the central structure around which our analyses are based. In most of the previous chapters, this has been held fully in memory as a dense matrix or as a sparse dgCMatrix. Howevever, in-memory representations may not be feasible for very large data sets, especially on machines with limited memory. For example, the 1.3 million brain cell data set from 10X Genomics (Zheng et al. 2017) would require over 100 GB of RAM to hold as a matrix and around 30 GB as a dgCMatrix. This makes it challenging to explore the data on anything less than a HPC system. The obvious solution is to use a file-backed matrix representation where the data are held on disk and subsets are retrieved into memory as requested. While a number of implementations of file-backed matrices are available (e.g., bigmemory, matter), we will be using the implementation from the HDF5Array package. This uses the popular HDF5 format as the underlying data store, which provides a measure of standardization and portability across systems. We demonstrate with a subset of 20,000 cells from the 1.3 million brain cell data set, as provided by the TENxBrainData package. library(TENxBrainData) sce.brain &lt;- TENxBrainData20k() sce.brain ## class: SingleCellExperiment ## dim: 27998 20000 ## metadata(0): ## assays(1): counts ## rownames: NULL ## rowData names(2): Ensembl Symbol ## colnames: NULL ## colData names(4): Barcode Sequence Library Mouse ## reducedDimNames(0): ## altExpNames(0): Examination of the SingleCellExperiment object indicates that the count matrix is a HDF5Matrix. From a comparison of the memory usage, it is clear that this matrix object is simply a stub that points to the much larger HDF5 file that actually contains the data. This avoids the need for large RAM availability during analyses. counts(sce.brain) ## &lt;27998 x 20000&gt; matrix of class HDF5Matrix and type &quot;integer&quot;: ## [,1] [,2] [,3] [,4] ... [,19997] [,19998] [,19999] ## [1,] 0 0 0 0 . 0 0 0 ## [2,] 0 0 0 0 . 0 0 0 ## [3,] 0 0 0 0 . 0 0 0 ## [4,] 0 0 0 0 . 0 0 0 ## [5,] 0 0 0 0 . 0 0 0 ## ... . . . . . . . . ## [27994,] 0 0 0 0 . 0 0 0 ## [27995,] 0 0 0 1 . 0 2 0 ## [27996,] 0 0 0 0 . 0 1 0 ## [27997,] 0 0 0 0 . 0 0 0 ## [27998,] 0 0 0 0 . 0 0 0 ## [,20000] ## [1,] 0 ## [2,] 0 ## [3,] 0 ## [4,] 0 ## [5,] 0 ## ... . ## [27994,] 0 ## [27995,] 0 ## [27996,] 0 ## [27997,] 0 ## [27998,] 0 object.size(counts(sce.brain)) ## 2496 bytes file.info(path(counts(sce.brain)))$size ## [1] 76264332 Manipulation of the count matrix will generally result in the creation of a DelayedArray object from the DelayedArray package. This remembers the operations to be applied to the counts and stores them in the object, to be executed when the modified matrix values are realized for use in calculations. The use of delayed operations avoids the need to write the modified values to a new file at every operation, which would unnecessarily require time-consuming disk I/O. tmp &lt;- counts(sce.brain) tmp &lt;- log2(tmp + 1) tmp ## &lt;27998 x 20000&gt; matrix of class DelayedMatrix and type &quot;double&quot;: ## [,1] [,2] [,3] ... [,19999] [,20000] ## [1,] 0 0 0 . 0 0 ## [2,] 0 0 0 . 0 0 ## [3,] 0 0 0 . 0 0 ## [4,] 0 0 0 . 0 0 ## [5,] 0 0 0 . 0 0 ## ... . . . . . . ## [27994,] 0 0 0 . 0 0 ## [27995,] 0 0 0 . 0 0 ## [27996,] 0 0 0 . 0 0 ## [27997,] 0 0 0 . 0 0 ## [27998,] 0 0 0 . 0 0 Many functions described in the previous workflows are capable of accepting HDF5Matrix objects. This is powered by the availability of common methods for all matrix representations (e.g., subsetting, combining, methods from DelayedMatrixStats) as well as representation-agnostic C++ code using beachmat (A. T. L. Lun, Pages, and Smith 2018). For example, we compute QC metrics below with the same calculateQCMetrics() function that we used in the other workflows. library(scater) is.mito &lt;- grepl(&quot;^mt-&quot;, rowData(sce.brain)$Symbol) qcstats &lt;- perCellQCMetrics(sce.brain, subsets=list(Mt=is.mito)) qcstats ## DataFrame with 20000 rows and 6 columns ## sum detected subsets_Mt_sum subsets_Mt_detected subsets_Mt_percent ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## 1 3060 1546 123 10 4.01961 ## 2 3500 1694 118 11 3.37143 ## 3 3092 1613 58 9 1.87581 ## 4 4420 2050 131 10 2.96380 ## 5 3771 1813 100 8 2.65182 ## ... ... ... ... ... ... ## 19996 4431 2050 127 9 2.866170 ## 19997 6988 2704 60 9 0.858615 ## 19998 8749 2988 305 11 3.486113 ## 19999 3842 1711 129 8 3.357626 ## 20000 1775 945 26 6 1.464789 ## total ## &lt;numeric&gt; ## 1 3060 ## 2 3500 ## 3 3092 ## 4 4420 ## 5 3771 ## ... ... ## 19996 4431 ## 19997 6988 ## 19998 8749 ## 19999 3842 ## 20000 1775 Needless to say, data access from file-backed representations is slower than that from in-memory representations. The time spent retrieving data from disk is an unavoidable cost of reducing memory usage. Whether this is tolerable depends on the application. One example usage pattern involves performing the heavy computing quickly with in-memory representations on HPC systems with plentiful memory, and then distributing file-backed counterparts to individual users for exploration and visualization on their personal machines. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] TENxBrainData_1.10.0 HDF5Array_1.18.1 [3] rhdf5_2.34.0 DelayedArray_0.16.2 [5] Matrix_1.3-2 BiocParallel_1.24.1 [7] BiocSingular_1.6.0 scater_1.18.6 [9] ggplot2_3.3.3 bluster_1.0.0 [11] BiocNeighbors_1.8.2 scran_1.18.5 [13] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0 [15] Biobase_2.50.0 GenomicRanges_1.42.0 [17] GenomeInfoDb_1.26.4 IRanges_2.24.1 [19] S4Vectors_0.28.1 BiocGenerics_0.36.0 [21] MatrixGenerics_1.2.1 matrixStats_0.58.0 [23] BiocStyle_2.18.1 rebook_1.0.0 loaded via a namespace (and not attached): [1] ggbeeswarm_0.6.0 colorspace_2.0-0 [3] ellipsis_0.3.1 scuttle_1.0.4 [5] XVector_0.30.0 bit64_4.0.5 [7] AnnotationDbi_1.52.0 interactiveDisplayBase_1.28.0 [9] fansi_0.4.2 codetools_0.2-18 [11] sparseMatrixStats_1.2.1 cachem_1.0.4 [13] knitr_1.31 jsonlite_1.7.2 [15] dbplyr_2.1.0 graph_1.68.0 [17] shiny_1.6.0 BiocManager_1.30.10 [19] compiler_4.0.4 httr_1.4.2 [21] dqrng_0.2.1 assertthat_0.2.1 [23] fastmap_1.1.0 limma_3.46.0 [25] later_1.1.0.1 htmltools_0.5.1.1 [27] tools_4.0.4 rsvd_1.0.3 [29] igraph_1.2.6 gtable_0.3.0 [31] glue_1.4.2 GenomeInfoDbData_1.2.4 [33] dplyr_1.0.5 rappdirs_0.3.3 [35] Rcpp_1.0.6 jquerylib_0.1.3 [37] vctrs_0.3.6 rhdf5filters_1.2.0 [39] ExperimentHub_1.16.0 DelayedMatrixStats_1.12.3 [41] xfun_0.22 stringr_1.4.0 [43] ps_1.6.0 beachmat_2.6.4 [45] mime_0.10 lifecycle_1.0.0 [47] irlba_2.3.3 statmod_1.4.35 [49] XML_3.99-0.6 AnnotationHub_2.22.0 [51] edgeR_3.32.1 zlibbioc_1.36.0 [53] scales_1.1.1 promises_1.2.0.1 [55] curl_4.3 yaml_2.2.1 [57] memoise_2.0.0 gridExtra_2.3 [59] sass_0.3.1 stringi_1.5.3 [61] RSQLite_2.2.4 BiocVersion_3.12.0 [63] rlang_0.4.10 pkgconfig_2.0.3 [65] bitops_1.0-6 evaluate_0.14 [67] lattice_0.20-41 purrr_0.3.4 [69] Rhdf5lib_1.12.1 CodeDepends_0.6.5 [71] bit_4.0.4 processx_3.4.5 [73] tidyselect_1.1.0 magrittr_2.0.1 [75] bookdown_0.21 R6_2.5.0 [77] snow_0.4-3 generics_0.1.0 [79] DBI_1.1.1 pillar_1.5.1 [81] withr_2.4.1 RCurl_1.98-1.3 [83] tibble_3.1.0 crayon_1.4.1 [85] utf8_1.2.1 BiocFileCache_1.14.0 [87] rmarkdown_2.7 viridis_0.5.1 [89] locfit_1.5-9.4 grid_4.0.4 [91] blob_1.2.1 callr_3.5.1 [93] digest_0.6.27 xtable_1.8-4 [95] httpuv_1.5.5 munsell_0.5.0 [97] beeswarm_0.3.1 viridisLite_0.3.0 [99] vipor_0.4.5 bslib_0.2.4 Bibliography "],["interoperability.html", "Chapter 24 Interoperability 24.1 Motivation 24.2 Interchanging with Seurat 24.3 Interchanging with scanpy Session Info", " Chapter 24 Interoperability .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 24.1 Motivation The Bioconductor single-cell ecosystem is but one of many popular frameworks for scRNA-seq data analysis. Seurat is very widely used for analysis of droplet-based datasets while scanpy provides an option for users who prefer working in Python. In many scenarios, these frameworks provide useful functionality that we might want to use from a Bioconductor-centric analysis (or vice versa). For example, Python has well-established machine learning libraries while R has a large catalogue of statistical tools, and it would be much better to use this functionality directly than to attempt to transplant it into a new framework. However, effective re-use requires some consideration towards interoperability during development of the relevant software tools. In an ideal world, everyone would agree on a common data structure that could be seamlessly and faithfully exchanged between frameworks. In the real world, though, each framework uses a different structure for various pragmatic or historical reasons. (This obligatory xkcd sums up the situation.) Most single cell-related Bioconductor packages use the SingleCellExperiment class, as previously discussed; Seurat defines its own SeuratObject class; and scanpy has its AnnData class. This inevitably introduces some friction if we are forced to convert from one structure to another in order to use another framework’s methods. In the absence of coordination of data structures, the next best solution is for each framework to provide methods that can operate on its most basic data object. Depending on the method, this might be the count matrix, the normalized expression matrix, a matrix of PCs or a graph object. If such methods are available, we can simply extract the relevant component from our SingleCellExperiment and call an external method directly without having to assemble that framework’s structure. Indeed, it is for this purpose that almost all scran functions and many scater functions are capable of accepting matrix objects or equivalents (e.g., sparse matrices) in addition to SingleCellExperiments. In this chapter, we will provide some examples of using functionality from frameworks outside of the SingleCellExperiment ecosystem in a single-cell analysis. We will focus on Seurat and scanpy as these are the two of the most popular analysis frameworks in the field. However, the principles of interoperability are generally applicable and are worth keeping in mind when developing or evaluating any type of analysis software. 24.2 Interchanging with Seurat Figure 24.1: Need to add this at some point. 24.3 Interchanging with scanpy Figure 24.2: Need to add this at some point. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BiocStyle_2.18.1 rebook_1.0.0 loaded via a namespace (and not attached): [1] graph_1.68.0 knitr_1.31 magrittr_2.0.1 [4] BiocGenerics_0.36.0 R6_2.5.0 rlang_0.4.10 [7] highr_0.8 stringr_1.4.0 tools_4.0.4 [10] parallel_4.0.4 xfun_0.22 jquerylib_0.1.3 [13] htmltools_0.5.1.1 CodeDepends_0.6.5 yaml_2.2.1 [16] digest_0.6.27 bookdown_0.21 processx_3.4.5 [19] callr_3.5.1 BiocManager_1.30.10 ps_1.6.0 [22] codetools_0.2-18 sass_0.3.1 evaluate_0.14 [25] rmarkdown_2.7 stringi_1.5.3 compiler_4.0.4 [28] bslib_0.2.4 XML_3.99-0.6 stats4_4.0.4 [31] jsonlite_1.7.2 "],["lun-416b-cell-line-smart-seq2.html", "Chapter 25 Lun 416B cell line (Smart-seq2) 25.1 Introduction 25.2 Data loading 25.3 Quality control 25.4 Normalization 25.5 Variance modelling 25.6 Batch correction 25.7 Dimensionality reduction 25.8 Clustering 25.9 Interpretation Session Info", " Chapter 25 Lun 416B cell line (Smart-seq2) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 25.1 Introduction The A. T. L. Lun et al. (2017) dataset contains two 96-well plates of 416B cells (an immortalized mouse myeloid progenitor cell line), processed using the Smart-seq2 protocol (Picelli et al. 2014). A constant amount of spike-in RNA from the External RNA Controls Consortium (ERCC) was also added to each cell’s lysate prior to library preparation. High-throughput sequencing was performed and the expression of each gene was quantified by counting the total number of reads mapped to its exonic regions. Similarly, the quantity of each spike-in transcript was measured by counting the number of reads mapped to the spike-in reference sequences. 25.2 Data loading We convert the blocking factor to a factor so that downstream steps do not treat it as an integer. library(scRNAseq) sce.416b &lt;- LunSpikeInData(which=&quot;416b&quot;) sce.416b$block &lt;- factor(sce.416b$block) We rename the rows of our SingleCellExperiment with the symbols, reverting to Ensembl identifiers for missing or duplicate symbols. library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] rowData(sce.416b)$ENSEMBL &lt;- rownames(sce.416b) rowData(sce.416b)$SYMBOL &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SYMBOL&quot;) rowData(sce.416b)$SEQNAME &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SEQNAME&quot;) library(scater) rownames(sce.416b) &lt;- uniquifyFeatureNames(rowData(sce.416b)$ENSEMBL, rowData(sce.416b)$SYMBOL) 25.3 Quality control We save an unfiltered copy of the SingleCellExperiment for later use. unfiltered &lt;- sce.416b Technically, we do not need to use the mitochondrial proportions as we already have the spike-in proportions (which serve a similar purpose) for this dataset. However, it probably doesn’t do any harm to include it anyway. mito &lt;- which(rowData(sce.416b)$SEQNAME==&quot;MT&quot;) stats &lt;- perCellQCMetrics(sce.416b, subsets=list(Mt=mito)) qc &lt;- quickPerCellQC(stats, percent_subsets=c(&quot;subsets_Mt_percent&quot;, &quot;altexps_ERCC_percent&quot;), batch=sce.416b$block) sce.416b &lt;- sce.416b[,!qc$discard] colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$block &lt;- factor(unfiltered$block) unfiltered$discard &lt;- qc$discard gridExtra::grid.arrange( plotColData(unfiltered, x=&quot;block&quot;, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(unfiltered, x=&quot;block&quot;, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(unfiltered, x=&quot;block&quot;, y=&quot;subsets_Mt_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;Mito percent&quot;), plotColData(unfiltered, x=&quot;block&quot;, y=&quot;altexps_ERCC_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;ERCC percent&quot;), nrow=2, ncol=2 ) Figure 25.1: Distribution of each QC metric across cells in the 416B dataset, stratified by the plate of origin. Each point represents a cell and is colored according to whether that cell was discarded. gridExtra::grid.arrange( plotColData(unfiltered, x=&quot;sum&quot;, y=&quot;subsets_Mt_percent&quot;, colour_by=&quot;discard&quot;) + scale_x_log10(), plotColData(unfiltered, x=&quot;altexps_ERCC_percent&quot;, y=&quot;subsets_Mt_percent&quot;, colour_by=&quot;discard&quot;), ncol=2 ) Figure 25.2: Percentage of mitochondrial reads in each cell in the 416B dataset, compared to the total count (left) or the percentage of spike-in reads (right). Each point represents a cell and is colored according to whether that cell was discarded. We also examine the number of cells removed for each reason. colSums(as.matrix(qc)) ## low_lib_size low_n_features high_subsets_Mt_percent ## 5 0 2 ## high_altexps_ERCC_percent discard ## 2 7 25.4 Normalization No pre-clustering is performed here, as the dataset is small and all cells are derived from the same cell line anyway. library(scran) sce.416b &lt;- computeSumFactors(sce.416b) sce.416b &lt;- logNormCounts(sce.416b) summary(sizeFactors(sce.416b)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.347 0.711 0.921 1.000 1.152 3.604 We see that the induced cells have size factors that are systematically shifted from the uninduced cells, consistent with the presence of a composition bias (Figure 25.3). plot(librarySizeFactors(sce.416b), sizeFactors(sce.416b), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, col=c(&quot;black&quot;, &quot;red&quot;)[grepl(&quot;induced&quot;, sce.416b$phenotype)+1], log=&quot;xy&quot;) Figure 25.3: Relationship between the library size factors and the deconvolution size factors in the 416B dataset. Each cell is colored according to its oncogene induction status. 25.5 Variance modelling We block on the plate of origin to minimize plate effects, and then we take the top 10% of genes with the largest biological components. dec.416b &lt;- modelGeneVarWithSpikes(sce.416b, &quot;ERCC&quot;, block=sce.416b$block) chosen.hvgs &lt;- getTopHVGs(dec.416b, prop=0.1) par(mfrow=c(1,2)) blocked.stats &lt;- dec.416b$per.block for (i in colnames(blocked.stats)) { current &lt;- blocked.stats[[i]] plot(current$mean, current$total, main=i, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(current) points(curfit$mean, curfit$var, col=&quot;red&quot;, pch=16) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) } Figure 25.4: Per-gene variance as a function of the mean for the log-expression values in the 416B dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the spike-in transcripts (red). This was performed separately for each plate. 25.6 Batch correction The composition of cells is expected to be the same across the two plates, hence the use of removeBatchEffect() rather than more complex methods. For larger datasets, consider using regressBatches() from the batchelor package. library(limma) assay(sce.416b, &quot;corrected&quot;) &lt;- removeBatchEffect(logcounts(sce.416b), design=model.matrix(~sce.416b$phenotype), batch=sce.416b$block) 25.7 Dimensionality reduction We do not expect a great deal of heterogeneity in this dataset, so we only request 10 PCs. We use an exact SVD to avoid warnings from irlba about handling small datasets. sce.416b &lt;- runPCA(sce.416b, ncomponents=10, subset_row=chosen.hvgs, exprs_values=&quot;corrected&quot;, BSPARAM=BiocSingular::ExactParam()) set.seed(1010) sce.416b &lt;- runTSNE(sce.416b, dimred=&quot;PCA&quot;, perplexity=10) 25.8 Clustering my.dist &lt;- dist(reducedDim(sce.416b, &quot;PCA&quot;)) my.tree &lt;- hclust(my.dist, method=&quot;ward.D2&quot;) library(dynamicTreeCut) my.clusters &lt;- unname(cutreeDynamic(my.tree, distM=as.matrix(my.dist), minClusterSize=10, verbose=0)) colLabels(sce.416b) &lt;- factor(my.clusters) We compare the clusters to the plate of origin. Each cluster is comprised of cells from both batches, indicating that the clustering is not driven by a batch effect. table(Cluster=colLabels(sce.416b), Plate=sce.416b$block) ## Plate ## Cluster 20160113 20160325 ## 1 40 38 ## 2 37 32 ## 3 10 14 ## 4 6 8 We compare the clusters to the oncogene induction status. We observe differences in in the composition of each cluster, consistent with a biological effect of oncogene induction. table(Cluster=colLabels(sce.416b), Oncogene=sce.416b$phenotype) ## Oncogene ## Cluster induced CBFB-MYH11 oncogene expression wild type phenotype ## 1 78 0 ## 2 0 69 ## 3 1 23 ## 4 14 0 plotTSNE(sce.416b, colour_by=&quot;label&quot;) Figure 25.5: Obligatory \\(t\\)-SNE plot of the 416B dataset, where each point represents a cell and is colored according to the assigned cluster. Most cells have relatively small positive widths in Figure 25.6, indicating that the separation between clusters is weak. This may be symptomatic of over-clustering where clusters that are clearly defined on oncogene induction status are further split into subsets that are less well separated. Nonetheless, we will proceed with the current clustering scheme as it provides reasonable partitions for further characterization of heterogeneity. library(cluster) clust.col &lt;- scater:::.get_palette(&quot;tableau10medium&quot;) # hidden scater colours sil &lt;- silhouette(my.clusters, dist = my.dist) sil.cols &lt;- clust.col[ifelse(sil[,3] &gt; 0, sil[,1], sil[,2])] sil.cols &lt;- sil.cols[order(-sil[,1], sil[,3])] plot(sil, main = paste(length(unique(my.clusters)), &quot;clusters&quot;), border=sil.cols, col=sil.cols, do.col.sort=FALSE) Figure 25.6: Silhouette plot for the hierarchical clustering of the 416B dataset. Each bar represents the silhouette width for a cell and is colored according to the assigned cluster (if positive) or the closest cluster (if negative). 25.9 Interpretation markers &lt;- findMarkers(sce.416b, my.clusters, block=sce.416b$block) marker.set &lt;- markers[[&quot;1&quot;]] head(marker.set, 10) ## DataFrame with 10 rows and 7 columns ## Top p.value FDR summary.logFC logFC.2 logFC.3 ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Ccna2 1 9.85422e-67 4.59246e-62 -7.13310 -7.13310 -2.20632 ## Cdca8 1 1.01449e-41 1.52514e-38 -7.26175 -6.00378 -2.03841 ## Pirb 1 4.16555e-33 1.95516e-30 5.87820 5.28149 5.87820 ## Cks1b 2 2.98233e-40 3.23229e-37 -6.43381 -6.43381 -4.15385 ## Aurkb 2 2.41436e-64 5.62593e-60 -6.94063 -6.94063 -1.65534 ## Myh11 2 1.28865e-46 3.75353e-43 4.38182 4.38182 4.29290 ## Mcm6 3 1.15877e-28 3.69887e-26 -5.44558 -5.44558 -5.82130 ## Cdca3 3 5.02047e-45 1.23144e-41 -6.22179 -6.22179 -2.10502 ## Top2a 3 7.25965e-61 1.12776e-56 -7.07811 -7.07811 -2.39123 ## Mcm2 4 1.50854e-33 7.98908e-31 -5.54197 -5.54197 -6.09178 ## logFC.4 ## &lt;numeric&gt; ## Ccna2 -7.3451052 ## Cdca8 -7.2617478 ## Pirb 0.0352849 ## Cks1b -6.4385323 ## Aurkb -6.4162126 ## Myh11 0.9410499 ## Mcm6 -3.5804973 ## Cdca3 -7.0539510 ## Top2a -6.8297343 ## Mcm2 -3.8238103 We visualize the expression profiles of the top candidates in Figure 25.7 to verify that the DE signature is robust. Most of the top markers have strong and consistent up- or downregulation in cells of cluster 1 compared to some or all of the other clusters. A cursory examination of the heatmap indicates that cluster 1 contains oncogene-induced cells with strong downregulation of DNA replication and cell cycle genes. This is consistent with the potential induction of senescence as an anti-tumorigenic response (Wajapeyee et al. 2010). top.markers &lt;- rownames(marker.set)[marker.set$Top &lt;= 10] plotHeatmap(sce.416b, features=top.markers, order_columns_by=&quot;label&quot;, colour_columns_by=c(&quot;label&quot;, &quot;block&quot;, &quot;phenotype&quot;), center=TRUE, symmetric=TRUE, zlim=c(-5, 5)) Figure 25.7: Heatmap of the top marker genes for cluster 1 in the 416B dataset, stratified by cluster. The plate of origin and oncogene induction status are also shown for each cell. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] cluster_2.1.0 dynamicTreeCut_1.63-1 [3] limma_3.46.0 scran_1.18.5 [5] scater_1.18.6 ggplot2_3.3.3 [7] AnnotationHub_2.22.0 BiocFileCache_1.14.0 [9] dbplyr_2.1.0 ensembldb_2.14.0 [11] AnnotationFilter_1.14.0 GenomicFeatures_1.42.2 [13] AnnotationDbi_1.52.0 scRNAseq_2.4.0 [15] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0 [17] Biobase_2.50.0 GenomicRanges_1.42.0 [19] GenomeInfoDb_1.26.4 IRanges_2.24.1 [21] S4Vectors_0.28.1 BiocGenerics_0.36.0 [23] MatrixGenerics_1.2.1 matrixStats_0.58.0 [25] BiocStyle_2.18.1 rebook_1.0.0 loaded via a namespace (and not attached): [1] igraph_1.2.6 lazyeval_0.2.2 [3] BiocParallel_1.24.1 digest_0.6.27 [5] htmltools_0.5.1.1 viridis_0.5.1 [7] fansi_0.4.2 magrittr_2.0.1 [9] memoise_2.0.0 Biostrings_2.58.0 [11] askpass_1.1 prettyunits_1.1.1 [13] colorspace_2.0-0 blob_1.2.1 [15] rappdirs_0.3.3 xfun_0.22 [17] dplyr_1.0.5 callr_3.5.1 [19] crayon_1.4.1 RCurl_1.98-1.3 [21] jsonlite_1.7.2 graph_1.68.0 [23] glue_1.4.2 gtable_0.3.0 [25] zlibbioc_1.36.0 XVector_0.30.0 [27] DelayedArray_0.16.2 BiocSingular_1.6.0 [29] scales_1.1.1 pheatmap_1.0.12 [31] edgeR_3.32.1 DBI_1.1.1 [33] Rcpp_1.0.6 viridisLite_0.3.0 [35] xtable_1.8-4 progress_1.2.2 [37] dqrng_0.2.1 bit_4.0.4 [39] rsvd_1.0.3 httr_1.4.2 [41] RColorBrewer_1.1-2 ellipsis_0.3.1 [43] pkgconfig_2.0.3 XML_3.99-0.6 [45] farver_2.1.0 scuttle_1.0.4 [47] CodeDepends_0.6.5 sass_0.3.1 [49] locfit_1.5-9.4 utf8_1.2.1 [51] tidyselect_1.1.0 labeling_0.4.2 [53] rlang_0.4.10 later_1.1.0.1 [55] munsell_0.5.0 BiocVersion_3.12.0 [57] tools_4.0.4 cachem_1.0.4 [59] generics_0.1.0 RSQLite_2.2.4 [61] ExperimentHub_1.16.0 evaluate_0.14 [63] stringr_1.4.0 fastmap_1.1.0 [65] yaml_2.2.1 processx_3.4.5 [67] knitr_1.31 bit64_4.0.5 [69] purrr_0.3.4 sparseMatrixStats_1.2.1 [71] mime_0.10 xml2_1.3.2 [73] biomaRt_2.46.3 compiler_4.0.4 [75] beeswarm_0.3.1 curl_4.3 [77] interactiveDisplayBase_1.28.0 statmod_1.4.35 [79] tibble_3.1.0 bslib_0.2.4 [81] stringi_1.5.3 highr_0.8 [83] ps_1.6.0 lattice_0.20-41 [85] bluster_1.0.0 ProtGenerics_1.22.0 [87] Matrix_1.3-2 vctrs_0.3.6 [89] pillar_1.5.1 lifecycle_1.0.0 [91] BiocManager_1.30.10 jquerylib_0.1.3 [93] BiocNeighbors_1.8.2 cowplot_1.1.1 [95] bitops_1.0-6 irlba_2.3.3 [97] httpuv_1.5.5 rtracklayer_1.50.0 [99] R6_2.5.0 bookdown_0.21 [101] promises_1.2.0.1 gridExtra_2.3 [103] vipor_0.4.5 codetools_0.2-18 [105] assertthat_0.2.1 openssl_1.4.3 [107] withr_2.4.1 GenomicAlignments_1.26.0 [109] Rsamtools_2.6.0 GenomeInfoDbData_1.2.4 [111] hms_1.0.0 grid_4.0.4 [113] beachmat_2.6.4 rmarkdown_2.7 [115] DelayedMatrixStats_1.12.3 Rtsne_0.15 [117] shiny_1.6.0 ggbeeswarm_0.6.0 Bibliography "],["zeisel-mouse-brain-strt-seq.html", "Chapter 26 Zeisel mouse brain (STRT-Seq) 26.1 Introduction 26.2 Data loading 26.3 Quality control 26.4 Normalization 26.5 Variance modelling 26.6 Dimensionality reduction 26.7 Clustering 26.8 Interpretation Session Info", " Chapter 26 Zeisel mouse brain (STRT-Seq) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 26.1 Introduction Here, we examine a heterogeneous dataset from a study of cell types in the mouse brain (Zeisel et al. 2015). This contains approximately 3000 cells of varying types such as oligodendrocytes, microglia and neurons. Individual cells were isolated using the Fluidigm C1 microfluidics system (Pollen et al. 2014) and library preparation was performed on each cell using a UMI-based protocol. After sequencing, expression was quantified by counting the number of unique molecular identifiers (UMIs) mapped to each gene. 26.2 Data loading We obtain a SingleCellExperiment object for this dataset using the relevant function from the scRNAseq package. The idiosyncrasies of the published dataset means that we need to do some extra work to merge together redundant rows corresponding to alternative genomic locations for the same gene. library(scRNAseq) sce.zeisel &lt;- ZeiselBrainData() library(scater) sce.zeisel &lt;- aggregateAcrossFeatures(sce.zeisel, id=sub(&quot;_loc[0-9]+$&quot;, &quot;&quot;, rownames(sce.zeisel))) We also fetch the Ensembl gene IDs, just in case we need them later. library(org.Mm.eg.db) rowData(sce.zeisel)$Ensembl &lt;- mapIds(org.Mm.eg.db, keys=rownames(sce.zeisel), keytype=&quot;SYMBOL&quot;, column=&quot;ENSEMBL&quot;) 26.3 Quality control unfiltered &lt;- sce.zeisel The original authors of the study have already removed low-quality cells prior to data publication. Nonetheless, we compute some quality control metrics to check whether the remaining cells are satisfactory. stats &lt;- perCellQCMetrics(sce.zeisel, subsets=list( Mt=rowData(sce.zeisel)$featureType==&quot;mito&quot;)) qc &lt;- quickPerCellQC(stats, percent_subsets=c(&quot;altexps_ERCC_percent&quot;, &quot;subsets_Mt_percent&quot;)) sce.zeisel &lt;- sce.zeisel[,!qc$discard] colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- qc$discard gridExtra::grid.arrange( plotColData(unfiltered, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(unfiltered, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(unfiltered, y=&quot;altexps_ERCC_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;ERCC percent&quot;), plotColData(unfiltered, y=&quot;subsets_Mt_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;Mito percent&quot;), ncol=2 ) Figure 26.1: Distribution of each QC metric across cells in the Zeisel brain dataset. Each point represents a cell and is colored according to whether that cell was discarded. gridExtra::grid.arrange( plotColData(unfiltered, x=&quot;sum&quot;, y=&quot;subsets_Mt_percent&quot;, colour_by=&quot;discard&quot;) + scale_x_log10(), plotColData(unfiltered, x=&quot;altexps_ERCC_percent&quot;, y=&quot;subsets_Mt_percent&quot;, colour_by=&quot;discard&quot;), ncol=2 ) Figure 26.2: Percentage of mitochondrial reads in each cell in the Zeisel brain dataset, compared to the total count (left) or the percentage of spike-in reads (right). Each point represents a cell and is colored according to whether that cell was discarded. We also examine the number of cells removed for each reason. colSums(as.matrix(qc)) ## low_lib_size low_n_features high_altexps_ERCC_percent ## 0 3 65 ## high_subsets_Mt_percent discard ## 128 189 26.4 Normalization library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.zeisel) sce.zeisel &lt;- computeSumFactors(sce.zeisel, cluster=clusters) sce.zeisel &lt;- logNormCounts(sce.zeisel) summary(sizeFactors(sce.zeisel)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.119 0.486 0.831 1.000 1.321 4.509 plot(librarySizeFactors(sce.zeisel), sizeFactors(sce.zeisel), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, log=&quot;xy&quot;) Figure 26.3: Relationship between the library size factors and the deconvolution size factors in the Zeisel brain dataset. 26.5 Variance modelling In theory, we should block on the plate of origin for each cell. However, only 20-40 cells are available on each plate, and the population is also highly heterogeneous. This means that we cannot assume that the distribution of sampled cell types on each plate is the same. Thus, to avoid regressing out potential biology, we will not block on any factors in this analysis. dec.zeisel &lt;- modelGeneVarWithSpikes(sce.zeisel, &quot;ERCC&quot;) top.hvgs &lt;- getTopHVGs(dec.zeisel, prop=0.1) We see from Figure 26.4 that the technical and total variances are much smaller than those in the read-based datasets. This is due to the use of UMIs, which reduces the noise caused by variable PCR amplification. Furthermore, the spike-in trend is consistently lower than the variances of the endogenous gene, which reflects the heterogeneity in gene expression across cells of different types. plot(dec.zeisel$mean, dec.zeisel$total, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(dec.zeisel) points(curfit$mean, curfit$var, col=&quot;red&quot;, pch=16) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) Figure 26.4: Per-gene variance as a function of the mean for the log-expression values in the Zeisel brain dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the spike-in transcripts (red). 26.6 Dimensionality reduction library(BiocSingular) set.seed(101011001) sce.zeisel &lt;- denoisePCA(sce.zeisel, technical=dec.zeisel, subset.row=top.hvgs) sce.zeisel &lt;- runTSNE(sce.zeisel, dimred=&quot;PCA&quot;) We have a look at the number of PCs retained by denoisePCA(). ncol(reducedDim(sce.zeisel, &quot;PCA&quot;)) ## [1] 50 26.7 Clustering snn.gr &lt;- buildSNNGraph(sce.zeisel, use.dimred=&quot;PCA&quot;) colLabels(sce.zeisel) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) table(colLabels(sce.zeisel)) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ## 283 451 114 143 599 167 191 128 350 70 199 58 39 24 plotTSNE(sce.zeisel, colour_by=&quot;label&quot;) Figure 26.5: Obligatory \\(t\\)-SNE plot of the Zeisel brain dataset, where each point represents a cell and is colored according to the assigned cluster. 26.8 Interpretation We focus on upregulated marker genes as these can quickly provide positive identification of cell type in a heterogeneous population. We examine the table for cluster 1, in which log-fold changes are reported between cluster 1 and every other cluster. The same output is provided for each cluster in order to identify genes that discriminate between clusters. markers &lt;- findMarkers(sce.zeisel, direction=&quot;up&quot;) marker.set &lt;- markers[[&quot;1&quot;]] head(marker.set[,1:8], 10) # only first 8 columns, for brevity ## DataFrame with 10 rows and 8 columns ## Top p.value FDR summary.logFC logFC.2 logFC.3 ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Atp1a3 1 1.45982e-282 7.24035e-279 3.45669 0.0398568 0.0893943 ## Celf4 1 2.27030e-246 4.50404e-243 3.10465 0.3886716 0.6145023 ## Gad1 1 7.44925e-232 1.34351e-228 4.57719 4.5392751 4.3003280 ## Gad2 1 2.88086e-207 3.57208e-204 4.25393 4.2322487 3.8884654 ## Mllt11 1 1.72982e-249 3.81309e-246 2.88363 0.5782719 1.4933128 ## Ndrg4 1 0.00000e+00 0.00000e+00 3.84337 0.8887239 1.0183408 ## Slc32a1 1 2.38276e-110 4.04030e-108 1.92859 1.9196173 1.8252062 ## Syngr3 1 3.68257e-143 1.30462e-140 2.55531 1.0981258 1.1994793 ## Atp6v1g2 2 3.04451e-204 3.55295e-201 2.50875 0.0981706 0.5203760 ## Napb 2 1.10402e-231 1.82522e-228 2.81533 0.1774508 0.3046901 ## logFC.4 logFC.5 ## &lt;numeric&gt; &lt;numeric&gt; ## Atp1a3 1.241388 3.45669 ## Celf4 0.869334 3.10465 ## Gad1 4.050305 4.47236 ## Gad2 3.769556 4.16902 ## Mllt11 0.951649 2.88363 ## Ndrg4 1.140041 3.84337 ## Slc32a1 1.804311 1.92426 ## Syngr3 1.188856 2.47696 ## Atp6v1g2 0.616391 2.50875 ## Napb 0.673772 2.81533 Figure 26.6 indicates that most of the top markers are strongly DE in cells of cluster 1 compared to some or all of the other clusters. We can use these markers to identify cells from cluster 1 in validation studies with an independent population of cells. A quick look at the markers suggest that cluster 1 represents interneurons based on expression of Gad1 and Slc6a1 (Zeng et al. 2012). top.markers &lt;- rownames(marker.set)[marker.set$Top &lt;= 10] plotHeatmap(sce.zeisel, features=top.markers, order_columns_by=&quot;label&quot;) Figure 26.6: Heatmap of the log-expression of the top markers for cluster 1 compared to each other cluster. Cells are ordered by cluster and the color is scaled to the log-expression of each gene in each cell. An alternative visualization approach is to plot the log-fold changes to all other clusters directly (Figure 26.7). This is more concise and is useful in situations involving many clusters that contain different numbers of cells. library(pheatmap) logFCs &lt;- getMarkerEffects(marker.set[1:50,]) pheatmap(logFCs, breaks=seq(-5, 5, length.out=101)) Figure 26.7: Heatmap of the log-fold changes of the top markers for cluster 1 compared to each other cluster. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] pheatmap_1.0.12 BiocSingular_1.6.0 [3] scran_1.18.5 org.Mm.eg.db_3.12.0 [5] AnnotationDbi_1.52.0 scater_1.18.6 [7] ggplot2_3.3.3 scRNAseq_2.4.0 [9] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0 [11] Biobase_2.50.0 GenomicRanges_1.42.0 [13] GenomeInfoDb_1.26.4 IRanges_2.24.1 [15] S4Vectors_0.28.1 BiocGenerics_0.36.0 [17] MatrixGenerics_1.2.1 matrixStats_0.58.0 [19] BiocStyle_2.18.1 rebook_1.0.0 loaded via a namespace (and not attached): [1] AnnotationHub_2.22.0 BiocFileCache_1.14.0 [3] igraph_1.2.6 lazyeval_0.2.2 [5] BiocParallel_1.24.1 digest_0.6.27 [7] ensembldb_2.14.0 htmltools_0.5.1.1 [9] viridis_0.5.1 fansi_0.4.2 [11] magrittr_2.0.1 memoise_2.0.0 [13] limma_3.46.0 Biostrings_2.58.0 [15] askpass_1.1 prettyunits_1.1.1 [17] colorspace_2.0-0 blob_1.2.1 [19] rappdirs_0.3.3 xfun_0.22 [21] dplyr_1.0.5 callr_3.5.1 [23] crayon_1.4.1 RCurl_1.98-1.3 [25] jsonlite_1.7.2 graph_1.68.0 [27] glue_1.4.2 gtable_0.3.0 [29] zlibbioc_1.36.0 XVector_0.30.0 [31] DelayedArray_0.16.2 scales_1.1.1 [33] edgeR_3.32.1 DBI_1.1.1 [35] Rcpp_1.0.6 viridisLite_0.3.0 [37] xtable_1.8-4 progress_1.2.2 [39] dqrng_0.2.1 bit_4.0.4 [41] rsvd_1.0.3 httr_1.4.2 [43] RColorBrewer_1.1-2 ellipsis_0.3.1 [45] pkgconfig_2.0.3 XML_3.99-0.6 [47] farver_2.1.0 scuttle_1.0.4 [49] CodeDepends_0.6.5 sass_0.3.1 [51] dbplyr_2.1.0 locfit_1.5-9.4 [53] utf8_1.2.1 tidyselect_1.1.0 [55] labeling_0.4.2 rlang_0.4.10 [57] later_1.1.0.1 munsell_0.5.0 [59] BiocVersion_3.12.0 tools_4.0.4 [61] cachem_1.0.4 generics_0.1.0 [63] RSQLite_2.2.4 ExperimentHub_1.16.0 [65] evaluate_0.14 stringr_1.4.0 [67] fastmap_1.1.0 yaml_2.2.1 [69] processx_3.4.5 knitr_1.31 [71] bit64_4.0.5 purrr_0.3.4 [73] AnnotationFilter_1.14.0 sparseMatrixStats_1.2.1 [75] mime_0.10 xml2_1.3.2 [77] biomaRt_2.46.3 compiler_4.0.4 [79] beeswarm_0.3.1 curl_4.3 [81] interactiveDisplayBase_1.28.0 statmod_1.4.35 [83] tibble_3.1.0 bslib_0.2.4 [85] stringi_1.5.3 highr_0.8 [87] ps_1.6.0 GenomicFeatures_1.42.2 [89] lattice_0.20-41 bluster_1.0.0 [91] ProtGenerics_1.22.0 Matrix_1.3-2 [93] vctrs_0.3.6 pillar_1.5.1 [95] lifecycle_1.0.0 BiocManager_1.30.10 [97] jquerylib_0.1.3 BiocNeighbors_1.8.2 [99] cowplot_1.1.1 bitops_1.0-6 [101] irlba_2.3.3 httpuv_1.5.5 [103] rtracklayer_1.50.0 R6_2.5.0 [105] bookdown_0.21 promises_1.2.0.1 [107] gridExtra_2.3 vipor_0.4.5 [109] codetools_0.2-18 assertthat_0.2.1 [111] openssl_1.4.3 withr_2.4.1 [113] GenomicAlignments_1.26.0 Rsamtools_2.6.0 [115] GenomeInfoDbData_1.2.4 hms_1.0.0 [117] grid_4.0.4 beachmat_2.6.4 [119] rmarkdown_2.7 DelayedMatrixStats_1.12.3 [121] Rtsne_0.15 shiny_1.6.0 [123] ggbeeswarm_0.6.0 Bibliography "],["unfiltered-human-pbmcs-10x-genomics.html", "Chapter 27 Unfiltered human PBMCs (10X Genomics) 27.1 Introduction 27.2 Data loading 27.3 Quality control 27.4 Normalization 27.5 Variance modelling 27.6 Dimensionality reduction 27.7 Clustering 27.8 Interpretation Session Info", " Chapter 27 Unfiltered human PBMCs (10X Genomics) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 27.1 Introduction Here, we describe a brief analysis of the peripheral blood mononuclear cell (PBMC) dataset from 10X Genomics (Zheng et al. 2017). The data are publicly available from the 10X Genomics website, from which we download the raw gene/barcode count matrices, i.e., before cell calling from the CellRanger pipeline. 27.2 Data loading library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) 27.3 Quality control We perform cell detection using the emptyDrops() algorithm, as discussed in Section 15.2. set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] unfiltered &lt;- sce.pbmc We use a relaxed QC strategy and only remove cells with large mitochondrial proportions, using it as a proxy for cell damage. This reduces the risk of removing cell types with low RNA content, especially in a heterogeneous PBMC population with many different cell types. stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location==&quot;MT&quot;))) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) sce.pbmc &lt;- sce.pbmc[,!high.mito] summary(high.mito) ## Mode FALSE TRUE ## logical 3985 315 colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- high.mito gridExtra::grid.arrange( plotColData(unfiltered, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(unfiltered, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(unfiltered, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;Mito percent&quot;), ncol=2 ) Figure 27.1: Distribution of various QC metrics in the PBMC dataset after cell calling. Each point is a cell and is colored according to whether it was discarded by the mitochondrial filter. plotColData(unfiltered, x=&quot;sum&quot;, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + scale_x_log10() Figure 27.2: Proportion of mitochondrial reads in each cell of the PBMC dataset compared to its total count. 27.4 Normalization library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) sce.pbmc &lt;- logNormCounts(sce.pbmc) summary(sizeFactors(sce.pbmc)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.007 0.712 0.875 1.000 1.099 12.254 plot(librarySizeFactors(sce.pbmc), sizeFactors(sce.pbmc), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, log=&quot;xy&quot;) Figure 27.3: Relationship between the library size factors and the deconvolution size factors in the PBMC dataset. 27.5 Variance modelling set.seed(1001) dec.pbmc &lt;- modelGeneVarByPoisson(sce.pbmc) top.pbmc &lt;- getTopHVGs(dec.pbmc, prop=0.1) plot(dec.pbmc$mean, dec.pbmc$total, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(dec.pbmc) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) Figure 27.4: Per-gene variance as a function of the mean for the log-expression values in the PBMC dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to simulated Poisson counts. 27.6 Dimensionality reduction set.seed(10000) sce.pbmc &lt;- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc) set.seed(100000) sce.pbmc &lt;- runTSNE(sce.pbmc, dimred=&quot;PCA&quot;) set.seed(1000000) sce.pbmc &lt;- runUMAP(sce.pbmc, dimred=&quot;PCA&quot;) We verify that a reasonable number of PCs is retained. ncol(reducedDim(sce.pbmc, &quot;PCA&quot;)) ## [1] 9 27.7 Clustering g &lt;- buildSNNGraph(sce.pbmc, k=10, use.dimred = &#39;PCA&#39;) clust &lt;- igraph::cluster_walktrap(g)$membership colLabels(sce.pbmc) &lt;- factor(clust) table(colLabels(sce.pbmc)) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 205 508 541 56 374 125 46 432 302 867 47 155 166 61 84 16 plotTSNE(sce.pbmc, colour_by=&quot;label&quot;) Figure 27.5: Obligatory \\(t\\)-SNE plot of the PBMC dataset, where each point represents a cell and is colored according to the assigned cluster. 27.8 Interpretation markers &lt;- findMarkers(sce.pbmc, pval.type=&quot;some&quot;, direction=&quot;up&quot;) We examine the markers for cluster 8 in more detail. High expression of CD14, CD68 and MNDA combined with low expression of CD16 suggests that this cluster contains monocytes, compared to macrophages in cluster 15 (Figure 27.6). marker.set &lt;- markers[[&quot;8&quot;]] as.data.frame(marker.set[1:30,1:3]) ## p.value FDR summary.logFC ## CSTA 7.171e-222 2.016e-217 2.4179 ## MNDA 1.197e-221 2.016e-217 2.6615 ## FCN1 2.376e-213 2.669e-209 2.6381 ## S100A12 4.393e-212 3.701e-208 3.0809 ## VCAN 1.711e-199 1.153e-195 2.2604 ## TYMP 1.174e-154 6.590e-151 2.0238 ## AIF1 3.674e-149 1.768e-145 2.4604 ## LGALS2 4.005e-137 1.687e-133 1.8928 ## MS4A6A 5.640e-134 2.111e-130 1.5457 ## FGL2 2.045e-124 6.889e-121 1.3859 ## RP11-1143G9.4 6.892e-122 2.111e-118 2.8042 ## AP1S2 1.786e-112 5.015e-109 1.7704 ## CD14 1.195e-110 3.098e-107 1.4260 ## CFD 6.870e-109 1.654e-105 1.3560 ## GPX1 9.049e-107 2.033e-103 2.4014 ## TNFSF13B 3.920e-95 8.256e-92 1.1151 ## KLF4 3.310e-94 6.560e-91 1.2049 ## GRN 4.801e-91 8.987e-88 1.3815 ## NAMPT 2.490e-90 4.415e-87 1.1439 ## CLEC7A 7.736e-88 1.303e-84 1.0616 ## S100A8 3.125e-84 5.014e-81 4.8052 ## SERPINA1 1.580e-82 2.420e-79 1.3843 ## CD36 8.018e-79 1.175e-75 1.0538 ## MPEG1 8.482e-79 1.191e-75 0.9778 ## CD68 5.119e-78 6.899e-75 0.9481 ## CYBB 1.201e-77 1.556e-74 1.0300 ## S100A11 1.175e-72 1.466e-69 1.8962 ## RBP7 2.467e-71 2.969e-68 0.9666 ## BLVRB 3.763e-71 4.372e-68 0.9701 ## CD302 9.859e-71 1.107e-67 0.8792 plotExpression(sce.pbmc, features=c(&quot;CD14&quot;, &quot;CD68&quot;, &quot;MNDA&quot;, &quot;FCGR3A&quot;), x=&quot;label&quot;, colour_by=&quot;label&quot;) Figure 27.6: Distribution of expression values for monocyte and macrophage markers across clusters in the PBMC dataset. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] scran_1.18.5 EnsDb.Hsapiens.v86_2.99.0 [3] ensembldb_2.14.0 AnnotationFilter_1.14.0 [5] GenomicFeatures_1.42.2 AnnotationDbi_1.52.0 [7] scater_1.18.6 ggplot2_3.3.3 [9] DropletUtils_1.10.3 SingleCellExperiment_1.12.0 [11] SummarizedExperiment_1.20.0 Biobase_2.50.0 [13] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [15] IRanges_2.24.1 S4Vectors_0.28.1 [17] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [19] matrixStats_0.58.0 DropletTestFiles_1.0.0 [21] BiocStyle_2.18.1 rebook_1.0.0 loaded via a namespace (and not attached): [1] AnnotationHub_2.22.0 BiocFileCache_1.14.0 [3] igraph_1.2.6 lazyeval_0.2.2 [5] BiocParallel_1.24.1 digest_0.6.27 [7] htmltools_0.5.1.1 viridis_0.5.1 [9] fansi_0.4.2 magrittr_2.0.1 [11] memoise_2.0.0 limma_3.46.0 [13] Biostrings_2.58.0 R.utils_2.10.1 [15] askpass_1.1 prettyunits_1.1.1 [17] colorspace_2.0-0 blob_1.2.1 [19] rappdirs_0.3.3 xfun_0.22 [21] dplyr_1.0.5 callr_3.5.1 [23] crayon_1.4.1 RCurl_1.98-1.3 [25] jsonlite_1.7.2 graph_1.68.0 [27] glue_1.4.2 gtable_0.3.0 [29] zlibbioc_1.36.0 XVector_0.30.0 [31] DelayedArray_0.16.2 BiocSingular_1.6.0 [33] Rhdf5lib_1.12.1 HDF5Array_1.18.1 [35] scales_1.1.1 DBI_1.1.1 [37] edgeR_3.32.1 Rcpp_1.0.6 [39] viridisLite_0.3.0 xtable_1.8-4 [41] progress_1.2.2 dqrng_0.2.1 [43] bit_4.0.4 rsvd_1.0.3 [45] httr_1.4.2 FNN_1.1.3 [47] ellipsis_0.3.1 farver_2.1.0 [49] pkgconfig_2.0.3 XML_3.99-0.6 [51] R.methodsS3_1.8.1 scuttle_1.0.4 [53] uwot_0.1.10 CodeDepends_0.6.5 [55] sass_0.3.1 dbplyr_2.1.0 [57] locfit_1.5-9.4 utf8_1.2.1 [59] labeling_0.4.2 tidyselect_1.1.0 [61] rlang_0.4.10 later_1.1.0.1 [63] munsell_0.5.0 BiocVersion_3.12.0 [65] tools_4.0.4 cachem_1.0.4 [67] generics_0.1.0 RSQLite_2.2.4 [69] ExperimentHub_1.16.0 evaluate_0.14 [71] stringr_1.4.0 fastmap_1.1.0 [73] yaml_2.2.1 processx_3.4.5 [75] knitr_1.31 bit64_4.0.5 [77] purrr_0.3.4 sparseMatrixStats_1.2.1 [79] mime_0.10 R.oo_1.24.0 [81] xml2_1.3.2 biomaRt_2.46.3 [83] compiler_4.0.4 beeswarm_0.3.1 [85] curl_4.3 interactiveDisplayBase_1.28.0 [87] statmod_1.4.35 tibble_3.1.0 [89] bslib_0.2.4 stringi_1.5.3 [91] highr_0.8 ps_1.6.0 [93] RSpectra_0.16-0 bluster_1.0.0 [95] lattice_0.20-41 ProtGenerics_1.22.0 [97] Matrix_1.3-2 vctrs_0.3.6 [99] pillar_1.5.1 lifecycle_1.0.0 [101] rhdf5filters_1.2.0 BiocManager_1.30.10 [103] jquerylib_0.1.3 BiocNeighbors_1.8.2 [105] cowplot_1.1.1 bitops_1.0-6 [107] irlba_2.3.3 httpuv_1.5.5 [109] rtracklayer_1.50.0 R6_2.5.0 [111] bookdown_0.21 promises_1.2.0.1 [113] gridExtra_2.3 vipor_0.4.5 [115] codetools_0.2-18 assertthat_0.2.1 [117] rhdf5_2.34.0 openssl_1.4.3 [119] withr_2.4.1 GenomicAlignments_1.26.0 [121] Rsamtools_2.6.0 GenomeInfoDbData_1.2.4 [123] hms_1.0.0 grid_4.0.4 [125] beachmat_2.6.4 rmarkdown_2.7 [127] DelayedMatrixStats_1.12.3 Rtsne_0.15 [129] shiny_1.6.0 ggbeeswarm_0.6.0 Bibliography "],["filtered-human-pbmcs-10x-genomics.html", "Chapter 28 Filtered human PBMCs (10X Genomics) 28.1 Introduction 28.2 Data loading 28.3 Quality control 28.4 Normalization 28.5 Variance modelling 28.6 Dimensionality reduction 28.7 Clustering 28.8 Data integration Session Info", " Chapter 28 Filtered human PBMCs (10X Genomics) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 28.1 Introduction This performs an analysis of the public PBMC ID dataset generated by 10X Genomics (Zheng et al. 2017), starting from the filtered count matrix. 28.2 Data loading library(TENxPBMCData) all.sce &lt;- list( pbmc3k=TENxPBMCData(&#39;pbmc3k&#39;), pbmc4k=TENxPBMCData(&#39;pbmc4k&#39;), pbmc8k=TENxPBMCData(&#39;pbmc8k&#39;) ) 28.3 Quality control unfiltered &lt;- all.sce Cell calling implicitly serves as a QC step to remove libraries with low total counts and number of detected genes. Thus, we will only filter on the mitochondrial proportion. library(scater) stats &lt;- high.mito &lt;- list() for (n in names(all.sce)) { current &lt;- all.sce[[n]] is.mito &lt;- grep(&quot;MT&quot;, rowData(current)$Symbol_TENx) stats[[n]] &lt;- perCellQCMetrics(current, subsets=list(Mito=is.mito)) high.mito[[n]] &lt;- isOutlier(stats[[n]]$subsets_Mito_percent, type=&quot;higher&quot;) all.sce[[n]] &lt;- current[,!high.mito[[n]]] } qcplots &lt;- list() for (n in names(all.sce)) { current &lt;- unfiltered[[n]] colData(current) &lt;- cbind(colData(current), stats[[n]]) current$discard &lt;- high.mito[[n]] qcplots[[n]] &lt;- plotColData(current, x=&quot;sum&quot;, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + scale_x_log10() } do.call(gridExtra::grid.arrange, c(qcplots, ncol=3)) Figure 28.1: Percentage of mitochondrial reads in each cell in each of the 10X PBMC datasets, compared to the total count. Each point represents a cell and is colored according to whether that cell was discarded. lapply(high.mito, summary) ## $pbmc3k ## Mode FALSE TRUE ## logical 2609 91 ## ## $pbmc4k ## Mode FALSE TRUE ## logical 4182 158 ## ## $pbmc8k ## Mode FALSE TRUE ## logical 8157 224 28.4 Normalization We perform library size normalization, simply for convenience when dealing with file-backed matrices. all.sce &lt;- lapply(all.sce, logNormCounts) lapply(all.sce, function(x) summary(sizeFactors(x))) ## $pbmc3k ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.234 0.748 0.926 1.000 1.157 6.604 ## ## $pbmc4k ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.315 0.711 0.890 1.000 1.127 11.027 ## ## $pbmc8k ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.296 0.704 0.877 1.000 1.118 6.794 28.5 Variance modelling library(scran) all.dec &lt;- lapply(all.sce, modelGeneVar) all.hvgs &lt;- lapply(all.dec, getTopHVGs, prop=0.1) par(mfrow=c(1,3)) for (n in names(all.dec)) { curdec &lt;- all.dec[[n]] plot(curdec$mean, curdec$total, pch=16, cex=0.5, main=n, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(curdec) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) } Figure 28.2: Per-gene variance as a function of the mean for the log-expression values in each PBMC dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the variances. 28.6 Dimensionality reduction For various reasons, we will first analyze each PBMC dataset separately rather than merging them together. We use randomized SVD, which is more efficient for file-backed matrices. library(BiocSingular) set.seed(10000) all.sce &lt;- mapply(FUN=runPCA, x=all.sce, subset_row=all.hvgs, MoreArgs=list(ncomponents=25, BSPARAM=RandomParam()), SIMPLIFY=FALSE) set.seed(100000) all.sce &lt;- lapply(all.sce, runTSNE, dimred=&quot;PCA&quot;) set.seed(1000000) all.sce &lt;- lapply(all.sce, runUMAP, dimred=&quot;PCA&quot;) 28.7 Clustering for (n in names(all.sce)) { g &lt;- buildSNNGraph(all.sce[[n]], k=10, use.dimred=&#39;PCA&#39;) clust &lt;- igraph::cluster_walktrap(g)$membership colLabels(all.sce[[n]]) &lt;- factor(clust) } lapply(all.sce, function(x) table(colLabels(x))) ## $pbmc3k ## ## 1 2 3 4 5 6 7 8 9 10 ## 487 154 603 514 31 150 179 333 147 11 ## ## $pbmc4k ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 ## 497 185 569 786 373 232 44 1023 77 218 88 54 36 ## ## $pbmc8k ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 1004 759 1073 1543 367 150 201 2067 59 154 244 67 76 285 20 15 ## 17 18 ## 64 9 all.tsne &lt;- list() for (n in names(all.sce)) { all.tsne[[n]] &lt;- plotTSNE(all.sce[[n]], colour_by=&quot;label&quot;) + ggtitle(n) } do.call(gridExtra::grid.arrange, c(all.tsne, list(ncol=2))) Figure 28.3: Obligatory \\(t\\)-SNE plots of each PBMC dataset, where each point represents a cell in the corresponding dataset and is colored according to the assigned cluster. 28.8 Data integration With the per-dataset analyses out of the way, we will now repeat the analysis after merging together the three batches. # Intersecting the common genes. universe &lt;- Reduce(intersect, lapply(all.sce, rownames)) all.sce2 &lt;- lapply(all.sce, &quot;[&quot;, i=universe,) all.dec2 &lt;- lapply(all.dec, &quot;[&quot;, i=universe,) # Renormalizing to adjust for differences in depth. library(batchelor) normed.sce &lt;- do.call(multiBatchNorm, all.sce2) # Identifying a set of HVGs using stats from all batches. combined.dec &lt;- do.call(combineVar, all.dec2) combined.hvg &lt;- getTopHVGs(combined.dec, n=5000) set.seed(1000101) merged.pbmc &lt;- do.call(fastMNN, c(normed.sce, list(subset.row=combined.hvg, BSPARAM=RandomParam()))) We use the percentage of lost variance as a diagnostic measure. metadata(merged.pbmc)$merge.info$lost.var ## pbmc3k pbmc4k pbmc8k ## [1,] 7.003e-03 3.126e-03 0.000000 ## [2,] 7.137e-05 5.125e-05 0.003003 We proceed to clustering: g &lt;- buildSNNGraph(merged.pbmc, use.dimred=&quot;corrected&quot;) colLabels(merged.pbmc) &lt;- factor(igraph::cluster_louvain(g)$membership) table(colLabels(merged.pbmc), merged.pbmc$batch) ## ## pbmc3k pbmc4k pbmc8k ## 1 113 387 825 ## 2 507 395 806 ## 3 175 344 581 ## 4 295 539 1018 ## 5 346 638 1210 ## 6 11 3 9 ## 7 17 27 111 ## 8 33 113 185 ## 9 423 754 1546 ## 10 4 36 67 ## 11 197 124 221 ## 12 150 180 293 ## 13 327 588 1125 ## 14 11 54 160 And visualization: set.seed(10101010) merged.pbmc &lt;- runTSNE(merged.pbmc, dimred=&quot;corrected&quot;) gridExtra::grid.arrange( plotTSNE(merged.pbmc, colour_by=&quot;label&quot;, text_by=&quot;label&quot;, text_colour=&quot;red&quot;), plotTSNE(merged.pbmc, colour_by=&quot;batch&quot;) ) Figure 28.4: Obligatory \\(t\\)-SNE plots for the merged PBMC datasets, where each point represents a cell and is colored by cluster (top) or batch (bottom). Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] batchelor_1.6.2 BiocSingular_1.6.0 [3] scran_1.18.5 scater_1.18.6 [5] ggplot2_3.3.3 TENxPBMCData_1.8.0 [7] HDF5Array_1.18.1 rhdf5_2.34.0 [9] DelayedArray_0.16.2 Matrix_1.3-2 [11] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0 [13] Biobase_2.50.0 GenomicRanges_1.42.0 [15] GenomeInfoDb_1.26.4 IRanges_2.24.1 [17] S4Vectors_0.28.1 BiocGenerics_0.36.0 [19] MatrixGenerics_1.2.1 matrixStats_0.58.0 [21] BiocStyle_2.18.1 rebook_1.0.0 loaded via a namespace (and not attached): [1] AnnotationHub_2.22.0 BiocFileCache_1.14.0 [3] igraph_1.2.6 BiocParallel_1.24.1 [5] digest_0.6.27 htmltools_0.5.1.1 [7] viridis_0.5.1 fansi_0.4.2 [9] magrittr_2.0.1 memoise_2.0.0 [11] limma_3.46.0 colorspace_2.0-0 [13] blob_1.2.1 rappdirs_0.3.3 [15] xfun_0.22 dplyr_1.0.5 [17] callr_3.5.1 crayon_1.4.1 [19] RCurl_1.98-1.3 jsonlite_1.7.2 [21] graph_1.68.0 glue_1.4.2 [23] gtable_0.3.0 zlibbioc_1.36.0 [25] XVector_0.30.0 Rhdf5lib_1.12.1 [27] scales_1.1.1 DBI_1.1.1 [29] edgeR_3.32.1 Rcpp_1.0.6 [31] viridisLite_0.3.0 xtable_1.8-4 [33] dqrng_0.2.1 bit_4.0.4 [35] rsvd_1.0.3 ResidualMatrix_1.0.0 [37] httr_1.4.2 FNN_1.1.3 [39] ellipsis_0.3.1 pkgconfig_2.0.3 [41] XML_3.99-0.6 farver_2.1.0 [43] scuttle_1.0.4 CodeDepends_0.6.5 [45] sass_0.3.1 uwot_0.1.10 [47] dbplyr_2.1.0 locfit_1.5-9.4 [49] utf8_1.2.1 tidyselect_1.1.0 [51] labeling_0.4.2 rlang_0.4.10 [53] later_1.1.0.1 AnnotationDbi_1.52.0 [55] munsell_0.5.0 BiocVersion_3.12.0 [57] tools_4.0.4 cachem_1.0.4 [59] generics_0.1.0 RSQLite_2.2.4 [61] ExperimentHub_1.16.0 evaluate_0.14 [63] stringr_1.4.0 fastmap_1.1.0 [65] yaml_2.2.1 processx_3.4.5 [67] knitr_1.31 bit64_4.0.5 [69] purrr_0.3.4 sparseMatrixStats_1.2.1 [71] mime_0.10 compiler_4.0.4 [73] beeswarm_0.3.1 curl_4.3 [75] interactiveDisplayBase_1.28.0 tibble_3.1.0 [77] statmod_1.4.35 bslib_0.2.4 [79] stringi_1.5.3 highr_0.8 [81] ps_1.6.0 RSpectra_0.16-0 [83] lattice_0.20-41 bluster_1.0.0 [85] vctrs_0.3.6 pillar_1.5.1 [87] lifecycle_1.0.0 rhdf5filters_1.2.0 [89] BiocManager_1.30.10 jquerylib_0.1.3 [91] RcppAnnoy_0.0.18 BiocNeighbors_1.8.2 [93] cowplot_1.1.1 bitops_1.0-6 [95] irlba_2.3.3 httpuv_1.5.5 [97] R6_2.5.0 bookdown_0.21 [99] promises_1.2.0.1 gridExtra_2.3 [101] vipor_0.4.5 codetools_0.2-18 [103] assertthat_0.2.1 withr_2.4.1 [105] GenomeInfoDbData_1.2.4 grid_4.0.4 [107] beachmat_2.6.4 rmarkdown_2.7 [109] DelayedMatrixStats_1.12.3 Rtsne_0.15 [111] shiny_1.6.0 ggbeeswarm_0.6.0 Bibliography "],["human-pbmc-with-surface-proteins-10x-genomics.html", "Chapter 29 Human PBMC with surface proteins (10X Genomics) 29.1 Introduction 29.2 Data loading 29.3 Quality control 29.4 Normalization 29.5 Dimensionality reduction 29.6 Clustering Session Info", " Chapter 29 Human PBMC with surface proteins (10X Genomics) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 29.1 Introduction Here, we describe a brief analysis of yet another peripheral blood mononuclear cell (PBMC) dataset from 10X Genomics (Zheng et al. 2017). Data are publicly available from the 10X Genomics website, from which we download the filtered gene/barcode count matrices for gene expression and cell surface proteins. Note that most of the repertoire-related steps will be discussed in Chapter 21, this workflow mostly provides the baseline analysis for the expression data. 29.2 Data loading library(BiocFileCache) bfc &lt;- BiocFileCache(ask=FALSE) exprs.data &lt;- bfcrpath(bfc, file.path( &quot;http://cf.10xgenomics.com/samples/cell-vdj/3.1.0&quot;, &quot;vdj_v1_hs_pbmc3&quot;, &quot;vdj_v1_hs_pbmc3_filtered_feature_bc_matrix.tar.gz&quot;)) untar(exprs.data, exdir=tempdir()) library(DropletUtils) sce.pbmc &lt;- read10xCounts(file.path(tempdir(), &quot;filtered_feature_bc_matrix&quot;)) sce.pbmc &lt;- splitAltExps(sce.pbmc, rowData(sce.pbmc)$Type) 29.3 Quality control unfiltered &lt;- sce.pbmc We discard cells with high mitochondrial proportions and few detectable ADT counts. library(scater) is.mito &lt;- grep(&quot;^MT-&quot;, rowData(sce.pbmc)$Symbol) stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=is.mito)) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) low.adt &lt;- stats$`altexps_Antibody Capture_detected` &lt; nrow(altExp(sce.pbmc))/2 discard &lt;- high.mito | low.adt sce.pbmc &lt;- sce.pbmc[,!discard] We examine some of the statistics: summary(high.mito) ## Mode FALSE TRUE ## logical 6660 571 summary(low.adt) ## Mode FALSE ## logical 7231 summary(discard) ## Mode FALSE TRUE ## logical 6660 571 We examine the distribution of each QC metric (Figure 29.1). colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- discard gridExtra::grid.arrange( plotColData(unfiltered, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(unfiltered, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(unfiltered, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;Mito percent&quot;), plotColData(unfiltered, y=&quot;altexps_Antibody Capture_detected&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;ADT detected&quot;), ncol=2 ) Figure 29.1: Distribution of each QC metric in the PBMC dataset, where each point is a cell and is colored by whether or not it was discarded by the outlier-based QC approach. We also plot the mitochondrial proportion against the total count for each cell, as one does (Figure 29.2). plotColData(unfiltered, x=&quot;sum&quot;, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + scale_x_log10() Figure 29.2: Percentage of UMIs mapped to mitochondrial genes against the totalcount for each cell. 29.4 Normalization Computing size factors for the gene expression and ADT counts. library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) altExp(sce.pbmc) &lt;- computeMedianFactors(altExp(sce.pbmc)) sce.pbmc &lt;- logNormCounts(sce.pbmc, use_altexps=TRUE) We generate some summary statistics for both sets of size factors: summary(sizeFactors(sce.pbmc)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.074 0.719 0.908 1.000 1.133 8.858 summary(sizeFactors(altExp(sce.pbmc))) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.10 0.70 0.83 1.00 1.03 227.36 We also look at the distribution of size factors compared to the library size for each set of features (Figure 29.3). par(mfrow=c(1,2)) plot(librarySizeFactors(sce.pbmc), sizeFactors(sce.pbmc), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, main=&quot;Gene expression&quot;, log=&quot;xy&quot;) plot(librarySizeFactors(altExp(sce.pbmc)), sizeFactors(altExp(sce.pbmc)), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Median-based factors&quot;, main=&quot;Antibody capture&quot;, log=&quot;xy&quot;) Figure 29.3: Plot of the deconvolution size factors for the gene expression values (left) or the median-based size factors for the ADT expression values (right) compared to the library size-derived factors for the corresponding set of features. Each point represents a cell. 29.5 Dimensionality reduction We omit the PCA step for the ADT expression matrix, given that it is already so low-dimensional, and progress directly to \\(t\\)-SNE and UMAP visualizations. set.seed(100000) altExp(sce.pbmc) &lt;- runTSNE(altExp(sce.pbmc)) set.seed(1000000) altExp(sce.pbmc) &lt;- runUMAP(altExp(sce.pbmc)) 29.6 Clustering We perform graph-based clustering on the ADT data and use the assignments as the column labels of the alternative Experiment. g.adt &lt;- buildSNNGraph(altExp(sce.pbmc), k=10, d=NA) clust.adt &lt;- igraph::cluster_walktrap(g.adt)$membership colLabels(altExp(sce.pbmc)) &lt;- factor(clust.adt) We examine some basic statistics about the size of each cluster, their separation (Figure 29.4) and their distribution in our \\(t\\)-SNE plot (Figure 29.5). table(colLabels(altExp(sce.pbmc))) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 160 507 662 39 691 1415 32 650 76 1037 121 47 68 25 15 562 ## 17 18 19 20 21 22 23 24 ## 139 32 44 120 84 65 52 17 library(bluster) mod &lt;- pairwiseModularity(g.adt, clust.adt, as.ratio=TRUE) library(pheatmap) pheatmap::pheatmap(log10(mod + 10), cluster_row=FALSE, cluster_col=FALSE, color=colorRampPalette(c(&quot;white&quot;, &quot;blue&quot;))(101)) Figure 29.4: Heatmap of the pairwise cluster modularity scores in the PBMC dataset, computed based on the shared nearest neighbor graph derived from the ADT expression values. plotTSNE(altExp(sce.pbmc), colour_by=&quot;label&quot;, text_by=&quot;label&quot;, text_col=&quot;red&quot;) Figure 29.5: Obligatory \\(t\\)-SNE plot of PBMC dataset based on its ADT expression values, where each point is a cell and is colored by the cluster of origin. Cluster labels are also overlaid at the median coordinates across all cells in the cluster. We perform some additional subclustering using the expression data to mimic an in silico FACS experiment. set.seed(1010010) subclusters &lt;- quickSubCluster(sce.pbmc, clust.adt, prepFUN=function(x) { dec &lt;- modelGeneVarByPoisson(x) top &lt;- getTopHVGs(dec, prop=0.1) denoisePCA(x, dec, subset.row=top) }, clusterFUN=function(x) { g.gene &lt;- buildSNNGraph(x, k=10, use.dimred = &#39;PCA&#39;) igraph::cluster_walktrap(g.gene)$membership } ) We counting the number of gene expression-derived subclusters in each ADT-derived parent cluster. data.frame( Cluster=names(subclusters), Ncells=vapply(subclusters, ncol, 0L), Nsub=vapply(subclusters, function(x) length(unique(x$subcluster)), 0L) ) ## Cluster Ncells Nsub ## 1 1 160 3 ## 2 2 507 4 ## 3 3 662 5 ## 4 4 39 1 ## 5 5 691 5 ## 6 6 1415 7 ## 7 7 32 1 ## 8 8 650 7 ## 9 9 76 2 ## 10 10 1037 8 ## 11 11 121 2 ## 12 12 47 1 ## 13 13 68 2 ## 14 14 25 1 ## 15 15 15 1 ## 16 16 562 9 ## 17 17 139 3 ## 18 18 32 1 ## 19 19 44 1 ## 20 20 120 4 ## 21 21 84 3 ## 22 22 65 2 ## 23 23 52 3 ## 24 24 17 1 Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] pheatmap_1.0.12 bluster_1.0.0 [3] scran_1.18.5 scater_1.18.6 [5] ggplot2_3.3.3 DropletUtils_1.10.3 [7] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0 [9] Biobase_2.50.0 GenomicRanges_1.42.0 [11] GenomeInfoDb_1.26.4 IRanges_2.24.1 [13] S4Vectors_0.28.1 BiocGenerics_0.36.0 [15] MatrixGenerics_1.2.1 matrixStats_0.58.0 [17] BiocFileCache_1.14.0 dbplyr_2.1.0 [19] BiocStyle_2.18.1 rebook_1.0.0 loaded via a namespace (and not attached): [1] Rtsne_0.15 ggbeeswarm_0.6.0 [3] colorspace_2.0-0 ellipsis_0.3.1 [5] scuttle_1.0.4 XVector_0.30.0 [7] BiocNeighbors_1.8.2 farver_2.1.0 [9] bit64_4.0.5 RSpectra_0.16-0 [11] fansi_0.4.2 codetools_0.2-18 [13] R.methodsS3_1.8.1 sparseMatrixStats_1.2.1 [15] cachem_1.0.4 knitr_1.31 [17] jsonlite_1.7.2 R.oo_1.24.0 [19] uwot_0.1.10 graph_1.68.0 [21] HDF5Array_1.18.1 BiocManager_1.30.10 [23] compiler_4.0.4 httr_1.4.2 [25] dqrng_0.2.1 assertthat_0.2.1 [27] Matrix_1.3-2 fastmap_1.1.0 [29] limma_3.46.0 BiocSingular_1.6.0 [31] htmltools_0.5.1.1 tools_4.0.4 [33] igraph_1.2.6 rsvd_1.0.3 [35] gtable_0.3.0 glue_1.4.2 [37] GenomeInfoDbData_1.2.4 dplyr_1.0.5 [39] rappdirs_0.3.3 Rcpp_1.0.6 [41] jquerylib_0.1.3 vctrs_0.3.6 [43] rhdf5filters_1.2.0 DelayedMatrixStats_1.12.3 [45] xfun_0.22 stringr_1.4.0 [47] ps_1.6.0 beachmat_2.6.4 [49] lifecycle_1.0.0 irlba_2.3.3 [51] statmod_1.4.35 XML_3.99-0.6 [53] edgeR_3.32.1 zlibbioc_1.36.0 [55] scales_1.1.1 rhdf5_2.34.0 [57] RColorBrewer_1.1-2 yaml_2.2.1 [59] curl_4.3 memoise_2.0.0 [61] gridExtra_2.3 sass_0.3.1 [63] stringi_1.5.3 RSQLite_2.2.4 [65] highr_0.8 BiocParallel_1.24.1 [67] rlang_0.4.10 pkgconfig_2.0.3 [69] bitops_1.0-6 evaluate_0.14 [71] lattice_0.20-41 purrr_0.3.4 [73] Rhdf5lib_1.12.1 CodeDepends_0.6.5 [75] labeling_0.4.2 cowplot_1.1.1 [77] bit_4.0.4 processx_3.4.5 [79] tidyselect_1.1.0 RcppAnnoy_0.0.18 [81] magrittr_2.0.1 bookdown_0.21 [83] R6_2.5.0 generics_0.1.0 [85] DelayedArray_0.16.2 DBI_1.1.1 [87] pillar_1.5.1 withr_2.4.1 [89] RCurl_1.98-1.3 tibble_3.1.0 [91] crayon_1.4.1 utf8_1.2.1 [93] rmarkdown_2.7 viridis_0.5.1 [95] locfit_1.5-9.4 grid_4.0.4 [97] blob_1.2.1 callr_3.5.1 [99] digest_0.6.27 R.utils_2.10.1 [101] munsell_0.5.0 beeswarm_0.3.1 [103] viridisLite_0.3.0 vipor_0.4.5 [105] bslib_0.2.4 Bibliography "],["grun-human-pancreas-cel-seq2.html", "Chapter 30 Grun human pancreas (CEL-seq2) 30.1 Introduction 30.2 Data loading 30.3 Quality control 30.4 Normalization 30.5 Variance modelling 30.6 Data integration 30.7 Dimensionality reduction 30.8 Clustering Session Info", " Chapter 30 Grun human pancreas (CEL-seq2) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 30.1 Introduction This workflow performs an analysis of the Grun et al. (2016) CEL-seq2 dataset consisting of human pancreas cells from various donors. 30.2 Data loading library(scRNAseq) sce.grun &lt;- GrunPancreasData() We convert to Ensembl identifiers, and we remove duplicated genes or genes without Ensembl IDs. library(org.Hs.eg.db) gene.ids &lt;- mapIds(org.Hs.eg.db, keys=rowData(sce.grun)$symbol, keytype=&quot;SYMBOL&quot;, column=&quot;ENSEMBL&quot;) keep &lt;- !is.na(gene.ids) &amp; !duplicated(gene.ids) sce.grun &lt;- sce.grun[keep,] rownames(sce.grun) &lt;- gene.ids[keep] 30.3 Quality control unfiltered &lt;- sce.grun This dataset lacks mitochondrial genes so we will do without them for quality control. We compute the median and MAD while blocking on the donor; for donors where the assumption of a majority of high-quality cells seems to be violated (Figure 30.1), we compute an appropriate threshold using the other donors as specified in the subset= argument. library(scater) stats &lt;- perCellQCMetrics(sce.grun) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sce.grun$donor, subset=sce.grun$donor %in% c(&quot;D17&quot;, &quot;D7&quot;, &quot;D2&quot;)) sce.grun &lt;- sce.grun[,!qc$discard] colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- qc$discard gridExtra::grid.arrange( plotColData(unfiltered, x=&quot;donor&quot;, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(unfiltered, x=&quot;donor&quot;, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(unfiltered, x=&quot;donor&quot;, y=&quot;altexps_ERCC_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;ERCC percent&quot;), ncol=2 ) Figure 30.1: Distribution of each QC metric across cells from each donor of the Grun pancreas dataset. Each point represents a cell and is colored according to whether that cell was discarded. colSums(as.matrix(qc), na.rm=TRUE) ## low_lib_size low_n_features high_altexps_ERCC_percent ## 450 512 606 ## discard ## 665 30.4 Normalization library(scran) set.seed(1000) # for irlba. clusters &lt;- quickCluster(sce.grun) sce.grun &lt;- computeSumFactors(sce.grun, clusters=clusters) sce.grun &lt;- logNormCounts(sce.grun) summary(sizeFactors(sce.grun)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.099 0.511 0.796 1.000 1.231 8.838 plot(librarySizeFactors(sce.grun), sizeFactors(sce.grun), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, log=&quot;xy&quot;) Figure 30.2: Relationship between the library size factors and the deconvolution size factors in the Grun pancreas dataset. 30.5 Variance modelling We block on a combined plate and donor factor. block &lt;- paste0(sce.grun$sample, &quot;_&quot;, sce.grun$donor) dec.grun &lt;- modelGeneVarWithSpikes(sce.grun, spikes=&quot;ERCC&quot;, block=block) top.grun &lt;- getTopHVGs(dec.grun, prop=0.1) We examine the number of cells in each level of the blocking factor. table(block) ## block ## CD13+ sorted cells_D17 CD24+ CD44+ live sorted cells_D17 ## 86 87 ## CD63+ sorted cells_D10 TGFBR3+ sorted cells_D17 ## 41 90 ## exocrine fraction, live sorted cells_D2 exocrine fraction, live sorted cells_D3 ## 82 7 ## live sorted cells, library 1_D10 live sorted cells, library 1_D17 ## 33 88 ## live sorted cells, library 1_D3 live sorted cells, library 1_D7 ## 24 85 ## live sorted cells, library 2_D10 live sorted cells, library 2_D17 ## 35 83 ## live sorted cells, library 2_D3 live sorted cells, library 2_D7 ## 27 84 ## live sorted cells, library 3_D3 live sorted cells, library 3_D7 ## 16 83 ## live sorted cells, library 4_D3 live sorted cells, library 4_D7 ## 29 83 par(mfrow=c(6,3)) blocked.stats &lt;- dec.grun$per.block for (i in colnames(blocked.stats)) { current &lt;- blocked.stats[[i]] plot(current$mean, current$total, main=i, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(current) points(curfit$mean, curfit$var, col=&quot;red&quot;, pch=16) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) } Figure 25.4: Per-gene variance as a function of the mean for the log-expression values in the Grun pancreas dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the spike-in transcripts (red) separately for each donor. 30.6 Data integration library(batchelor) set.seed(1001010) merged.grun &lt;- fastMNN(sce.grun, subset.row=top.grun, batch=sce.grun$donor) metadata(merged.grun)$merge.info$lost.var ## D10 D17 D2 D3 D7 ## [1,] 0.030626 0.032123 0.000000 0.00000 0.00000 ## [2,] 0.007151 0.011372 0.036091 0.00000 0.00000 ## [3,] 0.003905 0.005135 0.007729 0.05239 0.00000 ## [4,] 0.011862 0.014643 0.013594 0.01235 0.05387 30.7 Dimensionality reduction set.seed(100111) merged.grun &lt;- runTSNE(merged.grun, dimred=&quot;corrected&quot;) 30.8 Clustering snn.gr &lt;- buildSNNGraph(merged.grun, use.dimred=&quot;corrected&quot;) colLabels(merged.grun) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) table(Cluster=colLabels(merged.grun), Donor=merged.grun$batch) ## Donor ## Cluster D10 D17 D2 D3 D7 ## 1 32 70 31 80 28 ## 2 14 34 3 2 67 ## 3 12 71 31 2 71 ## 4 5 4 2 4 2 ## 5 11 119 0 0 55 ## 6 2 8 3 3 6 ## 7 3 40 0 0 10 ## 8 1 9 0 0 7 ## 9 15 36 12 11 45 ## 10 5 13 0 0 10 ## 11 4 13 0 0 1 ## 12 5 17 0 1 33 gridExtra::grid.arrange( plotTSNE(merged.grun, colour_by=&quot;label&quot;), plotTSNE(merged.grun, colour_by=&quot;batch&quot;), ncol=2 ) Figure 30.3: Obligatory \\(t\\)-SNE plots of the Grun pancreas dataset. Each point represents a cell that is colored by cluster (left) or batch (right). Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] batchelor_1.6.2 scran_1.18.5 [3] scater_1.18.6 ggplot2_3.3.3 [5] org.Hs.eg.db_3.12.0 AnnotationDbi_1.52.0 [7] scRNAseq_2.4.0 SingleCellExperiment_1.12.0 [9] SummarizedExperiment_1.20.0 Biobase_2.50.0 [11] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [13] IRanges_2.24.1 S4Vectors_0.28.1 [15] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [17] matrixStats_0.58.0 BiocStyle_2.18.1 [19] rebook_1.0.0 loaded via a namespace (and not attached): [1] AnnotationHub_2.22.0 BiocFileCache_1.14.0 [3] igraph_1.2.6 lazyeval_0.2.2 [5] BiocParallel_1.24.1 digest_0.6.27 [7] ensembldb_2.14.0 htmltools_0.5.1.1 [9] viridis_0.5.1 fansi_0.4.2 [11] magrittr_2.0.1 memoise_2.0.0 [13] limma_3.46.0 Biostrings_2.58.0 [15] askpass_1.1 prettyunits_1.1.1 [17] colorspace_2.0-0 blob_1.2.1 [19] rappdirs_0.3.3 xfun_0.22 [21] dplyr_1.0.5 callr_3.5.1 [23] crayon_1.4.1 RCurl_1.98-1.3 [25] jsonlite_1.7.2 graph_1.68.0 [27] glue_1.4.2 gtable_0.3.0 [29] zlibbioc_1.36.0 XVector_0.30.0 [31] DelayedArray_0.16.2 BiocSingular_1.6.0 [33] scales_1.1.1 edgeR_3.32.1 [35] DBI_1.1.1 Rcpp_1.0.6 [37] viridisLite_0.3.0 xtable_1.8-4 [39] progress_1.2.2 dqrng_0.2.1 [41] bit_4.0.4 rsvd_1.0.3 [43] ResidualMatrix_1.0.0 httr_1.4.2 [45] ellipsis_0.3.1 pkgconfig_2.0.3 [47] XML_3.99-0.6 farver_2.1.0 [49] scuttle_1.0.4 CodeDepends_0.6.5 [51] sass_0.3.1 dbplyr_2.1.0 [53] locfit_1.5-9.4 utf8_1.2.1 [55] tidyselect_1.1.0 labeling_0.4.2 [57] rlang_0.4.10 later_1.1.0.1 [59] munsell_0.5.0 BiocVersion_3.12.0 [61] tools_4.0.4 cachem_1.0.4 [63] generics_0.1.0 RSQLite_2.2.4 [65] ExperimentHub_1.16.0 evaluate_0.14 [67] stringr_1.4.0 fastmap_1.1.0 [69] yaml_2.2.1 processx_3.4.5 [71] knitr_1.31 bit64_4.0.5 [73] purrr_0.3.4 AnnotationFilter_1.14.0 [75] sparseMatrixStats_1.2.1 mime_0.10 [77] xml2_1.3.2 biomaRt_2.46.3 [79] compiler_4.0.4 beeswarm_0.3.1 [81] curl_4.3 interactiveDisplayBase_1.28.0 [83] statmod_1.4.35 tibble_3.1.0 [85] bslib_0.2.4 stringi_1.5.3 [87] highr_0.8 ps_1.6.0 [89] GenomicFeatures_1.42.2 lattice_0.20-41 [91] bluster_1.0.0 ProtGenerics_1.22.0 [93] Matrix_1.3-2 vctrs_0.3.6 [95] pillar_1.5.1 lifecycle_1.0.0 [97] BiocManager_1.30.10 jquerylib_0.1.3 [99] BiocNeighbors_1.8.2 cowplot_1.1.1 [101] bitops_1.0-6 irlba_2.3.3 [103] httpuv_1.5.5 rtracklayer_1.50.0 [105] R6_2.5.0 bookdown_0.21 [107] promises_1.2.0.1 gridExtra_2.3 [109] vipor_0.4.5 codetools_0.2-18 [111] assertthat_0.2.1 openssl_1.4.3 [113] withr_2.4.1 GenomicAlignments_1.26.0 [115] Rsamtools_2.6.0 GenomeInfoDbData_1.2.4 [117] hms_1.0.0 grid_4.0.4 [119] beachmat_2.6.4 rmarkdown_2.7 [121] DelayedMatrixStats_1.12.3 Rtsne_0.15 [123] shiny_1.6.0 ggbeeswarm_0.6.0 Bibliography "],["muraro-human-pancreas-cel-seq.html", "Chapter 31 Muraro human pancreas (CEL-seq) 31.1 Introduction 31.2 Data loading 31.3 Quality control 31.4 Normalization 31.5 Variance modelling 31.6 Data integration 31.7 Dimensionality reduction 31.8 Clustering Session Info", " Chapter 31 Muraro human pancreas (CEL-seq) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 31.1 Introduction This performs an analysis of the Muraro et al. (2016) CEL-seq dataset, consisting of human pancreas cells from various donors. 31.2 Data loading library(scRNAseq) sce.muraro &lt;- MuraroPancreasData() Converting back to Ensembl identifiers. library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] gene.symb &lt;- sub(&quot;__chr.*$&quot;, &quot;&quot;, rownames(sce.muraro)) gene.ids &lt;- mapIds(edb, keys=gene.symb, keytype=&quot;SYMBOL&quot;, column=&quot;GENEID&quot;) # Removing duplicated genes or genes without Ensembl IDs. keep &lt;- !is.na(gene.ids) &amp; !duplicated(gene.ids) sce.muraro &lt;- sce.muraro[keep,] rownames(sce.muraro) &lt;- gene.ids[keep] 31.3 Quality control unfiltered &lt;- sce.muraro This dataset lacks mitochondrial genes so we will do without. For the one batch that seems to have a high proportion of low-quality cells, we compute an appropriate filter threshold using a shared median and MAD from the other batches (Figure 31.1). library(scater) stats &lt;- perCellQCMetrics(sce.muraro) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sce.muraro$donor, subset=sce.muraro$donor!=&quot;D28&quot;) sce.muraro &lt;- sce.muraro[,!qc$discard] colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- qc$discard gridExtra::grid.arrange( plotColData(unfiltered, x=&quot;donor&quot;, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(unfiltered, x=&quot;donor&quot;, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(unfiltered, x=&quot;donor&quot;, y=&quot;altexps_ERCC_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;ERCC percent&quot;), ncol=2 ) Figure 31.1: Distribution of each QC metric across cells from each donor in the Muraro pancreas dataset. Each point represents a cell and is colored according to whether that cell was discarded. We have a look at the causes of removal: colSums(as.matrix(qc)) ## low_lib_size low_n_features high_altexps_ERCC_percent ## 663 700 738 ## discard ## 773 31.4 Normalization library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.muraro) sce.muraro &lt;- computeSumFactors(sce.muraro, clusters=clusters) sce.muraro &lt;- logNormCounts(sce.muraro) summary(sizeFactors(sce.muraro)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.088 0.541 0.821 1.000 1.211 13.987 plot(librarySizeFactors(sce.muraro), sizeFactors(sce.muraro), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, log=&quot;xy&quot;) Figure 31.2: Relationship between the library size factors and the deconvolution size factors in the Muraro pancreas dataset. 31.5 Variance modelling We block on a combined plate and donor factor. block &lt;- paste0(sce.muraro$plate, &quot;_&quot;, sce.muraro$donor) dec.muraro &lt;- modelGeneVarWithSpikes(sce.muraro, &quot;ERCC&quot;, block=block) top.muraro &lt;- getTopHVGs(dec.muraro, prop=0.1) par(mfrow=c(8,4)) blocked.stats &lt;- dec.muraro$per.block for (i in colnames(blocked.stats)) { current &lt;- blocked.stats[[i]] plot(current$mean, current$total, main=i, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(current) points(curfit$mean, curfit$var, col=&quot;red&quot;, pch=16) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) } Figure 31.3: Per-gene variance as a function of the mean for the log-expression values in the Muraro pancreas dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the spike-in transcripts (red) separately for each donor. 31.6 Data integration library(batchelor) set.seed(1001010) merged.muraro &lt;- fastMNN(sce.muraro, subset.row=top.muraro, batch=sce.muraro$donor) We use the proportion of variance lost as a diagnostic measure: metadata(merged.muraro)$merge.info$lost.var ## D28 D29 D30 D31 ## [1,] 0.060847 0.024121 0.000000 0.00000 ## [2,] 0.002646 0.003018 0.062421 0.00000 ## [3,] 0.003449 0.002641 0.002598 0.08162 31.7 Dimensionality reduction set.seed(100111) merged.muraro &lt;- runTSNE(merged.muraro, dimred=&quot;corrected&quot;) 31.8 Clustering snn.gr &lt;- buildSNNGraph(merged.muraro, use.dimred=&quot;corrected&quot;) colLabels(merged.muraro) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) tab &lt;- table(Cluster=colLabels(merged.muraro), CellType=sce.muraro$label) library(pheatmap) pheatmap(log10(tab+10), color=viridis::viridis(100)) Figure 31.4: Heatmap of the frequency of cells from each cell type label in each cluster. table(Cluster=colLabels(merged.muraro), Donor=merged.muraro$batch) ## Donor ## Cluster D28 D29 D30 D31 ## 1 104 6 57 112 ## 2 59 21 77 97 ## 3 12 75 64 43 ## 4 28 149 126 120 ## 5 87 261 277 214 ## 6 21 7 54 26 ## 7 1 6 6 37 ## 8 6 6 5 2 ## 9 11 68 5 30 ## 10 4 2 5 8 gridExtra::grid.arrange( plotTSNE(merged.muraro, colour_by=&quot;label&quot;), plotTSNE(merged.muraro, colour_by=&quot;batch&quot;), ncol=2 ) Figure 31.5: Obligatory \\(t\\)-SNE plots of the Muraro pancreas dataset. Each point represents a cell that is colored by cluster (left) or batch (right). Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] pheatmap_1.0.12 batchelor_1.6.2 [3] scran_1.18.5 scater_1.18.6 [5] ggplot2_3.3.3 ensembldb_2.14.0 [7] AnnotationFilter_1.14.0 GenomicFeatures_1.42.2 [9] AnnotationDbi_1.52.0 AnnotationHub_2.22.0 [11] BiocFileCache_1.14.0 dbplyr_2.1.0 [13] scRNAseq_2.4.0 SingleCellExperiment_1.12.0 [15] SummarizedExperiment_1.20.0 Biobase_2.50.0 [17] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [19] IRanges_2.24.1 S4Vectors_0.28.1 [21] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [23] matrixStats_0.58.0 BiocStyle_2.18.1 [25] rebook_1.0.0 loaded via a namespace (and not attached): [1] igraph_1.2.6 lazyeval_0.2.2 [3] BiocParallel_1.24.1 digest_0.6.27 [5] htmltools_0.5.1.1 viridis_0.5.1 [7] fansi_0.4.2 magrittr_2.0.1 [9] memoise_2.0.0 limma_3.46.0 [11] Biostrings_2.58.0 askpass_1.1 [13] prettyunits_1.1.1 colorspace_2.0-0 [15] blob_1.2.1 rappdirs_0.3.3 [17] xfun_0.22 dplyr_1.0.5 [19] callr_3.5.1 crayon_1.4.1 [21] RCurl_1.98-1.3 jsonlite_1.7.2 [23] graph_1.68.0 glue_1.4.2 [25] gtable_0.3.0 zlibbioc_1.36.0 [27] XVector_0.30.0 DelayedArray_0.16.2 [29] BiocSingular_1.6.0 scales_1.1.1 [31] edgeR_3.32.1 DBI_1.1.1 [33] Rcpp_1.0.6 viridisLite_0.3.0 [35] xtable_1.8-4 progress_1.2.2 [37] dqrng_0.2.1 bit_4.0.4 [39] rsvd_1.0.3 ResidualMatrix_1.0.0 [41] httr_1.4.2 RColorBrewer_1.1-2 [43] ellipsis_0.3.1 pkgconfig_2.0.3 [45] XML_3.99-0.6 farver_2.1.0 [47] scuttle_1.0.4 CodeDepends_0.6.5 [49] sass_0.3.1 locfit_1.5-9.4 [51] utf8_1.2.1 tidyselect_1.1.0 [53] labeling_0.4.2 rlang_0.4.10 [55] later_1.1.0.1 munsell_0.5.0 [57] BiocVersion_3.12.0 tools_4.0.4 [59] cachem_1.0.4 generics_0.1.0 [61] RSQLite_2.2.4 ExperimentHub_1.16.0 [63] evaluate_0.14 stringr_1.4.0 [65] fastmap_1.1.0 yaml_2.2.1 [67] processx_3.4.5 knitr_1.31 [69] bit64_4.0.5 purrr_0.3.4 [71] sparseMatrixStats_1.2.1 mime_0.10 [73] xml2_1.3.2 biomaRt_2.46.3 [75] compiler_4.0.4 beeswarm_0.3.1 [77] curl_4.3 interactiveDisplayBase_1.28.0 [79] statmod_1.4.35 tibble_3.1.0 [81] bslib_0.2.4 stringi_1.5.3 [83] highr_0.8 ps_1.6.0 [85] lattice_0.20-41 bluster_1.0.0 [87] ProtGenerics_1.22.0 Matrix_1.3-2 [89] vctrs_0.3.6 pillar_1.5.1 [91] lifecycle_1.0.0 BiocManager_1.30.10 [93] jquerylib_0.1.3 BiocNeighbors_1.8.2 [95] cowplot_1.1.1 bitops_1.0-6 [97] irlba_2.3.3 httpuv_1.5.5 [99] rtracklayer_1.50.0 R6_2.5.0 [101] bookdown_0.21 promises_1.2.0.1 [103] gridExtra_2.3 vipor_0.4.5 [105] codetools_0.2-18 assertthat_0.2.1 [107] openssl_1.4.3 withr_2.4.1 [109] GenomicAlignments_1.26.0 Rsamtools_2.6.0 [111] GenomeInfoDbData_1.2.4 hms_1.0.0 [113] grid_4.0.4 beachmat_2.6.4 [115] rmarkdown_2.7 DelayedMatrixStats_1.12.3 [117] Rtsne_0.15 shiny_1.6.0 [119] ggbeeswarm_0.6.0 Bibliography "],["lawlor-human-pancreas-smarter.html", "Chapter 32 Lawlor human pancreas (SMARTer) 32.1 Introduction 32.2 Data loading 32.3 Quality control 32.4 Normalization 32.5 Variance modelling 32.6 Dimensionality reduction 32.7 Clustering Session Info", " Chapter 32 Lawlor human pancreas (SMARTer) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 32.1 Introduction This performs an analysis of the Lawlor et al. (2017) dataset, consisting of human pancreas cells from various donors. 32.2 Data loading library(scRNAseq) sce.lawlor &lt;- LawlorPancreasData() library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] anno &lt;- select(edb, keys=rownames(sce.lawlor), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.lawlor) &lt;- anno[match(rownames(sce.lawlor), anno[,1]),-1] 32.3 Quality control unfiltered &lt;- sce.lawlor library(scater) stats &lt;- perCellQCMetrics(sce.lawlor, subsets=list(Mito=which(rowData(sce.lawlor)$SEQNAME==&quot;MT&quot;))) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;subsets_Mito_percent&quot;, batch=sce.lawlor$`islet unos id`) sce.lawlor &lt;- sce.lawlor[,!qc$discard] colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- qc$discard gridExtra::grid.arrange( plotColData(unfiltered, x=&quot;islet unos id&quot;, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;) + theme(axis.text.x = element_text(angle = 90)), plotColData(unfiltered, x=&quot;islet unos id&quot;, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;) + theme(axis.text.x = element_text(angle = 90)), plotColData(unfiltered, x=&quot;islet unos id&quot;, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;Mito percent&quot;) + theme(axis.text.x = element_text(angle = 90)), ncol=2 ) Figure 32.1: Distribution of each QC metric across cells from each donor of the Lawlor pancreas dataset. Each point represents a cell and is colored according to whether that cell was discarded. plotColData(unfiltered, x=&quot;sum&quot;, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + scale_x_log10() Figure 32.2: Percentage of mitochondrial reads in each cell in the 416B dataset compared to the total count. Each point represents a cell and is colored according to whether that cell was discarded. colSums(as.matrix(qc)) ## low_lib_size low_n_features high_subsets_Mito_percent ## 9 5 25 ## discard ## 34 32.4 Normalization library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.lawlor) sce.lawlor &lt;- computeSumFactors(sce.lawlor, clusters=clusters) sce.lawlor &lt;- logNormCounts(sce.lawlor) summary(sizeFactors(sce.lawlor)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.295 0.781 0.963 1.000 1.182 2.629 plot(librarySizeFactors(sce.lawlor), sizeFactors(sce.lawlor), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, log=&quot;xy&quot;) Figure 32.3: Relationship between the library size factors and the deconvolution size factors in the Lawlor pancreas dataset. 32.5 Variance modelling Using age as a proxy for the donor. dec.lawlor &lt;- modelGeneVar(sce.lawlor, block=sce.lawlor$`islet unos id`) chosen.genes &lt;- getTopHVGs(dec.lawlor, n=2000) par(mfrow=c(4,2)) blocked.stats &lt;- dec.lawlor$per.block for (i in colnames(blocked.stats)) { current &lt;- blocked.stats[[i]] plot(current$mean, current$total, main=i, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(current) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) } Figure 32.4: Per-gene variance as a function of the mean for the log-expression values in the Lawlor pancreas dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted separately for each donor. 32.6 Dimensionality reduction library(BiocSingular) set.seed(101011001) sce.lawlor &lt;- runPCA(sce.lawlor, subset_row=chosen.genes, ncomponents=25) sce.lawlor &lt;- runTSNE(sce.lawlor, dimred=&quot;PCA&quot;) 32.7 Clustering snn.gr &lt;- buildSNNGraph(sce.lawlor, use.dimred=&quot;PCA&quot;) colLabels(sce.lawlor) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) table(colLabels(sce.lawlor), sce.lawlor$`cell type`) ## ## Acinar Alpha Beta Delta Ductal Gamma/PP None/Other Stellate ## 1 1 0 0 13 2 16 2 0 ## 2 0 1 76 1 0 0 0 0 ## 3 0 161 1 0 0 1 2 0 ## 4 0 1 0 1 0 0 5 19 ## 5 0 0 175 4 1 0 1 0 ## 6 22 0 0 0 0 0 0 0 ## 7 0 75 0 0 0 0 0 0 ## 8 0 0 0 1 20 0 2 0 table(colLabels(sce.lawlor), sce.lawlor$`islet unos id`) ## ## ACCG268 ACCR015A ACEK420A ACEL337 ACHY057 ACIB065 ACIW009 ACJV399 ## 1 8 2 2 4 4 4 9 1 ## 2 14 3 2 33 3 2 4 17 ## 3 36 23 14 13 14 14 21 30 ## 4 7 1 0 1 0 4 9 4 ## 5 34 10 4 39 7 23 24 40 ## 6 0 2 13 0 0 0 5 2 ## 7 32 12 0 5 6 7 4 9 ## 8 1 1 2 1 2 1 12 3 gridExtra::grid.arrange( plotTSNE(sce.lawlor, colour_by=&quot;label&quot;), plotTSNE(sce.lawlor, colour_by=&quot;islet unos id&quot;), ncol=2 ) Figure 30.3: Obligatory \\(t\\)-SNE plots of the Lawlor pancreas dataset. Each point represents a cell that is colored by cluster (left) or batch (right). Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] BiocSingular_1.6.0 scran_1.18.5 [3] scater_1.18.6 ggplot2_3.3.3 [5] ensembldb_2.14.0 AnnotationFilter_1.14.0 [7] GenomicFeatures_1.42.2 AnnotationDbi_1.52.0 [9] AnnotationHub_2.22.0 BiocFileCache_1.14.0 [11] dbplyr_2.1.0 scRNAseq_2.4.0 [13] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0 [15] Biobase_2.50.0 GenomicRanges_1.42.0 [17] GenomeInfoDb_1.26.4 IRanges_2.24.1 [19] S4Vectors_0.28.1 BiocGenerics_0.36.0 [21] MatrixGenerics_1.2.1 matrixStats_0.58.0 [23] BiocStyle_2.18.1 rebook_1.0.0 loaded via a namespace (and not attached): [1] igraph_1.2.6 lazyeval_0.2.2 [3] BiocParallel_1.24.1 digest_0.6.27 [5] htmltools_0.5.1.1 viridis_0.5.1 [7] fansi_0.4.2 magrittr_2.0.1 [9] memoise_2.0.0 limma_3.46.0 [11] Biostrings_2.58.0 askpass_1.1 [13] prettyunits_1.1.1 colorspace_2.0-0 [15] blob_1.2.1 rappdirs_0.3.3 [17] xfun_0.22 dplyr_1.0.5 [19] callr_3.5.1 crayon_1.4.1 [21] RCurl_1.98-1.3 jsonlite_1.7.2 [23] graph_1.68.0 glue_1.4.2 [25] gtable_0.3.0 zlibbioc_1.36.0 [27] XVector_0.30.0 DelayedArray_0.16.2 [29] scales_1.1.1 edgeR_3.32.1 [31] DBI_1.1.1 Rcpp_1.0.6 [33] viridisLite_0.3.0 xtable_1.8-4 [35] progress_1.2.2 dqrng_0.2.1 [37] bit_4.0.4 rsvd_1.0.3 [39] httr_1.4.2 ellipsis_0.3.1 [41] pkgconfig_2.0.3 XML_3.99-0.6 [43] farver_2.1.0 scuttle_1.0.4 [45] CodeDepends_0.6.5 sass_0.3.1 [47] locfit_1.5-9.4 utf8_1.2.1 [49] tidyselect_1.1.0 labeling_0.4.2 [51] rlang_0.4.10 later_1.1.0.1 [53] munsell_0.5.0 BiocVersion_3.12.0 [55] tools_4.0.4 cachem_1.0.4 [57] generics_0.1.0 RSQLite_2.2.4 [59] ExperimentHub_1.16.0 evaluate_0.14 [61] stringr_1.4.0 fastmap_1.1.0 [63] yaml_2.2.1 processx_3.4.5 [65] knitr_1.31 bit64_4.0.5 [67] purrr_0.3.4 sparseMatrixStats_1.2.1 [69] mime_0.10 xml2_1.3.2 [71] biomaRt_2.46.3 compiler_4.0.4 [73] beeswarm_0.3.1 curl_4.3 [75] interactiveDisplayBase_1.28.0 statmod_1.4.35 [77] tibble_3.1.0 bslib_0.2.4 [79] stringi_1.5.3 highr_0.8 [81] ps_1.6.0 lattice_0.20-41 [83] bluster_1.0.0 ProtGenerics_1.22.0 [85] Matrix_1.3-2 vctrs_0.3.6 [87] pillar_1.5.1 lifecycle_1.0.0 [89] BiocManager_1.30.10 jquerylib_0.1.3 [91] BiocNeighbors_1.8.2 cowplot_1.1.1 [93] bitops_1.0-6 irlba_2.3.3 [95] httpuv_1.5.5 rtracklayer_1.50.0 [97] R6_2.5.0 bookdown_0.21 [99] promises_1.2.0.1 gridExtra_2.3 [101] vipor_0.4.5 codetools_0.2-18 [103] assertthat_0.2.1 openssl_1.4.3 [105] withr_2.4.1 GenomicAlignments_1.26.0 [107] Rsamtools_2.6.0 GenomeInfoDbData_1.2.4 [109] hms_1.0.0 grid_4.0.4 [111] beachmat_2.6.4 rmarkdown_2.7 [113] DelayedMatrixStats_1.12.3 Rtsne_0.15 [115] shiny_1.6.0 ggbeeswarm_0.6.0 Bibliography "],["segerstolpe-human-pancreas-smart-seq2.html", "Chapter 33 Segerstolpe human pancreas (Smart-seq2) 33.1 Introduction 33.2 Data loading 33.3 Quality control 33.4 Normalization 33.5 Variance modelling 33.6 Dimensionality reduction 33.7 Clustering 33.8 Data integration 33.9 Multi-sample comparisons Session Info", " Chapter 33 Segerstolpe human pancreas (Smart-seq2) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 33.1 Introduction This performs an analysis of the Segerstolpe et al. (2016) dataset, consisting of human pancreas cells from various donors. 33.2 Data loading library(scRNAseq) sce.seger &lt;- SegerstolpePancreasData() library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] symbols &lt;- rowData(sce.seger)$symbol ens.id &lt;- mapIds(edb, keys=symbols, keytype=&quot;SYMBOL&quot;, column=&quot;GENEID&quot;) ens.id &lt;- ifelse(is.na(ens.id), symbols, ens.id) # Removing duplicated rows. keep &lt;- !duplicated(ens.id) sce.seger &lt;- sce.seger[keep,] rownames(sce.seger) &lt;- ens.id[keep] We simplify the names of some of the relevant column metadata fields for ease of access. Some editing of the cell type labels is necessary for consistency with other data sets. emtab.meta &lt;- colData(sce.seger)[,c(&quot;cell type&quot;, &quot;disease&quot;, &quot;individual&quot;, &quot;single cell well quality&quot;)] colnames(emtab.meta) &lt;- c(&quot;CellType&quot;, &quot;Disease&quot;, &quot;Donor&quot;, &quot;Quality&quot;) colData(sce.seger) &lt;- emtab.meta sce.seger$CellType &lt;- gsub(&quot; cell&quot;, &quot;&quot;, sce.seger$CellType) sce.seger$CellType &lt;- paste0( toupper(substr(sce.seger$CellType, 1, 1)), substring(sce.seger$CellType, 2)) 33.3 Quality control unfiltered &lt;- sce.seger We remove low quality cells that were marked by the authors. We then perform additional quality control as some of the remaining cells still have very low counts and numbers of detected features. For some batches that seem to have a majority of low-quality cells (Figure 33.1), we use the other batches to define an appropriate threshold via subset=. low.qual &lt;- sce.seger$Quality == &quot;low quality cell&quot; library(scater) stats &lt;- perCellQCMetrics(sce.seger) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sce.seger$Donor, subset=!sce.seger$Donor %in% c(&quot;HP1504901&quot;, &quot;HP1509101&quot;)) sce.seger &lt;- sce.seger[,!(qc$discard | low.qual)] colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- qc$discard gridExtra::grid.arrange( plotColData(unfiltered, x=&quot;Donor&quot;, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;) + theme(axis.text.x = element_text(angle = 90)), plotColData(unfiltered, x=&quot;Donor&quot;, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;) + theme(axis.text.x = element_text(angle = 90)), plotColData(unfiltered, x=&quot;Donor&quot;, y=&quot;altexps_ERCC_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;ERCC percent&quot;) + theme(axis.text.x = element_text(angle = 90)), ncol=2 ) Figure 33.1: Distribution of each QC metric across cells from each donor of the Segerstolpe pancreas dataset. Each point represents a cell and is colored according to whether that cell was discarded. colSums(as.matrix(qc)) ## low_lib_size low_n_features high_altexps_ERCC_percent ## 788 1056 1031 ## discard ## 1246 33.4 Normalization We don’t normalize the spike-ins at this point as there are some cells with no spike-in counts. library(scran) clusters &lt;- quickCluster(sce.seger) sce.seger &lt;- computeSumFactors(sce.seger, clusters=clusters) sce.seger &lt;- logNormCounts(sce.seger) summary(sizeFactors(sce.seger)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.014 0.390 0.708 1.000 1.332 11.182 plot(librarySizeFactors(sce.seger), sizeFactors(sce.seger), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, log=&quot;xy&quot;) Figure 33.2: Relationship between the library size factors and the deconvolution size factors in the Segerstolpe pancreas dataset. 33.5 Variance modelling We do not use cells with no spike-ins for variance modelling. Donor AZ also has very low spike-in counts and is subsequently ignored. for.hvg &lt;- sce.seger[,librarySizeFactors(altExp(sce.seger)) &gt; 0 &amp; sce.seger$Donor!=&quot;AZ&quot;] dec.seger &lt;- modelGeneVarWithSpikes(for.hvg, &quot;ERCC&quot;, block=for.hvg$Donor) chosen.hvgs &lt;- getTopHVGs(dec.seger, n=2000) par(mfrow=c(3,3)) blocked.stats &lt;- dec.seger$per.block for (i in colnames(blocked.stats)) { current &lt;- blocked.stats[[i]] plot(current$mean, current$total, main=i, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(current) points(curfit$mean, curfit$var, col=&quot;red&quot;, pch=16) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) } Figure 33.3: Per-gene variance as a function of the mean for the log-expression values in the Grun pancreas dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the spike-in transcripts (red) separately for each donor. 33.6 Dimensionality reduction We pick the first 25 PCs for downstream analyses, as it’s a nice square number. library(BiocSingular) set.seed(101011001) sce.seger &lt;- runPCA(sce.seger, subset_row=chosen.hvgs, ncomponents=25) sce.seger &lt;- runTSNE(sce.seger, dimred=&quot;PCA&quot;) 33.7 Clustering library(bluster) clust.out &lt;- clusterRows(reducedDim(sce.seger, &quot;PCA&quot;), NNGraphParam(), full=TRUE) snn.gr &lt;- clust.out$objects$graph colLabels(sce.seger) &lt;- clust.out$clusters We see a strong donor effect in Figures 33.4 and 30.3. This might be due to differences in cell type composition between donors, but the more likely explanation is that of a technical difference in plate processing or uninteresting genotypic differences. The implication is that we should have called fastMNN() at some point. tab &lt;- table(Cluster=colLabels(sce.seger), Donor=sce.seger$Donor) library(pheatmap) pheatmap(log10(tab+10), color=viridis::viridis(100)) Figure 33.4: Heatmap of the frequency of cells from each donor in each cluster. gridExtra::grid.arrange( plotTSNE(sce.seger, colour_by=&quot;label&quot;), plotTSNE(sce.seger, colour_by=&quot;Donor&quot;), ncol=2 ) Figure 33.5: Obligatory \\(t\\)-SNE plots of the Segerstolpe pancreas dataset. Each point represents a cell that is colored by cluster (left) or batch (right). 33.8 Data integration We repeat the clustering after running fastMNN() on the donors. This yields a more coherent set of clusters in Figure 33.6 where each cluster contains contributions from all donors. library(batchelor) set.seed(10001010) corrected &lt;- fastMNN(sce.seger, batch=sce.seger$Donor, subset.row=chosen.hvgs) set.seed(10000001) corrected &lt;- runTSNE(corrected, dimred=&quot;corrected&quot;) colLabels(corrected) &lt;- clusterRows(reducedDim(corrected, &quot;corrected&quot;), NNGraphParam()) tab &lt;- table(Cluster=colLabels(corrected), Donor=corrected$batch) tab ## Donor ## Cluster AZ HP1502401 HP1504101T2D HP1504901 HP1506401 HP1507101 HP1508501T2D ## 1 3 19 3 11 67 8 78 ## 2 14 53 13 19 37 41 20 ## 3 2 2 1 1 44 1 1 ## 4 2 18 7 3 36 2 28 ## 5 29 114 140 72 26 136 121 ## 6 8 21 9 6 2 6 6 ## 7 1 1 1 9 0 1 2 ## 8 2 1 3 10 2 6 12 ## 9 4 20 70 8 16 2 8 ## Donor ## Cluster HP1509101 HP1525301T2D HP1526901T2D ## 1 27 124 46 ## 2 14 11 70 ## 3 0 1 4 ## 4 2 23 9 ## 5 49 85 96 ## 6 11 5 34 ## 7 2 2 1 ## 8 3 13 4 ## 9 1 10 34 gridExtra::grid.arrange( plotTSNE(corrected, colour_by=&quot;label&quot;), plotTSNE(corrected, colour_by=&quot;batch&quot;), ncol=2 ) Figure 33.6: Yet another \\(t\\)-SNE plot of the Segerstolpe dataset, this time after batch correction across donors. Each point represents a cell and is colored by the assigned cluster identity. 33.9 Multi-sample comparisons This particular dataset contains both healthy donors and those with type II diabetes. It is thus of some interest to identify genes that are differentially expressed upon disease in each cell type. To keep things simple, we use the author-provided annotation rather than determining the cell type for each of our clusters. summed &lt;- aggregateAcrossCells(sce.seger, ids=colData(sce.seger)[,c(&quot;Donor&quot;, &quot;CellType&quot;)]) summed ## class: SingleCellExperiment ## dim: 25454 105 ## metadata(0): ## assays(1): counts ## rownames(25454): ENSG00000118473 ENSG00000142920 ... ENSG00000278306 ## eGFP ## rowData names(2): symbol refseq ## colnames: NULL ## colData names(9): CellType Disease ... CellType ncells ## reducedDimNames(2): PCA TSNE ## altExpNames(1): ERCC Here, we will use the voom pipeline from the limma package instead of the QL approach with edgeR. This allows us to use sample weights to better account for the variation in the precision of each pseudo-bulk profile. We see that insulin is downregulated in beta cells in the disease state, which is sensible enough. summed.beta &lt;- summed[,summed$CellType==&quot;Beta&quot;] library(edgeR) y.beta &lt;- DGEList(counts(summed.beta), samples=colData(summed.beta), genes=rowData(summed.beta)[,&quot;symbol&quot;,drop=FALSE]) y.beta &lt;- y.beta[filterByExpr(y.beta, group=y.beta$samples$Disease),] y.beta &lt;- calcNormFactors(y.beta) design &lt;- model.matrix(~Disease, y.beta$samples) v.beta &lt;- voomWithQualityWeights(y.beta, design) fit.beta &lt;- lmFit(v.beta) fit.beta &lt;- eBayes(fit.beta, robust=TRUE) res.beta &lt;- topTable(fit.beta, sort.by=&quot;p&quot;, n=Inf, coef=&quot;Diseasetype II diabetes mellitus&quot;) head(res.beta) ## symbol logFC AveExpr t P.Value adj.P.Val B ## ENSG00000254647 INS -2.728 16.680 -7.671 3.191e-06 0.03902 4.842 ## ENSG00000137731 FXYD2 -2.595 7.265 -6.705 1.344e-05 0.08219 3.353 ## ENSG00000169297 NR0B1 -2.092 6.790 -5.789 5.810e-05 0.09916 1.984 ## ENSG00000181029 TRAPPC5 -2.127 7.046 -5.678 7.007e-05 0.09916 1.877 ## ENSG00000105707 HPN -1.803 6.118 -5.654 7.298e-05 0.09916 1.740 ## LOC284889 LOC284889 -2.113 6.652 -5.515 9.259e-05 0.09916 1.571 We also create some diagnostic plots to check for potential problems in the analysis. The MA plots exhibit the expected shape (Figure 33.7) while the differences in the sample weights in Figure 33.8 justify the use of voom() in this context. par(mfrow=c(5, 2)) for (i in colnames(y.beta)) { plotMD(y.beta, column=i) } Figure 33.7: MA plots for the beta cell pseudo-bulk profiles. Each MA plot is generated by comparing the corresponding pseudo-bulk profile against the average of all other profiles # Easier to just re-run it with plot=TRUE than # to try to make the plot from &#39;v.beta&#39;. voomWithQualityWeights(y.beta, design, plot=TRUE) Figure 33.8: Diagnostic plots for voom after estimating observation and quality weights from the beta cell pseudo-bulk profiles. The left plot shows the mean-variance trend used to estimate the observation weights, while the right plot shows the per-sample quality weights. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] edgeR_3.32.1 limma_3.46.0 [3] batchelor_1.6.2 pheatmap_1.0.12 [5] bluster_1.0.0 BiocSingular_1.6.0 [7] scran_1.18.5 scater_1.18.6 [9] ggplot2_3.3.3 ensembldb_2.14.0 [11] AnnotationFilter_1.14.0 GenomicFeatures_1.42.2 [13] AnnotationDbi_1.52.0 AnnotationHub_2.22.0 [15] BiocFileCache_1.14.0 dbplyr_2.1.0 [17] scRNAseq_2.4.0 SingleCellExperiment_1.12.0 [19] SummarizedExperiment_1.20.0 Biobase_2.50.0 [21] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [23] IRanges_2.24.1 S4Vectors_0.28.1 [25] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [27] matrixStats_0.58.0 BiocStyle_2.18.1 [29] rebook_1.0.0 loaded via a namespace (and not attached): [1] igraph_1.2.6 lazyeval_0.2.2 [3] BiocParallel_1.24.1 digest_0.6.27 [5] htmltools_0.5.1.1 viridis_0.5.1 [7] fansi_0.4.2 magrittr_2.0.1 [9] memoise_2.0.0 Biostrings_2.58.0 [11] askpass_1.1 prettyunits_1.1.1 [13] colorspace_2.0-0 blob_1.2.1 [15] rappdirs_0.3.3 xfun_0.22 [17] dplyr_1.0.5 callr_3.5.1 [19] crayon_1.4.1 RCurl_1.98-1.3 [21] jsonlite_1.7.2 graph_1.68.0 [23] glue_1.4.2 gtable_0.3.0 [25] zlibbioc_1.36.0 XVector_0.30.0 [27] DelayedArray_0.16.2 scales_1.1.1 [29] DBI_1.1.1 Rcpp_1.0.6 [31] viridisLite_0.3.0 xtable_1.8-4 [33] progress_1.2.2 dqrng_0.2.1 [35] bit_4.0.4 rsvd_1.0.3 [37] ResidualMatrix_1.0.0 httr_1.4.2 [39] RColorBrewer_1.1-2 ellipsis_0.3.1 [41] pkgconfig_2.0.3 XML_3.99-0.6 [43] farver_2.1.0 scuttle_1.0.4 [45] CodeDepends_0.6.5 sass_0.3.1 [47] locfit_1.5-9.4 utf8_1.2.1 [49] tidyselect_1.1.0 labeling_0.4.2 [51] rlang_0.4.10 later_1.1.0.1 [53] munsell_0.5.0 BiocVersion_3.12.0 [55] tools_4.0.4 cachem_1.0.4 [57] generics_0.1.0 RSQLite_2.2.4 [59] ExperimentHub_1.16.0 evaluate_0.14 [61] stringr_1.4.0 fastmap_1.1.0 [63] yaml_2.2.1 processx_3.4.5 [65] knitr_1.31 bit64_4.0.5 [67] purrr_0.3.4 sparseMatrixStats_1.2.1 [69] mime_0.10 xml2_1.3.2 [71] biomaRt_2.46.3 compiler_4.0.4 [73] beeswarm_0.3.1 curl_4.3 [75] interactiveDisplayBase_1.28.0 statmod_1.4.35 [77] tibble_3.1.0 bslib_0.2.4 [79] stringi_1.5.3 highr_0.8 [81] ps_1.6.0 lattice_0.20-41 [83] ProtGenerics_1.22.0 Matrix_1.3-2 [85] vctrs_0.3.6 pillar_1.5.1 [87] lifecycle_1.0.0 BiocManager_1.30.10 [89] jquerylib_0.1.3 BiocNeighbors_1.8.2 [91] cowplot_1.1.1 bitops_1.0-6 [93] irlba_2.3.3 httpuv_1.5.5 [95] rtracklayer_1.50.0 R6_2.5.0 [97] bookdown_0.21 promises_1.2.0.1 [99] gridExtra_2.3 vipor_0.4.5 [101] codetools_0.2-18 assertthat_0.2.1 [103] openssl_1.4.3 withr_2.4.1 [105] GenomicAlignments_1.26.0 Rsamtools_2.6.0 [107] GenomeInfoDbData_1.2.4 hms_1.0.0 [109] grid_4.0.4 beachmat_2.6.4 [111] rmarkdown_2.7 DelayedMatrixStats_1.12.3 [113] Rtsne_0.15 shiny_1.6.0 [115] ggbeeswarm_0.6.0 Bibliography "],["merged-pancreas.html", "Chapter 34 Merged human pancreas datasets 34.1 Introduction 34.2 The good 34.3 The bad 34.4 The ugly Session Info", " Chapter 34 Merged human pancreas datasets .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 34.1 Introduction For a period in 2016, there was a great deal of interest in using scRNA-seq to profile the human pancreas at cellular resolution (Muraro et al. 2016; Grun et al. 2016; Lawlor et al. 2017; Segerstolpe et al. 2016). As a consequence, we have a surplus of human pancreas datasets generated by different authors with different technologies, which provides an ideal use case for demonstrating more complex data integration strategies. This represents a more challenging application than the PBMC dataset in Chapter 13 as it involves different sequencing protocols and different patients, most likely with differences in cell type composition. 34.2 The good We start by considering only two datasets from Muraro et al. (2016) and Grun et al. (2016). This is a relatively simple scenario involving very similar protocols (CEL-seq and CEL-seq2) and a similar set of authors. View history #--- loading ---# library(scRNAseq) sce.grun &lt;- GrunPancreasData() #--- gene-annotation ---# library(org.Hs.eg.db) gene.ids &lt;- mapIds(org.Hs.eg.db, keys=rowData(sce.grun)$symbol, keytype=&quot;SYMBOL&quot;, column=&quot;ENSEMBL&quot;) keep &lt;- !is.na(gene.ids) &amp; !duplicated(gene.ids) sce.grun &lt;- sce.grun[keep,] rownames(sce.grun) &lt;- gene.ids[keep] #--- quality-control ---# library(scater) stats &lt;- perCellQCMetrics(sce.grun) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sce.grun$donor, subset=sce.grun$donor %in% c(&quot;D17&quot;, &quot;D7&quot;, &quot;D2&quot;)) sce.grun &lt;- sce.grun[,!qc$discard] #--- normalization ---# library(scran) set.seed(1000) # for irlba. clusters &lt;- quickCluster(sce.grun) sce.grun &lt;- computeSumFactors(sce.grun, clusters=clusters) sce.grun &lt;- logNormCounts(sce.grun) #--- variance-modelling ---# block &lt;- paste0(sce.grun$sample, &quot;_&quot;, sce.grun$donor) dec.grun &lt;- modelGeneVarWithSpikes(sce.grun, spikes=&quot;ERCC&quot;, block=block) top.grun &lt;- getTopHVGs(dec.grun, prop=0.1) sce.grun ## class: SingleCellExperiment ## dim: 17474 1063 ## metadata(0): ## assays(2): counts logcounts ## rownames(17474): ENSG00000268895 ENSG00000121410 ... ENSG00000074755 ## ENSG00000036549 ## rowData names(2): symbol chr ## colnames(1063): D2ex_1 D2ex_2 ... D17TGFB_94 D17TGFB_95 ## colData names(3): donor sample sizeFactor ## reducedDimNames(0): ## altExpNames(1): ERCC View history #--- loading ---# library(scRNAseq) sce.muraro &lt;- MuraroPancreasData() #--- gene-annotation ---# library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] gene.symb &lt;- sub(&quot;__chr.*$&quot;, &quot;&quot;, rownames(sce.muraro)) gene.ids &lt;- mapIds(edb, keys=gene.symb, keytype=&quot;SYMBOL&quot;, column=&quot;GENEID&quot;) # Removing duplicated genes or genes without Ensembl IDs. keep &lt;- !is.na(gene.ids) &amp; !duplicated(gene.ids) sce.muraro &lt;- sce.muraro[keep,] rownames(sce.muraro) &lt;- gene.ids[keep] #--- quality-control ---# library(scater) stats &lt;- perCellQCMetrics(sce.muraro) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sce.muraro$donor, subset=sce.muraro$donor!=&quot;D28&quot;) sce.muraro &lt;- sce.muraro[,!qc$discard] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.muraro) sce.muraro &lt;- computeSumFactors(sce.muraro, clusters=clusters) sce.muraro &lt;- logNormCounts(sce.muraro) #--- variance-modelling ---# block &lt;- paste0(sce.muraro$plate, &quot;_&quot;, sce.muraro$donor) dec.muraro &lt;- modelGeneVarWithSpikes(sce.muraro, &quot;ERCC&quot;, block=block) top.muraro &lt;- getTopHVGs(dec.muraro, prop=0.1) sce.muraro ## class: SingleCellExperiment ## dim: 16940 2299 ## metadata(0): ## assays(2): counts logcounts ## rownames(16940): ENSG00000268895 ENSG00000121410 ... ENSG00000159840 ## ENSG00000074755 ## rowData names(2): symbol chr ## colnames(2299): D28-1_1 D28-1_2 ... D30-8_93 D30-8_94 ## colData names(4): label donor plate sizeFactor ## reducedDimNames(0): ## altExpNames(1): ERCC We subset both batches to their common universe of genes; adjust their scaling to equalize sequencing coverage (not really necessary in this case, as the coverage is already similar, but we will do so anyway for consistency); and select those genes with positive average biological components for further use. universe &lt;- intersect(rownames(sce.grun), rownames(sce.muraro)) sce.grun2 &lt;- sce.grun[universe,] dec.grun2 &lt;- dec.grun[universe,] sce.muraro2 &lt;- sce.muraro[universe,] dec.muraro2 &lt;- dec.muraro[universe,] library(batchelor) normed.pancreas &lt;- multiBatchNorm(sce.grun2, sce.muraro2) sce.grun2 &lt;- normed.pancreas[[1]] sce.muraro2 &lt;- normed.pancreas[[2]] library(scran) combined.pan &lt;- combineVar(dec.grun2, dec.muraro2) chosen.genes &lt;- rownames(combined.pan)[combined.pan$bio &gt; 0] We observe that rescaleBatches() is unable to align cells from different batches in Figure 34.1. This is attributable to differences in population composition between batches, with additional complications from non-linearities in the batch effect, e.g., when the magnitude or direction of the batch effect differs between cell types. library(scater) rescaled.pancreas &lt;- rescaleBatches(sce.grun2, sce.muraro2) set.seed(100101) rescaled.pancreas &lt;- runPCA(rescaled.pancreas, subset_row=chosen.genes, exprs_values=&quot;corrected&quot;) rescaled.pancreas &lt;- runTSNE(rescaled.pancreas, dimred=&quot;PCA&quot;) plotTSNE(rescaled.pancreas, colour_by=&quot;batch&quot;) Figure 34.1: \\(t\\)-SNE plot of the two pancreas datasets after correction with rescaleBatches(). Each point represents a cell and is colored according to the batch of origin. Here, we use fastMNN() to merge together the two human pancreas datasets described earlier. Clustering on the merged datasets yields fewer batch-specific clusters, which is recapitulated as greater intermingling between batches in Figure 34.2. This improvement over Figure 34.1 represents the ability of fastMNN() to adapt to more complex situations involving differences in population composition between batches. set.seed(1011011) mnn.pancreas &lt;- fastMNN(sce.grun2, sce.muraro2, subset.row=chosen.genes) snn.gr &lt;- buildSNNGraph(mnn.pancreas, use.dimred=&quot;corrected&quot;) clusters &lt;- igraph::cluster_walktrap(snn.gr)$membership tab &lt;- table(Cluster=clusters, Batch=mnn.pancreas$batch) tab ## Batch ## Cluster 1 2 ## 1 243 280 ## 2 315 255 ## 3 203 843 ## 4 56 194 ## 5 24 108 ## 6 116 398 ## 7 27 0 ## 8 56 72 ## 9 18 113 ## 10 0 17 ## 11 5 19 mnn.pancreas &lt;- runTSNE(mnn.pancreas, dimred=&quot;corrected&quot;) plotTSNE(mnn.pancreas, colour_by=&quot;batch&quot;) Figure 34.2: \\(t\\)-SNE plot of the two pancreas datasets after correction with fastMNN(). Each point represents a cell and is colored according to the batch of origin. 34.3 The bad Flushed with our previous success, we now attempt to merge the other datasets from Lawlor et al. (2017) and Segerstolpe et al. (2016). This is a more challenging task as it involves different technologies, mixtures of UMI and read count data and a more diverse set of authors (presumably with greater differences in the patient population). View history #--- loading ---# library(scRNAseq) sce.lawlor &lt;- LawlorPancreasData() #--- gene-annotation ---# library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] anno &lt;- select(edb, keys=rownames(sce.lawlor), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.lawlor) &lt;- anno[match(rownames(sce.lawlor), anno[,1]),-1] #--- quality-control ---# library(scater) stats &lt;- perCellQCMetrics(sce.lawlor, subsets=list(Mito=which(rowData(sce.lawlor)$SEQNAME==&quot;MT&quot;))) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;subsets_Mito_percent&quot;, batch=sce.lawlor$`islet unos id`) sce.lawlor &lt;- sce.lawlor[,!qc$discard] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.lawlor) sce.lawlor &lt;- computeSumFactors(sce.lawlor, clusters=clusters) sce.lawlor &lt;- logNormCounts(sce.lawlor) #--- variance-modelling ---# dec.lawlor &lt;- modelGeneVar(sce.lawlor, block=sce.lawlor$`islet unos id`) chosen.genes &lt;- getTopHVGs(dec.lawlor, n=2000) sce.lawlor ## class: SingleCellExperiment ## dim: 26616 604 ## metadata(0): ## assays(2): counts logcounts ## rownames(26616): ENSG00000229483 ENSG00000232849 ... ENSG00000251576 ## ENSG00000082898 ## rowData names(2): SYMBOL SEQNAME ## colnames(604): 10th_C11_S96 10th_C13_S61 ... 9th-C96_S81 9th-C9_S13 ## colData names(9): title age ... Sex sizeFactor ## reducedDimNames(0): ## altExpNames(0): View history #--- loading ---# library(scRNAseq) sce.seger &lt;- SegerstolpePancreasData() #--- gene-annotation ---# library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] symbols &lt;- rowData(sce.seger)$symbol ens.id &lt;- mapIds(edb, keys=symbols, keytype=&quot;SYMBOL&quot;, column=&quot;GENEID&quot;) ens.id &lt;- ifelse(is.na(ens.id), symbols, ens.id) # Removing duplicated rows. keep &lt;- !duplicated(ens.id) sce.seger &lt;- sce.seger[keep,] rownames(sce.seger) &lt;- ens.id[keep] #--- sample-annotation ---# emtab.meta &lt;- colData(sce.seger)[,c(&quot;cell type&quot;, &quot;disease&quot;, &quot;individual&quot;, &quot;single cell well quality&quot;)] colnames(emtab.meta) &lt;- c(&quot;CellType&quot;, &quot;Disease&quot;, &quot;Donor&quot;, &quot;Quality&quot;) colData(sce.seger) &lt;- emtab.meta sce.seger$CellType &lt;- gsub(&quot; cell&quot;, &quot;&quot;, sce.seger$CellType) sce.seger$CellType &lt;- paste0( toupper(substr(sce.seger$CellType, 1, 1)), substring(sce.seger$CellType, 2)) #--- quality-control ---# low.qual &lt;- sce.seger$Quality == &quot;low quality cell&quot; library(scater) stats &lt;- perCellQCMetrics(sce.seger) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sce.seger$Donor, subset=!sce.seger$Donor %in% c(&quot;HP1504901&quot;, &quot;HP1509101&quot;)) sce.seger &lt;- sce.seger[,!(qc$discard | low.qual)] #--- normalization ---# library(scran) clusters &lt;- quickCluster(sce.seger) sce.seger &lt;- computeSumFactors(sce.seger, clusters=clusters) sce.seger &lt;- logNormCounts(sce.seger) #--- variance-modelling ---# for.hvg &lt;- sce.seger[,librarySizeFactors(altExp(sce.seger)) &gt; 0 &amp; sce.seger$Donor!=&quot;AZ&quot;] dec.seger &lt;- modelGeneVarWithSpikes(for.hvg, &quot;ERCC&quot;, block=for.hvg$Donor) chosen.hvgs &lt;- getTopHVGs(dec.seger, n=2000) sce.seger ## class: SingleCellExperiment ## dim: 25454 2090 ## metadata(0): ## assays(2): counts logcounts ## rownames(25454): ENSG00000118473 ENSG00000142920 ... ENSG00000278306 ## eGFP ## rowData names(2): symbol refseq ## colnames(2090): HP1502401_H13 HP1502401_J14 ... HP1526901T2D_N8 ## HP1526901T2D_A8 ## colData names(5): CellType Disease Donor Quality sizeFactor ## reducedDimNames(0): ## altExpNames(1): ERCC We perform the usual routine to obtain re-normalized values and a set of HVGs. Here, we put all the objects into a list to avoid having to explicitly type their names separately. all.sce &lt;- list(Grun=sce.grun, Muraro=sce.muraro, Lawlor=sce.lawlor, Seger=sce.seger) all.dec &lt;- list(Grun=dec.grun, Muraro=dec.muraro, Lawlor=dec.lawlor, Seger=dec.seger) universe &lt;- Reduce(intersect, lapply(all.sce, rownames)) all.sce &lt;- lapply(all.sce, &quot;[&quot;, i=universe,) all.dec &lt;- lapply(all.dec, &quot;[&quot;, i=universe,) normed.pancreas &lt;- do.call(multiBatchNorm, all.sce) combined.pan &lt;- do.call(combineVar, all.dec) chosen.genes &lt;- rownames(combined.pan)[combined.pan$bio &gt; 0] We observe that the merge is generally successful, with many clusters containing contributions from each batch (Figure 34.3). There are few clusters that are specific to the Segerstolpe dataset, and if we were naive, we might consider them to represent interesting subpopulations that are not present in the other datasets. set.seed(1011110) mnn.pancreas &lt;- fastMNN(normed.pancreas) # Bumping up &#39;k&#39; to get broader clusters for this demonstration. snn.gr &lt;- buildSNNGraph(mnn.pancreas, use.dimred=&quot;corrected&quot;, k=20) clusters &lt;- igraph::cluster_walktrap(snn.gr)$membership clusters &lt;- factor(clusters) tab &lt;- table(Cluster=clusters, Batch=mnn.pancreas$batch) tab ## Batch ## Cluster Grun Lawlor Muraro Seger ## 1 304 28 256 383 ## 2 103 244 382 174 ## 3 0 0 0 55 ## 4 219 16 241 140 ## 5 166 231 357 149 ## 6 34 0 1 0 ## 7 56 17 196 109 ## 8 70 9 80 11 ## 9 28 6 39 50 ## 10 24 18 107 55 ## 11 35 6 483 171 ## 12 0 0 0 118 ## 13 0 1 16 4 ## 14 18 18 117 158 ## 15 0 0 0 26 ## 16 1 2 5 11 ## 17 0 0 0 47 ## 18 0 0 0 196 ## 19 0 0 0 185 ## 20 5 8 19 17 ## 21 0 0 0 31 mnn.pancreas &lt;- runTSNE(mnn.pancreas, dimred=&quot;corrected&quot;) gridExtra::grid.arrange( plotTSNE(mnn.pancreas, colour_by=&quot;batch&quot;, text_by=I(clusters)), plotTSNE(mnn.pancreas, colour_by=I(clusters), text_by=I(clusters)), ncol=2 ) Figure 34.3: \\(t\\)-SNE plots of the four pancreas datasets after correction with fastMNN(). Each point represents a cell and is colored according to the batch of origin (left) or the assigned cluster (right). The cluster label is shown at the median location across all cells in the cluster. Fortunately, we are battle-hardened and cynical, so we are sure to check for other sources of variation. The most obvious candidate is the donor of origin for each cell (Figure 34.4), which correlates strongly to these Segerstolpe-only clusters. This is not surprising given the large differences between humans in the wild, but donor-level variation is not interesting for the purposes of cell type characterization. (That said, preservation of the within-dataset donor effects is the technically correct course of action here, as a batch correction method should try to avoid removing heterogeneity within each of its defined batches.) donors &lt;- c( normed.pancreas$Grun$donor, normed.pancreas$Muraro$donor, normed.pancreas$Lawlor$`islet unos id`, normed.pancreas$Seger$Donor ) seger.donors &lt;- donors seger.donors[mnn.pancreas$batch!=&quot;Seger&quot;] &lt;- NA plotTSNE(mnn.pancreas, colour_by=I(seger.donors)) Figure 34.4: \\(t\\)-SNE plots of the four pancreas datasets after correction with fastMNN(). Each point represents a cell and is colored according to the donor of origin for the Segerstolpe dataset. 34.4 The ugly Given these results, the most prudent course of action is to remove the donor effects within each dataset in addition to the batch effects across datasets. This involves a bit more work to properly specify the two levels of unwanted heterogeneity. To make our job a bit easier, we use the noCorrect() utility to combine all batches into a single SingleCellExperiment object. combined &lt;- noCorrect(normed.pancreas) assayNames(combined) &lt;- &quot;logcounts&quot; combined$donor &lt;- donors We then call fastMNN() on the combined object with our chosen HVGs, using the batch= argument to specify which cells belong to which donors. This will progressively merge cells from each donor in each batch until all cells are mapped onto a common coordinate space. For some extra sophistication, we also set the weights= argument to ensure that each batch contributes equally to the PCA, regardless of the number of donors present in that batch; see ?multiBatchPCA for more details. donors.per.batch &lt;- split(combined$donor, combined$batch) donors.per.batch &lt;- lapply(donors.per.batch, unique) donors.per.batch ## $Grun ## [1] &quot;D2&quot; &quot;D3&quot; &quot;D7&quot; &quot;D10&quot; &quot;D17&quot; ## ## $Lawlor ## [1] &quot;ACIW009&quot; &quot;ACJV399&quot; &quot;ACCG268&quot; &quot;ACCR015A&quot; &quot;ACEK420A&quot; &quot;ACEL337&quot; &quot;ACHY057&quot; ## [8] &quot;ACIB065&quot; ## ## $Muraro ## [1] &quot;D28&quot; &quot;D29&quot; &quot;D31&quot; &quot;D30&quot; ## ## $Seger ## [1] &quot;HP1502401&quot; &quot;HP1504101T2D&quot; &quot;AZ&quot; &quot;HP1508501T2D&quot; &quot;HP1506401&quot; ## [6] &quot;HP1507101&quot; &quot;HP1509101&quot; &quot;HP1504901&quot; &quot;HP1525301T2D&quot; &quot;HP1526901T2D&quot; set.seed(1010100) multiout &lt;- fastMNN(combined, batch=combined$donor, subset.row=chosen.genes, weights=donors.per.batch) # Renaming metadata fields for easier communication later. multiout$dataset &lt;- combined$batch multiout$donor &lt;- multiout$batch multiout$batch &lt;- NULL With this approach, we see that the Segerstolpe-only clusters have disappeared (Figure 34.5). Visually, there also seems to be much greater mixing between cells from different Segerstolpe donors. This suggests that we have removed most of the donor effect, which simplifies the interpretation of our clusters. library(scater) g &lt;- buildSNNGraph(multiout, use.dimred=1, k=20) clusters &lt;- igraph::cluster_walktrap(g)$membership tab &lt;- table(clusters, multiout$dataset) tab ## ## clusters Grun Lawlor Muraro Seger ## 1 247 20 278 186 ## 2 338 26 257 387 ## 3 171 250 453 267 ## 4 202 245 857 891 ## 5 57 17 193 108 ## 6 24 17 108 55 ## 7 5 9 19 17 ## 8 0 1 17 4 ## 9 19 19 117 175 multiout &lt;- runTSNE(multiout, dimred=&quot;corrected&quot;) gridExtra::grid.arrange( plotTSNE(multiout, colour_by=&quot;dataset&quot;, text_by=I(clusters)), plotTSNE(multiout, colour_by=I(seger.donors)), ncol=2 ) Figure 34.5: \\(t\\)-SNE plots of the four pancreas datasets after donor-level correction with fastMNN(). Each point represents a cell and is colored according to the batch of origin (left) or the donor of origin for the Segerstolpe-derived cells (right). The cluster label is shown at the median location across all cells in the cluster. Our clusters compare well to the published annotations, indicating that we did not inadvertently discard important factors of variation during correction. (Though in this case, the cell types are so well defined, it would be quite a feat to fail to separate them!) proposed &lt;- c(rep(NA, ncol(sce.grun)), sce.muraro$label, sce.lawlor$`cell type`, sce.seger$CellType) proposed &lt;- tolower(proposed) proposed[proposed==&quot;gamma/pp&quot;] &lt;- &quot;gamma&quot; proposed[proposed==&quot;pp&quot;] &lt;- &quot;gamma&quot; proposed[proposed==&quot;duct&quot;] &lt;- &quot;ductal&quot; proposed[proposed==&quot;psc&quot;] &lt;- &quot;stellate&quot; table(proposed, clusters) ## clusters ## proposed 1 2 3 4 5 6 7 8 9 ## acinar 421 1 0 2 0 0 1 0 0 ## alpha 1 7 4 1871 1 0 1 0 2 ## beta 3 4 917 5 0 2 1 1 7 ## co-expression 0 0 17 22 0 0 0 0 0 ## delta 0 2 3 3 306 0 1 0 1 ## ductal 6 612 4 1 0 6 0 10 1 ## endothelial 0 0 0 0 0 1 33 0 0 ## epsilon 0 0 0 2 0 0 0 0 6 ## gamma 2 0 0 0 0 0 0 0 280 ## mesenchymal 0 1 0 0 0 79 0 0 0 ## mhc class ii 0 0 0 0 0 0 0 4 0 ## nana 2 8 1 13 1 0 1 0 0 ## none/other 0 3 1 2 0 0 4 1 1 ## stellate 0 0 0 1 0 70 1 0 0 ## unclassified 0 0 0 0 0 2 0 0 0 ## unclassified endocrine 0 0 2 4 0 0 0 0 0 ## unclear 0 4 0 0 0 0 0 0 0 Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] bluster_1.0.0 scater_1.18.6 [3] ggplot2_3.3.3 scran_1.18.5 [5] batchelor_1.6.2 SingleCellExperiment_1.12.0 [7] SummarizedExperiment_1.20.0 Biobase_2.50.0 [9] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [11] IRanges_2.24.1 S4Vectors_0.28.1 [13] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [15] matrixStats_0.58.0 BiocStyle_2.18.1 [17] rebook_1.0.0 loaded via a namespace (and not attached): [1] bitops_1.0-6 tools_4.0.4 [3] bslib_0.2.4 utf8_1.2.1 [5] R6_2.5.0 irlba_2.3.3 [7] ResidualMatrix_1.0.0 vipor_0.4.5 [9] DBI_1.1.1 colorspace_2.0-0 [11] withr_2.4.1 gridExtra_2.3 [13] tidyselect_1.1.0 processx_3.4.5 [15] compiler_4.0.4 graph_1.68.0 [17] BiocNeighbors_1.8.2 DelayedArray_0.16.2 [19] labeling_0.4.2 bookdown_0.21 [21] sass_0.3.1 scales_1.1.1 [23] callr_3.5.1 stringr_1.4.0 [25] digest_0.6.27 rmarkdown_2.7 [27] XVector_0.30.0 pkgconfig_2.0.3 [29] htmltools_0.5.1.1 sparseMatrixStats_1.2.1 [31] highr_0.8 limma_3.46.0 [33] rlang_0.4.10 DelayedMatrixStats_1.12.3 [35] farver_2.1.0 jquerylib_0.1.3 [37] generics_0.1.0 jsonlite_1.7.2 [39] BiocParallel_1.24.1 dplyr_1.0.5 [41] RCurl_1.98-1.3 magrittr_2.0.1 [43] BiocSingular_1.6.0 GenomeInfoDbData_1.2.4 [45] scuttle_1.0.4 Matrix_1.3-2 [47] ggbeeswarm_0.6.0 Rcpp_1.0.6 [49] munsell_0.5.0 fansi_0.4.2 [51] viridis_0.5.1 lifecycle_1.0.0 [53] stringi_1.5.3 yaml_2.2.1 [55] edgeR_3.32.1 zlibbioc_1.36.0 [57] Rtsne_0.15 grid_4.0.4 [59] dqrng_0.2.1 crayon_1.4.1 [61] lattice_0.20-41 cowplot_1.1.1 [63] beachmat_2.6.4 locfit_1.5-9.4 [65] CodeDepends_0.6.5 knitr_1.31 [67] ps_1.6.0 pillar_1.5.1 [69] igraph_1.2.6 codetools_0.2-18 [71] XML_3.99-0.6 glue_1.4.2 [73] evaluate_0.14 BiocManager_1.30.10 [75] vctrs_0.3.6 gtable_0.3.0 [77] purrr_0.3.4 assertthat_0.2.1 [79] xfun_0.22 rsvd_1.0.3 [81] viridisLite_0.3.0 tibble_3.1.0 [83] beeswarm_0.3.1 statmod_1.4.35 [85] ellipsis_0.3.1 Bibliography "],["grun-mouse-hsc-cel-seq.html", "Chapter 35 Grun mouse HSC (CEL-seq) 35.1 Introduction 35.2 Data loading 35.3 Quality control 35.4 Normalization 35.5 Variance modelling 35.6 Dimensionality reduction 35.7 Clustering 35.8 Marker gene detection Session Info", " Chapter 35 Grun mouse HSC (CEL-seq) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 35.1 Introduction This performs an analysis of the mouse haematopoietic stem cell (HSC) dataset generated with CEL-seq (Grun et al. 2016). Despite its name, this dataset actually contains both sorted HSCs and a population of micro-dissected bone marrow cells. 35.2 Data loading library(scRNAseq) sce.grun.hsc &lt;- GrunHSCData(ensembl=TRUE) library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] anno &lt;- select(ens.mm.v97, keys=rownames(sce.grun.hsc), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.grun.hsc) &lt;- anno[match(rownames(sce.grun.hsc), anno$GENEID),] After loading and annotation, we inspect the resulting SingleCellExperiment object: sce.grun.hsc ## class: SingleCellExperiment ## dim: 21817 1915 ## metadata(0): ## assays(1): counts ## rownames(21817): ENSMUSG00000109644 ENSMUSG00000007777 ... ## ENSMUSG00000055670 ENSMUSG00000039068 ## rowData names(3): GENEID SYMBOL SEQNAME ## colnames(1915): JC4_349_HSC_FE_S13_ JC4_350_HSC_FE_S13_ ... ## JC48P6_1203_HSC_FE_S8_ JC48P6_1204_HSC_FE_S8_ ## colData names(2): sample protocol ## reducedDimNames(0): ## altExpNames(0): 35.3 Quality control unfiltered &lt;- sce.grun.hsc For some reason, no mitochondrial transcripts are available, and we have no spike-in transcripts, so we only use the number of detected genes and the library size for quality control. We block on the protocol used for cell extraction, ignoring the micro-dissected cells when computing this threshold. This is based on our judgement that a majority of micro-dissected plates consist of a majority of low-quality cells, compromising the assumptions of outlier detection. library(scuttle) stats &lt;- perCellQCMetrics(sce.grun.hsc) qc &lt;- quickPerCellQC(stats, batch=sce.grun.hsc$protocol, subset=grepl(&quot;sorted&quot;, sce.grun.hsc$protocol)) sce.grun.hsc &lt;- sce.grun.hsc[,!qc$discard] We examine the number of cells discarded for each reason. colSums(as.matrix(qc)) ## low_lib_size low_n_features discard ## 465 482 488 We create some diagnostic plots for each metric (Figure 35.1). The library sizes are unusually low for many plates of micro-dissected cells; this may be attributable to damage induced by the extraction protocol compared to cell sorting. colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- qc$discard library(scater) gridExtra::grid.arrange( plotColData(unfiltered, y=&quot;sum&quot;, x=&quot;sample&quot;, colour_by=&quot;discard&quot;, other_fields=&quot;protocol&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;) + facet_wrap(~protocol), plotColData(unfiltered, y=&quot;detected&quot;, x=&quot;sample&quot;, colour_by=&quot;discard&quot;, other_fields=&quot;protocol&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;) + facet_wrap(~protocol), ncol=1 ) Figure 35.1: Distribution of each QC metric across cells in the Grun HSC dataset. Each point represents a cell and is colored according to whether that cell was discarded. 35.4 Normalization library(scran) set.seed(101000110) clusters &lt;- quickCluster(sce.grun.hsc) sce.grun.hsc &lt;- computeSumFactors(sce.grun.hsc, clusters=clusters) sce.grun.hsc &lt;- logNormCounts(sce.grun.hsc) We examine some key metrics for the distribution of size factors, and compare it to the library sizes as a sanity check (Figure 35.2). summary(sizeFactors(sce.grun.hsc)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.027 0.290 0.603 1.000 1.201 16.433 plot(librarySizeFactors(sce.grun.hsc), sizeFactors(sce.grun.hsc), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, log=&quot;xy&quot;) Figure 35.2: Relationship between the library size factors and the deconvolution size factors in the Grun HSC dataset. 35.5 Variance modelling We create a mean-variance trend based on the expectation that UMI counts have Poisson technical noise. We do not block on sample here as we want to preserve any difference between the micro-dissected cells and the sorted HSCs. set.seed(00010101) dec.grun.hsc &lt;- modelGeneVarByPoisson(sce.grun.hsc) top.grun.hsc &lt;- getTopHVGs(dec.grun.hsc, prop=0.1) The lack of a typical “bump” shape in Figure 35.3 is caused by the low counts. plot(dec.grun.hsc$mean, dec.grun.hsc$total, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(dec.grun.hsc) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) Figure 35.3: Per-gene variance as a function of the mean for the log-expression values in the Grun HSC dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the simulated Poisson-distributed noise. 35.6 Dimensionality reduction set.seed(101010011) sce.grun.hsc &lt;- denoisePCA(sce.grun.hsc, technical=dec.grun.hsc, subset.row=top.grun.hsc) sce.grun.hsc &lt;- runTSNE(sce.grun.hsc, dimred=&quot;PCA&quot;) We check that the number of retained PCs is sensible. ncol(reducedDim(sce.grun.hsc, &quot;PCA&quot;)) ## [1] 9 35.7 Clustering snn.gr &lt;- buildSNNGraph(sce.grun.hsc, use.dimred=&quot;PCA&quot;) colLabels(sce.grun.hsc) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) table(colLabels(sce.grun.hsc)) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 ## 259 148 221 103 177 108 48 122 98 63 62 18 short &lt;- ifelse(grepl(&quot;micro&quot;, sce.grun.hsc$protocol), &quot;micro&quot;, &quot;sorted&quot;) gridExtra:::grid.arrange( plotTSNE(sce.grun.hsc, colour_by=&quot;label&quot;), plotTSNE(sce.grun.hsc, colour_by=I(short)), ncol=2 ) Figure 35.4: Obligatory \\(t\\)-SNE plot of the Grun HSC dataset, where each point represents a cell and is colored according to the assigned cluster (left) or extraction protocol (right). 35.8 Marker gene detection markers &lt;- findMarkers(sce.grun.hsc, test.type=&quot;wilcox&quot;, direction=&quot;up&quot;, row.data=rowData(sce.grun.hsc)[,&quot;SYMBOL&quot;,drop=FALSE]) To illustrate the manual annotation process, we examine the marker genes for one of the clusters. Upregulation of Camp, Lcn2, Ltf and lysozyme genes indicates that this cluster contains cells of neuronal origin. chosen &lt;- markers[[&#39;6&#39;]] best &lt;- chosen[chosen$Top &lt;= 10,] aucs &lt;- getMarkerEffects(best, prefix=&quot;AUC&quot;) rownames(aucs) &lt;- best$SYMBOL library(pheatmap) pheatmap(aucs, color=viridis::plasma(100)) Figure 35.5: Heatmap of the AUCs for the top marker genes in cluster 6 compared to all other clusters in the Grun HSC dataset. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] pheatmap_1.0.12 scran_1.18.5 [3] scater_1.18.6 ggplot2_3.3.3 [5] scuttle_1.0.4 AnnotationHub_2.22.0 [7] BiocFileCache_1.14.0 dbplyr_2.1.0 [9] ensembldb_2.14.0 AnnotationFilter_1.14.0 [11] GenomicFeatures_1.42.2 AnnotationDbi_1.52.0 [13] scRNAseq_2.4.0 SingleCellExperiment_1.12.0 [15] SummarizedExperiment_1.20.0 Biobase_2.50.0 [17] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [19] IRanges_2.24.1 S4Vectors_0.28.1 [21] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [23] matrixStats_0.58.0 BiocStyle_2.18.1 [25] rebook_1.0.0 loaded via a namespace (and not attached): [1] igraph_1.2.6 lazyeval_0.2.2 [3] BiocParallel_1.24.1 digest_0.6.27 [5] htmltools_0.5.1.1 viridis_0.5.1 [7] fansi_0.4.2 magrittr_2.0.1 [9] memoise_2.0.0 limma_3.46.0 [11] Biostrings_2.58.0 askpass_1.1 [13] prettyunits_1.1.1 colorspace_2.0-0 [15] blob_1.2.1 rappdirs_0.3.3 [17] xfun_0.22 dplyr_1.0.5 [19] callr_3.5.1 crayon_1.4.1 [21] RCurl_1.98-1.3 jsonlite_1.7.2 [23] graph_1.68.0 glue_1.4.2 [25] gtable_0.3.0 zlibbioc_1.36.0 [27] XVector_0.30.0 DelayedArray_0.16.2 [29] BiocSingular_1.6.0 scales_1.1.1 [31] edgeR_3.32.1 DBI_1.1.1 [33] Rcpp_1.0.6 viridisLite_0.3.0 [35] xtable_1.8-4 progress_1.2.2 [37] dqrng_0.2.1 bit_4.0.4 [39] rsvd_1.0.3 httr_1.4.2 [41] RColorBrewer_1.1-2 ellipsis_0.3.1 [43] pkgconfig_2.0.3 XML_3.99-0.6 [45] farver_2.1.0 CodeDepends_0.6.5 [47] sass_0.3.1 locfit_1.5-9.4 [49] utf8_1.2.1 labeling_0.4.2 [51] tidyselect_1.1.0 rlang_0.4.10 [53] later_1.1.0.1 munsell_0.5.0 [55] BiocVersion_3.12.0 tools_4.0.4 [57] cachem_1.0.4 generics_0.1.0 [59] RSQLite_2.2.4 ExperimentHub_1.16.0 [61] evaluate_0.14 stringr_1.4.0 [63] fastmap_1.1.0 yaml_2.2.1 [65] processx_3.4.5 knitr_1.31 [67] bit64_4.0.5 purrr_0.3.4 [69] sparseMatrixStats_1.2.1 mime_0.10 [71] xml2_1.3.2 biomaRt_2.46.3 [73] compiler_4.0.4 beeswarm_0.3.1 [75] curl_4.3 interactiveDisplayBase_1.28.0 [77] statmod_1.4.35 tibble_3.1.0 [79] bslib_0.2.4 stringi_1.5.3 [81] highr_0.8 ps_1.6.0 [83] lattice_0.20-41 bluster_1.0.0 [85] ProtGenerics_1.22.0 Matrix_1.3-2 [87] vctrs_0.3.6 pillar_1.5.1 [89] lifecycle_1.0.0 BiocManager_1.30.10 [91] jquerylib_0.1.3 BiocNeighbors_1.8.2 [93] cowplot_1.1.1 bitops_1.0-6 [95] irlba_2.3.3 httpuv_1.5.5 [97] rtracklayer_1.50.0 R6_2.5.0 [99] bookdown_0.21 promises_1.2.0.1 [101] gridExtra_2.3 vipor_0.4.5 [103] codetools_0.2-18 assertthat_0.2.1 [105] openssl_1.4.3 withr_2.4.1 [107] GenomicAlignments_1.26.0 Rsamtools_2.6.0 [109] GenomeInfoDbData_1.2.4 hms_1.0.0 [111] grid_4.0.4 beachmat_2.6.4 [113] rmarkdown_2.7 DelayedMatrixStats_1.12.3 [115] Rtsne_0.15 shiny_1.6.0 [117] ggbeeswarm_0.6.0 Bibliography "],["nestorowa-mouse-hsc-smart-seq2.html", "Chapter 36 Nestorowa mouse HSC (Smart-seq2) 36.1 Introduction 36.2 Data loading 36.3 Quality control 36.4 Normalization 36.5 Variance modelling 36.6 Dimensionality reduction 36.7 Clustering 36.8 Marker gene detection 36.9 Cell type annotation 36.10 Miscellaneous analyses Session Info", " Chapter 36 Nestorowa mouse HSC (Smart-seq2) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 36.1 Introduction This performs an analysis of the mouse haematopoietic stem cell (HSC) dataset generated with Smart-seq2 (Nestorowa et al. 2016). 36.2 Data loading library(scRNAseq) sce.nest &lt;- NestorowaHSCData() library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] anno &lt;- select(ens.mm.v97, keys=rownames(sce.nest), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.nest) &lt;- anno[match(rownames(sce.nest), anno$GENEID),] After loading and annotation, we inspect the resulting SingleCellExperiment object: sce.nest ## class: SingleCellExperiment ## dim: 46078 1920 ## metadata(0): ## assays(1): counts ## rownames(46078): ENSMUSG00000000001 ENSMUSG00000000003 ... ## ENSMUSG00000107391 ENSMUSG00000107392 ## rowData names(3): GENEID SYMBOL SEQNAME ## colnames(1920): HSPC_007 HSPC_013 ... Prog_852 Prog_810 ## colData names(2): cell.type FACS ## reducedDimNames(1): diffusion ## altExpNames(1): ERCC 36.3 Quality control unfiltered &lt;- sce.nest For some reason, no mitochondrial transcripts are available, so we will perform quality control using the spike-in proportions only. library(scater) stats &lt;- perCellQCMetrics(sce.nest) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;) sce.nest &lt;- sce.nest[,!qc$discard] We examine the number of cells discarded for each reason. colSums(as.matrix(qc)) ## low_lib_size low_n_features high_altexps_ERCC_percent ## 146 28 241 ## discard ## 264 We create some diagnostic plots for each metric (Figure 36.1). colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- qc$discard gridExtra::grid.arrange( plotColData(unfiltered, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(unfiltered, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(unfiltered, y=&quot;altexps_ERCC_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;ERCC percent&quot;), ncol=2 ) Figure 36.1: Distribution of each QC metric across cells in the Nestorowa HSC dataset. Each point represents a cell and is colored according to whether that cell was discarded. 36.4 Normalization library(scran) set.seed(101000110) clusters &lt;- quickCluster(sce.nest) sce.nest &lt;- computeSumFactors(sce.nest, clusters=clusters) sce.nest &lt;- logNormCounts(sce.nest) We examine some key metrics for the distribution of size factors, and compare it to the library sizes as a sanity check (Figure 36.2). summary(sizeFactors(sce.nest)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.044 0.422 0.748 1.000 1.249 15.927 plot(librarySizeFactors(sce.nest), sizeFactors(sce.nest), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, log=&quot;xy&quot;) Figure 36.2: Relationship between the library size factors and the deconvolution size factors in the Nestorowa HSC dataset. 36.5 Variance modelling We use the spike-in transcripts to model the technical noise as a function of the mean (Figure 36.3). set.seed(00010101) dec.nest &lt;- modelGeneVarWithSpikes(sce.nest, &quot;ERCC&quot;) top.nest &lt;- getTopHVGs(dec.nest, prop=0.1) plot(dec.nest$mean, dec.nest$total, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(dec.nest) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) points(curfit$mean, curfit$var, col=&quot;red&quot;) Figure 36.3: Per-gene variance as a function of the mean for the log-expression values in the Nestorowa HSC dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the spike-ins (red). 36.6 Dimensionality reduction set.seed(101010011) sce.nest &lt;- denoisePCA(sce.nest, technical=dec.nest, subset.row=top.nest) sce.nest &lt;- runTSNE(sce.nest, dimred=&quot;PCA&quot;) We check that the number of retained PCs is sensible. ncol(reducedDim(sce.nest, &quot;PCA&quot;)) ## [1] 9 36.7 Clustering snn.gr &lt;- buildSNNGraph(sce.nest, use.dimred=&quot;PCA&quot;) colLabels(sce.nest) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) table(colLabels(sce.nest)) ## ## 1 2 3 4 5 6 7 8 9 ## 203 472 258 175 142 229 20 83 74 plotTSNE(sce.nest, colour_by=&quot;label&quot;) Figure 36.4: Obligatory \\(t\\)-SNE plot of the Nestorowa HSC dataset, where each point represents a cell and is colored according to the assigned cluster. 36.8 Marker gene detection markers &lt;- findMarkers(sce.nest, colLabels(sce.nest), test.type=&quot;wilcox&quot;, direction=&quot;up&quot;, lfc=0.5, row.data=rowData(sce.nest)[,&quot;SYMBOL&quot;,drop=FALSE]) To illustrate the manual annotation process, we examine the marker genes for one of the clusters. Upregulation of Car2, Hebp1 amd hemoglobins indicates that cluster 8 contains erythroid precursors. chosen &lt;- markers[[&#39;8&#39;]] best &lt;- chosen[chosen$Top &lt;= 10,] aucs &lt;- getMarkerEffects(best, prefix=&quot;AUC&quot;) rownames(aucs) &lt;- best$SYMBOL library(pheatmap) pheatmap(aucs, color=viridis::plasma(100)) Figure 36.5: Heatmap of the AUCs for the top marker genes in cluster 8 compared to all other clusters. 36.9 Cell type annotation library(SingleR) mm.ref &lt;- MouseRNAseqData() # Renaming to symbols to match with reference row names. renamed &lt;- sce.nest rownames(renamed) &lt;- uniquifyFeatureNames(rownames(renamed), rowData(sce.nest)$SYMBOL) labels &lt;- SingleR(renamed, mm.ref, labels=mm.ref$label.fine) Most clusters are not assigned to any single lineage (Figure 36.6), which is perhaps unsurprising given that HSCs are quite different from their terminal fates. Cluster 8 is considered to contain erythrocytes, which is roughly consistent with our conclusions from the marker gene analysis above. tab &lt;- table(labels$labels, colLabels(sce.nest)) pheatmap(log10(tab+10), color=viridis::viridis(100)) Figure 36.6: Heatmap of the distribution of cells for each cluster in the Nestorowa HSC dataset, based on their assignment to each label in the mouse RNA-seq references from the SingleR package. 36.10 Miscellaneous analyses This dataset also contains information about the protein abundances in each cell from FACS. There is barely any heterogeneity in the chosen markers across the clusters (Figure 36.7); this is perhaps unsurprising given that all cells should be HSCs of some sort. Y &lt;- colData(sce.nest)$FACS keep &lt;- rowSums(is.na(Y))==0 # Removing NA intensities. se.averaged &lt;- sumCountsAcrossCells(t(Y[keep,]), colLabels(sce.nest)[keep], average=TRUE) averaged &lt;- assay(se.averaged) log.intensities &lt;- log2(averaged+1) centered &lt;- log.intensities - rowMeans(log.intensities) pheatmap(centered, breaks=seq(-1, 1, length.out=101)) Figure 36.7: Heatmap of the centered log-average intensity for each target protein quantified by FACS in the Nestorowa HSC dataset. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] celldex_1.0.0 SingleR_1.4.1 [3] pheatmap_1.0.12 scran_1.18.5 [5] scater_1.18.6 ggplot2_3.3.3 [7] AnnotationHub_2.22.0 BiocFileCache_1.14.0 [9] dbplyr_2.1.0 ensembldb_2.14.0 [11] AnnotationFilter_1.14.0 GenomicFeatures_1.42.2 [13] AnnotationDbi_1.52.0 scRNAseq_2.4.0 [15] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0 [17] Biobase_2.50.0 GenomicRanges_1.42.0 [19] GenomeInfoDb_1.26.4 IRanges_2.24.1 [21] S4Vectors_0.28.1 BiocGenerics_0.36.0 [23] MatrixGenerics_1.2.1 matrixStats_0.58.0 [25] BiocStyle_2.18.1 rebook_1.0.0 loaded via a namespace (and not attached): [1] igraph_1.2.6 lazyeval_0.2.2 [3] BiocParallel_1.24.1 digest_0.6.27 [5] htmltools_0.5.1.1 viridis_0.5.1 [7] fansi_0.4.2 magrittr_2.0.1 [9] memoise_2.0.0 limma_3.46.0 [11] Biostrings_2.58.0 askpass_1.1 [13] prettyunits_1.1.1 colorspace_2.0-0 [15] blob_1.2.1 rappdirs_0.3.3 [17] xfun_0.22 dplyr_1.0.5 [19] callr_3.5.1 crayon_1.4.1 [21] RCurl_1.98-1.3 jsonlite_1.7.2 [23] graph_1.68.0 glue_1.4.2 [25] gtable_0.3.0 zlibbioc_1.36.0 [27] XVector_0.30.0 DelayedArray_0.16.2 [29] BiocSingular_1.6.0 scales_1.1.1 [31] edgeR_3.32.1 DBI_1.1.1 [33] Rcpp_1.0.6 viridisLite_0.3.0 [35] xtable_1.8-4 progress_1.2.2 [37] dqrng_0.2.1 bit_4.0.4 [39] rsvd_1.0.3 httr_1.4.2 [41] RColorBrewer_1.1-2 ellipsis_0.3.1 [43] pkgconfig_2.0.3 XML_3.99-0.6 [45] farver_2.1.0 scuttle_1.0.4 [47] CodeDepends_0.6.5 sass_0.3.1 [49] locfit_1.5-9.4 utf8_1.2.1 [51] tidyselect_1.1.0 labeling_0.4.2 [53] rlang_0.4.10 later_1.1.0.1 [55] munsell_0.5.0 BiocVersion_3.12.0 [57] tools_4.0.4 cachem_1.0.4 [59] generics_0.1.0 RSQLite_2.2.4 [61] ExperimentHub_1.16.0 evaluate_0.14 [63] stringr_1.4.0 fastmap_1.1.0 [65] yaml_2.2.1 processx_3.4.5 [67] knitr_1.31 bit64_4.0.5 [69] purrr_0.3.4 sparseMatrixStats_1.2.1 [71] mime_0.10 xml2_1.3.2 [73] biomaRt_2.46.3 compiler_4.0.4 [75] beeswarm_0.3.1 curl_4.3 [77] interactiveDisplayBase_1.28.0 statmod_1.4.35 [79] tibble_3.1.0 bslib_0.2.4 [81] stringi_1.5.3 highr_0.8 [83] ps_1.6.0 lattice_0.20-41 [85] bluster_1.0.0 ProtGenerics_1.22.0 [87] Matrix_1.3-2 vctrs_0.3.6 [89] pillar_1.5.1 lifecycle_1.0.0 [91] BiocManager_1.30.10 jquerylib_0.1.3 [93] BiocNeighbors_1.8.2 cowplot_1.1.1 [95] bitops_1.0-6 irlba_2.3.3 [97] httpuv_1.5.5 rtracklayer_1.50.0 [99] R6_2.5.0 bookdown_0.21 [101] promises_1.2.0.1 gridExtra_2.3 [103] vipor_0.4.5 codetools_0.2-18 [105] assertthat_0.2.1 openssl_1.4.3 [107] withr_2.4.1 GenomicAlignments_1.26.0 [109] Rsamtools_2.6.0 GenomeInfoDbData_1.2.4 [111] hms_1.0.0 grid_4.0.4 [113] beachmat_2.6.4 rmarkdown_2.7 [115] DelayedMatrixStats_1.12.3 Rtsne_0.15 [117] shiny_1.6.0 ggbeeswarm_0.6.0 Bibliography "],["paul-mouse-hsc-mars-seq.html", "Chapter 37 Paul mouse HSC (MARS-seq) 37.1 Introduction 37.2 Data loading 37.3 Quality control 37.4 Normalization 37.5 Variance modelling 37.6 Dimensionality reduction 37.7 Clustering Session Info", " Chapter 37 Paul mouse HSC (MARS-seq) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 37.1 Introduction This performs an analysis of the mouse haematopoietic stem cell (HSC) dataset generated with MARS-seq (Paul et al. 2015). Cells were extracted from multiple mice under different experimental conditions (i.e., sorting protocols) and libraries were prepared using a series of 384-well plates. 37.2 Data loading library(scRNAseq) sce.paul &lt;- PaulHSCData(ensembl=TRUE) library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] anno &lt;- select(ens.mm.v97, keys=rownames(sce.paul), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.paul) &lt;- anno[match(rownames(sce.paul), anno$GENEID),] After loading and annotation, we inspect the resulting SingleCellExperiment object: sce.paul ## class: SingleCellExperiment ## dim: 17483 10368 ## metadata(0): ## assays(1): counts ## rownames(17483): ENSMUSG00000007777 ENSMUSG00000107002 ... ## ENSMUSG00000039068 ENSMUSG00000064363 ## rowData names(3): GENEID SYMBOL SEQNAME ## colnames(10368): W29953 W29954 ... W76335 W76336 ## colData names(13): Well_ID Seq_batch_ID ... CD34_measurement ## FcgR3_measurement ## reducedDimNames(0): ## altExpNames(0): 37.3 Quality control unfiltered &lt;- sce.paul For some reason, only one mitochondrial transcripts are available, so we will perform quality control using only the library size and number of detected features. Ideally, we would simply block on the plate of origin to account for differences in processing, but unfortunately, it seems that many plates have a large proportion (if not outright majority) of cells with poor values for both metrics. We identify such plates based on the presence of very low outlier thresholds, for some arbitrary definition of “low”; we then redefine thresholds using information from the other (presumably high-quality) plates. library(scater) stats &lt;- perCellQCMetrics(sce.paul) qc &lt;- quickPerCellQC(stats, batch=sce.paul$Plate_ID) # Detecting batches with unusually low threshold values. lib.thresholds &lt;- attr(qc$low_lib_size, &quot;thresholds&quot;)[&quot;lower&quot;,] nfeat.thresholds &lt;- attr(qc$low_n_features, &quot;thresholds&quot;)[&quot;lower&quot;,] ignore &lt;- union(names(lib.thresholds)[lib.thresholds &lt; 100], names(nfeat.thresholds)[nfeat.thresholds &lt; 100]) # Repeating the QC using only the &quot;high-quality&quot; batches. qc2 &lt;- quickPerCellQC(stats, batch=sce.paul$Plate_ID, subset=!sce.paul$Plate_ID %in% ignore) sce.paul &lt;- sce.paul[,!qc2$discard] We examine the number of cells discarded for each reason. colSums(as.matrix(qc2)) ## low_lib_size low_n_features discard ## 1695 1781 1783 We create some diagnostic plots for each metric (Figure 37.1). colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- qc2$discard unfiltered$Plate_ID &lt;- factor(unfiltered$Plate_ID) gridExtra::grid.arrange( plotColData(unfiltered, y=&quot;sum&quot;, x=&quot;Plate_ID&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(unfiltered, y=&quot;detected&quot;, x=&quot;Plate_ID&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), ncol=1 ) Figure 37.1: Distribution of each QC metric across cells in the Paul HSC dataset. Each point represents a cell and is colored according to whether that cell was discarded. 37.4 Normalization library(scran) set.seed(101000110) clusters &lt;- quickCluster(sce.paul) sce.paul &lt;- computeSumFactors(sce.paul, clusters=clusters) sce.paul &lt;- logNormCounts(sce.paul) We examine some key metrics for the distribution of size factors, and compare it to the library sizes as a sanity check (Figure 37.2). summary(sizeFactors(sce.paul)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.057 0.422 0.775 1.000 1.335 9.654 plot(librarySizeFactors(sce.paul), sizeFactors(sce.paul), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, log=&quot;xy&quot;) Figure 37.2: Relationship between the library size factors and the deconvolution size factors in the Paul HSC dataset. 37.5 Variance modelling We fit a mean-variance trend to the endogenous genes to detect highly variable genes. Unfortunately, the plates are confounded with an experimental treatment (Batch_desc) so we cannot block on the plate of origin. set.seed(00010101) dec.paul &lt;- modelGeneVarByPoisson(sce.paul) top.paul &lt;- getTopHVGs(dec.paul, prop=0.1) plot(dec.paul$mean, dec.paul$total, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curve(metadata(dec.paul)$trend(x), col=&quot;blue&quot;, add=TRUE) Figure 37.3: Per-gene variance as a function of the mean for the log-expression values in the Paul HSC dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to simulated Poisson noise. 37.6 Dimensionality reduction set.seed(101010011) sce.paul &lt;- denoisePCA(sce.paul, technical=dec.paul, subset.row=top.paul) sce.paul &lt;- runTSNE(sce.paul, dimred=&quot;PCA&quot;) We check that the number of retained PCs is sensible. ncol(reducedDim(sce.paul, &quot;PCA&quot;)) ## [1] 13 37.7 Clustering snn.gr &lt;- buildSNNGraph(sce.paul, use.dimred=&quot;PCA&quot;, type=&quot;jaccard&quot;) colLabels(sce.paul) &lt;- factor(igraph::cluster_louvain(snn.gr)$membership) These is a strong relationship between the cluster and the experimental treatment (Figure 37.4), which is to be expected. Of course, this may also be attributable to some batch effect; the confounded nature of the experimental design makes it difficult to make any confident statements either way. tab &lt;- table(colLabels(sce.paul), sce.paul$Batch_desc) rownames(tab) &lt;- paste(&quot;Cluster&quot;, rownames(tab)) pheatmap::pheatmap(log10(tab+10), color=viridis::viridis(100)) Figure 37.4: Heatmap of the distribution of cells across clusters (rows) for each experimental treatment (column). plotTSNE(sce.paul, colour_by=&quot;label&quot;) Figure 37.5: Obligatory \\(t\\)-SNE plot of the Paul HSC dataset, where each point represents a cell and is colored according to the assigned cluster. plotTSNE(sce.paul, colour_by=&quot;label&quot;, other_fields=&quot;Batch_desc&quot;) + facet_wrap(~Batch_desc) Figure 37.6: Obligatory \\(t\\)-SNE plot of the Paul HSC dataset faceted by the treatment condition, where each point represents a cell and is colored according to the assigned cluster. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] scran_1.18.5 scater_1.18.6 [3] ggplot2_3.3.3 AnnotationHub_2.22.0 [5] BiocFileCache_1.14.0 dbplyr_2.1.0 [7] ensembldb_2.14.0 AnnotationFilter_1.14.0 [9] GenomicFeatures_1.42.2 AnnotationDbi_1.52.0 [11] scRNAseq_2.4.0 SingleCellExperiment_1.12.0 [13] SummarizedExperiment_1.20.0 Biobase_2.50.0 [15] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [17] IRanges_2.24.1 S4Vectors_0.28.1 [19] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [21] matrixStats_0.58.0 BiocStyle_2.18.1 [23] rebook_1.0.0 loaded via a namespace (and not attached): [1] igraph_1.2.6 lazyeval_0.2.2 [3] BiocParallel_1.24.1 digest_0.6.27 [5] htmltools_0.5.1.1 viridis_0.5.1 [7] fansi_0.4.2 magrittr_2.0.1 [9] memoise_2.0.0 limma_3.46.0 [11] Biostrings_2.58.0 askpass_1.1 [13] prettyunits_1.1.1 colorspace_2.0-0 [15] blob_1.2.1 rappdirs_0.3.3 [17] xfun_0.22 dplyr_1.0.5 [19] callr_3.5.1 crayon_1.4.1 [21] RCurl_1.98-1.3 jsonlite_1.7.2 [23] graph_1.68.0 glue_1.4.2 [25] gtable_0.3.0 zlibbioc_1.36.0 [27] XVector_0.30.0 DelayedArray_0.16.2 [29] BiocSingular_1.6.0 scales_1.1.1 [31] pheatmap_1.0.12 edgeR_3.32.1 [33] DBI_1.1.1 Rcpp_1.0.6 [35] viridisLite_0.3.0 xtable_1.8-4 [37] progress_1.2.2 dqrng_0.2.1 [39] bit_4.0.4 rsvd_1.0.3 [41] httr_1.4.2 RColorBrewer_1.1-2 [43] ellipsis_0.3.1 pkgconfig_2.0.3 [45] XML_3.99-0.6 farver_2.1.0 [47] scuttle_1.0.4 CodeDepends_0.6.5 [49] sass_0.3.1 locfit_1.5-9.4 [51] utf8_1.2.1 labeling_0.4.2 [53] tidyselect_1.1.0 rlang_0.4.10 [55] later_1.1.0.1 munsell_0.5.0 [57] BiocVersion_3.12.0 tools_4.0.4 [59] cachem_1.0.4 generics_0.1.0 [61] RSQLite_2.2.4 ExperimentHub_1.16.0 [63] evaluate_0.14 stringr_1.4.0 [65] fastmap_1.1.0 yaml_2.2.1 [67] processx_3.4.5 knitr_1.31 [69] bit64_4.0.5 purrr_0.3.4 [71] sparseMatrixStats_1.2.1 mime_0.10 [73] xml2_1.3.2 biomaRt_2.46.3 [75] compiler_4.0.4 beeswarm_0.3.1 [77] curl_4.3 interactiveDisplayBase_1.28.0 [79] statmod_1.4.35 tibble_3.1.0 [81] bslib_0.2.4 stringi_1.5.3 [83] highr_0.8 ps_1.6.0 [85] lattice_0.20-41 bluster_1.0.0 [87] ProtGenerics_1.22.0 Matrix_1.3-2 [89] vctrs_0.3.6 pillar_1.5.1 [91] lifecycle_1.0.0 BiocManager_1.30.10 [93] jquerylib_0.1.3 BiocNeighbors_1.8.2 [95] cowplot_1.1.1 bitops_1.0-6 [97] irlba_2.3.3 httpuv_1.5.5 [99] rtracklayer_1.50.0 R6_2.5.0 [101] bookdown_0.21 promises_1.2.0.1 [103] gridExtra_2.3 vipor_0.4.5 [105] codetools_0.2-18 assertthat_0.2.1 [107] openssl_1.4.3 withr_2.4.1 [109] GenomicAlignments_1.26.0 Rsamtools_2.6.0 [111] GenomeInfoDbData_1.2.4 hms_1.0.0 [113] grid_4.0.4 beachmat_2.6.4 [115] rmarkdown_2.7 DelayedMatrixStats_1.12.3 [117] Rtsne_0.15 shiny_1.6.0 [119] ggbeeswarm_0.6.0 Bibliography "],["merged-hsc.html", "Chapter 38 Merged mouse HSC datasets 38.1 Introduction 38.2 Data loading 38.3 Setting up the merge 38.4 Merging the datasets 38.5 Combined analyses Session Info", " Chapter 38 Merged mouse HSC datasets .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 38.1 Introduction The blood is probably the most well-studied tissue in the single-cell field, mostly because everything is already dissociated “for free”. Of particular interest has been the use of single-cell genomics to study cell fate decisions in haematopoeisis. Indeed, it was not long ago that dueling interpretations of haematopoeitic stem cell (HSC) datasets were a mainstay of single-cell conferences. Sadly, these times have mostly passed so we will instead entertain ourselves by combining a small number of these datasets into a single analysis. 38.2 Data loading View history #--- data-loading ---# library(scRNAseq) sce.nest &lt;- NestorowaHSCData() #--- gene-annotation ---# library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] anno &lt;- select(ens.mm.v97, keys=rownames(sce.nest), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.nest) &lt;- anno[match(rownames(sce.nest), anno$GENEID),] #--- quality-control-grun ---# library(scater) stats &lt;- perCellQCMetrics(sce.nest) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;) sce.nest &lt;- sce.nest[,!qc$discard] #--- normalization ---# library(scran) set.seed(101000110) clusters &lt;- quickCluster(sce.nest) sce.nest &lt;- computeSumFactors(sce.nest, clusters=clusters) sce.nest &lt;- logNormCounts(sce.nest) #--- variance-modelling ---# set.seed(00010101) dec.nest &lt;- modelGeneVarWithSpikes(sce.nest, &quot;ERCC&quot;) top.nest &lt;- getTopHVGs(dec.nest, prop=0.1) sce.nest ## class: SingleCellExperiment ## dim: 46078 1656 ## metadata(0): ## assays(2): counts logcounts ## rownames(46078): ENSMUSG00000000001 ENSMUSG00000000003 ... ## ENSMUSG00000107391 ENSMUSG00000107392 ## rowData names(3): GENEID SYMBOL SEQNAME ## colnames(1656): HSPC_025 HSPC_031 ... Prog_852 Prog_810 ## colData names(3): cell.type FACS sizeFactor ## reducedDimNames(1): diffusion ## altExpNames(1): ERCC The Grun dataset requires a little bit of subsetting and re-analysis to only consider the sorted HSCs. View history #--- data-loading ---# library(scRNAseq) sce.grun.hsc &lt;- GrunHSCData(ensembl=TRUE) #--- gene-annotation ---# library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] anno &lt;- select(ens.mm.v97, keys=rownames(sce.grun.hsc), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.grun.hsc) &lt;- anno[match(rownames(sce.grun.hsc), anno$GENEID),] #--- quality-control ---# library(scuttle) stats &lt;- perCellQCMetrics(sce.grun.hsc) qc &lt;- quickPerCellQC(stats, batch=sce.grun.hsc$protocol, subset=grepl(&quot;sorted&quot;, sce.grun.hsc$protocol)) sce.grun.hsc &lt;- sce.grun.hsc[,!qc$discard] library(scuttle) sce.grun.hsc &lt;- sce.grun.hsc[,sce.grun.hsc$protocol==&quot;sorted hematopoietic stem cells&quot;] sce.grun.hsc &lt;- logNormCounts(sce.grun.hsc) set.seed(11001) library(scran) dec.grun.hsc &lt;- modelGeneVarByPoisson(sce.grun.hsc) Finally, we will grab the Paul dataset, which we will also subset to only consider the unsorted myeloid population. This removes the various knockout conditions that just complicates matters. View history #--- data-loading ---# library(scRNAseq) sce.paul &lt;- PaulHSCData(ensembl=TRUE) #--- gene-annotation ---# library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] anno &lt;- select(ens.mm.v97, keys=rownames(sce.paul), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;, &quot;SEQNAME&quot;)) rowData(sce.paul) &lt;- anno[match(rownames(sce.paul), anno$GENEID),] #--- quality-control ---# library(scater) stats &lt;- perCellQCMetrics(sce.paul) qc &lt;- quickPerCellQC(stats, batch=sce.paul$Plate_ID) # Detecting batches with unusually low threshold values. lib.thresholds &lt;- attr(qc$low_lib_size, &quot;thresholds&quot;)[&quot;lower&quot;,] nfeat.thresholds &lt;- attr(qc$low_n_features, &quot;thresholds&quot;)[&quot;lower&quot;,] ignore &lt;- union(names(lib.thresholds)[lib.thresholds &lt; 100], names(nfeat.thresholds)[nfeat.thresholds &lt; 100]) # Repeating the QC using only the &quot;high-quality&quot; batches. qc2 &lt;- quickPerCellQC(stats, batch=sce.paul$Plate_ID, subset=!sce.paul$Plate_ID %in% ignore) sce.paul &lt;- sce.paul[,!qc2$discard] sce.paul &lt;- sce.paul[,sce.paul$Batch_desc==&quot;Unsorted myeloid&quot;] sce.paul &lt;- logNormCounts(sce.paul) set.seed(00010010) dec.paul &lt;- modelGeneVarByPoisson(sce.paul) 38.3 Setting up the merge common &lt;- Reduce(intersect, list(rownames(sce.nest), rownames(sce.grun.hsc), rownames(sce.paul))) length(common) ## [1] 17147 Combining variances to obtain a single set of HVGs. combined.dec &lt;- combineVar( dec.nest[common,], dec.grun.hsc[common,], dec.paul[common,] ) hvgs &lt;- getTopHVGs(combined.dec, n=5000) Adjusting for gross differences in sequencing depth. library(batchelor) normed.sce &lt;- multiBatchNorm( Nestorowa=sce.nest[common,], Grun=sce.grun.hsc[common,], Paul=sce.paul[common,] ) 38.4 Merging the datasets We turn on auto.merge=TRUE to instruct fastMNN() to merge the batch that offers the largest number of MNNs. This aims to perform the “easiest” merges first, i.e., between the most replicate-like batches, before tackling merges between batches that have greater differences in their population composition. set.seed(1000010) merged &lt;- fastMNN(normed.sce, subset.row=hvgs, auto.merge=TRUE) Not too much variance lost inside each batch, hopefully. We also observe that the algorithm chose to merge the more diverse Nestorowa and Paul datasets before dealing with the HSC-only Grun dataset. metadata(merged)$merge.info[,c(&quot;left&quot;, &quot;right&quot;, &quot;lost.var&quot;)] ## DataFrame with 2 rows and 3 columns ## left right lost.var ## &lt;List&gt; &lt;List&gt; &lt;matrix&gt; ## 1 Paul Nestorowa 0.01069374:0.0000000:0.00739465 ## 2 Paul,Nestorowa Grun 0.00562344:0.0178334:0.00702615 38.5 Combined analyses The Grun dataset does not contribute to many clusters, consistent with a pure undifferentiated HSC population. Most of the other clusters contain contributions from the Nestorowa and Paul datasets, though some are unique to the Paul dataset. This may be due to incomplete correction though we tend to think that this are Paul-specific subpopulations, given that the Nestorowa dataset does not have similarly sized unique clusters that might represent their uncorrected counterparts. library(bluster) colLabels(merged) &lt;- clusterRows(reducedDim(merged), NNGraphParam(cluster.fun=&quot;louvain&quot;)) table(Cluster=colLabels(merged), Batch=merged$batch) ## Batch ## Cluster Grun Nestorowa Paul ## 1 0 40 206 ## 2 0 19 0 ## 3 39 353 146 ## 4 0 6 29 ## 5 0 217 487 ## 6 0 162 522 ## 7 0 133 191 ## 8 22 411 94 ## 9 230 315 348 ## 10 0 0 385 ## 11 0 0 397 While I prefer \\(t\\)-SNE plots, we’ll switch to a UMAP plot to highlight some of the trajectory-like structure across clusters (Figure 38.1). library(scater) set.seed(101010101) merged &lt;- runUMAP(merged, dimred=&quot;corrected&quot;) gridExtra::grid.arrange( plotUMAP(merged, colour_by=&quot;label&quot;), plotUMAP(merged, colour_by=&quot;batch&quot;), ncol=2 ) Figure 38.1: Obligatory UMAP plot of the merged HSC datasets, where each point represents a cell and is colored by the batch of origin (left) or its assigned cluster (right). In fact, we might as well compute a trajectory right now. TSCAN constructs a reasonable minimum spanning tree but the path choices are somewhat incongruent with the UMAP coordinates (Figure 38.2). This is most likely due to the fact that TSCAN operates on cluster centroids, which is simple and efficient but does not consider the variance of cells within each cluster. It is entirely possible for two well-separated clusters to be closer than two adjacent clusters if the latter span a wider region of the coordinate space. library(TSCAN) pseudo.out &lt;- quickPseudotime(merged, use.dimred=&quot;corrected&quot;, outgroup=TRUE) common.pseudo &lt;- rowMeans(pseudo.out$ordering, na.rm=TRUE) plotUMAP(merged, colour_by=I(common.pseudo), text_by=&quot;label&quot;, text_colour=&quot;red&quot;) + geom_line(data=pseudo.out$connected$UMAP, mapping=aes(x=dim1, y=dim2, group=edge)) Figure 38.2: Another UMAP plot of the merged HSC datasets, where each point represents a cell and is colored by its TSCAN pseudotime. The lines correspond to the edges of the MST across cluster centers. To fix this, we construct the minimum spanning tree using distances based on pairs of mutual nearest neighbors between clusters. This focuses on the closeness of the boundaries of each pair of clusters rather than their centroids, ensuring that adjacent clusters are connected even if their centroids are far apart. Doing so yields a trajectory that is more consistent with the visual connections on the UMAP plot (Figure 38.3). pseudo.out2 &lt;- quickPseudotime(merged, use.dimred=&quot;corrected&quot;, with.mnn=TRUE, outgroup=TRUE) common.pseudo2 &lt;- rowMeans(pseudo.out2$ordering, na.rm=TRUE) plotUMAP(merged, colour_by=I(common.pseudo2), text_by=&quot;label&quot;, text_colour=&quot;red&quot;) + geom_line(data=pseudo.out2$connected$UMAP, mapping=aes(x=dim1, y=dim2, group=edge)) Figure 38.3: Yet another UMAP plot of the merged HSC datasets, where each point represents a cell and is colored by its TSCAN pseudotime. The lines correspond to the edges of the MST across cluster centers. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] TSCAN_1.28.0 scater_1.18.6 [3] ggplot2_3.3.3 bluster_1.0.0 [5] batchelor_1.6.2 scran_1.18.5 [7] scuttle_1.0.4 SingleCellExperiment_1.12.0 [9] SummarizedExperiment_1.20.0 Biobase_2.50.0 [11] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [13] IRanges_2.24.1 S4Vectors_0.28.1 [15] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [17] matrixStats_0.58.0 BiocStyle_2.18.1 [19] rebook_1.0.0 loaded via a namespace (and not attached): [1] ggbeeswarm_0.6.0 colorspace_2.0-0 [3] ellipsis_0.3.1 mclust_5.4.7 [5] XVector_0.30.0 BiocNeighbors_1.8.2 [7] farver_2.1.0 fansi_0.4.2 [9] codetools_0.2-18 splines_4.0.4 [11] sparseMatrixStats_1.2.1 knitr_1.31 [13] jsonlite_1.7.2 ResidualMatrix_1.0.0 [15] graph_1.68.0 uwot_0.1.10 [17] shiny_1.6.0 BiocManager_1.30.10 [19] compiler_4.0.4 dqrng_0.2.1 [21] fastmap_1.1.0 assertthat_0.2.1 [23] Matrix_1.3-2 limma_3.46.0 [25] later_1.1.0.1 BiocSingular_1.6.0 [27] htmltools_0.5.1.1 tools_4.0.4 [29] rsvd_1.0.3 igraph_1.2.6 [31] gtable_0.3.0 glue_1.4.2 [33] GenomeInfoDbData_1.2.4 dplyr_1.0.5 [35] Rcpp_1.0.6 jquerylib_0.1.3 [37] vctrs_0.3.6 nlme_3.1-152 [39] DelayedMatrixStats_1.12.3 xfun_0.22 [41] stringr_1.4.0 ps_1.6.0 [43] beachmat_2.6.4 mime_0.10 [45] lifecycle_1.0.0 irlba_2.3.3 [47] gtools_3.8.2 statmod_1.4.35 [49] XML_3.99-0.6 edgeR_3.32.1 [51] zlibbioc_1.36.0 scales_1.1.1 [53] promises_1.2.0.1 yaml_2.2.1 [55] gridExtra_2.3 sass_0.3.1 [57] fastICA_1.2-2 stringi_1.5.3 [59] highr_0.8 caTools_1.18.1 [61] BiocParallel_1.24.1 rlang_0.4.10 [63] pkgconfig_2.0.3 bitops_1.0-6 [65] evaluate_0.14 lattice_0.20-41 [67] purrr_0.3.4 CodeDepends_0.6.5 [69] labeling_0.4.2 cowplot_1.1.1 [71] processx_3.4.5 tidyselect_1.1.0 [73] RcppAnnoy_0.0.18 plyr_1.8.6 [75] magrittr_2.0.1 bookdown_0.21 [77] R6_2.5.0 gplots_3.1.1 [79] generics_0.1.0 combinat_0.0-8 [81] DelayedArray_0.16.2 DBI_1.1.1 [83] pillar_1.5.1 withr_2.4.1 [85] mgcv_1.8-33 RCurl_1.98-1.3 [87] tibble_3.1.0 crayon_1.4.1 [89] KernSmooth_2.23-18 utf8_1.2.1 [91] rmarkdown_2.7 viridis_0.5.1 [93] locfit_1.5-9.4 grid_4.0.4 [95] callr_3.5.1 digest_0.6.27 [97] xtable_1.8-4 httpuv_1.5.5 [99] munsell_0.5.0 beeswarm_0.3.1 [101] viridisLite_0.3.0 vipor_0.4.5 [103] bslib_0.2.4 "],["pijuan-sala-chimeric-mouse-embryo-10x-genomics.html", "Chapter 39 Pijuan-Sala chimeric mouse embryo (10X Genomics) 39.1 Introduction 39.2 Data loading 39.3 Quality control 39.4 Normalization 39.5 Variance modelling 39.6 Merging 39.7 Clustering 39.8 Dimensionality reduction Session Info", " Chapter 39 Pijuan-Sala chimeric mouse embryo (10X Genomics) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 39.1 Introduction This performs an analysis of the Pijuan-Sala et al. (2019) dataset on mouse gastrulation. Here, we examine chimeric embryos at the E8.5 stage of development where td-Tomato-positive embryonic stem cells (ESCs) were injected into a wild-type blastocyst. 39.2 Data loading library(MouseGastrulationData) sce.chimera &lt;- WTChimeraData(samples=5:10) sce.chimera ## class: SingleCellExperiment ## dim: 29453 20935 ## metadata(0): ## assays(1): counts ## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ... ## ENSMUSG00000095742 tomato-td ## rowData names(2): ENSEMBL SYMBOL ## colnames(20935): cell_9769 cell_9770 ... cell_30702 cell_30703 ## colData names(11): cell barcode ... doub.density sizeFactor ## reducedDimNames(2): pca.corrected.E7.5 pca.corrected.E8.5 ## altExpNames(0): library(scater) rownames(sce.chimera) &lt;- uniquifyFeatureNames( rowData(sce.chimera)$ENSEMBL, rowData(sce.chimera)$SYMBOL) 39.3 Quality control Quality control on the cells has already been performed by the authors, so we will not repeat it here. We additionally remove cells that are labelled as stripped nuclei or doublets. drop &lt;- sce.chimera$celltype.mapped %in% c(&quot;stripped&quot;, &quot;Doublet&quot;) sce.chimera &lt;- sce.chimera[,!drop] 39.4 Normalization We use the pre-computed size factors in sce.chimera. sce.chimera &lt;- logNormCounts(sce.chimera) 39.5 Variance modelling We retain all genes with any positive biological component, to preserve as much signal as possible across a very heterogeneous dataset. library(scran) dec.chimera &lt;- modelGeneVar(sce.chimera, block=sce.chimera$sample) chosen.hvgs &lt;- dec.chimera$bio &gt; 0 par(mfrow=c(1,2)) blocked.stats &lt;- dec.chimera$per.block for (i in colnames(blocked.stats)) { current &lt;- blocked.stats[[i]] plot(current$mean, current$total, main=i, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(current) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) } Figure 39.1: Per-gene variance as a function of the mean for the log-expression values in the Pijuan-Sala chimeric mouse embryo dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the variances. Figure 39.2: Per-gene variance as a function of the mean for the log-expression values in the Pijuan-Sala chimeric mouse embryo dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the variances. Figure 39.3: Per-gene variance as a function of the mean for the log-expression values in the Pijuan-Sala chimeric mouse embryo dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the variances. 39.6 Merging We use a hierarchical merge to first merge together replicates with the same genotype, and then merge samples across different genotypes. library(batchelor) set.seed(01001001) merged &lt;- correctExperiments(sce.chimera, batch=sce.chimera$sample, subset.row=chosen.hvgs, PARAM=FastMnnParam( merge.order=list( list(1,3,5), # WT (3 replicates) list(2,4,6) # td-Tomato (3 replicates) ) ) ) We use the percentage of variance lost as a diagnostic: metadata(merged)$merge.info$lost.var ## 5 6 7 8 9 10 ## [1,] 0.000e+00 0.0204433 0.000e+00 0.0169567 0.000000 0.000000 ## [2,] 0.000e+00 0.0007389 0.000e+00 0.0004409 0.000000 0.015474 ## [3,] 3.090e-02 0.0000000 2.012e-02 0.0000000 0.000000 0.000000 ## [4,] 9.024e-05 0.0000000 8.272e-05 0.0000000 0.018047 0.000000 ## [5,] 4.321e-03 0.0072518 4.124e-03 0.0078280 0.003831 0.007786 39.7 Clustering g &lt;- buildSNNGraph(merged, use.dimred=&quot;corrected&quot;) clusters &lt;- igraph::cluster_louvain(g) colLabels(merged) &lt;- factor(clusters$membership) We examine the distribution of cells across clusters and samples. table(Cluster=colLabels(merged), Sample=merged$sample) ## Sample ## Cluster 5 6 7 8 9 10 ## 1 152 72 85 88 164 386 ## 2 19 7 13 17 20 36 ## 3 130 96 109 63 159 311 ## 4 43 35 81 81 87 353 ## 5 68 31 120 107 83 197 ## 6 122 65 64 52 63 141 ## 7 187 113 322 587 458 541 ## 8 47 22 84 50 90 131 ## 9 182 47 231 192 216 391 ## 10 95 19 36 18 50 34 ## 11 9 7 18 13 30 27 ## 12 110 69 73 96 127 252 ## 13 0 2 0 51 0 5 ## 14 38 39 50 47 126 123 ## 15 98 16 164 125 368 273 ## 16 146 37 132 110 231 216 ## 17 114 43 44 37 40 154 ## 18 78 45 189 119 340 493 ## 19 86 20 64 54 153 77 ## 20 159 77 137 101 147 401 ## 21 2 1 7 3 65 133 ## 22 11 16 20 9 47 57 ## 23 1 5 0 84 0 66 ## 24 170 47 282 173 426 542 ## 25 109 23 117 55 271 285 ## 26 122 72 298 572 296 776 39.8 Dimensionality reduction We use an external algorithm to compute nearest neighbors for greater speed. merged &lt;- runTSNE(merged, dimred=&quot;corrected&quot;, external_neighbors=TRUE) merged &lt;- runUMAP(merged, dimred=&quot;corrected&quot;, external_neighbors=TRUE) gridExtra::grid.arrange( plotTSNE(merged, colour_by=&quot;label&quot;, text_by=&quot;label&quot;, text_col=&quot;red&quot;), plotTSNE(merged, colour_by=&quot;batch&quot;) ) Figure 39.4: Obligatory \\(t\\)-SNE plots of the Pijuan-Sala chimeric mouse embryo dataset, where each point represents a cell and is colored according to the assigned cluster (top) or sample of origin (bottom). Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] batchelor_1.6.2 scran_1.18.5 [3] scater_1.18.6 ggplot2_3.3.3 [5] MouseGastrulationData_1.4.0 SingleCellExperiment_1.12.0 [7] SummarizedExperiment_1.20.0 Biobase_2.50.0 [9] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [11] IRanges_2.24.1 S4Vectors_0.28.1 [13] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [15] matrixStats_0.58.0 BiocStyle_2.18.1 [17] rebook_1.0.0 loaded via a namespace (and not attached): [1] Rtsne_0.15 ggbeeswarm_0.6.0 [3] colorspace_2.0-0 ellipsis_0.3.1 [5] scuttle_1.0.4 bluster_1.0.0 [7] XVector_0.30.0 BiocNeighbors_1.8.2 [9] farver_2.1.0 bit64_4.0.5 [11] interactiveDisplayBase_1.28.0 AnnotationDbi_1.52.0 [13] fansi_0.4.2 codetools_0.2-18 [15] sparseMatrixStats_1.2.1 cachem_1.0.4 [17] knitr_1.31 jsonlite_1.7.2 [19] ResidualMatrix_1.0.0 dbplyr_2.1.0 [21] uwot_0.1.10 graph_1.68.0 [23] shiny_1.6.0 BiocManager_1.30.10 [25] compiler_4.0.4 httr_1.4.2 [27] dqrng_0.2.1 assertthat_0.2.1 [29] Matrix_1.3-2 fastmap_1.1.0 [31] limma_3.46.0 later_1.1.0.1 [33] BiocSingular_1.6.0 htmltools_0.5.1.1 [35] tools_4.0.4 rsvd_1.0.3 [37] igraph_1.2.6 gtable_0.3.0 [39] glue_1.4.2 GenomeInfoDbData_1.2.4 [41] dplyr_1.0.5 rappdirs_0.3.3 [43] Rcpp_1.0.6 jquerylib_0.1.3 [45] vctrs_0.3.6 ExperimentHub_1.16.0 [47] DelayedMatrixStats_1.12.3 xfun_0.22 [49] stringr_1.4.0 ps_1.6.0 [51] beachmat_2.6.4 mime_0.10 [53] lifecycle_1.0.0 irlba_2.3.3 [55] statmod_1.4.35 XML_3.99-0.6 [57] edgeR_3.32.1 AnnotationHub_2.22.0 [59] zlibbioc_1.36.0 scales_1.1.1 [61] promises_1.2.0.1 yaml_2.2.1 [63] curl_4.3 memoise_2.0.0 [65] gridExtra_2.3 sass_0.3.1 [67] stringi_1.5.3 RSQLite_2.2.4 [69] highr_0.8 BiocVersion_3.12.0 [71] BiocParallel_1.24.1 rlang_0.4.10 [73] pkgconfig_2.0.3 bitops_1.0-6 [75] evaluate_0.14 lattice_0.20-41 [77] purrr_0.3.4 labeling_0.4.2 [79] CodeDepends_0.6.5 cowplot_1.1.1 [81] bit_4.0.4 processx_3.4.5 [83] tidyselect_1.1.0 magrittr_2.0.1 [85] bookdown_0.21 R6_2.5.0 [87] generics_0.1.0 DelayedArray_0.16.2 [89] DBI_1.1.1 pillar_1.5.1 [91] withr_2.4.1 RCurl_1.98-1.3 [93] tibble_3.1.0 crayon_1.4.1 [95] utf8_1.2.1 BiocFileCache_1.14.0 [97] rmarkdown_2.7 viridis_0.5.1 [99] locfit_1.5-9.4 grid_4.0.4 [101] blob_1.2.1 callr_3.5.1 [103] digest_0.6.27 xtable_1.8-4 [105] httpuv_1.5.5 munsell_0.5.0 [107] beeswarm_0.3.1 viridisLite_0.3.0 [109] vipor_0.4.5 bslib_0.2.4 Bibliography "],["bach-mouse-mammary-gland-10x-genomics.html", "Chapter 40 Bach mouse mammary gland (10X Genomics) 40.1 Introduction 40.2 Data loading 40.3 Quality control 40.4 Normalization 40.5 Variance modelling 40.6 Dimensionality reduction 40.7 Clustering Session Info", " Chapter 40 Bach mouse mammary gland (10X Genomics) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 40.1 Introduction This performs an analysis of the Bach et al. (2017) 10X Genomics dataset, from which we will consider a single sample of epithelial cells from the mouse mammary gland during gestation. 40.2 Data loading library(scRNAseq) sce.mam &lt;- BachMammaryData(samples=&quot;G_1&quot;) library(scater) rownames(sce.mam) &lt;- uniquifyFeatureNames( rowData(sce.mam)$Ensembl, rowData(sce.mam)$Symbol) library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] rowData(sce.mam)$SEQNAME &lt;- mapIds(ens.mm.v97, keys=rowData(sce.mam)$Ensembl, keytype=&quot;GENEID&quot;, column=&quot;SEQNAME&quot;) 40.3 Quality control unfiltered &lt;- sce.mam is.mito &lt;- rowData(sce.mam)$SEQNAME == &quot;MT&quot; stats &lt;- perCellQCMetrics(sce.mam, subsets=list(Mito=which(is.mito))) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;subsets_Mito_percent&quot;) sce.mam &lt;- sce.mam[,!qc$discard] colData(unfiltered) &lt;- cbind(colData(unfiltered), stats) unfiltered$discard &lt;- qc$discard gridExtra::grid.arrange( plotColData(unfiltered, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(unfiltered, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(unfiltered, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;Mito percent&quot;), ncol=2 ) Figure 40.1: Distribution of each QC metric across cells in the Bach mammary gland dataset. Each point represents a cell and is colored according to whether that cell was discarded. plotColData(unfiltered, x=&quot;sum&quot;, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + scale_x_log10() Figure 40.2: Percentage of mitochondrial reads in each cell in the Bach mammary gland dataset compared to its total count. Each point represents a cell and is colored according to whether that cell was discarded. colSums(as.matrix(qc)) ## low_lib_size low_n_features high_subsets_Mito_percent ## 0 0 143 ## discard ## 143 40.4 Normalization library(scran) set.seed(101000110) clusters &lt;- quickCluster(sce.mam) sce.mam &lt;- computeSumFactors(sce.mam, clusters=clusters) sce.mam &lt;- logNormCounts(sce.mam) summary(sizeFactors(sce.mam)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.271 0.522 0.758 1.000 1.204 10.958 plot(librarySizeFactors(sce.mam), sizeFactors(sce.mam), pch=16, xlab=&quot;Library size factors&quot;, ylab=&quot;Deconvolution factors&quot;, log=&quot;xy&quot;) Figure 40.3: Relationship between the library size factors and the deconvolution size factors in the Bach mammary gland dataset. 40.5 Variance modelling We use a Poisson-based technical trend to capture more genuine biological variation in the biological component. set.seed(00010101) dec.mam &lt;- modelGeneVarByPoisson(sce.mam) top.mam &lt;- getTopHVGs(dec.mam, prop=0.1) plot(dec.mam$mean, dec.mam$total, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(dec.mam) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) Figure 40.4: Per-gene variance as a function of the mean for the log-expression values in the Bach mammary gland dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to simulated Poisson counts. 40.6 Dimensionality reduction library(BiocSingular) set.seed(101010011) sce.mam &lt;- denoisePCA(sce.mam, technical=dec.mam, subset.row=top.mam) sce.mam &lt;- runTSNE(sce.mam, dimred=&quot;PCA&quot;) ncol(reducedDim(sce.mam, &quot;PCA&quot;)) ## [1] 15 40.7 Clustering We use a higher k to obtain coarser clusters (for use in doubletCluster() later). snn.gr &lt;- buildSNNGraph(sce.mam, use.dimred=&quot;PCA&quot;, k=25) colLabels(sce.mam) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) table(colLabels(sce.mam)) ## ## 1 2 3 4 5 6 7 8 9 10 ## 550 799 716 452 24 84 52 39 32 24 plotTSNE(sce.mam, colour_by=&quot;label&quot;) Figure 40.5: Obligatory \\(t\\)-SNE plot of the Bach mammary gland dataset, where each point represents a cell and is colored according to the assigned cluster. Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] BiocSingular_1.6.0 scran_1.18.5 [3] AnnotationHub_2.22.0 BiocFileCache_1.14.0 [5] dbplyr_2.1.0 scater_1.18.6 [7] ggplot2_3.3.3 ensembldb_2.14.0 [9] AnnotationFilter_1.14.0 GenomicFeatures_1.42.2 [11] AnnotationDbi_1.52.0 scRNAseq_2.4.0 [13] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0 [15] Biobase_2.50.0 GenomicRanges_1.42.0 [17] GenomeInfoDb_1.26.4 IRanges_2.24.1 [19] S4Vectors_0.28.1 BiocGenerics_0.36.0 [21] MatrixGenerics_1.2.1 matrixStats_0.58.0 [23] BiocStyle_2.18.1 rebook_1.0.0 loaded via a namespace (and not attached): [1] igraph_1.2.6 lazyeval_0.2.2 [3] BiocParallel_1.24.1 digest_0.6.27 [5] htmltools_0.5.1.1 viridis_0.5.1 [7] fansi_0.4.2 magrittr_2.0.1 [9] memoise_2.0.0 limma_3.46.0 [11] Biostrings_2.58.0 askpass_1.1 [13] prettyunits_1.1.1 colorspace_2.0-0 [15] blob_1.2.1 rappdirs_0.3.3 [17] xfun_0.22 dplyr_1.0.5 [19] callr_3.5.1 crayon_1.4.1 [21] RCurl_1.98-1.3 jsonlite_1.7.2 [23] graph_1.68.0 glue_1.4.2 [25] gtable_0.3.0 zlibbioc_1.36.0 [27] XVector_0.30.0 DelayedArray_0.16.2 [29] scales_1.1.1 edgeR_3.32.1 [31] DBI_1.1.1 Rcpp_1.0.6 [33] viridisLite_0.3.0 xtable_1.8-4 [35] progress_1.2.2 dqrng_0.2.1 [37] bit_4.0.4 rsvd_1.0.3 [39] httr_1.4.2 ellipsis_0.3.1 [41] pkgconfig_2.0.3 XML_3.99-0.6 [43] farver_2.1.0 scuttle_1.0.4 [45] CodeDepends_0.6.5 sass_0.3.1 [47] locfit_1.5-9.4 utf8_1.2.1 [49] tidyselect_1.1.0 labeling_0.4.2 [51] rlang_0.4.10 later_1.1.0.1 [53] munsell_0.5.0 BiocVersion_3.12.0 [55] tools_4.0.4 cachem_1.0.4 [57] generics_0.1.0 RSQLite_2.2.4 [59] ExperimentHub_1.16.0 evaluate_0.14 [61] stringr_1.4.0 fastmap_1.1.0 [63] yaml_2.2.1 processx_3.4.5 [65] knitr_1.31 bit64_4.0.5 [67] purrr_0.3.4 sparseMatrixStats_1.2.1 [69] mime_0.10 xml2_1.3.2 [71] biomaRt_2.46.3 compiler_4.0.4 [73] beeswarm_0.3.1 curl_4.3 [75] interactiveDisplayBase_1.28.0 statmod_1.4.35 [77] tibble_3.1.0 bslib_0.2.4 [79] stringi_1.5.3 highr_0.8 [81] ps_1.6.0 lattice_0.20-41 [83] bluster_1.0.0 ProtGenerics_1.22.0 [85] Matrix_1.3-2 vctrs_0.3.6 [87] pillar_1.5.1 lifecycle_1.0.0 [89] BiocManager_1.30.10 jquerylib_0.1.3 [91] BiocNeighbors_1.8.2 cowplot_1.1.1 [93] bitops_1.0-6 irlba_2.3.3 [95] httpuv_1.5.5 rtracklayer_1.50.0 [97] R6_2.5.0 bookdown_0.21 [99] promises_1.2.0.1 gridExtra_2.3 [101] vipor_0.4.5 codetools_0.2-18 [103] assertthat_0.2.1 openssl_1.4.3 [105] withr_2.4.1 GenomicAlignments_1.26.0 [107] Rsamtools_2.6.0 GenomeInfoDbData_1.2.4 [109] hms_1.0.0 grid_4.0.4 [111] beachmat_2.6.4 rmarkdown_2.7 [113] DelayedMatrixStats_1.12.3 Rtsne_0.15 [115] shiny_1.6.0 ggbeeswarm_0.6.0 Bibliography "],["messmer-hesc.html", "Chapter 41 Messmer human ESC (Smart-seq2) 41.1 Introduction 41.2 Data loading 41.3 Quality control 41.4 Normalization 41.5 Cell cycle phase assignment 41.6 Feature selection 41.7 Batch correction 41.8 Dimensionality Reduction Session Info", " Chapter 41 Messmer human ESC (Smart-seq2) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 41.1 Introduction This performs an analysis of the human embryonic stem cell (hESC) dataset generated with Smart-seq2 (Messmer et al. 2019), which contains several plates of naive and primed hESCs. The chapter’s code is based on the steps in the paper’s GitHub repository, with some additional steps for cell cycle effect removal contributed by Philippe Boileau. 41.2 Data loading Converting the batch to a factor, to make life easier later on. library(scRNAseq) sce.mess &lt;- MessmerESCData() sce.mess$`experiment batch` &lt;- factor(sce.mess$`experiment batch`) library(AnnotationHub) ens.hs.v97 &lt;- AnnotationHub()[[&quot;AH73881&quot;]] anno &lt;- select(ens.hs.v97, keys=rownames(sce.mess), keytype=&quot;GENEID&quot;, columns=c(&quot;SYMBOL&quot;)) rowData(sce.mess) &lt;- anno[match(rownames(sce.mess), anno$GENEID),] 41.3 Quality control Let’s have a look at the QC statistics. colSums(as.matrix(filtered)) ## low_lib_size low_n_features high_subsets_Mito_percent ## 107 99 22 ## high_altexps_ERCC_percent discard ## 117 156 gridExtra::grid.arrange( plotColData(original, x=&quot;experiment batch&quot;, y=&quot;sum&quot;, colour_by=I(filtered$discard), other_field=&quot;phenotype&quot;) + facet_wrap(~phenotype) + scale_y_log10(), plotColData(original, x=&quot;experiment batch&quot;, y=&quot;detected&quot;, colour_by=I(filtered$discard), other_field=&quot;phenotype&quot;) + facet_wrap(~phenotype) + scale_y_log10(), plotColData(original, x=&quot;experiment batch&quot;, y=&quot;subsets_Mito_percent&quot;, colour_by=I(filtered$discard), other_field=&quot;phenotype&quot;) + facet_wrap(~phenotype), plotColData(original, x=&quot;experiment batch&quot;, y=&quot;altexps_ERCC_percent&quot;, colour_by=I(filtered$discard), other_field=&quot;phenotype&quot;) + facet_wrap(~phenotype), ncol=1 ) Figure 41.1: Distribution of QC metrics across batches (x-axis) and phenotypes (facets) for cells in the Messmer hESC dataset. Each point is a cell and is colored by whether it was discarded. 41.4 Normalization library(scran) set.seed(10000) clusters &lt;- quickCluster(sce.mess) sce.mess &lt;- computeSumFactors(sce.mess, cluster=clusters) sce.mess &lt;- logNormCounts(sce.mess) par(mfrow=c(1,2)) plot(sce.mess$sum, sizeFactors(sce.mess), log = &quot;xy&quot;, pch=16, xlab = &quot;Library size (millions)&quot;, ylab = &quot;Size factor&quot;, col = ifelse(sce.mess$phenotype == &quot;naive&quot;, &quot;black&quot;, &quot;grey&quot;)) spike.sf &lt;- librarySizeFactors(altExp(sce.mess, &quot;ERCC&quot;)) plot(sizeFactors(sce.mess), spike.sf, log = &quot;xy&quot;, pch=16, ylab = &quot;Spike-in size factor&quot;, xlab = &quot;Deconvolution size factor&quot;, col = ifelse(sce.mess$phenotype == &quot;naive&quot;, &quot;black&quot;, &quot;grey&quot;)) Figure 41.2: Deconvolution size factors plotted against the library size (left) and spike-in size factors plotted against the deconvolution size factors (right). Each point is a cell and is colored by its phenotype. 41.5 Cell cycle phase assignment Here, we use multiple cores to speed up the processing. set.seed(10001) hs_pairs &lt;- readRDS(system.file(&quot;exdata&quot;, &quot;human_cycle_markers.rds&quot;, package=&quot;scran&quot;)) assigned &lt;- cyclone(sce.mess, pairs=hs_pairs, gene.names=rownames(sce.mess), BPPARAM=BiocParallel::MulticoreParam(10)) sce.mess$phase &lt;- assigned$phases table(sce.mess$phase) ## ## G1 G2M S ## 460 406 322 smoothScatter(assigned$scores$G1, assigned$scores$G2M, xlab=&quot;G1 score&quot;, ylab=&quot;G2/M score&quot;, pch=16) Figure 41.3: G1 cyclone() phase scores against the G2/M phase scores for each cell in the Messmer hESC dataset. 41.6 Feature selection dec &lt;- modelGeneVarWithSpikes(sce.mess, &quot;ERCC&quot;, block = sce.mess$`experiment batch`) top.hvgs &lt;- getTopHVGs(dec, prop = 0.1) par(mfrow=c(1,3)) for (i in seq_along(dec$per.block)) { current &lt;- dec$per.block[[i]] plot(current$mean, current$total, xlab=&quot;Mean log-expression&quot;, ylab=&quot;Variance&quot;, pch=16, cex=0.5, main=paste(&quot;Batch&quot;, i)) fit &lt;- metadata(current) points(fit$mean, fit$var, col=&quot;red&quot;, pch=16) curve(fit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) } Figure 41.4: Per-gene variance of the log-normalized expression values in the Messmer hESC dataset, plotted against the mean for each batch. Each point represents a gene with spike-ins shown in red and the fitted trend shown in blue. 41.7 Batch correction We eliminate the obvious batch effect between batches with linear regression, which is possible due to the replicated nature of the experimental design. We set keep=1:2 to retain the effect of the first two coefficients in design corresponding to our phenotype of interest. library(batchelor) sce.mess &lt;- correctExperiments(sce.mess, PARAM = RegressParam( design = model.matrix(~sce.mess$phenotype + sce.mess$`experiment batch`), keep = 1:2 ) ) 41.8 Dimensionality Reduction We could have set d= and subset.row= in correctExperiments() to automatically perform a PCA on the the residual matrix with the subset of HVGs, but we’ll just explicitly call runPCA() here to keep things simple. set.seed(1101001) sce.mess &lt;- runPCA(sce.mess, subset_row = top.hvgs, exprs_values = &quot;corrected&quot;) sce.mess &lt;- runTSNE(sce.mess, dimred = &quot;PCA&quot;, perplexity = 40) From a naive PCA, the cell cycle appears to be a major source of biological variation within each phenotype. gridExtra::grid.arrange( plotTSNE(sce.mess, colour_by = &quot;phenotype&quot;) + ggtitle(&quot;By phenotype&quot;), plotTSNE(sce.mess, colour_by = &quot;experiment batch&quot;) + ggtitle(&quot;By batch &quot;), plotTSNE(sce.mess, colour_by = &quot;CDK1&quot;, swap_rownames=&quot;SYMBOL&quot;) + ggtitle(&quot;By CDK1&quot;), plotTSNE(sce.mess, colour_by = &quot;phase&quot;) + ggtitle(&quot;By phase&quot;), ncol = 2 ) Figure 41.5: Obligatory \\(t\\)-SNE plots of the Messmer hESC dataset, where each point is a cell and is colored by various attributes. We perform contrastive PCA (cPCA) and sparse cPCA (scPCA) on the corrected log-expression data to obtain the same number of PCs. Given that the naive hESCs are actually reprogrammed primed hESCs, we will use the single batch of primed-only hESCs as the “background” dataset to remove the cell cycle effect. library(scPCA) is.bg &lt;- sce.mess$`experiment batch`==&quot;3&quot; target &lt;- sce.mess[,!is.bg] background &lt;- sce.mess[,is.bg] mat.target &lt;- t(assay(target, &quot;corrected&quot;)[top.hvgs,]) mat.background &lt;- t(assay(background, &quot;corrected&quot;)[top.hvgs,]) set.seed(1010101001) con_out &lt;- scPCA( target = mat.target, background = mat.background, penalties = 0, # no penalties = non-sparse cPCA. n_eigen = 50, contrasts = 100 ) reducedDim(target, &quot;cPCA&quot;) &lt;- con_out$x set.seed(101010101) sparse_con_out &lt;- scPCA( target = mat.target, background = mat.background, penalties = 1e-4, n_eigen = 50, contrasts = 100, alg = &quot;rand_var_proj&quot; # for speed. ) reducedDim(target, &quot;scPCA&quot;) &lt;- sparse_con_out$x We see greater intermingling between phases within both the naive and primed cells after cPCA and scPCA. set.seed(1101001) target &lt;- runTSNE(target, dimred = &quot;cPCA&quot;, perplexity = 40, name=&quot;cPCA+TSNE&quot;) target &lt;- runTSNE(target, dimred = &quot;scPCA&quot;, perplexity = 40, name=&quot;scPCA+TSNE&quot;) gridExtra::grid.arrange( plotReducedDim(target, &quot;cPCA+TSNE&quot;, colour_by = &quot;phase&quot;) + ggtitle(&quot;After cPCA&quot;), plotReducedDim(target, &quot;scPCA+TSNE&quot;, colour_by = &quot;phase&quot;) + ggtitle(&quot;After scPCA&quot;), ncol=2 ) Figure 41.6: More \\(t\\)-SNE plots of the Messmer hESC dataset after cPCA and scPCA, where each point is a cell and is colored by its assigned cell cycle phase. We can quantify the change in the separation between phases within each phenotype using the silhouette coefficient. library(bluster) naive &lt;- target[,target$phenotype==&quot;naive&quot;] primed &lt;- target[,target$phenotype==&quot;primed&quot;] N &lt;- approxSilhouette(reducedDim(naive, &quot;PCA&quot;), naive$phase) P &lt;- approxSilhouette(reducedDim(primed, &quot;PCA&quot;), primed$phase) c(naive=mean(N$width), primed=mean(P$width)) ## naive primed ## 0.02032 0.03025 cN &lt;- approxSilhouette(reducedDim(naive, &quot;cPCA&quot;), naive$phase) cP &lt;- approxSilhouette(reducedDim(primed, &quot;cPCA&quot;), primed$phase) c(naive=mean(cN$width), primed=mean(cP$width)) ## naive primed ## 0.007696 0.011941 scN &lt;- approxSilhouette(reducedDim(naive, &quot;scPCA&quot;), naive$phase) scP &lt;- approxSilhouette(reducedDim(primed, &quot;scPCA&quot;), primed$phase) c(naive=mean(scN$width), primed=mean(scP$width)) ## naive primed ## 0.006614 0.014601 Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] bluster_1.0.0 scPCA_1.4.0 [3] batchelor_1.6.2 scran_1.18.5 [5] scater_1.18.6 ggplot2_3.3.3 [7] AnnotationHub_2.22.0 BiocFileCache_1.14.0 [9] dbplyr_2.1.0 ensembldb_2.14.0 [11] AnnotationFilter_1.14.0 GenomicFeatures_1.42.2 [13] AnnotationDbi_1.52.0 scRNAseq_2.4.0 [15] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0 [17] Biobase_2.50.0 GenomicRanges_1.42.0 [19] GenomeInfoDb_1.26.4 IRanges_2.24.1 [21] S4Vectors_0.28.1 BiocGenerics_0.36.0 [23] MatrixGenerics_1.2.1 matrixStats_0.58.0 [25] BiocStyle_2.18.1 rebook_1.0.0 loaded via a namespace (and not attached): [1] igraph_1.2.6 lazyeval_0.2.2 [3] listenv_0.8.0 BiocParallel_1.24.1 [5] digest_0.6.27 htmltools_0.5.1.1 [7] viridis_0.5.1 fansi_0.4.2 [9] magrittr_2.0.1 memoise_2.0.0 [11] cluster_2.1.0 limma_3.46.0 [13] globals_0.14.0 Biostrings_2.58.0 [15] askpass_1.1 prettyunits_1.1.1 [17] colorspace_2.0-0 blob_1.2.1 [19] rappdirs_0.3.3 rbibutils_2.0 [21] xfun_0.22 dplyr_1.0.5 [23] callr_3.5.1 crayon_1.4.1 [25] RCurl_1.98-1.3 jsonlite_1.7.2 [27] graph_1.68.0 glue_1.4.2 [29] gtable_0.3.0 zlibbioc_1.36.0 [31] XVector_0.30.0 DelayedArray_0.16.2 [33] coop_0.6-2 kernlab_0.9-29 [35] BiocSingular_1.6.0 future.apply_1.7.0 [37] abind_1.4-5 scales_1.1.1 [39] edgeR_3.32.1 DBI_1.1.1 [41] Rcpp_1.0.6 viridisLite_0.3.0 [43] xtable_1.8-4 progress_1.2.2 [45] dqrng_0.2.1 bit_4.0.4 [47] rsvd_1.0.3 ResidualMatrix_1.0.0 [49] httr_1.4.2 ellipsis_0.3.1 [51] pkgconfig_2.0.3 XML_3.99-0.6 [53] farver_2.1.0 scuttle_1.0.4 [55] CodeDepends_0.6.5 sass_0.3.1 [57] locfit_1.5-9.4 utf8_1.2.1 [59] tidyselect_1.1.0 labeling_0.4.2 [61] rlang_0.4.10 later_1.1.0.1 [63] munsell_0.5.0 BiocVersion_3.12.0 [65] tools_4.0.4 cachem_1.0.4 [67] generics_0.1.0 RSQLite_2.2.4 [69] ExperimentHub_1.16.0 evaluate_0.14 [71] stringr_1.4.0 fastmap_1.1.0 [73] yaml_2.2.1 processx_3.4.5 [75] knitr_1.31 bit64_4.0.5 [77] purrr_0.3.4 future_1.21.0 [79] sparseMatrixStats_1.2.1 mime_0.10 [81] origami_1.0.3 xml2_1.3.2 [83] biomaRt_2.46.3 compiler_4.0.4 [85] beeswarm_0.3.1 curl_4.3 [87] interactiveDisplayBase_1.28.0 statmod_1.4.35 [89] tibble_3.1.0 bslib_0.2.4 [91] stringi_1.5.3 highr_0.8 [93] ps_1.6.0 RSpectra_0.16-0 [95] lattice_0.20-41 ProtGenerics_1.22.0 [97] Matrix_1.3-2 vctrs_0.3.6 [99] pillar_1.5.1 lifecycle_1.0.0 [101] BiocManager_1.30.10 Rdpack_2.1.1 [103] jquerylib_0.1.3 BiocNeighbors_1.8.2 [105] data.table_1.14.0 cowplot_1.1.1 [107] bitops_1.0-6 irlba_2.3.3 [109] httpuv_1.5.5 rtracklayer_1.50.0 [111] R6_2.5.0 bookdown_0.21 [113] promises_1.2.0.1 KernSmooth_2.23-18 [115] gridExtra_2.3 parallelly_1.24.0 [117] vipor_0.4.5 codetools_0.2-18 [119] assertthat_0.2.1 openssl_1.4.3 [121] sparsepca_0.1.2 withr_2.4.1 [123] GenomicAlignments_1.26.0 Rsamtools_2.6.0 [125] GenomeInfoDbData_1.2.4 hms_1.0.0 [127] grid_4.0.4 beachmat_2.6.4 [129] rmarkdown_2.7 DelayedMatrixStats_1.12.3 [131] Rtsne_0.15 shiny_1.6.0 [133] ggbeeswarm_0.6.0 Bibliography "],["hca-human-bone-marrow-10x-genomics.html", "Chapter 42 HCA human bone marrow (10X Genomics) 42.1 Introduction 42.2 Data loading 42.3 Quality control 42.4 Normalization 42.5 Variance modeling 42.6 Data integration 42.7 Dimensionality reduction 42.8 Clustering 42.9 Differential expression 42.10 Cell type classification Session Info", " Chapter 42 HCA human bone marrow (10X Genomics) .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 42.1 Introduction Here, we use an example dataset from the Human Cell Atlas immune cell profiling project on bone marrow, which contains scRNA-seq data for 380,000 cells generated using the 10X Genomics technology. This is a fairly big dataset that represents a good use case for the techniques in Chapter 23. 42.2 Data loading This dataset is loaded via the HCAData package, which provides a ready-to-use SingleCellExperiment object. library(HCAData) sce.bone &lt;- HCAData(&#39;ica_bone_marrow&#39;) sce.bone$Donor &lt;- sub(&quot;_.*&quot;, &quot;&quot;, sce.bone$Barcode) We use symbols in place of IDs for easier interpretation later. library(EnsDb.Hsapiens.v86) rowData(sce.bone)$Chr &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rownames(sce.bone), column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) library(scater) rownames(sce.bone) &lt;- uniquifyFeatureNames(rowData(sce.bone)$ID, names = rowData(sce.bone)$Symbol) 42.3 Quality control Cell calling was not performed (see here) so we will perform QC using all metrics and block on the donor of origin during outlier detection. We perform the calculation across multiple cores to speed things up. library(BiocParallel) bpp &lt;- MulticoreParam(8) sce.bone &lt;- unfiltered &lt;- addPerCellQC(sce.bone, BPPARAM=bpp, subsets=list(Mito=which(rowData(sce.bone)$Chr==&quot;MT&quot;))) qc &lt;- quickPerCellQC(colData(sce.bone), batch=sce.bone$Donor, sub.fields=&quot;subsets_Mito_percent&quot;) sce.bone &lt;- sce.bone[,!qc$discard] unfiltered$discard &lt;- qc$discard gridExtra::grid.arrange( plotColData(unfiltered, x=&quot;Donor&quot;, y=&quot;sum&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(unfiltered, x=&quot;Donor&quot;, y=&quot;detected&quot;, colour_by=&quot;discard&quot;) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(unfiltered, x=&quot;Donor&quot;, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + ggtitle(&quot;Mito percent&quot;), ncol=2 ) Figure 42.1: Distribution of QC metrics in the HCA bone marrow dataset. Each point represents a cell and is colored according to whether it was discarded. plotColData(unfiltered, x=&quot;sum&quot;, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;) + scale_x_log10() Figure 42.2: Percentage of mitochondrial reads in each cell in the HCA bone marrow dataset compared to its total count. Each point represents a cell and is colored according to whether that cell was discarded. 42.4 Normalization For a minor speed-up, we use already-computed library sizes rather than re-computing them from the column sums. sce.bone &lt;- logNormCounts(sce.bone, size_factors = sce.bone$sum) summary(sizeFactors(sce.bone)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.05 0.47 0.65 1.00 0.89 42.38 42.5 Variance modeling We block on the donor of origin to mitigate batch effects during HVG selection. We select a larger number of HVGs to capture any batch-specific variation that might be present. library(scran) set.seed(1010010101) dec.bone &lt;- modelGeneVarByPoisson(sce.bone, block=sce.bone$Donor, BPPARAM=bpp) top.bone &lt;- getTopHVGs(dec.bone, n=5000) par(mfrow=c(4,2)) blocked.stats &lt;- dec.bone$per.block for (i in colnames(blocked.stats)) { current &lt;- blocked.stats[[i]] plot(current$mean, current$total, main=i, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(current) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) } Figure 42.3: Per-gene variance as a function of the mean for the log-expression values in the HCA bone marrow dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the variances. 42.6 Data integration Here we use multiple cores, randomized SVD and approximate nearest-neighbor detection to speed up this step. library(batchelor) library(BiocNeighbors) set.seed(1010001) merged.bone &lt;- fastMNN(sce.bone, batch = sce.bone$Donor, subset.row = top.bone, BSPARAM=BiocSingular::RandomParam(deferred = TRUE), BNPARAM=AnnoyParam(), BPPARAM=bpp) reducedDim(sce.bone, &#39;MNN&#39;) &lt;- reducedDim(merged.bone, &#39;corrected&#39;) We use the percentage of variance lost as a diagnostic measure: metadata(merged.bone)$merge.info$lost.var ## MantonBM1 MantonBM2 MantonBM3 MantonBM4 MantonBM5 MantonBM6 MantonBM7 ## [1,] 0.006925 0.006387 0.000000 0.000000 0.000000 0.000000 0.000000 ## [2,] 0.006376 0.006853 0.023099 0.000000 0.000000 0.000000 0.000000 ## [3,] 0.005068 0.003070 0.005132 0.019471 0.000000 0.000000 0.000000 ## [4,] 0.002006 0.001873 0.001898 0.001780 0.023112 0.000000 0.000000 ## [5,] 0.002445 0.002009 0.001780 0.002923 0.002634 0.023804 0.000000 ## [6,] 0.003106 0.003178 0.003068 0.002581 0.003283 0.003363 0.024562 ## [7,] 0.001939 0.001677 0.002427 0.002013 0.001555 0.002270 0.001969 ## MantonBM8 ## [1,] 0.00000 ## [2,] 0.00000 ## [3,] 0.00000 ## [4,] 0.00000 ## [5,] 0.00000 ## [6,] 0.00000 ## [7,] 0.03213 42.7 Dimensionality reduction We set external_neighbors=TRUE to replace the internal nearest neighbor search in the UMAP implementation with our parallelized approximate search. We also set the number of threads to be used in the UMAP iterations. set.seed(01010100) sce.bone &lt;- runUMAP(sce.bone, dimred=&quot;MNN&quot;, external_neighbors=TRUE, BNPARAM=AnnoyParam(), BPPARAM=bpp, n_threads=bpnworkers(bpp)) 42.8 Clustering Graph-based clustering generates an excessively large intermediate graph so we will instead use a two-step approach with \\(k\\)-means. We generate 1000 small clusters that are subsequently aggregated into more interpretable groups with a graph-based method. If more resolution is required, we can increase centers in addition to using a lower k during graph construction. library(bluster) set.seed(1000) colLabels(sce.bone) &lt;- clusterRows(reducedDim(sce.bone, &quot;MNN&quot;), TwoStepParam(KmeansParam(centers=1000), NNGraphParam(k=5))) table(colLabels(sce.bone)) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 ## 65896 47244 30659 37110 7039 10193 18585 17064 26262 8870 7993 7968 3732 ## 14 15 16 17 18 19 ## 3685 4993 11048 3161 3024 2199 We observe mostly balanced contributions from different samples to each cluster (Figure 42.4), consistent with the expectation that all samples are replicates from different donors. tab &lt;- table(Cluster=colLabels(sce.bone), Donor=sce.bone$Donor) library(pheatmap) pheatmap(log10(tab+10), color=viridis::viridis(100)) Figure 42.4: Heatmap of log10-number of cells in each cluster (row) from each sample (column). # TODO: add scrambling option in scater&#39;s plotting functions. scrambled &lt;- sample(ncol(sce.bone)) gridExtra::grid.arrange( plotUMAP(sce.bone, colour_by=&quot;label&quot;, text_by=&quot;label&quot;), plotUMAP(sce.bone[,scrambled], colour_by=&quot;Donor&quot;) ) Figure 42.5: UMAP plots of the HCA bone marrow dataset after merging. Each point represents a cell and is colored according to the assigned cluster (top) or the donor of origin (bottom). 42.9 Differential expression We identify marker genes for each cluster while blocking on the donor. markers.bone &lt;- findMarkers(sce.bone, block = sce.bone$Donor, direction = &#39;up&#39;, lfc = 1, BPPARAM=bpp) We visualize the top markers for a randomly chosen cluster using a “dot plot” in Figure 42.6. The presence of upregulated genes like LYZ, S100A8 and VCAN is consistent with a monocyte identity for this cluster. top.markers &lt;- markers.bone[[&quot;2&quot;]] best &lt;- top.markers[top.markers$Top &lt;= 10,] lfcs &lt;- getMarkerEffects(best) library(pheatmap) pheatmap(lfcs, breaks=seq(-5, 5, length.out=101)) Figure 42.6: Heatmap of log2-fold changes for the top marker genes (rows) of cluster 2 compared to all other clusters (columns). 42.10 Cell type classification We perform automated cell type classification using a reference dataset to annotate each cluster based on its pseudo-bulk profile. This is faster than the per-cell approaches described in Chapter 12 at the cost of the resolution required to detect heterogeneity inside a cluster. Nonetheless, it is often sufficient for a quick assignment of cluster identity, and indeed, cluster 2 is also identified as consisting of monocytes from this analysis. se.aggregated &lt;- sumCountsAcrossCells(sce.bone, id=colLabels(sce.bone)) library(celldex) hpc &lt;- HumanPrimaryCellAtlasData() library(SingleR) anno.single &lt;- SingleR(se.aggregated, ref = hpc, labels = hpc$label.main, assay.type.test=&quot;sum&quot;) anno.single ## DataFrame with 19 rows and 5 columns ## scores first.labels tuning.scores ## &lt;matrix&gt; &lt;character&gt; &lt;DataFrame&gt; ## 1 0.325925:0.652679:0.575563:... T_cells 0.691160:0.1929391 ## 2 0.296605:0.741579:0.529038:... Pre-B_cell_CD34- 0.565504:0.2473908 ## 3 0.311871:0.672003:0.539219:... B_cell 0.621882:0.0147466 ## 4 0.335035:0.658920:0.580926:... T_cells 0.643977:0.3972014 ## 5 0.333016:0.614727:0.522212:... T_cells 0.603456:0.4050120 ## ... ... ... ... ## 15 0.380959:0.683493:0.784153:... MEP 0.376779:0.374257 ## 16 0.309959:0.666660:0.548568:... B_cell 0.775892:0.716429 ## 17 0.367825:0.654503:0.580287:... B_cell 0.530394:0.327330 ## 18 0.409967:0.708583:0.643723:... Pro-B_cell_CD34+ 0.851223:0.780534 ## 19 0.331292:0.671789:0.555484:... Pre-B_cell_CD34- 0.139321:0.121342 ## labels pruned.labels ## &lt;character&gt; &lt;character&gt; ## 1 T_cells T_cells ## 2 Monocyte Monocyte ## 3 B_cell B_cell ## 4 T_cells T_cells ## 5 T_cells T_cells ## ... ... ... ## 15 BM &amp; Prog. BM &amp; Prog. ## 16 B_cell B_cell ## 17 B_cell NA ## 18 Pro-B_cell_CD34+ Pro-B_cell_CD34+ ## 19 GMP GMP Session Info View session info R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] SingleR_1.4.1 celldex_1.0.0 [3] pheatmap_1.0.12 bluster_1.0.0 [5] BiocNeighbors_1.8.2 batchelor_1.6.2 [7] scran_1.18.5 BiocParallel_1.24.1 [9] scater_1.18.6 ggplot2_3.3.3 [11] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.14.0 [13] AnnotationFilter_1.14.0 GenomicFeatures_1.42.2 [15] AnnotationDbi_1.52.0 rhdf5_2.34.0 [17] HCAData_1.6.0 SingleCellExperiment_1.12.0 [19] SummarizedExperiment_1.20.0 Biobase_2.50.0 [21] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 [23] IRanges_2.24.1 S4Vectors_0.28.1 [25] BiocGenerics_0.36.0 MatrixGenerics_1.2.1 [27] matrixStats_0.58.0 BiocStyle_2.18.1 [29] rebook_1.0.0 loaded via a namespace (and not attached): [1] AnnotationHub_2.22.0 BiocFileCache_1.14.0 [3] igraph_1.2.6 lazyeval_0.2.2 [5] digest_0.6.27 htmltools_0.5.1.1 [7] viridis_0.5.1 fansi_0.4.2 [9] magrittr_2.0.1 memoise_2.0.0 [11] limma_3.46.0 Biostrings_2.58.0 [13] askpass_1.1 prettyunits_1.1.1 [15] colorspace_2.0-0 blob_1.2.1 [17] rappdirs_0.3.3 xfun_0.22 [19] dplyr_1.0.5 callr_3.5.1 [21] crayon_1.4.1 RCurl_1.98-1.3 [23] jsonlite_1.7.2 graph_1.68.0 [25] glue_1.4.2 gtable_0.3.0 [27] zlibbioc_1.36.0 XVector_0.30.0 [29] DelayedArray_0.16.2 BiocSingular_1.6.0 [31] Rhdf5lib_1.12.1 HDF5Array_1.18.1 [33] scales_1.1.1 edgeR_3.32.1 [35] DBI_1.1.1 Rcpp_1.0.6 [37] viridisLite_0.3.0 xtable_1.8-4 [39] progress_1.2.2 dqrng_0.2.1 [41] bit_4.0.4 rsvd_1.0.3 [43] ResidualMatrix_1.0.0 httr_1.4.2 [45] RColorBrewer_1.1-2 ellipsis_0.3.1 [47] pkgconfig_2.0.3 XML_3.99-0.6 [49] farver_2.1.0 scuttle_1.0.4 [51] uwot_0.1.10 CodeDepends_0.6.5 [53] sass_0.3.1 dbplyr_2.1.0 [55] locfit_1.5-9.4 utf8_1.2.1 [57] labeling_0.4.2 tidyselect_1.1.0 [59] rlang_0.4.10 later_1.1.0.1 [61] munsell_0.5.0 BiocVersion_3.12.0 [63] tools_4.0.4 cachem_1.0.4 [65] generics_0.1.0 RSQLite_2.2.4 [67] ExperimentHub_1.16.0 evaluate_0.14 [69] stringr_1.4.0 fastmap_1.1.0 [71] yaml_2.2.1 processx_3.4.5 [73] knitr_1.31 bit64_4.0.5 [75] purrr_0.3.4 sparseMatrixStats_1.2.1 [77] mime_0.10 xml2_1.3.2 [79] biomaRt_2.46.3 compiler_4.0.4 [81] beeswarm_0.3.1 curl_4.3 [83] interactiveDisplayBase_1.28.0 statmod_1.4.35 [85] tibble_3.1.0 bslib_0.2.4 [87] stringi_1.5.3 highr_0.8 [89] ps_1.6.0 RSpectra_0.16-0 [91] lattice_0.20-41 ProtGenerics_1.22.0 [93] Matrix_1.3-2 vctrs_0.3.6 [95] pillar_1.5.1 lifecycle_1.0.0 [97] rhdf5filters_1.2.0 BiocManager_1.30.10 [99] jquerylib_0.1.3 cowplot_1.1.1 [101] bitops_1.0-6 irlba_2.3.3 [103] httpuv_1.5.5 rtracklayer_1.50.0 [105] R6_2.5.0 bookdown_0.21 [107] promises_1.2.0.1 gridExtra_2.3 [109] vipor_0.4.5 codetools_0.2-18 [111] assertthat_0.2.1 openssl_1.4.3 [113] withr_2.4.1 GenomicAlignments_1.26.0 [115] Rsamtools_2.6.0 GenomeInfoDbData_1.2.4 [117] hms_1.0.0 grid_4.0.4 [119] beachmat_2.6.4 rmarkdown_2.7 [121] DelayedMatrixStats_1.12.3 shiny_1.6.0 [123] ggbeeswarm_0.6.0 "],["contributors.html", "Chapter 43 Contributors", " Chapter 43 Contributors .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } Aaron Lun, PhD When one thinks of single-cell bioinformatics, one thinks of several titans who bestride the field. Unfortunately, they weren’t available, so we had to make do with Aaron instead. He likes long walks on the beach (as long as there’s Wifi) and travelling (but only in business class). His friends say that he is “absolutely insane” and “needs to get a life”, or they would if they weren’t mostly imaginary. His GitHub account is his Facebook and his Slack is his Twitter. He maintains more Bioconductor packages than he has phone numbers on his cell. He has a recurring event on his Google Calendar to fold his laundry. He is… the most boring man in the world. (“I don’t often cry when I watch anime, but when I do, my tears taste like Dos Equis.”) He currently works as a Scientist at Genentech after a stint as a research associate in John Marioni’s group at the CRUK Cambridge Institute, which was preceded by a PhD with Gordon Smyth at the Walter and Eliza Hall Institute for Medical Research. Robert Amezquita, PhD Robert Amezquita is a Postdoctoral Fellow in the Immunotherapy Integrated Research Center (IIRC) at Fred Hutch under the mentorship of Raphael Gottardo. His current research focuses on utilizing computational approaches leveraging transcriptional and epigenomic profiling at single-cell resolution to understand how novel anti-cancer therapeutics - ranging from small molecule therapies to biologics such as CAR-T cells - affect immune response dynamics. Extending from his graduate work at Yale’s Dept. of Immunobiology, Robert’s research aims to better understand the process of immune cell differentiation under the duress of cancer as a means to inform future immunotherapies. To accomplish this, Robert works collaboratively across the Fred Hutch IIRC with experimental collaborators, extensively utilizing R and Bioconductor for data analysis. Stephanie Hicks, PhD Stephanie Hicks is an Assistant Professor in the Department of Biostatistics at Johns Hopkins Bloomberg School of Public Health. Her research interests focus around developing statistical methods, tools and software for the analysis of genomics data. Specifically, her research addresses statistical challenges in epigenomics, functional genomics and single-cell genomics such as the pre-processing, normalization, analysis of noisy high-throughput data leading to an improved quantification and understanding of biological variability. She actively contributes software packages to Bioconductor and is involved in teaching courses for data science and the analysis of genomics data. She is also a faculty member of the Johns Hopkins Data Science Lab, co-host of The Corresponding Author podcast and co-founder of R-Ladies Baltimore. For more information, please see http://www.stephaniehicks.com Raphael Gottardo, PhD Raphael Gottardo is the Scientific Director of the Translational Data Science Integrated Research Center (TDS IRC) at Fred Hutch, J. Ordin Edson Foundation Endowed Chair, and Full Member within the Vaccine and Infectious Disease and Public Health Sciences Division. A pioneer in developing and applying statistical methods and software tools to distill actionable insights from large and complex biological data sets.In partnership with scientists and clinicians, he works to understand such diseases as cancer, HIV, malaria, and tuberculosis and inform the development of vaccines and treatments. He is a leader in forming interdisciplinary collaborations across the Hutch, as well as nationally and internationally, to address important research questions, particularly in the areas of vaccine research, human immunology, and immunotherapy. As director of the Translational Data Science Integrated Research Center, he fosters interaction between the Hutch’s experimental and clinical researchers and their computational and quantitative science colleagues with the goal of transforming patient care through data-driven research. Dr. Gottardo partners closely with the cancer immunotherapy program at Fred Hutch to improve treatments. For example, his team is harnessing cutting-edge computational methods to determine how cancers evade immunotherapy. He has made significant contributions to vaccine research and is the principal investigator of the Vaccine and Immunology Statistical Center of the Collaboration for AIDS Vaccine Discovery. Other notable entities Kevin Rue-Albrecht (University of Oxford, United Kingdom), for contributing the interactive data analysis chapter. Charlotte Soneson (Friedrich Miescher Institute, Switzerland), for many formatting and typographical fixes. Al J Abadi (University of Melbourne, Australia), for bringing the log-normalization corner case to our attention. Philippe Boileau (University of California Berkeley, USA), for demonstrations on how to use scPCA. "],["bibliography.html", "Chapter 44 Bibliography", " Chapter 44 Bibliography "]]