[["index.html", "Introduction to Single-Cell Analysis with Bioconductor Welcome", " Introduction to Single-Cell Analysis with Bioconductor Authors: Robert Amezquita [aut], Aaron Lun [aut], Stephanie Hicks [aut], Raphael Gottardo [aut], Ludwig Geistlinger [cre] Version: 1.16.1 Modified: 2025-06-26 Compiled: 2025-06-27 Environment: R version 4.5.1 (2025-06-13), Bioconductor 3.21 License: CC BY 4.0 Copyright: Bioconductor, 2025 Source: https://github.com/OSCA-source/OSCA.intro Welcome This site contains the introductory chapters for the “Orchestrating Single-Cell Analysis with Bioconductor” book. This describes how to install R and Bioconductor packages, links out to some resources to learn R, describes how to load datasets into an R session, provides an overview of the SingleCellExperiment class, and performs a “quick start” demonstration for basic single-cell RNA-seq analyses. It is intended for readers with little-to-no computational background who are just getting started with analyses in R. "],["installation.html", "Chapter 1 Installation 1.1 Overview 1.2 Installing software 1.3 Installing packages 1.4 Comments on Bioconductor versioning 1.5 Finding relevant packages 1.6 Staying up to date", " Chapter 1 Installation .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 1.1 Overview So, you want to be the very best, like no one ever wa– oh wait, wrong tutorial. So, you want to learn how to do single-cell RNA-seq data analyses with Bioconductor? This chapter will describe the very first step in this process: getting up and running with a R/Bioconductor installation on your local computer. If you already know how to do this, or are using a centrally-managed installation (e.g., on a institutional server), feel free to skip ahead to the next chapter. What is R, anyway? R is a high-level programming language that provides an integrated environment for analyzing all kinds of data. One of its key advantages is the ease with which it can be extended via packages. For example, some of these packages implement statistical/computational methods (e.g., lme4 for mixed effect modelling), while other packages provide programming utilities for general use (e.g., ggplot2 for visualization). The diverse package ecosystem provides R with the capabilities needed to develop useful applications and answer important scientific questions across many fields of study. Within this ecosystem, the Bioconductor project provides tools for the analysis and comprehension of high-throughput genomics data. The scope of the project covers microarray data, various forms of sequencing (RNA-seq, ChIP-seq, bisulfite, genotyping, etc.), proteomics, flow cytometry and more. One of Bioconductor’s main selling points is the use of common data structures to promote interoperability between packages, allowing code written by different people (from different organizations, in different countries) to work together seamlessly in complex analyses. By extending R to genomics, Bioconductor serves as a powerful addition to the computational biologist’s toolkit. 1.2 Installing software Our first task is to get R installed on our computer by following the instructions at https://www.r-project.org. In brief: we select a local mirror from https://cran.r-project.org/mirrors.html and then we choose the appropriate link in “Download R for…” for our operating system. This will download installers for Mac OS X and Windows, which can be opened and run in the usual way. For Linux, the link provides distribution-specific instructions that uses the relevant package manager - for example: sudo apt-get install r-base # Debian/Ubuntu sudo dnf install R # Fedora/CentOS/RHEL sudo yum install R # older Fedora/CentOS/RHEL Users of Homebrew can also do: brew install R We suggest installing the latest version of R to ensure that you have access to the most up-to-date functionality and bugfixes. For example, this book’s contents were generated using R 4.5, which is the version that should be installed if you want to reproduce the results shown in later chapters. For most users, we also recommend installing a graphical user interface such as RStudio. This features many helpful tools such as code completion and an interactive data viewer. Starting an R session becomes as simple as opening up RStudio and typing commands into the console. Of course, this is not essential and more advanced users may prefer to work with R directly from the command line. (This author does.) 1.3 Installing packages Once R is installed, we can install packages that extend R’s capabilities. The default repository is the Comprehensive R Archive Network (CRAN), which is home to over 13,000 different R packages. We can easily install packages from CRAN - say, the popular ggplot2 package for data visualization - by opening up R and typing in: install.packages(&quot;ggplot2&quot;) In our case, we want to install Bioconductor packages. These packages are located in a separate repository (see comments below) so we first install the BiocManager package to easily connect to the Bioconductor servers. install.packages(&quot;BiocManager&quot;) After that, we can use BiocManager’s install() function to install any package from Bioconductor. For example, the code chunk below uses this approach to install the SingleCellExperiment package. (The same command also works for any CRAN package; install() will automatically call install.packages() for us, as a matter of convenience.) ## The command below is a one-line shortcut for: ## library(BiocManager) ## install(&quot;SingleCellExperiment&quot;) BiocManager::install(&quot;SingleCellExperiment&quot;) Should we forget, the same instructions are present on the landing page of any Bioconductor package. For example, looking at the scater package page on Bioconductor, we can see the following copy-pasteable instructions: if (!requireNamespace(&quot;BiocManager&quot;, quietly = TRUE)) install.packages(&quot;BiocManager&quot;) BiocManager::install(&quot;scater&quot;) In fact, each Bioconductor book is itself a package that can be installed via BiocManager. This will automatically install all of the individual packages that are used in the book. We illustrate below with OSCA.intro, which is the package corresponding to this particular book. BiocManager::install(&quot;OSCA.intro&quot;) Packages only need to be installed once, and then they are available for all subsequent uses of a particular R installation. There is no need to repeat the installation every time we start R. 1.4 Comments on Bioconductor versioning Unlike CRAN, Bioconductor releases its packages as a cohort on a half-yearly cycle. This comes with the guarantee that different packages will work together smoothly if they belong to the same cohort - as mentioned above, this is one of Bioconductor’s main selling points. For a particular installation, the version of the cohort release can be easily obtained from BiocManager: BiocManager::version() ## [1] &#39;3.21&#39; Each Bioconductor release relies on the latest release version of R, which in turn has yearly updates. For example, Bioconductor 3.11 and 3.12 would use R 4.0, while Bioconductor 3.13 and 3.14 will use R 4.1, and so on. Thus, getting the latest Bioconductor release usually requires us to install the latest release version of R; BiocManager::install() will then take care of the rest. The interoperability guarantee mentioned above only extends to packages from the same version of Bioconductor. Packages from different Bioconductor releases may not necessarily work together, e.g., due to updates in the data structures or function arguments. Normally, BiocManager::install() will prevent us from installing versions from different versions, but if it does happen, we can fix incompatibilities with: BiocManager::valid() 1.5 Finding relevant packages To find relevant Bioconductor packages, one useful resource is the BiocViews page. This provides a hierarchically organized view of annotations associated with each Bioconductor package. For example, under the “Software” label, we might be interested in a particular “Technology” such as… say, “SingleCell”. This gives us a listing of all Bioconductor packages that might be useful for our single-cell data analyses. CRAN uses the similar concept of “Task views”, though this is understandably more general than genomics. For example, the Cluster task view page lists an assortment of packages that are relevant to cluster analyses. 1.6 Staying up to date Updating all R/Bioconductor packages is as simple as running BiocManager::install() without any arguments. This will check for more recent versions of each package (within a Bioconductor release) and prompt the user to update if any are available. BiocManager::install() If we want to update to a more recent Bioconductor release, we can use the version= argument to explicitly state the version number. This assumes that we have a version of R that is capable of handling the requested Bioconductor release. BiocManager::install(version=&#39;3.21&#39;) It is a good idea to make sure that you are using the latest versions of all packages, at least at the start of any analysis project. This ensures that you have the most recent functionality and bugfixes. The only exception is if there is a need to recover historical results, in which case we might prefer to use older versions of all packages: # Installing CRAN packages as of 29th April, 2020; # see https://packagemanager.rstudio.com/client/#/repos/1/overview for available dates. options(repos = c(CRAN = &quot;https://packagemanager.rstudio.com/all/277&quot;)) # Using packages from Bioconductor version 3.10, see below. BiocManager::install(version=&quot;3.10&quot;) More advanced users may consider using packrat, Conda or Docker to create separate R environments for different analysis projects. These approaches ensure that package updates for one project do not affect the reproducibility of results in other projects; they also make it easier to share environments between users. "],["learning-r.html", "Chapter 2 Learning R 2.1 Links to R tutorials 2.2 Code formatting 2.3 Getting help 2.4 Beyond the basics Session Info", " Chapter 2 Learning R .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 2.1 Links to R tutorials There are many, many online resources available for learning how to use R. To name a few: The R for data science book, which is a fairly enjoyable read though it focuses heavily on a specific dialect of the R language. A free course from Codecademy, which uses a web-based console; this allows people to start learning without actually installing R on their own computers. A free course from EdX, which focuses on the use of R’s statistical functionality. An Introduction to R, a definitive description of R that is best read after some basic familiarity has been established. We will not attempt to repeat the contents of these resources here, as they already do a good job of explaining themselves. 2.2 Code formatting This book contains code chunks interspersed with results, plots and explanatory text. Code chunks contain R code that is to be evaluated, and interested readers can copy-paste these lines into the R console to try it out themselves. Each code chunk looks like this: a &lt;- 2 * 5 print(a) Terms are colored differently depending on their category - this is mostly aesthetic and can be ignored for the time being. If a code chunk produces any visible output, it is shown in another chunk like so: ## [1] 10 Alternatively, as a figure: plot(1:10, 1:10, pch=16, main=&quot;I am a figure&quot;) Any text after a # is considered a comment and is ignored when running the code. The content of output chunks is always prefixed with # so that users can just copy-paste sections of code without having to explicitly remove the lines containing the results. In some chapters, chunks may also be hidden in collapsible boxes. This usually contains code to set up objects for later steps but is otherwise not particularly interesting (e.g., downloading files, formatting data), and so is hidden to avoid distracting the reader. Click me! message &lt;- &quot;I am hidden!&quot; All chapters will finish with a printout of the session information. This describes the system on which the chapter was compiled and the versions of all packages that were used, which is useful for reproducing old results and diagnosing changes due to package updates. 2.3 Getting help If you have a question about how a function works, it can often be answered by the function’s documentation. This is accessible by prepending the function name with ?. More general questions on how to use a package may be answered by the package’s vignette, if it is available. (One aspect of Bioconductor software that distinguishes it from CRAN packages is the required documentation of packages and workflows.) vignette(package=&#39;SingleCellExperiment&#39;) # list all available vignettes vignette(package=&#39;SingleCellExperiment&#39;, topic=&#39;intro&#39;) # open specific vignette Beyond the R console, there are myriad online resources to get help. The R for Data Science book has a great section dedicated to looking for help outside of R. For example, Stack Overflow’s R tag is a helpful resource for asking and exploring general R programming questions. For Bioconductor specifically, the support site contains a question and answer-style support site that is actively updated by both users and package developers. This should generally be the first port of call for questions that are not answered by any existing documentation. Users can also connect to the Bioconductor community through our Slack group, which hosts various channels dedicated to packages and workflows. The Bioc-community Slack is a great way to stay in the loop on the latest developments happening across Bioconductor, and we recommend exploring the “Channels” section to find topics of interest. 2.4 Beyond the basics Once comfortable with the basic concepts of the language, we take things to the next level: Advanced R, as its name suggests, goes through some of the more advanced concepts in the language. The aptly named What They Forgot to Teach You About R discusses topics such as file naming, maintaining an R installation, and reproducible analysis habits. The R Inferno dives into many of the unique quirks of R and some of the common user mistakes. Happy Git and Github for the useR, which describes how to use the Git version control system with R. Over time, you may accumulate a collection of your own functions that you might want to re-use across projects or even share with other people. This can be done easily by creating your own R package. The R Packages book provides a user-friendly guide for doing so; more experienced developers will consult Writing R extensions, the definitive documentation for the R packaging system. Bioconductor itself also provides some educational resources for package development within the Bioconductor context. Session Info View session info R version 4.5.1 (2025-06-13) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BiocStyle_2.36.0 rebook_1.18.0 loaded via a namespace (and not attached): [1] cli_3.6.5 knitr_1.50 rlang_1.1.6 [4] xfun_0.52 generics_0.1.4 CodeDepends_0.6.6 [7] jsonlite_2.0.0 dir.expiry_1.16.0 htmltools_0.5.8.1 [10] XML_3.99-0.18 graph_1.86.0 sass_0.4.10 [13] stats4_4.5.1 rmarkdown_2.29 filelock_1.0.3 [16] evaluate_1.0.4 jquerylib_0.1.4 fastmap_1.2.0 [19] yaml_2.3.10 lifecycle_1.0.4 bookdown_0.43 [22] BiocManager_1.30.26 compiler_4.5.1 codetools_0.2-20 [25] digest_0.6.37 R6_2.6.1 bslib_0.9.0 [28] tools_4.5.1 BiocGenerics_0.54.0 cachem_1.1.0 "],["getting-scrna-seq-datasets.html", "Chapter 3 Getting scRNA-seq datasets 3.1 Overview 3.2 Some comments on experimental design 3.3 Creating a count matrix 3.4 Reading counts into R Session Info", " Chapter 3 Getting scRNA-seq datasets .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 3.1 Overview Sequencing data from single-cell RNA-seq experiments must be converted into a matrix of expression values. This is usually a count matrix containing the number of reads mapped to each gene (row) in each cell (column). Alternatively, the counts may be that of the number of unique molecular identifiers (UMIs); these are interpreted in the same manner as read counts but are less affected by PCR artifacts during library preparation (Islam et al. 2014). Once this quantification is complete, we can proceed with our downstream statistical analyses in R. Constructing a count matrix from raw scRNA-seq data requires some thought as the term “single-cell RNA-seq” encompasses a variety of different experimental protocols. This includes droplet-based protocols like 10X Genomics, inDrop and Drop-seq; plate-based protocols with UMIs like CEL-seq(2) and MARS-seq; plate-based protocols with reads (mostly Smart-seq2); and others like sciRNA-seq, to name a few. Each approach requires a different processing pipeline to deal with cell demultiplexing and UMI deduplication (if applicable). This chapter will briefly describe some of the methods used to generate a count matrix and read it into R. 3.2 Some comments on experimental design Each scRNA-seq protocol has its own advantages and weaknesses that are discussed extensively elsewhere (Mereu et al. 2019; Ziegenhain et al. 2017). In practical terms, droplet-based technologies are the current de facto standard due to their throughput and low cost per cell. Plate-based methods can capture other phenotypic information (e.g., morphology) and are more amenable to customization. Read-based methods provide whole-transcript coverage, which is useful in some applications (e.g., splicing, exome mutations); otherwise, UMI-based methods are more popular as they mitigate the effects of PCR amplification noise. The choice of method is left to the reader’s circumstances - we will simply note that most of the downstream analysis is agnostic to the exact technology being used. Another question is how many cells should be captured, and to what depth they should be sequenced. The best trade-off between these two factors is an active topic of research (Zhang, Ntranos, and Tse 2020; Svensson, Veiga Beltrame, and Pachter 2019), though ultimately, much depends on the scientific aims of the experiment. If we are aiming to discover rare cell subpopulations, we would need more cells, whereas if we are aiming to quantify subtle differences, we would need more sequencing depth. As of time of writing, an informal survey of the literature suggests that typical droplet-based experiments would capture anywhere from 10,000 to 100,000 cells, sequenced at anywhere from 1,000 to 10,000 UMIs per cell (usually in inverse proportion to the number of cells). Droplet-based methods also have a trade-off between throughput and doublet rate that affects the true efficiency of sequencing. For studies involving multiple samples or conditions, the design considerations are the same as those for bulk RNA-seq experiments. There should be multiple biological replicates for each condition and conditions should not be confounded with batch. Note that individual cells are not replicates; rather, we are referring to samples derived from replicate donors or cultures. In fact, this adds another dimension into the resourcing equation - should we obtain more cells per sample at the cost of being able to sequence fewer samples? The best answer depends on the sizes of the subpopulations involved, the ease with which they are distinguished from others, and their variability across different samples and conditions. Such factors are rarely known ahead of time, so an informed decision on the design will often benefit from pilot experiments. 3.3 Creating a count matrix As mentioned above, the exact procedure for quantifying expression depends on the technology involved: For 10X Genomics data, the Cellranger software suite (Zheng et al. 2017) provides a custom pipeline to obtain a count matrix. This uses STAR to align reads to the reference genome and then counts the number of unique UMIs mapped to each gene. Alternatively, pseudo-alignment methods such as alevin (Srivastava et al. 2019) can be used to obtain a count matrix from the same data. This avoids the need for explicit alignment, which reduces the compute time and memory usage. For other highly multiplexed protocols, the scPipe package provides a more general pipeline for processing scRNA-seq data. This uses the Rsubread aligner to align reads and then counts reads or UMIs per gene. For CEL-seq or CEL-seq2 data, the scruff package provides a dedicated pipeline for quantification. For read-based protocols, we can generally re-use the same pipelines for processing bulk RNA-seq data. For any data involving spike-in transcripts, the spike-in sequences should be included as part of the reference genome during alignment and quantification. In all cases, the identity of the genes in the count matrix should be defined with standard identifiers from Ensembl or Entrez. These provide an unambiguous mapping between each row of the matrix and the corresponding gene. In contrast, a single gene symbol may be used by multiple loci, or the mapping between symbols and genes may change over time, e.g., if the gene is renamed. This makes it difficult to re-use the count matrix as we cannot be confident in the meaning of the symbols. (Of course, identifiers can be easily converted to gene symbols later on in the analysis. This is the recommended approach as it allows us to document how the conversion was performed and to backtrack to the stable identifiers if the symbols are ambiguous.) Depending on the process involved, there may be additional points of concern: Some feature-counting tools (e.g., HTSeq) will report mapping statistics in the count matrix, such as the number of unaligned or unassigned reads. While these values can be useful for quality control, they would be misleading if treated as gene expression values. Thus, they should be removed (or at least moved somewhere else) prior to further analyses. The most common spike-ins are those developed by the External RNA Controls Consortium (ERCC), which have names along the lines of ERCC-00002. For human data, one should be careful to distinguish these rows from an actual ERCC gene family that has gene symbols like ERCC1. This issue can be avoided altogether by using standard identifiers that are not susceptible to these naming conflicts. 3.4 Reading counts into R 3.4.1 From tabular formats The next step is to import the count matrix into R. Again, this depends on the output format of the aforementioned processing pipeline. In the simplest case, the pipeline will produce a matrix in tabular format, which can be read in with standard methods like read.delim(). We demonstrate below using a pancreas scRNA-seq dataset from Muraro et al. (2016) (GSE85241): Code to download file library(BiocFileCache) bfc &lt;- BiocFileCache(ask=FALSE) url &lt;- file.path(&quot;ftp://ftp.ncbi.nlm.nih.gov/geo/series&quot;, &quot;GSE85nnn/GSE85241/suppl&quot;, &quot;GSE85241%5Fcellsystems%5Fdataset%5F4donors%5Fupdated%2Ecsv%2Egz&quot;) # Making a symbolic link so that the later code can pretend # that we downloaded the file into the local directory. muraro.fname &lt;- bfcrpath(bfc, url) local.name &lt;- URLdecode(basename(url)) unlink(local.name) if (.Platform$OS.type==&quot;windows&quot;) { file.copy(muraro.fname, local.name) } else { file.symlink(muraro.fname, local.name) } mat &lt;- as.matrix(read.delim(&quot;GSE85241_cellsystems_dataset_4donors_updated.csv.gz&quot;)) dim(mat) # number of rows, number of columns ## [1] 19140 3072 In practice, a more efficient approach is to read in the table in sparse format using the readSparseCounts() function from the scuttle package. This only stores the non-zero values and avoids spending memory on the majority of zeros in lowly-sequenced scRNA-seq experiments. library(scuttle) sparse.mat &lt;- readSparseCounts(&quot;GSE85241_cellsystems_dataset_4donors_updated.csv.gz&quot;) dim(sparse.mat) ## [1] 19140 3072 # We can see that it uses less memory compared to &#39;mat&#39;. object.size(sparse.mat) ## 150978872 bytes object.size(mat) ## 471999152 bytes On occasion, we may encounter count data stored in Excel files. These can be extracted into a matrix using functions from the readxl package, as demonstrated for a dataset from Wilson et al. (2015) (GSE61533): Code to download file bfc &lt;- BiocFileCache(&quot;raw_data&quot;, ask=FALSE) wilson.fname &lt;- bfcrpath(bfc, file.path(&quot;ftp://ftp.ncbi.nlm.nih.gov/geo/series&quot;, &quot;GSE61nnn/GSE61533/suppl/GSE61533_HTSEQ_count_results.xls.gz&quot;)) library(R.utils) wilson.name2 &lt;- &quot;GSE61533_HTSEQ_count_results.xls&quot; gunzip(wilson.fname, destname=wilson.name2, remove=FALSE, overwrite=TRUE) library(readxl) all.counts &lt;- read_excel(&quot;GSE61533_HTSEQ_count_results.xls&quot;) gene.names &lt;- all.counts$ID all.counts &lt;- as.matrix(all.counts[,-1]) rownames(all.counts) &lt;- gene.names dim(all.counts) ## [1] 38498 96 3.4.2 From Cellranger output For 10X Genomics data, the Cellranger software suite will produce an output directory containing counts and feature/barcode annotations. We can read this into R by supplying the directory path to read10xCounts() from the DropletUtils package, as demonstrated below using a 4000 peripheral blood mononuclear cell dataset. Note that the function produces a SingleCellExperiment object containing the matrix, which we will discuss in more detail in the next chapter. Code to download file library(DropletTestFiles) cached &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/filtered.tar.gz&quot;) fpath &lt;- &quot;tenx-2.1.0-pbmc4k&quot; untar(cached, exdir=fpath) library(DropletUtils) sce &lt;- read10xCounts(&quot;tenx-2.1.0-pbmc4k/filtered_gene_bc_matrices/GRCh38&quot;) sce ## class: SingleCellExperiment ## dim: 33694 4340 ## metadata(1): Samples ## assays(1): counts ## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475 ## ENSG00000268674 ## rowData names(2): ID Symbol ## colnames: NULL ## colData names(2): Sample Barcode ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): We can also read in multiple count matrices by passing multiple directory paths to read10xCounts(). Provided that all datasets have the same gene annotation, the function will be able to combine them into a single object. Code to download file # Making a copy and pretending it&#39;s a different sample, # for demonstration purposes. # TODO: actually get a different sample. target &lt;- paste0(fpath, &#39;-2&#39;) unlink(target) if (.Platform$OS.type==&quot;windows&quot;) { file.copy(fpath, target) } else { file.symlink(fpath, target) } dirA &lt;- &quot;tenx-2.1.0-pbmc4k/filtered_gene_bc_matrices/GRCh38&quot; dirB &lt;- &quot;tenx-2.1.0-pbmc4k-2/filtered_gene_bc_matrices/GRCh38&quot; sce &lt;- read10xCounts(c(dirA, dirB)) sce ## class: SingleCellExperiment ## dim: 33694 8680 ## metadata(1): Samples ## assays(1): counts ## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475 ## ENSG00000268674 ## rowData names(2): ID Symbol ## colnames: NULL ## colData names(2): Sample Barcode ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): It is worth noting that the Cellranger software suite is not the only approach to processing 10X Genomics data. For example, alevin output can be read into R using the tximeta package, while kallisto-bustools output can be read using the BUSpaRse package. 3.4.3 From HDF5-based formats A family of scRNA-seq storage formats is based around Hierarchical Data Format version 5 (HDF5). These formats offer the ability to store, in the same file, both the expression values and associated gene and cell annotations. One flavor of this approach is the H5AD format, which can be read into R as a SingleCellExperiment using the zellkonverter package. We demonstrate below with an example dataset that is built into the package: library(zellkonverter) demo &lt;- system.file(&quot;extdata&quot;, &quot;krumsiek11.h5ad&quot;, package = &quot;zellkonverter&quot;) sce &lt;- readH5AD(demo) sce ## class: SingleCellExperiment ## dim: 11 640 ## metadata(2): highlights iroot ## assays(1): X ## rownames(11): Gata2 Gata1 ... EgrNab Gfi1 ## rowData names(0): ## colnames(640): 0 1 ... 158-3 159-3 ## colData names(1): cell_type ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): Another flavor is the Loom file format, which we can read into R with the LoomExperiment package. In this case, the procedure creates a SingleCellLoomExperiment, which is effectively a plug-and-play equivalent to the SingleCellExperiment. library(LoomExperiment) demo &lt;- system.file(&quot;extdata&quot;, &quot;L1_DRG_20_example.loom&quot;, package = &quot;LoomExperiment&quot;) scle &lt;- import(demo, type=&quot;SingleCellLoomExperiment&quot;) scle ## class: SingleCellLoomExperiment ## dim: 20 20 ## metadata(4): CreatedWith LOOM_SPEC_VERSION LoomExperiment-class ## MatrixName ## assays(1): matrix ## rownames: NULL ## rowData names(7): Accession Gene ... X_Total X_Valid ## colnames: NULL ## colData names(103): Age AnalysisPool ... cDNA_Lib_Ok ngperul_cDNA ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): ## rowGraphs(0): NULL ## colGraphs(2): KNN MKNN The HDF5-based formats have an additional advantage in that Bioconductor-based analyses can be performed without reading all of the data into R. This allows us to analyze very large datasets in the presence of limited computer memory, a functionality that we will discuss in more detail in Advanced Chapter 14. Session Info View session info R version 4.5.1 (2025-06-13) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] LoomExperiment_1.26.1 BiocIO_1.18.0 [3] rhdf5_2.52.1 zellkonverter_1.18.0 [5] DropletUtils_1.28.0 DropletTestFiles_1.18.0 [7] readxl_1.4.5 R.utils_2.13.0 [9] R.oo_1.27.1 R.methodsS3_1.8.2 [11] scuttle_1.18.0 SingleCellExperiment_1.30.1 [13] SummarizedExperiment_1.38.1 Biobase_2.68.0 [15] GenomicRanges_1.60.0 GenomeInfoDb_1.44.0 [17] IRanges_2.42.0 S4Vectors_0.46.0 [19] BiocGenerics_0.54.0 generics_0.1.4 [21] MatrixGenerics_1.20.0 matrixStats_1.5.0 [23] BiocFileCache_2.16.0 dbplyr_2.5.0 [25] BiocStyle_2.36.0 rebook_1.18.0 loaded via a namespace (and not attached): [1] DBI_1.2.3 CodeDepends_0.6.6 [3] rlang_1.1.6 magrittr_2.0.3 [5] compiler_4.5.1 RSQLite_2.4.1 [7] dir.expiry_1.16.0 DelayedMatrixStats_1.30.0 [9] png_0.1-8 vctrs_0.6.5 [11] stringr_1.5.1 pkgconfig_2.0.3 [13] crayon_1.5.3 fastmap_1.2.0 [15] XVector_0.48.0 rmarkdown_2.29 [17] graph_1.86.0 UCSC.utils_1.4.0 [19] purrr_1.0.4 bit_4.6.0 [21] xfun_0.52 cachem_1.1.0 [23] beachmat_2.24.0 jsonlite_2.0.0 [25] blob_1.2.4 rhdf5filters_1.20.0 [27] DelayedArray_0.34.1 Rhdf5lib_1.30.0 [29] BiocParallel_1.42.1 parallel_4.5.1 [31] R6_2.6.1 stringi_1.8.7 [33] bslib_0.9.0 reticulate_1.42.0 [35] limma_3.64.1 jquerylib_0.1.4 [37] cellranger_1.1.0 Rcpp_1.0.14 [39] bookdown_0.43 knitr_1.50 [41] Matrix_1.7-3 tidyselect_1.2.1 [43] abind_1.4-8 yaml_2.3.10 [45] codetools_0.2-20 curl_6.4.0 [47] lattice_0.22-7 tibble_3.3.0 [49] basilisk.utils_1.20.0 withr_3.0.2 [51] KEGGREST_1.48.1 evaluate_1.0.4 [53] ExperimentHub_2.16.0 Biostrings_2.76.0 [55] pillar_1.10.2 BiocManager_1.30.26 [57] filelock_1.0.3 BiocVersion_3.21.1 [59] sparseMatrixStats_1.20.0 glue_1.8.0 [61] tools_4.5.1 AnnotationHub_3.16.0 [63] locfit_1.5-9.12 XML_3.99-0.18 [65] grid_4.5.1 AnnotationDbi_1.70.0 [67] edgeR_4.6.2 GenomeInfoDbData_1.2.14 [69] basilisk_1.20.0 HDF5Array_1.36.0 [71] cli_3.6.5 rappdirs_0.3.3 [73] S4Arrays_1.8.1 dplyr_1.1.4 [75] sass_0.4.10 digest_0.6.37 [77] SparseArray_1.8.0 dqrng_0.4.1 [79] memoise_2.0.1 htmltools_0.5.8.1 [81] lifecycle_1.0.4 h5mread_1.0.1 [83] httr_1.4.7 statmod_1.5.0 [85] mime_0.13 bit64_4.6.0-1 References "],["the-singlecellexperiment-class.html", "Chapter 4 The SingleCellExperiment class 4.1 Overview 4.2 Installing required packages 4.3 Storing primary experimental data 4.4 Handling metadata 4.5 Subsetting and combining 4.6 Single-cell-specific fields 4.7 Conclusion Session Info", " Chapter 4 The SingleCellExperiment class .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 4.1 Overview One of the main strengths of the Bioconductor project lies in the use of a common data infrastructure that powers interoperability across packages. Users should be able to analyze their data using functions from different Bioconductor packages without the need to convert between formats. To this end, the SingleCellExperiment class (from the SingleCellExperiment package) serves as the common currency for data exchange across 70+ single-cell-related Bioconductor packages. This class implements a data structure that stores all aspects of our single-cell data - gene-by-cell expression data, per-cell metadata and per-gene annotation (Figure 4.1) - and manipulate them in a synchronized manner. Figure 4.1: Overview of the structure of the SingleCellExperiment class. Each row of the assays corresponds to a row of the rowData (pink shading), while each column of the assays corresponds to a column of the colData and reducedDims (yellow shading). Each piece of (meta)data in the SingleCellExperiment is represented by a separate “slot”. (This terminology comes from the S4 class system, but that’s not important right now.) If we imagine the SingleCellExperiment object to be a cargo ship, the slots can be thought of as individual cargo boxes with different contents, e.g., certain slots expect numeric matrices whereas others may expect data frames. In the rest of this chapter, we will discuss the available slots, their expected formats, and how we can interact with them. More experienced readers may note the similarity with the SummarizedExperiment class, and if you are such a reader, you may wish to jump directly to the end of this chapter for the single-cell-specific aspects of this class. 4.2 Installing required packages The SingleCellExperiment package is automatically installed when using any package that depends on the SingleCellExperiment class, but it can also be explicitly installed: BiocManager::install(&#39;SingleCellExperiment&#39;) We then load the SingleCellExperiment package into our R session. This avoids the need to prefix our function calls with ::, especially for packages that are heavily used throughout a workflow. library(SingleCellExperiment) For the demonstrations in this chapter, we will use some functions from a variety of other packages, installed as shown below. These functions will be accessed through the &lt;package&gt;::&lt;function&gt; convention as needed. BiocManager::install(c(&#39;scuttle&#39;, &#39;scran&#39;, &#39;scater&#39;, &#39;uwot&#39;, &#39;rtracklayer&#39;)) 4.3 Storing primary experimental data 4.3.1 Filling the assays slot To construct a rudimentary SingleCellExperiment object, we only need to fill the assays slot. This contains primary data such as a matrix of sequencing counts where rows correspond to features (genes) and columns correspond to samples (cells) (Figure 4.1, blue box). To demonstrate, we will use a small scRNA-seq count matrix downloaded from the ArrayExpress servers (Lun et al. 2017): Code to download file library(BiocFileCache) bfc &lt;- BiocFileCache(&quot;raw_data&quot;, ask = FALSE) calero.counts &lt;- bfcrpath(bfc, file.path(&quot;https://www.ebi.ac.uk/biostudies&quot;, &quot;files/E-MTAB-5522/counts_Calero_20160113.tsv&quot;)) mat &lt;- read.delim(calero.counts, header=TRUE, row.names=1, check.names=FALSE) # Only considering endogenous genes for now. spike.mat &lt;- mat[grepl(&quot;^ERCC-&quot;, rownames(mat)),] mat &lt;- mat[grepl(&quot;^ENSMUSG&quot;, rownames(mat)),] # Splitting off the gene length column. gene.length &lt;- mat[,1] mat &lt;- as.matrix(mat[,-1]) dim(mat) ## [1] 46603 96 From this, we can now construct our first SingleCellExperiment object using the SingleCellExperiment() function. Note that we provide our data as a named list where each entry of the list is a matrix - in this case, named \"counts\". sce &lt;- SingleCellExperiment(assays = list(counts = mat)) To inspect the object, we can simply type sce into the console to see some pertinent information. This will print an overview of the various slots available to us, which may or may not have any data. sce ## class: SingleCellExperiment ## dim: 46603 96 ## metadata(0): ## assays(1): counts ## rownames(46603): ENSMUSG00000102693 ENSMUSG00000064842 ... ## ENSMUSG00000096730 ENSMUSG00000095742 ## rowData names(0): ## colnames(96): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ... ## SLX-9555.N712_S508.C89V9ANXX.s_1.r_1 ## SLX-9555.N712_S517.C89V9ANXX.s_1.r_1 ## colData names(0): ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): To access the count data we just supplied, we can do any one of the following: assay(sce, \"counts\") - this is the most general method, where we can supply the name of the assay as the second argument. counts(sce) - this is a short-cut for the above, but only works for assays with the special name \"counts\". mat2 &lt;- counts(sce) NOTE: For exported S4 classes, it is rarely a good idea to directly access the slots with the @ operator. This is considered bad practice as the class developers are free to alter the internal structure of the class, at which point any code using @ may no longer work. Rather, it is best to use the provided getter functions like assay() and counts() to extract data from the object. Similarly, setting slots should be done via the provided setter functions, which are discussed in the next section. 4.3.2 Adding more assays One of the strengths of the assays slot is that it can hold multiple representations of the primary data. This is particularly useful for storing the raw count matrix as well as a normalized version of the data. For example, the logNormCounts() function from scuttle will compute a log-transformed normalized expression matrix and store it as another assay. sce &lt;- scuttle::logNormCounts(sce) sce ## class: SingleCellExperiment ## dim: 46603 96 ## metadata(0): ## assays(2): counts logcounts ## rownames(46603): ENSMUSG00000102693 ENSMUSG00000064842 ... ## ENSMUSG00000096730 ENSMUSG00000095742 ## rowData names(0): ## colnames(96): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ... ## SLX-9555.N712_S508.C89V9ANXX.s_1.r_1 ## SLX-9555.N712_S517.C89V9ANXX.s_1.r_1 ## colData names(1): sizeFactor ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): We overwrite our previous sce by reassigning the result of logNormCounts() back to sce. This is possible because these particular functions return a SingleCellExperiment object that contains the results in addition to original data. Indeed, viewing the object again, we see that this function added a new assay entry \"logcounts\". sce ## class: SingleCellExperiment ## dim: 46603 96 ## metadata(0): ## assays(2): counts logcounts ## rownames(46603): ENSMUSG00000102693 ENSMUSG00000064842 ... ## ENSMUSG00000096730 ENSMUSG00000095742 ## rowData names(0): ## colnames(96): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ... ## SLX-9555.N712_S508.C89V9ANXX.s_1.r_1 ## SLX-9555.N712_S517.C89V9ANXX.s_1.r_1 ## colData names(1): sizeFactor ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): Similar to \"counts\", the \"logcounts\" name can be conveniently accessed using logcounts(sce). dim(logcounts(sce)) ## [1] 46603 96 While logNormCounts() will automatically add assays to our sce object and return the modified object, other functions may return a matrix directly. In such cases, we need to manually insert the matrix into the assays, which we can do with the corresponding setter function. To illustrate, we can create a new assay where we add 100 to all the counts: counts_100 &lt;- counts(sce) + 100 assay(sce, &quot;counts_100&quot;) &lt;- counts_100 # assign a new entry to assays slot assays(sce) # new assay has now been added. ## List of length 3 ## names(3): counts logcounts counts_100 4.3.3 Further comments To retrieve all the available assays within sce, we can use the assays() getter. By comparison, assay() only returns a single assay of interest. assays(sce) ## List of length 3 ## names(3): counts logcounts counts_100 The setter of the same name can be used to modify the set of available assays: # Only keeping the first two assays assays(sce) &lt;- assays(sce)[1:2] sce ## class: SingleCellExperiment ## dim: 46603 96 ## metadata(0): ## assays(2): counts logcounts ## rownames(46603): ENSMUSG00000102693 ENSMUSG00000064842 ... ## ENSMUSG00000096730 ENSMUSG00000095742 ## rowData names(0): ## colnames(96): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ... ## SLX-9555.N712_S508.C89V9ANXX.s_1.r_1 ## SLX-9555.N712_S517.C89V9ANXX.s_1.r_1 ## colData names(1): sizeFactor ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): Alternatively, if we just want the names of the assays, we can get or set them with assayNames(): assayNames(sce) ## [1] &quot;counts&quot; &quot;logcounts&quot; names(assays(sce)) # same result, but slightly less efficient. ## [1] &quot;counts&quot; &quot;logcounts&quot; 4.4 Handling metadata 4.4.1 On the columns To further annotate our SingleCellExperiment object, we can add metadata to describe the columns of our primary data, e.g., the samples or cells of our experiment. This is stored in the colData slot, a DataFrame object where rows correspond to cells and columns correspond to metadata fields, e.g., batch of origin, treatment condition (Figure 4.1, orange box). We demonstrate using some of the metadata accompanying the counts from Lun et al. (2017). Code to download file # Downloading the SDRF file containing the metadata. lun.sdrf &lt;- bfcrpath(bfc, file.path(&quot;https://www.ebi.ac.uk/arrayexpress/files&quot;, &quot;E-MTAB-5522/E-MTAB-5522.sdrf.txt&quot;)) coldata &lt;- read.delim(lun.sdrf, check.names=FALSE) # Only keeping the cells involved in the count matrix in &#39;mat&#39;. coldata &lt;- coldata[coldata[,&quot;Derived Array Data File&quot;]==&quot;counts_Calero_20160113.tsv&quot;,] # Only keeping interesting columns, and setting the library names as the row names. coldata &lt;- DataFrame( genotype=coldata[,&quot;Characteristics[genotype]&quot;], phenotype=coldata[,&quot;Characteristics[phenotype]&quot;], spike_in=coldata[,&quot;Factor Value[spike-in addition]&quot;], row.names=coldata[,&quot;Source Name&quot;] ) coldata ## DataFrame with 96 rows and 3 columns ## genotype ## &lt;character&gt; ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 Doxycycline-inducibl.. ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 Doxycycline-inducibl.. ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 Doxycycline-inducibl.. ## SLX-9555.N701_S505.C89V9ANXX.s_1.r_1 Doxycycline-inducibl.. ## SLX-9555.N701_S506.C89V9ANXX.s_1.r_1 Doxycycline-inducibl.. ## ... ... ## SLX-9555.N712_S505.C89V9ANXX.s_1.r_1 Doxycycline-inducibl.. ## SLX-9555.N712_S506.C89V9ANXX.s_1.r_1 Doxycycline-inducibl.. ## SLX-9555.N712_S507.C89V9ANXX.s_1.r_1 Doxycycline-inducibl.. ## SLX-9555.N712_S508.C89V9ANXX.s_1.r_1 Doxycycline-inducibl.. ## SLX-9555.N712_S517.C89V9ANXX.s_1.r_1 Doxycycline-inducibl.. ## phenotype spike_in ## &lt;character&gt; &lt;character&gt; ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 wild type phenotype ERCC+SIRV ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 wild type phenotype ERCC+SIRV ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 wild type phenotype ERCC+SIRV ## SLX-9555.N701_S505.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. ERCC+SIRV ## SLX-9555.N701_S506.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. ERCC+SIRV ## ... ... ... ## SLX-9555.N712_S505.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. Premixed ## SLX-9555.N712_S506.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. Premixed ## SLX-9555.N712_S507.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. Premixed ## SLX-9555.N712_S508.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. Premixed ## SLX-9555.N712_S517.C89V9ANXX.s_1.r_1 wild type phenotype Premixed Now, we can take two approaches - either add the coldata to our existing sce, or start from scratch via the SingleCellExperiment() constructor. If we start from scratch: sce &lt;- SingleCellExperiment(assays = list(counts=mat), colData=coldata) Similar to assays, we can see our colData is now populated: sce ## class: SingleCellExperiment ## dim: 46603 96 ## metadata(0): ## assays(1): counts ## rownames(46603): ENSMUSG00000102693 ENSMUSG00000064842 ... ## ENSMUSG00000096730 ENSMUSG00000095742 ## rowData names(0): ## colnames(96): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ... ## SLX-9555.N712_S508.C89V9ANXX.s_1.r_1 ## SLX-9555.N712_S517.C89V9ANXX.s_1.r_1 ## colData names(3): genotype phenotype spike_in ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): We can access our column data with the colData() function: colData(sce) ## DataFrame with 96 rows and 3 columns ## genotype ## &lt;character&gt; ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 Doxycycline-inducibl.. ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 Doxycycline-inducibl.. ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 Doxycycline-inducibl.. ## SLX-9555.N701_S505.C89V9ANXX.s_1.r_1 Doxycycline-inducibl.. ## SLX-9555.N701_S506.C89V9ANXX.s_1.r_1 Doxycycline-inducibl.. ## ... ... ## SLX-9555.N712_S505.C89V9ANXX.s_1.r_1 Doxycycline-inducibl.. ## SLX-9555.N712_S506.C89V9ANXX.s_1.r_1 Doxycycline-inducibl.. ## SLX-9555.N712_S507.C89V9ANXX.s_1.r_1 Doxycycline-inducibl.. ## SLX-9555.N712_S508.C89V9ANXX.s_1.r_1 Doxycycline-inducibl.. ## SLX-9555.N712_S517.C89V9ANXX.s_1.r_1 Doxycycline-inducibl.. ## phenotype spike_in ## &lt;character&gt; &lt;character&gt; ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 wild type phenotype ERCC+SIRV ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 wild type phenotype ERCC+SIRV ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 wild type phenotype ERCC+SIRV ## SLX-9555.N701_S505.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. ERCC+SIRV ## SLX-9555.N701_S506.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. ERCC+SIRV ## ... ... ... ## SLX-9555.N712_S505.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. Premixed ## SLX-9555.N712_S506.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. Premixed ## SLX-9555.N712_S507.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. Premixed ## SLX-9555.N712_S508.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. Premixed ## SLX-9555.N712_S517.C89V9ANXX.s_1.r_1 wild type phenotype Premixed Or more simply, we can extract a single field using the $ shortcut: head(sce$Factor.Value.phenotype.) ## NULL Alternatively, we can add colData to an existing object, either en masse: sce &lt;- SingleCellExperiment(list(counts=mat)) colData(sce) &lt;- coldata sce ## class: SingleCellExperiment ## dim: 46603 96 ## metadata(0): ## assays(1): counts ## rownames(46603): ENSMUSG00000102693 ENSMUSG00000064842 ... ## ENSMUSG00000096730 ENSMUSG00000095742 ## rowData names(0): ## colnames(96): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ... ## SLX-9555.N712_S508.C89V9ANXX.s_1.r_1 ## SLX-9555.N712_S517.C89V9ANXX.s_1.r_1 ## colData names(3): genotype phenotype spike_in ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): Or piece by piece: sce &lt;- SingleCellExperiment(list(counts=mat)) sce$phenotype &lt;- coldata$phenotype colData(sce) ## DataFrame with 96 rows and 1 column ## phenotype ## &lt;character&gt; ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 wild type phenotype ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 wild type phenotype ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 wild type phenotype ## SLX-9555.N701_S505.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. ## SLX-9555.N701_S506.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. ## ... ... ## SLX-9555.N712_S505.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. ## SLX-9555.N712_S506.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. ## SLX-9555.N712_S507.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. ## SLX-9555.N712_S508.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. ## SLX-9555.N712_S517.C89V9ANXX.s_1.r_1 wild type phenotype Regardless of which strategy you pick, it is very important to make sure that the rows of your colData actually refer to the same cells as the columns of your count matrix. It is usually a good idea to check that the names are consistent before constructing the SingleCellExperiment: stopifnot(identical(rownames(coldata), colnames(mat))) Some functions automatically add column metadata by returning a SingleCellExperiment with extra fields in the colData slot. For example, the scuttle package contains the addPerCellQC() function that appends a number of quality control metrics to the colData: sce &lt;- scuttle::addPerCellQC(sce) colData(sce) ## DataFrame with 96 rows and 4 columns ## phenotype sum detected ## &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 wild type phenotype 854171 7617 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 wild type phenotype 1044243 7520 ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 wild type phenotype 1152450 8305 ## SLX-9555.N701_S505.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. 1193876 8142 ## SLX-9555.N701_S506.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. 1521472 7153 ## ... ... ... ... ## SLX-9555.N712_S505.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. 203221 5608 ## SLX-9555.N712_S506.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. 1059853 6948 ## SLX-9555.N712_S507.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. 1672343 6879 ## SLX-9555.N712_S508.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. 1939537 7213 ## SLX-9555.N712_S517.C89V9ANXX.s_1.r_1 wild type phenotype 1436899 8469 ## total ## &lt;numeric&gt; ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 854171 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 1044243 ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 1152450 ## SLX-9555.N701_S505.C89V9ANXX.s_1.r_1 1193876 ## SLX-9555.N701_S506.C89V9ANXX.s_1.r_1 1521472 ## ... ... ## SLX-9555.N712_S505.C89V9ANXX.s_1.r_1 203221 ## SLX-9555.N712_S506.C89V9ANXX.s_1.r_1 1059853 ## SLX-9555.N712_S507.C89V9ANXX.s_1.r_1 1672343 ## SLX-9555.N712_S508.C89V9ANXX.s_1.r_1 1939537 ## SLX-9555.N712_S517.C89V9ANXX.s_1.r_1 1436899 4.4.2 On the rows We store feature-level annotation in the rowData slot, a DataFrame where each row corresponds to a gene and contains annotations like the transcript length or gene symbol. We can manually get and set this with the rowData() function, as shown below: rowData(sce)$Length &lt;- gene.length rowData(sce) ## DataFrame with 46603 rows and 1 column ## Length ## &lt;integer&gt; ## ENSMUSG00000102693 1070 ## ENSMUSG00000064842 110 ## ENSMUSG00000051951 6094 ## ENSMUSG00000102851 480 ## ENSMUSG00000103377 2819 ## ... ... ## ENSMUSG00000094431 100 ## ENSMUSG00000094621 121 ## ENSMUSG00000098647 99 ## ENSMUSG00000096730 3077 ## ENSMUSG00000095742 243 Some functions will return a SingleCellExperiment with the rowData populated with relevant bits of information, such as the addPerFeatureQC() function: sce &lt;- scuttle::addPerFeatureQC(sce) rowData(sce) ## DataFrame with 46603 rows and 3 columns ## Length mean detected ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; ## ENSMUSG00000102693 1070 0.0000000 0.00000 ## ENSMUSG00000064842 110 0.0000000 0.00000 ## ENSMUSG00000051951 6094 0.0000000 0.00000 ## ENSMUSG00000102851 480 0.0000000 0.00000 ## ENSMUSG00000103377 2819 0.0208333 1.04167 ## ... ... ... ... ## ENSMUSG00000094431 100 0 0 ## ENSMUSG00000094621 121 0 0 ## ENSMUSG00000098647 99 0 0 ## ENSMUSG00000096730 3077 0 0 ## ENSMUSG00000095742 243 0 0 Furthermore, there is a special rowRanges slot to hold genomic coordinates in the form of a GRanges or GRangesList. This stores the chromosome, start, and end coordinates of the features (genes, genomic regions) that can be easily queryed and manipulated via the GenomicRanges framework. In our case, rowRanges(sce) produces an empty list because we did not fill it with any coordinate information. rowRanges(sce) # empty ## GRangesList object of length 46603: ## $ENSMUSG00000102693 ## GRanges object with 0 ranges and 0 metadata columns: ## seqnames ranges strand ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; ## ------- ## seqinfo: no sequences ## ## $ENSMUSG00000064842 ## GRanges object with 0 ranges and 0 metadata columns: ## seqnames ranges strand ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; ## ------- ## seqinfo: no sequences ## ## $ENSMUSG00000051951 ## GRanges object with 0 ranges and 0 metadata columns: ## seqnames ranges strand ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; ## ------- ## seqinfo: no sequences ## ## ... ## &lt;46600 more elements&gt; The manner with which we populate the rowRanges depends on the organism and annotation used during alignment and quantification. Here, we have Ensembl identifiers, so we might use rtracklayer to load in a GRanges from an GTF file containing the Ensembl annotation used in this dataset: Code to download file # The paper uses Ensembl 82. mm10.gtf &lt;- bfcrpath(bfc, file.path(&quot;http://ftp.ensembl.org/pub/release-82&quot;, &quot;gtf/mus_musculus/Mus_musculus.GRCm38.82.gtf.gz&quot;)) gene.data &lt;- rtracklayer::import(mm10.gtf) # Cleaning up the object. gene.data &lt;- gene.data[gene.data$type==&quot;gene&quot;] names(gene.data) &lt;- gene.data$gene_id is.gene.related &lt;- grep(&quot;gene_&quot;, colnames(mcols(gene.data))) mcols(gene.data) &lt;- mcols(gene.data)[,is.gene.related] rowRanges(sce) &lt;- gene.data[rownames(sce)] rowRanges(sce)[1:10,] ## GRanges object with 10 ranges and 6 metadata columns: ## seqnames ranges strand | gene_id ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; | &lt;character&gt; ## ENSMUSG00000102693 1 3073253-3074322 + | ENSMUSG00000102693 ## ENSMUSG00000064842 1 3102016-3102125 + | ENSMUSG00000064842 ## ENSMUSG00000051951 1 3205901-3671498 - | ENSMUSG00000051951 ## ENSMUSG00000102851 1 3252757-3253236 + | ENSMUSG00000102851 ## ENSMUSG00000103377 1 3365731-3368549 - | ENSMUSG00000103377 ## ENSMUSG00000104017 1 3375556-3377788 - | ENSMUSG00000104017 ## ENSMUSG00000103025 1 3464977-3467285 - | ENSMUSG00000103025 ## ENSMUSG00000089699 1 3466587-3513553 + | ENSMUSG00000089699 ## ENSMUSG00000103201 1 3512451-3514507 - | ENSMUSG00000103201 ## ENSMUSG00000103147 1 3531795-3532720 + | ENSMUSG00000103147 ## gene_version gene_name gene_source ## &lt;character&gt; &lt;character&gt; &lt;character&gt; ## ENSMUSG00000102693 1 4933401J01Rik havana ## ENSMUSG00000064842 1 Gm26206 ensembl ## ENSMUSG00000051951 5 Xkr4 ensembl_havana ## ENSMUSG00000102851 1 Gm18956 havana ## ENSMUSG00000103377 1 Gm37180 havana ## ENSMUSG00000104017 1 Gm37363 havana ## ENSMUSG00000103025 1 Gm37686 havana ## ENSMUSG00000089699 1 Gm1992 havana ## ENSMUSG00000103201 1 Gm37329 havana ## ENSMUSG00000103147 1 Gm7341 havana ## gene_biotype havana_gene_version ## &lt;character&gt; &lt;character&gt; ## ENSMUSG00000102693 TEC 1 ## ENSMUSG00000064842 snRNA &lt;NA&gt; ## ENSMUSG00000051951 protein_coding 2 ## ENSMUSG00000102851 processed_pseudogene 1 ## ENSMUSG00000103377 TEC 1 ## ENSMUSG00000104017 TEC 1 ## ENSMUSG00000103025 TEC 1 ## ENSMUSG00000089699 antisense 1 ## ENSMUSG00000103201 TEC 1 ## ENSMUSG00000103147 processed_pseudogene 1 ## ------- ## seqinfo: 45 sequences from an unspecified genome; no seqlengths 4.4.3 Other metadata Some analyses contain results or annotations that do not fit into the aforementioned slots, e.g., study metadata. This can be stored in the metadata slot, a named list of arbitrary objects. For example, say we have some favorite genes (e.g., highly variable genes) that we want to store inside of sce for use in our analysis at a later point. We can do this simply by appending to the metadata slot as follows: my_genes &lt;- c(&quot;gene_1&quot;, &quot;gene_5&quot;) metadata(sce) &lt;- list(favorite_genes = my_genes) metadata(sce) ## $favorite_genes ## [1] &quot;gene_1&quot; &quot;gene_5&quot; Similarly, we can append more information via the $ operator: your_genes &lt;- c(&quot;gene_4&quot;, &quot;gene_8&quot;) metadata(sce)$your_genes &lt;- your_genes metadata(sce) ## $favorite_genes ## [1] &quot;gene_1&quot; &quot;gene_5&quot; ## ## $your_genes ## [1] &quot;gene_4&quot; &quot;gene_8&quot; The main disadvantage of storing content in the metadata() is that it will not be synchronized with the rows or columns when subsetting or combining (see next chapter). Thus, if an annotation field is related to the rows or columns, we suggest storing it in the rowData() or colData() instead. 4.5 Subsetting and combining One of the major advantages of using the SingleCellExperiment is that operations on the rows or columns of the expression data are synchronized with the associated annotation. This avoids embarrassing bookkeeping errors where we, e.g., subset the count matrix but forget to subset the corresponding feature/sample annotation. For example, if we subset our sce object to the first 10 cells, we can see that both the count matrix and the colData() are automatically subsetted: first.10 &lt;- sce[,1:10] ncol(counts(first.10)) # only 10 columns. ## [1] 10 colData(first.10) # only 10 rows. ## DataFrame with 10 rows and 4 columns ## phenotype sum detected ## &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 wild type phenotype 854171 7617 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 wild type phenotype 1044243 7520 ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 wild type phenotype 1152450 8305 ## SLX-9555.N701_S505.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. 1193876 8142 ## SLX-9555.N701_S506.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. 1521472 7153 ## SLX-9555.N701_S507.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. 866705 6828 ## SLX-9555.N701_S508.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. 608581 6966 ## SLX-9555.N701_S517.C89V9ANXX.s_1.r_1 wild type phenotype 1113526 8634 ## SLX-9555.N702_S502.C89V9ANXX.s_1.r_1 wild type phenotype 1308250 8364 ## SLX-9555.N702_S503.C89V9ANXX.s_1.r_1 wild type phenotype 778605 8665 ## total ## &lt;numeric&gt; ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 854171 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 1044243 ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 1152450 ## SLX-9555.N701_S505.C89V9ANXX.s_1.r_1 1193876 ## SLX-9555.N701_S506.C89V9ANXX.s_1.r_1 1521472 ## SLX-9555.N701_S507.C89V9ANXX.s_1.r_1 866705 ## SLX-9555.N701_S508.C89V9ANXX.s_1.r_1 608581 ## SLX-9555.N701_S517.C89V9ANXX.s_1.r_1 1113526 ## SLX-9555.N702_S502.C89V9ANXX.s_1.r_1 1308250 ## SLX-9555.N702_S503.C89V9ANXX.s_1.r_1 778605 Similarly, if we only wanted wild-type cells, we could subset our sce object based on its colData() entries: wt.only &lt;- sce[, sce$phenotype == &quot;wild type phenotype&quot;] ncol(counts(wt.only)) ## [1] 48 colData(wt.only) ## DataFrame with 48 rows and 4 columns ## phenotype sum detected ## &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 wild type phenotype 854171 7617 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 wild type phenotype 1044243 7520 ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 wild type phenotype 1152450 8305 ## SLX-9555.N701_S517.C89V9ANXX.s_1.r_1 wild type phenotype 1113526 8634 ## SLX-9555.N702_S502.C89V9ANXX.s_1.r_1 wild type phenotype 1308250 8364 ## ... ... ... ... ## SLX-9555.N711_S517.C89V9ANXX.s_1.r_1 wild type phenotype 1317671 8581 ## SLX-9555.N712_S502.C89V9ANXX.s_1.r_1 wild type phenotype 1736189 9687 ## SLX-9555.N712_S503.C89V9ANXX.s_1.r_1 wild type phenotype 1521132 8983 ## SLX-9555.N712_S504.C89V9ANXX.s_1.r_1 wild type phenotype 1759166 8480 ## SLX-9555.N712_S517.C89V9ANXX.s_1.r_1 wild type phenotype 1436899 8469 ## total ## &lt;numeric&gt; ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 854171 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 1044243 ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 1152450 ## SLX-9555.N701_S517.C89V9ANXX.s_1.r_1 1113526 ## SLX-9555.N702_S502.C89V9ANXX.s_1.r_1 1308250 ## ... ... ## SLX-9555.N711_S517.C89V9ANXX.s_1.r_1 1317671 ## SLX-9555.N712_S502.C89V9ANXX.s_1.r_1 1736189 ## SLX-9555.N712_S503.C89V9ANXX.s_1.r_1 1521132 ## SLX-9555.N712_S504.C89V9ANXX.s_1.r_1 1759166 ## SLX-9555.N712_S517.C89V9ANXX.s_1.r_1 1436899 The same logic applies to the rowData(). Say we only want to keep protein-coding genes: coding.only &lt;- sce[rowData(sce)$gene_biotype == &quot;protein_coding&quot;,] nrow(counts(coding.only)) ## [1] 22013 rowData(coding.only) ## DataFrame with 22013 rows and 6 columns ## gene_id gene_version gene_name ## &lt;character&gt; &lt;character&gt; &lt;character&gt; ## ENSMUSG00000051951 ENSMUSG00000051951 5 Xkr4 ## ENSMUSG00000025900 ENSMUSG00000025900 9 Rp1 ## ENSMUSG00000025902 ENSMUSG00000025902 12 Sox17 ## ENSMUSG00000033845 ENSMUSG00000033845 12 Mrpl15 ## ENSMUSG00000025903 ENSMUSG00000025903 13 Lypla1 ## ... ... ... ... ## ENSMUSG00000079808 ENSMUSG00000079808 3 AC168977.1 ## ENSMUSG00000095041 ENSMUSG00000095041 6 PISD ## ENSMUSG00000063897 ENSMUSG00000063897 3 DHRSX ## ENSMUSG00000096730 ENSMUSG00000096730 6 Vmn2r122 ## ENSMUSG00000095742 ENSMUSG00000095742 1 CAAA01147332.1 ## gene_source gene_biotype havana_gene_version ## &lt;character&gt; &lt;character&gt; &lt;character&gt; ## ENSMUSG00000051951 ensembl_havana protein_coding 2 ## ENSMUSG00000025900 ensembl_havana protein_coding 2 ## ENSMUSG00000025902 ensembl_havana protein_coding 6 ## ENSMUSG00000033845 ensembl_havana protein_coding 3 ## ENSMUSG00000025903 ensembl_havana protein_coding 3 ## ... ... ... ... ## ENSMUSG00000079808 ensembl protein_coding NA ## ENSMUSG00000095041 ensembl protein_coding NA ## ENSMUSG00000063897 ensembl protein_coding NA ## ENSMUSG00000096730 ensembl protein_coding NA ## ENSMUSG00000095742 ensembl protein_coding NA Conversely, if we were to combine multiple SingleCellExperiment objects, the class would take care of combining both the expression values and the associated annotation in a coherent manner. We can use cbind() to combine objects by column, assuming that all objects involved have the same row annotation values and compatible column annotation fields. sce2 &lt;- cbind(sce, sce) ncol(counts(sce2)) # twice as many columns ## [1] 192 colData(sce2) # twice as many rows ## DataFrame with 192 rows and 4 columns ## phenotype sum detected ## &lt;character&gt; &lt;numeric&gt; &lt;numeric&gt; ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 wild type phenotype 854171 7617 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 wild type phenotype 1044243 7520 ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 wild type phenotype 1152450 8305 ## SLX-9555.N701_S505.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. 1193876 8142 ## SLX-9555.N701_S506.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. 1521472 7153 ## ... ... ... ... ## SLX-9555.N712_S505.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. 203221 5608 ## SLX-9555.N712_S506.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. 1059853 6948 ## SLX-9555.N712_S507.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. 1672343 6879 ## SLX-9555.N712_S508.C89V9ANXX.s_1.r_1 induced CBFB-MYH11 o.. 1939537 7213 ## SLX-9555.N712_S517.C89V9ANXX.s_1.r_1 wild type phenotype 1436899 8469 ## total ## &lt;numeric&gt; ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 854171 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 1044243 ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 1152450 ## SLX-9555.N701_S505.C89V9ANXX.s_1.r_1 1193876 ## SLX-9555.N701_S506.C89V9ANXX.s_1.r_1 1521472 ## ... ... ## SLX-9555.N712_S505.C89V9ANXX.s_1.r_1 203221 ## SLX-9555.N712_S506.C89V9ANXX.s_1.r_1 1059853 ## SLX-9555.N712_S507.C89V9ANXX.s_1.r_1 1672343 ## SLX-9555.N712_S508.C89V9ANXX.s_1.r_1 1939537 ## SLX-9555.N712_S517.C89V9ANXX.s_1.r_1 1436899 Similarly, we can use rbind() to combine objects by row, assuming all objects have the same column annotation values and compatible row annotation fields. sce2 &lt;- rbind(sce, sce) nrow(counts(sce2)) # twice as many rows ## [1] 93206 rowData(sce2) # twice as many rows ## DataFrame with 93206 rows and 6 columns ## gene_id gene_version gene_name ## &lt;character&gt; &lt;character&gt; &lt;character&gt; ## ENSMUSG00000102693 ENSMUSG00000102693 1 4933401J01Rik ## ENSMUSG00000064842 ENSMUSG00000064842 1 Gm26206 ## ENSMUSG00000051951 ENSMUSG00000051951 5 Xkr4 ## ENSMUSG00000102851 ENSMUSG00000102851 1 Gm18956 ## ENSMUSG00000103377 ENSMUSG00000103377 1 Gm37180 ## ... ... ... ... ## ENSMUSG00000094431 ENSMUSG00000094431 1 CAAA01205117.1 ## ENSMUSG00000094621 ENSMUSG00000094621 1 CAAA01098150.1 ## ENSMUSG00000098647 ENSMUSG00000098647 1 CAAA01064564.1 ## ENSMUSG00000096730 ENSMUSG00000096730 6 Vmn2r122 ## ENSMUSG00000095742 ENSMUSG00000095742 1 CAAA01147332.1 ## gene_source gene_biotype havana_gene_version ## &lt;character&gt; &lt;character&gt; &lt;character&gt; ## ENSMUSG00000102693 havana TEC 1 ## ENSMUSG00000064842 ensembl snRNA NA ## ENSMUSG00000051951 ensembl_havana protein_coding 2 ## ENSMUSG00000102851 havana processed_pseudogene 1 ## ENSMUSG00000103377 havana TEC 1 ## ... ... ... ... ## ENSMUSG00000094431 ensembl miRNA NA ## ENSMUSG00000094621 ensembl miRNA NA ## ENSMUSG00000098647 ensembl miRNA NA ## ENSMUSG00000096730 ensembl protein_coding NA ## ENSMUSG00000095742 ensembl protein_coding NA 4.6 Single-cell-specific fields 4.6.1 Background So far, we have covered the assays (primary data), colData (cell metadata), rowData/rowRanges (feature metadata), and metadata slots (other) of the SingleCellExperiment class. These slots are actually inherited from the SummarizedExperiment parent class (see here for details), so any method that works on a SummarizedExperiment will also work on a SingleCellExperiment object. But why do we need a separate SingleCellExperiment class? This is motivated by the desire to streamline some single-cell-specific operations, which we will discuss in the rest of this section. 4.6.2 Dimensionality reduction results The reducedDims slot is specially designed to store reduced dimensionality representations of the primary data obtained by methods such as PCA and \\(t\\)-SNE (see Basic Chapter 4 for more details). This slot contains a list of numeric matrices of low-reduced representations of the primary data, where the rows represent the columns of the primary data (i.e., cells), and columns represent the dimensions. As this slot holds a list, we can store multiple PCA/\\(t\\)-SNE/etc. results for the same dataset. In our example, we can calculate a PCA representation of our data using the runPCA() function from scater. We see that the sce now shows a new reducedDim that can be retrieved with the accessor reducedDim(). sce &lt;- scater::logNormCounts(sce) sce &lt;- scater::runPCA(sce) dim(reducedDim(sce, &quot;PCA&quot;)) ## [1] 96 50 We can also calculate a tSNE representation using the scater package function runTSNE(): sce &lt;- scater::runTSNE(sce, perplexity = 0.1) ## Perplexity should be lower than K! head(reducedDim(sce, &quot;TSNE&quot;)) ## TSNE1 TSNE2 ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 -110.3 -530.98 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 -453.8 -32.11 ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 -586.5 -1210.52 ## SLX-9555.N701_S505.C89V9ANXX.s_1.r_1 1004.6 265.02 ## SLX-9555.N701_S506.C89V9ANXX.s_1.r_1 736.9 297.17 ## SLX-9555.N701_S507.C89V9ANXX.s_1.r_1 1191.2 -495.58 We can view the names of all our entries in the reducedDims slot via the accessor, reducedDims(). Note that this is plural and returns a list of all results, whereas reducedDim() only returns a single result. reducedDims(sce) ## List of length 2 ## names(2): PCA TSNE We can also manually add content to the reducedDims() slot, much like how we added matrices to the assays slot previously. To illustrate, we run the umap() function directly from the uwot package to generate a matrix of UMAP coordinates that is added to the reducedDims of our sce object. (In practice, scater has a runUMAP() wrapper function that adds the results for us, but we will manually call umap() here for demonstration purposes.) u &lt;- uwot::umap(t(logcounts(sce)), n_neighbors = 2) reducedDim(sce, &quot;UMAP_uwot&quot;) &lt;- u reducedDims(sce) # Now stored in the object. ## List of length 3 ## names(3): PCA TSNE UMAP_uwot head(reducedDim(sce, &quot;UMAP_uwot&quot;)) ## [,1] [,2] ## SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 4.1734 -1.084 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 2.5674 -1.680 ## SLX-9555.N701_S504.C89V9ANXX.s_1.r_1 4.0490 -2.482 ## SLX-9555.N701_S505.C89V9ANXX.s_1.r_1 -0.9967 -3.251 ## SLX-9555.N701_S506.C89V9ANXX.s_1.r_1 -0.5958 5.437 ## SLX-9555.N701_S507.C89V9ANXX.s_1.r_1 -3.2816 3.713 4.6.3 Alternative Experiments The SingleCellExperiment class provides the concept of “alternative Experiments” where we have data for a distinct set of features but the same set of samples/cells. The classic application would be to store the per-cell counts for spike-in transcripts; this allows us to retain this data for downstream use but separate it from the assays holding the counts for endogenous genes. The separation is particularly important as such alternative features often need to be processed separately, see Chapter Advanced Chapter 12 for examples with antibody-derived tags. If we have data for alternative feature sets, we can store it in our SingleCellExperiment as an alternative Experiment. For example, if we have some data for spike-in transcripts, we first create a separate SummarizedExperiment object: # -1 to get rid of the first gene length column. spike_se &lt;- SummarizedExperiment(list(counts=spike.mat[,-1])) spike_se ## class: SummarizedExperiment ## dim: 92 96 ## metadata(0): ## assays(1): counts ## rownames(92): ERCC-00002 ERCC-00003 ... ERCC-00170 ERCC-00171 ## rowData names(0): ## colnames(96): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ... ## SLX-9555.N712_S508.C89V9ANXX.s_1.r_1 ## SLX-9555.N712_S517.C89V9ANXX.s_1.r_1 ## colData names(0): Then we store this SummarizedExperiment in our sce object via the altExp() setter. Like assays() and reducedDims(), we can also retrieve all of the available alternative Experiments with altExps(). altExp(sce, &quot;spike&quot;) &lt;- spike_se altExps(sce) ## List of length 1 ## names(1): spike The alternative Experiment concept ensures that all relevant aspects of a single-cell dataset can be held in a single object. It is also convenient as it ensures that our spike-in data is synchronized with the data for the endogenous genes. For example, if we subsetted sce, the spike-in data would be subsetted to match: sub &lt;- sce[,1:2] # retain only two samples. altExp(sub, &quot;spike&quot;) ## class: SummarizedExperiment ## dim: 92 2 ## metadata(0): ## assays(1): counts ## rownames(92): ERCC-00002 ERCC-00003 ... ERCC-00170 ERCC-00171 ## rowData names(0): ## colnames(2): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ## colData names(0): Any SummarizedExperiment object can be stored as an alternative Experiment, including another SingleCellExperiment! This allows power users to perform tricks like those described in Basic Section 3.6. 4.6.4 Size factors The sizeFactors() function allows us to get or set a numeric vector of per-cell scaling factors used for normalization (see Basic Chapter 2 for more details). This is typically automatically added by normalization functions, as shown below for scran’s deconvolution-based size factors: sce &lt;- scran::computeSumFactors(sce) summary(sizeFactors(sce)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.151 0.742 0.948 1.000 1.124 3.558 Alternatively, we can manually add the size factors, as shown below for library size-derived factors: sizeFactors(sce) &lt;- scater::librarySizeFactors(sce) summary(sizeFactors(sce)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.170 0.766 0.951 1.000 1.106 3.605 Technically speaking, the sizeFactors concept is not unique to single-cell analyses. Nonetheless, we mention it here as it is an extension beyond what is available in the SummarizedExperiment parent class. 4.6.5 Column labels The colLabels() function allows us to get or set a vector or factor of per-cell labels, typically corresponding to groupings assigned by unsupervised clustering (Basic Chapter 5). or predicted cell type identities from classification algorithms (Basic Chapter 7). colLabels(sce) &lt;- scran::clusterCells(sce, use.dimred=&quot;PCA&quot;) table(colLabels(sce)) ## ## 1 2 ## 47 49 This is a convenient field to set as several functions (e.g., scran::findMarkers) will attempt to automatically retrieve the labels via colLabels(). We can thus avoid the few extra keystrokes that would otherwise be necessary to specify, say, the cluster assignments in the function call. 4.7 Conclusion The widespread use of the SingleCellExperiment class provides the foundation for interoperability between single-cell-related packages in the Bioconductor ecosystem. SingleCellExperiment objects generated by one package can be used as input into another package, encouraging synergies that enable our analysis to be greater than the sum of its parts. Each step of the analysis will also add new entries to the assays, colData, reducedDims, etc., meaning that the final SingleCellExperiment object effectively serves as a self-contained record of the analysis. This is convenient as the object can be saved for future use or transferred to collaborators for further analysis. Thus, for the rest of this book, we will be using the SingleCellExperiment as our basic data structure. Session Info View session info R version 4.5.1 (2025-06-13) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] BiocFileCache_2.16.0 dbplyr_2.5.0 [3] SingleCellExperiment_1.30.1 SummarizedExperiment_1.38.1 [5] Biobase_2.68.0 GenomicRanges_1.60.0 [7] GenomeInfoDb_1.44.0 IRanges_2.42.0 [9] S4Vectors_0.46.0 BiocGenerics_0.54.0 [11] generics_0.1.4 MatrixGenerics_1.20.0 [13] matrixStats_1.5.0 BiocStyle_2.36.0 [15] rebook_1.18.0 loaded via a namespace (and not attached): [1] DBI_1.2.3 bitops_1.0-9 gridExtra_2.3 [4] CodeDepends_0.6.6 rlang_1.1.6 magrittr_2.0.3 [7] scater_1.36.0 compiler_4.5.1 RSQLite_2.4.1 [10] dir.expiry_1.16.0 png_0.1-8 vctrs_0.6.5 [13] pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0 [16] XVector_0.48.0 scuttle_1.18.0 Rsamtools_2.24.0 [19] rmarkdown_2.29 graph_1.86.0 UCSC.utils_1.4.0 [22] ggbeeswarm_0.7.2 purrr_1.0.4 bit_4.6.0 [25] bluster_1.18.0 xfun_0.52 cachem_1.1.0 [28] beachmat_2.24.0 jsonlite_2.0.0 blob_1.2.4 [31] DelayedArray_0.34.1 BiocParallel_1.42.1 cluster_2.1.8.1 [34] irlba_2.3.5.1 parallel_4.5.1 R6_2.6.1 [37] bslib_0.9.0 RColorBrewer_1.1-3 limma_3.64.1 [40] rtracklayer_1.68.0 jquerylib_0.1.4 Rcpp_1.0.14 [43] bookdown_0.43 knitr_1.50 FNN_1.1.4.1 [46] igraph_2.1.4 Matrix_1.7-3 tidyselect_1.2.1 [49] dichromat_2.0-0.1 abind_1.4-8 yaml_2.3.10 [52] viridis_0.6.5 codetools_0.2-20 curl_6.4.0 [55] lattice_0.22-7 tibble_3.3.0 withr_3.0.2 [58] evaluate_1.0.4 Rtsne_0.17 Biostrings_2.76.0 [61] pillar_1.10.2 BiocManager_1.30.26 filelock_1.0.3 [64] RCurl_1.98-1.17 ggplot2_3.5.2 scales_1.4.0 [67] glue_1.8.0 metapod_1.16.0 tools_4.5.1 [70] BiocIO_1.18.0 BiocNeighbors_2.2.0 ScaledMatrix_1.16.0 [73] locfit_1.5-9.12 GenomicAlignments_1.44.0 scran_1.36.0 [76] XML_3.99-0.18 grid_4.5.1 edgeR_4.6.2 [79] GenomeInfoDbData_1.2.14 beeswarm_0.4.0 BiocSingular_1.24.0 [82] restfulr_0.0.15 vipor_0.4.7 cli_3.6.5 [85] rsvd_1.0.5 S4Arrays_1.8.1 viridisLite_0.4.2 [88] dplyr_1.1.4 uwot_0.2.3 gtable_0.3.6 [91] sass_0.4.10 digest_0.6.37 dqrng_0.4.1 [94] SparseArray_1.8.0 ggrepel_0.9.6 rjson_0.2.23 [97] farver_2.1.2 memoise_2.0.1 htmltools_0.5.8.1 [100] lifecycle_1.0.4 httr_1.4.7 statmod_1.5.0 [103] bit64_4.6.0-1 References "],["analysis-overview.html", "Chapter 5 Analysis overview 5.1 Outline 5.2 Quick start (simple) 5.3 Quick start (multiple batches) Session Info", " Chapter 5 Analysis overview .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 5.1 Outline This chapter provides an overview of the framework of a typical scRNA-seq analysis workflow (Figure 5.1). Figure 5.1: Schematic of a typical scRNA-seq analysis workflow. Each stage (separated by dashed lines) consists of a number of specific steps, many of which operate on and modify a SingleCellExperiment instance. In the simplest case, the workflow has the following form: We compute quality control metrics to remove low-quality cells that would interfere with downstream analyses. These cells may have been damaged during processing or may not have been fully captured by the sequencing protocol. Common metrics includes the total counts per cell, the proportion of spike-in or mitochondrial reads and the number of detected features. We convert the counts into normalized expression values to eliminate cell-specific biases (e.g., in capture efficiency). This allows us to perform explicit comparisons across cells in downstream steps like clustering. We also apply a transformation, typically log, to adjust for the mean-variance relationship. We perform feature selection to pick a subset of interesting features for downstream analysis. This is done by modelling the variance across cells for each gene and retaining genes that are highly variable. The aim is to reduce computational overhead and noise from uninteresting genes. We apply dimensionality reduction to compact the data and further reduce noise. Principal components analysis is typically used to obtain an initial low-rank representation for more computational work, followed by more aggressive methods like \\(t\\)-stochastic neighbor embedding for visualization purposes. We cluster cells into groups according to similarities in their (normalized) expression profiles. This aims to obtain groupings that serve as empirical proxies for distinct biological states. We typically interpret these groupings by identifying differentially expressed marker genes between clusters. Subsequent chapters will describe each analysis step in more detail. 5.2 Quick start (simple) Here, we use the a droplet-based retina dataset from Macosko et al. (2015), provided in the scRNAseq package. This starts from a count matrix and finishes with clusters (Figure 5.2) in preparation for biological interpretation. Similar workflows are available in abbreviated form in later parts of the book. library(scRNAseq) sce &lt;- MacoskoRetinaData() # Quality control (using mitochondrial genes). library(scater) is.mito &lt;- grepl(&quot;^MT-&quot;, rownames(sce)) qcstats &lt;- perCellQCMetrics(sce, subsets=list(Mito=is.mito)) filtered &lt;- quickPerCellQC(qcstats, percent_subsets=&quot;subsets_Mito_percent&quot;) sce &lt;- sce[, !filtered$discard] # Normalization. sce &lt;- logNormCounts(sce) # Feature selection. library(scran) dec &lt;- modelGeneVar(sce) hvg &lt;- getTopHVGs(dec, prop=0.1) # PCA. library(scater) set.seed(1234) sce &lt;- runPCA(sce, ncomponents=25, subset_row=hvg) # Clustering. library(bluster) colLabels(sce) &lt;- clusterCells(sce, use.dimred=&#39;PCA&#39;, BLUSPARAM=NNGraphParam(cluster.fun=&quot;louvain&quot;)) # Visualization. sce &lt;- runUMAP(sce, dimred = &#39;PCA&#39;) plotUMAP(sce, colour_by=&quot;label&quot;) Figure 5.2: UMAP plot of the retina dataset, where each point is a cell and is colored by the assigned cluster identity. # Marker detection. markers &lt;- findMarkers(sce, test.type=&quot;wilcox&quot;, direction=&quot;up&quot;, lfc=1) 5.3 Quick start (multiple batches) Here we use the pancreas Smart-seq2 dataset from Segerstolpe et al. (2016), again provided in the scRNAseq package. This starts from a count matrix and finishes with clusters (Figure 5.2) with some additional tweaks to eliminate uninteresting batch effects between individuals. Note that a more elaborate analysis of the same dataset with justifications for each step is available in Workflow Chapter 8. sce &lt;- SegerstolpePancreasData() # Quality control (using ERCCs). qcstats &lt;- perCellQCMetrics(sce) filtered &lt;- quickPerCellQC(qcstats, percent_subsets=&quot;altexps_ERCC_percent&quot;) sce &lt;- sce[, !filtered$discard] # Normalization. sce &lt;- logNormCounts(sce) # Feature selection, blocking on the individual of origin. dec &lt;- modelGeneVar(sce, block=sce$individual) hvg &lt;- getTopHVGs(dec, prop=0.1) # Batch correction. library(batchelor) set.seed(1234) sce &lt;- correctExperiments(sce, batch=sce$individual, subset.row=hvg, correct.all=TRUE) # Clustering. colLabels(sce) &lt;- clusterCells(sce, use.dimred=&#39;corrected&#39;) # Visualization. sce &lt;- runUMAP(sce, dimred = &#39;corrected&#39;) gridExtra::grid.arrange( plotUMAP(sce, colour_by=&quot;label&quot;), plotUMAP(sce, colour_by=&quot;individual&quot;), ncol=2 ) Figure 5.3: UMAP plot of the pancreas dataset, where each point is a cell and is colored by the assigned cluster identity (left) or the individual of origin (right). # Marker detection, blocking on the individual of origin. markers &lt;- findMarkers(sce, test.type=&quot;wilcox&quot;, direction=&quot;up&quot;, lfc=1) Session Info View session info R version 4.5.1 (2025-06-13) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] batchelor_1.24.0 bluster_1.18.0 [3] scran_1.36.0 scater_1.36.0 [5] ggplot2_3.5.2 scuttle_1.18.0 [7] scRNAseq_2.22.0 SingleCellExperiment_1.30.1 [9] SummarizedExperiment_1.38.1 Biobase_2.68.0 [11] GenomicRanges_1.60.0 GenomeInfoDb_1.44.0 [13] IRanges_2.42.0 S4Vectors_0.46.0 [15] BiocGenerics_0.54.0 generics_0.1.4 [17] MatrixGenerics_1.20.0 matrixStats_1.5.0 [19] BiocStyle_2.36.0 rebook_1.18.0 loaded via a namespace (and not attached): [1] RColorBrewer_1.1-3 jsonlite_2.0.0 [3] CodeDepends_0.6.6 magrittr_2.0.3 [5] ggbeeswarm_0.7.2 GenomicFeatures_1.60.0 [7] gypsum_1.4.0 farver_2.1.2 [9] rmarkdown_2.29 BiocIO_1.18.0 [11] vctrs_0.6.5 DelayedMatrixStats_1.30.0 [13] memoise_2.0.1 Rsamtools_2.24.0 [15] RCurl_1.98-1.17 htmltools_0.5.8.1 [17] S4Arrays_1.8.1 AnnotationHub_3.16.0 [19] curl_6.4.0 BiocNeighbors_2.2.0 [21] Rhdf5lib_1.30.0 SparseArray_1.8.0 [23] rhdf5_2.52.1 sass_0.4.10 [25] alabaster.base_1.8.0 bslib_0.9.0 [27] alabaster.sce_1.8.0 httr2_1.1.2 [29] cachem_1.1.0 ResidualMatrix_1.18.0 [31] GenomicAlignments_1.44.0 igraph_2.1.4 [33] lifecycle_1.0.4 pkgconfig_2.0.3 [35] rsvd_1.0.5 Matrix_1.7-3 [37] R6_2.6.1 fastmap_1.2.0 [39] GenomeInfoDbData_1.2.14 digest_0.6.37 [41] AnnotationDbi_1.70.0 dqrng_0.4.1 [43] irlba_2.3.5.1 ExperimentHub_2.16.0 [45] RSQLite_2.4.1 beachmat_2.24.0 [47] labeling_0.4.3 filelock_1.0.3 [49] httr_1.4.7 abind_1.4-8 [51] compiler_4.5.1 bit64_4.6.0-1 [53] withr_3.0.2 BiocParallel_1.42.1 [55] viridis_0.6.5 DBI_1.2.3 [57] HDF5Array_1.36.0 alabaster.ranges_1.8.0 [59] alabaster.schemas_1.8.0 rappdirs_0.3.3 [61] DelayedArray_0.34.1 rjson_0.2.23 [63] tools_4.5.1 vipor_0.4.7 [65] beeswarm_0.4.0 glue_1.8.0 [67] h5mread_1.0.1 restfulr_0.0.15 [69] rhdf5filters_1.20.0 grid_4.5.1 [71] cluster_2.1.8.1 gtable_0.3.6 [73] ensembldb_2.32.0 metapod_1.16.0 [75] BiocSingular_1.24.0 ScaledMatrix_1.16.0 [77] XVector_0.48.0 RcppAnnoy_0.0.22 [79] ggrepel_0.9.6 BiocVersion_3.21.1 [81] pillar_1.10.2 limma_3.64.1 [83] dplyr_1.1.4 BiocFileCache_2.16.0 [85] lattice_0.22-7 FNN_1.1.4.1 [87] rtracklayer_1.68.0 bit_4.6.0 [89] tidyselect_1.2.1 locfit_1.5-9.12 [91] Biostrings_2.76.0 knitr_1.50 [93] gridExtra_2.3 bookdown_0.43 [95] ProtGenerics_1.40.0 edgeR_4.6.2 [97] xfun_0.52 statmod_1.5.0 [99] UCSC.utils_1.4.0 lazyeval_0.2.2 [101] yaml_2.3.10 evaluate_1.0.4 [103] codetools_0.2-20 tibble_3.3.0 [105] alabaster.matrix_1.8.0 BiocManager_1.30.26 [107] graph_1.86.0 cli_3.6.5 [109] uwot_0.2.3 jquerylib_0.1.4 [111] dichromat_2.0-0.1 Rcpp_1.0.14 [113] dir.expiry_1.16.0 dbplyr_2.5.0 [115] png_0.1-8 XML_3.99-0.18 [117] parallel_4.5.1 blob_1.2.4 [119] AnnotationFilter_1.32.0 sparseMatrixStats_1.20.0 [121] bitops_1.0-9 viridisLite_0.4.2 [123] alabaster.se_1.8.0 scales_1.4.0 [125] crayon_1.5.3 rlang_1.1.6 [127] cowplot_1.1.3 KEGGREST_1.48.1 "]]
