--- title: "Co-visualizing tissue and single cell embedding plots" author: "Authors: Jianhai Zhang, Jordan Hayes, Le Zhang, Bing Yang, Wolf B. Frommer, Julia Bailey-Serres, and Thomas Girke" date: "Last update: `r format(Sys.time(), '%d %B, %Y')`" output: BiocStyle::html_document: css: file/custom.css toc: true toc_float: collapsed: true smooth_scroll: true toc_depth: 4 fig_caption: yes code_folding: show number_sections: true self_contained: true fontsize: 14pt bibliography: bibtex.bib package: spatialHeatmap vignette: > %\VignetteIndexEntry{Co-visualizing spatial heatmaps with single cell embedding plots} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{css, echo=FALSE} pre code { white-space: pre !important; overflow-x: scroll !important; word-break: keep-all !important; word-wrap: initial !important; } ``` ```{r global_options, include=FALSE} ## ThG: chunk added to enable global knitr options. The below turns on ## caching for faster vignette re-build during text editing. knitr::opts_chunk$set(cache=TRUE) ``` ```{r setup0, eval=TRUE, echo=FALSE, message=FALSE, warning=FALSE} library(knitr); opts_chunk$set(message=FALSE, warning=FALSE) ``` # Introduction ## Overview The primary utility of the _spatialHeatmap_ package is the generation of _spatial heatmaps_ (SHM) for visualizing cell-, tissue- and organ-specific abundance patterns of biological molecules (_e.g._ RNAs) in anatomical images [@shm]. This is useful for identifying molecules with spatially enriched (SE) abundance patterns as well as clusters and/or network modules composed of molecules sharing similar abundance patterns such as similar gene expression patterns. These functionalities are introduced in the [main vignette](https://bioconductor.org/packages/release/bioc/html/spatialHeatmap.html) of the _spatialHeatmap_ package. The following describes extended functionalities for integrating tissue with single cell data by co-visualizing them in composite plots that combine spatial heatmaps with embedding plots of high-dimensional data. The resulting spatial context information is important for gaining insights into the tissue-level organization of single cell data. The required quantitative assay data, such as gene expression values, can be provided in a variety of widely used tabular data structures (_e.g._ `data.frame`, `SummarizedExperiment` or `SingleCellExperiment`). The corresponding anatomic images need to be supplied as annotated SVG (aSVG) images and can be stored in a specific S4 class `SVG`. The creation of aSVGs is described in the main vignette of this package. For the embedding plots of single cell data, several dimensionality reduction algorithms (_e.g._ PCA, UMAP or tSNE) are supported. To associate single cells with their source tissues, the user can choose among three major methods including annotation-based, manual and automated methods (Figure \@ref(fig:covizOver)). Similar to other functionalities in _spatialHeatmap_, these functionalities are available within R as well as the corresponding [Shiny app](#autoSCE) [@shiny]. ## Methods for Associating Cells and Bulk Tissues To co-visualize single cell data with tissue features (Figure \@ref(fig:covizOver)), the individual cells of the single cell data are mapped via their group labels to the corresponding tissue features in an aSVG image. If the feature labels in an aSVG are different than the corresponding group labels used for the single cell data, _e.g._ due to variable terminologies, a translation map can be used to avoid manual relabelling. Throughout this vignette the usage of the term feature is a generalization referring in most cases to tissues or organs. For the implementation of the co-visualization tool, _spatialHeatmap_ takes advantage of efficient and reusable S4 classes for both assay data and aSVGs respectively. The former includes the Bioconductor core data structures such as the widely used `SingleCellExperiment` (`SCE`) container illustrated in Figure \@ref(fig:covizOver).1 [@sce]. The slots `assays`, `colData`, `rowData` and `reducedDims` in an `SCE` contain expression data, cell metadata, molecule metadata and reduced dimensionality embedding results, respectively. The cell group labels are stored in the `colData` slot as shown in Figure \@ref(fig:covizOver).1. The S4 class `SVG` (Figure \@ref(fig:covizOver).3) is developed specifically in `spatialHeatmap` for storing aSVG instances. The two most important slots `coordinate` and `attribute` stores the aSVG feature coordinates and respective attributes such as fill color, line withs, etc. respectively, while other slots `dimension`, `svg`, and `raster` stores image dimension, aSVG file paths, and raster image paths respectively. For handling cell-to-tissue grouping information, three general methods are available including (a) annotation-based, (b) manual and (c) automated. The annotation-based and manual methods are similar by using known cell group labels. The main difference is how the cell labels are provided. In the annotation-based method, existing group labels are available and can be uploaded and/or stored in the `SCE` object, as is the case in some of the `SCE` instances provided by the `scRNAseq` package [@scrnaseq]. The manual method allows users to create the cell to tissue associations one-by-one or import them from a tabular file. In contrast to this, the automated method aims to assign single cells to the corresponding source tissues computationally by a co-clustering algorithm (Figure \@ref(fig:coclusOver)). This co-clustering is experimental and requires bulk expression data that are obtained from the tissues represented in the single cell data. The grouping information is visualized by using for each group the same color in both the single cell embedding plot and the tissue spatial heatmap plot (Figure \@ref(fig:covizOver).5). The colors can represent any type of custom or numeric information. In a typical use case, either fixed tissue-specific colors or a heat color gradient is used that is proportional to the numeric expression information obtained from the single cell or bulk expression data of a chosen gene. When the expression values among groups are very similar, toggling between the two coloring option is important to track the tissue origin in the single cell data. To color by single cell data, one often wants to first summarize the expression of a given gene across the cells within each group via a meaningful summary statistics, such as mean or median. Cells and tissues with the same group label will be colored the same. When coloring by tissues the color used for each tissue feature will be applied to the corresponding cell groups represented in the embedding plot. ```{r covizOver, echo=FALSE, fig.wide=TRUE, out.width="100%", fig.cap=("Overview of single cell and bulk tissue co-visualization. (1) Single cell data are organized in a `SingleCellExperiment` object where cell grouping information is stored in the `colData` slot. (2) Bulk tissue expression data are stored in a `SummarizedExperiment` object, where tissues are labelled by aSVG feature identifiers (3) in the `colData` slot. (3) The aSVG instance is stored in an `SVG` class. The `coordinate` slot defines spatial feature shapes in spatial heatmap (5) while the `attribute` slot defines styling of these features such as color, line with, etc. (4) Bi-directional mapping between cell or tissue groups are supported. Cell groups and tissues can be colored by fixed colors or by expression values of selected genes. (5) Cells in (1) and bulk tissues in (2) are co-visualized in and embedding plot and spatial heatmap plot respectively, where mapped cell groups and respective tissues are indicated by same colors. ")} include_graphics('img/covisualize.jpg') ``` # Getting Started ## Installation The `spatialHeatmap` package can be installed with the `BiocManager::install` command. ```{ eval=FALSE, echo=TRUE, warnings=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("spatialHeatmap") ``` ## Packages and Documentation Next, the packages required for running the sample code in this vignette need to be loaded. ```{r, eval=TRUE, echo=TRUE, warnings=FALSE, results='hide'} library(spatialHeatmap); library(SummarizedExperiment); library(scran); library(scater); library(igraph); library(SingleCellExperiment); library(BiocParallel) library(kableExtra) ``` The following lists the vignette(s) of this package in an HTML browser. Clicking the name of the corresponding vignette will open it. ```{r, eval=FALSE, echo=TRUE, warnings=FALSE} browseVignettes('spatialHeatmap') ``` To reduce runtime, intermediate results can be cached under `~/.cache/shm`. ```{r eval=TRUE, echo=TRUE, message=FALSE, warnings=FALSE } cache.pa <- '~/.cache/shm' # Set path of the cache directory ``` # Quick Start {#test} To obtain for examples with randomized data or parameters always the same results, a fixed seed is set. ```{r , eval=TRUE, echo=TRUE, warnings=FALSE} set.seed(10) ``` This quick start example is demonstrated on cell-to-tissue mapping by using a single cell data set from oligodendrocytes of mouse brain [@Marques2016-ff]. This data set is obtained from the `scRNAseq` [@scrnaseq] package with minor modificatons, which is included in `spatialHeatmap`. Prior to co-visualization the single cell data is pre-processed by the `process_cell_meta` function that applies common QC, normalization and dimensionality reduction routines. The details of these pre-processing methods are described in the corresponding help file. Additional background information on these topics can be found in the [OSCA](http://bioconductor.org/books/3.14/OSCA.workflows/zeisel-mouse-brain-strt-seq.html){target='blank'} tutorial. ```{r , eval=TRUE, echo=TRUE, warnings=FALSE} sce.pa <- system.file("extdata/shinyApp/example", "cell_mouse_brain.rds", package="spatialHeatmap") sce <- readRDS(sce.pa) sce.dimred.quick <- process_cell_meta(sce, qc.metric=list(subsets=list(Mt=rowData(sce)$featureType=='mito'), threshold=1)) colData(sce.dimred.quick)[1:3, 1:2] ``` To visualize single cell data in spatial heatmaps, the single cell expression values are summarized using common summary statistics such as mean or median by their source tissue grouping, which is in the `label` column of `colData` slot. ```{r , eval=TRUE, echo=TRUE, warnings=FALSE} sce.aggr.quick <- aggr_rep(sce.dimred.quick, assay.na='logcounts', sam.factor='label', aggr='mean') ``` The summarized values are then used to color the tissue features of the corresponding aSVG file of mouse brain that is included in the `spatialHeatmap` package. The function `read_svg` is used for parsing the aSVG and storing it in an `SVG` object `svg.mus.brain`. ```{r , eval=TRUE, echo=TRUE, warnings=FALSE, results='hide'} svg.mus.brain.pa <- system.file("extdata/shinyApp/example", "mus_musculus.brain.svg", package="spatialHeatmap") svg.mus.brain <- read_svg(svg.mus.brain.pa) ``` A subset of features and related attributes are returned from `svg.mus.brain`, where `fill` and `stroke` refer to color and line width respectively. ```{r , eval=TRUE, echo=TRUE, warnings=FALSE} tail(attribute(svg.mus.brain)[[1]])[, 1:4] ``` To map cell group labels to aSVG features, a `list` with named components is used, where cell labels are in name slots. Note, each cell label can be matched to multiple aSVG features but not vice versa. ```{r , eval=TRUE, echo=TRUE, warnings=FALSE} lis.match.quick <- list(hypothalamus=c('hypothalamus'), cortex.S1=c('cerebral.cortex', 'nose')) ``` The co-visualization is created on gene `Eif5b` using the function `covis`. The embedding plot includes all cells before aggregation. The `hypothalamus` and `cortex.S1` cells are colored according to their respecitive aggregated expression values in `data=sce.aggr.quick`. In the nearby spatial heatmap plot, aSVG features are filled by the same color as the matching cells defined in `lis.match.quick`. The `cell.group` argument indicates cell group labels in the `colData` slot of `sce.aggr.quick`, `tar.cell` suggests the target cell groups to visualize, and `dimred` specifies the embedding plot. ```{r , eval=TRUE, echo=TRUE, warnings=FALSE, results='hide', fig.cap=('Co-visualization of cell-to-tissue mapping. The co-visualization is created on gene `Eif5b`. Single cells in the embedding plot and their matching aSVG features in the spatial heatmap are filled by the same color according to aggregated expression values of `Eif5b` in single cell data.')} shm.lis.quick <- covis(svg=svg.mus.brain, data=sce.aggr.quick, ID=c('Eif5b'), sce.dimred=sce.dimred.quick, dimred='PCA', cell.group='label', tar.cell=names(lis.match.quick), lis.rematch=lis.match.quick, assay.na='logcounts', bar.width=0.11, dim.lgd.nrow=1, height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=3) ``` # Mappings between cells and tissues ## Annotation-based {#ann} The annotation-based mapping uses existing cell-to-tissue labels. Those can be imported (_e.g._ from a tabular file) and then stored in the `colData` slot of an `SCE` object. Alternatively, an `SCE` object might already contain this information. This is the case in some of the `SCE` objects provided by the `scRNAseq` package [@scrnaseq]. The [Quick Start](#test) is domenstrated using the annotation-based method, since the single cell data are downloaded from `scRNAseq` package. In this example, the tissue-to-cell mapping will be showcased. The single cell data and aSVG file are the same as the Quick Start, and the bulk data are modified from a research on mouse cerebellar development[@Vacher2021-xg]. The bulk count data are imported in a `SummarizedExperiment` and partial are shown. Note, replicates are indicated by the same sample names (_e.g._ `cerebral.cortex`). ```{r eval=TRUE, echo=TRUE, warnings=FALSE} blk.mus.pa <- system.file("extdata/shinyApp/example", "bulk_mouse_cocluster.rds", package="spatialHeatmap") blk.mus <- readRDS(blk.mus.pa) assay(blk.mus)[1:3, 1:5] colData(blk.mus)[1:3, , drop=FALSE] ``` The bulk count data are normalized using the `ESF` option, which refers to `estimateSizeFactors` from DESeq2 [@deseq2]. The expression values for each gene are summarized by mean across replicates (here `aggr='mean'`). ```{r eval=TRUE, echo=TRUE, warnings=FALSE, results='hide'} # Normalization. blk.mus.nor <- norm_data(data=blk.mus, norm.fun='ESF', log2.trans=TRUE) # Aggregation. blk.mus.aggr <- aggr_rep(blk.mus.nor, sam.factor='tissue', aggr='mean') assay(blk.mus.aggr)[1:2, ] ``` The single cell data from the Quick Start are used in this example. Its metadata in `colData` are partially shown. ```{r eval=TRUE, echo=TRUE, warnings=FALSE} colData(sce.dimred.quick)[1:3, 1:2] ``` The single cell data are plotted at the UMAP dimensions using `plot_dim`, where cells are represented by dots and are colored by the grouping information stored in the `colData` slot of the `SCE` object, here `color.by="label"`. ```{r eval=TRUE, echo=TRUE, warnings=FALSE, fig.wide=TRUE, fig.cap=('Embedding plot of single cell data. The cells (dots) are colored by the grouping information stored in the `colData` slot of the corresponding `SCE` object'), out.width="100%", fig.show='show'} plot_dim(sce.dimred.quick, color.by="label", dim='UMAP') ``` The aSVG instance of mouse brain from the Quick Start are used. Partial of the aSVG features are shown. ```{r eval=TRUE, echo=TRUE, warnings=FALSE} tail(attribute(svg.mus.brain)[[1]])[1:3, 1:4] ``` Same with conventions in the main vignette, at least one bulk tissue identifier should match with aSVG features so as to successfully plot spatial heatmaps. ```{r eval=TRUE, echo=TRUE, warnings=FALSE} colnames(blk.mus) %in% attribute(svg.mus.brain)[[1]]$feature ``` In practice the labels used for features in the aSVG image and the cell group labels of the single cell data may not be the same. To resolve this without manual relabeling, a translation list is used to make them match. In tissue-to-cell mapping, the feature and cell labels are expected to be the names and elements of the list components, respectively. ```{r scLabList, eval=TRUE, echo=TRUE, warnings=FALSE} lis.match.blk <- list(cerebral.cortex=c('cortex.S1'), hypothalamus=c('corpus.callosum', 'hypothalamus')) ``` The following plots the corresponding co-visualization for sample gene Actr3b. The legend under the embedding plot shows the cell labels in the matching list (`lis.match.blk`). The source tissue information is indicated by using the same colors in the embedding and tissue plots on the left and right, respectively. In contrast to the Quick Start, the `tar.bulk` indicates target bulk tissues to show. ```{r eval=TRUE, echo=TRUE, warnings=FALSE, fig.wide=TRUE, fig.cap=('Co-visualization plot through tissue-to-cell mapping. In this plot, `Actr3b` is used as an example. Each cell population is colored by its summarized expression value in the embedding plot on the left, and the corresponding tissue shares the same color in the spatial heatmap on the right.'), out.width="100%", fig.show='show', results='hide'} covis(svg=svg.mus.brain, data=blk.mus.aggr, ID=c('Actr3b'), sce.dimred=sce.dimred.quick, dimred='UMAP', cell.group='label', tar.bulk=names(lis.match.blk), lis.rematch=lis.match.blk, bar.width=0.09, dim.lgd.nrow=2, dim.lgd.text.size=12, height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=2) ``` In scenarios where expression values are similar across tissues, the mapping between cells and tissues can be indicated by fixed contrasting colors by setting `profile=FALSE`. ```{r eval=TRUE, echo=TRUE, warnings=FALSE, fig.wide=TRUE, fig.cap=('Co-visualization plot through tissue-to-cell mapping without expression values. In this plot, mapping beween cell groups and tissues are indicated by fixed colors instead of expression values. '), out.width="100%", fig.show='show', results='hide'} covis(svg=svg.mus.brain, data=blk.mus.aggr, ID=c('Actr3b'), profile=FALSE, sce.dimred=sce.dimred.quick, dimred='UMAP', cell.group='label', lis.rematch=lis.match.blk, tar.bulk=names(lis.match.blk), bar.width=0.09, dim.lgd.nrow=2, dim.lgd.text.size=12, height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=2) ``` ## Manual Method To provide additional flexibility for defining cell groupings, several manual options are provided. Here users can assign cell groups manually or by clustering methods for single cell embedding data that are often used in the analysis of single cell data. The resulting cell grouping or cluster information needs to be stored in a tabular file, that will be imported into an `SCE` object (here `manual_group` function). The following demonstration uses the same single cell and aSVG instance as the annotation example above. The only difference is an additional clustering step. For demonstration purposes a small example of a cluster file is included in the `spatialHeatmap` package. In this case the group labels were created by the `cluster_cell` function. The details of this function are available in its help file. The cluster file contains at least two columns: a column (here `cell`) with single cell identifiers used under `colData` and a column (here `cluster`) with the cell group labels. For practical reasons of building this vignette a pure manual example could not be used here. However, the chosen clustering example can be easily adapted to manual or hybrid grouping approaches since the underlying tabular data structure is the same for both that can be generated in most text or spreadsheet programs. ```{r eval=TRUE, echo=TRUE, warnings=FALSE} manual.clus.mus.sc.pa <- system.file("extdata/shinyApp/example", "manual_cluster_mouse_brain.txt", package="spatialHeatmap") manual.clus.mus.sc <- read.table(manual.clus.mus.sc.pa, header=TRUE, sep='\t') manual.clus.mus.sc[1:3, ] ``` The `manual_group` function can be used to append the imported group labels to the `colData` slot of an `SCE` object without interfering with other functions and methods operating on `SCE` objects. ```{r eval=TRUE, echo=TRUE, warnings=FALSE} sce.clus <- manual_group(sce=sce.dimred.quick, df.group=manual.clus.mus.sc, cell='cell', cell.group='cluster') colData(sce.clus)[1:3, c('cluster', 'label', 'expVar')] ``` An embedding plot of single cell data is created. The cells represented as dots are colored by the grouping information stored in the `cluster` column of the `colData` slot of `SCE`. ```{r eval=TRUE, echo=TRUE, warnings=FALSE, fig.wide=TRUE, fig.cap=('Embedding plot of single cells. The cells (dots) are colored by the grouping information stored in the `colData` slot of the corresponding `SCE` object .'), out.width="100%", fig.show='show'} plot_dim(sce.clus, color.by="cluster", dim='UMAP') ``` The same mouse brain aSVG as above is used here. ```{r eval=TRUE, echo=TRUE, warnings=FALSE} tail(attribute(svg.mus.brain)[[1]])[1:3, 1:4] ``` Similarly as above, a mapping list is used to match the cell clusters with aSVG features. ```{r eval=TRUE, echo=TRUE, warnings=FALSE} lis.match.clus <- list(clus1=c('cerebral.cortex'), clus3=c('brainstem', 'medulla.oblongata')) ``` This example is demonstrated through cell-to-tissue mapping, so gene expression values need to be summarized for the cells within each group label. Any grouping column in `colData` can be used as labels for summarizing. In this manual method, the `cluster` labels are chosen. If additional experimental variables are provided, the summarizing will consider them as well (here `expVar`). The following example uses the `clsuter` and `expVar` columns as group labels and experimental variables, respectively. ```{r eval=TRUE, echo=TRUE, warnings=FALSE} sce.clus.aggr <- aggr_rep(sce.clus, assay.na='logcounts', sam.factor='cluster', con.factor='expVar', aggr='mean') colData(sce.clus.aggr)[1:3, c('cluster', 'label', 'expVar')] ``` The co-visualization is plotted for gene `Tcea1`. In this example the coloring is based on the gene expression summary for each cell group obtained by clustering. Completely manual groupings can be provided the same way. ```{r eval=TRUE, echo=TRUE, warnings=FALSE, fig.wide=TRUE, fig.cap=('Co-visualization of single cell data and tissue features with cluster groupings. Gene `Tcea1` is used as an example and the cell groupings were obtained by clustering.'), out.width="100%", fig.show='show', results='hide'} covis(svg=svg.mus.brain, data=sce.clus.aggr, ID=c('Tcea1'), sce.dimred=sce.clus, dimred='UMAP', cell.group='cluster', assay.na='logcounts', tar.cell=names(lis.match.clus), lis.rematch=lis.match.clus, bar.width=0.09, dim.lgd.nrow=1, height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=3) ``` ## Automated Method If both single cell and bulk gene expression data are available for the same or overlapping tissues then co-clustering can be used to assign cells to tissues automatically (Figure \@ref(fig:coclusOver)). Subsequently, the predicted cell-to-tissue assignments can be used for co-visualizing single cell embedding data with tissue level bulk data. This approach is useful for predicting the source tissues of unassigned cells without prior knowledge as is required for the annotation and manual approaches introduced above. While attractive there are various challenges to overcome to reliably co-cluster single cell data with the corresponding tissue-level bulk data. This is due to the different properties of single cell and bulk gene expression data, such as lower sensitivity and higher sparsity in single cell compared to bulk data. This section introduces a co-clustering method that is largely based on parameter optimization including three major steps. First, both data are preprocessed to retain the most reliable expression values (Figure \@ref(fig:coclusOver).1a-b). Second, the genes in the bulk data are reduced to those robustly expressed in the single cell data (Figure \@ref(fig:coclusOver).1c). Third, bulk and cell data are co-clustered by using optimal default settings (Table \@ref(tab:optPar)) that are obtained through optimization on real data with known cell-to-tissue assignments. The following introduces the three steps of this method in more detail using the example of RNA-Seq data. 1. The raw count matrices of bulk and single cells are column-wise combined for joint normalization (Figure \@ref(fig:coclusOver).1a). After separated from bulk data, the single cell data are reduced to genes with robust expression across X% cells and to cells with robust expression across Y% genes (Figure \@ref(fig:coclusOver).1b). In the bulk data, genes are filtered according to expression values $\ge$ A at a proportion of $\ge$ p across bulk samples and a coefficient of variance (CV) between CV1 and CV2 (Figure \@ref(fig:coclusOver).1b). 2. The bulk data are subsetted to the same genes as the single cell data (Figure \@ref(fig:coclusOver).1c). This and the previous filtering steps in single cell data reduce the sparsity in the single cell data and the bulk data are made more compareable to the single cell data by subsetting it to the same genes. 3. Bulk and single cell data are column-wise combined for joint embedding using PCA (UMAP, or other). Co-clustering is performed on the embedding data and three types of clusters are produced. First, only one bulk tissue is co-clustered with cells ((Figure \@ref(fig:coclusOver).2a). This bulk is assigned to all cells in the same cocluster. Second, multiple bulk tissues are co-clustered with multiple cells (Figure \@ref(fig:coclusOver).2b). The bulk tissues are assigned according to the nearest neighbor bulk of each cell, which is measured by Spearman's correlation coefficients. Third, no bulk tissue is co-clustered with cells (Figure \@ref(fig:coclusOver).2c). All these cells are un-labeled, which are candidates for discovering novel cell types. After co-clustering, cells are labeled by bulk tissues or un-labeled (Figure \@ref(fig:coclusOver).3) and these labels are used for co-visualization (Figure \@ref(fig:coclusOver).4), where cells in embedding plot and matching tissues in spatial heatmap are filled by same colors. ```{r coclusOver, echo=FALSE, fig.wide=TRUE, out.width="100%", fig.cap=("Overview of co-clustering. (1) Raw count data are pre-processed. (2) Co-clustering is performed on embedding dimensions of combined bulk and single cell data. (3) Cells are labeled by bulk assignments. (4) Cells and bulk tissues are co-visualized according to (3). ")} include_graphics('img/coclustering.jpg') ``` After revising the method outline for the co-clustering please revise the figure 7 above accordingly. The legend can just refer to the main text to avoid duplication. PLEASE keep it simple and clear. If a figure raises more questions than it answers then it defeats the purpose of guiding the reader and providing clarity. To obtain reasonably robust default settings for co-clustering, four parameters shown in Table \@ref(tab:optPar) are optimized, where bold text indicates optimal settings that are treated as robust default settings. The reason to choose these parameters is they are most relevant to the co-clustering step. The details of this optimization are given [here](https://jianhaizhang.github.io/spatialHeatmap_supplement/cocluster_optimize.html). The following demonstration applies the default settings (bold in Table \@ref(tab:optPar)) using single cell and bulk data from mouse brain [@Vacher2021-xg; @Ortiz2020-yt]. Both data sets have been simplified for demonstraton purposes. ```{r optPar, eval=TRUE, echo=FALSE, warnings=FALSE} df.opt <- cbind( Parameter=c('dimensionReduction', 'topDimensions', 'graphBuilding', 'clusterDetection'), Settings=c( '**denoisePCA** (**PCA**, scran), runUMAP (UMAP, scater)', '5 to 80 (**11**, **19**, **33**, **22**)', '**buildKNNGraph** (**knn**), buildSNNGraph (snn) (scran)', 'cluster_walktrap (wt), **cluster_fast_greedy** (**fg**), cluster_leading_eigen (le) (igraph)' ), Description=c( 'Dimension reduction methods', 'Number of top dimensions selected for co-clustering', 'Methods for building a graph where nodes are cells and edges are connections between nearest neighbors', 'Methods for partitioning the graph to generate clusters' ) ) #write.table(df.opt, 'cocluster_para.txt', col.names=TRUE, row.names=TRUE, sep='\t') kable(df.opt, caption='Parameter settings to optimize for co-clustering. Optimal settings are indicated by bold text.', col.names=colnames(df.opt), row.names=FALSE, escape=TRUE) %>% kable_styling("striped", full_width = TRUE) %>% scroll_box(width = "700px", height = "230px") ``` ### Pre-processing {#proCoclus} To obtain reproducible results, a fixed seed is set for generating random numbers. ```{r , eval=TRUE, echo=TRUE, warnings=FALSE} set.seed(10) ``` The bulk data (`blk.mus`) is the same with the tissue-to-cell mapping in the [Annotation-based method](#ann). The following imports example single cell from the `spatialHeatmap` package and shows its partial metadata in `colData` slot. ```{r eval=TRUE, echo=TRUE, warnings=FALSE} sc.mus.pa <- system.file("extdata/shinyApp/example", "cell_mouse_cocluster.rds", package="spatialHeatmap") sc.mus <- readRDS(sc.mus.pa) colData(sc.mus)[1:3, , drop=FALSE] ``` Bulk and single cell raw count data are jointly normalized by the function `norm_cell`, which is a wrapper of `computeSumFactors` from `scran` [@scran]. `com=FALSE` means bulk and single cells are separated after normalization for subsequent separate filtering. ```{r eval=TRUE, echo=TRUE, warnings=FALSE, results='hide'} #mus.lis.nor <- read_cache(cache.pa, 'mus.lis.nor') #if (is.null(mus.lis.nor)) { mus.lis.nor <- norm_cell(sce=sc.mus, bulk=blk.mus, com=FALSE) # save_cache(dir=cache.pa, overwrite=TRUE, mus.lis.nor) #} ``` The normalized single cell and bulk data (log2-scale) are filtered to reduce sparsity and low expression values. In the bulk data, replicates are first aggregated by taking means using the function `aggr_rep`. Then the filtering retains genes in the bulk data to have expression values of $\ge$ 1 at a proportion of $\ge 10\%$ (`pOA`) across bulk samples and a coefficient of variance (`CV`) between $0.1-50$ [@Gentleman2018-xj]. In the single cell data the filtering is set to retain genes with expression values of $\ge 1$ (`cutoff=1`) in $\ge 1\%$ (`p.in.gen=0.01`) of cells. In addition, only those cells are retained that have expression values $\ge 1$ (`cutoff=1`) in $\ge 10\%$ (`p.in.cell=0.1`) of their genes. ```{r , eval=TRUE, echo=TRUE, warnings=FALSE} # Aggregate bulk replicates blk.mus.aggr <- aggr_rep(data=mus.lis.nor$bulk, assay.na='logcounts', sam.factor='sample', aggr='mean') # Filter bulk blk.mus.fil <- filter_data(data=blk.mus.aggr, pOA=c(0.1, 1), CV=c(0.1, 50), verbose=FALSE) # Filter cell and subset bulk to genes in cell blk.sc.mus.fil <- filter_cell(sce=mus.lis.nor$cell, bulk=blk.mus.fil, cutoff=1, p.in.cell=0.1, p.in.gen=0.01, verbose=FALSE) ``` Compared to bulk RNA-Seq data, single cell data has a much higher level of sparsity. This difference is reduced by the above filtering and then subsetting the bulk data to the genes remaining in the filtered single cell data. This entire process is accomplished by the `filter_cell` function. The same aSVG instance of mouse brain as in the [Quick Start](#test) section above is used here and the aSVG importing is omitted for brevity. ```{r , eval=TRUE, echo=TRUE, warnings=FALSE} tail(attribute(svg.mus.brain)[[1]])[1:3, 1:4] # Partial features are shown. ``` ### Co-clustering The co-clustering process is performed by calling the function `cocluster`. In the following, default settings obtained from optimization are shown, where `min.dim`, `dimred`, `graph.meth`, and `cluster` refers to `topDimensions`, `dimensionReduction`, `graphBuilding`, and `clusterDetection` in Table \@ref(tab:optPar) respectively. ```{r eval=TRUE, echo=TRUE, warnings=FALSE, results='hide'} #coclus.mus <- read_cache(cache.pa, 'coclus.mus') #if (is.null(coclus.mus)) { coclus.mus <- cocluster(bulk=blk.sc.mus.fil$bulk, cell=blk.sc.mus.fil$cell, min.dim=11, dimred='PCA', graph.meth='knn', cluster='fg') # save_cache(dir=cache.pa, overwrite=TRUE, coclus.mus) #} ``` The co-clustering results are stored in the `colData` slot of `coclus.mus`. The `cluster` indicates cluster labels, the `bulkCell` indicates bulk tissues or single cells, the `sample` suggests original labels of bulk and cells, the `assignedBulk` refers to bulk tissues assigned to cells with `none` suggesting un-assigned, and the `similarity` refers to Spearman's correlation coefficients for assignments between bulk and cells, which is a measure of assignment strigency. ```{r eval=TRUE, echo=TRUE, warnings=FALSE} colData(coclus.mus) ``` The strigency of assignments between bulk and cells can be controled by filtering the `similarity`. This utility is impletmented in function `filter_asg`, where `min.sim` is the `similarity` cutoff. Utilities are also developed to manually tailor the assignments and are explained in the [Supplementary Section](#tailor). ```{r eval=TRUE, echo=TRUE, warnings=FALSE} coclus.mus <- filter_asg(coclus.mus, min.sim=0.1) ``` The co-clusters including bulk and cells can be visualized with the function `plot_dim`. The `dim` argument specifies embedding methods and `cocluster.only=TRUE` sets cells without bulk assignments in gray. In the embedding plot, bulk tissues and cells are indicated by large and small circles respectively. ```{r , eval=TRUE, echo=TRUE, warnings=FALSE, fig.wide=TRUE, fig.cap=('Embedding plot of co-clusters. Large and small circles refer to bulk tissues and single cells respectively. '), out.width="80%", fig.show='show'} plot_dim(coclus.mus, dim='PCA', color.by='cluster', cocluster.only=TRUE) ``` ### Co-visualization {#covisAuto} Bulk assignments are treated as group labels for cells in co-visualization. Similar with annotation-based and manual methods, co-visualization options include cell-to-bulk mapping, bulk-to-cell mapping, inclusion or exclusion of expression values. In cell-to-bulk mapping, single cell data are separated from bulk and expression values are summarized by means within each cell group, _i.e._ bulk assignement. ```{r , eval=TRUE, echo=TRUE, warnings=FALSE} # Separate single cell data. coclus.sc <- subset(coclus.mus, , bulkCell=='cell') # Summarize expression values in each cell group. sc.aggr.coclus <- aggr_rep(data=coclus.sc, assay.na='logcounts', sam.factor='assignedBulk', aggr='mean') colData(sc.aggr.coclus) ``` The co-visualization through cell-to-bulk mapping is built on gene `Adcy1`. Cells in the embedding plot and respective assigned bulk tissues in spatial heatmap are revealed by same colors. ```{r , eval=TRUE, echo=TRUE, warnings=FALSE, fig.wide=TRUE, fig.cap=('Co-visualization through cell-to-bulk mapping in automated method. This plot is created on gene Adcy1. Colors between the embedding plot and spatial heatmap indicate matching of cells with bulk tissues. '), out.width="100%", fig.show='show', results='hide'} covis(svg=svg.mus.brain, data=sc.aggr.coclus, ID=c('Adcy1'), sce.dimred=coclus.sc, dimred='UMAP', tar.cell=c('hippocampus', 'hypothalamus', 'cerebellum', 'cerebral.cortex'), dim.lgd.text.size=10, dim.lgd.nrow=2, bar.width=0.1, legend.nrow=4) ``` In bulk-to-cell mapping, bulk data are separated from cell, where replicates are already aggregated in preprocessing. ```{r , eval=TRUE, echo=TRUE, warnings=FALSE} coclus.blk <- subset(coclus.mus, , bulkCell=='bulk') ``` Same with conventions in the main vignette, at least one bulk tissue identifier should match with aSVG features so as to successfully plot spatial heatmaps. ```{r eval=TRUE, echo=TRUE, warnings=FALSE} colnames(coclus.blk) %in% attribute(svg.mus.brain)[[1]]$feature ``` The co-visualization through bulk-to-cell mapping is built on gene `Adcy1`. Cells in the embedding plot and respective assigned bulk tissues in spatial heatmap are revealed by same colors. ```{r eval=TRUE, echo=TRUE, warnings=FALSE, fig.wide=TRUE, fig.cap=('Co-visualization through bulk-to-cell mapping in automated method. This plot is created on gene Adcy1. Colors between the embedding plot and spatial heatmap indicate matching of cells with bulk tissues.'), out.width="100%", fig.show='show', results='hide'} covis(svg=svg.mus.brain, data=coclus.blk, ID=c('Adcy1'), sce.dimred=coclus.sc, dimred='UMAP', tar.bulk=colnames(coclus.blk), assay.na='logcounts', legend.nrow=4, dim.lgd.text.size=10, dim.lgd.nrow=2, bar.width=0.08) ``` By setting `profile=FALSE`, the co-visualization is created without expression values, where colors in the embedding plot and spatial heatmap only indicate matching between cells and bulk tissues. ```{r eval=TRUE, echo=TRUE, warnings=FALSE, fig.wide=TRUE, fig.cap=('Co-visualization without expression values in automated method. Colors between the embedding plot and spatial heatmap only indicate matching of cells with bulk tissues.'), out.width="100%", fig.show='show', results='hide'} covis(svg=svg.mus.brain, data=coclus.blk, ID=c('Adcy1'), sce.dimred=coclus.sc, dimred='UMAP', profile=FALSE, assay.na='logcounts', legend.nrow=4, dim.lgd.text.size=10, dim.lgd.nrow=2, bar.width=0.1) ``` # Shiny App {#autoSCE} The co-visualization feature is included in the integrated Shiny app that is an GUI implementation of `spatialHeatmap`, including annotation-based, manual, and automated methods. To start this app, simply call `shiny_shm()` in R. Below is a screenshot of the co-visulization output. ```{r echo=FALSE, fig.wide=TRUE, out.width="100%", fig.cap=('Screenshot of the co-visualization output in Shiny app. The co-visualization plot is generated by the automatic method.')} include_graphics('img/shiny_coviz.png') ``` When using the Shiny app, if bulk and single cell raw count data are provided, column-wise combine them in a `SCE` object, and format the metadata in the `colData` slot according to the following rules: 1. In the `bulkCell` column, use `bulk` and `cell` to indicate bulk and cell samples respectively. If no `bulk` is included in this column, all samples are considered cells. 2. If multiple cell group labels (annotation-based, or manually created) are provided, include them in columns of `label`, `label1`, `label2`, and so on respectively. In each of these label columns include corresponding aSVG features as bulk tissue labels. After formatting the metadata, save the `SCE` object as an `.rds` file using `saveRDS`, then upload the `.rds` and aSVG file to the app. An example of bulk and single cell data for use in the Shiny app are included in `spatialHeatmap` and shown below. ```{r eval=TRUE, echo=TRUE, warnings=FALSE} shiny.dat.pa <- system.file("extdata/shinyApp/example", "shiny_covis_bulk_cell_mouse_brain.rds", package="spatialHeatmap") shiny.dat <- readRDS(shiny.dat.pa) colData(shiny.dat) ``` # Supplementary Section {#sup} ## Tailoring Co-clustering Results {#tailor} ### Assigning Desired Bulk in R The bulk-cell assignments in the automated method can be manually tailored, which is optional. The tailoring can be performed in command line or on a [Shiny app](#asgBulkShiny). This section illustrates the command-line based tailoring. First visualize single cells in an embedding plot as shown below. In order to define more accurate coordinates in the next step, tune the x-y axis breaks (`x.break`, `y.break`) and set `panel.grid=TRUE`. ```{r , eval=TRUE, echo=TRUE, warnings=FALSE, fig.wide=TRUE, fig.cap=('UMAP embedding plot of mouse brain bulk tissues and single cells. Bulk tissues and single cells are indicated by large and small circles respectively. '), out.width="100%", fig.show='show'} plot_dim(coclus.mus, dim='UMAP', color.by='sample', x.break=seq(-10, 10, 1), y.break=seq(-10, 10, 1), panel.grid=TRUE) ``` Define desired bulk tissues (`desiredBulk`) for cells selected by x-y coordinate ranges (`x.min`, `x.max`, `y.min`, `y.max`) in the embedding plot in form of a `data.frame` (`df.desired.bulk`). The `dimred` reveals where the coordinates come from and is required. In this example, cell populations near the bulk `hippocampus` are selected and `hippocampus` is the desired bulk tissue. ```{r , eval=TRUE, echo=TRUE, warnings=FALSE} df.desired.bulk <- data.frame(x.min=c(6), x.max=c(8), y.min=c(-1), y.max=c(0.5), desiredBulk=c('hippocampus'), dimred='UMAP') df.desired.bulk ``` Incorporate desired bulk assignments to co-clustering results by calling `refine_asg`. The similarities corresponding to desired bulk are internally set at the maximum of 1. After inclusion of desired bulk, separate single cells from bulk for co-visualization. ```{r , eval=TRUE, echo=TRUE, warnings=FALSE, results='hide'} # Incorporate desired bulk coclus.mus.tailor <- refine_asg(sce.all=coclus.mus, df.desired.bulk=df.desired.bulk) # Separate cells from bulk coclus.sc.tailor <- subset(coclus.mus.tailor, , bulkCell=='cell') ``` After tailoring, the co-visualization through bulk-to-cell mapping is built on gene `Adcy1`. To reveal the tailoring result, only `hippocampus` is shown through the argument `tar.bulk` in the plot, where cells defined in `df.desired.bulk` have the same color as the desired bulk `hippocampus`. `hippocampus` cells in the embedding plot include tailored cells in `df.desired.bulk` and those labeled `hippocampus` in coclustering. ```{r eval=TRUE, echo=TRUE, warnings=FALSE, fig.wide=TRUE, fig.cap=('Co-visualization through bulk-to-cell mapping after tailoring. This plot is created on gene Adcy1. Only hippocampus cells and tissue are shown to display the tailoring result. '), out.width="100%", fig.show='show', results='hide'} covis(svg=svg.mus.brain, data=coclus.blk, ID=c('Adcy1'), sce.dimred=coclus.sc.tailor, dimred='UMAP', tar.bulk=c('hippocampus'), assay.na='logcounts', legend.nrow=4, dim.lgd.text.size=10, dim.lgd.nrow=2, bar.width=0.08) ``` ### Assigning Desired Bulk on Shiny App {#asgBulkShiny} This section describes [tailoring](#tailor) co-clustering results on the convenience Shiny app, which is lauched by calling `desired_bulk_shiny`. Figure \@ref(fig:tailorShiny) is the screenshot of the Shiny app. The file to upload is the co-clustering result returned by `cocluster`, here `coclus.mus`. It should be saved in an `.rds` file by using `saveRDS` before uploaded to the app. On the left embedding plot, cells are selected with the "Lasso Select" tool. On the right, selected cells and their coordinates are listed in a table, and the desired bulk tissues (aSVG features) can be selected from the dropdown list, here `hippocampus`. To download the table just click the "Download" button. The "Help" button gives more instructions. ```{r tailorShiny, echo=FALSE, fig.wide=TRUE, out.width="100%", fig.cap=('Screenshot of the Shiny app for selecting desired bulk tissues. On the left is the embedding plot of single cells, where target cells are selected with the "Lasso Select" tool. On the right, desired bulk tissues are assigned for selected cell.')} include_graphics('img/assign_bulk.png') ``` An example of desired bulk downloaded from the convenience Shiny app is shown below. The x-y coordinates refer to single cells in embbeding plots (`dimred`). The `df.desired.bulk` is ready to use in the [tailoring](#tailor) section. ```{r eval=TRUE, echo=TRUE, warnings=FALSE} desired.blk.pa <- system.file("extdata/shinyApp/example", "selected_cells_with_desired_bulk.txt", package="spatialHeatmap") df.desired.bulk <- read.table(desired.blk.pa, header=TRUE, row.names=1, sep='\t') df.desired.bulk[1:3, ] ```
# Version Informaion ```{r eval=TRUE, echo=TRUE} sessionInfo() ``` # Funding This project has been funded by NSF awards: [PGRP-1546879](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1546879&HistoricalAwards=false){target="_blank"}, [PGRP-1810468](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1810468){target="_blank"}, [PGRP-1936492](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1936492&HistoricalAwards=false){target="_blank"}. # References