---
title: "scECODA tutorial"
author: "Christian Halter"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc_float: true
  BiocStyle::pdf_document: default
editor_options: 
  markdown: 
    wrap: 80
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{scECODA tutorial}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
# Solution for the "magick package is required to crop" error:
knitr::opts_chunk$set(
  echo = TRUE, # keep showing code chunks
  message = FALSE, 
  warning = FALSE,
  # This is the line that fixes the cropping issue:
  crop = NULL
)
```

# Introduction

The scECODA package provides functions for *s*ingle-*c*ell *E*xploratory
*CO*mpositional *D*ata *A*nalysis at the population scale. It enables users to
explore their dataset by visualizing the sample structure and how groups of
samples relate to each other by providing numerous convenient plotting and
clustering functions. Additionally, it provides metrics to quantify how strongly
groups of samples separate and which cell types are driving this separation
(differential abundance analysis).

## Motivation for Bioconductor Integration

While many tools exist for supervised comparative single-cell analysis, there is
a distinct need for a standardized framework that handles unsupervised
**exploratory compositional data analysis (ECODA)** specifically for large-scale
cohorts. `scECODA` is motivated by the following goals for the Bioconductor
ecosystem:

-   **Interoperability**: The package natively supports core Bioconductor
    classes, including `SingleCellExperiment` and `SummarizedExperiment`,
    allowing seamless integration into existing pipelines.
-   **Infrastructure Synergy**: It leverages the established Bioconductor tool
    `DESeq2` for pseudobulk normalization.

--------------------------------------------------------------------------------

# Installation

You can install the latest stable release of `scECODA` from Bioconductor.

```{r Install the scECODA package, eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("scECODA")
```

```{r Load scECODA}
library(scECODA)
```

--------------------------------------------------------------------------------

# Create SummarizedExperiment objects for ECODA

You can create an SummarizedExperiment object from a single-cell data (Seurat or
SingleCellExperiment object) or from a data frame containing cell counts or
frequencies (in % per sample) with `ecoda()`.

Based on the input, the following data and transformations will be automatically
calculated:

-   Cell counts per sample ("counts")
-   Relative abundance in % ("freq")
-   Centered log-ratio (CLR) transformed abundance ("clr")
-   Arc-sine square root transformed abundance ("asin_sqrt")
-   Optionally: you can get the pseudobulk gene expression per sample,
    normalized with DESeq2 ("pb")
-   Metadata

Additionally, it will calculate useful metrics, such as:

-   The variance of each cell type across all samples (in CLR value) in a data
    frame. It also contains their average abundance (in CLR) and cumulative
    variance explained.
-   The top *n* most highly variable cell types (HVCs) ("top_n_hvcs"). The
    number of HVCs (*n*) is automatically determined by selecting the most
    highly variable cell types explaining approximately 50% of the total
    variance (sum of variances of all cell types). Alternatively, the user can
    choose the number of cell types *n* or a variance explained threshold.
-   The names of the top *n* most highly variable cell types (HVCs) ("hvcs")
-   The variance explained by the top_n_hvcs ("variance_explained")
-   Sample distance matrix ("sample_distances")

**NOTE:** To calculate CLR-transformed cell type abundances, zero counts or
relative abundances are not allowed (log of zero is not defined). By default,
scECODA adds a pseudocount of +0.5 to all values if a counts matrix is provided;
when the input is a frequency matrix, we add 2/3 times the smallest frequency
value as suggested in [@martín-fernández2003].

## From SingleCellExperiment object

Load an example dataset from the `scRNAseq` BioC library.

```{r Load example SingleCellExperiment dataset, message=FALSE, warning=FALSE}
# Loading an example SingleCellExperiment dataset
library(scRNAseq)

all.ds <- as.data.frame(surveyDatasets())
ds <- 1
sce_object <- fetchDataset(
  all.ds[["name"]][ds],
  all.ds[["version"]][ds]
)

names(colData(sce_object))
```

To represent this dataset in terms of cell type composition, we will use the
cell type labels provided by the authors and stored in the "cluster" metadata
field. We will do this for each individual sample ("sample" metadata field). The
`ecoda()` function will do this for you:

```{r Create ECODA object}
# Create ECODA object
se <- ecoda(
  sce_object,
  sample_col = "sample",
  celltype_col = "cluster"
)
```

The resulting object stores row counts, frequencies and several transformations
for the cell type composition by sample. For instance, the `clr` assay contains
centered log-ratio values for the cell types in each sample:

```{r Show example of CLR df}
assay(se, "clr")[1:5,1:5]
```

## From Seurat object

Alternatively, scECODA also supports generating compositional matrices from
Seurat objects:

```{r Create ECODA object from Seurat object}
# Commented out because Seurat depends on igraph which has not been updated
# to work with R 4.6 yet.

# library(Seurat)
# 
# counts <- counts(sce_object)
# metadata <- as.data.frame(SummarizedExperiment::colData(sce_object))
# 
# seurat_object <- CreateSeuratObject(counts = counts, meta.data = metadata)
# 
# se <- ecoda(
#   seurat_object,
#   sample_col = "sample",
#   celltype_col = "cluster"
# )
```

## From counts or frequency data frame

If you have a pre-calculated sample x cell type compositional matrix (either as
counts or raw frequencies), you can use them directly as input to scECODA.
Metadata can be optionally provided and used downstream for visualization.

In this example, we load the immune cell composition in samples from a breast
cancer cohort [@zhang2021]

**NOTES:**

-   If you create an SummarizedExperiment ECODA object from a data frame
    containing relative abundances (in %), please use percentage values (e.g. 30
    for 30%) and not as decimals (NOT 0.3 for 30%). If the relative abundances
    sum up to 100%, they will be automatically re-scaled to 100 and you will be
    informed with a message.
-   If you do add metadata, **it is important to make sure that the
    counts/frequency and the metadata data frames have the same column names**.

```{r Load example datasets}
# Load example datasets
data(example_data)

Zhang <- example_data$Zhang
counts <- Zhang$cell_counts_lowresolution
freq <- calc_freq(counts)
metadata <- Zhang$metadata

se <- ecoda(data = counts, metadata = metadata)

se <- ecoda(data = freq, metadata = metadata)
```

--------------------------------------------------------------------------------

# Visualization and quantification of sample separation

## PCA

For explorative analysis, Principal Component Analysis (PCA) can be very useful
to find the major axes of variation between samples. You can plot a PCA of your
se with `plot_pca(se)`.

```{r Show PCA based on cell type composition, message=FALSE, warning=FALSE}
se <- ecoda(
  data = example_data$Zhang$cell_counts_lowresolution,
  metadata = example_data$Zhang$metadata,
)

plot_pca(
  se,
  label_col = "Tissue",
  title = "PCA based on cell type composition",
  n_hv_feat_show = 5 # Shows the most highly variable features (cell types)
)
```

Not all cell types have informative composition (some may be very rare, others
constant across samples). It is often more robust to restrict the analysis on
highly-variable cell types (HVCs); for example, PCA can be calculated on HVCs:

```{r PCA based on highly variable cell types}
plot_pca(
  se,
  assay = "clr_hvc",
  label_col = "Tissue",
  title = "PCA based on highly variable cell types",
  n_hv_feat_show = nrow(metadata(se)$clr_hvc)
)
```

Optionally, we can also add an extra dimension and plot the first 3 principal
components:

```{r 3D PCA}
plot_pca3d(se, label_col = "Tissue")
```

## Quantifying group separation

Sample separation (based on a user-defined grouping) can be quantified with
different metrics, for example:

-   Analysis of similarities (ANOSIM)
-   Adjusted rand index (ARI)
-   Modularity score
-   Average silhouette score (not recommended as it expects gaussian cluster
    shapes and is not robust to outliers. ANOSIM and modularity are more
    robust.)

```{r Quantifying group separation, message=FALSE, warning=FALSE}
# You can get the separation scores like this:
dist_mat <- dist(assay(se, "clr"))
# Or use the pre-calculated distance matrix based on the CLR abundances:
dist_mat <- metadata(se)$sample_distances

labels <- colData(se)$Tissue

anosim_r_score <- calc_anosim(dist_mat, labels)
ari_score <- calc_ari(dist_mat, labels)
# modularity_score <- calc_modularity(dist_mat, labels)
silhouette_score <- calc_sil(dist_mat, labels)

print(paste("ANOSIM R:", anosim_r_score))
print(paste("ARI:", ari_score))
# print(paste("Modularity:", modularity_score))
print(paste("Silhouette:", silhouette_score))
```

## Box and bar plots

Box plots are another useful tool to visualize the distribution of cell types
across samples:

```{r Box plot}
plot_boxplot(se, title = "CLR Abundance by Cell Type")
```

If there are biological groupings of interest, we can directly perform
statistical tests between the per-sample cell type compositions:

```{r Grouped box plot}
plot_boxplot(
  se,
  label_col = "Tissue",
  title = "CLR Abundance by Cell Type (with Wilcoxon Test)"
)
```

This can also be plotted as stacked bar plots but it is a bit harder to read.

```{r Bar plot by group}
# Plotting average cell type abundance by experimental group
plot_barplot(
  se,
  label_col = "Tissue",
  plot_by = "group",
  title = "Mean Relative Abundance by Condition"
)
```

```{r Bar plot by sample}
# Plotting cell type abundance for each sample separately
plot_barplot(
  se,
  label_col = "Tissue",
  plot_by = "sample",
  title = "Relative Abundance for Each Sample"
)
```

**NOTE:** It is highly recommended to only do statistical testing on
log-ratio-transformed compositional data and not on counts or relative
abundances in % [@greenacre2021].

The main focus of scECODA is **exploratory** (unsupervised) analysis of samples
based on cell type composition. If you want to do down-stream comparative
(supervised) differential abundance testing of *a priori known groups*
(including statistical designs with multiple known covariates), there are other
tools available for that purpose, e.g. scCODA [@büttner2021] or tascCODA
[@tasccoda].

## Heatmap

The clustered heatmap plot is a good alternative to not only visualize how
samples and cell types correlate but also to cluster samples. It is important
not to re-scale in order to avoid amplifying tiny differences.

Also, it is possible that you see your samples forming separate clusters in the
PCA or heatmap even though you do not find any (or only few) significant
differences in cell type composition. However, there might be small differences
among a large number of cell types which, taken all together, can still drive
sample separation and explain differences between your biological groups.

NOTE: you can specify multiple metadata columns in "label_col" to visualize how
samples group across multiple covariates simultaneously.

Here is an example for a simple dataset:

```{r Plot heatmap example 1}
plot_heatmap(
  se,
  label_col = c("Clinical.efficacy.", "Tissue"),
  cutree_rows = 3,
  cutree_cols = 3
)
```

We see, for example, an enrichment of plasma cells in the tumor samples, while
neutrophils and platelets are more abundant in blood samples.

As a second example, here's the same analysis using a large healthy cohort with
868 PBMC samples and 69 cell types [@gong2024]:

```{r Plot heatmap example 2, fig.height=16, fig.width=12}
se <- ecoda(
  data = example_data$GongSharma_full$cell_counts_highresolution,
  metadata = example_data$GongSharma_full$metadata
)

plot_heatmap(
  se,
  label_col = c("subject.cmv", "age_group"),
  # Additional arguments for pheatmap:
  cutree_rows = 3,
  cutree_cols = 5,
  show_colnames = FALSE
)
```

Subsetting to only the most highly variable cell types (HVCs) yields a very
similar sample clustering, while being much more interpretable.

```{r Plot heatmap example 2 only HVCs}
plot_heatmap(
  se,
  assay = "clr_hvc",
  label_col = c("subject.cmv", "age_group"),
  # Additional arguments for pheatmap:
  cutree_rows = 3,
  cutree_cols = 4,
  show_colnames = FALSE
)
```

Interestingly, we see that cell type composition tends to cluster subjects, in
an unsupervised manner, based on their age group and their CMV status. For
example, naive cell types appear to be enriched in younger subjects that were
not exposed to CMV infection.

## Cell type correlation

The cell type correlation plot offers the possibility to visualize which cell
types correlate with each other, highlighting possible microenvironment hubs
that might be of relevance to explain possible groups of samples in your data:

```{r Correlation plot, fig.height=16, fig.width=16}
plot_corr(se)
```

While the correlation plot below showing only the HVCs is easier to read, it is
nevertheless interesting to see the full correlation plot above. It seems that
there is additional information that might be overlooked:

Besides the main cell type clusters of effector memory cells (separating samples
by cytomegalovirus (CMV) infection status) and SOX4+ naive T cells (separating
samples by age), there seems to be a cluster of ISG+ cell types, memory T cells
and monocytes/DCs whose abundance strongly correlate, respectively.

```{r Correlation plot only HVCs, fig.height=16, fig.width=16}
plot_corr(se, assay = "clr_hvc")
```

## Highly variable cell types

As demonstrated in the heatmap of the extensive Gong & Sharma dataset, not all
cell subtypes contribute equally to sample separation.

Cell types at the top exhibit **high variance** across samples, indicating they
are the main drivers of group separation, while those toward the bottom show
minor variation.

This principle is directly analogous to the selection of highly variable genes
(HVGs) in (sc)RNA-seq data: **high variance is a powerful proxy for biological
relevance.**

We expect cell types that are differentially abundant between key biological
groups to display this increased variance. scECODA provides convenient functions
to calculate and visualize these cell type variances and mean abundance across
all samples:

```{r Mean-variance plot 50 percent}
library(ggplot2)
plot_varmean(se, labels = "none") +
  ggtitle("Default - cell types explaining 50% variance")
```

```{r Mean-variance plot 80 percent}
se_new <- find_hvcs(se, variance_explained = 0.8)
plot_varmean(se_new) +
  ggtitle("Cell types explaining 80% variance")
```

```{r Mean-variance plot top 5 HVCs}
se_new <- find_hvcs(se, top_n_hvcs = 5)
plot_varmean(se_new) +
  ggtitle("Top 5 HVCs")
```

# scECODA and pseudobulk

**ECODA is much more interpretable** (10-100 cell types) compared to gene
expression (2'000 - 20'000 genes). However, pseudobulk analysis has the
advantage of not requiring cell type annotations.

The scECODA package provides convenient functions to explore sample structure at
the population level based on cell type composition but also by default
automatically calculates DESeq2-normalized [@love2015] sample pseudobulk gene
expression.

A PCA plot based on pseudobulk can be plotted like this:

```{r Load dataset including pseudobulk calculation, message=FALSE, warning=FALSE}
# Loading an example SingleCellExperiment dataset
library(scRNAseq)

all.ds <- as.data.frame(surveyDatasets())
ds <- 1
sce_object <- fetchDataset(
  all.ds[["name"]][ds],
  all.ds[["version"]][ds]
)

se <- ecoda(
  data = sce_object,
  sample_col = "sample",
  celltype_col = "cluster",
  get_pb = TRUE
)
```

```{r PCA plot showing top 5 features}
plot_pca(
  se,
  label_col = "Condition",
  title = "PCA based on cell type composition",
  n_hv_feat_show = 5
)
```

```{r PCA plot based on pseudobulk}
plot_pca(
  se,
  assay = "pb",
  label_col = "Condition",
  title = "PCA based on pseudobulk gene expression"
  #, n_hv_feat_show = 5 # optionally, show top n most highly variable genes
)
```

# FAQ

## Can ECODA be affected by batch effect?

Yes, ECODA can be influenced by technical variation between batches. A "batch
effect" can manifests as an artificial inflation or depletion of specific cell
types due to sample handling rather than biology.

Common examples include:

-   Technical artifacts: Red blood cells or platelets resulting from tissue
    contamination or varied bleeding during surgery/dissection.
-   High-variance/Fragile types: Some cell types, e.g. neutrophils, can be
    sensitive to dissociation protocols, resulting in inconsistent recovery
    rates across samples.

Removing these "nuisance" populations can help mitigating noise and/or batch
effect, and to better capture true biological variation.

## Should I remove red blood cells, platelets and neutrophils?

Yes, we recommended to exclude cell types that introduce technical noise rather
than biological signal.

## How can I remove specific cell types?

The recommended workflow depends on your data source and whether you require
pseudobulk analysis:

-   From a Count Matrix: Simply remove the unwanted columns (cell types) from
    your matrix before passing it to ecoda().
-   From Seurat/SCE (with Pseudobulk): If you set get_pb = TRUE, you must subset
    your Seurat or SCE object first. This ensures that the gene expression
    aggregation (pseudobulk) only includes the cells you are interested in.
-   From an existing ECODA object: You can easily extract the data, filter it,
    and re-initialize a clean object:

```{r Removing cell types, eval=FALSE}
# Load example datasets
data(example_data)
Stephenson <- example_data$Stephenson
counts <- Stephenson$cell_counts_lowresolution
metadata <- Stephenson$metadata
se <- ecoda(data = counts, metadata = metadata)

# Specify cell types to drop
cts_to_drop <- c("RBC", "Platelets", "Neutrophils")
keep <- !rownames(assay(se, "counts")) %in% cts_to_drop
counts_subset <- assay(se, "counts")[, keep]
colData_subset <- as.data.frame(colData(se))[keep, ]

# Re-initialize a new subset object
# This ensures all transformations (CLR, HVCs, etc.) are recalculated correctly
se_subset <- ecoda(data = counts_subset, metadata = colData_subset)
```

# References

# Session Info

```{r sessionInfo}
sessionInfo()
```