---
title: Flexible clustering for Bioconductor 
author: 
- name: Aaron Lun
  email: infinite.monkeys.with.keyboards@gmail.com
output: BiocStyle::html_document
package: bluster
vignette: >
  %\VignetteIndexEntry{1. Clustering algorithms}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}    
---

```{r, echo=FALSE}
knitr::opts_chunk$set(error=FALSE, message=FALSE, warnings=FALSE)
library(BiocStyle)
```

# Introduction

The `r Biocpkg("bluster")` package provides a flexible and extensible framework for clustering in Bioconductor packages/workflows.
At its core is the `clusterRows()` generic that controls dispatch to different clustering algorithms. 
We will demonstrate on some single-cell RNA sequencing data from the `r Biocpkg("scRNAseq")` package;
our aim is to cluster cells into cell populations based on their PC coordinates.

```{r}
library(scRNAseq)
sce <- ZeiselBrainData()

# Trusting the authors' quality control, and going straight to normalization.
library(scuttle)
sce <- logNormCounts(sce)

# Feature selection based on highly variable genes.
library(scran)
dec <- modelGeneVar(sce)
hvgs <- getTopHVGs(dec, n=1000)

# Dimensionality reduction for work (PCA) and pleasure (t-SNE).
set.seed(1000)
library(scater)
sce <- runPCA(sce, ncomponents=20, subset_row=hvgs)
sce <- runUMAP(sce, dimred="PCA")

mat <- reducedDim(sce, "PCA")
dim(mat)
```

# Hierarchical clustering

Our first algorithm is good old hierarchical clustering, as implemented using `hclust()` from the _stats_ package.
This automatically sets the cut height to half the dendrogram height.

```{r}
library(bluster)
hclust.out <- clusterRows(mat, HclustParam())
plotUMAP(sce, colour_by=I(hclust.out))
```

Advanced users can achieve greater control of the procedure by passing more parameters to the `HclustParam()` constructor.
Here, we use Ward's criterion for the agglomeration with a dynamic tree cut from the `r CRANpkg("dynamicTreeCut")` package.

```{r}
hp2 <- HclustParam(method="ward.D2", cut.dynamic=TRUE)
hp2
hclust.out <- clusterRows(mat, hp2)
plotUMAP(sce, colour_by=I(hclust.out))
```

# $k$-means clustering

Our next algorithm is $k$-means clustering, as implemented using the `kmeans()` function.
This requires us to pass in the number of clusters, either as a number: 

```{r}
set.seed(100)
kmeans.out <- clusterRows(mat, KmeansParam(10))
plotUMAP(sce, colour_by=I(kmeans.out))
```

Or as a function of the number of observations, which is useful for vector quantization purposes:

```{r}
kp <- KmeansParam(sqrt)
kp
set.seed(100)
kmeans.out <- clusterRows(mat, kp)
plotUMAP(sce, colour_by=I(kmeans.out))
```

# Graph-based clustering

We can build shared or direct nearest neighbor graphs and perform community detection with `r CRANpkg("igraph")`.
Here, the number of neighbors `k` controls the resolution of the clusters.

```{r}
set.seed(101) # just in case there are ties.
graph.out <- clusterRows(mat, NNGraphParam(k=10))
plotUMAP(sce, colour_by=I(graph.out))
```

It is again straightforward to tune the procedure by passing more arguments such as the community detection algorithm to use.

```{r}
set.seed(101) # just in case there are ties.
np <- NNGraphParam(k=20, cluster.fun="louvain")
np
graph.out <- clusterRows(mat, np)
plotUMAP(sce, colour_by=I(graph.out))
```

# Two-phase clustering

We also provide a wrapper for a hybrid "two-step" approach for handling larger datasets.
Here, a fast agglomeration is performed with $k$-means to compact the data,
followed by a slower graph-based clustering step to obtain interpretable meta-clusters.
(This dataset is not, by and large, big enough for this approach to work particularly well.)

```{r}
set.seed(100) # for the k-means
two.out <- clusterRows(mat, TwoStepParam())
plotUMAP(sce, colour_by=I(two.out))
```

Each step is itself parametrized by `BlusterParam` objects, so it is possible to tune them individually:

```{r}
twop <- TwoStepParam(second=NNGraphParam(k=5))
twop
set.seed(100) # for the k-means
two.out <- clusterRows(mat, TwoStepParam())
plotUMAP(sce, colour_by=I(two.out))
```

# Obtaining full clustering statistics

Sometimes the vector of cluster assignments is not enough.
We can obtain more information about the clustering procedure by setting `full=TRUE` in `clusterRows()`.
For example, we could obtain the actual graph generated by `NNGraphParam()`:

```{r}
nn.out <- clusterRows(mat, NNGraphParam(), full=TRUE)
nn.out$objects$graph
```

The assignment vector is still reported in the `clusters` entry:

```{r}
table(nn.out$clusters)
```

# Further comments

`clusterRows()` enables users or developers to easily switch between clustering algorithms by changing a single argument.
Indeed, by passing the `BlusterParam` object across functions, we can ensure that the same algorithm is used through a workflow.
It is also helpful for package functions as it provides diverse functionality without compromising a clean function signature.
However, the true power of this framework lies in its extensibility.
Anyone can write a `clusterRows()` method for a new clustering algorithm with an associated `BlusterParam` subclass,
and that procedure is immediately compatible with any workflow or function that was already using `clusterRows()`.

# Session information {-}

```{r}
sessionInfo()
```