--- title: Flexible clustering for Bioconductor author: - name: Aaron Lun email: infinite.monkeys.with.keyboards@gmail.com output: BiocStyle::html_document package: bluster vignette: > %\VignetteIndexEntry{1. Clustering algorithms} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE} knitr::opts_chunk$set(error=FALSE, message=FALSE, warnings=FALSE) library(BiocStyle) ``` # Introduction The `r Biocpkg("bluster")` package provides a flexible and extensible framework for clustering in Bioconductor packages/workflows. At its core is the `clusterRows()` generic that controls dispatch to different clustering algorithms. We will demonstrate on some single-cell RNA sequencing data from the `r Biocpkg("scRNAseq")` package; our aim is to cluster cells into cell populations based on their PC coordinates. ```{r} library(scRNAseq) sce <- ZeiselBrainData() # Trusting the authors' quality control, and going straight to normalization. library(scuttle) sce <- logNormCounts(sce) # Feature selection based on highly variable genes. library(scran) dec <- modelGeneVar(sce) hvgs <- getTopHVGs(dec, n=1000) # Dimensionality reduction for work (PCA) and pleasure (t-SNE). set.seed(1000) library(scater) sce <- runPCA(sce, ncomponents=20, subset_row=hvgs) sce <- runUMAP(sce, dimred="PCA") mat <- reducedDim(sce, "PCA") dim(mat) ``` # Hierarchical clustering Our first algorithm is good old hierarchical clustering, as implemented using `hclust()` from the _stats_ package. This automatically sets the cut height to half the dendrogram height. ```{r} library(bluster) hclust.out <- clusterRows(mat, HclustParam()) plotUMAP(sce, colour_by=I(hclust.out)) ``` Advanced users can achieve greater control of the procedure by passing more parameters to the `HclustParam()` constructor. Here, we use Ward's criterion for the agglomeration with a dynamic tree cut from the `r CRANpkg("dynamicTreeCut")` package. ```{r} hp2 <- HclustParam(method="ward.D2", cut.dynamic=TRUE) hp2 hclust.out <- clusterRows(mat, hp2) plotUMAP(sce, colour_by=I(hclust.out)) ``` # $k$-means clustering Our next algorithm is $k$-means clustering, as implemented using the `kmeans()` function. This requires us to pass in the number of clusters, either as a number: ```{r} set.seed(100) kmeans.out <- clusterRows(mat, KmeansParam(10)) plotUMAP(sce, colour_by=I(kmeans.out)) ``` Or as a function of the number of observations, which is useful for vector quantization purposes: ```{r} kp <- KmeansParam(sqrt) kp set.seed(100) kmeans.out <- clusterRows(mat, kp) plotUMAP(sce, colour_by=I(kmeans.out)) ``` # Graph-based clustering We can build shared or direct nearest neighbor graphs and perform community detection with `r CRANpkg("igraph")`. Here, the number of neighbors `k` controls the resolution of the clusters. ```{r} set.seed(101) # just in case there are ties. graph.out <- clusterRows(mat, NNGraphParam(k=10)) plotUMAP(sce, colour_by=I(graph.out)) ``` It is again straightforward to tune the procedure by passing more arguments such as the community detection algorithm to use. ```{r} set.seed(101) # just in case there are ties. np <- NNGraphParam(k=20, cluster.fun="louvain") np graph.out <- clusterRows(mat, np) plotUMAP(sce, colour_by=I(graph.out)) ``` # Two-phase clustering We also provide a wrapper for a hybrid "two-step" approach for handling larger datasets. Here, a fast agglomeration is performed with $k$-means to compact the data, followed by a slower graph-based clustering step to obtain interpretable meta-clusters. (This dataset is not, by and large, big enough for this approach to work particularly well.) ```{r} set.seed(100) # for the k-means two.out <- clusterRows(mat, TwoStepParam()) plotUMAP(sce, colour_by=I(two.out)) ``` Each step is itself parametrized by `BlusterParam` objects, so it is possible to tune them individually: ```{r} twop <- TwoStepParam(second=NNGraphParam(k=5)) twop set.seed(100) # for the k-means two.out <- clusterRows(mat, TwoStepParam()) plotUMAP(sce, colour_by=I(two.out)) ``` # Obtaining full clustering statistics Sometimes the vector of cluster assignments is not enough. We can obtain more information about the clustering procedure by setting `full=TRUE` in `clusterRows()`. For example, we could obtain the actual graph generated by `NNGraphParam()`: ```{r} nn.out <- clusterRows(mat, NNGraphParam(), full=TRUE) nn.out$objects$graph ``` The assignment vector is still reported in the `clusters` entry: ```{r} table(nn.out$clusters) ``` # Further comments `clusterRows()` enables users or developers to easily switch between clustering algorithms by changing a single argument. Indeed, by passing the `BlusterParam` object across functions, we can ensure that the same algorithm is used through a workflow. It is also helpful for package functions as it provides diverse functionality without compromising a clean function signature. However, the true power of this framework lies in its extensibility. Anyone can write a `clusterRows()` method for a new clustering algorithm with an associated `BlusterParam` subclass, and that procedure is immediately compatible with any workflow or function that was already using `clusterRows()`. # Session information {-} ```{r} sessionInfo() ```