---
title: "4. Stratified Analysis of Mutational Signatures"
author: "Daniel Huebschmann"
date: "30/12/2019"
vignette: >
  %\VignetteIndexEntry{4. Stratified Analysis of Mutational Signatures}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
output: 
  BiocStyle::html_document:
    toc: yes
---

```{r loadStyle, warning=FALSE, echo=FALSE, message=FALSE, results="hide"}
library(BiocStyle)
```

```{r packages, include=FALSE}
library(YAPSA)
library(Biostrings)
library(BSgenome.Hsapiens.UCSC.hg19)
library(knitr)
opts_chunk$set(echo=TRUE)
opts_chunk$set(fig.show='asis')
```

# Introduction {#introduction}

For many biological questions, it is of relevance to group mutations, or more 
specifically, single nucleotide variants (SNVs), into different **strata** or 
categories. These strata may be defined by genomic coordinates or other measures 
derived therefrom and annotated to the variants, e.g., replication timing, 
chromatin state, coding vs. non-coding part of the genome etc. Other choices of 
stratification are also possible, as shown below for the example of mutation 
density as stratification axis.

It is noteworthy to say that it would be neither sufficient nor appropriate to 
perform completely separate and independent NNLS analyses of mutational 
signatures on the different strata. This artificially reduces the statistical 
power provided by the higher number of SNVs in the unstratified setting. 
Instead, the results of the unstratified analysis should be used as input and 
supplied to a constrained analysis for the strata.

In YAPSA, these strata or categories have to be exclusive, i.e. one SNV must be 
in one and only one category. The number of strata will be denoted by $s$. For 
example one could evaluate whether an SNV falls into a region of high, 
intermediate or low mutation density by applying meaningful cutoffs on 
intermutation distance. Following this choice, there are three strata: high, 
intermediate and low mutation density. If we have already performed an analysis 
of mutational signatures for the whole (unstratified) mutational catalog of
the cohort, we have identified a set of signatures of interest and the 
corresponding exposures. We now could ask the question if some of the 
signatures are enriched or depleted in one or the other stratum, yielding a 
strata-specific signature enrichment and depletion pattern. In YAPSA, the 
function `SMC` (Stratification of the Mutational Catalogue) solves the 
stratified optimization problem:

(@SMC_formula) $$
\begin{aligned}
\min_{H_{(\cdot j)}^k \in \mathbb{R}^l}||W \cdot H_{(\cdot j)}^{k} - 
V_{(\cdot j)}^{k}|| \quad \forall j,k \\
\textrm{under the constraint of non-negativity:} \quad H_{(ij)}^{k} >= 0 
\quad \forall i,j,k \\
\textrm{and the additional contraint:} \quad \sum_{k=1}^{s} H^{k} = H \\
\textrm{where H is defined by the optimization:} \quad \min_{H_{(\cdot j)} 
\in \mathbb{R}^l}||W \cdot H_{(\cdot j)} - V_{(\cdot j)}|| \quad \forall j \\
\textrm{also under the constraint of non-negativity:} \quad H_{(ij)} >= 0 
\quad \forall i,j \quad \textrm{and} \quad V = \sum_{k=1}^{s} V^{k}
\end{aligned}
$$

As defined in 
[1. Usage of YAPSA](http://bioconductor.org/packages/3.11/bioc/vignettes/YAPSA/inst/doc/YAPSA.html), 
$j$ is the index over samples, $m$ is the number of samples, $i$ 
is the index over signatures and $l$ is the number of signatures. In addition, 
$k$ is the index over strata and $s$ is the number of strata. Note that the 
last two lines of equation (@SMC_formula) correspond to the unstratified 
optimization problem. The very last part of equation (@SMC_formula) reflects the 
additivity of the stratified mutational catalogs $V^{k}$ which is due to the 
fact that by definition the sets of SNVs they were constructed from (i.e. the 
strata) are exclusive.

The SMC-procedure can also be applied when an NMF analysis has been performed 
and the exposures $\widetilde{H}$ of this NMF analysis should be used as input 
and constraint for the SMC. It then solves the task:

(@SMC_NMF_input_formula) $$
\begin{aligned}
\min_{H_{(\cdot j)}^k \in \mathbb{R}^l}||W \cdot H_{(\cdot j)}^{k} - 
V_{(\cdot j)}^{k}|| \quad \forall j,k \\
\textrm{under the constraint of non-negativity:} \quad H_{(ij)}^{k} >= 0 
\quad \forall i,j,k \\
\textrm{and the additional contraint:} \quad 
\sum_{k=1}^{s} H^{k} = \widetilde{H} \\
\end{aligned}$$


Applying `SMC` that way, the initial LCD decomposition of the unstratified 
mutational catalog is omitted and it's result replaced by the exposures 
extracted by the NMF analysis.



# Recap: compute signature exposures (without stratification)

In the following section, we briefly recapitulate the analysis of SNV 
mutational signatures on an example data set as performed in 
[1. Usage of YAPSA](http://bioconductor.org/packages/3.11/bioc/vignettes/YAPSA/inst/doc/YAPSA.html).
We thus first load the example data stored in the package:

```{r, loadStoredSigData}
data(sigs)
data(cutoffs)
data("lymphomaNature2013_mutCat_df")
current_cutoff_vector <- cutoffCosmicValid_abs_df[6,]
```

We then perform a supervised analysis of SNV mutational signatures using 
signature-specific cutoffs:

```{r LCDwithCutoffs}
lymphoma_COSMIC_listsList <-
  LCD_complex_cutoff_combined(
      in_mutation_catalogue_df = lymphomaNature2013_mutCat_df,
      in_cutoff_vector = current_cutoff_vector, 
      in_signatures_df = AlexCosmicValid_sig_df, 
      in_sig_ind_df = AlexCosmicValid_sigInd_df)
```

We assign subgroups to the different samples:

```{r subrgroupAnnotation}
data(lymphoma_PID)
colnames(lymphoma_PID_df) <- "SUBGROUP"
lymphoma_PID_df$PID <- rownames(lymphoma_PID_df)
COSMIC_subgroups_df <- 
  make_subgroups_df(lymphoma_PID_df,
                    lymphoma_COSMIC_listsList$cohort$exposures)
```

And finally plot the obtained result:

```{r captionExposures, echo=FALSE}
cap <- "Exposures to SNV mutational signatures"
```

```{r exposuresCutoffs, warning=FALSE, fig.width=8, fig.height=6, fig.cap= cap}
exposures_barplot(
  in_exposures_df = lymphoma_COSMIC_listsList$cohort$exposures,
  in_signatures_ind_df = lymphoma_COSMIC_listsList$cohort$out_sig_ind_df,
  in_subgroups_df = COSMIC_subgroups_df)               
```


# Stratified analysis of mutational signatures

## Preparation

The functions for the stratified analysis access the raw data in order to 
provide the highest degree of flexibility for the definition of the strata 
themselves. Therefore we have to repeat the formatting steps already performed 
in the [first vignette](http://bioconductor.org/packages/3.11/bioc/vignettes/YAPSA/inst/doc/YAPSA.html). 

```{r formatRaw, warning=FALSE, fig.width=8, fig.height=6, fig.cap= cap}
data("lymphoma_Nature2013_raw")
names(lymphoma_PID_df) <- gsub("SUBGROUP", "subgroup", names(lymphoma_PID_df))
names(lymphoma_Nature2013_raw_df) <- c("PID","TYPE","CHROM","START",
                                       "STOP","REF","ALT","FLAG")
lymphoma_Nature2013_df <- subset(lymphoma_Nature2013_raw_df,TYPE=="subs",
                                 select=c(CHROM,START,REF,ALT,PID))
names(lymphoma_Nature2013_df)[2] <- "POS"
lymphoma_Nature2013_df$SUBGROUP <- "unknown"
DLBCL_ind <- grep("^DLBCL.*",lymphoma_Nature2013_df$PID)
lymphoma_Nature2013_df$SUBGROUP[DLBCL_ind] <- "DLBCL_other"
MMML_ind <- grep("^41[0-9]+$",lymphoma_Nature2013_df$PID)
lymphoma_Nature2013_df <- lymphoma_Nature2013_df[MMML_ind,]
for(my_PID in rownames(lymphoma_PID_df)) {
  PID_ind <- which(as.character(lymphoma_Nature2013_df$PID)==my_PID)
  lymphoma_Nature2013_df$SUBGROUP[PID_ind] <-
    lymphoma_PID_df$subgroup[which(rownames(lymphoma_PID_df)==my_PID)]
}
lymphoma_Nature2013_df$SUBGROUP <- factor(lymphoma_Nature2013_df$SUBGROUP)
lymphoma_Nature2013_df <- translate_to_hg19(lymphoma_Nature2013_df,"CHROM")
lymphoma_Nature2013_df$change <- 
  attribute_nucleotide_exchanges(lymphoma_Nature2013_df)
lymphoma_Nature2013_df <- 
  lymphoma_Nature2013_df[order(lymphoma_Nature2013_df$PID,
                               lymphoma_Nature2013_df$CHROM,
                               lymphoma_Nature2013_df$POS),]
lymphoma_Nature2013_df <- annotate_intermut_dist_cohort(lymphoma_Nature2013_df,
                                                        in_PID.field="PID")
data("exchange_colour_vector")
lymphoma_Nature2013_df$col <- 
  exchange_colour_vector[lymphoma_Nature2013_df$change]
```


## Performing a stratification based on mutation density

We will now use the intermutational distances computed above. We set cutoffs 
for the intermutational distance at 1000 and 100000 bp, leading to three 
strata. We annotate to every variant in which stratum it falls.

```{r stratifyMutDensity}
lymphoma_Nature2013_df$density_cat <- cut(lymphoma_Nature2013_df$dist,
                                          c(0,1001,100001,Inf),
                                          right=FALSE,
                                          labels=c("high","intermediate",
                                                   "background"))
```

The following table shows the distribution of variants over strata:

```{r tableStrataMutDensity}
temp_df <- data.frame(table(lymphoma_Nature2013_df$density_cat))
names(temp_df) <- c("Stratum","Cohort-wide counts")
kable(temp_df, caption=paste0("Strata for the SMC of mutation density"))
```

We now have everything at hand to carry out a stratified signature analysis:

```{r captionSMC, echo=FALSE}
cap="SMC (Stratification of the Mutational Catalogue)
        based on mutation density."
```

```{r SMC, results="hide",warning=FALSE,fig.width=8,fig.height=7,fig.cap=cap}
strata_order_ind <- c(1,3,2)
mut_density_list <- run_SMC(lymphoma_Nature2013_df,
                            lymphoma_COSMIC_listsList$cohort$signatures,
                            lymphoma_COSMIC_listsList$cohort$out_sig_ind_df,
                            COSMIC_subgroups_df,
                            column_name="density_cat",
                            refGenome=BSgenome.Hsapiens.UCSC.hg19,
                            cohort_method_flag="norm_PIDs",
                            in_strata_order_ind=strata_order_ind)
```

This produces a multi-panel figure with 
`r length(unique(lymphoma_Nature2013_df$density_cat))+1` rows of plots. The 
first row visualizes the signature distribution over the whole cohort without 
stratification, followed by one row of plots per stratum. Hence in our example 
we have four rows of graphs with three (exclusive) strata as input. Each row 
consists of three plots. The left plots show absolute exposures in the 
respective stratum as stacked barplots on a per sample basis. The middle plots 
show relative exposures in the respective stratum on a per sample basis as 
stacked barplots. The right plots shows cohort-wide averages of the relative 
exposures in the respective stratum. The error bars indicate the standard error
of the mean (SEM).

These results can also be displayed as a dodged barplot containing the 
information of the third column of plots in the above figure, which needs 
the execution of an additional code block for its generation:

```{r captionSMCdodged, echo=FALSE}
cap = "SMC results displayed as dodged plot."
```

```{r SMCdodged, results="hide",warning=FALSE,fig.width=8,fig.height=7,fig.cap=cap}
dodged_df <- do.call(rbind, mut_density_list$cohort)
names(dodged_df) <- gsub("variable","stratum", names(dodged_df))
names(dodged_df) <- gsub("sig","signature", names(dodged_df))
dodged_df$stratum <- gsub("_rel", "", as.character(dodged_df$stratum))
dodged_df <- dodged_df[which(dodged_df$stratum != "all"),]
dodged_df$stratum <-
  factor(as.character(dodged_df$stratum),
         levels = sort(unique(dodged_df$stratum))[rev(strata_order_ind)])
dodged_plot <- ggplot() +
  geom_bar(data = dodged_df,
           aes_string(x = "signature", y = "exposure",
                      group = "stratum", fill = "stratum"),
           stat = "identity", position = "dodge", size = 1.5) +
  geom_errorbar(data = dodged_df,
                aes_string(x = "signature", ymin = "exposure_min",
                           ymax = "exposure_max",
                           group = "stratum"),
                position = position_dodge(width = 0.9), width = 0.3) +
  labs(y = "relative exposures")
if(exists("current_strata_colVector")){
  dodged_plot <- dodged_plot +
    scale_fill_manual(values = current_strata_colVector[-1],
                      labels = current_labelVector[-1]
                                [current_strata_order_ind])
}
print(dodged_plot)
```


## Testing for statistical significance of enrichment and depletion patterns

To test for statistical significance of potential differences in the signature 
exposures (i.e. signature enrichment and depletion patterns) between the 
different strata, we can use the Kruskal-Wallis test, as the data is grouped 
into (potentially more than two) categories and might not follow a normal 
distribution. As we are testing the different signatures on the same 
stratification, we have to correct for multiple testing. In order to control 
the false discovery rate (FDR), the Benjamini-Hochberg correction is 
appropriate.

```{r statSMCmutDensity, warning=FALSE, message=FALSE}
stat_mut_density_list <- stat_test_SMC(mut_density_list,in_flag="norm")
kable(stat_mut_density_list$kruskal_df,
      caption=paste0("Results of Kruskal tests for cohort-wide exposures over",
                     " strata per signature without and with correction for ",
                     "multiple testing."))
```

In the following paragraph we perform post-hoc tests for those signatures where
the Kruskal-Wallis test, as evaluated above, has given a significant result.
\newpage

```{r postHocMutDensity}
significance_level <- 0.05
for(i in seq_len(dim(stat_mut_density_list$kruskal_df)[1])){
  if(stat_mut_density_list$kruskal_df$Kruskal_p_val_BH[i]<significance_level){
    print(paste0("Signature: ",rownames(stat_mut_density_list$kruskal_df)[i]))
    print(stat_mut_density_list$kruskal_posthoc_list[[i]])
  }
}
```


From this analysis, we can see that a distinct signature enrichment and 
depletion pattern emerges:

1. Stratum of high mutation density: Enrichment of signatures AC5 and depletion 
of signatures AC1 (significant) and AC17
2. Background: signature distribution very similar to the one of the complete 
mutational catalog (first row)
3. Stratum of intermediate mutation density: intermediate signature enrichment 
and depletion pattern between the strata of high mutation density and 
background.