---
title: "mitch"
author: "Antony Kaspi & Mark Ziemann"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{mitch Workflow}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

## Background

mitch is an R package for multi-contrast enrichment analysis. At
itâ€™s heart, it uses a rank-MANOVA based statistical approach to detect sets
of genes that exhibit enrichment in the multidimensional space as compared
to the background. The rank-MANOVA concept dates to work by Cox and Mann
(https://doi.org/10.1186/1471-2105-13-S16-S12). mitch is useful for
pathway analysis of profiling studies with one, two or more contrasts,
or in studies with multiple omics profiling, for example proteomic,
transcriptomic, epigenomic analysis of the same samples. mitch is perfectly
suited for pathway level differential analysis of scRNA-seq data.

The main strengths of mitch are that it can import datasets easily
from many upstream tools and has advanced plotting features to 
visualise these enrichments. mitch consists of five functions. A
typical mitch workflow would consist of: 

* 1: Import gene sets with `gmt_import()`

* 2: Import profiling data with `mitch_import()`

* 3: Calculate enrichments with `mitch_calc()`

* 4: And generate plots and reports with `mitch_plots()` and `mitch_report()`



```{r, install, eval = FALSE}
if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("mitch")
```

```{r, lib}
library("mitch")
```

## Importing gene sets
mitch has a function to import GMT files to R lists (adapted from Yu et al, 2012
in the [clusterProfiler](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) 
package). For example we can grab some gene sets from Reactome.org:

```{r,gsets}
download.file("https://reactome.org/download/current/ReactomePathways.gmt.zip",
    destfile="ReactomePathways.gmt.zip")
unzip("ReactomePathways.gmt.zip")
genesets<-gmt_import("ReactomePathways.gmt")
```

In this cut down example we will be using a sample of 200 Reactome gene sets:

```{r,genesetsExample}
data(genesetsExample)
head(genesetsExample,3)
```

## Importing profiling data
mitch accepts pre-ranked data supplied by the user, but also has a function
called `mitch_import` for importing tables generated by limma, edgeR, DESeq2,
ABSSeq, Sleuth, Seurat, Muscat and several other upstream tools. By default,
only the genes that are detected in all contrasts are included, but this 
behaviour can be modified for sparse data setting `joinType=full`. The below
example imports two edgeR tables called "rna" and "k9a" Where gene 
identifiers are present as row names. Note that if there is more than one
profile being imported, they need to be part of a list.

```{r,import11}
data(rna,k9a)
x<-list("rna"=rna,"k9a"=k9a)
y<-mitch_import(x,"edgeR")
head(y)
```

mitch can do unidimensional analysis if you provide it a single profile as a
dataframe (not in a list).

```{r,import4}
y<-mitch_import(rna,DEtype="edger")
head(y)
```

If the gene identifiers are not given in the rownames, then the column can be
specified with the `geneIDcol` parameter like this:

```{r,import5}
# first rearrange cols
rna_mod<-rna
rna_mod$MyGeneIDs<-rownames(rna_mod)
rownames(rna_mod)<-seq(nrow(rna_mod))
head(rna_mod)
# now import with geneIDcol
y<-mitch_import(rna_mod,DEtype="edgeR",geneIDcol="MyGeneIDs")
head(y)
```

By default, differential gene activity is scored using a supplied test
statistic or directional p-value (D):

D = sgn(logFC) * -log10(p-value)

If this is not desired, then users can perform their own custom scoring
procedure and import with `DEtype="prescored"`.

There are many cases where the gene IDs don't match the gene sets. To overcome
this, `mitch_import` also accepts a two-column table (gt here) that relates gene
identifiers in the profiling data to those in the gene sets. In this example
we can create some fake gene accession numbers to demonstrate this feature.

```{r,import6}
library("stringi")
# obtain vector of gene names
genenames<-rownames(rna)
# create fake accession numbers
accessions<-paste("Gene0",stri_rand_strings(nrow(rna)*2, 6, pattern = "[0-9]"),sep="")
accessions<-head(unique(accessions),nrow(rna))
# create a gene table file that relates gene names to accession numbers
gt<-data.frame(genenames,accessions)

# now swap gene names for accessions
rna2<-merge(rna,gt,by.x=0,by.y="genenames")
rownames(rna2)<-rna2$accessions
rna2<-rna2[,2:5]

k9a2<-merge(k9a,gt,by.x=0,by.y="genenames")
rownames(k9a2)<-k9a2$accessions
k9a2<-k9a2[,2:5]

# now have a peek at the input data before importing
head(rna2,3)
head(k9a2,3)
head(gt,3)
x<-list("rna2"=rna2,"k9a2"=k9a2)
y<-mitch_import(x,DEtype="edgeR",geneTable=gt)
head(y,3)
```

`?mitch_import` provides more instructions on using this feature.

## Calculating enrichment
The `mitch_calc` function performs multivariate enrichment analysis of the
supplied gene sets in the scored profiling data. At its simpest form 
`mitch_calc` function accepts the scored data as the first argument and the 
genesets as the second argument. Users can prioritise enrichments based on
small adjusted p-values, by the observed effect size (magnitude of "s", the
enrichment score) or the standard deviation of the s scores. Note that the
number of parallel cores is set here to cores=2 but the default is to use all
but one available cores.

```{r,calc1,results="hide"}
# prioritisation by significance
res<-mitch_calc(y,genesetsExample,priority="significance",cores=2)
```
```{r,calc2}
# peek at the results
head(res$enrichment_result)
```
```{r,calc3,results="hide"}
# prioritisation by effect size
res<-mitch_calc(y,genesetsExample,priority="effect",cores=2)
```
```{r,calc4}
head(res$enrichment_result)
```

By default, gene sets with fewer than 10 members present in the profiling data
are discarded. This threshold can be modified using the minsetsize option.
There is no upper limit of gene set size.

```{r,calc5,results="hide"}
res<-mitch_calc(y,genesetsExample,priority="significance",minsetsize=5,cores=2)
```

By default, in downstream visualisation steps, charts are made from the top 50 gene sets, but this can be 
modified using the resrows option.

```{r,calc6,results="hide"}
res<-mitch_calc(y,genesetsExample,priority="significance",resrows=3,cores=2)
```

## Generate a HTML report
The HTML reports contain several plots as raster images and interactive charts
which are useful as a first-pass visualisation. These can be generated like
this:

```{r,report,results="hide"}
mitch_report(res,"myreport.html")
```

## Generate high resolution plots
In case you want the charts in PDF format, for example for publications, these
can be generated as such:

```{r,plots,results="hide"}
mitch_plots(res,outfile="mycharts.pdf")
```

## Session Info
```{r,sessioninfo,message=FALSE}
sessionInfo()
```