---
title: "Analyze and visualize RNA-Seq data with pathlinkR"
author:
- Andy An
- Travis Blimkie
output:
    BiocStyle::html_document:
        toc: true
        toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Analyze and visualize RNA-Seq data with pathlinkR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

<head>
<link rel="shortcut icon" href="../man/figures/pathlinkR_logo_32.svg">
<link rel="shortcut icon" href="../man/figures/pathlinkR_logo_16.svg">
</head>

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE, comment="", dev="png")
```

This vignette contains some minimal examples for the main **pathlinkR** 
functions; for more complete documentation, please see our 
[Github pages](https://hancockinformatics.github.io/pathlinkR/).


# Introduction
Often times, gene expression studies such as microarrays and RNA-Seq result in
hundreds to thousands of differentially expressed genes (DEGs). It becomes very
difficult to understand the biological significance of such massive data sets,
especially when there are multiple conditions and comparisons being analyzed.
This package facilitates visualization and downstream analyses of differential
gene expression results, using pathway enrichment and protein-protein
interaction networks, to aid researchers in uncovering underlying biology and
pathophysiology from their gene expression studies.

We have included an example data set of gene expression results in this package
as the object `exampleDESeqResults`. This is a list of 2 data frames, generated
using the `results()` functions from the package 
`r BiocStyle::Biocpkg("DESeq2")` (Love et al. 2014). The data is from an RNA-Seq
study investigating COVID-19 and non-COVID-19 sepsis patients at admission (T1)
compared to  approximately1 week later (T2) in the ICU, indexed over time (i.e.,
T2 vs T1) (An et al. 2023).


# Installation
To install and load the package:
```{r load_library, message=FALSE}
# We'll also be using some functions from dplyr
# BiocManager::install("pathlinkR", version="devel")
library(dplyr)
library(pathlinkR)
```


# Visualizing RNA-Seq data with volcano plots
One of the first visualizations commonly performed with gene expression studies
is to identify the number of DEGs. These are typically defined using specific
cutoffs for both fold change and statistical significance. Thresholds of
adjusted p-value <0.05 and absolute fold change >1.5 are used as the default,
though any value can be specified. **pathlinkR** includes the function
`eruption()` to create a volcano plot.
```{r datasets_and_preliminary_volcano_plot, message=FALSE}
## A quick look at the DESeq2 results table
data("exampleDESeqResults")
knitr::kable(head(exampleDESeqResults[[1]]))

## Generate a volcano plot from the first data frame, with default thresholds
eruption(
    rnaseqResult=exampleDESeqResults[[1]],
    title=names(exampleDESeqResults[1])
)
```

There are multiple options available for customizing this volcano plot,
including:

- Adjust the cutoffs
- Switch the colours of significant and non-significant genes
- Change the number of labelled genes
- Adjust the x- and y-axis limits
- Plot either log2 or non-log2 fold change (for a better sense of magnitude)
- Highlight specific gene sets of interest


# Visualizing fold changes across comparisons
In addition to creating volcano plots, we can also visualize our DEGs using
heatmaps of genes involved in a specific pathways (e.g. one identified as
significant by `pathwayEnrichment()`). The function `plotFoldChange()`
accomplishes this by taking in an input list of `DESeq2::results()` data frames,
just like `pathwayEnrichment()`, and creating a heatmap of fold changes for the
constituent genes.
```{r plot_fold_changes_I, fig.height=8}
plotFoldChange(
    inputList=exampleDESeqResults,
    pathName="Interferon alpha/beta signaling"
)
```

A number of options are provided for customization, including:

- Shorten the names of each comparison
- Provide a custom list of genes to visualize, and add an informative title
- Use log2 fold changes in the heatmap cells
- Swap the rows and columns from the default, and add a split between the rows
    (comparisons)


# Building and visualizing PPI networks
**pathlinkR** includes tools for constructing and visualizing Protein-Protein
Interaction (PPI) networks. Here we leverage PPI data gathered from
[InnateDB](https://www.innatedb.com/) to generate a list of interactions among
DE genes identified in gene expression analyses. These interactions can then be
used to build PPI networks within R, with multiple options for controlling the
type of network, such as support for first, minimum, or zero order networks. The
two main functions used to accomplish this are `ppiBuildNetwork()` and
`ppiPlotNetwork()`.

Let's continue looking at the DEGs from the COVID positive patients over time,
using the significant DEGs to build a PPI network. Since the data frame we're
inputting includes all measured genes (not just the significant ones), we'll use
the `filterInput=TRUE` option to ensure the network is made only with those
genes which pass the standard thresholds (defined above). Since we're
visualizing a network of DEGs, let's colour the nodes to indicate the direction
of their dysregulation (i.e. up- or down-regulated) by specifying
`fillType="foldChange"`.
```{r ppi_networks, fig.width=10, fig.height=8, warning=FALSE}
exNetwork <- ppiBuildNetwork(
    rnaseqResult=exampleDESeqResults[[1]],
    filterInput=TRUE,
    order="zero"
)

ppiPlotNetwork(
    network=exNetwork,
    title=names(exampleDESeqResults)[1],
    fillColumn=LogFoldChange,
    fillType="foldChange",
    label=TRUE,
    labelColumn=hgncSymbol,
    legend=TRUE
)
```

The nodes with blue labels (e.g. STAT1, FBXO6, CDH1, etc.) are hubs within the
network; i.e. those genes which have a high betweenness score. The statistic
used to determine hub nodes can be set in `ppiBuildNetwork()` with the
"hubMeasure" option.

## Enriching networks and extracting subnetworks
**pathlinkR** includes two functions for further analyzing PPI networks. First, 
`ppiEnrichNetwork()` will use the node table from a network to test for enriched
Reactome pathways or Hallmark gene sets (see the next section for more detail on
the pathway enrichment methods):
```{r ppiEnrichNetwork, message=FALSE}
exNetworkPathways <- ppiEnrichNetwork(
    network=exNetwork,
    analysis="hallmark",
    filterResults="default",
    geneUniverse = rownames(exampleDESeqResults[[1]])
)
```

Second, the function `ppiExtractSubnetwork()` can extract a minimally-connected
subnetwork from a starting network, using the genes from an enriched pathway as
the "starting" nodes for extraction. For example, below we use the results from
the Hallmark enrichment above to pull out a subnetwork of genes from the
"Interferon Gamma Response" term, then plot this reduced network while
highlighting the genes from the pathway:
```{r ppiExtractSubnetwork, warning=FALSE, message=FALSE}
exSubnetwork <- ppiExtractSubnetwork(
    network=exNetwork,
    pathwayEnrichmentResult=exNetworkPathways,
    pathwayToExtract="INTERFERON GAMMA RESPONSE"
)

ppiPlotNetwork(
    network=exSubnetwork,
    fillType="oneSided",
    fillColumn=degree,
    label=TRUE,
    labelColumn=hgncSymbol,
    legendTitle="Degree"
)
```

Alternatively you can use the "genesToExtract" argument in
`ppiExtractSubnetwork()` to supply your own set of genes (a character vector of
Ensembl IDs) to extract as a subnetwork.


# Performing pathway enrichment
The essence of pathway enrichment is the concept of over-representation: that
is, are there more genes belonging to a specific pathway present in our DEG list
than we would expect to find by chance? To calculate this, the simplest method
is to compare the ratio of DEGs in some pathway to all DEGs, and all genes in
tat pathway to all genes in all pathways in a database. **pathlinkR** mainly
uses the [Reactome](https://reactome.org/) database (Fabregat et al. 2017) for
this purpose.

One issue that can occur with over-representation analysis is the assumption
that each gene in each pathway has "equal" value in belonging to that pathway.
In reality, a single protein can have multiple (and sometimes very different)
functions, and belong to multiple pathways, like protein kinases. There are also
pathways that have substantial overlap with cellular machinery, like the TLR
pathways. This can lead to enrichment of multiple similar pathways or even
unrelated "false-positives" that make parsing through the results very
difficult.

One solution is to use unique gene pairs, as described by the creators of the
package `r BiocStyle::CRANpkg("sigora")` (Foroushani et al.
2013). This methodology decreases the number of similar and unrelated pathways
from promiscuous genes, focusing more on the pathways that are likely related to
the underlying biology. This approach is the default used in the **pathlinkR**
function  `pathwayEnrichment()`.

The `pathwayEnrichment()` function takes as input a **list of data frames**
(each from `DESeq2::results()`), and by default will split the genes into up-
and down-regulated before performing pathway enrichment one each set. The name
of the data frames in the list should indicate the comparison that was made in
the DESeq2 results, as it will be used to identify the results. For
`analysis="sigora"` we also need to provide a Gene Pair Signature Repository
(`gpsRepo`) which contains the pathways and gene pairs to be tested. Leaving
this argument to "default" will use the `reaH` GPS repository from
**sigora**, containing human Reactome pathways. Alternatively one can supply
their own GPS repository; see `??sigora::makeGPS()` for details on how to make
one.
```{r sigora_enrichment}
## Note the structure of `exampleDESeqResults`: a named list of results from
## DESeq2
exampleDESeqResults

enrichedResultsSigora <- pathwayEnrichment(
    inputList=exampleDESeqResults,
    analysis="sigora",
    filterInput=TRUE,
    gpsRepo="default"
)

head(enrichedResultsSigora)
```

For those who still prefer traditional over-representation analysis, we include
the option of doing so by setting `analysis="reactome"`, which uses 
`r BiocStyle::Biocpkg("ReactomePA")` (Yu et al. 2016). When using this method,
we recommend providing a gene universe to serve as a background for the
enrichment test; here we'll use all the genes which were tested for significance
by DESeq2 (i.e. all genes from the count matrix), converting them to Entrez gene
IDs before running the test. See the full vignette at our 
[Github pages](https://hancockinformatics.github.io/pathlinkR/) for details.

In addition to the Reactome database used when setting `analysis` to "sigora" or
"reactome", we also provide over-representation analysis using the [Hallmark
gene sets](https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp) from
the Molecular Signatures Database (MSigDb). These are 50 gene sets that
represent "specific, well-defined biological states or processes with coherent
expression" (Liberzon et al. 2015). This database provides a more high-level
summary of key biological processes compared to the more granular Reactome
pathways.
```{r hallmark_enrichment}
enrichedResultsHm <- pathwayEnrichment(
    inputList=exampleDESeqResults,
    analysis="hallmark",
    filterInput=TRUE,
    split=TRUE
)

head(enrichedResultsHm)
```

Finally, users may also use data from [KEGG](https://www.kegg.jp/) when
performing enrichment analyses. This database is supported with traditional
over-representation analysis by setting `analysis="kegg"`, or with a gene
pair-based approach by specifying `analysis="sigora"` and `gpsRepo="kegH"`.


# Plotting pathway enrichment results
Now that we have (a lot of) pathway enrichment results from multiple
comparisons, its time to visualize them. The function `plotPathways()` does this
by grouping Reactome pathways (or Hallmark gene sets) under parent groups, and
indicates if each pathway is up- or down-regulated in each comparison, making it
easy to identify which pathways are shared or unique to different DEG lists.
Because there are often many pathways, you can split the plot into multiple
columns (up to 3), and truncate the pathway names to make the results fit more
easily.

Sometimes a pathway may be enriched in both up- and down-regulated genes from
the same DEG list (these usually occur with larger pathways). Such occurrences
are indicated by a white asterisk where the more significant (lower adjusted
p-value) direction is displayed. You can also change the angle/labels of the
comparisons, or add the number of DEGs in each comparison below the labels.
Lastly, you can specify which pathways or top pathway groups to include for
visualization.
```{r plot_pathways_I, fig.width=10, fig.height=8}
pathwayPlots(
    pathwayEnrichmentResults=enrichedResultsSigora, 
    columns=2
)
```

A variety of tweaks can be applied to these plot as well:

- Only show immune-related pathways based on the top pathway grouping
- Change the colour scaling used for the adjusted p-values
- Condense the comparison names to better fit onto the plot, and make them
    display horizontally
- Add the number of DEGs used for enrichment below the comparison name
- Increase the truncation cutoff value of pathway names for more words

From these results, you can see that while many of the immune system pathways
change in the same direction over time in COVID-19 and non-COVID-19 sepsis
patients, a few unique ones stand out, mostly related to interferon signaling
("Interferon Signaling", "Interferon gamma signaling", "Interferon alpha/beta
signaling", "ISG15 antiviral mechanism"). This likely reflects an elevated early
antiviral response in COVID-19 patients that decreased over time, compared to no
change in non-COVID-19 sepsis patients.


# Generating networks from enriched pathways
**pathlinkR** includes functions for turning the pathway enrichment results
from either Reactome-based method ("sigora" or "reactomepa") into networks,
using the overlap of the genes assigned to each pathway to determine their
similarity to one other. In these networks, each pathway is a node, with
connections or edges between them determined via a distance measure. A threshold
can be set, where two pathways with a minimum similarity measure are considered
connected, and would have an edge drawn between their nodes.

We provide a pre-computed distance matrix of Reactome pathways, generated using
Jaccard distance, but there is support for multiple distance measures to be
used. Once this "foundation" of pathway interactions is created, a pathway
network can be built using the `createPathnet()` function:
```{r pathway_network_I, fig.width=12, fig.height=10, warning=FALSE, message=FALSE}
data("sigoraDatabase")

pathwayDistancesJaccard <- getPathwayDistances(pathwayData = sigoraDatabase)

startingPathways <- pathnetFoundation(
    mat=pathwayDistancesJaccard,
    maxDistance=0.8
)

# Get the enriched pathways from the "COVID Pos Over Time" comparison
exPathwayNetworkInput <- enrichedResultsSigora %>% 
    filter(comparison == "COVID Pos Over Time")

myPathwayNetwork <- pathnetCreate(
    pathwayEnrichmentResult=exPathwayNetworkInput,
    foundation=startingPathways
)
```

There are two options for visualization, the first being a static network:
```{r plot_pathnet_I, warning=FALSE, fig.height=7}
pathnetGGraph(
    myPathwayNetwork,
    labelProp=0.1,
    nodeLabelSize=3,
    nodeLabelOverlaps=8,
    segColour="red",
    themeBaseSize = 12
)
```

Nodes (pathways) which are filled in are enriched pathways (i.e. those output by
`pathwayEnrichment()`). Size of nodes is correlated with statistical
significance, while edge thickness relates to the similarity of two connected
pathways.

Though this type of visualization is useful, we can also display this network
using an alternate method that creates an interactive display, with the function
`pathnetVisNetwork()`; see our 
[Github pages](https://hancockinformatics.github.io/pathlinkR/) for these
details.


# Supplemental materials

## Gene-pair signatures
**sigora** uses a Gene-Pair Signature (GPS) Repository that stores information
on which gene pairs are unique for which pathways. We recommend using the one
provided by sigora, which can be loaded via `data("reaH", package = "sigora")`
(Reactome Human). You can also generate your own GPS repo using sigora's own
function and a custom set of pathways (e.g. from another pathway database like
GO or KEGG). Please consult the sigora documentation on how to generate your
custom GPS repository.

## Why are there different p-value cut-offs for sigora vs. ReactomePA/Hallmark?
Because there are now multiple gene pairs vs. single genes, the gene-pair
"universe" is greatly increased and it is more likely for a result to be
significant. Therefore, the cutoff threshold for significance is more stringent
(adjusted p-value < 0.001) and a more conservative adjustment method
(Bonferroni) is used. For regular over-representation analysis, a less
conservative adjustment method (Benjamini-Hochberg) is used with adjusted
p-value < 0.05. These are automatically set with `filterResults="default"`. You
can adjust these cut-offs by setting `filterResults` to different values between
0 and 1, or 1 if you want all the pathways (this may be useful for comparing
which enriched genes appear in which comparisons, even if the enrichment is not
significant).


# Citations
An AY, Baghela AS, Falsafi R, Lee AH, Trahtemberg U, Baker AJ, dos Santos CC,
Hancock REW. Severe COVID-19 and non-COVID-19 severe sepsis converge
transcriptionally after a week in the intensive care unit, indicating common
disease mechanisms. Front Immunol. 2023;6(14):1167917.

Fabregat A, Sidiropoulos K, Viteri G, Forner O, Marin-Garcia P, Arnau V,
Dâ€™Eustachio P, Stein L, Hermjakob H. Reactome pathway analysis: a
high-performance in-memory approach. BMC Bioinform. 2017;18:142.

Foroushani ABK, Brinkman FSL, Lynn DJ. Pathway-GPS and sigora: identifying
relevant pathways based on the over-representation of their gene-pair
signatures. PeerJ. 2013;1:e229.

Liberzon A, Birger C, ThorvaldsdÃ³ttir H, Ghandi M, Mesirov JP, Tamayo P. The
Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst.
2015;1(6):417â€“25.

Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion
for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550.

Yu G, He QY. ReactomePA: an R/Bioconductor package for reactome pathway analysis
and visualization. Mol Biosyst. 2016;12(2):477-9.


# Session information
```{r session_information, echo=FALSE}
sessionInfo()
```