---
title: "The transcriptogramer user's guide"
author:
- name: Diego Morais
  affiliation: 
  - Bioinformatics Multidisciplinary Environment, Federal University of Rio Grande do Norte, Natal, RN, Brazil
  email: vinx@ufrn.edu.br
- name: Rodrigo Dalmolin
  affiliation:
  - Bioinformatics Multidisciplinary Environment, Federal University of Rio Grande do Norte, Natal, RN, Brazil
  - Department of Biochemistry, Federal University of Rio Grande do Norte, Natal, RN, Brazil
  email: rodrigo.dalmolin@imd.ufrn.br
output: 
  BiocStyle::html_document
bibliography: cite.bib
date: '`r NULL`'
vignette: >
  %\VignetteIndexEntry{The transcriptogramer user's guide}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
**Version**: `r packageDescription("transcriptogramer")[["Version"]]`<br />

![](logo.png)

# Overview
The transcriptogramer package [@Morais2019] is designed for transcriptional analysis
based on transcriptograms,
a method to analyze transcriptomes that projects
expression values on a set of
ordered proteins, arranged such that the probability that gene products
participate
in the same metabolic pathway exponentially decreases with the increase of
the distance between two proteins of the ordering. Transcriptograms are,
hence, genome wide gene expression profiles that provide a global view for
the cellular metabolism, while indicating gene sets whose expression are
altered [@Silva2014; @Filho2011; @Almeida2016; @Ferrareze2017; @Xavier2017].

Methods are provided to analyze topological properties of a
Protein-Protein Interaction (PPI) network, to generate transcriptograms, to detect and to display
differentially expressed gene clusters, and to perform a Gene Ontology 
Enrichment Analysis on these clusters.

As a set of ordered proteins is required in order to run the methods,
datasets are available for four species (*Homo sapiens*, *Mus musculus*,
*Saccharomyces cerevisiae* and *Rattus norvegicus*). Each species has three 
datasets, originated from [STRINGdb](https://string-db.org/) release 10.5
protein network data, with combined scores greater than or equal to 700, 800
and 900 (see **Hs900**, **Hs800**, **Hs700**, **Mm900**, **Mm800**, **Mm700**,
**Sc900**, **Sc800**, **Sc700**, **Rn900**, **Rn800** and **Rn700** datasets).
Custom sets of ordered proteins can be generated from protein network data using
[The transcriptogramer](https://lief.if.ufrgs.br/pub/biosoftwares/transcriptogramer/)
on Windows, or [an implementation of the seriation algorithm](https://github.com/arthurvinx/seriation)
on Linux.

# Quick start
The first step is to create a Transcriptogram object by running the
`transcriptogramPreprocess()` method. This example uses a subset of the
*Homo sapiens* protein network data, from STRINGdb release 10.5, containing only
associations of proteins of combined score greater than or equal to 900
(see **Hs900** and **association** datasets).

```{r message = FALSE}
library(transcriptogramer)
t <- transcriptogramPreprocess(association = association, ordering = Hs900)
```

## Topological analysis
There are two methods to perform topological analysis,
`connectivityProperties()` calculates average graph properties as function
of node connectivity, and `orderingProperties()` calculates graph
properties projected on the ordered proteins. Some methods, such as
orderingProperties(), uses a window, region of n (radius * 2 + 1) proteins
centered at a protein, whose radius changes the output. The Transcriptogram
object has a **radius** slot that can be setted during, or after, its
preprocessing (see **Transcriptogram-class** documentation).

```{r message = FALSE}
## during the preprocessing

## creating the object and setting the radius as 0
t <- transcriptogramPreprocess(association = association, ordering = Hs900)

## creating the object and setting the radius as 80
t <- transcriptogramPreprocess(association = association, ordering = Hs900,
                               radius = 80)
```
```{r message = FALSE}
## after the preprocessing

## modifying the radius of an existing Transcriptogram object
radius(object = t) <- 50

## getting the radius of an existing Transcriptogram object
r <- radius(object = t)
```

As window related metrics are affected by the radius, the output of the orderingProperties() method changes depending on the content of
the radius slot. A window modularity value close to 1 indicates dense connections between the genes inside the window, as well as sparse connections between these genes and the other genes in the network. Note that the sum of the window modularity increased using the radius 80. 

```{r message = FALSE, comment = ""}
oPropertiesR50 <- orderingProperties(object = t, nCores = 1)

## slight change of radius
radius(object = t) <- 80

## this output is partially different comparing to oPropertiesR50
oPropertiesR80 <- orderingProperties(object = t, nCores = 1)

sum(oPropertiesR50$windowModularity)
sum(oPropertiesR80$windowModularity)
```

However, the connectivityProperties() method does not use a window, thus, its output is not
affected by the radius slot.

```{r message = FALSE}
cProperties <- connectivityProperties(object = t)
```

## Transcriptogram
A transcriptogram is generated in two steps and requires expression values,
from microarray or RNA-Seq assays (log2-counts-per-million), and a dictionary. This example uses the
datasets **GSE9988**,
which contains normalized expression values of 3 cases and 3 controls (GSM252443, GSM252444, GSM252445, GSM252465, GSM252466 and GSM252467 respectively),
and **GPL570**, a mapping
between ENSEMBL Peptide ID and Affymetrix Human Genome U133 Plus 2.0 Array
probe identifier.

The methods to generate a transcriptogram are `transcriptogramStep1()` and
`transcriptogramStep2()`. The transcriptogramStep1() assigns to each protein,
of each transcriptome sample, the average of the expression values of all the
identifiers related to it.

```{r message = FALSE}
t <- transcriptogramStep1(object = t, expression = GSE9988,
                          dictionary = GPL570, nCores = 1)
t2 <- t
```

To each position of the ordering, the transcriptogramStep2() method assigns a
value equal to the average of the expression values inside a window, which
considers periodic boundary conditions to deal with proteins near the ends of
the ordering, in order to reduce random noise.

```{r message = FALSE}
t <- transcriptogramStep2(object = t, nCores = 1)
```

The Transcriptogram object has slots to store the outputs of the
transcriptogramStep1() and transcriptogramStep2() methods, called
transcriptogramS1 and transcriptogramS2 respectively. As the output of some
methods are affected by the content of the transcriptogramS2 slot, it can be
recalculated using the content of the transcriptogramS1 slot.

```{r message = FALSE}
radius(object = t2) <- 50
t2 <- transcriptogramStep2(object = t2, nCores = 1)
```

## Gene Ontology enrichment analysis
As nearby genes of a transcriptogram have a high probability to interact with
each other, gene sets whose expression are altered can be identified using the
`r Biocpkg("limma")` package. The `differentiallyExpressed()` method uses the
limma package to identify differentially expressed genes (the approaches voom and trend are supported for RNA-Seq), for the contrast
"case-control", grouping as a cluster
a set of genes which positions are within a radius range specified by the
content of the radius slot.

For this example, the p-value threshold for false
discovery rate will be set as 0.01. If a species name is provided, the
`r Biocpkg("biomaRt")` package is used to translate the
ENSEMBL Peptide ID to Symbol (Gene Name), alternatively, a data.frame with such
content can be provided.
The levels argument classify the columns of
the transcriptogramS2 slot referring to samples, as there are 6 columns (see
dataset **GSE9988**), is created a logical vector that uses TRUE to label the
columns referring to controls samples, and FALSE to label the
columns referring to case samples.

```{r message = FALSE, fig.show = "hide"}
## trend = FALSE for microarray data or voom log2-counts-per-million
## the default value for trend is FALSE
levels <- c(rep(FALSE, 3), rep(TRUE, 3))
t <- differentiallyExpressed(object = t, levels = levels, pValue = 0.01,
                             trend = FALSE, title = "radius 80")

## the radius 50 will affect the output significantly
t2 <- differentiallyExpressed(object = t2, levels = levels, pValue = 0.01,
                             species = DEsymbols, title = "radius 50")
```
```{r eval = FALSE}
## using the species argument to translate ENSEMBL Peptide IDs to Symbols
## Internet connection is required for this command
t <- differentiallyExpressed(object = t, levels = levels, pValue = 0.01,
                             species = "Homo sapiens", title = "radius 80")

## translating ENSEMBL Peptide IDs to Symbols using the DEsymbols dataset
t <- differentiallyExpressed(object = t, levels = levels, pValue = 0.01,
                             species = DEsymbols, title = "radius 80")
```

This method also produces a plot referring to its output. Each cluster detected is represented by a color. The genes
that are above the horizontal black line are upregulated, and the genes that
are below are downregulated.

![](Rplot1-a.png)

![](Rplot1-b.png)

The differentially expressed genes identified by this method are stored in the
DE slot of the Transcriptogram object, its content can be obtained using the
DE method. By default, the p-values are adjusted by the Benjamini-Hochberg
procedure. Note that the differential expression on the object of radius 80 detected less clusters, but there are more windows centers significantly altered on it, thus, the clusters are more consistent. Therefore, the next methods will be performed only on the object of radius 80.

```{r message = FALSE, comment = ""}
DE <- DE(object = t)
DE2 <- DE(object = t2)
nrow(DE)
nrow(DE)/length(unique(DE$ClusterNumber))
nrow(DE2)
nrow(DE2)/length(unique(DE2$ClusterNumber))
```

The `clusterVisualization()` method uses the `r Biocpkg("RedeR")` package to
display graphs of the differentially expressed clusters and returns an
object of the RedPort Class, allowing interactions through methods
of the RedeR package. This method requires the Java Runtime Environment
(>= 6).


```{r eval = FALSE}
rdp <- clusterVisualization(object = t)
```

![](Rplot2.png)

The `clusterEnrichment()` method performs a Gene Ontology enrichment analysis
using the `r Biocpkg("topGO")` package. By default, the universe is composed by
all the proteins present in the ordering slot, the ontology is setted
to biological process, the algorithm is setted to classic, the statistic is
setted to fisher, and the p-values are adjusted by the Benjamini-Hochberg
procedure. For this example, the p-value threshold for false
discovery rate will be set as 0.005. This method uses the biomaRt package to
build a Protein2GO data.frame using the given species name, or data.frame.

```{r message = FALSE}
## using the HsBPTerms dataset to create the Protein2GO data.frame
t <- clusterEnrichment(object = t, species = HsBPTerms,
                           pValue = 0.005, nCores = 1)
```
```{r eval = FALSE}
## using the species argument to create the Protein2GO data.frame
## Internet connection is required for this command
t <- clusterEnrichment(object = t, species = "Homo sapiens",
                           pValue = 0.005, nCores = 1)
```
```{r eval = FALSE}
head(Terms(t))
```
```{r echo = FALSE, comment = ""}
load("terms.RData")
head(terms)
```
```{r eval = FALSE}
enrichmentPlot(t)
```

![](Rplot3.png)

# Frequently asked questions
> How does transcriptogramer deal in cases where Protein-Protein Interactions (PPIs) for expressed proteins are missing?

The `transcriptogramStep1()` method removes rows from the expression data which are not present in the set of ordered proteins. As the association and ordering arguments of the `transcriptogramPreprocess()` method must contain the same proteins, if a row of the expression data cannot be mapped to one of these proteins, it will be discarded.

***

> How does transcriptogramer deal in cases where there’s no expression data for some PPIs in the network?

When there is no expression data related to a given protein of the set of ordered proteins, this protein expression value is treated as a missing value. In this case, the sliding window will discard missing values, sum the remaining ones, and divide this value by `radius * 2 + 1` while running the `transcriptogramStep2()` method.

***

> Does the transcriptogramer accept/convert protein/gene IDs other than the ones used in the vignette?

This is not possible in the current transcriptogramer version. The transcriptogramer is only able to automatically convert ENSEMBL Peptide IDs to Symbols. As this can be done by the biomaRt package, arguments will be added in the future to allow the automatic conversion of other IDs, such as ENTREZ to ENSEMBL Gene ID. This change will be reported here and in the package NEWS.

# Session info

```{r}
sessionInfo()
```

```{r}
warnings()
```
# References