---
title: "RIVER"
author: "Yungil Kim"
output:
  BiocStyle::html_document
date: "`r doc_date()`"
package: "`r pkg_ver('RIVER')`"
abstract: >
  A probabilistic modeling framework that jointly analyzes personal genome 
  and transcriptome data to estimate the probability that a variant has 
  regulatory impact in that individual.
vignette: >
  %\VignetteIndexEntry{RIVER}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

<a id="top"></a>

# Introduction

`RIVER` is an `R` package of a probabilistic modeling framework, called 
*RIVER (RNA-Informed Variant Effect on Regulation)*, that jointly analyzes 
personal genome (WGS) and transcriptome data (RNA-seq) to estimate the 
probability that a variant has regulatory impact in that individual. 
It is based on a generative model that assumes that genomic annotations, 
such as the location of a variant with respect to regulatory elements, 
determine the prior probability that variant is a functional regulatory 
variant, which is an unobserved variable. The functional regulatory variant 
status then influences whether nearby genes are likely to display outlier 
levels of gene expression in that person.

*RIVER* is a hierarchical Bayesian model that predicts the regulatory 
effects of rare variants by integrating gene expression with genomic 
annotations. The *RIVER* consists of three layers: a set of nodes 
$G = G_{1}, ..., G_{P}$ in the topmost layer representing $P$ observed 
genomic annotations over all rare variants near a particular gene, 
a latent binary variable $FR$ in the middle layer representing the unobserved 
funcitonal regulatory status of the rare variants, and one binary node $E$ 
in the final layer representing expression outlier status of the nearby gene. 
We model each conditional probability distribution as follows:
$$ FR | G \sim Bernoulli(\psi), \psi = logit^{-1}(\beta^T, G) $$
$$E | FR \sim Categorical(\theta_{FR}) $$
$$ \beta_i \sim N(0, \frac{1}{\lambda})$$
$$\theta_{FR} \sim Beta(C,C)$$
with parameters $\beta$ and $\theta$ and 
hyper-parameters $\lambda$ and $C$.

Because $FR$ is unobserved, log-likelihood objective 
of *RIVER* over instances $n = 1, ..., N$,
$$
  \sum_{n=1}^{N} log\sum_{FR_n= 0}^{1} P(E_n, G_n, FR_n | \beta, \theta),
$$
is non-convex. We therefore optimize model parameters via 
Expectation-Maximization (EM) as follows:

In the E-step, we compute the posterior probabilities ($\omega_n^{(i)}$) 
of the latent variables $FR_n$ given current parameters and observed data. 
For example, at the $i$-th iteration, the posterior probability 
of $FR_n = 1$ for the $n$-th instance is
$$
  \omega_{1n}^{(i)} = P(FR_n = 1 | G_n, \beta^{(i)}, E_n, \theta^{(i)})
=\frac{P(FR_n = 1 | G_n, \beta^{(i)}) \cdotp P(E_n | FR_n = 1, \theta^{(i)})}{\sum_{FR_n = 0}^1 P(FR_n | G_n, \beta^{(i)}) \cdotp P(E_n | FR_n, \theta^{(i)})},
$$
$$\omega_{0n}^{(i)} = 1 - \omega_{1n}^{(i)}.$$

In the M-step, at the $i$-th iteration, given the current 
estimates $\omega^{(i)}$, the parameters ($\beta^{(i+1)*}$) 
are estimated as
$$
  \max_{\beta^{(i+1)}} \sum_{n = 1}^N \sum_{FR_n = 0}^1 log(P(FR_n | G_n, \beta^{(i+1)})) \cdotp \omega_{FR, n}^{(i)} - \frac{\lambda}{2}||\beta^{(i+1)}||_2,
$$
where $\lambda$ is an L2 penalty hyper-parameter derived 
from the Gaussian prior on $\beta$.
The parameters $\theta$ get updated as:
$$
  \theta_{s,t}^{(i+1)} = \sum_{n = 1}^{N} I(E_n = t) \cdotp \omega_{s,n}^{(i)} + C.
$$
where $I$ is an indicator operator, $t$ is the binary value 
of expression $E_n$, $s$ is the possible binary values of $FR_n$
and $C$ is a pseudo count derived from the Beta prior on \theta. 
The E- and M-steps are applied iteratively until convergence.

[Back to Top](#top)

# Quick Start

The purpose of this section is to provide users a general sense 
of our package, `RIVER`, including components and their behaviors 
and applications. We will briefly go over main functions, 
observe basic operations and corresponding outcomes. 
Throughout this section, users may have better ideas about which functions 
are available, which values to be chosen for necessary parameters, and 
where to seek help. More detailed descriptions are given in later sections.

First, we load `RIVER`:
```{r ultaQuick}
library("RIVER")
```
`RIVER` consists of several functions mainly supporting two main functions 
including `evaRIVER` and `appRIVER`, which we are about to show how to use 
them here. We first load simulated data created beforehand for illustration.
    
```{r}
filename <- system.file("extdata", "simulation_RIVER.gz", 
                       package="RIVER")
dataInput <- getData(filename) # import experimental data
```
`getData` combines different resources including genomic features, 
outlier status of gene expression, and N2 pairs having same rare variants 
into standardized data structures, called `ExpressionSet` class.
```{r}
print(dataInput)
Feat <- t(Biobase::exprs(dataInput)) # genomic features (G)
Out <- as.numeric(unlist(dataInput$Outlier))-1 # outlier status (E)
```
In the simulated data, an input object `dataInput` consists of 
18 genomic features and expression outlier status of 6122 samples and 
which samples belong to N2 pairs.
```{r}
head(Feat)
head(Out)
```
`Feat` contains continuous values of genomic features (defined as $G$ 
in the objective function) while `Out` contains binary values 
representing outlier status of same samples as `Feat` (defined as $E$ 
in the objective function).

For evaluation, we hold out pairs of individuals at genes 
where only those two individuals shared the same rare variants. 
Except for the list of instances, we train *RIVER* with the rest of instances, 
compute the *RIVER* score (the posterior probability of having a functional 
regulatory variant given both WGS and RNA-seq data) from one individual, 
and assess the accuracy with respect to the second individualâ€™s held-out 
expression levels. Since there is currently quite few gold standard set of 
functional rare variants, using this labeled test data allow us 
to evaluate predictive accuracy of *RIVER* compared with genomic annotation model, 
*GAM*, that uses genomic annotations alone. We can observe 
a significant improvement by incorporating expression data. 

To do so, we simply use `evaRIVER`:
```{r}
evaROC <- evaRIVER(dataInput)
```
`evaROC` is an S4 object of class `evaRIVER` which contains 
two AUC values from *RIVER* and *GAM*, specificity and sensitivity 
measures from the two models, and p-value of comparing the two AUC values. 
```{r}
cat('AUC (GAM-genomic annotation model) = ', round(evaROC$GAM_auc,3), '\n')
cat('AUC (RIVER) = ', round(evaROC$RIVER_auc,3), '\n')
cat('P-value ', format.pval(evaROC$pvalue, digits=2, eps=0.001), '***\n')
```

We can visualize the ROC curves with AUC values:
```{r}
par(mar=c(6.1, 6.1, 4.1, 4.1))
plot(NULL, xlim=c(0,1), ylim=c(0,1), 
     xlab="False positive rate", ylab="True positive rate", 
     cex.axis=1.3, cex.lab=1.6)
abline(0, 1, col="gray")
lines(1-evaROC$RIVER_spec, evaROC$RIVER_sens, 
      type="s", col='dodgerblue', lwd=2)
lines(1-evaROC$GAM_spec, evaROC$GAM_sens, 
      type="s", col='mediumpurple', lwd=2)
legend(0.7,0.2,c("RIVER","GAM"), lty=c(1,1), lwd=c(2,2),
       col=c("dodgerblue","mediumpurple"), cex=1.2, 
       pt.cex=1.2, bty="n")
title(main=paste("AUC: RIVER = ", round(evaROC$RIVER_auc,3), 
                 ", GAM = ", round(evaROC$GAM_auc,3), 
                 ", P = ", format.pval(evaROC$pvalue, digits=2, 
                                       eps=0.001),sep=""))
```
Each ROC curve from either *RIVER* or *GAM* is computed by 
comparing the posterior probability given available data 
for the 1st individual with the outlier status of the 2nd individual 
in the list of held-out pairs and vice versa.

To extract posterior probabilities for prioritizing functional rare variants 
in any downstream analysis such as finding pathogenic rare variants in disease, 
you simply run `appRIVER` to obtain the posterior probabilities:
```{r}
postprobs <- appRIVER(dataInput)
```
`postprobs` is an S4 object of class `appRIVER` which contains 
subject IDs, gene names, $P(FR = 1 | G)$, $P(FR = 1 | G, E)$, and `fitRIVER`, 
all the relevant information of the fitted *RIVER* including hyperparamters 
for further use.

Probabilities of rare variants being functional from *RIVER* and *GAM* 
for a few samples are shown below:
```{r}
example_probs <- data.frame(Subject=postprobs$indiv_name, 
                           Gene=postprobs$gene_name, 
                           RIVERpp=postprobs$RIVER_posterior, 
                           GAMpp=postprobs$GAM_posterior)
head(example_probs)
```
From left to right, it shows subject ID, gene name, posterior probabilities 
from *RIVER*, posterior probabilities from *GAM*.

To observe how much we can obtain additional information on functional 
rare variants by integrating the outlier status of gene expression 
into *RIVER* in the following figure.
```{r}
plotPosteriors(postprobs, outliers=as.numeric(unlist(dataInput$Outlier))-1)
```
As shown in this figure, the integration of both genomic features and 
expression outliers indeed provide higher quantitative power 
for prioritizing functional rare variants. You can observe a few examples 
of pathogenic regulatory variants based on posterior probabilities 
from *RIVER* in our paper (http://biorxiv.org/content/early/2016/09/09/074443).

[Back to Top](#top)

# Two Main Functions

## Evaluation of *RIVER*

The function, `evaRIVER`, is to see how much we can gain additional information 
by integrating an outlier status of gene expression into integrated models. 
The prioritization of functional rare variants has difficulty in its evaluation 
especially due to no gold standard class of the functionality of rare variants. 
To come up with this limitation, we extract pairs of individuals for genes 
having same rare variants and hold them out for the evaluation. In other words, 
we train *RIVER* with all the instances except for those held-out pairs of individuals, 
calculate posterior probabilities of functional regulatory variants 
given genomic features and outlier status for the first individual, and 
compare the probabilities with the second individual's outlier status and vice versa. 
You can simply observe how the entire steps of evaluating models including 
both *RIVER* and *GAM* proceed by using `evaRIVER` with `verbose=TRUE`:

```{r}
filename <- system.file("extdata", "simulation_RIVER.gz", 
                       package="RIVER")
dataInput <- getData(filename) # import experimental data
evaROC <- evaRIVER(dataInput, pseudoc=50, 
                   theta_init=matrix(c(.99, .01, .3, .7), nrow=2), 
                   costs=c(100, 10, 1, .1, .01, 1e-3, 1e-4), 
                   verbose=TRUE)
```
`evaRIVER` requires a `ExpressionSet` class object containing genomic features, 
outlier status, and a list of N2 pairs as an input and four optional parameters 
including pseudo count, initial theta, a list of candidate $\lambda$, and verbose. 
The input class is obtained by running `getData` with an original gzipped file. 
If you would like to know which format you should follow when generating 
the original compressed file, refer to the section **4 Generation of custumized data for RIVER** below. 
Most of optional parameters are set according to your input data. 
The `pseudoc` is a hyperparameter for estimating `theta`, parameters 
between an unobserved `FR` node and observed outlier `E` node. 
Lower `pseudoc` provides higher reliance on observed data. 
The `theta_init` is an initial set of theta parameters. From left to right, 
the elements are $P(E = 0 | FR = 0)$, $P(E = 1 | FR = 0)$, $P(E = 0 | FR = 1)$, 
and $P(E = 1 | FR = 1)$, respecitively. The `costs` are the list of 
candidate $\lambda$ for searching the best L2 penaly hyperparameter 
for both *GAM* and *RIVER*. For more information on optional paramters, 
see Appendix 5.1 for optional parameters and Appendix 5.2 for parameter 
stabilities across different initializations.

To train *RIVER* with training data (all instances except for N2 pairs), 
we first select best lambda value based on 10 cross-validation on 
training dataset via `glmnet`. You can see the selected $\lambda$ parameter 
at the first line of output. The initial paramters of $\beta$ in *RIVER* 
were set based on the estimated $\beta$ parameters from *GAM*. 
In each EM iteration, the `evaRIVER` reports both the top 10 % threshold 
of expected $P(FR = 1 | G, E)$ and norms of difference between previous and 
current estimates of parameters. The EM algorithm iteratively 
find best estimates of both $\beta$ and $\theta$ until it converges 
within the predefined tolerence of the norm ($0.001$ for both $\beta$ and $\theta$). 
After the estimates of paramters converge, `evaRIVER` reports AUC values 
from both models and its p-value of the difference between them.

[Back to Top](#top)

## Application of *RIVER*

The function, `appRIVER`, is to train *RIVER* (and *GAM*) with all instances and 
compute posterior probabilities of them for the future analyses (i.e. finding 
pathogenic rare variants in disease). Same as `evaRIVER`, this function also 
requires a `ExpressionSet` class object as an input and three optional parameters 
which you can set again based on your data. If you use a certain set of values 
for the optional parameters, you would use same ones here.
```{r}
postprobs <- appRIVER(dataInput, pseudoc=50, 
                      theta_init=matrix(c(.99, .01, .3, .7), nrow=2), 
                      costs=c(100, 10, 1, .1, .01, 1e-3, 1e-4), 
                      verbose=TRUE)
```
Like the reported procedures from `evaRIVER`, we can recognize which $\lambda$ 
is set and variant top 10 % threshold of expected $P(FR = 1 | G, E)$ and 
norms of difference during each of EM iteractions.

If you would like to observe estimated parameters associated with genomic features 
($\beta$) and outliers ($\theta$), you can simply use `print` for the corresponding 
parameters of interest.
```{r}
print(postprobs$fitRIVER$beta)
```
```{r}
print(postprobs$fitRIVER$theta)
```
These parameters can be used for computing test posterior probabilities of 
new instances given their $G$ and $E$ for further analyses.

[Back to Top](#top)

# Generation of custumized data for *RIVER*

An original compressed file, generated from all necessary processed data 
including genomic features from various genomic annotations, Z-scores 
from gene expression, and a list of N2 pairs based on WGS, contains 
all the information.
```{r}
filename <- system.file("extdata", "simulation_RIVER.gz", 
                       package = "RIVER")
system(paste('zcat ', filename, " | head  -2", sep=""), 
       ignore.stderr=TRUE)
```
From right to left column in each row, the data includes subject ID, 
gene name, values of genomic features of interest (18 features here), 
Z-scores of corresponding gene expression, and either integer values or 
*NA* representing the existence/absence in N2 pairs sharing same rare variants. 
If one subject has a unique set of rare variants compared to other subjects 
near a particular gene, *NA* is assigned in N2pair column. Otherwise, 
two subjects sharing same rare variants in any gene have same integers 
as unique identifiers of each of N2 pairs. 

If you would like to train RIVER with your own data, you need to generate 
your own compressed file having same file format as explained above. 
Then, you simply put an entire path of your compressed data file 
into `getData` which generates a `ExpressionSet` class object 
(`YourInputToRIVER`) with all necessary information for running RIVER 
with your own data. 
```{r}
YourInputToRIVER <- getData(filename) # import experimental data
```
For our paper, genomic features were generated from various genomic annotations 
including conservation scores like Gerp, chromatin states from chromHMM or segway, 
and other summary scores such as CADD and DANN. The intances were selected 
based on two criteria: (1) any genes having at least one individual outlier 
in term of z-scores of gene expression and (2) any individuals having 
at least one rare variant within specific regions near each gene. 
In each instance, the feature values within regions of interest were aggreated 
into one summary statistics by applying relevant mathematical operations like max. 
In more details of a list of genomic annotations used for constructing features and 
how to generate the features and outlier status, please refer to our publication 
[pre-print](http://biorxiv.org/content/early/2016/09/09/074443).

[Back to Top](#top)

# Basics

## Installation of `RIVER`

`R` is an open-source statistical environment which can be easily modified 
to enhance its functionality via packages. `RIVER` is a `R` package available 
via the [Bioconductor](https://www.bioconductor.org/packages/release/BiocViews.html#___Software) 
repository for packages. `R` can be installed on any operating system from 
[CRAN](https://cran.r-project.org/) after which you can install `RIVER` 
by using the following commands in your `R` session:

```{r, eval=FALSE}
## try http:// if https:// URLs are not supported
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("RIVER")
```

[Back to Top](#top)

## Session Info

Here is the output of sessionInfo() on the system on which this document 
was compiled:

```{r}
## Session info
library('devtools')
options(width=120)
session_info()
```

[Back to Top](#top)

## Asking for Help

As package developers, we try to explain clearly how to use our packages 
and in which order to use the functions. But `R` and `Bioconductor` have 
a steep learning curve so it is critical to learn where to ask for help. 
The blog post quoted above mentions some but we would like to highlight 
the [Bioconductor support site](https://support.bioconductor.org/) 
as the main resource for getting help. Other alternatives are available 
such as creating GitHub issues and tweeting. However, please note that 
if you want to receive help you should adhere to the 
[posting guidlines](http://www.bioconductor.org/help/support/posting-guide/). 
It is particularly critical that you provide a small reproducible example 
and your session information so package developers can track down 
the source of the error.

[Back to Top](#top)

## References

<p> Xin Li$^{*}$, Yungil Kim$^{*}$, Emily K. Tsang$^{*}$, Joe R. Davis$^{*}$, Farhan N. Damani, Colby Chiang, Zachary Zappala, Benjamin J. Strober, Alexandra J. Scott, Andrea Ganna, Jason Merker, GTEx Consortium, Ira M. Hall, Alexis Battle$^{\#}$, Stephen B. Montgomery$^{\#}$ (2016). <br><a href="http://biorxiv.org/content/early/2016/09/09/074443">The impact of rare variation on gene expression across tissues </a><br><i>(in arXiv, submitted, *: equal contribution, #: corresponding authors) </i></p>

[Back to Top](#top)

# Appendices

## Optional parameters

Functions within `RIVER` have a set of optional parameters which control 
some aspects of the computation of *RIVER* scores. The *factory default* 
settings are expected to serve in many cases, but users might need 
to make changes based on the input data.

There are four parameters that users can change:

`pseudoc` - Pseudo count (hyperparameter) in a beta distribution for $\theta$; *factory default = 50*

`theta_init` - Initial values of $\theta$; *factory default = (P(E = 0 | FR = 0), P(E = 1 | FR = 0), P(E = 0 | FR = 1), P(E = 1 | FR = 1)) = (0.99, 0.01, 0.3, 0.7)*

`costs` - List of candidate $\lambda$ values for finding a best $\lambda$ (hyperparameter). A best $\lambda$ value among the candidate list is selected from L2-regularized logistic regression (*GAM*) via 10 cross-validation; *factory default = (100, 10, 1, .1, .01, 1e-3, 1e-4)*

`verbose` - If you set this parameter as `TRUE`, you observe 
how parameters including $\theta$ and $\beta$ converge 
until their updates at each EM iteration are within predefined tolerance levels 
(one norm of the difference between current and previous parameters < 1e-3); *factory default = `FALSE`*

Note that initial values of $\beta$ are generated from 
L2-regularized logistic regression (*GAM*) with 
pre-selected $\lambda$ from 10 cross-validation.

[Back to Top](#top)

## Stability of Estimated Parameters with Different Parameter Initializations

In this section, we reports how several different initialization parameters 
for either $\beta$ or $\theta$ affect the estimated parameters. 
We initialized a noisy $\beta$ by adding K% Gaussian noise compared to the mean 
of $\beta$ with fixed $\theta$ (for K = 10, 20, 50 100, 200, 400, 800). 
For $\theta$, we fixed P(E = 1 | FR = 0) and P(E = 0 | FR = 0) as 0.01 and 0.99, 
respectively, and initialized (P(E = 1 | FR = 1), P(E = 0 | FR = 1)) 
as (0.1, 0.9), (0.4, 0.6), and (0.45, 0.55) instead of (0.3, 0.7) with $\beta$ fixed. 
For each parameter initialization, we computed Spearman rank correlations 
between parameters from *RIVER* using the original initialization and 
the alternative initializations. We also investigated how many instances 
within top 10% of posterior probabilities from *RIVER* under the original settings 
were replicated in the top 10% of posterior probabilities under the alternative 
initializations. We also tried five different values of pseudoc as 10, 20, 30, 75, and 100 
with default settings of $\beta$ and $\theta$ and computed 
both Spearman rank correlations and accuracy as explained above.

| Parameter	| Initialization | Spearman Ï	| Accuracy |
|:----------:|---------------:|-----------:|----------:|
|         	|    10% noise	 |   > .999	  |  0.880   |
|         	|    25% noise	 |   > .999	  |  0.862   |
|         	|    50% noise	 |   > .999	  |  0.849   |
|  $\beta$	|    100% noise	 |   > .999	  |  0.848  |
|         	|    200% noise	 |   > .999	  |  0.843   |
|         	|    400% noise	 |   > .999	  |  0.846   |
|         	|    800% noise	 |   > .999	  |  0.846   |
|         	|   [0.1, 0.9]   |   > .999	  |  0.841   |
| $\theta$	|   [0.4, 0.6]	 |   > .999	  |  1.000   |
|         	|  [0.45, 0.55]	 |   > .999	  |  1.000   |
|         	|    10	 |   .988	  |  0.934   |
|         	|    20	 |   .996	  |  0.955   |
| pseudoc	|    30	 |   .999	  |  0.972   |
|         	|    75	 |   .999	  |  0.979   |
|         	|  100	 |   .998	  |  0.967   |

[Back to Top](#top)