---
title: "periodicDNA"
author: "Jacques Serizay"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
    %\VignetteIndexEntry{Introduction to periodicDNA}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r, eval = TRUE, echo=FALSE, results="hide", warning=FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
suppressPackageStartupMessages({
    library(GenomicRanges)
    library(ggplot2)
    library(magrittr)
    library(periodicDNA)
})
BiocParallel::register(setUpBPPARAM(1), default = TRUE)
```

### Introduction to periodicDNA 

Short DNA sequence motifs provide key information for interpreting the 
instructions in DNA, for example by providing binding sites for 
proteins or altering the structure of the double-helix. A less studied but 
important feature of DNA sequence motifs is their periodicity. 
A famous example is the 10-bp periodicity of many k-mers in nucleosome 
positioning (reviewed in Travers et al. 2010 or in Struhl and Segal 2013).  

periodicDNA provides a framework to quantify the periodicity of k-mers of 
interest in DNA sequences. For a chosen k-mer, periodicDNA can identify which 
periods are statistically enriched in a set of sequences, by using a 
randomized shuffling approach to compute an empirical p-value. It can also 
generate continuous linear tracks of k-mer periodicity strength over 
genomic loci.

### Internal steps of periodicDNA

To estimate the periodicity strength of a given k-mer in one or several
sequences, periodicDNA performs the following steps: 

1. The k-mer occurrences are mapped and their pairwise distances 
   are calculated.
2. The distribution of all the resulting pairwise distances 
   (also called "distogram") is generated.
3. The distogram is transformed into a frequency histogram and smoothed 
   using a moving window of 3 to mask the universal three-base genomic 
   periodicity. To normalize the frequency for distance decay, 
   the local average 
   (obtained by averaging the frequency with a moving window of 10) is then
   subtracted from the smoothed frequency.
4. Finally, the power spectral density (PSD) is estimated by applying a 
   Fast Fourier Transform (Figure 1F) over the normalized frequency histogram. 
   The magnitude of the PSD values indicates the contribution of a given 
   period to the overall periodicity of the k-mer of interest.

```{r, eval = TRUE, echo=FALSE, out.width='100%'}
knitr::include_graphics(
   "https://raw.githubusercontent.com/js2264/periodicDNA/master/man/figures/periodicityDNA_principle.png"
)
```

### Quantifying k-mer periodicity over a set of sequences

#### Basic usage 

The main goal of periodicDNA is to quantify periodicity of a given k-mer in a
set of sequences. For instance, one can assess the periodicity of TT
dinucleotides in sequences around TSSs of ubiquitous promoters using 
`getPeriodicity()`.  

In the following example, `getPeriodicity()` is directly ran using 
a GRanges object, specifying from which genome this GRanges comes from. 

```{r, eval = TRUE}
library(ggplot2)
library(magrittr)
library(periodicDNA)
#
data(ce11_TSSs)
periodicity_result <- getPeriodicity(
    ce11_TSSs[['Ubiq.']][1:500],
    genome = 'BSgenome.Celegans.UCSC.ce11',
    motif = 'TT', 
    BPPARAM = setUpBPPARAM(1)
)
```

The main output of `getPeriodicity()` is a table of power spectral density 
(PSD) values associated with discrete frequencies, computed using a 
Fast Fourier Transform. For a given frequency, a high PSD score 
indicates a high periodicity of the k-mer of interest.  

In the previous example, TT dinucleotides in sequences around TSSs 
of ubiquitous promoters are highly periodic, with a periodicity of 10 bp. 

```{r, eval = TRUE}
head(periodicity_result$PSD)
subset(periodicity_result$periodicityMetrics, Period == 10)
```

Graphical output of `getPeriodicity()` can be obtained using the 
`plotPeriodicityResults()` function: 

```{r, eval = TRUE, fig.width = 9, fig.height = 3.2, out.width='100%'}
plotPeriodicityResults(periodicity_result)
```

The first plot shows the raw distribution of distances between pairs of 'TT' 
in the sequences of the provided GRanges. The second plot shows the 
decay-normalised distribution. Finally, the third plot shows the PSD scores 
of the 'TT' k-mer, measured from the normalised distribution.  

#### Repeated shuffling of input sequences 

periodicDNA provides an approach to compare the periodicity of a given k-mer
in a set of sequences compared to background. For a given k-mer at a period T 
in a set of input sequences, the fold-change over background of its PSD 
is estimated by iteratively shuffling the input sequences and estimating 
the resulting PSD values.  
Eventually, the log2 fold-change (l2FC) between the observed PSD and the 
median of the PSD values measured after shuffling is computed as follows:  

l2FC = log2(observed PSD / median(shuffled PSDs)).

```{r, eval = TRUE, fig.width = 9, fig.height = 3.2, out.width='100%'}
periodicity_result <- getPeriodicity(
    ce11_TSSs[['Ubiq.']][1:500],
    genome = 'BSgenome.Celegans.UCSC.ce11',
    motif = 'TT', 
    n_shuffling = 5
)
head(periodicity_result$periodicityMetrics)
subset(periodicity_result$periodicityMetrics, Period == 10)
plotPeriodicityResults(periodicity_result)
```

If `n_shuffling >= 100`, an associated empirical p-value is calculated as 
well (North et al 2002). This metric indicates, for each individual period T, 
whether the observed PSD is significantly greater than the PSD values measured 
after shuffling the input sequences. Note that empirical p-values are only an 
estimation of the real p-value. Notably, small p-values are systematically 
under-estimated (North et al 2002).

#### Note

`getPeriodicity()` can also be ran directly on a set of sequences of interest 
as follows: 

```{r, eval = TRUE}
data(ce11_proms_seqs)
periodicity_result <- getPeriodicity(
    ce11_proms_seqs,
    motif = 'TT', 
    BPPARAM = setUpBPPARAM(1)
)
subset(periodicity_result$periodicityMetrics, Period == 10)
```

### Track of periodicity over a set of Genomic Ranges

The other aim of periodicDNA is to generate continuous linear tracks of 
k-mer periodicity strength over genomic loci of interest. 
`getPeriodicityTrack()` can be used to generate suck genomic tracks. In the 
following example,  

```{r, eval = FALSE}
WW_10bp_track <- getPeriodicityTrack(
    genome = 'BSgenome.Celegans.UCSC.ce11',
    granges = ce11_proms, 
    motif = 'WW',
    period = 10,
    BPPARAM = setUpBPPARAM(1),
    bw_file = 'WW-10-bp-periodicity_over-proms.bw'
)
```

When plotted over sets of ubiquitous, germline or somatic TSSs, the resulting 
track clearly shows increase of WW 10-bp periodicity above the ubiquitous and
germline TSSs, whereas somatic TSSs do not show such increase.

```{r, eval = FALSE, results="hide", warning=FALSE}
data(ce11_TSSs)
plotAggregateCoverage(
    WW_10bp_track, 
    ce11_TSSs, 
    xlab = 'Distance from TSS',
    ylab = '10-bp periodicity strength (forward proms.)'
)
```

```{r, eval = TRUE, echo=FALSE, out.width='100%'}
knitr::include_graphics(
    "https://raw.githubusercontent.com/js2264/periodicDNA/master/man/figures/TT-10bp-periodicity_tissue-spe-TSSs.png"
)
```

## Session info

```{r, eval = TRUE}
sessionInfo()
```