% \VignetteIndexEntry{HelloRanges Tutorial}
% \VignetteKeywords{ranges, bedtools, tutorial}
% \VignettePackage{HelloRanges}
\documentclass[10pt]{article}

<<style-Sweave, eval=TRUE, echo=FALSE, results=tex>>=
BiocStyle::latex2()
@

\bioctitle[HelloRanges Tutorial]{Saying Hello to the Bioconductor
  Ranges Infrastructure}

\author{Michael Lawrence\thanks{\email{michafla@gene.com}}\\Genentech}
\date{\today}

\begin{document}

\maketitle
\packageVersion{\Sexpr{BiocStyle::pkg_ver("HelloRanges")}}
\tableofcontents
\newpage

% TODO:
% - Demonstrate visualizing some of the tracks with ggbio?

\section{Introduction}

The primary purpose of the \Biocpkg{HelloRanges} package, from the
pedagogical perspective, is to map \software{bedtools} ``recipes'' to
\R{} scripts. The package was born out of the recognition
that \Bioconductor{} lacks a cookbook that explains how to achieve
common, specific tasks using the Ranges infrastructure. And that
writing a compiler is more fun (and hopefully more useful) than
writing documentation. The goal is to enable those who already use
\R{}/\Bioconductor{} for modeling and plotting to unify their workflow
by integrating their upstream processing.

\Biocpkg{HelloRanges} provides an \R{} function corresponding to each
\software{bedtools} command. The output is an \R{} language object,
and we can print the object to copy and integrate the code into an R
script.  Ideally, the code is first integrated (by copy/paste) into an
R script. Unlike bedtools, the result of evaluation is an R object
that is suitable for further analysis. There is no automatic output to
files, because once we are in \R{}, we want to stay there. Assuming
that the reader is interested in learning Bioconductor, we encourage
the reader to inspect the code prior to evaluation by printing the
language object. The generated code is meant to be correct, readable,
conformant to best practices and performant (in that order). While the
code is much more verbose than the corresponding \software{bedtools}
call, we argue that the explicit, low-level Ranges API has the
advantage of being self-documenting and more flexible. And, of course,
it directly integrates with the rest of \Bioconductor{}.

Obviously, performing I/O with each operation will have a negative
impact on performance, so it is recommended to import the data once,
and perform subsequent operations on in-memory data structures. If
memory is exhausted, consider distributing computations.

For the sake of comparison, this tutorial closely follows that of
\software{bedtools} itself
(\url{http://quinlanlab.org/tutorials/bedtools/bedtools.html}). We
will analyze the data from Maurano et al \cite{maurano2012systematic}
assessment of DnaseI hypersensitivy across a range of fetal tissues
(20 samples). The \software{bedtools} tutorial mostly consists of
arbitrary range operations on the annotation tracks. Near the end, we
will compare samples from the Maurano study in terms of their mutual
overlap.

\section{Data}

The data are provided via the \Biocpkg{HelloRangesData} package, which
we load presently:
<<loadData>>=
library(HelloRanges)
library(HelloRangesData)
@ 

To have convenient paths to the data, we switch our working directory
to the one with the data files:
<<setwd>>=
oldwd <- setwd(system.file("extdata", package="HelloRangesData"))
@ 

In our working directory are 20 BED files from the DnaseI study, as
well as BED files representing the CpG islands (\file{cpg.bed}),
Refseq exons (\file{exons.bed}), disease-associated SNPs
(\file{gwas.bed}), and functional annotations output by chromHMM given
ENCODE human embrionic stem cell ChIP-seq data
(\file{hesc.chromHmm.bed}). There is also a \file{hg19.genome} file
indicating the chromosome lengths of the hg19 genome build.

One of the advantages of \R{}, compared to the shell, is its unified
package management system. \R{} packages can contain data, and even
completed analyses, in addition to libraries of
functions. \Bioconductor{} provides many annotations and sample
datasets through packages. Packages make obtaining data easy, and they
also help with reproducibility and provenance through
versioning. Thus, they are more convenient and safer compared to
downloading from live URLs. Some packages provide APIs to download
data from repositories, and we can use them to programmatically
generate data packages in a reproducible way.

For example, the TxDb packages provide transcript annotations for an
individual reference genome. \Bioconductor{} provides pre-built TxDb
packages for the commonly used genome builds, and it is easy to
generate a custom TxDb package using annotations from Biomart, UCSC,
or just a GFF file. The \Biocpkg{rtracklayer} package supports
automated retrieval of any dataset from the UCSC table browser,
without any pointing-and-clicking involved. We will demonstrate some
of these tools later in the tutorial.

% Data (hg19):
% - DnaseI hypersensitivity: bedtools tutorial
% - Exons: GenomicFeatures package
% - CpG Islands: cpgIslandExt
% - Flagged SNPs: snp146Flagged
% - ChromHMM: wgEncodeBroadHmmH1hescHMM

\section{Overlap and Intersection}
\includegraphics{fig/intersect.jpg}

One of the most useful ways to compare two tracks is to determine
which ranges overlap, and where they intersect (see the above image
from the \software{bedtools} tutorial).

By default, \software{bedtools} outputs the region of intersection for
each overlap between the two tracks. We compile the code for
intersecting the CpG islands and the exons.
<<intersect-default>>=
code <- bedtools_intersect("-a cpg.bed -b exons.bed -g hg19.genome")
code
@
This code should be integrated into an R script that implements a
larger workflow. For the purposes of this tutorial, we will call
\Rfunction{eval} on the language object to yield the result:
<<intersect-eval>>=
ans <- eval(code)
mcols(ans)$hit <- NULL
ans
@ 
%
The result is an instance of \Rclass{GRanges}, the central data
structure in \Bioconductor{} for genomic data.  A \Rclass{GRanges}
takes the form of a table and resembles a BED file, with a column for
the chromosome, start, end, strand.  We will see \Rclass{GRanges} a
lot, along with its cousin, \Rclass{GRangesList}, which stores what
\software{bedtools} calls ``split'' ranges.

\subsection{Sequence information}

Consider the simplest invocation of \software{bedtools intersect}:
<<intersect-simple>>=
code <- bedtools_intersect("-a cpg.bed -b exons.bed")
code
@ 

The first line creates an object representing the structure of the
genome build:
<<seqinfo>>=
genome <- eval(code[[2L]])
genome
@
%
It is an empty object, because the genome identifier is unspecified
(\Robject{NA\_character\_}). Having unspecified genome bounds is
dangerous and leads to accidents involving incompatible genome
builds. Besides error checking, the bounds are useful when computing
coverage and finding the gaps (see below). Luckily,
\software{bedtools} lets us specify the genome as an argument:
<<intersect-genome>>=
code <- bedtools_intersect("-a cpg.bed -b exons.bed -g hg19.genome")
code
genome <- eval(code[[2L]])
genome
@ 
We now have a populated genome, since the tutorial had provided the
\file{hg19.genome} file. However, in general, information on the
genome build should be centralized, not sitting somewhere in a
file. \Bioconductor{} provides \Biocpkg{GenomeInfoDb} as its central
source of sequence information. We can hook into that by passing the
genome identifier instead of a file name:
<<intersect-genome-id>>=
code <- bedtools_intersect("-a cpg.bed -b exons.bed -g hg19")
genome <- eval(code[[2L]])
genome
@ 

\subsection{Annotations}

Now our code looks like:
<<annotations-review>>=
code
@
The next step is the import of the CpG islands and exons using the
\Rfunction{import} function from \Biocpkg{rtracklayer}, which can
load pretty much any kind of genomic data into the appropriate
type of \Bioconductor{} object. In this case, we are loading the data
as a \Rclass{GRanges}:
<<granges>>=
gr_a
@ 

The \Biocpkg{rtracklayer} package can also download data directly from
the UCSC table browser. For example, we could get the CpG islands directly:
<<rtracklayer-cpgs, eval=FALSE>>=
ucsc <- browserSession()
genome(ucsc) <- "hg19"
cpgs <- ucsc[["CpG Islands"]]
@ 

Gene annotations, including exon coordinates, should also be stored
more formally than in a file, and Bioconductor provides them through
its TxDb family of packages:
<<txdb-exons, eval=FALSE>>=
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
exons <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene)
@ 

\subsection{Finding Overlaps}

A review of the code:
<<findOverlaps-review>>=
code
@ 

The next step is to find all of the overlaps. The workhorse function
is \Rfunction{findOverlaps} from the \Biocpkg{IRanges} package. Here,
we use the variant \Rfunction{findOverlapPairs}, a convenience for
creating a \Rclass{Pairs} object that matches up the overlapping
ranges:
<<intersect-pairs>>=
pairs
@ 
Although the ranges are displayed as succinct strings, they data are
still represented as \Rclass{GRanges} objects.

Users of \software{bedtools} will be familiar with \Rclass{Pairs} as
the analog of the BEDPE file format. We can use \Biocpkg{rtracklayer}
to export \Rclass{Pairs} to BEDPE:
<<export-pairs, eval=FALSE>>=
export(pairs, "pairs.bedpe")
@ 

A key parameter to \Rfunction{findOverlapPairs} is
\Rcode{ignore.strand=TRUE}. By default, all operations on
\Rclass{GRanges} take strand into account when determining whether two
ranges overlap, and deciding on the orientation of a range. This is
surprising to many novice users, particularly to those with
\software{bedtools} experience. Most functions take the
\Robject{ignore.strand} argument to control this behavior. To avoid
confusion, the code generated by \Biocpkg{HelloRanges} is always
explicit about how it is treating strand. Users are encouraged to
follow the same practice.

\subsection{Computing intersections}
One last look at the code:
<<findOverlaps-review>>=
code
@ 

The final step is to find the actual intersecting region between the
member of each overlapping pair. We do this with the
\Rfunction{pintersect} function, which is the ``parallel'' or
``pairwise'' version of the default \Rfunction{intersect} function. If
we had just called \Rcode{intersect(gr\_a, gr\_b)} instead, the entire
set of ranges would have been treated as a set, and overlapping ranges
in \Robject{gr\_a} and \Robject{gr\_b} would have been merged (this is
rarely desirable and requires an extra merge step in
\software{bedtools}).

Notice again the importance of \Rcode{ignore.strand=TRUE}. Without
that, ranges on opposite strands would have zero intersection.

And here is our result:
<<intersect-ans>>=
ans
@ 
Again, a \Rclass{GRanges} object. The \Robject{hit} column indicates
whether the pair overlapped at all (as opposed to one range being of
zero width). It's useless in this case.

\subsection{Keeping the original features}

To keep the original form of the overlapping features, the generated
code simply neglect to call \Rfunction{pintersect} and end up with the
\Robject{pairs} object introduced previously:
<<intersect-wa-wb>>=
code <- bedtools_intersect("-a cpg.bed -b exons.bed -g hg19 -wa -wb")
code
eval(code)
@ 

\subsection{Computing the amount of overlap}

To compute the width of the overlapping regions, we query the initial
result for its width and store as an annotation on the pairs:
<<intersect-wo>>=
code <- bedtools_intersect("-a cpg.bed -b exons.bed -g hg19 -wo")
code
eval(code)
@ 

This code reveals that \Rclass{GRanges}, along with every other
vector-like object in the Ranges infrastructure, is capable of storing
tabular annotations, accessible via \Rfunction{mcols}. We actually saw
this before with the ``name'' column on the Cpg Islands. Here, we use
it to store the overlap width.

\subsection{Counting the number of overlaps}

A common query, particularly in RNA-seq analysis, is how many ranges
in the subject overlap each query range. The \Rfunction{countOverlaps}
function serves this particular purpose:
<<intersect-c>>=
code <- bedtools_intersect("-a cpg.bed -b exons.bed -g hg19 -c")
code
eval(code)
@ 

\subsection{Excluding queries with overlaps}

We might instead want to exclude all query ranges that overlap any
subject range, i.e., any CpG island that overlaps an exon. The
\Rfunction{subsetByOverlaps} function is tasked with restricting by
overlap. By passing \Rcode{invert=TRUE}, we exclude ranges with
overlaps.
<<intersect-v>>=
code <- bedtools_intersect("-a cpg.bed -b exons.bed -g hg19 -v")
code
@ 

\subsection{Restricting by fraction of overlap}

The \software{bedtools} suite has deep support for restricting
overlaps by the fraction of the query/subject range that is
overlapped. This is not directly supported by the \Bioconductor{}
infrastructure, but we can filter post-hoc:
<<intersect-f>>=
code <- bedtools_intersect("-a cpg.bed -b exons.bed -g hg19 -f 0.5 -wo")
code
@ 

\subsection{Performance}

Comparing the performance of \software{bedtools} and \Biocpkg{IRanges}
is like comparing apples and oranges. The typical \Bioconductor{}
workflow imports the data once, paying an upfront cost, and then
operates efficiently on in-memory data structures. The BED parser is
implemented in R code and will not compete with the parsing
performance of special purpose C code. The intersect operation itself
is also slower than \software{bedtools}, but it's still reasonably
close for being mostly implemented in R.
<<intersect-performance>>=
a <- import("exons.bed")
b <- import("hesc.chromHmm.bed")
system.time(pintersect(findOverlapPairs(a, b, ignore.strand=TRUE),
                       ignore.strand=TRUE))
@ 

\subsection{Multiple subjects}

Often, we are interested in intersections with mutiple annotation
tracks, or multiple samples. Note that the command line parser used by
\Biocpkg{helloRanges} requires that the filenames be comma-separated,
instead of space-separated. This is probably more readable anyway.
<<intersect-multiple>>=
code <- bedtools_intersect(
    paste("-a exons.bed",
          "-b cpg.bed,gwas.bed,hesc.chromHmm.bed -wa -wb -g hg19",
          "-names cpg,gwas,chromhmm"))
ans <- eval(code)
code
@ 
Inspecting the code, we see that we need to loop over the database
files and then \Rfunction{stack} them into a single \Rclass{GRanges}
grouped by the column ``b'':
<<intersect-multiple-second>>=
second(ans)
@
The ``b'' column is an \Rclass{Rle} object, a run-length encoded form
of an ordinary R vector, in this case a factor.  Since the data are
sorted into groups, this encoding is more efficient than a dense
representation. The ``seqnames'' and ``strand'' columns also benefit
from run-length encoding. Not only can we fit more data into memory,
many operations become faster.

\section{Merge}
\includegraphics{fig/merge.jpg}

There are many ways to summarize interval data. In the Ranges
infrastructure, we call some of them \Rfunction{range} (min start to
max end), \Rfunction{reduce} (\software{bedtools merge}),
\Rfunction{disjoin} (involved in \software{bedtools multiinter}) and
\Rfunction{coverage} (counting the number of ranges overlapping each
position, \software{bedtools genomecov}). We are presently concerned
with \Rfunction{reduce}, which combines overlapping and adjacent
ranges into a single range. The corresponding \software{bedtools
  merge} command requires the data to be sorted; however,
\Rfunction{reduce} does not have this constraint.

<<merge>>=
code <- bedtools_merge("-i exons.bed")
code
@ 

\subsection{Aggregation}

As with any reduction, we often want to simultaneously aggregate
associated variables and report the summaries in the result.

We count the number of ranges overlapping each merged range:
<<merge-count-overlaps>>=
code <- bedtools_merge("-i exons.bed -c 1 -o count")
code
@

The key to aggregation with \Rfunction{reduce} is the
\Rcode{with.revmap=TRUE} argument. That yields a ``revmap'' column on
the result. It is an \Rclass{IntegerList} holding the subscripts
corresponding to each group. We pass it to \Rfunction{aggregate} to
indicate the grouping. The named arguments to \Rfunction{aggregate},
in this case \Robject{seqnames.count}, are effectively evaluated with
respect to each group (although they are actually evaluated only once).

This yields the result:
<<merge-count-overlaps-ans>>=
eval(code)
@ 
We see that the grouping has been preserved on the object, in case we
wish to aggregate further through joins.

Counting the overlaps by counting the ``seqnames'' is a little
circuitous. Instead, we could have just counted the elements in each group:
<<merge-count-overlaps-direct>>=
identical(lengths(ans$grouping), ans$seqnames.count)
@ 
Note that this counting is very fast, because the ``revmap''
\Rclass{IntegerList} is not actually a list, but a partitioned vector,
and the partitioning already encodes the counts. This is an example of
where the flexibility and efficient in-memory representations of
\Bioconductor{} are particularly effective.

\subsection{Merging close features}

By default, features are merged if they are overlapping or adjacent,
i.e., the \Robject{min.gapwidth} (the minimum gap allowed to not be
merged) is 1. To merge features that are up to, say, 1000bp away, we
need to pass \Rcode{min.gapwidth=1001}:
<<merge-close>>=
code <- bedtools_merge("-i exons.bed -d 1000")
code
@ 

Here is another example showing how to merge multiple columns at once:
<<merge-multiple>>=
code <- bedtools_merge("-i exons.bed -d 90 -c 1,4 -o count,collapse")
code
@

\section{Finding the Gaps}
\includegraphics{fig/complement.jpg}

The \software{bedtools complement} tool finds the gaps in the
sequence, i.e., the regions of sequence the track does not cover. This
is where having the sequence bounds is critical.

<<complement>>=
code <- bedtools_complement("-i exons.bed -g hg19.genome")
code
@ 

The call to \Rfunction{setdiff} is a set operation, along with
\Rfunction{intersect} and \Rfunction{union}. Set operations behave a
bit surprisingly with respect to strand. The ``unstranded'' features,
those with ``*'' for their strand, are considered to be in a separate
space from the stranded features. If we pass
\Rcode{ignore.strand=TRUE}, both arguments are unstranded and the
result is unstranded (strand information is discarded). This makes
sense, because there is no obvious way to merge a stranded and
unstranded feature. Since we are looking for the gaps, we do not care
about the strand, so discarding the strand is acceptable. Best
practice is to make this explicit by calling \Rfunction{unstrand}
instead of assuming the reader understands the behavior of
\Rcode{ignore.strand=TRUE}.

\section{Computing Genomic Coverage}
\includegraphics{fig/genomecov.jpg}

One of the useful ways to summarize ranges, particularly alignments,
is to count how many times a position is overlapped by a range. This
is called the coverage. Unlike \software{bedtools genomecov}, we do
not require the data to be sorted.

\subsection{Coverage vector}

To compute the coverage, we just call \Rfunction{coverage}. For
consistency with \software{bedtools}, which drops zero runs with the
``-bg'' option, we convert the coverage vector to a \Rclass{GRanges}
and subset:
<<genomecov-bg>>=
code <- bedtools_genomecov("-i exons.bed -g hg19.genome -bg")
code
@ 

\subsection{Coverage histogram}

The default behavior of \software{genomecov} is to compute a histogram
showing the number and fraction of positions covered at each level. It
does this for the individual chromosome, and the entire genome. While
computing the coverage vector is as simple as calling
\Rfunction{coverage}, forming the histogram is a bit
complicated. This is a bit esoteric, but it lets us
demonstrate how to aggregate data in R:
<<genomecov>>=
code <- bedtools_genomecov("-i exons.bed -g hg19.genome")
ans <- eval(code)
code
@
%

The \Robject{cov} object is an \Rclass{RleList}, with one \Rclass{Rle}
per sequence (chromosome). 
<<genomecov-cov>>=
cov
@ 

We tabulate each coverage vector individually, then stack the tables
into an initial histogram. Then, we aggregate over the entire genome
and combine the genome histogram with the per-chromosome
histogram. The call to \Rfunction{NumericList} is only to avoid
integer overflow. Finally, we compute the fraction covered and end up
with:
<<genomecov-ans>>=
ans
@ 
%
This takes 3 minutes for bedtools, but closer to 3 seconds for us, probably
because it is working too hard to conserve memory.

\section{Combining operations}

\subsection{Chaining}

Most real-world workflows involve multiple operations, chained
together. The R objects produced \Biocpkg{HelloRanges} can be passed
directly to existing \R{} functions, and \Biocpkg{HelloRanges} defines
an ordinary \R{} function corresponding to each bedtools operation.
The arguments of the function correspond to \software{bedtools}
arguments, except they can be \R{} objects, like \Rclass{GRanges}, in
addition to filenames. These functions with ordinary R semantics are
prefixed by \Rfunction{R\_}, so the analog to
\Rfunction{bedtools\_intersect} is \Rfunction{R\_bedtools\_intersect}.

Consider a use case similar to the one mentioned in the
\software{bedtools} tutorial: find the regions of the CpG islands that
are not covered by exons. We could do this directly with
\Rfunction{bedtools\_subtract}, but let us instead compute the coverage
of the exons, find the regions of zero coverage, and intersect those
with the CpG islands. 

First, we generate the code for the coverage operation (and ideally
copy it to a script). The result of evaluating that code is a
\Rclass{GRanges}, which we subset for the regions with zero score.
<<combine-genomecov>>=
code <- bedtools_genomecov("-i exons.bed -g hg19.genome -bga")
gr0 <- subset(eval(code), score == 0L) # compare to: awk '$4==0'
gr0
@ 

Next, we pass \Robject{gr0} directly to the R analog of
\software{intersect}, \Rfunction{R\_bedtools\_intersect}:
<<combine-intersect>>=
code <- R_bedtools_intersect("cpg.bed", gr0)
code
@ 
The generated code already refers to \Robject{gr0} explicitly, so it
is easy to copy this into the script.

To generalize, the chaining workflow is:
\begin{enumerate}
\item Generate code for first operation,
\item Integrate and evaluate the code,
\item Interactively inspect the result of evaluation,
\item Perform intermediate operations, while inspecting results,
\item Call \Rfunction{R\_} analog to generate second stage code.
\end{enumerate}

Generating and integrating \R{} code is the best way to learn, and the
best way to produce a readable, flexible and performant
script. However, there are probably those who are tempted to evaluate
the code directly, as we have done in this vignette. Further, there
are those who wish to chain these operations together with the
so-called ``pipe'' operator, because it would come so tantalizing
close to the syntax of the shell. We thought it would be entertaining
to observe people who, while ``living in the moment'', obfuscate their
code by using both the obscure \software{bedtools} API and
non-standard syntax. Thus, we created a third family of functions,
prefixed by \Rfunction{do\_}, which provide the same interface as the
\Rfunction{R\_} family, except they evaluate the generated code:
<<combine-pipe, eval=FALSE>>=
do_bedtools_genomecov("exons.bed", g="hg19.genome", bga=TRUE) %>% 
    subset(score > 0L) %>%
    do_bedtools_intersect("cpg.bed", .)    
@ 
Enjoy.

\subsection{Coalescence}

In the previous section, we chained together independent
operations. Having access to the underlying code gives us the
flexibility to merge operations so that they are faster than the sum
of their parts. We call this coalescence.

Consider a use case cited by the \software{bedtools} tutorial: compute
the distribution of coverage over all exons. To integrate better with
this tutorial, we adapt that to finding the distribution of exon
coverage over all CpG islands. 

We could mimic the example by computing the coverage complete
histogram and extracting only the margin:
<<coalescence-naive>>=
code <- bedtools_coverage("-a cpg.bed -b exons.bed -hist -g hg19.genome")
code
@ 
The code is quite complex, because the Ranges infrastructure does
not attempt to generate high-level summaries of the data. The
rationale, which is validated in this case, is that the desired
summary depends on the specific question, and the number of questions
is effectively infinite.

In this case, we only care about the margin, i.e., the \Robject{all}
component:
<<coalescence-all>>=
ans <- eval(code)
metadata(ans)$coverage
@ 

Thus, we can simplify the code. We begin with the same lines:
<<coalescence-simplify>>=
genome <- import("hg19.genome")
gr_a <- import("cpg.bed", genome = genome)
gr_b <- import("exons.bed", genome = genome)
cov <- unname(coverage(gr_b)[gr_a])
@ 
And summarize all of the coverage at once:
<<coalescence-custom>>=
all_cov <- unlist(cov)
df <- as.data.frame(table(coverage=all_cov))
df$fraction <- df$Freq / length(all_cov)
@ 
This is much faster, because we are only computing one table, not
30,000, and the \Rfunction{table} method for \Rclass{Rle} is very
efficient.

We now have a simple \Rclass{data.frame} that we can plot as an
inverted ECDF:
<<echo=FALSE>>=
setwd(oldwd)
@ 
<<coalescence-plot,fig=TRUE,png=TRUE>>=
plot((1-cumsum(fraction)) ~ as.integer(coverage), df, type="s",
     ylab = "fraction of bp > coverage", xlab="coverage")
@ 
<<echo=FALSE>>=
setwd(system.file("extdata", package="HelloRangesData"))
@ 

\section{Jaccard Statistic}

In order to compare the DnaseI hypersenstivity across tissues, we will
employ the \software{bedtools jaccard} statistic, a measure of
similarity between two tracks. It is defined as the total width of
their intersection over the total width of their union.

We might expect, for example, that the similarity within a tissue is
higher than that between two tissues, and this is indeed the case:
<<jaccard-example>>=
code <- bedtools_jaccard(
    paste("-a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed",
          "-b fHeart-DS15839.hotspot.twopass.fdr0.05.merge.bed"))
heart_heart <- eval(code)
code <- bedtools_jaccard(
    paste("-a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed",
          "-b fSkin_fibro_bicep_R-DS19745.hg19.hotspot.twopass.fdr0.05.merge.bed"))
heart_skin <- eval(code)
mstack(heart_heart=heart_heart, heart_skin=heart_skin)
@
%
The generated code makes the statistic self-documenting:
<<jaccard-code>>=
code
@ 

We can compute the statistic over all pairs of samples using
functionality included with R, through the \Rpackage{parallel}
package. There is no need to learn yet another syntax, such as that of
the \software{parallel} UNIX utility. Nor do we need to download a
custom python script, and repeatedly call perl and awk.
<<jaccard-all>>=
files <- Sys.glob("*.merge.bed")
names(files) <- sub("\\..*", "", files)
ans <- outer(files, files, 
             function(a, b) mcmapply(do_bedtools_jaccard, a, b, 
                                     mc.cores=4))
jaccard <- apply(ans, 1:2, function(x) x[[1]]$jaccard)
@ 

Since we are already in R, it is easy to create a simple plot:
<<echo=FALSE>>=
setwd(oldwd)
@ 
<<jaccard-plot,fig=TRUE,png=TRUE>>=
palette <- colorRampPalette(c("lightblue", "darkblue"))(9)
heatmap(jaccard, col=palette, margin=c(14, 14))
@ 
<<echo=FALSE>>=
setwd(system.file("extdata", package="HelloRangesData"))
@ 

\section{Exercises}

These were adapted from the \software{bedtools} tutorial. Try to
complete these exercises using \Bioconductor{} directly.

\begin{enumerate}
\item Create a \Rclass{GRanges} containing the non-exonic regions of
  the genome.
\item Compute the average distance from the GWAS SNPs to the closest
  exon (Hint: \Rcode{?bedtools\_closest} and
  \Rcode{?distanceToNearest}).
\item Compute the exon coverage in 500kb windows across the genome
  (Hint: \Rcode{?bedtools\_makewindows} and \Rcode{?tileGenome}).
\item How many exons are completely overlapped by an enhancer (from
  \file{hesc.chromHmm.bed}) (Hint: \Rcode{?`\%within\%`})?
\item What fraction of the disease-associated SNPs are exonic 
  (Hint: (Hint: \Rcode{?`\%over\%`}))?
\item Create intervals representing the canonical 2bp splice sites on
  either side of each exon (bonus: exclude splice sites at the first
  and last exons) (Hint: \Rcode{?bedtools\_flank},
  \Rcode{?intronsByTranscript}).
\item Which hESC ChromHMM state represents the most number of base
  pairs in the genome? (Hint: \Rcode{?xtabs}).
\end{enumerate}

\subsection{Answers}
Below, we give the \software{bedtools}-style answer first, followed by
the essential call against the \Bioconductor{} API.

First, we load the files into \R{} objects for convenience:
<<answers-load>>=
genome <- import("hg19.genome")
exons <- import("exons.bed", genome=genome)
gwas <- import("gwas.bed", genome=genome)
hesc.chromHmm <- import("hesc.chromHmm.bed", genome=genome)
@ 

Here are the numbered answers:

\begin{enumerate}
\item 
<<answer-1>>=
bedtools_complement("-i exons.bed -g hg19.genome")
## or without HelloRanges:
setdiff(as(seqinfo(exons), "GRanges"), unstrand(exons))
@ 
\item 
<<answer-2>>=
bedtools_closest("-a gwas.bed -b exons.bed -d")
## or 
distanceToNearest(gwas, exons)
@
\item
<<answer-3>>=
code <- bedtools_makewindows("-g hg19.genome -w 500000")
code
windows <- unlist(eval(code))
R_bedtools_intersect(windows, exons, c=TRUE)
## or
str(countOverlaps(tileGenome(seqinfo(exons), tilewidth=500000), 
                  exons))
@   
\item 
<<answer-4>>=
bedtools_intersect(
    paste("-a exons.bed -b <\"grep Enhancer hesc.chromHmm.bed\"",
          "-f 1.0 -wa -u"))
quote(length(ans))
## or
sum(exons %within% 
    subset(hesc.chromHmm, grepl("Enhancer", name)))
@
\item 
<<answer-5>>=
bedtools_intersect("-a gwas.bed -b exons.bed -u")
quote(length(gr_a)/length(ans))
## or
mean(gwas %over% exons)
@   
\item 
<<answer-6>>=
bedtools_flank("-l 2 -r 2 -i exons.bed -g genome.txt")
## or, bonus:
exons <- keepStandardChromosomes(exons)
txid <- sub("_exon.*", "", exons$name)
tx <- split(exons, txid)
bounds <- range(tx)
transpliced <- lengths(bounds) > 1
introns <- unlist(psetdiff(unlist(bounds[!transpliced]), 
                           tx[!transpliced]))
Pairs(resize(introns, 2L), resize(introns, 2L, fix="end"))
## better way to get introns:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
introns <- unlist(intronsByTranscript(txdb))
@
\item
<<answer-7>>=
system.time(names(which.max(xtabs(width ~ name, 
                                  hesc.chromHmm))))
## or
names(which.max(sum(with(hesc.chromHmm, 
                         splitAsList(width, name)))))
## or
df <- aggregate(hesc.chromHmm, ~ name, totalWidth=sum(width))
df$name[which.max(df$totalWidth)]
@ 
\end{enumerate}

<<restore-wd, echo=FALSE>>=
setwd(oldwd)
@ 

\bibliography{tutorial}

\end{document}