%\VignetteIndexEntry{An introduction to Rsamtools}
%\VignetteDepends{}
%\VignetteKeywords{Short read, I/0, samtools}
%\VignettePackage{Rsamtools}
\documentclass{article}

\usepackage[utf8]{inputenc}

<<style, echo=FALSE, results=tex>>=
BiocStyle::latex()
@
\newcommand{\bam}{\texttt{BAM}}

\title{An Introduction to \Biocpkg{Rsamtools}}
\author{Martin Morgan}
\date{Modified: 18 March, 2010. Compiled: \today}

\begin{document}

\maketitle

\tableofcontents

<<options,echo=FALSE>>=
options(width=60)
@ 

<<preliminaries>>=
library(Rsamtools)
@ 

\section{Introduction}

The \Biocpkg{Rsamtools} package provides an interface to \bam{} files. \bam{}
files are produced by \software{samtools} and other software, and
represent a flexible format for storing `short' reads aligned to
reference genomes. \bam{} files typically contain sequence and base
qualities, and alignment coordinates and quality measures. \bam{}
files are appealing for several reasons. The format is flexible enough
to represent reads generated and aligned using diverse
technologies. The files are binary so that file access is relatively
efficient. \bam{} files can be indexed, allowing ready access to
localized chromosomal regions. \bam{} files can be accessed remotely,
provided the remote hosting site supports such access and a local
index is available. This means that specific regions of remote files
can be accessed without retrieving the entire (large!) file.  A full
description is available in the \bam{} format specification
(\url{http://samtools.sourceforge.net/SAM1.pdf})

The main purpose of the \Biocpkg{Rsamtools} package is to import \bam{} files
into \R{}. \Biocpkg{Rsamtools} also provides some facility for file access
such as record counting, index file creation, and filtering to create
new files containing subsets of the original. An important use case
for \Biocpkg{Rsamtools} is as a starting point for creating \R{} objects
suitable for a diversity of work flows, e.g., \Rclass{AlignedRead}
objects in the \Rpackage{ShortRead} package (for quality assessment
and read manipulation), or \Rclass{GAlignments} objects in
\Rpackage{GenomicAlignments} package (for RNA-seq and other
applications). Those desiring more functionality are encouraged to
explore \software{samtools} and related software efforts.

\section{Input}

The essential capability provided by \Biocpkg{Rsamtools} is \bam{}
input. This is accomplished with the \Rfunction{scanBam}
function. \Rfunction{scanBam} takes as input the name of the \bam{}
file to be parsed.  In addition, the \Rcode{param} argument
determines which genomic coordinates of the \bam{} file, and what
components of each record, will be input. \R{param} is an
instance of the \Rclass{ScanBamParam} class. To create a
\Rcode{param} object, call \Rfunction{ScanBamParam}. Here we create
a \Rcode{param} object to extract reads aligned to three distinct
ranges (one on \Rcode{seq1}, two on \Rcode{seq2}). From each of read
in those ranges, we specify that we would like to extract the
reference name (\Rcode{rname}, e.g., \Rcode{seq1}), strand, alignment
position, query (i.e., read) width, and query sequence:
<<ScanBamParam>>=
which <- IRangesList(seq1=IRanges(1000, 2000),
                     seq2=IRanges(c(100, 1000), c(1000, 2000)))
what <- c("rname", "strand", "pos", "qwidth", "seq")
param <- ScanBamParam(which=which, what=what)
@ 
%% 
Additional information can be found on the help page for
\Rfunction{ScanBamParam}.  Reading the relevant records from the
\bam{} file is accomplished with
<<scanBam, keep.source=TRUE>>=
bamFile <- 
    system.file("extdata", "ex1.bam", package="Rsamtools")
bam <- scanBam(bamFile, param=param)
@ 

Like \Rfunction{scan}, \Rfunction{scanBam} returns a \Robject{list} of
values. Each element of the list corresponds to a range specified by
the \Rcode{which} argument to \Rfunction{ScanBamParam}. 
<<scanBam-elts-1>>=
class(bam)
names(bam)
@ 
%% 
Each element is itself a list, containing the elements specified by
the \Rcode{what} and \Rcode{tag} arguments to
\Rfunction{ScanBamParam}. 
<<scanBam-elts-2>>=
class(bam[[1]])
names(bam[[1]])
@ 
%% 
The elements are either basic \R{} or \Rpackage{IRanges} data types
<<scanBam-elts-class>>=
sapply(bam[[1]], class)
@ 
%% 
A paradigm for collapsing the list-of-lists into a single list is
<<scanBam-to-IRanges>>=
.unlist <- function (x)
{
    ## do.call(c, ...) coerces factor to integer, which is undesired
    x1 <- x[[1L]]
    if (is.factor(x1)) {
        structure(unlist(x), class = "factor", levels = levels(x1))
    } else {
        do.call(c, x)
    }
}
bam <- unname(bam) # names not useful in unlisted result
elts <- setNames(bamWhat(param), bamWhat(param))
lst <- lapply(elts, function(elt) .unlist(lapply(bam, "[[", elt)))
@ 
%% 
This might be further transformed, e.g., to a \Rclass{DataFrame}, with
<<lst-to-DataFrame>>=
head(do.call("DataFrame", lst))
@ 
Often, an alternative is to use a \Rclass{ScanBamParam} object with
desired fields specified in \Rcode{what} as an argument to
\Rfunction{GenomicAlignments::readGAlignments}; the specified fields
are added as columns to the returned \Rclass{GAlignments}.

The \bam{} file in the previous example includes an index, represented
by a separate file with extension \texttt{.bai}:
<<indexed-file>>=
list.files(dirname(bamFile), pattern="ex1.bam(.bai)?")
@ 
%% 
Indexing provides two significant benefits. First, an index allows a
\bam{} file to be efficiently accessed by range. A corollary is that
providing a \Rcode{which} argument to \Rfunction{ScanBamParam}
requires an index. Second, coordinates for extracting information from
a \bam{} file can be derived from the index, so a portion of a remote
\bam{} file can be retrieved with local access only to the index. For
instance, provided an index file exists on the local computer, it is
possible to retrieve a small portion of a \bam{} file residing on the
1000 genomes HTTP server. The url
\url{ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/NA19240/alignment/NA19240.chrom6.SLX.maq.SRP000032.2009_07.bam}
points to the \bam{} file corresponding to individual NA19240
chromosome~6 Solexa (Illumina) sequences aligned using MAQ. The remote
file is very large (about 10 GB), but the corresponding index file is
small (about 500 KB). With \Rcode{na19240url} set to the above
address, the following retrieves just those reads in the specified
range
<<bam-remote,eval=FALSE>>=
which <- IRangesList("6"=IRanges(100000L, 110000L))
param <- ScanBamParam(which=which, what=scanBamWhat())
na19240bam <- scanBam(na19240url, param=param)
@ 
%% 
Invoking \Rfunction{scanBam} without an index file, as above, first
retrieves the index file from the remote location, and then queries
the remote file using the index; for repeated queries, it is more
efficient to retrieve the index file first (e.g., with
\Rfunction{download.file}) and then use the local index as an argument
to \Rfunction{scanBam}. Many \bam{} files were created in a way that
causes \Rfunction{scanBam} to report that the ``EOF marker is
absent''; this message can safely be ignored.

\bam{} files may be read by functions in packages other than
\Biocpkg{Rsamtools}, in particular the \Rfunction{readGAlignments}
family of functions in \Rpackage{GenomicAlignments}.

Additional ways of interacting with \bam{} files include
\Rfunction{scanBamHeader} (to extract header information) and
\Rfunction{countBam} (to count records matching \Rcode{param}).
\Rfunction{filterBam} filters reads from the source file according to
the criteria of the \Rclass{ScanBamParam} parameter, writing reads
passing the filter to a new file.  The function \Rfunction{sortBam}
sorts a previously unsorted \bam{}, while The function
\Rfunction{indexBam} creates an index file from a sorted \bam{} file.

\Rfunction{readPileup} reads a \texttt{pileup} file created by
\software{samtools}, importing SNP, indel, or all variants into a
\Rclass{GRanges} object.

\subsection{Large bam files}

\bam{} files can be large, containing more information on more genomic
regions than are of immediate interest or than can fit in memory. The
first strategy for dealing with this is to select, using the
\Rcode{what} and \Rcode{which} arguments to
\Rfunction{ScanBamParam}, just those portions of the \bam{} file that
are essential to the current analysis, e.g., specifying
\Rcode{what=c(`rname', 'qname', 'pos')} when wishing to calculate
coverage of ungapped reads. 

When selective input of \bam{} files is still too memory-intensive,
the file can be processed in chunks, with each chunk
distilled to the derived information of interest. Chromosomes will
often be the natural chunk to process. For instance, here we write a
summary function that takes a single sequence name (chromosome) as
input, reads in specific information from the \bam{} file, and calculates
coverage over that sequence.
<<summaryFunction>>=
summaryFunction <- 
    function(seqname, bamFile, ...)
{
    param <- ScanBamParam(what=c('pos', 'qwidth'),
                          which=GRanges(seqname, IRanges(1, 1e7)),
                          flag=scanBamFlag(isUnmappedQuery=FALSE))
    x <- scanBam(bamFile, ..., param=param)[[1]]
    coverage(IRanges(x[["pos"]], width=x[["qwidth"]]))
}
@ 
%% 
This might be used as follows; it is an ideal candidate for evaluation
in parallel, e.g., using the \Rpackage{parallel} package and
\Rfunction{srapply} function in \Rpackage{ShortRead}.  
<<summaryByChromosome>>=
seqnames <- paste("seq", 1:2, sep="")
cvg <- lapply(seqnames, summaryFunction, bamFile)
names(cvg) <- seqnames
cvg
@ 
%% 
The result of the function (a coverage vector, in this case) will
often be much smaller than the input.

\section{Views}

The functions described in the previous section import data in to
\R{}. However, sequence data can be very large, and it does not always
make sense to read the data in immediately. Instead, it can be useful
to marshal \emph{references} to the data into a container and then act
on components of the container.  The \Rclass{BamViews} class provides
a mechanism for creating `views' into a set of \bam{} files. The view
itself is light-weight, containing references to the relevant \bam{}
files and metadata about the view (e.g., the phenotypic samples
corresponding to each \bam{} file).

One way of understanding a \Rclass{BamViews} instance is as a
rectangular data structure. The columns represent \bam{} files (e.g.,
distinct samples). The rows represent ranges (i.e., genomic
coordinates). For instance, a ChIP-seq experiment might identify a
number of peaks of high read counts.

\subsection{Assembling a \Rclass{BamViews} instance}

To illustrate, suppose we have an interest in caffeine metabolism in
humans. The `rows' contain coordinates of genomic regions associated
with genes in a KEGG caffeine metabolism pathway. The `columns'
represent individuals in the 1000 genomes project.  

To create the `rows', we identify possible genes that KEGG associates
with caffeine metabolism:
<<caffeine-kegg>>=
library(KEGG.db)
kid <- revmap(KEGGPATHID2NAME)[["Caffeine metabolism"]]
egid <- KEGGPATHID2EXTID[[sprintf("hsa%s", kid)]]
@ 
%% 
Then we use the appropriate \Rpackage{TxDb} package to
translate Entrez identifiers to obtain ranges of interest (one could
also use \Rpackage{biomaRt} to retrieve coordinates for non-model
organisms, perhaps making a \Robject{TxDb} object as outlined
in the \Rpackage{GenomicFeatures} vignette). We'll find that the names
used for chromosomes in the alignments differ from those used at
Ensembl, so \Rcode{seqlevels<-} is used to map between naming schemes
and to drop unused levels.
<<txdb-transcripts>>=
library(TxDb.Hsapiens.UCSC.hg18.knownGene)
bamRanges <- transcripts(TxDb.Hsapiens.UCSC.hg18.knownGene, 
                         filter=list(gene_id=egid))
seqlevels(bamRanges) <-                 # translate seqlevels
    sub("chr", "", seqlevels(bamRanges))
lvls <- seqlevels(bamRanges)            # drop unused levels
seqlevels(bamRanges) <- lvls[lvls %in% as.character(seqnames(bamRanges))]
@ 
%% 
Note that \Robject{bamRanges} `knows' the genome for which the
ranges are defined
<<bamRanges-genome>>=
unique(genome(bamRanges))
@ 

The details of creating the `columns' of BAM files. Here we retrieve a
vector of \bam{} file URLs (\Robject{slxMaq09}) from the package.
<<BamViews-parts>>=
slxMaq09 <- local({
    fl <- system.file("extdata", "slxMaq09_urls.txt", 
                      package="Rsamtools")
    readLines(fl)
})
@ 
%% 
We now assemble the \Rclass{BamViews} instance from these objects; we
also provide information to annotated the \bam{} files (with the
\Rcode{bamSamples} function argument, which is a \Rclass{DataFrame}
instance with each row corresponding to a \bam{} file) and the
instance as a whole (with \Rcode{bamExperiment}, a simple named
\Rclass{list} containing information structured as the user sees fit).
<<BamViews-construct, keep.source=TRUE>>=
bamExperiment <- 
    list(description="Caffeine metabolism views on 1000 genomes samples",
         created=date())
bv <- BamViews(slxMaq09, bamRanges=bamRanges, 
               bamExperiment=bamExperiment)
metadata(bamSamples(bv)) <- 
    list(description="Solexa/MAQ samples, August 2009",
         created="Thu Mar 25 14:08:42 2010")
@ 

\subsection{Using \Rclass{BamViews} instances}

The \Rclass{BamViews} object can be queried for its component parts, e.g.,
<<BamViews-query>>=
bamExperiment(bv)
@ 
%% 
More usefully, Methods in \Biocpkg{Rsamtools} are designed to work
with \Rclass{BamViews} objects, retrieving data from all files in the
view. These operations can take substantial time and require reliable
network access.

To illustrate, the following code (not evaluated when this vignette
was created) downloads the index files associated with the
\Robject{bv} object
<<bamIndicies,eval=FALSE>>=
bamIndexDir <- tempfile()
indexFiles <- paste(bamPaths(bv), "bai", sep=".")
dir.create(bamIndexDir)
idxFiles <- mapply(download.file, indexFiles,
                   file.path(bamIndexDir, basename(indexFiles)) ,
                   MoreArgs=list(method="curl"))
@ 
%% 
and then queries the 1000 genomes project for reads overlapping our
transcripts; by loading the \Rpackage{parallel} package, we tell
\Biocpkg{Rsamtools} to use as many cores as are available on our machine (the
\Rpackage{parallel} package does not provide parallel functionality on
 Windows computers)
<<readGAlignments, eval=FALSE>>=
library(GenomicAlignments)
library(parallel)
options(srapply_fapply="parallel", mc.cores=detectCores())
olaps <- readGAlignments(bv)
@ 
%% 
The resulting object is about 11 MB in size. To avoid having to
download this data each time the vignette is run, we instead load it
from disk
%% 
<<olaps>>=
library(GenomicAlignments)
load(system.file("extdata", "olaps.Rda", package="Rsamtools"))
olaps
head(olaps[[1]])
@ 
%% 
There are \Sexpr{length(olaps[[1]])} reads in
\verb|\Sexpr{names(olaps)[[1]]}| overlapping at least one of our
transcripts. It is easy to explore this object, for instance
discovering the range of read widths in each individual.
<<read-lengths>>=
head(t(sapply(olaps, function(elt) range(qwidth(elt)))))
@ 

Suppose we were particularly interested in the first transcript, which
has a transcript id
\Sexpr{mcols(bamRanges(bv))[["tx_name"]][1]}. Here we
extract reads overlapping this transcript from each of our samples. As
a consequence of the protocol used, reads aligning to either strand
could be derived from this transcript. For this reason, we set the
strand of our range of interest to \texttt{*}. We use the
\Rfunction{endoapply} function, which is like \Rfunction{lapply} but
returns an object of the same class (in this case,
\Sexpr{as.vector(class(olaps))}) as its first argument.
<<focal>>=
rng <- bamRanges(bv)[1]
strand(rng) <- "*"
olap1 <- endoapply(olaps, subsetByOverlaps, rng)
olap1 <- lapply(olap1, "seqlevels<-", value=as.character(seqnames(rng)))
head(olap1[[24]])
@ 
%% 
To carry the example a little further, we calculate coverage of each
sample:
<<olap-cvg, keep.source=TRUE>>=
minw <- min(sapply(olap1, function(elt) min(start(elt))))
maxw <- max(sapply(olap1, function(elt) max(end(elt))))
cvg <- endoapply(olap1, coverage,
                 shift=-start(ranges(bamRanges[1])),
                 width=width(ranges(bamRanges[1])))
cvg[[1]]
@ 
%% 
Since the example includes a single region of uniform width across all
samples, we can easily create a coverage matrix with rows representing
nucleotide and columns sample and, e.g., document variability between
samples and nucleotides
<<olap-cvg-as-m>>=
m <- matrix(unlist(lapply(cvg, lapply, as.vector)),
            ncol=length(cvg))
summary(rowSums(m))
summary(colSums(m))
@ 

\section{Directions}

This vignette has summarized facilities in the \Biocpkg{Rsamtools}
package. Important additional packages include \Rpackage{GenomicRanges}
(for representing and manipulating gapped alignments),
\Rpackage{ShortRead} for I/O and quality assessment of ungapped short
read alignments, \Rpackage{Biostrings} and \Rpackage{BSgenome} for DNA
sequence and whole-genome manipulation, \Rpackage{IRanges} for
range-based manipulation, and \Rpackage{rtracklayer} for I/O related
to the UCSC genome browser. Users might also find the interface to the
integrative genome browser (IGV) in \Rpackage{SRAdb} useful for
visualizing \bam{} files.

<<sessionInfo>>=
packageDescription("Rsamtools")
sessionInfo()
@ 

\appendix

\section{Assembling a \Rclass{BamViews} instance}

\subsection{Genomic ranges of interest}

\subsection{BAM files}

\emph{Note}: The following operations were performed at the time the
vignette was written; location of on-line resources, in particular the
organization of the 1000 genomes \bam{} files, may have changed.

We are interested in collecting the URLs of a number of \bam{} files
from the 1000 genomes project. Our first goal is to identify files
that might make for an interesting comparison.  First, let's visit the
1000 genomes FTP site and discover available files. We'll use the
\Rpackage{RCurl} package to retrieve the names of all files in an
appropriate directory
<<bam-avail, keep.source=TRUE, eval=FALSE>>=
library(RCurl)
ftpBase <- 
    "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/"
indivDirs <- 
    strsplit(getURL(ftpBase, ftplistonly=TRUE), "\n")[[1]]
alnDirs <- 
    paste(ftpBase, indivDirs, "/alignment/", sep="")
urls0 <- 
    strsplit(getURL(alnDirs, dirlistonly=TRUE), "\n")
@ 
%% 
From these, we exclude directories without any files in them, select
only the \bam{} index (extension \texttt{.bai}) files, and choose those
files that exactly six \texttt{'.'} characters in their name.
<<bam-index,eval=FALSE>>=
urls <- urls[sapply(urls0, length) != 0]
fls0 <- unlist(unname(urls0))
fls1 <- fls0[grepl("bai$", fls0)]
fls <- fls1[sapply(strsplit(fls1, "\\."), length)==7]
@ 
%% 
After a little exploration, we focus on those files obtained form
Solexa sequencing, aligned using \texttt{MAQ}, and archived in August,
2009; we remove the \texttt{.bai} extension, so that the URL refers to
the underlying file rather than index. There are 24 files.
<<slxMaq09,eval=FALSE>>=
urls1 <- 
    Filter(function(x) length(x) != 0,
           lapply(urls, 
                  function(x) x[grepl("SLX.maq.*2009_08.*bai$", x)]))
slxMaq09.bai <- 
   mapply(paste, names(urls1), urls1, sep="", USE.NAMES=FALSE)
slxMaq09 <- sub(".bai$", "", slxMaq09.bai) #$
@ 

As a final step to prepare for using a \Rclass{BamViews} file, we
create local copies of the \emph{index} files. The index files are
relatively small (about 190 Mb total).
<<bamIndicies,eval=FALSE>>=
bamIndexDir <- tempfile()
dir.create(bamIndexDir)
idxFiles <- mapply(download.file, slxMaq09.bai, 
                   file.path(bamIndexDir, basename(slxMaq09.bai)) ,
                   MoreArgs=list(method="curl"))
@ 

\end{document}