Bioconductor enables the analysis and comprehension of high- throughput genomic data. We have a vast number of packages that allow rigorous statistical analysis of large data while keeping technological artifacts in mind. Bioconductor helps users place their analytic results into biological context, with rich opportunities for visualization. Reproducibility is an important goal in Bioconductor analyses. Different types of analysis can be carried out using Bioconductor, for example
For these analyses, one typically imports and works with diverse sequence-related file types, including fasta, fastq, BAM, gtf, bed, and wig files, among others. Bioconductor packages support import, common and advanced sequence manipulation operations such as trimming, transformation, and alignment including quality assessment.
Here is a illustrative description elaborating the different file types at various stages in a typical analysis, with the package names (in pink boxes) that one will use for each stage.
The following packages illustrate the diversity of functionality available; all are in the release version of Bioconductor.
IRanges and GenomicRanges for range-based (e.g., chromosomal regions) calculation, data manipulation, and general-purpose data representation. Biostrings for DNA and amino acid sequence representation, alignment, pattern matching (e.g., primer removal), and data manipulation of large biological sequences or sets of sequences. ShortRead for working with FASTQ files of short reads and their quality scores.
Rsamtools and GenomicAlignments for aligned read (BAM file) I/O and data manipulation. rtracklayer for import and export of diverse data formats (e.g., BED, WIG, bigWig, GTF, GFF) and manipualtion of tracks on the UCSC genome browser.
BSgenome for accessing and manipulating curated whole-genome representations. GenomicFeatures for annotation of sequence features across common genomes, biomaRt for access to Biomart databases.
SRAdb for querying and retrieving data from the Sequence Read Archive.
Bioconductor packages are organized by biocViews. Some of the entries under Sequencing and other terms, and representative packages, include:
ChIPSeq, e.g.,DiffBind, csaw, ChIPseeker, ChIPQC.
SNPs and other variants, e.g., VariantAnnotation, VariantFiltering, h5vc.
CopyNumberVariation e.g., DNAcopy, crlmm, fastseg.
Microbiome and metagenome sequencing, e.g., metagenomeSeq, phyloseq, DirichletMultinomial.
Many Bioconductor packages rely heavily on the IRanges / GenomicRanges infrastructure. Thus we will begin with a quick introduction to these and then cover different file types.
The GenomicRanges
package allows us to associate a range of chromosome coordinates with a
sequence name (e.g., chromosome) and a strand. Such genomic ranges are
very useful for describing both data (e.g., the coordinates of aligned
reads, called ChIP peaks, SNPs, or copy number variants) and annotations
(e.g., gene models, Roadmap Epigenomics regulatory elements, known
clinically relevant variants from dbSNP). GRanges
is an object
representing a vector of genomic locations and associated annotations.
Each element in the vector is comprised of a sequence name, a range, a
strand, and optional metadata (e.g. score, GC content, etc.).
library(GenomicRanges)
GRanges(seqnames=Rle(c('chr1', 'chr2', 'chr3'), c(3, 3, 4)),
IRanges(1:10, width=5), strand='-',
score=101:110, GC = runif(10))
Genomic ranges can be created ‘by hand’, as above, but are often the
result of importing data (e.g., via
GenomicAlignments::readGAlignments()
) or annotation (e.g., via
GenomicFeatures::select()
or rtracklayer::import()
of BED, WIG, GTF,
and other common file formats). Use help()
to list the help pages in
the GenomicRanges
package, and vignettes()
to view and access available vignettes.
help(package="GenomicRanges")
vignette(package="GenomicRanges")
Some of the common operations on GRanges
include
findOverlaps(query, subject)
and nearest(query, subject)
, which
identify the ranges in query
that overlap ranges in subject
, or the
range in subject
nearest to `query. These operations are useful both
in data analysis (e.g., counting overlaps between aligned reads and gene
models in RNAseq) and comprehension (e.g., annotating genes near ChIP
binding sites).
Biostrings classes
(e.g., DNAStringSet
) are used to represent DNA or amino acid
sequences. In the example below we will construct a DNAString and show
some manipulations.
library(Biostrings)
d <- DNAString("TTGAAAA-CTC-N")
length(d) #no of letters in the DNAString
## [1] 13
We will download all Homo sapiens cDNA sequences from the FASTA file ‘Homo_sapiens.GRCh38.cdna.all.fa’ from Ensembl using AnnotationHub.
library(AnnotationHub)
ah <- AnnotationHub()
This file is downloaded as a FaFile object
ah2 <- query(ah, c("fasta", "homo sapiens", "Ensembl"))
fa <- ah2[["AH18522"]]
fa
## class: FaFile
## path: /var/lib/jenkins//.AnnotationHub/22617
## index: /var/lib/jenkins//.AnnotationHub/25666
## isOpen: FALSE
## yieldSize: NA
The sequences in the file can be read in using readDNAStringSet()
from
the Biostrings package.
The sequences are returned as a DNAStringSet object.
readDNAStringSet(path(fa))
## A DNAStringSet instance of length 170893
## width seq names
## [1] 9 CCTTCCTAC ENST00000434970 h...
## [2] 8 GAAATAGT ENST00000415118 h...
## [3] 13 ACTGGGGGATACG ENST00000448914 h...
## [4] 16 TGACTACGGTGACTAC ENST00000431870 h...
## [5] 16 TGACTACAGTAACTAC ENST00000414852 h...
## ... ... ...
## [170889] 3808 CACGCTGGCCACTAGCTGCCTCACTCTTAT...CTACTGCACCAGTCAGAGGCACCAAGCCTG ENST00000444082 h...
## [170890] 859 CAGCACCATGTCGCTCATGGTCATCATCAT...CTGCAGGGAACAGAACAGTGAACAGCGAGG ENST00000615390 h...
## [170891] 115 TGTGTCCAGAGCATGTTGGCTGAAGTCCCA...AAATGTACAAGTCTTCAGGGCTGGGTGCGG ENST00000512197 h...
## [170892] 204 ATGTCCGTGTGAGGGTTCAAGGCCAAGCTG...CGTCTAGAAGATACTAAGGAGAACAGTTCG ENST00000414573 h...
## [170893] 797 GGTGACAATCTCAAAGCTAAAACTTCCAAA...TTGGCAATGGAGGAATGCATCCTGCAGGCT ENST00000428912 h...
Alternatively, we can use tools from the Rsamtools package to selectively load a subset of the sequences only.
library(Rsamtools)
idx <- scanFaIndex(fa)
idx
## GRanges object with 170893 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] ENST00000434970 [1, 9] *
## [2] ENST00000415118 [1, 8] *
## [3] ENST00000448914 [1, 13] *
## [4] ENST00000431870 [1, 16] *
## [5] ENST00000414852 [1, 16] *
## ... ... ... ...
## [170889] ENST00000444082 [1, 3808] *
## [170890] ENST00000615390 [1, 859] *
## [170891] ENST00000512197 [1, 115] *
## [170892] ENST00000414573 [1, 204] *
## [170893] ENST00000428912 [1, 797] *
## -------
## seqinfo: 170893 sequences from an unspecified genome
The index is returned as a GRanges object. Then we use getSeq()
to
load only the sequences indicated by param
.
long <- idx[width(idx) > 82000]
getSeq(fa, param=long)
## A DNAStringSet instance of length 7
## width seq names
## [1] 101518 GCAGTCGTGCATTCCCAGCCTCGCCTCGGGTGT...ACAAAATAAAGCAAGCTATCTGCACCTCAAAA ENST00000342992
## [2] 82029 GAGCAGTCGTGCATTCCCAGCCTCGCCTCGGGT...ACAAAATAAAGCAAGCTATCTGCACCTCAAAA ENST00000460472
## [3] 109224 GAGCAGTCGTGCATTCCCAGCCTCGCCTCGGGT...ACAAAATAAAGCAAGCTATCTGCACCTCAAAA ENST00000589042
## [4] 104301 GAGCAGTCGTGCATTCCCAGCCTCGCCTCGGGT...ACAAAATAAAGCAAGCTATCTGCACCTCAAAA ENST00000591111
## [5] 104301 GAGCAGTCGTGCATTCCCAGCCTCGCCTCGGGT...ACAAAATAAAGCAAGCTATCTGCACCTCAAAA ENST00000615779
## [6] 82605 GAGCAGTCGTGCATTCCCAGCCTCGCCTCGGGT...ACAAAATAAAGCAAGCTATCTGCACCTCAAAA ENST00000342175
## [7] 82404 GAGCAGTCGTGCATTCCCAGCCTCGCCTCGGGT...ACAAAATAAAGCAAGCTATCTGCACCTCAAAA ENST00000359218
BSgenome packages inside
Bioconductor contain whole genome sequences as distributed by ENSEMBL,
NCBI and others. In this next example we will load the whole genome
sequence for Homo sapiens from UCSC’s hg19
build, and calculate the
GC content across chromosome 14.
library(BSgenome.Hsapiens.UCSC.hg19)
chr14_range = GRanges("chr14", IRanges(1, seqlengths(Hsapiens)["chr14"]))
chr14_dna <- getSeq(Hsapiens, chr14_range)
letterFrequency(chr14_dna, "GC", as.prob=TRUE)
## G|C
## [1,] 0.336276
ShortRead package from Bioconductor can be used for working with fastq files. Here we illustrate a quick example where one can read in multiple fasta files, collect some statistics and generate a report about the same.
BiocParallel is another package from Bioconductor which parallelizes this task and speeds up the process.
## 1. attach ShortRead and BiocParallel
library(ShortRead)
library(BiocParallel)
## 2. create a vector of file paths
fls <- dir("~/fastq", pattern="*fastq", full=TRUE)
## 3. collect statistics
stats0 <- qa(fls)
## 4. generate and browse the report
if (interactive())
browseURL(report(stats0))
Two useful functions in
ShortRead are
trimTails()
for processing FASTQ files, and FastqStreamer()
for
iterating through FASTQ files in manageable chunks (e.g., 1,000,000
records at a time).
The GenomicAlignments package is used to input reads aligned to a reference genome.
In this next example, we will read in a BAM file and specifically read in reads supporting an apparent exon splice junction spanning position 19653773 of chromosome 14.
The package
RNAseqData.HNRNPC.bam.chr14_BAMFILES
contains 8 BAM files. We will use only the first BAM file. We will load
the software packages and the data package, construct a GRanges with
our region of interest, and use summarizeJunctions()
to find reads in
our region of interest.
## 1. load software packages
library(GenomicRanges)
library(GenomicAlignments)
## 2. load sample data
library('RNAseqData.HNRNPC.bam.chr14')
bf <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[[1]], asMates=TRUE)
## 3. define our 'region of interest'
roi <- GRanges("chr14", IRanges(19653773, width=1))
## 4. alignments, junctions, overlapping our roi
paln <- readGAlignmentsList(bf)
j <- summarizeJunctions(paln, with.revmap=TRUE)
j_overlap <- j[j %over% roi]
## 5. supporting reads
paln[j_overlap$revmap[[1]]]
## GAlignmentsList object of length 8:
## [[1]]
## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width njunc
## [1] chr14 - 66M120N6M 72 19653707 19653898 192 1
## [2] chr14 + 7M1270N65M 72 19652348 19653689 1342 1
##
## [[2]]
## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width njunc
## [1] chr14 - 66M120N6M 72 19653707 19653898 192 1
## [2] chr14 + 72M 72 19653686 19653757 72 0
##
## [[3]]
## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width njunc
## [1] chr14 + 72M 72 19653675 19653746 72 0
## [2] chr14 - 65M120N7M 72 19653708 19653899 192 1
##
## ...
## <5 more elements>
## -------
## seqinfo: 93 sequences from an unspecified genome
For a detailed tutorial on working with BAM files do check out this detailed Overlap Encodings vignette of GenomicAlignments.
VCF (Variant Call Files) describe SNP and other variants. The files contain meta-information lines, a header line with column names, and then (many!) data lines, each with information about a position in the genome, and optional genotype information on samples for each position.
Data are parsed into a VCF object with readVcf()
from
VariantAnnoation
library(VariantAnnotation)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
An excellent workflow on working with Variants can be found
here. In particular
it is possible to read in specific components of the VCF file (e.g.,
readInfo()
, readGeno()
) and parts of the VCF at specific genomic
locations (using GRanges and the param = ScanVcfParam()
argument to
input functions).
rtracklayer import and export functions can read in many common file types, e.g., BED, WIG, GTF, …, in addition to querying and navigating the UCSC genome browser.
rtracklayer contains a ‘test’ BED file which we will read in here
library(rtracklayer)
test_path <- system.file("tests", package = "rtracklayer")
test_bed <- file.path(test_path, "test.bed")
test <- import(test_bed, format = "bed")
test
## UCSC track 'ItemRGBDemo'
## UCSCData object with 5 ranges and 5 metadata columns:
## seqnames ranges strand | name score itemRgb
## <Rle> <IRanges> <Rle> | <character> <numeric> <character>
## [1] chr7 [127471197, 127472363] + | Pos1 0 #FF0000
## [2] chr7 [127472364, 127473530] + | Pos2 2 #FF0000
## [3] chr7 [127473531, 127474697] - | Neg1 0 #FF0000
## [4] chr9 [127474698, 127475864] + | Pos3 5 #FF0000
## [5] chr9 [127475865, 127477031] - | Neg2 5 #0000FF
## thick blocks
## <IRanges> <IRangesList>
## [1] [127471197, 127472363] [ 1, 300] [ 501, 700] [1068, 1167]
## [2] [127472364, 127473530] [ 1, 250] [668, 1167]
## [3] [127473531, 127474697] [1, 1167]
## [4] [127474698, 127475864] [1, 1167]
## [5] [127475865, 127477031] [1, 1167]
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
The file is returned to the user as a GRanges instance. A more detailed tutorial can be found here
AnnotationHub also contains a variety of genomic annotation files (eg BED, GTF, BigWig) which use import() from rtracklayer behind the scenes. For a detailed tutorial the user is referred to Annotation workflow and AnnotationHub HOW TO vignette
sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/local/lib/R/lib/libRblas.so
## LAPACK: /usr/local/lib/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
## [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RNAseqData.HNRNPC.bam.chr14_0.15.0 BSgenome.Hsapiens.UCSC.hg19_1.4.0
## [3] BSgenome_1.45.3 AnnotationHub_2.9.20
## [5] VariantAnnotation_1.23.9 rtracklayer_1.37.3
## [7] ShortRead_1.35.3 BiocParallel_1.11.11
## [9] GenomicAlignments_1.13.6 Rsamtools_1.29.1
## [11] Biostrings_2.45.4 XVector_0.17.2
## [13] SummarizedExperiment_1.7.10 DelayedArray_0.3.21
## [15] matrixStats_0.52.2 Biobase_2.37.2
## [17] GenomicRanges_1.29.15 GenomeInfoDb_1.13.5
## [19] IRanges_2.11.19 S4Vectors_0.15.14
## [21] BiocGenerics_0.23.3 BiocStyle_2.5.41
## [23] rmarkdown_1.6 knitr_1.17
##
## loaded via a namespace (and not attached):
## [1] progress_1.1.2 lattice_0.20-35 htmltools_0.3.6
## [4] yaml_2.1.14 GenomicFeatures_1.29.13 interactiveDisplayBase_1.15.0
## [7] blob_1.1.0 XML_3.98-1.9 rlang_0.1.2
## [10] DBI_0.7 bit64_0.9-7 RColorBrewer_1.1-2
## [13] GenomeInfoDbData_0.99.1 stringr_1.2.0 zlibbioc_1.23.0
## [16] hwriter_1.3.2 evaluate_0.10.1 memoise_1.1.0
## [19] latticeExtra_0.6-28 biomaRt_2.33.4 BiocInstaller_1.27.5
## [22] httpuv_1.3.5 curl_3.0 AnnotationDbi_1.39.4
## [25] Rcpp_0.12.13 xtable_1.8-2 backports_1.1.1
## [28] mime_0.5 bit_1.1-12 digest_0.6.12
## [31] stringi_1.1.5 shiny_1.0.5 grid_3.4.2
## [34] rprojroot_1.2 tools_3.4.2 bitops_1.0-6
## [37] magrittr_1.5 RCurl_1.95-4.8 RSQLite_2.0
## [40] tibble_1.3.4 pkgconfig_2.0.1 Matrix_1.2-11
## [43] prettyunits_1.0.2 assertthat_0.2.0 httr_1.3.1
## [46] R6_2.2.2 compiler_3.4.2