%\VignetteIndexEntry{Making and Utilizing TranscriptDb Objects} %\VignetteKeywords{annotation} %\VignettePackage{GenomicFeatures} \documentclass[11pt]{article} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \title{Making and Utilizing TranscriptDb Objects} \author{Marc Carlson, Patrick Aboyoun, Herve Pages, Seth Falcon and Martin Morgaan} \SweaveOpts{keep.source=TRUE} \begin{document} \maketitle \section{Introduction} The \Rpackage{GenomicFeatures} package retrieves and manages transcript-related features from UCSC Genome Bioinformatics and BioMart data resources. The package is useful for ChIP-chip, ChIP-seq, and RNA-seq analyses. <>= library(GenomicFeatures) @ \section{Transcript Metadata} \subsection{\Rclass{TranscriptDb} Objects} The \Rpackage{GenomicFeatures} package uses \Rclass{TranscriptDb} objects to store transcript metadata. This class is designed to map the 5' untranslated regions (UTRs), protein coding sequence (CDSs), and 3' UTRs for a set mRNA transcripts to their associated genome, where the combined 5' UTR, CDS, and 3' UTR region for an mRNA transcript either originated from a single exon or from multiple exons that were post-transcriptionally spliced. As the suffix of the class name suggests, \Rclass{TranscriptDb} objects are backed by a SQLite database. This database manages genomic locations and the relationships between pre-processed mRNA transcripts, exons, protein coding sequences, and their related gene IDs. \subsection{Creating New \Rclass{TranscriptDb} Objects} There are three methods for creating new \Rclass{TranscriptDb} objects in the \Rpackage{GenomicFeatures} package: \begin{enumerate} \item Use \Rfunction{makeTranscriptDbFromUCSC} to download from UCSC Genome Bioinformatics. \item Use \Rfunction{makeTranscriptDbFromBiomart} to download from BioMart. \item Use a \Rclass{data.frame} containing transcript metadata with \Rfunction{makeTranscriptDb} to make a custom database. \end{enumerate} The function \Rfunction{makeTranscriptDbFromUCSC} downloads UCSC Genome Bioinformatics transcript tables (e.g. \Rcode{"knownGene"}, \Rcode{"refGene"}, \Rcode{"ensGene"}) for a genome build (e.g. \Rcode{"mm9"}, \Rcode{"hg19"}). Use the \Rfunction{supportedUCSCtables} utility function to get the list of supported tables. %% <>= supportedUCSCtables() @ <>= mm9KG <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = "knownGene") @ Retrieve data from BioMart by specifying the `mart' and data set (not all BioMart data sets are currently supported): %% <>= mmusculusEnsembl <- makeTranscriptDbFromBiomart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl") @ The function \Rfunction{makeTranscriptDb} creates \Rclass{TransctriptDb} objects from \Rclass{data.frame} objects. \subsection{Saving and Loading a \Rclass{TranscriptDb} Object} \Rclass{TranscriptDb} objects can be saved as SQLite files for future access (e.g., to easily reproduce results with identical genomic feature data at a later date, or for access from programs other than R). %% <>= saveFeatures(mm9KG, file="fileName.sqlite") @ %% Load a saved \Rclass{TranscriptDb} object with \Rfunction{loadFeatures}: <>= mm9KG <- loadFeatures("fileName.sqlite") @ %% For instance, a sample of UCSC known genes is included in \Rpackage{GenomicFeatures}. <>= exampleFile <- system.file("extdata", "UCSC_knownGene_sample.sqlite", package="GenomicFeatures") txdb <- loadFeatures(exampleFile) txdb @ %% In addition to transcript data, the object contains information about how it was created (e.g., the number of transcripts, exons, and coding sequence rows) and about software versions and creation dates. \section{Retrieving Transcript, Exon, and Coding Sequence Ranges} \subsection{Working with Basic Features} The most basic operation for getting data out of a transcriptDb object is to simple retrieve the ranges of exons, transcripts or coding sequences into a GRanges object. For this purpose, the functions \Rfunction{transcripts}, \Rfunction{exons}, and \Rfunction{cds} have been provided. So as an example, you can use \Rfunction{transcripts} to simply retrieve all the transcripts present in a \Rclass{TranscriptDb} object and return them as a single \Rclass{GRanges} object like this: <>= txdb <- loadFeatures(system.file("extdata", "UCSC_knownGene_sample.sqlite", package="GenomicFeatures")) GR <- transcripts(txdb) GR @ Now suppose that you wanted to further refine things and retrieve only the things that are present on the plus strand of chromosome 1. Well the \Rfunction{transcripts} function will allow you to do things like that too. <>= GR <- transcripts(txdb, vals <- list(tx_chrom = "chr1", tx_strand = "+")) GR @ The \Rfunction{exons}, and \Rfunction{cds} functions have a similar role to play and are used to retrieve just the exons or just the coding sequences. \subsection{Working with Grouped Features} Often it will not be enough to just get exons, cds or transcripts only. Sometimes you will want to give greater consideration to the context of the data. In these cases, you will want to think of the ranged annotation data as being grouped in a specific way, for example, you might want to consider that the transcripts are each associated with a specific gene or the exon composition of specific transcripts. These relationships are maintained with \Rclass{TranscriptDb} objects and can be accessed through the \Rfunction{transcriptsBy}, \Rfunction{exonsBy}, and \Rfunction{cdsBy} functions. So for example, to extract all the transcripts grouped by their exons you can call: <>= txdb <- loadFeatures(system.file("extdata", "UCSC_knownGene_sample.sqlite", package="GenomicFeatures")) GRList <- transcriptsBy(txdb, "gene") GRList @ Similarly, to extract all the exons for each transcript you can call: <>= txdb <- loadFeatures(system.file("extdata", "UCSC_knownGene_sample.sqlite", package="GenomicFeatures")) GRList <- exonsBy(txdb, "tx") GRList @ These functions return \Rclass{GRangesList} objects that contain locations and identifiers all grouped according to the type of feature specified. These objects can be used in downstream analyses such as using \Rfunction{findOverlaps} contextualize the alignments from high-throughput sequencing. It is important to consider the context created when grouping by a particular feature. For example, in the 1st example, where we grouped by genes, the name of the features is an Entrez Gene ID. If the database had been based instead on Ensembl sources, it would be an Ensembl Gene ID. However, in the second example where we group by transcript, we see that the groups are labeled by an ID that is not a traditional transcript ID. In this second case, we have been given an internally assigned database ID. This is because some sources may choose to overload their use of traditional transcript IDs in ways that would make the existence of our database schema impossible. So in the second case we have to use an internal id to guarantee uniqueness. This will happen whenever you group by anything that is not a gene. So when you want to use traditional transcript IDs you can look them up using the appropriate basic accessors described in the preceding section. Another case where context can matter is when considering the order of the elements returned. In most cases the grouped elements will be listed in the order that they occur along the chromosome. But in the context where you have grouped exons or CDS by transcripts, they will instead be grouped according to their position along the transcript itself. This is important because alternative splicing can mean that the order along the transcript can be different from that along the chromosome. \subsection{Prespecfied grouping functions} An important kind of grouping functions are the ones that group in a prespecified manner. The \Rfunction{intronsByTranscript}, \Rfunction{fiveUTRsByTranscript} and \Rfunction{threeUTRsByTranscript} functions are like this. These functions retrieve the \Rclass{GRangesList} objects for the introns, 5' UTR's, and 3' UTR's grouped by transcript respectively. <>= intronsByTranscript(txdb) fiveUTRsByTranscript(txdb) threeUTRsByTranscript(txdb) @ \subsection{Convenience functions} The \Rfunction{transcriptsByOverlaps}, \Rfunction{exonsByOverlaps} and \Rfunction{cdsByOverlaps} functions return a \Rclass{GRangesList} object containing data about transcripts, exons, or coding sequences that overlap genomic coordinates specified by a \Rclass{GRanges} object. %% <>= library(org.Mm.eg.db) set.seed(0L) idx <- sample(length(org.Mm.egCHRLOC), 50) tbl <- unique(merge(toTable(org.Mm.egCHRLOC[idx]), toTable(org.Mm.egCHRLOCEND[idx]))) gr <- with(tbl, { lvls <- paste("chr", c(1:19, "X", "Y", "MT", "Un"), sep="") GRanges(seqnames=factor(paste("chr", Chromosome, sep=""), levels=lvls), ranges=IRanges(abs(start_location), abs(end_location)), strand=ifelse(start_location >= 0, "+", "-"), egid=gene_id) }) transcriptsByOverlaps(txdb, gr) @ The convenience functions can be a great shortcut, but because they have to make assumptions about how the results are compared and represented, they are ultimately not as flexible as just using the basic and grouping accessors in combination with \Rfunction{findOverlaps}. \section{Examples} Let's suppose that you have run an experiment. After mapping all your reads to a genome and collapsing them into a set of ranges, you want to find out what genomic Features a particular range overlaps with. How would be the usual way to proceed? Here is an example: \subsection{Retrieving data from an RNA-seq experiment} Let's consider the case where you have some RNA-seq data and you want to convert your ranges into counts representing how hits per transcript. For this example, let's also assume that you are only interested in counting ranges that overlap with exons (not introns). First lets say that this is your data: <>= gr <- GRanges(seqnames = rep("chr5",4), ranges = IRanges(start = c(244620,244670,245804,247502), end = c(244652,244702,245836,247534)), strand = rep("+",4)) @ From our \Rclass{TranscriptDb} object, we want to recover the annotations for all of the relevant exons, but grouped according to their transcripts. Therefore, we want to use \Rfunction{exonsby} and group them by transcripts. <>= annotGr <- exonsBy(txdb, "tx") @ Then we need to used \Rfunction{findOverlaps} to learn which of our data ranges, \Robject{gr}, will overlap with the in exons that we have grouped by transcripts. <>= OL <- findOverlaps(annotGr, gr) @ Finally, once we have called \Rfunction{findOverlaps} we can subset out the annotations that meet our criteria. The \Rfunction{subjectHits} method will allow us to retrieve only the things that overlapped with the subject in our original \Rfunction{findOverlaps} call. And once we have subsetted out annotations in this way, the length of the resulting \Rclass{GRangesList} object is also the number of transcripts that overlap with our data. <>= tdata <- annotGr[subjectHits(OL),] tdata length(tdata) @ By using \Rfunction{findOverlaps} along with the different accessors in this way, it is possible to connect any data that has been represented as a \Rclass{GRanges} object with the annotations stored in a \Rclass{TranscriptDb} object. Calling \Rfunction{findOverlaps} along with the appropriate \Rclass{GRanges} object not only allows users to quickly determine what has overlapped, but also controls what criteria are used for determining whether an overlap has occurred. This can be done by passing in an alternate \Rfunarg{type} parameter to \Rfunction{findOverlaps}. In addition, because the basic accesors allow for the users to retrieve data grouped in different ways, the user has control over which parts of a transcript or gene are included in the overlap. \section{Session Information} The version number of R and packages loaded for generating the vignette were: <>= sessionInfo() @ \end{document}