% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % %\VignetteIndexEntry{Annotation Overview} %\VignetteDepends{Biobase, genefilter, annotate, hgu95av2.db} %\VignetteKeywords{Expression Analysis, Annotation} %\VignettePackage{annotate} \documentclass{article} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \usepackage{times} \bibliographystyle{plainnat} \begin{document} \title{Bioconductor: Annotation Package Overview} \maketitle{} \section{Overview} In its current state the basic purpose of \Rpackage{annotate} is to supply interface routines that support user actions that rely on the different meta-data packages provided through the Bioconductor Project. There are currently four basic categories of functions that are contained in \Rpackage{annotate}. \begin{itemize} \item Interface functions for getting data out of specific meta-data libraries. \item Functions that support querying the different web services provided by the National Library of Medicine (NLM), and the National Center for Biotechnology Information (NCBI). \item Functions that support organizing and structuring chromosomal location data to support some of the gene plotting and gene finding routines in \Rpackage{geneplotter}. \item Functions that produce HTML output, hyperlinked to different web resources, from gene lists. \end{itemize} We will briefly describe the second and third of these different aspects and then for the remainder of this vignette concentrate on the first category. The other three have their own vignettes. There are two different, but complementary strategies for accessing meta-data. One is to use highly curated data that have been assembled from many different sources and a second is to rely on on-line sources. The former tends to be less current but more comprehensive while the latter tends to be current but can be less comprehensive and difficult to reproduce as the sources themselves are constantly evolving. To address the second of these we develop and distribute software that can take advantage of the web services that are provided. Perhaps the richest source of these is provided by the National Library of Medicine. Further details on accessing these resources are provided in \cite{PubMedRnews} and \cite{PubMedVignette}. While the chromosomal location is not always of interest, in certain situations, especially the study of genetic diseases it is important that we be able to associate particular genes with locations on chromosomes. We provide a complete set of functions that map from LocusLink identifiers to chromosomal location. Examples and further discussion are provided in \cite{ChromLocVignette}. Producing output that helps users navigate and understand the results of an analysis is a very important aspect of any data analysis. Since one of the primary tasks that is carried out when analysing gene expression data is to create lists of interesting genes we have provided some simple infrastructure that will help produce a hyperlinked output page for any given set of genes. A more substantial and comprehensive approach has be taken by C. Smith in the \Rpackage{annaffy} package. A vignetted for using the functions provided in the \Rfunction{annotate} package is provided in \cite{HTMLVignette}. We now turn our attention to interfaces to the meta-data packages and how and when they will be useful. The annotation library provides interface support for the different meta-data packages (\url{http://www.bioconductor.org/data/metaData.html} that are available through the Bioconductor Project. We have tried to make these different meta-data packages modular, in the sense that all of them should have similar sets of mappings from manufacturer IDs to specific biological data such as chromosomal location, GO, and LocusLink identifiers. Annotation in the Bioconductor Project is handled by two systems. One, \Rpackage{AnnBuilder} is a system for assembling and relating the data from various sources. It is much more {\em industrial} and takes advantage of many different non-R tools such as Perl and XML. The second package is {\em annotate}. This package is designed to provide experiment level annotation resources and to help users associate the outputs of \Rpackage{AnnBuilder} with their particular data. There will be some need for the structure of the meta-data packages to evolve over time and by making use of the functions provided in \Rpackage{annotate} users and developers should shield themselves from many of the changes. We will endeavor to keep the \Rpackage{annotate} interfaces constant. Any given experiment typically involves a set of known identifiers (probes in the case of a microarray experiment). These identifiers are typically unique (for any manufacturer). This holds true for any of the standard databases such as LocusLink. However, when the identifiers from one source are linked to the identifiers from another there does not need to be a one--to--one relationship. For example, several different Affymetrix accession numbers correspond to a single LocusLink identifier. Thus, when going one direction (Affymetrix to LocusLink) we have no problem, but when going the other we need some mechanism for dealing with the multiplicity of matches. \subsection*{A Short Example} We will consider the Affymetrix human gene chip, \texttt{hgu95av2}, for our example. We first load this chip's package and \Rpackage{annotate}. <>= library("annotate") library("hgu95av2.db") ls("package:hgu95av2.db") @ We see the listing of twenty or thirty different R objects in this package. Most of them represent mappings from the identifiers on the Affymetrix chip to the different biological resources and you can find out more about them by using the R help system, since each has a manual page that describes the data together with other information such as where, when and what files were used to construct the mappings. Also, each meta-data package has one object that has the same name as the package basename, in this case it is \Robject{hgu95av2}. This is function and it can be invoked to find out some of the different statistics regarding the mappings that were done. Its manual page lists all data resources that were used to create the meta-data package. <>= hgu95av2() @ Now let's consider a specific object, say the \Robject{hgu95av2ENTREZID} object. <>= hgu95av2ENTREZID @ If we type its name we see that it is an R \texttt{environment}; all this means is that it is a special data type that is efficient at storing and retrieving mappings between symbols (the Affymetrix identifiers) and associated values (the LocusLink IDs). We can retrieve values from this environment in many different ways (and the reader is referred to the R help pages for more details on using some of the functions described briefly here). Suppose that we are interested in finding the LocusLink ID for the Affymetrix probe, \texttt{1000\_at}. Then we can do it in any one of the following ways: <>= get("1000_at", env=hgu95av2ENTREZID) hgu95av2ENTREZID[["1000_at"]] hgu95av2ENTREZID$"1000_at" @ And in all cases we see that the LocusLink identifier is \Sexpr{hgu95av2ENTREZID[["1000_at"]]}. If you want to get more than one object from an environment you also have a number of choices. You can extract them one at a time using either a \texttt{for} loop or \Rfunction{apply} or \Rfunction{eapply}. It will be more efficient to use \Rfunction{mget} since it does the retrieval using internal code and is optimized. You may also want to turn the environment into a named list, so that you can perform different list operations on it, this can be done using the function \Rfunction{contents} or \Rfunction{as.list}. <>= LLs = as.list(hgu95av2ENTREZID) length(LLs) names(LLs)[1:10] @ \section{Developers Tips} Software developers that are building other tools that might make use of the infrastructure produced here should make use of the \Rfunction{get*} family of functions. Examples include \Rfunction{getEG}, \Rfunction{getGO} and so on. There are two reasons for using these functions. First, they allow you to implement functionality that is independent of the meta-data packages. Since each of these functions takes two arguments, one a list of the manufacturers ids and the second the \textit{basename} of the annotation package these functions will provide the correct results for all annotation packages. A second reason to make use of these interface functions is that they are guaranteed not to change. The underlying structure of the meta-data packages may change and developers that access this directly will find that their code needs to be updated regularly. But developers that make use of these interface functions will find that their code needs much less maintenance. \section{Session Information} The version number of R and packages loaded for generating the vignette were: <>= sessionInfo() @ \bibliography{annotate} \end{document}