%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
%\VignetteIndexEntry{annaffy Primer}
%\VignetteDepends{Biobase, GO.db, KEGG.db, multtest, hgu95av2.db}
%\VignetteKeywords{HTML, URL}
%\VignettePackage{annaffy}
\documentclass[12pt]{article}

\usepackage{hyperref}

\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\begin{document}
\title{Textual Description of annaffy}
\author{Colin A. Smith}
\maketitle

\section*{Introduction}

\verb+annaffy+ is part of the Bioconductor project. It is designed
to help interface between Affymetrix analysis results and web-based
databases. It provides classes and functions for accessing those
resources both interactively and through statically generated HTML
pages.

The core functionality of \verb+annaffy+ depends on annotation
contained in Bioconductor data packages. The data packages are created
by the \verb+SQLForge+ code inside of another package called
\verb+AnnotationDbi+. It gathers annotation data from many diverse
sources and makes the information easily processed by
R. Preconstructed packages for most Affymetrix chips are available on
the Bioconductor web site.

\section{Loading Annotation Data}

\verb+annaffy+ represents each type of annotation data as a different
class. Currently implemented classes include:

\begin{description}
\item[aafSymbol] gene symbol
\item[aafDescription] gene description/name
\item[aafFunction] gene functional description
\item[aafChromosome] genomic chromosome
\item[aafChromLoc] location on the chromosome (in base pairs)
\item[aafGenBank] GenBank accession number
\item[aafLocusLink] LocusLink ids (almost never more than one)
\item[aafCytoband] mapped cytoband location
\item[aafUniGene] UniGene cluster ids (almost never more than one)
\item[aafPubMed] PubMed ids
\item[aafGO] Gene Ontology identifiers, names, types, and evidence
codes
\item[aafPathway] KEGG pathway identifiers and names
\end{description}

For each class, there is a constructor function with the same name.
It takes as arguments a vector of Affymetrix probe ids as well as
the chip name. The chip name corresponds to the name of the data
package that contains the annotation. If the data package for the
chip is not already loaded, the constructor will attempt to load
it. The constructor returns a list of the corresponding objects
populated with annotation data. (\verb+NA+ values in the annotation
package are mapped to empty objects.)

<<results=hide>>=
library("annaffy")
@

(For the purpose of demonstration, we will use the \verb+hgu95av2.db+
metadata package and probe ids from the \verb+aafExpr+ dataset.)

<<>>=
data(aafExpr)
probeids <- featureNames(aafExpr)
symbols <- aafSymbol(probeids, "hgu95av2.db")
symbols[[54]]
symbols[55:57]
@

All annotation constructors return their results as \verb+aafList+
objects, which act like normal lists but have special behavior when
used with certain methods. One such method is \verb+getText()+,
which returns a simple textual representation of most \verb+annaffy+
objects. Note the differing ways \verb+annaffy+ handles missing
annotation data.

<<>>=
getText(symbols[54:57])
@

Other annotation constructors return more complex data structures:

<<>>=
gos <- aafGO(probeids, "hgu95av2.db")
gos[[3]]
@

The gene ontology constructor, \verb+aafGO()+, returns \verb+aafList+s
of \verb+aafGO+ objects, which are in turn lists of \verb+aafGOItem+
objects. Within each of those objects, there are four slots: id,
name, type, and evidence code. The individual slots can be accessed
with the \verb+@+ operator.

<<>>=
gos[[3]][[1]]@name
@

If the reader is not already aware, \verb+R+ includes two subsetting
operators, which can be the source of some confusion at first.
Single brackets (\verb+[]+) always return an object of the same
type that they are used to subset. For example, using single brackets
to subset an \verb+aafList+ will return another \verb+aafList+,
even if it only contains one item. On the other hand, double brackets
(\verb+[[]]+) return just a single item which is not enclosed in
a list. Thus the above statement first picks out the third \verb+aafGO+
object, then the first \verb+aafGOItem+, and finally the name slot.

\section{Linking to Online Databases}

One of the most important features of the \verb+annaffy+ package
its ability to link to various public online databases. Most of
the annotation classes in annaffy have a \verb+getURL()+ method
which returns single or multiple URLs, depending on the object
type.

The simplest annotation class which produces a URL is \verb+aafGenBank+.
Because Affymetrix chips are generally based off GenBank, all probes
have a corresponding GenBank accession number, even those missing
other annotation data. The GenBank database provides information
about the expressed sequence that the Affymetrix chip detects.
Additionally, it helps break down the functional parts of the
sequence and provides information about the authors that initially
sequenced the gene fragment. (See this URL
\href{http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=search&db=nucleotide&term=U41068%5BACCN%5D&doptcmdl=GenBank}{here}.)

<<>>=
gbs <- aafGenBank(probeids, "hgu95av2.db")
getURL(gbs[[1]])
@

In most distributions of R, you can open URLs in your browser with
the \verb+browseURL()+ function. Many other types of URLs are also
possible. Entrez Gene (formerly LocusLink) is a very useful online
database that links to many other data sources not referenced by
Bioconductor. One worthy of note is OMIM, which provides relatively
concise gene function and mutant phenotype information. (See this
URL
\href{http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=2322}{here}.)

<<>>=
lls <- aafLocusLink(probeids, "hgu95av2.db") 
getURL(lls[[2]])
@

If you are interested in exploring the area of the genome surrounding
a probe, the \verb+aafCytoband+ provides a link to NCBI's online
genome viewer. It includes adjacent genes and other genomic
annotations. (See this URL
\href{http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?direct=on&query=U02687%5BACCN%5D}{here}.)

<<>>=
bands <- aafCytoband(probeids, "hgu95av2.db") 
getURL(bands[[2]])
@

For primary literature information about a gene, use the \verb+aafPubMed+
class. It will provide a link to a list of abstracts on PubMed
which describe the gene of interest. The list abstracts that
Bioconductor provides are by no means complete and will sometimes
only include the paper in which the gene was cloned. (See this URL
\href{http://www.ncbi.nih.gov/entrez/query.fcgi?tool=bioconductor&cmd=Retrieve&db=PubMed&list_uids=15059064%2c14759363%2c14737077%2c14670924%2c14630076%2c14604974%2c14562119%2c14504097%2c12969963%2c12935959%2c12926083%2c12854887%2c12842996%2c12816873%2c12691136%2c12676789%2c12481903%2c12468433%2c12393674%2c12239147%2c12239146%2c12070009%2c12060771%2c12036858%2c11983110%2c11971190%2c8394751%2c7507245}{here}.)

<<>>=
pmids <- aafPubMed(probeids, "hgu95av2.db")
getURL(pmids[[2]])
@

A number of interesting queries can be done with the gene ontology
class. You can display the gene ontology family hierarchy for an
entire probe id at once, including multiple GO ids. The usefulness
of such a query may be dubious, but it is possible. See this URL
\href{http://godatabase.org/cgi-bin/go.cgi?open_0=GO:0007155&open_0=GO:0005581&open_0=GO:0030199&open_0=GO:0005592&open_0=GO:0005737&open_0=GO:0030020&open_0=GO:0007605&open_0=GO:0006817&open_0=GO:0001501}{here}.

<<>>=
getURL(gos[[1]])
@

You can also show the family hierarchy for a single GO id. (See
this URL
\href{http://godatabase.org/cgi-bin/go.cgi?open_0=GO:0005592}{here}.)

<<>>=
getURL(gos[[1]][[4]])
@

The last link type of note is that for KEGG Pathway information.
Most genes are not annotated with pathway data. However, for those
that are, it is possible to retrieve schematics of the biochemical
pathways a gene is involved in. (See this URL
\href{http://www.genome.ad.jp/dbget-bin/show_pathway?MAP00480+2.5.1.18}{here}.
Look for the enzyme in question to be highlighted in red.)

<<>>=
paths <- aafPathway(probeids, "hgu95av2.db")
getURL(paths[[4]])
@

\section{Building HTML Pages}

In addition to using \verb+annaffy+ interactively through R, it
may also be desirable to generate annotated reports summarizing
your microarray analysis results. Such a report can be utilized by
a scientist collaborator with no knowledge of either R or Bioconductor.
Additionally, by having all the annotation and statistical data
presented together on one page, connections between and generalizations
about the data can be made in a more efficient manner.

The primary intent of the \verb+annaffy+ package is to produce such
reports in HTML. Additionally, it can easily format the same report
as tab-delimited text for import into a table, spreadsheet, or
database. It supports nearly all the annotation data available
through Bioconductor. Additionally, it has facilities for including
and colorizing user data in an informative manner.

The rest of this tutorial will make use of an \verb+ExpressionSet+
generated for demonstration purposes. It contains anonymized data
from a microarray experiment which used the Affymetrix hgu95av2
chip. There are eight total samples in the set, four control samples
and four experimental samples. 250 expression measures were selected
at random from the results, and another 250 probe ids were selected
at random and assigned to those expression measures. The data
therefore has no real biological significance, but can still fully
show the capabilities of \verb+annaffy+.

\subsection{Limiting the Results}

HTML reports generated by \verb+annaffy+ can grow to become quite
large unless some measures are taken to limit the results.
Multi-megabyte web pages are unwieldy and should thus be avoided.
Doing a ranked statistical analysis is one way to limit results,
and will be shown here. We will rank the expression measures by
putting their two-sample Welch t-statistics in order of decreasing
absolute value.

The first step is to load the \verb+multtest+ package which will
be used for the t-test. (It is also part of the Bioconductor
project.)

<<>>=
library(multtest)
@

The \verb+mt.teststat()+ function requires a vector that specifies
which samples belong to the different observation classes.  Using
a few R tricks, that vector can be produced directly from the first
covariate of \verb+pData+.

<<print=true>>=
class <- as.integer(pData(aafExpr)$covar1) - 1
@

Using the class vector, we calculate the t-statistic for each of
the probes. We then generate an index vector which can be used to
order the probes themselves in increasing order. As a last step,
we produce the vector of ranked probe ids. Latter annotation steps
will only use the first 50 of those probes.

<<>>=
teststat <- mt.teststat(exprs(aafExpr), class)
index <- order(abs(teststat), decreasing = TRUE)
probeids <- featureNames(aafExpr)[index]
@

\subsection{Annotating the Probes}

Once there is a list of probes, annotation is quite simple. The
only decision that needs to be made is which classes of annotation
to include in the table. Including all the annotation classes,
which is the default, may not be a good idea.  If the table grows
too wide, its usefulness may decrease. To see which columns of data
can be included, use the \verb+aaf.handler()+ function. When called
with no arguments, it returns the annotation types \verb+annaffy+
can handle.

<<>>=
aaf.handler()
@

To help avoid typing errors, subset the vector instead of retyping
each column name.

<<print=true>>=
anncols <- aaf.handler()[c(1:3,8:9,11:13)]
@

This may be too many columns, but it is possible at a later stage
to choose to either not show some of the columns or remove them
altogether. Note that by using the \verb+widget=TRUE+ option in
the next function, it is also possible select data columns with a
graphical widget. See Figure \ref{fig:widget.selector}.

\begin{figure}[htbp]
\begin{center}
\includegraphics{selector}
\caption{\label{fig:widget.selector}Graphical display for selecting
annotation data columns.
}
\end{center}
\end{figure}

Now we generate the annotation table with the \verb+aafTableAnn()+
function. Note that for this tutorial, \verb+annaffy+ is acting as
its own data package. If you wish to annotate results for other
chips, download the appropriate data package from the Bioconductor
website.

<<>>=
anntable <- aafTableAnn(probeids[1:50], "hgu95av2.db", anncols)
@

To see what has been produced so far, use the \verb+saveHTML()+
method to generate the HTML report. Using the optional argument
\verb+open=TRUE+ will open the resulting file in your browser.

<<>>=
saveHTML(anntable, "example1.html", title = "Example Table without Data")
@

See this page online
\href{http://genome.nasa.gov/downloads/annaffy/example1.html}{here}.

\subsection{Adding Other Data}

To add other data to the table, just use any of the other table
constructors to generate your own table, and then merge the two.
For instance, listing the t-test results along with the annotation
data is quite useful. \verb+annaffy+ provides the option of colorizing
signed data, making it easier to assimilate.

<<>>=
testtable <- aafTable("t-statistic" = teststat[index[1:50]], signed = TRUE)
table <- merge(anntable, testtable)
@

After HTML generation, a one line change to the style sheet header
will change the colors used to show the positive and negative
values. In fact, with a bit of CSS it is possible to heavily
customize the appearance of the tables very quickly, even on a
column by column basis.

\verb+annaffy+ also provides an easy way to include expression data
in the table. It colorizes the cells with varrying intensities of
green to show relative expression values. Additionally, because of
the way \verb+merge+ works, it will always match probe id rows
together, regardless of their order. This allows a quick "sanity
check" on the other statistics produced, and can help decrease user
error.  (Check, for example, that the t-statistics and ranking seem
reasonable given the expression data.)

<<>>=
exprtable <- aafTableInt(aafExpr, probeids = probeids[1:50])
table <- merge(table, exprtable)
saveHTML(table, "example2.html", title = "Example Table with Data")
@

See this page online
\href{http://genome.nasa.gov/downloads/annaffy/example2.html}{here}.

Producing a tab-delimited text version uses the \verb+saveText()+
method.  The text ouput also includes more digits of precision than
HTML.

<<>>=
saveText(table, "example2.txt")
@

\section{Searching Metadata}

Often a biologist will make hypotheses about changes in gene
expression either before or after the microarray experiment. In
order to facilitate the formulation and testing of such hypotheses,
\verb+annaffy+ includes functions to search annotation metadata
using various criteria. All search functions return character
vectors of Affymetrix probe ids that can be used to subset data
and annotation.

\subsection{Text}

The two currently implemented search functions are simple and easy
to use. The first is a text search that matches against the textual
representation of biological metadata. Recall that textual
representations are extracted using the \verb+getText()+ method.
For complex annotation structures, the textual representation can
include a variety of information, including numeric identifiers
and textual descriptions.

For the purposes of demonstration, we will use the hgu95av2.db annotation
data package available through Bioconductor.

<<>>=
library(hgu95av2.db)
probeids <- ls(hgu95av2SYMBOL)
gos <- aafGO(probeids[1:2], "hgu95av2.db")
getText(gos)
@

The textual search is probably best applied to the Symbol, Description,
and Pathway metadata types. (A specialized Gene Ontology search
will be discussed later.) For instance, to find all the kinases on
a chip, simply do a text search of Description for kinases.

<<>>=
kinases <- aafSearchText("hgu95av2.db", "Description", "kinase")
kinases[1:5]
print(length(kinases))
@

One can search multiple metadata types with multiple queries all
with a single function call. For instance, to find all genes with
"ribosome" or "polymerase" in the Description or Pathway annotation,
use the following function call.

<<>>=
probes <- aafSearchText("hgu95av2.db", c("Description", "Pathway"),
                        c("ribosome", "polymerase"))
print(length(probes))
@

When doing searches of multiple annotation data types or multiple
terms, by default the search returns all probe ids matching any of
the search criteria. That can be altered by changing the logical
operator from OR to AND using the \verb+logic="AND"+ argument. This
is useful because \verb+aafSearchText()+ does not automatically
tokenize a search query as Google and many other search engines
do. For example, "DNA polymerase" finds all occurrences of that
exact string. To find all probes whose description contains both
"DNA" and "polymerase", use the following function call.

<<>>=
probes <- aafSearchText("hgu95av2.db", "Description",
                        c("DNA", "polymerase"), logic = "AND")
print(length(probes))
@

Another useful application of the text search is to map a vector
of GenBank accession numbers onto a vector of probe ids. This comes
in handy if you wish to filter microarray data based on the results
of a BLAST job.

<<>>=
gbs <- c("AF035121", "AL021546", "AJ006123", "AL080082", "AI289489")
aafSearchText("hgu95av2.db", "GenBank", gbs)
@

Lastly, two points for power users. One, the text search is always
case insensitive. Second, individual search terms are treated as
Perl compatible regular expressions. This means that you should be
cautious of special regular expression characters. See the Perl
documentation\footnote{\url{http://www.perldoc.com/perl5.8.0/pod/perlre.html}}
for further information about how to use regular expressions.

\subsection{Gene Ontology}

The second type of search available is a Gene Ontology search. It
takes a vector of Gene Ontology identifiers and maps them onto a
list of probe ids. Gene Ontology is a tree and you can include or
exclude descendents with the \verb+descendents+ argument. The search
also supports the \verb+logic+ argument. Because the Bioconductor
metadata packages include pre-indexed Gene Ontology mappings, this
search is very fast.

The input format for Gene Ontology ids is very flexible. You may
use numeric or character vectors, either excluding or including
the "GO:" prefix and leading zeros.

<<>>=
aafSearchGO("hgu95av2.db", c("GO:0000002", "GO:0000008"))
aafSearchGO("hgu95av2.db", c("2", "8"))
aafSearchGO("hgu95av2.db", c(2, 8))
@

A good source for finding relevant Gene Ontology identifiers is
the AmiGO website\footnote{\url{http://www.godatabase.org/}},
operated by the Gene Ontology Consortium.

\end{document}