Lab4.Rnw
% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{Lab 4} %\VignetteDepends{Biobase,ctest,multtest,bioclabs} %\VignetteKeywords{Microarray} \documentclass[12pt]{article}
\usepackage{amsmath,pstricks} \usepackage[authoryear,round]{natbib} \usepackage{hyperref}
\textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in
\newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle}
\newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}}
\bibliographystyle{plainnat}
\title{Lab 4: Differential Gene Expression}
\begin{document}
\maketitle
In this lab, we demonstrate how to use R to find genes that are
differentially expressed in two populations. We demonstrate two useful
plots, the MA-plot and the volcano plot. For a more formal assessment,
we use the \verb+multtest+ package for obtaining adjusted $p$-values.
<
We use expression data from an
experiment where sixteen genes were spiked in at different known
concentrations in different hybridizations and are thus differentially
expressed. Expression measures are stored in an \Robject{exprSet}
object available through the \Rpackage{bioclabs} package.
<
Let us create a matrix containing for each of the 12626 genes on the HGU95a chip (note this really is A and not Av2), its {\em A-value} or average log intensity, its {\em M-value} or difference of log intensities (log ratio), its two-sample $t$-statistic, and its nominal $p$-value from the $t$-distribution. The rows of the \Robject{scores} matrix correspond to genes and the columns to the four different types of statistics.
The data have already been transformed to the log scale. Base 2 logarithms were used.
<
The following commands produce an {\em MA-plot} of the differences of
log intensities in the two populations, M, vs. the average log
intensities, A.
<
In the MA-plot, points with large vertical deviations (absolute M-values) suggest differentially expressed genes. The horizontal lines show the typical two-fold-change cutoff (because the expression data are on a log 2 scale). The colored symbols correspond to the sixteen genes that are truly differentially expressed, i.e., were spiked in at different concentrations. Other points with large M-values correspond to false positives. One of the sixteen known genes has a low M-value, corresponding to a false negative ({\tt 1597\_at}). Based on our knowledge of the concentrations for each of the sixteen genes, one expects only one of these genes ({\tt 684\_at}) to have a negative M-value, i.e., higher expression measure in population 0 than 1 (infinitely more abundant in population 0). This is indeed the point with the very low M-value.
Should we take the variability of the estimates into account? There are only 3 replicates but we can try a $t$-test. The following is a so-called {\em volcano plot} of the $t$-statistic vs. the numerator of the $t$-statistic, or M-value. Genes in the top left and right corners of the plot correspond to genes with both large absolute differences and large relative (to standard error) differences in expression between the two populations. Although the sixteen known genes tend to have large absolute $t$-statistics, they standout more in terms of their M-values.
In other situations you may have seen the volcano plot defined as a
plot of minus the base 10 logarithm of the data versus fold
change. The two versions (ours and this one) are equivalent since all
$t$-statistics are based on the same number of degrees of freedom.
<
Note that the gene with small M value really distorts our visual
impression of the data. In order to remove that effect we replot the
data with new limits. For a close-up, simply change the \verb+xlim+
and \verb+ylim+.
<
How many genes have $p$-values less than 0.05? How about 0.01?
<
One can adjust the $p$-values to account for multiple hypothesis
testing using the
\Rpackage{multtest} package. The function
\Robject{mt.rawp2adjp} gives adjusted $p$-values according to various
methods using only the raw $p$-values.
<
Let's see how we fared with the sixteen known to be differentially
expressed genes.
%'
<
One can make some pretty substantial arguments against this procedure. Make an adjustment for all 12626 probes is probably not appropriate in any situation and is certainly hard to support here. These data arose from a designed experiment where we know which genes are going to change. Not all 12626 were candidates for change. In a real experiment you are likely to find that approximately 40\% of the genes are not expressed in the tissue that you are studying. In that case you have corrected for something which should have been excluded from the analysis. There is of course an issue in determining which genes are not differentially expressed is very difficult but the gains from an approximate solution are likely to be quite large.
With 3 replicates we don't expect to have much power. Should we even
use $t$-statistics over the more simple fold change estimates? Let's
see which one does better at ranking the sixteen truly differentially
expressed genes. Ideally, one would like the sixteen genes to have
ranks 1 through 16. It seems like the simple M-value or fold change
measure was more successful at identifying the sixteen known genes.
%'
<
We can easily repeat the whole procedure with the dataset comprising 12 replicates, by simply using \verb+data(eset12)+ instead of \verb+data(eset3)+.
\end{document}