% -*- mode: noweb; noweb-default-code-mode: R-mode; -*- %\VignetteIndexEntry{affycomp primer} %\VignetteKeywords{Preprocessing, Affymetrix} %\VignetteDepends{Biobase} %\VignettePackage{affycomp} %documentclass[12pt, a4paper]{article} \documentclass[12pt]{article} \usepackage{amsmath} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \author{Rafael Irizarry and Leslie Cope} \begin{document} \title{Bioconductor Expression Assessment Tool for Affymetrix Oligonucleotide Arrays (affycomp)} \maketitle \tableofcontents \section{Introduction} <>= library(affycomp) @ \section{Introduction} The defining feature of oligonucleotide expression arrays is the use of several probes to assay each targeted transcript. This is a bonanza for the statistical geneticist, offering a great opportunity to create probeset summaries with specific characteristics. There are now several methods available for summarizing probe level data from the popular Affymetrix GeneChips. It is harder to identify the method best suited to a given inquiry. This package provides a {\it graphical tool} for summaries of Affymetrix probe level data. Plots and summary statistics offer a picture of how an expression measure performs in several important areas. This picture facilitates the comparison of competing expression measures and the selection of methods suitable for a specific investigation. The key is a benchmark dataset consisting of a dilution study and a spike-in study. Because the {\it truth} is known for this data, it is possible to identify statistical features of the data for which the expected outcome is known in advance. Those features highlighted in our suite of graphs are justified by questions of biological interest, and motivated by the presence of appropriate data. The benchmark data used is freely available to the public. The assessment data sets are the following: For the dilution study by \url{http://qolotus02.genelogic.com/datasets.nsf/}{GeneLogic}, two sources of cRNA, human liver tissue and central nervous system cell line (CNS), were hybridized to human arrays (HG-U95Av2) in a range of dilutions and proportions \cite{RMA}. For the spike-in study, different cRNA fragments were added to the hybridization mixture of the arrays at different picoMolar concentrations. The cRNAs were spiked-in at a different concentration on each array (apart from replicates) arranged in a cyclic Latin square design with each concentration appearing once in each row and column. All arrays had a common background cRNA. The data can be obtained form \url{http://www.affymetrix.com/analysis/download_center2.affx}. Two phenoData objects are included in the package that give more details: \verb+dilution.phenodata+ and \verb+spikein.phenodata+. \section{What's new in version 1.2?} A new sets of assessment has been added. The wrapper is called \verb+assessSpikeIn2+ and one can send it an \verb+ExpressionSet+ with the spikein as \verb+assessSpikeIn+ but it only uses columns 1,...,13,17 and 21,...,33,37. So one only needs to get expression measures for these 28 arrays if one plans to only use \verb+assessSpikeIn2+. The engine functions are \verb+assessLS+, \verb+assessSpikeInSD+, and \verb+assessMA2+. All spike in related assessments now work with both the HGU95A (this is the one used in version 1.1) and HGU133A. The function \verb+read.newspikein+ will read the HGU133A expression values. \section{The Image Report} If you have a file named \verb+dilfilename.csv+ with the dilution expression values and \verb+sifilename.csv+ with the spikein expression measures, say RMA, you can easily obtain the graphs and summary statistics in the image report: \begin{Sinput} R> library(affycomp) ##load the affy package R> d <- read.dilution("dilfilename.csv") R> s <- read.spikein("sifilename.csv") R> rma.assessment <- affycomp(d, s, method.name="RMA") \end{Sinput} \verb+res+ will have the necessary information to recreate the graphs without having to wait for the assessment. See below. \verb+affycomp+ is a wrapper for other \verb+assessAll+ and \verb+affycompTable+. Here are some examples based <<>>= data(mas5.assessment) data(rma.assessment) @ were created using the function \verb+assessAll+ on \verb+ExpressionSet+s created using the RMA and MAS 5.0 methods. These are lists of lists. <<>>= names(mas5.assessment) @ Each component is the result of a specific assessment. The names tell us what they are for. \verb+Dilution+ are the assessment based on the dilution data and can be used to create Figures 2, 3, and 4b. \verb+MA+ has the necessary information for the MA plot or Figure 1. \verb+Signal+ has the necessary information to create Figure 4a. \verb+FC+ has assessments related to fold change and can be used to create Figures 5a, 6a, and 6b. Finally \verb+FC2+ has the necessary information to create Figure 5b. The captions for these Figures will give you an idea of what they are for. There are two kinds of plots the basic and the comparative. In the basic plots based on the expression measure being assessed are shown. In the comparative plots, the expression measure is compared to other measures, MAS 5.0 by default. Table are also automatically created with assessment statistics. Finally, a simple assessment of standard error estimates can be done. These are described in the following subsections. \subsection{Basic Plots} You can use \verb+affycompPlot+ which will automatically know what to do \begin{figure}[htbp] \begin{center} <>= affycompPlot(mas5.assessment$MA) @ \end{center} {Figure 1: The MA plot shows log fold change as a function of mean log expression level. A set of 14 arrays representing a single experiment from the Affymetrix spike-in data are used for this plot. A total of 13 sets of fold changes are generated by comparing the first array in the set to each of the others. Genes are symbolized by numbers representing the nominal $\log_2$ fold change for the gene. Non-differentially expressed genes with observed fold changes larger than 2 are plotted in red. All other probesets are represented with black dots.} \end{figure} Or you can use the auxiliary figure functions that will need to have a specific assessment list \begin{figure}[htbp] \begin{center} <>= affycomp.figure2(mas5.assessment$Dilution) @ \end{center} {Figure 2: For each gene, and each experimental condition, we calculate the mean log expression and the observed standard deviation across 5 replicates. The resulting scatterplot is smoothed to generate a single curve representing mean standard deviation as a function of mean log expression.} \end{figure} \begin{figure}[htbp] \begin{center} <>= affycomp.figure3(mas5.assessment$Dilution) @ \end{center} {Figure 3: This plot, using the GeneLogic dilution data, shows the sensitivity of fold change calculations to total RNA abundance. Average log fold-changes between liver and CNS for the lowest concentration and the highest in the dilution data set are computed. Orange and red color is used to denote genes with $M_{6g}-M_{1g}$ bigger than $\log_2(2)$ and $\log_2(3)$ respectively. The rest are denoted with black.} \end{figure} \begin{figure}[htbp] \begin{center} <>= par(mfrow=c(2,1)) affycomp.figure4a(mas5.assessment$Signal) affycomp.figure4b(mas5.assessment$Dilution) @ \end{center} {Figure 4: a) Average observed $log_2$ intensity plotted against nominal $log_2$ concentration for each spiked-in gene for all arrays in Affymetrix spike-In experiment. b) For the GeneLogic dilution data, log expression values are regressed against their log nominal concentration. The slope estimates are plotted against average log intensity across all concentrations.} \end{figure} \begin{figure}[htbp] \begin{center} <>= par(mfrow=c(2,1)) affycomp.figure5a(mas5.assessment$FC) affycomp.figure5b(mas5.assessment$FC) @ \end{center} {Figure 5: A typical identification rule for differential expression filters genes with fold change exceeding a given threshold. This figure shows average ROC curves which offer a graphical representation of both specificity and sensitivity for such a detection rule. a) Average ROC curves based on comparisons with nominal fold changes ranging from 2 to 4096. b) As a) but with nominal fold changes equal to 2.} \end{figure} \begin{figure}[htbp] \begin{center} <>= par(mfrow=c(2,1)) affycomp.figure6a(mas5.assessment$FC) affycomp.figure6b(mas5.assessment$FC) @ \end{center} {Figure 6: a) Observed log fold changes plotted against nominal log fold changes. The dashed lines represent highest, 25th highest, 100th highest, 25th percentile, 75th percentile, smallest 100th, smallest 25th, and smallest log fold change for the genes that were not differentially expressed. b) Like a) but the observed fold changes were calculated for spiked in genes with nominal concentrations no higher than 2pM.} \end{figure} \subsection{Comparative Plots} You can use \verb+affycompPlot+ which will automatically know what to do \begin{figure}[htbp] \begin{center} <>= par(mfrow=c(2,1)) affycompPlot(mas5.assessment$MA, rma.assessment$MA) @ \end{center} {Figure 1: The MA plot shows log fold change as a function of mean log expression level. A set of 14 arrays representing a single experiment from the Affymetrix spike-in data are used for this plot. A total of 13 sets of fold changes are generated by comparing the first array in the set to each of the others. Genes are symbolized by numbers representing the nominal $\log_2$ fold change for the gene. Non-differentially expressed genes with observed fold changes larger than 2 are plotted in red. All other probesets are represented with black dots.} \end{figure} Or you can use the auxiliary figure functions that will need to have a specific assessment list \begin{figure}[htbp] \begin{center} <>= affycomp.compfig2(list(mas5.assessment$Dilution, rma.assessment$Dilution), method.names=c("MAS 5.0","RMA")) @ \end{center} {Figure 2: For each gene, and each experimental condition, we calculate the mean log expression and the observed standard deviation across 5 replicates. The resulting scatterplot is smoothed to generate a single curve representing mean standard deviation as a function of mean log expression.} \end{figure} \begin{figure}[htbp] \begin{center} <>= affycomp.compfig3(list(mas5.assessment$Dilution, rma.assessment$Dilution), method.names=c("MAS 5.0","RMA")) @ \end{center} {Figure 3: This plot, using the GeneLogic dilution data, shows the sensitivity of fold change calculations to total RNA abundance. Average log fold-changes between liver and CNS for the lowest concentration and the highest in the dilution data set are computed. Orange and red color is used to denote genes with $M_{6g}-M_{1g}$ bigger than $\log_2(2)$ and $\log_2(3)$ respectively. The rest are denoted with black.} \end{figure} \begin{figure}[htbp] \begin{center} <>= par(mfrow=c(2,1)) affycomp.compfig4a(list(mas5.assessment$Signal, rma.assessment$Signal), method.names=c("MAS 5.0","RMA")) affycomp.compfig4b(list(mas5.assessment$Dilution, rma.assessment$Dilution), method.names=c("MAS 5.0","RMA")) @ \end{center} {Figure 4: a) Average observed $log_2$ intensity plotted against nominal $log_2$ concentration for each spiked-in gene for all arrays in Affymetrix spike-In experiment. b) For the GeneLogic dilution data, log expression values are regressed against their log nominal concentration. The slope estimates are plotted against average log intensity across all concentrations.} \end{figure} \begin{figure}[htbp] \begin{center} <>= par(mfrow=c(2,1)) affycomp.compfig5a(list(mas5.assessment$FC, rma.assessment$FC), method.names=c("MAS 5.0","RMA")) affycomp.compfig5b(list(mas5.assessment$FC2, rma.assessment$FC2), method.names=c("MAS 5.0","RMA")) @ \end{center} {Figure 5: A typical identification rule for differential expression filters genes with fold change exceeding a given threshold. This figure shows average ROC curves which offer a graphical representation of both specificity and sensitivity for such a detection rule. a) Average ROC curves based on comparisons with nominal fold changes ranging from 2 to 4096. b) As a) but with nominal fold changes equal to 2.} \end{figure} \begin{figure}[htbp] \begin{center} <>= par(mfrow=c(2,2)) affycomp.figure6a(mas5.assessment$FC, main="a) MAS 5.0", ylim=c(-12,12)) affycomp.figure6a(rma.assessment$FC, main="a) RMA", ylim=c(-12,12)) affycomp.figure6b(mas5.assessment$FC, main="b) MAS 5.0", ylim=c(-6,6)) affycomp.figure6b(rma.assessment$FC, main="b) RMA", ylim=c(-6,6)) @ \end{center} {Figure 6: a) Observed log fold changes plotted against nominal log fold changes. The dashed lines represent highest, 25th highest, 100th highest, 25th percentile, 75th percentile, smallest 100th, smallest 25th, and smallest log fold change for the genes that were not differentially expressed. b) Like a) but the observed fold changes were calculated for spiked in genes with nominal concentrations no higher than 2pM.} \end{figure} \subsection{Tables} The function \verb+tableAll+ returns a matrix with assessment statistics. Once the assessment function are run all one needs to type is <<>>= data(rma.assessment) data(mas5.assessment) tableAll(rma.assessment, mas5.assessment) @ The function \verb+affycompTable+ make a minimal table (that is also more informative). <<>>= affycompTable(rma.assessment, mas5.assessment) @ \subsection{Standard deviation assessment} The package also contains a simple tool to assess standard error estimates. For this to work the \verb+ExpressionSet+ object used for the assessment must have standard error estimates for the dilution data. We include two examples in the package. <<>>= data(rma.sd.assessment) data(lw.sd.assessment) tableAll(rma.sd.assessment, lw.sd.assessment) @ For the SD assessment, there are also comparison plots as well as simple plots. \begin{figure}[htbp] \begin{center} <>= affycompPlot(lw.sd.assessment, rma.sd.assessment) @ \end{center} {Figure 7: Using the replicates from the dilution data, we calculate the mean predicted variance for each gene, tissue and dilution by squaring the estimated standard error. The usual sample variance $s^2_{tdg} = \sum_r (y_{tdrg} - y_{td\cdot g})^2/4$ are calculated as well. These boxplots are of the log ratios of the predicted and observed variance. } \end{figure} There are also comparison plots as well as simple plots. \begin{figure}[htbp] \begin{center} <>= affycompPlot(lw.sd.assessment) @ \end{center} {Figure 7: Using the replicates from the dilution data, we calculate the mean predicted variance for each gene, tissue and dilution by squaring the estimated standard error. The usual sample variance $s^2_{tdg} = \sum_r (y_{tdrg} - y_{td\cdot g})^2/4$ are calculated as well. These boxplots are of the log ratios of the predicted and observed variance. } \end{figure} Enjoy! \end{document}