%\VignetteIndexEntry{Introduction: microarray quality assessment with arrayQualityMetrics}
%\VignetteDepends{CCl4,ALLMLL,vsn}
%\VignettePackage{arrayQualityMetrics}
%\VignetteEngine{knitr::knitr}

\documentclass[11pt]{article}

<<style, eval=TRUE, echo=FALSE, results="asis">>=
knitr::opts_chunk$set(message=FALSE, warning=FALSE, error=FALSE, tidy=FALSE) # turn off verbosity
BiocStyle:::latex()
@

\begin{document}

\bioctitle[Introduction]%
    {Introduction: microarray quality assessment with arrayQualityMetrics}
\author{Audrey Kauffmann, Wolfgang Huber}
\maketitle
\tableofcontents

\section*{Introduction}

The \Rpackage{arrayQualityMetrics} package produces, through a single
function call, a comprehensive HTML report of \emph{quality metrics}
about a microarray
dataset~\cite{Kauffmann:2009a:Bioinf,Kauffmann:2010:Genomics,McCall2010}.
The quality metrics are mainly on the \emph{per array} level,
i.\,e.\ they can be used to assess the relative quality of different
arrays within a dataset. Some of the metrics can also be used to
diagnose batch effects, and thus the quality of the overall dataset.

The report can be extended to contain further diagnostics through
additional arguments, and we will see examples for this in Section~\ref{sec:ext}.

The aim of the \Rpackage{arrayQualityMetrics} package is to produce
information that is relevant for your decision making - not, to make
the decision. It will often be applied to two, somewhat distinct, use
cases: (i) assessing quality of a ``raw'' dataset, in order to get
feedback on the experimental procedures that produced the data; (ii)
assessing quality of a normalised dataset, in order to decide whether
and how to use the dataset (or subsets of arrays in it) for subsequent
data analysis.

Different types of microarray data (one colour, two colour,
Affymetrix, Illumina) are represented by different object classes
in Bioconductor. The function \Rfunction{arrayQualityMetrics} will
work in the same way for all of them. Further information about
its arguments can be found in its manual page.

When the function \Rfunction{arrayQualityMetrics} is finished, a
report is produced in the directory specified by the function's
\texttt{outdir} argument. By default, a directory with a suitable name
is created in the current working directory. This directory contains
an HTML page \texttt{index.html} that can be opened by a browser. The
report contains a series of plots explained by text.  Some of the
plots are interactive (see Figure~\ref{fig:svgplots}). Technically, this
is achieved by the use of SVG (scalable vector graphics) and
JavaScript, and it requires that you use a recent (HTML5 capable) web
browser\footnote{If in doubt, please see the notes about browser
  compatibility at the top of the report; or contact the Bioconductor
  mailing list.}.  Other plots, where interactivity is less relevant,
are provided as bitmaps (PNG format) and are also linked to PDF files
that provide high resolution versions e.\,g.\ for publication.

Plus (+) or minus (-) symbols at the begin of different section
headings of the report (as in the left panel of
Figure~\ref{fig:svgplots}) indicate that you can show or hide these
sections by clicking on the heading. After (re)loading, all sections
are shown except for the \emph{Outlier detection} barplots, which are
hidden and can be expanded by clicking on them.

\begin{figure}[t]\begin{center}
  \includegraphics[width=0.49\textwidth]{pcaplot.png}
  \includegraphics[width=0.49\textwidth]{arraymetadatatable.png}
  \caption{\label{fig:svgplots}\textit{Left:} An example plot from the
    report. The plot shows the arrays (points) in a two-dimensional
    plot area spanned by the first two axes of a principal component
    analysis (PCA). By moving the mouse over the points, the
    corresponding array's metadata is displayed in the table to the
    right of the plot. By clicking on a point, it can be selected or
    deselected. Selected arrays are indicated by larger points or
    wider lines in the plots and by ticked checkboxes in the array
    table shown in the \textit{right} panel. Arrays can also be
    (de)selected by clicking the checkboxes.  Initially, when the
    report is loaded (or reloaded) by the browser, all arrays are
    selected that were called outliers by at least one criterion.}
\end{center}\end{figure}

Metadata about the arrays is shown at the top of the report as a table
(see Figure~\ref{fig:svgplots}).  It is extracted from the
\Robject{phenoData} slot of the data object supplied to
\Rfunction{arrayQualityMetrics}.  It can be useful to adjust the
contents this slot before producing the report, and to make sure it
contains the right quantity of information to make an informative
report - not too much, not too little.

In the case of \Rclass{AffyBatch} input, some Affymetrix specific
sections are added to the standard report. Also for other types of
arrays, sections can be added to the standard report if certain
metadata are present in the input object (see Section~\ref{sec:ext}).

The function \Rfunction{arrayQualityMetrics} also produces an R object
(essentially, a big list) with all the information contained in the
report, and this object can be used by downstream tools for
programmatic analysis of the report. This is discussed in the vignette
\emph{Advanced topics: Customizing arrayQualityMetrics reports and
  programmatic processing of the output}

\section{Basic use}
%
<<options, include=FALSE>>=
##options(error = recover, warn = 2)
options(bitmapType = "cairo")
@

%

\subsection{Affymetrix data - before preprocessing}
If you are working with Affymetrix GeneChips, an \Rclass{AffyBatch} object
is the most appropriate way to import your raw data into Bioconductor.
Starting from CEL files, this is typically done using the function
\Rfunction{ReadAffy} from the \Rpackage{affy} package\footnote{For
more information on how to produce an \Rclass{AffyBatch} from your data, please see
the documentation of the \Rpackage{affy} package.}.  Here, we use the
dataset \emph{MLL.A}, an object of class \Rclass{AffyBatch}
provided in the data package \Rpackage{ALLMLL}.
%
<<DataLoading>>=
library("ALLMLL")
data("MLL.A")
@

Now that the data are loaded, we can call
\Rfunction{arrayQualityMetrics}\footnote{For this vignette, in order to save
computation time, we only call the function on the first 5 arrays; in
your own application, you can call it on the complete data object.}.

%
<<AffyBatchQM, results="hide">>=
library("arrayQualityMetrics")
arrayQualityMetrics(expressionset = MLL.A[, 1:5],
                    outdir = "Report_for_MLL_A",
                    force = TRUE,
                    do.logtransform = TRUE)
@
%

This is the simplest way of calling the function. We give a name to
the directory (\texttt{outdir}) and we overwrite the possibly existing
files of this directory (\texttt{force}). Finally, we set
\texttt{do.logtransform} to logarithm transform the intensities. You
can then view the report by directing your browser to the file
\texttt{index.html} in the directory whose name is indicated by
\Robject{outdir}.

%--------------------------------------------------
\subsection{Affymetrix data - after preprocessing}
%--------------------------------------------------
We can call the RMA algorithm on \emph{MLL.A} to obtain a preprocessed
dataset.  The preprocessing includes background correction, between
array intensity adjustment (normalisation) and probeset
summarisation. The resulting object \Robject{nMLL} is of class
\Rclass{ExpressionSet} and
contains one value (expression estimate) for each gene
for each array.
%
<<Normalisation, results="hide">>=
nMLL = rma(MLL.A)
@
%
We can then call again the function \Rfunction{arrayQualityMetrics}.
%
<<ExpressionSet>>=
arrayQualityMetrics(expressionset = nMLL,
                    outdir = "Report_for_nMLL",
                    force = TRUE)
@
%
We do not need to set \texttt{do.logtransform} as after
\Rfunction{rma} the data are already logarithm transformed.

%----------------------------------------------------
\subsection{ExpressionSet and ExpressionSetIllumina}
%----------------------------------------------------
If you are working on one colour arrays other than Affymetrix
genechips, you can load your data into Bioconductor as an
\Rclass{ExpressionSet} object \footnote{See the documentation of the
  \Rpackage{Biobase} package.}, or if you work with Illumina data and the
\Rpackage{beadarray} package, as an \Rclass{ExpressionSetIllumina} object.
You can then proceed exactly as above.

%--------------------------------------------------
\subsection{Two colour arrays, NChannelSet, RGList, MAList}
%--------------------------------------------------
The package \Rpackage{limma} imports a wide range of data formats used
for two colour arrays and produces objects of class \Rclass{RGList} or
\Rclass{MAList}. When presented with an object of these classes,
\Rfunction{arrayQualityMetrics} tries to convert them into an
\Rclass{NChannelSet} and then proceeds with calling its
\Rclass{NChannelSet} method.

Alternatively, you can create an \Rclass{NChannelSet} to contain your
data ``from scratch''. The documentation of the \Rpackage{Biobase}
package gives instructions on how to do so.

The \Rfunction{arrayQualityMetrics} function expects the
\Robject{assayData} slot of the \Rclass{NChannelSet} object to contain
the elements \Robject{R} and \Robject{G}, for the ``red'' and the
``green'' intensities. Optionally, it can contain elements
\Robject{Rb} and \Robject{Gb} for associated ``background''
intensities.  As an alternative to all that, the
\Rfunction{arrayQualityMetrics} function also accepts
\Rclass{NChannelSet} objects with a single slot \Robject{exprs}, and
will then simply behave like it does for (single-colour)
\Rclass{ExpressionSet} objects.

As an example, we consider the dataset \Robject{CCl4} from the data package
\Rpackage{CCl4} and normalize it using the variance stabilization
method available in the package \Rpackage{vsn}.

%
<<NChannelSet1, results="hide">>=
library("vsn")
library("CCl4")
data("CCl4")
nCCl4 = justvsn(CCl4, subsample = 15000)
arrayQualityMetrics(expressionset = nCCl4,
                    outdir = "Report_for_nCCl4",
                    force = TRUE)
@
%

%--------------------------------------------------
\subsection{Loading data from ArrayExpress}
%--------------------------------------------------
You can use the \Rpackage{ArrayExpress}
package~\cite{Kauffmann:2009b:Bioinf} to download datasets from the
EBI's ArrayExpress database. The resulting \Rclass{ExpressionSet},
\Rclass{AffyBatch} or \Rclass{NChannelSet} objects can be directly fed
into \Rfunction{arrayQualityMetrics}.

%--------------------------------------------------
\section{Making the report more informative by adding a factor of interest}
%--------------------------------------------------
A useful feature of \Rfunction{arrayQualityMetrics} is the possibility
to show the results in the context of an experimental factor of
interest, i.\,e.\ a categorical variable associated with the arrays
such as \emph{hybridisation date}, \emph{treatment level} or
\emph{replicate number}. Specifying a factor does \emph{not} change
how the quality metrics are computed. By setting the argument
\Rfunction{intgroup} to contain the names of one or multiple columns
of the data object's \Rclass{phenoData} slot\footnote{This is where
  Bioconductor objects store array annotation}, a bar on the side of
the heatmap with colours representing the respective factors is added.
Similarly, the colours of the boxplots and density plots reflect the
levels of the first of the factors named by \Rfunction{intgroup}.

We use the \emph{nMLL} example again, and create artificial
array metadata factors \Robject{condition} and
\Robject{batch} (see Section~\ref{sec:rin} for a more realistic example).

%
<<intgroup1>>=
pData(nMLL)$condition = rep(letters[1:4], times = 5)
pData(nMLL)$batch = rep(paste(1:4), each = 5)
@
<<intgroup2>>=
arrayQualityMetrics(expressionset = nMLL,
                    outdir = "Report_for_nMLL_with_factors",
                    force = TRUE,
                    intgroup = c("condition", "batch"))
@

%--------------------------------------------------
\section{Extended use}
\label{sec:ext}
%--------------------------------------------------
Some of the quality metrics that the package can compute require
specific information about the features on the arrays.  To use these,
you need to make sure that this information is provided in your input
object.  We use the \emph{nCCl4} example again.

\subsection{Spatial layout of the array}
\label{sec:spatial}

To plot the spatial distributions of the intensities of the arrays,
\Rfunction{arrayQualityMetrics} needs the spatial coordinates of the
features on the chip. For \Rclass{AffyBatch} or
\Rclass{BeadLevelList}, this information is automatically available
without further user input. For the other types of objects, two
columns corresponding to $X$ and $Y$ coordinates of the features are
required in the \Robject{featureData} slot of the object.  These
columns should be named "X" and "Y". If the arrays are split into
blocks, rows and columns, then the function \Rfunction{addXYfromGAL}
(please check its manual page for details) can used to convert the
row, column and blocks indices into absolute "X" and "Y" coordinates on the
array. In the example of the dataset \Robject{CCl4}, the coordinates of
the spots are in the columns named "Row" and "Column" of the
featureData (the slot of the object containing the annotation of the
probes). We copy this information into columns named "X"
and "Y" respectively
%
<<XYcoordinates>>=
featureData(nCCl4)$X = featureData(nCCl4)$Row
featureData(nCCl4)$Y = featureData(nCCl4)$Column
@
%
The next call to \Rfunction{arrayQualityMetrics} with this refined version of
\Robject{nCCl4} (see Section~\ref{sec:rin}) will now
include this information in the report, and the spatial distribution
of the intensities will be shown.
%

\subsection{Mapping of the reporters}

The report can also include an assessment of the effect of the target
mapping of the reporters. You can define a \Rclass{featureData} column
named \Robject{hasTarget} that indicates, by logical \texttt{TRUE}, if
the reporter matches a known transcript, and by \texttt{FALSE}, if
not. In the \emph{CCl4} example, many of the reporter names are RefSeq
identifiers, while others are not. Thus, we let \Robject{hasTarget}
indicate whether the name begins with "NM".
%
<<hasTarget>>=
featureData(nCCl4)$hasTarget = (regexpr("^NM", featureData(nCCl4)$Name) > 0)
table(featureData(nCCl4)$hasTarget)
@
The next call to \Rfunction{arrayQualityMetrics} with this refined version of
\Robject{nCCl4} (see Section~\ref{sec:rin}) will now
include this information in the report, and the spatial distribution
of the intensities will be shown.


%\subsection{GC content of the reporters}
%If the GC content of the reporters is known, then it is possible to
%indicate that in the \Robject{featureData} slot of the
%\Rclass{NChannelSet} object under the column name "GC". Then a study of the
%GC content effect on intensities of the arrays can be performed. In
%the \emph{CCl4} example we do not have this information. However, the
%process would be very similar to what is done with the "hasTarget"
%column.


\subsection{RNA quality}
\label{sec:rin}

The RNA hybridized to the arrays in the \Rpackage{CCl4} dataset was
intentionally made to good, medium or poor quality, and this is
recorded by a so-called RIN value (see \emph{CCl4} vignette).
%
<<pData>>=
pd = pData(CCl4)
rownames(pd) = NULL
pd
@
%
The RIN is always 9 for the reference (DMSO), the relevant value is that
for the test sample (CCl4).
<<RIN>>=
RIN = with(pd, ifelse( Cy3=="CCl4", RIN.Cy3, RIN.Cy5))
fRIN = factor(RIN)
levels(fRIN) = c("poor", "medium", "good")
pData(nCCl4)$"RNA-integrity" = fRIN
@
%
Now we can use this to set the argument \Robject{intgroup} when
calling the function \Rfunction{arrayQualityMetrics}.
%
<<NChannelSet2>>=
arrayQualityMetrics(expressionset = nCCl4,
                    outdir = "Report_for_nCCl4_with_RIN",
                    force = TRUE,
                    intgroup = "RNA-integrity")
@
%
Boxplots, PCA plot and heatmap in the report will now indicate the values
of the factor \texttt{RNA-integrity} for each array.

\subsection*{Session Info}
%

<<pkgs, echo=FALSE, results="asis">>=
toLatex(sessionInfo())
@
%

\bibliography{arrayQualityMetrics}

\end{document}