%\VignetteIndexEntry{Parallelized affy functions for preprocessing} %\VignetteKeywords{Preprocessing, Affymetrix} %\VignetteDepends{affy, snow, affydata} %\VignettePackage{affyPara} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \documentclass[12pt]{article} \usepackage[% baseurl={http://www.bioconductor.org},% pdftitle={Parallelized affy functions for preprocessing},% pdfauthor={Markus Schmidberger},% pdfsubject={affyPara},% pdfkeywords={Bioconductor},% plainpages,pdftex]{hyperref} %nice urls \usepackage{url} \usepackage{natbib} %\SweaveOpts{keep.source=TRUE,eps=FALSE,include=FALSE,width=4,height=4.5} \author{Markus Schmidberger \footnote{Division of Biometrics and Bioinformatics, IBE, University of Munich, 81377 Munich, Germany} \footnote{Package maintainer, Email: \texttt{schmidb@ibe.med.uni-muenchen.de}} \and Ulrich Mansmann$^*$ } \title{Description of the \Rpackage{affyPara} package: Parallelized preprocessing methods for Affymetrix Oligonucleotide Arrays} \begin{document} \maketitle \tableofcontents \newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Abstract} The \Rpackage{affyPara} package is part of the Bioconductor\footnote{\url{http://www.bioconductor.org/}} \cite{Gentleman2004} project. The package extends the \Rpackage{affy} package. The \Rpackage{affy} package is meant to be an extensible, interactive environment for data analysis and exploration of Affymetrix oligonucleotide array probe level data. For more details see the \Rpackage{affy} vignettes or \citet{Irizarry2002}. The \Rpackage{affyPara} package contains parallelized preprocessing methods for high-density oligonucleotide microarray data. Partition of data could be done on arrays and therefore parallelization of algorithms gets intuitive possible. The partition of data and distribution to several nodes solves the main memory problems -- caused by the \Robject{AffyBatch} object -- and accelerates the methods \cite{Schmidberger2008}. <>= library(affyPara) @ This document was created using R version \Sexpr{paste(R.version$major,".",R.version$minor,sep="")} and versions \Sexpr{package.version("affy")}, \Sexpr{package.version("snow")} and \Sexpr{package.version("vsn")} of the packages \Rpackage{affy}, \Rpackage{snow} and \Rpackage{vsn} respectively. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Changes to previous Versions} For major changes see the NEWS file in the source code of the package or use the function \Rfunction{readNEWS()}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} The functions in the \Rpackage{affyPara} package have the same functionality and a very similar user-interface as the functions in the \Rpackage{affy} package. For a detailed function and method description see the \Rpackage{affy} vignettes and help files. The \Rpackage{affyPara} package contains parallelized preprocessing methods for high-density oligonucleotide microarray data. The package is designed for large numbers of microarray data and solves the main memory problems caused by the \Robject{AffyBatch} object at only one workstation or processor. Partition of data could be done on arrays and therefore parallelization of algorithms gets intuitive possible. It is very difficult to define a concrete limit for a large number of data, because this strongly depends on the computer system (architecture, main memory, operating system). In general a computer cluster and the \Rpackage{affyPara} package should be used when working with more than 150 microarrays. The partition of data and distribution to several nodes solves the main memory problems (at one workstation) and accelerates the methods. Parallelization of existing preprocessing methods produces, in view of machine accuracy, the same results as serialized methods. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Requirements} The \Rpackage{affyPara} package requires the \Rpackage{affy}, \Rpackage{snow} and \Rpackage{vsn} package. From the \Rpackage{affy} and \Rpackage{vsn} packages several subfunctions for preprocessing will be used. The \Rpackage{snow} package \cite{Rossini2003} will be used as interface to a communication mechanism for parallel computing. In the \Rpackage{snow} package four low level interfaces have been implemented, one using PVM via the \Rpackage{rpvm} package by Li and Rossini, one using MPI via the \Rpackage{Rmpi} \cite{Rmpi} package by Hao Yu, one using NetWorkSpaces via the \Rpackage{NWS} package by Revolution Computing and one using raw sockets that may be useful if PVM and MPI are not available. For a comparison and review of existing parallel computing techniques with the R language see \cite{Schmidberger2009}. For more details concerning the \Rpackage{snow} package, see the help files or the webpage \url{http://www.cs.uiowa.edu/~luke/R/cluster/cluster.html}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Loading the package} First of all you have to load the package and the depending packages. <>= library(affyPara) @ For demonstration we use the small \Robject{AffyBatch} object '\Robject{Dilution}' from the \Rpackage{affydata} package: <>= library(affydata) data(Dilution) Dilution @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Starting and stopping cluster} After loading the libraries the computer cluster has to be initialized. Starting a workstation cluster is the only step in using a computer cluster that depends explicitly on the underlying communication mechanism. A cluster is started by calling the \Rfunction{makeCluster()} function, but the details of the call depend on the type of cluster. PVM, MPI or NWS cluster may also need some preliminary preparations to start the systems. For some examples see the webpage \url{http://www.cs.uiowa.edu/~luke/R/cluster/cluster.html}. To start a cluster you should use <>= cl <- makeCluster(4, type='SOCK') @ with the first parameter '4' for the number of spawend slaves and a parameter (type) for the used communication mechanism. In this example we use raw socket connections. As default in the \Rpackage{snow} a cluster object 'cl' will be created. To stop a cluster you should use <>= stopCluster(cl) @ <>= cl <- makeCluster(2, type='SOCK') @ The \Rpackage{affyPara} package masks (overwrites) the functions \Rfunction{makeCluster()} and \Rfunction{stopCluster()} from the \Rpackage{snow} package. The new functions save the cluster object (cl) in the namespace environment. Therefore you do not have to deal with the 'cl' object in the function calls of the \Rpackage{affyPara} functions. But it is still possible, using the parameter \Robject{cluster=cl}. Socket clusters should stop automatically, when the process -- that created them -- terminates. However, it is still a good idea to call \Rfunction{stopCluster()}. For more details see the \Rpackage{snow} package, the R package for your communication mechanism (\Rpackage{Rmpi}, \Rpackage{Rpvm}, \Rpackage{NWS}), and the implementation of your communication mechanism. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Inputdata: CEL Files or AffyBatch} Before running any kind of proprocessing, the probe level data (CEL files) have to be handled. As suggested in the \Rpackage{affy} package an object of class \Robject{AffyBatch} can be created: \begin{itemize} \item Create a directory. \item Move all the relevant CEL files to that directory. \item Make sure your working directory contains the CEL files (\Rfunction{getwd()}, \Rfunction{setwd()}). \item Then read in the data: <>= AffyBatch <- ReadAffy() @ \end{itemize} This \Robject{AffyBatch} object can be used to do preprocessing (with functions from the \Rpackage{affyPara} and \Rpackage{affy} package) on the data. Depending on the size of the dataset and on the memory available at the computer system, you might experience errors like 'Cannot allocate vector ...'. The idea of the \Rpackage{affyPara} package is, that all probe level data will never be needed at one place (computer) at the same time. Therefore it is much more efficient and memory friendly to distribute the CEL files to the local disc of the slave computers or to a shared memory system (e.g. samba device). To build only small \Rfunction{AffyBatch} objects at the slaves, do preprocessing at the slaves and rebuild the results (\Rfunction{AffyBatch} or \Rfunction{ExpressionSet} object) at the master node. This process is implemented in the functions from the \Rpackage{affyPara} package. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Function Description} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Background Correction} Background correction (BGC) methods are used to adjust intensities observed by means of image analysis to give an accurate measurement of specific hybridization. Therefore BGC is essential, since part of the measured probe intensities are due to non-specific hybridization and the noise in the optical detection system. In the \Rpackage{affyPara} package the same BGC methods as in the \Rpackage{affy} package are available. To list the background correction methods -- built into the package -- the function \Rfunction{bgcorrect.method()} can be used: <<>>= bgcorrect.methods() @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Use Background Correction Para} The function \Rfunction{bgCorrectPara()} needs an input data object (Dilution) and the background correction method (method="rma") as input parameters. <>= affyBatchBGC <- bgCorrectPara(Dilution, method="rma", verbose=TRUE) @ If you do not want to use an \Robject{AffyBatch} object as input data, you can directly give the CEL files and a vector of the CEL files location respectively to the function \Rfunction{bgCorrectPara()}: <>= files <- list.celfiles(full.names=TRUE) affyBatchGBC <- bgCorrectPara(files, method="rma", cluster=cl) @ For this method all CEL files have to be available at every node. This could be achieved for example using a a shared memory system. If you want to distribute the CEL files to the slaves, see Chapter~\ref{chap:distri}. Additionally this example demonstrates how to use an extra cluster object (\Robject{cluster=cl}). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Normalization} Normalization methods make measurements from different arrays comparable. Multi-chip methods have proved to perform very well. We parallelized the following methods \begin{description} \item[\Rfunction{contrast}] -> \Rfunction{normalizeAffyBatchConstantPara} \item[\Rfunction{invariantset}] -> \Rfunction{normalizeAffyBatchInvariantsetPara} \item[\Rfunction{loess}] -> \Rfunction{normalizeAffyBatchLoessPara} \item[\Rfunction{quantile}] -> \Rfunction{normalizeAffyBatchQuantilesPara} \item[\Rfunction{vsn2}] -> \Rfunction{vsnPara} \end{description} available from the \Rpackage{affy} package in the function \Rfunction{normalize()} and the \Rpackage{vsn} package. The parallelized normalization functions need an input data object and the coresponding normalization parameters as input paramters. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Use Quantile Normalization Para} The function \Rfunction{normalizeAffyBatchQuantilesPara()} needs an input data object (Dilution) and quantil normalization parameters as input parameters (type = "pmonly"). <>= affyBatchNORM <- normalizeAffyBatchQuantilesPara( Dilution, type = "pmonly", verbose=TRUE) @ If you do not want to use an \Rfunction{AffyBatch} object as input data, you can directly give the CEL files and a vector of the CEL files location respectively to the function \Rfunction{normalizeAffyBatchQuantilesPara()}: <>= files <- list.celfiles(full.names=TRUE) affyBatchNORM <- normalizeAffyBatchQuantilesPara( files, type = "pmonly") @ For this method all CEL files have to be available from a shared memory system. If you want to distribute the CEL files to the slaves, see Chapter~\ref{chap:distri}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Summarization} Summarization is the final step in preprocessing raw data. It combines the multiple probe intensities for each probeset to produce expression values. These values will be stored in the class called \Rfunction{ExpressionSet}. Compared to the \Rfunction{AffyBatch} class, the \Rfunction{ExpressionSet} requires much less main memory, because there are no more multiple data. Therefore the complete preprocessing functions in the \Rpackage{affyPara} package are very effizient, because no complete \Rfunction{AffyBatch} object has to be build, see Chapter~\ref{chap:preproPara}. The parallelized summarization functions need an input data object and the coresponding summarization parameters as input paramters. To see the summarization methods and PM correct methods that are built into the package the function \Rfunction{express.summary.stat.methods()} and \Rfunction{pmcorrect.methods()} can be used: <<>>= express.summary.stat.methods() pmcorrect.methods() @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Use Summarization Para} The function \Rfunction{computeExprSetPara()} needs an input data object (Dilution) and the summarization parameters as input parameters (pmcorrect.method = "pmonly", summary.method = "avgdiff"). <>= esset <- computeExprSetPara( Dilution, pmcorrect.method = "pmonly", summary.method = "avgdiff") @ If you do not want to use an \Rfunction{AffyBatch} object as input data, you can directly give the CEL files and a vector of the CEL files location respectively to the function \Rfunction{computeExprSetPara()}: <>= files <- list.celfiles(full.names=TRUE) esset <- normalizeAffyBatchQuantilesPara( files, pmcorrect.method = "pmonly", summary.method = "avgdiff") @ For this method all CEL files have to be available from a shared memory system. If you want to distribute the CEL files to the slaves, see Chapter~\ref{chap:distri}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Complete Preprocessing}\label{chap:preproPara} By combining the background correction, normalization and summarization methods to one single method for preprocessing an efficient method can be obtained. For parallelization, the combination has the big advantage of reducing the exchange of data between master and slaves. Moreover, at no point a complete \Rfunction{AffyBatch} object needs to be built, and the time-consuming rebuilding of the AffyBatch objects is no longer necessary. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Use Preprocessing Para} It is important to note that not every preprocessing method can be combined together. For more details see the vignettes in the \Rpackage{affy} package. The function \Rfunction{preproPara()} needs an input data object (Dilution) and the parameters for BGC, normalization and summarization as input parameters. <>= esset <- preproPara( Dilution, bgcorrect = TRUE, bgcorrect.method = "rma", normalize = TRUE, normalize.method = "quantil", pmcorrect.method = "pmonly", summary.method = "avgdiff") @ The function works very similar to the \Rfunction{expresso()} function from the \Rpackage{affy} package. It is not very reasonable to have an \Robject{AffyBatch} object as input data object for this function. Because therefore you have to create a complete \Rfunction{AffyBatch} object. (very memory intensive). It is much better to use a vector of CEL files as input data object. And at no point a complete \Robject{AffyBatch} object needs to be built: <>= files <- list.celfiles(full.names=TRUE) esset <- preproPara( files, bgcorrect = TRUE, bgcorrect.method = "rma", normalize = TRUE, normalize.method = "quantil", pmcorrect.method = "pmonly", summary.method = "avgdiff") @ For this method all CEL files have to be available from a shared memory system. If you want to distribute the CEL files to the slaves, see Chapter~\ref{chap:distri}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Use RMA Para} RMA is a famous \cite{Irizarry2003a} complete preprocessing method. This function converts an \Rfunction{AffyBatch} object into an \Rfunction{ExpressionSet} object using the robust multi-array average (RMA) expression measure. There exists a function \Rfunction{justRMA()} in the \Rpackage{affy} package, which reads CEL files and computes an expression measure without using an \Rfunction{AffyBatch} object. The parallelized version of \Rfunction{rma()} is called \Rfunction{rmaPara()} and is a 'simple' wrapper function for the function \Rfunction{preproPara()}. <>= esset <- rmaPara(Dilution) @ It is not very reasonable to have an \Robject{AffyBatch} object as input data object for this function. Because therefore you have to create a complete \Robject{AffyBatch} object (very memory intensive). It is much better to use a vector of CEL files as input data object. And at no point a complete \Robject{AffyBatch} object needs to be built: <>= files <- list.celfiles(full.names=TRUE) esset <- rmaPara(files) @ For this method all CEL files have to be available from a shared memory system. If you want to distribute the CEL files to the slaves, see Chapter~\ref{chap:distri}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection{Use vsnRMA Para} An other famous preprocessing method is \Rfunction{vsnrma()}. This uses varianze stabilization normalization for normalization and rma for summarization. The parallelized version is called \Rfunction{vsnrmaPara()}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Quality Control and Assessment} Quality control and assessment gets very difficult for a huge number of microarrays data. It takes a lot of computation time and it requires a lot of main memory. In addition most methods for quality control are graphical tools and using more than 200 arrays no more information can be readout of the figures. Therefore some optimized functions for quality control were implemented in parallel: <>= boxplotPara(Dilution) MAplotPara(Dilution) @ These functions create output tables with the quality assessment for all arrays and create optimized graphics (\Robject{plot=TRUE}) for huge numbers of arrays. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Distributing Data}\label{chap:distri} At a workstation cluster the CEL files could be available by a shared memory system. At a workstation cluster, this is often done by a samba device. But this could be the bottle neck for communication traffic. For distributed memory systems, the function \Rfunction{distributeFiles()} for (hierarchically) distributing files from the master to a special directory (e.g. '/tmp/') at all slaves was designed. R or the faster network protocols SCP or RCP can be used for the process of distributing. <>= path <- "tmp/CELfiles" # path at local computer system (master) files <- list.files(path,full.names=TRUE) distList <- distributeFiles(CELfiles, protocol="RCP") eset <- rmaPara(distList$CELfiles) @ With the paramteter \Rfunction{hierarchicallyDist} hierarchically distribution could be used. If \Rfunction{hierarchicallyDist = TRUE} data will be hierarchically distributed to all slaves. If \Rfunction{hierarchicallyDist = FALSE} at every slave only a part of data is available. This function and the corresponding input data object (\Robject{distList\$CELfiles}) could be used for every parallelized preprocessing method in the \Rpackage{affyPara} package. There is also a function to remove distributed files: <>= removeDistributedFiles("/usr1/tmp/CELfiles") @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Results and Discuccion} This article proposes the new package called \Rpackage{affyPara} for parallelized preprocessing of high-density oligonucleotide microarrays. Parallelization of existing preprocessing methods produces, in view of machine accuracy, the same results as serialized methods. The partition of data and distribution to several nodes solves the main memory problems and accelerates the methods. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Test for Accuracy}\label{chap:compare} In view of machine accuracy, the parallelized functions produce same results as serialized methods. To compare results from different functions you can use the functions \Rfunction{identical()} or \Rfunction{all.equal()} from the \Rpackage{base} package. <>= affybatch1 <- bg.correct(Dilution, method="rma") affybatch2 <- bgCorrectPara(Dilution, method="rma") identical(exprs(affybatch1),exprs(affybatch2)) all.equal(exprs(affybatch1),exprs(affybatch2)) @ Attention: If you directly compare the \Robject{AffyBatch} or \Robject{ExpressionSet} objects there are some warnings or not similar results. This is being caused by different values of the 'Title' and 'notes' slots in experimentData. Using the function \Rfunction{exprs()} to get the expression data equal results -- in view of machine accuracy -- can be proven. Attention for loess normalization: In loess normalization a random sub sample will be created. For generating the same results the random generator has to be reset for every run: <>= set.seed(1234) affybatch1 <- normalize.AffyBatch.loess(Dilution) set.seed(1234) affybatch2 <- normalizeAffyBatchLoessPara(Dilution, verbose=TRUE) identical(exprs(affybatch1),exprs(affybatch2)) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Speedup} In order to illustrate by how much the parallel algorithms are faster than the corresponding sequential algorithms, Figure \ref{fig:speedup} shows the speedup for the parallelized preprocessing methods for 50, 100 and 200 CEL files. An average speedup of up to the factor 10 for 200 arrays or more may be achieved. \begin{figure}[ht] \centering \includegraphics[width=12.5cm]{fig/affyPara_speedup.pdf} \caption{Speedup for the parallelized preprocessing methods for 200, 100 and 50 microarrays.} \label{fig:speedup} \end{figure} The computation time for parallel algorithms is compared to the original serial code. It is well known that parts of the original code are not very well implemented. Therefore an increased speedup (super-linear) could be achieved for low numbers of processors, the outliers are mostly generated by unbalanced data distribution. For example 200 microarrays can not be equally distributed to 23 nodes, there are some computers who have to calculate with one more array. Furthermore foreign network traffic in the workstation cluster at the IBE is a reason for outliers. After a special number of processors (depending on number of arrays and method) the plots for all parallelized function get a flat. This means, by using more processors no more speedup could be achieved. Therefore for example for 200 microrrays circa 10 nodes will be enough. The cluster at the Department for Medical Information, Biometrics and Epidemiology (IBE, University of Munich) consists of 32 personal computers with 8 GB main memory and two dual core Intel Xeon DP 5150 processors. Using this cluster, about 16.000 (32 nodes $\cdot$ approximately 500 CEL files) microarrays of the type HGU-133A can be preprocessed using the function \Rfunction{preproPara()}. By expanding the cluster, the number of microarrays can be increased to any given number. The \Rpackage{affyPara} package is tested in different hardware environments: \begin{description} \item[Computer Cluster:] IBE, Linux-Cluster(LRZ, Munich, Germany), HLRB2(LRZ, Munich, Germany), Hoppy (FHCRC, Seattle, WA, USA) \item[Multicore:] IBE, HLRB2(LRZ, Munich, Germany), lamprey (FHCRC, Seattle, WA, USA) \end{description} Thanks a lot to the institutes for providing access to their computer ressources. <>= stopCluster() @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \nocite{Team2007} \bibliographystyle{plain} \bibliography{references} \end{document}