%% LyX 1.6.1 created this file. For more info, see http://www.lyx.org/. %% Do not edit unless you really know what you are doing. \documentclass[english]{article} \usepackage[T1]{fontenc} \usepackage[latin9]{inputenc} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands. \usepackage{Sweave} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rcommand}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands. %\VignetteIndexEntry{SSPA vignette} %\VignetteKeywords{pilotData, sampleSize} %\VignettePackage{SSPA} %\VignetteDepends{SSPA, multtest} \usepackage{babel} \begin{document} \title{Sample Size and Power Analysis for Microarray Studies} \author{Maarten van Iterson and Ren{\'e}e de Menezes\\ Center for Human and Clinical Genetics,\\ Leiden University Medical Center, The Netherlands\\ Package \texttt{SSPA}, version 0.1.3\\ } \maketitle \tableofcontents{} \section{Introduction} This document shows the functionality of the R-package\emph{ }\texttt{SSPA}.\emph{ }The package performs the power and sample size analysis method as described by \cite{Ferreira2006a,Ferreira2006b}. Our implementation allows for fast and realistic estimates of power and sample size for microarray experiments, given pilot data. By means of two simple commands (\texttt{pilotData(), sampleSize()}), a researcher can read their data in and compute the desired estimates. Other functions are provided to facilitate interpretation of results. Given a set of test statistics from the pilot data, the knowledge of their distribution under the null hypothesis and the sample size used to compute them, the method estimates the power for a given false discovery rate. The multiple testing problem is controlled through the adaptive version of the Benjamini and Hochberg method \cite{Benjamini1995}. For more details about the implementation and method we refer to \cite{Iterson2008,Ferreira2006a,Ferreira2006b}. In \cite{Iterson2009} we describe two biological case studies using the package\emph{ }\texttt{SSPA}. For comparison we implement the power and sample size estimation method proposed by Ruppert \cite{Ruppert2007} which uses a different estimation approach. Two additional packages need to be installed for using this method namely \texttt{quadprog} and \texttt{splines}. <>= options(continue=" ") @ \section{Basic Example} We demonstrate the functionality of this package by using the preprocessed gene expression data from the leukemia ALL/AML study of \cite{golub1999} from the \texttt{multtest} package. To load the leukemia dataset, use \texttt{data(golub)}, and to view a description of the experiments and data, type \texttt{?golub}. The number of samples per group are shown by \texttt{table(golub.cl)} where ALL is class 0 and AML class 1. <<>>= library(multtest) data(golub) table(golub.cl) @ The required input for the sample size and power analysis is a vector of test-statistics and the sample sizes used to compute them. The test-statistics are obtained by a differentially gene expression analysis using one of the available packages like \texttt{limma, maanova, multtest} (it is also possible to import the vector of test-statistics in \texttt{R} if they are calculated using any other software than \texttt{R}). Here we will use the function \texttt{mt.teststat} from the \texttt{multtest} package to obtain a vector of test-statistics from the leukemia data. <<>>= tst <- mt.teststat(golub, golub.cl) @ The first step in doing the sample size and power analysis is creating a object of class-\texttt{PilotData} which will contain all the necessary information needed for the following power and sample size analysis; a vector of test-statistcs and the sample sizes used to compute them. A user-friendly interface for creating an object of \texttt{PilotData} is available as \texttt{pilotData()}. <<>>= library(SSPA) pd <- pilotData(name="ALL/AML",testStatistics=tst,sampleSizeA=11,sampleSizeB=27) @ Several ways of viewing the content of the \texttt{PilotData}-object are possible either graphically or using a \texttt{show}-method by just typing the name of the created object of \texttt{PilotData}: <<>>= pd @ A histogram of p-values is obtained by just calling \texttt{hist(pd)} and an empirical cumulative distribution of p-values by \texttt{plot(pd)}. % \begin{figure} <>= layout(matrix(c(1,2),nrow=2)) hist(pd, cex.main=1) plot(pd, cex.main=1) @ \caption{Histogram and ECDF plot of p-values.} \end{figure} Now we can create an object of class \texttt{SampleSize} which will perform the estimation of the proportion of non-differentially expressed genes and the density of effect sizes. Several options are possible \texttt{?sampleSize}. The default method for estimation of the proportion of non-differentially expressed genes is the method proposed by \cite{Langaas2005} as implemented by \texttt{convest} from the package \texttt{limma}. Additionally the method by \cite{Storey2002} and \cite{Ferreira2006b} are available, also a user-defined proportion can be given. Again a generic \texttt{show}-method is implemented. <<>>= ss <- sampleSize(pd) ss @ The density of effect size can be shown by a call to \texttt{plotEffectSize()} % \begin{figure} <>= plotEffectSize(ss, type='l') @ \caption{Density of effect-sizes and estimated average power.} \end{figure} Estimating the average power for other sample sizes then that of the pilot-data can be performed with the \texttt{Power()}-function. The user can also give the desired false discovery rate level or possible multilpe false discovery rate levels. % \begin{figure} <>= layout(matrix(c(1:2), nrow=2)) pwr <- Power(ss, plot = FALSE, samplesizes = c(5, 10, 15, 20), fdr=0.01) plot(c(5, 10, 15, 20), pwr, ylim = c(0, 1), type = "b", ylab = "Power", xlab = "Sample size per group") legend("bottomright", colnames(pwr), col=c(1:ncol(pwr)), pch=1, lty=1) pwr <- Power(ss, plot = FALSE, samplesizes = c(5, 10, 15, 20), fdr=c(0.01, 0.05)) matplot(c(5, 10, 15, 20), pwr, ylim = c(0, 1), type = "b", pch=1, ylab = "Power", xlab = "Sample size per group") legend("bottomright", colnames(pwr), col=c(1:ncol(pwr)), pch=1, lty=1) @ \caption{Density of effect-sizes and estimated average power.} \end{figure} \section{Additional functionality} \subsection{Ferreira's $\pi_{0}$ estimate} Ferreira \textit{et al.} propose a semi-parametric method to estimate the proportion of non-differentially expressed genes\cite{Ferreira2006a,Ferreira2006b}. Figure 4 output of the $\pi_{0}$ estimation method of Ferreira is shown. % \begin{figure} <>= layout(matrix(c(1,2), nrow=1)) ss <- sampleSize(pd, method="Ferreira", pi0=seq(0.05, 0.5, 0.05), doplot=TRUE) plotEffectSize(ss) @ \caption{\textit{Left panel:} The global minimum shows the $\pi_{0}$ estimate. \textit{Right panel:} The estimation of the density of effect sizes using $\pi_{0}$ estimate obtained with Ferreira's method.} \end{figure} \subsection{Rupperts estimation method} For using the method proposed by Ruppert \cite{Ruppert2007} two additional packages need to be installed for using this method namely \texttt{quadprog} and \texttt{splines}. Using \texttt{sampleSize(pd, method="Ruppert", nKnots = 11, bDegree = 3)} both \inputencoding{latin1}{$\pi_{0}$ and the density of effect sizes is estimated by Ruppert's method.} \inputencoding{latin9}\bibliographystyle{plain} \bibliography{SSPA} \end{document}