% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % %\VignetteDepends{graph,Rgraphviz} %\VignetteKeywords{Protein Complex, Graphs} %\VignettePackage{apComplex} %\VignetteIndexEntry{apComplex} \documentclass[11pt]{article} \usepackage{amsmath} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \usepackage{times} \usepackage{comment} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \bibliographystyle{plainnat} \title{Protein complex membership estimation using \Rpackage{apComplex}} \author{Denise Scholtens} \date{} \begin{document} \maketitle \Rpackage{apComplex} contains functions for estimating protein complex membership using data from affinity-purification/mass-spectrometry (AP-MS) experiments. Users must specify the believed sensitivity and specificity of the AP-MS technology and may incorporate external similarity data for the proteins under investigation. The statistical details of the algorithm are reported in Scholtens and Gentleman (2004) and the biological implications are discussed in Scholtens, Vidal, and Gentleman (2005). <>= plotGraph <- function(x, y, ...) { if (require("Rgraphviz", quietly=TRUE)) return(plot(x, y, ...)) else { msg = "This plot requires Rgraphviz" plot(0, 0, type="n") text(0, 0.5, msg) } } @ <>= library(apComplex) @ \section*{AP-MS Data} AP-MS technology is designed to detect complex comembership among proteins. A set of proteins are used as baits, and in each purification, the bait protein finds the set of hit proteins with which it shares membership in at least one complex. Suppose proteins $P_{1}$, $P_{2}$, $P_{4}$, and $P_{6}$ compose one complex and proteins $P_{3}$, $P_{4}$ and $P_{5}$ compose a separate complex. If proteins $P_{1}$, $P_{2}$, and $P_{3}$ are used as baits, then with perfectly sensitive and specific AP-MS technology, the following data should be observed. \vspace{.2in} \begin{tabular}{cc|cccccc} & \multicolumn{1}{c}{} & \multicolumn{6}{c}{Hits} \\ & \multicolumn{1}{c}{} & $P_{1}$ & $P_{2}$ & $P_{3}$ & $P_{4}$ & $P_{5}$ & $P_{6}$ \\ \cline{3-8} & $P_{1}$ & 1 & 1 & 0 & 1 & 0 & 1 \\ Baits & $P_{2}$ & 1 & 1 & 0 & 1 & 0 & 1 \\ & $P_{3}$ & 0 & 0 & 1 & 1 & 1 & 0 \\ \end{tabular} \vspace{.2in} \noindent The rows of the matrix are baits, the columns are hits, an entry of 1 in the $i$th row and $j$th column indicates that bait protein $i$ finds protein $j$ as a hit, and an entry of 0 in the $i$th row and $j$th column indicates that bait protein $i$ does not find protein $j$ as a hit. The diagonal entries are 1 since a protein is always a complex comember with itself. Note that bait proteins can be found as hits by other bait proteins. Also note that some proteins are never used as baits. A graph of the data is useful for understanding which comembership relationships are tested in AP-MS experiments and which are not. In the graph in Figure \ref{fig:trueData}, nodes represent proteins and directed edges from baits to hits represent complex comembership. Bait proteins are yellow and hit-only proteins (i.e. proteins that are found as hits but never used as baits) are white. Directed edges always originate at yellow bait proteins and point to the set of hits detected by that bait. The red reciprocated edge connecting $P_{1}$ and $P_{2}$ represents a bait-bait relationship that is tested twice, once in each purification. The gray unreciprocated edges represent bait-hit-only edges that are only tested once. Missing edges between baits and other baits or hit-only proteins represent comemberships that are tested, but not observed. Edges between hit-only proteins are always missing, not because the comemberships are known not to exist, but because they are never tested. <>= FromT <- c(rep("P1",3),rep("P2",3),rep("P3",2)) ToT <- c("P2","P4","P6","P1","P4","P6","P4","P5") LT <- cbind(FromT,ToT) apEXGT <- ftM2graphNEL(LT) fcT <- c(rep("yellow",3),rep("white",3)) names(fcT) <- c("P1","P2","P3","P4","P5","P6") nAT <- list(fillcolor=fcT) ecT <- c("red","red",rep("gray",6)) names(ecT) <- c("P1~P2","P2~P1","P1~P6","P1~P4","P2~P4","P2~P6","P3~P4","P3~P5") eAT <- list(color=ecT) plotGraph(apEXGT,nodeAttrs=nAT,edgeAttrs=eAT) @ \begin{figure}[htbp] \begin{center} \includegraphics[width=0.5\textwidth]{apComplex-trueData} \caption{\label{fig:trueData} True complex comemberships that would be detected with perfectly sensitive and specific AP-MS technology using $P_{1}$, $P_{2}$, and $P_{3}$ as baits.} \end{center} \end{figure} In reality, AP-MS technology is neither perfectly sensitive nor specific, resulting in false positive (FP) and false negative (FN) observations of the complex comemberships under investigation. Suppose in this experiment, we make a FN observation between $P_{2}$ and $P_{4}$, i.e. $P_{4}$ is not found as a hit when we use $P_{2}$ as a bait. Also suppose we make two FP observations: 1) when we use $P_{3}$ as a bait, we find an extraneous hit-only protein $P_{7}$, and 2) when performing a purification using $P_{8}$ as a bait, we find $P_{3}$ as a hit. Data for this example are contained in the matrix \Robject{apEX}. In this matrix, rows again represent baits and columns represent hits. <>= data(apEX) apEX @ The graph of the data in Figure \ref{fig:observedData} demonstrates the observations recorded in \Robject{apEX}. Note the missing edge from $P_{2}$ to $P_{4}$ and the new edge from $P_{3}$ to $P_{7}$. Also note the blue unreciprocated edge between $P_{3}$ and $P_{8}$ -- this is a complex comembership that was tested twice when $P_{3}$ and $P_{8}$ were used as baits, but only detected once in the purification using $P_{8}$ as a bait. <>= From <- c(rep("P1",3),rep("P2",2),rep("P3",3),"P8") To <- c("P2","P4","P6","P1","P6","P4","P5","P7","P3") L <- cbind(From,To) M1 <- ftM2adjM(L) apEXG <- as(M1,"graphNEL") fc <- c(rep("yellow",4),rep("white",4)) names(fc) <- c("P1","P2","P3","P8","P4","P5","P6","P7") nA <- list(fillcolor=fc) ec <- c("red","red","blue",rep("gray",6)) names(ec) <- c("P1~P2","P2~P1","P8~P3","P1~P6","P1~P4","P2~P6","P3~P4","P3~P5","P3~P7") eA <- list(color=ec) plotGraph(apEXG,nodeAttrs=nA,edgeAttrs=eA) @ \begin{figure}[htbp] \begin{center} \includegraphics[width=0.5\textwidth]{apComplex-observedData} \caption{\label{fig:observedData} Hypothetical data from an AP-MS experiment with a FN observation between $P_{2}$ and $P_{6}$ and FP observations between $P_{3}$ and $P_{7}$ and $P_{8}$ and $P_{3}$.} \end{center} \end{figure} \Rpackage{apComplex} starts with observed complex \textit{comembership} data from AP-MS and estimates complex \textit{membership}. \section*{Algorithm} The edges in an AP-MS graph represent complex comembership. If all proteins were used as baits, then maximal complete subgraphs (or cliques) in the AP-MS graph would contain entire collections of proteins that compose a complex. The maximal complete subgraphs could then be used to form a protein complex membership graph (PCMG): a bipartite graph in which one set of nodes represents proteins, the other set represents complexes, and an edge from a protein node to a complex node represents membership of the protein in that complex. Bipartite graphs can also be represented using an affiliation matrix in which the rows represent proteins, the columns represent complexes, and an entry of 1 in the $j$th row and $i$th column represents membership of protein $i$ in complex $j$. \Rpackage{apComplex} is essentially a maximal complete subgraph finding algorithm that is adapted for the bait/hit status of proteins, as well as imperfect observation of edges. The first step in estimating complex membership is to find the \textit{maximal BH-complete subgraphs} in the observed AP-MS data. A \textit{BH-complete subgraph} is defined to be a collection of baits and hits for which all bait-bait edges and all bait-hit-only edges exist; a \textit{maximal BH-complete subgraph} is a BH-complete subgraph that is not contained in any other BH-complete subgraph. In the event of unreciprocated observations between pairs of baits, the edges are estimated to exist when the sensitivity of the AP-MS technology is less than the specificity. Under a logistic regression model where the parameters represent sensitivity and specificity, this treatment of unreciprocated bait-bait edges maximizes the likelihood $L$ for the data (Scholtens and Gentleman, 2004). In our example, the observed data contains four maximal BH-complete subgraphs, shown in Figure \ref{fig:PCMG0}. The function \Rfunction{bhmaxSubgraph} will detect these maximal BH-complete subgraphs and report them using an affiliation matrix. Figure \ref{fig:BP1} shows the corresponding bipartite PCMG for the initial complex estimates. <>= PCMG0 <- bhmaxSubgraph(apEX) PCMG0 @ <>= par(mfrow=c(2,2)) sn1 <- c("P3","P8") sg1 <- subGraph(sn1,apEXG) edgemode(sg1) <- "directed" fc1 <- fc[sn1] nA1 <- list(fillcolor=fc1) ec1 <- ec["P8~P3"] eA1 <- list(color=ec1) plotGraph(sg1,nodeAttrs=nA1,edgeAttrs=eA1) title("bhmax1") sn2 <- c("P1","P2","P6") sg2 <- subGraph(sn2,apEXG) edgemode(sg2) <- "directed" fc2 <- fc[sn2] nA2 <- list(fillcolor=fc2) ec2 <- ec[c("P1~P2","P2~P1","P1~P6","P2~P6")] eA2 <- list(color=ec2) plotGraph(sg2,nodeAttrs=nA2,edgeAttrs=eA2) title("bhmax2") sn3 <- c("P1","P4","P6") sg3 <- subGraph(sn3,apEXG) edgemode(sg3) <- "directed" fc3 <- fc[sn3] nA3 <- list(fillcolor=fc3) ec3 <- ec[c("P1~P4","P1~P6")] eA3 <- list(color=ec3) plotGraph(sg3,nodeAttrs=nA3,edgeAttrs=eA3) title("bhmax3") sn4 <- c("P3","P4","P5","P7") sg4 <- subGraph(sn4,apEXG) edgemode(sg4) <- "directed" fc4 <- fc[sn4] nA4 <- list(fillcolor=fc4) ec4 <- ec[c("P3~P4","P3~P5","P3~P7")] eA4 <- list(color=ec4) plotGraph(sg4,nodeAttrs=nA4,edgeAttrs=eA4) title("bhmax4") @ \begin{figure}[htbp] \begin{center} \includegraphics[width=0.6\textwidth]{apComplex-PCMG0graph} \caption{\label{fig:PCMG0} Four maximal BH-complete subgraphs in the observed data.} \end{center} \end{figure} <>= FromBP1 <- c("P1","P1","P2","P3","P3","P4","P4","P5","P6","P6","P7","P8") ToBP1 <- paste("BHM",c(2,3,2,1,4,3,4,4,2,3,4,1),sep="") LBP1 <- cbind(FromBP1,ToBP1) gBP1 <- ftM2graphNEL(LBP1, edgemode="undirected") fc <- c(rep("yellow",4),rep("white",4),rep("salmon",4)) names(fc) <- c("P1","P2","P3","P8","P4","P5","P6","P7",paste("BHM",1:4,sep="")) nABP1 <- list(fillcolor=fc) plotGraph(gBP1,nodeAttrs=nABP1) @ \begin{figure}[htbp] \begin{center} \includegraphics[width=0.5\textwidth]{apComplex-BP1} \caption{\label{fig:BP1} Bipartite PCMG for the initial complex estimates determined by locating maximal BH-complete subgraphs in the graph of observed AP-MS data.} \end{center} \end{figure} The initial maximal BH-complete subgraph estimate of the PCMG does not allow missing edges between bait and hit-only proteins; since AP-MS technology is not perfectly sensitive, it is reasonable to expect a number of missing edges in the subgraph for each complex estimate. \Rpackage{apComplex} accommodates this by employing an objective function to evaluate the complex estimates. For a complex $c_k$, let $C(c_k)$ represent the product of 1) the binomial probability for the number of observed edges in $c_k$ given the number of tested edges, and 2) a two-sided $p$-value from Fisher's exact test for the distribution of missing incoming edges for complex estimate $c_k$. Then let $C$ equal the product of $C(c_k)$ over all complexes $c_1,...,c_K$. The objective function $P$ is the product of $L$ and $C$, or $P=L \times C$. $L$ is maximized with the initial maximal BH-complete subgraphs -- the algorithm in \Rpackage{apComplex} looks to increase $C$ in favor of small decreases in $L$. After the initial PCMG estimate is made using \Rfunction{bhmaxSubgraph}, \Rfunction{mergeComplexes} proposes pairwise unions of individual complex estimates. If $P$ increases when the complexes are treated as one, then the combination is accepted. If more than one union increases $P$, then the union with the largest increase is accepted. Algebraic details of the acceptance criteria are available in Scholtens and Gentleman (2004). <>= PCMG1 <- mergeComplexes(PCMG0,apEX,sensitivity=.7,specificity=.75) PCMG1 @ In this case, \Robject{bhmax2} and \Robject{bhmax3} were combined into \Robject{Complex2}. (\Robject{bhmax1} and \Robject{bhmax4} remained as originally estimated and are now named \Robject{Complex1} and \Robject{Complex3}, respectively). The one missing edge out of the six tested in \Robject{Complex2} is consistent with the sensitivity of the technology and the distribution of missing edges (in this case only one) is sufficiently random in the subgraph for \Robject{Complex2}. Figure \ref{fig:mergedComplex} contains the subgraphs for the new complex estimates and Figure \ref{fig:BP2} shows the corresponding bipartite PCMG. <>= par(mfrow=c(1,3)) plotGraph(sg1,nodeAttrs=nA1,edgeAttrs=eA1) title("Complex1") sn5 <- c("P1","P2","P4","P6") sg5 <- subGraph(sn5,apEXG) edgemode(sg5) <- "directed" fc5 <- fc[sn5] nA5 <- list(fillcolor=fc5) ec5 <- ec[c("P1~P2","P2~P1","P1~P4","P1~P6","P2~P6")] eA5 <- list(color=ec5) plotGraph(sg5,nodeAttrs=nA5,edgeAttrs=eA5) title("Complex2") plotGraph(sg4,nodeAttrs=nA4,edgeAttrs=eA4) title("Complex3") @ \begin{figure}[htbp] \begin{center} \includegraphics[width=0.75\textwidth]{apComplex-mergedComplex} \caption{\label{fig:mergedComplex} Subgraphs for new complex estimates after using \Rfunction{mergeComplexes}.} \end{center} \end{figure} <>= FromBP2 <- paste("P",c(1,2,3,3,8,4,4,5,6,7),sep="") ToBP2 <- paste("C",c(2,2,1,3,1,2,3,3,2,3),sep="") LBP2 <- cbind(FromBP2,ToBP2) gBP2 <- ftM2graphNEL(LBP2, edgemode="undirected") fc <- c(rep("yellow",4),rep("white",4),rep("thistle",3)) names(fc) <- c("P1","P2","P3","P8","P4","P5","P6","P7",paste("C",1:3,sep="")) nABP2 <- list(fillcolor=fc) plotGraph(gBP2,nodeAttrs=nABP2) @ \begin{figure}[htbp] \begin{center} \includegraphics[width=0.5\textwidth]{apComplex-BP2} \caption{\label{fig:BP2}Bipartite graph for new complex estimates after using \Rfunction{mergeComplexes}.} \end{center} \end{figure} The function \Rfunction{findComplexes} can be used to run both steps together. <>= PCMG2 <- findComplexes(apEX,sensitivity=.7,specificity=.75) PCMG2 @ Our algorithm makes three types of complex estimates: multi-bait-mult-edge (MBME) complexes that contain multiple baits and multiple edges, single-bait-multi-hit (SBMH) complexes that contain a single bait and a collection of hit-only proteins, and unreciprocated bait-bait (UnRBB) complexes that only contain two bait proteins connected by one unreciprocated edge. MBME complexes are the most reliable outputs since they contain the most tested data. SBMH complexes are useful for proposing future experiments since the topology among the hit-only proteins is unknown. UnRBB complexes may result from FP observations since the edges are tested twice, observed once, and not confirmed by other subgraph edges. On the other hand, the unreciprocated edge may also result from a FN observation between the two baits. The PCMG affiliation matrix resulting from \Rfunction{mergeComplexes} or \Rfunction{findComplexes} can be sorted into the MBME, SBMH, and UnRBB components using the function \Rfunction{sortComplexes}. <>= sortComplexes(PCMG2,adjMat=apEX) @ Recall that the true complexes in this example consist of $P_{1}$, $P_{2}$, $P_{4}$, and $P_{6}$ and $P_{3}$, $P_{4}$, and $P_{5}$. \Robject{MBME1} accurately estimates $P_{1}$, $P_{2}$, $P_{4}$, and $P_{6}$ as composing one complex. \Robject{SBMH1} predicts $P_{3}$, $P_{4}$, and $P_{5}$ as members of one complex, but also includes the FP observation for $P_{7}$. The limited data for this complex makes it impossible to distinguish the FP from the true positive (TP) observations. Further purifications using $P_{4}$, $P_{5}$, and $P_{7}$ as baits would likely resolve this difficulty. \Robject{UnRBB1} is the result of the FP observation between $P_{8}$ and $P_{3}$. While it is reported as a complex estimate, further experimental testing would likely confirm that this is not a true complex. If the user desires, a matrix of similarity data can be included as the \Robject{simMat} argument in \Rfunction{mergeComplexes} and \Rfunction{findComplexes} in an extended logistic regression model. The similarity measure can be used to lend credence to the existence of an edge, even if it is not detected using AP-MS. Users must specify the parameter \Robject{Beta} which weights contribution of the similarity measure to the model. See Scholtens and Gentleman (2004) for details. \section*{Publicly available data} Two publicly available data sets are included in \Rpackage{apComplex}. \Robject{TAP} is an adjacency matrix of the AP-MS data (called `TAP') reported by Gavin, et al. (2002). There were 3420 comemberships reported using 455 baits and 909 hit-only proteins. \Robject{TAPgraph} contains a graph of class \Robject{graphNEL} of the TAP data. The TAP data were originally compiled into 232 yTAP complexes, available in Supplementary Table 1 of Gavin, et al. (2002) at http://www.nature.com and at http://yeast.cellzome.com. These yTAP complex estimates, along with the annotations given by Gavin, et al. are available in \Robject{yTAP}. <>= data(TAP) dim(TAP) data(TAPgraph) which(TAP["Abd1",]==1) adj(TAPgraph,"Abd1") data(yTAP) @ \Robject{HMSPCI} is an adjacency matrix of the AP-MS data (called `HMS-PCI') reported by Ho, et al. (2002). There were 3687 comemberships reported using 493 baits and 1085 hit-only proteins. \Robject{HMSPCIgraph} contains a graph of class \Robject{graphNEL} of the HMS-PCI data. <>= data(HMSPCI) dim(HMSPCI) data(HMSPCIgraph) which(HMSPCI["YAL015C",]==1) adj(HMSPCIgraph,"YAL015C") @ These data were analyzed using \Rpackage{apComplex}, and the results are described in Scholtens, Vidal, and Gentleman (submitted). The complex estimates are available for both data sets - \Robject{MBMEcTAP}, \Robject{SBMHcTAP}, and \Robject{UnRBBcTAP} for the TAP data, and \Robject{MBMEcHMSPCI}, \Robject{SBMHcHMSPCI}, and \Robject{UnRBBcHMSPCI} for the HMS-PCI data. One example of the detail with which the \Rpackage{apComplex} algorithm can estimate complex membership involves the PP2A proteins Tpd3, Cdc55, Rts1, Pph21, and Pph22. These five proteins compose four heterotrimers (Jiang and Broach, 1999). We accurately predict these trimers as distinct complexes. and furthermore note the exclusive association of Zds1 and Zds2 with the Cdc55/Pph22 trimer. Confirmation of this prediction in the lab may help clarify the cellular function of this particular trimer and the reason for its joint activity with Zds1 and Zds2. <>= data(MBMEcTAP) which(MBMEcTAP[,37]==1) which(MBMEcTAP[,38]==1) which(MBMEcTAP[,39]==1) which(MBMEcTAP[,195]==1) which(MBMEcTAP[,233]==1) @ In summary, \Rpackage{apComplex} can be used to predict complex membership using data from AP-MS experiments. An accurate catalog of complex membership is a fundamental requirement for understanding functional modules in the cell. Integration of \Rpackage{apComplex} analyses with other high-throughput data, including binary physical interactions assayed by yeast two-hybrid technology, gene expression data, and binding domain data are promising avenues for further systems biology research. \section*{References} \noindent Gavin, A.-C., et al. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes, \textit{Nature}, \textbf{415}, 141-147. \vspace{.2in} \noindent Ho, Y., et al. (2002) Systematic identification of protein complexes in \textit{Saccharomyces cerevisiae} by mass spectrometry, \textit{Nature}. \textbf{415}, 180-183. \vspace{.2in} \noindent Jiang, Y., Broach, J. (1999) Tor proteins and protein phosphatase 2A reciprocally regulate Tap42 in controlling cell growth in yeast, \textit{EMBO Journal}, \textbf{18}, 2782-2792. \vspace{.2in} \noindent Scholtens, D., Gentleman, R. (2004) Making sense of high-throughput protein-protein interaction data, \textit{Statistical Aplications in Genetics and Molecular Biology}, \textbf{3}, Article 39. \vspace{.2in} \noindent Scholtens, D., Vidal, M., Gentleman, R. Local modeling of global interacome networks, \textit{Submitted}. \end{document}