1 Introduction

The SMAD package provides a suite of scoring functions to evaluate protein-protein interactions (PPI) from Affinity Purification-Mass Spectrometry (AP-MS) data. These functions assign probability or confidence scores to interactions, helping to distinguish true biological interactions from non-specific background contaminants.

This vignette showcases the various scoring methods implemented in SMAD.

2 Data Preparation

Most scoring functions in SMAD take a standardized input format. We will use the built-in TestDatInput dataset for demonstration.

library(SMAD)
data("TestDatInput")
head(TestDatInput)
#>       idRun idBait idPrey countPrey lenPrey
#> 7452  68982  TIMP2  ACTC1        15     377
#> 8016  66491  CASP1   CDK4         9     303
#> 7162  68486   BTG3  RPL24         3     157
#> 8086  66491  CASP1 IMPDH2         9     514
#> 23653 72934    LUM  THOP1         7     689
#> 9196  67747    FAS   RFC5         9     340

The columns are: - idRun: Unique identifier for the AP-MS run. - idBait: Unique identifier for the bait protein. - idPrey: Unique identifier for the prey protein. - countPrey: Spectral counts (or peptide counts) for the prey. - lenPrey: Length of the prey protein.

3 Scoring Methods

3.1 CompPASS

The Comparative Proteomic Analysis Software Suite (CompPASS) identifies high-confidence interactions by comparing protein occurrences across multiple AP-MS experiments. It produces four types of scores: Z-score, S-score, D-score, and WD-score (weighted D-score).

scoreCompPASS <- CompPASS(TestDatInput)
head(scoreCompPASS)
#>   idBait idPrey AvePSM    scoreZ    scoreS    scoreD Entropy   scoreWD
#> 1  AIFM3  AIFM1     20 7.9382230 36.055513 36.055513       0 1.6903085
#> 2  AIFM3  ALDOA     14 2.6586313  9.095453  9.095453       0 0.2308028
#> 3  AIFM3 ATP5A1      5 0.5826082  5.700877  5.700877       0 0.1529845
#> 4  AIFM3   CALR      4 0.8703043  4.654747  4.654747       0 0.1161689
#> 5  AIFM3   CCT2     24 3.1558989 12.489996 12.489996       0 0.3398374
#> 6  AIFM3   CCT4     20 2.8371693  9.013878  9.013878       0 0.2135599

3.2 HGScore

HGScore is based on a hypergeometric distribution error model. It incorporates the Normalized Spectral Abundance Factor (NSAF) to account for protein length and abundance.

scoreHG <- HG(TestDatInput)
head(scoreHG)
#>   InteractorA InteractorB ppiTN tnA  tnB       PPI NMinTn       HG
#> 1         A2M        ACLY     1 122 1197  A2M~ACLY 477317 3.264772
#> 2         A2M         AGK     1 122  940   A2M~AGK 477317 3.707123
#> 3         A2M        AGO1     1 122 1501  A2M~AGO1 477317 2.860551
#> 4         A2M        AHCY     1 122 2349  A2M~AHCY 477317 2.098700
#> 5         A2M       AHSA1     1 122  386 A2M~AHSA1 477317 5.399179
#> 6         A2M       AKAP8     1 122  317 A2M~AKAP8 477317 5.782404

3.3 DICE

The Dice coefficient is used to score the interaction affinity between two proteins based on their co-occurrence across different runs. It focuses on prey-prey interactions.

scoreDICE <- DICE(TestDatInput)
head(scoreDICE)
#>   InteractorA InteractorB DICE          PPI
#> 1         A2M        AARS    0     A2M~AARS
#> 2         A2M       AARS2    0    A2M~AARS2
#> 3         A2M    AASDHPPT    0 A2M~AASDHPPT
#> 4         A2M        ABAT    0     A2M~ABAT
#> 5         A2M       ABCD3    0    A2M~ABCD3
#> 6         A2M       ABCE1    0    A2M~ABCE1

3.4 Hart

Based on Hart et al. (2007), this algorithm uses a hypergeometric distribution to compute the probability of two proteins interacting, based on their frequency of co-purification.

scoreHart <- Hart(TestDatInput)
head(scoreHart)
#>           PPI InteractorA InteractorB Freq TnA TnB totTn     Hart
#> 1  AARS~ABCE1        AARS       ABCE1    1   3   2  5000 15.24283
#> 2 AARS~ACADSB        AARS      ACADSB    1   3   3  5000 14.14395
#> 3  AARS~ACAT1        AARS       ACAT1    1   3   9  5000 11.65744
#> 4  AARS~ACBD3        AARS       ACBD3    1   3   4  5000 13.45053
#> 5  AARS~ACTC1        AARS       ACTC1    1   3  16  5000 10.45160
#> 6  AARS~ACTN2        AARS       ACTN2    1   3   4  5000 13.45053

3.5 PE (Purification Enrichment)

The PE score is based on a Bayesian classifier framework (Collins et al., 2007). It combines “spoke” (bait-prey) and “matrix” (prey-prey) models to compute a comprehensive enrichment score.

# PE might require data.table and RcppAlgos
scorePE <- PE(TestDatInput)
head(scorePE)
#>         PPI        PB        BP InteractorA InteractorB spokeBP spokePB
#> 1  A2M~ACLY  ACLY:A2M  A2M:ACLY         A2M        ACLY      NA      NA
#> 2   A2M~AGK   AGK:A2M   A2M:AGK         A2M         AGK      NA      NA
#> 3  A2M~AGO1  AGO1:A2M  A2M:AGO1         A2M        AGO1      NA      NA
#> 4  A2M~AHCY  AHCY:A2M  A2M:AHCY         A2M        AHCY      NA      NA
#> 5 A2M~AHSA1 AHSA1:A2M A2M:AHSA1         A2M       AHSA1      NA      NA
#> 6 A2M~AKAP8 AKAP8:A2M A2M:AKAP8         A2M       AKAP8      NA      NA
#>    matrixPP        PE
#> 1 0.6405372 0.6405372
#> 2 0.7885690 0.7885690
#> 3 0.5711720 0.5711720
#> 4 0.4918723 0.4918723
#> 5 1.0596816 1.0596816
#> 6 1.0596816 1.0596816

3.6 SAINTexpress

Significance Analysis of INTeractome (SAINT) is a widely used tool for AP-MS data. SMAD provides an integrated version with two modes: Spectral Count (spc) and Intensity (int).

3.6.1 SAINTexpress-spc (Spectral Count)

This mode is used for data where protein abundance is measured by spectral counts.

# Using example data from the package
bait_path <- system.file("exdata", "TIP49", "bait.dat", package = "SMAD")
prey_path <- system.file("exdata", "TIP49", "prey.dat", package = "SMAD")
inter_path <- system.file("exdata", "TIP49", "inter.dat", package = "SMAD")

bait <- read.table(bait_path, sep = "\t", header = FALSE, 
                   col.names = c("ip_id", "bait_id", "test_ctrl"))
prey <- read.table(prey_path, sep = "\t", header = FALSE, 
                   col.names = c("prey_id", "prey_length"))
inter <- read.table(inter_path, sep = "\t", header = FALSE, 
                    col.names = c("ip_id", "bait_id", "prey_id", "quant"))

result_spc <- SAINTexpress_spc(inter, prey, bait)
head(result_spc[, c("Bait", "Prey", "SaintScore", "BFDR")])
#>    Bait   Prey SaintScore         BFDR
#> 1 ACTR5  ACTR5  0.0000000 3.658563e-01
#> 2 ACTR5 RUVBL2  0.9999999 2.994328e-09
#> 3 ACTR5 RUVBL1  1.0000000 2.220446e-16
#> 4 ACTR5 INO80C  1.0000000 0.000000e+00
#> 5 ACTR5  ACTR8  1.0000000 0.000000e+00
#> 6 ACTR5   CCT2  0.9328368 8.106743e-03

3.6.2 SAINTexpress-int (Intensity)

This mode is designed for intensity-based data, such as those from label-free quantification (LFQ).

# Re-using the same example data for demonstration purposes
result_int <- SAINTexpress_int(inter, prey, bait)
head(result_int[, c("Bait", "Prey", "SaintScore", "BFDR")])
#>    Bait   Prey SaintScore       BFDR
#> 1 ACTR5  ACTR5 0.00000000 0.97874778
#> 2 ACTR5 RUVBL2 0.77444020 0.06311952
#> 3 ACTR5 RUVBL1 0.55922882 0.13826125
#> 4 ACTR5 INO80C 0.08051629 0.61941539
#> 5 ACTR5  ACTR8 0.11749406 0.55342319
#> 6 ACTR5   CCT2 0.01055179 0.89860014

4 Visualization of Scores

Visualizing the distribution of scores can help in selecting appropriate thresholds for high-confidence interactions.

par(mfrow = c(2, 3))
hist(scoreCompPASS$scoreWD, main = "CompPASS WD-score", xlab = "WD-score", col = "skyblue")
hist(scoreHG$HG, main = "HGScore", xlab = "HGScore", col = "salmon")
hist(scoreDICE$DICE, main = "DICE Score", xlab = "DICE", col = "lightgreen")
hist(scoreHart$Hart, main = "Hart Score", xlab = "Hart", col = "plum")
hist(scorePE$PE, main = "PE Score", xlab = "PE", col = "orange")
hist(result_spc$SaintScore, main = "SAINT Score (spc)", xlab = "SAINT Score", col = "gold")

5 Session Information

sessionInfo()
#> R version 4.6.0 alpha (2026-04-05 r89794)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] SMAD_1.27.5      RcppAlgos_2.10.0 BiocStyle_2.39.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.7.3         cli_3.6.6           knitr_1.51         
#>  [4] magick_2.9.1        rlang_1.2.0         xfun_0.57          
#>  [7] otel_0.2.0          purrr_1.2.2         generics_0.1.4     
#> [10] jsonlite_2.0.0      data.table_1.18.2.1 glue_1.8.1         
#> [13] htmltools_0.5.9     tinytex_0.59        sass_0.4.10        
#> [16] gmp_0.7-5.1         rmarkdown_2.31      tibble_3.3.1       
#> [19] evaluate_1.0.5      jquerylib_0.1.4     fastmap_1.2.0      
#> [22] yaml_2.3.12         lifecycle_1.0.5     bookdown_0.46      
#> [25] BiocManager_1.30.27 compiler_4.6.0      dplyr_1.2.1        
#> [28] pkgconfig_2.0.3     Rcpp_1.1.1-1        tidyr_1.3.2        
#> [31] digest_0.6.39       R6_2.6.1            tidyselect_1.2.1   
#> [34] pillar_1.11.1       magrittr_2.0.5      bslib_0.10.0       
#> [37] withr_3.0.2         tools_4.6.0         cachem_1.1.0