Mass spectrometry is a powerful tool in biomedical research. Advancements in label-free methods and MS instruments have enabled high-throughput proteomics profiling of biological samples with the increasingly deeper coverage of the proteome, but missing values are still ubiquitous in MS-based proteomics data. Single-cell MS-based proteomics has also started to become available, with increased frequency of missing values. The limpa package implements a pipeline for quantification and differential expression analysis of mass-spectrometry (MS) proteomics data, with probabilistic information recovery from missing values. A key feature is the ability to quantify protein expression without missing values, even for proteins with a single peptide or with a high proportion of missing values. Another key feature is the ability to propagate quantification uncertainty through to the differential expression analysis, preserving power but avoiding false discoveries. limpa produces a linear model object suitable for downstream analysis with the limma package, allowing complex experimental designs and other downstream tasks such as the gene ontology or pathway analysis. limpa package version: 0.99.3.
If y.peptide
is a matrix of peptide-level log2-intensities (including NAs), protein.id
is a vector of protein IDs, and design
is a design matrix, then the following code will quantify complete log2-expression for the proteins without missing values and will conduct a differential expression analysis defined by the design matrix.
library(limpa)
dpcfit <- dpc(y.peptide)
y.protein <- dpcQuant(y.peptide, protein.id, dpc=dpcfit)
fit <- dpcDE(y.protein, design)
fit <- eBayes(fit)
topTable(fit)
Here is a complete reproducible analysis using a small simulated data. First, generate the dataset:
> library(limpa)
Loading required package: limma
> set.seed(20241230)
> y.peptide <- simProteinDataSet()
The dataset is stored as a limma EList object, with components E
(log2-expression), genes
(feature annotation) and targets
(sample annotation).
The simulation function can generate any number of peptides or proteins but, by default, the dataset has 100 peptides belonging to 25 proteins and the samples are in two groups with \(n=5\) replicates in each group.
About 40% of the expression values are missing.
> dim(y.peptide)
[1] 100 10
> head(y.peptide$genes)
Protein DEStatus
Peptide001 Protein01 NotDE
Peptide002 Protein01 NotDE
Peptide003 Protein01 NotDE
Peptide004 Protein01 NotDE
Peptide005 Protein02 NotDE
Peptide006 Protein02 NotDE
> table(y.peptide$targets$Group)
1 2
5 5
> mean(is.na(y.peptide$E))
[1] 0.425
Next we estimate the intercept and slope of the detection probability curve, which relates the probability of detection to the underlying peptide expression level on the logit scale.
> dpcfit <- dpc(y.peptide)
4 peptides are completely missing in all samples.
> dpcfit$dpc
beta0 beta1
-4.0521 0.7506
> plotDPC(dpcfit)
Then we use the DPC to quantify the protein log2-expression values, using the DPC to represent the missing values. There are no longer any missing values, and the samples now cluster into groups:
> y.protein <- dpcQuant(y.peptide, "Protein", dpc=dpcfit)
Estimating hyperparameters ...
Quantifying proteins ...
Proteins: 25 Peptides: 100
> plotMDS(y.protein)
Finally, we conduct a differential expression analysis using the limma package.
The dpcDE
function calls limma’s voomaLmFit
function, which was specially developed for use with limpa.
voomaLmFit
computes precision weights, in a similar way to voom
for RNA-seq, but instead of using count sizes it use the quantification precisions from dpcQuant
.
The plot shows how dpcDE
predicts the protein-wise variances from the quantification uncertainties and expression levels.
> Group <- factor(y.peptide$targets$Group)
> design <- model.matrix(~Group)
> fit <- dpcDE(y.protein, design, plot=TRUE)
> fit <- eBayes(fit)
> topTable(fit, coef=2)
Protein DEStatus NPeptides PropObs logFC AveExpr t
Protein23 Protein23 Up 4 0.950 0.9364 9.243 5.3571
Protein24 Protein24 Down 4 0.975 -0.8046 9.330 -4.4836
Protein22 Protein22 Up 4 1.000 0.6810 8.911 4.2444
Protein08 Protein08 Up 4 0.475 0.7248 4.841 2.5971
Protein13 Protein13 NotDE 4 0.625 -0.5275 5.880 -2.4000
Protein21 Protein21 NotDE 4 0.800 0.3147 8.462 1.8132
Protein11 Protein11 NotDE 4 0.500 0.3638 5.014 1.4367
Protein07 Protein07 NotDE 4 0.375 0.3211 4.354 1.2351
Protein06 Protein06 NotDE 4 0.175 -0.2934 3.771 -1.2772
Protein03 Protein03 Down 4 0.175 -0.1931 2.367 -0.8466
P.Value adj.P.Val B
Protein23 2.358e-07 5.894e-06 6.565
Protein24 1.248e-05 1.560e-04 2.809
Protein22 3.381e-05 2.817e-04 1.790
Protein08 1.011e-02 6.321e-02 -3.011
Protein13 1.733e-02 8.666e-02 -3.660
Protein21 7.134e-02 2.972e-01 -5.060
Protein11 1.524e-01 5.443e-01 -5.305
Protein07 2.183e-01 6.063e-01 -5.541
Protein06 2.030e-01 6.063e-01 -5.604
Protein03 3.983e-01 9.052e-01 -6.055
This small dataset has five truly DE proteins. Four of the give are top-ranked in the DE results. The other DE protein is ranked 10th in the DE results and does not achieve statistical significant because it had only 17% detected observations and, hence, a high quantification uncertainty.
> sessionInfo()
R Under development (unstable) (2024-10-21 r87258)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS
Matrix products: default
BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/New_York
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] limpa_0.99.3 limma_3.63.3 BiocStyle_2.35.0
loaded via a namespace (and not attached):
[1] cli_3.6.3 knitr_1.49 rlang_1.1.4
[4] magick_2.8.5 xfun_0.50 jsonlite_1.8.9
[7] data.table_1.16.4 statmod_1.5.0 htmltools_0.5.8.1
[10] tinytex_0.54 sass_0.4.9 rmarkdown_2.29
[13] evaluate_1.0.3 jquerylib_0.1.4 fastmap_1.2.0
[16] yaml_2.3.10 lifecycle_1.0.4 bookdown_0.42
[19] BiocManager_1.30.25 compiler_4.5.0 Rcpp_1.0.14
[22] digest_0.6.37 R6_2.5.1 magrittr_2.0.3
[25] bslib_0.8.0 tools_4.5.0 cachem_1.1.0
The limpa pipeline starts with a matrix of peptide precursor intensities (rows for peptides and columns for samples) and a character vector of protein IDs.
The input data can be conveniently supplied as a limma EList object, but a plain matrix is also acceptable.
limpa includes readDIANN
, which reads the Report.tsv
file from DIA-NN software, and the readSpectronaut
, which reads the Report.tsv
file from Spectronaut software.
Missing values for some peptides in some samples has complicated the analysis of MS proteomics data.
Peptides with very low expression values are frequently not detected, but peptides at high expression levels may also be undetected for a variety of reasons that are not completely understood or easily predictable, for example ambiguity of their elution profile with that of other peptides.
If y
is the true expression level of a particular peptide in a particular sample (on the log2 scale), then limpa assumes that the probability of detection is given by
\[P(D | y) = F(\beta_0 + \beta_1 y)\]
where \(D\) indicates detection, \(\beta_0\) and \(\beta_1\) are the intercept and slope of the DPC and \(F\) is the logistic function, given by plogis
in R.
This probability relationship is called the detection probability curve (DPC) in limpa.
The slope \(\beta_1\) measures how dependent the missing value process is on the underlying expression level.
A slope of zero would means completely random missing values, while very large slopes correspond to left censoring.
The DPC allows limpa to recover information in a probabilistic manner from the missing values.
The larger the slope, the more information there is to recover.
We typically find \(\beta_1\) values between about 0.7 and 1 to be representative of real MS data.
The DPC is difficult to estimate because y
is only observed for detected peptides, and the detected values are a biased representation of the complete values that in principle might have been observed had the missing value mechanism not operated.
limpa uses a mathematical exponential tilting argument to represent the DPC in terms of observed values only, which provides a means to estimate the DPC from real data.
The DPC slope \(\beta_1\) is nevertheless often under-estimated if the variability of each peptide is large.
limpa uses the DPC, together with a Bayesian model, to estimate the expression level of each protein in each sample. A multivariate normal prior is estimated empirically from data to describe the variability in log-intensities across the samples and across the peptides. The DPC-Quant uses maximum posterior estimation to quantify the expression of each protein in each sample, and also returns the posterior standard error with which each expression value is estimated.
Finally, limpa passes the protein log2-expression values and associated uncertainties on to the limma package, and uses the voomaLmFit
function to compute precision weights for each observation.
voomaLmFit
uses both protein expression and the quantification standard errors to predict the protein-wise variances and, hence, to construct precision weights for downstream linear modelling.
This allows the uncertainty associated with missing values imputation to be propagated through to the differential expression analysis.
limpa’s dpcDE
function is a convenient wrapper function, passing the appropriate standard errors from dpcQuant
to voomaLmFit
.
The limpa package is fully compatible with limma analysis pipelines, allowing any complex experimental design and other downstream tasks such as the gene ontology or pathway analysis.
The dpcDE
function accepts any argument that voomaLmFit
does.
For example, dpcDE(y.protein, design, sample.weights=TRUE)
can be used to downweight outlier samples.
Or dpc(y.protein, design, block=subject)
could be used to model the correlation between repeated observations on the same subject.
Any questions about limpa can be sent to the Bioconductor support forum.
We are working on a publication that will fully describe the limpa theory and functionality. In the meantime, please cite Li & Smyth (2023), which introduced the idea of the detection probability curve (DPC) that is fundamental to the limpa package.
Li M, Smyth GK (2023). Neither random nor censored: estimating intensity-dependent probabilities for missing values in label-free proteomics. Bioinformatics 39(5), btad200. doi:10.1093/bioinformatics/btad200
The limpa project was supported by Chan Zuckerberg Initiative EOSS grant 2021-237445, by Melbourne Research and CSL Translational Data Science Scholarships to ML, and by an NHMRC Investigator Grant to GS.