---
title: "R / Bioconductor packages for Proteomics"
author: "[Laurent Gatto](http://proteome.sysbiol.cam.ac.uk/lgatto/)"
output:
html_document:
theme: united
toc: true
---
**Abstract** This workshop will give delegates the opportunity to
discover and try some of the recent R / Bioconductor developments for
proteomics. Topics covered will including support for open
community-driven formats for raw data and identification results,
packages for peptide-spectrum matching, quantitative proteomics, mass
spectrometry (MS) and quantitation data processing, and
visualisation. The workshop material will be a self-contained
vignette/workflow including example data.
This short tutorial is part of the `ProteomicsBioc2014Workshop`
package (version 0.2.0),
available at
https://github.com/ComputationalProteomicsUnit/ProteomicsBioc2014Workshop.
## Introduction
In Bioconductor version 2.14, there are respectively 53
proteomics and 31 mass spectrometry software packages and
7 mass spectrometry experiment packages. These respective
packages can be extracted with the `proteomicsPackages()`,
`massSpectrometryPackages()` and `massSpectrometryDataPackages()` and
explored interactively.
```r
library("RforProteomics")
pp <- proteomicsPackages(biocv)
display(pp)
```
## Mass spectrometry data
|Type |Format |Package |
|:--------------|:---------------------------|:-----------------------------------------------------------------------------------------|
|raw |mzML, mzXML, netCDF, mzData |[`mzR`](http://bioconductor.org/packages/release/bioc/html/mzR.html) (read) |
|identification |mzIdentML |[`mzID`](http://bioconductor.org/packages/release/bioc/html/mzID.html) (read) |
|quantitation |mzQuantML | |
|peak lists |mgf |[`MSnbase`](http://bioconductor.org/packages/release/bioc/html/MSnbase.html) (read/write) |
|other |mzTab |[`MSnbase`](http://bioconductor.org/packages/release/bioc/html/MSnbase.html) (read/write) |
### Getting data from proteomics repositories
Contemporary MS-based proteomics data is disseminated through the
[ProteomeXchange](http://www.proteomexchange.org/) infrastructure,
which centrally coordinates submission, storage and dissemination
through multiple data repositories, such as the
[PRIDE](https://www.ebi.ac.uk/pride/archive/) data base at the EBI for
MS/MS experiments, [PASSEL](http://www.peptideatlas.org/passel/) at
the ISB for SRM data and the
[MassIVE](http://massive.ucsd.edu/ProteoSAFe/static/massive.jsp)
resource. The
[`rpx`](http://www.bioconductor.org/packages/release/bioc/html/rpx.html)
is an interface to ProteomeXchange and provides a basic and unified
access to PX data.
```r
library("rpx")
pxannounced()
```
```
## 14 new ProteomeXchange annoucements
```
```
## Data.Set Publication.Data Message
## 1 PXD001187 2014-07-30 15:21:44 New
## 2 PXD000249 2014-07-30 14:31:46 Updated information
## 3 PXD000249 2014-07-30 12:24:35 New
## 4 PXD001049 2014-07-30 10:12:30 New
## 5 PXD000459 2014-07-29 09:59:26 New
## 6 PXD001029 2014-07-29 07:27:32 New
## 7 PXD000807 2014-07-28 09:24:47 Updated information
## 8 PXD000479 2014-07-28 09:23:05 Updated information
## 9 PXD000909 2014-07-28 08:09:14 New
## 10 PXD000813 2014-07-25 09:16:02 New
## 11 PXD000764 2014-07-25 08:53:34 Updated information
## 12 PXD000004 2014-07-24 15:07:14 New
## 13 PXD000001 2014-07-24 14:59:35 New
## 14 PXD001030 2014-07-24 12:28:19 Updated information
```
```r
px <- PXDataset("PXD000001")
px
```
```
## Object of class "PXDataset"
## Id: PXD000001 with 8 files
## [1] 'F063721.dat' ... [8] 'erwinia_carotovora.fasta'
## Use 'pxfiles(.)' to see all files.
```
```r
pxfiles(px)
```
```
## [1] "F063721.dat"
## [2] "F063721.dat-mztab.txt"
## [3] "PRIDE_Exp_Complete_Ac_22134.xml.gz"
## [4] "PRIDE_Exp_mzData_Ac_22134.xml.gz"
## [5] "PXD000001_mztab.txt"
## [6] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML"
## [7] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.raw"
## [8] "erwinia_carotovora.fasta"
```
Other metadata for the `px` dataset:
```r
pxtax(px)
pxurl(px)
pxref(px)
```
Data files can then be downloaded with the `pxget` function as
illustrated below. Alternatively, the file is available on the
workshop's Amazon virtual machine in `/data/Proteomics/data/`.
```r
mzf <- "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML"
mzf <- file.path("/data/Proteomics/data", mzf)
if (!file.exists(mzf))
mzf <- pxget(px, pxfiles(px)[6])
```
```
## Downloading 1 file
## TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML already present.
```
```r
mzf
```
```
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML"
```
### Handling raw MS data
The `mzR` package provides an interface to the
[proteowizard](http://proteowizard.sourceforge.net/) code base,
the legacy RAMP is a non-sequential parser and other C/C++ code to
access various raw data files, such as `mzML`, `mzXML`, `netCDF`, and
`mzData`. The data is accessed on-disk, i.e it does not get loaded
entirely in memory by default. The three main functions are
`openMSfile` to create a file handle to a raw data file, `header` to
extract metadata about the spectra contained in the file and `peaks`
to extract one or multiple spectra of interest. Other functions such
as `instrumentInfo`, or `runInfo` can be used to gather general
information about a run.
```r
library("mzR")
ms <- openMSfile(mzf)
ms
```
```
## Mass Spectrometry file handle.
## Filename: TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML
## Number of scans: 7534
```
```r
hd <- header(ms)
dim(hd)
```
```
## [1] 7534 21
```
```r
names(hd)
```
```
## [1] "seqNum" "acquisitionNum"
## [3] "msLevel" "polarity"
## [5] "peaksCount" "totIonCurrent"
## [7] "retentionTime" "basePeakMZ"
## [9] "basePeakIntensity" "collisionEnergy"
## [11] "ionisationEnergy" "lowMZ"
## [13] "highMZ" "precursorScanNum"
## [15] "precursorMZ" "precursorCharge"
## [17] "precursorIntensity" "mergedScan"
## [19] "mergedResultScanNum" "mergedResultStartScanNum"
## [21] "mergedResultEndScanNum"
```
#### Exercise
Extract the index of the MS2 spectrum with the highest base peak
intensity and plot its spectrum. Is the data centroided or in profile
mode?
### Handling identification data
The `RforProteomics` package distributes a small identification result
file (see
`?TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzid`) that we
load and parse using infrastructure from the
[`mzID`](http://bioconductor.org/packages/release/bioc/html/mzID.html)
package.
```r
library("mzID")
(f <- dir(system.file("extdata", package = "ProteomicsBioc2014Workshop"),
pattern = "mzid", full.names=TRUE))
```
```
## [1] "/home/lgatto/R/x86_64-unknown-linux-gnu-library/3.1/ProteomicsBioc2014Workshop/extdata/TMT_Erwinia.mzid.gz"
```
```r
id <- mzID(f)
```
```
## reading TMT_Erwinia.mzid.gz... DONE!
```
```r
id
```
```
## An mzID object
##
## Software used: MS-GF+ (version: Beta (v10072))
##
## Rawfile: /home/lgatto/dev/00_github/RforProteomics/sandbox/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML
##
## Database: /home/lgatto/dev/00_github/RforProteomics/sandbox/erwinia_carotovora.fasta
##
## Number of scans: 5287
## Number of PSM's: 5563
```
Various data can be extracted from the `mzID` object, using one the
accessor functions such as `database`, `scans`, `peptides`, ... The
object can also be converted into a `data.frame` using the `flatten`
function.
#### Exercise
Is there a relation between the length of a protein and the number of
identified peptides, conditioned by the (average) e-value of the
identifications?
### MS/MS database search
While searches are generally performed using third-party software
independently of R or can be started from R using a `system` call, the
[`rTANDEM`](http://www.bioconductor.org/packages/release/bioc/html/rTANDEM.html)
package allows one to execute such searches using the X!Tandem
engine. The
[`shinyTANDEM`](http://www.bioconductor.org/packages/release/bioc/html/shinyTANDEM.html)
provides a interactive interface to explore the search results.
```r
library("rTANDEM")
?rtandem
library("shinyTANDEM")
?shinyTANDEM
```
## High-level data interface
The above sections introduced low-level interfaces to raw and
identification results. The
[`MSnbase`](http://bioconductor.org/packages/release/bioc/html/MSnbase.html)
package provides abstractions for raw data through the `MSnExp` class
and containers for quantification data via the `MSnSet` class. Both
store the actual assay data (spectra or quantitation matrix) and
sample and feature metadata, accessed with `spectra` (or the
`[`, `[[` operators) or `exprs`, `pData` and `fData`.
The figure below give a schematics of an `MSnSet` instance and the
relation between the assay data and the respective feature and sample
metadata.
Another useful slot is `processingData`, accessed with `processingData(.)`, that records all the processing that objects have undergone since their creation (see examples below).
The `readMSData` will parse the raw data, extract the MS2 spectra and
construct an MS experiment file.
```r
library("MSnbase")
quantFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "mzXML$")
quantFile
```
```
## [1] "/home/lgatto/R/x86_64-unknown-linux-gnu-library/3.2/MSnbase/extdata/dummyiTRAQ.mzXML"
```
```r
msexp <- readMSData(quantFile, verbose=FALSE)
msexp
```
```
## Object of class "MSnExp"
## Object size in memory: 0.2 Mb
## - - - Spectra data - - -
## MS level(s): 2
## Number of MS1 acquisitions: 1
## Number of MSn scans: 5
## Number of precursor ions: 5
## 4 unique MZs
## Precursor MZ's: 437.8 - 716.34
## MSn M/Z range: 100 2017
## MSn retention times: 25:1 - 25:2 minutes
## - - - Processing information - - -
## Data loaded: Wed Jul 30 21:43:48 2014
## MSnbase version: 1.13.12
## - - - Meta data - - -
## phenoData
## rowNames: 1
## varLabels: sampleNames
## varMetadata: labelDescription
## Loaded from:
## dummyiTRAQ.mzXML
## protocolData: none
## featureData
## featureNames: X1.1 X2.1 ... X5.1 (5 total)
## fvarLabels: spectrum
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
```
The identification results stemming from the same raw data file can
then be used to add PSM matches.
```r
## find path to a mzIdentML file
identFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
full.name = TRUE, pattern = "mzid$")
identFile
```
```
## [1] "/home/lgatto/R/x86_64-unknown-linux-gnu-library/3.2/MSnbase/extdata/dummyiTRAQ.mzid"
```
```r
msexp <- addIdentificationData(msexp, identFile)
```
```
## reading dummyiTRAQ.mzid... DONE!
```
```r
fData(msexp)
```
```
## spectrum passthreshold rank calculatedmasstocharge
## X1.1 1 TRUE 1 645.0
## X2.1 2 TRUE 1 547.0
## X3.1 3 NA NA NA
## X4.1 4 NA NA NA
## X5.1 5 TRUE 1 437.3
## experimentalmasstocharge chargestate ms-gf:denovoscore ms-gf:evalue
## X1.1 645.4 3 77 79.37
## X2.1 547.0 3 39 13.47
## X3.1 NA NA NA NA
## X4.1 NA NA NA NA
## X5.1 437.8 2 5 366.38
## ms-gf:rawscore ms-gf:specevalue assumeddissociationmethod
## X1.1 -39 5.527e-05 CID
## X2.1 -30 9.399e-06 CID
## X3.1 NA NA
## X4.1 NA NA
## X5.1 -42 2.578e-04 CID
## isotopeerror isdecoy post pre end start accession length
## X1.1 1 FALSE A R 186 170 ECA0984;ECA3829 231
## X2.1 0 FALSE A K 62 50 ECA1028 275
## X3.1 NA NA NA NA
## X4.1 NA NA NA NA
## X5.1 1 FALSE L K 28 22 ECA0510 166
## description
## X1.1 DNA mismatch repair protein;acetolactate synthase isozyme III large subunit
## X2.1 2,3,4,5-tetrahydropyridine-2,6-dicarboxylate N-succinyltransferase
## X3.1
## X4.1
## X5.1 putative capsular polysacharide biosynthesis transferase
## pepseq modified modification databaseFile
## X1.1 VESITARHGEVLQLRPK FALSE NA erwinia_carotovora.fasta
## X2.1 IDGQWVTHQWLKK FALSE NA erwinia_carotovora.fasta
## X3.1 NA NA
## X4.1 NA NA
## X5.1 LVILLFR FALSE NA erwinia_carotovora.fasta
## identFile nprot npep.prot npsm.prot npsm.pep
## X1.1 2 2 1 1 1
## X2.1 2 1 1 1 1
## X3.1 NA NA NA NA NA
## X4.1 NA NA NA NA NA
## X5.1 2 1 1 1 1
```
```r
msexp[[1]]
```
```
## Object of class "Spectrum2"
## Precursor: 645.4
## Retention time: 25:1
## Charge: 3
## MSn level: 2
## Peaks count: 2921
## Total ion count: 668170086
```
```r
plot(msexp[[1]], full=TRUE)
```
```r
as(msexp[[1]], "data.frame")[100:105, ]
```
```
## mz i
## 100 141.1 588595
## 101 141.1 845401
## 102 141.1 791352
## 103 141.1 477623
## 104 141.1 155376
## 105 141.1 4753
```
### Quantitative proteomics
There are a wide range of proteomics quantitation techniques that can
broadly be classified as labelled vs. label-free, depending whether
the features are labelled prior the MS acquisition and the MS level at
which quantitation is inferred, namely MS1 or MS2.
| |Label-free |Labelled |
|:---|:----------|:----------|
|MS1 |XIC |SILAC, 15N |
|MS2 |Counting |iTRAQ, TMT |
In terms of raw data quantitation, most efforts have been devoted to
MS2-level quantitation. Label-free XIC quantitation has however been
addressed in the frame of metabolomics data processing by the
[`xcms`](http://bioconductor.org/packages/release/bioc/html/xcms.html)
infrastructure.
An `MSnExp` is converted to an `MSnSet` by the `quantitation`
method. Below, we use the iTRAQ 4-plex isobaric tagging strategy
(defined by the `iTRAQ4` parameter; other tags are available).
```r
plot(msexp[[1]], full=TRUE, reporters = iTRAQ4)
```
```r
msset <- quantify(msexp, method = "trap", reporters = iTRAQ4, verbose=FALSE)
```
```
## Error: argument "verbose" is missing, with no default
```
```r
exprs(msset)
```
```
## Error: error in evaluating the argument 'object' in selecting a method for function 'exprs': Error: object 'msset' not found
```
```r
processingData(msset)
```
```
## Error: error in evaluating the argument 'object' in selecting a method for function 'processingData': Error: object 'msset' not found
```
Other MS2 quantitation methods available in `quantify` include the
(normalised) spectral index `SI` and (normalised) spectral abundance
factor `SAF` or simply a simple count method.
```r
exprs(si <- quantify(msexp, method = "SIn"))
```
```
## 1
## ECA0510 0.003589
## ECA1028 0.001470
```
```r
exprs(saf <- quantify(msexp, method = "NSAF"))
```
```
## 1
## ECA0510 0.6236
## ECA1028 0.3764
```
Note that spectra that have not been assigned any peptide (`NA`) or
that match non-unique peptides (`npsm > 1`) are discarded in the
counting process.
**See also** The
[`isobar`](http://www.bioconductor.org/packages/release/bioc/html/isobar.html)
package supports quantitation from centroided `mgf` peak lists or its
own tab-separated files that can be generated from Mascot and Phenyx
vendor files.
### Importing third-party data
The PSI `mzTab` file format is aimed at providing a simpler (than XML
formats) and more accessible file format to the wider community. It is
composed of a key-value metadata section and peptide/protein/small
molecule tabular sections.
```r
mztf <- pxget(px, pxfiles(px)[2])
```
```
## Downloading 1 file
## F063721.dat-mztab.txt already present.
```
```r
(mzt <- readMzTabData(mztf, what = "PEP"))
```
```
## Warning: Support for mzTab version 0.9 only. Support will be added soon.
```
```
## Detected a metadata section
## Detected a peptide section
```
```
## MSnSet (storageMode: lockedEnvironment)
## assayData: 1528 features, 6 samples
## element names: exprs
## protocolData: none
## phenoData
## rowNames: sub[1] sub[2] ... sub[6] (6 total)
## varLabels: abundance
## varMetadata: labelDescription
## featureData
## featureNames: 1 2 ... 1528 (1528 total)
## fvarLabels: sequence accession ... uri (14 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## mzTab read: Wed Jul 30 21:44:00 2014
## MSnbase version: 1.13.12
```
It is also possible to import arbitrary spreadsheets as `MSnSet`
objects into R with the `readMSnSet2` function. The main 2 arguments
of the function are (1) a text-based spreadsheet and (2) column names
of indices that identify the quantitation data.
```r
csv <- dir(system.file ("extdata" , package = "pRolocdata"),
full.names = TRUE, pattern = "pr800866n_si_004-rep1.csv")
getEcols(csv, split = ",")
```
```
## [1] "\"Protein ID\"" "\"FBgn\""
## [3] "\"Flybase Symbol\"" "\"No. peptide IDs\""
## [5] "\"Mascot score\"" "\"No. peptides quantified\""
## [7] "\"area 114\"" "\"area 115\""
## [9] "\"area 116\"" "\"area 117\""
## [11] "\"PLS-DA classification\"" "\"Peptide sequence\""
## [13] "\"Precursor ion mass\"" "\"Precursor ion charge\""
## [15] "\"pd.2013\"" "\"pd.markers\""
```
```r
ecols <- 7:10
res <- readMSnSet2(csv, ecols)
head(exprs(res))
```
```
## area.114 area.115 area.116 area.117
## 1 0.3790 0.2810 0.2250 0.1140
## 2 0.4200 0.2097 0.2061 0.1639
## 3 0.1873 0.1673 0.1697 0.4760
## 4 0.2475 0.2530 0.3200 0.1790
## 5 0.2160 0.1830 0.3420 0.2590
## 6 0.0720 0.2123 0.5730 0.1427
```
```r
head(fData(res))
```
```
## Protein.ID FBgn Flybase.Symbol No..peptide.IDs Mascot.score
## 1 CG10060 FBgn0001104 G-ialpha65A 3 179.86
## 2 CG10067 FBgn0000044 Act57B 5 222.40
## 3 CG10077 FBgn0035720 CG10077 5 219.65
## 4 CG10079 FBgn0003731 Egfr 2 86.39
## 5 CG10106 FBgn0029506 Tsp42Ee 1 52.10
## 6 CG10130 FBgn0010638 Sec61beta 2 79.90
## No..peptides.quantified PLS.DA.classification Peptide.sequence
## 1 1 PM
## 2 9 PM
## 3 3
## 4 2 PM
## 5 1 GGVFDTIQK
## 6 3 ER/Golgi
## Precursor.ion.mass Precursor.ion.charge pd.2013 pd.markers
## 1 PM unknown
## 2 PM unknown
## 3 unknown unknown
## 4 PM unknown
## 5 626.887 2 Phenotype 1 unknown
## 6 ER/Golgi ER
```
## Data processing and analysis
### Processing and normalisation
Each different types of quantitative data will require their own
pre-processing and normalisation steps. Both `isobar` and `MSnbase`
allow to correct for isobaric tag impurities normalise the
quantitative data.
```r
data(itraqdata)
qnt <- quantify(itraqdata, method = "trap",
reporters = iTRAQ4, verbose = FALSE)
```
```
## Error: argument "verbose" is missing, with no default
```
```r
impurities <- matrix(c(0.929,0.059,0.002,0.000,
0.020,0.923,0.056,0.001,
0.000,0.030,0.924,0.045,
0.000,0.001,0.040,0.923),
nrow=4, byrow = TRUE)
## or, using makeImpuritiesMatrix()
## impurities <- makeImpuritiesMatrix(4)
qnt.crct <- purityCorrect(qnt, impurities)
```
```
## Error: error in evaluating the argument 'object' in selecting a method for function 'purityCorrect': Error: object 'qnt' not found
```
```r
processingData(qnt.crct)
```
```
## Error: error in evaluating the argument 'object' in selecting a method for function 'processingData': Error: object 'qnt.crct' not found
```
```r
plot0 <- function(x, y, main = "") {
old.par <- par(no.readonly = TRUE)
on.exit(par(old.par))
par(mar = c(4, 4, 1, 1))
par(mfrow = c(2, 2))
sx <- sampleNames(x)
sy <- sampleNames(y)
for (i in seq_len(ncol(x))) {
plot(exprs(x)[, i], exprs(y)[, i], log = "xy",
xlab = sx[i], ylab = sy[i])
grid()
}
}
plot0(qnt, qnt.crct)
```
```
## Error: error in evaluating the argument 'object' in selecting a method for function 'sampleNames': Error: object 'qnt' not found
```
Various normalisation methods can be applied the `MSnSet` instances
using the `normalise` method: variance stabilisation (`vsn`), quantile
(`quantiles`), median or mean centring (`center.media` or
`center.mean`), ...
```r
qnt.crct.nrm <- normalise(qnt.crct,"quantiles")
```
```
## Error: error in evaluating the argument 'object' in selecting a method for function 'normalize': Error: object 'qnt.crct' not found
```
```r
plot0(qnt, qnt.crct.nrm)
```
```
## Error: error in evaluating the argument 'object' in selecting a method for function 'sampleNames': Error: object 'qnt' not found
```
The `combineFeatures` method combines spectra/peptides quantitation
values into protein data. The grouping is defined by the `groupBy`
parameter, which is generally taken from the feature metadata (protein
accessions, for example).
```r
## arbitraty grouping
g <- factor(c(rep(1, 25), rep(2, 15), rep(3, 15)))
prt <- combineFeatures(qnt.crct.nrm, groupBy = g, fun = "sum")
```
```
## Error: error in evaluating the argument 'object' in selecting a method for function 'normalize': Error: object 'qnt.crct.nrm' not found
```
```r
processingData(prt)
```
```
## Error: error in evaluating the argument 'object' in selecting a method for function 'processingData': Error: object 'prt' not found
```
Finally, proteomics data analysis is generally hampered by missing
values. Missing data imputation is a sensitive operation whose success
will be guided by many factors, such as degree and (non-)random nature
of the missingness. Missing value in `MSnSet` instances can be
filtered out and imputed using the `filterNA` and `impute` functions.
```r
set.seed(1)
qnt0 <- qnt
```
```
## Error: object 'qnt' not found
```
```r
exprs(qnt0)[sample(prod(dim(qnt0)), 10)] <- NA
```
```
## Error: object 'qnt0' not found
```
```r
table(is.na(qnt0))
```
```
## Error: object 'qnt0' not found
```
```r
qnt00 <- filterNA(qnt0)
```
```
## Error: error in evaluating the argument 'object' in selecting a method for function 'filterNA': Error: object 'qnt0' not found
```
```r
dim(qnt00)
```
```
## Error: object 'qnt00' not found
```
```r
qnt.imp <- impute(qnt0)
```
```
## Error: error in evaluating the argument 'object' in selecting a method for function 'impute': Error: object 'qnt0' not found
```
```r
plot0(qnt, qnt.imp)
```
```
## Error: error in evaluating the argument 'object' in selecting a method for function 'sampleNames': Error: object 'qnt' not found
```
#### Exercise
The `mzt` instance created from the `mzTab` file has the following is
a TMT 6-plex with the following design:
In this TMT 6-plex experiment, four exogenous proteins were spiked
into an equimolar *Erwinia carotovora* lysate with varying
proportions in each channel of quantitation; yeast enolase (ENO) at
10:5:2.5:1:2.5:10, bovine serum albumin (BSA) at 1:2.5:5:10:5:1,
rabbit glycogen phosphorylase (PHO) at 2:2:2:2:1:1 and bovin
cytochrome C (CYT) at 1:1:1:1:1:2. Proteins were then digested,
differentially labelled with TMT reagents, fractionated by reverse
phase nanoflow UPLC (nanoACQUITY, Waters), and analysed on an LTQ
Orbitrap Velos mass spectrometer (Thermo Scientic).
Explore the `mzt` data using some of the illustrated functions. The
heatmap and MAplot (see `MAplot` function), taken from the
[`RforProteomics`](http://www.bioconductor.org/packages/release/data/experiment/html/RforProteomics.html)
vignette, have been produced using the same data.
![heatmap](figure/heatmap.png)
![maplot](figure/maplot.png)
### Statistical analysis
R in general and Bioconductor in particular are well suited for the
statistical analysis of data. Several packages provide dedicated
resources for proteomics data:
- [`MSstats`](http://www.bioconductor.org/packages/release/bioc/html/MSstats.html):
A set of tools for statistical relative protein significance
analysis in DDA, SRM and DIA experiments.
- [`msmsTest`](http://www.bioconductor.org/packages/release/bioc/html/msmsTests.html):
Statistical tests for label-free LC-MS/MS data by spectral counts,
to discover differentially expressed proteins between two biological
conditions. Three tests are available: Poisson GLM regression,
quasi-likelihood GLM regression, and the negative binomial of the
edgeR package.
- [`isobar`](http://www.bioconductor.org/packages/release/bioc/html/isobar.html)
also provides dedicated infrastructure for the statistical analysis of isobaric data.
### Machine learning
The
[`MLInterfaces`](http://www.bioconductor.org/packages/release/bioc/html/MLInterfaces.html)
package provides a unified interface to a wide range of machine
learning algorithms. Initially developed for microarray and
`ExpressionSet` instances, the
[`pRoloc`](http://www.bioconductor.org/packages/release/bioc/html/pRoloc.html)
package enables application of these algorithms to `MSnSet` data.
```r
library("MLInterfaces")
library("pRoloc")
library("pRolocdata")
data(dunkley2006)
traininds <- which(fData(dunkley2006)$markers != "unknown")
ans <- MLearn(markers ~ ., data = t(dunkley2006), knnI(k = 5), traininds)
ans
```
```
## MLInterfaces classification output container
## The call was:
## MLearn(formula = markers ~ ., data = t(dunkley2006), .method = knnI(k = 5),
## trainInd = traininds)
## Predicted outcome distribution for test set:
##
## ER Golgi mit/plastid PM vacuole
## 203 86 129 112 17
## Summary of scores on test set (use testScores() method for details):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.400 1.000 1.000 0.965 1.000 1.000
```
```r
kcl <- MLearn( ~ ., data = dunkley2006, kmeansI, centers = 12)
kcl
```
```
## clusteringOutput: partition table
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 38 10 22 94 24 99 43 118 107 48 50 36
## The call that created this object was:
## MLearn(formula = ~., data = dunkley2006, .method = kmeansI, centers = 12)
```
```r
plot(kcl, exprs(dunkley2006))
```
```r
hcl <- MLearn( ~ ., data = t(dunkley2006), hclustI(distFun = dist, cutParm = list(k = 4)))
hcl
```
```
## clusteringOutput: partition table
##
## 1 2 3 4
## 4 4 4 4
## The call that created this object was:
## MLearn(formula = ~., data = t(dunkley2006), .method = hclustI(distFun = dist,
## cutParm = list(k = 4)))
```
```r
plot(hcl, exprs(t(dunkley2006)))
```
## Annotation
All the
[Bioconductor annotation infrastructure](http://bioconductor.org/help/workflows/annotation/annotation/),
such as
[`biomaRt`](http://bioconductor.org/packages/release/bioc/html/biomaRt.html),
[`GO.db`](http://www.bioconductor.org/packages/release/data/annotation/html/GO.db.html),
organism specific annotations, .. are directly relevant to the
analysis of proteomics data. Some proteomics-centred annotations such
as the PSI Mass Spectrometry Ontology, Molecular Interaction (PSI MI
2.5) or Protein Modifications are available through the
[`rols`](http://www.bioconductor.org/packages/release/bioc/html/rols.html). Data
from the [Human Protein Atlas](http://www.proteinatlas.org/) is
available via the
[`hpar`](http://www.bioconductor.org/packages/release/bioc/html/hpar.html)
package.
## Other relevant packages/pipelines
- Analysis of post translational modification with
[`isobar`](http://www.bioconductor.org/packages/release/bioc/html/isobar.html).
- Analysis of label-free data from a Synapt G2 (including ion
mobility) with
[`synapter`](http://www.bioconductor.org/packages/release/bioc/html/synapter.html).
- Analysis of spatial proteomics data with
[`pRoloc`](http://www.bioconductor.org/packages/release/bioc/html/pRoloc.html).
- Analysis of MALDI data with the
[`MALDIquant`](http://cran.r-project.org/web/packages/MALDIquant/index.html)
package.
- Access to the Proteomics Standard Initiative Common QUery InterfaCe
with the
[`PSICQUIC`](http://www.bioconductor.org/packages/release/bioc/html/PSICQUIC.html)
package.
Additional relevant packages are described in the
[`RforProteomics`](http://www.bioconductor.org/packages/release/data/experiment/html/RforProteomics.html)
vignette.
## Session information
```
## R Under development (unstable) (2014-04-10 r65396)
## Platform: x86_64-unknown-linux-gnu (64-bit)
##
## attached base packages:
## [1] parallel methods stats graphics grDevices utils datasets
## [8] base
##
## other attached packages:
## [1] shinyTANDEM_1.3.0 xtable_1.7-3 mixtools_1.0.2
## [4] segmented_0.4-0.0 boot_1.3-11 shiny_0.10.0
## [7] rTANDEM_1.5.0 data.table_1.9.2 BiocInstaller_1.15.5
## [10] pRolocdata_1.3.0 pRoloc_1.5.13 MLInterfaces_1.45.1
## [13] sfsmisc_1.0-26 cluster_1.15.2 annotate_1.43.5
## [16] XML_3.98-1.1 AnnotationDbi_1.27.8 GenomeInfoDb_1.1.12
## [19] IRanges_1.99.23 S4Vectors_0.1.2 rda_1.0.2-2
## [22] rpart_4.1-8 genefilter_1.47.6 MASS_7.3-33
## [25] rpx_1.1.1 MSnbase_1.13.12 BiocParallel_0.7.8
## [28] ggplot2_1.0.0 Biobase_2.25.0 BiocGenerics_0.11.3
## [31] mzID_1.3.4 mzR_1.11.10 Rcpp_0.11.2
## [34] RforProteomics_1.3.1 knitr_1.6
##
## loaded via a namespace (and not attached):
## [1] affy_1.43.3 affyio_1.33.0
## [3] BatchJobs_1.3 BBmisc_1.7
## [5] biocViews_1.33.10 bitops_1.0-6
## [7] BradleyTerry2_1.0-5 brew_1.0-6
## [9] brglm_0.5-9 car_2.0-20
## [11] caret_6.0-30 Category_2.31.1
## [13] caTools_1.17 checkmate_1.2
## [15] class_7.3-11 codetools_0.2-8
## [17] colorspace_1.2-4 DBI_0.2-7
## [19] digest_0.6.4 doParallel_1.0.8
## [21] e1071_1.6-3 evaluate_0.5.5
## [23] fail_1.2 FNN_1.1
## [25] foreach_1.4.2 formatR_0.10
## [27] gdata_2.13.3 graph_1.43.0
## [29] grid_3.2.0 gridSVG_1.4-0
## [31] GSEABase_1.27.1 gtable_0.1.2
## [33] gtools_3.4.1 htmltools_0.2.4
## [35] httpuv_1.3.0 impute_1.39.0
## [37] interactiveDisplay_1.3.9 iterators_1.0.7
## [39] kernlab_0.9-19 labeling_0.2
## [41] lattice_0.20-29 limma_3.21.10
## [43] lme4_1.1-7 lpSolve_5.6.9
## [45] MALDIquant_1.10 Matrix_1.1-4
## [47] mboost_2.3-0 mclust_4.3
## [49] minqa_1.2.3 munsell_0.4.2
## [51] mvtnorm_1.0-0 nlme_3.1-117
## [53] nloptr_1.0.0 nnet_7.3-8
## [55] nnls_1.4 pcaMethods_1.55.0
## [57] plyr_1.8.1 preprocessCore_1.27.1
## [59] proto_0.3-10 proxy_0.4-12
## [61] quadprog_1.5-5 R.methodsS3_1.6.1
## [63] R.oo_1.18.0 R.utils_1.32.4
## [65] randomForest_4.6-10 RBGL_1.41.0
## [67] RColorBrewer_1.0-5 RCurl_1.95-4.1
## [69] reshape2_1.4 RJSONIO_1.2-0.2
## [71] RSQLite_0.11.4 RUnit_0.4.26
## [73] sampling_2.6 scales_0.2.4
## [75] sendmailR_1.1-2 splines_3.2.0
## [77] stats4_3.2.0 stringr_0.6.2
## [79] survival_2.37-7 tools_3.2.0
## [81] vsn_3.33.0 zlibbioc_1.11.1
```