---
title: "SubCellBarCode: Integrated workflow for robust classification and 
visualization of spatial proteome"
author:
- name: Taner Arslan
  affiliation: "Karolinska Institute, Stockholm, Sweden"
- name: Lukas Orre
  affiliation: "Karolinska Institute, Stockholm, Sweden"
- name: Mattias Vesterlund
  affiliation: "Karolinska Institute, Stockholm, Sweden"
- name: Yanbo Pan
  affiliation: "Karolinska Institute, Stockholm, Sweden"
- name: Janne Lehtiö
  affiliation: "Karolinska Institute, Stockholm, Sweden"
date: "`r Sys.Date()`"
output: 
    BiocStyle::html_document:
        toc_fload: true
package: SubCellBarCode
abstract: In the SubCellBarCode project, mass spectrometry (MS) 
    based proteomics was used for proteome-wide mapping of protein 
    subcellular localization (Orre et al. 2019, Molecular Cell). 
    Subcellular fractionation and MS-based quantification can be 
    reproduced following the methods described in Orre et al. 
    Here, the SubCellBarCode R package offers tools to reproduce 
    the underlying bioinformatics pipeline for robust classification 
    and visualization of proteins into corresponding subcellular 
    localizations. Moreover, the R-package can be used for differential 
    localization analysis to investigate e.g. treatment induced protein 
    relocalization, condition dependent localization or cell type 
    specific localization. Last but not least, it provides peptide/
    exon/transcript centric classification as well as PTM (Post 
    translation modifications) modified peptide classification. 
vignette: >
    %\VignetteIndexEntry{SubCellBarCode R Markdown vignettes}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---
                                                                                
# Installation of the package

                                                                                
`SubCellBarCode` can be installed through `BiocManager` package as follows:

``` {r install,eval = FALSE}
if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("SubCellBarCode")
``` 

# Load the package

```{r Loadpackage}
library(SubCellBarCode)
```

# Data preparation and classification 
## Example Data
As example data we here provide the publicly available HCC827 (human lung 
adenocarcinoma cell line) TMT10plex labelled proteomics dataset 
(*Orre et al.* 2019, Molecular Cell). The `data.frame` consists of 
10480 proteins as rows (rownames are gene -centric protein ids) 
and 5 fractions with duplicates as columns 
(replicates must be named ".A." and ".B.", repectively). 
                                                                                
```{r exampleData}
head(hcc827Ctrl)
```

## Marker Proteins
The classification of protein localisation using the SubCellBarCode 
method is dependent on 3365 marker proteins as defined in Orre et al. 
The markerProteins `data.frame` contain protein names (gene symbol), 
associated subcellular localization (compartment), color code for 
the compartment and the median normalized fractionation
profile (log2) based on five different human cell lines 
(NCI-H322, HCC827, MCF7, A431and U251) here called the 
“5CL marker profile”.


```{r markerdata}
head(markerProteins)
```

## Load and normalize data
Input `data.frame` is checked with *"NA"* values and for the correct 
format. If there is any *"NA"* value, corresponding row is deleted. 
Then, data frame is `log2` transformmed. 

```{r loadData}
df <- loadData(protein.data = hcc827Ctrl)
```


```{r printDimData}
cat(dim(df))
head(df)
```

**Additional step:**
We use gene symbols for the protein identification. Therefore, we require 
gene symbols for the identifiaction. However, if the input data has 
other identifier e.g. UNIPROT, IPI, Entrez ID, you can convert it to gene
symbol by our defined function.   
*Please be aware of possible (most likely few) id loss during the 
conversion to one another.*

```{r convertID}
##Run if you have another identifier than gene symbols.
##Function will convert UNIPROT identifier to gene symbols.
##Deafult id is "UNIPROT", make sure you change it if you use another.

#df <- convert2symbol(df = df, id = "UNIPROT")
```

For the downstream analysis, we used the randomly selected subset data. 
```{r subset data}
set.seed(2)
df <- df[sample(nrow(df), 6000),]
```

## Calculate covered marker proteins
The overlap between *marker proteins (3365)* and 
*input data.frame* is calculated and visualized for each 
compartment by a bar plot.    

*Note that we recommend at least 20% coverage of marker proteins 
for each compartment. If certain compartments are underrrepresented 
we recommend you to perform the cell fractionation again. 
If all compartments are low in coverage we recommend increasing 
the analytical depth of the MS-analysis.* 

```{r coverageMarkers, fig = TRUE}
c.prots <- calculateCoveredProtein(proteinIDs = rownames(df), 
                        markerproteins = markerProteins[,1]) 
```
                                                                                
## Quality control of the marker proteins
To avoid reduced classification accuracy, marker proteins with 
noisy quantification and marker proteins that are not representative 
of their associated compartment (e.g.due to cell type specific 
localization) are filtered out by a two-step quality control. 


1. Marker proteins with pearson correlations 
less than 0.8 between *A* and *B* duplicates for each cell line were 
filtered out (Figure A).  

2. Pairwise correlations between 5CL marker profile and input data for 
each protein (A and B replicate experiments separately) were calculated 
using both Pearson and Spearman correlation. The lowest value for each 
method were then used for filtering with cut-offs set to 0.8 and 0.6 
respectively, to exclude non-representative marker proteins (Figure B).

```{r markerQC}
r.markers <- markerQualityControl(coveredProteins = c.prots,protein.data = df)
r.markers[1:5]
```

**Optional step:** After removing non-marker proteins, you can re-calculate 
and visualize the final coverage of the marker proteins.

```{r finalCoverage, fig = TRUE}
## uncomment the function when running 
# f.prots <- calculateCoveredProtein(r.markers, markerProteins[,1])
```
                                                                                
## Visualization of marker proteins in t-SNE map
The spatial distribution of the marker proteins is visualized in 
t-SNE map. This plot will be informative for the quality control 
of the generated data as it offers evaluation of the spatial 
distribution and separation of marker proteins. 


```{r tsneparameter}
#Default parameters
#Run the broad-range parameters to optimize well !!!
#Output dimensionality
#dims = 3
#Speed/accuracy trade-off (increase for less accuracy) 
#theta = c(0.1, 0.2, 0.3, 0.4, 0.5)
#Perplexity parameter
#perplexity = c(5, 10, 20, 30, 40, 50, 60) 
```

Information about the different t-SNE parameters that can be 
modified by the user is available by typing `?Rtsne` in the console.    
Although the applications of t-SNE is widespread in the field 
of machine learning, it can be misleading if it is not well 
optimized. Therefore, we optimize t-SNE map by grid search, 
a process that can take some time 

```{r tsnedim3, fig = TRUE, fig.width = 6.5, fig.height = 6.5}
set.seed(6)
tsne.map <- tsneVisualization(protein.data = df, 
                    markerProteins = r.markers, 
                    dims = 3, 
                    theta = c(0.1), 
                    perplexity = c(60)) 
```

We recommend 3D vizualisation by setting `dims = 3`, 
for optimal evaluation of marker protein cluster separation 
and data modularity. You can also visualize the marker 
proteins in 2 dimensional space by setting `dims = 2`, although 
reducing the dimensionality results in loss of information and 
underestimation of data resolution. 


```{r tsnedim2, fig = TRUE}
set.seed(9)
tsne.map2 <- tsneVisualization(protein.data = df, 
                    markerProteins = r.markers, 
                    dims = 2, 
                    theta = c(0.5), 
                    perplexity = c(60))
```

## Build model and classify proteins
For replicate *A* and *B* separately, marker proteins are 
used for training a Support Vector Machine (SVM) classifier 
with a Gaussian radial basis function kernel algorithm. 
After tuning the parameters, the SVM model predicts (classifies) 
the subcellular localization for all proteins in the input 
data with corresponding probabilities for *A* and *B* replicate 
classification.


```{r buildSVM}
set.seed(4)
cls <- svmClassification(markerProteins = r.markers, 
                                    protein.data = df, 
                                    markerprot.df = markerProteins)
```

```{r testdata}
# testing data predictions for replicate A and B
test.A <- cls[[1]]$svm.test.prob.out
test.B <- cls[[2]]$svm.test.prob.out
head(test.A)
```

```{r allPred}
# all predictions for replicate A and B
all.A <- cls[[1]]$all.prot.pred
all.B <- cls[[2]]$all.prot.pred
```
                                                                                
### Estimate classification thresholds for compartment level
Classification probabilities close to 1 indicate high confidence 
predictions, whereas probabilities close to 0 indicate low 
confidence predictions. To increase the overal prediction accuracy 
and to filter out poor predictions, one criterion and 
two cut-offs are defined.

1. The criterion is the consensus of preliminary predictions 
between biological duplicates. Proteins are kept in the analysis, 
if there is an agreement between biological duplicates. 
Subsequently, prediction probabilities from the two 
duplicates are averaged for each protein.  

2. Cut-off is (precision - based) set when precision reach 0.9 
in the test data.  

3. Cut-off is (recall - based) set as the probability of the 
lowest true positive in the test data.

```{r compartmentThreshold}
t.c.df <- computeThresholdCompartment(test.repA = test.A, test.repB = test.B)
```

```{r headcompartmentThreshold}
head(t.c.df)
```

### Apply threshold to compartment level classifications
The determined thresholds for the compartment levels are applied 
to all classifications.

```{r applycompartmentThreshold}
c.cls.df <- applyThresholdCompartment(all.repA = all.A, all.repB = all.B,
                                    threshold.df = t.c.df)
```

```{r headcompartmentCls}
head(c.cls.df)
```

### Estimate classification thresholds for neighborhood level
Compartment level classification probabilities are summed to 
neighborhood probabilities and thresholds for neighborhood 
analysis are estimated as described above for compartment 
level analysis except precision based cut-off is set to 0.95.  

```{r neighborhoodThreshold}
t.n.df <- computeThresholdNeighborhood(test.repA = test.A, test.repB = test.B)
```

```{r headneighborhoodThreshold}
head(t.n.df)
```

### Apply threshold to neighborhood level classifications
The determined thresholds for the neighborhood levels are applied to all 
classifications.

```{r applyNeighborhoodThreshold}
n.cls.df <- applyThresholdNeighborhood(all.repA = all.A, all.repB = all.B, 
                                    threshold.df = t.n.df)
```

```{r headNeighborhoodCls}
head(n.cls.df)
```

### Merge compartment and neighborhood classification
Individual classifications (compartment and neighborhood) are merged 
into one data frame. 

```{r mergecls}
cls.df <- mergeCls(compartmentCls = c.cls.df, neighborhoodCls = n.cls.df)
```

```{r headmerge}
head(cls.df)
```

# Visualization of the protein subcellular localization
## SubCellBarCode plot
You can query one protein at a time to plot barcode of the protein 
of the interest.    

`PSM` (Peptide-spectra-matching) count table is required for the plotting
SubCellBarCode. It is in `data.frame` format;

```{r hcc827psmcount}
head(hcc827CtrlPSMCount)
```

```{r plotbarcode, fig = TRUE, fig.width = 6, fig.height = 6}
plotBarcode(sampleClassification = cls.df, protein = "NLRP4",
        s1PSM = hcc827CtrlPSMCount)
```

## Co-localization plot
To evaluate localization and of multiple proteins at 
the same time, a vector of proteins (identified by gene symbols) 
can be prepared and used to create a barplot showing the 
distribution of classifications across compartments and 
neighborhoods. This analysis could be helpful when 
evaluating co-localization of proteins, protein complex 
formation and compartmentalized protein level regulation.


```{r multipleprots, fig = TRUE, fig.width= 10, fig.height = 8}
# 26S proteasome complex (26s proteasome regulatory complex)
proteasome26s <- c("PSMA7", "PSMC3","PSMA4", "PSMB4", 
                    "PSMB6", "PSMB5", "PSMC2","PSMC4",
                    "PSMB3", "PSMA6","PSMC5","PSMC6")

plotMultipleProtein(sampleClassification = cls.df, proteinList = proteasome26s)
```

# Differential localization analysis
Regulation of  protein localization is a the key 
process in cellular signalling. The `SubCellBarCode` 
method can be used for differential localization analysis 
given two conditions such as control vs treatment, cancer
cell *vs* normal cell, cell state A *vs* cell state B, *etc*. 
As example, we compared untreated and gefitinib (EGFR inhibitor) 
treated HCC827 cells (for details, see Orre et al.). 

## Plot differentially localizing proteins
Neighborhood classifications  for condition 1 (untreated) and 
condition 2 (gefitinib) is first done separately, and 
classifications for overlapping proteins are then vizualized by a 
sankey plot.     
*The HCC827 gefitinib cell lines classification was embedded 
into the package for example analysis.*


```{r headHCC827GEFCls}
head(hcc827GEFClass)
```

```{r sankey, fig.width = 6, fig.height = 3}
sankeyPlot(sampleCls1 = cls.df, sampleCls2 = hcc827GEFClass)
```

## Filter Candidates
As the differential localization analysis is an outlier analysis, 
it will include analytical noise. To filter out such noise, 
`PSM` (Peptide-spectra-matching) counts and fractionation profile 
correlation analysis (Pearson) was done to identify strong 
candidates. The PSM count format for the input have to be the 
same between the compared conditions;


```{r headPSMCount}
head(hcc827CtrlPSMCount)
```

For each protein, the minimum PSM count between the two conditions 
is plotted against the fractionation profile (median) correlation 
between the two conditions. For proteins with different localizations 
between conditions, the fractionation profile differs and therefore 
we are expecting a low fractionation profile correlation. As a standard 
setting for filtering of analytical noise in the differential 
localization analysis we suggest to demand a fractionation profile 
correlation below 0.8, and a minimum PSM count of at least 3.   
However, for the exploratory analysis, you can adjust
the `min.psm` and `pearson.cor` parameters. 

```{r relocation parameters}
##parameters
#sampleCls1 = sample 1 classification output
#s1PSM = sample 2 PSM count
#s1Quant = Sample 1 Quantification data
#sampleCls2 = sample 2 classification output
#s2PSM = sample 2 classification output
#sample2Quant = Sample 2 Quantification data
#min.psm = minumum psm count
#pearson.cor = perason correlation coefficient
```

```{r strongCandidates, fig = TRUE}
candidate.df <- candidateRelocatedProteins(sampleCls1 = cls.df, 
                                s1PSM = hcc827CtrlPSMCount, 
                                s1Quant = hcc827Ctrl,
                                sampleCls2 = hcc827GEFClass,
                                s2PSM = hcc827GefPSMCount,
                                s2Quant = hcc827GEF,
                                min.psm = 2,
                                pearson.cor = 0.8)
```

```{r printdim}
cat(dim(candidate.df))
```

```{r printhead}
head(candidate.df)
```

Candidate subset of differentially localizing proteins can 
be annotated with names by setting `annotation = TRUE`, `min.psm` 
and `pearson.cor` 

```{r strongCandidatesLabel, fig = TRUE}
candidate2.df <- candidateRelocatedProteins(sampleCls1 = cls.df,
                                s1PSM = hcc827CtrlPSMCount, 
                                s1Quant = hcc827Ctrl, 
                                sampleCls2 = hcc827GEFClass, 
                                s2PSM = hcc827GefPSMCount, 
                                s2Quant = hcc827GEF, 
                                annotation = TRUE, 
                                min.psm = 9, 
                                pearson.cor = 0.05) 
```

# Peptide/Exon/Transcript centric or PTM regulated localization
Our main analysis is based on gene-centric. However, based on the analytical
depth of the data, peptide/exon/transcript centric classifications can
be performed. Moreover, this approach is also applicable for
post translation modification (PTM) enriched data.  
Fundamentally, we will use the same classifiers, compartment 
and neighborhood levels thresholds and apply 
them to peptide/exon/transcript data. 

## Exon-centric classification
Here, we have provided subset of the HCC827 exon-centric data. 

```{r head exon data}
head(hcc827exon)
```

For the classification, we will use the gene-centric model that we built in 
section 3.7.

```{r recall model}
##recall the models
modelA <- cls[[1]]$model
modelB <- cls[[2]]$model
```

*`svmExternalData` function greps replicates A and B, repectively. So 
the input data can include other features like, here, peptide counts.* 

```{r exon centric cls}
exon.cls <- svmExternalData(df = hcc827exon, modelA = modelA, modelB= modelB)
```


```{r exon cls}
exon.A <- exon.cls[[1]]
exon.B <- exon.cls[[2]]
```

```{r exon cls rep A}
head(exon.A)
```

Next steps are exactly same as what we did in section 3.7. We will 
use same thresholdsthat we have estimated in 3.7.1 and 3.7.3. 
Finally, we will merge two classifications same function used 
in 3.7.5.

```{r apply thresholds}
exon.comp.cls <- applyThresholdCompartment(all.repA = exon.A[,2:17], 
                                    all.repB = exon.B[,2:17],
                                    threshold.df = t.c.df)

exon.neigh.cls <- applyThresholdNeighborhood(all.repA = exon.A[,2:17], 
                                    all.repB = exon.B[,2:17], 
                                    threshold.df = t.n.df)

exon.cls.df <- mergeCls(compartmentCls = exon.comp.cls, 
                        neighborhoodCls = exon.neigh.cls)

#same order
exon.cls.df <- exon.cls.df[rownames(exon.A),]

# we will add gene symbols as well as peptide count  
# (PSM count is also accepted) in case for comparing with 
# gene-centric classifications

exon.cls.df$GeneSymbol <- exon.A$Gene_Symbol
exon.cls.df$PeptideCount <- hcc827exon$PeptideCount
```

```{r head exon cls}
head(exon.cls.df)
```

## Comparison between gene and exon centric classification
The insteresting part of the exon classification is to detect deviated 
classification between gene centric classification. However, we enrich
more noise than gene centric classification due to analtyical dept of the exon
centric data. Therefore, we, additioanlly, internally calculate correlation
between gene-centric data `hcc827Ctrl` and exon-centric data `hcc827exon`, 
and report
number of peptides (PSM count works, too) that match to corresponding exon 
as an indication of the confidence of the results.

```{r compareCls}
comp.df <- compareCls(geneCls = cls.df, exonCls = exon.cls.df)
```

```{r head comp.df}
head(comp.df)
```
You can easily vizualize/filter the results using correlations and number of peptides. 

# References

* Orre., et al. "SubCellBarCode: Proteome-wide Mapping of Protein 
Localization and Relocalization." Molecular Cell (2019): 73(1):166-182.e7. 

# Session Information

```{r}
sessionInfo()
```