---
title: "Step 1: Pre-Processing"
author: "Tyler J Burns"
date: September 29, 2017
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{How to process FCS files for downstream use in R}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, 
                      results = "markup", 
                      message = FALSE, 
                      warning = FALSE)
```

### TL:DR:  
You will use GetMarkerNames() to obtain properly named markers as a column in a 
csv saved to your computer. In this csv, you will make two columns, one for 
markers you want as input for knn, and one for markers you want to make your 
scone comparisons on. You will read this back into R to access these markers as 
vectors of strings. You'll then run ProcessMultipleFiles() to get your 
subsampled, concatenated matrix ready for KNN computation. 

## ABOUT PRE-PROCESSING 

### Introduction:  
Fluorescence and mass cytometry data are routinely processed by an increasing 
array of software platforms. Many of these contain graphical user interfaces, 
and many of these are R packages. However, no two analyses are the same, and 
many cases may involve direct processing of files in R. The Sconify package 
provides a suite of functions to make this process simpler and more 
user-friendly, prior to the knn-centric analysis occurring downstream. In 
essence, these functions convert fcs files into data matrices, process these 
matrices, and can output said matrices into fcs files readable by additional 
software. Although the primary intent of these functions is to pre-process fcs 
files for use of k-nearest neighbor statistics and visualizations in the 
remainder of this package, they are intended also to be general use functions. 

### Data:   
We will be using the Wanderlust dataset through this series of vignettes. 
(paper: https://www.ncbi.nlm.nih.gov/pubmed/24766814, dataset: 
https://www.c2b2.columbia.edu/danapeerlab/html/wanderlust-data.html). We show a 
particular donor (labeled Sample C), with B cell precursors at the basal state, 
and stimulated with IL-7. In the paper, this reveals a small subset of 
precursors elevating its levels of pSTAT5 in relation to the rest of the cells. 

## THE PROCEDURE

### The name of your file:   
For general comparisons, your files need to take on the structure 
"name_condition.fcs." If you are comparing multiple donors, then your files 
need to take on the structure "name__donorID_condition.fcs."

### Getting the right markers out of your file:  
I provide the function GetMarkerNames() to get the names of your parameters out 
of your data and modify them accordingly. This saves a list of marker names to 
a csv, as a single named column. Open this up in excel and delete the 
parameters you don't want. Then, make two named columns. One contains static 
markers to be used for KNN (typically surface markers). The second contains the 
markers to be used in the comparisons. Name these two columns what you like. 
You'll read in this csv and get these columns out as a vector of strings. 


```{r}
library(Sconify)
# Example fcs file
basal <- system.file('extdata',
    'Bendall_et_al_Cell_Sample_C_basal.fcs',
     package = "Sconify")

# Run this, and check your directory for "markers.csv"
markers <- GetMarkerNames(basal)
markers
```

```{r, eval = FALSE}
# Turn this into two columns, one for surface markers, and one for phosphos
# Label the two columns accoridngly. Save to csv. You'll read this modified
# file in the ProcessMultipleFiles() function.
write.csv(markers, "markers.csv", row.names = FALSE)
```



The markers.csv file, when opened in excel, looks like this (first 20 rows): 

```{r, out.width = "200px", eval = TRUE}
knitr::include_graphics("original.markers.csv.example.png")

```

You modify it by hand. In the case of this dataset, it looks like this (first 
20 rows):

```{r, out.width = "200px"}
knitr::include_graphics("modified.markers.csv.example.png")

```


### From fcs file to data matrix (a general function):  
Skip to "processing multiple files" section for the function that directly 
produces the data matrix for SCONE. Here, I provide here a general function 
that takes a single fcs file as input, performs an asinh transformation with a 
scale argument of 5 (for CyTOF) if instructed to, and converts the final output 
into a tibble. For SCONE, you'll use ProcessMultipleFiles(), which has this 
function embedded inside it. I provide it as well for any instance where you 
simply want to read a fcs file into R. 


```{r}

# FCS file provided in the package
basal <- system.file('extdata',
    'Bendall_et_al_Cell_Sample_C_basal.fcs',
    package = "Sconify")

# Example of data with no transformation
basal.raw <- FcsToTibble(basal, transform = "none")
basal.raw

# Asinh transformation with a scale argument of 5
basal.asinh <- FcsToTibble(basal, transform = "asinh")
basal.asinh

```

### Processing multiple files:  
This is the function that will be used as input for the rest of the SCONE 
pipeline. If multiple files are used, the data will be conatenated into a 
single labeled tibble with an additional column containing "condition" 
information for each cell (which file it came from). If multiple donors as 
used, an additional column can be added with this information as well (see 
MultipleDonorScone.Rmd). Per marker, the files can be quantile normalized 
(across files), or z score transformed. The files are downsampled evenly to the 
number specified by the user. We recommend 20,000. 

```{r}
# The FCS files (THEY NEED TO BE IN THE "....._condidtion.fcs" format")
basal <- system.file('extdata',
    'Bendall_et_al_Cell_Sample_C_basal.fcs',
     package = "Sconify")
il7 <- system.file('extdata',
    'Bendall_et_al_Cell_Sample_C_IL7.fcs',
    package = "Sconify")

# The markers (after they were modified by hand as instructed above)
markers <- system.file('extdata',
    'markers.csv',
    package = "Sconify")
markers <- ParseMarkers(markers)
surface <- markers[[1]]
surface
```
```{r}
# Combining these. Note that default is sub-sampling to 10,000 cells.
# Here, we subsample to 1000 cells to minimize processing time for the vignettes.
# not normalizing, and not scaling.
wand.combined <- ProcessMultipleFiles(files = c(basal, il7), 
                                      input = surface, 
                                      numcells = 1000)
wand.combined
unique(wand.combined$condition)

# Limit your matrix to surface markers, if just using those downstream
wand.combined.input <- wand.combined[,surface]
wand.combined.input
```

```{r}
# We can do this on a single file as well. 
wand.basal <- ProcessMultipleFiles(files = basal, 
                                   numcells = 1000, 
                                   scale = TRUE, 
                                   input = surface)
wand.basal
unique(wand.basal$condition)

```

### (Optional) a control condition using a split single file:  
For the type of condition versus basal analysis shown above, it may behoove the 
user to have a control containing two basal files being compared to each other 
(eg. with phospho-protein shifts across clusters). To this end, I developed a 
function called splitFile() that takes in a single file as input and splits it 
into two sub-matricies such that one group of cells can be the "treated" 
condition. 

```{r}
# Using the aforementioned basal fcs file
markers <- system.file('extdata',
    'markers.csv',
    package = "Sconify")

# The markers
markers <- read.csv(markers, stringsAsFactors = FALSE)
surface <- markers$surface

# The function
split.data <- SplitFile(file = basal, 
                        input.markers = surface, 
                        numcells = 1000)
split.data
unique(split.data$condition)
```