---
title: "terraTCGAdata Introduction"
author: "Marcel Ramos"
date: "`r format(Sys.Date(), '%B %d, %Y')`"
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Obtain Terra TCGA data as MultiAssayExperiment}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    number_sections: yes
    toc: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
```

# Installation (development version)

```{r,eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("LiNk-NY/terraTCGAdata")
```

# Description

Some public Terra workspaces come pre-packaged with TCGA data (i.e., cloud data
resources are linked within the data model). Particularly the workspaces that
are labelled `OpenAccess_V1-0`. Datasets harmonized to the hg38 genome use a
different data model / workflow and are not compatible with the functions in
this package. For those that are, we make use of the Terra data model and
represent the data as `MultiAssayExperiment`.

For more information on `MultiAssayExperiment`, please see the vignette in
that package.

# Requirements

## Loading packages

```{r,include=TRUE,results="hide",message=FALSE,warning=FALSE}
library(AnVIL)
library(terraTCGAdata)
```

## gcloud sdk installation

A valid GCloud SDK installation is required to use the package. Use the
`gcloud_exists()` function from the `r Biocpkg("AnVIL")` package to identify
whether it is installed in your system.

```{r}
gcloud_exists()
```

You can also use the `gcloud_project` to set a project name by specifying
the project argument:

```{r, eval=AnVIL::gcloud_exists()}
gcloud_project()
```

# Default Data Workspace

To get a list of available TCGA workspaces, use the `findTCGAworkspaces()`
function: 

```{r, eval=AnVIL::gcloud_exists()}
findTCGAworkspaces()
```

You can then set a package-wide option with the `terraTCGAworkspace` function
and check the setting with the `getOption('terraTCGAdata.workspace')` option.

```{r,eval=AnVIL::gcloud_exists()}
terraTCGAworkspace("TCGA_COAD_OpenAccess_V1-0_DATA")
getOption("terraTCGAdata.workspace")
```

# Clinical data resources

In order to determine what datasets to download, use the `getClinicalTable`
function to list all of the columns that correspond to clinical data
from the different collection centers.

```{r, eval=AnVIL::gcloud_exists()}
ct <- getClinicalTable(workspace = "TCGA_COAD_OpenAccess_V1-0_DATA")
ct
names(ct)
```

# Clinical data download

After picking the column in the `getClinicalTable` output, use the column
name as input to the `getClinical` function to obtain the data:

```{r, eval=AnVIL::gcloud_exists()}
column_name <- "clin__bio__nationwidechildrens_org__Level_1__biospecimen__clin"
clin <- getClinical(
    columnName = column_name,
    participants = TRUE,
    workspace = "TCGA_COAD_OpenAccess_V1-0_DATA"
)
clin[, 1:6]
dim(clin)
```

# Assay data resources

We use the same approach for assay data. We first produce a list of assays
from the `getAssayTable` and then we select one along with any sample
codes of interest.

```{r, eval=AnVIL::gcloud_exists()}
at <- getAssayTable(workspace = "TCGA_COAD_OpenAccess_V1-0_DATA")
at
names(at)
```

# Summary of sample types in the data

You can get a summary table of all the samples in the adata by using the
`sampleTypesTable`:

```{r, eval=AnVIL::gcloud_exists()}
sampleTypesTable(workspace = "TCGA_COAD_OpenAccess_V1-0_DATA")
```

# Intermediate function for obtaining only the data

Note that if you have the package-wide option set, the workspace argument
is not needed in the function call.

```{r, eval=AnVIL::gcloud_exists()}
prot <- getAssayData(
    assayName = "protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data",
    sampleCode = c("01", "10"),
    workspace = "TCGA_COAD_OpenAccess_V1-0_DATA",
    sampleIdx = 1:4
)
head(prot)
```

# MultiAssayExperiment

Finally, once you have collected all the relevant column names, 
these can be inputs to the main `terraTCGAdata` function:

```{r, eval=AnVIL::gcloud_exists()}
mae <- terraTCGAdata(
    clinicalName = "clin__bio__nationwidechildrens_org__Level_1__biospecimen__clin",
    assays =
        c("protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data",
        "rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data"),
    sampleCode = NULL,
    split = FALSE,
    sampleIdx = 1:4,
    workspace = "TCGA_COAD_OpenAccess_V1-0_DATA"
)
mae
```

We expect that most `OpenAccess_V1-0` cancer datasets follow this data model.
If you encounter any errors, please provide a minimally reproducible example
at https://github.com/waldronlab/terraTCGAdata.

# Session Info

```{r}
sessionInfo()
```