--- title: "terraTCGAdata Introduction" author: "Marcel Ramos" date: "`r format(Sys.Date(), '%B %d, %Y')`" vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Obtain Terra TCGA data as MultiAssayExperiment} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: number_sections: yes toc: true --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, cache = TRUE) ``` # Installation (development version) ```{r,eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("LiNk-NY/terraTCGAdata") ``` # Description Some public Terra workspaces come pre-packaged with TCGA data (i.e., cloud data resources are linked within the data model). Particularly the workspaces that are labelled `OpenAccess_V1-0`. Datasets harmonized to the hg38 genome use a different data model / workflow and are not compatible with the functions in this package. For those that are, we make use of the Terra data model and represent the data as `MultiAssayExperiment`. For more information on `MultiAssayExperiment`, please see the vignette in that package. # Requirements ## Loading packages ```{r,include=TRUE,results="hide",message=FALSE,warning=FALSE} library(AnVIL) library(terraTCGAdata) ``` ## gcloud sdk installation A valid GCloud SDK installation is required to use the package. Use the `gcloud_exists()` function from the `r Biocpkg("AnVIL")` package to identify whether it is installed in your system. ```{r} gcloud_exists() ``` You can also use the `gcloud_project` to set a project name by specifying the project argument: ```{r, eval=AnVIL::gcloud_exists()} gcloud_project() ``` # Default Data Workspace To get a list of available TCGA workspaces, use the `findTCGAworkspaces()` function: ```{r, eval=AnVIL::gcloud_exists()} findTCGAworkspaces() ``` You can then set a package-wide option with the `terraTCGAworkspace` function and check the setting with the `getOption('terraTCGAdata.workspace')` option. ```{r,eval=AnVIL::gcloud_exists()} terraTCGAworkspace("TCGA_COAD_OpenAccess_V1-0_DATA") getOption("terraTCGAdata.workspace") ``` # Clinical data resources In order to determine what datasets to download, use the `getClinicalTable` function to list all of the columns that correspond to clinical data from the different collection centers. ```{r, eval=AnVIL::gcloud_exists()} ct <- getClinicalTable(workspace = "TCGA_COAD_OpenAccess_V1-0_DATA") ct names(ct) ``` # Clinical data download After picking the column in the `getClinicalTable` output, use the column name as input to the `getClinical` function to obtain the data: ```{r, eval=AnVIL::gcloud_exists()} column_name <- "clin__bio__nationwidechildrens_org__Level_1__biospecimen__clin" clin <- getClinical( columnName = column_name, participants = TRUE, workspace = "TCGA_COAD_OpenAccess_V1-0_DATA" ) clin[, 1:6] dim(clin) ``` # Assay data resources We use the same approach for assay data. We first produce a list of assays from the `getAssayTable` and then we select one along with any sample codes of interest. ```{r, eval=AnVIL::gcloud_exists()} at <- getAssayTable(workspace = "TCGA_COAD_OpenAccess_V1-0_DATA") at names(at) ``` # Summary of sample types in the data You can get a summary table of all the samples in the adata by using the `sampleTypesTable`: ```{r, eval=AnVIL::gcloud_exists()} sampleTypesTable(workspace = "TCGA_COAD_OpenAccess_V1-0_DATA") ``` # Intermediate function for obtaining only the data Note that if you have the package-wide option set, the workspace argument is not needed in the function call. ```{r, eval=AnVIL::gcloud_exists()} prot <- getAssayData( assayName = "protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data", sampleCode = c("01", "10"), workspace = "TCGA_COAD_OpenAccess_V1-0_DATA", sampleIdx = 1:4 ) head(prot) ``` # MultiAssayExperiment Finally, once you have collected all the relevant column names, these can be inputs to the main `terraTCGAdata` function: ```{r, eval=AnVIL::gcloud_exists()} mae <- terraTCGAdata( clinicalName = "clin__bio__nationwidechildrens_org__Level_1__biospecimen__clin", assays = c("protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data", "rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data"), sampleCode = NULL, split = FALSE, sampleIdx = 1:4, workspace = "TCGA_COAD_OpenAccess_V1-0_DATA" ) mae ``` We expect that most `OpenAccess_V1-0` cancer datasets follow this data model. If you encounter any errors, please provide a minimally reproducible example at https://github.com/waldronlab/terraTCGAdata. # Session Info ```{r} sessionInfo() ```