• Useful information
  • Searching arguments
    • project options
    • sample.type options
    • Harmonized data options
  • Harmonized database examples
    • DNA methylation data: Recurrent tumor samples
    • Samples with DNA methylation and gene expression data
    • Raw Sequencing Data: Finding the match between file names and barcode for Controlled data.
  • Get Manifest file
  • ATAC-seq data
  • Summary of available files per patient

TCGAbiolinks has provided a few functions to search GDC database.


Useful information

Understanding the barcode

A TCGA barcode is composed of a collection of identifiers. Each specifically identifies a TCGA data element. Refer to the following figure for an illustration of how metadata identifiers comprise a barcode. An aliquot barcode contains the highest number of identifiers.

Example:

  • Aliquot barcode: TCGA-G4-6317-02A-11D-2064-05
  • Participant: TCGA-G4-6317
  • Sample: TCGA-G4-6317-02

For more information check GDC TCGA barcodes

Searching arguments

You can easily search GDC data using the GDCquery function.

Using a summary of filters as used in the TCGA portal, the function works with the following arguments:

?project A list of valid project (see table below)]
data.category A valid project (see list with TCGAbiolinks:::getProjectSummary(project))
data.type A data type to filter the files to download
workflow.type GDC workflow type
access Filter by access type. Possible values: controlled, open
platform Example:
CGH- 1x1M_G4447A IlluminaGA_RNASeqV2
AgilentG4502A_07 IlluminaGA_mRNA_DGE
Human1MDuo HumanMethylation450
HG-CGH-415K_G4124A IlluminaGA_miRNASeq
HumanHap550 IlluminaHiSeq_miRNASeq
ABI H-miRNA_8x15K
HG-CGH-244A SOLiD_DNASeq
IlluminaDNAMethylation_OMA003_CPI IlluminaGA_DNASeq_automated
IlluminaDNAMethylation_OMA002_CPI HG-U133_Plus_2
HuEx- 1_0-st-v2 Mixed_DNASeq
H-miRNA_8x15Kv2 IlluminaGA_DNASeq_curated
MDA_RPPA_Core IlluminaHiSeq_TotalRNASeqV2
HT_HG-U133A IlluminaHiSeq_DNASeq_automated
diagnostic_images microsat_i
IlluminaHiSeq_RNASeq SOLiD_DNASeq_curated
IlluminaHiSeq_DNASeqC Mixed_DNASeq_curated
IlluminaGA_RNASeq IlluminaGA_DNASeq_Cont_automated
IlluminaGA_DNASeq IlluminaHiSeq_WGBS
pathology_reports IlluminaHiSeq_DNASeq_Cont_automated
Genome_Wide_SNP_6 bio
tissue_images Mixed_DNASeq_automated
HumanMethylation27 Mixed_DNASeq_Cont_curated
IlluminaHiSeq_RNASeqV2 Mixed_DNASeq_Cont
file.type To be used in the legacy database for some platforms, to define which file types to be used.
barcode A list of barcodes to filter the files to download
experimental.strategy Filter to experimental strategy. Harmonized: WXS, RNA-Seq, miRNA-Seq, Genotyping Array.
sample.type A sample type to filter the files to download

project options

The options for the field project are below:

sample.type options

The options for the field sample.type are below:

The other fields (data.category, data.type, workflow.type, platform, file.type) can be found below. Please, note that these tables are still incomplete.

Harmonized data options

Harmonized database examples

DNA methylation data: Recurrent tumor samples

In this example we will access the harmonized database and search for all DNA methylation data for recurrent glioblastoma multiform (GBM) and low grade gliomas (LGG) samples.

query <- GDCquery(
    project = c("TCGA-GBM", "TCGA-LGG"),
    data.category = "DNA Methylation",
    platform = c("Illumina Human Methylation 450"),
    sample.type = "Recurrent Tumor"
)
datatable(
    getResults(query), 
    filter = 'top',
    options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
    rownames = FALSE
)

Samples with DNA methylation and gene expression data

In this example we will access the harmonized database and search for all patients with DNA methylation (platform HumanMethylation450k) and gene expression data for Colon Adenocarcinoma tumor (TCGA-COAD).

query_met <- GDCquery(
    project = "TCGA-COAD",
    data.category = "DNA Methylation",
    platform = c("Illumina Human Methylation 450")
)
query_exp <- GDCquery(
    project = "TCGA-COAD",
    data.category = "Transcriptome Profiling",
    data.type = "Gene Expression Quantification", 
    workflow.type = "STAR - Counts"
)

# Get all patients that have DNA methylation and gene expression.
common.patients <- intersect(
    substr(getResults(query_met, cols = "cases"), 1, 12),
    substr(getResults(query_exp, cols = "cases"), 1, 12)
)

# Only select the first 5 patients
query_met <- GDCquery(
    project = "TCGA-COAD",
    data.category = "DNA Methylation",
    platform = c("Illumina Human Methylation 450"),
    barcode = common.patients[1:5]
)

query_exp <- GDCquery(
    project = "TCGA-COAD",
    data.category = "Transcriptome Profiling",
    data.type = "Gene Expression Quantification", 
    workflow.type = "STAR - Counts",
    barcode = common.patients[1:5]
)
datatable(
    getResults(query_met, cols = c("data_type","cases")),
    filter = 'top',
    options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
    rownames = FALSE
)
datatable(
    getResults(query_exp, cols = c("data_type","cases")), 
    filter = 'top',
    options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
    rownames = FALSE
)

Raw Sequencing Data: Finding the match between file names and barcode for Controlled data.

This example shows how the user can search for breast cancer Raw Sequencing Data (“Controlled”) and verify the name of the files and the barcodes associated with it.

query <- GDCquery(
    project = "TCGA-ACC", 
    data.category = "Sequencing Reads",
    data.type = "Aligned Reads", 
    data.format = "bam",
    workflow.type = "STAR 2-Pass Transcriptome"
)
# Only first 10 to make render faster
datatable(
    getResults(query, rows = 1:10,cols = c("file_name","cases")), 
    filter = 'top',
    options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
    rownames = FALSE
)
query <- GDCquery(
    project = "TCGA-ACC", 
    data.category = "Sequencing Reads",
    data.type = "Aligned Reads", 
    data.format = "bam",
    workflow.type = "STAR 2-Pass Genome"
)
# Only first 10 to make render faster
datatable(
    getResults(query, rows = 1:10,cols = c("file_name","cases")), 
    filter = 'top',
    options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
    rownames = FALSE
)
query <- GDCquery(
    project = "TCGA-ACC", 
    data.category = "Sequencing Reads",
    data.type = "Aligned Reads", 
    data.format = "bam",
    workflow.type = "STAR 2-Pass Chimeric"
)
# Only first 10 to make render faster
datatable(
    getResults(query, rows = 1:10,cols = c("file_name","cases")), 
    filter = 'top',
    options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
    rownames = FALSE
)
query <- GDCquery(
    project = "TCGA-ACC", 
    data.category = "Sequencing Reads",
    data.type = "Aligned Reads", 
    data.format = "bam",
    workflow.type = "BWA-aln"
)
# Only first 10 to make render faster
datatable(
    getResults(query, rows = 1:10,cols = c("file_name","cases")), 
    filter = 'top',
    options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
    rownames = FALSE
)
query <- GDCquery(
    project = "TCGA-ACC", 
    data.category = "Sequencing Reads",
    data.type = "Aligned Reads", 
    data.format = "bam",
    workflow.type = "BWA with Mark Duplicates and BQSR"
)
# Only first 10 to make render faster
datatable(
    getResults(query, rows = 1:10,cols = c("file_name","cases")), 
    filter = 'top',
    options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
    rownames = FALSE
)

Get Manifest file

If you want to get the manifest file from the query object you can use the function getManifest. If you set save to TRUE a txt file that can be used with GDC-client Data transfer tool (DTT) or with its GUI version ddt-ui will be created.

getManifest(query, save = FALSE) 
ABCDEFGHIJ0123456789
 
 
id
<chr>
1a55594d7-01b7-43bb-8bd2-effe736f7fdd
23e49f7d3-567a-469f-b7a0-014a8f1110d7
326f69ddb-d10a-4c98-b812-5e3333570249
4100c475e-144b-4a4e-b100-a04f588e22bc
5bcba3ec6-e242-4fe6-9b2b-85577aab939f
60842bc26-2f35-4f95-89d3-9b980274444d
7b02879ce-c910-4d19-8f59-0e8695fbccd7
8b066a539-de83-49ed-89c3-51a905b1fbd9
9163ca627-8bdb-4efb-a689-564fdd09c851
10d145a047-7078-4565-af1d-f84871cc0edd

ATAC-seq data

For the moment, ATAC-seq data is available at the GDC publication page. Also, for more details, you can check an ATAC-seq workshop at http://rpubs.com/tiagochst/atac_seq_workshop

The list of file available is below:

datatable(
    getResults(TCGAbiolinks:::GDCquery_ATAC_seq())[,c("file_name","file_size")], 
    filter = 'top',
    options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
    rownames = FALSE
)

You can use the function GDCquery_ATAC_seq filter the manifest table and use GDCdownload to save the data locally.

query <- TCGAbiolinks:::GDCquery_ATAC_seq(file.type = "rds") 
GDCdownload(query, method = "client")

query <- TCGAbiolinks:::GDCquery_ATAC_seq(file.type = "bigWigs") 
GDCdownload(query, method = "client")

Summary of available files per patient

Retrieve the numner of files under each data_category + data_type + experimental_strategy + platform. Almost like https://portal.gdc.cancer.gov/exploration

tab <-  getSampleFilesSummary(project = "TCGA-ACC")
datatable(
    head(tab),
    filter = 'top',
    options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
    rownames = FALSE
)