Questions and answers from over the years

How could I generate a manifest file with filtering of Race and Ethnicity?

From https://support.bioconductor.org/p/9138939/.

library(GenomicDataCommons,quietly = TRUE)

I made a small change to the filtering expression approach based on changes to lazy evaluation best practices. There is now no need to include the ~ in the filter expression. So:

q = files() |>
  GenomicDataCommons::filter(
    cases.project.project_id == 'TCGA-COAD' &
      data_type == 'Aligned Reads' &
      experimental_strategy == 'RNA-Seq' &
      data_format == 'BAM')

And get a count of the results:

count(q)
## [1] 1188

And the manifest.

manifest(q)
## # A tibble: 1,188 × 26
##    id    proportion_reads_map…¹ access wgs_coverage proportion_base_mism…² acl_1
##    <chr>                  <dbl> <chr>  <chr>                         <dbl> <chr>
##  1 7b8e…                  0.988 contr… Not Applica…                0.00679 phs0…
##  2 7e34…                  0.991 contr… Not Applica…                0.00419 phs0…
##  3 3f80…                  0.986 contr… Not Applica…                0.00466 phs0…
##  4 1522…                 NA     contr… Not Applica…               NA       phs0…
##  5 6a52…                 NA     contr… Not Applica…               NA       phs0…
##  6 82dd…                  0.988 contr… Not Applica…                0.00768 phs0…
##  7 cc3a…                  0.975 contr… Not Applica…                0.00427 phs0…
##  8 f565…                 NA     contr… Not Applica…               NA       phs0…
##  9 0575…                 NA     contr… Not Applica…               NA       phs0…
## 10 db84…                  0.984 contr… Not Applica…                0.00383 phs0…
## # ℹ 1,178 more rows
## # ℹ abbreviated names: ¹​proportion_reads_mapped, ²​proportion_base_mismatch
## # ℹ 20 more variables: type <chr>, platform <chr>, created_datetime <chr>,
## #   md5sum <chr>, updated_datetime <chr>, pairs_on_diff_chr <int>, state <chr>,
## #   data_format <chr>, total_reads <int>, file_name <chr>,
## #   proportion_reads_duplicated <int>, submitter_id <chr>, data_category <chr>,
## #   file_size <dbl>, average_base_quality <int>, file_id <chr>, …

Your question about race and ethnicity is a good one.

all_fields = available_fields(files())

And we can grep for race or ethnic to get potential matching fields to look at.

grep('race|ethnic',all_fields,value=TRUE)
## [1] "cases.demographic.ethnicity"                                           
## [2] "cases.demographic.race"                                                
## [3] "cases.follow_ups.hormonal_contraceptive_type"                          
## [4] "cases.follow_ups.hormonal_contraceptive_use"                           
## [5] "cases.follow_ups.other_clinical_attributes.hormonal_contraceptive_type"
## [6] "cases.follow_ups.other_clinical_attributes.hormonal_contraceptive_use" 
## [7] "cases.follow_ups.scan_tracer_used"

Now, we can check available values for each field to determine how to complete our filter expressions.

available_values('files',"cases.demographic.ethnicity")
## [1] "not hispanic or latino" "not reported"           "hispanic or latino"    
## [4] "unknown"                "_missing"
available_values('files',"cases.demographic.race")
##  [1] "white"                                    
##  [2] "not reported"                             
##  [3] "black or african american"                
##  [4] "asian"                                    
##  [5] "unknown"                                  
##  [6] "american indian or alaska native"         
##  [7] "native hawaiian or other pacific islander"
##  [8] "other"                                    
##  [9] "not allowed to collect"                   
## [10] "_missing"

We can complete our filter expression now to limit to white race only.

q_white_only = q |>
  GenomicDataCommons::filter(cases.demographic.race=='white')
count(q_white_only)
## [1] 695
manifest(q_white_only)
## # A tibble: 695 × 26
##    id       data_format access file_name wgs_coverage submitter_id data_category
##    <chr>    <chr>       <chr>  <chr>     <chr>        <chr>        <chr>        
##  1 0f41ec2… BAM         contr… cfbdbfeb… Not Applica… 13dd79d6-81… Sequencing R…
##  2 d69631b… BAM         contr… d69e622a… Not Applica… 30b51497-56… Sequencing R…
##  3 ab2377e… BAM         contr… f825534d… Not Applica… c3567b4f-ed… Sequencing R…
##  4 a2524e5… BAM         contr… f825534d… Not Applica… 6d6d6b21-d5… Sequencing R…
##  5 45e003c… BAM         contr… 83ae572a… Not Applica… 8774a16c-ad… Sequencing R…
##  6 7822ea1… BAM         contr… 83ae572a… Not Applica… c9570046-cb… Sequencing R…
##  7 afd02b7… BAM         contr… 46a6f49d… Not Applica… 24d0b0c2-31… Sequencing R…
##  8 4dbe852… BAM         contr… 63bab58e… Not Applica… 02234cd2-65… Sequencing R…
##  9 6f4370b… BAM         contr… 63bab58e… Not Applica… 09b2c041-86… Sequencing R…
## 10 befed65… BAM         contr… 10013d81… Not Applica… a57d4eac-22… Sequencing R…
## # ℹ 685 more rows
## # ℹ 19 more variables: acl_1 <chr>, type <chr>, platform <chr>,
## #   file_size <dbl>, created_datetime <chr>, md5sum <chr>,
## #   updated_datetime <chr>, file_id <chr>, data_type <chr>, state <chr>,
## #   experimental_strategy <chr>, proportion_reads_mapped <dbl>,
## #   proportion_base_mismatch <dbl>, pairs_on_diff_chr <int>, total_reads <int>,
## #   proportion_reads_duplicated <int>, average_base_quality <int>, …

How can I get the number of cases with RNA-Seq data added by date to TCGA project with GenomicDataCommons?

I would like to get the number of cases added (created, any logical datetime would suffice here) to the TCGA project by experiment type. I attempted to get this data via GenomicDataCommons package, but it is giving me I believe the number of files for a given experiment type rather than number cases. How can I get the number of cases for which there is RNA-Seq data?

library(tibble)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:GenomicDataCommons':
## 
##     count, filter, select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GenomicDataCommons)

cases() |> 
  GenomicDataCommons::filter(
    ~ project.program.name=='TCGA' & files.experimental_strategy=='RNA-Seq'
  ) |> 
  facet(c("files.created_datetime")) |> 
  aggregations() |> 
  unname() |>
  unlist(recursive = FALSE) |> 
  as_tibble() |>
  dplyr::arrange(dplyr::desc(key))
## # A tibble: 200 × 2
##    doc_count key                             
##        <int> <chr>                           
##  1       271 2024-06-14t14:27:00.916424-05:00
##  2       416 2024-06-14t13:28:10.644120-05:00
##  3       150 2024-03-11t09:00:39.229286-05:00
##  4       151 2023-03-09t00:35:51.387873-06:00
##  5        79 2023-02-19t04:41:11.008116-06:00
##  6       458 2023-02-19t04:36:10.605050-06:00
##  7        80 2023-02-19t04:28:49.400023-06:00
##  8       178 2023-02-19t04:23:49.092629-06:00
##  9       516 2023-02-19t04:18:49.453628-06:00
## 10       179 2023-02-19t04:13:47.877168-06:00
## # ℹ 190 more rows