Original Authors: Martin Morgan, Sonali Arora, Lori Shepherd
Presenting Author: Martin Morgan
Date: 20 June, 2022
Back: Monday labs
Objective: Learn about Bioconductor resources for gene and genome annotation.
Lessons learned:
org.*
packages for mapping between gene symbols.TxDb.*
and ensembldb
(EnsDb.*
) packages for working with gene
models.AnnotationHub
to easily obtain select consortium-level resourcesbiomaRt
and other internet-based resources for highly
flexible annotation.VariantAnnotation
and VariantFiltering
for annotating SNPs.Organism-level (‘org’) packages contain mappings between a central
identifier (e.g., Entrez gene ids) and other identifiers (e.g. GenBank
or Uniprot accession number, RefSeq id, etc.). The name of an org
package is always of the form org.<Sp>.<id>.db
(e.g. org.Sc.sgd.db) where <Sp>
is a 2-letter abbreviation of
the organism (e.g. Sc
for Saccharomyces cerevisiae) and <id>
is
an abbreviation (in lower-case) describing the type of central
identifier (e.g. sgd
for gene identifiers assigned by the
Saccharomyces Genome Database, or eg
for Entrez gene ids). The
“How to use the ‘.db’ annotation packages” vignette in the
AnnotationDbi package (org packages are only one type of “.db”
annotation packages) is a key reference. The ‘.db’ and most other
Bioconductor annotation packages are updated every 6 months.
Annotation packages usually contain an object named after the package
itself. These objects are collectively called AnnotationDb
objects,
with more specific classes named OrgDb
, ChipDb
or TranscriptDb
objects. Methods that can be applied to these objects include
cols()
, keys()
, keytypes()
and select()
. Common operations
for retrieving annotations are summarized in the table.
Category | Function | Description |
---|---|---|
Discover | columns() |
List the kinds of columns that can be returned |
keytypes() |
List columns that can be used as keys | |
keys() |
List values that can be expected for a given keytype | |
select() |
Retrieve annotations matching keys , keytype and columns |
|
Manipulate | setdiff() , union() , intersect() |
Operations on sets |
duplicated() , unique() |
Mark or remove duplicates | |
%in% , match() |
Find matches | |
any() , all() |
Are any TRUE ? Are all? |
|
merge() |
Combine two different based on shared keys | |
GRanges* |
transcripts() , exons() , cds() |
Features (transcripts, exons, coding sequence) as GRanges . |
transcriptsBy() , exonsBy() |
Features group by gene, transcript, etc., as GRangesList . |
|
cdsBy() |
A short summary of select Bioconductor packages enabling web-based queries is in following Table.
Package | Description |
---|---|
AnnotationHub | Ensembl, Encode, dbSNP, UCSC data objects |
biomaRt | Ensembl and other annotations |
PSICQUIC | Protein interactions |
uniprot.ws | Protein annotations |
KEGGREST | KEGG pathways |
SRAdb | Sequencing experiments. |
rtracklayer | genome tracks. |
GEOquery | Array and other data |
ArrayExpress | Array and other data |
Exercise 1: This exercise illustrates basic use of the `select’ interface to annotation packages.
Install and attach the org.Hs.eg.db annotation package; it contains ‘symbol mapping’ information for Homo sapiens, based on NCBI ‘Entrez’ identifiers.
library(org.Hs.eg.db)
Take a quick look at a summary of data in this package
org.Hs.eg.db
## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
## | DBSCHEMA: HUMAN_DB
## | ORGANISM: Homo sapiens
## | SPECIES: Human
## | EGSOURCEDATE: 2022-Mar17
## | EGSOURCENAME: Entrez Gene
## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | CENTRALID: EG
## | TAXID: 9606
## | GOSOURCENAME: Gene Ontology
## | GOSOURCEURL: http://current.geneontology.org/ontology/go-basic.obo
## | GOSOURCEDATE: 2022-03-10
## | GOEGSOURCEDATE: 2022-Mar17
## | GOEGSOURCENAME: Entrez Gene
## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | KEGGSOURCENAME: KEGG GENOME
## | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
## | KEGGSOURCEDATE: 2011-Mar15
## | GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
## | GPSOURCEURL:
## | GPSOURCEDATE: 2022-Nov23
## | ENSOURCEDATE: 2021-Dec21
## | ENSOURCENAME: Ensembl
## | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
## | UPSOURCENAME: Uniprot
## | UPSOURCEURL: http://www.UniProt.org/
## | UPSOURCEDATE: Fri Apr 1 14:42:16 2022
##
## Please see: help('select') for usage information
The idea is that there are keytypes()
that can be mapped to
different columns()
; keys()
can be used to see available
keys. Explore the package to see what sorts of information is
available, e.g.,
keytypes(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [26] "UNIPROT"
columns(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [26] "UNIPROT"
head(keys(org.Hs.eg.db, "SYMBOL"))
## [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "NATP"
There are two basic ways of extracting data from an org.*
package
– mapIds()
to create a 1:1 mapping between key and a single
column, and select()
(it’s often necessary to specify this
function directly, to avoid a conflict with dplyr, as
AnnotationDbi::select()
). Explore these functions, e.g.,
set.seed(123)
egid <- sample(keys(org.Hs.eg.db), 6)
mapIds(org.Hs.eg.db, egid, "SYMBOL", "ENTREZID")
## 'select()' returned 1:1 mapping between keys and columns
## 106481083 111365226 3691 100130841 121627892
## "RN7SL586P" "LOC111365226" "ITGB4" "LOC100130841" "LOC121627892"
## 105373680
## "LOC105373680"
AnnotationDbi::select(
org.Hs.eg.db, egid, c("SYMBOL", "ENSEMBL", "GENENAME"), "ENTREZID"
)
## 'select()' returned 1:1 mapping between keys and columns
## ENTREZID SYMBOL ENSEMBL
## 1 106481083 RN7SL586P ENSG00000241487
## 2 111365226 LOC111365226 <NA>
## 3 3691 ITGB4 ENSG00000132470
## 4 100130841 LOC100130841 <NA>
## 5 121627892 LOC121627892 <NA>
## 6 105373680 LOC105373680 <NA>
## GENENAME
## 1 RNA, 7SL, cytoplasmic 586, pseudogene
## 2 HNF4 motif-containing MPRA enhancer 122
## 3 integrin subunit beta 4
## 4 MDM2 oncogene, E3 ubiquitin protein ligase pseudogene
## 5 Sharpr-MPRA regulatory region 8388
## 6 uncharacterized LOC105373680
Some key - column mappings are 1:many, e.g., Entrez ID "3812"
maps to 44 Ensembl Ids. What does mapIds()
return when mapping
Entrez ID "3812"
to Ensembl ids? Use the additional argument
multiVals = "CharacterList"
to explore further. Compare results
to those returned by select()
.
egid <- "3812"
mapIds(org.Hs.eg.db, egid, "ENSEMBL", "ENTREZID")
## 'select()' returned 1:many mapping between keys and columns
## 3812
## "ENSG00000240403"
mapIds(
org.Hs.eg.db, egid, "ENSEMBL", "ENTREZID",
multiVals = "CharacterList"
)
## 'select()' returned 1:many mapping between keys and columns
## CharacterList of length 1
## [["3812"]] ENSG00000240403 ENSG00000276424 ... ENSG00000284046 ENSG00000283951
AnnotationDbi::select(
org.Hs.eg.db, egid, c("SYMBOL", "ENSEMBL"),
multiVals = "CharacterList"
)
## 'select()' returned 1:many mapping between keys and columns
## ENTREZID SYMBOL ENSEMBL
## 1 3812 KIR3DL2 ENSG00000240403
## 2 3812 KIR3DL2 ENSG00000276424
## 3 3812 KIR3DL2 ENSG00000273735
## 4 3812 KIR3DL2 ENSG00000276357
## 5 3812 KIR3DL2 ENSG00000278850
## 6 3812 KIR3DL2 ENSG00000278710
## 7 3812 KIR3DL2 ENSG00000274722
## 8 3812 KIR3DL2 ENSG00000276004
## 9 3812 KIR3DL2 ENSG00000278758
## 10 3812 KIR3DL2 ENSG00000278726
## 11 3812 KIR3DL2 ENSG00000278442
## 12 3812 KIR3DL2 ENSG00000278474
## 13 3812 KIR3DL2 ENSG00000278656
## 14 3812 KIR3DL2 ENSG00000275629
## 15 3812 KIR3DL2 ENSG00000276882
## 16 3812 KIR3DL2 ENSG00000275416
## 17 3812 KIR3DL2 ENSG00000278809
## 18 3812 KIR3DL2 ENSG00000277982
## 19 3812 KIR3DL2 ENSG00000278361
## 20 3812 KIR3DL2 ENSG00000275262
## 21 3812 KIR3DL2 ENSG00000275083
## 22 3812 KIR3DL2 ENSG00000275626
## 23 3812 KIR3DL2 ENSG00000275511
## 24 3812 KIR3DL2 ENSG00000278403
## 25 3812 KIR3DL2 ENSG00000275838
## 26 3812 KIR3DL2 ENSG00000276739
## 27 3812 KIR3DL2 ENSG00000278707
## 28 3812 KIR3DL2 ENSG00000277709
## 29 3812 KIR3DL2 ENSG00000275566
## 30 3812 KIR3DL2 ENSG00000273911
## 31 3812 KIR3DL2 ENSG00000277181
## 32 3812 KIR3DL2 ENSG00000284063
## 33 3812 KIR3DL2 ENSG00000288389
## 34 3812 KIR3DL2 ENSG00000284384
## 35 3812 KIR3DL2 ENSG00000284295
## 36 3812 KIR3DL2 ENSG00000284381
## 37 3812 KIR3DL2 ENSG00000283975
## 38 3812 KIR3DL2 ENSG00000284213
## 39 3812 KIR3DL2 ENSG00000284528
## 40 3812 KIR3DL2 ENSG00000284192
## 41 3812 KIR3DL2 ENSG00000284466
## 42 3812 KIR3DL2 ENSG00000284046
## 43 3812 KIR3DL2 ENSG00000283951
It seems like it might often be useful to use the tidyverse on
return values from mapIds()
and select()
; explore this usage
library(tidyverse)
egid <- keys(org.Hs.eg.db) # all ENTREZIDs
mapIds(org.Hs.eg.db, egid, "SYMBOL", "ENTREZID") |>
as_tibble() |>
rownames_to_column("ENTREZID")
## # A tibble: 66,102 × 2
## ENTREZID value
## <chr> <chr>
## 1 1 A1BG
## 2 2 A2M
## 3 3 A2MP1
## 4 4 NAT1
## 5 5 NAT2
## 6 6 NATP
## 7 7 SERPINA3
## 8 8 AADAC
## 9 9 AAMP
## 10 10 AANAT
## # … with 66,092 more rows
AnnotationDbi::select(
org.Hs.eg.db, egid, c("SYMBOL", "GO", "GENENAME"), "ENTREZID"
) |>
as_tibble()
## # A tibble: 379,815 × 6
## ENTREZID SYMBOL GO EVIDENCE ONTOLOGY GENENAME
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 A1BG GO:0003674 ND MF alpha-1-B glycoprotein
## 2 1 A1BG GO:0005576 HDA CC alpha-1-B glycoprotein
## 3 1 A1BG GO:0005576 IDA CC alpha-1-B glycoprotein
## 4 1 A1BG GO:0005576 TAS CC alpha-1-B glycoprotein
## 5 1 A1BG GO:0005615 HDA CC alpha-1-B glycoprotein
## 6 1 A1BG GO:0008150 ND BP alpha-1-B glycoprotein
## 7 1 A1BG GO:0031093 TAS CC alpha-1-B glycoprotein
## 8 1 A1BG GO:0034774 TAS CC alpha-1-B glycoprotein
## 9 1 A1BG GO:0062023 HDA CC alpha-1-B glycoprotein
## 10 1 A1BG GO:0070062 HDA CC alpha-1-B glycoprotein
## # … with 379,805 more rows
Exercise 2: biomaRt.
Internet access required for this exercise
Explore the Biomart web site https://www.ensembl.org/biomart for retrieving all kinds of genomic annotations.
Start by choosing a database (e.g., ‘Ensembl Genes 92’), dataset (e.g., ‘Human genes (GRCh38.p12)’), filter (e.g., ‘GENE’ / ‘Input external reference’ / ‘Gene stable id’ and enter ‘ENSG00000000003’), attributes (default is ok), then press ‘Results’ to map from Ensembl identifier to transcript identifier.
Install (if necessary) and load the biomaRt package. Use
listMarts()
to see availble databases, useMart()
to select the
mart you’re interested in.
library(biomaRt)
head(listMarts())
## biomart version
## 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 106
## 2 ENSEMBL_MART_MOUSE Mouse strains 106
## 3 ENSEMBL_MART_SNP Ensembl Variation 106
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 106
mart <- useMart("ENSEMBL_MART_ENSEMBL")
Use listDatasets()
and useDataset()
to select the Homo
sapiens gene dataset.
head(listDatasets(mart))
## dataset description
## 1 abrachyrhynchus_gene_ensembl Pink-footed goose genes (ASM259213v1)
## 2 acalliptera_gene_ensembl Eastern happy genes (fAstCal1.2)
## 3 acarolinensis_gene_ensembl Green anole genes (AnoCar2.0v2)
## 4 acchrysaetos_gene_ensembl Golden eagle genes (bAquChr1.2)
## 5 acitrinellus_gene_ensembl Midas cichlid genes (Midas_v5)
## 6 amelanoleuca_gene_ensembl Giant panda genes (ASM200744v2)
## version
## 1 ASM259213v1
## 2 fAstCal1.2
## 3 AnoCar2.0v2
## 4 bAquChr1.2
## 5 Midas_v5
## 6 ASM200744v2
dataset <- useDataset("hsapiens_gene_ensembl", mart)
Use listFilters()
to see available filters. The filter is the
type of data that you are querying with. Choose one.
head(listFilters(dataset))
## name description
## 1 chromosome_name Chromosome/scaffold name
## 2 start Start
## 3 end End
## 4 band_start Band Start
## 5 band_end Band End
## 6 marker_start Marker Start
filters <- "ensembl_gene_id" # see `listFilters()`
Use listAttrbutes()
to see available attributes. Attributes
represent the information you’d like to retrieve. Choose some!
head(listAttributes(dataset))
## name description page
## 1 ensembl_gene_id Gene stable ID feature_page
## 2 ensembl_gene_id_version Gene stable ID version feature_page
## 3 ensembl_transcript_id Transcript stable ID feature_page
## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
## 5 ensembl_peptide_id Protein stable ID feature_page
## 6 ensembl_peptide_id_version Protein stable ID version feature_page
attrs <- c("ensembl_gene_id", "hgnc_symbol") # see `listAttributes()`
Create a character vector of Ensembl gene ids, compose and execute the query, transforming the result to a tibble.
ids <- c(
"ENSG00000000003", "ENSG00000000005", "ENSG00000000419",
"ENSG00000000457", "ENSG00000000460", "ENSG00000000938"
)
tbl <-
getBM(attrs, filters, ids, dataset) |>
as_tibble()
tbl
## # A tibble: 6 × 2
## ensembl_gene_id hgnc_symbol
## <chr> <chr>
## 1 ENSG00000000003 TSPAN6
## 2 ENSG00000000005 TNMD
## 3 ENSG00000000419 DPM1
## 4 ENSG00000000457 SCYL3
## 5 ENSG00000000460 C1orf112
## 6 ENSG00000000938 FGR
Exercise 3: KEGGREST
Internet access required for this exercise
Explore the KEGG web site https://www.genome.jp/kegg/ KEGG is a database of information on pathways.
Load the KEGGREST package and discover available databases
library(KEGGREST)
KEGGREST::listDatabases()
## [1] "pathway" "brite" "module" "ko" "genome" "vg"
## [7] "ag" "compound" "glycan" "reaction" "rclass" "enzyme"
## [13] "disease" "drug" "dgroup" "environ" "genes" "ligand"
## [19] "kegg"
Use keggList()
to query the pathway database for human pathways;
present the result as a tibble
hsa_pathways <-
keggList("pathway", "hsa") |>
enframe(name = "pathway", value = "description")
hsa_pathways
## # A tibble: 347 × 2
## pathway description
## <chr> <chr>
## 1 path:hsa00010 Glycolysis / Gluconeogenesis - Homo sapiens (human)
## 2 path:hsa00020 Citrate cycle (TCA cycle) - Homo sapiens (human)
## 3 path:hsa00030 Pentose phosphate pathway - Homo sapiens (human)
## 4 path:hsa00040 Pentose and glucuronate interconversions - Homo sapiens (human)
## 5 path:hsa00051 Fructose and mannose metabolism - Homo sapiens (human)
## 6 path:hsa00052 Galactose metabolism - Homo sapiens (human)
## 7 path:hsa00053 Ascorbate and aldarate metabolism - Homo sapiens (human)
## 8 path:hsa00061 Fatty acid biosynthesis - Homo sapiens (human)
## 9 path:hsa00062 Fatty acid elongation - Homo sapiens (human)
## 10 path:hsa00071 Fatty acid degradation - Homo sapiens (human)
## # … with 337 more rows
Use keggLink()
to recover the genes in each pathway.
hsa_path_eg <-
keggLink("pathway", "hsa") |>
enframe(name = "egid", value = "pathway") |>
mutate(egid = sub("hsa:", "", egid))
hsa_path_eg
## # A tibble: 35,568 × 2
## egid pathway
## <chr> <chr>
## 1 10327 path:hsa00010
## 2 124 path:hsa00010
## 3 125 path:hsa00010
## 4 126 path:hsa00010
## 5 127 path:hsa00010
## 6 128 path:hsa00010
## 7 130 path:hsa00010
## 8 130589 path:hsa00010
## 9 131 path:hsa00010
## 10 160287 path:hsa00010
## # … with 35,558 more rows
hsa_path_eg |>
group_by(pathway) |>
summarize(genes = list(egid))
## # A tibble: 347 × 2
## pathway genes
## <chr> <list>
## 1 path:hsa00010 <chr [67]>
## 2 path:hsa00020 <chr [30]>
## 3 path:hsa00030 <chr [30]>
## 4 path:hsa00040 <chr [35]>
## 5 path:hsa00051 <chr [33]>
## 6 path:hsa00052 <chr [31]>
## 7 path:hsa00053 <chr [30]>
## 8 path:hsa00061 <chr [18]>
## 9 path:hsa00062 <chr [27]>
## 10 path:hsa00071 <chr [43]>
## # … with 337 more rows
Update the hsa_path_eg
table to include information on gene
symbol and Ensembl id from the org.Hs.eg.db
package. Retrieve the
relevant information using mapIds()
. How would you deal with
entrez gene ids that map to multiple Ensembl ids?
hsa_kegg_anno <-
hsa_path_eg |>
mutate(
symbol = mapIds(org.Hs.eg.db, egid, "SYMBOL", "ENTREZID"),
ensembl = mapIds(org.Hs.eg.db, egid, "ENSEMBL", "ENTREZID")
)
## 'select()' returned 1:1 mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
Use left_join()
to append pathway descriptions to the
hsa_kegg_anno
table.
left_join(hsa_kegg_anno, hsa_pathways, by = "pathway")
## # A tibble: 35,568 × 5
## egid pathway symbol ensembl description
## <chr> <chr> <chr> <chr> <chr>
## 1 10327 path:hsa00010 AKR1A1 ENSG00000117448 Glycolysis / Gluconeogenesis - …
## 2 124 path:hsa00010 ADH1A ENSG00000187758 Glycolysis / Gluconeogenesis - …
## 3 125 path:hsa00010 ADH1B ENSG00000196616 Glycolysis / Gluconeogenesis - …
## 4 126 path:hsa00010 ADH1C ENSG00000248144 Glycolysis / Gluconeogenesis - …
## 5 127 path:hsa00010 ADH4 ENSG00000198099 Glycolysis / Gluconeogenesis - …
## 6 128 path:hsa00010 ADH5 ENSG00000197894 Glycolysis / Gluconeogenesis - …
## 7 130 path:hsa00010 ADH6 ENSG00000172955 Glycolysis / Gluconeogenesis - …
## 8 130589 path:hsa00010 GALM ENSG00000143891 Glycolysis / Gluconeogenesis - …
## 9 131 path:hsa00010 ADH7 ENSG00000196344 Glycolysis / Gluconeogenesis - …
## 10 160287 path:hsa00010 LDHAL6A ENSG00000166800 Glycolysis / Gluconeogenesis - …
## # … with 35,558 more rows
There are a diversity of packages and classes available for representing large genomes. Several include:
TxDb.*
and EnsDb.*
For transcript and other genome / coordinate
annotation.available.genomes()
for pre-packaged genomes, and the vignette
‘How to forge a BSgenome data package’ in theFaFile()
(Rsamtools) for accessing indexed FASTA files.Genome-centric packages are very useful for annotations involving genomic coordinates. It is straight-forward, for instance, to discover the coordinates of coding sequences in regions of interest, and from these retrieve corresponding DNA or protein coding sequences. Other examples of the types of operations that are easy to perform with genome-centric annotations include defining regions of interest for counting aligned reads in RNA-seq experiments and retrieving DNA sequences underlying regions of interest in ChIP-seq analysis, e.g., for motif characterization.
The rtracklayer package allows us to query the UCSC genome
browser, as well as providing import()
and export()
functions for
common annotation file formats like GFF, GTF, and BED. The exercise
below illustrates some of the functionality of rtracklayer.
Exercise 4: TxDb.*
packages
Install and attach the TxDb.Hsapiens.UCSC.hg38.knownGene package. This contains the gene models for Homo sapiens based on the ‘hg38’ build of the human genome, using gene annotations in the UCSC ‘knownGene’ annotation track; TxDb’s for more recent builds and for different annotation tracks are available. Take a look at a summary of the package, and create an alias for easy typing
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
TxDb.Hsapiens.UCSC.hg38.knownGene
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg38
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # UCSC Table: knownGene
## # UCSC Track: GENCODE V39
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: NA
## # Nb of transcripts: 266064
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2022-04-07 16:50:30 +0000 (Thu, 07 Apr 2022)
## # GenomicFeatures version at creation time: 1.47.13
## # RSQLite version at creation time: 2.2.11
## # DBSCHEMAVERSION: 1.2
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
The main purpose of this package is to provide genomic coordinates
of genomic features such as exons()
, coding sequences (cds()
),
transcripts()
and genes()
. Explore, for example,
ex <- exons(txdb)
ex
## GRanges object with 713360 ranges and 1 metadata column:
## seqnames ranges strand | exon_id
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr1 11869-12227 + | 1
## [2] chr1 12010-12057 + | 2
## [3] chr1 12179-12227 + | 3
## [4] chr1 12613-12697 + | 4
## [5] chr1 12613-12721 + | 5
## ... ... ... ... . ...
## [713356] chrUn_GL000220v1 155997-156149 + | 713356
## [713357] chrUn_KI270442v1 380608-380726 + | 713357
## [713358] chrUn_KI270442v1 217250-217401 - | 713358
## [713359] chrUn_KI270744v1 51009-51114 - | 713359
## [713360] chrUn_KI270750v1 148668-148843 + | 713360
## -------
## seqinfo: 640 sequences (1 circular) from hg38 genome
library(ggplot2)
qplot(log10(width(ex)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ex[ which.max(width(ex)) ]
## GRanges object with 1 range and 1 metadata column:
## seqnames ranges strand | exon_id
## <Rle> <IRanges> <Rle> | <integer>
## [1] chrX 113616300-113963599 - | 648594
## -------
## seqinfo: 640 sequences (1 circular) from hg38 genome
Extract all genes, and then keep only the ‘standard’ chromosomes
1:22, X, Y, and M. Use table()
of seqnames()
to determine how
many genes are on each chromosome. Also do this in a dplyr way;
note that the seqnames(gn)
need to be coerced with as.factor()
.
gn <- genes(txdb)
## 1662 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
length(gn)
## [1] 29721
std <- paste0("chr", c(1:22, "X", "Y", "M"))
seqlevels(gn, pruning.mode = "coarse") <- std
length(gn)
## [1] 29644
seqlevels(gn)
## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"
## [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
## [19] "chr19" "chr20" "chr21" "chr22" "chrX" "chrY" "chrM"
table( seqnames(gn) )
##
## chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13
## 2981 2046 1657 1236 1389 1353 1442 1054 1216 1214 1741 1466 685
## chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chrM
## 969 1018 1211 1506 521 1726 846 428 645 1166 128 0
tibble(chr = as.factor(seqnames(gn))) |>
group_by(chr) |>
summarize(n = n())
## # A tibble: 24 × 2
## chr n
## <fct> <int>
## 1 chr1 2981
## 2 chr2 2046
## 3 chr3 1657
## 4 chr4 1236
## 5 chr5 1389
## 6 chr6 1353
## 7 chr7 1442
## 8 chr8 1054
## 9 chr9 1216
## 10 chr10 1214
## # … with 14 more rows
exonsBy()
groups exons by gene or transcript; extract exons
grouped by gene. (Challenging!) can you identify genes with exons
on different chromosomes? Are there any of these genes on the
standard chromosomes?
exByGn <- exonsBy(txdb, "gene")
##
trans <- lengths(unique(seqnames(exByGn)))
table( trans )
## trans
## 1 2 3 4 5 6 7 8 9 10 12 13 16
## 29721 1236 137 40 28 38 77 74 2 17 2 1 2
## 31 33 38 39 41 44
## 1 3 1 1 1 1
seqnames( exByGn[ trans > 1 ] )
## RleList of length 1662
## $`10000`
## factor-Rle of length 82 with 2 runs
## Lengths: 56 26
## Values : chr1 chr1_KI270763v1_alt
## Levels(640): chr1 chr2 chr3 ... chrUn_KI270756v1 chrUn_KI270757v1
##
## $`10003`
## factor-Rle of length 76 with 2 runs
## Lengths: 38 38
## Values : chr11 chr11_KZ559110v1_alt
## Levels(640): chr1 chr2 chr3 ... chrUn_KI270756v1 chrUn_KI270757v1
##
## $`100033819`
## factor-Rle of length 2 with 2 runs
## Lengths: 1 1
## Values : chr11 chr11_ML143358v1_fix
## Levels(640): chr1 chr2 chr3 ... chrUn_KI270756v1 chrUn_KI270757v1
##
## $`100037417`
## factor-Rle of length 6 with 2 runs
## Lengths: 3 3
## Values : chr22 chr22_KI270879v1_alt
## Levels(640): chr1 chr2 chr3 ... chrUn_KI270756v1 chrUn_KI270757v1
##
## $`100049076`
## factor-Rle of length 21 with 3 runs
## Lengths: 4 14 3
## Values : chr5 chr5_KI270897v1_alt chr5_KV575243v1_alt
## Levels(640): chr1 chr2 chr3 ... chrUn_KI270756v1 chrUn_KI270757v1
##
## ...
## <1657 more elements>
##
std <- paste0("chr", c(1:22, "X", "Y", "M"))
unames <- unique(seqnames(exByGn[ trans > 1 ]))
transstd <- all(unames %in% std)
unames[transstd]
## FactorList of length 26
## [["100128260"]] chrX chrY
## [["100359394"]] chrX chrY
## [["100500894"]] chrX chrY
## [["101928092"]] chrX chrY
## [["102464837"]] chrX chrY
## [["105373105"]] chrX chrY
## [["1438"]] chrX chrY
## [["207063"]] chrX chrY
## [["286530"]] chrX chrY
## [["293"]] chrX chrY
## ...
## <16 more elements>
The previous exercise indicated that gene "22947"
has exons on
both chromosomes 4 and 10. Find out more about this gene using the
org.Hs.eg.db package and by searching for the gene symbol on
the NCBI web site.
egid <- "22947"
AnnotationDbi::select(
org.Hs.eg.db, egid, c("SYMBOL", "GENENAME"), "ENTREZID"
)
## 'select()' returned 1:1 mapping between keys and columns
## ENTREZID SYMBOL GENENAME
## 1 22947 DUX4L1 double homeobox 4 like 1 (pseudogene)
url <- paste0("https://www.ncbi.nlm.nih.gov/gene/", egid)
browseURL(url)
Note that the TxDb.*
packages also support keytypes()
,
columns()
, and select()
for mapping between exon, cds,
transcript, and gene identifiers.
Exercise 5: BSgenome.*
packages
Install (if necessary) and load the BSgenome.Hsapiens.UCSC.hg38 package, containing the entire sequence of the hg38 build of Homo sapiens. Check out it’s contents, and create a simple alias.
library(BSgenome.Hsapiens.UCSC.hg38)
BSgenome.Hsapiens.UCSC.hg38
## Human genome:
## # organism: Homo sapiens (Human)
## # genome: hg38
## # provider: UCSC
## # release date: Feb 2019
## # 640 sequences:
## # chr1 chr2 chr3
## # chr4 chr5 chr6
## # chr7 chr8 chr9
## # chr10 chr11 chr12
## # chr13 chr14 chr15
## # ... ... ...
## # chr19_KV575254v1_alt chr19_KV575255v1_alt chr19_KV575256v1_alt
## # chr19_KV575257v1_alt chr19_KV575258v1_alt chr19_KV575259v1_alt
## # chr19_KV575260v1_alt chr22_KN196485v1_alt chr22_KN196486v1_alt
## # chr22_KQ458387v1_alt chr22_KQ458388v1_alt chr22_KQ759761v1_alt
## # chrX_KV766199v1_alt
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
## # to access a given sequence)
hg38 <- BSgenome.Hsapiens.UCSC.hg38
Genomic sequence can be retrieved by chromosome, e.g.,
hg38[["chr1"]]
, or by genomic range, e.g., getSeq(hg38, GRanges("chr1:1000000-2000000"))
. Retrieve your favorite chunk(s)
of DNA and calculate GC content.
dna <- getSeq(hg38, GRanges("chr1:1000000-2000000"))
letterFrequency(dna, "GC", as.prob=TRUE)
## G|C
## [1,] 0.5728534
Use the org.*
, TxDb.*
, and BSgenome.*
packages to retrieve
the BRCA1 exon DNA sequence.
brca1_egid <- mapIds(org.Hs.eg.db, "BRCA1", "ENTREZID", "SYMBOL")
## 'select()' returned 1:1 mapping between keys and columns
brca1_exons <- exonsBy(txdb, "gene")[[brca1_egid]]
getSeq(hg38, brca1_exons)
## DNAStringSet object of length 83:
## width seq
## [1] 1508 CAATTGGGCAGATGTGTGAGGCACCTGTGGTGA...CACTGCAAATAAACTTGGTAGCAAACACTTCCA
## [2] 998 CAATTGGGCAGATGTGTGAGGCACCTGTGGTGA...TGAGACTGTGGCTCAAAAAAAAAAAAAAAAAAA
## [3] 717 CAATTGGGCAGATGTGTGAGGCACCTGTGGTGA...AAAAGGAAACTTGAAACCTGGGCATGGTGGCTC
## [4] 240 CAATTGGGCAGATGTGTGAGGCACCTGTGGTGA...TCACTCTTCAGTCCTTCTACTGTCCTGGCTACT
## [5] 174 CAATTGGGCAGATGTGTGAGGCACCTGTGGTGA...GTACAGAGCCACAGGACCCCAAGAATGAGCTTA
## ... ... ...
## [79] 126 ACAGATAAATTAAAACTGCGACTGCGCGGCGTG...CTGCGCTCAGGAGGCCTTCACCCTCTGCTCTGG
## [80] 174 TTAGCGGTAGCCCCTTGGTTTCCGTGGCAACGG...CTGCGCTCAGGAGGCCTTCACCCTCTGCTCTGG
## [81] 175 CTTAGCGGTAGCCCCTTGGTTTCCGTGGCAACG...CTGCGCTCAGGAGGCCTTCACCCTCTGCTCTGG
## [82] 207 GTACCTTGATTTCGTATTCTGAGAGGCTGCTGC...CTGCGCTCAGGAGGCCTTCACCCTCTGCTCTGG
## [83] 120 AAAGCGTGGGAATTACAGATAAATTAAAACTGT...GGCCGCGTTGGGGTGAGACCCTCACTTCATCCG
Exercise 6
This exercise uses annotation resources to go from a gene symbol
‘BRCA1’ through to the genomic coordinates of each transcript
associated with the gene, and finally to the DNA sequences of the
transcripts. This can be achieved using an EnsDb
package along with
a BSgenome package, or with a combination of TxDb
, Homo.sapiens
and BSgenome packages. We will focus here on the former approach.
Use AnnotationHub to discover and retrieve a current Ensembl annotation (‘EnsDb’) for Homo sapiens.
Use the cdsBy()
function to retrieve the genomic coordinates of all coding
sequences for the gene ‘BRCA1’ from the EnsDb.Hsapiens.v86 package. To
retrieve only data for the specified gene, submit either a GenenameFilter
or a filter formula/expression to the function’s filter
parameter. This
avoids to extract the coding region for all genes, which takes a long time.
Visualize the transcripts in genomic coordinates using the Gviz
package to construct a GeneRegionTrack
, and plotting it using
plotTracks()
.
Use the Bsgenome.Hsapiens.UCSC.hg38 package and
extractTranscriptSeqs()
function to extract the DNA sequence of
each transcript.
Solution
Retrieve the coding sequences grouped by transcript for the gene of interest and verify that each coding sequence is a multiple of 3.
library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86
brca1cds <- cdsBy(edb, by = "tx", filter = ~ genename == "BRCA1")
class(brca1cds)
## [1] "CompressedGRangesList"
## attr(,"package")
## [1] "GenomicRanges"
length(brca1cds)
## [1] 29
brca1cds[[1]] # exons in cds
## GRanges object with 21 ranges and 3 metadata columns:
## seqnames ranges strand | gene_name exon_id
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] 17 43124017-43124096 - | BRCA1 ENSE00003559512
## [2] 17 43115726-43115779 - | BRCA1 ENSE00003510592
## [3] 17 43106456-43106533 - | BRCA1 ENSE00003541068
## [4] 17 43104868-43104956 - | BRCA1 ENSE00003531836
## [5] 17 43104122-43104261 - | BRCA1 ENSE00003513709
## ... ... ... ... . ... ...
## [17] 17 43057052-43057135 - | BRCA1 ENSE00003458468
## [18] 17 43051063-43051117 - | BRCA1 ENSE00003477922
## [19] 17 43049121-43049194 - | BRCA1 ENSE00003628864
## [20] 17 43047643-43047703 - | BRCA1 ENSE00003687053
## [21] 17 43045678-43045802 - | BRCA1 ENSE00001814242
## exon_rank
## <integer>
## [1] 2
## [2] 3
## [3] 4
## [4] 5
## [5] 6
## ... ...
## [17] 18
## [18] 19
## [19] 20
## [20] 21
## [21] 22
## -------
## seqinfo: 1 sequence from GRCh38 genome
cdswidth <- width(brca1cds) # width of each exon
all((sum(cdswidth) %% 3) == 0) # sum within cds, modulus 3
## [1] FALSE
The CDS for some transcripts is not of the expected length, how come? Get the transcript ID of the first transcript that does have a CDS of the wrong size and look this transcript up in the Ensembl genome browser (http://www.ensembl.org).
tx_cds_fail <- names(brca1cds)[(sum(cdswidth) %% 3) != 0]
length(tx_cds_fail)
## [1] 10
tx_cds_fail[1]
## [1] "ENST00000412061"
In the description of the transcript it says CDS 5’ incomplete. Thus, in addition to known protein coding transcripts, Ensembl provides annotations for transcripts known to be targeted for nonsense mediated mRNA decay or that have incomplete CDS. Such transcripts would however not be listed in e.g. the TxDb.Hsapiens.UCSC.hg38.knownGene package.
Next we visualize the BRCA1 transcripts using Gviz (this package has an
excellent vignette, vignette("Gviz")
)
library(Gviz)
## Use the function from the ensembldb package to extract the data in the
## format suitable for Gviz
grt <- getGeneRegionTrackForGviz(edb, filter = ~genename == "BRCA1")
plotTracks(list(GenomeAxisTrack(), GeneRegionTrack(grt)))
Extract the coding sequences of each transcript. EnsDb
databases provide
annotations from Ensembl and use hence Ensembl style chromosome names (such as
“Y”) while the BSgenome
package is based on UCSC annotations that use a naming
style that prepends a “chr” to each chromosome name (e.g. “chrY”). Change thus
the seqlevelsStyle
from the default UCSC chromosome naming to Ensembl naming
style.
library(BSgenome.Hsapiens.UCSC.hg38)
genome <- BSgenome.Hsapiens.UCSC.hg38
## Change the seqlevelsStyle from UCSC to Ensembl
seqlevelsStyle(genome) <- "Ensembl"
tx_seq <- extractTranscriptSeqs(genome, brca1cds)
tx_seq
## DNAStringSet object of length 29:
## width seq names
## [1] 2166 ATGGATTTATCTGCTCTTCGCGT...GATCCCCCACAGCCACTACTGA ENST00000352993
## [2] 4200 ATGGATTTATCTGCTCTTCGCGT...ACAAATTTCCAAGTATAGTTAA ENST00000354071
## [3] 5592 ATGGATTTATCTGCTCTTCGCGT...GATCCCCCACAGCCACTACTGA ENST00000357654
## [4] 1312 GTTTGGATTCTGCAAAAAAGGCT...GTGAAGAGATAAAGAAAAAAAA ENST00000412061
## [5] 192 ATGGATTTATCTGCTCTTCGCGT...GCCTTCACAGTGTCCTTTATGA ENST00000461221
## ... ... ...
## [25] 1065 ATGCACAGTTGCTCTGGGAGTCT...GATCCCCCACAGCCACTACTGA ENST00000591534
## [26] 291 ATGAGTGACAGCAAGAAAACCTG...GATCCCCCACAGCCACTACTGA ENST00000591849
## [27] 71 ATGGATTTATCTGCTCTTCGCGT...CTATGCAGAAAATCTTAGAGTG ENST00000618469
## [28] 2395 ATGGATTTATCTGCTCTTCGCGT...TTGGGACATGAAGTTAACCACA ENST00000634433
## [29] 5592 ATGGATTTATCTGCTCTTCGCGT...GATCCCCCACAGCCACTACTGA LRG_292t1
We can also inspect the CDS sequence for the transcripts with incomplete CDS. Many of them do not start with a start codon hence indicating that the CDS is incomplete on their 5’ end.
tx_seq[tx_cds_fail]
## DNAStringSet object of length 10:
## width seq names
## [1] 1312 GTTTGGATTCTGCAAAAAAGGCT...GTGAAGAGATAAAGAAAAAAAA ENST00000412061
## [2] 958 TTCAGCTTGACACAGGTTTGGAG...TTCACTCCAAATCAGTAGAGAG ENST00000473961
## [3] 667 ATGGATTTATCTGCTCTTCGCGT...AGTTTGGATTCTGCAAAAAAGG ENST00000476777
## [4] 1867 ATGGATTTATCTGCTCTTCGCGT...GATAGTTGTTCTAGCAGTGAAG ENST00000477152
## [5] 1870 ATGGATTTATCTGCTCTTCGCGT...CAGTCTATTAAAGAAAGAAAAA ENST00000478531
## [6] 1495 GAGCTATTGAAAATCATTTGTGC...CAGTCTATTAAAGAAAGAAAAA ENST00000484087
## [7] 800 GAGCTATTGAAAATCATTTGTGC...ATAAAGAACCAGGAGTGGAAAG ENST00000487825
## [8] 296 ATGGATTTATCTGCTCTTCGCGT...TAAAAGATGAAGTTTCTATCAT ENST00000489037
## [9] 71 ATGGATTTATCTGCTCTTCGCGT...CTATGCAGAAAATCTTAGAGTG ENST00000618469
## [10] 2395 ATGGATTTATCTGCTCTTCGCGT...TTGGGACATGAAGTTAACCACA ENST00000634433
Intron coordinates can be identified by first calculating the range of the genome (from the start of the first exon to the end of the last exon) covered by each transcript, and then taking the (algebraic) set difference between this and the genomic coordinates covered by each exon
introns <- psetdiff(unlist(range(brca1cds)), brca1cds)
Retrieve the intronic sequences with getSeq()
(these are not
assembled, the way that extractTranscriptSeqs()
assembles exon
sequences into mature transcripts); note that introns start and end
with the appropriate acceptor and donor site sequences.
Unfortunately, UCSC and Ensembl do also use different names for the genome
assembly. Change the genome name for the introns
object to matche the one from
the genome
object.
unique(genome(genome))
## [1] "hg38"
genome(introns)
## 17
## "GRCh38"
## Change the genome name on introns to match the one from the
## BSgenome package
genome(introns) <- c(`17` = unique(genome(genome)))
seq <- getSeq(genome, introns)
names(seq)
## [1] "ENST00000352993" "ENST00000354071" "ENST00000357654" "ENST00000412061"
## [5] "ENST00000461221" "ENST00000461574" "ENST00000461798" "ENST00000468300"
## [9] "ENST00000470026" "ENST00000471181" "ENST00000473961" "ENST00000476777"
## [13] "ENST00000477152" "ENST00000478531" "ENST00000484087" "ENST00000487825"
## [17] "ENST00000489037" "ENST00000491747" "ENST00000492859" "ENST00000493795"
## [21] "ENST00000493919" "ENST00000494123" "ENST00000497488" "ENST00000586385"
## [25] "ENST00000591534" "ENST00000591849" "ENST00000618469" "ENST00000634433"
## [29] "LRG_292t1"
seq[["ENST00000352993"]] # 20 introns
## DNAStringSet object of length 20:
## width seq
## [1] 1840 GTAAGGTGCCTGCATGTACCTGTGCTATATGGG...ACACTAATCTCTGCTTGTGTTCTCTGTCTCCAG
## [2] 1417 GTAAGTATTGGGTGCCCTGTCAGAGAGGGAGGA...ACTTTGAATGCTCTTTCCTTCCTGGGGATCCAG
## [3] 1868 GTAAGAGCCTGGGAGAACCCCAGAGTTCCAGCA...TGCAGTGATTTTACATCTAAATGTCCATTTTAG
## [4] 5934 GTAAAGCTCCCTCCCTCAAGTTGACAAAAATCT...CCCCTGTCCCTCTCTCTTCCTCTCTTCTTCCAG
## [5] 6197 GTAAGTACTTGATGTTACAAACTAACCAGAGAT...TTATCCTGATGGGTTGTGTTTGGTTTCTTTCAG
## ... ... ...
## [16] 4241 GTAAAACCATTTGTTTTCTTCTTCTTCTTCTTC...AATTGCTTGACTGTTCTTTACCATACTGTTTAG
## [17] 606 GTAAGTGTTGAATATCCCAAGAATGACACTCAA...TGCAAACATAATGTTTTCCCTTGTATTTTACAG
## [18] 1499 GTATATAATTTGGTAATGATGCTAGGTTGGAAG...GCTGAGTGTGTTTCTCAAACAATTTAATTTCAG
## [19] 9192 GTAAGTTTGAATGTGTTATGTGGCTCCATTATT...TTAAATTGTTCTTTCTTTCTTTATAATTTATAG
## [20] 8237 GTAAGTCAGCACAAGAGTGTATTAATTTGGGAT...TTATTTTCTTTTTCTCCCCCCCTACCCTGCTAG
Exercise 7
Internet access required for this exercise
Here we use rtracklayer to retrieve estrogen receptor binding sites identified across cell lines in the ENCODE project. We focus on binding sites in the vicinity of a particularly interesting region.
GRanges
instance with
appropriate genomic coordinates. Our region corresponds to 10Mb up-
and down-stream of a particular gene.Solution
Define the region of interest
library(GenomicRanges)
roi <- GRanges("chr10", IRanges(92106877, 112106876, names="ENSG00000099194"))
Create a session
library(rtracklayer)
session <- browserSession()
Query the UCSC for a particular track, table, and transcription factor, in our region of interest
trackName <- "wgEncodeRegTfbsClusteredV2"
tableName <- "wgEncodeRegTfbsClusteredV2"
trFactor <- "ERalpha_a"
ucscTable <- getTable(ucscTableQuery(session, track=trackName,
range=roi, table=tableName, name=trFactor))
Visualize the result
plot(score ~ chromStart, ucscTable, pch="+")
abline(v=start(roi) + (end(roi) - start(roi) + 1) / 2, col="blue")
AnnotationHub is a data base of large-scale whole-genome resources, e.g., regulatory elements from the Roadmap Epigenomics project, Ensembl GTF and FASTA files for model and other organisms, and the NHLBI grasp2db data base of GWAS results. There are many interesting ways in which these resources can be used. Examples include
Unfortunately, AnnotationHub makes extensive use of internet resources and so we will not pursue it in this course; see the vignettes that come with the pacakge, for instance AnnotationHub HOW-TOs.
Bioconductor provides facilities for reading VCF files. These work very well with the annotation resources described above, so for instance it is straight-forward to identify variants in coding or other regions of interest.
To develop a sense of the capabilities available, work through the VariantAnnotation vignette ‘Introduction to Variant Annotation’, and the VariantFiltering vignette.
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] KEGGREST_1.36.2
## [2] forcats_0.5.1
## [3] stringr_1.4.0
## [4] dplyr_1.0.9
## [5] purrr_0.3.4
## [6] readr_2.1.2
## [7] tidyr_1.2.0
## [8] tibble_3.1.7
## [9] ggplot2_3.3.6
## [10] tidyverse_1.3.1
## [11] AnnotationHub_3.4.0
## [12] BiocFileCache_2.4.0
## [13] dbplyr_2.2.0
## [14] Gviz_1.40.1
## [15] biomaRt_2.52.0
## [16] BSgenome.Hsapiens.UCSC.hg38_1.4.4
## [17] BSgenome_1.64.0
## [18] rtracklayer_1.56.0
## [19] Biostrings_2.64.0
## [20] XVector_0.36.0
## [21] EnsDb.Hsapiens.v86_2.99.0
## [22] ensembldb_2.20.2
## [23] AnnotationFilter_1.20.0
## [24] TxDb.Hsapiens.UCSC.hg38.knownGene_3.15.0
## [25] GenomicFeatures_1.48.3
## [26] GenomicRanges_1.48.0
## [27] GenomeInfoDb_1.32.2
## [28] org.Hs.eg.db_3.15.0
## [29] AnnotationDbi_1.58.0
## [30] IRanges_2.30.0
## [31] S4Vectors_0.34.0
## [32] Biobase_2.56.0
## [33] BiocGenerics_0.42.0
## [34] BiocStyle_2.24.0
##
## loaded via a namespace (and not attached):
## [1] readxl_1.4.0 backports_1.4.1
## [3] Hmisc_4.7-0 lazyeval_0.2.2
## [5] splines_4.2.0 BiocParallel_1.30.3
## [7] digest_0.6.29 htmltools_0.5.2
## [9] magick_2.7.3 fansi_1.0.3
## [11] magrittr_2.0.3 checkmate_2.1.0
## [13] memoise_2.0.1 cluster_2.1.3
## [15] tzdb_0.3.0 modelr_0.1.8
## [17] matrixStats_0.62.0 prettyunits_1.1.1
## [19] jpeg_0.1-9 colorspace_2.0-3
## [21] rvest_1.0.2 blob_1.2.3
## [23] rappdirs_0.3.3 haven_2.5.0
## [25] xfun_0.31 crayon_1.5.1
## [27] RCurl_1.98-1.7 jsonlite_1.8.0
## [29] survival_3.3-1 VariantAnnotation_1.42.1
## [31] glue_1.6.2 gtable_0.3.0
## [33] zlibbioc_1.42.0 DelayedArray_0.22.0
## [35] scales_1.2.0 DBI_1.1.2
## [37] Rcpp_1.0.8.3 xtable_1.8-4
## [39] progress_1.2.2 htmlTable_2.4.0
## [41] foreign_0.8-82 bit_4.0.4
## [43] Formula_1.2-4 htmlwidgets_1.5.4
## [45] httr_1.4.3 RColorBrewer_1.1-3
## [47] ellipsis_0.3.2 farver_2.1.0
## [49] pkgconfig_2.0.3 XML_3.99-0.10
## [51] nnet_7.3-17 sass_0.4.1
## [53] utf8_1.2.2 labeling_0.4.2
## [55] tidyselect_1.1.2 rlang_1.0.2
## [57] later_1.3.0 cellranger_1.1.0
## [59] munsell_0.5.0 BiocVersion_3.15.2
## [61] tools_4.2.0 cachem_1.0.6
## [63] cli_3.3.0 generics_0.1.2
## [65] RSQLite_2.2.14 broom_0.8.0
## [67] evaluate_0.15 fastmap_1.1.0
## [69] yaml_2.3.5 fs_1.5.2
## [71] knitr_1.39 bit64_4.0.5
## [73] mime_0.12 xml2_1.3.3
## [75] compiler_4.2.0 rstudioapi_0.13
## [77] filelock_1.0.2 curl_4.3.2
## [79] png_0.1-7 interactiveDisplayBase_1.34.0
## [81] reprex_2.0.1 bslib_0.3.1
## [83] stringi_1.7.6 highr_0.9
## [85] lattice_0.20-45 ProtGenerics_1.28.0
## [87] Matrix_1.4-1 vctrs_0.4.1
## [89] pillar_1.7.0 lifecycle_1.0.1
## [91] BiocManager_1.30.18 jquerylib_0.1.4
## [93] data.table_1.14.2 bitops_1.0-7
## [95] httpuv_1.6.5 R6_2.5.1
## [97] BiocIO_1.6.0 latticeExtra_0.6-29
## [99] bookdown_0.27 promises_1.2.0.1
## [101] gridExtra_2.3 codetools_0.2-18
## [103] dichromat_2.0-0.1 assertthat_0.2.1
## [105] SummarizedExperiment_1.26.1 rjson_0.2.21
## [107] withr_2.5.0 GenomicAlignments_1.32.0
## [109] Rsamtools_2.12.0 GenomeInfoDbData_1.2.8
## [111] parallel_4.2.0 hms_1.1.1
## [113] rpart_4.1.16 rmarkdown_2.14
## [115] MatrixGenerics_1.8.0 biovizBase_1.44.0
## [117] shiny_1.7.1 lubridate_1.8.0
## [119] base64enc_0.1-3 restfulr_0.0.14
Research reported in this tutorial was supported by the National Human Genome Research Institute and the National Cancer Institute of the National Institutes of Health under award numbers U24HG004059 (Bioconductor), U24HG010263 (AnVIL) and U24CA180996 (ITCR).