The identification of the transcription factor (TF) responsible for the corregulation of an specific set of genes is a common problem in transcriptomics. In the most simple scenario, the comparison of the transcriptome of cells or organisms in two conditions leads to the identification of a set of differentially expressed (DE) genes and the underlying assumption is that one or a few TFs regulate the expression of those genes. Traditionally, the identification of the relevant TFs has relied on the use of position weight matrices (PWMs) to predict transcription factor binding sites (TFBSs) proximal to the DE genes1. The comparison of predicted TFBS in DE versus control genes reveals factors that are significantly enriched in the DE gene set. These approaches have been useful to narrow down potential binding sites, but can suffer from high rates of false positives. In addition, this strategy is limited by design to sequence-specific TF and thus unable to identify cofactors that bind indirectly to target genes.
To overcome these limitations, TFEA.ChIP leverages the vast collection of publicly available ChIP-seq datasets to determine TFBSs linked to a given set of genes and performs enrichment analysis based on this experimentally derived, high-quality information. Specifically, TFEA.ChIP utilizes data from the ReMap 2022 repository, which contains hundreds of manually curated and uniformly processed DNA-binding experiments 2.
A critical step in this analysis is linking TF binding sites from ChIP-seq experiments to their target genes. The previous version of TFEA.ChIP 3 relied on GeneHancer 4, an integrative resource that connects genes to regulatory elements by aggregating data from ENCODE, FANTOM5, eQTLs, and chromatin interaction datasets. The current version of TFEA.ChIP includes a TF-gene interaction database constructed by integrating binding sites from the ReMap 2022 repository with enhancer-gene pairs generated by ENCODE-rE2G 5. This database, which contains 8,103 datasets from ChIP-seq experiments covering 1,193 human transcriptional regulators, demonstrated the best performance in internal benchmarking analyses and is therefore the default in this version of the tool. Users can also choose alternative TF–gene interaction databases, including those used in previous versions or custom-defined datasets. Additional databases are available through the accompanying Bioconductor data package ExperimentHub (to be released), and more information can be found at: https://github.com/LauraPS1/TFEA.ChIP_downloads and https://github.com/yberda/ChIPDBData.
TFEA.ChIP implements two enrichment analysis methods:
Although primarily designed for analyzing expression data from human cells, TFEA.ChIP includes an option to incorporate datasets from mouse experiments by translating mouse gene names to their human orthologs. Internal analyses indicate that this approach outperforms TF-gene databases derived from mouse-specific enhancer-gene associations.
TFEA.ChIP is designed to take the output of a differential expression analysis and identify TFBS enriched in the list of DE genes. In the case of the Overrepresentation Analysis, the only required input is a set of DE genes and, optionally, a set of control genes whose expression is not altered by the experimental conditions under study. For the GSEA analysis a ranked list of genes is required. This is supplied as a dataframe containing a column with gene names and a numerical column with the ranking metric, which typically are log2 fold change (LFC) or p-values for the gene expression changes in the two conditions under evaluation. To enhance integration with the Bioconductor ecosystem, the current version of the package natively supports output objects from widely used differential expression analysis tools such as edgeR and DESeq2.
For illustration purposes we will derive the input required for both analysis from a table containing the following field columns:
The output of popular packages, such as DESeq2, for detection of differentially expressed genes from the analysis of count data from RNA-seq experiments produce tables with this information, and the output table can be taken by the tool. The hypoxia_DESeq and hypoxia datasets are the output of a differential expression analysis performed on an RNAseq experiment analyzing the response to hypoxia of endothelial cells8 deposited at the NCBI’s GEO repository (GSE89831).
We begin by loading the input data table into the R environment. The input is expected to be in a format compatible with results generated by the DESeq2 package:
library(TFEA.ChIP)
library(dplyr)
data( "hypoxia_DESeq", "hypoxia", package="TFEA.ChIP" ) # Load example datasets
hypoxia_df <- preprocessInputData( hypoxia_DESeq )
#> Warning: Some genes returned 1:many mapping to ENTREZ ID.
# Display the first few rows
head(hypoxia_df)
#> Symbol Genes log2FoldChange pvalue pval.adj
#> 1 DNAH8 1769 4.588064 3.942129e-62 3.285202e-59
#> 2 STC1 6781 4.339768 7.840000e-11 2.714222e-09
#> 3 SYTL2 54843 4.307058 1.435621e-75 1.522672e-72
#> 4 IGFBP3 3486 4.297902 6.595438e-18 5.535897e-16
#> 5 AK4 205 4.235590 1.318311e-125 5.126912e-122
#> 6 PTGIS 5740 3.931662 4.398104e-15 2.631420e-13Alternatively, the input can be a simple data frame containing columns for gene identifiers (such as Entrez IDs, Ensembl IDs, or gene symbols), LFC and p-values or adjusted p-values:
# Preview the raw hypoxia dataset
head(hypoxia)
#> Genes log2FoldChange pvalue pval.adj
#> 1 TNMD 0.00000000 1.000000000 1.00000000
#> 2 DPM1 -0.44497788 0.001243905 0.01809033
#> 3 SCYL3 -0.27092276 0.139977254 0.65057456
#> 4 C1orf112 -0.52382569 0.035752231 0.25407939
#> 5 FGR -0.44810165 0.217220835 0.83961811
#> 6 CFH 0.05237749 0.740720152 1.00000000
hypoxia <- preprocessInputData( hypoxia )
#> Warning: Some genes returned 1:many mapping to ENTREZ ID.
# Display the first few rows
head(hypoxia)
#> Genes log2FoldChange pvalue pval.adj Symbol
#> 10785 1365 5.319709 9.647797e-77 2.817857e-73 CLDN3
#> 10138 284525 5.179949 2.411485e-64 5.414910e-61 SLC9C2
#> 9362 392255 4.401819 3.559414e-91 1.998137e-87 GDF6
#> 3958 4883 4.208881 1.414157e-39 1.082537e-36 NPR3
#> 10320 2199 4.060889 1.305316e-46 1.465522e-43 FBLN2
#> 5256 1769 4.024703 8.418844e-65 2.025454e-61 DNAH8After processing the data with the preprocessInputData()
function, the resulting dataset will be ready for downstream analysis
with the TFEA.ChIP package. This function converts gene identifiers to
Entrez Gene IDs and sorts the table by LFC for consistency.
Note: Using the option
mode = m2hwill convert mouse gene IDs to their equivalent human gene ID, thus taking advantage of the wider availability of ChIP-seq experiments done on human cells. This strategy relies on the overlap between human and mouse transcription regulatory mechanisms. Nevertheless, we advise to be cautious using this approach, since extrapolating results from one organism to another is not always appropriate.
By default, TFEA.ChIP uses a reduced internal reference database
restricted to ChIP-seq experiments performed in ENCODE tier 1, 2, and
2.5 cell types. This streamlined database is intentionally
minimal to accommodate users with limited computing resources and may
not fully capture the diversity of available TF–gene
associations. To perform enrichment analysis using the complete
ChIP-seq database, load the curated dataset from
ExperimentHub and assign it to an object named
ChIPDB to override the default:
The database shown here is derived from ENCODE’s rE2G predictive
models (“rE2G300d”, see ChIPDBData GitHub repository for details).
However, additional ready-to-use databases based on rE2G, GeneHancer,
and CREdb are also available. To use one of these alternative databases,
select a different entry in the eh object:
As mentioned previously, this analysis requires two gene sets: one consisting of differentially expressed genes (e.g., upregulated), and another comprising control genes whose expression remains unchanged under the experimental conditions.
The Select_genes() function facilitates this selection.
For example, upregulated genes can be defined as those with a LFC
greater than 1 and an adjusted p-value (adjP) less than 0.05. Control
genes can be selected based on an LFC between -0.25 and 0.25 and an adjP
between 0.5 and 1:
In situations where preprocessing is not applicable, such as when
using an external or manually curated gene list, gene identifiers must
be standardized to Entrez Gene IDs for compatibility with the package’s
downstream functions. The GeneID2entrez() function supports
this conversion:
In this step, a contingency table is built for each TF represented in the ChIP-seq regulatory map. Each TF-specific contingency table summarizes the overlap between TF binding and differential expression:
| TFbound_yes | TFbound_no | |
|---|---|---|
| DE_yes | number y/y | number y/n |
| DE_no | number n/y | number n/n |
Fisher’s exact test is then applied to each contingency table to evaluate the null hypothesis that TF binding and differential expression are independent events. In addition to raw p-values, the function returns false discovery rate (FDR)-adjusted p-values to account for multiple testing, as well as the odds ratio (OR), which quantifies the strength of the association between the DE genes and the TF based on ChIP-seq data.
CM_list_UP <- contingency_matrix( Genes.Upreg, Genes.Control ) # Generates list of contingency tables, one per dataset
pval_mat_UP <- getCMstats( CM_list_UP ) # Generates list of p-values and OR from association test
head( pval_mat_UP )
#> Accession Cell Treatment TF p.value OR
#> 873 GSE89836.EPAS1.HUVEC-C HUVEC-C EPAS1 2.871795e-06 63.245126
#> 345 ENCSR091BOQ.SUZ12.GM12878 GM12878 SUZ12 8.532670e-32 5.686499
#> 738 GSE109625.EZH2.HUVEC-C HUVEC-C EZH2 4.598148e-24 7.756261
#> 612 ENCSR784VUY.RNF2.WA01 WA01 RNF2 4.912230e-23 3.480488
#> 9 ENCSR000ARI.EZH2.Hep-G2 Hep-G2 EZH2 5.260255e-22 5.175982
#> 715 GSE104690.RNF2.WA01 WA01 RNF2 3.281069e-22 3.860270
#> OR.SE log2.OR adj.p.value log10.adj.pVal distance
#> 873 68.4196369 5.982882 6.907226e-06 5.160696 62.45869
#> 345 0.7807292 2.507541 4.389585e-30 29.357577 29.72929
#> 738 1.4182419 2.955361 9.677012e-23 22.014259 23.02769
#> 612 0.4273568 1.799290 7.842629e-22 21.105538 21.25080
#> 9 0.8040923 2.371833 7.380296e-21 20.131926 20.56048
#> 715 0.5055438 1.948702 4.900435e-21 20.309765 20.51019In the example above, all 8,103 TF binding datasets included in the
internal database were used for the enrichment analysis. However, the
analysis can be restricted to a specific subset of TFs. To do this, the
get_chip_index() function can be used to generate an index
referencing the desired subset of datasets. This index can then be
passed as an argument to the contingency_matrix()
function.
chip_index <- get_chip_index( TFfilter = c( "HIF1A","EPAS1","ARNT" ) ) # Restrict the analysis to datasets assaying these factors
CM_list_UPe <- contingency_matrix( Genes.Upreg, chip_index = chip_index ) # Generates list of contingency tables
pval_mat_UPe <- getCMstats( CM_list_UPe, chip_index ) # Generates list of p-values and ORs
head( pval_mat_UPe )
#> Accession Cell Treatment TF p.value
#> 6 GSE89836.EPAS1.HUVEC-C HUVEC-C EPAS1 1.065567e-08
#> 5 GSE89836.ARNT.HUVEC-C HUVEC-C ARNT 1.969789e-14
#> 3 GSE39089.HIF1A.HUVEC-C_HYPOX HUVEC-C_HYPOX HYPOX HIF1A 9.068584e-24
#> 1 ENCSR029IBC.ARNT.Hep-G2 Hep-G2 ARNT 7.685171e-03
#> 2 ENCSR590KEQ.ARNT.GM12878 GM12878 ARNT 7.014308e-05
#> 4 GSE39089.HIF1A.HUVEC-C_NOMO HUVEC-C_NOMO NOMO HIF1A 5.679383e-02
#> OR OR.SE log2.OR adj.p.value log10.adj.pVal distance
#> 6 54.018066 27.1907542 5.7553701 2.131134e-08 7.671389 53.570193
#> 5 20.345669 5.8382159 4.3466498 5.909367e-14 13.228459 23.436020
#> 3 4.709911 0.6323793 2.2356996 5.441150e-23 22.264309 22.571285
#> 1 5.430327 2.8007619 2.4410392 9.222205e-03 2.035165 4.875418
#> 2 1.751547 0.2344372 0.8086295 1.052146e-04 3.977924 4.048296
#> 4 2.853479 1.4550505 1.5127221 5.679383e-02 1.245699 2.233193Note: When a complete ranked gene list is provided as input, the current version of the package can automatically identify expressed TFs and limit the analysis to that subset. Details on this process are explained later in the document. Additionally, specifying a control gene set is optional: if omitted, the function defaults to using all annotated human genes not included in the test set as controls.
The TFEA.ChIP package includes a metadata table that provides detailed information about each ChIP-seq dataset in the internal database. This includes experimental conditions, TF targets, and dataset sources. The metadata can be accessed as follows:
data("MetaData", package = "TFEA.ChIP")
head(MetaData)
#> Accession ID
#> 1 ENCSR000AHD.CTCF.MCF-7 ENCSR000AHD
#> 2 ENCSR000AHF.TAF1.MCF-7 ENCSR000AHF
#> 3 ENCSR000AKB.CTCF.GM12878 ENCSR000AKB
#> 4 ENCSR000AKO.CTCF.K-562 ENCSR000AKO
#> 5 ENCSR000ALA.CTCF.endothelial_umbilical-vein ENCSR000ALA
#> 6 ENCSR000ALJ.CTCF.keratinocyte ENCSR000ALJ
#> Cell Treatment TF
#> 1 MCF-7 CTCF
#> 2 MCF-7 TAF1
#> 3 GM12878 CTCF
#> 4 K-562 CTCF
#> 5 endothelial_umbilical-vein umbilical-vein CTCF
#> 6 keratinocyte CTCFThe TFEA.ChIP package provides a function that integrates all key
analysis steps into a single command, simplifying the workflow. This
streamlined function is particularly helpful when working with complete
gene tables, as it can automatically detect TFs that are
expressed in the input dataset and restrict the analysis
accordingly. By default, this filtering step is enabled to
avoid testing TFs that are unlikely to be relevant in the given
biological context, thereby reducing the multiple testing burden and
increasing statistical power. If desired, this behavior can be disabled
by setting the expressed argument to
FALSE:
pval_mat_UPe <- analysis_from_table(hypoxia_df,
method = 'ora', # Overrepresentation Analysis
interest_min_LFC = 1, # min LFC
expressed = TRUE, # only take expressed TFs
control_min_pval = 0.5, control_max_pval = 1, # control pval limits
control_min_LFC = -0.25, control_max_LFC = 0.25) # control LFC limitsIn the analyses described above, each ChIP-seq dataset is treated
independently. This approach allows for the identification of specific
cell types or experimental conditions where a TF may be active,
facilitating biological interpretation in relation to the study samples.
Alternatively, the package supports the integration of all ChIP-seq
datasets associated with a given TF, irrespective of experimental
context. This strategy helps to highlight TFs that show consistent
enrichment across conditions. The rankTFs() function
enables this summarization by applying either a Wilcoxon rank-sum test
or a GSEA approach to assess whether the ChIP-seq profiles corresponding
to the same TF are collectively enriched or depleted in the ranked
results.
Important consideration: For TFs whose activity is highly context-dependent, aggregating ChIP-seq data across diverse experimental conditions may obscure biologically meaningful enrichments. Additionally, many public ChIP-seq datasets are derived from baseline or control conditions in which the TF may not be active. While this metadata is informative for interpreting individual experiments, it can confound integrated analyses. Therefore, we recommend using non-integrated (individual) results for more accurate and context-specific interpretation whenever possible.
TF_ranking <- rankTFs( pval_mat_UP, rankMethod = "gsea", makePlot = TRUE )
head( TF_ranking[[ "TF_ranking" ]] )
TF_ranking[[ "TFranking_plot" ]]Meta-analyses can also be used to summarize results across multiple ChIP experiments for the same TF, accounting for the standard error of the OR. This approach gives greater weight to more precise estimates. However, it may also dilute effects that are context-dependent:
library(meta)
TF_ranking2 <- metaanalysis_fx( pval_mat_UP )
# Get HIF results
filter(TF_ranking2$summary,
TF %in% c("ARNT", "HIF1A", "EPAS1"))In this analysis, k denotes the number of ChIP
experiments included. OR represents the weighted average
Odds Ratio, OR.SE is its standard error, and
CI refers to the corresponding confidence interval. The
pooled effect size (OR values) estimated by the meta-analysis integrates
all datasets for a given TF, which may dilute effects that are
context-dependent. However, a forest plot can be generated to visualize
the results of individual experiments alongside the integrated
value:
The table of results generated by getCMstats can be
parsed to select candidate TF. The function plot_CM uses
the package plotly to generate an interactive plot
representing the p-value against the Odds Ratio that is very helpful to
explore the results:
As expected for the hypoxia dataset, visual inspection of the enrichment graph reveals a strong signal for several HIF-related ChIP-seq datasets. These can be highlighted to emphasize their relevance in the current analysis:
HIFs <- c( "EPAS1","HIF1A","ARNT" )
plot_CM( pval_mat_UP, specialTF = HIFs ) # Plot p-values against ORs highlighting indicated TFsThe Gene Set Enrichment Analysis (GSEA) implemented in TFEA.ChIP
requires a ranked list of genes as input. By default, the
preprocessInputData() function sorts genes in descending
order based on their LFC values. However, users can choose to rank genes
using alternative numerical metrics, such as the p-value, adjusted
p-value, or a composite measure like the product of LFC and
–log10(p-value), depending on the specific goals of the analysis.
If a custom gene ranking is used, it is essential to ensure that gene
identifiers are in Entrez Gene ID format. If necessary, the
GeneID2entrez() function can be used to convert gene
symbols or alternative identifiers into the required format.
By default, the analysis will include all the ChIP-Seq experiments
available in the database. However, this analysis might take several
minutes to run. To restrict the analysis to a subset of the database we
can generate an index variable and pass it to the function
GSEA_run(). This will limit the analysis to the ChIP-Seq
datasets of the user’s choosing. This index variable can be generated
using the function get_chip_index() and allows the user to
select the whole database, the set of ChIP-Seq experiments produced by
the ENCODE project (“encode”) or a specific subset of TFs (as a vector
containing the TF names).
The function GSEA_run() will perform a GSEA-based
analysis on the input gene list. This function is based on the R-GSEA R
script bundle written by the GSEA team at the Broad Institute of MIT and
Harvard. The output of the analysis depends on the variable
get.RES:
When FALSE, the function returns a data frame
storing maximum Enrichment Score and associated p-value determined for
each dataset included in the analysis.
When TRUE, the function returns a list of three
elements. The first element (Enrichment.table) is the
enrichment data frame previously mentioned. The second element
(RES) is a list of vectors containing the Running
Enrichment Scores values (see GSEA documentation, http://software.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm#_Interpreting_GSEA_Results)
for each of the sorted genes tested against each one of the analyzed
ChIP datasets. The third element (indicators) is a list of
vectors indicating if the sorted genes were bound by the factor analyzed
in each ChIP dataset.
# run GSEA analysis
GSEA.result <- GSEA_run( hypoxia_df$Genes, hypoxia_df$log2FoldChange, chip_index, get.RES = TRUE) Finally, the package provides a function that integrates all the
above steps into a single command. In the current version of the
package, only TFs present in the input sorted list of genes (i.e., TFs
expressed under the conditions under study) are considered for the
analysis. This default behavior can be overridden by setting the
parameter expressed to FALSE:
TFEA.ChIP provides two functions, plot_ES() and
plot_RES(), to create interactive HTML visualizations of
GSEA results.
plot_ES()The plot_ES() function generates an interactive
enrichment score (ES) plot. It supports highlighting ChIP-seq datasets
associated with specific TFs by displaying them in distinct colors,
allowing users to easily identify TFs of interest within the broader
analysis.
TF.hightlight <- c( "EPAS1","ARNT","HIF1A" )
plot_ES( GSEA.result$result, LFC = GSEA.result$processed_table$log2FoldChange, specialTF = TF.hightlight)plot_RES()This function will plot all the RES stored in the
GSEA_run() output. It is only recommended to restrict
output to specific TF and/or datasets by setting the parameters
TF and/or Accession respectively:
plot_RES(
GSEA_result = GSEA.result$result, LFC = hypoxia_df$log2FoldChange,
Accession = c("GSE89836.ARNT.HUVEC-C", "GSE89836.EPAS1.HUVEC-C" ) )If the user wants to generate their own database of ChIPseq datasets,
the functions txt2gr() and makeChIPGeneDB()
automate most of the process. The required inputs are:
A Metadata table (storing at least, Accession ID, name of the file, and TF tested in the ChIP-Seq experiment). The metadata table included with this package has the following fields: “Name”, “Accession”, “Cell”, “Cell Type”, “Treatment”, “Antibody”, and “TF”.
A folder containing ChIP-Seq peak data, either in “.narrowpeak” format or the MACS output files “_peaks.bed” -a format that stores “chr”, “start”, “end”, “name”, and “Q-value” of every peak-.
Specify the folder where the ChIP-Seq files are stored, create an
array with the names of the ChIP-Seq files, and choose a format. Set a
for loop to convert all your files to GenomicRanges objects
using txt2GR(). Please note that, by default, only peaks
with an associated p-value of 0.05 (for narrow peaks files) or 1e-5 (for
MACS files) will be kept. The user can modify the default values by
setting the alpha argument to the desired threshold p-value.
folder <- "~/peak.files.folder"
File.list<-dir( folder )
format <- "macs"
gr.list <- lapply(
seq_along( File.list ),
function( File.list, myMetaData, format ){
tmp<-read.table( File.list[i], ..., stringsAsFactors = FALSE )
file.metadata <- myMetaData[ myMetaData$Name == File.list[i], ]
ChIP.dataset.gr<-txt2GR(tmp, format, file.metadata)
return(ChIP.dataset.gr)
},
File.list = File.list,
myMetadata = myMetadata,
format = format
)# As an example of the output
data( "ARNT.peaks.bed","ARNT.metadata", package = "TFEA.ChIP" ) # Loading example datasets for this function
ARNT.gr <- txt2GR( ARNT.peaks.bed, "macs1.4", ARNT.metadata )
head( ARNT.gr, n=2 )
#> GRanges object with 2 ranges and 9 metadata columns:
#> seqnames ranges strand | score mcols.Name
#> <Rle> <IRanges> <Rle> | <numeric> <character>
#> [1] chr1 1813434-1814386 * | 4.60909e-06 GSM2390643_ARNT_ChIP..
#> [2] chr1 3456399-3457536 * | 8.82450e-27 GSM2390643_ARNT_ChIP..
#> mcols.Accession mcols.Cell mcols.Cell Type mcols.Treatment
#> <character> <character> <character> <character>
#> [1] GSM2390643 HUVEC Hypoxia
#> [2] GSM2390643 HUVEC Hypoxia
#> mcols.Antibody mcols.TF mcols.Score Type
#> <character> <character> <character>
#> [1] HIF1B ARNT corrected p-Value
#> [2] HIF1B ARNT corrected p-Value
#> -------
#> seqinfo: 32 sequences from an unspecified genome; no seqlengthsThe function makeChIPGeneDB() assigns the TFBS peaks in
the ChIP datasets stored in gr.list() to a gene. To this
end, a ChIP peak overlaping a regulatory region region receive the gene
label associated to said region. By default the function also assigns
the gene name when the ChIP peak does not overlap a regulatory region
but maps at less than 10 nucleotides from it. This behaviour can be
modified by setting the argument distanceMargin to the
desired value (by default distanceMargin = 10 bases).
The resulting ChIP-Gene data base is a list containing two elements: - Gene Keys: vector of gene IDs. - ChIP Targets: list of vectors, one per element in gr.list, containing the putative targets assigned. Each target is coded as its position in the vector ‘Gene Keys’.
data( "DnaseHS_db", "gr.list", package="TFEA.ChIP" ) # Loading example datasets for this function
TF.gene.binding.db <- makeChIPGeneDB( DnaseHS_db, gr.list )
str( TF.gene.binding.db )
#> List of 2
#> $ Gene Keys : chr [1:54] "100507501" "10209" "10277" "10458" ...
#> $ ChIP Targets:List of 1
#> ..$ wgEncodeEH002402: int [1:54] 1 2 3 4 5 6 7 8 9 10 ...The function, accepts any Genomic Range object that includes a metacolumn with a gene ID (stored in the @elementMetadata@listdata [["gene_id"]] slot of the object) for each genomic segment. For example, asignation of peaks to genes can be done by providing a list of all the genes in the genome:
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
#> Loading required package: GenomicFeatures
#> Loading required package: BiocGenerics
#> Loading required package: generics
#>
#> Attaching package: 'generics'
#> The following object is masked from 'package:dplyr':
#>
#> explain
#> The following objects are masked from 'package:base':
#>
#> as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#> setequal, union
#>
#> Attaching package: 'BiocGenerics'
#> The following object is masked from 'package:dplyr':
#>
#> combine
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
#> mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#> rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
#> unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#>
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:dplyr':
#>
#> first, rename
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#>
#> Attaching package: 'IRanges'
#> The following objects are masked from 'package:dplyr':
#>
#> collapse, desc, slice
#> Loading required package: Seqinfo
#> Loading required package: GenomicRanges
#> Loading required package: AnnotationDbi
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#> Attaching package: 'AnnotationDbi'
#> The following object is masked from 'package:dplyr':
#>
#> select
data( "gr.list", package="TFEA.ChIP") # Loading example datasets for this function
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
Genes <- genes( txdb )
#> 2169 genes were dropped because they have exons located on both strands of
#> the same reference sequence or on more than one reference sequence, so cannot
#> be represented by a single genomic range.
#> Use 'single.strand.genes.only=FALSE' to get all the genes in a GRangesList
#> object, or use suppressMessages() to suppress this message.
TF.gene.binding.db <- makeChIPGeneDB( Genes, gr.list, distanceMargin = 0 )
str( TF.gene.binding.db )
#> List of 2
#> $ Gene Keys : chr [1:35356] "1" "10" "100" "1000" ...
#> $ ChIP Targets:List of 1
#> ..$ wgEncodeEH002402: int [1:41] 1656 3218 3793 4259 7194 8388 8790 9647 11114 12246 ...In this case the information about Dnase hypersensitivity is
disregarded and peaks are assigned to overlapping genes (or genes closer
than distanceMargin residues).
At the beginning of a session, use the function
set_user_data() to use your TFBS binary matrix and metadata
table with the rest of the package.
sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.22.0
#> [2] GenomicFeatures_1.62.0
#> [3] AnnotationDbi_1.72.0
#> [4] Biobase_2.70.0
#> [5] GenomicRanges_1.62.0
#> [6] Seqinfo_1.0.0
#> [7] IRanges_2.44.0
#> [8] S4Vectors_0.48.0
#> [9] BiocGenerics_0.56.0
#> [10] generics_0.1.4
#> [11] dplyr_1.1.4
#> [12] TFEA.ChIP_1.30.0
#> [13] BiocStyle_2.38.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.3 bitops_1.0-9
#> [3] httr2_1.2.1 biomaRt_2.66.0
#> [5] rlang_1.1.6 magrittr_2.0.4
#> [7] matrixStats_1.5.0 compiler_4.5.2
#> [9] RSQLite_2.4.4 png_0.1-8
#> [11] vctrs_0.6.5 stringr_1.6.0
#> [13] pkgconfig_2.0.3 crayon_1.5.3
#> [15] fastmap_1.2.0 dbplyr_2.5.1
#> [17] XVector_0.50.0 Rsamtools_2.26.0
#> [19] rmarkdown_2.30 bit_4.6.0
#> [21] xfun_0.54 cachem_1.1.0
#> [23] cigarillo_1.0.0 jsonlite_2.0.0
#> [25] progress_1.2.3 blob_1.2.4
#> [27] DelayedArray_0.36.0 BiocParallel_1.44.0
#> [29] parallel_4.5.2 prettyunits_1.2.0
#> [31] R6_2.6.1 bslib_0.9.0
#> [33] stringi_1.8.7 RColorBrewer_1.1-3
#> [35] rtracklayer_1.70.0 jquerylib_0.1.4
#> [37] Rcpp_1.1.0 SummarizedExperiment_1.40.0
#> [39] knitr_1.50 org.Mm.eg.db_3.22.0
#> [41] R.utils_2.13.0 Matrix_1.7-4
#> [43] tidyselect_1.2.1 abind_1.4-8
#> [45] yaml_2.3.10 codetools_0.2-20
#> [47] curl_7.0.0 lattice_0.22-7
#> [49] tibble_3.3.0 withr_3.0.2
#> [51] KEGGREST_1.50.0 S7_0.2.1
#> [53] evaluate_1.0.5 BiocFileCache_3.0.0
#> [55] ExperimentHub_3.0.0 Biostrings_2.78.0
#> [57] pillar_1.11.1 BiocManager_1.30.27
#> [59] filelock_1.0.3 MatrixGenerics_1.22.0
#> [61] RCurl_1.98-1.17 BiocVersion_3.23.1
#> [63] hms_1.1.4 ggplot2_4.0.1
#> [65] scales_1.4.0 glue_1.8.0
#> [67] maketools_1.3.2 tools_4.5.2
#> [69] AnnotationHub_4.0.0 BiocIO_1.20.0
#> [71] sys_3.4.3 locfit_1.5-9.12
#> [73] GenomicAlignments_1.46.0 buildtools_1.0.0
#> [75] XML_3.99-0.20 grid_4.5.2
#> [77] restfulr_0.0.16 cli_3.6.5
#> [79] rappdirs_0.3.3 S4Arrays_1.10.0
#> [81] gtable_0.3.6 R.methodsS3_1.8.2
#> [83] DESeq2_1.50.2 sass_0.4.10
#> [85] digest_0.6.38 SparseArray_1.10.2
#> [87] org.Hs.eg.db_3.22.0 rjson_0.2.23
#> [89] farver_2.1.2 memoise_2.0.1
#> [91] htmltools_0.5.8.1 R.oo_1.27.1
#> [93] lifecycle_1.0.4 httr_1.4.7
#> [95] bit64_4.6.0-1Wasserman, W.W., & Sandelin, A. (2004). Applied bioinformatics for the identification of regulatory elements. Nature Reviews Genetics, 5, 276–287. https://doi.org/10.1038/nrg1315↩︎
Hammal, F., de Langen, P., Bergon, A., Lopez, F., & Ballester, B. (2022). ReMap 2022: a database of Human, Mouse, Drosophila and Arabidopsis regulatory regions from an integrative analysis of DNA-binding sequencing experiments. Nucleic Acids Research, 50(D1), D316–D325. https://doi.org/10.1093/nar/gkab996↩︎
Puente-Santamaría, L., Wasserman, W.W., & del Peso, L. (2019). TFEA.ChIP: a tool kit for transcription factor binding site enrichment analysis capitalizing on ChIP-seq datasets. Bioinformatics, 35(24), 5339–5340. https://doi.org/10.1093/bioinformatics/btz573↩︎
Fishilevich, S., Nudel, R., Rappaport, N., Hadar, R., Plaschkes, I., Iny Stein, T., Rosen, N., Kohn, A., Twik, M., Safran, M., Lancet, D., & Cohen, D. (2017). GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database (Oxford), 2017, bax028. https://doi.org/10.1093/database/bax028↩︎
Gschwind, A.R., et al. (2023). An encyclopedia of enhancer-gene regulatory interactions in the human genome. bioRxiv. https://doi.org/10.1101/2023.01.01.123456↩︎
Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., & Mesirov, J.P. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43), 15545–15550. https://doi.org/10.1073/pnas.0506580102↩︎
Mootha, V.K., Lindgren, C.M., Eriksson, K.F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstråle, M., Laurila, E., Houstis, N., Daly, M.J., Patterson, N., Mesirov, J.P., Golub, T.R., Tamayo, P., Spiegelman, B., Lander, E.S., Hirschhorn, J.N., Altshuler, D., & Groop, L.C. (2003). PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics, 34(3), 267–273. https://doi.org/10.1038/ng1180↩︎
Tiana, M., Acosta-Iborra, B., Puente-Santamaría, L., Hernansanz-Agustín, P., Worsley-Hunt, R., Masson, N., García-Rio, F., Mole, D., Ratcliffe, P., Wasserman, W.W., Jiménez, B., & del Peso, L. (2018). The SIN3A histone deacetylase complex is required for a complete transcriptional response to hypoxia. Nucleic Acids Research, 46(1), 120–133. https://doi.org/10.1093/nar/gkx951↩︎