--- title: "peakCombiner: The R package to curate and merge enriched genomic regions into consensus peak sets" author: "Markus Muckenhuber, Charlotte Soneson, Michael Stadler, Kathleen Sprouffske" date: "`r Sys.Date()`" bibliography: peakCombiner-refs.bib output: #word_document BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{peakCombiner} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, out.width = "80%", fig.align = "center", echo = TRUE, crop = NULL # suppress "The magick package is required to crop" issue ) library(BiocStyle) ``` # Introduction This vignette for the R package `r Biocpkg("peakCombiner")` guides you through the preparation of all accepted input files and how to define the best parameters for your analysis. Our goal is to provide you with an easy to understand, transparent and modular framework to create the consensus regions file you need to address your scientific question. ## Abstract Genome-wide epigenomic data sets like ChIP-seq and ATAC-seq typically use tools (e.g. MACS [@macs]) to identify genomic regions of interest, so-called peaks, usually for multiple sample replicates and across experimental conditions. Many downstream analyses require a consensus set of genomic regions relevant to the experiment, but current tools within the R ecosystem to easily and flexibly create combined peak sets from conditions and replicates are limited. Here, we describe `r Biocpkg("peakCombiner")`, a fully R-based, user-friendly, transparent, and flexible tool that allows even novice R users to create a high-quality consensus peak list. The modularity of its functions allows an easy way to filter and recenter input and output data. A broad range of accepted input data formats can be used as input to `r Biocpkg("peakCombiner")`, and the resulting consensus regions set can be exported to a file or used immediately as the starting point for most downstream peak analyses. ## Input file formats ## Overview of steps The package `r Biocpkg("peakCombiner")` contains four main functions to curate and combine genomic regions into one set of consensus regions. A short overview about each the functions is in the table below. |Function name | What it does | Where to learn more |:------------- |:------------ |:--------------- | [`prepareInputRegions`][Prepare input data] | Transforms your data into the format needed by `r Biocpkg("peakCombiner")` for all following steps. | See in section [Prepare input data]. | [`centerExpandRegions`][Center and expand regions] | Modifies your genomic regions by centering and then expanding them to a uniform size. | See in section [Center and expand regions]. | [`filterRegions`][Filter regions] | Allows you to filter your genomic regions. | See in section [Filter regions]. | [`combineRegions`][Combine Regions] | Combine overlapping genomic regions to create a single set of consensus genomic regions. | See in section [Combine regions]. Each `r Biocpkg("peakCombiner")` function provides the parameter `outputFormat` that allows you to choose the output format of the function. The accepted values are 'tibble', 'GenomicRanges', and 'data.frame'. The default value is 'tibble'. The output format is used to define the structure of the returned data, which is described in the section [Standard genomic regions format]. Additionally, the parameter `showMessages` that by default prints feedback messages as functions run. If you plan to use `r Biocpkg("peakCombiner")` non-interactively in a workflow, you can silence these messages by setting `showMessages = FALSE`. Note that error messages will still be printed to help you to troubleshoot potential problems. ## Standard genomic regions format The modularity of `r Biocpkg("peakCombiner")` is enabled by standardizing how the genomic regions and samples are represented in the input and output of all its functions. We chose to use a tibble because this allows you to easily plot and explore the data. The columns are described in the table below. Only ‘chrom’, ‘start’, ‘end’, ‘sample_name’ are required to use `r Biocpkg("peakCombiner")`, but you have more possibilities for [`centerExpandRegions`][Center and expand regions] and [`filterRegions`][Filter regions] if you can provide the optional ‘score’ and ‘center’ columns. If you can’t provide data for the optional columns, `prepareInputRegions` adds sensible defaults for you. We recommend that you use either a GenomicRanges object or a tibble using `prepareInputRegions` (see section “Prepare input data”), but you can also prepare it yourself (see section "[Load from pre-loaded tibble]") |Column name |Requirement |Content |Details |:----------|:-----------|:-------------|:-------------------------- | 'chrom' | required | Name of the chromosome. | - | 'start' | required | Start coordinate of the genomic region. | 1-based coordinate system, NOT like BED files which are 0-based. | 'end' | required | End coordinate of the genomic region. | - | 'name' | optional | Can be any unique identifier for a sample and genomic region, but prepareInputRegions concatenates the ‘sample_name’ and row number. | 'score' | optional | Selected enrichment value used to rank genomic regions. | Bigger values are more important. For example, qValue from Macs2, -log10FDR from another method, or fold enrichment over background computed from your favorite method. If not provided in input data, `prepareInputRegions` is creating this column and populating it with '0'. | 'strand' | optional | Value to define on which strand genomic region is located. | Values are ‘+’, ‘-’, or ‘.’. If not provided in input data, `prepareInputRegions` is creating this column and populating it with ‘.’. | 'center' | optional | Absolute genomic coordinate of the center of the region. | Note that another common representation of ‘center’ is to report the number of base pairs of the summit from the ‘start’ coordinate. We call this value ‘summit’ instead of ‘center’. You can use `prepareInputRegions` to create ‘center’ from ‘summit’. If not provided in input data, `prepareInputRegions` is creating this column as the midpoint of each region. | 'sample_name' | required | Unique identifier for each sample (provided in the input data). | Avoid special characters like spaces or tabs. # Install `peakCombiner` `r Biocpkg("peakCombiner")` can be installed from Bioconductor using the `r CRANpkg("BiocManager")` package: ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("peakCombiner") ``` ```{r, eval=TRUE} library("peakCombiner") library("ggplot2") ``` # An overview of a complete run Here you find an overview of a complete run of the `r Biocpkg("peakCombiner")` recommended workflow. Lets use some of the provided synthetic data sets. ```{r, eval=TRUE} utils::data("syn_data_tibble") ``` Here you have all the function in the recommended order with all available parameters. ```{r, eval=TRUE} data_prepared <- prepareInputRegions( data = syn_data_tibble, outputFormat = "tibble", showMessages = FALSE ) centerExpandRegions( data = data_prepared, centerBy = "center_column", expandBy = NULL, genome = NA, trim_start = TRUE, outputFormat = "GenomicRanges", showMessages = FALSE ) data_center_expand <- centerExpandRegions( data = data_prepared, centerBy = "center_column", expandBy = NULL, genome = NA, trim_start = TRUE, outputFormat = "tibble", showMessages = FALSE ) data_filtered <- filterRegions( data = data_center_expand, includeByChromosomeName = NULL, excludeByBlacklist = NULL, includeAboveScoreCutoff = NULL, includeTopNScoring = NULL, outputFormat = "tibble", showMessages = FALSE ) consensus_peak <- combineRegions( data = data_filtered, foundInSamples = 2, combinedCenter = "nearest", annotateWithInputNames = FALSE, combinedSampleName = "my_new_sample_name", outputFormat = "tibble", showMessages = FALSE ) consensus_final <- centerExpandRegions( data = consensus_peak, expandBy = 350, outputFormat = "tibble", showMessages = FALSE ) consensus_final ``` Please note that the message occurring during the [`filterRegions`][Filter regions] is expected if chromosome names between input sample and your provided blacklist are not absolutely identical. Finally we export the resulting consensus regions tibble in a data format of the users choice (GenomicRanges, tibble, data.frame), which then can be saved as BED file. ```{r, eval=FALSE} rtracklayer::export.bed(consensus_final, paste0(here::here(), "/lists/consensus_regions.bed"), format = "bed") ``` # Prepare input data In this section, we explain how to prepare the accepted, standardized input files and how to add needed features to these files. `prepareInputRegions` is the mandatory first step to prepare your input data in the format needed for all of the following steps in `r Biocpkg("peakCombiner")` (for details see section "[Standard genomic regions format]"). This function accepts three types of input formats: + a GenomicRanges object of genomic regions and specifically named columns, + a tibble or data.frame of genomic regions and specifically named columns, or + a tibble (sample sheet) that specifies file paths of BED-like files for a set of samples (e.g. standard output files from peak callers like ‘narrowPeak’ files). *Please note* when a GenomicRanges object, a tibble or data.frame (not the sample sheet) is used as input, the package assumes the input is 1-based. When using a sample sheet, the input is expected to be a standard BED (or BED-like) file which is 0-based. The packages takes care of this for the user and all outputs are 1-based to be aligned with GenomicRanges. *Please note* that the function `prepareInputRegions` also has a filtering step, which automatically checks for genomic regions with the same values in the columns 'chrom', 'start', 'end' and 'sample_name' and filters for the strongest enriched summit (based on the 'score' values) per region. Additional summits are removed. Regions from peak callers can have multiple summits annotated per enriched genomic regions. This step is *not optional*. `prepareInputRegions` returns the data that you need for all your future work in a data format of your choice (GenomicRanges, tibble, data.frame). All these formats can be used for downstream functions of this package. For more details see the section "[Standard genomic regions format]". ## Quick start This is for you if you want to jump right in with some code. One of the easiest ways (and the way we recommend unless you are an advanced user) is to provide `r Biocpkg("peakCombiner")` with the list of samples and files paths for the peak files in BED-like format, and some information about the format of the peak files. ```{r, eval=TRUE} utils::data("syn_sample_sheet", package = "peakCombiner") syn_sample_sheet ``` This is the expected structure of a sample_sheet. If you already have your BED-like region files loaded into your R session, you can alternatively provide them to `prepareInputRegions` as a single GenomicRanges, tibble or data.frame object with genomic regions and named columns (see section "[Standard genomic regions format]") that, importantly, include a unique sample identifier ('sample_name'). Here, we show as an example the tibble object. ```{r, eval=TRUE} utils::data("syn_data_control01", package = "peakCombiner") syn_data_control01 ``` ```{r, eval=TRUE} utils::data("syn_data_treatment01", package = "peakCombiner") syn_data_treatment01 ``` Let's combine the two tibbles. ```{r, eval=TRUE} combined_input <- syn_data_control01 |> dplyr::mutate(sample_name = "control-rep1") |> rbind(syn_data_treatment01 |> dplyr::mutate(sample_name = "treatment-rep1")) combined_input |> dplyr::group_by(sample_name) |> dplyr::summarize(num_regions = dplyr::n()) ``` ```{r, eval=TRUE} prepareInputRegions( data = combined_input, outputFormat = "tibble", startsAreBased = 1, showMessages = FALSE ) ``` If you set the `startsAreBased = 0`, the package assumes the input is 0-based and the value in the column `start` is adjusted. ```{r, eval=TRUE} prepareInputRegions( data = combined_input, outputFormat = "tibble", startsAreBased = 0, showMessages = FALSE ) ``` ## Load from sample sheet We recommend that you use a sample sheet to prepare your data, especially if you have 'narrowPeak' or 'broadPeak' files. `prepareInputRegions` loads in the genomic regions file and formats them so that they are ready to be used in peakCombiner’s other functions. The sample sheet has three required columns and one optional column, and is described as follows. |Column name |Requirement |Column content |Details |:-----------|:-----------|:-------------------|:------------------- |'sample_name' | required | Unique name for each input sample. | The user defines this value. Please avoid special characters (‘space’, ‘-’, etc.). |'file_path' | required | Path to the file in which the genomic regions are stored. | For example, the path to a 'BED' or 'narrowPeak' file. Can be a relative path. |'file_format' | required | Recognized values are: 'BED', 'narrowPeak', and 'broadPeak'. | Used to correctly name columns from input files. Must be the same for all samples loaded at once. |'score_colname' | optional | The exact name of the original column to be used as score or the number of the column having the the metric used to rank peak importance, where bigger values are more important. | If not provided, column '9' will be used for 'narrowPeak' or 'broadPeak' file formats. Column '9' corresponds to the `qValue` as described in the UCSC documentation [here](https://genome.ucsc.edu/FAQ/FAQformat.html#format12). *Please note* if you load your data with a sample_sheet, the loaded BEd (BED-like) files are supposed to be in 0-based and consequently the starting coordinate will be ajusted to have the output as 1-based. The following code illustrates how you prepare an accepted sample sheet in R. + **Identify paths to file.** First, let's get the paths to the peak files we want to use and save it. ```{r, eval=FALSE} file_names ``` + **Create a sample sheet.** Next, we create a tibble (named 'sample_sheet') with the correct column names ('sample_name', 'file_path', 'file_format', 'score_colname') to load in our data. ```{r, eval=TRUE} utils::data("syn_sample_sheet", package = "peakCombiner") sample_sheet <- syn_sample_sheet sample_sheet ``` With this step you create a new tibble containing all the required information to run `prepareInputRegions`. + **Prepare data from the sample sheet.** Now we use the prepared tibble (sample_sheet) and add it as argument `data` into the function `prepareInputRegions`. ```{r, eval = FALSE} prepareInputRegions( data = sample_sheet, startsAreBased = 0, showMessages = FALSE ) ``` This returned table is a GenomicRanges object (default) or a tibble that contains all required information formatted correctly in order to use the downstream functions within `r Biocpkg("peakCombiner")`. For more information about its structure, go back to the "[Standard genomic regions format]" section. ## Load from pre-loaded tibble If you have already been working with your genomic regions from within an R session, you may have them pre-loaded as a tibble (this section) or as a GenomicRanges object (next section: [Load from GenomicRanges object]). This could be either that per sample one tibble exists or that you ran `r Biocpkg("peakCombiner")` before on two data sets and now want to combine these. In this section we show how you prepare your input data from pre-loaded tibbles. We start by loading in two of our synthetic data sets in 'narrowPeak' format, a common format generated by peak callers like MACS. Most BED-like files (including 'narrowPeak' files) don’t include column names in the file so you will have to name them yourself using the standard naming conventions as described by UCSC for [BED](https://genome.ucsc.edu/FAQ/FAQformat.html#format1) or [narrowPeak](https://genome.ucsc.edu/FAQ/FAQformat.html#format12) or [broadPeak](https://genome.ucsc.edu/FAQ/FAQformat.html#format13). As we know what to expect in each column we can name the columns correctly: 'chrom' (X1), 'start' (X2), 'end' (X3), 'name' (X4), 'score' (X5) and the 'strand' (X6). To be clear, the names that you define for the columns are used by `prepareInputRegions` to property format the data. Columns named 'chrom', 'start', 'end', 'name', 'score', 'strand', 'center' and 'sample_name' are maintained. *Please note* that within `r Biocpkg("peakCombiner")` we call columns with relative information about the peak summit 'summit' and with absolute values 'center'. So, if a column named 'summit' is provided, it is used to populate a column named 'center'. All other columns are dropped at the end of `prepareInputRegions`. + **Prepare input files** Now let's load the first narrowPeak file. Note that the columns are named already correctly and we expect this from your data as well. ```{r, eval=TRUE} utils::data(syn_control_rep1_narrowPeak) syn_control_rep1_narrowPeak ``` And the second file. ```{r, eval=TRUE} utils::data(syn_treatment_rep1_narrowPeak) syn_treatment_rep1_narrowPeak ``` + **Add column 'sample_name'** Now we add a column named 'sample_name' to each of our tibbles. ```{r, eval=TRUE} control <- syn_control_rep1_narrowPeak |> dplyr::mutate(sample_name = "control-rep1") control ``` ```{r, eval=TRUE} treatment <- syn_treatment_rep1_narrowPeak |> dplyr::mutate(sample_name = "treatment-rep1") treatment ``` + **Combine multiple tibbles** Finally, combine the multiple input tibbles into one. ```{r, eval=TRUE} combined_input <- control |> rbind(treatment) combined_input ``` And check how many rows we have now for each sample. ```{r, eval=TRUE} combined_input |> dplyr::group_by(sample_name) |> dplyr::count(name = "number_of_entries") ``` Both 'sample_name's are found, so we know that we have successfully combined the data sets. + **Prepare data from the pre-loaded tibble** After preparing the pre-loaded tibble, we run the function `prepareInputRegions` and use the tibble in the parameter `data`. ```{r, eval=TRUE} prepareInputRegions( data = combined_input, outputFormat = "tibble", startsAreBased = 1, showMessages = FALSE ) ``` The output tibble from `prepare_input_data` can now be used for your next steps with `r Biocpkg("peakCombiner")`. For details about the accepted file structure see section "[Standard genomic regions format]". ## Load from GenomicRanges object In memory GenomicRanges object listing the genomic regions in a sample. This object is very similar to the tibble above, except that `chrom`, `start`, and `end` are instead described using the `GenomicRanges` nomenclature (See [here](https://bioconductor.org/packages/devel/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.html) for details ). + **Load in a GenomicRanges object** As first step we load the provided synthetic data originating from a GenomicRanges object. ```{r, eval=TRUE} utils::data("syn_data_granges") syn_data_granges ``` The column names are based on its original GenomicRanges file format. This allows us to easily transform it into a GenomicRanges object. Note that normally we expect you to have the GenomicRanges object pre-loaded and want to use the `r Biocpkg("peakCombiner")` on this data set. For the purpose of showing you how a accepted GenomicRanges object has to be structured we transform it here. ```{r, eval=TRUE} GenomicRanges_data <- GenomicRanges::makeGRangesFromDataFrame( syn_data_granges, keep.extra.columns = TRUE ) GenomicRanges_data ``` + **Prepare input from GenomicRanges object** You can simply use your GenomicRanges object in the parameter ‘data’ and load it in. The output tibble from `prepare_input_data` can now be used for your next steps with `r Biocpkg("peakCombiner")`. For details about the accepted file structure see section "[Standard genomic regions format]". ```{r, eval=TRUE} prepareInputRegions( data = GenomicRanges_data, outputFormat = "GenomicRanges", showMessages = FALSE ) ``` ## Explained in detail We recommend to use the function `prepareInputRegions` to prepare the input data in the format needed for all of the following steps in `r Biocpkg("peakCombiner")`. In theory you can also manually provide an expected input data when preparing your data following the descriptions in the section "[Standard genomic regions format]". ### Load from BED file Unlike 'narrowPeak' files, BED files typically do not include columns for summits ('summit') or significance ('score') for your peaks. For that reason, we recommend using 'narrowPeak' files if possible. Sometimes you may only have 'chrom', 'start', and 'end' and you may still use `r Biocpkg("peakCombiner")`. Here we show you how to load it. Lets load a BED file. ```{r, eval=TRUE} utils::data("syn_data_bed") syn_data_bed |> dplyr::arrange(sample_name) ``` By default the chromosomal coordinates are 0-based. When we pull the 'sample_name' column we see the different number of entries for each sample name. ```{r, eval=TRUE} syn_data_bed |> dplyr::group_by(sample_name) |> dplyr::summarize(num_regions = dplyr::n()) ``` And now we use it as input for `prepareInputRegions`. ```{r, eval=TRUE} prepareInputRegions( data = syn_data_bed, outputFormat = "data.frame", startsAreBased = 0, showMessages = FALSE ) ``` Please note here that the information messages are informing you about all missing values and with which default values these columns are populated. The 'score' is set to 0 as no information can be obtained from a classical BED file about enrichment values. The column 'strand' is populated with the value '.', representing that no strand information is known. The 'center' is calculated based on the arithmetical midpoint of each region as no 'summit' input column was found. The resulting tibble can be used with all functions ([`centerExpandRegions`][Center and expand regions], [`filterRegions`][Filter regions], [`combineRegions`][Combine Regions]) of the package but certain option are limited due to the missing information in the input. For instance, [`centerExpandRegions`][Center and expand regions] is limited to use the 'midpoint' as center as no summit information is provided (See section [Center and expand regions]). [`filterRegions`][Filter regions] the options 'includeAboveScoreCutoff' and 'includeTopNScoring' do rely on values in the column 'score' to be populated with different values (See section [Filter regions]) and can not be used. ### Working in 1-based space The package `r Biocpkg("peakCombiner")` is expecting to work with 1-based values instead of 0-based or a mix setup as in [BED file](https://genome.ucsc.edu/FAQ/FAQformat.html#format1). For example in [narrowPeak](https://genome.ucsc.edu/FAQ/FAQformat.html#format12) files the start coordinate is included in the region, while the end coordinate is not. This follows the classical definition of a [BED file](https://genome.ucsc.edu/FAQ/FAQformat.html#format1) in UCSC, where the start is included but the end not, meaning the start is 0-based and the end is 1-based. As normally in displaying genomic regions browsers both coordinates are used in the 1-based way, we decided to use for `r Biocpkg("peakCombiner")` exclusivity the 1-based approach. A good cheat sheet is linked [here](https://www.biostars.org/p/84686/). If you load your data using a sample sheet "[Load from sample sheet]", `r Biocpkg("peakCombiner")` takes care of the details for you. # Center and expand regions After you prepare your input data, you may want to resize your peaks to a single, consistent size for your downstream analyses. In this case, we recommend that you first center and expand your genomic regions by using `centerExpandRegions`. *Please note* that this function is optional, so not required but recommended to run `centerExpandRegions` before you combine your regions. ## Quick start The quickest way to get started is to call `centerExpandRegions` using just the default parameters. In this case, `centerExpandRegions` updates the 'start' and 'end' coordinates of each genomic region such that it is centered around 'center' and resized to the median of all input regions. ```{r, eval=TRUE} centerExpandRegions( data = data_prepared, centerBy = "center_column", expandBy = NULL, outputFormat = "tibble", showMessage = FALSE ) ``` The table you obtained has altered coordinates for the genomic region. The center of the region is defined as the value found in the column 'center' and the expansion is based on the median region size of all input regions. ## Run center and expand The expected input can be a GenomicRanges, tibble or data.frame as described previously (See section "[Standard genomic regions format]"). We suggest you prepare it with the function `prepareInputRegions`. + **Define the new center** By default, `centerExpandRegions` uses the existing value in the ‘center’ column. (centerBy = "center_column") which is the recommended approach. However, you can also overwrite the existing ‘center’ value by computing the midpoint of the genomic region (centerBy = "midpoint"). This makes sense to do only if you have loaded your data from a peak caller that contain summits or the location of the strongest enrichment (e.g., MACS narrowPeak), but you would rather use the midpoint. + **Define the value to expand** By default, `centerExpandRegions` (expandBy = NULL) calculates a reasonable value for you from your data by using half of the median regions length of all your inputdata to expand. You can alternatively provide a numeric value. For example, when 'expandBy' is set to 250 (expandBy = 250), this results in regions of size 500 centered around 'center' (for more details see section "[Expand from 'center']"). + **Center & expanding the regions** ```{r, eval=TRUE} centerExpandRegions( data = data_prepared, centerBy = "center_column", expandBy = NULL, outputFormat = "tibble", showMessages = TRUE ) ``` You can appreciate that values for 'start' and 'end' are changed, while the number of input regions stays the same. + **Define the expansion value.** Finally some examples how you can define the expansion, either symmetrically or asymmetrically (for more details see section [Expand from 'center']). ```{r, eval=TRUE} centerExpandRegions( data = data_prepared, centerBy = "center_column", expandBy = c(500), outputFormat = "tibble", showMessages = FALSE ) ``` ```{r, eval=TRUE} centerExpandRegions( data = data_prepared, centerBy = "center_column", expandBy = c(100, 1000), outputFormat = "tibble", showMessages = FALSE ) ``` In all cases, the input regions are altered. If you are unsure how to define this parameter or have no prior expectations, don’t specify a value for the parameter 'expandBy' to enable its default behavior to calculate the expansion as median of the input regions from input data. ## Explained in details In this section we want to go into more detail to help you define the ideal parameters for the function `centerExpandRegions`. The expected input data structure we described early in this section "[Standard genomic regions format]". After the preparation of the input data, the first step, we recommend to center and expand the genomic regions by using `centerExpandRegions`. It is useful if you want all of your peaks to be the same size for your downstream analyses or if you want to use the information about the maximum enrichment in each reach (often called 'summit') (stored in the column 'center'), normally obtained by some peak callers (e.g., MACS2). The function allows you to automatically center your regions of interest on these summits to capture information about the most important part within a genomic region (e.g., TF-binding site or highest peak). There are two concepts that are relevant to understand the function `centerExpandRegions`: ### Define the 'center' In general, there are two approaches to define a 'center' of a genomic region and the function `centerExpandRegions` allows you to decide which one to use. The first option for you is to use pre-defined summit information (e.g., obtained from a peak caller like MACS2) (centerBy = center_column). Such information is provided when input regions were identified by a peak calling tool that exactly defines where in the region the maximum enriched signal is found. The second option is to calculate the arithmetic midpoint of the genomic region (centerBy = midpoint). (For details see the help for `prepareInputRegions`). ### Expand from 'center' The second factor you have to think about is by how many nucleotides you want to expand from the center to re-define your genomic regions. The function `centerExpandRegions` allows you to either use the input data to calculate the expansion value or to provide one or two numeric values to expand. Your decision is strongly impacted by the type of signal you are looking for. Traditionally, ATAC-seq, transcription factor ChIP-seq and some histone marks ChIP-seq (For details see [ENCODE recommendations](https://www.encodeproject.org/chip-seq/histone/)) show a very narrow, sharp signal, and your region size and expansion value should reflect that. On the other hand, some histone marks are enriched on large domains showing broad patterns. Therefore, prior knowledge of the signal you are looking for is key to choosing the best option for this parameter. Sometimes you may not have a prior expectation of region size, so the default in `r Biocpkg("peakCombiner")` is to choose a reasonable default value from the data (expandBy = NULL) as the median input region length / 2 for expansion. You can also choose to expand the genomic region from the new center asymmetrically, by different lengths before and after the center position (For examples see section “[Run center and expand”) by providing a vector with two values (the 'center' minus the first value defines the start and the 'center' plus the second defines the end). ### Correct if expanding over chromsome boarders It is theoretically possible that centering and expanding could result in coordinate out side of the chromosomal boundaries. This of course would be invalid. Taking advantage of GenomicRanges function, the user can provide the information of the reference genome based on supported genomes by GenomicRanges (For details see the help for `centerExpandRegions`). If no genome is provided the package `r Biocpkg("peakCombiner")` has a parameter to trim the start coordinate to 0 (trim_start = TRUE). This means that if the start coordinate is negative, it is automatically replaced with 1. If you do not want this behavior, you can set `trim_start = FALSE` and the function will return the original negative value. The user gets a printed feedback on how often and where this happened so you can debug if necessary. If these edge cases are critical to your downstream analyses, we suggest that you double-check this. ### Run function mutliple times If you have chosen to run `centerExpandRegions` on your input regions, you probably also want to center and expand the consensus regions you get from [`combineRegions`][Combine Regions] so that your final consensus set of regions all have the same length and are well-aligned at their centers. When you combine genomic regions during [`combineRegions`][Combine Regions], the genomic regions are alternated. This means that the median region size could be different compared to your input data region size, if you try to let [`centerExpandRegions`][Center and expand regions] choose the default region length from the data. For that reason, we suggest that you save the values of the calculated expansion (which is printed out) for the input regions and when you run the [`centerExpandRegions`][Center and expand regions] again on the consensus file to use this value for the 'expandBy' parameter. Doing so allows you to define the consensus region length by the exact input region length and not just a subset. # Filter regions The next function `filterRegions` allows you to refine the genomic regions based on four different parameters in exact the order provided here: + 'include_by_chromosome' + 'excludeByBlacklist' + 'includeAboveScoreCutoff' + 'includeTopNScoring' This is an *optional* step that can help you retain the most high-quality genomic regions or reduce the overall number of genomic regions for each sample. You can apply `filterRegions` multiple times on the same data set, one after another, in order to explore the effect of each individual filtering steps. ## Quick start As a quick first example, you can easily exclude blacklisted regions as follows. We recommend using the ENCODE-annotated blacklisted regions for either human ([hg38][https://www.encodeproject.org/files/ENCFF356LFX/]) or mouse ([mm10][https://www.encodeproject.org/files/ENCFF547MET/]). Alternatively, you can also use a data frame in bed file format as well. ```{r, eval=TRUE} backlist <- tibble::tibble(chrom = c("chr1"), start = c(100), end = c(1000)) backlist ``` ```{r, eval=TRUE} data_filtered <- filterRegions( data = data_center_expand, excludeByBlacklist = backlist, outputFormat = "tibble", showMessages = FALSE ) data_filtered ``` ## Filter genomic regions You can filter based on four parameters: + `includeByChromosomeName` - Retains only chromosomes that are in the provided vector. By not including mitochondrial, sex, or non-classical chromosomes, genomic regions found on these chromosomes can be removed. By default, this filter is not applied. + `excludeByBlacklist` - A tibble can be provided listing the genomic regions to remove (having 'chrom', 'start', and 'end' column names). By default, this filter is not applied. + `includeAboveScoreCutoff` - Single numeric value that defines the threshold above which all genomic regions will be retained based on the values in the column 'score'. The 'score' column in the `r Biocpkg("peakCombiner")` input data must be non-zero for this parameter to be useful. Recall that `prepareInputRegions` sets the ‘score’ by default if possible (e.g., it takes the value of -log10(FDR) using a 'narrowPeak' file from MACS as input). Importantly, applying this filter retains a variable number of genomic regions per sample, all having a score greater than the `includeAboveScoreCutoff` parameter. By default, this filter is not applied. + `includeTopNScoring` - Single numeric value that defines how many of the top scoring genomic regions (using the column 'score') are retained. All other genomic regions are discarded. Importantly, applying this filter retains `includeTopNScoring` regions per sample, which means that the minimum enrichment levels may vary between samples. Note that if multiple genomic regions have the same 'score' cutoff value, then all of those genomic regions are included. In this case, the number of resulting regions retained may be a bit higher than the input parameter. By default, this filter is not applied. Let’s see how what happens when all four filtering options are used at once. ```{r, eval=TRUE} filterRegions( data = data_center_expand, includeByChromosomeName = c("chr1", "chr2", "chr4"), excludeByBlacklist = backlist, includeAboveScoreCutoff = 2.5, includeTopNScoring = 6, outputFormat = "tibble", showMessages = TRUE ) ``` The filtering occurred in the order of the parameters and can be described as following: 1. The data set is filtered to only include chromosomes with the names 'chr1', 'chr2' and 'chr4', 2. Any remaining regions overlapping with blacklisted human regions are removed, 3. Regions having a ‘score’ above ‘2.5’ are retained, and 4. Only 6 top-scoring sites are retained per sample. ## Explained in detail The function `filterRegions` can help you curate the input genomic regions when your peak-calling pipeline uses the default options and you still need to refine the peaks further, for example when using an automated pipeline for annotation and peak calling. The idea here is to to provide a quick and easy way to clean-up your input data if needed. We suggest that you run the function `filterRegions` after the function `centerExpandRegions` just in case `centerExpandRegions` alters the genomic coordinates enough to overlap with a blacklisted region. `filterRegions` allows a step-wise optimization of selection criteria of regions of interest and can be used multiple time on the same data set. ### Apply the right filters Following is more detail the filtering options in `r Biocpkg("peakCombiner")`, what we recommend to do. All of these parameters are optional, and by default no filtering is done. #### Include chromsomes Sometimes you may need to focus your analysis on a single chromosome (e.g., to reduce the running time when testing) or set of chromosomes (just the canonical human chromosomes). In this case, `include_chromosomes` can help you to define the set of chromosomes you want to focus on in your analysis. In this subsection, we provide an example to identify “canonical” human chromosomes directly from the input data, save it as vector and use this as input for the filtering step. + **Extract a vector of chromosomes to include** First we extract all chromosome names from the input data. ```{r, eval=TRUE} input_chrom <- data_center_expand |> dplyr::select(chrom) |> unique() input_chrom ``` Here we see that in this data set we have some unexpected values for chromosome names like "chr4 2", "chr4|2" or "chr42". Let’s modify this vector to only keep what we consider to be the "canonical" chromosomes. In real world human data sets, you may find names like "chr11_KI270721v1_random" or "chrUn_GL000195v1" that you might want to remove for your downstream analyses To do so, the next step is to filter with regular expressions to maintain only wanted chromosome names. ```{r, eval=TRUE} include_chrom <- input_chrom |> dplyr::filter(grepl("^chr[0-9]$|^chr[1-2][0-9]$|^chr[XYM]", chrom)) |> dplyr::pull(chrom) |> unique() |> sort() include_chrom ``` Finally, we can use this vector of good names in `filterRegions` for the parameter `includeByChromosomeName` ```{r, eval=TRUE} data_filtered_chr <- filterRegions( data = data_center_expand, includeByChromosomeName = include_chrom, excludeByBlacklist = NULL, includeAboveScoreCutoff = NULL, includeTopNScoring = NULL, outputFormat = "tibble", showMessages = FALSE ) data_filtered_chr ``` Let’s confirm that we did indeed have unwanted chromosome names in the input data. ```{r, eval=TRUE} data_center_expand |> dplyr::group_by(sample_name) |> dplyr::summarize(num_regions = dplyr::n()) ``` And we can also confirm that only the good chromosomes remain after this filtering step. ```{r, eval=TRUE} data_filtered_chr |> dplyr::group_by(chrom) |> dplyr::summarize(num_regions = dplyr::n()) ``` #### Remove blacklisted sites It is often recommended to exclude blacklisted genomic regions from your analyses, and the `excludeByBlacklist` parameter allows you to do that easily. Genomic regions that overlap a blacklisted region with at least 1 base are removed. We suggest to use blacklisted regions from ENCODE for human ([hg38](https://www.encodeproject.org/files/ENCFF356LFX/)) and mouse ([mm10](https://www.encodeproject.org/files/ENCFF547MET/)). The blacklists were manually curated by ENCODE by combing several published blacklisted regions. While we recommend to use the provided ENCODE blacklists, you can alternatively provide your own blacklist as a tibble containing genomic regions with columns named 'chrom', 'start' and 'end'. In more general terms, you can use any file with genomic locations here to remove overlapping regions, for instance if you want to remove regions overlapping with promoters you can use this functions here to remove these. Here is some code to directly download the file from ENCODE, save it locally and load it to your enviroment. ```{r download_blacklist_bed, eval=FALSE} # Download the blacklist BED file from ENCODE and save it as "blacklist_human.bed" download.file( url = "https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz", destfile = "blacklist_human.bed.gz", mode = "wb" ) # Import the BED file blacklist_granges <- readr::read_tsv("blacklist_human.bed.gz", col_names = c("chrom", "start", "end")) blacklist_granges ``` ```{r apply_downloaded_blacklist_bed, eval=FALSE} filterRegions( data = data_center_expand, excludeByBlacklist = blacklist_granges, showMessages = FALSE) ``` An alternative is to load it via the package [AnnotationHub](https://bioconductor.org/packages/release/bioc/html/AnnotationHub.html) to creates and manages a local cache for such files. ```{r } # Load library (AnnotationHub would need to be added to DESCRIPTION under Suggests if used in the vignette) library(AnnotationHub) # Get annotation hub handle (will download index if called for the first time) ah <- AnnotationHub::AnnotationHub() # List ENCODE related data providers unique(grep("ENCODE", ah$dataprovider, value = TRUE)) # Query for "blacklist" dm <- AnnotationHub::query(ah, "blacklist") dm # get specific object (will be downloaded if called for the first time) blacklist_ah <- dm[["AH107343"]] # get info for a specific object AnnotationHub::recordStatus(ah, "AH107343") AnnotationHub::getInfoOnIds(ah, "AH107343") mcols(dm)["AH107343",] mcols(dm)["AH107343","description"] ``` ```{r view_black_list_annotationhub, eval=FALSE} blacklist_ah ``` ```{r apply_annotationhub_blacklist, eval=FALSE} filterRegions( data = data_center_expand, excludeByBlacklist = blacklist_ah, showMessages = FALSE) ``` Additionally, other R packages as [excluderages](https://www.bioconductor.org/packages/release/data/annotation/html/excluderanges.html) do also provide regions to be used within this function. If you would like to remove blacklisted regions from multiple sources, you can run `filterRegions` repeatedly, for example first to remove the ENCODE blacklisted regions and then to remove your own blacklisted sites. We show how to do this below. #### Filter for enrichment value Setting this parameter allows you to retain any peak that has a bigger score than the `includeAboveScoreCutoff` value based on the values in the column 'score'. Recall that [`prepareInputRegions`][Prepare input data] defines the 'score' by default if possible (e.g., it takes the value of -log10(FDR) using a ‘narrowPeak’ file from MACS2 as input). As scores might differ between experiments, we recommend that you look at the distribution of the values in 'score' to help you choose the best threshold value for `includeAboveScoreCutoff` for your data set. Here is some code to do that on our small, synthetic dataset. The distribution you see will look different. ```{r, eval=TRUE} data_center_expand |> ggplot2::ggplot(ggplot2::aes(x = score)) + ggplot2::geom_histogram(binwidth = 10) + ggplot2::xlim(0,NA) ``` See that we have very few sites with low values. We use a cutoff of 35 to remove lowly enriched sites and apply this value to the filtering step. ```{r, eval=TRUE} data_filtered_cutoff <- filterRegions( data = data_center_expand, includeAboveScoreCutoff = 35, outputFormat = "tibble", showMessages = FALSE ) ``` Let's compare the sites remaining in the input data set and after filtering. ```{r, eval=TRUE} dim(data_center_expand) dim(data_filtered_cutoff) ``` When we check the range of the values in 'score' columns we see the effects of the filtering. ```{r, eval=TRUE} range(data_center_expand |> dplyr::pull(score)) range(data_filtered_cutoff |> dplyr::pull(score)) ``` Again, we see that sites with 'score' 35 and below are removed. ```{r, eval=TRUE} data_filtered_cutoff |> ggplot2::ggplot(ggplot2::aes(x = score)) + ggplot2::geom_histogram(binwidth = 10) + ggplot2::xlim(0, NA) ``` #### Select a defined number of regions You can also select a fixed number of highest scoring regions per sample to extract the top enriched sites. An information message is shown if any sample does not have the required number of regions left in your input data. If your 'score' values vary widely between samples you may select widely different numbers of regions using the `include_above_cutoff`. In this case, using this approach will help you select similar numbers of regions for each sample. The exact same number of regions may not be selected for each sample because sometimes multiple genomic regions may have the same 'score' value. In this case, all of the tied genomic regions are retained. ```{r, eval=TRUE} data_center_expand |> dplyr::group_by(sample_name) |> dplyr::summarize(num_regions = dplyr::n()) filterRegions( data = data_center_expand, includeTopNScoring = 8, outputFormat = "tibble", showMessages = FALSE ) |> dplyr::group_by(sample_name) |> dplyr::summarize(num_regions = dplyr::n()) ``` We requested that only the top 8 genomic regions would be retained for each sample, and we can see in the information messages that `r Biocpkg("peakCombiner")` that one sample (‘control03’) contains less then the required 8 sites. For the remaining samples, we select the expected 8 regions per sample. # Combine regions After loading and preparing the regions from your samples, you can use them to build your set of consensus regions using `combineRegions`. The most important parameter that affects the final set of consensus regions is 'foundInSamples', which allows you to retain a genomic region based on how many samples it was found in (counting the unique entries in 'sample_name'). It is up to you to define this parameter based on your input data and analysis goals. The default is '2' samples, as we assume that each condition has three biological replicates and most users would like to focus on regions to be found in at lest two of these replicates. See the explained in detail section "[Define parameter to best combine regions]" for more considerations on setting this parameter. The other parameters allow you to configure how the new consensus regions are annotated. For example, you can define how the 'center' column is populated ('combinedCenter') and what new 'sample_name' should be used for the new consensus regions ('combinedSampleName'). Based on the selected value for the parameter 'combinedCenter', the new column 'score' is filled (for details see section [Which center to select for the consensus regions?]). If you want to trace which samples and genomic regions contributed to a new consensus region, you can use 'annotateWithInputNames' to create a new column ‘input_name’ describing links to the input genomic regions. ## Quick start Following is a quick example that uses default parameters. ```{r, eval=TRUE} combineRegions( data = data_filtered, foundInSamples = 2, combinedCenter = "nearest", annotateWithInputNames = FALSE, combinedSampleName = "my_new_sample_name", outputFormat = "tibble", showMessage = FALSE ) ``` ## Run to combine regions We run the function `combineRegions` and give the value 'consensus' to the parameter `combinedSampleName` and we want to have the link to our input regions by setting `annotate_with_input_name` to 'TRUE'. In brief, a description what the specific parameter you can define do. For more details please see below section "[Define parameter to best combine regions]". * `foundInSamples` - Defines in how many input samples a genomic site has to be found to be included into the consensus set. * `combinedCenter` - To stay within the expected data structure of `r Biocpkg("peakCombiner")`, we add new values for 'center' and 'score' at the end of this function to make it an accepted input for [`centerExpandRegions`][Center and expand regions] and [`filterRegions`][Filter regions]. You can choose between the options 'nearest', 'strongest' or 'middle' for `combinedCenter`. * `annotateWithInputNames` - You can define if for each consensus region the 'name' values of all contributing input regions are combined and saved in a new column 'input_names'. This allows you to link back each consensus region to the input regions. * `combinedSampleName` - You can define the new value for the column 'sample_name', which is then used to create the unique 'name' column. ```{r, eval=TRUE} consensus_peak_list <- combineRegions( data = data_filtered, foundInSamples = 2, combinedCenter = "nearest", annotateWithInputNames = TRUE, combinedSampleName = "consensus", outputFormat = "tibble", showMessages = TRUE ) consensus_peak_list ``` Here you can see the resulting consensus regions tibble with all the known column and the newly added 'input_names'. In conclusion, we provide here multiple options for you to customize the resulting regions file based on your personal needs. ## Explained in detail Here we briefly describe what is happening under the hood. + **Binning and counting of each genomic input region:** First step is to apply the GenomicRanges function `disjoin` to split up peaks depending on the number of input samples contributing. These numbers are counted to get the coverage of each genomic region. These counts are then used for filtering using the parameter `foundInSamples`, which is defined by the user. If `foundInSamples` is a fraction between 0 and 1, then only include genomic regions that are found in at least this fraction of input samples. Default value is 2. *Please note* if `foundInSamples` is set to 1 all genomic regions covered in at least one sample are included, this makes this function to simply merge all input regions. + **Recombine separated genomic regions:** As next step we perform `reduce`, another GenomicRanges operation, to combine regions adjacent to each other. If regions are separated by at least a single nucelotide, we keep them separately. + **Overlap with summits:** This is a quality control function for the resulting genomic regions specifically developed for `r Biocpkg("peakCombiner")`. To remove potential false positive sites, we designed the function `combineRegions` to check for an overlap between newly combined regions and input summits. If at least one overlap is found, the new region will be maintained, otherwise it is removed. + **Restore data structure**: In this final step, we restore the accepted data structure for working with `r Biocpkg("peakCombiner")` functions to ensure the modularity of the package. This allows a user to apply the functions `center_expand_resgions` and `filterRegions` to the combined consensus region file. Therefore, 'center', 'score' and 'sample_name' columns are created and populated based on the user defined parameters. ### Define parameter to best combine regions Here we explain the parameters you can set for the function `combineRegions` and the defaults we recommend. #### In how many samples should a region be found? As described in "[Run to combine regions]", the default behavior is to retain genomic regions that overlap in at least two samples. When should you considering using a non-default value? Setting `foundInSamples` = 1 includes every region from every sample. This is useful if you simply want a union of all peak regions regardless of how often they were found across your replicates. For example, you may already be very confident in the quality of the input genomic regions or you have previously run `r Biocpkg("peakCombiner")` on subsets of your data to get consensus regions for replicates of the same condition. When you have many replicates per condition, you may find it helpful to increase the stringency by increasing `foundInSamples` parameter value. In general, as you increase `foundInSamples`, `r Biocpkg("peakCombiner")` will return fewer and smaller consensus genomic regions. ```{r, eval=TRUE} combineRegions( data = data_filtered, foundInSamples = 2, outputFormat = "tibble", showMessages = FALSE, combinedSampleName = "foundInSamples_2_example" ) ``` If the parameter `foundInSamples` is set to '1', this function basically merges all input regions. ```{r, eval=TRUE} combineRegions( data = data_filtered, foundInSamples = 1, outputFormat = "tibble", showMessages = FALSE, combinedSampleName = "foundInSamples_1_example" ) ``` #### Which center to select for the consensus regions? You only need to consider how to select the center of your new genomic region if you plan to run [`centerExpandRegions`][Center and expand regions] to update all the consensus genomic regions so that they are the same size, or if you plan to use the 'score' value for your downstream analyses. Each consensus region can be derived from several input regions, each of which has its own 'center' and 'score' value. There are three ways you can define the 'center' of new genomic regions. Your choice also defines how the 'score' is calculated. Three different option are available: * `middle` - the mathematical midpoint of the new region. The resulting 'score' for each consensus region is the mean of all input score values contributing to that site. * `strongest` - the 'center' of the input region that has the the highest 'score' of all overlapping input regions. 'score' is taken from the used input center. * `nearest` - the 'center' of the input region that is closest to mean of the 'center's of all overlapping input regions (default). 'score' is taken from the used input center. Lets have a look at the 'center' and 'score' within the synthetic data, when setting `combinedCenter` to 'strongest'. ```{r, eval=TRUE} combineRegions( data = data_filtered, combinedCenter = "strongest", outputFormat = "tibble", showMessages = FALSE ) |> dplyr::select("center", "score") ``` Or to 'middle'. ```{r, eval=TRUE} combineRegions( data = data_filtered, combinedCenter = "middle", outputFormat = "tibble", showMessages = FALSE ) |> dplyr::select("center", "score") ``` Or to 'nearest'. ```{r, eval=TRUE} combineRegions( data = data_filtered, combinedCenter = "nearest", outputFormat = "tibble", showMessages = FALSE ) |> dplyr::select("center", "score") ``` We recommend to use the `nearest` option as this will give you a best common center and associated score. # Session info This vignette was built using: ```{r, session, eval=TRUE} sessionInfo() ``` # References