[["index.html", "The csaw Book Chapter 1 Welcome 1.1 Introduction 1.2 How to read this book 1.3 How to get help 1.4 How to cite this book 1.5 Quick start Session information", " The csaw Book Authors: Aaron Lun [aut, cre] Version: 1.0.0 Modified: 2021-02-07 Compiled: 2021-05-21 Environment: R version 4.1.0 (2021-05-18), Bioconductor 3.13 License: GPL-3 Copyright: Bioconductor, 2020 Source: https://github.com/LTLA/csawUsersGuide Chapter 1 Welcome .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 1.1 Introduction Chromatin immunoprecipitation with sequencing (ChIP-seq) is a widely used technique for identifying the genomic binding sites of a target protein. Conventional analyses of ChIP-seq data aim to detect absolute binding (i.e., the presence or absence of a binding site) based on peaks in the read coverage. An alternative analysis strategy is to detect of changes in the binding profile between conditions (Ross-Innes et al. 2012; Pal et al. 2013). These differential binding (DB) analyses involve counting reads into genomic intervals and testing those counts for significant differences between conditions. This defines a set of putative DB regions for further examination. DB analyses are statistically easier to perform than their conventional counterparts, as the effect of genomic biases is largely mitigated when counts for different libraries are compared at the same genomic region. DB regions may also be more relevant as the change in binding can be associated with the biological difference between conditions. This book describes the use of the csaw Bioconductor package to detect differential binding (DB) in ChIP-seq experiments with sliding windows (Lun and Smyth 2016). In these analyses, we detect and summarize DB regions between conditions in a de novo manner, i.e., without making any prior assumptions about the location or width of bound regions. We demonstrate on data from a variety of real studies focusing on changes in transcription factor binding and histone mark enrichment. Our aim is to facilitate the practical implementation of window-based DB analyses by providing detailed code and expected output. The code here can be adapted to any dataset with multiple experimental conditions and with multiple biological samples within one or more of the conditions; it is similarly straightforward to accommodate batch effects, covariates and additional experimental factors. Indeed, though the book focuses on ChIP-seq, the same software can be adapted to data from any sequencing technique where reads represent coverage of enriched genomic regions. 1.2 How to read this book The descriptions in this book explore the theoretical and practical motivations behind each step of a csaw analysis. While all users are welcome to read it from start to finish, new users may prefer to examine the case studies presented in the later sections (Lun and Smyth 2015), which provides the important information in a more concise format. Experienced users (or those looking for some nighttime reading!) are more likely to benefit from the in-depth discussions in this document. All of the workflows described here start from sorted and indexed BAM files in the chipseqDBData package. For application to user-specified data, the raw read sequences have to be aligned to the appropriate reference genome beforehand. Most aligners can be used for this purpose, but we have used Rsubread (Liao, Smyth, and Shi 2013) due to the convenience of its R interface. It is also recommended to mark duplicate reads using tools like Picard prior to starting the workflow. The statistical methods described here are based upon those in the edgeR package (Robinson, McCarthy, and Smyth 2010). Knowledge of edgeR is useful but not a prerequesite for reading this guide. 1.3 How to get help Most questions about csaw should be answered by the documentation. Every function mentioned in this guide has its own help page. For example, a detailed description of the arguments and output of the windowCounts() function can be obtained by typing ?windowCounts or help(windowCounts) at the R prompt. Further detail on the methods or the underlying theory can be found in the references at the bottom of each help page. The authors of the package always appreciate receiving reports of bugs in the package functions or in the documentation. The same goes for well-considered suggestions for improvements. Other questions about how to use csaw are best sent to the Bioconductor support site. Please send requests for general assistance and advice to the support site, rather than to the individual authors. Users posting to the support site for the first time may find it helpful to read the posting guide. 1.4 How to cite this book Most users of csaw should cite the following in any publications: A. T. Lun and G. K. Smyth. csaw: a Bioconductor package for differential binding analysis of ChIP-seq data using sliding windows. Nucleic Acids Res., 44(5):e45, Mar 2016 To cite the workflows specifically, we can use: A. T. L. Lun and G. K. Smyth. From reads to regions: a Bioconductor workflow to detect differential binding in ChIP-seq data. F1000Research, 4, 2015 For people interested in combined \\(p\\)-values, their use in DB analyses was proposed in: A. T. Lun and G. K. Smyth. De novo detection of differentially bound regions for ChIP-seq data using peaks and windows: controlling error rates correctly. Nucleic Acids Res., 42(11):e95, Jul 2014 The DB analyses shown here use methods from the edgeR package, which has its own citation recommendations. See the appropriate section of the edgeR user’s guide for more details. 1.5 Quick start A typical ChIP-seq analysis in csaw would look something like that described below. This assumes that a vector of file paths to sorted and indexed BAM files is provided in and a design matrix in supplied in . The code is split across several steps: library(chipseqDBData) tf.data &lt;- NFYAData() tf.data &lt;- head(tf.data, -1) # skip the input. bam.files &lt;- tf.data$Path cell.type &lt;- sub(&quot;NF-YA ([^ ]+) .*&quot;, &quot;\\\\1&quot;, tf.data$Description) design &lt;- model.matrix(~factor(cell.type)) colnames(design) &lt;- c(&quot;intercept&quot;, &quot;cell.type&quot;) Loading in data from BAM files. library(csaw) param &lt;- readParam(minq=20) data &lt;- windowCounts(bam.files, ext=110, width=10, param=param) Filtering out uninteresting regions. binned &lt;- windowCounts(bam.files, bin=TRUE, width=10000, param=param) keep &lt;- filterWindowsGlobal(data, binned)$filter &gt; log2(5) data &lt;- data[keep,] Calculating normalization factors. data &lt;- normFactors(binned, se.out=data) Identifying DB windows. library(edgeR) y &lt;- asDGEList(data) y &lt;- estimateDisp(y, design) fit &lt;- glmQLFit(y, design, robust=TRUE) results &lt;- glmQLFTest(fit) Correcting for multiple testing. merged &lt;- mergeResults(data, results$table, tol=1000L) Session information View session info R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] edgeR_3.34.0 limma_3.48.0 [3] csaw_1.26.0 SummarizedExperiment_1.22.0 [5] Biobase_2.52.0 MatrixGenerics_1.4.0 [7] matrixStats_0.58.0 GenomicRanges_1.44.0 [9] GenomeInfoDb_1.28.0 IRanges_2.26.0 [11] S4Vectors_0.30.0 BiocGenerics_0.38.0 [13] chipseqDBData_1.8.0 BiocStyle_2.20.0 loaded via a namespace (and not attached): [1] bitops_1.0-7 bit64_4.0.5 [3] filelock_1.0.2 httr_1.4.2 [5] tools_4.1.0 bslib_0.2.5.1 [7] utf8_1.2.1 R6_2.5.0 [9] DBI_1.1.1 withr_2.4.2 [11] tidyselect_1.1.1 bit_4.0.4 [13] curl_4.3.1 compiler_4.1.0 [15] rebook_1.2.0 graph_1.70.0 [17] DelayedArray_0.18.0 bookdown_0.22 [19] sass_0.4.0 rappdirs_0.3.3 [21] stringr_1.4.0 digest_0.6.27 [23] Rsamtools_2.8.0 rmarkdown_2.8 [25] XVector_0.32.0 pkgconfig_2.0.3 [27] htmltools_0.5.1.1 dbplyr_2.1.1 [29] fastmap_1.1.0 rlang_0.4.11 [31] RSQLite_2.2.7 shiny_1.6.0 [33] jquerylib_0.1.4 generics_0.1.0 [35] jsonlite_1.7.2 BiocParallel_1.26.0 [37] dplyr_1.0.6 RCurl_1.98-1.3 [39] magrittr_2.0.1 GenomeInfoDbData_1.2.6 [41] Matrix_1.3-3 Rcpp_1.0.6 [43] fansi_0.4.2 lifecycle_1.0.0 [45] stringi_1.6.2 yaml_2.2.1 [47] zlibbioc_1.38.0 BiocFileCache_2.0.0 [49] AnnotationHub_3.0.0 grid_4.1.0 [51] blob_1.2.1 promises_1.2.0.1 [53] ExperimentHub_2.0.0 crayon_1.4.1 [55] lattice_0.20-44 dir.expiry_1.0.0 [57] splines_4.1.0 Biostrings_2.60.0 [59] KEGGREST_1.32.0 locfit_1.5-9.4 [61] CodeDepends_0.6.5 metapod_1.0.0 [63] knitr_1.33 pillar_1.6.1 [65] codetools_0.2-18 XML_3.99-0.6 [67] glue_1.4.2 BiocVersion_3.13.1 [69] evaluate_0.14 BiocManager_1.30.15 [71] png_0.1-7 vctrs_0.3.8 [73] httpuv_1.6.1 purrr_0.3.4 [75] assertthat_0.2.1 cachem_1.0.5 [77] xfun_0.23 mime_0.10 [79] xtable_1.8-4 later_1.2.0 [81] tibble_3.1.2 AnnotationDbi_1.54.0 [83] memoise_2.0.0 statmod_1.4.36 [85] ellipsis_0.3.2 interactiveDisplayBase_1.30.0 Bibliography "],["counting-reads-into-windows.html", "Chapter 2 Counting reads into windows 2.1 Background 2.2 Obtaining window-level counts 2.3 Filtering out low-quality reads 2.4 Estimating the fragment length 2.5 Choosing a window size Session information", " Chapter 2 Counting reads into windows .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 2.1 Background The key step in the DB analysis is the manner in which reads are counted. The most obvious strategy is to count reads into pre-defined regions of interest, like promoters or gene bodies (???). This is simple but will not capture changes outside of those regions. In contrast, de novo analyses do not depend on pre-specified regions, instead using empirically defined peaks or sliding windows for read counting. Peak-based methods are implemented in the DiffBind and DBChIP software packages (Ross-Innes et al. 2012; ???), which count reads into peak intervals that have been identified with software like MACS (???). This requires some care to maintain statistical rigour as peaks are called with the same data used to test for DB. Alternatively, window-based approaches count reads into sliding windows across the genome. This is a more direct strategy that avoids problems with data re-use and can provide increased DB detection power (???). In csaw, we define a window as a fixed-width genomic interval and we count the number of fragments overlapping that window in each library. For single-end data, each fragment is imputed by directional extension of the read to the average fragment length (Figure 2.1), while for paired-end data, the fragment is defined from the interval spanned by the paired reads. This is repeated after sliding the window along the genome to a new position. A count is then obtained for each window in each library, thus quantifying protein binding intensity across the genome. Figure 2.1: Schematic of the read extension process for single-end data. Reads are extended to the average fragment length (ext) and the number of overlapping extended reads is counted for each window of size width. For single-end data, we estimate the average fragment length from a cross-correlation plot (see Section 2.4) for use as ext. Alternatively, the length can be estimated from diagnostics during ChIP or library preparation, e.g., post-fragmentation gel electrophoresis images. Typical values range from 100 to 300 bp, depending on the efficiency of sonication and the use of size selection steps in library preparation. We interpret the window size (width) as the width of the binding site for the target protein, i.e., its physical “footprint” on the genome. This is user-specified and has important implications for the power and resolution of a DB analysis, which are discussed in Section 2.5. For TF analyses with small windows, the choice of spacing interval will also be affected by the choice of window size – see Section 3.2 for more details. 2.2 Obtaining window-level counts To demonstrate, we will use some publicly available data from the chipseqDBData package. The dataset below focuses on changes in the binding profile of the NF-YA transcription factor between embryonic stem cells and terminal neurons (Tiwari et al. 2012). library(chipseqDBData) tf.data &lt;- NFYAData() tf.data ## DataFrame with 5 rows and 3 columns ## Name Description Path ## &lt;character&gt; &lt;character&gt; &lt;List&gt; ## 1 SRR074398 NF-YA ESC (1) &lt;BamFile&gt; ## 2 SRR074399 NF-YA ESC (2) &lt;BamFile&gt; ## 3 SRR074417 NF-YA TN (1) &lt;BamFile&gt; ## 4 SRR074418 NF-YA TN (2) &lt;BamFile&gt; ## 5 SRR074401 Input &lt;BamFile&gt; bam.files &lt;- head(tf.data$Path, -1) # skip the input. bam.files ## List of length 4 The windowCounts() function uses a sliding window approach to count fragments for a set of BAM files, supplied as either a character vector or as a list of BamFile objects (from the Rsamtools package). We assume that the BAM files are sorted by position and have been indexed - for character inputs, the index files are assumed to have the same prefixes as the BAM files. It’s worth pointing out that a common mistake is to replace or update the BAM file without updating the index, which will cause csaw some grief. library(csaw) frag.len &lt;- 110 win.width &lt;- 10 param &lt;- readParam(minq=20) data &lt;- windowCounts(bam.files, ext=frag.len, width=win.width, param=param) The function returns a RangedSummarizedExperiment object where the matrix of counts is stored as the first assay. Each row corresponds to a genomic window while each column corresponds to a library. The coordinates of each window are stored in the rowRanges. The total number of reads in each library (also referred to as the library size) is stored as totals in the colData. # Preview the counts: head(assay(data)) ## [,1] [,2] [,3] [,4] ## [1,] 2 3 3 4 ## [2,] 3 6 3 4 ## [3,] 5 5 0 0 ## [4,] 4 7 0 0 ## [5,] 5 9 0 0 ## [6,] 3 9 0 0 # Preview the genomic coordinates: head(rowRanges(data)) ## GRanges object with 6 ranges and 0 metadata columns: ## seqnames ranges strand ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; ## [1] chr1 3003701-3003710 * ## [2] chr1 3003751-3003760 * ## [3] chr1 3003801-3003810 * ## [4] chr1 3003951-3003960 * ## [5] chr1 3004001-3004010 * ## [6] chr1 3004051-3004060 * ## ------- ## seqinfo: 66 sequences from an unspecified genome # Preview the totals data$totals ## [1] 18363361 21752369 25095004 24104691 The above windowCounts() call involved a few arguments, so we will spend the rest of this chapter explaining these in more detail. 2.3 Filtering out low-quality reads Read extraction from the BAM files is controlled with the param argument in windowCounts(). This takes a readParam object that specifies a number of extraction parameters. The idea is to define the readParam object once in the entire analysis pipeline, which is then reused for all relevant functions. This ensures that read loading is consistent throughout the analysis. (A good measure of synchronisation between windowCounts() calls is to check that the values of totals are identical between calls, which indicates that the same reads are being extracted from the BAM files in each call.) param ## Extracting reads in single-end mode ## Duplicate removal is turned off ## Minimum allowed mapping score is 20 ## Reads are extracted from both strands ## No restrictions are placed on read extraction ## No regions are specified to discard reads In the example above, reads are filtered out based on the minimum mapping score with the minq argument. Low mapping scores are indicative of incorrectly and/or non-uniquely aligned sequences. Removal of these reads is highly recommended as it will ensure that only the reliable alignments are supplied to csaw. The exact value of the threshold depends on the range of scores provided by the aligner. The subread aligner (Liao, Smyth, and Shi 2013) was used to align the reads in this dataset, so a value of 20 might be appropriate. Reads mapping to the same genomic position can be marked as putative PCR duplicates using software like the MarkDuplicates program from the Picard suite. Marked reads in the BAM file can be ignored during counting by setting dedup=TRUE in the readParam object. This reduces the variability caused by inconsistent amplification between replicates, and avoid spurious duplicate-driven DB between groups. An example of counting with duplicate removal is shown below, where fewer reads are used from each library relative to data\\$totals. dedup.param &lt;- readParam(minq=20, dedup=TRUE) demo &lt;- windowCounts(bam.files, ext=frag.len, width=win.width, param=dedup.param) demo$totals ## [1] 14799759 14773466 17424949 20386739 That said, duplicate removal is generally not recommended for routine DB analyses. This is because it caps the number of reads at each position, reducing DB detection power in high-abundance regions. Spurious differences may also be introduced when the same upper bound is applied to libraries of varying size. However, it may be unavoidable in some cases, e.g., involving libraries generated from low quantities of DNA. Duplicate removal is also acceptable for paired-end data, as exact overlaps for both paired reads are required to define duplicates. This greatly reduces the probability of incorrectly discarding read pairs from non-duplicate DNA fragments (assuming that a pair-aware method was used during duplicate marking). 2.4 Estimating the fragment length Cross-correlation plots are generated directly from BAM files using the correlateReads() function. This provides a measure of the immunoprecipitation (IP) efficiency of a ChIP-seq experiment (Kharchenko, Tolstorukov, and Park 2008). Efficient IP should yield a smooth peak at a delay distance corresponding to the average fragment length. This reflects the strand-dependent bimodality of reads around narrow regions of enrichment, e.g., TF binding sites. max.delay &lt;- 500 dedup.on &lt;- initialize(param, dedup=TRUE) # just flips &#39;dedup=TRUE&#39; in the existing &#39;param&#39;. x &lt;- correlateReads(bam.files, max.delay, param=dedup.on) plot(0:max.delay, x, type=&quot;l&quot;, ylab=&quot;CCF&quot;, xlab=&quot;Delay (bp)&quot;) Figure 2.2: Cross-correlation plot of the NF-YA dataset. The location of the peak is used as an estimate of the fragment length for read extension in windowCounts(). An estimate of ~110 bp is obtained from the plot above. We can do this more precisely with the maximizeCcf() function, which returns a similar value. maximizeCcf(x) ## [1] 112 A sharp spike may also be observed in the plot at a distance corresponding to the read length. This is thought to be an artifact, caused by the preference of aligners towards uniquely mapped reads. Duplicate removal is typically required here (i.e., set dedup=TRUE in readParam()) to reduce the size of this spike. Otherwise, the fragment length peak will not be visible as a separate entity. The size of the smooth peak can also be compared to the height of the spike to assess the signal-to-noise ratio of the data (Landt et al. 2012). Poor IP efficiency will result in a smaller or absent peak as bimodality is less pronounced. Cross-correlation plots can also be used for fragment length estimation of narrow histone marks such as histone acetylation and H3K4 methylation (Figure 2.3). However, they are less effective for regions of diffuse enrichment where bimodality is not obvious (e.g., H3K27 trimethylation). n &lt;- 1000 # Using more data sets from &#39;chipseqDBData&#39;. acdata &lt;- H3K9acData() h3k9ac &lt;- correlateReads(acdata$Path[1], n, param=dedup.on) k27data &lt;- H3K27me3Data() h3k27me3 &lt;- correlateReads(k27data$Path[1], n, param=dedup.on) k4data &lt;- H3K4me3Data() h3k4me3 &lt;- correlateReads(k4data$Path[1], n, param=dedup.on) plot(0:n, h3k9ac, col=&quot;blue&quot;, ylim=c(0, 0.1), xlim=c(0, 1000), xlab=&quot;Delay (bp)&quot;, ylab=&quot;CCF&quot;, pch=16, type=&quot;l&quot;, lwd=2) lines(0:n, h3k27me3, col=&quot;red&quot;, pch=16, lwd=2) lines(0:n, h3k4me3, col=&quot;forestgreen&quot;, pch=16, lwd=2) legend(&quot;topright&quot;, col=c(&quot;blue&quot;, &quot;red&quot;, &quot;forestgreen&quot;), c(&quot;H3K9ac&quot;, &quot;H3K27me3&quot;, &quot;H3K4me3&quot;), pch=16) Figure 2.3: Cross-correlation plots for a variety of histone mark datasets. In general, use of different extension lengths is unnecessary in well-controlled datasets. Difference in lengths between libraries are usually smaller than 50 bp. This is less than the inherent variability in fragment lengths within each library (see the histogram for the paired-end data in Section~). The effect on the coverage profile of within-library variability in lengths will likely mask the effect of small between-library differences in the average lengths. Thus, an ext list should only be specified for datasets that exhibit large differences in the average fragment sizes between libraries. 2.5 Choosing a window size We interpret the window size as the width of the binding ``footprint’’ for the target protein, where the protein residues directly contact the DNA. TF analyses typically use a small window size, e.g., 10 - 20 bp, which maximizes spatial resolution for optimal detection of narrow regions of enrichment. For histone marks, widths of at least 150 bp are recommended (Humburg et al. 2011). This corresponds to the length of DNA wrapped up in each nucleosome, which is the smallest relevant unit for histone mark enrichment. We consider diffuse marks as chains of adjacent histones, for which the combined footprint may be very large (e.g., 1-10 kbp). The choice of window size controls the compromise between spatial resolution and count size. Larger windows will yield larger read counts that can provide more power for DB detection. However, spatial resolution is also lost for large windows whereby adjacent features can no longer be distinguished. Reads from a DB site may be counted alongside reads from a non-DB site (e.g., non-specific background) or even those from an adjacent site that is DB in the opposite direction. This will result in the loss of DB detection power. We might expect to be able to infer the optimal window size from the data, e.g., based on the width of the enriched regions. However, in practice, a clear-cut choice of distance/window size is rarely found in real datasets. For many non-TF targets, the widths of the enriched regions can be highly variable, suggesting that no single window size is optimal. Indeed, even if all enriched regions were of constant width, the width of the DB events occurring those regions may be variable. This is especially true of diffuse marks where the compromise between resolution and power is more arbitrary. We suggest performing an initial DB analysis with small windows to maintain spatial resolution. The widths of the final merged regions (see Section~) can provide an indication of the appropriate window size. Alternatively, the analysis can be repeated with a series of larger windows, and the results combined (see Section~). This examines a spread of resolutions for more comprehensive detection of DB regions. Session information View session info R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] csaw_1.26.0 SummarizedExperiment_1.22.0 [3] Biobase_2.52.0 MatrixGenerics_1.4.0 [5] matrixStats_0.58.0 GenomicRanges_1.44.0 [7] GenomeInfoDb_1.28.0 IRanges_2.26.0 [9] S4Vectors_0.30.0 BiocGenerics_0.38.0 [11] chipseqDBData_1.8.0 BiocStyle_2.20.0 [13] rebook_1.2.0 loaded via a namespace (and not attached): [1] bitops_1.0-7 bit64_4.0.5 [3] filelock_1.0.2 httr_1.4.2 [5] tools_4.1.0 bslib_0.2.5.1 [7] utf8_1.2.1 R6_2.5.0 [9] DBI_1.1.1 withr_2.4.2 [11] tidyselect_1.1.1 bit_4.0.4 [13] curl_4.3.1 compiler_4.1.0 [15] graph_1.70.0 DelayedArray_0.18.0 [17] bookdown_0.22 sass_0.4.0 [19] rappdirs_0.3.3 stringr_1.4.0 [21] digest_0.6.27 Rsamtools_2.8.0 [23] rmarkdown_2.8 XVector_0.32.0 [25] pkgconfig_2.0.3 htmltools_0.5.1.1 [27] limma_3.48.0 dbplyr_2.1.1 [29] fastmap_1.1.0 highr_0.9 [31] rlang_0.4.11 RSQLite_2.2.7 [33] shiny_1.6.0 jquerylib_0.1.4 [35] generics_0.1.0 jsonlite_1.7.2 [37] BiocParallel_1.26.0 dplyr_1.0.6 [39] RCurl_1.98-1.3 magrittr_2.0.1 [41] GenomeInfoDbData_1.2.6 Matrix_1.3-3 [43] Rcpp_1.0.6 fansi_0.4.2 [45] lifecycle_1.0.0 edgeR_3.34.0 [47] stringi_1.6.2 yaml_2.2.1 [49] zlibbioc_1.38.0 BiocFileCache_2.0.0 [51] AnnotationHub_3.0.0 grid_4.1.0 [53] blob_1.2.1 promises_1.2.0.1 [55] ExperimentHub_2.0.0 crayon_1.4.1 [57] lattice_0.20-44 dir.expiry_1.0.0 [59] Biostrings_2.60.0 KEGGREST_1.32.0 [61] locfit_1.5-9.4 CodeDepends_0.6.5 [63] metapod_1.0.0 knitr_1.33 [65] pillar_1.6.1 codetools_0.2-18 [67] XML_3.99-0.6 glue_1.4.2 [69] BiocVersion_3.13.1 evaluate_0.14 [71] BiocManager_1.30.15 png_0.1-7 [73] vctrs_0.3.8 httpuv_1.6.1 [75] purrr_0.3.4 assertthat_0.2.1 [77] cachem_1.0.5 xfun_0.23 [79] mime_0.10 xtable_1.8-4 [81] later_1.2.0 tibble_3.1.2 [83] AnnotationDbi_1.54.0 memoise_2.0.0 [85] ellipsis_0.3.2 interactiveDisplayBase_1.30.0 Bibliography "],["more-counting-options.html", "Chapter 3 More counting options 3.1 Avoiding problematic genomic regions 3.2 Increasing speed and memory efficiency 3.3 Dealing with paired-end data 3.4 Other counting strategies 3.5 Handling variable fragment lengths Session information", " Chapter 3 More counting options .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 3.1 Avoiding problematic genomic regions Read extraction and counting can be restricted to particular chromosomes by specifying the names of the chromosomes of interest in restrict. This avoids the need to count reads on unassigned contigs or uninteresting chromosomes, e.g., the mitochondrial genome for ChIP-seq studies targeting nuclear factors. Alternatively, it allows windowCounts() to work on huge datasets or in limited memory by analyzing only one chromosome at a time. library(csaw) restrict.param &lt;- readParam(restrict=c(&quot;chr1&quot;, &quot;chr10&quot;, &quot;chrX&quot;)) Reads lying in certain regions can also be removed by specifying the coordinates of those regions in discard. This is intended to remove reads that are wholly aligned within known repeat regions but were not removed by the minq filter. Repeats are problematic as different repeat units in an actual genome are usually reported as a single unit in the genome build. Alignment of all (non-specifically immunoprecipitated) reads from the former will result in artificially high coverage of the latter. More importantly, any changes in repeat copy number or accessibility between conditions can lead to spurious DB at this single unit. Removal of reads within repeat regions can avoid detection of these irrelevant differences. repeats &lt;- GRanges(&quot;chr1&quot;, IRanges(3000001, 3041000)) # telomere discard.param &lt;- readParam(discard=repeats) Coordinates of annotated repeats can be obtained from several different sources. A curated blacklist of problematic regions is available from the ENCODE project (Consortium 2012) for various organisms. This list is constructed empirically from the ENCODE datasets and includes obvious offenders like telomeres, microsatellites and some rDNA genes. We generally prefer to use the ENCODE blacklist most applications where blacklisting is necessary. Alternatively, repeats can be predicted from the genome sequence using software like RepeatMasker. These calls are available from the UCSC website (e.g., for mouse) or they can be extracted from an appropriate masked BSgenome object. This contains a greater number of problematic regions (especially microsatellites) compared to the ENCODE blacklist, though genuine DB sites may also be removed. If negative control samples are available, they can be used to empirically identify problematic regions with the GreyListChIP package. These regions should be ignored as they have high coverage in the controls and are unlikely to be genuine binding sites. Using discard is more appropriate than simply ignoring windows that overlap the repeat regions. For example, a large window might contain both repeat and non-repeat regions. Discarding the window because of the former will compromise detection of DB features in the latter. Of course, any DB sites within the discarded regions will be lost from downstream analyses. Some caution is therefore required when specifying the regions of disinterest. For example, many more repeats are called by RepeatMasker than are present in the ENCODE blacklist, so the use of the former may result in loss of potentially interesting features. 3.2 Increasing speed and memory efficiency The spacing parameter controls the distance between adjacent windows in the genome. By default, this is set to 50 bp, i.e., sliding windows are shifted 50 bp forward at each step. Using a higher value will reduce computational work as fewer features need to be counted, and may be useful when machine memory is limited. Of course, spatial resolution is lost with larger spacings as adjacent positions are not counted and thus cannot be distinguished. View set-up code #--- loading-files ---# library(chipseqDBData) tf.data &lt;- NFYAData() tf.data bam.files &lt;- head(tf.data$Path, -1) # skip the input. bam.files #--- counting-windows ---# library(csaw) frag.len &lt;- 110 win.width &lt;- 10 param &lt;- readParam(minq=20) data &lt;- windowCounts(bam.files, ext=frag.len, width=win.width, param=param) demo &lt;- windowCounts(bam.files, spacing=100, ext=frag.len, width=win.width, param=param) head(rowRanges(demo)) ## GRanges object with 6 ranges and 0 metadata columns: ## seqnames ranges strand ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; ## [1] chr1 3003701-3003710 * ## [2] chr1 3003801-3003810 * ## [3] chr1 3004001-3004010 * ## [4] chr1 3008301-3008310 * ## [5] chr1 3008401-3008410 * ## [6] chr1 3010801-3010810 * ## ------- ## seqinfo: 66 sequences from an unspecified genome While the default is usually satisfactory, users can improve efficiency by increasing the spacing to a value up to (width + ext)/2. This reduces the computational work by decreasing the number of windows and extracted counts. Any loss in spatial resolution due to a larger spacing interval is negligible compared to that already lost by using a large window size. The suggested upper bound ensures that a narrow binding site will not be overlooked if it falls between two windows. Windows that are overlapped by few fragments are filtered out based on the filter argument. A window is removed if the sum of counts across all libraries is below filter. This improves memory efficiency by discarding the majority of low-abundance windows corresponding to uninteresting background regions. The default value of the filter threshold is 10, though it can be raised to reduce memory usage for large libraries. More sophisticated filtering is recommended and should be applied later (see Chapter~). demo &lt;- windowCounts(bam.files, ext=frag.len, width=win.width, filter=30, param=param) head(assay(demo)) ## [,1] [,2] [,3] [,4] ## [1,] 4 12 10 5 ## [2,] 5 6 8 11 ## [3,] 7 1 9 19 ## [4,] 3 1 7 22 ## [5,] 4 2 10 15 ## [6,] 6 5 10 11 Users can parallelize read counting and several other functions by setting the BPPARAM argument. This will load and process reads from multiple BAM files simultaneously. The number of workers and type of parallelization can be specified using BiocParallelParam objects. By default, parallelization is turned off (i.e., set to a SerialParam object) because it provides little benefit for small files or on systems with I/O bottlenecks. 3.3 Dealing with paired-end data Paired-end datasets are accomodated by setting pe=\"both\" in the param object supplied to windowCounts(). Read extension is not required as the genomic interval spanned by the originating fragment is explicitly defined as that between the 5’ positions of the paired reads. The number of fragments overlapping each window is then counted as previously described. By default, only proper pairs are used in which the two paired reads are on the same chromosome, face inward and are no more than max.frag apart. # Using the BAM file in Rsamtools as an example. pe.bam &lt;- system.file(&quot;extdata&quot;, &quot;ex1.bam&quot;, package=&quot;Rsamtools&quot;, mustWork=TRUE) pe.param &lt;- readParam(max.frag=400, pe=&quot;both&quot;) demo &lt;- windowCounts(pe.bam, ext=250, param=pe.param) demo$totals ## [1] 1572 A suitable value for max.frag is chosen by examining the distribution of fragment sizes from the getPESizes() function. In this example, we might use a value of around 400 bp as it is larger than the vast majority of fragment sizes (Figure 3.1). The plot can also be used to examine the quality of the PE sequencing procedure. The location of the mode should be consistent with the fragmentation and size selection steps in library preparation. out &lt;- getPESizes(pe.bam) frag.sizes &lt;- out$sizes[out$sizes&lt;=800] hist(frag.sizes, breaks=50, xlab=&quot;Fragment sizes (bp)&quot;, ylab=&quot;Frequency&quot;, main=&quot;&quot;, col=&quot;grey80&quot;) abline(v=400, col=&quot;red&quot;) Figure 3.1: Distribution of fragment sizes in an example paired-end dataset. The number of fragments exceeding the maximum size is recorded for quality control. The getPESizes() function also returns the number of single reads, pairs with one unmapped read, improperly orientated pairs and inter-chromosomal pairs. A non-negligble proportion of these reads may be indicative of problems with paired-end alignment or sequencing. c(out$diagnostics, too.large=sum(out$sizes &gt; 400)) ## total.reads mapped.reads single mate.unmapped unoriented ## 3307 3307 0 163 0 ## inter.chr too.large ## 0 0 Note that all of the paired-end methods in csaw depend on correct mate information for each alignment. This is usually enforced by the aligner in the output BAM file. Any file manipulations that might break the synchronisation should be corrected (e.g., with the FixMateInformation program from the Picard suite) prior to read counting. Paired-end data can also be treated as single-end by specifiying pe=\"first\" or \"second\" in the readParam() constructor. This will only use the first or second read of each read pair, regardless of the validity of the pair or the relative quality of the alignments. This setting may be useful for contrasting paired- and single-end analyses, or in disastrous situations where paired-end sequencing has failed, e.g., due to ligation between DNA fragments. first.param &lt;- readParam(pe=&quot;first&quot;) demo &lt;- windowCounts(pe.bam, param=first.param) demo$totals ## [1] 1654 3.4 Other counting strategies 3.4.1 Assigning reads into bins Setting bin=TRUE will direct windowCounts() to count reads into contiguous bins across the genome. Here, spacing is set to width such that each window forms a bin. For single-end data, only the 5’ end of each read is used for counting into bins, without any directional extension. For paired-end data, the midpoint of the originating fragment is used.) demo &lt;- windowCounts(bam.files, width=1000, bin=TRUE, param=param) head(rowRanges(demo)) ## GRanges object with 6 ranges and 0 metadata columns: ## seqnames ranges strand ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; ## [1] chr1 3000001-3001000 * ## [2] chr1 3001001-3002000 * ## [3] chr1 3002001-3003000 * ## [4] chr1 3003001-3004000 * ## [5] chr1 3004001-3005000 * ## [6] chr1 3005001-3006000 * ## ------- ## seqinfo: 66 sequences from an unspecified genome The filter argument is automatically set to 1, which means that counts will be returned for each non-empty genomic bin. Users should set width to a reasonably large value, to avoid running out of memory with a large number of small bins. We can also force windowCounts() to return bins for all bins by setting filter=0 manually. 3.4.2 Manually specified regions While csaw focuses on counting reads into windows, it may be occasionally desirable to use the same conventions (e.g., duplicate removal, quality score filtering) when counting reads into pre-specified regions. This can be performed with the regionCounts() function, which is largely a wrapper for countOverlaps() from the GenomicRanges package. my.regions &lt;- GRanges(c(&quot;chr11&quot;, &quot;chr12&quot;, &quot;chr15&quot;), IRanges(c(75461351, 95943801, 21656501), c(75461610, 95944810, 21657610))) reg.counts &lt;- regionCounts(bam.files, my.regions, ext=frag.len, param=param) head(assay(reg.counts)) ## [,1] [,2] [,3] [,4] ## [1,] 43 68 116 111 ## [2,] 1 0 0 0 ## [3,] 17 10 14 12 3.4.3 Strand-specific counting Techniques like CLIP-seq, MeDIP-seq or CAGE provide strand-specific sequence information. csaw can analyze these datasets through strand-specific counting via the strandedCounts() wrapper function. The strand of each output range indicates the strand on which reads were counted for that row. Up to two rows can be generated for each window or region, depending on filtering. ss.param &lt;- initialize(param, forward=logical(0)) # flipping &#39;forward&#39; to a new value. ss.counts &lt;- strandedCounts(bam.files, ext=frag.len, width=win.width, param=ss.param) strand(rowRanges(ss.counts)) ## factor-Rle of length 924432 with 72 runs ## Lengths: 67133 64898 22552 22019 38909 ... 1819 6 3 70 82 ## Values : + - + - + ... - + - + - ## Levels(3): + - * Note that strandedCounts() operates internally by calling windowCounts() (or regionCounts()) twice with different settings for param$forward. Specifically, setting forward=TRUE or FALSE would direct windowCounts() to only count reads on the forward or reverse strand. strandedCounts() itself will only accept a logical(0) value for this slot, in order to protect the user; any attempt to re-use ss.param in functions that are not designed for strand specificity will (appropriately) raise an error. 3.5 Handling variable fragment lengths In rare cases, there will be large systematic differences in the fragment lengths between libraries. For example, samples with less efficient fragmentation will exhibit larger fragment lengths and wider peaks. Single-end reads in the peaks of such libraries will require more directional extension to impute a fragment interval that covers the binding site. The windowCounts() function supports the use of library-specific fragment lengths, though some work is required to avoid detecting irrelevant DB from differences in peak widths. This is achieved by resizing the inferred fragments to the same length in all libraries. Consider a bimodal peak, present in several libraries that have different fragment lengths. Resizing ensures that the subpeak on the forward strand is centered at the same location in each library - similarly, for the subpeak on the reverse strand. Thus, the effect of differences in peak width between libraries can be largely mitigated. Variable read extension is performed in windowCounts() by setting ext to a list with two elements. The first element is a vector where each entry specifies the average fragment length to be used for the corresponding library. The second specifies the final length to which the inferred fragments are to be resized. If the second element is set to NA, no rescaling is performed and the library-specific fragment sizes are used directly. This also works for analyses with paired-end data, though the first element of ext will be ignored as directional extension is not performed. The example below rescales all fragments to 200 bp in all libraries. Extension information is stored in the RangedSummarizedExperiment object for later use. multi.frag.lens &lt;- list(c(100, 150, 200, 250), 200) demo &lt;- windowCounts(bam.files, ext=multi.frag.lens, filter=30, param=param) demo$ext ## [1] 100 150 200 250 metadata(demo)$final ## [1] 200 That said, use of different extension lengths is generally unnecessary in well-controlled datasets. Difference in lengths between libraries are usually smaller than 50 bp. This is less than the inherent variability in fragment lengths within each library (see the histogram for the paired-end data in Section~). The effect on the coverage profile of within-library variability in lengths will likely mask the effect of small between-library differences in the average lengths. Session information View session info R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] csaw_1.26.0 SummarizedExperiment_1.22.0 [3] Biobase_2.52.0 MatrixGenerics_1.4.0 [5] matrixStats_0.58.0 GenomicRanges_1.44.0 [7] GenomeInfoDb_1.28.0 IRanges_2.26.0 [9] S4Vectors_0.30.0 BiocGenerics_0.38.0 [11] BiocStyle_2.20.0 rebook_1.2.0 loaded via a namespace (and not attached): [1] locfit_1.5-9.4 xfun_0.23 bslib_0.2.5.1 [4] lattice_0.20-44 htmltools_0.5.1.1 yaml_2.2.1 [7] XML_3.99-0.6 rlang_0.4.11 jquerylib_0.1.4 [10] BiocParallel_1.26.0 CodeDepends_0.6.5 GenomeInfoDbData_1.2.6 [13] stringr_1.4.0 zlibbioc_1.38.0 Biostrings_2.60.0 [16] codetools_0.2-18 evaluate_0.14 knitr_1.33 [19] highr_0.9 Rcpp_1.0.6 edgeR_3.34.0 [22] filelock_1.0.2 BiocManager_1.30.15 limma_3.48.0 [25] DelayedArray_0.18.0 graph_1.70.0 jsonlite_1.7.2 [28] XVector_0.32.0 Rsamtools_2.8.0 dir.expiry_1.0.0 [31] metapod_1.0.0 digest_0.6.27 stringi_1.6.2 [34] bookdown_0.22 grid_4.1.0 tools_4.1.0 [37] bitops_1.0-7 magrittr_2.0.1 sass_0.4.0 [40] RCurl_1.98-1.3 crayon_1.4.1 Matrix_1.3-3 [43] rmarkdown_2.8 R6_2.5.0 compiler_4.1.0 Bibliography "],["chap-filter.html", "Chapter 4 Filtering out uninteresting windows 4.1 Overview 4.2 By count size 4.3 By proportion 4.4 By global enrichment 4.5 By local enrichment 4.6 With negative controls 4.7 By prior information 4.8 Some final comments about filtering Session information", " Chapter 4 Filtering out uninteresting windows .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 4.1 Overview Many of the low abundance windows in the genome correspond to background regions in which DB is not expected. Indeed, windows with low counts will not provide enough evidence against the null hypothesis to obtain sufficiently low \\(p\\)-values for DB detection. Similarly, some approximations used in the statistical analysis will fail at low counts. Removing such uninteresting or ineffective tests reduces the severity of the multiple testing correction, increases detection power amongst the remaining tests and reduces computational work. Filtering is valid so long as it is independent of the test statistic under the null hypothesis (Bourgon, Gentleman, and Huber 2010). In the negative binomial (NB) framework, this (probably) corresponds to filtering on the overall NB mean. The DB \\(p\\)-values retained after filtering on the overall mean should be uniform under the null hypothesis, by analogy to the normal case. Row sums can also be used for datasets where the effective library sizes are not very different, or where the counts are assumed to be Poisson-distributed between biological replicates. In edgeR, the log-transformed overall NB mean is referred to as the average abundance. This is computed with the aveLogCPM() function, as shown below for each window. View set-up code #--- loading-files ---# library(chipseqDBData) tf.data &lt;- NFYAData() tf.data bam.files &lt;- head(tf.data$Path, -1) # skip the input. bam.files #--- counting-windows ---# library(csaw) frag.len &lt;- 110 win.width &lt;- 10 param &lt;- readParam(minq=20) data &lt;- windowCounts(bam.files, ext=frag.len, width=win.width, param=param) library(csaw) library(edgeR) abundances &lt;- aveLogCPM(asDGEList(data)) summary(abundances) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -2.33 -2.30 -2.21 -2.14 -2.08 7.93 For demonstration purposes, an arbitrary threshold of -1 is used here to filter the window abundances. This restricts the analysis to windows with abundances above this threshold. keep &lt;- abundances &gt; -1 filtered.data &lt;- data[keep,] summary(keep) ## Mode FALSE TRUE ## logical 4688607 15708 The exact choice of filter threshold may not be obvious. In particular, there is often no clear distinction in abundances between genuine binding and background events, e.g., due to the presence of many weak but genuine binding sites. A threshold that is too small will be ineffective, whereas a threshold that is too large may decrease power by removing true DB sites. Arbitrariness is unavoidable when balancing these opposing considerations. Nonetheless, several strategies for defining the threshold are described below. Users should start by choosing one of these filtering approaches to implement in their analyses. Each approach yields a logical vector that can be used in the same way as keep.simple. 4.2 By count size The simplest approach is to simply filter according to the count size. This removes windows for which the counts are simply too low for modelling and hypothesis testing. The code below retains windows with (library size-adjusted) average counts greater than 5. keep &lt;- abundances &gt; aveLogCPM(5, lib.size=mean(data$totals)) summary(keep) ## Mode FALSE TRUE ## logical 4599548 104767 However, a count-based filter becomes less effective as the library size increases. More windows will be retained with greater sequencing depth, even in uninteresting background regions. This increases both computational work and the severity of the multiplicity correction. The threshold may also be inappropriate when library sizes are very different. 4.3 By proportion One approach is to to assume that only a certain proportion - say, 0.1% - of the genome is genuinely bound. This corresponds to the top proportion of high-abundance windows. The total number of windows is calculated from the genome length and the spacing interval used in windowCounts(). The filterWindowsProportion() function returns the ratio of the rank of each window to this total, where higher-abundance windows have larger ranks. Users can then retain those windows with rank ratios above the unbound proportion of the genome. keep &lt;- filterWindowsProportion(data)$filter &gt; 0.999 sum(keep) ## [1] 54616 This approach is simple and has the practical advantage of maintaining a constant number of windows for the downstream analysis. However, it may not adapt well to different datasets where the proportion of bound sites can vary. Using an inappropriate percentage of binding sites will result in the loss of potential DB regions or inclusion of background regions. 4.4 By global enrichment An alternative approach involves choosing a filter threshold based on the fold change over the level of non-specific enrichment. The degree of background enrichment is estimated by counting reads into large bins across the genome. Binning is necessary here to increase the size of the counts when examining low-density background regions. This ensures that precision is maintained when estimating the background abundance. bin.size &lt;- 2000L binned &lt;- windowCounts(bam.files, bin=TRUE, width=bin.size, param=param) The median of the average abundances across all bins is computed and used as a global estimate of the background coverage. This global background is then compared to the window-based abundances. This determines whether a window is driven by background enrichment, and thus, unlikely to be interesting. However, some care is required as the sizes of the regions used for read counting are different between bins and windows. The average abundance of each bin must be scaled down to be comparable to those of the windows. The filterWindowsGlobal() function returns the increase in the abundance of each window over the global background. Windows are filtered by setting some minimum threshold on this increase. The aim is to eliminate the majority of uninteresting windows prior to further analysis. Here, a fold change of 3 is necessary for a window to be considered as containing a binding site. This approach has an intuitive and experimentally relevant interpretation that adapts to the level of non-specific enrichment in the dataset. filter.stat &lt;- filterWindowsGlobal(data, background=binned) keep &lt;- filter.stat$filter &gt; log2(3) sum(keep) ## [1] 23948 We can visualize the effect of filtering (Figure 4.1) to confirm that the bulk of windows - presumably in background regions - are indeed discarded upon filtering. One might hope to see a bimodal distribution due to windows containing genuine binding sites, but this is usually not visible due to the dominance of background regions in the genome. hist(filter.stat$filter, xlab=&quot;Log-fold change from global background&quot;, breaks=100, main=&quot;&quot;, col=&quot;grey80&quot;, xlim=c(0, 5)) abline(v=log2(3), col=&quot;red&quot;, lwd=2) Figure 4.1: Distribution of the log-increase in coverage over the global background for each window in the NF-YA dataset. The red line denotes the chosen threshold for filtering. Of course, the pre-specified minimum fold change may be too aggressive when binding is weak. For TF data, a large cut-off works well as narrow binding sites will have high read densities and are unlikely to be lost during filtering. Smaller minimum fold changes are recommended for diffuse marks where the difference from background is less obvious. 4.5 By local enrichment 4.5.1 Mimicking single-sample peak callers Local background estimators can also be constructed, which avoids inappropriate filtering when there are differences in background coverage across the genome. Here, the 2 kbp region surrounding each window will be used as the “neighborhood” over which a local estimate of non-specific enrichment for that window can be obtained. The counts for these regions are first obtained with the regionCounts() function. This should be synchronized with windowCounts() by using the same param, if any non-default settings were used. surrounds &lt;- 2000 neighbor &lt;- suppressWarnings(resize(rowRanges(data), surrounds, fix=&quot;center&quot;)) wider &lt;- regionCounts(bam.files, regions=neighbor, ext=frag.len, param=param) We apply filterWindowsLocal() to compute enrichment values, i.e., the increase in the abundance of each window over its neighborhood. In this function, counts for each window are subtracted from the counts for its neighborhood. This ensures that any enriched regions or binding sites inside the window will not interfere with estimation of its local background. The width of the window is also subtracted from that of its neighborhood, to reflect the effective size of the latter after subtraction of counts. Based on the fold-differences in widths, the abundance of the neighborhood is scaled down for a valid comparison to that of the corresponding window. Enrichment values are subsequently calculated from the differences in scaled abundances. filter.stat &lt;- filterWindowsLocal(data, wider) summary(filter.stat$filter) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -6.287 0.438 0.575 0.588 0.727 8.149 Filtering can then be performed using a quantile- or fold change-based threshold on the enrichment values. In this scenario, a 3-fold increase in enrichment over the neighborhood abundance is required for retention of each window (Figure 4.2). This roughly mimics the behavior of single-sample peak-calling programs such as MACS (Zhang et al. 2008). keep &lt;- filter.stat$filter &gt; log2(3) sum(keep) ## [1] 10701 hist(filter.stat$filter, xlab=&quot;Log-fold change from local background&quot;, breaks=100, main=&quot;&quot;, col=&quot;grey80&quot;, xlim=c(0, 5)) abline(v=log2(3), col=&quot;red&quot;, lwd=2) Figure 4.2: Distribution of the log-increase in coverage over the local background for each window in the NF-YA dataset. The red line denotes the chosen threshold for filtering. Note that this procedure also assumes that no other enriched regions are present in each neighborhood. Otherwise, the local background will be overestimated and windows may be incorrectly filtered out. This may be problematic for diffuse histone marks or clusters of TF binding sites, where enrichment may be observed in both the window and its neighborhood. If this seems too complicated, an alternative is to identify locally enriched regions using peak-callers like MACS. Filtering can then be performed to retain only windows within called peaks. However, peak calling must be done independently of the DB status of each window. If libraries are of similar size or biological variability is low, reads can be pooled into one library for single-sample peak calling (Lun and Smyth 2014). This is equivalent to filtering on the average count and avoids loss of the type I error control from data snooping. 4.5.2 Identifying local maxima {sec:localmax} Another strategy is to use the findMaxima() function to identify local maxima in the read density across the genome. The code below will determine if each window is a local maximum, i.e., whether it has the highest average abundance within 1 kbp on either side. The data can then be filtered to retain only these locally maximal windows. This can also be combined with other filters to ensure that the retained windows have high absolute abundance. maxed &lt;- findMaxima(rowRanges(data), range=1000, metric=abundances) summary(maxed) ## Mode FALSE TRUE ## logical 3782693 921622 This approach is very aggressive and should only be used (sparingly) in datasets where binding is sharp, simple and isolated. Complex binding events involving diffuse enrichment or adjacent binding sites will not be handled well. For example, DB detection will fail if a low-abundance DB window is ignored in favor of a high-abundance non-DB neighbor. 4.6 With negative controls Negative controls for ChIP-seq refer to input or IgG libraries where the IP step has been skipped or compromised with an irrelevant antibody, respectively. This accounts for sequencing/mapping biases in ChIP-seq data. IgG controls also quantify the amount of non-specific enrichment throughout the genome. These controls are mostly irrelevant when testing for DB between ChIP samples. However, they can be used to filter out windows where the average abundance across the ChIP samples is below the abundance of the control. To illustrate, let us add an input library to our NF-YA data set in the code below. library(chipseqDBData) tf.data &lt;- NFYAData() with.input &lt;- tf.data$Path in.demo &lt;- windowCounts(with.input, ext=frag.len, param=param) chip &lt;- in.demo[,1:4] # All ChIP libraries control &lt;- in.demo[,5] # All control libraries Some additional work is required to account for composition biases that are likely to be present when comparing ChIP to negative control samples (see Section 5.2). A simple strategy for normalization involves counting reads into large bins, which are used in scaleControlFilter() to compute a normalization factor. in.binned &lt;- windowCounts(with.input, bin=TRUE, width=10000, param=param) chip.binned &lt;- in.binned[,1:4] control.binned &lt;- in.binned[,5] scale.info &lt;- scaleControlFilter(chip.binned, control.binned) We use the filterWindowsControl() function to compute the enrichment of the ChIP counts over the control counts for each window. This uses scale.info to adjust for composition biases between ChIP and control samples. A larger prior.count of 5 is also used to compute the average abundance. This protects against inflated log-fold changes when the count for the window in the control sample is near zero.1 filter.stat &lt;- filterWindowsControl(chip, control, prior.count=5, scale.info=scale.info) The log-fold enrichment of the ChIP sample over the control is then computed for each window, after normalizing for composition bias with the binned counts. The example below requires a 3-fold or greater increase in abundance over the control to retain each window (Figure 4.3). keep &lt;- filter.stat$filter &gt; log2(3) sum(keep) ## [1] 6657 hist(filter.stat$filter, xlab=&quot;Log-fold change from control&quot;, breaks=100, main=&quot;&quot;, col=&quot;grey80&quot;, xlim=c(0, 5)) abline(v=log2(3), col=&quot;red&quot;, lwd=2) Figure 4.3: Distribution of the log-increase in average abundance for the ChIP samples over the control for each window in the NF-YA dataset. The red line denotes the chosen threshold for filtering. As an aside, the csaw pipeline can also be applied to search for DB'' between ChIP libraries and control libraries. The ChIP and control libraries can be treated as separate groups, in which mostDB’’ events are expected to be enriched in the ChIP samples. If this is the case, the filtering procedure described above is inappropriate as it will select for windows with differences between ChIP and control samples. This compromises the assumption of the null hypothesis during testing, resulting in loss of type I error control. 4.7 By prior information When only a subset of genomic regions are of interest, DB detection power can be improved by removing windows lying outside of these regions. Such regions could include promoters, enhancers, gene bodies or exons. Alternatively, sites could be defined from a previous experiment or based on the genome sequence, e.g., TF motif matches. The example below retrieves the coordinates of the broad gene bodies from the mouse genome, including the 3 kbp region upstream of the TSS that represents the putative promoter region for each gene. library(TxDb.Mmusculus.UCSC.mm10.knownGene) broads &lt;- genes(TxDb.Mmusculus.UCSC.mm10.knownGene) broads &lt;- resize(broads, width(broads)+3000, fix=&quot;end&quot;) head(broads) ## GRanges object with 6 ranges and 1 metadata column: ## seqnames ranges strand | gene_id ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; | &lt;character&gt; ## 100009600 chr9 21062393-21076096 - | 100009600 ## 100009609 chr7 84935565-84967115 - | 100009609 ## 100009614 chr10 77708457-77712009 + | 100009614 ## 100009664 chr11 45805087-45841171 + | 100009664 ## 100012 chr4 144157557-144165663 - | 100012 ## 100017 chr4 134741554-134771024 - | 100017 ## ------- ## seqinfo: 66 sequences (1 circular) from mm10 genome Windows can be filtered to only retain those which overlap with the regions of interest. Discerning users may wish to distinguish between full and partial overlaps, though this should not be a significant issue for small windows. This could also be combined with abundance filtering to retain windows that contain putative binding sites in the regions of interest. suppressWarnings(keep &lt;- overlapsAny(rowRanges(data), broads)) sum(keep) ## [1] 2700568 Any information used here should be independent of the DB status under the null in the current dataset. For example, DB calls from a separate dataset and/or independent annotation can be used without problems. However, using DB calls from the same dataset to filter regions would violate the null assumption and compromise type I error control. In addition, this filter is unlike the others in that it does not operate on the abundance of the windows. It is possible that the set of retained windows may be very small, e.g., if no non-empty windows overlap the pre-defined regions of interest. Thus, it may be better to apply this filter before the multiplicity correction but after DB testing. This ensures that there are sufficient windows for stable estimation of the downstream statistics. 4.8 Some final comments about filtering It should be stressed that these filtering strategies do not eliminate subjectivity. Some thought is still required in selecting an appropriate proportion of bound sites or minimum fold change above background for each method. Rather, these filters provide a relevant interpretation for what would otherwise be an arbitrary threshold on the abundance. As a general rule, users should filter less aggressively if there is any uncertainty about the features of interest. In particular, the thresholds shown in this chapter for each filtering statistic are fairly mild. This ensures that more potentially DB windows are retained for testing. Use of an aggressive filter risks the complete loss of detection for such windows, even if power is improved among those that are retained. Low numbers of retained windows may also lead to unstable estimates during, e.g., normalization, variance modelling. Different filters can also be combined in more advanced applications, e.g., by running for filter vectors {keep1} and \\Robject{keep2(). Any benefit will depend on the type of filters involved. The greatest effect is observed for filters that operate on different principles. For example, the low-count filter can be combined with others to ensure that all retained windows surpass some minimum count. This is especially relevant for the local background filters, where a large enrichment value does not guarantee a large count. Session information View session info R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0 [2] GenomicFeatures_1.44.0 [3] AnnotationDbi_1.54.0 [4] chipseqDBData_1.8.0 [5] edgeR_3.34.0 [6] limma_3.48.0 [7] csaw_1.26.0 [8] SummarizedExperiment_1.22.0 [9] Biobase_2.52.0 [10] MatrixGenerics_1.4.0 [11] matrixStats_0.58.0 [12] GenomicRanges_1.44.0 [13] GenomeInfoDb_1.28.0 [14] IRanges_2.26.0 [15] S4Vectors_0.30.0 [16] BiocGenerics_0.38.0 [17] BiocStyle_2.20.0 [18] rebook_1.2.0 loaded via a namespace (and not attached): [1] bitops_1.0-7 bit64_4.0.5 [3] progress_1.2.2 filelock_1.0.2 [5] httr_1.4.2 tools_4.1.0 [7] bslib_0.2.5.1 utf8_1.2.1 [9] R6_2.5.0 DBI_1.1.1 [11] withr_2.4.2 prettyunits_1.1.1 [13] tidyselect_1.1.1 bit_4.0.4 [15] curl_4.3.1 compiler_4.1.0 [17] graph_1.70.0 DelayedArray_0.18.0 [19] rtracklayer_1.52.0 bookdown_0.22 [21] sass_0.4.0 rappdirs_0.3.3 [23] stringr_1.4.0 digest_0.6.27 [25] Rsamtools_2.8.0 rmarkdown_2.8 [27] XVector_0.32.0 pkgconfig_2.0.3 [29] htmltools_0.5.1.1 dbplyr_2.1.1 [31] fastmap_1.1.0 highr_0.9 [33] rlang_0.4.11 RSQLite_2.2.7 [35] shiny_1.6.0 BiocIO_1.2.0 [37] jquerylib_0.1.4 generics_0.1.0 [39] jsonlite_1.7.2 BiocParallel_1.26.0 [41] dplyr_1.0.6 RCurl_1.98-1.3 [43] magrittr_2.0.1 GenomeInfoDbData_1.2.6 [45] Matrix_1.3-3 Rcpp_1.0.6 [47] fansi_0.4.2 lifecycle_1.0.0 [49] stringi_1.6.2 yaml_2.2.1 [51] zlibbioc_1.38.0 BiocFileCache_2.0.0 [53] AnnotationHub_3.0.0 grid_4.1.0 [55] blob_1.2.1 promises_1.2.0.1 [57] ExperimentHub_2.0.0 crayon_1.4.1 [59] dir.expiry_1.0.0 lattice_0.20-44 [61] Biostrings_2.60.0 hms_1.1.0 [63] KEGGREST_1.32.0 locfit_1.5-9.4 [65] CodeDepends_0.6.5 metapod_1.0.0 [67] knitr_1.33 pillar_1.6.1 [69] rjson_0.2.20 biomaRt_2.48.0 [71] codetools_0.2-18 XML_3.99-0.6 [73] glue_1.4.2 BiocVersion_3.13.1 [75] evaluate_0.14 BiocManager_1.30.15 [77] png_0.1-7 httpuv_1.6.1 [79] vctrs_0.3.8 purrr_0.3.4 [81] assertthat_0.2.1 cachem_1.0.5 [83] xfun_0.23 mime_0.10 [85] xtable_1.8-4 restfulr_0.0.13 [87] later_1.2.0 tibble_3.1.2 [89] GenomicAlignments_1.28.0 memoise_2.0.0 [91] interactiveDisplayBase_1.30.0 ellipsis_0.3.2 Bibliography "],["chap-norm.html", "Chapter 5 Normalizing for technical biases 5.1 Overview 5.2 Eliminating composition biases 5.3 Eliminating efficiency biases 5.4 Choosing between normalization strategies 5.5 With spike-in chromatin 5.6 Dealing with trended biases 5.7 A word on other biases Session information", " Chapter 5 Normalizing for technical biases .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 5.1 Overview The complexity of the ChIP-seq technique gives rise to a number of different biases in the data. For a DB analysis, library-specific biases are of particular interest as they can introduce spurious differences between conditions. This includes composition biases, efficiency biases and trended biases. Thus, normalization between libraries is required to remove these biases prior to any statistical analysis. Several normalization strategies are presented here, though users should only pick one to use for any given analysis. Advice on choosing the most appropriate method is scattered throughout the chapter. 5.2 Eliminating composition biases 5.2.1 Using the TMM method on binned counts As the name suggests, composition biases are formed when there are differences in the composition of sequences across libraries. Highly enriched regions consume more sequencing resources and thereby suppress the representation of other regions. Differences in the magnitude of suppression between libraries can lead to spurious DB calls. Scaling by library size fails to correct for this as composition biases can still occur in libraries of the same size. To remove composition biases in csaw, reads are counted into large bins and the counts are used for normalization with the normFactors() wrapper function. This uses the trimmed mean of M-values (TMM) method (Robinson and Oshlack 2010) to correct for any systematic fold change in the coverage of the bins. The assumption here is that most bins represent non-DB background regions, so any consistent difference across bins must be technical bias. View set-up code #--- loading-files ---# library(chipseqDBData) tf.data &lt;- NFYAData() tf.data bam.files &lt;- head(tf.data$Path, -1) # skip the input. bam.files #--- counting-windows ---# library(csaw) frag.len &lt;- 110 win.width &lt;- 10 param &lt;- readParam(minq=20) data &lt;- windowCounts(bam.files, ext=frag.len, width=win.width, param=param) #--- filtering ---# binned &lt;- windowCounts(bam.files, bin=10000, param=param) fstats &lt;- filterWindowsGlobal(data, binned) filtered.data &lt;- data[fstats$filter &gt; log2(5),] library(csaw) binned &lt;- windowCounts(bam.files, bin=TRUE, width=10000, param=param) filtered.data &lt;- normFactors(binned, se.out=filtered.data) filtered.data$norm.factors ## [1] 1.0084 0.9751 1.0136 1.0033 The TMM method trims away putative DB bins (i.e., those with extreme M-values) and computes normalization factors from the remainder to use in edgeR. The size of each library is scaled by the corresponding factor to obtain an effective library size for modelling. A larger normalization factor results in a larger effective library size and is conceptually equivalent to scaling each individual count downwards, given that the ratio of that count to the (effective) library size will be smaller. Check out the edgeR user’s guide for more information. To elaborate on the above code: the normFactors() call computes normalization factors from the bin-level counts in binned (see Section 5.2.2). The se.out argument directs the function to return a modified version of filtered.data, where the normalization factors are stored alongside the window-level counts for further analysis. Composition biases affect both bin- and window-level counts, so computing normalization factors from the former and applying them to the latter is valid - provided that the library sizes are the same between the two sets of counts, as the factors are interpreted with respect to the library sizes. (In csaw, separate calls to windowCounts() with the same readParam object will always yield the same library sizes in totals.) Note that normFactors() skips the precision weighting step in the TMM method. Weighting aims to increase the contribution of bins with high counts, as these yield more precise M-values. However, high-abundance bins are more likely to contain binding sites and thus are more likely to be DB compared to background regions. If any DB regions should survive trimming, upweighting them would be counterproductive. 5.2.2 Motivating the use of large bins By definition, read coverage is low for background regions of the genome. This can result in a large number of zero counts and undefined M-values when reads are counted into small windows. Adding a prior count is only a superficial solution as the chosen prior will have undue influence on the estimate of the normalization factor when many counts are low. The variance of the fold change distribution is also higher for low counts, which reduces the effectiveness of the trimming procedure. These problems can be overcome by using large bins to increase the size of the counts, thus improving the precision of TMM normalization. The normalization factors computed from the bin-level counts are then applied to the window-level counts of interest. Of course, this strategy requires the user to supply a bin size. If the bins are too large, background and enriched regions will be included in the same bin. This makes it difficult to trim away bins corresponding to enriched regions. On the other hand, the counts will be too low if the bins are too small. Testing multiple bin sizes is recommended to ensure that the estimates are robust to any changes. A value of 10 kbp is usually suitable for most datasets. demo &lt;- windowCounts(bam.files, bin=TRUE, width=5000, param=param) normFactors(demo, se.out=FALSE) # se.out=FALSE to report factors directly. ## [1] 1.0083 0.9779 1.0109 1.0032 demo &lt;- windowCounts(bam.files, bin=TRUE, width=15000, param=param) normFactors(demo, se.out=FALSE) ## [1] 1.0085 0.9723 1.0160 1.0037 Here, the factors are consistently close to unity, which suggests that composition bias is negligble in this dataset. See Section~ for some examples with greater bias. 5.2.3 Visualizing normalization with MA plots The effectiveness of normalization can be examined using a MA plot (Figure 5.1). A single main cloud of points should be present, consisting primarily of background regions. Separation into multiple discrete points indicates that the counts are too low and that larger bin sizes should be used. Composition biases manifest as a vertical shift in the position of this cloud. Ideally, the log-ratios of the corresponding normalization factors should pass through the centre of the cloud. This indicates that undersampling has been identified and corrected. library(edgeR) adj.counts &lt;- cpm(asDGEList(binned), log=TRUE) normfacs &lt;- filtered.data$norm.factors par(mfrow=c(1, 3), mar=c(5, 4, 2, 1.5)) for (i in seq_len(length(bam.files)-1)) { cur.x &lt;- adj.counts[,1] cur.y &lt;- adj.counts[,1+i] smoothScatter(x=(cur.x+cur.y)/2+6*log2(10), y=cur.x-cur.y, xlab=&quot;A&quot;, ylab=&quot;M&quot;, main=paste(&quot;1 vs&quot;, i+1)) all.dist &lt;- diff(log2(normfacs[c(i+1, 1)])) abline(h=all.dist, col=&quot;red&quot;) } Figure 5.1: MA plots for each sample compared to the first in the NF-YA dataset. Each point represents a 10 kbp bin and the red line denotes the scaled normalization factor for each sample. 5.3 Eliminating efficiency biases 5.3.0.1 Using TMM on high-abundance regions Efficiency biases in ChIP-seq data refer to fold changes in enrichment that are introduced by variability in IP efficiencies between libraries. These technical differences are not biologically interesting and must be removed. This can be achieved by assuming that high-abundance windows contain binding sites. Consider the following H3K4me3 data set, where reads are counted into 150 bp windows. library(chipseqDBData) k4data &lt;- H3K4me3Data() k4data ## DataFrame with 4 rows and 3 columns ## Name Description Path ## &lt;character&gt; &lt;character&gt; &lt;List&gt; ## 1 h3k4me3-proB-8110 pro-B H3K4me3 (8110) &lt;BamFile&gt; ## 2 h3k4me3-proB-8115 pro-B H3K4me3 (8115) &lt;BamFile&gt; ## 3 h3k4me3-matureB-8070 mature B H3K4me3 (80.. &lt;BamFile&gt; ## 4 h3k4me3-matureB-8088 mature B H3K4me3 (80.. &lt;BamFile&gt; me.files &lt;- k4data$Path[c(1,3)] # just one sample from each condition, for brevity. me.demo &lt;- windowCounts(me.files, width=150, param=param) High-abundance windows are chosen using a global filtering approach described in Section @ref{sec:global-filter}. Here, the binned counts in me.bin are only used for defining the background abundance, not for computing normalization factors. me.bin &lt;- windowCounts(me.files, bin=TRUE, width=10000, param=param) keep &lt;- filterWindowsGlobal(me.demo, me.bin)$filter &gt; log2(3) filtered.me &lt;- me.demo[keep,] The TMM method is then applied to eliminate systematic differences across those windows. This assumes that most binding sites in the genome are not DB - thus, any systematic differences in coverage among the high-abundance windows must be caused by differences in IP efficiency between libraries or some other technical issue. Scaling by the normalization factors will subseqeuently remove these biases prior to further analyses. filtered.me &lt;- normFactors(filtered.me) me.eff &lt;- filtered.me$norm.factors me.eff ## [1] 1.2655 0.7902 The downside of this approach is that genuine biological differences may be removed when the assumption of a non-DB majority does not hold, e.g., overall binding is truly lower in one condition. In such cases, it is safer to normalize for composition biases - see Section 5.4 for a discussion of the choice between normalization methods. While the above process seems rather involved, this is only because we need to work our way through counting and filtering for a new data set. Only normFactors() is actually needed for the normalization step. As a demonstration, we repeat this procedure on another data set involving H3 acetylation. acdata &lt;- H3K9acData() ac.files &lt;- acdata$Path[c(1,2)] # subsetting again for brevity. # Counting: ac.demo &lt;- windowCounts(ac.files, width=150, param=param) # Filtering: ac.bin &lt;- windowCounts(ac.files, bin=TRUE, width=10000, param=param) keep &lt;- filterWindowsGlobal(ac.demo, ac.bin)$filter &gt; log2(5) filtered.ac &lt;- ac.demo[keep,] # Normalization: filtered.ac &lt;- normFactors(filtered.ac, se.out=TRUE) ac.eff &lt;- filtered.ac$norm.factors ac.eff ## [1] 1.0746 0.9306 5.3.1 Filtering windows prior to normalization Normalization for efficiency biases is performed on window-level counts instead of bin counts. This is possible as filtering ensures that we only retain the high-abundance windows, i.e., those with counts that are large enough for stable calculation of normalization factors. It is not necessary to use larger windows or bins, and indeed, direct use of the windows of interest ensures removal of systematic differences in those windows prior to downstream analyses. The filtering procedure needs to be stringent enough to avoid retaining windows from background regions. These will interfere with calculation of normalization factors from binding sites. This is due to the lower coverage for background regions, as well as the fact that they are not affected by efficiency bias (and cannot contribute to its estimation). Conversely, attempting to use the factors computed from high-abundance windows on windows from background regions will result in incorrect normalization of the latter. Thus, it is usually better to err on the side of caution and filter aggressively to ensure that background regions are not retained in downstream analyses. Obviously, though, retaining too few windows will result in unstable estimates of the normalization factors. 5.3.2 Visualizing normalization with MA plots We again visualize the effect of normalization with MA plots. We continue to use the counts for 10 kbp bins to construct the plots, rather than with those from the windows. This is useful as the behavior of the entire genome can be examined, rather than just that of the high-abundance windows. It also allows calculation of and comparison to the factors for composition bias. # Again, just setting se.out=FALSE to report factors directly. me.comp &lt;- normFactors(me.bin, se.out=FALSE) me.comp ## [1] 0.7751 1.2902 ac.comp &lt;- normFactors(ac.bin, se.out=FALSE) ac.comp ## [1] 0.9474 1.0556 In Figure 5.2, the clouds at low and high A-values represent the background and bound regions, respectively. The normalization factors from removal of composition bias (dashed) pass through the former, whereas the factors to remove efficiency bias (full) pass through the latter. A non-zero M-value location for the high A-value cloud represents a systematic difference between libraries for the bound regions, either due to genuine DB or variable IP efficiency. This also induces composition bias, leading to a non-zero M-value for the background cloud. par(mfrow=c(1,2)) for (main in c(&quot;H3K4me3&quot;, &quot;H3ac&quot;)) { if (main==&quot;H3K4me3&quot;) { bins &lt;- me.bin comp &lt;- me.comp eff &lt;- me.eff } else { bins &lt;- ac.bin comp &lt;- ac.comp eff &lt;- ac.eff } adjc &lt;- cpm(asDGEList(bins), log=TRUE) smoothScatter(x=rowMeans(adjc), y=adjc[,1]-adjc[,2], xlab=&quot;A&quot;, ylab=&quot;M&quot;, main=main) abline(h=log2(eff[1]/eff[2]), col=&quot;red&quot;) abline(h=log2(comp[1]/comp[2]), col=&quot;red&quot;, lty=2) } Figure 5.2: MA plots between individual samples in the H3K4me3 and H3ac datasets. Each point represents a 10 kbp bin and the red lines denotes the ratio of normalization factors computed from bins (dashed) or high-abundance windows (full). 5.4 Choosing between normalization strategies The normalization strategies for composition and efficiency biases are mutually exclusive, as only one set of normalization factors will ultimately be used in edgeR. The choice between the two methods depends on whether one assumes that the systematic differences at high abundances represent genuine DB events. If so, the binned TMM method from Section 5.2 should be used to remove composition bias. This will preserve the assumed DB, at the cost of ignoring any efficiency biases that might be present. Otherwise, if the systematic differences are not genuine DB, they must represent efficiency bias and should be removed by applying the TMM method on high-abundance windows (Section 5.3). Some understanding of the biological context is useful in making this decision, e.g., comparing a wild-type against a knock-out for the target protein should result in systematic DB, while overall levels of histone marking are expected to be consistent in most conditions. For the main NF-YA example, there is no expectation of constant binding between cell types. Thus, normalization factors will be computed to remove composition biases. This ensures that any genuine systematic changes in binding will still be picked up. In general, normalization for composition bias is a good starting point for any analysis. This can be considered as the “default” strategy unless there is evidence for a confounding efficiency bias. 5.5 With spike-in chromatin Some studies use spike-in chromatin for scaling normalization of ChIP-seq data (Bonhoure et al. 2014; Orlando et al. 2014). Briefly, a constant amount of chromatin from a different species is added to each sample at the start of the ChIP-seq protocol. The mixture is processed and sequenced in the usual manner, using antibodies that can bind epitopes of interest from both species. The coverage of the spiked-in foreign chromatin is then quantified in each library. As the quantity of foreign chromatin should be constant in each sample, the coverage of binding sites on the foreign genome should also be the same between libraries. Any difference in coverage between libraries represents some technical bias that should be removed by scaling. This normalization strategy can be implemented in csaw with some work. Assuming that the reference genome includes appropriate sequences from the foreign genome, coverage is quantified across genomic windows with windowCounts(). Filtering is performed to select for high-abundance windows in the foreign genome, yielding counts for all enriched spike-in regions. (The filtered object is named spike.data in the code below.) Normalization factors are computed by applying the TMM method on these counts via normFactors(). This aims to identify the fold-change in coverage between samples that is attributable to technical effects. # Pretend chr1 is a spike-in, for demonstration purposes only! # TODO: find an actual spike-in dataset and get it into chipseqDBData. is.1 &lt;- seqnames(rowRanges(filtered.data))==&quot;chr1&quot; spike.data &lt;- filtered.data[is.1,] endog.data &lt;- filtered.data[!is.1,] endog.data &lt;- normFactors(spike.data, se.out=endog.data) In the code above, the spike-in normalization factors are returned in a modified copy of endog.data for further analysis of the endogenous windows. We assume that the library sizes in totals are the same between spike.data and endog.data, which should be the case if they were formed by subsetting the output of a single windowCounts() call. This ensures that the normalization factors computed from the spike-in windows are applicable to the endogenous windows. Compared to the previous normalization methods, the spike-in approach does not distinguish between composition and efficiency biases. Instead, it uses the fold-differences in the coverage of spiked-in binding sites to empirically measure and remove the net bias between libraries. This avoids the need for assumptions regarding the origin of any systematic differences between libraries. That said, spike-in normalization involves some strong assumptions of its own. In particular, the ratio of spike-in chromatin to endogenous chromatin is assumed to be the same in each sample. This requires accurate quantitation of the chromatin in each sample, followed by precise addition of small spike-in quantities. Furthermore, the spike-in chromatin, its protein target and the corresponding antibody are assumed to behave in the same manner as their endogenous counterparts throughout the ChIP-seq protocol. Whether these assumptions are reasonable will depend on the experimenter and the nature of the spike-in chromatin. 5.6 Dealing with trended biases In more extreme cases, the bias may vary with the average abundance to form a trend. One possible explanation is that changes in IP efficiency will have little effect at low-abundance background regions and more effect at high-abundance binding sites. Thus, the magnitude of the bias between libraries will change with abundance. The trend cannot be corrected with scaling methods as no single scaling factor will remove differences at all abundances. Rather, non-linear methods are required, such as cyclic loess or quantile normalization. One such implementation of a non-linear normalization method is provided in normOffsets(). This is based on the fast loess algorithm (Ballman et al. 2004) with minor adaptations to handle low counts. A matrix is produced that contains an offset term for each bin/window in each library. This offset matrix can then be directly used in edgeR, assuming that the bins or windows used in normalization are also the ones to be tested for DB. We demonstrate this procedure below, using filtered counts for 2 kbp windows in the H3 acetylation data set. (This window size is chosen purely for aesthetics in this demonstration, as the trend is less obvious at smaller widths. Obviously, users should pick a more appropriate value for their analysis.) ac.demo2 &lt;- windowCounts(ac.files, width=2000L, param=param) # Filtering for high-abundance intervals. filtered &lt;- filterWindowsGlobal(ac.demo2, ac.bin) keep &lt;- filtered$filter &gt; log2(4) ac.demo2 &lt;- ac.demo2[keep,] # Actually applying the normalization. ac.demo2 &lt;- normOffsets(ac.demo2) ac.off &lt;- assay(ac.demo2, &quot;offset&quot;) head(ac.off) ## [,1] [,2] ## [1,] 15.95 15.8 ## [2,] 15.95 15.8 ## [3,] 15.95 15.8 ## [4,] 15.95 15.8 ## [5,] 15.95 15.8 ## [6,] 15.95 15.8 By default, the offsets are stored in the RangedSummarizedExperiment object as an \"offset\" entry in the assays slot. Each offset represents the log-transformed scaling factor that needs to be applied to the corresponding entry of the count matrix for its normalization. Any operations like subsetting that are applied to modify the object will also be applied to the offsets, allowing for synchronised processing. Functions from packages like edgeR will also respect the offsets during model fitting. We again examine the MA plots in Figure 5.3 to determine whether normalization was successful. Any abundance-dependent trend in the M-values should be eliminated after applying the offsets to the log-counts. This is done by subtraction, though note that the offsets are base \\(e\\) while most log-values in edgeR are reported as base 2. par(mfrow=c(1,2)) # MA plot without normalization. ac.y &lt;- asDGEList(ac.demo2) lib.size &lt;- ac.y$samples$lib.size adjc &lt;- cpm(ac.y, log=TRUE) abval &lt;- aveLogCPM(ac.y) mval &lt;- adjc[,1]-adjc[,2] fit &lt;- loessFit(x=abval, y=mval) smoothScatter(abval, mval, ylab=&quot;M&quot;, xlab=&quot;Average logCPM&quot;, main=&quot;Raw&quot;, ylim=c(-2,2), xlim=c(0, 7)) o &lt;- order(abval) lines(abval[o], fit$fitted[o], col=&quot;red&quot;) # Repeating after normalization. re.adjc &lt;- log2(assay(ac.demo2)+0.5) - ac.off/log(2) mval &lt;- re.adjc[,1]-re.adjc[,2] fit &lt;- loessFit(x=abval, y=mval) smoothScatter(abval, re.adjc[,1]-re.adjc[,2], ylab=&quot;M&quot;, xlab=&quot;Average logCPM&quot;, main=&quot;Normalized&quot;, ylim=c(-2,2), xlim=c(0, 7)) lines(abval[o], fit$fitted[o], col=&quot;red&quot;) Figure 5.3: MA plots between individual samples in the H3ac dataset before and after trended normalization. Each point represents a 2 kbp bin, and the trend represents a fitted loess curve. Loess normalization of trended biases is quite similar to TMM normalization for efficiency biases described in Section 5.3. Both methods assume a non-DB majority across features, and will not be appropriate if there is a change in overall binding. Loess normalization involves a slightly stronger assumption of a non-DB majority at every abundance, not just across all bound regions. This is necessary to remove trended biases but may also discard genuine changes, such as a subset of DB sites at very high abundances. Compared to TMM normalization, the accuracy of loess normalization is less dependent on stringent filtering. This is because the use of a trend accommodates changes in the bias between high-abundance binding sites and low-abundance background regions. Nonetheless, some filtering is still necessary to avoid inaccuracies in loess fitting at low abundances. Any filter statistic for the windows should be based on the average abundance from aveLogCPM(), such as those calculated using filterWindowsGlobal() or equivalents. An average abundance threshold will act as a clean vertical cutoff in the MA plots above. This avoids introducing spurious trends at the filter boundary that might affect normalization. 5.7 A word on other biases No normalization is performed to adjust for differences in mappability or sequencability between different regions of the genome. Region-specific biases are assumed to be constant between libraries. This is generally reasonable as the biases depend on fixed properties of the genome sequence such as GC content. Thus, biases should cancel out during DB comparisons. Any variability between samples will just be absorbed into the dispersion estimate. That said, explicit normalization to correct these biases can improve results for some datasets. Procedures like GC correction could decrease the observed variability by removing systematic differences between replicates. Of course, this also assumes that the targeted differences have no biological relevance. Detection power may be lost if this is not true. For example, differences in the GC content distribution can be driven by technical bias as well as biology, e.g., when protein binding is associated with a specific GC composition. Session information View session info R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] chipseqDBData_1.8.0 edgeR_3.34.0 [3] limma_3.48.0 csaw_1.26.0 [5] SummarizedExperiment_1.22.0 Biobase_2.52.0 [7] MatrixGenerics_1.4.0 matrixStats_0.58.0 [9] GenomicRanges_1.44.0 GenomeInfoDb_1.28.0 [11] IRanges_2.26.0 S4Vectors_0.30.0 [13] BiocGenerics_0.38.0 BiocStyle_2.20.0 [15] rebook_1.2.0 loaded via a namespace (and not attached): [1] bitops_1.0-7 bit64_4.0.5 [3] filelock_1.0.2 httr_1.4.2 [5] tools_4.1.0 bslib_0.2.5.1 [7] utf8_1.2.1 R6_2.5.0 [9] KernSmooth_2.23-20 DBI_1.1.1 [11] withr_2.4.2 tidyselect_1.1.1 [13] bit_4.0.4 curl_4.3.1 [15] compiler_4.1.0 graph_1.70.0 [17] DelayedArray_0.18.0 bookdown_0.22 [19] sass_0.4.0 rappdirs_0.3.3 [21] stringr_1.4.0 digest_0.6.27 [23] Rsamtools_2.8.0 rmarkdown_2.8 [25] XVector_0.32.0 pkgconfig_2.0.3 [27] htmltools_0.5.1.1 dbplyr_2.1.1 [29] fastmap_1.1.0 highr_0.9 [31] rlang_0.4.11 RSQLite_2.2.7 [33] shiny_1.6.0 jquerylib_0.1.4 [35] generics_0.1.0 jsonlite_1.7.2 [37] BiocParallel_1.26.0 dplyr_1.0.6 [39] RCurl_1.98-1.3 magrittr_2.0.1 [41] GenomeInfoDbData_1.2.6 Matrix_1.3-3 [43] Rcpp_1.0.6 fansi_0.4.2 [45] lifecycle_1.0.0 stringi_1.6.2 [47] yaml_2.2.1 zlibbioc_1.38.0 [49] BiocFileCache_2.0.0 AnnotationHub_3.0.0 [51] grid_4.1.0 blob_1.2.1 [53] promises_1.2.0.1 ExperimentHub_2.0.0 [55] crayon_1.4.1 dir.expiry_1.0.0 [57] lattice_0.20-44 Biostrings_2.60.0 [59] KEGGREST_1.32.0 locfit_1.5-9.4 [61] CodeDepends_0.6.5 metapod_1.0.0 [63] knitr_1.33 pillar_1.6.1 [65] codetools_0.2-18 XML_3.99-0.6 [67] glue_1.4.2 BiocVersion_3.13.1 [69] evaluate_0.14 BiocManager_1.30.15 [71] png_0.1-7 httpuv_1.6.1 [73] vctrs_0.3.8 purrr_0.3.4 [75] assertthat_0.2.1 cachem_1.0.5 [77] xfun_0.23 mime_0.10 [79] xtable_1.8-4 later_1.2.0 [81] tibble_3.1.2 AnnotationDbi_1.54.0 [83] memoise_2.0.0 interactiveDisplayBase_1.30.0 [85] ellipsis_0.3.2 Bibliography "],["chap-stats.html", "Chapter 6 Testing for per-window differences 6.1 Overview 6.2 Setting up for edgeR 6.3 Estimating the dispersions 6.4 Testing for DB windows 6.5 What to do without replicates 6.6 Examining replicate similarity with MDS plots Session information", " Chapter 6 Testing for per-window differences .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 6.1 Overview Low counts per window are typically observed in ChIP-seq datasets, even for genuine binding sites. Any statistical analysis to identify DB sites must be able to handle discreteness in the data. Count-based models are ideal for this purpose. In this guide, the quasi-likelihood (QL) framework in the edgeR package is used (Lund et al. 2012). Counts are modelled using NB distributions that account for overdispersion between biological replicates (Robinson and Smyth 2008). Each window can then be tested for significant DB between conditions. Of course, any statistical method can be used if it is able to accept a count matrix and a vector of normalization factors (or more generally, a matrix of offsets). The choice of edgeR is primarily motivated by its performance relative to some published alternatives (Law et al. 2014)2. 6.2 Setting up for edgeR A DGEList object is first constructed from the SummarizedExperiment contaiing our filtered count matrix. If normalization factors or offsets are present in the RangedSummarizedExperiment object – see Chapter~5 – they will automatically be inserted into the DGEList. Otherwise, they can be manually passed to the asDGEList() function. If offsets are available, they will generally override the normalization factors in the downstream edgeR analysis. View set-up code #--- loading-files ---# library(chipseqDBData) tf.data &lt;- NFYAData() tf.data bam.files &lt;- head(tf.data$Path, -1) # skip the input. bam.files #--- counting-windows ---# library(csaw) frag.len &lt;- 110 win.width &lt;- 10 param &lt;- readParam(minq=20) data &lt;- windowCounts(bam.files, ext=frag.len, width=win.width, param=param) #--- filtering ---# binned &lt;- windowCounts(bam.files, bin=10000, param=param) fstats &lt;- filterWindowsGlobal(data, binned) filtered.data &lt;- data[fstats$filter &gt; log2(5),] #--- normalization ---# filtered.data &lt;- normFactors(binned, se.out=filtered.data) library(csaw) y &lt;- asDGEList(filtered.data) The experimental design is described by a design matrix. In this case, the only relevant factor is the cell type of each sample. A generalized linear model (GLM) will be fitted to the counts for each window using the specified design (McCarthy, Chen, and Smyth 2012). This provides a general framework for the analysis of complex experiments with multiple factors. In this case, our design matrix contains an intercept representing the average abundance in the ESC group, plus a cell.type coefficient representing the log-fold change of the TN group over the ESC group. cell.type &lt;- sub(&quot;NF-YA ([^ ]+) .*&quot;, &quot;\\\\1&quot;, head(tf.data$Description, -1)) cell.type ## [1] &quot;ESC&quot; &quot;ESC&quot; &quot;TN&quot; &quot;TN&quot; design &lt;- model.matrix(~factor(cell.type)) colnames(design) &lt;- c(&quot;intercept&quot;, &quot;cell.type&quot;) design ## intercept cell.type ## 1 1 0 ## 2 1 0 ## 3 1 1 ## 4 1 1 ## attr(,&quot;assign&quot;) ## [1] 0 1 ## attr(,&quot;contrasts&quot;) ## attr(,&quot;contrasts&quot;)$`factor(cell.type)` ## [1] &quot;contr.treatment&quot; Readers are referred to the user’s guide in edgeR for more details on parametrization of the design matrix. 6.3 Estimating the dispersions 6.3.1 Stabilising estimates with empirical Bayes Under the QL framework, both the QL and NB dispersions are used to model biological variability in the data (Lund et al. 2012). The former ensures that the NB mean-variance relationship is properly specified with appropriate contributions from the Poisson and Gamma components. The latter accounts for variability and uncertainty in the dispersion estimate. However, limited replication in most ChIP-seq experiments means that each window does not contain enough information for precise estimation of either dispersion. This problem is overcome in edgeR by sharing information across windows. For the NB dispersions, a mean-dispersion trend is fitted across all windows to model the mean-variance relationship (McCarthy, Chen, and Smyth 2012). The raw QL dispersion for each window is estimated after fitting a GLM with the trended NB dispersion. Another mean-dependent trend is fitted to the raw QL estimates. An empirical Bayes (EB) strategy is then used to stabilize the raw QL dispersion estimates by shrinking them towards the second trend (Lund et al. 2012). The ideal amount of shrinkage is determined from the variability of the dispersions. library(edgeR) y &lt;- estimateDisp(y, design) summary(y$trended.dispersion) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0511 0.1020 0.1050 0.1043 0.1082 0.1092 fit &lt;- glmQLFit(y, design, robust=TRUE) summary(fit$var.post) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.073 1.046 1.068 1.070 1.102 22.308 The effect of EB stabilisation can be visualized by examining the biological coefficient of variation (for the NB dispersion) and the quarter-root deviance (for the QL dispersion). Plot such as those in Figure 6.1 can also be used to decide whether the fitted trend is appropriate. Sudden irregulaties may be indicative of an underlying structure in the data which cannot be modelled with the mean-dispersion trend. Discrete patterns in the raw dispersions are indicative of low counts and suggest that more aggressive filtering is required. par(mfrow=c(1,2)) o &lt;- order(y$AveLogCPM) plot(y$AveLogCPM[o], sqrt(y$trended.dispersion[o]), type=&quot;l&quot;, lwd=2, ylim=c(0, 1), xlab=expression(&quot;Ave.&quot;~Log[2]~&quot;CPM&quot;), ylab=(&quot;Biological coefficient of variation&quot;)) plotQLDisp(fit) Figure 6.1: Fitted trend in the NB dispersion (left) or QL dispersion (right) as a function of the average abundance for each window. For the NB dispersion, the square root is shown as the biological coefficient of variation. For the QL dispersion, the shrunken estimate is also shown for each window. For most sequencing count data, we expect to see a decreasing trend that plateaus with increasing average abundance. This reflects the greater reliability of large counts, where the effects of stochasticity and technical artifacts (e.g., mapping errors, PCR duplicates) are averaged out. In some cases, a strong trend may also be observed where the NB dispersion drops sharply with increasing average abundance. It is difficult to accurately fit an empirical curve to these strong trends, and as a consequence, the dispersions at high abundances may be overestimated. Filtering of low-abundance regions (as described in Chapter 4) provides some protection by removing the strongest part of the trend. This has an additional benefit of removing those tests that have low power due to the magnitude of the dispersions. relevant &lt;- rowSums(assay(data)) &gt;= 20 # weaker filtering than &#39;filtered.data&#39; yo &lt;- asDGEList(data[relevant,], norm.factors=filtered.data$norm.factors) yo &lt;- estimateDisp(yo, design) oo &lt;- order(yo$AveLogCPM) plot(yo$AveLogCPM[oo], sqrt(yo$trended.dispersion[oo]), type=&quot;l&quot;, lwd=2, ylim=c(0, max(sqrt(yo$trended))), xlab=expression(&quot;Ave.&quot;~Log[2]~&quot;CPM&quot;), ylab=(&quot;Biological coefficient of variation&quot;)) lines(y$AveLogCPM[o], sqrt(y$trended[o]), lwd=2, col=&quot;grey&quot;) legend(&quot;topright&quot;, c(&quot;raw&quot;, &quot;filtered&quot;), col=c(&quot;black&quot;, &quot;grey&quot;), lwd=2) Figure 6.2: Fitted trend in the NB dispersions before (black) and after (grey) removing low-abundance windows. Note that only the trended dispersion will be used in the downstream steps – the common and tagwise values are only shown in Figure 6.1 for diagnostic purposes. Specifically, the common BCV provides an overall measure of the variability in the data set, averaged across all windows. The tagwise BCVs should also be dispersed above and below the fitted trend, indicating that the fit was successful. 6.3.2 Modelling variable dispersions between windows Any variability in the dispersions across windows is modelled in edgeR by the prior degrees of freedom (d.f.). A large value for the prior d.f. indicates that the variability is low. This means that more EB shrinkage can be performed to reduce uncertainty and maximize power. However, strong shrinkage is not appropriate if the dispersions are highly variable. Fewer prior degrees of freedom (and less shrinkage) are required to maintain type I error control. summary(fit$df.prior) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.34 43.37 43.37 43.28 43.37 43.37 On occasion, the estimated prior degrees of freedom will be infinite. This is indicative of a strong batch effect where the dispersions are consistently large. A typical example involves uncorrected differences in IP efficiency across replicates. In severe cases, the trend may fail to pass through the bulk of points as the variability is too low to be properly modelled in the QL framework. This problem is usually resolved with appropriate normalization. Note that the prior degrees of freedom should be robustly estimated (Phipson et al. 2016). Obviously, this protects against large positive outliers (e.g., highly variable windows) but it also protects against near-zero dispersions at low counts. These will manifest as large negative outliers after a log transformation step during estimation (Smyth 2004). Without robustness, incorporation of these outliers will inflate the observed variability in the dispersions. This results in a lower estimated prior d.f. and reduced DB detection power. 6.4 Testing for DB windows We identify windows with significant differential binding with respect to specific factors of interest in our design matrix. In the QL framework, \\(p\\)-values are computed using the QL F-test (Lund et al. 2012). This is more appropriate than using the likelihood ratio test as the F-test accounts for uncertainty in the dispersion estimates. Associated statistics such as log-fold changes and log-counts per million are also computed for each window. results &lt;- glmQLFTest(fit, contrast=c(0, 1)) head(results$table) ## logFC logCPM F PValue ## 1 1.238 0.5410 5.023 0.029953 ## 2 1.408 1.3780 7.563 0.008529 ## 3 1.574 1.3513 8.927 0.004523 ## 4 1.091 0.8299 3.742 0.059309 ## 5 1.174 -0.3423 3.431 0.070479 ## 6 1.095 -0.2668 2.702 0.107129 The null hypothesis here is that the cell type has no effect. The contrast argument in the glmQLFTest() function specifies which factors are of interest3. In this case, a contrast of c(0, 1) defines the null hypothesis as 0*intercept + 1*cell.type = 0, i.e., that the log-fold change between cell types is zero. DB windows can then be identified by rejecting the null. Users may find it more intuitive to express the contrast through limma’s makeContrasts() function, or by directly supplying the names of the relevant coefficients to glmQLFTest(), as shown below. colnames(design) ## [1] &quot;intercept&quot; &quot;cell.type&quot; # Same as above. results2 &lt;- glmQLFTest(fit, coef=&quot;cell.type&quot;) The log-fold change for each window is similarly defined from the contrast as 0*intercept + 1*cell.type, i.e., the value of the cell.type coefficient. Recall that this coefficient represents the log-fold change of the TN group over the ESC group. Thus, in our analysis, positive log-fold changes represent increase in binding of TN over ESC, and vice versa for negative log-fold changes. One could also define the contrast as c(0, -1), in which case the interpretation of the log-fold changes would be reversed. Once the significance statistics have been calculated, they can be stored in row metadata of the RangedSummarizedExperiment object. This ensures that the statistics and coordinates are processed together, e.g., when subsetting to select certain windows. rowData(filtered.data) &lt;- cbind(rowData(filtered.data), results$table) 6.5 What to do without replicates Designing a ChIP-seq experiment without any replicates is strongly discouraged. Without replication, the reproducibility of any findings cannot be determined. Nonetheless, it may be helpful to salvage some information from datasets that lack replicates. This is done by supplying a “reasonable” value for the NB dispersion during GLM fitting (e.g., 0.05 - 0.1, based on past experience). DB windows are then identified using the likelihood ratio test. fit.norep &lt;- glmFit(y, design, dispersion=0.05) results.norep &lt;- glmLRT(fit.norep, contrast=c(0, 1)) head(results.norep$table) ## logFC logCPM LR PValue ## 1 1.237 0.5410 8.407 3.738e-03 ## 2 1.407 1.3780 13.170 2.845e-04 ## 3 1.573 1.3513 16.102 6.002e-05 ## 4 1.089 0.8299 7.159 7.458e-03 ## 5 1.173 -0.3423 5.506 1.896e-02 ## 6 1.093 -0.2668 4.993 2.546e-02 Obviously, this approach has a number of pitfalls. The lack of replicates means that the biological variability in the data cannot be modelled. Thus, it becomes impossible to gauge the sensibility of the supplied NB dispersions in the analysis. Another problem is spurious DB due to inconsistent PCR duplication between libraries. Normally, inconsistent duplication results in a large QL dispersion for the affected window, such that significance is downweighted. However, estimation of the QL dispersion is not possible without replicates. This means that duplicates may need to be removed to protect against false positives. 6.6 Examining replicate similarity with MDS plots As a quality control measure, the window counts can be used to examine the similarity of replicates through multi-dimensional scaling (MDS) plots (Figure 6.3). The distance between each pair of libraries is computed as the square root of the mean squared log-fold change across the top set of bins with the highest absolute log-fold changes. A small top set visualizes the most extreme differences whereas a large set visualizes overall differences. Checking a range of top values may be useful when the scope of DB is unknown. Again, counting with large bins is recommended as fold changes will be undefined in the presence of zero counts. par(mfrow=c(2,2), mar=c(5,4,2,2)) adj.counts &lt;- cpm(y, log=TRUE) for (top in c(100, 500, 1000, 5000)) { plotMDS(adj.counts, main=top, col=c(&quot;blue&quot;, &quot;blue&quot;, &quot;red&quot;, &quot;red&quot;), labels=c(&quot;es.1&quot;, &quot;es.2&quot;, &quot;tn.1&quot;, &quot;tn.2&quot;), top=top) } Figure 6.3: MDS plots computed with varying numbers of top windows with the strongest log-fold changes between libaries. In each plot, each library is marked with its name and colored according to its cell type. Replicates from different groups should form separate clusters in the MDS plot, as observed above. This indicates that the results are reproducible and that the effect sizes are large. Mixing between replicates of different conditions indicates that the biological difference has no effect on protein binding, or that the data is too variable for any effect to manifest. Any outliers should also be noted as their presence may confound the downstream analysis. In the worst case, outlier samples may need to be removed to obtain sensible results. Session information View session info R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] edgeR_3.34.0 limma_3.48.0 [3] csaw_1.26.0 SummarizedExperiment_1.22.0 [5] Biobase_2.52.0 MatrixGenerics_1.4.0 [7] matrixStats_0.58.0 GenomicRanges_1.44.0 [9] GenomeInfoDb_1.28.0 IRanges_2.26.0 [11] S4Vectors_0.30.0 BiocGenerics_0.38.0 [13] BiocStyle_2.20.0 rebook_1.2.0 loaded via a namespace (and not attached): [1] statmod_1.4.36 locfit_1.5-9.4 xfun_0.23 [4] bslib_0.2.5.1 splines_4.1.0 lattice_0.20-44 [7] htmltools_0.5.1.1 yaml_2.2.1 XML_3.99-0.6 [10] rlang_0.4.11 jquerylib_0.1.4 BiocParallel_1.26.0 [13] CodeDepends_0.6.5 GenomeInfoDbData_1.2.6 stringr_1.4.0 [16] zlibbioc_1.38.0 Biostrings_2.60.0 codetools_0.2-18 [19] evaluate_0.14 knitr_1.33 highr_0.9 [22] Rcpp_1.0.6 filelock_1.0.2 BiocManager_1.30.15 [25] DelayedArray_0.18.0 graph_1.70.0 jsonlite_1.7.2 [28] XVector_0.32.0 Rsamtools_2.8.0 dir.expiry_1.0.0 [31] metapod_1.0.0 digest_0.6.27 stringi_1.6.2 [34] bookdown_0.22 grid_4.1.0 tools_4.1.0 [37] bitops_1.0-7 magrittr_2.0.1 sass_0.4.0 [40] RCurl_1.98-1.3 crayon_1.4.1 Matrix_1.3-3 [43] rmarkdown_2.8 R6_2.5.0 compiler_4.1.0 Bibliography "],["correction-for-multiple-testing.html", "Chapter 7 Correction for multiple testing 7.1 Overview 7.2 Grouping windows into regions 7.3 Obtaining per-region \\(p\\)-value 7.4 Squeezing out more detection power 7.5 FDR control in difficult situations Session information", " Chapter 7 Correction for multiple testing .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 7.1 Overview The false discovery rate (FDR) is usually the most appropriate measure of error for high-throughput experiments. Control of the FDR can be provided by applying the Benjamini-Hochberg (BH) method (Benjamini and Hochberg 1995) to a set of \\(p\\)-values. This is less conservative than the alternatives (e.g., Bonferroni) yet still provides some measure of error control. The most obvious approach is to apply the BH method to the set of \\(p\\)-values across all windows. This will control the FDR across the set of putative DB windows. However, the FDR across all detected windows is not necessarily the most relevant error rate. Interpretation of ChIP-seq experiments is more concerned with regions of the genome in which (differential) protein binding is found, rather than the individual windows. In other words, the FDR across all detected DB regions is usually desired. This is not equivalent to that across all DB windows as each region will often consist of multiple overlapping windows. Control of one will not guarantee control of the other (Lun and Smyth 2014). To illustrate this difference, consider an analysis where the FDR across all window positions is controlled at 10%. In the results, there are 18 adjacent window positions in one region and 2 windows in a separate region. The first set of windows is a truly DB region whereas the second set is a false positive. A window-based interpretation of the FDR is correct as only 2 of the 20 window positions are false positives. However, a region-based interpretation results in an actual FDR of 50%. To avoid misinterpretation of the FDR, csaw provides a number of strategies to obtain region-level results. This involves defining the regions of interest - possibly from the windows themselves - and converting per-window statistics into a \\(p\\)-value for each region. Application of the BH method to the per-region \\(p\\)-values will then control the relevant FDR across regions. These strategies are demonstrated below using the NF-YA data. 7.2 Grouping windows into regions 7.2.1 Quick and dirty clustering The mergeWindows() function provides a simple single-linkage algorithm to cluster windows into regions. Windows that are less than tol apart are considered to be adjacent and are grouped into the same cluster. The chosen tol represents the minimum distance at which two binding events are treated as separate sites. Large values (500 - 1000 bp) reduce redundancy and favor a region-based interpretation of the results, while smaller values (&lt; 200 bp) allow resolution of individual binding sites. View set-up code #--- loading-files ---# library(chipseqDBData) tf.data &lt;- NFYAData() tf.data bam.files &lt;- head(tf.data$Path, -1) # skip the input. bam.files #--- counting-windows ---# library(csaw) frag.len &lt;- 110 win.width &lt;- 10 param &lt;- readParam(minq=20) data &lt;- windowCounts(bam.files, ext=frag.len, width=win.width, param=param) #--- filtering ---# binned &lt;- windowCounts(bam.files, bin=10000, param=param) fstats &lt;- filterWindowsGlobal(data, binned) filtered.data &lt;- data[fstats$filter &gt; log2(5),] #--- normalization ---# filtered.data &lt;- normFactors(binned, se.out=filtered.data) #--- modelling ---# cell.type &lt;- sub(&quot;NF-YA ([^ ]+) .*&quot;, &quot;\\\\1&quot;, head(tf.data$Description, -1)) design &lt;- model.matrix(~cell.type) colnames(design) &lt;- c(&quot;intercept&quot;, &quot;cell.type&quot;) library(edgeR) y &lt;- asDGEList(filtered.data) y &lt;- estimateDisp(y, design) fit &lt;- glmQLFit(y, design, robust=TRUE) res &lt;- glmQLFTest(fit, coef=&quot;cell.type&quot;) rowData(filtered.data) &lt;- cbind(rowData(filtered.data), res$table) library(csaw) merged &lt;- mergeWindows(filtered.data, tol=1000L) merged$regions ## GRanges object with 3577 ranges and 0 metadata columns: ## seqnames ranges strand ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; ## [1] chr1 7397901-7398110 * ## [2] chr1 9541401-9541510 * ## [3] chr1 9545301-9545360 * ## [4] chr1 10007401-10007460 * ## [5] chr1 13134451-13134510 * ## ... ... ... ... ## [3573] chrX_GL456233_random 336801-336910 * ## [3574] chrY 143051-143060 * ## [3575] chrY 259151-259210 * ## [3576] chrY 90808851-90808860 * ## [3577] chrY 90812851-90812910 * ## ------- ## seqinfo: 66 sequences from an unspecified genome If many adjacent windows are present, very large clusters may be formed that are difficult to interpret. We perform a simple check below to determine whether most clusters are of an acceptable size. Huge clusters indicate that more aggressive filtering from Chapter 4 is required. This mitigates chaining effects by reducing the density of windows in the genome. summary(width(merged$regions)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 10 60 110 165 160 15660 Alternatively, chaining can be limited by setting max.width to restrict the size of the merged intervals. Clusters substantially larger than max.width are split into several smaller subclusters of roughly equal size. The chosen value should be small enough so as to separate DB regions from unchanged neighbors, yet large enough to avoid misinterpretation of the FDR. Any value from 2000 to 10000 bp is recommended. This paramater can also interpreted as the maximum distance at which two binding sites are considered part of the same event. merged.max &lt;- mergeWindows(filtered.data, tol=1000L, max.width=5000L) summary(width(merged.max$regions)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 10 60 110 164 160 4860 7.2.2 Using external information Another approach is to group together windows that overlap with a pre-specified region of interest. The most obvious source of pre-specified regions is that of annotated features such as promoters or gene bodies. Alternatively, called peaks can be used provided that sufficient care has been taken to avoid loss of error control from data snooping (Lun and Smyth 2014). Regardless of how they are specified, each region of interest corresponds to a group that contains all overlapping windows, as identified by the findOverlaps function from the GenomicRanges package. library(TxDb.Mmusculus.UCSC.mm10.knownGene) broads &lt;- genes(TxDb.Mmusculus.UCSC.mm10.knownGene) broads &lt;- resize(broads, width(broads)+3000, fix=&quot;end&quot;) olap &lt;- findOverlaps(broads, rowRanges(filtered.data)) olap ## Hits object with 12867 hits and 0 metadata columns: ## queryHits subjectHits ## &lt;integer&gt; &lt;integer&gt; ## [1] 7 6995 ## [2] 18 8323 ## [3] 18 8324 ## [4] 18 8325 ## [5] 18 8326 ## ... ... ... ## [12863] 24521 6840 ## [12864] 24521 6841 ## [12865] 24524 6601 ## [12866] 24524 6602 ## [12867] 24524 6603 ## ------- ## queryLength: 24528 / subjectLength: 12352 At this point, one might imagine that it would be simpler to just collect and analyze counts over the pre-specified regions. This is a valid strategy but will yield different results. Consider a promoter containing two separate sites that are identically DB in opposite directions. Counting reads across the promoter will give equal counts for each condition so changes within the promoter will not be detected. Similarly, imprecise peak boundaries can lead to loss of detection power due to “contamination” by reads in background regions. Window-based methods may be more robust as each interval of the promoter/peak region is examined separately (Lun and Smyth 2014), avoiding potential problems with peak-calling errors and incorrect/incomplete annotation. 7.3 Obtaining per-region \\(p\\)-value 7.3.1 Combining window-level \\(p\\)-values We compute a combined \\(p\\)-value for each region based on the \\(p\\)-values of the constituent windows (Simes 1986). This tests the joint null hypothesis for each region, i.e., that no enrichment is observed across any of its windows. Any DB within the region will reject the joint null and yield a low \\(p\\)-value for the entire region. The combined \\(p\\)-values are then adjusted using the BH method to control the region-level FDR. tabcom &lt;- combineTests(merged$ids, rowData(filtered.data)) is.sig.region &lt;- tabcom$FDR &lt;= 0.05 summary(is.sig.region) ## Mode FALSE TRUE ## logical 1666 1911 Summarizing the direction of DB for each cluster requires some care as the direction of DB can differ between constituent windows. The num.up.tests and num.down.tests fields contain the number of windows that change in each direction, and can be used to gauge whether binding increases or decreases across the cluster. A complex DB event may be present if both num.up.tests and num.down.tests are non-zero (i.e., opposing changes within the region) or if the total number of windows is much larger than either number (e.g., interval of constant binding adjacent to the DB interval). Alternatively, the direction field specifies which DB direction contributes to the combined \\(p\\)-value. If \"up\", the combined \\(p\\)-value for this cluster is driven by \\(p\\)-values of windows with positive log-fold changes. If \"down\", the combined \\(p\\)-value is driven by windows with negative log-fold changes. If \"mixed\", windows with both positive and negative log-fold changes are involved. This allows the dominant DB in significant clusters to be quickly summarized, as shown below. table(tabcom$direction[is.sig.region]) ## ## down up ## 174 1737 For pre-specified regions, the combineOverlaps() function will combine the \\(p\\)-values for all windows in each region. This is a wrapper around combineTests() for Hits objects. It returns a single combined \\(p\\)-value (and its BH-adjusted value) for each region. Regions that do not overlap any windows have values of NA in all fields for the corresponding rows. tabbroad &lt;- combineOverlaps(olap, rowData(filtered.data)) head(tabbroad[!is.na(tabbroad$PValue),]) ## DataFrame with 6 rows and 8 columns ## num.tests num.up.logFC num.down.logFC PValue FDR direction ## &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## 7 1 1 0 3.88905e-05 0.000767486 up ## 18 4 4 0 2.91361e-04 0.002481887 up ## 23 3 3 0 2.90086e-02 0.051465052 up ## 25 2 0 0 7.37883e-01 0.760929460 up ## 28 3 3 0 2.06513e-04 0.002055317 up ## 36 4 4 0 6.33654e-03 0.017992101 up ## rep.test rep.logFC ## &lt;integer&gt; &lt;numeric&gt; ## 7 6995 3.396579 ## 18 8326 3.446009 ## 23 315 1.331126 ## 25 9977 0.201974 ## 28 8774 3.503353 ## 36 2716 2.105493 is.sig.gene &lt;- tabcom$FDR &lt;= 0.05 table(tabbroad$direction[is.sig.gene]) ## ## down mixed up ## 86 29 1925 7.3.2 Based on the most significant window Another approach is to use the single window with the strongest DB as a representative of the entire region. This is useful when a log-fold change is required for each cluster, e.g., for plotting. (In contrast, taking the average log-fold change across all windows in a region will understate the magnitude of DB, especially if the region includes some non-DB background intervals of the genome.) Identification of the most significant (i.e., “best”) window is performed using the getBestTest() function. This reports the index of the window with the lowest \\(p\\)-value in each cluster as well as the associated statistics. tab.best &lt;- getBestTest(merged$ids, rowData(filtered.data)) head(tab.best) ## DataFrame with 6 rows and 8 columns ## num.tests num.up.logFC num.down.logFC PValue FDR direction ## &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## 1 5 2 0 0.0226156 0.0461735 up ## 2 3 2 0 0.0144413 0.0332624 up ## 3 2 2 0 0.0334643 0.0623772 up ## 4 2 0 0 0.2113083 0.2674838 up ## 5 2 0 0 0.1716740 0.2251684 up ## 6 5 1 0 0.0470532 0.0807238 up ## rep.test rep.logFC ## &lt;integer&gt; &lt;numeric&gt; ## 1 3 1.57363 ## 2 8 1.98359 ## 3 10 1.69023 ## 4 11 1.12035 ## 5 14 1.08336 ## 6 17 1.32717 A Bonferroni correction is applied to the \\(p\\)-value of the best window in each region, based on the number of constituent windows in that region. This is necessary to account for the implicit multiple testing across all windows in each region. The corrected \\(p\\)-value is reported as PValue in tab.best, and can be used for correction across regions using the BH method to control the region-level FDR. In addition, it is often useful to report the start location of the best window within each cluster. This allows users to easily identify a relevant DB subinterval in large regions. For example, the sequence of the DB subinterval can be extracted for motif discovery. tabcom$rep.start &lt;- start(rowRanges(filtered.data))[tab.best$rep.test] head(tabcom[,c(&quot;rep.logFC&quot;, &quot;rep.start&quot;)]) ## DataFrame with 6 rows and 2 columns ## rep.logFC rep.start ## &lt;numeric&gt; &lt;integer&gt; ## 1 1.40808 7398001 ## 2 1.81146 9541501 ## 3 1.57495 9545351 ## 4 1.03347 10007401 ## 5 1.08336 13134501 ## 6 1.26970 13372551 The same approach can be applied to the overlaps between windows and pre-specified regions, using the getBestOverlaps() wrapper function. This is demonstrated below for the broad gene body example. As with combineOverlaps(), regions with no windows are assigned NA in the output table, but these are removed here to show some actual results. tab.best.broad &lt;- getBestOverlaps(olap, rowData(filtered.data)) tabbroad$rep.start &lt;- start(rowRanges(filtered.data))[tab.best.broad$rep.test] head(tabbroad[!is.na(tabbroad$PValue),c(&quot;rep.logFC&quot;, &quot;rep.start&quot;)]) ## DataFrame with 6 rows and 2 columns ## rep.logFC rep.start ## &lt;numeric&gt; &lt;integer&gt; ## 7 3.396579 32657101 ## 18 3.446009 8259301 ## 23 1.331126 92934601 ## 25 0.201974 71596101 ## 28 3.503353 4137001 ## 36 2.105493 100187601 7.3.3 Wrapper functions For convenience, the steps of merging windows and computing statistics are implemented in a single wrapper function. This simply calls mergeWindows() followed by combineTests() and getBestTest(). merge.res &lt;- mergeResults(filtered.data, rowData(filtered.data), tol=100, merge.args=list(max.width=5000)) names(merge.res) ## [1] &quot;regions&quot; &quot;combined&quot; &quot;best&quot; An equivalent wrapper function is also available for handling overlaps to pre-specified regions. This simply calls findOverlaps() followed by combineOverlaps() and getBestOverlaps(). broad.res &lt;- overlapResults(filtered.data, regions=broads, tab=rowData(filtered.data)) names(broad.res) ## [1] &quot;regions&quot; &quot;combined&quot; &quot;best&quot; 7.4 Squeezing out more detection power 7.4.1 Integrating across multiple window sizes Repeating the analysis with different window sizes may uncover new DB events at different resolutions. Multiple sets of DB results are integrated by clustering adjacent windows together (even if they differ in size) and combining \\(p\\)-values within each of the resulting clusters. The example below uses the H3 acetylation data from Chapter @ref(#chap:norm). Some filtering is performed to avoid excessive chaining in this demonstration. Corresponding tables of DB results should also be obtained – for brevity, mock results are used here. library(chipseqDBData) ac.files &lt;- H3K9acData()$Path ac.small &lt;- windowCounts(ac.files, width=150L, spacing=100L, filter=25, param=param) ac.large &lt;- windowCounts(ac.files, width=1000L, spacing=500L, filter=35, param=param) # TODO: actually do the analysis here. # In the meantime, mocking up results for demonstration purposes. ns &lt;- nrow(ac.small) mock.small &lt;- data.frame(logFC=rnorm(ns), logCPM=0, PValue=runif(ns)) nl &lt;- nrow(ac.large) mock.large &lt;- data.frame(logFC=rnorm(nl), logCPM=0, PValue=runif(nl)) The mergeResultsList() function merges windows of all sizes into a single set of regions, and computes a combined \\(p\\)-value from the associated \\(p\\)-values for each region. Equal contributions from each window size are enforced by setting equiweight=TRUE, which uses a weighted version of Simes’ method (Benjamini and Hochberg 1997). The weight assigned to each window is inversely proportional to the number of windows of that size in the same cluster. This avoids the situation where, if a cluster contains many small windows, the DB results for the analysis with the small window size contribute most to the combined \\(p\\)-value. This is not ideal when results from all window sizes are of equal interest. cons.res &lt;- mergeResultsList(list(ac.small, ac.large), tab.list=list(mock.small, mock.large), equiweight=TRUE, tol=1000) cons.res$regions ## GRanges object with 30486 ranges and 0 metadata columns: ## seqnames ranges strand ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; ## [1] chr1 4774001-4776500 * ## [2] chr1 4784501-4787000 * ## [3] chr1 4806501-4809000 * ## [4] chr1 4856501-4860000 * ## [5] chr1 5082501-5084500 * ## ... ... ... ... ## [30482] chrY 38230601-38230750 * ## [30483] chrY 73037501-73039000 * ## [30484] chrY 75445901-75446150 * ## [30485] chrY 88935501-88937000 * ## [30486] chrY 90812501-90814000 * ## ------- ## seqinfo: 66 sequences from an unspecified genome cons.res$combined ## DataFrame with 30486 rows and 8 columns ## num.tests num.up.logFC num.down.logFC PValue FDR direction ## &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## 1 4 0 0 0.896683 0.999042 down ## 2 13 0 0 0.642911 0.990966 down ## 3 11 0 0 0.819675 0.998448 mixed ## 4 22 0 0 0.519854 0.990954 down ## 5 9 0 0 0.657743 0.991505 mixed ## ... ... ... ... ... ... ... ## 30482 1 0 0 0.184985 0.987279 up ## 30483 5 0 0 0.899906 0.999042 mixed ## 30484 2 0 0 0.567312 0.990954 mixed ## 30485 5 0 0 0.564992 0.990954 mixed ## 30486 2 0 0 0.765134 0.997090 down ## rep.test rep.logFC ## &lt;integer&gt; &lt;numeric&gt; ## 1 238289 -0.198639 ## 2 238295 -0.468623 ## 3 238298 0.773446 ## 4 31 -0.998320 ## 5 33 -1.513778 ## ... ... ... ## 30482 238280 0.5115937 ## 30483 238282 1.7600344 ## 30484 238285 0.0487379 ## 30485 391251 -0.3376354 ## 30486 391253 -0.6200357 Similarly, the overlapResultsList() function is used to merge windows of varying size that overlap pre-specified regions. cons.broad &lt;- overlapResultsList(list(ac.small, ac.large), tab.list=list(mock.small, mock.large), equiweight=TRUE, region=broads) cons.broad$regions ## GRanges object with 24528 ranges and 1 metadata column: ## seqnames ranges strand | gene_id ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; | &lt;character&gt; ## 100009600 chr9 21062393-21076096 - | 100009600 ## 100009609 chr7 84935565-84967115 - | 100009609 ## 100009614 chr10 77708457-77712009 + | 100009614 ## 100009664 chr11 45805087-45841171 + | 100009664 ## 100012 chr4 144157557-144165663 - | 100012 ## ... ... ... ... . ... ## 99889 chr3 84496093-85890516 - | 99889 ## 99890 chr3 110246109-110253998 - | 99890 ## 99899 chr3 151730922-151752960 - | 99899 ## 99929 chr3 65525410-65555518 + | 99929 ## 99982 chr4 136550540-136605723 - | 99982 ## ------- ## seqinfo: 66 sequences (1 circular) from mm10 genome cons.res$combined ## DataFrame with 30486 rows and 8 columns ## num.tests num.up.logFC num.down.logFC PValue FDR direction ## &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## 1 4 0 0 0.896683 0.999042 down ## 2 13 0 0 0.642911 0.990966 down ## 3 11 0 0 0.819675 0.998448 mixed ## 4 22 0 0 0.519854 0.990954 down ## 5 9 0 0 0.657743 0.991505 mixed ## ... ... ... ... ... ... ... ## 30482 1 0 0 0.184985 0.987279 up ## 30483 5 0 0 0.899906 0.999042 mixed ## 30484 2 0 0 0.567312 0.990954 mixed ## 30485 5 0 0 0.564992 0.990954 mixed ## 30486 2 0 0 0.765134 0.997090 down ## rep.test rep.logFC ## &lt;integer&gt; &lt;numeric&gt; ## 1 238289 -0.198639 ## 2 238295 -0.468623 ## 3 238298 0.773446 ## 4 31 -0.998320 ## 5 33 -1.513778 ## ... ... ... ## 30482 238280 0.5115937 ## 30483 238282 1.7600344 ## 30484 238285 0.0487379 ## 30485 391251 -0.3376354 ## 30486 391253 -0.6200357 In this manner, DB results from multiple window widths can be gathered together and reported as a single set of regions. Consolidation is most useful for histone marks and other analyses involving diffuse regions of enrichment. For such studies, the ideal window size is not known or may not even exist, e.g., if the widths of the enriched regions or DB subintervals are variable. 7.4.2 Weighting windows on abundance Windows that are more likely to be DB can be upweighted to improve detection power. For example, in TF ChIP-seq data, the window of highest abundance within each enriched region probably contains the binding site. It is reasonable to assume that this window will also have the strongest DB. To improve power, the weight assigned to the most abundant window is increased relative to that of other windows in the same cluster. This means that the \\(p\\)-value of this window will have a greater influence on the final combined \\(p\\)-value. Weights are computed in a manner to minimize conservativeness relative to the optimal unweighted approaches in each possible scenario. If the strongest DB event is at the most abundant window, the weighted approach will yield a combined \\(p\\)-value that is no larger than twice the \\(p\\)-value of the most abundant window. (Here, the optimal approach would be to use the \\(p\\)-value of the most abundance window directly as a proxy for the \\(p\\)-value of the cluster.) If the strongest DB event is not at the most abundant window, the weighted approach will yield a combined \\(p\\)-value that is no larger than twice the combined \\(p\\)-value without wweighting (which is optimal as all windows have equal probabilities of containing the strongest DB). All windows have non-zero weights, which ensures that any DB events in the other windows will still be considered when the \\(p\\)-values are combined. The application of this weighting scheme is demonstrated in the example below. First, the getBestTest} function with \\Rcode{by.pval=FALSE() is used to identify the most abundant window in each cluster. Window-specific weights are then computed using the upweightSummits} function, and supplied to \\Rcode{combineTests() to use in computing combined \\(p\\)-values. tab.ave &lt;- getBestTest(merged$id, rowData(filtered.data), by.pval=FALSE) weights &lt;- upweightSummit(merged$id, tab.ave$rep.test) head(weights) ## [1] 1 5 1 1 1 1 tabcom.w &lt;- combineTests(merged$id, rowData(filtered.data), weight=weights) head(tabcom.w) ## DataFrame with 6 rows and 8 columns ## num.tests num.up.logFC num.down.logFC PValue FDR direction ## &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## 1 5 3 0 0.01279380 0.0297165 up ## 2 3 2 0 0.00962617 0.0245598 up ## 3 2 2 0 0.03160768 0.0568416 up ## 4 2 0 0 0.12888341 0.1696159 up ## 5 2 0 0 0.12875551 0.1695099 up ## 6 5 2 0 0.01693914 0.0359806 up ## rep.test rep.logFC ## &lt;integer&gt; &lt;numeric&gt; ## 1 2 1.40808 ## 2 7 1.81146 ## 3 9 1.57495 ## 4 12 1.03347 ## 5 14 1.08336 ## 6 17 1.32717 The weighting approach can also be applied to the clusters from the broad gene body example. This is done by replacing the call to getBestTest} with one to \\Rfunction{getBestOverlaps(), as before. Similarly, upweightSummit} can be replaced with \\Rfunction{summitOverlaps(). These wrappers are designed to minimize book-keeping problems when one window overlaps multiple regions. broad.best &lt;- getBestOverlaps(olap, rowData(filtered.data), by.pval=FALSE) head(broad.best[!is.na(broad.best$PValue),]) ## DataFrame with 6 rows and 8 columns ## num.tests num.up.logFC num.down.logFC PValue FDR direction ## &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## 7 1 1 0 3.88905e-05 0.000755731 up ## 18 4 1 0 7.50992e-03 0.020295810 up ## 23 3 1 0 2.90086e-02 0.052950589 up ## 25 2 0 0 7.37883e-01 0.758205118 up ## 28 3 1 0 6.88376e-05 0.000995477 up ## 36 4 1 0 2.83839e-03 0.010705025 up ## rep.test rep.logFC ## &lt;integer&gt; &lt;numeric&gt; ## 7 6995 3.396579 ## 18 8324 1.690290 ## 23 315 1.331126 ## 25 9977 0.201974 ## 28 8774 3.503353 ## 36 2717 1.957656 broad.weights &lt;- summitOverlaps(olap, region.best=broad.best$rep.test) tabbroad.w &lt;- combineOverlaps(olap, rowData(filtered.data), o.weight=broad.weights) 7.4.3 Filtering after testing but before correction Most of the filters in Chapter~ are applied before the statistical analysis. However, some of the approaches may be too aggressive, e.g., filtering to retain only local maxima or based on pre-defined regions. In such cases, it may be preferable to initially apply one of the other, milder filters. This ensures that sufficient windows are retained for stable normalization and/or EB shrinkage. The aggressive filters can then be applied after the window-level statistics have been calculated, but before clustering into regions and calculation of cluster-level statistics. This is still beneficial as it removes irrelevant windows that would increase the severity of the BH correction. It may also reduce chaining effects during clustering. 7.5 FDR control in difficult situations 7.5.1 Clustering only on DB windows for diffuse marks The clustering procedures described above rely on independent filtering to remove irrelevant windows. This ensures that the regions of interest are reasonably narrow and can be easily interpreted, which is typically the case for most protein targets, e.g., TFs, narrow histone marks. However, enriched regions may be very large for more diffuse marks. Such regions may be difficult to interpret when only the DB subinterval is of interest. To overcome this, a post-hoc analysis can be performed whereby only significant windows are used for clustering. postclust &lt;- clusterWindows(rowRanges(filtered.data), rowData(filtered.data), target=0.05, tol=100, max.width=1000) postclust$FDR ## [1] 0.04957 postclust$region ## GRanges object with 1977 ranges and 0 metadata columns: ## seqnames ranges strand ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; ## [1] chr1 7398001-7398010 * ## [2] chr1 9541451-9541510 * ## [3] chr1 15805551-15805660 * ## [4] chr1 23256201-23256210 * ## [5] chr1 32172651-32172660 * ## ... ... ... ... ## [1973] chrX 102157051-102157110 * ## [1974] chrX 104482701-104482760 * ## [1975] chrX 106187051-106187060 * ## [1976] chrX 140456551-140456560 * ## [1977] chrX_GL456233_random 336801-336910 * ## ------- ## seqinfo: 66 sequences from an unspecified genome This will define and cluster significant windows in a manner that controls the cluster-level FDR at 5%. The clustering step itself is performed using mergeWindows() with the specified parameters. Each cluster consists entirely of DB windows and can be directly interpreted as a DB region or a DB subinterval of a larger enriched region. This reduces the pressure on abundance filtering to obtain well-separated regions prior to clustering, e.g., for diffuse marks or in data sets with weak IP signal. That said, users should be aware that calculation of the cluster-level FDR is not entirely rigorous. As such, independent clustering and FDR control via Simes’ method should be considered as the default for routine analyses. 7.5.2 Using the empirical FDR for noisy data Some analyses involve comparisons of ChIP samples to negative controls. In such cases, any region exhibiting enrichment in the negative control over the ChIP samples must be a false positive. The number of significant regions that change in the “wrong” direction can be used as an estimate of the number of false positives at any given \\(p\\)-value threshold. Division by the number of discoveries changing in the “right” direction yields an estimate of the FDR, i.e., the empirical FDR (Zhang et al. 2008). This strategy is implemented in the empiricalFDR() function, which controls the empirical FDR across clusters based on their combined \\(p\\)-values. Its use is demonstrated below, though the output is not meaningful in this situation as genuine changes in binding can be present in both directions. empres &lt;- empiricalFDR(merged$id, rowData(filtered.data)) The empirical FDR is useful for analyses of noisy data with high levels of non-specific binding. This is because the estimate of the number of false positives adapts to the observed number of regions exhibiting enrichment in the negative controls. In contrast, the standard BH method in combineTests() relies on proper type I error control during hypothesis testing. As non-specific binding events tend to be condition-specific, they are indistinguishable from DB events and assigned low \\(p\\)-values, resulting in loss of FDR control. Thus, for noisy data, use of the empirical FDR may be more appropriate to control the proportion of “experimental” false positives. However, calculation of the empirical FDR is not as statistically rigorous as that of the BH method, so users are advised to only apply it when necessary. 7.5.3 Detecting complex DB Complex DB events involve changes to the shape of the binding profile, not just a scaling increase/decrease to binding intensity. Such regions may contain multiple sites that change in binding strength in opposite directions, or peaks that change in width or position between conditions. This often manifests as DB in opposite directions in different subintervals of a region. Some of these events can be identified using the mixedTests() function. tab.mixed &lt;- mixedTests(merged$ids, rowData(filtered.data)) tab.mixed ## DataFrame with 3577 rows and 10 columns ## num.tests num.up.logFC num.down.logFC PValue FDR direction ## &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## 1 5 5 0 0.997738 1 mixed ## 2 3 2 0 0.997593 1 mixed ## 3 2 2 0 0.991634 1 mixed ## 4 2 0 0 0.947173 1 mixed ## 5 2 0 0 0.957081 1 mixed ## ... ... ... ... ... ... ... ## 3573 3 0 3 1.000000 1 mixed ## 3574 1 0 0 0.931630 1 mixed ## 3575 2 0 0 0.853330 1 mixed ## 3576 1 1 0 0.967612 1 mixed ## 3577 2 0 0 0.749056 1 mixed ## rep.up.test rep.up.logFC rep.down.test rep.down.logFC ## &lt;integer&gt; &lt;numeric&gt; &lt;integer&gt; &lt;numeric&gt; ## 1 2 1.40808 3 1.57363 ## 2 7 1.81146 8 1.98359 ## 3 9 1.57495 10 1.69023 ## 4 12 1.03347 11 1.12035 ## 5 14 1.08336 14 1.08336 ## ... ... ... ... ... ## 3573 12345 -2.607002 12345 -2.607002 ## 3574 12347 -0.987658 12347 -0.987658 ## 3575 12349 -0.636963 12348 -0.440200 ## 3576 12350 1.245995 12350 1.245995 ## 3577 12351 -0.430332 12352 -0.163333 mixedTests() converts the \\(p\\)-value for each window into two one-sided \\(p\\)-values. The one-sided \\(p\\)-values in each direction are combined using Simes’ method, and the two one-sided combined \\(p\\)-values are themselves combined using an intersection-union test (Berger and Hsu 1996). The resulting \\(p\\)-value is only low if a region contains strong DB in both directions. combineTests() also computes some statistics for informal detection of complex DB. For example, the num.up.tests and num.down.tests fields can be used to identify regions with changes in both directions. The direction field will also label some regions as \"mixed\", though this is not comprehensive. Indeed, regions labelled as \"up\" or \"down\" in the direction field may also correspond to complex DB events, but will not be labelled as \"mixed\" if the significance calculations are dominated by windows changing in only one direction. 7.5.4 Enforcing a minimal number of DB windows On occasion, we may be interested in genomic regions that contain at least a minimal number or proportion of DB windows. This is motivated by the desire to avoid detecting DB regions where only a small subinterval exhibits a change, instead favoring more systematic changes throughout the region that are easier to interpret. We can identify these regions using the minimalTests() function. tab.min &lt;- minimalTests(merged$ids, rowData(filtered.data), min.sig.n=3, min.sig.prop=0.5) tab.min ## DataFrame with 3577 rows and 8 columns ## num.tests num.up.logFC num.down.logFC PValue FDR direction ## &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## 1 5 2 0 0.0898582 0.1634907 up ## 2 3 2 0 0.1071287 0.1854788 up ## 3 2 2 0 0.0334643 0.0858077 up ## 4 2 0 0 0.2113083 0.3035635 up ## 5 2 0 0 0.2388546 0.3316704 up ## ... ... ... ... ... ... ... ## 3573 3 0 3 1.69623e-05 0.00123825 down ## 3574 1 0 0 1.36739e-01 0.22131946 down ## 3575 2 0 0 5.86682e-01 0.68760453 down ## 3576 1 0 0 6.47760e-02 0.13112825 up ## 3577 2 0 0 1.00000e+00 1.00000000 down ## rep.test rep.logFC ## &lt;integer&gt; &lt;numeric&gt; ## 1 1 1.238160 ## 2 6 1.094909 ## 3 9 1.574948 ## 4 12 1.033474 ## 5 13 0.738997 ## ... ... ... ## 3573 12346 -2.535065 ## 3574 12347 -0.987658 ## 3575 12348 -0.440200 ## 3576 12350 1.245995 ## 3577 12352 -0.163333 minimalTests() applies a Holm-Bonferroni correction to all windows in the same cluster and picks the \\(x\\)th-smallest adjusted \\(p\\)-value (where \\(x\\) is defined from min.sig.n and min.sig.prop). This tests the joint null hypothesis that the per-window null hypothesis is false for fewer than \\(x\\) windows in the cluster. If the \\(x\\)th-smallest \\(p\\)-value is low, this provides strong evidence against the joint null for that cluster. As an aside, this function also has some utility outside of ChIP-seq contexts. For example, we might want to obtain a single \\(p\\)-value for a gene set based on the presence of a minimal percentage of differentially expressed genes. Alternatively, we may be interested in ranking genes in loss-of-function screens based on a minimal number of shRNA/CRISPR guides that exhibit a significant effect. These problems are equivalent to that of identifying a genomic region with a minimal number of DB windows. Session information View session info R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] chipseqDBData_1.8.0 [2] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0 [3] GenomicFeatures_1.44.0 [4] AnnotationDbi_1.54.0 [5] csaw_1.26.0 [6] SummarizedExperiment_1.22.0 [7] Biobase_2.52.0 [8] MatrixGenerics_1.4.0 [9] matrixStats_0.58.0 [10] GenomicRanges_1.44.0 [11] GenomeInfoDb_1.28.0 [12] IRanges_2.26.0 [13] S4Vectors_0.30.0 [14] BiocGenerics_0.38.0 [15] BiocStyle_2.20.0 [16] rebook_1.2.0 loaded via a namespace (and not attached): [1] bitops_1.0-7 bit64_4.0.5 [3] filelock_1.0.2 progress_1.2.2 [5] httr_1.4.2 tools_4.1.0 [7] bslib_0.2.5.1 utf8_1.2.1 [9] R6_2.5.0 DBI_1.1.1 [11] withr_2.4.2 tidyselect_1.1.1 [13] prettyunits_1.1.1 curl_4.3.1 [15] bit_4.0.4 compiler_4.1.0 [17] graph_1.70.0 DelayedArray_0.18.0 [19] rtracklayer_1.52.0 bookdown_0.22 [21] sass_0.4.0 rappdirs_0.3.3 [23] stringr_1.4.0 digest_0.6.27 [25] Rsamtools_2.8.0 rmarkdown_2.8 [27] XVector_0.32.0 pkgconfig_2.0.3 [29] htmltools_0.5.1.1 dbplyr_2.1.1 [31] fastmap_1.1.0 limma_3.48.0 [33] rlang_0.4.11 RSQLite_2.2.7 [35] shiny_1.6.0 BiocIO_1.2.0 [37] jquerylib_0.1.4 generics_0.1.0 [39] jsonlite_1.7.2 BiocParallel_1.26.0 [41] dplyr_1.0.6 RCurl_1.98-1.3 [43] magrittr_2.0.1 GenomeInfoDbData_1.2.6 [45] Matrix_1.3-3 Rcpp_1.0.6 [47] fansi_0.4.2 lifecycle_1.0.0 [49] stringi_1.6.2 yaml_2.2.1 [51] edgeR_3.34.0 zlibbioc_1.38.0 [53] AnnotationHub_3.0.0 BiocFileCache_2.0.0 [55] grid_4.1.0 blob_1.2.1 [57] promises_1.2.0.1 ExperimentHub_2.0.0 [59] crayon_1.4.1 dir.expiry_1.0.0 [61] lattice_0.20-44 Biostrings_2.60.0 [63] hms_1.1.0 KEGGREST_1.32.0 [65] locfit_1.5-9.4 CodeDepends_0.6.5 [67] metapod_1.0.0 knitr_1.33 [69] pillar_1.6.1 rjson_0.2.20 [71] codetools_0.2-18 biomaRt_2.48.0 [73] BiocVersion_3.13.1 XML_3.99-0.6 [75] glue_1.4.2 evaluate_0.14 [77] BiocManager_1.30.15 httpuv_1.6.1 [79] png_0.1-7 vctrs_0.3.8 [81] purrr_0.3.4 assertthat_0.2.1 [83] cachem_1.0.5 xfun_0.23 [85] mime_0.10 xtable_1.8-4 [87] restfulr_0.0.13 later_1.2.0 [89] tibble_3.1.2 GenomicAlignments_1.28.0 [91] memoise_2.0.0 interactiveDisplayBase_1.30.0 [93] ellipsis_0.3.2 Bibliography "],["annotation-and-visualization.html", "Chapter 8 Annotation and visualization 8.1 Adding gene-based annotation 8.2 Checking bimodality for TF studies 8.3 Saving the results to file 8.4 Simple visualization of genomic coverage Session information", " Chapter 8 Annotation and visualization .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 8.1 Adding gene-based annotation Annotation can be added to a given set of regions using the detailRanges() function. This will identify overlaps between the regions and annotated genomic features such as exons, introns and promoters. Here, the promoter region of each gene is defined as some interval 3 kbp up- and 1 kbp downstream of the TSS for that gene. Any exonic features within dist on the left or right side of each supplied region will also be reported. View set-up code #--- loading-files ---# library(chipseqDBData) tf.data &lt;- NFYAData() tf.data bam.files &lt;- head(tf.data$Path, -1) # skip the input. bam.files #--- counting-windows ---# library(csaw) frag.len &lt;- 110 win.width &lt;- 10 param &lt;- readParam(minq=20) data &lt;- windowCounts(bam.files, ext=frag.len, width=win.width, param=param) #--- filtering ---# binned &lt;- windowCounts(bam.files, bin=10000, param=param) fstats &lt;- filterWindowsGlobal(data, binned) filtered.data &lt;- data[fstats$filter &gt; log2(5),] #--- normalization ---# filtered.data &lt;- normFactors(binned, se.out=filtered.data) #--- modelling ---# cell.type &lt;- sub(&quot;NF-YA ([^ ]+) .*&quot;, &quot;\\\\1&quot;, head(tf.data$Description, -1)) design &lt;- model.matrix(~cell.type) colnames(design) &lt;- c(&quot;intercept&quot;, &quot;cell.type&quot;) library(edgeR) y &lt;- asDGEList(filtered.data) y &lt;- estimateDisp(y, design) fit &lt;- glmQLFit(y, design, robust=TRUE) res &lt;- glmQLFTest(fit, coef=&quot;cell.type&quot;) rowData(filtered.data) &lt;- cbind(rowData(filtered.data), res$table) #--- merging ---# merged &lt;- mergeResults(filtered.data, tol=1000, merge.args=list(max.width=5000)) library(csaw) library(org.Mm.eg.db) library(TxDb.Mmusculus.UCSC.mm10.knownGene) anno &lt;- detailRanges(merged$regions, txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, orgdb=org.Mm.eg.db, promoter=c(3000, 1000), dist=5000) head(anno$overlap) ## [1] &quot;&quot; &quot;&quot; &quot;Rrs1:+:P,Adhfe1:+:P&quot; ## [4] &quot;Ppp1r42:-:I&quot; &quot;&quot; &quot;Ncoa2:-:PI&quot; head(anno$left) ## [1] &quot;&quot; &quot;&quot; &quot;&quot; &quot;Ppp1r42:-:3948&quot; ## [5] &quot;&quot; &quot;Ncoa2:-:12&quot; head(anno$right) ## [1] &quot;&quot; &quot;Rrs1:+:3898&quot; ## [3] &quot;Rrs1:+:48,Adhfe1:+:2588&quot; &quot;Ppp1r42:-:1612&quot; ## [5] &quot;Ncoa2:-:4595&quot; &quot;Ncoa2:-:1278&quot; Character vectors of compact string representations are provided to summarize the features overlapped by each supplied region. Each pattern contains GENE|STRAND|TYPE to describe the strand and overlapped features of that gene. Exons are labelled as E, promoters are P and introns are I. For left and right, TYPE is replaced by DISTANCE. This indicates the gap (in base pairs) between the supplied region and the closest non-overlapping exon of the annotated feature. All of this annotation can be stored in the metadata of the GRanges object for later use. merged$regions$overlap &lt;- anno$overlap merged$regions$left &lt;- anno$left merged$regions$right &lt;- anno$right While the string representation saves space in the output, it is not easy to work with. If the annotation needs to manipulated directly, users can obtain it from the detailRanges() command by not specifying the regions of interest. This can then be used for interactive manipulation, e.g., to identify all genes where the promoter contains DB sites. anno.ranges &lt;- detailRanges(txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, orgdb=org.Mm.eg.db) anno.ranges ## GRanges object with 505738 ranges and 2 metadata columns: ## seqnames ranges strand | symbol type ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; | &lt;character&gt; &lt;character&gt; ## 100009600 chr9 21062393-21062717 - | Zglp1 E ## 100009600 chr9 21062400-21062717 - | Zglp1 E ## 100009600 chr9 21062894-21062987 - | Zglp1 E ## 100009600 chr9 21063314-21063396 - | Zglp1 E ## 100009600 chr9 21066024-21066377 - | Zglp1 E ## ... ... ... ... . ... ... ## 99982 chr4 136554325-136558324 - | Kdm1a P ## 99982 chr4 136560502-136564501 - | Kdm1a P ## 99982 chr4 136567608-136571607 - | Kdm1a P ## 99982 chr4 136576159-136580158 - | Kdm1a P ## 99982 chr4 136550540-136602723 - | Kdm1a G ## ------- ## seqinfo: 66 sequences (1 circular) from mm10 genome 8.2 Checking bimodality for TF studies For TF experiments, a simple measure of strand bimodality can be reported as a diagnostic. Given a set of regions, the checkBimodality() function will return the maximum bimodality score across all base positions in each region. The bimodality score at each base position is defined as the minimum of the ratio of the number of forward- to reverse-stranded reads to the left of that position, and the ratio of the reverse- to forward-stranded reads to the right. A high score is only possible if both ratios are large, i.e., strand bimodality is present. # TODO: make this less weird. spacing &lt;- metadata(data)$spacing expanded &lt;- resize(merged$regions, fix=&quot;center&quot;, width=width(merged$regions)+spacing) sbm.score &lt;- checkBimodality(bam.files, expanded, width=frag.len) head(sbm.score) ## [1] 1.456 2.263 1.364 5.333 1.933 3.405 In the above code, all regions are expanded by spacing, i.e., 50 bp. This ensures that the optimal bimodality score can be computed for the centre of the binding site, even if that position is not captured by a window. The width argument specifies the span with which to count reads for the score calculation. This should be set to the average fragment length. If multiple bam.files are provided, they will be pooled during counting. For typical TF binding sites, bimodality scores can be considered to be “high” if they are larger than 4. This allows users to distinguish between genuine binding sites and high-abundance artifacts such as repeats or read stacks. However, caution is still required as some high scores may be driven by the stochastic distribution of reads. Obviously, the concept of strand bimodality is less relevant for diffuse targets like histone marks. 8.3 Saving the results to file It is a simple matter to save the results for later perusal, e.g., to a tab-separated file. ofile &lt;- gzfile(&quot;clusters.tsv.gz&quot;, open=&quot;w&quot;) write.table(as.data.frame(merged), file=ofile, row.names=FALSE, quote=FALSE, sep=&quot;\\t&quot;) close(ofile) Of course, other formats can be used depending on the purpose of the file. For example, significantly DB regions can be exported to BED files through the rtracklayer package for visual inspection with genomic browsers. A transformed FDR is used here for the score field. is.sig &lt;- merged$combined$FDR &lt;= 0.05 test &lt;- merged$regions[is.sig] test$score &lt;- -10*log10(merged$combined$FDR[is.sig]) names(test) &lt;- paste0(&quot;region&quot;, 1:sum(is.sig)) library(rtracklayer) export(test, &quot;clusters.bed&quot;) head(read.table(&quot;clusters.bed&quot;)) ## V1 V2 V3 V4 V5 V6 ## 1 chr1 7397900 7398110 region1 13.74 . ## 2 chr1 9541400 9541510 region2 15.60 . ## 3 chr1 13589950 13590010 region3 14.56 . ## 4 chr1 15805500 15805660 region4 22.60 . ## 5 chr1 23256050 23256210 region5 14.47 . ## 6 chr1 32172600 32172710 region6 13.82 . Alternatively, the GRanges object can be directly saved to file and reloaded later for direct manipulation in the R environment, e.g., to find overlaps with other regions of interest. saveRDS(merged$regions, &quot;ranges.rds&quot;) 8.4 Simple visualization of genomic coverage Visualization of the read depth around interesting features is often desired. This is facilitated by the extractReads() function, which pulls out the reads from the BAM file. The returned GRanges object can then be used to plot the sequencing coverage or any other statistic of interest. Note that the extractReads() function also accepts a readParam object. This ensures that the same reads used in the analysis will be pulled out during visualization. cur.region &lt;- GRanges(&quot;chr18&quot;, IRanges(77806807, 77807165)) extractReads(bam.files[[1]], cur.region, param=param) ## GRanges object with 55 ranges and 0 metadata columns: ## seqnames ranges strand ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; ## [1] chr18 77806886-77806922 + ## [2] chr18 77806887-77806923 + ## [3] chr18 77806887-77806923 + ## [4] chr18 77806887-77806923 + ## [5] chr18 77806890-77806926 + ## ... ... ... ... ## [51] chr18 77807063-77807095 - ## [52] chr18 77807068-77807104 - ## [53] chr18 77807082-77807119 - ## [54] chr18 77807084-77807120 - ## [55] chr18 77807087-77807123 - ## ------- ## seqinfo: 1 sequence from an unspecified genome Here, coverage is visualized as the number of reads covering each base pair in the interval of interest. Specifically, the reads-per-million is shown to allow comparisons between libraries of different size. The plots themselves are constructed using methods from the Gviz package. The blue and red tracks represent the coverage on the forward and reverse strands, respectively. Strong strand bimodality is consistent with a genuine TF binding site. For paired-end data, coverage can be similarly plotted for fragments, i.e., proper read pairs. library(Gviz) collected &lt;- vector(&quot;list&quot;, length(bam.files)) for (i in seq_along(bam.files)) { reads &lt;- extractReads(bam.files[[i]], cur.region, param=param) adj.total &lt;- data$totals[i]/1e6 pcov &lt;- as(coverage(reads[strand(reads)==&quot;+&quot;])/adj.total, &quot;GRanges&quot;) ncov &lt;- as(coverage(reads[strand(reads)==&quot;-&quot;])/adj.total, &quot;GRanges&quot;) ptrack &lt;- DataTrack(pcov, type=&quot;histogram&quot;, lwd=0, fill=rgb(0,0,1,.4), ylim=c(0,1.1), name=tf.data$Name[i], col.axis=&quot;black&quot;, col.title=&quot;black&quot;) ntrack &lt;- DataTrack(ncov, type=&quot;histogram&quot;, lwd=0, fill=rgb(1,0,0,.4), ylim=c(0,1.1)) collected[[i]] &lt;- OverlayTrack(trackList=list(ptrack,ntrack)) } gax &lt;- GenomeAxisTrack(col=&quot;black&quot;) plotTracks(c(gax, collected), from=start(cur.region), to=end(cur.region)) Session information View session info R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid parallel stats4 stats graphics grDevices utils [8] datasets methods base other attached packages: [1] Gviz_1.36.0 [2] rtracklayer_1.52.0 [3] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0 [4] GenomicFeatures_1.44.0 [5] org.Mm.eg.db_3.13.0 [6] AnnotationDbi_1.54.0 [7] csaw_1.26.0 [8] SummarizedExperiment_1.22.0 [9] Biobase_2.52.0 [10] MatrixGenerics_1.4.0 [11] matrixStats_0.58.0 [12] GenomicRanges_1.44.0 [13] GenomeInfoDb_1.28.0 [14] IRanges_2.26.0 [15] S4Vectors_0.30.0 [16] BiocGenerics_0.38.0 [17] BiocStyle_2.20.0 [18] rebook_1.2.0 loaded via a namespace (and not attached): [1] colorspace_2.0-1 rjson_0.2.20 ellipsis_0.3.2 [4] htmlTable_2.2.1 biovizBase_1.40.0 XVector_0.32.0 [7] base64enc_0.1-3 dichromat_2.0-0 rstudioapi_0.13 [10] bit64_4.0.5 fansi_0.4.2 codetools_0.2-18 [13] splines_4.1.0 cachem_1.0.5 knitr_1.33 [16] Formula_1.2-4 jsonlite_1.7.2 Rsamtools_2.8.0 [19] cluster_2.1.2 dbplyr_2.1.1 png_0.1-7 [22] graph_1.70.0 BiocManager_1.30.15 compiler_4.1.0 [25] httr_1.4.2 backports_1.2.1 lazyeval_0.2.2 [28] assertthat_0.2.1 Matrix_1.3-3 fastmap_1.1.0 [31] limma_3.48.0 htmltools_0.5.1.1 prettyunits_1.1.1 [34] tools_4.1.0 gtable_0.3.0 glue_1.4.2 [37] GenomeInfoDbData_1.2.6 dplyr_1.0.6 rappdirs_0.3.3 [40] Rcpp_1.0.6 jquerylib_0.1.4 vctrs_0.3.8 [43] Biostrings_2.60.0 xfun_0.23 stringr_1.4.0 [46] lifecycle_1.0.0 ensembldb_2.16.0 restfulr_0.0.13 [49] XML_3.99-0.6 edgeR_3.34.0 zlibbioc_1.38.0 [52] scales_1.1.1 BSgenome_1.60.0 VariantAnnotation_1.38.0 [55] ProtGenerics_1.24.0 hms_1.1.0 AnnotationFilter_1.16.0 [58] RColorBrewer_1.1-2 yaml_2.2.1 curl_4.3.1 [61] gridExtra_2.3 memoise_2.0.0 ggplot2_3.3.3 [64] sass_0.4.0 rpart_4.1-15 biomaRt_2.48.0 [67] latticeExtra_0.6-29 stringi_1.6.2 RSQLite_2.2.7 [70] highr_0.9 BiocIO_1.2.0 checkmate_2.0.0 [73] filelock_1.0.2 BiocParallel_1.26.0 rlang_0.4.11 [76] pkgconfig_2.0.3 bitops_1.0-7 evaluate_0.14 [79] lattice_0.20-44 purrr_0.3.4 htmlwidgets_1.5.3 [82] GenomicAlignments_1.28.0 CodeDepends_0.6.5 bit_4.0.4 [85] tidyselect_1.1.1 magrittr_2.0.1 bookdown_0.22 [88] R6_2.5.0 generics_0.1.0 Hmisc_4.5-0 [91] metapod_1.0.0 DelayedArray_0.18.0 DBI_1.1.1 [94] foreign_0.8-81 pillar_1.6.1 nnet_7.3-16 [97] survival_3.2-11 KEGGREST_1.32.0 RCurl_1.98-1.3 [100] tibble_3.1.2 dir.expiry_1.0.0 crayon_1.4.1 [103] utf8_1.2.1 BiocFileCache_2.0.0 rmarkdown_2.8 [106] jpeg_0.1-8.1 progress_1.2.2 locfit_1.5-9.4 [109] data.table_1.14.0 blob_1.2.1 digest_0.6.27 [112] munsell_0.5.0 bslib_0.2.5.1 "],["h3k9ac-pro-b-versus-mature-b.html", "Chapter 9 H3K9ac, pro-B versus mature B 9.1 Overview 9.2 Pre-processing checks 9.3 Quantifying coverage 9.4 Filtering windows by abundance 9.5 Normalizing for trended biases 9.6 Statistical modelling 9.7 Testing for DB 9.8 Interpreting the DB results 9.9 Visualizing DB results 9.10 Complex DB across a broad region Session information", " Chapter 9 H3K9ac, pro-B versus mature B .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 9.1 Overview Here, we perform a window-based differential binding (DB) analysis to identify regions of differential H3K9ac enrichment between pro-B and mature B cells (Revilla-I-Domingo et al. 2012). H3K9ac is associated with active promoters and tends to exhibit relatively narrow regions of enrichment relative to other marks such as H3K27me3. For this study, the experimental design contains two biological replicates for each of the two cell types. We download the BAM files using the relevant function from the chipseqDBData package. library(chipseqDBData) acdata &lt;- H3K9acData() acdata ## DataFrame with 4 rows and 3 columns ## Name Description Path ## &lt;character&gt; &lt;character&gt; &lt;List&gt; ## 1 h3k9ac-proB-8113 pro-B H3K9ac (8113) &lt;BamFile&gt; ## 2 h3k9ac-proB-8108 pro-B H3K9ac (8108) &lt;BamFile&gt; ## 3 h3k9ac-matureB-8059 mature B H3K9ac (8059) &lt;BamFile&gt; ## 4 h3k9ac-matureB-8086 mature B H3K9ac (8086) &lt;BamFile&gt; 9.2 Pre-processing checks 9.2.1 Examining mapping statistics We use methods from the Rsamtools package to compute some mapping statistics for each BAM file. Ideally, the proportion of mapped reads should be high (70-80% or higher), while the proportion of marked reads should be low (generally below 20%). library(Rsamtools) diagnostics &lt;- list() for (b in seq_along(acdata$Path)) { bam &lt;- acdata$Path[[b]] total &lt;- countBam(bam)$records mapped &lt;- countBam(bam, param=ScanBamParam( flag=scanBamFlag(isUnmapped=FALSE)))$records marked &lt;- countBam(bam, param=ScanBamParam( flag=scanBamFlag(isUnmapped=FALSE, isDuplicate=TRUE)))$records diagnostics[[b]] &lt;- c(Total=total, Mapped=mapped, Marked=marked) } diag.stats &lt;- data.frame(do.call(rbind, diagnostics)) rownames(diag.stats) &lt;- acdata$Name diag.stats$Prop.mapped &lt;- diag.stats$Mapped/diag.stats$Total*100 diag.stats$Prop.marked &lt;- diag.stats$Marked/diag.stats$Mapped*100 diag.stats ## Total Mapped Marked Prop.mapped Prop.marked ## h3k9ac-proB-8113 10724526 8832006 434884 82.35 4.924 ## h3k9ac-proB-8108 10413135 7793913 252271 74.85 3.237 ## h3k9ac-matureB-8059 16675372 4670364 396785 28.01 8.496 ## h3k9ac-matureB-8086 6347683 4551692 141583 71.71 3.111 Note that all csaw functions that read from a BAM file require BAM indices with .bai suffixes. In this case, index files have already been downloaded by H3K9acData(), but users supplying their own files should take care to ensure that BAM indices are available with appropriate names. 9.2.2 Obtaining the ENCODE blacklist We identify and remove problematic regions (Section 3.1) using an annotated blacklist for the mm10 build of the mouse genome, constructed by identifying consistently problematic regions from ENCODE datasets (Consortium 2012). We download this BED file and save it into a local cache with the BiocFileCache package. This allows it to be used again in later workflows without being re-downloaded. library(BiocFileCache) bfc &lt;- BiocFileCache(&quot;local&quot;, ask=FALSE) black.path &lt;- bfcrpath(bfc, file.path(&quot;https://www.encodeproject.org&quot;, &quot;files/ENCFF547MET/@@download/ENCFF547MET.bed.gz&quot;)) Genomic intervals in the blacklist are loaded using the import() method from the rtracklayer package. All reads mapped within the blacklisted intervals will be ignored during processing in csaw by specifying the discard= parameter (see below). library(rtracklayer) blacklist &lt;- import(black.path) blacklist ## GRanges object with 164 ranges and 0 metadata columns: ## seqnames ranges strand ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; ## [1] chr10 3110061-3110270 * ## [2] chr10 22142531-22142880 * ## [3] chr10 22142831-22143070 * ## [4] chr10 58223871-58224100 * ## [5] chr10 58225261-58225500 * ## ... ... ... ... ## [160] chr9 3038051-3038300 * ## [161] chr9 24541941-24542200 * ## [162] chr9 35305121-35305620 * ## [163] chr9 110281191-110281400 * ## [164] chr9 123872951-123873160 * ## ------- ## seqinfo: 19 sequences from an unspecified genome; no seqlengths 9.2.3 Setting up extraction parameters We ignore reads that map to blacklist regions or do not map to the standard set of mouse nuclear chromosomes4. library(csaw) standard.chr &lt;- paste0(&quot;chr&quot;, c(1:19, &quot;X&quot;, &quot;Y&quot;)) param &lt;- readParam(minq=20, discard=blacklist, restrict=standard.chr) Reads are also ignored if they have a mapping quality score below 205. This avoids spurious results due to weak or non-unique alignments that should be assigned low MAPQ scores by the aligner. Note that the range of MAPQ scores will vary between aligners, so some inspection of the BAM files is necessary to choose an appropriate value. 9.3 Quantifying coverage 9.3.1 Computing the average fragment length We estimate the average fragment length with cross correlation plots (Section 2.4). Specifically, the delay at the peak in the cross correlation is used as the average length in our analysis (Figure 9.1). x &lt;- correlateReads(acdata$Path, param=reform(param, dedup=TRUE)) frag.len &lt;- maximizeCcf(x) frag.len ## [1] 154 plot(1:length(x)-1, x, xlab=&quot;Delay (bp)&quot;, ylab=&quot;CCF&quot;, type=&quot;l&quot;) abline(v=frag.len, col=&quot;red&quot;) text(x=frag.len, y=min(x), paste(frag.len, &quot;bp&quot;), pos=4, col=&quot;red&quot;) Figure 9.1: Cross-correlation function (CCF) against delay distance for the H3K9ac data set. The delay with the maximum correlation is shown as the red line. Only unmarked reads (i.e., not potential PCR duplicates) are used to calculate the cross-correlations. However, general removal of marked reads is risky as it caps the signal in high-coverage regions of the genome. Thus, the marking status of each read will be ignored in the rest of the analysis, i.e., no duplicates will be removed in downstream steps. 9.3.2 Counting reads into windows The windowCounts() function produces a RangedSummarizedExperiment object containing a matrix of such counts. Each row corresponds to a window; each column represents a BAM file corresponding to a single sample6; and each entry of the matrix represents the number of fragments overlapping a particular window in a particular sample. win.data &lt;- windowCounts(acdata$Path, param=param, width=150, ext=frag.len) win.data ## class: RangedSummarizedExperiment ## dim: 1671254 4 ## metadata(6): spacing width ... param final.ext ## assays(1): counts ## rownames: NULL ## rowData names(0): ## colnames: NULL ## colData names(4): bam.files totals ext rlen To analyze H3K9ac data, we use a window size of 150 bp. This corresponds roughly to the length of the DNA in a nucleosome (Humburg et al. 2011), which is the smallest relevant unit for studying histone mark enrichment. The spacing between windows is left as the default of 50 bp, i.e., the start positions for adjacent windows are 50 bp apart. 9.4 Filtering windows by abundance We remove low-abundance windows using a global filter on the background enrichment (Section 4.4). A window is only retained if its coverage is 3-fold higher than that of the background regions, i.e., the abundance of the window is greater than the background abundance estimate by log2(3) or more. This removes a large number of windows that are weakly or not marked and are likely to be irrelevant. bins &lt;- windowCounts(acdata$Path, bin=TRUE, width=2000, param=param) filter.stat &lt;- filterWindowsGlobal(win.data, bins) min.fc &lt;- 3 keep &lt;- filter.stat$filter &gt; log2(min.fc) summary(keep) ## Mode FALSE TRUE ## logical 982167 689087 We examine the effect of the fold-change threshold in Figure 9.2. The chosen threshold is greater than the abundances of most bins in the genome – presumably, those that contain background regions. This suggests that the filter will remove most windows lying within background regions. hist(filter.stat$filter, main=&quot;&quot;, breaks=50, xlab=&quot;Background abundance (log2-CPM)&quot;) abline(v=log2(min.fc), col=&quot;red&quot;) Figure 9.2: Histogram of average abundances across all 2 kbp genomic bins. The filter threshold is shown as the red line. The filtering itself is done by simply subsetting the RangedSummarizedExperiment object. filtered.data &lt;- win.data[keep,] 9.5 Normalizing for trended biases In this dataset, we observe a trended bias between samples in Figure 9.3. This refers to a systematic fold-difference in per-window coverage between samples that changes according to the average abundance of the window. win.ab &lt;- scaledAverage(filtered.data) adjc &lt;- calculateCPM(filtered.data, use.offsets=FALSE) logfc &lt;- adjc[,4] - adjc[,1] smoothScatter(win.ab, logfc, ylim=c(-6, 6), xlim=c(0, 5), xlab=&quot;Average abundance&quot;, ylab=&quot;Log-fold change&quot;) lfit &lt;- smooth.spline(logfc~win.ab, df=5) o &lt;- order(win.ab) lines(win.ab[o], fitted(lfit)[o], col=&quot;red&quot;, lty=2) Figure 9.3: Abundance-dependent trend in the log-fold change between two H3K9ac samples (mature B over pro-B), across all windows retained after filtering. A smoothed spline fitted to the log-fold change against the average abundance is also shown in red. To remove these biases, we use csaw to compute a matrix of offsets for model fitting. filtered.data &lt;- normOffsets(filtered.data) head(assay(filtered.data, &quot;offset&quot;)) ## [,1] [,2] [,3] [,4] ## [1,] 16.07 15.88 15.05 15.14 ## [2,] 16.05 15.86 15.08 15.17 ## [3,] 16.04 15.86 15.08 15.17 ## [4,] 16.17 15.95 14.98 15.06 ## [5,] 16.24 16.00 14.93 14.97 ## [6,] 16.26 16.02 14.92 14.95 The effect of non-linear normalization is visualized with another mean-difference plot. Once the offsets are applied to adjust the log-fold changes, the trend is eliminated from the plot (Figure 9.4). The cloud of points is also centred at a log-fold change of zero, indicating that normalization successfully removed the differences between samples. norm.adjc &lt;- calculateCPM(filtered.data, use.offsets=TRUE) norm.fc &lt;- norm.adjc[,4]-norm.adjc[,1] smoothScatter(win.ab, norm.fc, ylim=c(-6, 6), xlim=c(0, 5), xlab=&quot;Average abundance&quot;, ylab=&quot;Log-fold change&quot;) lfit &lt;- smooth.spline(norm.fc~win.ab, df=5) lines(win.ab[o], fitted(lfit)[o], col=&quot;red&quot;, lty=2) Figure 9.4: Effect of non-linear normalization on the trended bias between two H3K9ac samples. Normalized log-fold changes are shown for all windows retained after filtering. A smoothed spline fitted to the log-fold change against the average abundance is also shown in red. The implicit assumption of non-linear methods is that most windows at each abundance are not DB. Any systematic difference between samples is attributed to bias and is removed. The assumption of a non-DB majority is reasonable for this data set, given that the cell types being compared are quite closely related. 9.6 Statistical modelling 9.6.1 Estimating the NB dispersion First, we set up our design matrix. This involves a fairly straightforward one-way layout with the groups representing our two cell types. celltype &lt;- acdata$Description celltype[grep(&quot;pro&quot;, celltype)] &lt;- &quot;proB&quot; celltype[grep(&quot;mature&quot;, celltype)] &lt;- &quot;matureB&quot; celltype &lt;- factor(celltype) design &lt;- model.matrix(~0+celltype) colnames(design) &lt;- levels(celltype) design ## matureB proB ## 1 0 1 ## 2 0 1 ## 3 1 0 ## 4 1 0 ## attr(,&quot;assign&quot;) ## [1] 1 1 ## attr(,&quot;contrasts&quot;) ## attr(,&quot;contrasts&quot;)$celltype ## [1] &quot;contr.treatment&quot; We coerce the RangedSummarizedExperiment object into a DGEList object (plus offsets) for use in edgeR. We then estimate the NB dispersion to capture the mean-variance relationship. The NB dispersion estimates are shown in Figure 9.5 as their square roots, i.e., the biological coefficients of variation. Data sets with common BCVs ranging from 10 to 20% are considered to have low variability for ChIP-seq experiments. library(edgeR) y &lt;- asDGEList(filtered.data) str(y) ## Formal class &#39;DGEList&#39; [package &quot;edgeR&quot;] with 1 slot ## ..@ .Data:List of 3 ## .. ..$ : int [1:689087, 1:4] 6 6 7 12 15 17 24 22 25 24 ... ## .. .. ..- attr(*, &quot;dimnames&quot;)=List of 2 ## .. .. .. ..$ : chr [1:689087] &quot;1&quot; &quot;2&quot; &quot;3&quot; &quot;4&quot; ... ## .. .. .. ..$ : chr [1:4] &quot;Sample1&quot; &quot;Sample2&quot; &quot;Sample3&quot; &quot;Sample4&quot; ## .. ..$ :&#39;data.frame&#39;: 4 obs. of 3 variables: ## .. .. ..$ group : Factor w/ 1 level &quot;1&quot;: 1 1 1 1 ## .. .. ..$ lib.size : int [1:4] 8392971 7269175 3792141 4241789 ## .. .. ..$ norm.factors: num [1:4] 1 1 1 1 ## .. ..$ : num [1:689087, 1:4] 16.1 16 16 16.2 16.2 ... ## ..$ names: chr [1:3] &quot;counts&quot; &quot;samples&quot; &quot;offset&quot; y &lt;- estimateDisp(y, design) summary(y$trended.dispersion) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0410 0.0525 0.0617 0.0607 0.0721 0.0740 plotBCV(y) Figure 9.5: Abundance-dependent trend in the BCV for each window, represented by the blue line. Common (red) and tagwise estimates (black) are also shown. 9.6.2 Estimating the QL dispersion We use quasi-likelihood methods to model window-specific variability, i.e., variance in the variance across windows. However, with limited replicates, there is not enough information for each window to stably estimate the QL dispersion. This is overcome by sharing information between windows with empirical Bayes (EB) shrinkage. The instability of the QL dispersion estimates is reduced by squeezing the estimates towards an abundance-dependent trend (Figure 9.6). fit &lt;- glmQLFit(y, design, robust=TRUE) plotQLDisp(fit) Figure 9.6: Effect of EB shrinkage on the raw QL dispersion estimate for each window (black) towards the abundance-dependent trend (blue) to obtain squeezed estimates (red). The extent of shrinkage is determined by the prior degrees of freedom (d.f.). Large prior d.f. indicates that the dispersions were similar across windows, such that stronger shrinkage to the trend could be performed to increase stability and power. summary(fit$df.prior) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.224 15.495 15.495 15.254 15.495 15.495 Also note the use of robust=TRUE in the glmQLFit() call, which reduces the sensitivity of the EB procedures to outlier variances. This is particularly noticeable in Figure 9.6 with highly variable windows that (correctly) do not get squeezed towards the trend. 9.6.3 Examining the data with MDS plots We use MDS plots to examine the similarities between samples. Ideally, replicates should cluster together while samples from different conditions should be separate. While the mature B replicates are less tightly grouped, samples still separate by cell type in Figure 9.7. This suggests that our downstream analysis will be able to detect significant differences in enrichment between cell types. plotMDS(norm.adjc, labels=celltype, col=c(&quot;red&quot;, &quot;blue&quot;)[as.integer(celltype)]) Figure 9.7: MDS plot with two dimensions for all samples in the H3K9ac data set. Samples are labelled and coloured according to the cell type. 9.7 Testing for DB Each window is tested for significant differences between cell types using the QL F-test. For this analysis, the comparison is parametrized such that the reported log-fold change for each window represents that of the coverage in pro-B cells over their mature B counterparts. contrast &lt;- makeContrasts(proB-matureB, levels=design) res &lt;- glmQLFTest(fit, contrast=contrast) head(res$table) ## logFC logCPM F PValue ## 1 1.3658 0.3097 2.306 0.14678 ## 2 1.3564 0.2624 2.305 0.14685 ## 3 2.0015 0.2503 3.940 0.06305 ## 4 2.0780 0.5033 5.576 0.03003 ## 5 0.8842 0.8051 1.652 0.21545 ## 6 0.9678 0.8949 2.055 0.16936 We then control the region-level FDR by aggregating windows into regions and combining the \\(p\\)-values. Here, adjacent windows less than 100 bp apart are aggregated into clusters. merged &lt;- mergeResults(filtered.data, res$table, tol=100, merge.args=list(max.width=5000)) merged$regions ## GRanges object with 41616 ranges and 0 metadata columns: ## seqnames ranges strand ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; ## [1] chr1 4775451-4775750 * ## [2] chr1 4785001-4786300 * ## [3] chr1 4807251-4807750 * ## [4] chr1 4808001-4808600 * ## [5] chr1 4857051-4858950 * ## ... ... ... ... ## [41612] chrY 73038001-73038400 * ## [41613] chrY 75445801-75446200 * ## [41614] chrY 88935951-88936350 * ## [41615] chrY 90554201-90554400 * ## [41616] chrY 90812801-90813100 * ## ------- ## seqinfo: 21 sequences from an unspecified genome A combined \\(p\\)-value is computed for each cluster using the method of Simes (1986), based on the \\(p\\)-values of the constituent windows. This represents the evidence against the global null hypothesis for each cluster, i.e., that no DB exists in any of its windows. Rejection of this global null indicates that the cluster (and the region that it represents) contains DB. Applying the BH method to the combined \\(p\\)-values allows the region-level FDR to be controlled. tabcom &lt;- merged$combined tabcom ## DataFrame with 41616 rows and 8 columns ## num.tests num.up.logFC num.down.logFC PValue FDR direction ## &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## 1 3 0 0 0.1468526 0.2464281 up ## 2 24 0 0 0.0882967 0.1687355 up ## 3 8 0 0 0.5264245 0.6480413 mixed ## 4 10 0 0 0.7296509 0.8298757 mixed ## 5 36 5 0 0.0208882 0.0605414 up ## ... ... ... ... ... ... ... ## 41612 6 0 6 0.00587505 0.0265930 down ## 41613 6 0 6 0.03868017 0.0930955 down ## 41614 6 0 6 0.02082155 0.0604134 down ## 41615 2 0 2 0.03344646 0.0836785 down ## 41616 4 0 4 0.00147494 0.0114325 down ## rep.test rep.logFC ## &lt;integer&gt; &lt;numeric&gt; ## 1 2 1.356426 ## 2 15 6.454882 ## 3 34 0.433617 ## 4 40 -0.272420 ## 5 63 6.353706 ## ... ... ... ## 41612 689066 -6.96911 ## 41613 689075 -5.73312 ## 41614 689081 -5.84635 ## 41615 689083 -5.00404 ## 41616 689087 -3.96416 We determine the total number of DB regions at a FDR of 5% by applying the Benjamini-Hochberg method on the combined \\(p\\)-values. is.sig &lt;- tabcom$FDR &lt;= 0.05 summary(is.sig) ## Mode FALSE TRUE ## logical 28515 13101 Determining the direction of DB is more complicated, as clusters may contain windows that are changing in opposite directions. One approach is to use the direction of DB from the windows that contribute most to the combined \\(p\\)-value, as reported in the direction field for each cluster. table(tabcom$direction[is.sig]) ## ## down mixed up ## 8580 154 4367 Another approach is to use the log-fold change of the most significant window as a proxy for the log-fold change of the cluster. tabbest &lt;- merged$best tabbest ## DataFrame with 41616 rows and 8 columns ## num.tests num.up.logFC num.down.logFC PValue FDR direction ## &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; ## 1 3 0 0 0.1891477 0.335560 up ## 2 24 0 0 0.0882967 0.190583 up ## 3 8 0 0 1.0000000 1.000000 up ## 4 10 0 0 1.0000000 1.000000 down ## 5 36 2 0 0.0464346 0.121536 up ## ... ... ... ... ... ... ... ## 41612 6 0 3 0.01762514 0.0634640 down ## 41613 6 0 0 0.19568010 0.3445628 down ## 41614 6 0 0 0.06141175 0.1463585 down ## 41615 2 0 0 0.06689293 0.1549923 down ## 41616 4 0 4 0.00421597 0.0261244 down ## rep.test rep.logFC ## &lt;integer&gt; &lt;numeric&gt; ## 1 3 2.001538 ## 2 15 6.454882 ## 3 35 1.178400 ## 4 43 -0.908825 ## 5 60 6.572738 ## ... ... ... ## 41612 689064 -6.96911 ## 41613 689070 -5.44350 ## 41614 689076 -6.68322 ## 41615 689082 -5.00404 ## 41616 689086 -4.08131 In the table above, the rep.test column is the index of the window that is the most significant in each cluster, while the rep.logFC field is the log-fold change of that window. We could also use this to obtain a summary of the direction of DB across all clusters. is.sig.pos &lt;- (tabbest$rep.logFC &gt; 0)[is.sig] summary(is.sig.pos) ## Mode FALSE TRUE ## logical 8664 4437 The final approach is generally satisfactory, though it will not capture multiple changes in opposite directions7. It also tends to overstate the magnitude of the log-fold change in each cluster. 9.8 Interpreting the DB results 9.8.1 Adding gene-centric annotation For convenience, we store all statistics in the metadata of a GRanges object. We also store the midpoint and log-fold change of the most significant window in each cluster. out.ranges &lt;- merged$regions mcols(out.ranges) &lt;- DataFrame(tabcom, best.pos=mid(ranges(rowRanges(filtered.data[tabbest$rep.test]))), best.logFC=tabbest$rep.logFC) We can then use the built-in annotation function in csaw to report genic features overlapping to each region (Section 8.1). Annotated features that flank the region of interest are also reported. library(org.Mm.eg.db) library(TxDb.Mmusculus.UCSC.mm10.knownGene) anno &lt;- detailRanges(out.ranges, orgdb=org.Mm.eg.db, txdb=TxDb.Mmusculus.UCSC.mm10.knownGene) head(anno$overlap) ## [1] &quot;Mrpl15:-:E&quot; &quot;Mrpl15:-:PE&quot; &quot;Lypla1:+:P&quot; ## [4] &quot;Lypla1:+:PE&quot; &quot;Lypla1:+:I,Tcea1:+:PE&quot; &quot;Rgs20:-:I&quot; head(anno$left) ## [1] &quot;Mrpl15:-:935&quot; &quot;Mrpl15:-:896&quot; &quot;&quot; &quot;Lypla1:+:19&quot; &quot;&quot; ## [6] &quot;&quot; head(anno$right) ## [1] &quot;Mrpl15:-:627&quot; &quot;&quot; &quot;Lypla1:+:38&quot; &quot;&quot; &quot;&quot; ## [6] &quot;&quot; The annotation for each region is stored in the metadata of the GRanges object. The compact string form is useful for human interpretation, as it allows rapid examination of all genic features neighbouring each region. meta &lt;- mcols(out.ranges) mcols(out.ranges) &lt;- data.frame(meta, anno) 9.8.2 Using the ChIPpeakAnno package As its name suggests, the ChIPpeakAnno package is designed to annotate peaks from ChIP-seq experiments (Zhu et al. 2010). A GRanges object containing all regions of interest is supplied to the relevant function after removing all previous metadata fields to reduce clutter. The gene closest to each region is then reported. Gene coordinates are taken from the NCBI mouse 38 annotation, which is roughly equivalent to the annotation in the mm10 genome build. library(ChIPpeakAnno) data(TSS.mouse.GRCm38) minimal &lt;- out.ranges elementMetadata(minimal) &lt;- NULL anno.regions &lt;- annotatePeakInBatch(minimal, AnnotationData=TSS.mouse.GRCm38) colnames(elementMetadata(anno.regions)) ## [1] &quot;peak&quot; &quot;feature&quot; ## [3] &quot;start_position&quot; &quot;end_position&quot; ## [5] &quot;feature_strand&quot; &quot;insideFeature&quot; ## [7] &quot;distancetoFeature&quot; &quot;shortestDistance&quot; ## [9] &quot;fromOverlappingOrNearest&quot; Alternatively, identification of all overlapping features within, say, 5 kbp can be achieved by setting maxgap=5000 and output=\"overlapping\" in annotatePeakInBatch. This will report each overlapping feature in a separate entry of the returned GRanges object, i.e., each input region may have multiple output values. In contrast, detailRanges() will report all overlapping features for a region as a single string, i.e., each input region has one output value. Which is preferable depends on the purpose of the annotation – the detailRanges() output is more convenient for direct annotation of a DB list, while the annotatePeakInBatch() output contains more information and is more convenient for further manipulation. 9.8.3 Reporting gene-based results Another approach to annotation is to flip the problem around such that DB statistics are reported directly for features of interest like genes. This is more convenient when the DB analysis needs to be integrated with, e.g., differential expression analyses of matched RNA-seq data. In the code below, promoter coordinates and gene symbols are obtained from various annotation objects. prom &lt;- suppressWarnings(promoters(TxDb.Mmusculus.UCSC.mm10.knownGene, upstream=3000, downstream=1000, columns=c(&quot;tx_name&quot;, &quot;gene_id&quot;))) entrez.ids &lt;- sapply(prom$gene_id, FUN=function(x) x[1]) # Using the first Entrez ID. gene.name &lt;- select(org.Mm.eg.db, keys=entrez.ids, keytype=&quot;ENTREZID&quot;, column=&quot;SYMBOL&quot;) prom$gene_name &lt;- gene.name$SYMBOL[match(entrez.ids, gene.name$ENTREZID)] head(prom) ## GRanges object with 6 ranges and 3 metadata columns: ## seqnames ranges strand | tx_name ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; | &lt;character&gt; ## ENSMUST00000193812.1 chr1 3070253-3074252 + | ENSMUST00000193812.1 ## ENSMUST00000082908.1 chr1 3099016-3103015 + | ENSMUST00000082908.1 ## ENSMUST00000192857.1 chr1 3249757-3253756 + | ENSMUST00000192857.1 ## ENSMUST00000161581.1 chr1 3463587-3467586 + | ENSMUST00000161581.1 ## ENSMUST00000192183.1 chr1 3528795-3532794 + | ENSMUST00000192183.1 ## ENSMUST00000193244.1 chr1 3677155-3681154 + | ENSMUST00000193244.1 ## gene_id gene_name ## &lt;CharacterList&gt; &lt;character&gt; ## ENSMUST00000193812.1 &lt;NA&gt; ## ENSMUST00000082908.1 &lt;NA&gt; ## ENSMUST00000192857.1 &lt;NA&gt; ## ENSMUST00000161581.1 &lt;NA&gt; ## ENSMUST00000192183.1 &lt;NA&gt; ## ENSMUST00000193244.1 &lt;NA&gt; ## ------- ## seqinfo: 66 sequences (1 circular) from mm10 genome All windows overlapping each promoter are defined as a cluster. We compute DB statistics are computed for each cluster/promoter using Simes’ method, which directly yields DB results for the annotated features. Promoters with no overlapping windows are assigned NA values for the various fields and are filtered out below for demonstration purposes. olap.out &lt;- overlapResults(filtered.data, regions=prom, res$table) olap.out ## DataFrame with 142446 rows and 3 columns ## regions combined best ## &lt;GRanges&gt; &lt;DataFrame&gt; &lt;DataFrame&gt; ## 1 chr1:3070253-3074252:+ NA:NA:NA:... NA:NA:NA:... ## 2 chr1:3099016-3103015:+ NA:NA:NA:... NA:NA:NA:... ## 3 chr1:3249757-3253756:+ NA:NA:NA:... NA:NA:NA:... ## 4 chr1:3463587-3467586:+ NA:NA:NA:... NA:NA:NA:... ## 5 chr1:3528795-3532794:+ NA:NA:NA:... NA:NA:NA:... ## ... ... ... ... ## 142442 chrUn_GL456381:15722-19721:- NA:NA:NA:... NA:NA:NA:... ## 142443 chrUn_GL456385:28243-32242:+ NA:NA:NA:... NA:NA:NA:... ## 142444 chrUn_GL456385:29719-33718:+ NA:NA:NA:... NA:NA:NA:... ## 142445 chrUn_JH584304:58668-62667:- NA:NA:NA:... NA:NA:NA:... ## 142446 chrUn_JH584304:58691-62690:- NA:NA:NA:... NA:NA:NA:... simple &lt;- DataFrame(ID=prom$tx_name, Gene=prom$gene_name, olap.out$combined) simple[!is.na(simple$PValue),] ## DataFrame with 57380 rows and 10 columns ## ID Gene num.tests num.up.logFC num.down.logFC ## &lt;character&gt; &lt;character&gt; &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; ## 1 ENSMUST00000134384.7 Lypla1 18 0 0 ## 2 ENSMUST00000027036.10 Lypla1 18 0 0 ## 3 ENSMUST00000150971.7 Lypla1 18 0 0 ## 4 ENSMUST00000155020.1 Lypla1 18 0 0 ## 5 ENSMUST00000119612.8 Lypla1 18 0 0 ## ... ... ... ... ... ... ## 57376 ENSMUST00000150715.1 Uty 18 0 11 ## 57377 ENSMUST00000154527.1 Uty 18 0 11 ## 57378 ENSMUST00000091190.11 Ddx3y 17 0 17 ## 57379 ENSMUST00000188484.1 Ddx3y 17 0 17 ## 57380 ENSMUST00000187962.1 NA 3 0 3 ## PValue FDR direction rep.test rep.logFC ## &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; &lt;integer&gt; &lt;numeric&gt; ## 1 0.700465 0.739135 mixed 40 -0.27242 ## 2 0.700465 0.739135 mixed 40 -0.27242 ## 3 0.700465 0.739135 mixed 40 -0.27242 ## 4 0.700465 0.739135 mixed 40 -0.27242 ## 5 0.700465 0.739135 mixed 40 -0.27242 ## ... ... ... ... ... ... ## 57376 6.45130e-06 0.000324147 down 689012 -3.36536 ## 57377 6.45130e-06 0.000324147 down 689012 -3.36536 ## 57378 6.82321e-05 0.001421855 down 689019 -2.78424 ## 57379 6.82321e-05 0.001421855 down 689019 -2.78424 ## 57380 2.93752e-03 0.013802417 down 689066 -6.96911 Note that this strategy is distinct from counting reads across promoters. Using promoter-level counts would not provide enough spatial resolution to detect sharp binding events that only occur in a subinterval of the promoter. In particular, detection may be compromised by non-specific background or the presence of multiple opposing DB events in the same promoter. Combining window-level statistics is preferable as resolution is maintained for optimal performance. 9.9 Visualizing DB results 9.9.1 Overview We again use the Gviz package to visualize read coverage across the data set at regions of interest (F. and R. 2016). Coverage in each BAM file will be represented by a single track. Several additional tracks will also be included in each plot. One is the genome axis track, to display the genomic coordinates across the plotted region. The other is the annotation track containing gene models, with gene IDs replaced by symbols (where possible) for easier reading. library(Gviz) gax &lt;- GenomeAxisTrack(col=&quot;black&quot;, fontsize=15, size=2) greg &lt;- GeneRegionTrack(TxDb.Mmusculus.UCSC.mm10.knownGene, showId=TRUE, geneSymbol=TRUE, name=&quot;&quot;, background.title=&quot;transparent&quot;) symbols &lt;- unlist(mapIds(org.Mm.eg.db, gene(greg), &quot;SYMBOL&quot;, &quot;ENTREZID&quot;, multiVals = &quot;first&quot;)) symbol(greg) &lt;- symbols[gene(greg)] We will also sort the DB regions by p-value for easier identification of regions of interest. o &lt;- order(out.ranges$PValue) sorted.ranges &lt;- out.ranges[o] sorted.ranges ## GRanges object with 41616 ranges and 13 metadata columns: ## seqnames ranges strand | num.tests num.up.logFC ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; | &lt;integer&gt; &lt;integer&gt; ## [1] chr17 34285101-34290050 * | 97 0 ## [2] chr9 109050201-109053150 * | 57 0 ## [3] chr17 34261151-34265850 * | 92 5 ## [4] chr17 34306001-34308650 * | 51 0 ## [5] chr18 60802751-60805750 * | 55 0 ## ... ... ... ... . ... ... ## [41612] chr18 23751901-23753200 * | 22 0 ## [41613] chr12 83922051-83922650 * | 10 0 ## [41614] chr15 99395101-99395650 * | 8 0 ## [41615] chr3 67504201-67504500 * | 4 0 ## [41616] chr4 43043401-43043700 * | 4 0 ## num.down.logFC PValue FDR direction rep.test ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; &lt;integer&gt; ## [1] 97 4.04798e-11 1.22570e-06 down 291020 ## [2] 57 7.13784e-11 1.22570e-06 down 671262 ## [3] 74 8.83575e-11 1.22570e-06 down 290891 ## [4] 51 1.23282e-10 1.28263e-06 down 291358 ## [5] 55 2.06286e-10 1.54430e-06 down 321713 ## ... ... ... ... ... ... ## [41612] 0 0.999833 0.999908 mixed 313278 ## [41613] 0 0.999885 0.999908 mixed 153754 ## [41614] 0 0.999908 0.999908 mixed 247763 ## [41615] 0 0.999908 0.999908 up 416884 ## [41616] 0 0.999908 0.999908 mixed 443668 ## rep.logFC best.pos best.logFC overlap ## &lt;numeric&gt; &lt;integer&gt; &lt;numeric&gt; &lt;character&gt; ## [1] -6.97365 34287575 -7.18686 H2-Aa:-:PE ## [2] -5.84054 109051575 -6.19603 Shisa5:+:PE ## [3] -7.87978 34262025 -7.70115 H2-Ab1:+:PE ## [4] -6.86030 34306075 -5.80798 H2-Eb1:+:PE ## [5] -5.13082 60804525 -5.98346 Cd74:+:PE ## ... ... ... ... ... ## [41612] -0.000502650 23752525 -0.777050 Gm15972:-:PE,Mapre2:.. ## [41613] 0.000405487 83922125 0.880875 Numb:-:P ## [41614] 0.000119628 99395425 -0.411300 Tmbim6:+:I ## [41615] 0.000119628 67504275 0.491618 Rarres1:-:I ## [41616] 0.000119628 43043575 0.174254 Fam214b:-:I ## left right ## &lt;character&gt; &lt;character&gt; ## [1] H2-Aa:-:565 ## [2] Atrip-trex1:-:4783,T.. ## [3] H2-Ab1:+:3314 H2-Ab1:+:1252 ## [4] H2-Eb1:+:925 ## [5] Cd74:+:2158 ## ... ... ... ## [41612] Gm15972:-:78 Mapre2:+:525 ## [41613] Numb:-:117 ## [41614] Tmbim6:+:1371 Tmbim6:+:4007 ## [41615] ## [41616] Fam214b:-:3106 Fam214b:-:1948 ## ------- ## seqinfo: 21 sequences from an unspecified genome 9.9.2 Simple DB across a broad region We start by visualizing one of the top-ranking DB regions. This represents a simple DB event where the entire region changes in one direction (Figure 9.8). Specifically, it represents an increase in H3K9ac marking at the H2-Aa locus in mature B cells. This is consistent with the expected biology – H3K9ac is a mark of active gene expression and MHCII components are upregulated in mature B cells (Hoffmann et al. 2002). cur.region &lt;- sorted.ranges[1] cur.region ## GRanges object with 1 range and 13 metadata columns: ## seqnames ranges strand | num.tests num.up.logFC num.down.logFC ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; | &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; ## [1] chr17 34285101-34290050 * | 97 0 97 ## PValue FDR direction rep.test rep.logFC best.pos ## &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; &lt;integer&gt; &lt;numeric&gt; &lt;integer&gt; ## [1] 4.04798e-11 1.2257e-06 down 291020 -6.97365 34287575 ## best.logFC overlap left right ## &lt;numeric&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; ## [1] -7.18686 H2-Aa:-:PE H2-Aa:-:565 ## ------- ## seqinfo: 21 sequences from an unspecified genome One track is plotted for each sample, in addition to the coordinate and annotation tracks. Coverage is plotted in terms of sequencing depth-per-million at each base. This corrects for differences in library sizes between tracks. collected &lt;- list() lib.sizes &lt;- filtered.data$totals/1e6 for (i in seq_along(acdata$Path)) { reads &lt;- extractReads(bam.file=acdata$Path[[i]], cur.region, param=param) cov &lt;- as(coverage(reads)/lib.sizes[i], &quot;GRanges&quot;) collected[[i]] &lt;- DataTrack(cov, type=&quot;histogram&quot;, lwd=0, ylim=c(0,10), name=acdata$Description[i], col.axis=&quot;black&quot;, col.title=&quot;black&quot;, fill=&quot;darkgray&quot;, col.histogram=NA) } plotTracks(c(gax, collected, greg), chromosome=as.character(seqnames(cur.region)), from=start(cur.region), to=end(cur.region)) Figure 9.8: Coverage tracks for a simple DB event between pro-B and mature B cells, across a broad region in the H3K9ac data set. Read coverage for each sample is shown as a per-million value at each base. 9.10 Complex DB across a broad region Complex DB refers to situations where multiple DB events are occurring within the same enriched region. These are identified as those clusters that contain windows changing in both directions8. Here, one of the top-ranking complex clusters is selected for visualization. complex &lt;- sorted.ranges$num.up.logFC &gt; 0 &amp; sorted.ranges$num.down.logFC &gt; 0 cur.region &lt;- sorted.ranges[complex][2] cur.region ## GRanges object with 1 range and 13 metadata columns: ## seqnames ranges strand | num.tests num.up.logFC ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; | &lt;integer&gt; &lt;integer&gt; ## [1] chr5 122987201-122991450 * | 83 5 ## num.down.logFC PValue FDR direction rep.test rep.logFC ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; &lt;integer&gt; &lt;numeric&gt; ## [1] 37 1.30976e-08 1.33826e-05 down 508657 -5.8272 ## best.pos best.logFC overlap left ## &lt;integer&gt; &lt;numeric&gt; &lt;character&gt; &lt;character&gt; ## [1] 122990925 -5.48535 A930024E05Rik:+:PE,K.. Kdm2b:-:2230 ## right ## &lt;character&gt; ## [1] A930024E05Rik:+:2913 ## ------- ## seqinfo: 21 sequences from an unspecified genome This region contains a bidirectional promoter where different genes are marked in the different cell types (Figure 9.9). Upon differentiation to mature B cells, loss of marking in one part of the region is balanced by a gain in marking in another part of the region. This represents a complex DB event that would not be detected if reads were counted across the entire region. collected &lt;- list() for (i in seq_along(acdata$Path)) { reads &lt;- extractReads(bam.file=acdata$Path[[i]], cur.region, param=param) cov &lt;- as(coverage(reads)/lib.sizes[i], &quot;GRanges&quot;) collected[[i]] &lt;- DataTrack(cov, type=&quot;histogram&quot;, lwd=0, ylim=c(0,3), name=acdata$Description[i], col.axis=&quot;black&quot;, col.title=&quot;black&quot;, fill=&quot;darkgray&quot;, col.histogram=NA) } plotTracks(c(gax, collected, greg), chromosome=as.character(seqnames(cur.region)), from=start(cur.region), to=end(cur.region)) Figure 9.9: Coverage tracks for a complex DB event in the H3K9ac data set, shown as per-million values. 9.10.1 Simple DB across a small region Both of the examples above involve differential marking within broad regions spanning several kilobases. This is consistent with changes in the marking profile across a large number of nucleosomes. However, H3K9ac marking can also be concentrated into small regions, involving only a few nucleosomes. csaw is equally capable of detecting sharp DB within these small regions. This is demonstrated by examining those clusters that contain a smaller number of windows. sharp &lt;- sorted.ranges$num.tests &lt; 20 cur.region &lt;- sorted.ranges[sharp][1] cur.region ## GRanges object with 1 range and 13 metadata columns: ## seqnames ranges strand | num.tests num.up.logFC num.down.logFC ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; | &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; ## [1] chr16 36665551-36666200 * | 11 0 11 ## PValue FDR direction rep.test rep.logFC best.pos ## &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; &lt;integer&gt; &lt;numeric&gt; &lt;integer&gt; ## [1] 1.2984e-08 1.33826e-05 down 264956 -4.65739 36665925 ## best.logFC overlap left right ## &lt;numeric&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; ## [1] -4.93342 Cd86:-:PE Cd86:-:3937 ## ------- ## seqinfo: 21 sequences from an unspecified genome Marking is increased for mature B cells within a 500 bp region (Figure 9.10), which is sharper than the changes in the previous two examples. This also coincides with the promoter of the Cd86 gene. Again, this makes biological sense as CD86 is involved in regulating immunoglobulin production in activated B-cells (Podojil and Sanders 2003). collected &lt;- list() for (i in seq_along(acdata$Path)) { reads &lt;- extractReads(bam.file=acdata$Path[[i]], cur.region, param=param) cov &lt;- as(coverage(reads)/lib.sizes[i], &quot;GRanges&quot;) collected[[i]] &lt;- DataTrack(cov, type=&quot;histogram&quot;, lwd=0, ylim=c(0,3), name=acdata$Description[i], col.axis=&quot;black&quot;, col.title=&quot;black&quot;, fill=&quot;darkgray&quot;, col.histogram=NA) } plotTracks(c(gax, collected, greg), chromosome=as.character(seqnames(cur.region)), from=start(cur.region), to=end(cur.region)) Figure 9.10: Coverage tracks for a sharp and simple DB event in the H3K9ac data set, shown as per-million values. Note that the window size will determine whether sharp or broad events are preferentially detected. Larger windows provide more power to detect broad events (as the counts are higher), while smaller windows provide more resolution to detect sharp events. Optimal detection of all features can be obtained by performing analyses with multiple window sizes and consolidating the results9, though – for brevity – this will not be described here. In general, smaller window sizes are preferred as strong DB events with sufficient coverage will always be detected. For larger windows, detection may be confounded by other events within the window that distort the log-fold change in the counts between conditions. Session information View session info R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid stats4 parallel stats graphics grDevices utils [8] datasets methods base other attached packages: [1] Gviz_1.36.0 [2] ChIPpeakAnno_3.26.0 [3] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0 [4] GenomicFeatures_1.44.0 [5] org.Mm.eg.db_3.13.0 [6] AnnotationDbi_1.54.0 [7] edgeR_3.34.0 [8] limma_3.48.0 [9] csaw_1.26.0 [10] SummarizedExperiment_1.22.0 [11] Biobase_2.52.0 [12] MatrixGenerics_1.4.0 [13] matrixStats_0.58.0 [14] rtracklayer_1.52.0 [15] BiocFileCache_2.0.0 [16] dbplyr_2.1.1 [17] Rsamtools_2.8.0 [18] Biostrings_2.60.0 [19] XVector_0.32.0 [20] GenomicRanges_1.44.0 [21] GenomeInfoDb_1.28.0 [22] IRanges_2.26.0 [23] S4Vectors_0.30.0 [24] BiocGenerics_0.38.0 [25] chipseqDBData_1.8.0 [26] BiocStyle_2.20.0 [27] rebook_1.2.0 loaded via a namespace (and not attached): [1] backports_1.2.1 Hmisc_4.5-0 [3] AnnotationHub_3.0.0 lazyeval_0.2.2 [5] splines_4.1.0 BiocParallel_1.26.0 [7] ggplot2_3.3.3 digest_0.6.27 [9] ensembldb_2.16.0 htmltools_0.5.1.1 [11] fansi_0.4.2 checkmate_2.0.0 [13] magrittr_2.0.1 memoise_2.0.0 [15] BSgenome_1.60.0 cluster_2.1.2 [17] InteractionSet_1.20.0 prettyunits_1.1.1 [19] jpeg_0.1-8.1 colorspace_2.0-1 [21] blob_1.2.1 rappdirs_0.3.3 [23] xfun_0.23 dplyr_1.0.6 [25] crayon_1.4.1 RCurl_1.98-1.3 [27] jsonlite_1.7.2 graph_1.70.0 [29] VariantAnnotation_1.38.0 survival_3.2-11 [31] glue_1.4.2 gtable_0.3.0 [33] zlibbioc_1.38.0 DelayedArray_0.18.0 [35] scales_1.1.1 futile.options_1.0.1 [37] DBI_1.1.1 Rcpp_1.0.6 [39] htmlTable_2.2.1 xtable_1.8-4 [41] progress_1.2.2 foreign_0.8-81 [43] bit_4.0.4 Formula_1.2-4 [45] htmlwidgets_1.5.3 metapod_1.0.0 [47] httr_1.4.2 RColorBrewer_1.1-2 [49] dir.expiry_1.0.0 ellipsis_0.3.2 [51] pkgconfig_2.0.3 XML_3.99-0.6 [53] nnet_7.3-16 CodeDepends_0.6.5 [55] sass_0.4.0 locfit_1.5-9.4 [57] utf8_1.2.1 tidyselect_1.1.1 [59] rlang_0.4.11 later_1.2.0 [61] munsell_0.5.0 BiocVersion_3.13.1 [63] tools_4.1.0 cachem_1.0.5 [65] generics_0.1.0 RSQLite_2.2.7 [67] ExperimentHub_2.0.0 evaluate_0.14 [69] stringr_1.4.0 fastmap_1.1.0 [71] yaml_2.2.1 knitr_1.33 [73] bit64_4.0.5 purrr_0.3.4 [75] KEGGREST_1.32.0 AnnotationFilter_1.16.0 [77] RBGL_1.68.0 mime_0.10 [79] formatR_1.9 biomaRt_2.48.0 [81] rstudioapi_0.13 compiler_4.1.0 [83] filelock_1.0.2 curl_4.3.1 [85] png_0.1-7 interactiveDisplayBase_1.30.0 [87] tibble_3.1.2 statmod_1.4.36 [89] bslib_0.2.5.1 stringi_1.6.2 [91] highr_0.9 futile.logger_1.4.3 [93] lattice_0.20-44 ProtGenerics_1.24.0 [95] Matrix_1.3-3 multtest_2.48.0 [97] vctrs_0.3.8 pillar_1.6.1 [99] lifecycle_1.0.0 BiocManager_1.30.15 [101] jquerylib_0.1.4 data.table_1.14.0 [103] bitops_1.0-7 httpuv_1.6.1 [105] latticeExtra_0.6-29 R6_2.5.0 [107] BiocIO_1.2.0 bookdown_0.22 [109] promises_1.2.0.1 gridExtra_2.3 [111] KernSmooth_2.23-20 codetools_0.2-18 [113] dichromat_2.0-0 lambda.r_1.2.4 [115] MASS_7.3-54 assertthat_0.2.1 [117] rjson_0.2.20 withr_2.4.2 [119] regioneR_1.24.0 GenomicAlignments_1.28.0 [121] GenomeInfoDbData_1.2.6 hms_1.1.0 [123] rpart_4.1-15 VennDiagram_1.6.20 [125] rmarkdown_2.8 biovizBase_1.40.0 [127] base64enc_0.1-3 shiny_1.6.0 [129] restfulr_0.0.13 Bibliography "],["cbp-wild-type-versus-knock-out.html", "Chapter 10 CBP, wild-type versus knock-out 10.1 Background 10.2 Pre-processing 10.3 Quantifying coverage 10.4 Filtering of low-abundance windows 10.5 Normalization for composition biases 10.6 Statistical modelling 10.7 Testing for DB 10.8 Annotation and visualization Session information", " Chapter 10 CBP, wild-type versus knock-out .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 10.1 Background Here, we perform a window-based DB analysis to identify differentially bound (DB) regions for CREB-binding protein (CBP). This particular dataset comes from a study comparing wild-type (WT) and CBP knock-out (KO) animals (Kasper et al. 2014), with two biological replicates for each genotype. As before, we obtain the BAM files and indices from chipseqDBData. library(chipseqDBData) cbpdata &lt;- CBPData() cbpdata ## DataFrame with 4 rows and 3 columns ## Name Description Path ## &lt;character&gt; &lt;character&gt; &lt;List&gt; ## 1 SRR1145787 CBP wild-type (1) &lt;BamFile&gt; ## 2 SRR1145788 CBP wild-type (2) &lt;BamFile&gt; ## 3 SRR1145789 CBP knock-out (1) &lt;BamFile&gt; ## 4 SRR1145790 CBP knock-out (2) &lt;BamFile&gt; 10.2 Pre-processing We check some mapping statistics for the CBP dataset with Rsamtools, as previously described. library(Rsamtools) diagnostics &lt;- list() for (b in seq_along(cbpdata$Path)) { bam &lt;- cbpdata$Path[[b]] total &lt;- countBam(bam)$records mapped &lt;- countBam(bam, param=ScanBamParam( flag=scanBamFlag(isUnmapped=FALSE)))$records marked &lt;- countBam(bam, param=ScanBamParam( flag=scanBamFlag(isUnmapped=FALSE, isDuplicate=TRUE)))$records diagnostics[[b]] &lt;- c(Total=total, Mapped=mapped, Marked=marked) } diag.stats &lt;- data.frame(do.call(rbind, diagnostics)) rownames(diag.stats) &lt;- cbpdata$Name diag.stats$Prop.mapped &lt;- diag.stats$Mapped/diag.stats$Total*100 diag.stats$Prop.marked &lt;- diag.stats$Marked/diag.stats$Mapped*100 diag.stats ## Total Mapped Marked Prop.mapped Prop.marked ## SRR1145787 28525952 24289396 2022868 85.15 8.328 ## SRR1145788 25514465 21604007 1939224 84.67 8.976 ## SRR1145789 34476967 29195883 2412650 84.68 8.264 ## SRR1145790 32624587 27348488 2617879 83.83 9.572 We construct a readParam object to standardize the parameter settings in this analysis. The ENCODE blacklist is again used to remove reads in problematic regions (Consortium 2012). library(BiocFileCache) bfc &lt;- BiocFileCache(&quot;local&quot;, ask=FALSE) black.path &lt;- bfcrpath(bfc, file.path(&quot;https://www.encodeproject.org&quot;, &quot;files/ENCFF547MET/@@download/ENCFF547MET.bed.gz&quot;)) library(rtracklayer) blacklist &lt;- import(black.path) We set the minimum mapping quality score to 10 to remove poorly or non-uniquely aligned reads. library(csaw) param &lt;- readParam(minq=10, discard=blacklist) param ## Extracting reads in single-end mode ## Duplicate removal is turned off ## Minimum allowed mapping score is 10 ## Reads are extracted from both strands ## No restrictions are placed on read extraction ## Reads in 164 regions will be discarded 10.3 Quantifying coverage 10.3.1 Computing the average fragment length The average fragment length is estimated by maximizing the cross-correlation function (Figure 10.1), as previously described. Generally, cross-correlations for TF datasets are sharper than for histone marks as the TFs typically contact a smaller genomic interval. This results in more pronounced strand bimodality in the binding profile. x &lt;- correlateReads(cbpdata$Path, param=reform(param, dedup=TRUE)) frag.len &lt;- maximizeCcf(x) frag.len ## [1] 161 plot(1:length(x)-1, x, xlab=&quot;Delay (bp)&quot;, ylab=&quot;CCF&quot;, type=&quot;l&quot;) abline(v=frag.len, col=&quot;red&quot;) text(x=frag.len, y=min(x), paste(frag.len, &quot;bp&quot;), pos=4, col=&quot;red&quot;) Figure 10.1: Cross-correlation function (CCF) against delay distance for the CBP dataset. The delay with the maximum correlation is shown as the red line. 10.3.2 Counting reads into windows Reads are then counted into sliding windows using csaw (Lun and Smyth 2016). For TF data analyses, smaller windows are necessary to capture sharp binding sites. A large window size will be suboptimal as the count for a particular site will be “contaminated” by non-specific background in the neighbouring regions. In this case, a window size of 10 bp is used. win.data &lt;- windowCounts(cbpdata$Path, param=param, width=10, ext=frag.len) win.data ## class: RangedSummarizedExperiment ## dim: 9952827 4 ## metadata(6): spacing width ... param final.ext ## assays(1): counts ## rownames: NULL ## rowData names(0): ## colnames: NULL ## colData names(4): bam.files totals ext rlen The default spacing of 50 bp is also used here. This may seem inappropriate given that the windows are only 10 bp. However, reads lying in the interval between adjacent windows will still be counted into several windows. This is because reads are extended to the value of frag.len, which is substantially larger than the 50 bp spacing10. 10.4 Filtering of low-abundance windows We remove low-abundance windows by computing the coverage in each window relative to a global estimate of background enrichment (Section 4.4). The majority of windows in background regions are filtered out upon applying a modest fold-change threshold. This leaves a small set of relevant windows for further analysis. bins &lt;- windowCounts(cbpdata$Path, bin=TRUE, width=10000, param=param) filter.stat &lt;- filterWindowsGlobal(win.data, bins) min.fc &lt;- 3 keep &lt;- filter.stat$filter &gt; log2(min.fc) summary(keep) ## Mode FALSE TRUE ## logical 9652836 299991 filtered.data &lt;- win.data[keep,] Note that the 10 kbp bins are used here for filtering, while smaller 2 kbp bins were used in the corresponding step for the H3K9ac analysis. This is purely for convenience – the 10 kbp counts for this dataset were previously loaded for normalization, and can be re-used during filtering to save time. Changes in bin size will have little impact on the results, so long as the bins (and their counts) are large enough for precise estimation of the background abundance. While smaller bins provide greater spatial resolution, this is irrelevant for quantifying coverage in large background regions that span most of the genome. 10.5 Normalization for composition biases We expect unbalanced DB in this dataset as CBP function should be compromised in the KO cells, such that most - if not all - of the DB sites should exhibit increased CBP binding in the WT condition. To remove this bias, we assign reads to large genomic bins and assume that most bins represent non-DB background regions (Lun and Smyth 2014). Any systematic differences in the coverage of those bins is attributed to composition bias and is normalized out. Specifically, the trimmed mean of M-values (TMM) method (Robinson and Oshlack 2010) is applied to compute normalization factors from the bin counts. These factors are stored in win.data11 so that they will be applied during the DB analysis with the window counts. win.data &lt;- normFactors(bins, se.out=win.data) (normfacs &lt;- win.data$norm.factors) ## [1] 1.0126 0.9083 1.0444 1.0411 We visualize the effect of normalization with mean-difference plots between pairs of samples (Figure 10.2). The dense cloud in each plot represents the majority of bins in the genome. These are assumed to mostly contain background regions. A non-zero log-fold change for these bins indicates that composition bias is present between samples. The red line represents the log-ratio of normalization factors and passes through the centre of the cloud in each plot, indicating that the bias has been successfully identified and removed. bin.ab &lt;- scaledAverage(bins) adjc &lt;- calculateCPM(bins, use.norm.factors=FALSE) par(cex.lab=1.5, mfrow=c(1,3)) smoothScatter(bin.ab, adjc[,1]-adjc[,4], ylim=c(-6, 6), xlab=&quot;Average abundance&quot;, ylab=&quot;Log-ratio (1 vs 4)&quot;) abline(h=log2(normfacs[1]/normfacs[4]), col=&quot;red&quot;) smoothScatter(bin.ab, adjc[,2]-adjc[,4], ylim=c(-6, 6), xlab=&quot;Average abundance&quot;, ylab=&quot;Log-ratio (2 vs 4)&quot;) abline(h=log2(normfacs[2]/normfacs[4]), col=&quot;red&quot;) smoothScatter(bin.ab, adjc[,3]-adjc[,4], ylim=c(-6, 6), xlab=&quot;Average abundance&quot;, ylab=&quot;Log-ratio (3 vs 4)&quot;) abline(h=log2(normfacs[3]/normfacs[4]), col=&quot;red&quot;) Figure 10.2: Mean-difference plots for the bin counts, comparing sample 4 to all other samples. The red line represents the log-ratio of the normalization factors between samples. Note that this normalization strategy is quite different from that in the H3K9ac analysis. Here, systematic DB in one direction is expected between conditions, given that CBP function is lost in the KO genotype. This means that the assumption of a non-DB majority (required for non-linear normalization of the H3K9ac data) is not valid. No such assumption is made by the binned-TMM approach described above, which makes it more appropriate for use in the CBP analysis. 10.6 Statistical modelling We model counts for each window using edgeR (McCarthy, Chen, and Smyth 2012; Robinson, McCarthy, and Smyth 2010). First, we convert our RangedSummarizedExperiment object into a DGEList. library(edgeR) y &lt;- asDGEList(filtered.data) summary(y) ## Length Class Mode ## counts 1199964 -none- numeric ## samples 3 data.frame list We then construct a design matrix for our experimental design. Again, we have a simple one-way layout with two groups of two replicates. genotype &lt;- cbpdata$Description genotype[grep(&quot;wild-type&quot;, genotype)] &lt;- &quot;wt&quot; genotype[grep(&quot;knock-out&quot;, genotype)] &lt;- &quot;ko&quot; genotype &lt;- factor(genotype) design &lt;- model.matrix(~0+genotype) colnames(design) &lt;- levels(genotype) design ## ko wt ## 1 0 1 ## 2 0 1 ## 3 1 0 ## 4 1 0 ## attr(,&quot;assign&quot;) ## [1] 1 1 ## attr(,&quot;contrasts&quot;) ## attr(,&quot;contrasts&quot;)$genotype ## [1] &quot;contr.treatment&quot; We estimate the negative binomial (NB) and quasi-likelihood (QL) dispersions for each window (Lund et al. 2012). The estimated NB dispersions (Figure 10.3) are substantially larger than those observed in the H3K9ac dataset. They also exhibit an unusual increasing trend with respect to abundance. y &lt;- estimateDisp(y, design) summary(y$trended.dispersion) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.102 0.129 0.146 0.152 0.176 0.214 plotBCV(y) Figure 10.3: Abundance-dependent trend in the biological coefficient of variation (i.e., the root-NB dispersion) for each window, represented by the blue line. Common (red) and tagwise estimates (black) are also shown. The estimated prior d.f. is also infinite, meaning that all the QL dispersions are equal to the trend (Figure 10.4). fit &lt;- glmQLFit(y, design, robust=TRUE) summary(fit$df.prior) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 48426 Inf Inf Inf Inf Inf plotQLDisp(fit) Figure 10.4: Effect of EB shrinkage on the raw QL dispersion estimate for each window (black) towards the abundance-dependent trend (blue) to obtain squeezed estimates (red). Quarter-root estimates are shown for greater dynamic range. These results are consistent with the presence of a systematic difference in CBP enrichment between the WT replicates. An increasing trend in Figure 10.3 is typical after normalization for composition biases, where replicates exhibit some differences in efficiency that manifest as increased dispersions at high abundance. The dispersions for all windows are inflated to a similarly large value by this difference, manifesting as low variability in the dispersions across windows. This effect is illustrated in Figure 10.5 where the WT samples are clearly separated in both dimensions. plotMDS(cpm(y, log=TRUE), top=10000, labels=genotype, col=c(&quot;red&quot;, &quot;blue&quot;)[as.integer(genotype)]) Figure 10.5: MDS plot with two dimensions for all samples in the CBP dataset. Samples are labelled and coloured according to the genotype. A larger top set of windows was used to improve the visualization of the genome-wide differences between the WT samples. The presence of a large batch effect between replicates is not ideal. Nonetheless, we can still proceed with the DB analysis - albeit with some loss of power due to the inflated NB dispersions - given that there are strong differences between genotypes in Figure 10.5, 10.7 Testing for DB We test for a significant difference in binding between genotypes in each window using the QL F-test. contrast &lt;- makeContrasts(wt-ko, levels=design) res &lt;- glmQLFTest(fit, contrast=contrast) Windows less than 100 bp apart are clustered into regions (Lun and Smyth 2014) with a maximum cluster width of 5 kbp. We then control the region-level FDR by combining per-window \\(p\\)-values using Simes’ method (Simes 1986). merged &lt;- mergeResults(filtered.data, res$table, tol=100, merge.args=list(max.width=5000)) merged$regions ## GRanges object with 61773 ranges and 0 metadata columns: ## seqnames ranges strand ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; ## [1] chr1 3613551-3613610 * ## [2] chr1 4785501-4785860 * ## [3] chr1 4807601-4808010 * ## [4] chr1 4857451-4857910 * ## [5] chr1 4858301-4858460 * ## ... ... ... ... ## [61769] chrY 90808801-90808910 * ## [61770] chrY 90810901-90810910 * ## [61771] chrY 90811301-90811360 * ## [61772] chrY 90811601-90811660 * ## [61773] chrY 90811851-90813910 * ## ------- ## seqinfo: 66 sequences from an unspecified genome tabcom &lt;- merged$combined is.sig &lt;- tabcom$FDR &lt;= 0.05 summary(is.sig) ## Mode FALSE TRUE ## logical 58053 3720 All significant regions have increased CBP binding in the WT genotype. This is expected given that protein function should be lost in the KO genotype. table(tabcom$direction[is.sig]) ## ## down up ## 2 3718 # Direction according the best window in each cluster. tabbest &lt;- merged$best is.sig.pos &lt;- (tabbest$rep.logFC &gt; 0)[is.sig] summary(is.sig.pos) ## Mode FALSE TRUE ## logical 2 3718 We save the results to file in the form of a serialized R object for later inspection. out.ranges &lt;- merged$regions mcols(out.ranges) &lt;- DataFrame(tabcom, best.pos=mid(ranges(rowRanges(filtered.data[tabbest$rep.test]))), best.logFC=tabbest$rep.logFC) saveRDS(file=&quot;cbp_results.rds&quot;, out.ranges) 10.8 Annotation and visualization We annotate each region using the detailRanges() function. library(TxDb.Mmusculus.UCSC.mm10.knownGene) library(org.Mm.eg.db) anno &lt;- detailRanges(out.ranges, orgdb=org.Mm.eg.db, txdb=TxDb.Mmusculus.UCSC.mm10.knownGene) mcols(out.ranges) &lt;- cbind(mcols(out.ranges), anno) We visualize one of the top-ranked DB regions here. This corresponds to a simple DB event as all windows are changing in the same direction, i.e., up in the WT. The binding region is also quite small relative to some of the H3K9ac examples, consistent with sharp TF binding to a specific recognition site. o &lt;- order(out.ranges$PValue) cur.region &lt;- out.ranges[o[2]] cur.region ## GRanges object with 1 range and 13 metadata columns: ## seqnames ranges strand | num.tests num.up.logFC ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; | &lt;integer&gt; &lt;integer&gt; ## [1] chr3 145758501-145759510 * | 21 17 ## num.down.logFC PValue FDR direction rep.test rep.logFC ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; &lt;integer&gt; &lt;numeric&gt; ## [1] 0 4.45747e-13 1.03695e-08 up 189312 4.29456 ## best.pos best.logFC overlap left right ## &lt;integer&gt; &lt;numeric&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; ## [1] 145759055 4.29456 Ddah1:+:PE Ddah1:+:1238 ## ------- ## seqinfo: 66 sequences from an unspecified genome We use Gviz (F. and R. 2016) to plot the results. As in the H3K9ac analysis, we set up some tracks to display genome coordinates and gene annotation. library(Gviz) gax &lt;- GenomeAxisTrack(col=&quot;black&quot;, fontsize=15, size=2) greg &lt;- GeneRegionTrack(TxDb.Mmusculus.UCSC.mm10.knownGene, showId=TRUE, geneSymbol=TRUE, name=&quot;&quot;, background.title=&quot;transparent&quot;) symbols &lt;- unlist(mapIds(org.Mm.eg.db, gene(greg), &quot;SYMBOL&quot;, &quot;ENTREZID&quot;, multiVals = &quot;first&quot;)) symbol(greg) &lt;- symbols[gene(greg)] We visualize two tracks for each sample – one for the forward-strand coverage, another for the reverse-strand coverage. This allows visualization of the strand bimodality that is characteristic of genuine TF binding sites. In Figure 10.6, two adjacent sites are present at the Gbe1 promoter, both of which exhibit increased binding in the WT genotype. Coverage is also substantially different between the WT replicates, consistent with the presence of a batch effect. collected &lt;- list() lib.sizes &lt;- filtered.data$totals/1e6 for (i in seq_along(cbpdata$Path)) { reads &lt;- extractReads(bam.file=cbpdata$Path[[i]], cur.region, param=param) pcov &lt;- as(coverage(reads[strand(reads)==&quot;+&quot;])/lib.sizes[i], &quot;GRanges&quot;) ncov &lt;- as(coverage(reads[strand(reads)==&quot;-&quot;])/-lib.sizes[i], &quot;GRanges&quot;) ptrack &lt;- DataTrack(pcov, type=&quot;histogram&quot;, lwd=0, ylim=c(-5, 5), name=cbpdata$Description[i], col.axis=&quot;black&quot;, col.title=&quot;black&quot;, fill=&quot;blue&quot;, col.histogram=NA) ntrack &lt;- DataTrack(ncov, type=&quot;histogram&quot;, lwd=0, ylim=c(-5, 5), fill=&quot;red&quot;, col.histogram=NA) collected[[i]] &lt;- OverlayTrack(trackList=list(ptrack, ntrack)) } plotTracks(c(gax, collected, greg), chromosome=as.character(seqnames(cur.region)), from=start(cur.region), to=end(cur.region)) Figure 10.6: Coverage tracks for TF binding sites that are differentially bound in the WT (top two tracks) against the KO (last two tracks). Blue and red tracks represent forward- and reverse-strand coverage, respectively, on a per-million scale (capped at 5 in SRR1145788, for visibility). Session information View session info R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid stats4 parallel stats graphics grDevices utils [8] datasets methods base other attached packages: [1] Gviz_1.36.0 [2] org.Mm.eg.db_3.13.0 [3] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0 [4] GenomicFeatures_1.44.0 [5] AnnotationDbi_1.54.0 [6] edgeR_3.34.0 [7] limma_3.48.0 [8] csaw_1.26.0 [9] SummarizedExperiment_1.22.0 [10] Biobase_2.52.0 [11] MatrixGenerics_1.4.0 [12] matrixStats_0.58.0 [13] rtracklayer_1.52.0 [14] BiocFileCache_2.0.0 [15] dbplyr_2.1.1 [16] Rsamtools_2.8.0 [17] Biostrings_2.60.0 [18] XVector_0.32.0 [19] GenomicRanges_1.44.0 [20] GenomeInfoDb_1.28.0 [21] IRanges_2.26.0 [22] S4Vectors_0.30.0 [23] BiocGenerics_0.38.0 [24] chipseqDBData_1.8.0 [25] BiocStyle_2.20.0 [26] rebook_1.2.0 loaded via a namespace (and not attached): [1] backports_1.2.1 Hmisc_4.5-0 [3] AnnotationHub_3.0.0 lazyeval_0.2.2 [5] splines_4.1.0 BiocParallel_1.26.0 [7] ggplot2_3.3.3 digest_0.6.27 [9] ensembldb_2.16.0 htmltools_0.5.1.1 [11] fansi_0.4.2 magrittr_2.0.1 [13] checkmate_2.0.0 memoise_2.0.0 [15] BSgenome_1.60.0 cluster_2.1.2 [17] prettyunits_1.1.1 jpeg_0.1-8.1 [19] colorspace_2.0-1 blob_1.2.1 [21] rappdirs_0.3.3 xfun_0.23 [23] dplyr_1.0.6 crayon_1.4.1 [25] RCurl_1.98-1.3 jsonlite_1.7.2 [27] graph_1.70.0 VariantAnnotation_1.38.0 [29] survival_3.2-11 glue_1.4.2 [31] gtable_0.3.0 zlibbioc_1.38.0 [33] DelayedArray_0.18.0 scales_1.1.1 [35] DBI_1.1.1 Rcpp_1.0.6 [37] xtable_1.8-4 progress_1.2.2 [39] htmlTable_2.2.1 foreign_0.8-81 [41] bit_4.0.4 Formula_1.2-4 [43] htmlwidgets_1.5.3 metapod_1.0.0 [45] httr_1.4.2 dir.expiry_1.0.0 [47] RColorBrewer_1.1-2 ellipsis_0.3.2 [49] pkgconfig_2.0.3 XML_3.99-0.6 [51] nnet_7.3-16 CodeDepends_0.6.5 [53] sass_0.4.0 locfit_1.5-9.4 [55] utf8_1.2.1 tidyselect_1.1.1 [57] rlang_0.4.11 later_1.2.0 [59] munsell_0.5.0 BiocVersion_3.13.1 [61] tools_4.1.0 cachem_1.0.5 [63] generics_0.1.0 RSQLite_2.2.7 [65] ExperimentHub_2.0.0 evaluate_0.14 [67] stringr_1.4.0 fastmap_1.1.0 [69] yaml_2.2.1 knitr_1.33 [71] bit64_4.0.5 purrr_0.3.4 [73] AnnotationFilter_1.16.0 KEGGREST_1.32.0 [75] mime_0.10 biomaRt_2.48.0 [77] rstudioapi_0.13 compiler_4.1.0 [79] filelock_1.0.2 curl_4.3.1 [81] png_0.1-7 interactiveDisplayBase_1.30.0 [83] tibble_3.1.2 statmod_1.4.36 [85] bslib_0.2.5.1 stringi_1.6.2 [87] highr_0.9 lattice_0.20-44 [89] ProtGenerics_1.24.0 Matrix_1.3-3 [91] vctrs_0.3.8 pillar_1.6.1 [93] lifecycle_1.0.0 BiocManager_1.30.15 [95] jquerylib_0.1.4 data.table_1.14.0 [97] bitops_1.0-7 httpuv_1.6.1 [99] R6_2.5.0 BiocIO_1.2.0 [101] latticeExtra_0.6-29 bookdown_0.22 [103] promises_1.2.0.1 KernSmooth_2.23-20 [105] gridExtra_2.3 codetools_0.2-18 [107] dichromat_2.0-0 assertthat_0.2.1 [109] rjson_0.2.20 withr_2.4.2 [111] GenomicAlignments_1.28.0 GenomeInfoDbData_1.2.6 [113] hms_1.1.0 rpart_4.1-15 [115] rmarkdown_2.8 biovizBase_1.40.0 [117] shiny_1.6.0 base64enc_0.1-3 [119] restfulr_0.0.13 Bibliography "],["h3k27me3-wild-type-versus-knock-out.html", "Chapter 11 H3K27me3, wild-type versus knock-out 11.1 Overview 11.2 Pre-processing checks 11.3 Counting reads into windows 11.4 Filtering of low-abundance windows 11.5 Normalization for composition biases 11.6 Statistical modelling 11.7 Consolidating results from multiple window sizes 11.8 Annotation and visualization Session information", " Chapter 11 H3K27me3, wild-type versus knock-out .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 11.1 Overview Here, we perform a window-based DB analysis to identify regions of differential H3K27me3 enrichment in mouse lung epithelium. H3K27me3 is associated with transcriptional repression and is usually observed with broad regions of enrichment. The aim of this workflow is to demonstrate how to analyze these broad marks with csaw, especially at variable resolutions with multiple window sizes. We use H3K27me3 ChIP-seq data from a study comparing wild-type (WT) and Ezh2 knock-out (KO) animals (Galvis et al. 2015), contains two biological replicates for each genotype. We download BAM files and indices using chipseqDBData. library(chipseqDBData) h3k27me3data &lt;- H3K27me3Data() h3k27me3data ## DataFrame with 4 rows and 3 columns ## Name Description Path ## &lt;character&gt; &lt;character&gt; &lt;List&gt; ## 1 SRR1274188 control H3K27me3 (1) &lt;BamFile&gt; ## 2 SRR1274189 control H3K27me3 (2) &lt;BamFile&gt; ## 3 SRR1274190 Ezh2 knock-out H3K27.. &lt;BamFile&gt; ## 4 SRR1274191 Ezh2 knock-out H3K27.. &lt;BamFile&gt; 11.2 Pre-processing checks We check some mapping statistics with Rsamtools. library(Rsamtools) diagnostics &lt;- list() for (b in seq_along(h3k27me3data$Path)) { bam &lt;- h3k27me3data$Path[[b]] total &lt;- countBam(bam)$records mapped &lt;- countBam(bam, param=ScanBamParam( flag=scanBamFlag(isUnmapped=FALSE)))$records marked &lt;- countBam(bam, param=ScanBamParam( flag=scanBamFlag(isUnmapped=FALSE, isDuplicate=TRUE)))$records diagnostics[[b]] &lt;- c(Total=total, Mapped=mapped, Marked=marked) } diag.stats &lt;- data.frame(do.call(rbind, diagnostics)) rownames(diag.stats) &lt;- h3k27me3data$Name diag.stats$Prop.mapped &lt;- diag.stats$Mapped/diag.stats$Total*100 diag.stats$Prop.marked &lt;- diag.stats$Marked/diag.stats$Mapped*100 diag.stats ## Total Mapped Marked Prop.mapped Prop.marked ## SRR1274188 24445704 18605240 2769679 76.11 14.89 ## SRR1274189 21978677 17014171 2069203 77.41 12.16 ## SRR1274190 26910067 18606352 4361393 69.14 23.44 ## SRR1274191 21354963 14092438 4392541 65.99 31.17 We construct a readParam object to standardize the parameter settings in this analysis. For consistency with the original analysis by Galvis et al. (2015), we will define the blacklist using the predicted repeats from the RepeatMasker software. library(BiocFileCache) bfc &lt;- BiocFileCache(&quot;local&quot;, ask=FALSE) black.path &lt;- bfcrpath(bfc, file.path(&quot;http://hgdownload.cse.ucsc.edu&quot;, &quot;goldenPath/mm10/bigZips/chromOut.tar.gz&quot;)) tmpdir &lt;- tempfile() dir.create(tmpdir) untar(black.path, exdir=tmpdir) # Iterate through all chromosomes. collected &lt;- list() for (x in list.files(tmpdir, full=TRUE)) { f &lt;- list.files(x, full=TRUE, pattern=&quot;.fa.out&quot;) to.get &lt;- vector(&quot;list&quot;, 15) to.get[[5]] &lt;- &quot;character&quot; to.get[6:7] &lt;- &quot;integer&quot; collected[[length(collected)+1]] &lt;- read.table(f, skip=3, stringsAsFactors=FALSE, colClasses=to.get) } collected &lt;- do.call(rbind, collected) blacklist &lt;- GRanges(collected[,1], IRanges(collected[,2], collected[,3])) blacklist ## GRanges object with 5147737 ranges and 0 metadata columns: ## seqnames ranges strand ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; ## [1] chr1 3000001-3002128 * ## [2] chr1 3003153-3003994 * ## [3] chr1 3003994-3004054 * ## [4] chr1 3004041-3004206 * ## [5] chr1 3004207-3004270 * ## ... ... ... ... ## [5147733] chrY_JH584303_random 152557-155890 * ## [5147734] chrY_JH584303_random 155891-156883 * ## [5147735] chrY_JH584303_random 157070-157145 * ## [5147736] chrY_JH584303_random 157909-157960 * ## [5147737] chrY_JH584303_random 157953-158099 * ## ------- ## seqinfo: 66 sequences from an unspecified genome; no seqlengths We set the minimum mapping quality score to 10 to remove poorly or non-uniquely aligned reads. We also restrict ourselves to the standard chromosomes. library(csaw) param &lt;- readParam(minq=10, discard=blacklist, restrict=paste0(&quot;chr&quot;, c(1:19, &quot;X&quot;, &quot;Y&quot;))) param ## Extracting reads in single-end mode ## Duplicate removal is turned off ## Minimum allowed mapping score is 10 ## Reads are extracted from both strands ## Read extraction is limited to 21 sequences ## Reads in 5147737 regions will be discarded 11.3 Counting reads into windows Reads are then counted into sliding windows using csaw (Lun and Smyth 2016). At this stage, we use a large 2 kbp window to reflect the fact that H3K27me3 exhibits broad enrichment. This allows us to increase the size of the counts and thus detection power, without having to be concerned about loss of genomic resolution to detect sharp binding events. win.data &lt;- windowCounts(h3k27me3data$Path, param=param, width=2000, spacing=500, ext=200) win.data ## class: RangedSummarizedExperiment ## dim: 4611513 4 ## metadata(6): spacing width ... param final.ext ## assays(1): counts ## rownames: NULL ## rowData names(0): ## colnames: NULL ## colData names(4): bam.files totals ext rlen We use spacing=500 to avoid redundant work when sliding a large window across the genome. The default spacing of 50 bp would result in many windows with over 90% overlap in their positions, increasing the amount of computational work without a meaningful improvement in resolution. We also set the fragment length to 200 bp based on experimental knowledge of the size selection procedure. Unlike the previous analyses, the fragment length cannot easily estimated here due to weak strand bimodality of diffuse marks. 11.4 Filtering of low-abundance windows We estimate the global background and remove low-abundance windows that are not enriched above this background level. To retain a window, we require it to have at least 2-fold more coverage than the average background. This is less stringent than the thresholds used in previous analyses, owing the weaker enrichment observed for diffuse marks. bins &lt;- windowCounts(h3k27me3data$Path, bin=TRUE, width=10000, param=param) filter.stat &lt;- filterWindowsGlobal(win.data, bins) min.fc &lt;- 2 Figure 11.1 shows that chosen threshold is greater than the abundances of most bins in the genome, presumably those corresponding to background regions. This suggests that the filter will remove most windows lying within background regions. hist(filter.stat$filter, main=&quot;&quot;, breaks=50, xlab=&quot;Background abundance (log2-CPM)&quot;) abline(v=log2(min.fc), col=&quot;red&quot;) Figure 11.1: Histogram of average abundances across all 10 kbp genomic bins. The filter threshold is shown as the red line. We filter out the majority of windows in background regions upon applying a modest fold-change threshold. This leaves a small set of relevant windows for further analysis. keep &lt;- filter.stat$filter &gt; log2(min.fc) summary(keep) ## Mode FALSE TRUE ## logical 4553052 58461 filtered.data &lt;- win.data[keep,] 11.5 Normalization for composition biases As in the CBP example, we normalize for composition biases resulting from imbalanced DB between conditions. This is because we expect systematic DB in one direction as Ezh2 function (and thus some H3K27me3 deposition activity) is lost in the KO genotype. win.data &lt;- normFactors(bins, se.out=win.data) (normfacs &lt;- win.data$norm.factors) ## [1] 0.9966 0.9969 1.0079 0.9986 Figure 11.2 shows the effect of normalization on the relative enrichment between pairs of samples. We see that log-ratio of normalization factors passes through the centre of the cloud of background regions in each plot, indicating that the bias has been successfully identified and removed. bin.ab &lt;- scaledAverage(bins) adjc &lt;- calculateCPM(bins, use.norm.factors=FALSE) par(cex.lab=1.5, mfrow=c(1,3)) smoothScatter(bin.ab, adjc[,1]-adjc[,2], ylim=c(-6, 6), xlab=&quot;Average abundance&quot;, ylab=&quot;Log-ratio (1 vs 2)&quot;) abline(h=log2(normfacs[1]/normfacs[4]), col=&quot;red&quot;) smoothScatter(bin.ab, adjc[,1]-adjc[,3], ylim=c(-6, 6), xlab=&quot;Average abundance&quot;, ylab=&quot;Log-ratio (1 vs 3)&quot;) abline(h=log2(normfacs[2]/normfacs[4]), col=&quot;red&quot;) smoothScatter(bin.ab, adjc[,1]-adjc[,4], ylim=c(-6, 6), xlab=&quot;Average abundance&quot;, ylab=&quot;Log-ratio (1 vs 4)&quot;) abline(h=log2(normfacs[3]/normfacs[4]), col=&quot;red&quot;) Figure 11.2: Mean-difference plots for the bin counts, comparing sample 1 to all other samples. The red line represents the log-ratio of the normalization factors between samples. 11.6 Statistical modelling We first convert our RangedSummarizedExperiment object into a DGEList for modelling with edgeR. library(edgeR) y &lt;- asDGEList(filtered.data) str(y) ## Formal class &#39;DGEList&#39; [package &quot;edgeR&quot;] with 1 slot ## ..@ .Data:List of 2 ## .. ..$ : int [1:58461, 1:4] 17 19 32 33 30 31 31 36 33 29 ... ## .. .. ..- attr(*, &quot;dimnames&quot;)=List of 2 ## .. .. .. ..$ : chr [1:58461] &quot;1&quot; &quot;2&quot; &quot;3&quot; &quot;4&quot; ... ## .. .. .. ..$ : chr [1:4] &quot;Sample1&quot; &quot;Sample2&quot; &quot;Sample3&quot; &quot;Sample4&quot; ## .. ..$ :&#39;data.frame&#39;: 4 obs. of 3 variables: ## .. .. ..$ group : Factor w/ 1 level &quot;1&quot;: 1 1 1 1 ## .. .. ..$ lib.size : int [1:4] 10577670 9948013 8579597 5642350 ## .. .. ..$ norm.factors: num [1:4] 1 1 1 1 ## ..$ names: chr [1:2] &quot;counts&quot; &quot;samples&quot; We then construct a design matrix for our experimental design. Here, we use a simple one-way layout with two groups of two replicates. genotype &lt;- h3k27me3data$Description genotype[grep(&quot;control&quot;, genotype)] &lt;- &quot;wt&quot; genotype[grep(&quot;knock-out&quot;, genotype)] &lt;- &quot;ko&quot; genotype &lt;- factor(genotype) design &lt;- model.matrix(~0+genotype) colnames(design) &lt;- levels(genotype) design ## ko wt ## 1 0 1 ## 2 0 1 ## 3 1 0 ## 4 1 0 ## attr(,&quot;assign&quot;) ## [1] 1 1 ## attr(,&quot;contrasts&quot;) ## attr(,&quot;contrasts&quot;)$genotype ## [1] &quot;contr.treatment&quot; We estimate the negative binomial (NB) and quasi-likelihood (QL) dispersions for each window. We again observe an increasing trend in the NB dispersions with respect to abundance (Figure 11.3). y &lt;- estimateDisp(y, design) summary(y$trended.dispersion) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00837 0.00884 0.00942 0.01026 0.00988 0.02363 plotBCV(y) Figure 11.3: Abundance-dependent trend in the BCV for each window, represented by the blue line. Common (red) and tagwise estimates (black) are also shown. The QL dispersions are strongly shrunk towards the trend (Figure 11.4), indicating that there is little variability in the dispersions across windows. fit &lt;- glmQLFit(y, design, robust=TRUE) summary(fit$df.prior) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.52 90.42 90.42 90.34 90.42 90.42 plotQLDisp(fit) Figure 11.4: Effect of EB shrinkage on the raw QL dispersion estimate for each window (black) towards the abundance-dependent trend (blue) to obtain squeezed estimates (red). These results are consistent with the presence of systematic differences in enrichment between replicates, as discussed in Section 10.6. Nonetheless, the samples separate by genotype in the MDS plot (Figure 11.5), which suggests that the downstream analysis will be able to detect DB regions. plotMDS(cpm(y, log=TRUE), top=10000, labels=genotype, col=c(&quot;red&quot;, &quot;blue&quot;)[as.integer(genotype)]) Figure 11.5: MDS plot with two dimensions for all samples in the H3K27me3 data set. Samples are labelled and coloured according to the genotype. A larger top set of windows was used to improve the visualization of the genome-wide differences between the WT samples. We then test for DB between conditions in each window using the QL F-test. contrast &lt;- makeContrasts(wt-ko, levels=design) res &lt;- glmQLFTest(fit, contrast=contrast) 11.7 Consolidating results from multiple window sizes Consolidation allows the analyst to incorporate information from a range of different window sizes, each of which has a different trade-off between resolution and count size. This is particularly useful for broad marks where the width of an enriched region can be variable, as can the width of the differentially bound interval of an enriched region. To demonstrate, we repeat the entire analysis using 500 bp windows. Compared to our previous 2 kbp analysis, this provides greater spatial resolution at the cost of lowering the counts. # Counting into 500 bp windows. win.data2 &lt;- windowCounts(h3k27me3data$Path, param=param, width=500, spacing=100, ext=200) # Re-using the same normalization factors. win.data2$norm.factors &lt;- win.data$norm.factors # Filtering on abundance. filter.stat2 &lt;- filterWindowsGlobal(win.data2, bins) keep2 &lt;- filter.stat2$filter &gt; log2(min.fc) filtered.data2 &lt;- win.data2[keep2,] # Performing the statistical analysis. y2 &lt;- asDGEList(filtered.data2) y2 &lt;- estimateDisp(y2, design) fit2 &lt;- glmQLFit(y2, design, robust=TRUE) res2 &lt;- glmQLFTest(fit2, contrast=contrast) We consolidate the 500 bp analysis with our previous 2 kbp analysis using the mergeWindowsList() function. This clusters both sets of windows together into a single set of regions. To limit chaining effects, each region cannot be more than 30 kbp in size. merged &lt;- mergeResultsList(list(filtered.data, filtered.data2), tab.list=list(res$table, res2$table), equiweight=TRUE, tol=100, merge.args=list(max.width=30000)) merged$regions ## GRanges object with 80139 ranges and 0 metadata columns: ## seqnames ranges strand ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; ## [1] chr1 3050001-3052500 * ## [2] chr1 3086401-3087200 * ## [3] chr1 3098001-3098800 * ## [4] chr1 3139501-3140100 * ## [5] chr1 3211501-3212600 * ## ... ... ... ... ## [80135] chrY 90766001-90769500 * ## [80136] chrY 90774801-90775900 * ## [80137] chrY 90788501-90789000 * ## [80138] chrY 90793101-90793700 * ## [80139] chrY 90797001-90814000 * ## ------- ## seqinfo: 21 sequences from an unspecified genome We compute combined \\(p\\)-values using Simes’ method for region-level FDR control (Simes 1986; Lun and Smyth 2014). This is done after weighting the contributions from the two sets of windows to ensure that the combined \\(p\\)-value for each region is not dominated by the analysis with more (smaller) windows. tabcom &lt;- merged$combined is.sig &lt;- tabcom$FDR &lt;= 0.05 summary(is.sig) ## Mode FALSE TRUE ## logical 73136 7003 Ezh2 is one of the proteins responsible for depositing H3K27me3, so we might expect that most DB regions have increased enrichment in the WT condition. However, the opposite seems to be true here, which is an interesting result that may warrant further investigation. table(tabcom$direction[is.sig]) ## ## down up ## 4803 2200 We also obtain statistics for the window with the lowest \\(p\\)-value in each region. The signs of the log-fold changes are largely consistent with the direction numbers above. tabbest &lt;- merged$best is.sig.pos &lt;- (tabbest$rep.logFC &gt; 0)[is.sig] summary(is.sig.pos) ## Mode FALSE TRUE ## logical 4803 2200 Finally, we save these results to file for future reference. out.ranges &lt;- merged$regions mcols(out.ranges) &lt;- data.frame(tabcom, best.logFC=tabbest$rep.logFC) saveRDS(file=&quot;h3k27me3_results.rds&quot;, out.ranges) 11.8 Annotation and visualization We add annotation for each region using the detailRanges() function, as previously described. library(TxDb.Mmusculus.UCSC.mm10.knownGene) library(org.Mm.eg.db) anno &lt;- detailRanges(out.ranges, orgdb=org.Mm.eg.db, txdb=TxDb.Mmusculus.UCSC.mm10.knownGene) mcols(out.ranges) &lt;- cbind(mcols(out.ranges), anno) We visualize one of the DB regions overlapping the Cdx2 gene to reproduce the results in Holik et al. (2015). cdx2 &lt;- genes(TxDb.Mmusculus.UCSC.mm10.knownGene)[&quot;12591&quot;] # Cdx2 Entrez ID cur.region &lt;- subsetByOverlaps(out.ranges, cdx2)[1] cur.region ## GRanges object with 1 range and 12 metadata columns: ## seqnames ranges strand | num.tests num.up.logFC ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; | &lt;integer&gt; &lt;integer&gt; ## [1] chr5 147299501-147322500 * | 138 92 ## num.down.logFC PValue FDR direction rep.test rep.logFC ## &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; &lt;integer&gt; &lt;numeric&gt; ## [1] 0 8.20439e-07 0.000172118 up 44791 2.95226 ## best.logFC overlap left right ## &lt;numeric&gt; &lt;character&gt; &lt;character&gt; &lt;character&gt; ## [1] 2.60436 Cdx2:-:PE,Urad:-:PE ## ------- ## seqinfo: 21 sequences from an unspecified genome We use Gviz (F. and R. 2016) to plot the results. As in the H3K9ac analysis, we set up some tracks to display genome coordinates and gene annotation. library(Gviz) gax &lt;- GenomeAxisTrack(col=&quot;black&quot;, fontsize=15, size=2) greg &lt;- GeneRegionTrack(TxDb.Mmusculus.UCSC.mm10.knownGene, showId=TRUE, geneSymbol=TRUE, name=&quot;&quot;, background.title=&quot;transparent&quot;) symbols &lt;- unlist(mapIds(org.Mm.eg.db, gene(greg), &quot;SYMBOL&quot;, &quot;ENTREZID&quot;, multiVals = &quot;first&quot;)) symbol(greg) &lt;- symbols[gene(greg)] In Figure 11.6, we see enrichment of H3K27me3 in the WT condition at the Cdx2 locus. This is consistent with the known regulatory relationship between Ezh2 and Cdx2. collected &lt;- list() lib.sizes &lt;- filtered.data$totals/1e6 for (i in seq_along(h3k27me3data$Path)) { reads &lt;- extractReads(bam.file=h3k27me3data$Path[[i]], cur.region, param=param) cov &lt;- as(coverage(reads)/lib.sizes[i], &quot;GRanges&quot;) collected[[i]] &lt;- DataTrack(cov, type=&quot;histogram&quot;, lwd=0, ylim=c(0,1), name=h3k27me3data$Description[i], col.axis=&quot;black&quot;, col.title=&quot;black&quot;, fill=&quot;darkgray&quot;, col.histogram=NA) } plotTracks(c(gax, collected, greg), chromosome=as.character(seqnames(cur.region)), from=start(cur.region), to=end(cur.region)) Figure 11.6: Coverage tracks for a region with H3K27me3 enrichment in KO (top two tracks) against the WT (last two tracks). In contrast, if we look at a constitutively expressed gene, such as Col1a2, we find little evidence of H3K27me3 deposition. Only a single window is retained after filtering on abundance, and this window shows no evidence of changes in H3K27me3 enrichment between WT and KO samples. col1a2 &lt;- genes(TxDb.Mmusculus.UCSC.mm10.knownGene)[&quot;12843&quot;] # Col1a2 Entrez ID cur.region &lt;- subsetByOverlaps(out.ranges, col1a2) cur.region ## GRanges object with 1 range and 12 metadata columns: ## seqnames ranges strand | num.tests num.up.logFC num.down.logFC ## &lt;Rle&gt; &lt;IRanges&gt; &lt;Rle&gt; | &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; ## [1] chr6 4511701-4512200 * | 1 0 0 ## PValue FDR direction rep.test rep.logFC best.logFC ## &lt;numeric&gt; &lt;numeric&gt; &lt;character&gt; &lt;integer&gt; &lt;numeric&gt; &lt;numeric&gt; ## [1] 0.220691 0.432977 down 312358 -0.714484 -0.714484 ## overlap left right ## &lt;character&gt; &lt;character&gt; &lt;character&gt; ## [1] Col1a2:+:I Col1a2:+:907 Col1a2:+:161 ## ------- ## seqinfo: 21 sequences from an unspecified genome Session information View session info R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid stats4 parallel stats graphics grDevices utils [8] datasets methods base other attached packages: [1] Gviz_1.36.0 [2] org.Mm.eg.db_3.13.0 [3] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0 [4] GenomicFeatures_1.44.0 [5] AnnotationDbi_1.54.0 [6] edgeR_3.34.0 [7] limma_3.48.0 [8] csaw_1.26.0 [9] SummarizedExperiment_1.22.0 [10] Biobase_2.52.0 [11] MatrixGenerics_1.4.0 [12] matrixStats_0.58.0 [13] BiocFileCache_2.0.0 [14] dbplyr_2.1.1 [15] Rsamtools_2.8.0 [16] Biostrings_2.60.0 [17] XVector_0.32.0 [18] GenomicRanges_1.44.0 [19] GenomeInfoDb_1.28.0 [20] IRanges_2.26.0 [21] S4Vectors_0.30.0 [22] BiocGenerics_0.38.0 [23] chipseqDBData_1.8.0 [24] BiocStyle_2.20.0 [25] rebook_1.2.0 loaded via a namespace (and not attached): [1] backports_1.2.1 Hmisc_4.5-0 [3] AnnotationHub_3.0.0 lazyeval_0.2.2 [5] splines_4.1.0 BiocParallel_1.26.0 [7] ggplot2_3.3.3 digest_0.6.27 [9] ensembldb_2.16.0 htmltools_0.5.1.1 [11] fansi_0.4.2 checkmate_2.0.0 [13] magrittr_2.0.1 memoise_2.0.0 [15] BSgenome_1.60.0 cluster_2.1.2 [17] prettyunits_1.1.1 jpeg_0.1-8.1 [19] colorspace_2.0-1 blob_1.2.1 [21] rappdirs_0.3.3 xfun_0.23 [23] dplyr_1.0.6 crayon_1.4.1 [25] RCurl_1.98-1.3 jsonlite_1.7.2 [27] graph_1.70.0 VariantAnnotation_1.38.0 [29] survival_3.2-11 glue_1.4.2 [31] gtable_0.3.0 zlibbioc_1.38.0 [33] DelayedArray_0.18.0 scales_1.1.1 [35] DBI_1.1.1 Rcpp_1.0.6 [37] xtable_1.8-4 progress_1.2.2 [39] htmlTable_2.2.1 foreign_0.8-81 [41] bit_4.0.4 Formula_1.2-4 [43] htmlwidgets_1.5.3 metapod_1.0.0 [45] httr_1.4.2 dir.expiry_1.0.0 [47] RColorBrewer_1.1-2 ellipsis_0.3.2 [49] pkgconfig_2.0.3 XML_3.99-0.6 [51] nnet_7.3-16 CodeDepends_0.6.5 [53] sass_0.4.0 locfit_1.5-9.4 [55] utf8_1.2.1 tidyselect_1.1.1 [57] rlang_0.4.11 later_1.2.0 [59] munsell_0.5.0 BiocVersion_3.13.1 [61] tools_4.1.0 cachem_1.0.5 [63] generics_0.1.0 RSQLite_2.2.7 [65] ExperimentHub_2.0.0 evaluate_0.14 [67] stringr_1.4.0 fastmap_1.1.0 [69] yaml_2.2.1 knitr_1.33 [71] bit64_4.0.5 purrr_0.3.4 [73] AnnotationFilter_1.16.0 KEGGREST_1.32.0 [75] mime_0.10 biomaRt_2.48.0 [77] rstudioapi_0.13 compiler_4.1.0 [79] filelock_1.0.2 curl_4.3.1 [81] png_0.1-7 interactiveDisplayBase_1.30.0 [83] tibble_3.1.2 statmod_1.4.36 [85] bslib_0.2.5.1 stringi_1.6.2 [87] highr_0.9 lattice_0.20-44 [89] ProtGenerics_1.24.0 Matrix_1.3-3 [91] vctrs_0.3.8 pillar_1.6.1 [93] lifecycle_1.0.0 BiocManager_1.30.15 [95] jquerylib_0.1.4 data.table_1.14.0 [97] bitops_1.0-7 httpuv_1.6.1 [99] rtracklayer_1.52.0 R6_2.5.0 [101] BiocIO_1.2.0 latticeExtra_0.6-29 [103] bookdown_0.22 promises_1.2.0.1 [105] KernSmooth_2.23-20 gridExtra_2.3 [107] codetools_0.2-18 dichromat_2.0-0 [109] assertthat_0.2.1 rjson_0.2.20 [111] withr_2.4.2 GenomicAlignments_1.28.0 [113] GenomeInfoDbData_1.2.6 hms_1.1.0 [115] rpart_4.1-15 rmarkdown_2.8 [117] biovizBase_1.40.0 shiny_1.6.0 [119] base64enc_0.1-3 restfulr_0.0.13 Bibliography "],["contributors.html", "Chapter 12 Contributors Grant information Acknowledgements", " Chapter 12 Contributors .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } Aaron Lun Me. Gordon Smyth My PhD supervisor. Other entities Aliaksei Holik (Walter and Eliza Hall Institute for Medical Research), who provided the raw materials for the H3K27me3 workflow. Grant information National Health and Medical Research Council (Program Grant 1054618 to G.K.S., Fellowship to G.K.S.); Victorian State Government Operational Infrastructure Support; Australian Government NHMRC IRIIS. Acknowledgements The authors would like to thank Prof. Stephen Nutt for his valuable insights on B-cell biology. "],["bibliography.html", "Chapter 13 Bibliography", " Chapter 13 Bibliography "]]