--- title: "The csaw Book" documentclass: book bibliography: ref.bib biblio-style: apalike link-citations: yes description: "ChIP-seq Analysis with Windows. But it works on Mac and Linux too!" github-repo: LTLA/csawUsersGuide --- --- date: "**Authors:** Aaron Lun [aut, cre]
**Version:** 1.14.0
**Modified:** 2024-04-17
**Compiled:** 2024-10-30
**Environment:** R version 4.4.1 (2024-06-14), Bioconductor 3.20
**License:** GPL-3
**Copyright:** Bioconductor, 2020
**Source:** https://github.com/LTLA/csawUsersGuide" url: "https://github.com/LTLA/csawUsersGuide" --- # Welcome ## Introduction Chromatin immunoprecipitation with sequencing (ChIP-seq) is a widely used technique for identifying the genomic binding sites of a target protein. Conventional analyses of ChIP-seq data aim to detect absolute binding (i.e., the presence or absence of a binding site) based on peaks in the read coverage. An alternative analysis strategy is to detect of changes in the binding profile between conditions [@rossinnes2012differential; @pal2013]. These differential binding (DB) analyses involve counting reads into genomic intervals and testing those counts for significant differences between conditions. This defines a set of putative DB regions for further examination. DB analyses are statistically easier to perform than their conventional counterparts, as the effect of genomic biases is largely mitigated when counts for different libraries are compared at the same genomic region. DB regions may also be more relevant as the change in binding can be associated with the biological difference between conditions. This book describes the use of the *[csaw](https://bioconductor.org/packages/3.20/csaw)* Bioconductor package to detect differential binding (DB) in ChIP-seq experiments with sliding windows [@lun2016csaw]. In these analyses, we detect and summarize DB regions between conditions in a *de novo* manner, i.e., without making any prior assumptions about the location or width of bound regions. We demonstrate on data from a variety of real studies focusing on changes in transcription factor binding and histone mark enrichment. Our aim is to facilitate the practical implementation of window-based DB analyses by providing detailed code and expected output. The code here can be adapted to any dataset with multiple experimental conditions and with multiple biological samples within one or more of the conditions; it is similarly straightforward to accommodate batch effects, covariates and additional experimental factors. Indeed, though the book focuses on ChIP-seq, the same software can be adapted to data from any sequencing technique where reads represent coverage of enriched genomic regions. ## How to read this book The descriptions in this book explore the theoretical and practical motivations behind each step of a *[csaw](https://bioconductor.org/packages/3.20/csaw)* analysis. While all users are welcome to read it from start to finish, new users may prefer to examine the case studies presented in the later sections [@lun2015from], which provides the important information in a more concise format. Experienced users (or those looking for some nighttime reading!) are more likely to benefit from the in-depth discussions in this document. All of the workflows described here start from sorted and indexed BAM files in the *[chipseqDBData](https://bioconductor.org/packages/3.20/chipseqDBData)* package. For application to user-specified data, the raw read sequences have to be aligned to the appropriate reference genome beforehand. Most aligners can be used for this purpose, but we have used *[Rsubread](https://bioconductor.org/packages/3.20/Rsubread)* [@liao2013] due to the convenience of its R interface. It is also recommended to mark duplicate reads using tools like `Picard` prior to starting the workflow. The statistical methods described here are based upon those in the *[edgeR](https://bioconductor.org/packages/3.20/edgeR)* package [@robinson2010]. Knowledge of *[edgeR](https://bioconductor.org/packages/3.20/edgeR)* is useful but not a prerequesite for reading this guide. ## How to get help Most questions about *[csaw](https://bioconductor.org/packages/3.20/csaw)* should be answered by the documentation. Every function mentioned in this guide has its own help page. For example, a detailed description of the arguments and output of the `windowCounts()` function can be obtained by typing `?windowCounts` or `help(windowCounts)` at the R prompt. Further detail on the methods or the underlying theory can be found in the references at the bottom of each help page. The authors of the package always appreciate receiving reports of bugs in the package functions or in the documentation. The same goes for well-considered suggestions for improvements. Other questions about how to use *[csaw](https://bioconductor.org/packages/3.20/csaw)* are best sent to the [Bioconductor support site](https://support.bioconductor.org). Please send requests for general assistance and advice to the support site, rather than to the individual authors. Users posting to the support site for the first time may find it helpful to read the [posting guide](http://www.bioconductor.org/help/support/posting-guide). ## How to cite this book Most users of *[csaw](https://bioconductor.org/packages/3.20/csaw)* should cite the following in any publications: > A. T. Lun and G. K. Smyth. csaw: a Bioconductor package for differential binding analysis of ChIP-seq data using sliding windows. _Nucleic Acids Res._, 44(5):e45, Mar 2016 To cite the workflows specifically, we can use: > A. T. L. Lun and G. K. Smyth. From reads to regions: a Bioconductor workflow to detect differential binding in ChIP-seq data. _F1000Research_, 4, 2015 For people interested in combined $p$-values, their use in DB analyses was proposed in: > A. T. Lun and G. K. Smyth. De novo detection of differentially bound regions for ChIP-seq data using peaks and windows: controlling error rates correctly. _Nucleic Acids Res._, 42(11):e95, Jul 2014 The DB analyses shown here use methods from the *[edgeR](https://bioconductor.org/packages/3.20/edgeR)* package, which has its own citation recommendations. See the appropriate section of the *[edgeR](https://bioconductor.org/packages/3.20/edgeR)* user's guide for more details. ## Quick start A typical ChIP-seq analysis in *[csaw](https://bioconductor.org/packages/3.20/csaw)* would look something like that described below. This assumes that a vector of file paths to sorted and indexed BAM files is provided in \Robject{bam.files} and a design matrix in supplied in \Robject{design}. The code is split across several steps: ``` r library(chipseqDBData) tf.data <- NFYAData() tf.data <- head(tf.data, -1) # skip the input. bam.files <- tf.data$Path cell.type <- sub("NF-YA ([^ ]+) .*", "\\1", tf.data$Description) design <- model.matrix(~factor(cell.type)) colnames(design) <- c("intercept", "cell.type") ``` 1. Loading in data from BAM files. ``` r library(csaw) param <- readParam(minq=20) data <- windowCounts(bam.files, ext=110, width=10, param=param) ``` 2. Filtering out uninteresting regions. ``` r binned <- windowCounts(bam.files, bin=TRUE, width=10000, param=param) keep <- filterWindowsGlobal(data, binned)$filter > log2(5) data <- data[keep,] ``` 3. Calculating normalization factors. ``` r data <- normFactors(binned, se.out=data) ``` 4. Identifying DB windows. ``` r library(edgeR) y <- asDGEList(data) y <- estimateDisp(y, design) fit <- glmQLFit(y, design, robust=TRUE) results <- glmQLFTest(fit) ``` 5. Correcting for multiple testing. ``` r merged <- mergeResults(data, results$table, tol=1000L) ``` ## Session information {-}
``` R version 4.4.1 (2024-06-14) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.1 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] edgeR_4.4.0 limma_3.62.0 [3] csaw_1.40.0 SummarizedExperiment_1.36.0 [5] Biobase_2.66.0 MatrixGenerics_1.18.0 [7] matrixStats_1.4.1 GenomicRanges_1.58.0 [9] GenomeInfoDb_1.42.0 IRanges_2.40.0 [11] S4Vectors_0.44.0 BiocGenerics_0.52.0 [13] chipseqDBData_1.21.0 BiocStyle_2.34.0 loaded via a namespace (and not attached): [1] tidyselect_1.2.1 dplyr_1.1.4 blob_1.2.4 [4] filelock_1.0.3 Biostrings_2.74.0 rebook_1.16.0 [7] bitops_1.0-9 fastmap_1.2.0 BiocFileCache_2.14.0 [10] XML_3.99-0.17 digest_0.6.37 mime_0.12 [13] lifecycle_1.0.4 statmod_1.5.0 KEGGREST_1.46.0 [16] RSQLite_2.3.7 magrittr_2.0.3 compiler_4.4.1 [19] rlang_1.1.4 sass_0.4.9 tools_4.4.1 [22] utf8_1.2.4 yaml_2.3.10 knitr_1.48 [25] S4Arrays_1.6.0 bit_4.5.0 curl_5.2.3 [28] DelayedArray_0.32.0 abind_1.4-8 BiocParallel_1.40.0 [31] withr_3.0.2 purrr_1.0.2 CodeDepends_0.6.6 [34] grid_4.4.1 fansi_1.0.6 ExperimentHub_2.14.0 [37] cli_3.6.3 rmarkdown_2.28 crayon_1.5.3 [40] generics_0.1.3 metapod_1.14.0 httr_1.4.7 [43] DBI_1.2.3 cachem_1.1.0 zlibbioc_1.52.0 [46] parallel_4.4.1 AnnotationDbi_1.68.0 BiocManager_1.30.25 [49] XVector_0.46.0 vctrs_0.6.5 Matrix_1.7-1 [52] jsonlite_1.8.9 dir.expiry_1.14.0 bookdown_0.41 [55] bit64_4.5.2 locfit_1.5-9.10 jquerylib_0.1.4 [58] glue_1.8.0 codetools_0.2-20 BiocVersion_3.20.0 [61] UCSC.utils_1.2.0 tibble_3.2.1 pillar_1.9.0 [64] rappdirs_0.3.3 htmltools_0.5.8.1 graph_1.84.0 [67] GenomeInfoDbData_1.2.13 R6_2.5.1 dbplyr_2.5.0 [70] evaluate_1.0.1 lattice_0.22-6 AnnotationHub_3.14.0 [73] png_0.1-8 Rsamtools_2.22.0 memoise_2.0.1 [76] bslib_0.8.0 Rcpp_1.0.13 SparseArray_1.6.0 [79] xfun_0.48 pkgconfig_2.0.3 ```