fastRanges provides high-throughput interval operations for Bioconductor
workflows built on IRanges and GRanges.
For genomic data, the central data type is GRanges. Each row is one genomic
interval with:
"chr1"start, end, or width)"+", "-", or "*")Typical analyses ask questions such as:
fastRanges is designed to answer those questions while keeping its grammar
close to the Bioconductor overlap APIs in IRanges and GenomicRanges.
The package is best thought of as a high-throughput interval engine for
supported Bioconductor range classes rather than as a promise of full
drop-in replacement for every findOverlaps() input type.
library(fastRanges)
library(GenomicRanges)
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: generics
#>
#> Attaching package: 'generics'
#> The following objects are masked from 'package:base':
#>
#> as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#> setequal, union
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
#> mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#> rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
#> unsplit, which.max, which.min
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: Seqinfo
library(S4Vectors)
data("fast_ranges_example", package = "fastRanges")
query <- fast_ranges_example$query
subject <- fast_ranges_example$subject
The package uses the same two-object language as Bioconductor:
query: the intervals you are asking aboutsubject: the intervals you compare the query againstIf a query peak overlaps three subject genes, then:
fast_find_overlaps() returns three hit pairsfast_count_overlaps() returns 3 for that query rowfast_overlaps_any() returns TRUE for that query rowThat distinction is the main decision point when choosing a function.
| Goal | Function | Main output |
|---|---|---|
| Get every matching pair | fast_find_overlaps() |
Hits |
| Count matches per query | fast_count_overlaps() |
integer vector |
| Ask only yes/no | fast_overlaps_any() |
logical vector |
| Build a tabular overlap join | fast_overlap_join() |
data.frame |
| Keep only matching query rows | fast_semi_overlap_join() |
data.frame |
| Keep only non-matching query rows | fast_anti_overlap_join() |
data.frame |
| Reuse the same subject many times | fast_build_index() |
fast_ranges_index |
| Find nearest subject | fast_nearest() |
DataFrame |
| Summarize overlap counts by subject metadata | fast_count_overlaps_by_group() |
matrix |
| Summarize subject scores over overlaps | fast_overlap_aggregate() |
numeric vector |
| Merge or transform intervals themselves | fast_reduce(), fast_disjoin(), fast_gaps() |
ranges |
| Summarize coverage in bins | fast_tile_coverage() |
data.frame |
The package ships with a small reproducible example:
names(fast_ranges_example)
#> [1] "query" "subject"
query
#> GRanges object with 6 ranges and 2 metadata columns:
#> seqnames ranges strand | query_id score
#> <Rle> <IRanges> <Rle> | <character> <integer>
#> [1] chr1 100-159 + | q1 8
#> [2] chr1 180-269 + | q2 10
#> [3] chr1 350-419 - | q3 6
#> [4] chr2 40-79 * | q4 3
#> [5] chr2 300-379 + | q5 9
#> [6] chr3 10-34 - | q6 5
#> -------
#> seqinfo: 3 sequences from an unspecified genome; no seqlengths
subject
#> GRanges object with 7 ranges and 2 metadata columns:
#> seqnames ranges strand | gene_id biotype
#> <Rle> <IRanges> <Rle> | <character> <character>
#> [1] chr1 90-169 + | gA promoter
#> [2] chr1 210-309 - | gB enhancer
#> [3] chr1 320-369 - | gC enhancer
#> [4] chr2 20-89 + | gD promoter
#> [5] chr2 330-419 * | gE gene_body
#> [6] chr3 1-40 - | gF promoter
#> [7] chr3 80-114 + | gG enhancer
#> -------
#> seqinfo: 3 sequences from an unspecified genome; no seqlengths
In a real analysis, query might be ChIP-seq peaks and subject might be
genes, exons, enhancers, promoters, or other genomic annotations.
This example is intentionally small so every function in the vignette can run quickly. The structure is:
query: 6 genomic intervals with metadata columns query_id and scoresubject: 7 genomic intervals with metadata columns gene_id and biotypeIf you want file-based examples instead of in-memory objects, the same records are also installed as BED files:
system.file("extdata", "query_peaks.bed", package = "fastRanges")
#> [1] "/tmp/RtmpOKui21/Rinst25ddf22361d5e/fastRanges/extdata/query_peaks.bed"
system.file("extdata", "subject_genes.bed", package = "fastRanges")
#> [1] "/tmp/RtmpOKui21/Rinst25ddf22361d5e/fastRanges/extdata/subject_genes.bed"
That split is deliberate:
fast_ranges_example for package examples, tutorials, and testshits <- fast_find_overlaps(query, subject, threads = 2)
hits
#> Hits object with 5 hits and 0 metadata columns:
#> from to
#> <integer> <integer>
#> [1] 1 1
#> [2] 3 3
#> [3] 4 4
#> [4] 5 5
#> [5] 6 6
#> -------
#> nLnode: 6 / nRnode: 7
This result is a Hits object. The important parts are:
queryHits(hits): row indices from querysubjectHits(hits): matching row indices from subjectcbind(
query_id = S4Vectors::queryHits(hits),
subject_id = S4Vectors::subjectHits(hits)
)
#> query_id subject_id
#> [1,] 1 1
#> [2,] 3 3
#> [3,] 4 4
#> [4,] 5 5
#> [5,] 6 6
counts <- fast_count_overlaps(query, subject, threads = 2)
counts
#> [1] 1 0 1 1 1 1
There is one value per query row:
0 means no subject matched2 means that query matched two subject rangesany_hits <- fast_overlaps_any(query, subject, threads = 2)
any_hits
#> [1] TRUE FALSE TRUE TRUE TRUE TRUE
This is the lightest-weight summary when you need only a logical answer.
Most overlap-style functions use the same argument grammar:
querysubjectmax_gapmin_overlaptypeignore_strandthreadsdeterministictypetype controls what “match” means.
type = "any"
query : |------|
subject: |------|
=> any qualifying overlap is a hit
type = "within"
subject: |--------------|
query : |------|
=> query must lie inside subject
type = "start"
query : |------|
subject: |------------|
=> start coordinates must match or be within tolerance
type = "end"
query : |------|
subject: |------------|
=> end coordinates must match or be within tolerance
type = "equal"
query : |------|
subject: |------|
=> start and end coordinates must match, or be within tolerance
max_gapmax_gap controls how far apart two ranges may be and still count as a hit.
-1 means true overlap is required0 allows Bioconductor’s zero-gap tolerance behaviormin_overlapmin_overlap is the minimum overlap width, in bases.
0 is the least strict settingignore_strandFor GRanges:
FALSE uses strand-aware Bioconductor semanticsTRUE ignores strand and compares only genomic coordinatesFor IRanges, strand does not exist, so this argument has no effect.
threads and deterministicthreads controls parallel executiondeterministic = TRUE keeps stable ordering of hit pairsdeterministic = FALSE can be slightly faster for large jobs when order is
unimportantfastRanges is intended to stay close to Bioconductor overlap semantics.
The easiest way to see that is to compare outputs directly:
ref <- GenomicRanges::findOverlaps(query, subject, ignore.strand = FALSE)
identical(
cbind(S4Vectors::queryHits(hits), S4Vectors::subjectHits(hits)),
cbind(S4Vectors::queryHits(ref), S4Vectors::subjectHits(ref))
)
#> [1] TRUE
IRangesGRanges"any", "start", "end", "within", and "equal"GRangesselect = "all", "first", "last", and "arbitrary"GRangesListFor these unsupported cases, fastRanges now raises an explicit error rather
than returning silently different answers.
Width-0 / empty-range cases are handled using Bioconductor-compatible reference behavior. This is a correctness choice: empty ranges are uncommon in production overlap workloads, so exact semantics are more important than forcing these cases through the multithreaded C++ path.
There are two intended usage modes:
fast_find_overlaps(query, subject, ...)idx <- fast_build_index(subject)fast_find_overlaps(query, idx, ...)This distinction matters for performance much more than for semantics.
deterministicdeterministic = TRUE keeps output ordering stable across thread counts and is
useful for tests, reports, and exact reproducibility.
deterministic = FALSE avoids that extra ordering step and is usually the
right choice when you care about maximum throughput on large multithreaded
jobs.
That compatibility is important because fastRanges is designed to fit into
existing IRanges / GRanges workflows rather than replace those classes.
ret_classes <- c(
hits = class(fast_find_overlaps(query, subject, threads = 1))[1],
count = class(fast_count_overlaps(query, subject, threads = 1))[1],
any = class(fast_overlaps_any(query, subject, threads = 1))[1],
nearest = class(fast_nearest(query, subject))[1],
join = class(fast_overlap_join(query, subject, threads = 1))[1],
coverage = class(fast_coverage(query))[1]
)
ret_classes
#> hits count any nearest join
#> "Hits" "integer" "logical" "DFrame" "data.frame"
#> coverage
#> "SimpleRleList"
Common return shapes:
Hits: pairwise overlap relationshipsDataFrame: nearest-neighbor tablesdata.frame: join or coverage summary tablesGRanges / IRanges: transformed intervalsRleList or Rle: base-resolution coverageIf you will query the same subject repeatedly, build an index once and reuse
it.
subject_index <- fast_build_index(subject)
subject_index
#> <fast_ranges_index>
#> subject class: GRanges
#> subject ranges: 7
#> partitions: 3
Now use that index in later overlap calls:
hits_from_index <- fast_find_overlaps(query, subject_index, threads = 2)
identical(
cbind(S4Vectors::queryHits(hits_from_index), S4Vectors::subjectHits(hits_from_index)),
cbind(S4Vectors::queryHits(hits), S4Vectors::subjectHits(hits))
)
#> [1] TRUE
When should you build an index?
The join family turns overlap hits into beginner-friendly tables.
joined_inner <- fast_overlap_join(query, subject, join = "inner", threads = 2)
head(joined_inner)
#> query_id subject_id query_seqnames query_start query_end query_width
#> 1 1 1 chr1 100 159 60
#> 3 3 3 chr1 350 419 70
#> 4 4 4 chr2 40 79 40
#> 5 5 5 chr2 300 379 80
#> 6 6 6 chr3 10 34 25
#> query_strand query_query_id query_score subject_seqnames subject_start
#> 1 + q1 8 chr1 90
#> 3 - q3 6 chr1 320
#> 4 * q4 3 chr2 20
#> 5 + q5 9 chr2 330
#> 6 - q6 5 chr3 1
#> subject_end subject_width subject_strand subject_gene_id subject_biotype
#> 1 169 80 + gA promoter
#> 3 369 50 - gC enhancer
#> 4 89 70 + gD promoter
#> 5 419 90 * gE gene_body
#> 6 40 40 - gF promoter
Important columns:
query_id: row index from querysubject_id: matching row index from subjectquery_*: columns copied from querysubject_*: columns copied from subjectLeft join keeps all query rows:
joined_left <- fast_left_overlap_join(query, subject, threads = 2)
head(joined_left)
#> query_id subject_id query_seqnames query_start query_end query_width
#> 1 1 1 chr1 100 159 60
#> 3 3 3 chr1 350 419 70
#> 4 4 4 chr2 40 79 40
#> 5 5 5 chr2 300 379 80
#> 6 6 6 chr3 10 34 25
#> 2 2 NA chr1 180 269 90
#> query_strand query_query_id query_score subject_seqnames subject_start
#> 1 + q1 8 chr1 90
#> 3 - q3 6 chr1 320
#> 4 * q4 3 chr2 20
#> 5 + q5 9 chr2 330
#> 6 - q6 5 chr3 1
#> 2 + q2 10 <NA> NA
#> subject_end subject_width subject_strand subject_gene_id subject_biotype
#> 1 169 80 + gA promoter
#> 3 369 50 - gC enhancer
#> 4 89 70 + gD promoter
#> 5 419 90 * gE gene_body
#> 6 40 40 - gF promoter
#> 2 NA NA <NA> <NA> <NA>
Semi and anti joins keep only query rows:
semi_tbl <- fast_semi_overlap_join(query, subject, threads = 2)
anti_tbl <- fast_anti_overlap_join(query, subject, threads = 2)
head(semi_tbl)
#> query_id overlap_count query_seqnames query_start query_end query_width
#> 1 1 1 chr1 100 159 60
#> 3 3 1 chr1 350 419 70
#> 4 4 1 chr2 40 79 40
#> 5 5 1 chr2 300 379 80
#> 6 6 1 chr3 10 34 25
#> query_strand query_query_id query_score
#> 1 + q1 8
#> 3 - q3 6
#> 4 * q4 3
#> 5 + q5 9
#> 6 - q6 5
head(anti_tbl)
#> query_id overlap_count query_seqnames query_start query_end query_width
#> 2 2 0 chr1 180 269 90
#> query_strand query_query_id query_score
#> 2 + q2 10
These functions are useful when direct overlap is not the only biological question.
nearest_tbl <- fast_nearest(query, subject)
head(as.data.frame(nearest_tbl))
#> query_id subject_id distance
#> 1 1 1 0
#> 2 2 1 10
#> 3 3 3 0
#> 4 4 4 0
#> 5 5 5 0
#> 6 6 6 0
Interpretation:
subject_id is the nearest subject for each matched querydistance = 0 means overlapDirectional helpers return subject row indices:
fast_precede(query, subject)
#> [1] NA NA 2 5 NA NA
fast_follow(query, subject)
#> [1] NA 1 NA NA 4 NA
Many analyses need more than raw hits. They need counts or scores summarized by subject annotation.
set.seed(1)
S4Vectors::mcols(subject)$group <- sample(
c("promoter", "enhancer"),
length(subject),
replace = TRUE
)
S4Vectors::mcols(subject)$score <- seq_len(length(subject))
by_group <- fast_count_overlaps_by_group(
query,
subject,
group_col = "group",
threads = 2
)
by_group
#> enhancer promoter
#> [1,] 0 1
#> [2,] 0 0
#> [3,] 0 1
#> [4,] 0 1
#> [5,] 1 0
#> [6,] 0 1
Rows are query ranges. Columns are subject groups.
sum_score <- fast_overlap_aggregate(
query,
subject,
value_col = "score",
fun = "sum",
threads = 2
)
sum_score
#> [1] 1 NA 3 4 5 6
This pattern is useful when subject intervals carry signal, annotation scores, or other numeric metadata.
Self-overlaps compare a range set to itself:
self_hits <- fast_self_overlaps(query, threads = 2)
self_hits
#> Hits object with 0 hits and 0 metadata columns:
#> from to
#> <integer> <integer>
#> -------
#> nLnode: 6 / nRnode: 6
Clusters identify connected components in the self-overlap graph:
clusters <- fast_cluster_overlaps(query, return = "data.frame")
head(clusters)
#> range_id cluster_id cluster_size
#> 1 1 1 1
#> 2 2 2 1
#> 3 3 3 1
#> 4 4 4 1
#> 5 5 5 1
#> 6 6 6 1
Sliding-window summaries convert interval data into a regular table:
window_counts <- fast_window_count_overlaps(
query,
subject,
window_width = 1000L,
step_width = 500L,
threads = 2
)
#> Warning in .merge_two_Seqinfo_objects(x, y): The 2 combined objects have no sequence levels in common. (Use
#> suppressWarnings() to suppress this warning.)
#> Warning in .merge_two_Seqinfo_objects(x, y): The 2 combined objects have no sequence levels in common. (Use
#> suppressWarnings() to suppress this warning.)
head(window_counts)
#> window_id seqnames start end overlap_count
#> 1 1 chr1 100 1099 3
#> 2 2 chr2 40 1039 2
#> 3 3 chr3 10 1009 2
These functions work on the ranges themselves rather than overlap hit pairs.
query_reduced <- fast_reduce(query)
query_disjoint <- fast_disjoin(query)
query_gaps <- fast_gaps(query)
c(
original = length(query),
reduced = length(query_reduced),
disjoint = length(query_disjoint),
gaps = length(query_gaps)
)
#> original reduced disjoint gaps
#> 6 6 6 6
You can also combine two range sets directly:
union_ranges <- fast_range_union(query, subject)
intersect_ranges <- fast_range_intersect(query, subject)
setdiff_ranges <- fast_range_setdiff(query, subject)
c(
union = length(union_ranges),
intersect = length(intersect_ranges),
setdiff = length(setdiff_ranges)
)
#> union intersect setdiff
#> 10 3 4
Coverage answers: “how many ranges cover each base position?”
query_cov <- fast_coverage(query)
query_cov
#> RleList of length 3
#> $chr1
#> integer-Rle of length 419 with 6 runs
#> Lengths: 99 60 20 90 80 70
#> Values : 0 1 0 1 0 1
#>
#> $chr2
#> integer-Rle of length 379 with 4 runs
#> Lengths: 39 40 220 80
#> Values : 0 1 0 1
#>
#> $chr3
#> integer-Rle of length 34 with 2 runs
#> Lengths: 9 25
#> Values : 0 1
For plotting and downstream summaries, tile-based coverage is often easier to work with:
query_tiles <- fast_tile_coverage(
query,
tile_width = 1000L,
step_width = 1000L
)
head(query_tiles)
#> seqnames start end coverage_sum
#> 1 chr1 1 419 220
#> 2 chr2 1 379 120
#> 3 chr3 1 34 25
Use the iterator API when you do not want to materialize all hits at once.
iter <- fast_find_overlaps_iter(query, subject_index, chunk_size = 2L, threads = 2)
fast_iter_has_next(iter)
#> [1] TRUE
chunk1 <- fast_iter_next(iter)
chunk1
#> Hits object with 1 hit and 0 metadata columns:
#> from to
#> <integer> <integer>
#> [1] 1 1
#> -------
#> nLnode: 6 / nRnode: 7
fast_iter_has_next(iter)
#> [1] TRUE
To start over:
fast_iter_reset(iter)
all_iter_hits <- fast_iter_collect(iter)
length(all_iter_hits)
#> [1] 5
fast_iter_collect() gathers the remaining chunks from the current iterator
position. If you already consumed some chunks and want everything again, reset
the iterator first.
tmp_index <- tempfile(fileext = ".rds")
fast_save_index(subject_index, tmp_index)
loaded_index <- fast_load_index(tmp_index)
fast_index_stats(loaded_index)
#> DataFrame with 1 row and 6 columns
#> subject_n partition_n block_n seqlevel_n mean_partition_size index_size_mb
#> <integer> <integer> <integer> <integer> <numeric> <numeric>
#> 1 7 3 3 3 2.33333 0.0043869
unlink(tmp_index)
This is useful when index construction is expensive and the same subject set will be queried again in a later session.
The chunk below is only a small demonstration. Publication-scale benchmarking
should use the Quarto workflow in inst/benchmarks.
base_time <- system.time({
h_base <- GenomicRanges::findOverlaps(query, subject, ignore.strand = FALSE)
})
fast_time <- system.time({
h_fast <- fast_find_overlaps(query, subject_index, ignore_strand = FALSE, threads = 2)
})
rbind(
genomic_ranges = base_time[1:3],
fast_ranges = fast_time[1:3]
)
#> user.self sys.self elapsed
#> genomic_ranges 0.015 0 0.015
#> fast_ranges 0.004 0 0.003
c(base_hits = length(h_base), fast_hits = length(h_fast))
#> base_hits fast_hits
#> 5 5
The vignette example above is intentionally small. For large benchmark studies, repeated-query experiments, and figure-rich interpretation, use the benchmark resources maintained in the repository:
This separation is deliberate:
The package also ships example BED files and benchmarking helpers:
c(
list.files(system.file("extdata", package = "fastRanges")),
list.files(system.file("benchmarks", package = "fastRanges"))
)
#> [1] "query_peaks.bed" "subject_genes.bed" "01"
#> [4] "README.md" "validate_outputs.R"
sessionInfo()
#> R version 4.6.0 alpha (2026-04-05 r89794)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] GenomicRanges_1.63.2 Seqinfo_1.1.0 IRanges_2.45.0
#> [4] S4Vectors_0.49.2 BiocGenerics_0.57.1 generics_0.1.4
#> [7] fastRanges_0.99.2 BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] httr_1.4.8 cli_3.6.6 knitr_1.51
#> [4] rlang_1.2.0 xfun_0.57 otel_0.2.0
#> [7] UCSC.utils_1.7.1 jsonlite_2.0.0 htmltools_0.5.9
#> [10] sass_0.4.10 rmarkdown_2.31 evaluate_1.0.5
#> [13] jquerylib_0.1.4 fastmap_1.2.0 GenomeInfoDb_1.47.2
#> [16] yaml_2.3.12 lifecycle_1.0.5 bookdown_0.46
#> [19] BiocManager_1.30.27 compiler_4.6.0 Rcpp_1.1.1-1
#> [22] XVector_0.51.0 digest_0.6.39 R6_2.6.1
#> [25] bslib_0.10.0 tools_4.6.0 cachem_1.1.0