fraq: A high-throughput extensible toolkit for processing fastq data

0.1 Why use fraq?
0.2 Quick start
0.3 Pre-built functions
0.4 Walkthroughs with synthetic data
0.5 Supported formats
0.6 Extension system: flow graph overview
0.7 Extension system: writing a custom kernel in R
0.8 Extension system: writing a custom kernel with Rcpp
- 0.8.1 Some important tips when building custom kernels
- 0.8.2 Streaming with named pipes
0.9 Tuning and threading
0.10 FRAQ file format
0.11 Session information

fraq is a high-throughput extensible toolkit for processing fastq data. The goal of this package is to empower users to quickly build out programmatic ‘kernels’ to define any FASTQ processing task they may need. fraq then takes those kernels and handles I/O, compression and multithreading. It builds on Intel TBB’s flow graph to orchestrate concurrency and data processing; throughput can be as fast as compression and disk speed allow.

The package ships with a suite of predefined ‘kernels’ for common FASTQ tasks, detailed in this vignette.

0.1 Why use fraq?

Concurrent I/O and data processing. As fast as compression and disk speed allow
Extensibility. You can drop in your own kernels for almost any fastq processing task (see Extension system sections below)
Support for gzip, zstd and pipes. See Supported formats section

0.2 Quick start

Install

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("fraq")

0.3 Pre-built functions

fraq ships with a collection of ready-to-go kernels that cover common preprocessing steps:

fraq_downsample() - deterministically retain a target fraction of reads (paired-end aware)
fraq_convert() - re-encode reads between any supported formats
fraq_concat() - glue multiple inputs together
fraq_chunk() - split streams into chunk-numbered files with a suffix-driven format
fraq_slice() - keep the first limit reads or select specific read indices
fraq_count_barcodes() - tally barcode hits with optional reverse-complement handling
fraq_demux() - shard reads into files derived from a {barcode} placeholder
fraq_quality_filter() - drop read groups that fail read quality thresholds
fraq_merge_pairs() - overlap paired reads to create consensus single-end reads while recording merge stats
fraq_trim_adapters() - trim adapters from the first mate (and optionally drop untrimmed records)
fraq_summary() - compute per-base stats, histograms, and insert-size estimates

0.4 Walkthroughs with synthetic data

For illustration purposes, we create a small synthetic dataset.

set.seed(314156)
example_dir <- file.path(tempdir(), "fraq_function_examples")
dir.create(example_dir, showWarnings = FALSE, recursive = TRUE)
R1 <- file.path(example_dir, "example_R1.fastq")
R2 <- file.path(example_dir, "example_R2.fastq")
generate_random_fastq(c(R1, R2), n_reads = 2000, read_length = 150)
example_reads <- c(R1, R2)

0.4.1 Summarization

fraq_summary() rolls up QC tables for the R1/R2 fastq pairs.

library(fraq)
# summarize quality metrics
qc <- fraq_summary(c(R1, R2))

Figure 1: Quick QC overview from fraq_summary()

0.4.2 Filtering

Various filtering operations are illustrated below. Here we downsample to reduce dataset size and then discard mates whose mean PHRED drops below 22 or accumulate more than 6 bases with PHRED < 18 (roughly 10% of reads will be filtered).

filter_dir <- file.path(example_dir, "filtering")
dir.create(filter_dir, showWarnings = FALSE)

downsampled_reads <- file.path(
    filter_dir,
    c("example_R1_ds.fastq.gz", "example_R2_ds.fastq.gz")
)
fraq_downsample(example_reads, downsampled_reads, amount = 0.50, nthreads = 2L)

quality_reads <- file.path(
    filter_dir,
    c("example_R1_q20.fastq.gz", "example_R2_q20.fastq.gz")
)
fraq_quality_filter(
    input = downsampled_reads,
    output = quality_reads,
    min_mean_quality = 22,
    max_low_q_bases = 6L,
    low_q_threshold = 18L,
    nthreads = 2L
)

filtered_stats <- fraq_summary(quality_reads, nthreads = 2L)$basic_stats_R1
filtered_stats

##   total_sequences total_bases seq_len_min seq_len_mean seq_len_max gc_percent
## 1             970      145500         150          150         150   50.00619

0.4.3 Splitting

Splitting workflows either direct reads into barcode-specific files, chunk long runs into bite-sized batches, or simply inspect barcode usage.

split_dir <- file.path(example_dir, "splitting")
if (dir.exists(split_dir)) unlink(split_dir, recursive = TRUE)
dir.create(split_dir, showWarnings = FALSE)
barcode_set <- c("ACGTAA", "TTGGCC")
demux_patterns <- file.path(
    split_dir,
    c("R1_{barcode}.fastq.gz", "R2_{barcode}.fastq.gz")
)

Barcode-guided demultiplexing

Demux looks for barcode/adapter/primer sequence at the start of the first given fastq file.

fraq_demux(
    input = example_reads,
    output_format = demux_patterns,
    barcodes = barcode_set,
    max_distance = 1L,
    nthreads = 2L
)
basename(sort(
    list.files(split_dir, pattern = "R1_.*\\.fastq.gz$", full.names = TRUE)
))

## [1] "R1_ACGTAA.fastq.gz"   "R1_NO_MATCH.fastq.gz" "R1_TTGGCC.fastq.gz"

Fixed-size chunking

fraq_chunk splits reads into fixed-size batches with incremental file names.

chunk_prefix <- file.path(split_dir, "chunk_demo")
fraq_chunk(
    input = example_reads[1],
    output_prefix = chunk_prefix,
    output_suffix = "gz",
    chunk_size = 200,
    nthreads = 2L
)
basename(sort(
    list.files(
        split_dir,
        pattern = "chunk_demo_chunk.+\\.fastq.gz$",
        full.names = TRUE
    )
))

##  [1] "chunk_demo_chunk0.fastq.gz" "chunk_demo_chunk1.fastq.gz"
##  [3] "chunk_demo_chunk2.fastq.gz" "chunk_demo_chunk3.fastq.gz"
##  [5] "chunk_demo_chunk4.fastq.gz" "chunk_demo_chunk5.fastq.gz"
##  [7] "chunk_demo_chunk6.fastq.gz" "chunk_demo_chunk7.fastq.gz"
##  [9] "chunk_demo_chunk8.fastq.gz" "chunk_demo_chunk9.fastq.gz"

Barcode counting

Barcode counting looks for sequence substrings anywhere in the fastq reads and outputs a data frame of counts.

fraq_count_barcodes(
    input = example_reads,
    barcodes = barcode_set,
    max_distance = 0L,
    allow_revcomp = FALSE,
    nthreads = 2L
)

##       barcode count
## 1    NO_MATCH  1734
## 2      TTGGCC   132
## 3      ACGTAA   126
## 4 MULTI_MATCH     8

0.4.4 Modification

Format conversion, adapter trimming, consensus merging, etc.

mod_dir <- file.path(example_dir, "modification")
if (dir.exists(mod_dir)) unlink(mod_dir, recursive = TRUE)
dir.create(mod_dir, showWarnings = FALSE)

Convert between formats

Re-wrap the paired-end files in Zstandard-compressed FASTQ.

converted_fastq <- file.path(
    mod_dir,
    c("example_R1.fastq.zst", "example_R2.fastq.zst")
)
fraq_convert(example_reads, converted_fastq, nthreads = 2L)

Concatenate files

Combine the converted shards into a single gzip-compressed stream.

concatenated <- file.path(mod_dir, "example_all.fastq.gz")
fraq_concat(converted_fastq, concatenated, nthreads = 2L)

Merge overlapping pairs

Generate consensus single-end reads while keeping optional unmerged outputs.

merged_reads <- file.path(mod_dir, "example_merged.fastq")
unmerged_reads <- file.path(
    mod_dir,
    c("example_unmerged_R1.fastq", "example_unmerged_R2.fastq")
)
fraq_merge_pairs(
    input = example_reads,
    output_merged = merged_reads,
    output_unmerged = unmerged_reads,
    min_overlap = 20L,
    max_mismatch_rate = 0.05,
    nthreads = 2L
)

## $merged_reads
## [1] 648
## 
## $unmerged_reads
## [1] 1352
## 
## $mean_insert_size
## [1] 267.7454
## 
## $sd_insert_size
## [1] 10.25771
## 
## $mean_overlap
## [1] 32.25463
## 
## $mean_mismatch_rate
## [1] 0

Trim adapters

Remove adapter prefixes from R1 (and optionally drop untrimmed reads).

trimmed_reads <- file.path(
    mod_dir,
    c("example_trimmed_R1.fastq", "example_trimmed_R2.fastq")
)
fraq_trim_adapters(
    input = example_reads,
    output = trimmed_reads,
    adapters = "ACTAC",
    max_distance = 1L,
    filter_untrimmed = FALSE,
    nthreads = 2L
)

##      adapter count
## 1 NO_ADAPTER  1973
## 2      ACTAC    27

0.5 Supported formats

fraq chooses formats from file names, so the extension you supply controls how data is decoded and encoded.

.fastq, .fq - Plain FASTQ (read/write). Fastest path but no compression.
.fastq.gz - Gzip-compressed FASTQ (read/write). Uses zlib; tune via fraq_options("gzip_compress_level").
.fastq.zst - Zstd-compressed FASTQ (read/write). Uses bundled zstd; tune via fraq_options("zstd_compress_level").
.fraq - Chunked binary container (read/write). Designed for multithreaded IO and compression; see FRAQ file format for layout details. FRAQ is a binary, block-compressed format layered on top of bundled zstd so it is both more storage-efficient and faster to stream than textual FASTQ.
.mem - In-memory .fraq (read/write). Lifetime is limited to the current R session. .mem values are exact in-memory keys, not filesystem paths, so they are not normalized or expanded.
.fifo - POSIX named pipes on Linux/macOS (read/write). Useful for streaming data between CLI programs.

0.6 Extension system: flow graph overview

The fraq_run() pipeline wires a TBB flow graph so that IO and data flow happen concurrently with data processing. A high-level view looks like this:

Each block of fastq is read from Primary Reader (for the first mate R1) and Secondary Readers (for R2 or additional fastq files). The key node in this graph is the Process Kernel. It takes fastq records (i.e. R1 and R2 reads), processes or alters them, decides whether to keep them or not (filtering) and outputs any number of fastq records (demux and splitting). This simple pattern naturally supports lots of different fastq processing operations and can be customized.

For R kernels, the Process Kernel step uses a different execution path, described below.

0.7 Extension system: writing a custom kernel in R

You can build a custom kernel with fraq_run_r(). The R function runs on the main R thread rather than inside a TBB graph node, so it is safe for it to work with R objects and call back into R. When nthreads > 1, IO happens on background threads while the R kernel stays on the main R thread. When nthreads = 1, or when fraq detects that it is running inside a forked R process, fraq uses a serial path that does not construct a TBB graph.

Blocks are delivered to the R kernel in increasing block-index order, and the index vector within each call is increasing.

An R kernel is called as kernel(reads, index):

The input and output data are normal R objects (data.frames and lists)
reads is a named list of data frames. Single-end input has reads$read1; paired-end input has reads$read1 and reads$read2. Each data frame has character columns name, seq, and qual.
index is a numeric vector of zero-based read indices for the rows in each data frame. It is useful for deterministic filtering, splitting, or joining back to external metadata.
Return NULL to drop the whole block.
Return a named list of data frames to write reads. Each list name is an output path or .mem key, and each data frame must contain name, seq, and qual columns. Output paths returned by an R kernel are normalized by the R wrapper; .mem keys are used exactly as returned.

Vectorized R kernels can still perform well on large FASTQ datasets because they operate on full blocks rather than one read at a time.

input_paths <- c("input_R1.fastq.gz", "input_R2.fastq.gz")
output_paths <- c("even_R1.fastq.gz", "even_R2.fastq.gz")

even_read_kernel <- function(reads, index) {
    keep <- index %% 2 == 0
    filtered_read1 <- reads$read1[keep, , drop = FALSE]
    filtered_read2 <- reads$read2[keep, , drop = FALSE]

    output <- list()
    output[[output_paths[1]]] <- filtered_read1
    output[[output_paths[2]]] <- filtered_read2
    output
}

fraq_run_r(
    input_paths,
    even_read_kernel,
    nthreads = 2L
)

Do not use parallel::mclapply() inside a fraq_run_r() kernel. On Unix-like systems it forks the R process, and forking while fraq has active background IO and compression threads can deadlock or crash.

0.8 Extension system: writing a custom kernel with Rcpp

If the prebuilt kernels are insufficient, you can write your own via an Rcpp script. You supply a lambda to fraq::run, and the runtime handles all batching, IO, and parallelism. More information can be found in ?fraq_rcpp_template.

Below is an example that keeps only reads whose GC fraction falls in a window (default 35-65%).

// [[Rcpp::depends(fraq)]]
#include <Rcpp.h>
#include <fraq.h>

double calc_gc_content(const std::string &s) {
    double gc = 0.0;
    for(char c : s) { if(c == 'G' || c == 'C') gc += 1.0 ; }
    return gc / (double) s.size();
}

// [[Rcpp::export(rng=false)]]
void fraq_gc_filter(std::vector<std::string> input,
                    std::vector<std::string> output,
                    double gc_min = 0.35, double gc_max = 0.65) {
  
    auto gc_filter_kernel = [&](fraq::input_t reads, size_t read_index)
        -> fraq::output_t {
    for(auto & read : reads) {
        double gc = calc_gc_content(read.seq);
        if(gc < gc_min || gc > gc_max) return {};
    }
    return fraq::zip(output, std::move(reads));
    };
  
    fraq::FraqRunConfig cfg;
    cfg.zstd_compress_level = 5;
    int nthreads = 4;
    fraq::run(input, gc_filter_kernel, nthreads, cfg);
}

All fraq classes are transparent structures with no private members, built on standard library types.

A Read is a struct with three strings: name, seq, and qual containing read info for a single mate
input_t is a vector of Read’s representing a single fastq record (i.e. R1 and R2 reads)
output_t is a std::vector of output paths / Reads

Compile via Rcpp::sourceCpp() then in R you can call your custom kernel as a normal function. The extensions on the output paths decide output format automatically.

input  <- c("sample_R1.fastq.gz", "sample_R2.fastq.gz")
output <- c("filtered_R1.fastq.zst", "filtered_R2.fastq.zst")
fraq_gc_filter(input, output, gc_min = 0.30, gc_max = 0.70)

0.8.1 Some important tips when building custom kernels

You are still responsible for writing safe C++
Avoid interacting with the R API or relying on Rcpp classes from a fraq::run() C++ kernel. If the kernel needs to call R code or work with R objects, use fraq_run_r() so R work stays on the main thread.

0.8.2 Streaming with named pipes

You can use named pipes (Linux/Mac only - Windows is not supported) to stream input and output directly into other command line programs.

Below is an example using fraq_downsample on input fastqs (random fastqs in this example) and streaming the output directly to bwa-mem2.

downsample_fifo.R

library(fraq)
generate_random_fastq("R1.fastq")
generate_random_fastq("R2.fastq")
fraq_downsample(input=c("R1.fastq", "R2.fastq"),
                output=c("ds_R1.fastq.fifo", "ds_R2.fastq.fifo"),
                amount = 0.25, nthreads = 5L)

In bash (Linux/macOS), create the named pipes first before any operations:

HG38_REF=/path/to/hg38.fa.gz
mkfifo ds_R1.fastq.fifo ds_R2.fastq.fifo
Rscript downsample_fifo.R &
bwa-mem2 mem -t 8 $HG38_REF ds_R1.fastq.fifo ds_R2.fastq.fifo > output.sam

Windows users should stick with regular files or platform-specific piping; .fifo paths are not available there.

0.9 Tuning and threading

Global knobs live behind fraq_options():

# Inspect the current block size.
fraq_options("blocksize")

# Shrink batches when running small tests.
fraq_options("blocksize", 16384L)

# Tune compression levels for new outputs.
fraq_options("zstd_compress_level", 6L)
fraq_options("gzip_compress_level", 4L)

Most kernels accept nthreads. With nthreads = 1, fraq uses a serial path. With nthreads > 1, fraq caps the TBB scheduler to the requested parallelism. If fraq detects that it is running inside a forked R process, it forces nthreads = 1 to avoid using TBB after fork().

0.10 FRAQ file format

The .fraq container stores FASTQ reads in independent blocks so that IO and block-level compression can run concurrently. Each block holds up to 65,535 reads (fraq_options("blocksize")) and stores block-level info such as zstd compression, name-prefix factoring, and optional nucleotide bit-packing.

FRAQ takes inspiration from the Nucleotide Archival Format (NAF) by concatenating nucleotide and quality payloads before compressing them with zstd (improving compression efficiency), following the approach described by Kryukov et al. (2019). FRAQ differs by block compressing the stream, which enables multithreaded compression, streaming and tailoring the layout specifically to the FASTQ format instead of being more general.

Specification

File header (16 bytes). Bytes 1-4 are the ASCII magic FRAQ. Bytes 5-14 are reserved (currently zero). Byte 15 records the writer’s endianness (mixed endianness is rejected in v1), and byte 16 is the format version (currently 0x01).
Block header. Every block begins with a 32-bit metadata word that packs 11 two-bit codes plus a single bit for use_bit_pack. The codes describe the byte-width (1/2/4/8) used for each scalar that follows so the block stays compact regardless of read count or payload size.
Scalar section. After the metadata word come nine unsigned integers whose widths are dictated by the codes: num_reads, uncompressed_names_size, uncompressed_seqs_size, name_prefix_size, and the byte sizes of the five compressed buffers (compressed_name_lengths, compressed_names, compressed_seq_lengths, compressed_seqs, compressed_quals).
Payload order. The payload immediately follows the scalars:
1. name_prefix - a raw string that all reads share (for example the sample identifier preceding /1 or /2).
2. compressed_name_lengths - zstd-compressed array of per-read tail lengths.
3. compressed_names - concatenated name tails compressed with zstd.
4. compressed_seq_lengths - zstd-compressed per-read sequence lengths.
5. compressed_seqs - either raw bases or 4-bit packed codes (A/C/G/T/R/Y/S/W/K/M/B/D/H/V/N/U) compressed with zstd.
6. compressed_quals - zstd-compressed ASCII Phred strings; qualities are never bit-packed.
Decoding. Readers expand the zstd streams, rebuild each read by combining name_prefix plus the stored tail, and emit the number of sequences indicated by num_reads.

Reference: Kryukov, Kirill, et al. “Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences.” Bioinformatics 35.19 (2019): 3826-3828.

0.11 Session information

sessionInfo()

## R version 4.6.0 RC (2026-04-17 r89917)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.24-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] fraq_1.1.1       BiocStyle_2.41.0
## 
## loaded via a namespace (and not attached):
##  [1] crayon_1.5.3          cli_3.6.6             knitr_1.51           
##  [4] magick_2.9.1          rlang_1.2.0           xfun_0.57            
##  [7] otel_0.2.0            generics_0.1.4        jsonlite_2.0.0       
## [10] RcppParallel_5.1.11-2 S4Vectors_0.51.1      Biostrings_2.81.1    
## [13] htmltools_0.5.9       tinytex_0.59          stringfish_0.19.0    
## [16] stats4_4.6.0          sass_0.4.10           rmarkdown_2.31       
## [19] Seqinfo_1.3.0         evaluate_1.0.5        jquerylib_0.1.4      
## [22] fastmap_1.2.0         IRanges_2.47.0        yaml_2.3.12          
## [25] lifecycle_1.0.5       bookdown_0.46         BiocManager_1.30.27  
## [28] compiler_4.6.0        Rcpp_1.1.1-1.1        XVector_0.53.0       
## [31] edlibR_1.0.3          digest_0.6.39         R6_2.6.1             
## [34] magrittr_2.0.5        bslib_0.10.0          tools_4.6.0          
## [37] BiocGenerics_0.59.0   cachem_1.1.0