This vignette demonstrates data import and preprocessing workflow for NMR-based metabolomics with metabom8.
library(metabom8)
In routine high-throughput NMR workflows, raw time-domain data (the free-induction decay) are typically processed directly on the spectrometer using the vendor software TopSpin. Downstream metabolomics analysis therefore commonly starts from frequency-domain spectra. Consequently, this vignette focuses on importing spectra generated by TopSpin.
For completeness and teaching purposes, the section (Starting point: raw FID) briefly illustrates importing and transforming time domain data to obtain spectra.
The function read1d_proc() imports TopSpin-processed 1D NMR
spectra from the pdata subdirectory of Bruker experiment folders
(typically pdata/1). It reads the processed absorption-mode spectrum
(e.g. 1r) together with associated acquisition (acqus) and processing
(procs) parameters.
At minimum two arguments are required:
path: path to the directory that encloses NMR experimentsexp_type: list of spectrometer parameters to filter desired
experiment types based on acquisition and processing metadata.Further details on supported filtering options are provided in ?read1d_proc.
## Example: import 1D spectra
exp_dir <- system.file("extdata", package = "metabom8")
exp_type <- list(pulprog="noesygppr1d")
nmr_data <- read1d_proc(exp_dir, exp_type, n_max = 100)
#> Imported 2 spectra.
names(nmr_data)
#> [1] "X" "ppm" "meta"
Both import functions (read1d_proc() and read1d_raw()) return a named
list containing three core components used throughout this vignette:
the spectral matrix (X), the chemical-shift axis (ppm), and associated
metadata (meta).
For interactive exploration, setting to_global = TRUE assigns the
components X, ppm, and meta directly to the global environment
(.GlobalEnv). Existing objects with these names will be overwritten.
The row names of X and meta correspond to experiment directories and can be
used to join external sample annotations. The meta data frame
contains full TopSpin acquisition and processing parameters for each
spectrum.
Example acquisition and processing parameters as they appear in meta:
| Prefix | Meaning | Examples |
|---|---|---|
| a_ | Acquisition parameters | a_PULPROG, a_SFO1, a_RG, a_SW_h, a_TE |
| p_ | Processing parameters | p_WDW, p_LB, p_SI, p_PHC, p_BC, p_NC_proc |
See ?read1d_proc() for further details.
For completeness, metabom8 also provides functionality to import and
process raw FIDs. This allows users to reproduce basic spectral processing
steps within R or to explore how different processing parameters influence
the resulting spectra.
Processing an FID converts the time-domain signal into a frequency-domain spectrum. Typical processing steps include digital-filter correction (group-delay), apodisation (windowing), zero-filling, Fourier transformation, phase correction.
The function read1d_raw() performs these operations:
# import 1D NMR data
nmr_raw <- read1d_raw(
exp_dir,
exp_type,
apodisation = list(fun = "exponential", lb = 0.2),
zerofil = 1L,
mode = "absorption"
)
#> Processing 2 experiments.
names(nmr_raw)
#> [1] "X" "ppm" "meta"
The following examples illustrate several commonly used apodisation
profiles implemented internally in metabom8.
# selected FID apodisation functions
f_apod <- c("sine","cosine","exponential","sem")
# create toy fid
n <- 200; t <- seq(0,1,len=n)
fid <- exp(-5*t)*cos(20*pi*t)
# apply apodisation
A <- sapply(f_apod, \(f) metabom8:::.fidApodisationFct(n, list(fun=f, lb=-0.2)))
Fp <- sweep(A, 1, fid, `*`)
S <- apply(Fp, 2, \(x) Mod(fft(x))[1:(n/2)])
# compare graphically
cols <- 1:ncol(A)
par(mfrow=c(2,2), mar=c(3,3,2,1))
plot(t, fid, type="l", lwd=2, main="FID")
matplot(t, A, type="l", lwd=2, col=cols, main="Apodisation windows")
matplot(t, Fp, type="l", lwd=2, col=cols, main="Windowed FID")
matplot(S, type="l", lwd=2, col=cols, main="Spectra")
legend("topright", f_apod, col=cols, lty=1, bty="n", cex=.8)
Spectra can be visualised with the function plot_spec(). The minimum
set of arguments include the spectral data X (array or matrix) and the
chemical shift vector ppm.
Three plotting backends are available: "plotly" (default), "base", and
"ggplot2". The "plotly" backend is set up to use WebGL, enabling
smooth interactive graphics even when the number of data spectra is
large. The "ggplot2" backend is not iteractive and renders much slower.
However, the returned object integrates with the extensive "ggplot2" ecosystem
and can be further customised to generate publication-quality figures.
See ?plot_spec for additional details.
data("covid", package = "metabom8")
## plot directly from the dataset object
plot_spec(covid, shift = c(0.5, 2))
## alternatively extract data components: X, ppm
X <- covid$X
ppm <- covid$ppm
dim(X)
#> [1] 10 27819
head(ppm)
#> [1] 4.549818 4.549513 4.549207 4.548902 4.548596 4.548291
plot_spec(X[1:3, ], ppm, shift = c(0.5, 2)) # backend='plotly' by default
plot_spec(X[1:3, ], ppm, shift = c(0.5, 2), backend='base')
plot_spec(X[1:3, ], ppm, shift = c(0.5, 2), backend='ggplot2')
metabom8 provides a modular preprocessing API for 1D NMR spectra. Each
preprocessing function is composable and designed to support both
interactive exploration and reproducible pipelines.
For a list of preprocessing functions call list_preprocessing() from
the R console.
When operating on a metabom8-style dataset object, preprocessing functions return
the same structured object: a named list containing
the elements X, ppm, and meta. This allows steps to be chained using the
base R pipe operator (|>):
data("hiit_raw", package = "metabom8") # urine NMR (HIIT exercise experiment)
names(hiit_raw) # X, ppm, meta
#> [1] "X" "ppm" "meta"
# piped preprocessing
hiit_proc <- hiit_raw |>
calibrate(type = "tsp") |>
excise() |>
correct_baseline(method='asls') |>
align_spectra() |>
pqn()
dim(hiit_proc$X)
#> [1] 3 54009
This API design enables clear, declarative workflows suitable for production analyses.
plot_spec(hiit_raw, shift = c(4,4.1))
plot_spec(hiit_proc, shift = c(4,4.1))
Each operation can also be applied independently to a spectral matrix and ppm vector. This explicit alternative is useful for pipeline development, parameter tuning and visual inspection.
X <- hiit_raw$X
ppm <- hiit_raw$ppm
# perform TSP calibration
X_cal <- calibrate(X, ppm, type='tsp')
# excise chemical shift regions
regions <- list(
upfield_noise = c(min(ppm), 0.25),
residual_water = c(4.5, 5.2),
downfield_noise = c(9.7, max(ppm))
)
X_exi <- excise(X_cal, ppm, regions)
ppm_exi <- attr(X_exi, 'm8_axis')$ppm
# baseline correction
X_bli <- correct_baseline(X_exi)
# plot spectrum
plot_spec(X_bli[1,], ppm_exi, shift = c(0,10))
Data provenance refers to the documented record of how data were generated, processed, and transformed. Recording this information ensures that analyses remain reproducible and that individual preprocessing steps can be inspected, verified, or repeated at a later stage.
In metabom8, particular emphasis is placed on transparent preprocessing, as
these decisions can substantially influence downstream statistical analyses and
biological interpretation. Each preprocessing operation automatically records
its parameters and execution context as structured metadata attached to the
spectral matrix X.
The resulting dataset therefore remains self-describing: the complete processing history is linked to the data matrix and can be inspected programmatically at any time.
## Inspect preprocessing steps
print_provenance(hiit_proc)
#> metabom8 processing pipeline:
#> ==============================
#> [1] calibrate
#> params:
#> target :
#> mode : singlet
#> window : [-0.2, 0.2]
#> centre : 0
#> notes: Chemical shift calibration applied.
#>
#> [2] excise1d
#> params:
#> regions :
#> upfield_noise : [-5.21794, 0.25]
#> water : [4.5, 5.2]
#> urea : [5.5, 6]
#> downfield_noise : [9.7, 14.8037]
#> notes: Specified chemical shift regions removed using get_idx().
#>
#> [3] baseline
#> params:
#> method : asls
#> args :
#> notes: Baseline correction applied.
#>
#> [4] align spectra
#> params:
#> half_win_ppm : 0.007
#> n_intervals : 86
#> lag.max : 200
#> intervals_ppm : <86 items> [[9.68427, 9.66121], [9.46736, 9.44216], [9.29139, 9.27764], ...]
#> notes: Median-derived chemical shift intervals aligned independently using cross-correlation.
#>
#> [5] pqn
#> params:
#> ref_index :
#> total_area : FALSE
#> bin :
#> notes: Probabilistic quotient normalisation applied.
# Retrieve specific step
get_provenance(hiit_proc, step = 2)
#> $step
#> [1] "excise1d"
#>
#> $params
#> $params$regions
#> $params$regions$upfield_noise
#> [1] -5.217943 0.250000
#>
#> $params$regions$water
#> [1] 4.5 5.2
#>
#> $params$regions$urea
#> [1] 5.5 6.0
#>
#> $params$regions$downfield_noise
#> [1] 9.70000 14.80367
#>
#>
#>
#> $notes
#> [1] "Specified chemical shift regions removed using get_idx()."
#>
#> $time
#> [1] "2026-04-24 18:44:26.005354"
#>
#> $pkg
#> [1] "metabom8"
#>
#> $pkg_version
#> [1] "0.99.10"
## Retrieve the full provenance log
log <- get_provenance(hiit_proc)
# print(log) # very detailed
Each entry stores:
Individual parameter values can be retrieved as well, allowing the full transformation history of a dataset to be inspected programmatically.
# Retrieve parameter from a named step
get_provenance(hiit_proc, step = "calibrate", param = "target")
#> $mode
#> [1] "singlet"
#>
#> $window
#> [1] -0.2 0.2
#>
#> $centre
#> [1] 0
Custom notes can also be included in the provenance log. This is particularly userful when data are processed in automated fashion, for example when using workflow managers:
# collect env varaibles
params <- list(
agent = paste0("snakemake/", Sys.getenv("SNAKEMAKE_VERSION", "v?")),
runtime = "docker",
workflow = Sys.getenv("M8_WORKFLOW", "std_prof-urine"),
run_id = Sys.getenv("M8_RUN_ID", "m8-2605-001")
)
# create note with env parameters
hiit_proc <- add_note(hiit_proc,
'Excluding outlier sample XX - reason is ...',
params)
# check provenance out
get_provenance(hiit_proc , step = "user note")
#> $step
#> [1] "user note"
#>
#> $params
#> $params$agent
#> [1] "snakemake/v?"
#>
#> $params$runtime
#> [1] "docker"
#>
#> $params$workflow
#> [1] "std_prof-urine"
#>
#> $params$run_id
#> [1] "m8-2605-001"
#>
#>
#> $notes
#> [1] "Excluding outlier sample XX - reason is ..."
#>
#> $time
#> [1] "2026-04-24 18:44:37.532632"
#>
#> $pkg
#> [1] "metabom8"
#>
#> $pkg_version
#> [1] "0.99.10"
Note on provenance
metabom8records preprocessing history as structured attributes on standard R matrices or named lists. This design is lightweigth and prioritises interoperability and flexibility over strict container classes (e.g.,SummarizedExperiment). Provenance is preserved when objects are saved/loaded based on standard R serialisation (save()/saveRDS(),load()), but may be lost if the object is modified by functions outside themetabom8API that drop or overwrite attributes (for example when objects are coerced, reconstructed, or exported to external formats).
The section demonstrates data modelling using PCA (unsupervised) and PLS/OPLS (supervised).
All modelling calls return an instance of class m8_model, which can be used
for further processing and diagnostics.
Y <- covid$an$type
X <- covid$X
To prevent the natural dynamic range of metabolite concentrations from dominating statistical modelling, data are scaled prior to analysis. In metabom8, scaling and centering are passed explicitly to the model function:
# Define the scaling and centering strategy
uv = uv_scaling(center = TRUE)
See ?uv_scaling for further details and additional scaling strategies.
Unsupervised modelling with PCA is done by an iterative procedure (NIPALS) that calculates components iteratively, one at a time. Therefore, the number of desired components is provided in the function call.
Here we compute two principal components using unit-variance scaling:
pca_model <- pca(X, scaling = uv, ncomp=2)
# show / summary methods
show(pca_model)
#>
#> m8_model <pca>
#> ----------------------------------------
#> Dimensions : 10 samples x 27819 variables
#> Mode : unsupervised
#> Preprocess : center | UV
#> Components : 2
#> ----------------------------------------
#> Use summary() for performance metrics.
summary(pca_model)
#> $perf
#> comp R2X R2X_cum selected fit
#> 1 1 0.3023058 0.3023058 TRUE fitted
#> 2 2 0.1898503 0.4921561 TRUE fitted
#>
#> $engine
#> [1] "pca (nipals)"
#>
#> $y_type
#> [1] "unsupervised"
Partial-Least-Squares (PLS) as supervised modelling technique utilises a statistical re-sampling techniques to determine the optimal number of components and to safeguard against overfitting.
In this vignette, we employ balanced Monte Carlo cross-validation (balanced_mc).
This process repeatedly splits the data into training and testing sets to ensure the
model predictive power generalizes well to unseen data.
mc_cv <- balanced_mc(k=15, split=2/3, type="DA")
See ?balanced_mc for further details and additional statistical validation strategies.
pls_model <- pls(X, Y, validation_strategy=mc_cv, scaling = uv)
#> Performing discriminant analysis.
#> A PLS-DA model with 1 component was fitted.
# model information
show(pls_model)
#>
#> m8_model <pls>
#> ----------------------------------------
#> Dimensions : 10 samples x 27819 variables
#> Mode : classification
#> Preprocess : center | UV
#> Components : 1 (2 tested)
#> Validation : BalancedMonteCarlo (k = 15)
#> Stop rule : cv_improvement_negligible
#> ----------------------------------------
#> Use summary() for performance metrics.
# model performance summary
summary(pls_model)
#> $perf
#> comp AUCt AUCv R2X selected fit stop_reason
#> 1 1 1 1 0.2502977 TRUE fitted <NA>
#> 2 2 1 1 0.2179150 FALSE not fitted cv_improvement_negligible
#> stop_metric stop_delta
#> 1 NA NA
#> 2 1 0
#>
#> $engine
#> [1] "pls"
#>
#> $y_type
#> [1] "DA"
# model scores & loadings
Tp <- scores(pls_model)
Pp <- loadings(pls_model)
OPLS can be though of as PLS with an integrated orthogonality filter, which has been reported to improve model interpretability. While standard PLS mixes predictive and non-predictive variation, OPLS partitions the variance into two parts: variation correlated to the response (\(Y\)) and variation “orthogonal” (unrelated) to it.
Similar to PLS, the OPLS algorithm evaluates model performance using statistical re-sampling techniques to ensure the results are robust and not driven by noise.
opls_model <- opls(X, Y, validation_strategy=mc_cv, scaling = uv)
#> Performing discriminant analysis.
#> An O-PLS-DA model with 1 predictive and 1 orthogonal components was fitted.
# model information & performance
show(opls_model)
#>
#> m8_model <opls>
#> ----------------------------------------
#> Dimensions : 10 samples x 27819 variables
#> Mode : classification
#> Preprocess : center | UV
#> Components : 2 (3 tested)
#> Validation : BalancedMonteCarlo (k = 15)
#> Stop rule : cv_improvement_negligible
#> ----------------------------------------
#> Use summary() for performance metrics.
The OPLS algorithm fitted one predictive and one orthogonal component.
A second orthogonal component was evaluated but the improvement in
cross-validated class prediction (AUCv)based on the left-out sets was
negligible (stop reason). Therefore, a third component was not fitted to the full
spectral matrix.
summary(opls_model)
#> $perf
#> comp_total comp_pred comp_orth AUCt AUCv R2X selected fit
#> 1 2 1 1 1 1 0.2860832 TRUE fitted
#> 2 3 1 2 1 1 NA FALSE not fitted
#> stop_reason stop_metric stop_delta
#> 1 <NA> NA NA
#> 2 cv_improvement_negligible 1 0
#>
#> $engine
#> [1] "opls"
#>
#> $y_type
#> [1] "DA"
For detailed information on component stopping criteria used in opls() andpls()
see ?.evalFit().
Methods implemented for both, pls() and opls(), models include scores(),
loadings() and fitted() (Y-fitted values from statistical resampling).
For OPLS models, these former two functions accept argument orth = TRUE
to extract orthogonal component information.
Further OPLS model interpretation can be performed using Variable Importance in
Projection (vip()), which identifies the most influential variables in the model.
# model scores, loadings & vip
Tp <- scores(opls_model)
To <- scores(opls_model, orth=TRUE)
Pp <- loadings(opls_model)
Po <- loadings(opls_model, orth=TRUE)
vip <- vip(opls_model)
To ensure the model is statistically robust, metabom8 provides several
diagnostic tools:
dmodx(): Calculates the “Distance to the Model X” to identify spectral outliersopls_perm(): Executes Y-permutations to validate that the observed \(Q^2\) is
significantly higher than what would be expected by chance.cv_anova(): Performs a Cross-Validated ANOVA to test the significance of the
model’s predictive ability (only for regression)# distance to the model in X space
dex <- dmodx(opls_model)
head(dex)
#> ID DmodX passed outlier label Group
#> 1 1 1.0456047 TRUE FALSE 1 2
#> 2 2 0.9667600 TRUE FALSE 2 2
#> 3 3 0.3198258 TRUE FALSE 3 2
#> 4 4 0.8996049 TRUE FALSE 4 2
#> 5 5 0.9106332 TRUE FALSE 5 2
#> 6 6 0.9668600 TRUE FALSE 6 1
This vignette demonstrated the core data import, preprocessing, and modelling workflow
implemented in metabom8. For further details on individual functions, see the
corresponding help pages.
To cite metabom8, please use:
Kimhofer T (2025). metabom8: A High-Performance R Package for Metabolomics Modeling and Analysis (R package version 1.2.0). doi:10.5281/zenodo.16794445.
A BibTeX entry for LaTeX users is
@Manual{, title = {metabom8: A High-Performance R Package for Metabolomics Modeling and Analysis}, author = {Torben Kimhofer}, year = {2025}, note = {R package version 1.2.0}, doi = {10.5281/zenodo.16794445}, url = {https://doi.org/10.5281/zenodo.16794445}, }
#> R version 4.6.0 RC (2026-04-17 r89917)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] metabom8_0.99.10 BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.6 ellipse_0.5.0 xfun_0.57
#> [4] bslib_0.10.0 ggplot2_4.0.3 htmlwidgets_1.6.4
#> [7] vctrs_0.7.3 tools_4.6.0 crosstalk_1.2.2
#> [10] generics_0.1.4 parallel_4.6.0 tibble_3.3.1
#> [13] pkgconfig_2.0.3 data.table_1.18.2.1 RColorBrewer_1.1-3
#> [16] S7_0.2.2 lifecycle_1.0.5 compiler_4.6.0
#> [19] farver_2.1.2 stringr_1.6.0 ptw_1.9-17
#> [22] progress_1.2.3 tinytex_0.59 RcppDE_0.1.9
#> [25] htmltools_0.5.9 sass_0.4.10 yaml_2.3.12
#> [28] lazyeval_0.2.3 plotly_4.12.0 pillar_1.11.1
#> [31] crayon_1.5.3 jquerylib_0.1.4 tidyr_1.3.2
#> [34] MASS_7.3-65 cachem_1.1.0 magick_2.9.1
#> [37] abind_1.4-8 tidyselect_1.2.1 digest_0.6.39
#> [40] stringi_1.8.7 dplyr_1.2.1 reshape2_1.4.5
#> [43] purrr_1.2.2 bookdown_0.46 labeling_0.4.3
#> [46] fastmap_1.2.0 grid_4.6.0 cli_3.6.6
#> [49] magrittr_2.0.5 dichromat_2.0-0.1 withr_3.0.2
#> [52] prettyunits_1.2.0 scales_1.4.0 rmarkdown_2.31
#> [55] httr_1.4.8 signal_1.8-1 otel_0.2.0
#> [58] hms_1.1.4 evaluate_1.0.5 knitr_1.51
#> [61] viridisLite_0.4.3 rlang_1.2.0 Rcpp_1.1.1-1.1
#> [64] glue_1.8.1 BiocManager_1.30.27 pROC_1.19.0.1
#> [67] jsonlite_2.0.0 R6_2.6.1 plyr_1.8.9
#> [70] fs_2.1.0 colorRamps_2.3.4