1 Introduction
- 1.1 Installation
2 Getting started
- 2.1 Exploring the data structure
3 Runnable toy workflow setup
4 Full-scale modeling workflow using real cohort data for reference
5 Best practices and recommendations
6 Troubleshooting
- 6.1 Common issues and solutions
  - 6.1.1 Feature mismatch between datasets
  - 6.1.2 Memory issues with large datasets
7 Session information
8 References
Appendix

1 Introduction

Microbiome research has become an increasingly important field for understanding the complex interactions between host health, disease, and environmental factors. Numerous studies have demonstrated that microbial community composition and function can provide predictive insights into a wide range of conditions, including cancer, metabolic, and inflammatory diseases. Translating these findings into robust and reproducible computational workflows remains a key challenge for the microbiome research community.

CrcBiomeScreen is a Bioconductor package that provides a reproducible and modular workflow for microbiome-based disease screening and classification. Although originally designed for colorectal cancer (CRC) applications, the framework is disease-agnostic and can be applied to any case–control microbiome dataset. The package integrates data preprocessing, normalization, quality control, and machine learning-based classification into a streamlined and extensible pipeline.

The Bioconductor ecosystem provides a standardized environment for reproducible omics analysis and emphasizes interoperability between data structures and statistical workflows. CrcBiomeScreen extends this principle to microbiome classification by supporting the use of Bioconductor core data containers such as TreeSummarizedExperiment and by providing accessor methods that facilitate integration with other packages (e.g., phyloseq and curatedMetagenomicData). This design ensures compatibility with existing Bioconductor workflows and promotes transparency, reproducibility, and reusability of microbiome-based models.

Existing Bioconductor packages such as microbiome, microbiomeMarker, and curatedMetagenomicData provide powerful tools for microbiome data access, visualization, and differential abundance testing. However, there remains a gap in standardized machine learning workflows tailored for disease screening and cross-cohort validation. CrcBiomeScreen addresses this need by providing:

Integrated preprocessing: automated taxonomic splitting, filtering, and normalization (TSS and GMPR);
Modeling interface: built-in machine learning modules for Random Forest and XGBoost classification with class weighting for imbalanced data;
Evaluation framework: functions for cross-validation, ROC/F1-based performance metrics, and external validation;
Quality control utilities: MDS-based sample outlier detection and contamination-aware filtering.

By bridging microbiome features with reproducible machine learning pipelines, CrcBiomeScreen complements existing Bioconductor tools and provides a flexible platform for developing, benchmarking, and validating microbiome-based predictive models.

Note: This vignette is organized in two parts:

a fully executable toy workflow for reproducibility during package checking, and

a full-scale workflow using real cohort data for reference.

The toy workflow mirrors the complete analysis pipeline, while the full workflow illustrates how to apply the same steps to a real data set.

1.1 Installation

Install from the Bioconductor repository:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("CrcBiomeScreen")

Install the development version of the package from GitHub(development version):

install.packages(c("remotes"))
remotes::install_github("iChronostasis/CrcBiomeScreen", force = TRUE)

Load CrcBiomeScreen into the R session:

library(CrcBiomeScreen)

2 Getting started

Full model training is disabled during vignette build because it exceeds SPB time limits. All other steps are fully executable to illustrate real package behavior.

The core workflow of CrcBiomeScreen involves several key steps: data loading and pre-processing, normalization, model training, evaluation, and validation. All functions are designed to work with the CrcBiomeScreenObject, which is the main data structure used throughout the package.

2.1 Exploring the data structure

2.1.1 Data preparation

First, we need to load the example data and create a CrcBiomeScreenObject This object is the primary input for all downstream functions in the package.

We offer two methods for preparing your data. You only need to use one of the following examples, depending on the format of your input data.

The first example below, which uses the Thomas_2018_RelativeAbundance dataset, is the one we will be using for the subsequent steps in this vignette. It represents a typical use case with a more complex dataset. The second example with NHSBCSP_screeningData is included to show an alternative input format for simpler datasets, demonstrating that the function can handle various data structures.

# Set working directory
# setwd("/home/CRCscreening/CRCscreening-Workflow/")
# List available datasets in the package
# data(package = "CrcBiomeScreen")

# Load the toy dataset from curatedMetagenomicData
data(Thomas_2018_RelativeAbundance, package = "CrcBiomeScreen")

# Create the CrcBiomeScreenObject
## Direct creation from TreeSummarizedExperiment

# library(curatedMetagenomicData)
# tse <- curatedMetagenomicData("ThomasAM_2018a.relative_abundance",
#                               dryrun = FALSE, rownames = "short")[[1]]

tse <- Thomas_2018_RelativeAbundance$`2021-03-31.ThomasAM_2018a.relative_abundance`
CrcBiomeScreenObject <- CreateCrcBiomeScreenObjectFromTSE(tse)

# Load the simple toy dataset from curatedMetagenomicData
data(NHSBCSP_screeningData, package = "CrcBiomeScreen")

# Create the CrcBiomeScreenObject for simple toy data
CrcBiomeScreenObject_NHSBCSP_screeningData <- CreateCrcBiomeScreenObject(
  AbsoluteAbundance = NHSBCSP_screeningData[,-c(1:2,646:647)],
  TaxaData = colnames(NHSBCSP_screeningData[,-c(1:2,646:647)]),
  SampleData = NHSBCSP_screeningData[,c(1:2,646:647)]
)

## Data exploration
head(getTaxaData(CrcBiomeScreenObject))
#>                                 superkingdom          phylum
#> Citrobacter braakii                 Bacteria  Proteobacteria
#> Akkermansia muciniphila             Bacteria Verrucomicrobia
#> Flavonifractor plautii              Bacteria      Firmicutes
#> Ruthenibacterium lactatiformans     Bacteria      Firmicutes
#> Escherichia coli                    Bacteria  Proteobacteria
#> Citrobacter youngae                 Bacteria  Proteobacteria
#>                                               class              order
#> Citrobacter braakii             Gammaproteobacteria   Enterobacterales
#> Akkermansia muciniphila            Verrucomicrobiae Verrucomicrobiales
#> Flavonifractor plautii                   Clostridia      Eubacteriales
#> Ruthenibacterium lactatiformans          Clostridia      Eubacteriales
#> Escherichia coli                Gammaproteobacteria   Enterobacterales
#> Citrobacter youngae             Gammaproteobacteria   Enterobacterales
#>                                             family            genus
#> Citrobacter braakii             Enterobacteriaceae      Citrobacter
#> Akkermansia muciniphila            Akkermansiaceae      Akkermansia
#> Flavonifractor plautii            Oscillospiraceae   Flavonifractor
#> Ruthenibacterium lactatiformans   Oscillospiraceae Ruthenibacterium
#> Escherichia coli                Enterobacteriaceae      Escherichia
#> Citrobacter youngae             Enterobacteriaceae      Citrobacter
#>                                                         species
#> Citrobacter braakii                         Citrobacter braakii
#> Akkermansia muciniphila                 Akkermansia muciniphila
#> Flavonifractor plautii                   Flavonifractor plautii
#> Ruthenibacterium lactatiformans Ruthenibacterium lactatiformans
#> Escherichia coli                               Escherichia coli
#> Citrobacter youngae                         Citrobacter youngae
dim(getRelativeAbundance(CrcBiomeScreenObject))
#> [1] 471  80
summary(getSampleData(CrcBiomeScreenObject)$age)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   49.00   60.75   67.00   67.50   75.25   84.00

2.1.2 Taxonomic data processing

The package offers robust functions to process raw taxonomic strings, splitting them into multiple hierarchical levels and allowing you to filter to a specific taxonomic rank. This ensures your data is clean and ready for downstream analysis.

2.1.2.1 Supported formats

Currently, the function supports multiple formats of taxonomic strings, including MetaPhlAn, QIIME, SILVA, etc.
Users do not need to manually specify the delimiter (;, ., |, _), since the function automatically detects the correct separator.

2.1.2.2 Original taxa column

For reproducibility and traceability, the output contains an additional column OriginalTaxa that preserves the original input string.

2.1.2.3 Special handling of uncultured and unclassified

If a taxonomic entry at a given rank is "uncultured" or "unclassified", the function appends this label to the higher-level taxa.

For example:

Input	Processed output (Genus)
`D_0__Bacteria.D_1__Firmicutes.D_2__Clostridia.D_3__Ruminococcaceae.D_4__uncultured`	`Ruminococcaceae_uncultured`
`k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales; f__Enterobacteriaceae; g__unclassified`	`Enterobacteriaceae_unclassified`

2.1.2.4 Choosing a Taxonomic Level

You don’t have to be limited to just the genus level. Use the KeepTaxonomicLevel() function to select the rank you need, whether it’s family, genus, or species. The function will automatically group taxa at the specified level. For the situation where there are two levels of unclassified, it will retain one level with unclassified.

2.1.2.5 Handling Data Without Taxonomic Strings

For data like ASVs or OTUs that do not contain taxonomic information in their names, you can use the LoadTaxaTable() function to import a separate table that maps each sequence ID to its full taxonomic lineage. This provides maximum flexibility for any type of input data. The use of this function can be seen in the documentation of the function.

2.1.2.6 Example

# Split the taxa into multiple levels[If not done already]
# CrcBiomeScreenObject <- SplitTaxas(CrcBiomeScreenObject)

head(getTaxaData(CrcBiomeScreenObject))
#>                                 superkingdom          phylum
#> Citrobacter braakii                 Bacteria  Proteobacteria
#> Akkermansia muciniphila             Bacteria Verrucomicrobia
#> Flavonifractor plautii              Bacteria      Firmicutes
#> Ruthenibacterium lactatiformans     Bacteria      Firmicutes
#> Escherichia coli                    Bacteria  Proteobacteria
#> Citrobacter youngae                 Bacteria  Proteobacteria
#>                                               class              order
#> Citrobacter braakii             Gammaproteobacteria   Enterobacterales
#> Akkermansia muciniphila            Verrucomicrobiae Verrucomicrobiales
#> Flavonifractor plautii                   Clostridia      Eubacteriales
#> Ruthenibacterium lactatiformans          Clostridia      Eubacteriales
#> Escherichia coli                Gammaproteobacteria   Enterobacterales
#> Citrobacter youngae             Gammaproteobacteria   Enterobacterales
#>                                             family            genus
#> Citrobacter braakii             Enterobacteriaceae      Citrobacter
#> Akkermansia muciniphila            Akkermansiaceae      Akkermansia
#> Flavonifractor plautii            Oscillospiraceae   Flavonifractor
#> Ruthenibacterium lactatiformans   Oscillospiraceae Ruthenibacterium
#> Escherichia coli                Enterobacteriaceae      Escherichia
#> Citrobacter youngae             Enterobacteriaceae      Citrobacter
#>                                                         species
#> Citrobacter braakii                         Citrobacter braakii
#> Akkermansia muciniphila                 Akkermansia muciniphila
#> Flavonifractor plautii                   Flavonifractor plautii
#> Ruthenibacterium lactatiformans Ruthenibacterium lactatiformans
#> Escherichia coli                               Escherichia coli
#> Citrobacter youngae                         Citrobacter youngae

taxa <- getTaxaData(CrcBiomeScreenObject)
colnames(taxa) <- c("Kingdom","Phylum","Class","Order","Family","Genus","Species")

setTaxaData(CrcBiomeScreenObject) <- taxa

colnames(getTaxaData(CrcBiomeScreenObject))
#> [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"

# Keep only the genus level data
CrcBiomeScreenObject <- KeepTaxonomicLevel(CrcBiomeScreenObject,level = "Genus")

# Inspect processed data
head(getTaxaData(CrcBiomeScreenObject))
#>                                  Kingdom          Phylum               Class
#> Citrobacter braakii             Bacteria  Proteobacteria Gammaproteobacteria
#> Akkermansia muciniphila         Bacteria Verrucomicrobia    Verrucomicrobiae
#> Flavonifractor plautii          Bacteria      Firmicutes          Clostridia
#> Ruthenibacterium lactatiformans Bacteria      Firmicutes          Clostridia
#> Escherichia coli                Bacteria  Proteobacteria Gammaproteobacteria
#> Citrobacter youngae             Bacteria  Proteobacteria Gammaproteobacteria
#>                                              Order             Family
#> Citrobacter braakii               Enterobacterales Enterobacteriaceae
#> Akkermansia muciniphila         Verrucomicrobiales    Akkermansiaceae
#> Flavonifractor plautii               Eubacteriales   Oscillospiraceae
#> Ruthenibacterium lactatiformans      Eubacteriales   Oscillospiraceae
#> Escherichia coli                  Enterobacterales Enterobacteriaceae
#> Citrobacter youngae               Enterobacterales Enterobacteriaceae
#>                                            Genus
#> Citrobacter braakii                  Citrobacter
#> Akkermansia muciniphila              Akkermansia
#> Flavonifractor plautii            Flavonifractor
#> Ruthenibacterium lactatiformans Ruthenibacterium
#> Escherichia coli                     Escherichia
#> Citrobacter youngae                  Citrobacter
#>                                                         Species
#> Citrobacter braakii                         Citrobacter braakii
#> Akkermansia muciniphila                 Akkermansia muciniphila
#> Flavonifractor plautii                   Flavonifractor plautii
#> Ruthenibacterium lactatiformans Ruthenibacterium lactatiformans
#> Escherichia coli                               Escherichia coli
#> Citrobacter youngae                         Citrobacter youngae
dim(getTaxaData(CrcBiomeScreenObject))
#> [1] 471   7

2.1.3 Data normalization

CrcBiomeScreen supports multiple normalization methods. Here we demonstrate both GMPR and TSS normalization:

# Normalize using GMPR (Geometric Mean of Pairwise Ratios)
CrcBiomeScreenObject <- NormalizeData(CrcBiomeScreenObject, method = "GMPR", level = "Genus")

# Normalize using TSS (Total Sum Scaling)
CrcBiomeScreenObject <- NormalizeData(CrcBiomeScreenObject, method = "TSS", level = "Genus")

Note: GMPR normalization is generally recommended for microbiome data as it is more robust to compositional effects and outliers compared to TSS normalization.

3 Runnable toy workflow setup

More explaination and details could check the Full-scale modeling workflow using real cohort data for reference section below.

3.1 Setup the toy dataset

set.seed(123)
# -------------------------
# Toy taxonomy
# -------------------------
toy_taxa_strings <- c(
  "D_0__Bacteria|D_1__Firmicutes|D_2__Clostridia|D_3__OrderA|D_4__FamilyA|D_5__GenusA",
  "D_0__Bacteria|D_1__Firmicutes|D_2__Clostridia|D_3__OrderB|D_4__FamilyB|D_5__GenusB",
  "D_0__Bacteria|D_1__Firmicutes|D_2__Clostridia|D_3__OrderC|D_4__FamilyC|D_5__GenusC",
  "D_0__Bacteria|D_1__Bacteroidetes|D_2__Bacteroidia|D_3__OrderD|D_4__FamilyD|D_5__GenusD",
  "D_0__Bacteria|D_1__Bacteroidetes|D_2__Bacteroidia|D_3__OrderE|D_4__FamilyE|D_5__GenusE",
  "D_0__Bacteria|D_1__Proteobacteria|D_2__Gammaproteobacteria|D_3__OrderF|D_4__FamilyF|D_5__GenusF"
)

toy_taxa <- data.frame(
  Taxa = toy_taxa_strings,
  stringsAsFactors = FALSE
)

# -------------------------
# Toy training abundance
# 6 taxa x 12 samples
# -------------------------
train_samples <- paste0("S", 1:12)

toy_abs <- matrix(
  c(
    rpois(6 * 6, lambda = 54.8887777),   # controls
    rpois(6 * 6, lambda = 55)    # CRC
  ),
  nrow = 6,
  ncol = 12
)

rownames(toy_abs) <- toy_taxa_strings
colnames(toy_abs) <- train_samples
toy_abs <- as.data.frame(toy_abs)

toy_sample <- data.frame(
  number_reads = rep(10000, 12),
  study_condition = c(rep("control", 6), rep("CRC", 6)),
  row.names = train_samples,
  stringsAsFactors = FALSE
)

# -------------------------
# 1. Create toy training data
# -------------------------

toy_obj <- CreateCrcBiomeScreenObject(
  AbsoluteAbundance = toy_abs,
  TaxaData = toy_taxa,
  SampleData = toy_sample
)

# -------------------------
# 2. Keep one taxonomic level and normalize
# -------------------------

toy_obj <- SplitTaxas(toy_obj)
toy_obj <- KeepTaxonomicLevel(toy_obj, level = "Genus")
toy_obj <- NormalizeData(toy_obj, method = "TSS", level = "Genus")

# -------------------------
# 3. Split dataset
# -------------------------
toy_obj <- SplitDataSet(
  toy_obj,
  label = c("control", "CRC"),
  partition = 0.7
)

# Perform quality control using cmdscale
toy_obj <- qcByCmdscale(
  toy_obj,
  TaskName = "toy_obj",
  normalize_method = "TSS",
  plot = FALSE
)

# Check class balance in the training data
checkClassBalance(getModelData(toy_obj)$TrainLabel)
#> $class_counts
#> labels
#>     CRC control 
#>       5       5 
#> 
#> $class_proportions
#> labels
#>     CRC control 
#>     0.5     0.5 
#> 
#> $is_imbalanced
#> [1] FALSE
#> 
#> $suggestion
#> [1] "Balanced: Class weights are likely unnecessary."

3.2 Model training and evaluation

# -------------------------------
# 4. Train and Evaluate RF model
# -------------------------------
toy_obj <- TrainModels(
  toy_obj,
  model_type = "RF",
  TaskName = "toy_rf",
  ClassWeights = FALSE,
  TrueLabel = "CRC",
  num_cores = 1,
  n_cv = 2
)

toy_obj <- EvaluateModel(
  toy_obj,
  model_type = "RF",
  TaskName = "ToyData_RF_Test",
  TrueLabel = "CRC",
  PlotAUC = FALSE
)

3.3 External validation

# -------------------------
# 5. Create toy validation data
# -------------------------
val_taxa_strings <- c(
  "D_0__Bacteria|D_1__Firmicutes|D_2__Clostridia|D_3__OrderA|D_4__FamilyA|D_5__GenusA",
  "D_0__Bacteria|D_1__Firmicutes|D_2__Clostridia|D_3__OrderB|D_4__FamilyB|D_5__GenusB",
  "D_0__Bacteria|D_1__Firmicutes|D_2__Clostridia|D_3__OrderC|D_4__FamilyC|D_5__GenusC",
  "D_0__Bacteria|D_1__Bacteroidetes|D_2__Bacteroidia|D_3__OrderD|D_4__FamilyD|D_5__GenusD",
  "D_0__Bacteria|D_1__Bacteroidetes|D_2__Bacteroidia|D_3__OrderE|D_4__FamilyE|D_5__GenusE",
  "D_0__Bacteria|D_1__Proteobacteria|D_2__Gammaproteobacteria|D_3__OrderF|D_4__FamilyF|D_5__GenusF"
)

val_taxa <- data.frame(
  Taxa = val_taxa_strings,
  stringsAsFactors = FALSE
)

val_samples <- paste0("V", 1:8)

val_abund <- matrix(
  c(
    rpois(6 * 4, lambda = 38),
    rpois(6 * 4, lambda = 48)
  ),
  nrow = 6,
  ncol = 8
)

rownames(val_abund) <- val_taxa_strings
colnames(val_abund) <- val_samples

val_abund <- as.data.frame(val_abund)

val_sample <- data.frame(
  number_reads = rep(10000, 8),
  study_condition = c(rep("control", 4), rep("CRC", 4)),
  condition = c(rep("control", 4), rep("CRC", 4)),
  row.names = val_samples,
  stringsAsFactors = FALSE
)

val_obj <- CreateCrcBiomeScreenObject(
  AbsoluteAbundance = val_abund,
  TaxaData = val_taxa,
  SampleData = val_sample
)

val_obj <- SplitTaxas(val_obj)
val_obj <- KeepTaxonomicLevel(val_obj, level = "Genus")
val_obj <- NormalizeData(val_obj, method = "TSS", level = "Genus")

# -------------------------
# 6. Align features
# -------------------------
train_norm <- getNormalizedData(toy_obj)
val_norm <- getNormalizedData(val_obj)

common_features <- intersect(colnames(train_norm), colnames(val_norm))

setNormalizedData(toy_obj) <- train_norm[, common_features, drop = FALSE]
setNormalizedData(val_obj) <- val_norm[, common_features, drop = FALSE]

# -------------------------
# 7. Validate model
# -------------------------
validated_obj <- ValidateModelOnData(
  toy_obj,
  ValidationData = val_obj,
  model_type = "RF",
  TaskName = "toy_validation",
  TrueLabel = "CRC",
  PlotAUC = FALSE
)

3.4 Full screening workflow in one step

# ----------------------------------
# 8. Minimal end-to-end RunScreening
# ----------------------------------
toy_obj2 <- CreateCrcBiomeScreenObject(
  AbsoluteAbundance = toy_abs,
  TaxaData = toy_taxa,
  SampleData = toy_sample
)

toy_obj2 <- SplitTaxas(toy_obj2)
toy_obj2 <- KeepTaxonomicLevel(toy_obj2, level = "Genus")
toy_obj2 <- NormalizeData(toy_obj2, method = "TSS", level = "Genus")

toy_obj2 <- RunScreening(
  obj = toy_obj2,
  model_type = "RF",
  partition = 0.7,
  split.requirement = list(
    label = c("control", "CRC"),
    condition_col = "study_condition"
  ),
  ClassWeights = FALSE,
  n_cv = 2,
  num_cores = 1,
  TaskName = "RF_TSS_toydata",
  ValidationData = val_obj,
  TrueLabel = "CRC"
)

3.5 Get the the raw prediction scores for each sample

# -------------------------
# 9. Prediction on new data
# -------------------------
toy_obj <- PredictCrcBiomeScreen(
  toy_obj,
  newdata = getNormalizedData(val_obj),
  model_type = "RF"
)

# Inspect the predictions
head(getPredictResult(toy_obj)$RF)
#>          CRC   control
#> V1 0.4977778 0.5022222
#> V2 0.4961111 0.5038889
#> V3 0.5572222 0.4427778
#> V4 0.4561111 0.5438889
#> V5 0.5238889 0.4761111
#> V6 0.5672222 0.4327778
#>      control       CRC
#> V1 0.5122222 0.4877778
#> V2 0.6066667 0.3933333
#> V3 0.4205556 0.5794444
#> V4 0.5255556 0.4744444
#> V5 0.4772222 0.5227778
#> V6 0.4677778 0.5322222
predictions <- getPredictResult(toy_obj)$RF

# -------------------------
# 10. Evaluate predictions
# -------------------------

evaluation_results <- EvaluateCrcBiomeScreen(
  predictions = predictions,
  true_labels = getSampleData(val_obj)$study_condition,
  TrueLabel = "CRC",
  TaskName = "Toy_Prediction_Eval"
)

4 Full-scale modeling workflow using real cohort data for reference

4.0.1 Data filtering and splitting

Before model training, we need to filter the data to include only CRC and control samples, then split into training and test sets:

# Check the available labels in the data
table(getSampleData(CrcBiomeScreenObject)$study_condition)
#> 
#>     CRC adenoma control 
#>      29      27      24

# Filter data to keep only CRC and control samples
CrcBiomeScreenObject <- FilterDataSet(
  CrcBiomeScreenObject,
  label = c("CRC", "control"),
  condition_col = "study_condition"
)

# Verify the filtering results
table(getSampleData(CrcBiomeScreenObject)$study_condition)
#> 
#>     CRC control 
#>      29      24

# Split data into training (70%) and test (30%) sets
CrcBiomeScreenObject <- SplitDataSet(
  CrcBiomeScreenObject, 
  label = c("control", "CRC"), 
  partition = 0.7
)

4.0.2 Quality control

Quality control using classical multidimensional scaling can help identify potential batch effects or outliers:

# Perform quality control using cmdscale
CrcBiomeScreenObject <- qcByCmdscale(
  CrcBiomeScreenObject,
  TaskName = "Normalize_ToyData_filtered_qc",
  normalize_method = "GMPR",
  plot = FALSE
)

# Check class balance in the training data
checkClassBalance(getModelData(CrcBiomeScreenObject)$TrainLabel)
#> $class_counts
#> labels
#>     CRC control 
#>      21      17 
#> 
#> $class_proportions
#> labels
#>       CRC   control 
#> 0.5526316 0.4473684 
#> 
#> $is_imbalanced
#> [1] TRUE
#> 
#> $suggestion
#> [1] "Unbalanced: Using class weights is recommended."

4.0.3 Model training and Evaluation

Training machine-learning models for microbiome classification often requires repeated cross-validation, hyperparameter search, and ensemble methods. These steps are computationally demanding and cannot be executed within the 15-minute limit of the Bioconductor build system.

For this reason, the full training code is provided for users, but is not executed when building the vignette. Users may run the pipeline locally to reproduce the complete analysis.

CrcBiomeScreen supports two machine learning algorithms: Random Forest and XGBoost. Both models can be trained with class balancing to handle imbalanced datasets:

4.1 External validation

To ensure the robustness and generalizability of the trained models, external validation on independent datasets is crucial

4.2 Prediction & Evaluation

Once you’ve trained a model, you’ll want to apply it to new data and assess its performance. Our package offers a streamlined, two-step process for this: first, generate predictions on your new dataset, and then, if true labels are available, evaluate the model’s accuracy. This modular approach allows for flexibility, whether you’re predicting outcomes for unknown samples or measuring model quality on a validation set.

4.2.1 Step 1: Predict on New Data

Use the PredictCrcBiomeScreen() function to generate prediction probabilities for your new dataset. This function takes your trained CrcBiomeScreenObject and the new data, returning the raw prediction scores for each sample.

# Assuming 'CrcBiomeScreenObject' contains a trained XGBoost model
# and 'MyNewData' is your new dataset (e.g., another CrcBiomeScreenObject or a data frame).

# Generate predictions for the new dataset
ValidationData_filtered_qc <- PredictCrcBiomeScreen(
    CrcBiomeScreenObject, 
    newdata = getNormalizedData(ValidationData_filtered_qc), # Use the appropriate data slot from your new data
    model_type = "RF"
)
# Inspect the predictions
head(getPredictResult(ValidationData_filtered_qc)$RF)
predictions <- getPredictResult(ValidationData_filtered_qc)$RF

4.2.2 Step 2: Evaluate Model Performance

If your new dataset includes known true labels, you can evaluate the model’s performance using the EvaluateCrcBiomeScreen() function. This function takes the predictions generated in the previous step and the true_labels to calculate key metrics like the AUC and plot the ROC curve.

# Assuming 'MyNewData' has a 'study_condition' column in its SampleData
# and the true positive label is "CRC".

# Evaluate the predictions
evaluation_results <- EvaluateCrcBiomeScreen(
    predictions = predictions,
    true_labels = getSampleData(ValidationData_filtered_qc)$study_condition, # Provide the actual true labels
    TrueLabel = "CRC", # Specify the positive class label
    TaskName = "MyNewDatasetEvaluation" # A name for the output plot/files
)

# View the evaluation results
print(evaluation_results$AUC)

4.3 Streamlined Screening Workflow

For users who want to run the entire screening workflow in one step, CrcBiomeScreen provides the RunScreening() function. This function encapsulates the complete process from data preparation to model validation, offering a comprehensive and easy-to-use solution.

# Load the toy dataset from curatedMetagenomicData
data(Thomas_2018_RelativeAbundance, package = "CrcBiomeScreen")

# Create the CrcBiomeScreenObject
tse <- Thomas_2018_RelativeAbundance$`2021-03-31.ThomasAM_2018a.relative_abundance`
CrcBiomeScreenObject <- CreateCrcBiomeScreenObjectFromTSE(tse)

taxa <- getTaxaData(CrcBiomeScreenObject)
colnames(taxa) <- c("Kingdom","Phylum","Class","Order","Family","Genus","Species")

setTaxaData(CrcBiomeScreenObject) <- taxa

# Keep only the genus level data
CrcBiomeScreenObject <- KeepTaxonomicLevel(CrcBiomeScreenObject,level = "Genus")

# Inspect processed data
head(getTaxaData(CrcBiomeScreenObject))

# Normalize using GMPR (Geometric Mean of Pairwise Ratios)
CrcBiomeScreenObject <- NormalizeData(CrcBiomeScreenObject, method = "GMPR", level = "Genus")

# Filter data to keep only CRC and control samples
CrcBiomeScreenObject <- FilterDataSet(
  CrcBiomeScreenObject,
  label = c("CRC", "control"),
  condition_col = "study_condition"
)

# Perform quality control using cmdscale
CrcBiomeScreenObject <- qcByCmdscale(
  CrcBiomeScreenObject,
  TaskName = "Normalize_ToyData_filtered_qc",
  normalize_method = "GMPR",
  plot = FALSE
)

# Verify the filtering results
table(getSampleData(CrcBiomeScreenObject)$study_condition)

# Split data into training (70%) and test (30%) sets
CrcBiomeScreenObject <- SplitDataSet(
  CrcBiomeScreenObject, 
  label = c("control", "CRC"), 
  partition = 0.7
)

# Check class balance in the training data
checkClassBalance(getModelData(CrcBiomeScreenObject)$TrainLabel)

# Load validation data
data("ZellerG_2014_RelativeAbundance")

ValidationData_tse <- ZellerG_2014_RelativeAbundance$`2021-03-31.ZellerG_2014.relative_abundance`
# Create CrcBiomeScreenObject for validation data
ValidationData <- CreateCrcBiomeScreenObjectFromTSE(ValidationData_tse)

# Process validation data similarly to training data
# ValidationData <- SplitTaxas(ValidationData)
taxa <- getTaxaData(ValidationData)
colnames(taxa) <- c("Kingdom","Phylum","Class","Order","Family","Genus","Species")

setTaxaData(ValidationData) <- taxa

ValidationData <- KeepTaxonomicLevel(ValidationData,level = "Genus")
ValidationData <- NormalizeData(ValidationData, method = "GMPR", level = "Genus")

# Align features between training and validation data
val_norm <- getNormalizedData(ValidationData)
crc_norm <- getNormalizedData(CrcBiomeScreenObject)
# Ensure consistent features between training and validation data
val_norm <- val_norm[, colnames(val_norm) %in% colnames(crc_norm)]
crc_norm <- crc_norm[, colnames(crc_norm) %in% colnames(val_norm)]
# Set back the normalized data
setNormalizedData(ValidationData) <- val_norm
setNormalizedData(CrcBiomeScreenObject) <- crc_norm

# Filter validation data
ValidationData_filtered <- FilterDataSet(
  ValidationData,
  label = c("CRC", "control"),
  condition_col = "study_condition"
)

# Quality control for validation data
ValidationData_filtered_qc <- qcByCmdscale(
  ValidationData_filtered,
  TaskName = "Normalize_ValidationData_filtered_qc",
  normalize_method = "GMPR",
  plot = FALSE
)

# Run the complete screening workflow
ValidationData_filtered_qc <- RunScreening(
  # Input data
  obj = CrcBiomeScreenObject,
  # Model and data splitting
  model_type = "RF",
  partition = 0.7,
  split.requirement = list(
    label = c("control", "CRC"),
    condition_col = "study_condition"
  ),
  ClassWeights = FALSE,
  
  # Cross-validation and parallelization
  n_cv = 10,
  num_cores = 10,
  
  # Task and output naming
  TaskName = "RF_GMPR_toydata",
  
  # External validation
  ValidationData = ValidationData_filtered_qc,
  TrueLabel = "CRC"
)

# Example with XGBoost model
ValidationData_filtered_qc <- RunScreening(
  CrcBiomeScreenObject,
  model = "RF",
  partition = 0.7,
  split.requirement = list(
    label = c("control", "CRC"),
    condition_col = "study_condition"
  ),
  ClassWeights = TRUE,
  n_cv = 10,
  num_cores = 10,
  TaskName = "RF_GMPR_toydata",
  ValidationData = ValidationData_filtered_qc,
  TrueLabel = "CRC"
)

This function performs the complete workflow, including:

Data normalization: Standardizes data to a common scale.
Data filtering and splitting: Prepares the data for modeling by selecting and partitioning it into training and testing sets.
Model training with cross-validation: Trains the specified model (RF or XGBoost) using k-fold cross-validation to ensure robustness.
Model evaluation on the test set: Assesses the model’s performance on a held-out test set to get an unbiased estimate of its predictive power.
External validation (if ValidationData is provided): Applies the trained model to an entirely independent dataset to confirm its generalizability and real-world applicability. This step also stores the individual sample predictions and evaluation metrics in the CrcBiomeScreenObject.

5 Best practices and recommendations

5.1 Normalization method selection

GMPR: Recommended for most microbiome datasets, especially when dealing with compositional data and potential outliers
TSS: Simpler approach, suitable for well-balanced datasets without extreme outliers

5.2 Model selection

Random Forest: Generally robust and interpretable, good baseline choice
XGBoost: Often provides better performance but may require more careful hyperparameter tuning

5.3 Validation strategies

Always use external validation datasets when available
Consider cross-validation for model selection and hyperparameter tuning
Ensure consistent preprocessing between training and validation data

5.4 Class imbalance

Enable ClassWeights = TRUE when dealing with imbalanced datasets
Monitor both sensitivity and specificity, not just overall accuracy
Consider stratified sampling when splitting datasets

6 Troubleshooting

6.1 Common issues and solutions

6.1.1 Feature mismatch between datasets

Ensure that both training and validation datasets have been processed identically and contain overlapping features.

6.1.2 Memory issues with large datasets

Reduce the number of features by applying more stringent filtering
Use fewer CPU cores (num_cores parameter) to reduce memory usage

7 Session information

This vignette was created under the following conditions:

sessionInfo()
#> R version 4.6.0 alpha (2026-04-05 r89794)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] foreach_1.5.2          pROC_1.19.0.1          ranger_0.18.0         
#> [4] CrcBiomeScreen_0.99.15 BiocStyle_2.39.0      
#> 
#> loaded via a namespace (and not attached):
#>   [1] inline_0.3.21                   permute_0.9-10                 
#>   [3] rlang_1.2.0                     magrittr_2.0.5                 
#>   [5] clue_0.3-68                     otel_0.2.0                     
#>   [7] e1071_1.7-17                    matrixStats_1.5.0              
#>   [9] compiler_4.6.0                  mgcv_1.9-4                     
#>  [11] vctrs_0.7.3                     reshape2_1.4.5                 
#>  [13] rmutil_1.1.10                   stringr_1.6.0                  
#>  [15] pkgconfig_2.0.3                 crayon_1.5.3                   
#>  [17] fastmap_1.2.0                   XVector_0.51.0                 
#>  [19] modeest_2.4.0                   rmarkdown_2.31                 
#>  [21] prodlim_2026.03.11              purrr_1.2.2                    
#>  [23] xfun_0.57                       cachem_1.1.0                   
#>  [25] jsonlite_2.0.0                  progress_1.2.3                 
#>  [27] recipes_1.3.2                   DelayedArray_0.37.1            
#>  [29] BiocParallel_1.45.0             cluster_2.1.8.2                
#>  [31] parallel_4.6.0                  prettyunits_1.2.0              
#>  [33] R6_2.6.1                        bslib_0.10.0                   
#>  [35] stringi_1.8.7                   RColorBrewer_1.1-3             
#>  [37] parallelly_1.47.0               rpart_4.1.27                   
#>  [39] GenomicRanges_1.63.2            lubridate_1.9.5                
#>  [41] jquerylib_0.1.4                 Rcpp_1.1.1-1                   
#>  [43] Seqinfo_1.1.0                   bookdown_0.46                  
#>  [45] SummarizedExperiment_1.41.1     iterators_1.0.14               
#>  [47] knitr_1.51                      future.apply_1.20.2            
#>  [49] IRanges_2.45.0                  Matrix_1.7-5                   
#>  [51] splines_4.6.0                   nnet_7.3-20                    
#>  [53] timechange_0.4.0                tidyselect_1.2.1               
#>  [55] dichromat_2.0-0.1               abind_1.4-8                    
#>  [57] yaml_2.3.12                     vegan_2.7-3                    
#>  [59] TreeSummarizedExperiment_2.19.0 timeDate_4052.112              
#>  [61] doParallel_1.0.17               codetools_0.2-20               
#>  [63] listenv_0.10.1                  lattice_0.22-9                 
#>  [65] tibble_3.3.1                    plyr_1.8.9                     
#>  [67] withr_3.0.2                     Biobase_2.71.0                 
#>  [69] treeio_1.35.0                   S7_0.2.1-1                     
#>  [71] stable_1.1.7                    evaluate_1.0.5                 
#>  [73] future_1.70.0                   survival_3.8-6                 
#>  [75] proxy_0.4-29                    Biostrings_2.79.5              
#>  [77] pillar_1.11.1                   BiocManager_1.30.27            
#>  [79] MatrixGenerics_1.23.0           stats4_4.6.0                   
#>  [81] generics_0.1.4                  S4Vectors_0.49.2               
#>  [83] hms_1.1.4                       ggplot2_4.0.2                  
#>  [85] scales_1.4.0                    tidytree_0.4.7                 
#>  [87] timeSeries_4052.112             globals_0.19.1                 
#>  [89] class_7.3-23                    glue_1.8.1                     
#>  [91] statip_0.2.3                    lazyeval_0.2.3                 
#>  [93] tools_4.6.0                     data.table_1.18.2.1            
#>  [95] spatial_7.3-18                  fBasics_4052.98                
#>  [97] ModelMetrics_1.2.2.2            gower_1.0.2                    
#>  [99] fs_2.0.1                        grid_4.6.0                     
#> [101] tidyr_1.3.2                     ape_5.8-1                      
#> [103] ipred_0.9-15                    SingleCellExperiment_1.33.2    
#> [105] nlme_3.1-169                    cli_3.6.6                      
#> [107] rappdirs_0.3.4                  S4Arrays_1.11.1                
#> [109] lava_1.9.0                      doFuture_1.2.1                 
#> [111] dplyr_1.2.1                     gtable_0.3.6                   
#> [113] stabledist_0.7-2                yulab.utils_0.2.4              
#> [115] sass_0.4.10                     digest_0.6.39                  
#> [117] progressr_0.19.0                BiocGenerics_0.57.1            
#> [119] caret_7.0-1                     ggrepel_0.9.8                  
#> [121] SparseArray_1.11.13             farver_2.1.2                   
#> [123] htmltools_0.5.9                 lifecycle_1.0.5                
#> [125] hardhat_1.4.3                   statmod_1.5.1                  
#> [127] GUniFrac_1.9                    MASS_7.3-65

8 References

Appendix

If you use CrcBiomeScreen in your research, please cite:

[Package citation information]

For questions, bug reports, or feature requests, please visit the GitHub repository: https://github.com/iChronostasis/CrcBiomeScreen

Note: This vignette demonstrates the basic usage of CrcBiomeScreen. For more advanced usage and customization options, please refer to the function documentation and additional tutorials.

CrcBiomeScreen: Colorectal Cancer Microbiome Screening and Analysis

2026-04-17

Contents