Microbiome research has become an increasingly important field for understanding the complex interactions between host health, disease, and environmental factors. Numerous studies have demonstrated that microbial community composition and function can provide predictive insights into a wide range of conditions, including cancer, metabolic, and inflammatory diseases. Translating these findings into robust and reproducible computational workflows remains a key challenge for the microbiome research community.
CrcBiomeScreen is a Bioconductor package that provides a reproducible and modular workflow for microbiome-based disease screening and classification. Although originally designed for colorectal cancer (CRC) applications, the framework is disease-agnostic and can be applied to any case–control microbiome dataset. The package integrates data preprocessing, normalization, quality control, and machine learning-based classification into a streamlined and extensible pipeline.
The Bioconductor ecosystem provides a standardized environment for reproducible omics analysis and emphasizes interoperability between data structures and statistical workflows. CrcBiomeScreen extends this principle to microbiome classification by supporting the use of Bioconductor core data containers such as TreeSummarizedExperiment and by providing accessor methods that facilitate integration with other packages (e.g., phyloseq and curatedMetagenomicData). This design ensures compatibility with existing Bioconductor workflows and promotes transparency, reproducibility, and reusability of microbiome-based models.
Existing Bioconductor packages such as microbiome, microbiomeMarker, and curatedMetagenomicData provide powerful tools for microbiome data access, visualization, and differential abundance testing. However, there remains a gap in standardized machine learning workflows tailored for disease screening and cross-cohort validation. CrcBiomeScreen addresses this need by providing:
By bridging microbiome features with reproducible machine learning pipelines, CrcBiomeScreen complements existing Bioconductor tools and provides a flexible platform for developing, benchmarking, and validating microbiome-based predictive models.
Note: This vignette is organized in two parts:
- a fully executable toy workflow for reproducibility during package checking, and
- a full-scale workflow using real cohort data for reference.
The toy workflow mirrors the complete analysis pipeline, while the full workflow illustrates how to apply the same steps to a real data set.
Install from the Bioconductor repository:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("CrcBiomeScreen")
Install the development version of the package from GitHub(development version):
install.packages(c("remotes"))
remotes::install_github("iChronostasis/CrcBiomeScreen", force = TRUE)
Load CrcBiomeScreen into the R session:
library(CrcBiomeScreen)
Full model training is disabled during vignette build because it exceeds SPB time limits. All other steps are fully executable to illustrate real package behavior.
The core workflow of CrcBiomeScreen involves several key steps: data loading and pre-processing, normalization, model training, evaluation, and validation. All functions are designed to work with the CrcBiomeScreenObject, which is the main data structure used throughout the package.
First, we need to load the example data and create a CrcBiomeScreenObject This object is the primary input for all downstream functions in the package.
We offer two methods for preparing your data. You only need to use one of the following examples, depending on the format of your input data.
The first example below, which uses the Thomas_2018_RelativeAbundance dataset, is the one we will be using for the subsequent steps in this vignette. It represents a typical use case with a more complex dataset. The second example with NHSBCSP_screeningData is included to show an alternative input format for simpler datasets, demonstrating that the function can handle various data structures.
# Set working directory
# setwd("/home/CRCscreening/CRCscreening-Workflow/")
# List available datasets in the package
# data(package = "CrcBiomeScreen")
# Load the toy dataset from curatedMetagenomicData
data(Thomas_2018_RelativeAbundance, package = "CrcBiomeScreen")
# Create the CrcBiomeScreenObject
## Direct creation from TreeSummarizedExperiment
# library(curatedMetagenomicData)
# tse <- curatedMetagenomicData("ThomasAM_2018a.relative_abundance",
# dryrun = FALSE, rownames = "short")[[1]]
tse <- Thomas_2018_RelativeAbundance$`2021-03-31.ThomasAM_2018a.relative_abundance`
CrcBiomeScreenObject <- CreateCrcBiomeScreenObjectFromTSE(tse)
# Load the simple toy dataset from curatedMetagenomicData
data(NHSBCSP_screeningData, package = "CrcBiomeScreen")
# Create the CrcBiomeScreenObject for simple toy data
CrcBiomeScreenObject_NHSBCSP_screeningData <- CreateCrcBiomeScreenObject(
AbsoluteAbundance = NHSBCSP_screeningData[,-c(1:2,646:647)],
TaxaData = colnames(NHSBCSP_screeningData[,-c(1:2,646:647)]),
SampleData = NHSBCSP_screeningData[,c(1:2,646:647)]
)
## Data exploration
head(getTaxaData(CrcBiomeScreenObject))
#> superkingdom phylum
#> Citrobacter braakii Bacteria Proteobacteria
#> Akkermansia muciniphila Bacteria Verrucomicrobia
#> Flavonifractor plautii Bacteria Firmicutes
#> Ruthenibacterium lactatiformans Bacteria Firmicutes
#> Escherichia coli Bacteria Proteobacteria
#> Citrobacter youngae Bacteria Proteobacteria
#> class order
#> Citrobacter braakii Gammaproteobacteria Enterobacterales
#> Akkermansia muciniphila Verrucomicrobiae Verrucomicrobiales
#> Flavonifractor plautii Clostridia Eubacteriales
#> Ruthenibacterium lactatiformans Clostridia Eubacteriales
#> Escherichia coli Gammaproteobacteria Enterobacterales
#> Citrobacter youngae Gammaproteobacteria Enterobacterales
#> family genus
#> Citrobacter braakii Enterobacteriaceae Citrobacter
#> Akkermansia muciniphila Akkermansiaceae Akkermansia
#> Flavonifractor plautii Oscillospiraceae Flavonifractor
#> Ruthenibacterium lactatiformans Oscillospiraceae Ruthenibacterium
#> Escherichia coli Enterobacteriaceae Escherichia
#> Citrobacter youngae Enterobacteriaceae Citrobacter
#> species
#> Citrobacter braakii Citrobacter braakii
#> Akkermansia muciniphila Akkermansia muciniphila
#> Flavonifractor plautii Flavonifractor plautii
#> Ruthenibacterium lactatiformans Ruthenibacterium lactatiformans
#> Escherichia coli Escherichia coli
#> Citrobacter youngae Citrobacter youngae
dim(getRelativeAbundance(CrcBiomeScreenObject))
#> [1] 471 80
summary(getSampleData(CrcBiomeScreenObject)$age)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 49.00 60.75 67.00 67.50 75.25 84.00
The package offers robust functions to process raw taxonomic strings, splitting them into multiple hierarchical levels and allowing you to filter to a specific taxonomic rank. This ensures your data is clean and ready for downstream analysis.
Currently, the function supports multiple formats of taxonomic strings, including MetaPhlAn, QIIME, SILVA, etc.
Users do not need to manually specify the delimiter (;, ., |, _), since the function automatically detects the correct separator.
For reproducibility and traceability, the output contains an additional column OriginalTaxa that preserves the original input string.
If a taxonomic entry at a given rank is "uncultured" or "unclassified", the function appends this label to the higher-level taxa.
For example:
| Input | Processed output (Genus) |
|---|---|
D_0__Bacteria.D_1__Firmicutes.D_2__Clostridia.D_3__Ruminococcaceae.D_4__uncultured |
Ruminococcaceae_uncultured |
k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales; f__Enterobacteriaceae; g__unclassified |
Enterobacteriaceae_unclassified |
You don’t have to be limited to just the genus level. Use the KeepTaxonomicLevel() function to select the rank you need, whether it’s family, genus, or species. The function will automatically group taxa at the specified level. For the situation where there are two levels of unclassified, it will retain one level with unclassified.
For data like ASVs or OTUs that do not contain taxonomic information in their names, you can use the LoadTaxaTable() function to import a separate table that maps each sequence ID to its full taxonomic lineage. This provides maximum flexibility for any type of input data. The use of this function can be seen in the documentation of the function.
# Split the taxa into multiple levels[If not done already]
# CrcBiomeScreenObject <- SplitTaxas(CrcBiomeScreenObject)
head(getTaxaData(CrcBiomeScreenObject))
#> superkingdom phylum
#> Citrobacter braakii Bacteria Proteobacteria
#> Akkermansia muciniphila Bacteria Verrucomicrobia
#> Flavonifractor plautii Bacteria Firmicutes
#> Ruthenibacterium lactatiformans Bacteria Firmicutes
#> Escherichia coli Bacteria Proteobacteria
#> Citrobacter youngae Bacteria Proteobacteria
#> class order
#> Citrobacter braakii Gammaproteobacteria Enterobacterales
#> Akkermansia muciniphila Verrucomicrobiae Verrucomicrobiales
#> Flavonifractor plautii Clostridia Eubacteriales
#> Ruthenibacterium lactatiformans Clostridia Eubacteriales
#> Escherichia coli Gammaproteobacteria Enterobacterales
#> Citrobacter youngae Gammaproteobacteria Enterobacterales
#> family genus
#> Citrobacter braakii Enterobacteriaceae Citrobacter
#> Akkermansia muciniphila Akkermansiaceae Akkermansia
#> Flavonifractor plautii Oscillospiraceae Flavonifractor
#> Ruthenibacterium lactatiformans Oscillospiraceae Ruthenibacterium
#> Escherichia coli Enterobacteriaceae Escherichia
#> Citrobacter youngae Enterobacteriaceae Citrobacter
#> species
#> Citrobacter braakii Citrobacter braakii
#> Akkermansia muciniphila Akkermansia muciniphila
#> Flavonifractor plautii Flavonifractor plautii
#> Ruthenibacterium lactatiformans Ruthenibacterium lactatiformans
#> Escherichia coli Escherichia coli
#> Citrobacter youngae Citrobacter youngae
taxa <- getTaxaData(CrcBiomeScreenObject)
colnames(taxa) <- c("Kingdom","Phylum","Class","Order","Family","Genus","Species")
setTaxaData(CrcBiomeScreenObject) <- taxa
colnames(getTaxaData(CrcBiomeScreenObject))
#> [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
# Keep only the genus level data
CrcBiomeScreenObject <- KeepTaxonomicLevel(CrcBiomeScreenObject,level = "Genus")
# Inspect processed data
head(getTaxaData(CrcBiomeScreenObject))
#> Kingdom Phylum Class
#> Citrobacter braakii Bacteria Proteobacteria Gammaproteobacteria
#> Akkermansia muciniphila Bacteria Verrucomicrobia Verrucomicrobiae
#> Flavonifractor plautii Bacteria Firmicutes Clostridia
#> Ruthenibacterium lactatiformans Bacteria Firmicutes Clostridia
#> Escherichia coli Bacteria Proteobacteria Gammaproteobacteria
#> Citrobacter youngae Bacteria Proteobacteria Gammaproteobacteria
#> Order Family
#> Citrobacter braakii Enterobacterales Enterobacteriaceae
#> Akkermansia muciniphila Verrucomicrobiales Akkermansiaceae
#> Flavonifractor plautii Eubacteriales Oscillospiraceae
#> Ruthenibacterium lactatiformans Eubacteriales Oscillospiraceae
#> Escherichia coli Enterobacterales Enterobacteriaceae
#> Citrobacter youngae Enterobacterales Enterobacteriaceae
#> Genus
#> Citrobacter braakii Citrobacter
#> Akkermansia muciniphila Akkermansia
#> Flavonifractor plautii Flavonifractor
#> Ruthenibacterium lactatiformans Ruthenibacterium
#> Escherichia coli Escherichia
#> Citrobacter youngae Citrobacter
#> Species
#> Citrobacter braakii Citrobacter braakii
#> Akkermansia muciniphila Akkermansia muciniphila
#> Flavonifractor plautii Flavonifractor plautii
#> Ruthenibacterium lactatiformans Ruthenibacterium lactatiformans
#> Escherichia coli Escherichia coli
#> Citrobacter youngae Citrobacter youngae
dim(getTaxaData(CrcBiomeScreenObject))
#> [1] 471 7
CrcBiomeScreen supports multiple normalization methods. Here we demonstrate both GMPR and TSS normalization:
# Normalize using GMPR (Geometric Mean of Pairwise Ratios)
CrcBiomeScreenObject <- NormalizeData(CrcBiomeScreenObject, method = "GMPR", level = "Genus")
# Normalize using TSS (Total Sum Scaling)
CrcBiomeScreenObject <- NormalizeData(CrcBiomeScreenObject, method = "TSS", level = "Genus")
Note: GMPR normalization is generally recommended for microbiome data as it is more robust to compositional effects and outliers compared to TSS normalization.
set.seed(123)
# -------------------------
# Toy taxonomy
# -------------------------
toy_taxa_strings <- c(
"D_0__Bacteria|D_1__Firmicutes|D_2__Clostridia|D_3__OrderA|D_4__FamilyA|D_5__GenusA",
"D_0__Bacteria|D_1__Firmicutes|D_2__Clostridia|D_3__OrderB|D_4__FamilyB|D_5__GenusB",
"D_0__Bacteria|D_1__Firmicutes|D_2__Clostridia|D_3__OrderC|D_4__FamilyC|D_5__GenusC",
"D_0__Bacteria|D_1__Bacteroidetes|D_2__Bacteroidia|D_3__OrderD|D_4__FamilyD|D_5__GenusD",
"D_0__Bacteria|D_1__Bacteroidetes|D_2__Bacteroidia|D_3__OrderE|D_4__FamilyE|D_5__GenusE",
"D_0__Bacteria|D_1__Proteobacteria|D_2__Gammaproteobacteria|D_3__OrderF|D_4__FamilyF|D_5__GenusF"
)
toy_taxa <- data.frame(
Taxa = toy_taxa_strings,
stringsAsFactors = FALSE
)
# -------------------------
# Toy training abundance
# 6 taxa x 12 samples
# -------------------------
train_samples <- paste0("S", 1:12)
toy_abs <- matrix(
c(
rpois(6 * 6, lambda = 54.8887777), # controls
rpois(6 * 6, lambda = 55) # CRC
),
nrow = 6,
ncol = 12
)
rownames(toy_abs) <- toy_taxa_strings
colnames(toy_abs) <- train_samples
toy_abs <- as.data.frame(toy_abs)
toy_sample <- data.frame(
number_reads = rep(10000, 12),
study_condition = c(rep("control", 6), rep("CRC", 6)),
row.names = train_samples,
stringsAsFactors = FALSE
)
# -------------------------
# 1. Create toy training data
# -------------------------
toy_obj <- CreateCrcBiomeScreenObject(
AbsoluteAbundance = toy_abs,
TaxaData = toy_taxa,
SampleData = toy_sample
)
# -------------------------
# 2. Keep one taxonomic level and normalize
# -------------------------
toy_obj <- SplitTaxas(toy_obj)
toy_obj <- KeepTaxonomicLevel(toy_obj, level = "Genus")
toy_obj <- NormalizeData(toy_obj, method = "TSS", level = "Genus")
# -------------------------
# 3. Split dataset
# -------------------------
toy_obj <- SplitDataSet(
toy_obj,
label = c("control", "CRC"),
partition = 0.7
)
# Perform quality control using cmdscale
toy_obj <- qcByCmdscale(
toy_obj,
TaskName = "toy_obj",
normalize_method = "TSS",
plot = FALSE
)
# Check class balance in the training data
checkClassBalance(getModelData(toy_obj)$TrainLabel)
#> $class_counts
#> labels
#> CRC control
#> 5 5
#>
#> $class_proportions
#> labels
#> CRC control
#> 0.5 0.5
#>
#> $is_imbalanced
#> [1] FALSE
#>
#> $suggestion
#> [1] "Balanced: Class weights are likely unnecessary."
# -------------------------------
# 4. Train and Evaluate RF model
# -------------------------------
toy_obj <- TrainModels(
toy_obj,
model_type = "RF",
TaskName = "toy_rf",
ClassWeights = FALSE,
TrueLabel = "CRC",
num_cores = 1,
n_cv = 2
)
toy_obj <- EvaluateModel(
toy_obj,
model_type = "RF",
TaskName = "ToyData_RF_Test",
TrueLabel = "CRC",
PlotAUC = FALSE
)
# -------------------------
# 5. Create toy validation data
# -------------------------
val_taxa_strings <- c(
"D_0__Bacteria|D_1__Firmicutes|D_2__Clostridia|D_3__OrderA|D_4__FamilyA|D_5__GenusA",
"D_0__Bacteria|D_1__Firmicutes|D_2__Clostridia|D_3__OrderB|D_4__FamilyB|D_5__GenusB",
"D_0__Bacteria|D_1__Firmicutes|D_2__Clostridia|D_3__OrderC|D_4__FamilyC|D_5__GenusC",
"D_0__Bacteria|D_1__Bacteroidetes|D_2__Bacteroidia|D_3__OrderD|D_4__FamilyD|D_5__GenusD",
"D_0__Bacteria|D_1__Bacteroidetes|D_2__Bacteroidia|D_3__OrderE|D_4__FamilyE|D_5__GenusE",
"D_0__Bacteria|D_1__Proteobacteria|D_2__Gammaproteobacteria|D_3__OrderF|D_4__FamilyF|D_5__GenusF"
)
val_taxa <- data.frame(
Taxa = val_taxa_strings,
stringsAsFactors = FALSE
)
val_samples <- paste0("V", 1:8)
val_abund <- matrix(
c(
rpois(6 * 4, lambda = 38),
rpois(6 * 4, lambda = 48)
),
nrow = 6,
ncol = 8
)
rownames(val_abund) <- val_taxa_strings
colnames(val_abund) <- val_samples
val_abund <- as.data.frame(val_abund)
val_sample <- data.frame(
number_reads = rep(10000, 8),
study_condition = c(rep("control", 4), rep("CRC", 4)),
condition = c(rep("control", 4), rep("CRC", 4)),
row.names = val_samples,
stringsAsFactors = FALSE
)
val_obj <- CreateCrcBiomeScreenObject(
AbsoluteAbundance = val_abund,
TaxaData = val_taxa,
SampleData = val_sample
)
val_obj <- SplitTaxas(val_obj)
val_obj <- KeepTaxonomicLevel(val_obj, level = "Genus")
val_obj <- NormalizeData(val_obj, method = "TSS", level = "Genus")
# -------------------------
# 6. Align features
# -------------------------
train_norm <- getNormalizedData(toy_obj)
val_norm <- getNormalizedData(val_obj)
common_features <- intersect(colnames(train_norm), colnames(val_norm))
setNormalizedData(toy_obj) <- train_norm[, common_features, drop = FALSE]
setNormalizedData(val_obj) <- val_norm[, common_features, drop = FALSE]
# -------------------------
# 7. Validate model
# -------------------------
validated_obj <- ValidateModelOnData(
toy_obj,
ValidationData = val_obj,
model_type = "RF",
TaskName = "toy_validation",
TrueLabel = "CRC",
PlotAUC = FALSE
)
# ----------------------------------
# 8. Minimal end-to-end RunScreening
# ----------------------------------
toy_obj2 <- CreateCrcBiomeScreenObject(
AbsoluteAbundance = toy_abs,
TaxaData = toy_taxa,
SampleData = toy_sample
)
toy_obj2 <- SplitTaxas(toy_obj2)
toy_obj2 <- KeepTaxonomicLevel(toy_obj2, level = "Genus")
toy_obj2 <- NormalizeData(toy_obj2, method = "TSS", level = "Genus")
toy_obj2 <- RunScreening(
obj = toy_obj2,
model_type = "RF",
partition = 0.7,
split.requirement = list(
label = c("control", "CRC"),
condition_col = "study_condition"
),
ClassWeights = FALSE,
n_cv = 2,
num_cores = 1,
TaskName = "RF_TSS_toydata",
ValidationData = val_obj,
TrueLabel = "CRC"
)
# -------------------------
# 9. Prediction on new data
# -------------------------
toy_obj <- PredictCrcBiomeScreen(
toy_obj,
newdata = getNormalizedData(val_obj),
model_type = "RF"
)
# Inspect the predictions
head(getPredictResult(toy_obj)$RF)
#> CRC control
#> V1 0.4977778 0.5022222
#> V2 0.4961111 0.5038889
#> V3 0.5572222 0.4427778
#> V4 0.4561111 0.5438889
#> V5 0.5238889 0.4761111
#> V6 0.5672222 0.4327778
#> control CRC
#> V1 0.5122222 0.4877778
#> V2 0.6066667 0.3933333
#> V3 0.4205556 0.5794444
#> V4 0.5255556 0.4744444
#> V5 0.4772222 0.5227778
#> V6 0.4677778 0.5322222
predictions <- getPredictResult(toy_obj)$RF
# -------------------------
# 10. Evaluate predictions
# -------------------------
evaluation_results <- EvaluateCrcBiomeScreen(
predictions = predictions,
true_labels = getSampleData(val_obj)$study_condition,
TrueLabel = "CRC",
TaskName = "Toy_Prediction_Eval"
)
Before model training, we need to filter the data to include only CRC and control samples, then split into training and test sets:
# Check the available labels in the data
table(getSampleData(CrcBiomeScreenObject)$study_condition)
#>
#> CRC adenoma control
#> 29 27 24
# Filter data to keep only CRC and control samples
CrcBiomeScreenObject <- FilterDataSet(
CrcBiomeScreenObject,
label = c("CRC", "control"),
condition_col = "study_condition"
)
# Verify the filtering results
table(getSampleData(CrcBiomeScreenObject)$study_condition)
#>
#> CRC control
#> 29 24
# Split data into training (70%) and test (30%) sets
CrcBiomeScreenObject <- SplitDataSet(
CrcBiomeScreenObject,
label = c("control", "CRC"),
partition = 0.7
)
Quality control using classical multidimensional scaling can help identify potential batch effects or outliers:
# Perform quality control using cmdscale
CrcBiomeScreenObject <- qcByCmdscale(
CrcBiomeScreenObject,
TaskName = "Normalize_ToyData_filtered_qc",
normalize_method = "GMPR",
plot = FALSE
)
# Check class balance in the training data
checkClassBalance(getModelData(CrcBiomeScreenObject)$TrainLabel)
#> $class_counts
#> labels
#> CRC control
#> 21 17
#>
#> $class_proportions
#> labels
#> CRC control
#> 0.5526316 0.4473684
#>
#> $is_imbalanced
#> [1] TRUE
#>
#> $suggestion
#> [1] "Unbalanced: Using class weights is recommended."
Training machine-learning models for microbiome classification often requires repeated cross-validation, hyperparameter search, and ensemble methods. These steps are computationally demanding and cannot be executed within the 15-minute limit of the Bioconductor build system.
For this reason, the full training code is provided for users, but is not executed when building the vignette. Users may run the pipeline locally to reproduce the complete analysis.
CrcBiomeScreen supports two machine learning algorithms: Random Forest and XGBoost. Both models can be trained with class balancing to handle imbalanced datasets:
To ensure the robustness and generalizability of the trained models, external validation on independent datasets is crucial
Once you’ve trained a model, you’ll want to apply it to new data and assess its performance. Our package offers a streamlined, two-step process for this: first, generate predictions on your new dataset, and then, if true labels are available, evaluate the model’s accuracy. This modular approach allows for flexibility, whether you’re predicting outcomes for unknown samples or measuring model quality on a validation set.
Use the PredictCrcBiomeScreen() function to generate prediction probabilities for your new dataset. This function takes your trained CrcBiomeScreenObject and the new data, returning the raw prediction scores for each sample.
# Assuming 'CrcBiomeScreenObject' contains a trained XGBoost model
# and 'MyNewData' is your new dataset (e.g., another CrcBiomeScreenObject or a data frame).
# Generate predictions for the new dataset
ValidationData_filtered_qc <- PredictCrcBiomeScreen(
CrcBiomeScreenObject,
newdata = getNormalizedData(ValidationData_filtered_qc), # Use the appropriate data slot from your new data
model_type = "RF"
)
# Inspect the predictions
head(getPredictResult(ValidationData_filtered_qc)$RF)
predictions <- getPredictResult(ValidationData_filtered_qc)$RF
If your new dataset includes known true labels, you can evaluate the model’s performance using the EvaluateCrcBiomeScreen() function. This function takes the predictions generated in the previous step and the true_labels to calculate key metrics like the AUC and plot the ROC curve.
# Assuming 'MyNewData' has a 'study_condition' column in its SampleData
# and the true positive label is "CRC".
# Evaluate the predictions
evaluation_results <- EvaluateCrcBiomeScreen(
predictions = predictions,
true_labels = getSampleData(ValidationData_filtered_qc)$study_condition, # Provide the actual true labels
TrueLabel = "CRC", # Specify the positive class label
TaskName = "MyNewDatasetEvaluation" # A name for the output plot/files
)
# View the evaluation results
print(evaluation_results$AUC)
For users who want to run the entire screening workflow in one step, CrcBiomeScreen provides the RunScreening() function. This function encapsulates the complete process from data preparation to model validation, offering a comprehensive and easy-to-use solution.
# Load the toy dataset from curatedMetagenomicData
data(Thomas_2018_RelativeAbundance, package = "CrcBiomeScreen")
# Create the CrcBiomeScreenObject
tse <- Thomas_2018_RelativeAbundance$`2021-03-31.ThomasAM_2018a.relative_abundance`
CrcBiomeScreenObject <- CreateCrcBiomeScreenObjectFromTSE(tse)
taxa <- getTaxaData(CrcBiomeScreenObject)
colnames(taxa) <- c("Kingdom","Phylum","Class","Order","Family","Genus","Species")
setTaxaData(CrcBiomeScreenObject) <- taxa
# Keep only the genus level data
CrcBiomeScreenObject <- KeepTaxonomicLevel(CrcBiomeScreenObject,level = "Genus")
# Inspect processed data
head(getTaxaData(CrcBiomeScreenObject))
# Normalize using GMPR (Geometric Mean of Pairwise Ratios)
CrcBiomeScreenObject <- NormalizeData(CrcBiomeScreenObject, method = "GMPR", level = "Genus")
# Filter data to keep only CRC and control samples
CrcBiomeScreenObject <- FilterDataSet(
CrcBiomeScreenObject,
label = c("CRC", "control"),
condition_col = "study_condition"
)
# Perform quality control using cmdscale
CrcBiomeScreenObject <- qcByCmdscale(
CrcBiomeScreenObject,
TaskName = "Normalize_ToyData_filtered_qc",
normalize_method = "GMPR",
plot = FALSE
)
# Verify the filtering results
table(getSampleData(CrcBiomeScreenObject)$study_condition)
# Split data into training (70%) and test (30%) sets
CrcBiomeScreenObject <- SplitDataSet(
CrcBiomeScreenObject,
label = c("control", "CRC"),
partition = 0.7
)
# Check class balance in the training data
checkClassBalance(getModelData(CrcBiomeScreenObject)$TrainLabel)
# Load validation data
data("ZellerG_2014_RelativeAbundance")
ValidationData_tse <- ZellerG_2014_RelativeAbundance$`2021-03-31.ZellerG_2014.relative_abundance`
# Create CrcBiomeScreenObject for validation data
ValidationData <- CreateCrcBiomeScreenObjectFromTSE(ValidationData_tse)
# Process validation data similarly to training data
# ValidationData <- SplitTaxas(ValidationData)
taxa <- getTaxaData(ValidationData)
colnames(taxa) <- c("Kingdom","Phylum","Class","Order","Family","Genus","Species")
setTaxaData(ValidationData) <- taxa
ValidationData <- KeepTaxonomicLevel(ValidationData,level = "Genus")
ValidationData <- NormalizeData(ValidationData, method = "GMPR", level = "Genus")
# Align features between training and validation data
val_norm <- getNormalizedData(ValidationData)
crc_norm <- getNormalizedData(CrcBiomeScreenObject)
# Ensure consistent features between training and validation data
val_norm <- val_norm[, colnames(val_norm) %in% colnames(crc_norm)]
crc_norm <- crc_norm[, colnames(crc_norm) %in% colnames(val_norm)]
# Set back the normalized data
setNormalizedData(ValidationData) <- val_norm
setNormalizedData(CrcBiomeScreenObject) <- crc_norm
# Filter validation data
ValidationData_filtered <- FilterDataSet(
ValidationData,
label = c("CRC", "control"),
condition_col = "study_condition"
)
# Quality control for validation data
ValidationData_filtered_qc <- qcByCmdscale(
ValidationData_filtered,
TaskName = "Normalize_ValidationData_filtered_qc",
normalize_method = "GMPR",
plot = FALSE
)
# Run the complete screening workflow
ValidationData_filtered_qc <- RunScreening(
# Input data
obj = CrcBiomeScreenObject,
# Model and data splitting
model_type = "RF",
partition = 0.7,
split.requirement = list(
label = c("control", "CRC"),
condition_col = "study_condition"
),
ClassWeights = FALSE,
# Cross-validation and parallelization
n_cv = 10,
num_cores = 10,
# Task and output naming
TaskName = "RF_GMPR_toydata",
# External validation
ValidationData = ValidationData_filtered_qc,
TrueLabel = "CRC"
)
# Example with XGBoost model
ValidationData_filtered_qc <- RunScreening(
CrcBiomeScreenObject,
model = "RF",
partition = 0.7,
split.requirement = list(
label = c("control", "CRC"),
condition_col = "study_condition"
),
ClassWeights = TRUE,
n_cv = 10,
num_cores = 10,
TaskName = "RF_GMPR_toydata",
ValidationData = ValidationData_filtered_qc,
TrueLabel = "CRC"
)
This function performs the complete workflow, including:
RF or XGBoost) using k-fold cross-validation to ensure robustness.ValidationData is provided): Applies the trained model to an entirely independent dataset to confirm its generalizability and real-world applicability. This step also stores the individual sample predictions and evaluation metrics in the CrcBiomeScreenObject.ClassWeights = TRUE when dealing with imbalanced datasetsEnsure that both training and validation datasets have been processed identically and contain overlapping features.
num_cores parameter) to reduce memory usageThis vignette was created under the following conditions:
sessionInfo()
#> R version 4.6.0 alpha (2026-04-05 r89794)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] foreach_1.5.2 pROC_1.19.0.1 ranger_0.18.0
#> [4] CrcBiomeScreen_0.99.15 BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] inline_0.3.21 permute_0.9-10
#> [3] rlang_1.2.0 magrittr_2.0.5
#> [5] clue_0.3-68 otel_0.2.0
#> [7] e1071_1.7-17 matrixStats_1.5.0
#> [9] compiler_4.6.0 mgcv_1.9-4
#> [11] vctrs_0.7.3 reshape2_1.4.5
#> [13] rmutil_1.1.10 stringr_1.6.0
#> [15] pkgconfig_2.0.3 crayon_1.5.3
#> [17] fastmap_1.2.0 XVector_0.51.0
#> [19] modeest_2.4.0 rmarkdown_2.31
#> [21] prodlim_2026.03.11 purrr_1.2.2
#> [23] xfun_0.57 cachem_1.1.0
#> [25] jsonlite_2.0.0 progress_1.2.3
#> [27] recipes_1.3.2 DelayedArray_0.37.1
#> [29] BiocParallel_1.45.0 cluster_2.1.8.2
#> [31] parallel_4.6.0 prettyunits_1.2.0
#> [33] R6_2.6.1 bslib_0.10.0
#> [35] stringi_1.8.7 RColorBrewer_1.1-3
#> [37] parallelly_1.47.0 rpart_4.1.27
#> [39] GenomicRanges_1.63.2 lubridate_1.9.5
#> [41] jquerylib_0.1.4 Rcpp_1.1.1-1
#> [43] Seqinfo_1.1.0 bookdown_0.46
#> [45] SummarizedExperiment_1.41.1 iterators_1.0.14
#> [47] knitr_1.51 future.apply_1.20.2
#> [49] IRanges_2.45.0 Matrix_1.7-5
#> [51] splines_4.6.0 nnet_7.3-20
#> [53] timechange_0.4.0 tidyselect_1.2.1
#> [55] dichromat_2.0-0.1 abind_1.4-8
#> [57] yaml_2.3.12 vegan_2.7-3
#> [59] TreeSummarizedExperiment_2.19.0 timeDate_4052.112
#> [61] doParallel_1.0.17 codetools_0.2-20
#> [63] listenv_0.10.1 lattice_0.22-9
#> [65] tibble_3.3.1 plyr_1.8.9
#> [67] withr_3.0.2 Biobase_2.71.0
#> [69] treeio_1.35.0 S7_0.2.1-1
#> [71] stable_1.1.7 evaluate_1.0.5
#> [73] future_1.70.0 survival_3.8-6
#> [75] proxy_0.4-29 Biostrings_2.79.5
#> [77] pillar_1.11.1 BiocManager_1.30.27
#> [79] MatrixGenerics_1.23.0 stats4_4.6.0
#> [81] generics_0.1.4 S4Vectors_0.49.2
#> [83] hms_1.1.4 ggplot2_4.0.2
#> [85] scales_1.4.0 tidytree_0.4.7
#> [87] timeSeries_4052.112 globals_0.19.1
#> [89] class_7.3-23 glue_1.8.1
#> [91] statip_0.2.3 lazyeval_0.2.3
#> [93] tools_4.6.0 data.table_1.18.2.1
#> [95] spatial_7.3-18 fBasics_4052.98
#> [97] ModelMetrics_1.2.2.2 gower_1.0.2
#> [99] fs_2.0.1 grid_4.6.0
#> [101] tidyr_1.3.2 ape_5.8-1
#> [103] ipred_0.9-15 SingleCellExperiment_1.33.2
#> [105] nlme_3.1-169 cli_3.6.6
#> [107] rappdirs_0.3.4 S4Arrays_1.11.1
#> [109] lava_1.9.0 doFuture_1.2.1
#> [111] dplyr_1.2.1 gtable_0.3.6
#> [113] stabledist_0.7-2 yulab.utils_0.2.4
#> [115] sass_0.4.10 digest_0.6.39
#> [117] progressr_0.19.0 BiocGenerics_0.57.1
#> [119] caret_7.0-1 ggrepel_0.9.8
#> [121] SparseArray_1.11.13 farver_2.1.2
#> [123] htmltools_0.5.9 lifecycle_1.0.5
#> [125] hardhat_1.4.3 statmod_1.5.1
#> [127] GUniFrac_1.9 MASS_7.3-65
If you use CrcBiomeScreen in your research, please cite:
[Package citation information]
For questions, bug reports, or feature requests, please visit the GitHub repository: https://github.com/iChronostasis/CrcBiomeScreen
Note: This vignette demonstrates the basic usage of CrcBiomeScreen. For more advanced usage and customization options, please refer to the function documentation and additional tutorials.