This vignette builds on the foundations established in
vignette("immLynx_vignette", package = "immLynx") to explore advanced
analytical workflows for T-cell receptor (TCR) repertoire data. While
the introductory vignette demonstrates each analysis function
individually, real-world studies often require combining multiple methods
to gain complementary views of repertoire biology. For instance,
sequence clustering identifies groups of structurally related TCRs,
while generation probability highlights sequences that are unlikely to
arise from random recombination and may therefore reflect
antigen-driven selection. Protein language model embeddings offer yet
another perspective by capturing learned biochemical features that do
not depend on predefined distance metrics.
This vignette covers the following advanced topics:
We begin by loading the required packages and the example dataset. The
immLynx_example object is a SingleCellExperiment containing
scRNA-seq data with TCR annotations from scRepertoire.
library(immLynx)
library(scran)
library(scater)
library(ggplot2)
data("immLynx_example")
The MCL algorithm used by clusTCR has a key parameter, inflation, that
controls the granularity of cluster assignments. Lower inflation values
(e.g., 2.0) favor larger, more inclusive clusters, while higher values
(e.g., 3.0) produce tighter, more specific groupings. Choosing the
appropriate inflation parameter depends on the biological question: broad
clusters may capture groups of TCRs with shared structural motifs, while
tight clusters more closely approximate antigen-specificity groups.
Here we run clusTCR with two different inflation parameters on the same
dataset and compare the resulting cluster counts. By using distinct
column_prefix values, both sets of assignments are stored in the same
SingleCellExperiment object for side-by-side comparison.
sce_mcl_2 <- runClustTCR(immLynx_example,
inflation = 2.0,
column_prefix = "mcl_2")
sce_mcl_3 <- runClustTCR(sce_mcl_2,
inflation = 3.0,
column_prefix = "mcl_3")
cat("MCL (inflation=2):",
length(unique(na.omit(sce_mcl_3$mcl_2_TRB))),
"clusters\n")
cat("MCL (inflation=3):",
length(unique(na.omit(sce_mcl_3$mcl_3_TRB))),
"clusters\n")
The ESM-2 family of protein language models ranges from 8 million to 15 billion parameters. Larger models generally produce embeddings that better capture structural and functional properties of protein sequences, but they require substantially more memory and computation time. For TCR CDR3 sequences, which are short (typically 10-20 amino acids), smaller models often perform well and may be preferable for large-scale analyses.
The following example compares embeddings from the 35M and 650M
parameter variants. Both sets of embeddings are stored as separate
dimensionality reductions in the SingleCellExperiment and can be
independently visualized with UMAP.
sce_small <- runEmbeddings(
immLynx_example,
model_name = "facebook/esm2_t12_35M_UR50D",
reduction_name = "esm_small"
)
sce_med <- runEmbeddings(
sce_small,
model_name = "facebook/esm2_t33_650M_UR50D",
reduction_name = "esm_medium"
)
sce_med <- scater::runUMAP(sce_med,
dimred = "esm_small",
name = "umap_small")
sce_med <- scater::runUMAP(sce_med,
dimred = "esm_medium",
name = "umap_medium")
p1 <- scater::plotReducedDim(sce_med,
dimred = "umap_small") +
ggtitle("ESM-2 35M")
p2 <- scater::plotReducedDim(sce_med,
dimred = "umap_medium") +
ggtitle("ESM-2 650M")
p1 + p2
For paired alpha-beta TCR data, embedding both chains together captures
the joint receptor structure. The chains = "both" option concatenates
the alpha and beta CDR3 sequences before computing embeddings, producing
a single representation per cell that reflects the full receptor
identity. This is particularly useful when alpha chain pairing
contributes to antigen specificity.
sce_paired <- runEmbeddings(
immLynx_example,
chains = "both",
reduction_name = "tcr_paired"
)
sce_paired <- scater::runUMAP(sce_paired,
dimred = "tcr_paired")
scater::plotReducedDim(sce_paired, dimred = "UMAP")
immLynx is designed to complement scRepertoire, which defines clonotypes based on exact CDR3 sequence matching. By comparing clusTCR cluster assignments with scRepertoire’s clonotype definitions, we can assess how sequence-similarity-based clustering relates to strict clonotype identity. Ideally, each cluster should contain one or a few closely related clonotypes, and cells sharing the same clonotype should be assigned to the same cluster.
sce_clust <- runClustTCR(immLynx_example,
chains = "TRB")
comparison <- table(
clustcr = sce_clust$clustcr_TRB,
clonotype = sce_clust$CTstrict
)
cat("Number of clustcr clusters:", nrow(comparison), "\n")
cat("Number of unique clonotypes:", ncol(comparison), "\n")
Generation probability (Pgen) from OLGA provides a window into selection pressures acting on the TCR repertoire. Sequences with low Pgen are rarely produced by V(D)J recombination; their presence in the observed repertoire suggests positive selection, potentially driven by antigen encounter. By comparing Pgen distributions across experimental conditions or sample types, we can identify conditions associated with enrichment of rare, potentially antigen-selected sequences.
sce_pgen <- runOLGA(immLynx_example,
chains = "TRB")
ggplot(as.data.frame(colData(sce_pgen)), aes(x = Type,
y = olga_pgen_log10_TRB)) +
geom_boxplot() +
labs(title = "TCR Generation Probability by Sample Type",
x = "Sample Type",
y = "log10(Pgen)")
low_pgen_cells <- which(colData(sce_pgen)$olga_pgen_log10_TRB < -15)
cat("Cells with Pgen < 10^-15:", length(low_pgen_cells), "\n")
tcrdist3 distance matrices provide a continuous measure of TCR
similarity that can serve as input to a variety of unsupervised learning
methods. Here we demonstrate hierarchical clustering of the distance
matrix, which complements the MCL-based approach used by clusTCR. The
complete linkage method ensures that all members of a cluster are
within a maximum distance of each other, producing compact groups. Note
that because tcrdist3 deduplicates input sequences, the distance matrix
dimensions correspond to unique TCR sequences rather than individual
cells.
dist_results <- runTCRdist(immLynx_example, chains = "beta")
d <- as.dist(dist_results$distances$pw_beta)
hc <- hclust(d, method = "complete")
n_unique <- nrow(dist_results$distances$pw_beta)
clusters <- cutree(hc, k = min(50, n_unique))
cat("Distance matrix:", n_unique, "x", n_unique, "\n")
cat("Hierarchical clusters:", length(unique(clusters)), "\n")
For datasets with thousands or tens of thousands of cells, computational resources may become a limiting factor. The following strategies can help manage memory and runtime:
chunk_size parameter in runEmbeddings()
controls how many sequences are processed in each batch. Smaller
chunks reduce peak memory usage at the cost of slightly longer runtime.sce_large <- runEmbeddings(
large_sce,
chunk_size = 16,
pool = "mean"
)
sce_large <- runClustTCR(
large_sce,
inflation = 2.0
)
sample_cells <- sample(colnames(large_sce), 5000)
subset_obj <- large_sce[, sample_cells]
dist_results <- runTCRdist(subset_obj)
When HLA genotyping data is available, immLynx can test for associations
between metaclone assignments and specific HLA alleles using
runHLAassociation(). This analysis identifies metaclones whose
frequency differs significantly between HLA-positive and HLA-negative
individuals, suggesting potential HLA-restricted antigen recognition.
The function performs Fisher’s exact test for each metaclone-HLA
combination and returns adjusted p-values.
metaclones <- runMetaclonotypist(immLynx_example,
return_input = FALSE)
hla_data <- data.frame(
barcode = metaclones$barcode,
HLA_A_01_01 = sample(c(TRUE, FALSE),
nrow(metaclones),
replace = TRUE),
HLA_A_02_01 = sample(c(TRUE, FALSE),
nrow(metaclones),
replace = TRUE),
HLA_B_07_02 = sample(c(TRUE, FALSE),
nrow(metaclones),
replace = TRUE)
)
hla_results <- runHLAassociation(metaclones,
hla_data)
head(hla_results)
All immLynx results can be exported to standard file formats for use
with external tools or for archiving. The convertToTcrdist() function
reformats TCR data into the tabular format expected by standalone
tcrdist3, facilitating interoperability with Python-based analysis
pipelines.
tcr_data <- extractTCRdata(immLynx_example,
chains = "both",
format = "wide")
tcrdist_format <- convertToTcrdist(tcr_data,
chains = "both")
write.csv(tcrdist_format, "tcr_for_tcrdist.csv", row.names = FALSE)
clusters <- data.frame(
barcode = colnames(sce_clust),
clustcr = sce_clust$clustcr_TRB
)
write.csv(clusters, "cluster_assignments.csv", row.names = FALSE)
embeddings <- reducedDim(sce_small, "esm_small")
write.csv(embeddings, "tcr_embeddings.csv")
Data Quality: Always validate TCR data before analysis using
validateTCRdata(). This catches common issues such as missing CDR3
sequences, non-standard amino acids, and inconsistent chain
annotations that can cause downstream errors or misleading results.
validation <- validateTCRdata(tcr_data, check_sequences = TRUE)
if (!validation$valid) {
warning(validation$errors)
}Memory Management: For large datasets, use smaller chunk sizes in
runEmbeddings() and consider subsampling strategies for
distance-based methods.
Reproducibility: Set random seeds before running stochastic algorithms such as MCL clustering to ensure reproducible results.
set.seed(42)
sce <- runClustTCR(sce)Multiple Methods: Use multiple clustering or grouping methods and look for consensus. Sequences that are consistently co-clustered across methods are more likely to represent biologically meaningful groups.
Biological Validation: Computational clusters should be validated against biological evidence such as antigen specificity, phenotypic markers, or clinical outcomes whenever possible.
sessionInfo()
#> R version 4.6.0 RC (2026-04-17 r89917)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] scater_1.39.4 ggplot2_4.0.3
#> [3] scran_1.39.2 scuttle_1.21.6
#> [5] SingleCellExperiment_1.33.2 SummarizedExperiment_1.41.1
#> [7] Biobase_2.71.0 GenomicRanges_1.63.2
#> [9] Seqinfo_1.1.0 IRanges_2.45.0
#> [11] S4Vectors_0.49.2 BiocGenerics_0.57.1
#> [13] generics_0.1.4 MatrixGenerics_1.23.0
#> [15] matrixStats_1.5.0 immLynx_0.99.4
#> [17] BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 viridisLite_0.4.3 vipor_0.4.7
#> [4] dplyr_1.2.1 farver_2.1.2 viridis_0.6.5
#> [7] filelock_1.0.3 S7_0.2.2 fastmap_1.2.0
#> [10] bluster_1.21.1 digest_0.6.39 rsvd_1.0.5
#> [13] lifecycle_1.0.5 cluster_2.1.8.2 statmod_1.5.1
#> [16] magrittr_2.0.5 compiler_4.6.0 rlang_1.2.0
#> [19] sass_0.4.10 tools_4.6.0 igraph_2.3.0
#> [22] yaml_2.3.12 knitr_1.51 S4Arrays_1.11.1
#> [25] dqrng_0.4.1 reticulate_1.46.0 DelayedArray_0.37.1
#> [28] RColorBrewer_1.1-3 abind_1.4-8 BiocParallel_1.45.0
#> [31] withr_3.0.2 grid_4.6.0 beachmat_2.27.5
#> [34] edgeR_4.9.9 scales_1.4.0 dichromat_2.0-0.1
#> [37] cli_3.6.6 rmarkdown_2.31 otel_0.2.0
#> [40] metapod_1.19.2 immApex_1.5.4 ggbeeswarm_0.7.3
#> [43] cachem_1.1.0 stringr_1.6.0 parallel_4.6.0
#> [46] BiocManager_1.30.27 XVector_0.51.0 basilisk_1.23.0
#> [49] vctrs_0.7.3 Matrix_1.7-5 jsonlite_2.0.0
#> [52] dir.expiry_1.19.0 bookdown_0.46 BiocSingular_1.27.1
#> [55] BiocNeighbors_2.5.4 ggrepel_0.9.8 beeswarm_0.4.0
#> [58] irlba_2.3.7 locfit_1.5-9.12 limma_3.67.3
#> [61] jquerylib_0.1.4 glue_1.8.1 codetools_0.2-20
#> [64] stringi_1.8.7 gtable_0.3.6 ScaledMatrix_1.19.0
#> [67] tibble_3.3.1 pillar_1.11.1 htmltools_0.5.9
#> [70] R6_2.6.1 evaluate_1.0.5 lattice_0.22-9
#> [73] png_0.1-9 bslib_0.10.0 Rcpp_1.1.1-1.1
#> [76] gridExtra_2.3 SparseArray_1.11.13 xfun_0.57
#> [79] pkgconfig_2.0.3