Contents

1 Introduction

This vignette builds on the foundations established in vignette("immLynx_vignette", package = "immLynx") to explore advanced analytical workflows for T-cell receptor (TCR) repertoire data. While the introductory vignette demonstrates each analysis function individually, real-world studies often require combining multiple methods to gain complementary views of repertoire biology. For instance, sequence clustering identifies groups of structurally related TCRs, while generation probability highlights sequences that are unlikely to arise from random recombination and may therefore reflect antigen-driven selection. Protein language model embeddings offer yet another perspective by capturing learned biochemical features that do not depend on predefined distance metrics.

This vignette covers the following advanced topics:

2 Setup

We begin by loading the required packages and the example dataset. The immLynx_example object is a SingleCellExperiment containing scRNA-seq data with TCR annotations from scRepertoire.

library(immLynx)
library(scran)
library(scater)
library(ggplot2)

data("immLynx_example")

3 Comparing Clustering Methods

The MCL algorithm used by clusTCR has a key parameter, inflation, that controls the granularity of cluster assignments. Lower inflation values (e.g., 2.0) favor larger, more inclusive clusters, while higher values (e.g., 3.0) produce tighter, more specific groupings. Choosing the appropriate inflation parameter depends on the biological question: broad clusters may capture groups of TCRs with shared structural motifs, while tight clusters more closely approximate antigen-specificity groups.

Here we run clusTCR with two different inflation parameters on the same dataset and compare the resulting cluster counts. By using distinct column_prefix values, both sets of assignments are stored in the same SingleCellExperiment object for side-by-side comparison.

sce_mcl_2 <- runClustTCR(immLynx_example,
                            inflation = 2.0,
                            column_prefix = "mcl_2")
sce_mcl_3 <- runClustTCR(sce_mcl_2,
                            inflation = 3.0,
                            column_prefix = "mcl_3")

cat("MCL (inflation=2):",
    length(unique(na.omit(sce_mcl_3$mcl_2_TRB))),
    "clusters\n")
cat("MCL (inflation=3):",
    length(unique(na.omit(sce_mcl_3$mcl_3_TRB))),
    "clusters\n")

4 Custom Embedding Workflows

4.1 Using Different ESM-2 Model Sizes

The ESM-2 family of protein language models ranges from 8 million to 15 billion parameters. Larger models generally produce embeddings that better capture structural and functional properties of protein sequences, but they require substantially more memory and computation time. For TCR CDR3 sequences, which are short (typically 10-20 amino acids), smaller models often perform well and may be preferable for large-scale analyses.

The following example compares embeddings from the 35M and 650M parameter variants. Both sets of embeddings are stored as separate dimensionality reductions in the SingleCellExperiment and can be independently visualized with UMAP.

sce_small <- runEmbeddings(
  immLynx_example,
  model_name = "facebook/esm2_t12_35M_UR50D",
  reduction_name = "esm_small"
)

sce_med <- runEmbeddings(
  sce_small,
  model_name = "facebook/esm2_t33_650M_UR50D",
  reduction_name = "esm_medium"
)

sce_med <- scater::runUMAP(sce_med,
                      dimred = "esm_small",
                      name = "umap_small")
sce_med <- scater::runUMAP(sce_med,
                      dimred = "esm_medium",
                      name = "umap_medium")

p1 <- scater::plotReducedDim(sce_med,
              dimred = "umap_small") +
  ggtitle("ESM-2 35M")
p2 <- scater::plotReducedDim(sce_med,
              dimred = "umap_medium") +
  ggtitle("ESM-2 650M")
p1 + p2

4.2 Embedding Both Chains

For paired alpha-beta TCR data, embedding both chains together captures the joint receptor structure. The chains = "both" option concatenates the alpha and beta CDR3 sequences before computing embeddings, producing a single representation per cell that reflects the full receptor identity. This is particularly useful when alpha chain pairing contributes to antigen specificity.

sce_paired <- runEmbeddings(
  immLynx_example,
  chains = "both",
  reduction_name = "tcr_paired"
)

sce_paired <- scater::runUMAP(sce_paired,
                         dimred = "tcr_paired")
scater::plotReducedDim(sce_paired, dimred = "UMAP")

5 Integration with scRepertoire Clonotypes

immLynx is designed to complement scRepertoire, which defines clonotypes based on exact CDR3 sequence matching. By comparing clusTCR cluster assignments with scRepertoire’s clonotype definitions, we can assess how sequence-similarity-based clustering relates to strict clonotype identity. Ideally, each cluster should contain one or a few closely related clonotypes, and cells sharing the same clonotype should be assigned to the same cluster.

sce_clust <- runClustTCR(immLynx_example,
                            chains = "TRB")

comparison <- table(
  clustcr = sce_clust$clustcr_TRB,
  clonotype = sce_clust$CTstrict
)

cat("Number of clustcr clusters:", nrow(comparison), "\n")
cat("Number of unique clonotypes:", ncol(comparison), "\n")

6 Analyzing Selection Pressure

Generation probability (Pgen) from OLGA provides a window into selection pressures acting on the TCR repertoire. Sequences with low Pgen are rarely produced by V(D)J recombination; their presence in the observed repertoire suggests positive selection, potentially driven by antigen encounter. By comparing Pgen distributions across experimental conditions or sample types, we can identify conditions associated with enrichment of rare, potentially antigen-selected sequences.

sce_pgen <- runOLGA(immLynx_example,
                       chains = "TRB")

ggplot(as.data.frame(colData(sce_pgen)), aes(x = Type,
                            y = olga_pgen_log10_TRB)) +
  geom_boxplot() +
  labs(title = "TCR Generation Probability by Sample Type",
       x = "Sample Type",
       y = "log10(Pgen)")

low_pgen_cells <- which(colData(sce_pgen)$olga_pgen_log10_TRB < -15)
cat("Cells with Pgen < 10^-15:", length(low_pgen_cells), "\n")

7 Combining Distance-Based Methods

tcrdist3 distance matrices provide a continuous measure of TCR similarity that can serve as input to a variety of unsupervised learning methods. Here we demonstrate hierarchical clustering of the distance matrix, which complements the MCL-based approach used by clusTCR. The complete linkage method ensures that all members of a cluster are within a maximum distance of each other, producing compact groups. Note that because tcrdist3 deduplicates input sequences, the distance matrix dimensions correspond to unique TCR sequences rather than individual cells.

dist_results <- runTCRdist(immLynx_example, chains = "beta")

d <- as.dist(dist_results$distances$pw_beta)
hc <- hclust(d, method = "complete")
n_unique <- nrow(dist_results$distances$pw_beta)
clusters <- cutree(hc, k = min(50, n_unique))

cat("Distance matrix:", n_unique, "x", n_unique, "\n")
cat("Hierarchical clusters:", length(unique(clusters)), "\n")

8 Working with Large Datasets

For datasets with thousands or tens of thousands of cells, computational resources may become a limiting factor. The following strategies can help manage memory and runtime:

sce_large <- runEmbeddings(
  large_sce,
  chunk_size = 16,
  pool = "mean"
)

sce_large <- runClustTCR(
  large_sce,
  inflation = 2.0
)

sample_cells <- sample(colnames(large_sce), 5000)
subset_obj <- large_sce[, sample_cells]
dist_results <- runTCRdist(subset_obj)

9 HLA Association Analysis

When HLA genotyping data is available, immLynx can test for associations between metaclone assignments and specific HLA alleles using runHLAassociation(). This analysis identifies metaclones whose frequency differs significantly between HLA-positive and HLA-negative individuals, suggesting potential HLA-restricted antigen recognition. The function performs Fisher’s exact test for each metaclone-HLA combination and returns adjusted p-values.

metaclones <- runMetaclonotypist(immLynx_example,
                                 return_input = FALSE)

hla_data <- data.frame(
  barcode = metaclones$barcode,
  HLA_A_01_01 = sample(c(TRUE, FALSE),
                       nrow(metaclones),
                       replace = TRUE),
  HLA_A_02_01 = sample(c(TRUE, FALSE),
                       nrow(metaclones),
                       replace = TRUE),
  HLA_B_07_02 = sample(c(TRUE, FALSE),
                       nrow(metaclones),
                       replace = TRUE)
)

hla_results <- runHLAassociation(metaclones,
                                 hla_data)
head(hla_results)

10 Exporting Results

All immLynx results can be exported to standard file formats for use with external tools or for archiving. The convertToTcrdist() function reformats TCR data into the tabular format expected by standalone tcrdist3, facilitating interoperability with Python-based analysis pipelines.

tcr_data <- extractTCRdata(immLynx_example,
                           chains = "both",
                           format = "wide")
tcrdist_format <- convertToTcrdist(tcr_data,
                                   chains = "both")
write.csv(tcrdist_format, "tcr_for_tcrdist.csv", row.names = FALSE)

clusters <- data.frame(
  barcode = colnames(sce_clust),
  clustcr = sce_clust$clustcr_TRB
)
write.csv(clusters, "cluster_assignments.csv", row.names = FALSE)

embeddings <- reducedDim(sce_small, "esm_small")
write.csv(embeddings, "tcr_embeddings.csv")

11 Best Practices

  1. Data Quality: Always validate TCR data before analysis using validateTCRdata(). This catches common issues such as missing CDR3 sequences, non-standard amino acids, and inconsistent chain annotations that can cause downstream errors or misleading results.

    validation <- validateTCRdata(tcr_data, check_sequences = TRUE)
    if (!validation$valid) {
      warning(validation$errors)
    }
  2. Memory Management: For large datasets, use smaller chunk sizes in runEmbeddings() and consider subsampling strategies for distance-based methods.

  3. Reproducibility: Set random seeds before running stochastic algorithms such as MCL clustering to ensure reproducible results.

    set.seed(42)
    sce <- runClustTCR(sce)
  4. Multiple Methods: Use multiple clustering or grouping methods and look for consensus. Sequences that are consistently co-clustered across methods are more likely to represent biologically meaningful groups.

  5. Biological Validation: Computational clusters should be validated against biological evidence such as antigen specificity, phenotypic markers, or clinical outcomes whenever possible.

12 Session Info

sessionInfo()
#> R version 4.6.0 RC (2026-04-17 r89917)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] scater_1.39.4               ggplot2_4.0.3              
#>  [3] scran_1.39.2                scuttle_1.21.6             
#>  [5] SingleCellExperiment_1.33.2 SummarizedExperiment_1.41.1
#>  [7] Biobase_2.71.0              GenomicRanges_1.63.2       
#>  [9] Seqinfo_1.1.0               IRanges_2.45.0             
#> [11] S4Vectors_0.49.2            BiocGenerics_0.57.1        
#> [13] generics_0.1.4              MatrixGenerics_1.23.0      
#> [15] matrixStats_1.5.0           immLynx_0.99.4             
#> [17] BiocStyle_2.39.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1    viridisLite_0.4.3   vipor_0.4.7        
#>  [4] dplyr_1.2.1         farver_2.1.2        viridis_0.6.5      
#>  [7] filelock_1.0.3      S7_0.2.2            fastmap_1.2.0      
#> [10] bluster_1.21.1      digest_0.6.39       rsvd_1.0.5         
#> [13] lifecycle_1.0.5     cluster_2.1.8.2     statmod_1.5.1      
#> [16] magrittr_2.0.5      compiler_4.6.0      rlang_1.2.0        
#> [19] sass_0.4.10         tools_4.6.0         igraph_2.3.0       
#> [22] yaml_2.3.12         knitr_1.51          S4Arrays_1.11.1    
#> [25] dqrng_0.4.1         reticulate_1.46.0   DelayedArray_0.37.1
#> [28] RColorBrewer_1.1-3  abind_1.4-8         BiocParallel_1.45.0
#> [31] withr_3.0.2         grid_4.6.0          beachmat_2.27.5    
#> [34] edgeR_4.9.9         scales_1.4.0        dichromat_2.0-0.1  
#> [37] cli_3.6.6           rmarkdown_2.31      otel_0.2.0         
#> [40] metapod_1.19.2      immApex_1.5.4       ggbeeswarm_0.7.3   
#> [43] cachem_1.1.0        stringr_1.6.0       parallel_4.6.0     
#> [46] BiocManager_1.30.27 XVector_0.51.0      basilisk_1.23.0    
#> [49] vctrs_0.7.3         Matrix_1.7-5        jsonlite_2.0.0     
#> [52] dir.expiry_1.19.0   bookdown_0.46       BiocSingular_1.27.1
#> [55] BiocNeighbors_2.5.4 ggrepel_0.9.8       beeswarm_0.4.0     
#> [58] irlba_2.3.7         locfit_1.5-9.12     limma_3.67.3       
#> [61] jquerylib_0.1.4     glue_1.8.1          codetools_0.2-20   
#> [64] stringi_1.8.7       gtable_0.3.6        ScaledMatrix_1.19.0
#> [67] tibble_3.3.1        pillar_1.11.1       htmltools_0.5.9    
#> [70] R6_2.6.1            evaluate_1.0.5      lattice_0.22-9     
#> [73] png_0.1-9           bslib_0.10.0        Rcpp_1.1.1-1.1     
#> [76] gridExtra_2.3       SparseArray_1.11.13 xfun_0.57          
#> [79] pkgconfig_2.0.3