Contents

1 Introduction

T-cell receptors (TCRs) are heterodimeric surface proteins that mediate adaptive immune recognition. Each TCR consists of an alpha and beta chain (or gamma and delta in a smaller subset of T cells), with the complementarity-determining region 3 (CDR3) serving as the primary determinant of antigen specificity. Advances in single-cell RNA sequencing (scRNA-seq) now enable simultaneous profiling of gene expression and TCR sequences at cellular resolution, producing rich datasets that benefit from computational analysis of repertoire diversity, clonal structure, and sequence similarity.

A growing ecosystem of Python-based tools has emerged for TCR repertoire analysis, including tcrdist3 for pairwise distance calculation, OLGA for generation probability estimation, soNNia for selection inference, clusTCR for sequence clustering, metaclonotypist for metaclone discovery, and the ESM-2 family of protein language models for learned sequence representations. While these tools are powerful individually, using them together requires managing multiple Python environments and translating data between frameworks—a barrier for researchers working primarily in R.

immLynx addresses this challenge by providing a unified R interface to these Python tools through the basilisk package, which manages isolated conda environments transparently. All functions accept and return SingleCellExperiment objects, ensuring seamless integration with the Bioconductor single-cell ecosystem, including scRepertoire for clonotype assignment and immApex for TCR data extraction. The package supports the following analytical capabilities:

This vignette demonstrates the core functionality of immLynx, walking through data preparation, individual analysis functions, and a combined workflow. For advanced use cases including method comparison and custom embedding strategies, see the companion vignette vignette("advanced_analysis", package = "immLynx").

2 Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("immLynx")

2.1 Python Dependencies

immLynx uses basilisk to manage Python dependencies automatically. The first time you call a function that invokes Python, basilisk will create an isolated conda environment containing all required packages (tcrdist3, olga, soNNia, clusTCR, metaclonotypist, and PyTorch with the ESM-2 model). This initial setup may take several minutes but only needs to occur once per installation. Subsequent calls reuse the existing environment with minimal overhead.

3 Quick Start

3.1 Loading the Package and Example Data

We begin by loading immLynx along with scran and scater, which provide utilities for single-cell analysis and visualization. The immLynx_example dataset is a SingleCellExperiment object containing scRNA-seq data from multiple patients, with TCR repertoire information added via scRepertoire.

library(immLynx)
library(scran)
library(scater)

data("immLynx_example")

immLynx_example
#> class: SingleCellExperiment 
#> dim: 2000 500 
#> metadata(0):
#> assays(2): counts logcounts
#> rownames(2000): TRBV7-2 HSPA1B ... AC119396.1 BHLHE40
#> rowData names(0):
#> colnames(500): P17B_AAACCTGCAACACGCC-1 P17B_AAACGGGTCTCCAGGG-1 ...
#>   P20L_TCAACGATCCCTCAGT-1 P20L_TCTGGAAGTGGAAAGA-1
#> colData names(16): orig.ident nCount_RNA ... ident clusters
#> reducedDimNames(1): UMAP
#> mainExpName: RNA
#> altExpNames(0):

3.2 Extracting TCR Data

Before running downstream analyses, it is often useful to extract TCR data into a data.frame format for inspection and quality control. The extractTCRdata() function wraps immApex::getIR() and supports both long format (one row per chain) and wide format (one row per cell with columns for each chain). The validateTCRdata() function checks for common issues such as missing CDR3 sequences, non-standard amino acids, and inconsistent chain annotations.

tcr_data <- extractTCRdata(immLynx_example,
                           chains = "TRB")
head(tcr_data)
#>             cdr3_aa        v    d       j     c                 barcode chain
#> 1     CAISEQGKGELFF TRBV10-3 <NA> TRBJ2-2 TRBC2 P17B_AAACCTGCAACACGCC-1   TRB
#> 2 CASSVRRERANTGELFF    TRBV9 <NA> TRBJ2-2 TRBC2 P17B_AAACGGGTCTCCAGGG-1   TRB
#> 4 CASSVRRERANTGELFF    TRBV9 <NA> TRBJ2-2 TRBC2 P17B_AACTTTCTCAGCGACC-1   TRB
#> 6   CASSSRQDSTDTQYF   TRBV27 <NA> TRBJ2-3 TRBC2 P17B_AATCGGTCAAGCCCAC-1   TRB
#> 8 CASSVRRERANTGELFF    TRBV9 <NA> TRBJ2-2 TRBC2 P17B_ACGAGCCTCATTTGGG-1   TRB
#> 9  CASSPRAGTPNTEAFF  TRBV4-1 <NA> TRBJ1-1 TRBC1 P17B_ACGCCAGGTTCATGGT-1   TRB

tcr_wide <- extractTCRdata(immLynx_example,
                           chains = "both",
                           format = "wide")

validation <- validateTCRdata(tcr_data)
print(validation)
#> $valid
#> [1] TRUE
#> 
#> $errors
#> character(0)
#> 
#> $warnings
#> character(0)
#> 
#> $summary
#> $summary$n_rows
#> [1] 306
#> 
#> $summary$n_unique_barcodes
#> [1] 306
#> 
#> $summary$n_unique_cdr3
#> [1] 275
#> 
#> $summary$n_na_cdr3
#> [1] 0
#> 
#> $summary$chains
#> [1] "TRB"

3.3 Summarizing TCR Repertoire

The summarizeTCRrepertoire() function computes standard repertoire statistics including clonotype frequencies, diversity indices (Shannon entropy, Simpson index, clonality), CDR3 length distributions, and V/J gene usage. These metrics provide a high-level overview of repertoire structure before detailed analysis.

summary <- summarizeTCRrepertoire(immLynx_example,
                                  chains = "TRB")
print(summary)
#> === TCR Repertoire Summary ===
#> Chain(s): TRB 
#> 
#> --- Basic Statistics ---
#> Total cells with TCR: 306 
#> Unique clonotypes: 275 
#> Clonotype ratio: 0.8987 
#> 
#> --- Diversity Metrics ---
#> Shannon entropy: 5.511 
#> Clonality: 0.0188 
#> Simpson index: 0.0057 
#> Inverse Simpson: 174.69 
#> 
#> --- Top Clonotypes ---
#>            cdr3_aa count  proportion
#>  CASSVRRERANTGELFF    14 0.045751634
#>      CASSPTVAGEQFF     3 0.009803922
#>   CASSQAPFSTSGELFF     3 0.009803922
#>   CASSQDRTGLDYEQYF     3 0.009803922
#>   CASSRLRTGALYEQYF     3 0.009803922
#> 
#> --- CDR3 Length Distribution ---
#> Mean: 14.8 
#> Median: 15 
#> Range: 10 - 21

4 Analysis Functions

4.1 TCR Clustering with clusTCR

clusTCR groups TCR sequences by CDR3 similarity using the Markov Cluster (MCL) algorithm. The inflation parameter controls cluster granularity: higher values produce more, smaller clusters, while lower values yield fewer, larger clusters. Cluster assignments are stored directly in the colData of the SingleCellExperiment object, making them immediately available for downstream visualization and statistical testing.

sce <- runClustTCR(
  immLynx_example,
  chains = "TRB",
  method = "mcl",
  inflation = 2.0
)

table(sce$clustcr_TRB)

4.2 TCR Distance Calculations with tcrdist3

tcrdist3 computes pairwise distances between TCR sequences using a substitution-matrix-based metric that accounts for CDR3 amino acid sequence, V gene, and J gene identity. The resulting distance matrix can be used for hierarchical clustering, nearest-neighbor analysis, or as input to dimensionality reduction methods. Note that tcrdist3 deduplicates sequences internally, so the distance matrix dimensions correspond to the number of unique TCR sequences rather than the total number of cells.

dist_results <- runTCRdist(
  immLynx_example,
  chains = "beta",
  organism = "human"
)

dim(dist_results$distances$pw_beta)

4.3 Generation Probability with OLGA

OLGA (Optimized Likelihood estimate of immunoGlobulin Amino-acid sequences) computes the generation probability (Pgen) of a TCR sequence—the probability that V(D)J recombination produces that specific CDR3 amino acid sequence. Low Pgen values indicate sequences that are unlikely to arise by chance, which may suggest antigen-driven selection. The runOLGA() function adds log10-transformed Pgen values to the colData of the input object.

sce <- runOLGA(
  immLynx_example,
  chains = "TRB",
  model = "humanTRB"
)

hist(log10(sce$olga_pgen_TRB), breaks = 50)

The generateOLGA() function can also produce synthetic TCR sequences sampled from the recombination model, which is useful for constructing background distributions or benchmarking analyses.

random_seqs <- generateOLGA(n = 100,
                            model = "humanTRB")
head(random_seqs)

4.4 Protein Language Model Embeddings

Protein language models such as ESM-2 learn dense vector representations of amino acid sequences from large-scale protein databases. When applied to CDR3 sequences, these embeddings capture biochemical and structural properties that complement sequence-based distance metrics. The runEmbeddings() function generates per-sequence embeddings and stores them as a dimensionality reduction in the SingleCellExperiment object, enabling visualization with standard tools such as UMAP or PCA. Multiple pooling strategies (mean, cls, max) are available to aggregate per-residue representations into a fixed-length vector.

sce <- runEmbeddings(
  immLynx_example,
  chains = "TRB",
  model_name = "facebook/esm2_t12_35M_UR50D",
  pool = "mean"
)

sce <- scater::runUMAP(sce,
                       dimred = "tcr_esm")
scater::plotReducedDim(sce, dimred = "UMAP")

4.5 Metaclone Discovery with Metaclonotypist

Metaclonotypist identifies metaclones—groups of TCR clonotypes that share enough sequence similarity to potentially recognize the same antigen. This approach extends traditional exact-match clonotyping by allowing controlled levels of CDR3 edit distance and tcrdist-based similarity. The max_edits and max_dist parameters control the stringency of metaclone grouping.

sce <- runMetaclonotypist(
  immLynx_example,
  chains = "beta",
  method = "tcrdist",
  max_edits = 2,
  max_dist = 20
)

table(sce$metaclone)

4.6 Selection Inference with soNNia

soNNia uses a neural network to model post-thymic selection by comparing observed TCR sequences against a background distribution generated by V(D)J recombination. The model estimates a selection factor for each sequence, quantifying how much more (or less) likely a sequence is to appear in the observed repertoire relative to the recombination baseline. This requires a background sequence file, which can be generated using generateOLGA().

background <- generateOLGA(n = 10000, model = "humanTRB")
write.csv(background, "background.csv", row.names = FALSE)

sce <- runSoNNia(
  immLynx_example,
  chains = "TRB",
  background_file = "background.csv"
)

5 Workflow Example

The following example demonstrates a combined workflow that integrates multiple immLynx analyses on a single dataset. By storing all results in the same SingleCellExperiment object, different analytical layers (clustering, generation probability, embeddings) can be compared and visualized together.

library(immLynx)
library(scran)
library(scater)
library(ggplot2)

data("immLynx_example")

# Summarize the repertoire
summary <- summarizeTCRrepertoire(immLynx_example)
print(summary)

# Cluster TCRs by CDR3 similarity
immLynx_example <- runClustTCR(
  immLynx_example,
  chains = "TRB",
  method = "mcl"
)

# Calculate generation probability
immLynx_example <- runOLGA(
  immLynx_example,
  chains = "TRB"
)

# Generate protein language model embeddings
immLynx_example <- runEmbeddings(
  immLynx_example,
  chains = "TRB"
)

# Visualize embeddings colored by cluster assignment
immLynx_example <- scater::runUMAP(
  immLynx_example,
  dimred = "tcr_esm",
  name = "tcr_umap"
)

scater::plotReducedDim(
  immLynx_example,
  dimred = "tcr_umap",
  colour_by = "clustcr_TRB"
)

# Visualize embeddings colored by generation probability
scater::plotReducedDim(
  immLynx_example,
  dimred = "tcr_umap",
  colour_by = "olga_pgen_log10_TRB"
)

6 Session Info

sessionInfo()
#> R version 4.6.0 RC (2026-04-17 r89917)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] scater_1.39.4               ggplot2_4.0.3              
#>  [3] scran_1.39.2                scuttle_1.21.6             
#>  [5] SingleCellExperiment_1.33.2 SummarizedExperiment_1.41.1
#>  [7] Biobase_2.71.0              GenomicRanges_1.63.2       
#>  [9] Seqinfo_1.1.0               IRanges_2.45.0             
#> [11] S4Vectors_0.49.2            BiocGenerics_0.57.1        
#> [13] generics_0.1.4              MatrixGenerics_1.23.0      
#> [15] matrixStats_1.5.0           immLynx_0.99.4             
#> [17] BiocStyle_2.39.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1    viridisLite_0.4.3   vipor_0.4.7        
#>  [4] dplyr_1.2.1         farver_2.1.2        viridis_0.6.5      
#>  [7] filelock_1.0.3      S7_0.2.2            fastmap_1.2.0      
#> [10] bluster_1.21.1      digest_0.6.39       rsvd_1.0.5         
#> [13] lifecycle_1.0.5     cluster_2.1.8.2     statmod_1.5.1      
#> [16] magrittr_2.0.5      compiler_4.6.0      rlang_1.2.0        
#> [19] sass_0.4.10         tools_4.6.0         igraph_2.3.0       
#> [22] yaml_2.3.12         knitr_1.51          S4Arrays_1.11.1    
#> [25] dqrng_0.4.1         reticulate_1.46.0   DelayedArray_0.37.1
#> [28] RColorBrewer_1.1-3  abind_1.4-8         BiocParallel_1.45.0
#> [31] withr_3.0.2         grid_4.6.0          beachmat_2.27.5    
#> [34] edgeR_4.9.9         scales_1.4.0        dichromat_2.0-0.1  
#> [37] cli_3.6.6           rmarkdown_2.31      otel_0.2.0         
#> [40] metapod_1.19.2      immApex_1.5.4       ggbeeswarm_0.7.3   
#> [43] cachem_1.1.0        stringr_1.6.0       parallel_4.6.0     
#> [46] BiocManager_1.30.27 XVector_0.51.0      basilisk_1.23.0    
#> [49] vctrs_0.7.3         Matrix_1.7-5        jsonlite_2.0.0     
#> [52] dir.expiry_1.19.0   bookdown_0.46       BiocSingular_1.27.1
#> [55] BiocNeighbors_2.5.4 ggrepel_0.9.8       beeswarm_0.4.0     
#> [58] irlba_2.3.7         locfit_1.5-9.12     limma_3.67.3       
#> [61] jquerylib_0.1.4     glue_1.8.1          codetools_0.2-20   
#> [64] stringi_1.8.7       gtable_0.3.6        ScaledMatrix_1.19.0
#> [67] tibble_3.3.1        pillar_1.11.1       htmltools_0.5.9    
#> [70] R6_2.6.1            evaluate_1.0.5      lattice_0.22-9     
#> [73] png_0.1-9           bslib_0.10.0        Rcpp_1.1.1-1.1     
#> [76] gridExtra_2.3       SparseArray_1.11.13 xfun_0.57          
#> [79] pkgconfig_2.0.3