T-cell receptors (TCRs) are heterodimeric surface proteins that mediate adaptive immune recognition. Each TCR consists of an alpha and beta chain (or gamma and delta in a smaller subset of T cells), with the complementarity-determining region 3 (CDR3) serving as the primary determinant of antigen specificity. Advances in single-cell RNA sequencing (scRNA-seq) now enable simultaneous profiling of gene expression and TCR sequences at cellular resolution, producing rich datasets that benefit from computational analysis of repertoire diversity, clonal structure, and sequence similarity.
A growing ecosystem of Python-based tools has emerged for TCR repertoire analysis, including tcrdist3 for pairwise distance calculation, OLGA for generation probability estimation, soNNia for selection inference, clusTCR for sequence clustering, metaclonotypist for metaclone discovery, and the ESM-2 family of protein language models for learned sequence representations. While these tools are powerful individually, using them together requires managing multiple Python environments and translating data between frameworks—a barrier for researchers working primarily in R.
immLynx addresses this challenge by providing a unified R interface
to these Python tools through the basilisk package, which
manages isolated conda environments transparently. All functions accept
and return SingleCellExperiment objects, ensuring seamless integration
with the Bioconductor single-cell ecosystem, including
scRepertoire for clonotype assignment and
immApex for TCR data extraction. The package supports the
following analytical capabilities:
This vignette demonstrates the core functionality of immLynx, walking
through data preparation, individual analysis functions, and a combined
workflow. For advanced use cases including method comparison and custom
embedding strategies, see the companion vignette
vignette("advanced_analysis", package = "immLynx").
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("immLynx")
immLynx uses basilisk to manage Python dependencies automatically. The first time you call a function that invokes Python, basilisk will create an isolated conda environment containing all required packages (tcrdist3, olga, soNNia, clusTCR, metaclonotypist, and PyTorch with the ESM-2 model). This initial setup may take several minutes but only needs to occur once per installation. Subsequent calls reuse the existing environment with minimal overhead.
We begin by loading immLynx along with scran and
scater, which provide utilities for single-cell analysis
and visualization. The immLynx_example dataset is a
SingleCellExperiment object containing scRNA-seq data from multiple
patients, with TCR repertoire information added via
scRepertoire.
library(immLynx)
library(scran)
library(scater)
data("immLynx_example")
immLynx_example
#> class: SingleCellExperiment
#> dim: 2000 500
#> metadata(0):
#> assays(2): counts logcounts
#> rownames(2000): TRBV7-2 HSPA1B ... AC119396.1 BHLHE40
#> rowData names(0):
#> colnames(500): P17B_AAACCTGCAACACGCC-1 P17B_AAACGGGTCTCCAGGG-1 ...
#> P20L_TCAACGATCCCTCAGT-1 P20L_TCTGGAAGTGGAAAGA-1
#> colData names(16): orig.ident nCount_RNA ... ident clusters
#> reducedDimNames(1): UMAP
#> mainExpName: RNA
#> altExpNames(0):
Before running downstream analyses, it is often useful to extract TCR
data into a data.frame format for inspection and quality control. The
extractTCRdata() function wraps immApex::getIR() and supports both
long format (one row per chain) and wide format (one row per cell with
columns for each chain). The validateTCRdata() function checks for
common issues such as missing CDR3 sequences, non-standard amino acids,
and inconsistent chain annotations.
tcr_data <- extractTCRdata(immLynx_example,
chains = "TRB")
head(tcr_data)
#> cdr3_aa v d j c barcode chain
#> 1 CAISEQGKGELFF TRBV10-3 <NA> TRBJ2-2 TRBC2 P17B_AAACCTGCAACACGCC-1 TRB
#> 2 CASSVRRERANTGELFF TRBV9 <NA> TRBJ2-2 TRBC2 P17B_AAACGGGTCTCCAGGG-1 TRB
#> 4 CASSVRRERANTGELFF TRBV9 <NA> TRBJ2-2 TRBC2 P17B_AACTTTCTCAGCGACC-1 TRB
#> 6 CASSSRQDSTDTQYF TRBV27 <NA> TRBJ2-3 TRBC2 P17B_AATCGGTCAAGCCCAC-1 TRB
#> 8 CASSVRRERANTGELFF TRBV9 <NA> TRBJ2-2 TRBC2 P17B_ACGAGCCTCATTTGGG-1 TRB
#> 9 CASSPRAGTPNTEAFF TRBV4-1 <NA> TRBJ1-1 TRBC1 P17B_ACGCCAGGTTCATGGT-1 TRB
tcr_wide <- extractTCRdata(immLynx_example,
chains = "both",
format = "wide")
validation <- validateTCRdata(tcr_data)
print(validation)
#> $valid
#> [1] TRUE
#>
#> $errors
#> character(0)
#>
#> $warnings
#> character(0)
#>
#> $summary
#> $summary$n_rows
#> [1] 306
#>
#> $summary$n_unique_barcodes
#> [1] 306
#>
#> $summary$n_unique_cdr3
#> [1] 275
#>
#> $summary$n_na_cdr3
#> [1] 0
#>
#> $summary$chains
#> [1] "TRB"
The summarizeTCRrepertoire() function computes standard repertoire
statistics including clonotype frequencies, diversity indices (Shannon
entropy, Simpson index, clonality), CDR3 length distributions, and V/J
gene usage. These metrics provide a high-level overview of repertoire
structure before detailed analysis.
summary <- summarizeTCRrepertoire(immLynx_example,
chains = "TRB")
print(summary)
#> === TCR Repertoire Summary ===
#> Chain(s): TRB
#>
#> --- Basic Statistics ---
#> Total cells with TCR: 306
#> Unique clonotypes: 275
#> Clonotype ratio: 0.8987
#>
#> --- Diversity Metrics ---
#> Shannon entropy: 5.511
#> Clonality: 0.0188
#> Simpson index: 0.0057
#> Inverse Simpson: 174.69
#>
#> --- Top Clonotypes ---
#> cdr3_aa count proportion
#> CASSVRRERANTGELFF 14 0.045751634
#> CASSPTVAGEQFF 3 0.009803922
#> CASSQAPFSTSGELFF 3 0.009803922
#> CASSQDRTGLDYEQYF 3 0.009803922
#> CASSRLRTGALYEQYF 3 0.009803922
#>
#> --- CDR3 Length Distribution ---
#> Mean: 14.8
#> Median: 15
#> Range: 10 - 21
clusTCR groups TCR sequences by CDR3 similarity using the Markov Cluster
(MCL) algorithm. The inflation parameter controls cluster granularity:
higher values produce more, smaller clusters, while lower values yield
fewer, larger clusters. Cluster assignments are stored directly in the
colData of the SingleCellExperiment object, making them immediately
available for downstream visualization and statistical testing.
sce <- runClustTCR(
immLynx_example,
chains = "TRB",
method = "mcl",
inflation = 2.0
)
table(sce$clustcr_TRB)
tcrdist3 computes pairwise distances between TCR sequences using a substitution-matrix-based metric that accounts for CDR3 amino acid sequence, V gene, and J gene identity. The resulting distance matrix can be used for hierarchical clustering, nearest-neighbor analysis, or as input to dimensionality reduction methods. Note that tcrdist3 deduplicates sequences internally, so the distance matrix dimensions correspond to the number of unique TCR sequences rather than the total number of cells.
dist_results <- runTCRdist(
immLynx_example,
chains = "beta",
organism = "human"
)
dim(dist_results$distances$pw_beta)
OLGA (Optimized Likelihood estimate of immunoGlobulin Amino-acid
sequences) computes the generation probability (Pgen) of a TCR
sequence—the probability that V(D)J recombination produces that
specific CDR3 amino acid sequence. Low Pgen values indicate sequences
that are unlikely to arise by chance, which may suggest antigen-driven
selection. The runOLGA() function adds log10-transformed Pgen values
to the colData of the input object.
sce <- runOLGA(
immLynx_example,
chains = "TRB",
model = "humanTRB"
)
hist(log10(sce$olga_pgen_TRB), breaks = 50)
The generateOLGA() function can also produce synthetic TCR sequences
sampled from the recombination model, which is useful for constructing
background distributions or benchmarking analyses.
random_seqs <- generateOLGA(n = 100,
model = "humanTRB")
head(random_seqs)
Protein language models such as ESM-2 learn dense vector representations
of amino acid sequences from large-scale protein databases. When applied
to CDR3 sequences, these embeddings capture biochemical and structural
properties that complement sequence-based distance metrics. The
runEmbeddings() function generates per-sequence embeddings and stores
them as a dimensionality reduction in the SingleCellExperiment object,
enabling visualization with standard tools such as UMAP or PCA. Multiple
pooling strategies (mean, cls, max) are available to aggregate
per-residue representations into a fixed-length vector.
sce <- runEmbeddings(
immLynx_example,
chains = "TRB",
model_name = "facebook/esm2_t12_35M_UR50D",
pool = "mean"
)
sce <- scater::runUMAP(sce,
dimred = "tcr_esm")
scater::plotReducedDim(sce, dimred = "UMAP")
Metaclonotypist identifies metaclones—groups of TCR clonotypes that
share enough sequence similarity to potentially recognize the same
antigen. This approach extends traditional exact-match clonotyping by
allowing controlled levels of CDR3 edit distance and tcrdist-based
similarity. The max_edits and max_dist parameters control the
stringency of metaclone grouping.
sce <- runMetaclonotypist(
immLynx_example,
chains = "beta",
method = "tcrdist",
max_edits = 2,
max_dist = 20
)
table(sce$metaclone)
soNNia uses a neural network to model post-thymic selection by comparing
observed TCR sequences against a background distribution generated by
V(D)J recombination. The model estimates a selection factor for each
sequence, quantifying how much more (or less) likely a sequence is to
appear in the observed repertoire relative to the recombination
baseline. This requires a background sequence file, which can be
generated using generateOLGA().
background <- generateOLGA(n = 10000, model = "humanTRB")
write.csv(background, "background.csv", row.names = FALSE)
sce <- runSoNNia(
immLynx_example,
chains = "TRB",
background_file = "background.csv"
)
The following example demonstrates a combined workflow that integrates
multiple immLynx analyses on a single dataset. By storing all results in
the same SingleCellExperiment object, different analytical layers
(clustering, generation probability, embeddings) can be compared and
visualized together.
library(immLynx)
library(scran)
library(scater)
library(ggplot2)
data("immLynx_example")
# Summarize the repertoire
summary <- summarizeTCRrepertoire(immLynx_example)
print(summary)
# Cluster TCRs by CDR3 similarity
immLynx_example <- runClustTCR(
immLynx_example,
chains = "TRB",
method = "mcl"
)
# Calculate generation probability
immLynx_example <- runOLGA(
immLynx_example,
chains = "TRB"
)
# Generate protein language model embeddings
immLynx_example <- runEmbeddings(
immLynx_example,
chains = "TRB"
)
# Visualize embeddings colored by cluster assignment
immLynx_example <- scater::runUMAP(
immLynx_example,
dimred = "tcr_esm",
name = "tcr_umap"
)
scater::plotReducedDim(
immLynx_example,
dimred = "tcr_umap",
colour_by = "clustcr_TRB"
)
# Visualize embeddings colored by generation probability
scater::plotReducedDim(
immLynx_example,
dimred = "tcr_umap",
colour_by = "olga_pgen_log10_TRB"
)
sessionInfo()
#> R version 4.6.0 RC (2026-04-17 r89917)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] scater_1.39.4 ggplot2_4.0.3
#> [3] scran_1.39.2 scuttle_1.21.6
#> [5] SingleCellExperiment_1.33.2 SummarizedExperiment_1.41.1
#> [7] Biobase_2.71.0 GenomicRanges_1.63.2
#> [9] Seqinfo_1.1.0 IRanges_2.45.0
#> [11] S4Vectors_0.49.2 BiocGenerics_0.57.1
#> [13] generics_0.1.4 MatrixGenerics_1.23.0
#> [15] matrixStats_1.5.0 immLynx_0.99.4
#> [17] BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 viridisLite_0.4.3 vipor_0.4.7
#> [4] dplyr_1.2.1 farver_2.1.2 viridis_0.6.5
#> [7] filelock_1.0.3 S7_0.2.2 fastmap_1.2.0
#> [10] bluster_1.21.1 digest_0.6.39 rsvd_1.0.5
#> [13] lifecycle_1.0.5 cluster_2.1.8.2 statmod_1.5.1
#> [16] magrittr_2.0.5 compiler_4.6.0 rlang_1.2.0
#> [19] sass_0.4.10 tools_4.6.0 igraph_2.3.0
#> [22] yaml_2.3.12 knitr_1.51 S4Arrays_1.11.1
#> [25] dqrng_0.4.1 reticulate_1.46.0 DelayedArray_0.37.1
#> [28] RColorBrewer_1.1-3 abind_1.4-8 BiocParallel_1.45.0
#> [31] withr_3.0.2 grid_4.6.0 beachmat_2.27.5
#> [34] edgeR_4.9.9 scales_1.4.0 dichromat_2.0-0.1
#> [37] cli_3.6.6 rmarkdown_2.31 otel_0.2.0
#> [40] metapod_1.19.2 immApex_1.5.4 ggbeeswarm_0.7.3
#> [43] cachem_1.1.0 stringr_1.6.0 parallel_4.6.0
#> [46] BiocManager_1.30.27 XVector_0.51.0 basilisk_1.23.0
#> [49] vctrs_0.7.3 Matrix_1.7-5 jsonlite_2.0.0
#> [52] dir.expiry_1.19.0 bookdown_0.46 BiocSingular_1.27.1
#> [55] BiocNeighbors_2.5.4 ggrepel_0.9.8 beeswarm_0.4.0
#> [58] irlba_2.3.7 locfit_1.5-9.12 limma_3.67.3
#> [61] jquerylib_0.1.4 glue_1.8.1 codetools_0.2-20
#> [64] stringi_1.8.7 gtable_0.3.6 ScaledMatrix_1.19.0
#> [67] tibble_3.3.1 pillar_1.11.1 htmltools_0.5.9
#> [70] R6_2.6.1 evaluate_1.0.5 lattice_0.22-9
#> [73] png_0.1-9 bslib_0.10.0 Rcpp_1.1.1-1.1
#> [76] gridExtra_2.3 SparseArray_1.11.13 xfun_0.57
#> [79] pkgconfig_2.0.3