library(knitr)
library(ClustIRR)
library(igraph)
library(ggplot2)
theme_set(new = theme_bw(base_size = 10))
library(ggrepel)
library(patchwork)
options(digits=2)Adaptive immunity employs diverse immune receptor repertoires (IRRs; B- and T-cell receptors) to combat evolving pathogens including viruses, bacteria, and cancers. Receptor diversity arises through V(D)J recombination - combinatorial assembly of germline genes generating unique sequences. Each naive lymphocyte consequently expresses a distinct receptor, enabling broad antigen recognition.
B cells engage antigens directly via BCR complementarity-determining regions (CDRs), while T cells recognize peptide-MHC complexes through TCR CDRs. Antigen recognition triggers clonal expansion, producing effector populations that mount targeted immune responses.
High-throughput sequencing (HT-seq) enables tracking of TCR/BCR community dynamics across biological conditions (e.g., pre-/post-treatment), offering insights into responses to immunotherapy and vaccination. However, two key challenges complicate this approach if the unit of analysis are clonotype:
Extreme diversity and privacy: TCRs/BCRs are highly personalized, with incomplete sampling particularly problematic in clinical settings where sample volumes are limited. Even comprehensive sampling reveals minimal repertoire overlap in terms of clonotypes between individuals.
Similar TCRs/BCRs recognize similar antigens
This vignette introduces ClustIRR, a computational method that addresses these challenges by: (1) Identifying specificity-associated receptor communities through sequence clustering, and (2) Applying Bayesian models to quantify differential community abundance across conditions.
ClustIRR is freely available as part of Bioconductor, filling the gap that currently exists in terms of software for quantitative analysis of IRRs.
To install ClustIRR please start R and enter:
The main input of ClustIRR
is a repertoire (s), which should be provided as
data.frame. The rows in the data.frame correspond to
clonotype (clonotype = group of cells derived from a
common parent cell by clonal expansion). We use the following data from
each clonotype:
In a typical scenario, the user will have more than one repertoire (see workflow). For instance, the user will analyze longitudinal repertoire data, i.e., two or three repertoires taken at different time points; or across different biological conditions.
Let’s look at dataset D1 that is provided within ClustIRR.
D1 contains three TCR\(\alpha\beta\) repertoires: \(a\), \(b\), and \(c\) and their metadata: ma,
mb and mc.
List of 6
$ a :'data.frame': 500 obs. of 4 variables:
..$ CDR3a : chr [1:500] "CASSLRGAHNEQFF" "CASGLRQGYGYTF" "CSAGGFRETQYF" "CGSSLSQGSEQYF" ...
..$ CDR3b : chr [1:500] "CASTVTSGSNEKLFF" "CASSLTGTGYTF" "CSALTPGLIYNEQFF" "CSARASWGTNEKLFF" ...
..$ sample : chr [1:500] "a" "a" "a" "a" ...
..$ clone_size: num [1:500] 3 2 2 5 3 5 5 5 3 4 ...
$ b :'data.frame': 500 obs. of 4 variables:
..$ CDR3a : chr [1:500] "CASSLRGAHNEQFF" "CASGLRQGYGYTF" "CSAGGFRETQYF" "CGSSLSQGSEQYF" ...
..$ CDR3b : chr [1:500] "CASTVTSGSNEKLFF" "CASSLTGTGYTF" "CSALTPGLIYNEQFF" "CSARASWGTNEKLFF" ...
..$ sample : chr [1:500] "b" "b" "b" "b" ...
..$ clone_size: num [1:500] 3 2 2 7 4 5 5 7 3 4 ...
$ c :'data.frame': 500 obs. of 4 variables:
..$ CDR3a : chr [1:500] "CASSLRGAHNEQFF" "CASGLRQGYGYTF" "CSAGGFRETQYF" "CGSSLSQGSEQYF" ...
..$ CDR3b : chr [1:500] "CASTVTSGSNEKLFF" "CASSLTGTGYTF" "CSALTPGLIYNEQFF" "CSARASWGTNEKLFF" ...
..$ sample : chr [1:500] "c" "c" "c" "c" ...
..$ clone_size: num [1:500] 3 2 2 9 5 5 5 9 3 4 ...
$ ma:'data.frame': 500 obs. of 8 variables:
..$ clone_id: int [1:500] 1 2 3 4 5 6 7 8 9 10 ...
..$ cell : chr [1:500] "CD8" "CD4" "CD8" "CD8" ...
..$ HLA_A : chr [1:500] "HLA-A∗24" "HLA-A∗24" "HLA-A∗24" "HLA-A∗24" ...
..$ age : num [1:500] 24 24 24 24 24 24 24 24 24 24 ...
..$ TRAV : chr [1:500] "TRAV27" "TRAV27" "TRAV20-1" "TRAV7-7" ...
..$ TRAJ : chr [1:500] "TRAJ2-1" "TRAJ1-2" "TRAJ2-5" "TRAJ2-7" ...
..$ TRBV : chr [1:500] "TRBV6-8" "TRBV7-3" "TRBV20-1" "TRBV20-1" ...
..$ TRBJ : chr [1:500] "TRBJ1-4" "TRBJ1-2" "TRBJ2-1" "TRBJ1-4" ...
$ mb:'data.frame': 500 obs. of 8 variables:
..$ clone_id: int [1:500] 1 2 3 4 5 6 7 8 9 10 ...
..$ cell : chr [1:500] "CD8" "CD4" "CD4" "CD4" ...
..$ HLA_A : chr [1:500] "HLA-A∗02" "HLA-A∗02" "HLA-A∗02" "HLA-A∗02" ...
..$ age : num [1:500] 30 30 30 30 30 30 30 30 30 30 ...
..$ TRAV : chr [1:500] "TRAV27" "TRAV27" "TRAV20-1" "TRAV7-7" ...
..$ TRAJ : chr [1:500] "TRAJ2-1" "TRAJ1-2" "TRAJ2-5" "TRAJ2-7" ...
..$ TRBV : chr [1:500] "TRBV6-8" "TRBV7-3" "TRBV20-1" "TRBV20-1" ...
..$ TRBJ : chr [1:500] "TRBJ1-4" "TRBJ1-2" "TRBJ2-1" "TRBJ1-4" ...
$ mc:'data.frame': 500 obs. of 8 variables:
..$ clone_id: int [1:500] 1 2 3 4 5 6 7 8 9 10 ...
..$ cell : chr [1:500] "CD8" "CD8" "CD8" "CD8" ...
..$ HLA_A : chr [1:500] "HLA-A∗11" "HLA-A∗11" "HLA-A∗11" "HLA-A∗11" ...
..$ age : num [1:500] 40 40 40 40 40 40 40 40 40 40 ...
..$ TRAV : chr [1:500] "TRAV27" "TRAV27" "TRAV20-1" "TRAV7-7" ...
..$ TRAJ : chr [1:500] "TRAJ2-1" "TRAJ1-2" "TRAJ2-5" "TRAJ2-7" ...
..$ TRBV : chr [1:500] "TRBV6-8" "TRBV7-3" "TRBV20-1" "TRBV20-1" ...
..$ TRBJ : chr [1:500] "TRBJ1-4" "TRBJ1-2" "TRBJ2-1" "TRBJ1-4" ...
Extract the data.frames for each TCR repertoire and their metadata:
We will merge three TCR repertoires into the data.frame tcr_reps.
We will do the same for the metadata
ClustIRR performs the following steps.
Compute similarities between T-cell clonotypes within each TCR repertoire
Construct a graph from each TCR repertoire
Construct a joint similarity graph (\(J\))
Detect communities in \(J\)
Analyze Differential Community Occupancy (DCO)
a. Between individual TCR repertoires with model \(M\)
b. Between groups of TCR repertoires from biological conditions with model \(M_h\)
Inspect results
cluster_irrClustIRR aims to quantify the similarity between pairs of TCR clonotypes based on the similarities of their CDR3s sequences. For this it employs Basic Local-Alignment Search Tool (BLAST) via the R-package blaster. Briefly, a protein database is constructed from all CDR3 sequences, and each CDR3 sequence is used as a query. This enables fast sequence similarity searches. Furthermore, only CDR3 sequences matches with \(\geq\) 80% sequence identity to the query are retained. This step reduces the computational and memory requirements, without impacting downstream community analyses, as CDR3 sequences with lower typically yield low similarity scores.
For matched CDR3 pair, an alignment score (\(\omega\)) is computed using BLOSUM62 substitution scores with gap opening penalty of -10 and gap extension penalty of -4. \(\omega\) is the sum of substitution scores and gap penalties in the alignment. Identical or highly similar CDR3 sequence pairs receive large positive \(\omega\) scores, while dissimilar pairs receive low or negative \(\omega\). To normalize \(\omega\) for alignment length, ClustIRR computes \(\bar{\omega} = \omega/l\), where \(l\) is the alignment length yielding normalized alignment score \(\bar{\omega}\). This normalization, also used in iSMART (Zhang, 2020), ensures comparability across CDR3 pairs of varying lengths.
ClustIRR also computes alignment scores for the CDR3 core regions (\(\omega^c\) and \(\bar{\omega}^c\)). The CDR3 core, representing the central loop region with high antigen-contact probability (Glanville, 2017), is generated by trimming three residues from each end of the CDR3 sequence. Comparing \(\bar{\omega}^c\) and \(\bar{\omega}\) allows assessment of whether sequence similarity is concentrated in the core or flanking regions.
Next, ClustIRR builds a graph for each TCR repertoire. The graphs have nodes and weighted edges:
Then the graphs are joined together: edges between TCR clonotypes
from different TCR repertoires are computed using the same procedure
outlined in step 1. The joint graph \(J\) is stored as an igraph
object.
clustirrStep 1. involves calling the function clust_irr which
returns an S4 object of class clust_irr.
The output complex list with three elements:
graph = the joint graph \(J\)
clust_irrs = list with S4 clust_irr
objects one for each TCR repertoire
multigraph = logical value which tell us whether
\(J\) contains multiple or a single TCR
repertoire
clust_irrsWe can look at the similarity scores between CDR3 sequences of TCR
repertoire a. Each row is a pair of CDR3 sequences from the
repertoire annotated with the following metrics:
max_len: length of the longer CDR3 sequence in the
pairmax_clen: length of the longer CDR3 core sequence in
the pairweight: \(\omega\) =
BLOSUM62 score of the complete CDR3 alignmentcweight= \(\omega_c\):
BLOSUM62 score of CDR3 coresnweight = \(\bar{\omega}\): normalized
weight by max_lenncweight = \(\bar{\omega}_c\): normalized
cweight by max_clenThe results for CDR3\(\alpha\)
sequence pairs from repertoire a:
| from_cdr3 | to_cdr3 | weight | cweight | nweight | ncweight | max_len | max_clen |
|---|---|---|---|---|---|---|---|
| CASSPGHSNTEAFF | CASSPGRSTEAFF | 102 | 45 | 7.3 | 5.6 | 14 | 8 |
| CASSHPGTSNQPQHF | CASSQTPTSNQPQHF | 124 | 64 | 8.3 | 7.1 | 15 | 9 |
| CASSLSGGRTEAFF | CASSLAGARYTEAFF | 105 | 48 | 7.0 | 5.3 | 15 | 9 |
| CASSRTGYTDTQYF | CASSRTLAGDTDTQYF | 106 | 47 | 6.6 | 4.7 | 16 | 10 |
| CASSRTGYTDTQYF | CACSRTGRTDTQYF | 120 | 66 | 8.6 | 8.2 | 14 | 8 |
| CASSRYRGGQPQHF | CASSSYRGSNQPQHF | 111 | 51 | 7.4 | 5.7 | 15 | 9 |
… the same table for CDR3\(\beta\)
sequence pairs from repertoire a:
| from_cdr3 | to_cdr3 | weight | cweight | nweight | ncweight | max_len | max_clen |
|---|---|---|---|---|---|---|---|
| CASGLNVNTEAFF | CASSLNRVNTEAFF | 101 | 44 | 7.2 | 5.5 | 14 | 8 |
| CASGLNVNTEAFF | CASGLVGNTEAFF | 105 | 48 | 8.1 | 6.9 | 13 | 7 |
| CASSLEGPVTDTQYF | CASSLGGSTDTQYF | 104 | 45 | 6.9 | 5.0 | 15 | 9 |
| CATSRPGGEWETQYF | CATSRPLGEETQYF | 111 | 51 | 7.4 | 5.7 | 15 | 9 |
| CASSLGPGVYEQYF | CASSLGTSGVRYEQYF | 98 | 39 | 6.1 | 3.9 | 16 | 10 |
| CASSLGTAPNQPQHF | CASSLGQGGNQPQHF | 125 | 65 | 8.3 | 7.2 | 15 | 9 |
Notice that very similar CDR3\(\alpha\) or CDR3\(\beta\) pairs have high normalized
alignment scores (\(\bar{\omega}\)). We
have similar tables for repertoire b and c.
These are used internally to construct graphs in the next step.
plot_graphWe can use the function plot_graph for interactive
visualization of relatively small graphs.
The function clust_irr performs automatic annotation of
TCR clonotypes based on multiple databases (DBs), including: VDJdb,
TCR3d, McPAS-TCR. Each TCR clonotype recieves two types of annotations
(per chain and per database):
Ag_species: the name of the antigen species recognized by the clonotype’s CDR3
Ag_gene: the name of the antigen gene recognized by clonotype’s CDR3
Hence, in the plot we can highlight TCR clonotypes that recognize certain antigenic species or genes (see dropdown menu), or use the hoovering function to look at the CDR3 sequences of nodes.
Do this now!
Notice that even in this small case study–where each TCR repertoire contains only 500 TCR clonotypes–we observe that the joint graph is complex! This complexity makes qualitative analyses impractical. To address this, ClustIRR focuses on quantitative analyses (see the next steps).
We can use igraph functions to inspect various properties of the
graph in cl$graph. For instance, below, we extract the edge
attributes and visualize the distributions of the edge attributes
ncweight and nweight for all CDR3\(\alpha\) and CDR3\(\beta\) sequence pairs.
ggplot(data = e)+
geom_density(aes(ncweight, col = chain))+
geom_density(aes(nweight, col = chain), linetype = "dashed")+
xlab(label = "edge weight (solid = ncweight, dashed = nweight)")+
theme(legend.position = "top")Can you guess why we observe trimodal distributions?
Top mode: weights of identical CDR3 sequence pairs from the different TCR repertoires
Middle mode: weights of CDR3 sequences with same lengths
Bottom mode: weights of CDR3 sequences with different lengths -> scores penalized by gap cost
detect_communitiesClustIRR employs graph-based community detection (GCD) algorithms, such as Louvain, Leiden or InfoMap, to identify communities of nodes that have high density of edges among each other, and low density of edges with nodes outside the community.
First, the similarity score between T-cell clonotypes \(i\) and \(j\) is defined as the average CDR3\(\alpha\) and CDR3\(\beta\) alignment scores:
\[\begin{align} \omega(i,j) = \dfrac{{\bar{\omega}}_\alpha + {\bar{\omega}}_\beta}{2} \end{align}\]where \(\bar{\omega}_\alpha\) and \(\bar{\omega}_\beta\) are the alignment scores for the CDR3\(\alpha\) and CDR3\(\beta\), respectively. If a chain is missing, its alignment score is set to 0.
The user has the following options:
algorithm: “leiden” (default) “louvain”, or
“infomap”resolution: GCD resolution = 1 (default)iterations: number of clustering iterations (default =
100) to ensure robust resultsweight: “ncweight” (default) or “nweight”chains: “CDR3a” or “CDR3b” or c(“CDR3a”, “CDR3b”)gcd <- detect_communities(graph = cl$graph,
algorithm = "leiden",
metric = "average",
resolution = 1,
iterations = 100,
weight = "ncweight",
chains = c("CDR3a", "CDR3b"))detect_communitiesThe function detect_communities generates a complex
output. Lets investigate its elements:
[1] "community_occupancy_matrix" "community_summary"
[3] "node_summary" "graph"
[5] "graph_structure_quality" "input_config"
The main element is community_occupancy_matrix, which
contains the number of T-cells in each community (row) and repertoire
(column). Here we have three repertoires (three columns) and over 400
communities. This matrix is the main input of the function
dco (step 5.), to detect differences in the community
occupancy between repertoires.
[1] 422 3
a b c
1 3 3 3
2 2 2 2
3 2 2 2
4 5 7 9
5 3 4 5
6 5 5 5
get_honeycombsIn the honeycomb plots shown below, several communities (black dots) appear far from the diagonal. This indicates that these communities contain more cells in repertoires b and c (y-axes in panels A and B) compared to repertoire a (x-axes in panels A and B). Meanwhile, the same points are generally close to the diagonal in panel C but remain slightly more abundant in repertoire c (y-axis) compared to b (x-axis).
In step 5, ClustIRR will provide a quantitative answer to the question: Which communities are differentially abundant between pairs of repertoires?
Importantly, the color of the hexagons encodes the density of communities in the 2D scatterplots: dark hexagons indicate high a frequency of communities, while light hexagons represent sparsely populated regions.
Also see community_summary. In the data.frame
wide we provide community summaries in each community
across all TCR repertoires, including:
clones_a, clone_b, clone_c,
clones_n: the frequency of clonotypes in the community
coming from repertoire a, b, c
and in total (n)cells_a, cells_b, cells_c,
cells_n: the frequency of cell in the community coming from
repertoires a, b, c and in total
(n)w: average inter-clonotype similarityncweight_CDR3a/b, nweight_CDR3a/b: raw and
normalized similarity for CDR3\(\alpha\) and CDR3\(\beta\) sequencesn_CDR3a, n_CDR3b: number of edges between
CDR3\(\alpha\) and CDR3\(\beta\) sequences| community | clones_a | clones_b | clones_c | clones_n | cells_a | cells_b | cells_c | cells_n | w | ncweight_CDR3a | ncweight_CDR3b | nweight_CDR3a | nweight_CDR3b | n_CDR3a | n_CDR3b |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 3 | 3 | 3 | 3 | 9 | 9.1 | NaN | 9.1 | NaN | 9.3 | 3 | 3 |
| 2 | 1 | 1 | 1 | 3 | 2 | 2 | 2 | 6 | 9.3 | NaN | 9.3 | NaN | 9.6 | 3 | 3 |
| 3 | 1 | 1 | 1 | 3 | 2 | 2 | 2 | 6 | 9.4 | NaN | 9.4 | NaN | 9.6 | 3 | 3 |
| 4 | 1 | 1 | 1 | 3 | 5 | 7 | 9 | 21 | 9.1 | NaN | 9.1 | NaN | 9.5 | 3 | 3 |
| 5 | 1 | 1 | 1 | 3 | 3 | 4 | 5 | 12 | 9.6 | NaN | 9.6 | NaN | 9.8 | 3 | 3 |
| 6 | 1 | 1 | 1 | 3 | 5 | 5 | 5 | 15 | 9.6 | NaN | 9.6 | NaN | 9.8 | 3 | 3 |
In the data.frame tall we provide community and
repertoire summaries in each row.
| community | sample | clones | cells | w | ncweight_CDR3a | ncweight_CDR3b | nweight_CDR3a | nweight_CDR3b | n_CDR3a | n_CDR3b |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | a | 1 | 3 | 9.1 | NaN | 9.1 | NaN | 9.3 | 3 | 3 |
| 1 | b | 1 | 3 | 9.1 | NaN | 9.1 | NaN | 9.3 | 3 | 3 |
| 1 | c | 1 | 3 | 9.1 | NaN | 9.1 | NaN | 9.3 | 3 | 3 |
| 112 | a | 3 | 11 | 5.2 | 1.8 | 4.0 | 2 | 4.4 | 18 | 18 |
| 112 | b | 3 | 12 | 5.2 | 1.8 | 4.0 | 2 | 4.4 | 18 | 18 |
| 112 | c | 3 | 13 | 5.2 | 1.8 | 4.0 | 2 | 4.4 | 18 | 18 |
Node-specific (TCR clonotype-specific) summaries are provided in
node_summary
| name | clone_id | Ag_gene | Ag_species | info_tcr3d_CDR3a | db_tcr3d_CDR3a | info_tcr3d_CDR3b | db_tcr3d_CDR3b | info_mcpas_CDR3a | db_mcpas_CDR3a | info_mcpas_CDR3b | db_mcpas_CDR3b | info_vdjdb_CDR3a | db_vdjdb_CDR3a | info_vdjdb_CDR3b | db_vdjdb_CDR3b | clone_size | sample | CDR3b | CDR3a | cell | HLA_A | age | TRAV | TRAJ | TRBV | TRBJ | id | community |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| a|1 | 1 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | 3 | a | CASTVTSGSNEKLFF | CASSLRGAHNEQFF | CD8 | HLA-A∗24 | 24 | TRAV27 | TRAJ2-1 | TRBV6-8 | TRBJ1-4 | 1 | 1 | ||
| a|2 | 2 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | 2 | a | CASSLTGTGYTF | CASGLRQGYGYTF | CD4 | HLA-A∗24 | 24 | TRAV27 | TRAJ1-2 | TRBV7-3 | TRBJ1-2 | 2 | 2 | ||
| a|3 | 3 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | 2 | a | CSALTPGLIYNEQFF | CSAGGFRETQYF | CD8 | HLA-A∗24 | 24 | TRAV20-1 | TRAJ2-5 | TRBV20-1 | TRBJ2-1 | 3 | 3 | ||
| a|4 | 4 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | 5 | a | CSARASWGTNEKLFF | CGSSLSQGSEQYF | CD8 | HLA-A∗24 | 24 | TRAV7-7 | TRAJ2-7 | TRBV20-1 | TRBJ1-4 | 4 | 4 | ||
| a|5 | 5 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | 3 | a | CASSVREAENQPQHF | CSPRGPYGYTF | CD8 | HLA-A∗24 | 24 | TRAV20-1 | TRAJ1-2 | TRBV4-1 | TRBJ1-5 | 5 | 5 | ||
| a|6 | 6 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | NA | 0 | 5 | a | CASCTVGNQPQHF | CSLGPSKSQYF | CD8 | HLA-A∗24 | 24 | TRAV29-1 | TRAJ2-3 | TRBV19 | TRBJ1-5 | 6 | 6 |
decode_communityCommunity with ID=8 contains 9 TCR clonotypes. These are connected based on their CDR3\(\beta\) sequences.
a b c
3 3 3
We can extract it and visualize it using igraph:
Furthermore, we can “decompose” this graph using
decode_community based on the attributes of the edges
(edge_filter) and nodes (node_filter).
[1] "nweight" "ncweight" "chain" "w"
The following edge-filter will instruct ClustIRR to keep edges with with edge attributes: nweight \(>=\) 8 AND ncweight \(>=\) 8
edge_filter <- rbind(data.frame(name = "nweight", value = 8, operation = ">="),
data.frame(name = "ncweight", value = 8, operation = ">=")) [1] "name" "clone_id" "Ag_gene" "Ag_species"
[5] "info_tcr3d_CDR3a" "db_tcr3d_CDR3a" "info_tcr3d_CDR3b" "db_tcr3d_CDR3b"
[9] "info_mcpas_CDR3a" "db_mcpas_CDR3a" "info_mcpas_CDR3b" "db_mcpas_CDR3b"
[13] "info_vdjdb_CDR3a" "db_vdjdb_CDR3a" "info_vdjdb_CDR3b" "db_vdjdb_CDR3b"
[17] "clone_size" "sample" "CDR3b" "CDR3a"
[21] "cell" "HLA_A" "age" "TRAV"
[25] "TRAJ" "TRBV" "TRBJ" "id"
[29] "community"
The following node-filter will instruct ClustIRR to retain edges between nodes that have shared node attributed with respect to ALL of the following node attributes:
c8 <- decode_community(community_id = 8,
graph = gcd$graph,
edge_filter = edge_filter,
node_filter = node_filter)kable(as_data_frame(x = c8$community, what = "vertices")
[, c("name", "component_id", "CDR3b", "TRBV",
"TRBJ", "CDR3a", "TRAV", "TRAJ")],
row.names = FALSE)| name | component_id | CDR3b | TRBV | TRBJ | CDR3a | TRAV | TRAJ |
|---|---|---|---|---|---|---|---|
| a|8 | 2 | CASSGRGGDNEQFF | TRBV7-9 | TRBJ2-1 | CASGTLSYNEQFF | TRAV19 | TRAJ2-1 |
| b|8 | 2 | CASSGRGGDNEQFF | TRBV7-9 | TRBJ2-1 | CASGTLSYNEQFF | TRAV19 | TRAJ2-1 |
| c|8 | 2 | CASSGRGGDNEQFF | TRBV7-9 | TRBJ2-1 | CASGTLSYNEQFF | TRAV19 | TRAJ2-1 |
| a|59 | 3 | CASSGQGGYYNEQFF | TRBV9 | TRBJ2-1 | CSARDGDSSSYEQYF | TRAV20-1 | TRAJ2-7 |
| b|59 | 3 | CASSGQGGYYNEQFF | TRBV9 | TRBJ2-1 | CSARDGDSSSYEQYF | TRAV20-1 | TRAJ2-7 |
| c|59 | 3 | CASSGQGGYYNEQFF | TRBV9 | TRBJ2-1 | CSARDGDSSSYEQYF | TRAV20-1 | TRAJ2-7 |
| a|344 | 1 | CASSERGGGSNEQFF | TRBV7-8 | TRBJ2-1 | CSAREITEAFF | TRAV20-1 | TRAJ1-1 |
| b|344 | 1 | CASSERGGGSNEQFF | TRBV7-8 | TRBJ2-1 | CSAREITEAFF | TRAV20-1 | TRAJ1-1 |
| c|344 | 1 | CASSERGGGSNEQFF | TRBV7-8 | TRBJ2-1 | CSAREITEAFF | TRAV20-1 | TRAJ1-1 |
… or we can summarize the properties of each component in the next table as rows with:
| component_id | community | mean_ncweight | mean_nweight | n_nodes | n_edges | n_clique_edges | diameter |
|---|---|---|---|---|---|---|---|
| 1 | 8 | 9.0 | 9.3 | 3 | 3 | 3 | 1 |
| 2 | 8 | 9.4 | 9.5 | 3 | 3 | 3 | 1 |
| 3 | 8 | 9.5 | 9.6 | 3 | 3 | 3 | 1 |
dcoDo we see expanding or contracting communities in a given repertoire?
We employ a Bayesian model to quantify the relative abundance (occupancy) of individual communities in each repertoire.
The model output for repertoire \(a\) is the parameter vector \(\beta^a=\beta^a_1,\beta^a_2,\ldots,\beta^a_k\), which describes the effect of repertoire \(a\) on the relative occupancy in each community.
Given two repertoires, \(a\) and \(b\), we can quantify the differential community occupancy (DCO):
\[\begin{align} \delta^{a-b}_i = \beta^a_i - \beta^b_i \end{align}\]d <- dco(community_occupancy_matrix = gcd$community_occupancy_matrix,
mcmc_control = list(mcmc_warmup = 300,
mcmc_iter = 600,
mcmc_chains = 2,
mcmc_cores = 1,
mcmc_algorithm = "NUTS",
adapt_delta = 0.9,
max_treedepth = 10))
SAMPLING FOR MODEL 'dm' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0.000255 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 2.55 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 600 [ 0%] (Warmup)
Chain 1: Iteration: 60 / 600 [ 10%] (Warmup)
Chain 1: Iteration: 120 / 600 [ 20%] (Warmup)
Chain 1: Iteration: 180 / 600 [ 30%] (Warmup)
Chain 1: Iteration: 240 / 600 [ 40%] (Warmup)
Chain 1: Iteration: 300 / 600 [ 50%] (Warmup)
Chain 1: Iteration: 301 / 600 [ 50%] (Sampling)
Chain 1: Iteration: 360 / 600 [ 60%] (Sampling)
Chain 1: Iteration: 420 / 600 [ 70%] (Sampling)
Chain 1: Iteration: 480 / 600 [ 80%] (Sampling)
Chain 1: Iteration: 540 / 600 [ 90%] (Sampling)
Chain 1: Iteration: 600 / 600 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 13.486 seconds (Warm-up)
Chain 1: 9.286 seconds (Sampling)
Chain 1: 22.772 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'dm' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 0.000263 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 2.63 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 600 [ 0%] (Warmup)
Chain 2: Iteration: 60 / 600 [ 10%] (Warmup)
Chain 2: Iteration: 120 / 600 [ 20%] (Warmup)
Chain 2: Iteration: 180 / 600 [ 30%] (Warmup)
Chain 2: Iteration: 240 / 600 [ 40%] (Warmup)
Chain 2: Iteration: 300 / 600 [ 50%] (Warmup)
Chain 2: Iteration: 301 / 600 [ 50%] (Sampling)
Chain 2: Iteration: 360 / 600 [ 60%] (Sampling)
Chain 2: Iteration: 420 / 600 [ 70%] (Sampling)
Chain 2: Iteration: 480 / 600 [ 80%] (Sampling)
Chain 2: Iteration: 540 / 600 [ 90%] (Sampling)
Chain 2: Iteration: 600 / 600 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 13.574 seconds (Warm-up)
Chain 2: 9.272 seconds (Sampling)
Chain 2: 22.846 seconds (Total)
Chain 2:
get_beta_violinsbeta_violins <- get_beta_violin(node_summary = gcd$node_summary,
beta = d$posterior_summary$beta,
ag = "",
ag_species = TRUE,
db = "vdjdb")beta_violins <- get_beta_violin(node_summary = gcd$node_summary,
beta = d$posterior_summary$beta,
ag = "CMV",
ag_species = TRUE,
db = "vdjdb")Before we can start interpreting the model results, we have to make sure that the model is valid. One standard approach is to check whether our model can retrodict the observed data (community occupancy matrix) which was used to fit model parameters.
General idea of posterior predictive checks:
ClustIRR provides \(y\) and \(\hat{y}\) of each repertoire, which we can visualize with ggplot:
ggplot(data = d$posterior_summary$y_hat)+
facet_wrap(facets = ~sample, nrow = 1, scales = "free")+
geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "gray")+
geom_errorbar(aes(x = y_obs, y = mean, ymin = L95, ymax = H95),
col = "darkgray", width=0)+
geom_point(aes(x = y_obs, y = mean), size = 0.8)+
xlab(label = "observed y")+
ylab(label = "predicted y (and 95% HDI)")Given two repertoires, \(a\) and \(b\), we can quantify the differential community occupancy (DCO):
\[\begin{align} \delta^{a-b}_i = \beta^a_i - \beta^b_i \end{align}\]Importantly, ClustIRR always computes both contrasts (\(a\) vs. \(b\) and \(b\) vs. \(a\)): \(\delta^{a-b}_i\) and \(\delta^{b-a}_i\).
ggplot(data = d$posterior_summary$delta)+
facet_wrap(facets = ~contrast, ncol = 2)+
geom_errorbar(aes(x = community, y = mean, ymin = L95, ymax = H95),
col = "lightgray", width = 0)+
geom_point(aes(x = community, y = mean), shape = 21, fill = "black",
stroke = 0.4, col = "white", size = 1.25)+
theme(legend.position = "top")+
ylab(label = expression(delta))+
scale_x_continuous(expand = c(0,0))Given two repertoires, \(a\) and \(b\), ClustIRR also quantifies absolute differences in community probabilities:
\[\begin{align} \epsilon^{a-b} = \mathrm{softmax}(\alpha + \beta^a) - \mathrm{softmax}(\alpha + \beta^b) \end{align}\]Again, both contrasts are computed: \(\epsilon^{a-b}_i\) and \(\epsilon^{b-a}_i\).
ggplot(data = d$posterior_summary$epsilon)+
facet_wrap(facets = ~contrast, ncol = 2)+
geom_errorbar(aes(x = community, y = mean, ymin = L95, ymax = H95),
col = "lightgray", width = 0)+
geom_point(aes(x = community, y = mean), shape = 21, fill = "black",
stroke = 0.4, col = "white", size = 1.25)+
theme(legend.position = "top")+
ylab(label = expression(epsilon))+
scale_x_continuous(expand = c(0,0))The function dco takes as its main input a community
occupancy matrix. This enables users who are accustomed to using
complementary algorithm for detecting specificity groups, such as,
GLIPH, TCRdist3, GIANA, and iSMART, to skip steps 1-4 of the ClustIRR
workflow, and to proceed with analysis for DCO.
cluster_irrclustirrdetect_communitiesdco