---
title: Machine and Deep Learning Models with immApex
author: 
- name: Nick Borcherding
  email: ncborch@gmail.com
  affiliation: Washington University in St. Louis, School of Medicine, St. Louis, MO, USA
date: 'Compiled: `r format(Sys.Date(), "%B %d, %Y")`'
output:
  BiocStyle::html_document:
    toc_float: true
package: immApex
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Machine and Deep Learning Models with immApex}
  %\VignetteEncoding{UTF-8} 
---


```{r, echo=FALSE, results="hide", message=FALSE}
knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE, tidy = FALSE)
library(BiocStyle)
set.seed(42)
```

# Introduction

Single-cell sequencing is mainstay technology in the field of immunology and oncology that allows researchers to couple RNA quantification and other modalities, like immune cell receptor profiling at the level of an individual cell. A number of workflows and software packages have been created to process and analyze single-cell transcriptomic data. These packages allow users to take the vast dimensionality of the data generated in single-cell-based experiments and distill the data into novel insights. Some of the packages within the R environment that offer single-cell immune profiling support include [scRepertoire](https://github.com/BorchLab/scRepertoire), [immunarch](https://github.com/immunomind/immunarch), and [immcantation](https://immcantation.readthedocs.io/en/stable/). None of these packages offer support for machine or deep learning applications for immune repertoire profiling. 

**immApex** is meant to serve as an API for supervised and unsupervised learning tasks based on immune receptor sequencing. These functions extract or generate amino acid or nucleotide sequences and prepare them for downstream machine learning. **immApex** is the underlying structure for the BCR models in [Ibex](https://github.com/BorchLab/Ibex) and TCR models in [Trex](https://github.com/BorchLab/Trex). It should be noted that the tools here are created for immune receptor sequences; they will work more generally for nucleotide or amino acid sequences. The package itself supports AIRR, Adaptive, and 10x formats and interacts with the **scRepertoire** R package.

More information is available at the [immApex GitHub Repo](https://github.com/BorchLab/immApex).

## Loading Libraries

```{r}
suppressMessages(library(immApex))
suppressMessages(library(ggplot2))
suppressMessages(library(viridis))
suppressMessages(library(dplyr))
suppressMessages(library(igraph))
suppressMessages(library(tidygraph))
suppressMessages(library(ggraph))
```

# Acquiring and Preparing Repertoire Data

## getIMGT

Depending on the sequencing technology and the version, we might want to expand the length of our sequence embedding approach. The first step in the process is pulling the reference sequences from the ImMunoGeneTics (IMGT) system using ```getIMGT()```. More information for IMGT can be found at [imgt.org](https://www.imgt.org/). Data from IMGT is under a CC BY-NC-ND 4.0 license. Please be aware that attribution is required for usage and should not be used to create commercial or derivative work. 

Parameters for ```getIMGT()```

* **species** One or two word designation of species. Currently supporting: "human", "mouse", "rat", "rabbit", "rhesus monkey", "sheep", "pig", "platypus", "alpaca", "dog", "chicken", and "ferret"
* **chain** Sequence chain to access
* **frame** Designation for "all", "inframe" or "inframe+gap"
* **region** Sequence gene loci to access
* **sequence.type** Type of sequence - "aa" for amino acid or "nt" for nucleotide

Here, we will use the ```getIMGT()``` function to get the amino acid sequences for the TRBV region to get all the sequences by V gene allele.

```{r, eval=knitr::is_html_output()}
# Function to check IMGT website availability
is_imgt_available <- function() {
  tryCatch({
    r <- httr::HEAD("https://www.imgt.org", timeout(5))
    httr::status_code(r) == 200
  }, error = function(e) {
    FALSE
  })
}

# Run getIMGT only if the website is available
if (is_imgt_available()) {
  TRBV_aa <- getIMGT(species = "human",
                     chain = "TRB",
                     frame = "inframe",
                     region = "v",
                     sequence.type = "aa")

  # Display first sequence as an example
  TRBV_aa[[1]][1]
} else {
  # Display a message if IMGT is not available
  "IMGT website is not accessible at the moment."
}
```

## formatGenes

Immune receptor nomenclature can be highly variable across sequencing platforms. When preparing data for models, we can use ```formatGenes()``` to universalize the gene formats into IMGT nomenclature.

Parameters for ```formatGenes()```

* **input.data** Data frame of sequencing data or scRepertoire outputs
* **region** Sequence gene loci to access - 'v', 'd', 'j', 'c' or a combination using c('v', 'd', 'j')
* **technology** The sequencing technology employed - 'TenX', "Adaptive', or 'AIRR'
* **species** One or two word designation of species. Currently supporting: "human", "mouse", "rat", "rabbit", "rhesus monkey", "sheep", "pig", "platypus", "alpaca", "dog", "chicken", and "ferret"
* **simplify.format** If applicable, remove the allelic designation (TRUE) or retain all information (FALSE)

Here, we will use the built-in example from Adaptive Biotechnologies and reformat and simplify the **v** region. ```formatGenes()``` will add 2 columns to the end of the data frame per region selected - 1) **v_IMGT** will be the formatted gene calls and 2) **v_IMGT.check** is a binary for if the formatted region appears in the IMGT database. In the example below, "TRBV2-1" is not recognized as a designation within IMGT.

```{r }
data("immapex_example.data")
Adaptive_example <- formatGenes(immapex_example.data[["Adaptive"]],
                                region = "v",
                                technology = "Adaptive", 
                                simplify.format = TRUE) 

head(Adaptive_example[,c("aminoAcid","vGeneName", "v_IMGT", "v_IMGT.check")])
```

## inferCDR

We can now use ```inferCDR()``` to add additional sequence elements to our example data using the outputs of ```formatGenes()``` and ```getIMGT()```. Here, we will use the function to isolate the complementarity-determining regions (CDR) 1 and 2. If the gene nomenclature does not match the IMGT the result will be NA for the given sequences. Likewise, if the IMGT nomenclature has been simplified, the first allelic match will be used for sequence extraction. 

Parameters for ```inferCDR```

* **input.data** Data frame of sequencing data or output from formatGenes().
* **reference** IMGT sequences from ```getIMGT()```
* **technology** The sequencing technology employed - 'TenX', "Adaptive', or 'AIRR',
* **sequence.type** Type of sequence - "aa" for amino acid or "nt" for nucleotide
* **sequences** The specific regions of the CDR loop to get from the data.

```{r }
if (is_imgt_available()) {
  Adaptive_example <- inferCDR(Adaptive_example,
                               chain = "TRB", 
                               reference = TRBV_aa,
                               technology = "Adaptive", 
                               sequence.type = "aa",
                               sequences = c("CDR1", "CDR2"))
  
  Adaptive_example[200:210,c("CDR1_IMGT", "CDR2_IMGT")]
} else {
  # Display a message if IMGT is not available
  "IMGT website is not accessible at the moment."
}
```

# Generating and Augmenting Sequence Sets

## generateSequences

Generating synthetic sequences is a quick way to start testing the model code. ```generateSequences()``` can also generate realistic noise for generative adversarial networks.

Parameters for ```generateSequences()```

* **prefix.motif** Add a defined sequence to the start of the generated sequences.
* **suffix.motif** Add a defined sequence to the end of the generated sequences
* **number.of.sequences** Number of sequences to generate
* **min.length** Minimum length of the final sequence (will be adjusted if incongruent with prefix.motif/suffix.motif)
* **max.length** Maximum length of the final sequence
* **sequence.dictionary** The letters to use in sequence generation (default are all amino acids)

```{r }
sequences <- generateSequences(prefix.motif = "CAS",
                               suffix.motif = "YF",
                               number.of.sequences = 1000,
                               min.length = 8,
                               max.length = 16)
sequences <- unique(sequences)
head(sequences)
```

If we want to generate nucleotide sequences instead of amino acids, we must to change the **sequence.dictionary**.

```{r }
nucleotide.sequences <- generateSequences(number.of.sequences = 1000,
                                          min.length = 8,
                                          max.length = 16, 
                                          sequence.dictionary = c("A", "C", "T", "G"))
head(nucleotide.sequences)
```

## mutateSequences

A common approach is to mutate sequences randomly or at specific intervals. This can be particularly helpful if we have fewer sequences or want to test a model for accuracy given new, altered sequences. ```mutateSequences()``` allows us to tune the type of mutation, where along the sequences to introduce the mutation and the overall number of mutations.

Parameters for ```mutateSequences()```

* **input.sequences** The amino acid or nucleotide sequences to use
* **number.of.sequences** The number of mutated sequences to return per input.sequence
* **mutation.rate** The rate of mutations introduced into sequences
* **position.start** The starting position to mutate along the sequence. Default NULL will start the random mutations at position 1.
* **position.end** The ending position to mutate along the sequence. Default NULL will end the random mutations at the last position.
* **sequence.dictionary** The letters to use in sequence mutation (default are all amino acids)

```{r }
mutated.sequences <- mutateSequences(sequences, 
                                     number.of.sequences = 1,
                                     position.start = 3,                                  
                                     position.end = 8)
head(sequences)
head(mutated.sequences)
```

# Feature Engineering from Repertoires

Beyond encoding single sequences, immApex provides functions to calculate summary statistics and features across a collection of sequences, such as a clonal repertoire.

## calculateFrequency

This function calculates the relative frequency of each residue at each position across a set of sequences.

Parameters for ```calculateFrequency()```

* **input.sequences**: A character vector of sequences.
* **max.length**: The length to align sequences to by padding/trimming.
* **tidy**: If TRUE, returns a long-format data frame suitable for ggplot2.

```{r }
freq.matrix <- calculateFrequency(sequences, 
                                  max.length = 20)
head(freq.matrix[, 1:10])
```

# calculateEntropy

This function measures the diversity (or entropy) at each position in a set of aligned sequences. It can use several common diversity metrics.

Parameters for `calculateEntropy()`

* **method**: The diversity metric to use: "shannon", "inv.simpson", "gini.simpson", "norm.entropy", "pielou", or Hill numbers ("hill0", "hill1", "hill2").

```{r }
shannon.entropy <- calculateEntropy(sequences, 
                                    method = "shannon")
head(shannon.entropy)
```

## calculateProperty

This function computes position-wise summary statistics for amino acid properties. For each position, it calculates a metric (like the mean) of a specific physicochemical property for all residues found at that position.

Parameters for `calculateProperty()`

* **property.set**: The amino acid property scale to use (e.g., "Atchley").
* **summary.fun**: The summary statistic to apply ("mean", "median", "sum" etc.).

```{r }
# Calculate the mean of Atchley factors at each position
atchley.profile <- calculateProperty(sequences, 
                                     property.set = "atchleyFactors", 
                                     summary.fun = "mean")
head(atchley.profile[, 1:6])
```

## calculateGeneUsage

This function quantifies the usage of gene loci (e.g., V and J genes) within a repertoire.

Parameters for `calculateGeneUsage()`
* **loci**: A character vector of one or two column names corresponding to the gene loci.
* **summary**: The output format: "proportion" (default), "count", or "percent".

```{r }
# Using the built-in AIRR data
data_airr <- immapex_example.data[["AIRR"]]

# Calculate paired V-J gene usage as percentages
vj_usage <- calculateGeneUsage(data_airr, 
                               loci = c("v_call", "j_call"),
                               summary = "percent")
vj_usage[1:5, 1:5]
```

## calculateMotif

This function rapidly finds and counts contiguous (or gapped) amino acid or nucleotide motifs of specified lengths across a set of sequences.

Parameters for `calculateMotif()`

* **motif.lengths**: An integer vector of motif sizes to search for.
* **min.depth**: The minimum number of times a motif must appear to be included in the output.
* **discontinuous**: If TRUE, also finds motifs with a single gap (e.g., "C.S").

```{r }
motif.counts <- calculateMotif(sequences, 
                               motif.lengths = 3, 
                               min.depth = 5)
head(motif.counts)
```

## probabilityMatrix

This function calculates a positional probability matrix (PPM) for a group of sequences. This can be used to represent the sequence profile of an antigen-specific repertoire or to generate a sequence logo.

```{r }
ppm.matrix <- probabilityMatrix(sequences)
head(ppm.matrix)
```

The PPM can also be converted to a positional weight matrix (PWM) using a log-likelihood ratio.

```{r }
set.seed(42)
# Generate a sample background frequency
back.freq <- sample(1:1000, 20)
names(back.freq) <- amino.acids
back.freq <- back.freq/sum(back.freq)

pwm.matrix <- probabilityMatrix(sequences,
                                max.length = 20,
                                convert.PWM = TRUE,
                                background.frequencies = back.freq)
head(pwm.matrix)
```

## adjacencyMatrix and buildNetwork

These functions help analyze relationships between residues or whole sequences.

`adjacencyMatrix()` summarizes transitions between adjacent residues, creating a matrix of co-occurrence counts.

```{r }
adj.matrix <- adjacencyMatrix(sequences, 
                              normalize = FALSE)
adj.matrix[1:10, 1:10]
```

`buildNetwork()` constructs a similarity network where nodes are sequences and edges connect sequences with an edit distance below a given threshold.

```{r }
# Building an edge list
g1 <- buildNetwork(data.frame(sequences = sequences),
                   seq_col = "sequences",
                   threshold = 2)

# Forming network
graph <- graph_from_edgelist(as.matrix(g1[,1:2]))
E(graph)$weight <- g1[,3]

# Remove isolated nodes for clearer visualization
graph <- delete_vertices(graph, which(degree(graph) == 0))

# Convert to tidygraph for use with ggraph
g_tidy <- as_tbl_graph(graph)

# Plot the network
ggraph(g_tidy, layout = "fr") + 
  geom_edge_link(aes(width = weight), color = "black", alpha = 0.5) + 
  geom_node_point(aes(size = degree(g_tidy, mode = "all")),
                  fill = "steelblue", color= "black", shape = 21) +
  theme_void() + 
  scale_edge_width(range = c(0.1, 0.5)) + 
  labs(size = "Degree") +
  theme(legend.position = "right")
```

# Encoding Sequences for Model Input

## sequenceEncoder

The primary tool for converting biological sequences into a numerical format suitable for machine learning is the ```sequenceEncoder()``` function. It is a versatile wrapper that can generate three different types of numerical representations, controlled by the mode argument. Understanding this function is key to using the package effectively.

**Key Arguments**:

* **input.sequences**: The character vector of your amino acid or nucleotide sequences.
* **mode**: The most important argument. A string specifying the encoding type you need: `"onehot"`, `"property"`, or `"geometric"`.
* **verbose**: A logical flag (TRUE or FALSE) to control whether progress messages are printed.

## onehotEncoder

One-hot encoding is a common method that transforms each amino acid at each position into a binary vector. This is useful when the specific identity and position of each residue are important. This functionality is accessed by setting **mode** = `"onehot"` in ```sequenceEncoder()``` or by using the ```onehotEncoder()``` alias.

**Mode-Specific Arguments**:

* **max.length**: Additional length to pad, NULL will pad sequences to the max length of input.sequences
* **padding.symbol**: The single character used to pad sequences shorter than max.length.
* **sequence.dictionary**: The letters to use in encoding (default are all amino acids)

```{r }
# Generate one-hot encoding using the main function
enc_onehot <- sequenceEncoder(sequences, mode = "onehot", verbose = FALSE)

# You can achieve the same result with the alias
enc_onehot_alias <- onehotEncoder(sequences, verbose = FALSE)

# View the first few columns of the flattened matrix output
head(enc_onehot$flattened[, 1:10])
```

## propertyEncoder

Instead of a binary vector, this method represents each amino acid using a set of continuous numerical values based on physicochemical properties (e.g., kideraFactors, FASGAI, etc). This can capture the biochemical similarity between amino acids. Access this method by setting **mode** = `"property"` or using the ```propertyEncoder()``` alias.

**Mode-Specific Arguments**:

* **property.set**: A character string specifying which set of pre-defined amino acid properties to use.
* **property.matrix**: A custom numeric matrix of property values you provide yourself.
    
Available **property.set** options include:

* atchleyFactors - [citation](https://pubmed.ncbi.nlm.nih.gov/15851683/)
* crucianiProperties - [citation](https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/abs/10.1002/cem.856)
* FASGAI - [citation](https://pubmed.ncbi.nlm.nih.gov/18318694/)
* kideraFactors - [citation](https://link.springer.com/article/10.1007/BF01025492)
* MSWHIM - [citation](https://pubs.acs.org/doi/10.1021/ci980211b)
* ProtFP - [citation](https://pubmed.ncbi.nlm.nih.gov/24059694/)
* stScales - [citation](https://pubmed.ncbi.nlm.nih.gov/19373543/)
* tScales - [citation](https://www.sciencedirect.com/science/article/abs/pii/S0022286006006314)
* VHSE - [citation](https://pubmed.ncbi.nlm.nih.gov/15895431/)
* zScales - [citation](https://pubmed.ncbi.nlm.nih.gov/9651153/)


```{r }
# Generate property-based encoding using FASGAI factors
enc_prop <- sequenceEncoder(sequences,
                            mode = "property",
                            property.set = "FASGAI",
                            verbose = FALSE)

# The propertyEncoder() alias is a convenient shortcut
enc_prop_alias <- propertyEncoder(sequences, property.set = "FASGAI", verbose = FALSE)

# View the first few columns of the flattened property matrix
head(enc_prop$flattened[, 1:6])
```

## geometricEncoder

This approach creates a single, fixed-length vector (an "embedding") for an entire sequence. It works by averaging the vectors for each amino acid from a substitution matrix (like BLOSUM62) and then applying a geometric rotation. This is useful for tasks where a summary of the entire sequence is needed, similar to approaches like [GIANA](https://pubmed.ncbi.nlm.nih.gov/34349111/). Use this with **mode** = `"geometric"` or the ```geometricEncoder()``` alias.

Parameters for ```geometricEncoder()```

* **method**: Select the following substitution matrices: "BLOSUM45", "BLOSUM50", "BLOSUM62", "BLOSUM80", "BLOSUM100", "PAM30", "PAM40", "PAM70", "PAM120", or "PAM250"
* **theta**: The angle in which to create the rotation matrix

```{r }
# Generate a geometric embedding for each sequence
enc_geo <- sequenceEncoder(sequences,
                           mode = "geometric",
                           method = "BLOSUM62",
                           theta = pi / 3,
                           verbose = FALSE)

# The alias provides a direct path to this functionality
enc_geo_alias <- geometricEncoder(sequences, 
                                  method = "BLOSUM62", 
                                  theta = pi / 3, 
                                  verbose = FALSE)

# The output is a single summary matrix where each row is a sequence
head(enc_geo$summary)
```

## tokenizeSequences

Another approach to transforming a sequence into numerical values is tokenizing it into numbers. This is a common approach for recurrent neural networks where one letter corresponds to a single integer. In addition, we can add start and stop tokens to our original sequences to differentiate between the beginning and end of the sequences. 

Parameters for ```tokenizeSequences()```

* **add.startstop** Add start and stop tokens to the sequence
* **start.token** The character to use for the start token
* **stop.token** The character to use for the stop token
* **max.length** Additional length to pad, NULL will pad sequences to the max length of input.sequences
* **convert.to.matrix** Return a matrix (**TRUE**) or a vector (**FALSE**)

```{r }
token.matrix <- tokenizeSequences(input.sequences =  c(sequences, mutated.sequences), 
                                  add.startstop = TRUE,
                                  start.token = "!",
                                  stop.token = "^", 
                                  convert.to.matrix = TRUE)
head(token.matrix[,1:18])
```

## sequenceDecoder

We have a function called ```sequenceDecoder()``` that extracts sequences from one-hot or property-encoded matrices or arrays. This function can be applied to any generative approach to sequence generation.

Parameters for ```sequenceDecoder()```

* **sequence.matrix** The encoded sequences to decode in an array or matrix
* **mode** The encoding mode used for decoding: `"onehot"` or `"property"`
* **property.set** For `mode = "property"`, a character vector of property names (e.g., `"Atchley"`) that were used for the original encoding.
* **call.threshold** The relative strictness of sequence calling with higher values being more stringent

```{r }
property.matrix <- propertyEncoder(input.sequences =  c(sequences, mutated.sequences), 
                                   property.set = "FASGAI")

property.sequences <- sequenceDecoder(property.matrix[[1]],
                                      mode = "property",
                                      property.set  = "FASGAI",
                                      call.threshold = 1)
head(sequences)
head(property.sequences)
```

A similar approach can be applied when using matrices or arrays derived from one-hot encoding: 

```{r tidy=FALSE}
sequence.matrix <- onehotEncoder(input.sequences =  c(sequences, mutated.sequences))

OHE.sequences <- sequenceDecoder(sequence.matrix,
                                 mode= "onehot")

head(OHE.sequences)
```

# Training a Model

## Example 1: Classifying Sequences with Random Forest

A common task is to classify sequences into groups, such as predicting whether an immune receptor binds to a specific antigen. Here, we'll train a Random Forest model to distinguish between two classes of sequences based on their physicochemical properties

**Step 1: Simulate Data and Engineer Features**

First, we'll simulate two distinct classes of sequences using ```generateSequences()```. We'll then use ```propertyEncoder()``` with the "Atchley" factors to convert each sequence into a flattened numerical vector. Each element in the vector represents a specific physicochemical property at a specific position, transforming our variable-length sequences into a fixed-size feature matrix suitable for machine learning.

```{r}
# Step 1a: Generate two distinct classes of sequences
class1.sequences <- generateSequences(prefix.motif = "CAS",
                                      min.length = 3,
                                      number.of.sequences = 500)

class2.sequences <- generateSequences(prefix.motif = "CSG",
                                      min.length = 3,
                                      number.of.sequences = 500)

# Combine sequences and create labels
all.sequences <- c(class1.sequences, class2.sequences)
labels <- as.factor(c(rep("Class1", 500), rep("Class2", 500)))

# Step 1b: Use propertyEncoder to create a feature matrix from Atchley factors
feature.matrix <- propertyEncoder(all.sequences, 
                                  property.set = "atchleyFactors",
                                  verbose = FALSE)

# Combine the flattened feature matrix and labels into the final data frame
training.data <- data.frame(feature.matrix$flattened, labels)
```

**Step 2: Train and Evaluate the Model**

Now we can train the Random Forest classifier. This model is robust, requires minimal tuning, and can provide insights into which features (in this case, which property at which position) are most important for distinguishing between the classes.

```{r}
suppressMessages(library(randomForest))

# Train the Random Forest model
set.seed(42) # for reproducibility
rf.model <- randomForest(labels ~ ., 
                         data = training.data, 
                         ntree = 100, 
                         importance = TRUE) # Set importance=TRUE to calculate scores

# Print the confusion matrix to see model performance
print(rf.model)
```

A key advantage of Random Forest is the ability to easily extract feature importance. We can create a plot to see which positional Atchley factors were most influential in the model's predictions.

```{r}
# Extract feature importance data from the model
importance.data <- as.data.frame(importance(rf.model))
importance.data$Feature <- rownames(importance.data)

# Get the top 15 most important features
top_features <- importance.data[order(importance.data$MeanDecreaseGini, decreasing = TRUE), ][1:15,]

# Plot using ggplot2
ggplot(top_features, aes(x = MeanDecreaseGini, y = reorder(Feature, MeanDecreaseGini))) +
  geom_col(aes(fill = MeanDecreaseGini)) +
  scale_fill_viridis_c() +
  labs(title = "Top 15 Most Important Features",
       x = "Mean Decrease in Gini Impurity",
       y = "Feature (Property and Position)") +
  theme_classic() +
  theme(legend.position = "none")
```

The plot shows the features that the model found most useful for classification. The feature names correspond to specific Atchley factors at specific positions within the sequence - here residues 2 and 3 where the difference was encoded - demonstrating how ```propertyEncoder()``` allows the model to learn from the underlying biochemistry.

## Example 2: Unsupervised Clustering with PCA and Geometric Encoding

Sometimes we don't have labels, but we want to see if our sequences form natural clusters. We can use dimensionality reduction techniques like Principal Component Analysis (PCA) to visualize the relationships between sequences. The ```geometricEncoder()``` function is perfect for this, as it creates a rich, fixed-length numerical embedding for each sequence.

**Step 1: Simulate and Encode a Mixed Population of Sequences**

We'll generate three distinct families of sequences. Then, we will use ```geometricEncoder()``` to transform them into a 20-dimensional numerical matrix.

```{r}
# Generate three distinct groups of sequences
groupA <- generateSequences(prefix.motif = "CA", 
                            number.of.sequences = 100, 
                            min.length = 2)
groupB <- generateSequences(prefix.motif = "QR", 
                            number.of.sequences = 100, 
                            min.length = 2)
groupC <- generateSequences(prefix.motif = "YY", 
                            number.of.sequences = 100, 
                            min.length = 2)

# Combine them and create a vector of original group IDs for later visualization
mixed.sequences <- c(groupA, groupB, groupC)
original.groups <- as.factor(rep(c("Group A", "Group B", "Group C"), each = 100))

# Use geometricEncoder to create embeddings
geometric.embeddings <- geometricEncoder(mixed.sequences, 
                                         method = "BLOSUM62", 
                                         verbose = FALSE)
```

**Step 2: Perform PCA and Visualize Clusters** 

With our geometric.embeddings matrix, we can now perform PCA and plot the results. If the geometric encoding captures the underlying differences in our sequence families, we should see distinct clusters in the plot of the first two principal components.

```{r}
# Perform PCA on the embedding matrix
pca.result <- prcomp(geometric.embeddings$summary, center = TRUE, scale. = TRUE)

# Prepare data for plotting with ggplot2
pca.data <- data.frame(PC1 = pca.result$x[,1],
                       PC2 = pca.result$x[,2],
                       Group = original.groups)

# Plot PC1 vs PC2 using ggplot2 and viridis
ggplot(pca.data, aes(x = PC1, y = PC2, color = Group)) +
  geom_point(size = 3, alpha = 0.8) +
  scale_color_viridis(discrete = TRUE) +
  labs(title = "PCA of Geometric Sequence Embeddings",
       x = "Principal Component 1",
       y = "Principal Component 2") +
  theme_classic()
```

The resulting plot clearly shows three distinct clusters, demonstrating that the ```geometricEncoder()``` successfully captured the structural differences between our sequence families, allowing for their separation in an unsupervised manner.

## Example 3: Identifying Sequence Communities with Network Analysis

Beyond analyzing individual sequences, we can explore the relationships between sequences by building a similarity network. A powerful machine learning application for these networks is community detection, an unsupervised clustering method that finds groups of densely connected nodes. In immunology, this can be used to identify clonal families or groups of sequences with shared features.

**Step 1: Simulate Data with Inherent Structure**

First, we will simulate a dataset containing three distinct "families" of sequences. Our goal is to see if the network analysis can blindly recover these groups without being given the labels.

```{r}
set.seed(42)
# Generate three distinct groups of sequences to simulate clonal families
family1 <- unique(generateSequences(prefix.motif = "CASS", 
                             suffix.motif = "YF", 
                             min.length = 6,
                             number.of.sequences = 60))
family2 <- unique(generateSequences(prefix.motif = "CARS", 
                             suffix.motif = "GF",
                             min.length = 6,
                             number.of.sequences = 60))
family3 <- unique(generateSequences(prefix.motif = "CSVA", 
                             suffix.motif = "HF", 
                             min.length = 6,
                             number.of.sequences = 60))

# Combine into a single data frame
all_sequences_df <- data.frame(
  sequence = c(family1, family2, family3),
  original_family = c(rep("Family 1", 42), rep("Family 2", 40), rep("Family 3", 46))
)
```

**Step 2: Build the Sequence Network**

Next, we use ```buildNetwork()``` to create an edge list based on sequence similarity (edit distance). We then convert this list into a formal graph object using the igraph package, which is the standard for network analysis in R.

```{r}
# Build the edge list: connect sequences with an edit distance of 3 or less
edge_list <- buildNetwork(all_sequences_df,
                          seq_col = "sequence",
                          threshold = 3)

# Replace numerical edge list with the sequences
edge_list_sequences <- as.matrix(data.frame(
  from = all_sequences_df$sequence[as.numeric(edge_list$from)],
  to = all_sequences_df$sequence[as.numeric(edge_list$to)],
  dist = edge_list$dist
))

# Create a graph object from the edge list and node data
# This graph now contains all our sequences as nodes
sequence_graph <- graph_from_data_frame(d = edge_list_sequences, 
                                        vertices = all_sequences_df, 
                                        directed = FALSE)
E(sequence_graph)$weight <- edge_list_sequences[,3]
```

**Step 3: Perform Community Detection**

Now we apply a machine learning algorithm to find communities. The `igraph` package offers many algorithms; we will use "Walktrap," a method that finds communities through a series of short random walks. The idea is that walks are more likely to get "trapped" within densely connected parts of the network (i.e., communities).

```{r}
# Perform community detection using the Walktrap algorithm
communities <- cluster_walktrap(sequence_graph)

# Add the community membership as an attribute to the graph nodes
V(sequence_graph)$community <- communities$membership
```

**Step 4: Visualize and Interpret the Communities**

Finally, we use `ggraph` to visualize the network, coloring each node by the community it was assigned to by the algorithm. If the analysis was successful, the colors should align with the original sequence families we simulated.

```{r}
# Convert to tidygraph for use with ggraph
g_tidy <- as_tbl_graph(sequence_graph)

# Plot the network, coloring nodes by their detected community
ggraph(g_tidy, layout = "fr") + 
  geom_edge_link(aes(alpha = weight), show.legend = FALSE) +
  geom_node_point(aes(color = as.factor(community)), size = 4) +
  scale_color_viridis(discrete = TRUE, option = "D") +
  labs(title = "Sequence Network with Detected Communities",
       color = "Detected\nCommunity") +
  theme_void()
```

The plot clearly visualizes the network structure, and the distinct color groups demonstrate that the community detection algorithm successfully identified and separated the three original sequence families without any prior information. This unsupervised approach is a powerful tool for exploring the hidden structure within a complex repertoire.

***

# Conclusion

This has been a general overview of the capabilities of **immApex** for processing immune receptor sequences, extracting features, and developing algorithms. If you have any questions, comments, or suggestions, feel free to visit the [GitHub repository](https://github.com/BorchLab/immApex).

## Session Info

```{r}
sessionInfo()
```