Post-transcriptional mechanisms play a central role in the regulation of gene expression, with protein levels partly determined by various features within target mRNAs. Emerging evidence indicates that most individual mRNAs contain multiple regulatory elements. This underscores the need for efficient bioinformatic tools that can capture and integrate multiple mRNA features to assess both their independent and combined impact on the proteome. Here, we present postNet, a tool that enables in silico identification, integration, and modelling of mRNA features that influence post-transcriptional regulation of gene expression at a transcriptome-wide scale. Although geared towards studies of post-transcriptional regulation, postNet is highly customizable and can, in principle, be applied in a variety of other contexts to explain changes in a continuous numeric variable between two conditions/groups. This vignette provides details regarding the use of postNet, and demonstrates typical workflows and results interpretation.
postNet 0.99.8
Post-transcriptional regulation is a major determinant of gene expression, occurring through processes such as altered mRNA translation efficiency and/or stability (Sonenberg and Hinnebusch 2009; Medina-Muñoz et al. 2021; Tushev et al. 2018). mRNAs encode intrinsic regulatory features (cis-acting elements) that influence their post-transcriptional fate, often in a context-dependent manner through interactions with trans-acting factors that integrate cellular and environmental signalling cues. For example, features in 5’ untranslated regions (UTRs) such as 5′ terminal oligopyrimidine (TOP) motifs, and upstream open reading frames (uORFs) can determine the translation efficiency of transcripts under conditions of cellular stress where mTOR signalling is suppressed and the integrated stress response (ISR) is activated (Philippe et al. 2020; Young et al. 2015). Furthermore, features of coding sequences (CDSs) and 3’UTRs also influence the post-transcriptional fate of mRNA molecules, for example via codon composition determining demand for tRNAs (Lorent et al. 2019), and 3’UTR-encoded sequences dictating mRNA localization and stability (Tushev et al. 2018). One of the main challenges in understanding sequence determinants of selective post-transcriptional regulation is that individual mRNAs often contain multiple regulatory features whose effects may depend on cellular context, making it challenging to pinpoint which features drive observed regulatory outcomes, and if features act independently or in a combinatorial manner. Moreover, many RNA features covary across transcripts, complicating statistical analyses and interpretation.
To address this challenge, we developed postNet, an R/Bioconductor package for transcriptome-wide identification, quantification, and modelling of RNA features that associate with post-transcriptional regulation. PostNet provides a suite of flexible tools to perform sequence feature analysis and statistical comparisons across gene sets of interest, including nucleotide content, sequence length, folding energy, motifs, uORFs, and codon composition. PostNet also performs hierarchical statistical modelling to decipher regulatory networks, revealing independent and combinatorial effects of mRNA features on post-transcriptional regulation of gene expression. The package also supports machine learning approaches, including Random Forest models, to classify modes of post-transcriptional regulation based on feature composition. These models can be applied across datasets to predict regulatory mechanisms and to test generalizability across experimental contexts. Finally, postNet also allows users to explore relationships between regulatory effects and features using network analyses and UMAP visualizations. In summary, postNet provides an integrated computational framework for dissecting how mRNA sequence features and/or other regulatory elements interact to shape gene expression.
PostNet was designed to accommodate a variety of inputs and biological contexts. A postNet analysis can consist of the following steps:
postNetData object with the postNetStart() function, containing curated reference mRNA sequences, gene sets of interest to be compared, and a regulatory effect measurement that will be used in downstream analyses. See Setting up a postNet analysis for details.featureIntegration() function, which implements either forward stepwise regression or Random Forest classification approaches. See Modelling post-transcriptional regulation for details.rfPred() function. See Network analysis and prediction for details.plotFeaturesMap() function. See Visualize and explore relationships between mRNA features and regulation for details.Here, we demonstrate a minimal example of the steps of a standard postNet analysis workflow using an example dataset illustrating changes in mRNA translation, with default parameters. This includes:
# Load the package
library(postNet)
# Load the vignette example data
data("postNetVignette", package = "postNet")Step 1: A postNetData object containing regulated gene sets of interest along with the background set, a regulatory effect measurement (in this case, log2 fold changes in translation efficiency derived from ribosome profiling and RNA-seq comparing cells responding to osmotic stress against controls (Krokowski et al. 2022)), and reference mRNA sequence annotations is initialized using the postNetStart() function.
# Prepare custom gene lists and regulatory effect measurement:
myGenes <- postNetVignette$geneList
str(myGenes)## List of 2
## $ translationUp : chr [1:1445] "DDX6" "CLK1" "DUSP1" "TXNIP" ...
## $ translationDown: chr [1:1378] "PPDPF" "NCKAP1" "AP2M1" "ABHD12" ...
## chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" "AAGAB" "AAK1" "AAMDC" "AAMP" ...
## Named num [1:9555] -2.095 0.275 -0.57 -0.608 0.576 ...
## - attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
# Initialize a postNetData object:
ptn <- postNetStart(
geneList = myGenes,
geneListcolours = c("#FC9272", "#99000D"),
customBg = myBg,
effectMeasure = myEffect,
selection = "random",
setSeed = 123,
source = "load",
species = "human"
)Step 2: mRNA features that may influence post-transcriptional regulation of gene expression are enumerated. In this minimal example, the length and nucleotide content of 5’UTR regions are calculated and compared between gene sets. However, numerous other features across all sequence regions can be examined (see Analysis of mRNA sequence features). These analyses allow both statistical comparisons of mRNA features between gene sets of interest, and generate outputs that can be used to assess the contribution of a given mRNA feature to the observed regulatory effect in the modelling step.
# Quantify the length and nucleotide content of the 5'UTR and compare between regulated gene sets
len <- lengthAnalysis(
ptn = ptn,
region = c("UTR5"),
comparisons = list(c(0, 1), c(0, 2), c(1, 2)),
plotOut = TRUE,
plotType = "boxplot",
pdfName = "Example"
)
str(len)## List of 1
## $ UTR5_length: Named num [1:9555] 8.34 7.18 7.23 7.71 4.91 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
content_UTR5 <- contentAnalysis(
ptn = ptn,
region = c("UTR5"),
comparisons = list(c(0, 1), c(0, 2), c(1, 2)),
contentIn = c("GC"),
plotOut = TRUE,
plotType = "ecdf",
pdfName = "Example"
)
str(content_UTR5)## List of 1
## $ UTR5_GC: Named num [1:9555] 62 63.4 80 64.1 73.3 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
Figure 1: Comparisons of 5’UTR length and GC content between mRNAs belonging to regulated gene sets
Step 3: Changes in translation efficiency are modelled using the featureIntegration() function which performs stepwise regression to identify the collection of features that best explain the observed regulation. In this minimal example, a set of pre-calculated features is used including both cis-acting (enumerated mRNA features) and trans-acting (signatures of upstream regulatory pathways) factors.
# Prepare the list of pre-calculated features to be used in modelling (for example,
# the 'len' and 'content_UTR' variables defined in the previous step can be included)
features <- postNetVignette$features
str(features)
# Signatures of trans-acting factors can also be included, and some are provided with the package
humanSignatures <- get_signatures(species = "human")
str(humanSignatures)
# Signatures can be converted into feature inputs using the signCalc function
trans_signatures <- signCalc(ptn, humanSignatures)
# Compile the final feature input:
features <- c(features, trans_signatures)
# Group the features according to category (optional)
group <- c(
"UTR5", "UTR3", "UTR5", "CDS", "UTR5", "UTR5", "UTR5", "CDS", "CDS",
rep("Pathway", 2), "UTR5", rep("Pathway", 2)
)
groupColour <- c("#834b62", "#6699cc", "#e9724c", "#fff299")
names(groupColour) <- c("UTR5", "CDS", "UTR3", "Pathway")
# Run feature integration modelling using forward stepwise regression
ptn <- featureIntegration(
ptn = ptn,
features = features,
pdfName = "omnibus",
regOnly = TRUE,
allFeat = FALSE,
analysis_type = "lm",
covarFilt = 20,
comparisons = list(c(1, 2)),
lmfeatGroup = group,
lmfeatGroupColour = groupColour,
NetModelSel = "omnibus"
)
Figure 2: Network plot of the results of the stepwise regression omnibus model
Step 4: The plotFeaturesMap() function is used to visualize the relationship between translation efficiency and different features that were identified in the modelling step as significantly contributing to the observed post-transcriptional regulation. These visualizations are also useful for exploring the overlaps between different features within the same mRNAs.
ptn <- plotFeaturesMap(ptn,
regOnly = TRUE,
comparisons = list(c(1, 2)),
featSel = names(ptn_selectedFeatures(ptn,
analysis_type = "lm",
comparison = 1
)),
remBinary = TRUE,
featCol = "UTR5_SCSCGS",
scaled = FALSE,
remExtreme = 0.1,
pdfName = "Example"
)
Figure 3: UMAP visualizing changes in translation efficiency (Effect) generated based on features identified in the omnibus model (left), with mRNAs containing SCSCGS motifs in the 5´UTR coloured (right)
This example provided an overview of a minimal postNet workflow. Full details of alternative workflows and additional inputs and options are discussed in the following sections.
The basic workflow of a postNet analysis aims to explain the post-transcriptional regulatory fate of mRNAs based on their repertoire of cis-acting sequence features and/or interactions with trans-acting factors. Three basic inputs are required to run a postNet analysis:
These inputs are compiled at the start of the analysis by initializing a postNetData object using the postNetStart() function. This class of object serves as the container for storing inputs and some outputs generated during the analysis, and is designated by ptn in the example code provided here. The following sections describe the different options and customizations available, along with examples illustrating how to set up a postNet analysis.
Throughout this vignette, we will use example data provided with the package to illustrate the implementation of different analyses. These data are published in the study by Krokowski et al. (Krokowski et al. 2022), where changes in translation efficiency were measured using ribosome profiling and RNA-seq in immortalized human corneal epithelial cells responding to osmotic stress (500 mOsm, NaCl) for 1 h, along with controls. Genes that were translationally activated or suppressed under osmotic stress were identified using the anota2seq algorithm (Oertlin et al. 2019).
The most flexible implementation of postNet allows gene lists of interest and regulatory effect measurements to be supplied from custom sources. Here, the postNetStart() function will be used with the geneList, geneListcolours, customBg, and effectMeasure parameters to initialize the postNetData object.
Depending on the aim of the analysis, different inputs can be provided. In the case where the goal is simply to enumerate sequence features in genes of interest without statistical comparisons, geneList can be a single list of gene IDs with no background gene set. To perform statistical comparisons, one or more gene sets of interest must be provided, either with or without a custom background gene set (customBg parameter). Although optional, it is strongly recommended to carefully consider that the appropriate background is used. Typically, this would be all genes in the given dataset passing expression and/or reproducibility thresholds. When providing custom gene lists, it is also required to specify colours (geneListcolours parameter) that will be used in visualizations generated at various steps of the analysis.
For each gene provided as part of a gene set of interest or background set, you must provide a regulatory effect measurement. As postNet is designed for examining post-transcriptional regulation, this would typically be the log2 fold changes in mRNA translation efficiency or expression between two conditions. However, other continuous numeric variables representing a change between two conditions could also be used, allowing a high degree of flexibility in potential applications.
The example below illustrates the set-up of a postNet analysis allowing identification of mRNA features and statistical comparisons between sets of translationally activated and suppressed genes, as well as modelling and network analysis to identify features associated with changes in translational regulation and their interdependence.
# Genes of interest should be provided in a named list.
myGenes <- postNetVignette$geneList
str(myGenes)## List of 2
## $ translationUp : chr [1:1445] "DDX6" "CLK1" "DUSP1" "TXNIP" ...
## $ translationDown: chr [1:1378] "PPDPF" "NCKAP1" "AP2M1" "ABHD12" ...
# All gene IDs in the list should be present in the background.
myBg <- postNetVignette$background
str(myBg)## chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" "AAGAB" "AAK1" "AAMDC" "AAMP" ...
# The regulatory effect measurement must be named with the same gene IDs
# present in the background (or the gene list if no custom background is provided).
myEffect <- postNetVignette$effect
str(myEffect)## Named num [1:9555] -2.095 0.275 -0.57 -0.608 0.576 ...
## - attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
A postNetData object can also be constructed directly downstream of running an analysis using the Anota2seq package to identify differentially translated genes. Here, the postNetStart() function will be used with the ads, regulation, contrast, regulationGen, and contrastSel parameters to initialize the postNetData object.
Anota2seq can be applied using data from both polysome and ribosome profiling techniques, and categorizes genes into three different regulatory modes: 1) Abundance, where changes in mRNA levels can be explained by transcription or mRNA stability. 2) Translation, where polysome-association is altered without a change in total mRNA level (this mode is expected to alter protein levels). 3) Buffering/translational offsetting, where polysome-association is maintained while total mRNA level changes (this mode is not expected to alter protein levels).
An Anota2seqDataSet can be provided using the ads parameter of postNetStart(). The regulation parameter allows the user to select which sets of regulated genes identified using anota2seq will be used in statistical comparisons and downstream feature integration modelling and network analysis. In addition to the three regulatory modes listed above, where changes in both total and translated mRNA are taken into account, it is also possible to examine differences in total and translated mRNA independently. Gene sets can be selected from multiple contrasts in anota2seq specified with the contrast parameter (see the anota2seq vignette for more details).
The regulatory effect measurement is taken directly from the Anota2seqDataSet object, and is specified by the regulationGen parameter. Anota2seq calculates sets of fold changes for each gene, in each regulatory mode. Although gene sets from all regulatory modes and contrasts can be selected for statistical comparisons, only one “general” regulatory effect measurement can be supplied (i.e., fold changes in translation efficiency, or buffering, etc.), so consideration should be given to downstream modelling where the mRNA features (and/or other input signatures) quantified in the gene sets specified by regulation will be used to explain changes in the regulatory effect measurement specified by regulationGen. Similarly to the gene sets, the regulatory effect measurement can be selected from any contrast from the anota2seq analysis using the contrastSel parameter.
When an Anota2seqDataSet is supplied, the appropriate background gene set, and colours for each gene list will be automatically retrieved.
The example below illustrates the set-up of a postNet analysis using the output of anota2seq, ads.
This postNet analysis will allow identification of mRNA features and comparison between translationally activated and suppressed genes, as well as buffered (total mRNA up and down) genes from contrast 1. Feature integration and network analysis will model the fold changes in translation efficiency from contrast 1. Note that in this example, if the buffering gene sets are included in modelling comparisons, the regulatory effect measurement values for these genes will be the fold changes in translation efficiency.
# Initialize the postNetData object:
ptn_withAds <- postNetStart(
ads = ads,
regulation = c("translationUp", "translationDown", "bufferingmRNAUp", "bufferingmRNADown"),
contrast = c(1, 1, 1, 1),
regulationGen = "translation",
contrastSel = 1,
selection = "random",
setSeed = 123, # ensures reproducibility of random isoform selection
source = "load",
species = "human"
)Reference sequence annotations in postNet are divided according to different regions of mRNA molecules (5’UTR, CDS, and 3’UTR) to allow regions to be compared separately. The source parameter of the postNetStart() function allows the user to select from several in-built sequence annotations provided with the package, retrieve sequence annotations directly from the NCBI RefSeq database (O’Leary et al. 2016), or provide custom sequence annotations. Optionally, UTR sequences can also be adjusted if more precise sequences are available, for example those experimentally determined using approaches like CAGE, QuantSeq, or long-read sequencing, etc.
It is highly recommended to use reference sequence annotations that correspond to those that were used in generating the input gene lists. For example, if RNA sequencing reads were counted using Ensembl gene/transcript annotations, these may not always correspond well to the sequences defined by RefSeq annotations, even if identifiers have been converted to be compatible. When running postNetStart(), a warning message describing differences in gene identifiers between the input gene lists and the reference sequence annotations will be printed to the console. Minor differences may be acceptable depending on the application. However, larger differences may warrant either reprocessing of input data, or selecting a more compatible reference sequence annotation before proceeding with the analysis.
By default, source = "load" meaning postNetStart() will load one of the in-built reference sequence annotations provided with the package when it is first needed.The data are downloaded from the postNet GitHub releases and stored in a user-specific cache directory managed by BiocFileCache. Once cached, the reference data are reused in subsequent sessions and do not require re-downloading. This option is available when species = "human" or "mouse". In-built annotations are based on NCBI RefSeq GRCh38 (human) and GRCm39 (mouse) genome assemblies and corresponding transcript annotations.The following reference annotation versions are currently available:
Human (RefSeq GRCh38): ver_40.202408 Mouse (RefSeq GRCm39): ver_27.202402
A postNetData object can then be initialized using the in-built reference sequence annotations, in this case using the RefSeq ver_40.202408 release version for human.
# Prepare custom gene lists and regulatory effect measurement using example data:
myGenes <- postNetVignette$geneList
myBg <- postNetVignette$background
myEffect <- postNetVignette$effect
# Initialize a postNetData object using in-built annotations for human:
ptn <- postNetStart(
geneList = myGenes,
geneListcolours = c("#FC9272", "#99000D"),
customBg = myBg,
effectMeasure = myEffect,
source = "load",
species = "human",
version = "ver_40.202408"
)In addition to the in-built annotations, it is also possible to either retrieve or construct reference sequences for any RefSeq release version. Currently, only human and mouse are supported with these options.
By specifying source = "create" in postNetStart(), annotation files from the most recent RefSeq release version for the indicated species will be automatically downloaded and used to construct a new reference sequence annotation locally. Note that downloads may take several minutes, and files will be stored in the working directory. Using this option requires an internet connection.
# Initialize a postNetData object creating new RefSeq reference sequence annotations for human:
ptn <- postNetStart(
geneList = myGenes,
geneListcolours = c("#FC9272", "#99000D"),
customBg = myBg,
effectMeasure = myEffect,
source = "create",
species = "human",
)Alternatively, two additional offline methods are available to construct reference sequence annotations if files have already been downloaded from RefSeq and are available locally. This can be done by specifying source = "createFromSourceFiles" in postNetStart(), and providing the RNA GBFF (“rna.gbff”), RNA FASTA (“rna.fna”), and genomic GFF (“genomic.gff”) files. These files must be provided using the rna_gbff_file, rna_fa_file, and genomic_gff_file parameters of postNetStart().
# The required RefSeq annotation files downloaded from the NCBI database will
# have the following naming:
# - "example_rna.gbff.gz"
# - "example_rna.fa.gz"
# - "example_genomic.gff.gz"
# Initialize a postNetData object creating a new RefSeq reference sequence annotation
# from source files:
ptn <- postNetStart(
geneList = myGenes,
geneListcolours = c("#FC9272", "#99000D"),
customBg = myBg,
effectMeasure = myEffect,
source = "createFromSourceFiles",
species = "human",
rna_gbff_file = "example_rna.gbff.gz",
rna_fa_file = "example_rna.fa.gz",
genomic_gff_file = "example_genomic.gff.gz"
)For added flexibility in defining sequence regions, it is also possible to assemble reference sequence annotations from a FASTA file using source = "createFromFasta" in postNetStart(). This option requires that a FASTA file be provided using the fastaFile parameter, along with an additional file specifying the coordinates of the mRNA sequence regions (provided with the posFile parameter). This position file must be tab delimited and have columns indicating the transcript id, 5’UTR length, end of the coding sequence, and the total transcript length (see example below). Optionally, a genomic GFF file can also be provided using the genomic_gff_file parameter. However, if no GFF is provided the latest version for the indicated species will be automatically downloaded from the RefSeq database.
# An example of the required format for the posFile parameter ("positions.txt"):
# id UTR5_len cds_stop total_length
# NM_000014 70 4495 4610
# NM_000015 70 943 1285
# NM_000016 79 1345 2261
# Initialize a postNetData object creating a new RefSeq reference sequence annotation
# from source files:
ptn <- postNetStart(
geneList = myGenes,
geneListcolours = c("#FC9272", "#99000D"),
customBg = myBg,
effectMeasure = myEffect,
source = "createFromFasta",
species = "human",
fastaFile = "_rna.fa.gz",
posFile = "positions.txt",
genomic_gff_file = "_genomic.gff.gz"
)The transcriptome is highly diverse, with mRNA isoform expression patterns varying across cell types (FANTOM Consortium and the RIKEN PMI and CLST (DGT) et al. 2014), and between normal and disease states (Qamra et al. 2017). The sequences of mRNA molecules can also be dynamically regulated through processes such as alternative splicing (Wright, Smith, and Jiggins 2022), altered transcription start site usage (Watt et al. 2025), and alternative polyadenylation (Navickas et al. 2023). Analyses relating mRNA sequence features to post-transcriptional regulation will benefit from more precise sequence annotations if these are available.
Custom reference sequence annotations can be used with the postNetStart() function and may be desirable in cases when working with data from species not currently supported by the options described above, with annotations other than those in the NCBI RefSeq database, or if sequences have been experimentally determined. A pre-prepared reference sequence annotation file can be provided using the customFile parameter. This file must be tab delimited and contain the columns: transcriptID, geneID, UTR5_seq, CDS_seq, and UTR3_seq.
# An example of the required format for the customFile parameter ("customSequences.txt"):
# id geneID UTR5_seq CDS_seq UTR3_seq
# NM_000014 A2M GGGACCAG... ATGGGGAA... AGACCACA...
# Initialize a postNetData object with a custom reference sequence annotation file:
ptn <- postNetStart(
geneList = myGenes,
geneListcolours = c("#FC9272", "#99000D"),
customBg = myBg,
effectMeasure = myEffect,
source = "custom",
species = NULL,
customFile = "customSequences.txt"
)Using sequence annotations provided with the package or built from the RefSeq database, postNet will perform gene-level analyses. For genes with multiple mRNA isoforms, the selection parameter of the postNetStart() function can be used to specify which isoforms will be considered in analyses. Depending on the application, it may sometimes be of interest to consider the longest or shortest mRNA isoforms for each gene. However, selecting the extremes for all genes may skew the results for some types of analysis. By default, isoforms for each gene will be selected at random to prevent a systematic length bias. However, it is important to note that each time the postNetStart() function is run, different isoforms may be selected leading to slight variations in results. If it is necessary to ensure reproducibility of isoform selection between different runs of the postNetStart() function, the setSeed parameter can be used. Alternatively, high-confidence representative protein-coding transcript isoforms can be selected based on the Match Annotation from NCBI and EMBL-EBI (MANE) initiative (Morales et al. 2022).
Although it is also possible to perform isoform-level analyses of mRNA features with postNet by supplying custom sequence annotations as described above, careful consideration should be given to the approach used when analyzing isoform versus gene-level data. If differential regulation of unique isoforms can be confidently quantified, it may be possible to treat these similarly to gene-level data when performing featureIntegration() modelling. However, it should be considered that including multiple isoforms for the same genes that contain shared sequences may have confounding effects, giving results that may be misleading or difficult to interpret. An alternative option is to enumerate mRNA features at the isoform level, and then summarize these at the gene-level prior to modelling (Watt et al. 2025).
Numerous sequencing approaches are available to specifically map UTRs, such as nanoCAGE (Poulain et al. 2017) and QuantSeq (Moll et al. 2014). For this reason, it is also possible to adjust only UTR sequences if more specific data is available for your experimental condition or model of interest.
Custom UTR sequences will replace UTRs in an existing annotation file when using the adjObj and region_adj parameters of the postNetStart() function. A list of custom sequences must be provided, specifying which UTR region(s) should be replaced. If custom UTR sequences are available for some, but not all genes in the existing sequence annotation, whether genes and isoforms without custom sequences are kept or discarded from the analysis can be controlled using the excl parameter. The keepAll parameter can also be used to control how custom UTR isoform sequences are stored for genes with multiple isoforms.
# Create the adjObj list with custom 5'UTR sequences to replace those in the loaded annotation file
myUTR5seqs <- ptn_sequences(ptn, "UTR5")
myIDd <- ptn_id(ptn, "UTR5")
names(myUTR5seqs) <- myIDd
customUTR5s <- list(UTR5 = myUTR5seqs)
str(customUTR5s)## List of 1
## $ UTR5: Named chr [1:9555] "AGCAAATAACTGGAATTCCGCAAACACTGGGGTGATGCAGGCAATGCCCTTAGCACGGTCCCGGCAGGAGGCGGTGGGATCAGGCCGACCCGGGTCCCAGGGGTGACAGCG"| __truncated__ "ACTTCCGTTGAGTTCCGCCTCGCCGTTTGTCCCTTGCGGTACCCGTCCGCATACGAATCTAGCCCGGGAACCGAGTTGCGGGAGTGCGGTCTGTGCCGTTCCGGCCAGGAG"| __truncated__ "AGTCCCTTGTTCCCCGCCGCCGCCGTCGCTGACCCAGCCCGCCAGGCGCTCCTGACCGTCGCTTCCTCCGGTCCCAGGTCCCCGGCCCTCGCCTCAGCCCCGGCCCCTGGT"| __truncated__ "AAAATGCACAGTCCCAGTGACTGAGAGGAGGCCAGCGGAGCGCGCGGGAACGAGGGTGGGGCCGCTCCCGGCTGGGCGCAGCTCGGGAGCCCCCTGCAGTGGACCCAAGTC"| __truncated__ ...
## ..- attr(*, "names")= chr [1:9555] "NM_001318038" "NM_001173466" "NM_001414675" "NM_001286683" ...
# Initialize a postNetData object with custom 5'UTR sequences, replacing the 5'UTR sequences
# in the loaded annotation file with custom ones where available, and discarding all isoforms
# without custom sequences. Genes with no custom sequence will retain all isoforms from the original annotation:
ptn <- postNetStart(
geneList = myGenes,
geneListcolours = c("#FC9272", "#99000D"),
customBg = myBg,
effectMeasure = myEffect,
source = "load",
species = "human",
selection = "random",
setSeed = 123,
adjObj = customUTR5s,
region_adj = "UTR5",
excl = FALSE,
keepAll = FALSE
)PostNet includes a variety of tools for identifying
and enumerating sequence features of mRNA. These tools can be used for stand-alone analyses to visualize and perform statistical comparisons of mRNA features between sets of regulated genes (described below). However, the outputs of these tools can also be compiled into per-gene catalogs of cis-acting regulatory features and used with the featureIntegration() function to model changes in post-transcriptional regulation, and dissect networks of interactions between features.
For the tools described in the following sections, where possible, statistical analyses are performed by providing the comparisons parameter, where the input is a list of numeric vectors specifying the pairwise comparisons to be made between gene sets defined by geneList (if using custom inputs), or regulation (if using the results of an anota2seq analysis) parameters in the postNetStart() function. Comparisons can be made between gene sets of interest, and between gene sets and the background set, which is always denoted by 0. For example, list(c(1,2), c(0,1), c(0,2)) would produce three sets of statistics for the comparisons between gene sets 1 and 2, and for each against the background set. Unless otherwise described, significant differences in enumerated mRNA features between pairs of gene sets are evaluated using two-sided Wilcoxon rank sum tests.
Some regulatory sequence features of mRNA are highly positional, such as 5’ terminal oligopyrimidine (TOP) motifs (Philippe et al. 2020; Cockman, Anderson, and Ivanov 2020), or codon “ramp” sequences (Verma et al. 2019). Therefore, for some tools described below, users may wish to search for and/or enumerate mRNA sequence features within more specific sequence sub-regions. This can be accomplished using the subregion, and subregionSel parameters. The subregion parameter takes an integer indicating the number of nucleotides to include or exclude from either the start of the sequence region if positive, or the end if negative. For example, subregion = 50 denotes the first 50 nucleotides of the sequence region(s) specified by the region parameter (5’UTR, CDS, or 3’UTR), while subregion = -50, denotes the last 50 nucleotides of the sequence region(s). The subregionSel parameter can then be applied to allow the user to either examine only this selected sub-region, or to exclude it from the analysis.
In most cases, three options for visualizations are available with the plotType parameter, including box plots, violin plots, and empirical cumulative distribution functions (eCDFs). If the plotType selected is ecdf, then differences between distributions at the 25th, 50th, and 75th percentiles will also be reported.
The length of different mRNA sequence regions can have important implications for post-transcriptional regulation (Mayr 2016; Gandin 2016). For each gene, the lengthAnalysis() function computes and returns a list of the log2 length of each mRNA sequence region specified by the region parameter. Statistical comparisons and visualizations can be generated using the parameters described above.
The example below will compute a list (len) of the log2 length for each gene and sequence region, and produce three PDF files with eCDFs and statistics comparing the length of each sequence region between gene sets and background.
# Calculate the length of the 5'UTR, 3'UTR and coding sequence for each gene,
# and compare lengths between translationUp genes vs. background,
# translationDown genes vs. background, and translationUp vs. translationDown genes
len <- lengthAnalysis(
ptn = ptn,
region = c("UTR5", "CDS", "UTR3"),
comparisons = list(c(0, 1), c(0, 2), c(1, 2)),
plotOut = TRUE,
plotType = "ecdf",
pdfName = "Example"
)
str(len)## List of 3
## $ UTR5_length: Named num [1:9555] 8.34 7.18 7.23 7.71 4.91 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
## $ CDS_length : Named num [1:9555] 10.05 10.59 10.94 10.32 9.89 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
## $ UTR3_length: Named num [1:9555] 9.59 4.86 11.96 9.46 11.07 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
Figure 4: eCDFs visualizing differences in Log2(Length) of mRNA sequence regions between different gene sets
5’UTR (left), 3’UTR (middle), CDS (right). Statistical comparisons are made between translationally activated (blue), suppressed (red), and background (grey) gene sets. Visualizations can also be generated as either box plots, or violin plots
Nucleotide content is associated with the post-transcriptional fate of mRNAs. For each gene, the contentAnalysis() function computes and returns a list of the percentage of nucleotide content of each mRNA sequence region specified by the region parameter. It is also possible to select more specific sequence sub-regions to either examine or exclude from the analysis. Statistical comparisons and visualizations can be generated using the parameters described above.
The contentIn parameter is used to specify one or more nucleotide(s) or nucleotide combinations to quantify. These can be any selection or combination of A, T, G, or C. For example c("GC") to obtain the combined percentage of G and C, or c("G", "C") to obtain the individual percentages. Within the coding sequence, it may also be of interest to calculate nucleotide content at specific codon positions, for example the GC content at the 3rd codon position (“GC3”) is often used as an indicator of codon bias (Aota and Ikemura 1986). For this reason, when the input for the region parameter includes "CDS", nucleotide content at specific codon positions can also be quantified by supplying the nucleotide or combination of nucleotides, along with the codon position to be examined (1, 2, or 3). For example, c("GC3") calculates percentage of GC content in the third codon position, while c("A12") would provide the percentage of A content in the first and second codon positions.
The example below will compute a list (GC_content) of the percentage of GC content for each gene and sequence region, and produce three PDF files with violin plots and statistics comparing the GC content of each sequence region between gene sets and background.
# Calculate the GC content of each sequence region for each gene, and compare between
# translationUp genes vs. background, translationDown genes vs. background, and
# translationUp vs. translationDown genes
GC_content <- contentAnalysis(
ptn = ptn,
region = c("UTR5", "CDS", "UTR3"),
comparisons = list(c(0, 1), c(0, 2), c(1, 2)),
contentIn = c("GC"),
plotOut = TRUE,
plotType = "violin",
pdfName = "Example"
)
Figure 5: Violin plots visualizing differences in the percentage of GC content of mRNA sequence regions between different gene sets
5’UTR (left), 3’UTR (middle), CDS (right). Statistical comparisons are made between translationally activated (blue), suppressed (red), and background (grey) gene sets. Visualizations can also be generated as either box plots, or eCDFs
In addition to calculating the percentage of GC content for each gene and sequence region (as above), the code below will return the GC content at the third codon position (GC3) for the coding sequence.
# Calculate the GC of each sequence region for each gene, as well as the GC3 content of
# the coding region and compare between translationUp genes vs. background, translationDown
# genes vs. background, and translationUp vs. translationDown genes
GC3_content <- contentAnalysis(
ptn = ptn,
region = c("CDS"),
comparisons = list(c(0, 1), c(0, 2), c(1, 2)),
contentIn = c("GC3"),
plotOut = TRUE,
plotType = "violin",
pdfName = "Example"
)
str(GC_content)## List of 3
## $ UTR5_GC: Named num [1:9555] 62 63.4 80 64.1 73.3 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
## $ CDS_GC : Named num [1:9555] 63.7 57.5 56 41.4 45.3 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
## $ UTR3_GC: Named num [1:9555] 63.1 24.1 52.5 33.6 41.4 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
Finally, the code below provides an example of how the subregion and subregionSel parameters can be used to quantify the percentage of G content in the first and last 15 nucleotides of the 5’UTR.
# Calculate the G content of the first and last 15 nucleotides of the 5'UTR for each gene, and
# compare between translationUp genes vs. background, translationDown genes vs. background, and
# translationUp vs. translationDown genes
content_UTR5_first15 <- contentAnalysis(
ptn = ptn,
region = c("UTR5"),
subregion = 15,
subregionSel = "select",
comparisons = list(c(0, 1), c(0, 2), c(1, 2)),
contentIn = c("G"),
plotOut = TRUE,
plotType = "ecdf",
pdfName = "Example_First15utr5"
)
content_UTR5_last15 <- contentAnalysis(
ptn = ptn,
region = c("UTR5"),
subregion = -15,
subregionSel = "select",
comparisons = list(c(0, 1), c(0, 2), c(1, 2)),
contentIn = c("G"),
plotOut = TRUE,
plotType = "ecdf",
pdfName = "Example_Last15utr5"
)The folding energy of mRNA molecules is an indication of the thermodynamic stability of higher-order structures, and has implications for post-transcriptional regulation (Faure et al. 2016; Zur and Tuller 2012). For each gene, the foldingEnergyAnalysis() function returns a list of free energy measurements for each mRNA sequence region specified by the region parameter. Statistical comparisons and visualizations can be generated using the parameters described above.
The foldingEnergyAnalysis() function does not perform folding energy calculations for sequences, but allows comparisons between gene sets of interest using pre-calculated values. The source of these values is specified using the sourceFE parameter, where pre-calculated values provided with the package can either be loaded, or custom values can be supplied using the customFileFE parameter. Pre-calculated folding energies supplied with the package are currently available for human and mouse RefSeq releases “rel_109.20201120”, and “rel_109.20200923”. These values were calculated for all sequence regions using the mfold algorithm (Zuker 2003), available here. Mfold reports the Gibbs free energy (ΔG) for the most probable RNA structures, where lower values indicate more stable structures and larger ones indicate less stable, more flexible mRNA molecules.
To use custom free energy values with the customFileFE parameter, the user must supply a TAB delimited file with three columns corresponding to: transcript ID, folding energy, and length of the sequence region. Note that folding energies for different sequence regions (e.g. 5’UTR and 3’UTR) must be provided separately.
The residFE parameter also offers the option of correcting the folding energies for the length of the sequences, which may sometimes be desirable to facilitate comparisons between gene sets. Here, instead of ΔG, the values returned are the residuals from the linear model: FE ~ log2(sequence region length). Note that if folding energies will be used in downstream modelling analysis with featureIntegration(), consideration should be given to the residFE parameter. If the length of the sequence regions will also be included in modelling, it is not recommended to correct the folding energies for the sequence length.
The example below will compute a list (FE) of the folding energies for 5’UTR regions, and produce a PDF file with box plots and statistics comparing the folding energy between regulated gene sets and background.
# Compare the folding energy of the 5'UTRs between translationUp
# genes vs. background, translationDown genes vs. background,
# and translationUp vs. translationDown genes.
FE <- foldingEnergyAnalysis(
ptn = ptn,
region = c("UTR5"),
comparisons = list(c(0, 1), c(0, 2), c(1, 2)),
residFE = FALSE,
plotType = "boxplot",
sourceFE = "load",
plotOut = TRUE,
pdfName = "Example"
)
str(FE)## List of 1
## $ UTR5_foldingEnergy: Named num [1:9130] -13.3 -51 -80.7 -58.8 -23.6 ...
## ..- attr(*, "names")= chr [1:9130] "ACADS" "PSEN1" "ADRB2" "ALAD" ...
Figure 6: Box plots visualizing differences in the folding energy of mRNA 5’UTRs between different gene sets
Statistical comparisons are made between translationally activated (light red), suppressed (dark red), and background (grey) gene sets. Visualizations can also be generated as either violin plots, or eCDFs
In the example code below, a custom file with pre-calculated folding energies is provided for 5’UTR sequences.
# An example of the required input format for the customFile parameter ("custom_UTR5FE.txt"):
# id fold_energy length
# NM_018117 -16.3 34
# NM_001025603 -36.84 162
# NM_001003891 -28.4 81
# NM_021107 -120.14 350
# Compare the folding energy of the 5'UTR for each gene between translationUp genes vs. background,
# translationDown genes vs. background, and translationUp vs. translationDown genes.
customFE <- foldingEnergyAnalysis(
ptn = ptn,
region = c("UTR5"),
comparisons = list(c(0, 1), c(0, 2), c(1, 2)),
residFE = FALSE,
plotType = "violin",
sourceFE = "custom",
customFileFE = "~/Path/To/CustomFile/custom_UTR5FE.txt",
plotOut = TRUE,
pdfName = "Example_customFE"
)Upstream open reading frames (uORFs) are common regulatory elements found in the 5’UTRs of many transcripts, imparting context-dependent post-transcriptional regulatory effects (Barbosa, Peixeiro, and Romão 2013). For each gene, the uorfAnalysis() function returns either the number, or the coordinates of the positions of uORFs in the 5’UTR. Statistical comparisons can be generated using the parameters described above.
By default, the uorfAnalysis() function will search for uORFs with canonical start codons (“AUG”) in a strong Kozak context. However, it is also possible to identify uORFs with non-canonical start codons using the startCodon parameter, and to specify the Kozak context of the start codon using the KozakContext parameter, where:
"strong" = [AG][ATGC][ATGC] [Start codon] G"adequate1" = [AG][ATGC][ATGC] [Start codon] [ATC]"adequate2" = [TC][ATGC][ATGC] [Start codon] G"weak" = [TC][ATGC][ATGC] [Start codon] [ATC]"any" = [ATGC][ATGC][ATGC] [Start codon] [ATGC]Note that if KozakContext = "any", the nucleotide context of the start codon is not considered.
Some uORFs are completely contained within the 5’UTR, as is the case for the uORF found in the transcript CHOP (DDIT3) (Jousse et al. 2001). However, in other cases the start codon for the uORF may be in the 5’UTR, but the stop codon may be found downstream of the start codon of the main ORF, as for ATF4 (Vattem and Wek 2004). Using the onlyUTR5 parameter, it is possible to specifically identify uORFs completely contained in 5’UTRs, instead of all uORFs.
Finally, in addition to quantifying the number of uORFs for a given transcript, it is also often of interest to know the positions in the sequence where the uORFs occur. The unitOut parameter can be used to specify whether the number, or coordinates (nucleotide position of the start and stop codons) of the detected uORFs will be returned.
The example below will compute a list (uORFs) of the number of uORFs with canonical start codons in a strong Kozak context for each gene, and produce a PDF file with a bar plot and statistics comparing between gene sets.
# Identify all uORFs with canonical start codons in a strong Kozak context (including
# those not fully contained within 5'UTRs, and compare
# between translationUp vs. translationDown genes
uORFs <- uorfAnalysis(
ptn = ptn,
comparisons = list(c(1, 2)),
startCodon = "ATG",
KozakContext = c("strong"),
onlyUTR5 = FALSE,
unitOut = "number",
plotOut = TRUE,
pdfName = "Example"
)
str(uORFs)## List of 1
## $ uORFs_ATG_strong: Named num [1:9555] 2 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
Figure 7: Bar plot visualizing differences in the proportion of transcripts containing uORFs with canonical start codons in a strong Kozak context between different gene sets
Statistical comparisons are made between translationally activated (blue), and suppressed (red) gene sets.
De novo motif discovery in postNet is carried out using the motifAnalysis() function, which implements STREME (Bailey 2021) from the MEME-Suite (Bailey et al. 2015) to identify ungapped motifs (recurring, fixed-length patterns) that are enriched in the selected sequence region(s) of gene sets of interest relative to control sequences (background). It is also possible to select more specific sequence sub-regions to either examine or exclude from the analysis.
The motifAnalysis() function applies runStreme() from the memes package (Nystrom 2021). This requires that the MEME Suite software tools be installed locally, and the path to the executables must be provided (memePath parameter). The MEME Suite can be downloaded here following the installation guide. It is possible to specify the minimum width (in nucleotides) and the enrichment p-value threshold for selecting motifs using the minwidth and stremeThreshold parameters. If custom reference sequence annotations have been used, the sequence type (RNA, DNA, protein) can also be adjusted with the seqType parameter. It is important to note that when using postNet with a custom gene list, if no background is provided, postNet defines the background as all genes in the gene list. In the scenario where only one gene set is provided in the list, it is not advisable to use motifAnalysis() to identify enriched motifs, as the input and background will be the same.
The motifAnalysis() function returns an updated postNetData object where results are stored in the “motifs” slot. Identified motifs can be enumerated in transcripts and included in downstream modelling with featureIntegration() by using the contentMotifs() function. All results, including the list of significantly enriched motifs can be extracted using the ptn_motifSelection() and ptn_motifGeneList() functions.
Note: If you use motifAnalysis() in your work, please appropriately cite the memes R package, The MEME Suite, and the STREME tool. Licensing for The MEME Suite is free for non-profit use. However, for-profit users should contact OIC-MEMESuite@ucsd.edu and purchase a license. See http://meme-suite.org/doc/copyright.html for details
In the example below, de novo motifs enriched in translationally activated and suppressed gene sets compared to background will be identified for all sequence regions. Motifs passing an enrichment p-value threshold of 0.05 will be extracted, along with the full motif analysis results for the translationally activated gene set.
# Note that as users must provide the path to the MEME Suite executables. An example of
# how to run this function is provided below, and can be updated with the correct memePath argument.
ptn <- motifAnalysis(
ptn = ptn,
stremeThreshold = 0.05,
minwidth = 6,
memePath = "/meme/bin",
region = c("UTR5", "CDS", "UTR3")
)
# Extract the significantly enriched motifs:
denovo_UTR5motifs <- ptn_motifSelection(
ptn = ptn,
region = "UTR5"
)
# Extract full motif results from one gene set of interest:
denovo_UTR5motifs_TransUp <- ptn_motifGeneList(
ptn = ptn,
region = "UTR5",
geneList = "translationUp"
)Sequence motifs can contribute to the structure and stability of RNA molecules, as well as mediate RNA-protein interactions. Numerous motifs play an important role in post-transcriptional regulation, including 5’TOP motifs, which render translation sensitive to regulation via mTOR (Philippe et al. 2020; Cockman, Anderson, and Ivanov 2020), and G-quadruplexes that can impact the translation and stability of the mRNAs where they occur (Bugaut and Balasubramanian 2012). For each gene, the contentMotifs() function identifies user-supplied sequence motifs and returns a list of enumerations and/or positions for each mRNA sequence region specified by the region parameter. It is also possible to select more specific sequence sub-regions to either examine or exclude from the analysis. Statistical comparisons between gene sets can be specified using the parameters described above.
The motifsIn parameter is used to provide the motif sequences to be detected and enumerated. Ambiguities can be specified using IUPAC codes or [ ] (bracket) annotations. The input motifs provided should match the type of reference sequences to be searched in your analysis (i.e., RNA, DNA, or protein), and can be specified using the seqType parameter. The contentMotifs() function can also implement pqsfinder to identify G-quadruplexes when motifsIn = "G4", using the min_score to adjust the prediction threshold. If you use the “G4” option in your work, please cite Hon et al. (Hon et al. 2017). The ATtRACT database, is a database of RNA binding proteins and associated motifs (Giudice et al. 2016), and may also be useful for identifying motifs of interest and potential mechanisms of post-transcriptional regulation in your data. Note that often you will first run the motifAnalysis() function to identify the enriched motifs to enumerate. In the example code above, denovo_UTR5motifs, extracted using the ptn_motifSelection() function can be supplied directly as input to contentMotifs() using the motifsIn parameter.
It is also possible to identify overlapping, or non-overlapping motifs, as well as dictate spatial constraints between motifs using the dist parameter to dictate the minimum nucleotide distance between motifs. Similarly to uorfAnalysis(), it is also often of interest to know the positions in the sequence where the motifs occur, and the unitOut parameter can be used to specify whether the number, or coordinates (nucleotide position of the beginning and end of the motif) will be returned.
As for foldingEnergyAnalysis(), the resid parameter of contentMotifs() offers the option of correcting the number of motifs per transcript for the length of the sequences (as longer sequences have more opportunity to contain motifs). Here, the values returned are the residuals from the linear model: number of motifs ~ log2(sequence region length). Note that if motifs will be used in downstream modelling analysis with featureIntegration(), and the length of the sequence regions will also be included in the same models, it is not recommended to correct the number of motifs for the sequence length.
It is important to note that the contentMotifs() function uses a more strict method of sequence matching to count motifs than the MEME-Suite (Bailey et al. 2015), and for this reason some motifs identified with motifAnalysis(), which implements STREME (Bailey 2021), may have divergent quantification between methods. Using a more stringent threshold with the stremeThreshold parameter of motifAnalysis() may remedy this. Alternatively, motifs identified with STREME can be counted using the FIMO (Grant, Bailey, and Noble 2011) tool provided with the MEME-Suite.
The example below will compute a list (UTR5_SCSCGS_num) of the number of SCSCGS in the 5’UTR for each gene, and produce a PDF file with an eCDF plot and statistics comparing enrichment of the motif between gene sets. A second list (UTR5_SCSCGS_pos) will contain the coordinates of the positions of the motifs within the 5’UTR for each gene.
# Detect and quantify a given motif within 5'UTRs, and compare between translationUp vs.
# translationDown genes
UTR5_SCSCGS_num <- contentMotifs(
ptn = ptn,
motifsIn = "SCSCGS",
region = c("UTR5"),
comparisons = list(c(1, 2)),
dist = 1,
unitOut = "number",
pdfName = "Example",
plotOut = TRUE
)
str(UTR5_SCSCGS_num)## List of 1
## $ UTR5_SCSCGS: Named num [1:9555] 0 1 6 1 0 3 0 0 0 1 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
# Now find the coordinates of positions where the motif occurs in 5'UTRs
UTR5_SCSCGS_pos <- contentMotifs(
ptn = ptn,
motifsIn = "SCSCGS",
region = c("UTR5"),
comparisons = list(c(1, 2)),
dist = 1,
unitOut = "position"
)
str(UTR5_SCSCGS_pos$UTR5_SCSCGS[1:3])## List of 3
## $ A4GALT: logi NA
## $ AAAS :List of 2
## ..$ start: num 62
## ..$ end : num 67
## $ AACS :List of 2
## ..$ start: num [1:6] 12 37 80 98 130 138
## ..$ end : num [1:6] 17 42 85 103 135 143
Figure 8: eCDFs visualizing differences in the presence of the SCSCGS motif in the 5’UTR of different gene sets
Statistical comparisons are made between translationally activated (blue) and suppressed (red) gene sets. Background genes are shown in grey.
PostNet includes functionality for enumerating and comparing codon usage across gene sets, implemented with the codonUsage(), and codonCalc() functions. In this workflow, the codonUsage() function is first used to enumerate usage of single codons (or multiple, e.g. dicodons) or amino acids (AAs), and identify those that are enriched or depleted between different gene sets of interest. The codonCalc() function can then be applied to obtain the number or frequency of selected codons or AAs for each gene, and perform statistical comparisons between gene sets. The outputs of codonCalc() can be used as input features for modelling analyses using featureIntegration().
The codonUsage() function can be used to enumerate either codon or AA usage, specified with the analysis parameter. Analyses can be performed using the coding region of reference sequence annotations stored in the postNetData object. Alternatively, coding sequences can be either automatically created from the NCBI Consensus CDS database or loaded using the annotType and sourceSeq parameters. If loaded, the data are downloaded from the postNet GitHub releases and stored in a user-specific cache directory managed by BiocFileCache. Once cached, the reference data are reused in subsequent sessions and do not require re-downloading. This option is available when species = "human" or "mouse".The following reference annotation versions are currently available:
CCDS Release 25 (2022)
It is also possible to enumerate multiple codon combinations, for example dicodons, using the codonN parameter. Similarly to contentAnalysis() and contentMotifs(), it is also possible to select more specific sequence sub-regions for analysis, for example to specifically examine the codon “ramp” downstream of the start codon (Verma et al. 2019).
The example below demonstrates an implementation of codonUsage() generating a summary table of codon counts and frequencies for all genes. Note that “frequency” corresponds to the number of codons relative to all codons in the gene, while “relative frequency” corresponds to the number of codons relative to all synonymous codons in the gene. This table can then be retrieved using the ptn_codonAnalysis() function. In this case, statistical comparisons will be performed between translationally activated and suppressed gene sets.
# Examine the composition of all genes:
ptn <- codonUsage(
ptn = ptn,
annotType = "ptnCDS",
sourceSeq = "load",
analysis = "codon",
codonN = 1,
comparisons = list(c(1, 2))
)
# Retrieve the codon usage summary table
codonUsageTable <- ptn_codonAnalysis(ptn)In addition to the summary table, the example code above also generates plots comparing the average codon usage and frequency between the gene sets of interest specified by the comparisons parameter.
Figure 9: Scatter plots comparing average codon usage and frequency between gene sets of interest
Codons encoding the same amino acid are coloured and connected by lines.
The codonUsage() function also computes several codon indexes and performs statistical comparisons between gene sets of interest. These include the CAI (codon adaptation index) (Sharp and Li 1987), CBI (Codon Bias Index) (Bennetzen and Hall 1982), FOP (frequency of optimal codons) (Ikemura 1981), L_aa (Number of AAs in protein), and tAI (tRNA adaptation index) (Reis, Savva, and Wernisch 2004). The type of Visualizations generated can be controlled as described above, using the plotType_index parameter.
Below are several examples produced by the example code above.
Figure 10: Violin plots of several codon usage indexes calculated by the codonUsage() function
Shown are several examples including CAI (left), CBI (middle), and Fop (right). Statistical comparisons are made between translationally activated (blue), suppressed (red), and background (grey) gene sets. Visualizations can also be generated as either box plots, or eCDFs
The codonUsage() function can also be used to identify enriched or depleted codons or AAs between gene sets of interest. In the first step of the analysis, for each gene set comparison, a Chi-square test is applied to assess significant differences in codon or AA usage. To visualize the results of this test, a clustered heatmap of the standardized Chi-square residuals for each codon or AA between the gene sets of interest can be produced using the plotHeatmap parameter. Then, for codons or AAs passing significance thresholds (specified using the pAdj parameter), the odds ratio for each desired comparison pair is calculated and plotted against the frequency. Codons or AAs of interest with differential usage between gene sets are usually identified as those with a high or low odds ratio, and high frequency. These thresholds can be defined using the thresOddsUp, thresFreqUp, thresOddsDown and thresFreqDown parameters.
In the example below, enriched and depleted codons will be identified between translationally activated and suppressed gene sets. These are defined as those with an adjusted p-value less than 0.01 in a Chi-square test, and are within the top 30% of enrichment or depletion based on odds ratios and average frequencies.
# Identify enriched and depleted codons between
# translationally activated and suppressed genes
codonUsage(
ptn = ptn,
annotType = "ptnCDS",
sourceSeq = "load",
analysis = "codon",
codonN = 1,
pAdj = 0.01,
plotHeatmap = TRUE,
thresOddsUp = 0.3,
thresFreqUp = 0.3,
thresOddsDown = 0.3,
thresFreqDown = 0.3,
comparisons = list(c(1, 2)),
pdfName = "Example"
)
Figure 11: Visualizations of differential codon usage between different gene sets of interest
A clustered heatmap of standardized residuals for each codon from a Chi-square test (left), and a plot of the log2 odds ratio for each codon between translationally activated and suppressed gene sets, vs. the average codon frequency. Those passing thresholds for enrichment or depletion between gene sets are highlighted in red and blue, respectively.
After identifying codons or AAs with differential usage between gene sets using the codonUsage() function, these can be enumerated across genes using the codonSelection() function for use in modelling using featureIntegration(). Statistical comparisons and visualizations can also be generated using the parameters described above.
Codons or AAs to examine are specified using the featsel parameter, and the output can be calculated as either the number of occurrences per gene, or the frequency (the count per gene relative to all other codons or AAs in the gene) specified by the unit parameter.
In the example below, codons identified as enriched or depleted between gene sets based on Chi-square and odds ratio analyses are enumerated for each gene, and these are compared between translationally activated and suppressed gene sets.
# Select codons of interest with high frequency, and the highest and lowest odds ratios
# between translationally activated and suppressed genes
codons <- ptn_codonSelection(ptn,
comparison = 1
)
# Calculate and plot eCDFs of codon frequencies, and compare between gene sets
codonCounts <- codonCalc(
ptn = ptn,
analysis = "codon",
featsel = codons,
unit = "freq",
comparisons = list(c(1, 2)),
plotType = "ecdf"
)
Figure 12: eCDFs visualizing differences in the frequency of enriched (left) and depleted (right) codons between gene sets
Statistical comparisons are made between translationally activated (blue), suppressed (red) gene sets, with background shown in grey. Visualizations can also be generated as either box plots, or violin plots.
The featureIntegration() function is used to model changes in post-transcriptional regulation and identify cis- and/or trans-acting factors (features of mRNA molecules, signatures of upstream regulators, etc.) that can explain the observed regulatory effects. The method integrates mRNA features and signatures obtained in previous steps of the postNet workflow (or from custom sources), and has several options for implementation, including forward stepwise linear regression modelling and network analysis, and Random Forest feature selection and classification, which allows prediction of post-transcriptional regulation in new datasets. These topics and considerations for use will be discussed in the following sections.
Both stepwise regression and Random Forest modelling implementations produce statistics and plots visualizing the relationship between regulatory effects and features. These relationships can be further explored using UMAP visualizations, described in later sections.
Many different types of features can be included in modelling with the featureIntegration() function. The input features must be provided as a list summarizing the properties of each gene, that will be used to explain the observed changes in regulation (using forward stepwise linear regression modelling), or to predict the regulatory outcome for a given gene (using a Random Forest classifier). Prior to modelling, careful consideration should be given to the features that are included in the models. These considerations will largely depend on the individual datasets being examined and on the biological questions being addressed. However, some general advice to keep in mind is outlined here.
Most analyses of post-transcriptional regulation using postNet will likely be interested in identifying sequence-encoded features of mRNA molecules that may be responsible for mediating selective regulation (cis-acting elements). From the postNet workflow, the outputs of the lengthAnalysis(), contentAnalysis(), uorfAnalysis(), foldingEnergyAnalysis(), contentMotifs(), and codonCalc() functions generate per-gene summary values that can be directly supplied as features in modelling. In addition to the features that can be enumerated using postNet, additional cis-acting features could be included from custom sources, or based on experimental measurements. For example, the length of the poly-A tail, the presence of RNA modifications such as m6A, or measurements of RNA stability or structures.
In addition to cis-acting features, it is also possible to include features describing upstream regulators (trans-acting features). These can be gene signatures of known regulation downstream of particular pathways or factors, for example genes translationally activated or suppressed by an inhibitor treatment or knockdown of an RNA binding protein. Including such gene signatures in modelling can serve several hypothesis-generating purposes. For example, when examining post-transcriptional regulation in a complex physiological condition, including signatures of upstream regulators may help to identify factors or pathways that may be modulated in that condition. For example, including known signatures of translational regulation when modelling the translatome under hypoxia confirmed that suppression of mTOR signalling and activation of the ISR play key roles in the observed translational control, and suggested the potential involvement of other factors such as DAP5 that were not previously appreciated (Watt et al. 2025). In addition, including trans-acting feature signatures can also allow assessment of the relationship between upstream regulators and sequence-encoded features. For example, stepwise regression modelling can identify covariance between features. If a cis-acting sequence feature co-varies substantially with an upstream signalling pathway (meaning the element is present in the same genes regulated by this pathway), one may hypothesize that this sequence element could be involved in mediating the observed post-transcriptional regulation exerted by that pathway. Several example gene signatures, including mTOR and ISR-sensitive translation, are included with the package.
In general, features that can be included in modelling can be either numeric, or categorical. Numeric features can be both continuous (for example, sequence region nucleotide content or length), or discrete (for example, the number of uORFs in 5’UTRs). Categorical features can also be included by converting to binary variables. This is useful for evaluating gene signatures in relation to regulatory outcomes, but can also be applied in many scenarios where mRNA can be divided into two classes (for example those with a certain property would be coded as 1, while those without would be coded as 0, etc.). It is also possible to supply categorical features with a directionality. For example, a feature that can be increased, decreased, or unchanged could be coded as 1, -1, or 0 for each gene to be included in modelling.
Any gene signature constituting a list of genes, including those provided with the package (stored in the humanSignatures and mouseSignatures objects) can be converted to feature inputs compatible with modelling using the signCalc() function. In the example code below, signatures of mTOR and ISR-sensitive translation, as well as mRNAs known to contain classical TOP motifs will be converted into binary variables that can be used as feature inputs in featureIntegration() modelling.
## List of 5
## $ Gandin_etal_2016_mTOR_transUp : chr [1:1569] "ZNF467" "PHLDA2" "PYCARD" "FHOD1" ...
## $ Gandin_etal_2016_mTOR_transDown: chr [1:1721] "DST" "OTULINL" "MME" "BRD8" ...
## $ Cockman_etal_2020_classicTOP : chr [1:92] "RPSA" "RPS2" "RPS3" "RPS3A" ...
## $ Guan_etal_2017_Tg1_transUp : chr [1:1192] "AR" "ATP7A" "BRCA2" "LYST" ...
## $ Guan_etal_2017_Tg1_transDown : chr [1:731] "AGA" "BCKDHB" "CST3" "CYBA" ...
# Convert the gene signatures to a binary format for featureIntegration:
signatureFeatures <- signCalc(ptn = ptn, signatures = humanSignatures)
str(signatureFeatures)## List of 5
## $ Gandin_etal_2016_mTOR_transUp : Named num [1:9555] 1 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
## $ Gandin_etal_2016_mTOR_transDown: Named num [1:9555] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
## $ Cockman_etal_2020_classicTOP : Named num [1:9555] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
## $ Guan_etal_2017_Tg1_transUp : Named num [1:9555] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
## $ Guan_etal_2017_Tg1_transDown : Named num [1:9555] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "names")= chr [1:9555] "A4GALT" "AAAS" "AACS" "AADAT" ...
Features that will be included in modelling are passed to the featureIntegration() function as a named list of numeric vectors. Each vector element in the list must have names corresponding to the gene IDs in your postNetData object. Ideally, all genes in the dataset should be represented in each feature vector as any genes missing data for one or more features will be excluded from the analysis.
In the example code below, a list of input features for modelling is compiled using the outputs of previous steps of the analysis.
After compiling an appropriate list of features, it is now possible to implement the featureIntegration() function to model the observed post-transcriptional regulation. The following sections will describe modelling using stepwise linear regression and network analysis. This analysis uses hierarchical multiple regression to identify features explaining changes in the regulatory effect, rank them according to their importance, and reveals independent and combinatorial effects (for example, co-occurrence of regulatory features in the same mRNA molecules). The results of this analysis can be visualized using network plots showing the association of features to regulatory effects, and the relationships between features. Stepwise linear regression modelling is carried out in three phases to generate univariate, omnibus, and adjusted models (described below).
In phase 1 of the analysis, each candidate feature is evaluated separately in univariate linear models to identify significant associations with the regulatory effect measurement.
The results of these univariate models are summarized in an output table where for each feature, the P-value, and FDR are provided, along with the proportion of variance in the regulatory effect measurement that can be explained by that feature.
Figure 13: Output of univariate linear models from featureIntegration
The results of the univariate models are used to prioritize the features included in the second phase of stepwise regression modelling.
In phase 2, starting with the feature that best explained changes in the regulatory effect from univariate models (in the example above this would be 5’UTR GC content), forward stepwise regression is performed by adding features to the model in an iterative fashion, keeping covariance assigned to the most influential feature (a rank-based greedy model). In each step, the best performing model (the feature with the strongest association with the regulatory effect) is retained, and features that fail to explain additional variance are discarded. Whether a feature is retained in the model is based on the size of the F-statistic, and a P-value below a selected threshold (by default < 0.05). The resulting omnibus model represents the smallest set of features that can explain the greatest proportion of variance in the regulatory effect measurement. This step both ranks features according to their relative importance, and reveals relationships between them.
Figure 14: Output of stepwise regression summarizing F-values at each step of modelling
In the output table above, the F-statistics are summarized at each step, where features are iteratively added to the model. The table is coloured according to a “traffic-light” system, where cells marked in green indicate features that were significant and included in the final omnibus model. Cells marked in red/purple indicate features that did not improve the ability to explain additional variance in the regulatory effect measurement, and were therefore discarded from the model. For example, after the addition of CDS GC3 in step 2, the set of codons identified as enriched (“Codons_up”) is dropped from the model (the F-statistic decreases from more than 204.6 to 1.3) as it is not able to significantly explain additional variance in the regulatory effect. Finally, cells marked in orange indicate a substantial change in the F-statistic for a given feature when another feature is added to the model (an increase or decrease of at least 50%), indicating a relationship between features. For example, we can see a relationship between the 5’UTR SCSCGS motif and 5’UTR length, as the F-statistic for the motif decreased by ~88.5% (from 132.5 to 15.2) between steps 3 and 4 when 5’UTR length was added to the model. These relationships can be explored in detail in the resulting network analyses, with correlations, and using UMAP visualizations (all described below).
The features included in the final omnibus model are summarized in an output table:
Figure 15: Summary table of features included in the final omnibus and adjusted models after forward stepwise regression
Here, the total proportion of variance in the regulatory effect that can be explained by features included in the omnibus model is ~45%. The P-value for each feature from the stepwise regression model is provided, along with the variance explained. Note that since covariance is assigned to the most influential feature, the proportion of variance explained assigned to 5’UTR GC content is 30.4%, the same as in the univariate analysis. However, CDS GC3 content now explains 6.6% of the variance as opposed to 21.4% in the univariate analysis. This is because of the rank-based greedy nature of the model, as since 5’UTR GC and CDS GC3 content co-vary, this covariance is assigned to the more influential of the two features. This relationship is also apparent when examining the F-value table above, as there was a large decrease in the F-statistic for CDS GC3 between steps 1 and 2 indicating a relationship with 5’UTR GC.
In the 3rd and final phase, the independent contribution of each feature identified in phase 2 is determined. This is done by removing covariance from the omnibus model, which as described above, was previously assigned to the most influential features. This provides the adjusted contribution of each feature. The output of phase 3 is provided in the summary table above (Figure 15), where the last column is the percentage of variance in the regulatory effect explained by a given feature, independent of all other features in the omnibus model.
These adjusted values are useful when evaluating whether particular features tend to co-occur in the same mRNAs along with other features included in the model. This may suggest combinatorial effects between features, or reveal whether more specific features have impacts that are independent of more general features. For example, in the table above we see that after adjusting for all other features, the 5’UTR GC content still explains 10% of the regulatory effect independently, whereas the CDS GC3 content can independently explain only 0.3%, meaning that this feature co-varies very substantially with other features in the model. Note that in some instances the proportion of variance explained by a feature may increase after the adjustment, such as for TOP mRNAs (Cockman_etal_2020_classicTOP). This may suggest that this feature is anti-correlated with other features in the model. Indeed, in Figure 14 we can see that the F-statistic for the TOP mRNA signature increases between steps 1 and 2, and steps 4 and 5, suggesting that there is some relationship between TOP mRNAs, and 5’UTR GC content and length.
In addition to the “traffic-light” colouring of the F-value table to highlight substantial relationships between features based on changes in F-value, when the useCorel parameter is TRUE, an additional table of Pearson’s correlation coefficients between features will also be displayed. A threshold for the strength of association between features in order for correlation coefficients to be calculated can be provided using the covarFilt parameter. This threshold is based on an integer representing the absolute change in F-value. The default value is 20, but this may not be appropriate for all datasets and research questions and users can adjust this to see more or fewer correlations (this is discussed further below for network visualizations). Note that the orange “traffic-light” highlighting indicating substantial relationships in the F-value table (Figure 14) is independent of the threshold set with covarFilt, and will always represent an increase or decrease in F-value of at least 50%.
Figure 16: Summary table of Pearson’s correlation coefficients between features
Stepwise regression modelling and network analysis are specified in the featureIntegration() function by setting the analysis_type parameter to lm. A number of parameters can be used to control the behavior of the modelling. Similarly to the functions previously described, statistical comparisons can be specified using the comparisons parameter. It is possible to perform multiple pair-wise comparisons with stepwise regression modelling. By default, regression modelling is performed using only the set of regulated genes (i.e., those included in geneList, or regulation if starting from an Anota2seqDataSet object). However, it is also possible to perform the modelling with all genes included in the dataset by setting the regOnly parameter to FALSE.
By default, only the set of input features that are significantly associated with the regulatory effect in univariate models (phase 1) will be included in stepwise regression (phase 2). However, in some instances you may be interested in how various features interact with each other, so it is possible to bypass this filtering by setting the allFeat parameter to TRUE, which will allow all features provided as input with the features parameter to be used in stepwise regression modelling. In addition, the phase 1 filtering is based on an FDR threshold of 0.05 for univariate models, however, this can be adjusted using the fdrUni parameter. Furthermore, features are retained or discarded in stepwise regression analysis based on a P-value threshold of 0.05. This can also be adjusted using the stepP parameter.
In the example code below the three phases of modelling will be performed using the list of input features defined above. In this analysis, only the sets of regulated genes will be included, comparing translationally activated and suppressed genes. Those features passing the FDR threshold of 0.05 in univariate models will be included in stepwise regression.
# Run feature integration modelling using forward stepwise regression
ptn <- featureIntegration(
ptn = ptn,
features = myFeatureList,
pdfName = "Example",
regOnly = TRUE,
allFeat = FALSE,
analysis_type = "lm",
comparisons = list(c(1, 2))
)The results of the modelling will be stored in the postNetData object, and a series of PDFs will be produced. The PDF with the suffix “FinalModel” will contain the summary tables shown above in Figures 13-16. It is possible to retrieve the full results from each phase of modelling using the ptn_model() function, as well as the set of features that were selected in the omnibus model that significantly explain changes in the regulatory effect with ptn_selectedFeatures(). Furthermore, if modelling using multiple pairwise comparisons was performed, you can check which modelling results are available using the ptn_check_models() function. See the example code below:
# Check which models are available in the postNetData object
ptn_check_models(ptn = ptn, analysis_type = "lm")
# Extract the set of significant features selected in the omnibus model
selectedFeaturesLM <- ptn_selectedFeatures(
ptn = ptn,
analysis_type = "lm",
comparison = 1
)
# Retrieve the stepwise regression results (other options include "stepwiseModel" and "finalModel"):
univariateModel <- ptn_model(ptn, analysis_type = "lm", model = "univariateModel", comparison = 1)The other PDF files produced will contain the network visualization of the features included in the modelling, as well is plots of individual features, described below.
The results of feature integration using forward stepwise regression modelling can be visualized using network analyses. Each feature selected in the omnibus model that can significantly explain a proportion of variance in the regulatory effect appear in the networks as nodes. The size of the nodes corresponds to the proportion of variance explained by the given feature, and can represent either the values from the omnibus model, or the adjusted model (selected using the NetModelSel parameter). If NetModelSel is omnibus, the size of each node representing features will be proportional to the variance explained, where covariance is assigned to the most influential feature. If NetModelSel is adjusted, the size of the nodes will reflect the adjusted total variance explained, and therefore the independent contribution of each feature to the overall observed regulation.
The edges connecting nodes can either represent the Pearson correlation coefficients between features if the parameter useCorel = TRUE. However, if FALSE, the edges will instead represent the magnitude of the change in F-value for a given feature between sequential steps of stepwise regression, indicative of the strength of the associations between features. Note that while changes in F-value are indicative of the strength of relationships between features, this metric is not always easy to interpret and may not readily reveal whether features are positively or negatively correlated. Therefore, by default, edges will represent correlation coefficients. As described above, covarFilt parameter can be used to control the threshold for displaying edges between feature nodes. This threshold is based on the minimum change in F-value between sequential steps of stepwise regression, and when larger values are provided, only the strongest correlations will be displayed as edges in the network plot.
Finally, some features may not be significant in the omnibus model but have substantial relationships with one or more features that are. These features will not be assigned a node, but will appear in pink with edges connecting to the feature nodes they are substantially correlated with. For example, in the network displayed below, 5’UTR folding energy did not independently explain changes in the regulatory effect, but as expected, was substantially negatively correlated with 5’UTR length and GC content. This example was included to highlight this functionality. However, it is likely not usually informative to include folding energy in modelling alongside length and GC content of 5’UTRs, as it is known to be highly dependent on these two features. Furthermore, the signature of genes translationally suppressed upon mTOR activation was not independently significant, but was substantially negatively correlated with the 5’UTR GC content and the GC3 codon index.
The network plot below displays the results of the example in the section above, where by default, NetModelSel = "omnibus", useCorel = TRUE, and covarFilt = 20.
Figure 17: Network visualization of the results of forward stepwise regression modelling
Nodes represent features that were selected in the omnibus model, with the size proportional to the total variance in the regulatory effect explained by that feature. Edges represent the Pearson correlation coefficient between features. Wider lines indicate stronger positive or negative correlations. Features without a node indicated in pink represent those that were not selected in the omnibus model, but correlate substantially with one or more features that were significant in the model.
When many features are selected in the stepwise regression modelling or when there are many substantial correlations between features, it may be desirable to organize the network plots to be more easily interpreted. The lmfeatGroup and lmfeatGroupColour parameters can be used to group features in space, and assign coloured circles to the groupings. For example, it is possible to organize the layout of features in the network plot according to regions of the mRNA sequence they are found in. However, grouping is highly customizable and can be tailored to the dataset and biological context.
In the example below, the same feature integration modelling as above will be performed adding grouping variables and colours to the network plot. Note that the total proportion of variance explained by the group of features is also displayed on the network plot for each category.
# Examine the input features
names(myFeatureList)
# Make a vector to assign groups to each feature according to categories:
group <- c(
"UTR5", "UTR3", "UTR5", "CDS", "UTR5", "UTR5", "UTR5", "CDS", "CDS",
rep("Pathway", 2), "UTR5", rep("Pathway", 2)
)
# Assign colours to each grouping category:
groupColour <- c("#834b62", "#6699cc", "#e9724c", "#fff299")
names(groupColour) <- c("UTR5", "CDS", "UTR3", "Pathway")
# Run feature integration modelling using forward stepwise regression
ptn <- featureIntegration(
ptn = ptn,
features = myFeatureList,
pdfName = "Example_grouped",
regOnly = TRUE,
allFeat = FALSE,
analysis_type = "lm",
comparisons = list(c(1, 2)),
lmfeatGroup = group,
lmfeatGroupColour = groupColour
)
Figure 18: Network visualization of the results of forward stepwise regression modelling
Selected features are grouped according to which region of the mRNA they correspond to, as well as upstream regulatory pathways known to alter translation efficiency.
Finally, in addition to producing a PDF file, the results of the network analysis are stored in the postNetData object as an igraph object (Csárdi and Nepusz 2006; Antonov et al. 2023; Csárdi et al. 2025), and can be retrieved using the ptn_networkGraph function. See the example code below:
In addition to forward stepwise regression, the featureIntegration() function can also apply Breiman’s Random Forest algorithm (as implemented in the randomForest package (Liaw 2002)) to classify genes according to their regulation, and identify features associated with regulatory classes. Unlike with forward stepwise regression modelling, the relationships between features will not necessarily be revealed using feature selection and Random Forest classification, and network analysis cannot be implemented. However, Random Forest classifiers can be constructed and trained in one dataset, and then applied to predict post-transcriptional regulation in new contexts using the selected features. Random Forest classification with featureIntegration() is implemented in two steps described below.
In the first step, genes are divided into training (70%) and validation (30%) sets. A pre-modelling step is carried out where the Breiman’s Random Forest algorithm (Liaw 2002; Breiman 2019) is applied to perform supervised classification of genes into regulatory classes. Next, feature selection is performed using Boruta (Miron B. Kursa 2010). Boruta uses feature importance measurements from Random Forest classification to iteratively select features based on the comparison of importance attributes to randomly shuffled “shadow” attributes. This feature selection process finds all relevant features associated with the prediction, meaning features may be redundant with others already selected (see the Boruta Vignette for details). Although not directly comparable, this differs somewhat in principle from forward stepwise regression modelling, where features will only be selected if they independently explain a proportion of variance in the regulatory effect beyond what is explained by other features in the model. In this example, in agreement with forward stepwise regression modelling, 5’UTR GC content is found to be the most important feature predicting post-transcriptional regulation in this context.
Figure 19: Boxplots of Z-Scores of Importance from feature selection using Boruta
Features coloured in green indicate retained features. Shadow feature metrics (randomly shuffled attributes) are indicated in blue.
After performing feature selection, the Random Forest classification of genes into different regulatory classes is repeated using the refined set of retained features. The performance of the model is assessed using Receiver Operating Characteristic (ROC) curves (implemented with ROCR) (Sing T. 2005). The ROC curve plots the rate of false positives against the rate of true positives and provides several metrics to evaluate the predictive power of the model. The area under the curve (AUC) describes the probability of ranking a randomly selected positive higher than a randomly selected negative. Also provided are the sensitivity (true positive rate), and the specificity (true negative rate). The feature importance from the final model is also reported as Mean Decreased Accuracy (see randomForest for details), where higher values indicate more important features.
Figure 20: Final modelling results using Random Forest Classification
Left panel: Feature importance (mean decrease accuracy) for selected features included in the final model. Right panel: Receiver Operating Characteristic (ROC) curve for Random Forest classification of translationally activated vs. suppressed genes.
The featureIntegration() function can also be used to implement Random Forest classification. To do so, the function is run as described above changing the input for the analysis_type parameter to rf. In the example code below, a Random Forest classification model will predict membership of regulated genes to translationally activated or suppressed categories using the list of input features defined above.
# Run feature integration modelling using Random Forest classification:
ptn <- featureIntegration(
ptn = ptn,
features = myFeatureList,
pdfName = "Example",
regOnly = TRUE,
allFeat = FALSE,
analysis_type = "rf",
comparisons = list(c(1, 2))
)The results of the modelling will be stored in the postNetData object, and a series of PDFs will be produced. The PDF with the suffix “featureImportance.pdf” will contain a summary plot of feature importance from the feature selection step (Figure 19). The PDF with the suffix “FinalModel” will contain a summary of the feature importance for the selected features retained in the final model, as well as the ROC curve evaluating the performance of the model predictions (Figure 20). As for the forward stepwise regression implementation, it is possible to retrieve the full results from each step of modelling using the ptn_model() function, as well as the set of features that were retained during feature selection using ptn_selectedFeatures(). See the example code below:
# Check which models are available in the postNetData object:
ptn_check_models(ptn = ptn, analysis_type = "rf")
# Extract the set of selected features and their importance:
selectedFeaturesRF <- ptn_selectedFeatures(
ptn = ptn,
analysis_type = "rf",
comparison = 1
)
# Retrieve the final model results (other options include "preModel" or "borutaModel"):
finalModel <- ptn_model(ptn, analysis_type = "rf", model = "finalModel", comparison = 1)Both implementations of featureIntegration() output PDF files with scatterplots for each individual feature identified in either the final stepwise regression or Random Forest models. The Pearson’s correlation coefficient between the feature and regulatory effect (two-sided test) is displayed, along with the linear trend line and cubic smoothing spline (allowing assessment of linearity). For binary features, these metrics are not displayed.
Figure 21: Individual feature visualizations
Scatterplots of 5’UTR GC content (left), 5’UTR SCSCGS motif content (middle), and signature of genes translationally activated by mTOR (right) versus translation efficiency (effM). Pearson’s correlation coefficients, linear trend lines, and cubic smoothing splines indicated.
Feature-based models trained using featureIntegration() can be used not only to interpret factors contributing to post-transcriptional regulation within a single dataset, but also test whether the same features generalize across biological contexts. When featureIntegration() is implemented using Random Forest classification (when analysis_type = "rf"), the final trained model can be exported and applied to new gene sets using the rfPred() function. This enables evaluation of whether features that explain post-transcriptional regulation in one system can accurately predict regulation in an independent dataset, condition, or experimental model.
Running rfPred() requires a postNetData object that already contains a trained Random Forest model. By providing a single integer for the comparison argument, the user can select between models trained for different comparisons during featureIntegration() modelling. The trained model will then be applied to a new list of genes provided with the predGeneList parameter. The list must have the same regulatory classes as the original dataset used to train the model (e.g., translationUp, translationDown). For the genes included in predGeneList, features must be enumerated and compiled into a list, as described above for featureIntegration(). This list is provided with the predFeatures parameter, and must include exactly the same features as those in the final Random Forest model. These can be easily retrieved using the ptn_selectedFeatures() function as described above. The output of the rfPred() function is a PDF file containing the ROC curve evaluating how well the original features can explain the regulatory effect in the new dataset.
In the example below, the Random Forest model trained in the example above will be applied to a new simulated dataset created by randomly sampling the original data. Note that in this example, due to the randomization, the model is expected to have poor predictive power.
# Simulate a new gene list by randomly sampling genes from the existing background:
newGenes <- sample(postNetVignette$background, size = 100)
newGeneList <- list(translationUp = newGenes[1:50], translationDown = newGenes[51:100])
# Select the features that were used to train the final Random Forest model in the original dataset:
predFeatureNames <- names(ptn_selectedFeatures(ptn,
analysis_type = "rf",
comparison = 1
))
# Prepare the predictive features. In this case the same list of input features will be used
# to predict as the new gene lists are taken from the same dataset the model was trained on.
# However, usually the input for the rfPred() function would be taken from a postNetData
# object from a separate analysis on a distinct dataset.
newFeatures <- myFeatureList[predFeatureNames]
# Evaluate the model in the new simulated dataset:
ptn <- rfPred(
ptn = ptn,
comparison = 1,
predGeneList = newGeneList,
predFeatures = newFeatures,
pdfName = "Example"
)
Figure 22: Receiver Operating Characteristic (ROC) curve for the Random Forest classification model trained in the example above, applied to a new simulated dataset of randomized genes to predict translationally activated vs. suppressed genes
After identifying putative regulatory features using featureIntegration(), it is often helpful to explore how these features co-occur across genes and how they relate to regulatory effects. The plotFeaturesMap() function uses umap to generate Uniform Manifold Approximation and Projection (UMAP) visualizations based on user-selected features, allowing you to overlay regulatory effects and visually inspect how individual features vary across the same embedding. This provides an intuitive tool to identify transcripts with shared or differing regulatory behaviour and feature compositions.
When plotting UMAPs, each point will represent a gene. By default, only regulated genes will be included (those defined by the geneList or regulation parameters of the postNetStart() function). However, similarly to featureIntegration(), all genes can be included by setting the regOnly parameter to FALSE. The UMAP embedding is constructed from a user-defined set of features provided using the featSel parameter, often starting with those that were identified as informative during featureIntegration() using either stepwise regression or Random Forest modelling. However, additional features of interest can also be included. The regulatory effect and feature values can then be overlaid on the embedding. Which features are coloured can be controlled using the featCol parameter. One PDF file with two panels will be generated for each feature provided with featCol. The left panel will be overlaid with the regulatory effect from the selected comparisons, and the right panel will be overlaid with the given feature.
UMAP is sensitive to input parameters and data pre-processing (Huang Haiyang 2022). Therefore, several options are provided that can be applied to feature values prior to dimensionality reduction. The remBinary parameter will remove binary features (0 or 1) which will often have a strong influence on gene clustering (default is TRUE). There are also options to scale and/or center the feature data using the scaled and centered parameters. Finally, the remExtreme parameter can be used to filter outlier values prior to generating the colour scales for features and effect measurements overlaid on UMAP embeddings. The default settings are a suggested starting point but may not be appropriate for all inputs, so it is recommended to test the impact of different input parameters. Users should exercise caution when interpreting spatial clustering of genes using this approach. These visualizations are intended to be used as a tool for data exploration and understanding relationships between different features and regulatory effects, particularly in cases where multiple features are co-occurring in the same mRNA.
In the example below, the set of features identified using forward stepwise regression modelling will be used to generate a UMAP embedding. PDF files will be generated with the regulatory effect and the values for 5’UTR GC content overlaid. The resulting UMAP coordinates are stored in the featuresMap slot of the postNetData object.
# Plot UMAP visualizations using the features identified in stepwise regression modelling
ptn <- plotFeaturesMap(
ptn = ptn,
regOnly = TRUE,
comparisons = list(c(1, 2)),
featSel = c(names(ptn_selectedFeatures(ptn,
analysis_type = "lm",
comparison = 1
))),
featCol = c("UTR5_GC"),
remBinary = TRUE,
scaled = FALSE,
centered = TRUE,
remExtreme = 0.1,
pdfName = "Example"
)
Figure 23: UMAP visualizing changes in translation efficiency (Effect) generated based on features identified in the omnibus model using forward stepwise regression (left), with the 5’UTR GC content of mRNAs coloured (right)
In addition to tools for identifying and enumerating sequence features of mRNA and modelling networks of post-transcriptional regulation, postNet can also be used to perform enrichment analyses including implementations of GSEA, GAGE, and GO term analysis to provide functional insights into gene sets of interest. It is also possible to examine the regulation of gene signatures in your dataset in a threshold-independent manner, providing insights into possible mechanisms explaining observed regulation and allowing comparisons between datasets.
When performing GSEA, GAGE, or GO term analysis using the output of an anota2seq analysis, it is often necessary to filter the input genes and log2 fold changes for the “translation” and “buffering” regulatory modes prior to performing enrichment analyses. This is because the slopes fitted by the anota2seq APV models can sometimes have unrealistic values, or suggest unlikely translational regulation, impacting the analysis of changes in translation or translational buffering (or offsetting). Filtering out genes with these unrealistic slopes is especially important for GSEA and GAGE analyses, which rely on rankings. For analyses relying on hypergeometric tests, such as GO term enrichment, the impact of filtering on the analysis is likely to be more negligible. However, slope filtering is still recommended. The slopeFilt function identifies these genes with unrealistic slopes, allowing them to be removed from downstream analyses. Note that in high-quality datasets few genes will require slope filtering, but filtering thresholds can be adjusted with the minSlope and maxSlope parameters.
# Get the genes to be filtered out of downstream enrichment analyses
# using the buffering regulatory mode:
filtOutGenes <- slopeFilt(ads,
regulationGen = "buffering",
contrastSel = 1
)The output of the slopeFilt() function can be directly supplied to downstream enrichment analysis functions using the genesSlopeFiltOut parameter.
PostNet performs Gene Set Enrichment Analysis (GSEA) (Mootha et al. 2003; Subramanian et al. 2005), using the gseaAnalysis() function and the regulatory effect measurement contained in a postNetData object. The analysis implements the fgsea package (Korotkevich et al. 2016), and can be applied using gene sets from The Molecular Signatures Database MSigDB. Gene expression signatures to be considered in the analysis can be selected from MSigDB using the collection and subcollection parameters, and specific gene sets can be specified using subsetNames. Gene sets can be further refined using the maxSize and minSize parameters to set upper and lower thresholds for the number of genes included in each gene set.
The code below provides an example of how to run GSEA analysis on genes ranked according to log2 fold changes in translation efficiency, using specific gene set terms from the MSigDB. If you are using the results of an anota2seq analysis, it would also be necessary to perform slope filtering, and supply the output using the genesSlopeFiltOut parameter in both the gseaAnalysis() and gseaPlot() functions.
# If using a GSEA collection from MSigDB, check the available versions:
version <- msigdb::getMsigdbVersions()
# Retrieve the MSigDB data for "human":
msigdbOut <- msigdb::getMsigdb(
org = "hs",
id = "SYM",
version = version[1]
)
# Check the available collections or subcollections:
msigdb::listCollections(msigdbOut)
msigdb::listSubCollections(msigdbOut)
# Run GSEA on the C5 collection with GO:BP and specific terms:
ptn <- gseaAnalysis(
ptn = ptn,
collection = "c5",
subcollection = "GO:BP",
subsetNames = c(
"GOBP_CELL-CELL_SIGNALING_BY_WNT",
"GOBP_ENDOCYTOSIS"
),
name = "c5_Example"
)In addition to the collections in MSigDB, GSEA can also be run using custom gene sets using the geneSet parameter.
# Create example custom gene sets for GSEA:
set1 <- sample(myGenes[[1]], 100)
set2 <- sample(myGenes[[2]], 100)
inSet <- list(Set1 = set1, Set2 = set2)
# Run GSEA on custom gene sets:
ptn <- gseaAnalysis(
ptn = ptn,
geneSet = inSet,
name = "Example"
)GSEA results can be extracted from the postNetData object and plotted using the ptn_GSEA() and gseaPlot functions.
# Extract the significant enrichment results from the postNetData object:
gseaOut <- ptn_GSEA(ptn,
threshold = 0.05
)
# Plot GSEA results:
gseaPlot(
ptn = ptn,
termNames = gseaOut$Term[1]
)
Figure 24: A plot visualizing GSEA results for the enrichment of the gene Set1 in the example dataset
Generally Applicable Gene-set Enrichment (GAGE) analysis can also be performed using the regulatory effect measurement contained in a postNetData object. The gageAnalysis() function implements the gage package (Luo et al. 2009), and can be used with the Biological Process “BP”, Molecular Function “MF”, or Cellular Component “CC” Gene Ontology categories, as well as with ‘KEGG’ pathways. These are specified using the category parameter. Similarly to GSEA, terms/pathways can be filtered using the maxSize and minSize parameters to set upper and lower thresholds for the number of genes included in each term or pathway.
The results of the analysis can be retrieved using the ptn_GAGE function. As GAGE uses a two-directional test, the directionality of the results (“greater”, “less”) can be specified with the direction parameter.
The example code below performs GAGE analysis using the “MF” GO term category, and extracts terms that are significantly associated with increased log2 fold changes in translation efficiency. Note that if you are using the results of an anota2seq analysis, it would also be necessary to perform slope filtering, and supply the output using the genesSlopeFiltOut parameter in the gageAnalysis() function.
Within postNet, GAGE can also be applied to identify enrichments in microRNA (miRNA) targets in regulated gene sets of interest. This is done using the miRNAanalysis() function which implements the gage package (Luo et al. 2009) and uses miRNA target information from the TargetScan database. Note that this approach does not identify or predict miRNA binding sites in mRNA sequences, but rather assesses whether gene sets of interest may be regulated by particular miRNAs.
To perform the analysis, it is necessary to download a miRNA target prediction file from the TargetScan database. This file must contain the headings ‘Cumulative weighted context++ score’, ‘Aggregate PCT’, ‘Gene Symbol’, and ‘Representative miRNA’. Importantly, predictions for multiple species are contained within the file, so it is essential to subset to retain information for only the desired species prior to running the analysis. Once downloaded and subset, this file is supplied using the miRNATargetScanFile parameter.
Prior to running GAGE analysis, miRNA-mRNA targeting predictions should be filtered to retain high-confidence predictions, or adjusted depending on the biological question. This filtering is performed by specifying thresholds in the cumulative weighted context++ score (CWCS) using the contextScore parameter, and/or the Aggregate PCT using the Pct parameter. The CWCS provides a measure of the efficacy of predicted miRNA targeting, where larger negative values indicate more repression in response to a miRNA (Agarwal et al. 2015; Grimson et al. 2007). This metric is useful as it captures all types of interactions and can be used in cases where miRNAs are not highly conserved across species. The Aggregate PCT is a measurement of how well conserved the predicted miRNA-mRNA targeting is across species [Agarwal et al., 2015, Friedman et al., 2009]. Values closer to 1 indicate a greater probability of evolutionary conservation due to maintenance of miRNA targeting (i.e., biological function). Depending on the biological question, filtering can be performed on one or both of these metrics (see the TargetScan FAQ for further details regarding target prediction).
After the appropriate filtering has been applied, GAGE is implemented to identify enrichments in miRNAs predicted to target genes in gene sets of interest based on rankings (genes are ranked by the regulatory effect measurement). Similarly to GSEA and GAGE analyses, sets of miRNA target predictions can be filtered using the maxSize and minSize parameters to set upper and lower thresholds for the number of genes included.
The example code below performs GAGE miRNA target enrichment analysis using human mRNA-miRNA target predictions. Target predictions are filtered according to both the CWCS and the Aggregate PCT in this example, but these may vary depending on what the user is interested in. Note that if you are using the results of an anota2seq analysis, it would also be necessary to perform slope filtering, and supply the output using the genesSlopeFiltOut parameter in the miRNAanalysis() function.
# An example of the required format for the input for miRNATargetScanFile
# parameter ("miRNA_predictions_hsa.txt"):
# Transcript.ID Gene.Symbol miRNA.family Species.ID Total.num.conserved.sites
# Number.of.conserved.8mer.sites Number.of.conserved.7mer.m8.sites
# Number.of.conserved.7mer.1a.sites Total.num.nonconserved.sites
# Number.of.nonconserved.8mer.sites Number.of.nonconserved.7mer.m8.sites
# Number.of.nonconserved.7mer.1a.sites #Number.of.6mer.sites Representative.miRNA
# Total.context...score Cumulative.weighted.context...score # Aggregate.PCT
# Predicted.occupancy...low.miRNA Predicted.occupancy...high.miRNA
# Predicted.occupancy...transfected.miRNA
# ENST00000055682.6 KIAA2022 UGGCACU 9606 3 0 0 3 0 0 0 0 0
# hsa-miR-183-5p.2 -0.604 -0.604 1 0.0807 0.5089 1.8669
# ENST00000207157.3 TBX15 UGGCACU 9606 1 1 0 0 0 0 0 0 1
# hsa-miR-183-5p.2 -0.589 -0.552 1 0.1287 0.5218 0.8900
# ENST00000215832.6 MAPK1 ACAGUAC 9606 2 0 1 1 3 0 2 1 3
# hsa-miR-101-3p.1 -0.509 -0.279 1 0.0244 0.1671 0.8723
# Note that this target prediction file has been filtered to include only human miRNA ('hsa')
# Run GAGE miRNA target enrichment analysis:
ptn <- miRNAanalysis(ptn,
genesSlopeFiltOut = filtOutGenes,
miRNATargetScanFile = "miRNA_predictions_hsa.txt",
contextScore = -0.2,
Pct = 0.7
)After running the analysis, the results can be retrieved using either the ptn_miRNA_analysis, or ptn_miRNA_to_gene functions.
The ptn_miRNA_analysis function will return the results of the GAGE analysis. When interpreting the results, note that enrichments in miRNA predicted to target genes that are upregulated (e.g., if the regulatory effect measurement is log2 fold change) that appear in the results table labelled “greater” can be interpreted as those miRNA that may be downregulated or otherwise not active in the experimental condition. Likewise, enrichments in miRNA predicted to target genes that are downregulated that appear in the results table labelled “less” can be interpreted as those miRNA that may be upregulated or active in the experimental condition.
The ptn_miRNA_to_gene function will return the lists of genes that are predicted to be targeted by the miRNA passing the filtering threshold set by contextScore and Pct parameters. These may be desirable if you plan to include specific miRNA in modelling analyses using the featureIntegration() function. These gene lists can be converted to signatures using the signCalc() function and be included as features in the models.
# Extract the significant miRNA target enrichment results from the postNetData object:
miRNAout <- ptn_miRNA_analysis(
ptn = ptn,
direction = "less",
threshold = 0.05
)
# Extract the lists of genes for miRNA that were found to have targets significantly
# enriched in the downregulated gene set of interest:
miRNAtargets <- ptn_miRNA_to_gene(
ptn = ptn,
miRNAs = c("hsa-miR-138-5p", "hsa-miR-182-5p")
)Gene Ontology (GO) term enrichment analysis can also be performed using the gene lists of interest in a postNetData object. The goAnalysis() function implements clusterProfiler (Xu et al. 2024), and similarly to GAGE analysis can be used with the Biological Process “BP”, Molecular Function “MF”, or Cellular Component “CC” Gene Ontology categories, and with ‘KEGG’ pathways specified using the category parameter, with optional filtering using the maxSize and minSize parameters to set upper and lower thresholds for the number of genes included. In addition, enriched terms can be filtered according to FDR threshold, and for the number of genes in the gene sets of interest that are included in the term using the FDR and counts parameters, respectively.
The example code below performs GO term enrichment analysis using the “BP” GO term category, and “KEGG” pathways. Note that if you are using the results of an anota2seq analysis, it would also be necessary to perform slope filtering, and supply the output using the genesSlopeFiltOut parameter in the goAnalysis() function.
# Run GO term analysis using a postNetData object:
ptn <- goAnalysis(
ptn = ptn,
category = c("BP"), # Note that multiple terms can be run simultaneously
name = "Example"
)The results of the analysis can be retrieved using the ptn_GO function and visualized using the goDotplot() function. The pool parameter of the goDotplot function controls if separate plots will be generated for each gene list of interest, or whether results for all gene lists will be pooled into a single plot. It is also possible to specify which terms will be plotted using the termSel parameter. Finally, the metric used to determine the size of the dot for each enriched term can be controlled using the size parameter to reflect either the number of genes that were included in the enriched term, or the ratio of genes included to total number of genes in the term.
# Extract the significant enrichment results for Biological Process:
goOut <- ptn_GO(ptn,
category = "BP",
geneList = "translationUp",
threshold = 0.05
)
# Plot GO term analysis results:
goDotplot(
ptn = ptn,
category = "BP",
pool = TRUE,
size = "geneRatio",
pdfName = "Example"
)
Figure 25: A dot plot visualizing GO term enrichment analysis results
Dot size corresponds to the ratio of genes included, to the total number of genes in the term.
In addition to the various forms of gene set enrichment analysis described above, postNet also provides functionality for assessing the regulation of gene signatures using empirical cumulative distribution functions (eCDFs). This approach allows visualizations of changes in the regulatory effect measurement (often log2 fold changes). ECDFs for genes belonging to the gene signatures of interest are compared to those for all other genes (background). Differences in the regulatory effect measurement distributions are calculated at the quantiles, and significant directional shifts in the distributions for gene signature versus background are identified using a Wilcoxon rank-sum test.
Gene signature analysis in postNet can be implemented in two ways depending on whether you are using custom gene lists and regulatory effect measurements, or if your analysis is based on the output of anota2seq. These are described below.
Using both approaches, gene signatures are supplied as a list using the signatureList parameter. It is possible to assess multiple signatures at the same time, and it is important to note that the background gene set is automatically determined based on the gene signatures provided. When there is no overlap between genes in the signatures, the background set is determined to be all genes not included in any signature (i.e., mutually exclusive signatures will be compared to the same background gene set). However, when there is overlap between the genes in the signatures provided, the background set is determined separately for each signature (i.e., overlapping signatures will each be compared to separate backgrounds). For example, you may be interested in examining how the sets of upregulated and downregulated genes taken from a particular study or condition behave in your dataset. In this case, the gene signatures will be mutually exclusive as a gene cannot be both up and downregulated, so the background is created such that the comparison will be between “regulated” vs. “unregulated” genes. In another example, you may wish to compare two signatures of mTOR-activated genes taken from different studies. As the signatures come from the same condition they will likely overlap considerably, and therefore a separate set of background genes will be created for each signature. Consideration should be given to what is the appropriate background gene set when deciding whether gene signatures should be compared together, or individually.
Several gene signatures relevant to post-transcriptional regulation are included with the package for both human and mouse.
When starting from custom gene lists and regulatory effect measurement the regulation of gene signatures can be assessed using the plotSignatures function. All signatures supplied to the signatureList parameter will be included on the same plot, where the background gene set is indicated in grey.
The example code below will load the set of signatures included with the package, and assess the regulation of the signatures for genes whose translation is either activated or suppressed by mTOR signalling.
# load human signature data:
data("humanSignatures")
# Examine the regulation of mTOR-sensitive transcripts
plotSignatures(
ptn = ptn,
signatureList = humanSignatures[c(
"Gandin_etal_2016_mTOR_transUp",
"Gandin_etal_2016_mTOR_transDown"
)],
signature_colours = c("red", "blue"),
dataName = "Example",
generalName = "mTOR_sensitive_translation",
xlim = c(-3, 3),
tableCex = 0.7,
pdfName = "mTORsignatures"
)
Figure 26: eCDF of log2 fold changes in translation efficiency for signatures of mTOR-sensitive translation
Grey curve indicates the background gene set. Wilcoxon p-values and differences between the distributions at each quantile are indicated. A shift to the right indicates an increase compared to background, while shift to the left indicates a decrease.
When starting from the results of an anota2seq analysis, gene signatures can be assessed using the plotSignatures_ads function. The key difference from the custom gene list workflow is that instead of providing the postNetData object, the anota2seqDataSet object (the ads parameter) must be provided instead. Four eCDFs will be generated examining the distribution of log2 fold changes in total mRNA, polysome-associated mRNA (or ribosome-protected fragments; RPFs), translation, and buffering (also known as offsetting) for each gene signature provided. In addition to eCDFs, a scatter plot of total mRNA versus polysome-associated mRNA will also be generated, with the genes corresponding to the signatures provided coloured.
The example code below will load the set of signatures included with the package, and assess the regulation of the signatures for genes whose translation is either activated or suppressed by activation of the integrated stress response with thapsigargin. Note that a very minimal example dataset was used to illustrate the anota2seq workflow, so very few genes are present in the outputs.
# Examine the regulation of ISR-sensitive transcripts in the results of an anota2seq analysis:
plotSignatures_ads(
ads = ads,
contrast = 1,
dataName = "Osmosis_1h",
signatureList = humanSignatures[c(
"Guan_etal_2017_Tg1_transUp",
"Guan_etal_2017_Tg1_transDown"
)],
generalName = c("ISR_activated", "ISR_suppressed"),
signature_colours = c("red", "blue"),
xlim = c(-3, 3),
scatterXY = 4,
tableCex = 1,
pdfName = "Example_ISRsignatures"
)
Figure 27: Gene signature analysis starting from an anota2seqDataSet
Scatterplot of total vs. translated mRNA from an example anota2seq analysis with the location of genes belonging to signatures of interest coloured (left panel). Subsequent panels show the eCDFs of log2 fold changes in translation and buffering regulatory modes, as well as for translated and total mRNA for the signatures of interest. Grey curves indicate the background gene set. Wilcoxon p-values and differences between the distributions at each quantile are indicated. A shift to the right indicates an increase compared to background, while shift to the left indicates a decrease.
Finally, it is also possible to generate a heatmap comparing the regulation of multiple gene signatures in your dataset of interest using the signaturesHeatmap function. The values displayed in the heatmap are specified using the unit parameter, and can either be based on significance or the magnitude of the regulation for each given gene signature. For significance, the values displayed in the heatmap are obtained from a two-sided Wilcoxon Rank Sum test comparing the regulatory effect measurement values for the gene signature of interest against the background. P-values are corrected for multiple testing using the Benjamini & Hochberg method, and the -log10(FDR) value is multiplied by either 1 or -1 corresponding to up- or down-regulation of the gene signature compared to background, respectively. For magnitude, the values displayed in the heatmap are obtained by taking the difference between the eCDFs of the regulatory effect measurements for the gene signature and the background, at the percentile specified.
The example code below will examine the regulation of all gene signatures included with the package, based on the significance and directionality of the regulation.
# Examine the regulation of all gene signatures:
signaturesHeatmap(ptn,
signatureList = humanSignatures,
unit = "FDR",
pdfName = "ExampleSignatures"
)
Figure 28: A heatmap of -log10(FDRs) for regulation of various gene signatures in the example dataset
Colours indicate the directionality of regulation.
If you use postNet in your work, please cite:
Watt, K., Dauber, B., Szkop, K.J. et al. Epigenetic alterations facilitate transcriptional and translational programs in hypoxia. Nat Cell Biol 27, 1965–1981 (2025). DOI: 10.1038/s41556-025-01786-8.
Additional citations:
If you use the motifAnalysis() function in your work, please also cite:
Bailey, T. L., Johnson, J., Grant, C. E., & Noble, W. S. (2015). The MEME Suite. Nucleic acids research, 43(W1), W39–W49. DOI: 10.1093/nar/gkv416.
Bailey T. L. (2021). STREME: accurate and versatile sequence motif discovery. Bioinformatics (Oxford, England), 37(18), 2834–2840. DOI: 10.1093/bioinformatics/btab203.
Nystrom SL, McKay DJ (2021) Memes: A motif analysis environment in R using tools from the MEME Suite. PLoS Comput Biol 17(9): e1008991. DOI: 10.1371/journal.pcbi.1008991.
If you use the G4 option with the contentMotifs() function in your work, please also cite:
Hon, J., Martínek, T., Zendulka, J., & Lexa, M. (2017). pqsfinder: an exhaustive and imperfection-tolerant search tool for potential quadruplex-forming sequences in R. Bioinformatics (Oxford, England), 33(21), 3373–3379. DOI: 10.1093/bioinformatics/btx413.
If you use the Random Forest implementation of the featureIntegration() function in your work, please cite:
Kursa, Miron B., and Witold R. Rudnicki. 2010. “Feature Selection with the Boruta Package.” Journal of Statistical Software 36 (11): 1–13. https://doi.org/10.18637/jss.v036.i11.
Sing, T., Sander, O., Beerenwinkel, N., & Lengauer, T. (2005). ROCR: visualizing classifier performance in R. Bioinformatics (Oxford, England), 21(20), 3940–3941. https://doi.org/10.1093/bioinformatics/bti623.
If you use the gseaAnalysis() function in your work, please also cite:
Korotkevich G, Sukhov V, Sergushichev A (2019). “Fast gene set enrichment analysis.” bioRxiv. doi:10.1101/060012, http://biorxiv.org/content/early/2016/06/20/060012.
If you use the gageAnalysis() function in your work, please also cite:
Luo, Weijun, Friedman, Michael, Shedden, Kerby, Hankenson, Kurt, Woolf, Peter (2009). “GAGE: generally applicable gene set enrichment for pathway analysis.” BMC Bioinformatics, 10, 161.
If you use the goAnalysis() function in your work, please also cite:
Xu S, Hu E, Cai Y, Xie Z, Luo X, Zhan L, Tang W, Wang Q, Liu B, Wang R, Xie W, Wu T, Xie L, Yu G (2024). “Using clusterProfiler to characterize multiomics data.” Nature Protocols, 19(11), 3292-3320. doi:10.1038/s41596-024-01020-z. # Session info
## R version 4.6.0 alpha (2026-04-05 r89794)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] postNet_0.99.8 data.table_1.18.2.1 dplyr_1.2.1
## [4] BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-9 DBI_1.3.0
## [3] gridExtra_2.3 httr2_1.2.2
## [5] rlang_1.2.0 magrittr_2.0.5
## [7] ade4_1.7-24 vioplot_0.5.1
## [9] otel_0.2.0 matrixStats_1.5.0
## [11] compiler_4.6.0 RSQLite_2.4.6
## [13] png_0.1-9 vctrs_0.7.3
## [15] reshape2_1.4.5 stringr_1.6.0
## [17] pkgconfig_2.0.3 crayon_1.5.3
## [19] fastmap_1.2.0 dbplyr_2.5.2
## [21] XVector_0.51.0 labeling_0.4.3
## [23] caTools_1.18.3 rmarkdown_2.31
## [25] purrr_1.2.2 bit_4.6.0
## [27] xfun_0.57 cachem_1.1.0
## [29] seqinr_4.2-36 jsonlite_2.0.0
## [31] blob_1.3.0 DelayedArray_0.37.1
## [33] BiocParallel_1.45.0 parallel_4.6.0
## [35] R6_2.6.1 bslib_0.10.0
## [37] stringi_1.8.7 RColorBrewer_1.1-3
## [39] limma_3.67.1 reticulate_1.46.0
## [41] GenomicRanges_1.63.2 jquerylib_0.1.4
## [43] Rcpp_1.1.1-1 Seqinfo_1.1.0
## [45] bookdown_0.46 SummarizedExperiment_1.41.1
## [47] knitr_1.51 zoo_1.8-15
## [49] IRanges_2.45.0 Matrix_1.7-5
## [51] splines_4.6.0 igraph_2.2.3
## [53] tidyselect_1.2.1 qvalue_2.43.0
## [55] dichromat_2.0-0.1 abind_1.4-8
## [57] yaml_2.3.12 gplots_3.3.0
## [59] codetools_0.2-20 curl_7.0.0
## [61] lattice_0.22-9 tibble_3.3.1
## [63] plyr_1.8.9 Biobase_2.71.0
## [65] withr_3.0.2 S7_0.2.1-1
## [67] askpass_1.2.1 evaluate_1.0.5
## [69] survival_3.8-6 BiocFileCache_3.1.0
## [71] Biostrings_2.79.5 pillar_1.11.1
## [73] BiocManager_1.30.27 filelock_1.0.3
## [75] MatrixGenerics_1.23.0 KernSmooth_2.23-26
## [77] sm_2.2-6.0 stats4_4.6.0
## [79] generics_0.1.4 S4Vectors_0.49.2
## [81] ggplot2_4.0.2 scales_1.4.0
## [83] gtools_3.9.5 anota2seq_1.33.0
## [85] glue_1.8.1 tools_4.6.0
## [87] RSpectra_0.16-2 fgsea_1.37.4
## [89] locfit_1.5-9.12 fastmatch_1.1-8
## [91] cowplot_1.2.0 grid_4.6.0
## [93] plotrix_3.8-14 umap_0.2.10.0
## [95] edgeR_4.9.7 cli_3.6.6
## [97] rappdirs_0.3.4 S4Arrays_1.11.1
## [99] gtable_0.3.6 DESeq2_1.51.7
## [101] sass_0.4.10 digest_0.6.39
## [103] BiocGenerics_0.57.1 SparseArray_1.11.13
## [105] farver_2.1.2 memoise_2.0.1
## [107] htmltools_0.5.9 multtest_2.67.0
## [109] lifecycle_1.0.5 statmod_1.5.1
## [111] openssl_2.4.0 bit64_4.6.0-1
## [113] MASS_7.3-65
Agarwal, Vikram, George W Bell, Jin-Wu Nam, and David P Bartel. 2015. “Predicting Effective microRNA Target Sites in Mammalian mRNAs.” Elife 4.
Antonov, Michael, Gábor Csárdi, Szabolcs Horvát, Kirill Müller, Tamás Nepusz, Daniel Noom, Maëlle Salmon, Vincent Traag, Brooke Foucault Welles, and Fabio Zanini. 2023. “Igraph Enables Fast and Robust Network Analysis Across Programming Languages.” arXiv Preprint arXiv:2311.10260. https://doi.org/10.48550/arXiv.2311.10260.
Aota, S, and T Ikemura. 1986. “Diversity in G + c Content at the Third Position of Codons in Vertebrate Genes and Its Cause.” Nucleic Acids Research 14: 6345–55.
Bailey, Timothy L. 2021. “STREME: Accurate and Versatile Sequence Motif Discovery.” Bioinformatics 37 (18): 2834–40.
Bailey, Timothy L, James Johnson, Charles E Grant, and William S Noble. 2015. “The MEME Suite.” Nucleic Acids Res. 43 (W1): W39–49.
Barbosa, Cristina, Isabel Peixeiro, and Luı́sa Romão. 2013. “Gene Expression Regulation by Upstream Open Reading Frames and Human Disease.” PLoS Genet. 9 (8): e1003529.
Bennetzen, J L, and B D Hall. 1982. “Codon Selection in Yeast.” J. Biol. Chem. 257 (6): 3026–31.
Breiman, Leo. 2019. “Random Forests - Machine Learning.” SpringerLink, June. https://link.springer.com/article/10.1023/A:1010933404324#citeas.
Bugaut, Anthony, and Shankar Balasubramanian. 2012. “5’-UTR RNA G-Quadruplexes: Translation Regulation and Targeting.” Nucleic Acids Res. 40 (11): 4727–41.
Cockman, Eric, Paul Anderson, and Pavel Ivanov. 2020. “TOP mRNPs: Molecular Mechanisms and Principles of Regulation.” Biomolecules 10 (7): 969.
Csárdi, Gábor, and Tamás Nepusz. 2006. “The Igraph Software Package for Complex Network Research.” InterJournal Complex Systems: 1695. https://igraph.org.
Csárdi, Gábor, Tamás Nepusz, Vincent Traag, Szabolcs Horvát, Fabio Zanini, Daniel Noom, Kirill Müller, David Schoch, and Maëlle Salmon. 2025. igraph: Network Analysis and Visualization in R. https://doi.org/10.5281/zenodo.7682609.
FANTOM Consortium and the RIKEN PMI and CLST (DGT), Alistair R R Forrest, Hideya Kawaji, Michael Rehli, J Kenneth Baillie, Michiel J L de Hoon, Vanja Haberle, et al. 2014. “A Promoter-Level Mammalian Expression Atlas.” Nature 507 (7493): 462–70.
Faure, Guilhem, Aleksey Y Ogurtsov, Svetlana A Shabalina, and Eugene V Koonin. 2016. “Role of mRNA Structure in the Control of Protein Folding.” Nucleic Acids Res. 44 (22): 10898–10911.
Gandin, Masvidal, V. 2016. “NanoCAGE Reveals 5’UTR Features That Define Specific Modes of Translation of Functionally Related mTOR-Sensitive mRNAs.” Genome Res 26 (5): 636–48. https://doi.org/10.1101/gr.197566.115.
Giudice, Girolamo, Fátima Sánchez-Cabo, Carlos Torroja, and Enrique Lara-Pezzi. 2016. “ATtRACT-a Database of Rna-Binding Proteins and Associated Motifs.” Database (Oxford) 2016: baw035.
Grant, Charles E, Timothy L Bailey, and William Stafford Noble. 2011. “FIMO: Scanning for Occurrences of a Given Motif.” Bioinformatics 27 (7): 1017–8.
Grimson, Andrew, Kyle Kai-How Farh, Wendy K Johnston, Philip Garrett-Engele, Lee P Lim, and David P Bartel. 2007. “MicroRNA Targeting Specificity in Mammals: Determinants Beyond Seed Pairing.” Mol. Cell 27 (1): 91–105.
Hon, Jirı́, Tomáš Martı́nek, Jaroslav Zendulka, and Matej Lexa. 2017. “Pqsfinder: An Exhaustive and Imperfection-Tolerant Search Tool for Potential Quadruplex-Forming Sequences in R.” Bioinformatics 33 (21): 3373–9.
Huang Haiyang, Rudin Cynthia, Wang Yingfan. 2022. “Towards a Comprehensive Evaluation of Dimension Reduction Methods for Transcriptomic Data Visualization.” Commun. Biol. 5 (1): 719.
Ikemura, T. 1981. “Correlation Between the Abundance of Escherichia Coli Transfer RNAs and the Occurrence of the Respective Codons in Its Protein Genes: A Proposal for a Synonymous Codon Choice That Is Optimal for the E. Coli Translational System.” J. Mol. Biol. 151 (3): 389–409.
Jousse, C, A Bruhat, V Carraro, F Urano, M Ferrara, D Ron, and P Fafournoux. 2001. “Inhibition of CHOP Translation by a Peptide Encoded by an Open Reading Frame Localized in the Chop 5’UTR.” Nucleic Acids Res. 29 (21): 4341–51.
Korotkevich, Gennady, Vladimir Sukhov, Nikolay Budin, Boris Shpak, Maxim N Artyomov, and Alexey Sergushichev. 2016. “Fast Gene Set Enrichment Analysis.” bioRxiv.
Krokowski, Dawid, Raul Jobava, Krzysztof J Szkop, Chien-Wen Chen, Xu Fu, Sarah Venus, Bo-Jhih Guan, et al. 2022. “Stress-Induced Perturbations in Intracellular Amino Acids Reprogram mRNA Translation in Osmoadaptation Independently of the ISR.” Cell Rep. 40 (3): 111092.
Liaw, Wiener, A. 2002. “Classification and Regression by randomForest.” R News 2 (3): 18–22. https://CRAN.R-project.org/doc/Rnews/.
Lorent, Julie, Eric P Kusnadi, Vincent van Hoef, Richard J Rebello, Matthew Leibovitch, Johannes Ristau, Shan Chen, et al. 2019. “Translational Offsetting as a Mode of Estrogen Receptor α-Dependent Regulation of Gene Expression.” The EMBO Journal 38: e101323.
Luo, Weijun, Michael S Friedman, Kerby Shedden, Kurt D Hankenson, and Peter J Woolf. 2009. “GAGE: Generally Applicable Gene Set Enrichment for Pathway Analysis.” BMC Bioinformatics 10 (1): 161.
Mayr, Christine. 2016. “Evolution and Biological Roles of Alternative 3’UTRs.” Trends Cell Biol. 26 (3): 227–37.
Medina-Muñoz, Santiago Gerardo, Gopal Kushawah, Luciana Andrea Castellano, Michay Diez, Michelle Lynn DeVore, María José Blanco Salazar, and Ariel Alejandro Bazzini. 2021. “Crosstalk Between Codon Optimality and Cis-Regulatory Elements Dictates mRNA Stability.” Genome Biology 22: 14.
Miron B. Kursa, Witold R. Rudnicki. 2010. “Feature Selection with the Boruta Package.” Journal of Statistical Software 36 (11): 1–13. https://doi.org/10.18637/jss.v036.i11.
Moll, Pamela, Michael Ante, Alexander Seitz, and Torsten Reda. 2014. “QuantSeq 3′ mRNA Sequencing for RNA Quantification.” Nat. Methods 11 (12): i–iii.
Mootha, Vamsi K, Cecilia M Lindgren, Karl-Fredrik Eriksson, Aravind Subramanian, Smita Sihag, Joseph Lehar, Pere Puigserver, et al. 2003. “PGC-1\(\alpha\)-responsive Genes Involved in Oxidative Phosphorylation Are Coordinately Downregulated in Human Diabetes.” Nat. Genet. 34 (3): 267–73.
Morales, Joannella, Shashikant Pujar, Jane E Loveland, Alex Astashyn, Ruth Bennett, Andrew Berry, Eric Cox, et al. 2022. “A Joint NCBI and EMBL-EBI Transcript Set for Clinical Genomics and Research.” Nature 604 (7905): 310–15.
Nystrom, McKay, S. L. 2021. “Memes: A Motif Analysis Environment in R Using Tools from the Meme Suite.” PLoS Comput. Biol. 17 (9): e1008991.
Oertlin, Christian, Julie Lorent, Carl Murie, Luc Furic, Ivan Topisirovic, and Ola Larsson. 2019. “Generally Applicable Transcriptome-Wide Analysis of Translation Using Anota2seq.” Nucleic Acids Res. 47 (12): e70. https://doi.org/10.1093/nar/gkz223.
O’Leary, Nuala A, Mathew W Wright, J Rodney Brister, Stacy Ciufo, Diana Haddad, Rich McVeigh, Bhanu Rajput, et al. 2016. “Reference Sequence (Refseq) Database at Ncbi: Current Status, Taxonomic Expansion, and Functional Annotation.” Nucleic Acids Research 44: D733–45.
Philippe, Lucas, Antonia M G van den Elzen, Maegan J Watson, and Carson C Thoreen. 2020. “Global Analysis of Larp1 Translation Targets Reveals Tunable and Dynamic Features of 5’ Top Motifs.” Proceedings of the National Academy of Sciences of the United States of America 117: 5319–28.
Poulain, Stéphane, Sachi Kato, Ophélie Arnaud, Jean-Étienne Morlighem, Makoto Suzuki, Charles Plessy, and Matthias Harbers. 2017. “NanoCAGE: A Method for the Analysis of Coding and Noncoding 5’-Capped Transcriptomes.” Methods Mol. Biol. 1543: 57–109.
Qamra, Aditi, Manjie Xing, Nisha Padmanabhan, Jeffrey Jun Ting Kwok, Shenli Zhang, Chang Xu, Yan Shan Leong, et al. 2017. “Epigenomic Promoter Alterations Amplify Gene Isoform and Immunogenic Diversity in Gastric Adenocarcinoma.” Cancer Discov. 7 (6): 630–51.
Reis, Mario dos, Renos Savva, and Lorenz Wernisch. 2004. “Solving the Riddle of Codon Usage Preferences: A Test for Translational Selection.” Nucleic Acids Res. 32 (17): 5036–44.
Sharp, P M, and W H Li. 1987. “The Codon Adaptation Index–a Measure of Directional Synonymous Codon Usage Bias, and Its Potential Applications.” Nucleic Acids Res. 15 (3): 1281–95.
Sing T., O., Sander. 2005. “ROCR: Visualizing Classifier Performance in R.” Bioinformatics 21 (20): 7881. http://rocr.bioinf.mpi-sb.mpg.de.
Sonenberg, Nahum, and Alan G Hinnebusch. 2009. “Regulation of Translation Initiation in Eukaryotes: Mechanisms and Biological Targets.” Cell 136: 731–45.
Subramanian, Aravind, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proc. Natl. Acad. Sci. U. S. A. 102 (43): 15545–50.
Tushev, Georgi, Caspar Glock, Maximilian Heumüller, Anne Biever, Marko Jovanovic, and Erin M Schuman. 2018. “Alternative 3’ Utrs Modify the Localization, Regulatory Potential, Stability, and Plasticity of mRNAs in Neuronal Compartments.” Neuron 98: 495–511.e6.
Vattem, Krishna M, and Ronald C Wek. 2004. “Reinitiation Involving Upstream ORFs Regulates ATF4 mRNA Translation in Mammalian Cells.” Proc. Natl. Acad. Sci. U. S. A. 101 (31): 11269–74.
Verma, Manasvi, Junhong Choi, Kyle A Cottrell, Zeno Lavagnino, Erica N Thomas, Slavica Pavlovic-Djuranovic, Pawel Szczesny, et al. 2019. “A Short Translational Ramp Determines the Efficiency of Protein Synthesis.” Nat. Commun. 10 (1): 5774.
Watt, Kathleen, Bianca Dauber, Krzysztof J Szkop, Laura Lee, Predrag Jovanovic, Shan Chen, Ranveer Palia, et al. 2025. “Epigenetic Alterations Facilitate Transcriptional and Translational Programs in Hypoxia.” Nat. Cell Biol. 27 (11): 1965–81.
Wright, Charlotte J, Christopher W J Smith, and Chris D Jiggins. 2022. “Alternative Splicing as a Source of Phenotypic Diversity.” Nat. Rev. Genet. 23 (11): 697–710.
Xu, Shuangbin, Erqiang Hu, Yantong Cai, Zijing Xie, Xiao Luo, Li Zhan, Wenli Tang, et al. 2024. “Using clusterProfiler to Characterize Multiomics Data.” Nat. Protoc. 19 (11): 3292–3320.
Young, Sara K, Jeffrey A Willy, Cheng Wu, Matthew S Sachs, and Ronald C Wek. 2015. “Ribosome Reinitiation Directs Gene-Specific Translation and Regulates the Integrated Stress Response.” The Journal of Biological Chemistry 290: 28257–71.
Zuker, M. 2003. “Mfold Web Server for Nucleic Acid Folding and Hybridization Prediction.” Nucleic Acids Res. 31 (13): 3406–15.
Zur, Hadas, and Tamir Tuller. 2012. “Strong Association Between mRNA Folding Strength and Protein Abundance in S. Cerevisiae.” EMBO Rep. 13 (3): 272–77.