# Standard setup chunk
knitr::opts_chunk$set(echo = TRUE, collapse = TRUE)
# Load libraries required for the vignette to build
library(PMScanR)
library(ggseqlogo)
library(seqinr)
library(plotly)## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
The PMScanR package provides large-scale identification,
analysis and visualization of protein motifs. The package integrates
various methods to facilitate motif identification, characterization,
and visualization. It includes functions for running PS-Scan, a PROSITE
database tool. Additionally, PMScanR supports format
conversion to GFF, enhancing downstream analyses such as graphical
representation and database integration. The library offers multiple
visualization tools, including heatmaps, sequence logos, and pie charts,
enabling a deeper understanding of motif distribution and conservation.
Through its integration with PROSITE, PMScanR provides
access to up-to-date motif data.
Proteins play a crucial role in biological processes, with their
functions closely related to structure. Protein functions are often
associated with the presence of specific motifs, which are short,
sometimes repetitive amino acid sequences essential for distinctive
molecular interactions or modifications. Most of the existing
bioinformatics tools focus mainly on the identification of known motifs
and often do not provide during motif extraction, interactive analysis
and visualization tools. Moreover, they do not take into account the
effect of single variations on an entire domain or protein motif. These
limitations highlight the need for a tool that can automate and scale
the analysis. To address this, we have developed PMScanR,
an R-based package. Designed to facilitate and automate the prediction
and the evaluation of the effect of single amino acid substitutions on
the occurrence of protein motifs on a large scale of both motifs and
sequences. However, existing tools lacked the capability to perform
comparative analysis of multiple motifs across multiple sequences, a gap
that PMScanR was particularly developed to fill.
If the user prefers to perform the analysis using a graphical user
interface (GUI), they can simply run the function
runPMScanRShiny(). This will launch a Shiny app that opens
an interactive window. The window can be used both within R and in a web
browser, providing a clickable, user-friendly interface that allows the
entire analysis, including visualizations, to be carried out without
needing to write code. If you want you can use shiny to use all the
features of the library with user friendly GUI helping to follow all the
steps which are contained in this library.
Alternatively, if the user wishes to work directly with the code, the library provides a set of functions to perform the full analysis, including protein motif identification and visualization. This can be done through an R script, where users can execute and customize the analysis programmatically. Each function included in the package is described below, along with an explanation of its purpose and functionality.
The following list of commands provides a step-by-step description of
the functions that ensure the complete analysis provided by the
PMScanR package.
The first step of the analyses is to establish the working environment through the use of functions:
The next step is to load the data for the analysis (more example input files are given in the repository on Github, and below is one example file just from this repository):
# Load example FASTA file included with the package
fasta_file <- system.file("extdata", "hemoglobins.fasta", package = "PMScanR")Once the previous steps have been completed, you can move on to the relevant part of the analysis. Below is a description of the function, together with an example of its use.
runPsScan()This function runs the ps_scan tool from the user’s operating system to find protein motifs. It requires in_file, out_format, and out_file arguments. The function automatically detects the operating system and uses a cached version of the required PROSITE files.
# This command is not evaluated in the vignette as it requires an external
# dependency (Perl) and can be time-consuming.
runPsScan(in_file = fasta_file, out_format = 'gff', out_file = "results_pfscan.gff")This function is an automatic form of analysis loading, which searches the sequence for protein motifs present on it, but there is also an alternative option where the user can manually point the path to each of the files and the version of the operation system.
After running the scan, the output files can be converted into a GFF-like format for downstream analysis.
readPsa() and readProsite()These functions are used to parse the output files generated by
runPsScan. They convert the raw text output into a
standardized GFF-like data frame, which is a structured format required
by the other analysis and visualization functions in the package.
The following example shows how to parse a sample PSA output file
that is included with the PMScanR package.
# Load the path to an example PSA output file
motifs_psa_file <- system.file("extdata", "out_Hb_psa.txt", package = "PMScanR")
# Use readPsa to convert the file to a GFF-like data frame
gff_data <- readPsa(motifs_psa_file)
# Display the first few rows of the resulting data frame
head(gff_data)
## seqnames source type start end score strand phase Name
## 1 NP_000508.1 PSA PS00005 39 41 NA <NA> NA PKC_PHOSPHO_SITE
## 2 NP_000508.1 PSA PS00005 138 140 NA <NA> NA PKC_PHOSPHO_SITE
## 3 NP_000508.1 PSA PS00006 4 7 NA <NA> NA CK2_PHOSPHO_SITE
## 4 NP_000508.1 PSA PS00008 19 24 NA <NA> NA MYRISTYL
## 5 NP_000508.1 PSA PS00009 59 62 NA <NA> NA AMIDATION
## 6 NP_000508.1 PSA PS01033 2 142 NA <NA> NA GLOBIN
## Sequence
## 1 TtK
## 2 TsK
## 3 SpaD
## 4 GAhaGE
## 5 hGKK
## 6 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF---DLSHGSAQVKGHGKKVADALTNAVAHV---DDMPNALSALSDLHAhKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
## Level SequenceDescription width SkipFlag KnownFalsePos RawScore FeatureFrom
## 1 <NA> NA 3 NA NA NA NA
## 2 <NA> NA 3 NA NA NA NA
## 3 <NA> NA 4 NA NA NA NA
## 4 <NA> NA 6 NA NA NA NA
## 5 <NA> NA 4 NA NA NA NA
## 6 0 NA 141 NA NA NA NA
## FeatureTo
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NAOnce the analysis output files have been converted to GFF format, the next step is to create the matrix from this file needed for further visualisation in the form of heatmap generation. That can me make by using:
gff2matrix()This function converts the GFF-like data frame into a binary
motif-occurrence matrix. This matrix format is compact and is the
required input for the heatmap visualization functions. Each row
represents a unique motif, each column represents a sequence, and a
1 indicates the presence of that motif in that
sequence.
# The 'gff_data' is used from the previous chunk
motif_matrix <- gff2matrix(gff_data)
# Display the first few rows of the resulting matrix
head(motif_matrix)
## NP_000175.1 NP_000508.1 NP_000549.1 NP_000550.2 NP_001138679.1
## PS00001:125-128 0 0 0 0 0
## PS00001:148-151 0 0 0 0 0
## PS00001:152-155 0 0 0 0 0
## PS00001:177-180 0 0 0 0 0
## PS00001:182-185 0 0 0 0 0
## PS00001:189-192 0 0 0 0 1
## NP_001305067.1 NP_001305150.1 NP_001350615.1 NP_071441.1
## PS00001:125-128 1 0 0 0
## PS00001:148-151 1 0 0 0
## PS00001:152-155 1 0 0 0
## PS00001:177-180 0 0 0 0
## PS00001:182-185 1 0 0 0
## PS00001:189-192 0 0 0 0
## NP_599030.1
## PS00001:125-128 0
## PS00001:148-151 0
## PS00001:152-155 0
## PS00001:177-180 1
## PS00001:182-185 0
## PS00001:189-192 0After using this function a heatmap can be generate by using:
matrix2hm() and
matrixToSquareHeatmap()These functions generate interactive heatmaps from the
motif-occurrence matrix, which is an effective way to visualize the
presence or absence of motifs across many sequences simultaneously.
matrix2hm creates a standard rectangular heatmap, while
matrixToSquareHeatmap creates one with a square aspect
ratio, which can be useful for comparing motifs.
# Generate a standard heatmap from the motif_matrix
heatmap1 <- matrix2hm(input = motif_matrix)
heatmap1# Generate a square heatmap from the motif_matrix
heatmap2 <- matrixToSquareHeatmap(input = motif_matrix)
heatmap2Heatmap is the first option to visualise the data (shown above), the next option is to generate a seqlogo, after preparing the protein motifs for their generation using the functions described below:
extractSegments() and
extractProteinMotifs()These functions are used to prepare data for sequence logo visualizations. A sequence logo provides a graphical representation of the conservation of nucleotides or amino acids in a set of aligned sequences.
extractSegments is used to pull out a specific region
(e.g., from position 10 to 20) from a list of raw sequences.
extractProteinMotifs is used to extract all occurrences of
a specific, defined motif from a PSA scan result file.
The example below demonstrates how to extract a segment from our sample hemoglobin sequences and generate a logo.
# Read the FASTA file into a list of sequences
sequences <- seqinr::read.fasta(file = fasta_file, seqtype = "AA")
# Extract segments from position 10 to 20 from all sequences
segments <- extractSegments(sequences, from = 10, to = 20)
# Generate the sequence logo from the extracted segments
ggseqlogo::ggseqlogo(unlist(segments), seq_type = "aa")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggseqlogo package.
## Please report the issue at <https://github.com/omarwagih/ggseqlogo/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.extractProteinMotifs()This function is used to extract all protein motifs belonging to specific PROSITE IDs from a PSA-formatted file. It returns a list where each element is named by a PROSITE ID (e.g., “PS00223”) and contains all the sequence instances found for that motif. This is used for sequence logo visualizations of specific motifs.
# Extract all protein motifs into a list
protein_motifs <- extractProteinMotifs(motifs_psa_file)
# Generate the sequence logo from the extracted protein motifs
ggseqlogo::ggseqlogo(protein_motifs$PS60007, seq_type='aa') # Sequence logo for motif 'PS60007' (example motif ID)
ggseqlogo::ggseqlogo(protein_motifs[1], seq_type='aa') # Sequence logo for the first motif in the list
ggseqlogo::ggseqlogo(protein_motifs[5], seq_type='aa') # Sequence logo for the fifth motif in the listfreqPie()This function generates a pie chart to visualize the frequency distribution of each motif type found in the analysis. It takes the GFF-like data frame as input and provides a quick, high-level overview of which motifs are most common in the dataset.
Sigrist C.J.A., de Castro E., Cerutti L., Cuche B.A., Hulo N., Bridge A., Bougueleret L., Xenarios I. (2012). New and continuing developments at PROSITE. Nucleic Acids Res.
sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] plotly_4.11.0 ggplot2_4.0.0 seqinr_4.2-36 ggseqlogo_0.2
## [5] PMScanR_1.0.0 BiocStyle_2.38.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 bitops_1.0-9
## [3] httr2_1.2.1 rlang_1.1.6
## [5] magrittr_2.0.4 ade4_1.7-23
## [7] otel_0.2.0 matrixStats_1.5.0
## [9] compiler_4.5.2 RSQLite_2.4.4
## [11] vctrs_0.6.5 reshape2_1.4.4
## [13] stringr_1.6.0 pkgconfig_2.0.3
## [15] crayon_1.5.3 fastmap_1.2.0
## [17] dbplyr_2.5.1 XVector_0.50.0
## [19] labeling_0.4.3 Rsamtools_2.26.0
## [21] promises_1.5.0 rmarkdown_2.30
## [23] purrr_1.2.0 bit_4.6.0
## [25] xfun_0.54 cachem_1.1.0
## [27] cigarillo_1.0.0 jsonlite_2.0.0
## [29] shinyFiles_0.9.3 blob_1.2.4
## [31] later_1.4.4 DelayedArray_0.36.0
## [33] BiocParallel_1.44.0 parallel_4.5.2
## [35] R6_2.6.1 bslib_0.9.0
## [37] stringi_1.8.7 RColorBrewer_1.1-3
## [39] rtracklayer_1.70.0 GenomicRanges_1.62.0
## [41] jquerylib_0.1.4 Rcpp_1.1.0
## [43] Seqinfo_1.0.0 SummarizedExperiment_1.40.0
## [45] knitr_1.50 IRanges_2.44.0
## [47] httpuv_1.6.16 Matrix_1.7-4
## [49] tidyselect_1.2.1 abind_1.4-8
## [51] yaml_2.3.10 codetools_0.2-20
## [53] curl_7.0.0 lattice_0.22-7
## [55] tibble_3.3.0 plyr_1.8.9
## [57] withr_3.0.2 Biobase_2.70.0
## [59] shiny_1.11.1 S7_0.2.0
## [61] evaluate_1.0.5 BiocFileCache_3.0.0
## [63] Biostrings_2.78.0 pillar_1.11.1
## [65] BiocManager_1.30.26 filelock_1.0.3
## [67] MatrixGenerics_1.22.0 stats4_4.5.2
## [69] generics_0.1.4 RCurl_1.98-1.17
## [71] S4Vectors_0.48.0 scales_1.4.0
## [73] xtable_1.8-4 glue_1.8.0
## [75] lazyeval_0.2.2 maketools_1.3.2
## [77] tools_4.5.2 BiocIO_1.20.0
## [79] sys_3.4.3 data.table_1.17.8
## [81] GenomicAlignments_1.46.0 buildtools_1.0.0
## [83] fs_1.6.6 XML_3.99-0.20
## [85] grid_4.5.2 tidyr_1.3.1
## [87] crosstalk_1.2.2 restfulr_0.0.16
## [89] cli_3.6.5 rappdirs_0.3.3
## [91] S4Arrays_1.10.0 viridisLite_0.4.2
## [93] dplyr_1.1.4 gtable_0.3.6
## [95] sass_0.4.10 digest_0.6.37
## [97] BiocGenerics_0.56.0 SparseArray_1.10.1
## [99] rjson_0.2.23 htmlwidgets_1.6.4
## [101] farver_2.1.2 memoise_2.0.1
## [103] htmltools_0.5.8.1 lifecycle_1.0.4
## [105] httr_1.4.7 mime_0.13
## [107] bit64_4.6.0-1 MASS_7.3-65