scToppR 0.99.10
scToppR is a package that allows seamless, workflow-based interaction with ToppGene, a portal for gene enrichment analysis. Researchers can use scToppR to directly query ToppGene’s databases and conduct analysis with a few lines of code. scToppR’s availability on Bioconductor ensures easy installation and integration with other Bioconductor workflows, allowing researchers to incorporate functional enrichment analysis from ToppGene into their existing pipelines.
The use of data from ToppGene is governed by their Terms of Use: https://toppgene.cchmc.org/navigation/termsofuse.jsp
This vignette demonstrates the use of scToppR within a differential expression workflow. We show the complete workflow from differential expression results to pathway analysis and visualization. While the examples show how to make live API calls to ToppGene, this vignette uses pre-computed results to ensure reproducibility and avoid dependency on internet connectivity.
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("scToppR")
As an introduction, this vignette will work with the FindAllMarkers output from Seurat’s PBMC 3k clustering tutorial: https://satijalab.org/seurat/articles/pbmc3k_tutorial.html
You can follow that tutorial and get the markers file from this line:
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE)
Alternatively, this markers table is included in the scToppR package:
library(scToppR)
data("pbmc.markers")
head(pbmc.markers)
#> p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
#> RPS12 1.273332e-143 0.7387061 1.000 0.991 1.746248e-139 0 RPS12
#> RPS6 6.817653e-143 0.6934523 1.000 0.995 9.349729e-139 0 RPS6
#> RPS27 4.661810e-141 0.7372604 0.999 0.992 6.393206e-137 0 RPS27
#> RPL32 8.158412e-138 0.6266075 0.999 0.995 1.118845e-133 0 RPL32
#> RPS14 5.177478e-130 0.6336957 1.000 0.994 7.100394e-126 0 RPS14
#> RPS25 3.244898e-123 0.7689940 0.997 0.975 4.450053e-119 0 RPS25
With this data we can run the function toppFun to get results from ToppGene. The toppFun function can accept three different data formats:
A vector of gene symbols type = "marker_list"
This is simply a list of gene symbols, without any additional information.
A data frame of cluster marker genes type = "marker_df"
This is a dataframe where each column is a different cluster or celltype, and each row contains marker genes for that cluster.
A data frame of differentially expressed genes type = "degs"
This is a typical output from a differential expression analysis such as DESeq2, containing gene symbols and statistics including p values and log fold changes. If the dataframe has a cluster or celltype column, the function can run ToppGene analysis for each cluster separately.
The pbmc.markers data is in the “degs” format, so we will set type = "degs" in the toppFun function. We will also need to specify the relevant columns for clusters, genes, p values, and log fold changes:
# This is how you would run the analysis with live data (requires internet)
if (curl::has_internet()) {
toppdata.pbmc <- toppFun(
input_data = pbmc.markers,
type = "degs",
topp_categories = NULL,
cluster_col = "cluster",
gene_col = "gene",
p_val_col = "p_val_adj",
logFC_col = "avg_log2FC"
)
} else {
data("toppdata.pbmc")
}
#> This function returns data generated from ToppGene (https://toppgene.cchmc.org/)
#>
#> Any use of this data must be done so under the Terms of Use and citation guide established by ToppGene.
#>
#> Terms of Use: https://toppgene.cchmc.org/navigation/termsofuse.jsp
#> Citations: https://toppgene.cchmc.org/help/publications.jsp
#> Working on cluster:0
#> Working on cluster:1
#> Working on cluster:2
#> Working on cluster:3
#> Working on cluster:4
#> Working on cluster:5
#> Working on cluster:6
#> Working on cluster:7
#> Working on cluster:8
head(toppdata.pbmc)
#> Category ID Name
#> 1 GeneOntologyMolecularFunction GO:0003735 structural constituent of ribosome
#> 2 GeneOntologyMolecularFunction GO:0005198 structural molecule activity
#> 3 GeneOntologyMolecularFunction GO:0019843 rRNA binding
#> 4 GeneOntologyMolecularFunction GO:1990948 ubiquitin ligase inhibitor activity
#> 5 GeneOntologyMolecularFunction GO:1990932 5.8S rRNA binding
#> 6 GeneOntologyMolecularFunction GO:0048027 mRNA 5'-UTR binding
#> PValue QValueFDRBH QValueFDRBY QValueBonferroni TotalGenes
#> 1 1.418323e-99 8.566669e-97 5.980920e-96 8.566669e-97 19978
#> 2 9.507317e-47 2.871210e-44 2.004569e-43 5.742419e-44 19978
#> 3 1.453993e-27 2.927374e-25 2.043780e-24 8.782121e-25 19978
#> 4 6.350812e-11 9.589727e-09 6.695180e-08 3.835891e-08 19978
#> 5 2.718755e-10 3.284256e-08 2.292942e-07 1.642128e-07 19978
#> 6 4.258090e-10 4.286477e-08 2.992654e-07 2.571886e-07 19978
#> GenesInTerm GenesInQuery GenesInTermInQuery Source URL
#> 1 181 246 76
#> 2 902 246 80
#> 3 77 246 24
#> 4 13 246 7
#> 5 5 246 5
#> 6 25 246 8
#> Genes
#> 1 RPL21, RPL22, RPL23A, RPL24, RPL26, RPL27, RPL30, RPL27A, RPL28, RPL29, RPL31, RPL32, RPL34, RPL35A, RPL37, RPL37A, RPL38, RPL41, RPL36A, RPLP0, RPLP1, RPLP2, RPS2, RPS3, RPS3A, RPS4X, RPS4Y1, RPS5, RPS6, RPS7, RPS8, RPS9, RPS10, RPS12, RPS13, RPS14, RPS15, RPS15A, RPS16, RPS18, RPS19, RPS20, RPS21, RPS23, RPS25, RPS26, RPS27, RPS27A, RPS28, RPS29, RPL10A, RPL23, FAU, RPL36, RPSA, RPL14, RPSA2, RPL35, RPL13A, RPL3, RPL4, RPL5, RPL6, RPL7, RPL7A, RPL8, RPL9, RPL10, RPL11, RPL12, RPL13, RPL15, RPL17, RPL18, RPL18A, RPL19
#> 2 RPL21, RPL22, RPL23A, RPL24, RPL26, RPL27, RPL30, RPL27A, RPL28, RPL29, RPL31, RPL32, RPL34, RPL35A, MAL, RPL37, RPL37A, RPL38, RPL41, RPL36A, RPLP0, RPLP1, RPLP2, RPS2, RPS3, RPS3A, RPS4X, RPS4Y1, RPS5, RPS6, RPS7, RPS8, RPS9, RPS10, RPS12, RPS13, RPS14, RPS15, RPS15A, RPS16, SPOCK2, RPS18, RPS19, RPS20, RPS21, RPS23, RPS25, RPS26, ACTN1, RPS27, RPS27A, RPS28, RPS29, RPL10A, RPL23, FAU, RPL36, FBLN5, RPSA, RPL14, RPSA2, RPL35, RPL13A, RPL3, RPL4, RPL5, RPL6, RPL7, RPL7A, RPL8, RPL9, RPL10, RPL11, RPL12, RPL13, RPL15, RPL17, RPL18, RPL18A, RPL19
#> 3 RPL23A, RPL37, RPLP0, RPS3, RPS4X, RPS4Y1, RPS5, RPS9, RPS13, RPS18, RPL23, NPM1, NOP53, RPL3, RPL4, RPL5, RPL6, RPL7, RPL8, RPL9, RPL11, RPL12, RPL17, RPL19
#> 4 RPL37, RPS7, RPS15, RPS20, RPL23, RPL5, RPL11
#> 5 RPS9, RPS13, RPL6, RPL8, RPL19
#> 6 RPL26, RPL41, RSL1D1, RPS3A, RPS7, RPS13, RPS14, RPL5
#> Cluster
#> 1 0
#> 2 0
#> 3 0
#> 4 0
#> 5 0
#> 6 0
Additionally, you can run toppFun on all ToppGene categories by setting topp_categories to NULL. You may also provide 1 or more specific categories as a list. To see all ToppGene categories, you can also use the function get_ToppCats():
get_ToppCats()
#> [1] "GeneOntologyMolecularFunction" "GeneOntologyBiologicalProcess"
#> [3] "GeneOntologyCellularComponent" "HumanPheno"
#> [5] "MousePheno" "Domain"
#> [7] "Pathway" "Pubmed"
#> [9] "Interaction" "Cytoband"
#> [11] "TFBS" "GeneFamily"
#> [13] "Coexpression" "CoexpressionAtlas"
#> [15] "ToppCell" "Computational"
#> [17] "MicroRNA" "Drug"
#> [19] "Disease"
You can also set additional parameters in the toppFun function, please check the documentation for more information.
The results of toppFun (whether from a live API call or loaded from cached data) are organized into a data frame with the following structure:
# Examine the structure of the results
str(toppdata.pbmc)
#> 'data.frame': 8550 obs. of 15 variables:
#> $ Category : chr "GeneOntologyMolecularFunction" "GeneOntologyMolecularFunction" "GeneOntologyMolecularFunction" "GeneOntologyMolecularFunction" ...
#> $ ID : chr "GO:0003735" "GO:0005198" "GO:0019843" "GO:1990948" ...
#> $ Name : chr "structural constituent of ribosome" "structural molecule activity" "rRNA binding" "ubiquitin ligase inhibitor activity" ...
#> $ PValue : num 1.42e-99 9.51e-47 1.45e-27 6.35e-11 2.72e-10 ...
#> $ QValueFDRBH : num 8.57e-97 2.87e-44 2.93e-25 9.59e-09 3.28e-08 ...
#> $ QValueFDRBY : num 5.98e-96 2.00e-43 2.04e-24 6.70e-08 2.29e-07 ...
#> $ QValueBonferroni : num 8.57e-97 5.74e-44 8.78e-25 3.84e-08 1.64e-07 ...
#> $ TotalGenes : int 19978 19978 19978 19978 19978 19978 19978 19978 19978 19978 ...
#> $ GenesInTerm : int 181 902 77 13 5 25 10 18 14 605 ...
#> $ GenesInQuery : int 246 246 246 246 246 246 246 246 246 246 ...
#> $ GenesInTermInQuery: int 76 80 24 7 5 8 6 7 6 24 ...
#> $ Source : chr " " " " " " " " ...
#> $ URL : chr " " " " " " " " ...
#> $ Genes : chr "RPL21, RPL22, RPL23A, RPL24, RPL26, RPL27, RPL30, RPL27A, RPL28, RPL29, RPL31, RPL32, RPL34, RPL35A, RPL37, RPL"| __truncated__ "RPL21, RPL22, RPL23A, RPL24, RPL26, RPL27, RPL30, RPL27A, RPL28, RPL29, RPL31, RPL32, RPL34, RPL35A, MAL, RPL37"| __truncated__ "RPL23A, RPL37, RPLP0, RPS3, RPS4X, RPS4Y1, RPS5, RPS9, RPS13, RPS18, RPL23, NPM1, NOP53, RPL3, RPL4, RPL5, RPL6"| __truncated__ "RPL37, RPS7, RPS15, RPS20, RPL23, RPL5, RPL11" ...
#> $ Cluster : chr "0" "0" "0" "0" ...
cat("Number of enriched terms:", nrow(toppdata.pbmc), "\n")
#> Number of enriched terms: 8550
cat("Categories analyzed:", length(unique(toppdata.pbmc$Category)), "\n")
#> Categories analyzed: 19
cat("Clusters analyzed:", length(unique(toppdata.pbmc$Cluster)), "\n")
#> Clusters analyzed: 9
scToppR can automatically create DotPlots for each ToppGene category. Simply run:
plots <- toppPlot(toppdata.pbmc,
category = "GeneOntologyMolecularFunction",
clusters = NULL
)
#> Warning in toppPlot.data.frame(toppdata.pbmc, category =
#> "GeneOntologyMolecularFunction", : P value adjustment not found - using 'BH' by
#> default. For no adjustment, use p_val_adj = 'none'.
#> Multiple clusters entered: function returns a list of ggplots
plots[1]
#> $`0`
This will create a list of plots for all clusters in one specific category. Here, the category “GenoOntologyMolecularFunction” was requested, and the clusters parameter was left NULL as default. If clusters is NULL, then all available ones are used. For example, the output here creates a list of plots for each cluster for the “GenoOntologyMolecularFunction”. If multiple clusters are selected, users can use combine = TRUE to return a patchwork object of plots. Leaving combine = FALSE returns a list of ggplot objects. If using the save = TRUE parameter, the function will automatically save each individual plot in the format: {category}_{cluster}_dotplot.pdf
scToppR can also create balloon plots showing overlapping terms between all clusters.
toppBalloon(toppdata.pbmc, categories = "GeneOntologyMolecularFunction")
#> Creating Balloon Plot:GeneOntologyMolecularFunction
This function also has a save parameter, which will automatically save plots, which is helpful if multiple categories are visualized.
scToppR will also automatically save the results of the ToppGene query. By default it will save separate files for each cluster. To save as one large file, set the parameter split = FALSE. It will also save all files as Excel spreadsheets, but this can be changed using the format parameter–it must be one of c("xlsx", "csv", "tsv").
tmpdir <- tempdir()
toppSave(toppdata.pbmc, filename = "PBMC", save_dir = tmpdir, split = TRUE, format = "xlsx")
#> Saving file:/tmp/RtmpCjRf7W/PBMC_0.xlsx
#> Saving file:/tmp/RtmpCjRf7W/PBMC_1.xlsx
#> Saving file:/tmp/RtmpCjRf7W/PBMC_2.xlsx
#> Saving file:/tmp/RtmpCjRf7W/PBMC_3.xlsx
#> Saving file:/tmp/RtmpCjRf7W/PBMC_4.xlsx
#> Saving file:/tmp/RtmpCjRf7W/PBMC_5.xlsx
#> Saving file:/tmp/RtmpCjRf7W/PBMC_6.xlsx
#> Saving file:/tmp/RtmpCjRf7W/PBMC_7.xlsx
#> Saving file:/tmp/RtmpCjRf7W/PBMC_8.xlsx
sessionInfo()
#> R version 4.6.0 alpha (2026-04-05 r89794)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] dplyr_1.2.1 DESeq2_1.51.7
#> [3] airway_1.31.0 SummarizedExperiment_1.41.1
#> [5] Biobase_2.71.0 GenomicRanges_1.63.2
#> [7] Seqinfo_1.1.0 IRanges_2.45.0
#> [9] S4Vectors_0.49.2 BiocGenerics_0.57.1
#> [11] generics_0.1.4 MatrixGenerics_1.23.0
#> [13] matrixStats_1.5.0 scToppR_0.99.10
#> [15] knitr_1.51 BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.6 xfun_0.57 bslib_0.10.0
#> [4] ggplot2_4.0.2 httr2_1.2.2 lattice_0.22-9
#> [7] vctrs_0.7.3 tools_4.6.0 curl_7.0.0
#> [10] parallel_4.6.0 tibble_3.3.1 pkgconfig_2.0.3
#> [13] Matrix_1.7-5 RColorBrewer_1.1-3 S7_0.2.1-1
#> [16] lifecycle_1.0.5 compiler_4.6.0 farver_2.1.2
#> [19] stringr_1.6.0 textshaping_1.0.5 tinytex_0.59
#> [22] codetools_0.2-20 htmltools_0.5.9 sass_0.4.10
#> [25] yaml_2.3.12 pillar_1.11.1 jquerylib_0.1.4
#> [28] BiocParallel_1.45.0 cachem_1.1.0 DelayedArray_0.37.1
#> [31] magick_2.9.1 viridis_0.6.5 abind_1.4-8
#> [34] tidyselect_1.2.1 locfit_1.5-9.12 zip_2.3.3
#> [37] digest_0.6.39 stringi_1.8.7 bookdown_0.46
#> [40] labeling_0.4.3 forcats_1.0.1 fastmap_1.2.0
#> [43] grid_4.6.0 cli_3.6.6 SparseArray_1.11.13
#> [46] magrittr_2.0.5 patchwork_1.3.2 S4Arrays_1.11.1
#> [49] dichromat_2.0-0.1 withr_3.0.2 scales_1.4.0
#> [52] rappdirs_0.3.4 rmarkdown_2.31 XVector_0.51.0
#> [55] otel_0.2.0 gridExtra_2.3 ragg_1.5.2
#> [58] openxlsx_4.2.8.1 evaluate_1.0.5 viridisLite_0.4.3
#> [61] rlang_1.2.0 Rcpp_1.1.1-1 glue_1.8.1
#> [64] BiocManager_1.30.27 jsonlite_2.0.0 R6_2.6.1
#> [67] systemfonts_1.3.2