scToppR with differential expression, Seurat object data

scToppR is a package that allows seamless, workflow-based interaction with ToppGene, a portal for gene enrichment analysis. Researchers can use scToppR to directly query ToppGene’s databases and conduct analysis with a few lines of code. The use of data from ToppGene is governed by their Terms of Use: https://toppgene.cchmc.org/navigation/termsofuse.jsp

This vignette shows the use of scToppR within a differential expression workflow using data from a Seurat object. Using the IFNB (Kang 2018) dataset included in the SeuratData package, one can find differentially expressed genes between the “CTRL” and “STIM” groups using Seurat’s FindMarkers function.

The raw results from this analysis are included as a dataset in scToppR, which can be accessed as such:

library(scToppR)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:Biobase':
#> 
#>     combine
#> The following objects are masked from 'package:GenomicRanges':
#> 
#>     intersect, setdiff, union
#> The following object is masked from 'package:GenomeInfoDb':
#> 
#>     intersect
#> The following objects are masked from 'package:IRanges':
#> 
#>     collapse, desc, intersect, setdiff, slice, union
#> The following objects are masked from 'package:S4Vectors':
#> 
#>     first, intersect, rename, setdiff, setequal, union
#> The following objects are masked from 'package:BiocGenerics':
#> 
#>     combine, intersect, setdiff, setequal, union
#> The following object is masked from 'package:generics':
#> 
#>     explain
#> The following object is masked from 'package:matrixStats':
#> 
#>     count
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
data("ifnb.de")
head(ifnb.de)
#>   p_val avg_log2FC pct.1 pct.2 p_val_adj  celltype    gene
#> 1     0   7.319139 0.985 0.033         0 CD14 Mono   IFIT1
#> 2     0   8.036564 0.984 0.035         0 CD14 Mono  CXCL10
#> 3     0   6.741673 0.988 0.045         0 CD14 Mono   RSAD2
#> 4     0   6.991279 0.989 0.047         0 CD14 Mono TNFSF10
#> 5     0   6.883785 0.992 0.056         0 CD14 Mono   IFIT3
#> 6     0   7.179929 0.961 0.039         0 CD14 Mono   IFIT2

As this is the raw data, we will begin by quickly filtering for significant results, using thresholds of 0.05 for the adjusted p value and 0.3 as the average log fold change.

ifnb.de.filtered <- ifnb.de |>
  dplyr::filter(p_val_adj < 0.05, abs(avg_log2FC) > 0.3)

With these results, we will use scToppR to querry the ToppGene database for all categories for each cluster using the toppFun() function. This function requires users to specify the columns in their dataset.

toppData <- toppFun(ifnb.de.filtered,
                    gene_col = "gene",
                    cluster_col = "celltype",
                    p_val_col = "p_val_adj",
                    logFC_col = "avg_log2FC")
#> This function returns data generated from ToppGene (https://toppgene.cchmc.org/)
#> 
#> Any use of this data must be done so under the Terms of Use and citation guide established by ToppGene.
#> 
#> Terms of Use: https://toppgene.cchmc.org/navigation/termsofuse.jsp
#> Citations: https://toppgene.cchmc.org/help/publications.jsp
#> Working on cluster: CD14 Mono 
#> Working on cluster: pDC 
#> Working on cluster: CD4 Memory T 
#> Working on cluster: T activated 
#> Working on cluster: CD4 Naive T 
#> Working on cluster: CD8 T 
#> Working on cluster: Mk 
#> Working on cluster: B Activated 
#> Working on cluster: B 
#> Working on cluster: DC 
#> Working on cluster: CD16 Mono 
#> Working on cluster: NK 
#> Working on cluster: Eryth
head(toppData)
#>                        Category         ID                          Name
#> 1 GeneOntologyMolecularFunction GO:0005126     cytokine receptor binding
#> 2 GeneOntologyMolecularFunction GO:0019001     guanyl nucleotide binding
#> 3 GeneOntologyMolecularFunction GO:0032561 guanyl ribonucleotide binding
#> 4 GeneOntologyMolecularFunction GO:0005525                   GTP binding
#> 5 GeneOntologyMolecularFunction GO:0042379    chemokine receptor binding
#> 6 GeneOntologyMolecularFunction GO:0008009            chemokine activity
#>         PValue  QValueFDRBH  QValueFDRBY QValueBonferroni TotalGenes
#> 1 1.765127e-08 2.877157e-05 0.0002294204     2.877157e-05      19912
#> 2 2.389083e-07 8.805693e-05 0.0007021534     3.894204e-04      19912
#> 3 2.389083e-07 8.805693e-05 0.0007021534     3.894204e-04      19912
#> 4 2.423479e-07 8.805693e-05 0.0007021534     3.950271e-04      19912
#> 5 2.701133e-07 8.805693e-05 0.0007021534     4.402846e-04      19912
#> 6 7.108735e-07 1.774066e-04 0.0014146151     1.158724e-03      19912
#>   GenesInTerm GenesInQuery GenesInTermInQuery Source URL   Cluster
#> 1         327          953                 41            CD14 Mono
#> 2         454          953                 48            CD14 Mono
#> 3         454          953                 48            CD14 Mono
#> 4         413          953                 45            CD14 Mono
#> 5          82          953                 17            CD14 Mono
#> 6          52          953                 13            CD14 Mono

As the code reminds you, the use of this data must be done so in accordance with ToppGene’s Terms of Use. For more information, please visit: https://toppgene.cchmc.org/navigation/termsofuse.jsp

The toppData dataframe includes all results from toppGene. We can use this dataframe to quickly generate pathway analysis plots using the toppPlot() function. The function can be used to generate a single plot, for example:

toppPlot(toppData, 
         category = "GeneOntologyMolecularFunction", 
         clusters = "CD8 T")

The toppPlot() function can also create a plot for each cluster for a specified category; simply assign the parameter clusters to NULL. In this case, the function will return a list of plots.

plot_list <- toppPlot(toppData, 
         category = "GeneOntologyMolecularFunction", 
         clusters = NULL)
#> Multiple clusters entered: function returns a list of ggplots
plot_list[1]
#> $`CD14 Mono`

All of these plots can also be automatically saved by the toppPlot() function. The files and their save locations can be set using the parameters: -save = TRUE -save_dir=“/path/to/save_directory” -file_name_prefix=“GO_Molecular_Function”

The cluster/celltype name will be automatically added to the filename prior to saving.

plot_list <- toppPlot(toppData, 
         category = "GeneOntologyMolecularFunction", 
         clusters = NULL,
         save = TRUE,
         save_dir = "./GO_results",
         file_prefix = "GO_molecular_function")

scToppR also uses the toppBalloon() function to create a balloon plot, allowing researchers to quickly compare the top terms from the ToppGene results.

toppBalloon(toppData,
            categories = "GeneOntologyBiologicalProcess")
#> Balloon Plot: GeneOntologyBiologicalProcess
#> Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]):
#> argument is not numeric or logical: returning NA
#> Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]):
#> argument is not numeric or logical: returning NA
#> Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]):
#> argument is not numeric or logical: returning NA

Some advantages of using scToppR in a pipeline include access to the other categories in ToppGene. Users can quickly view results from all ToppGene categories using these plotting function, or by examining the toppData results. For example, a user could explore any common results among celltypes in terms such as Pathway, ToppCell, and TFBS.

For example, a quick look at the toppBalloon plot for Pathway shows a distinction with the Dendritic Cells compared to others:

toppBalloon(toppData,
            categories = "Pathway")
#> Balloon Plot: Pathway
#> Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]):
#> argument is not numeric or logical: returning NA
#> Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]):
#> argument is not numeric or logical: returning NA
#> Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]):
#> argument is not numeric or logical: returning NA

The Pubmed category also provides researchers with other papers exploring similar data:

toppBalloon(toppData,
            categories = "Pubmed")
#> Balloon Plot: Pubmed
#> Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]):
#> argument is not numeric or logical: returning NA
#> Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]):
#> argument is not numeric or logical: returning NA
#> Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]):
#> argument is not numeric or logical: returning NA
#> Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]):
#> argument is not numeric or logical: returning NA
#> Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]):
#> argument is not numeric or logical: returning NA

To save toppData results, scToppR also includes a toppSave() function. This function can save the toppData results as a single file, or it can split the data into different clusters/celltypes and save each individually. To do so, set save = TRUE in the function call. The function saves the files as Excel spreadsheets by default, but this can be changed to .csv or .tsv files using the format parameter.


toppSave(toppData,
         filename = "IFNB_toppData",
         save_dir = "./toppData_results"
         split = TRUE,
         format = "xlsx")
sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] dplyr_1.1.4                 DESeq2_1.47.0              
#>  [3] airway_1.27.0               SummarizedExperiment_1.37.0
#>  [5] Biobase_2.67.0              GenomicRanges_1.59.0       
#>  [7] GenomeInfoDb_1.43.0         IRanges_2.41.0             
#>  [9] S4Vectors_0.45.1            BiocGenerics_0.53.2        
#> [11] generics_0.1.3              MatrixGenerics_1.19.0      
#> [13] matrixStats_1.4.1           scToppR_0.99.0             
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6            rjson_0.2.23            xfun_0.49              
#>  [4] bslib_0.8.0             ggplot2_3.5.1           lattice_0.22-6         
#>  [7] vctrs_0.6.5             tools_4.5.0             curl_6.0.1             
#> [10] parallel_4.5.0          tibble_3.2.1            fansi_1.0.6            
#> [13] pkgconfig_2.0.3         Matrix_1.7-1            lifecycle_1.0.4        
#> [16] GenomeInfoDbData_1.2.13 compiler_4.5.0          farver_2.1.2           
#> [19] stringr_1.5.1           munsell_0.5.1           codetools_0.2-20       
#> [22] htmltools_0.5.8.1       sass_0.4.9              yaml_2.3.10            
#> [25] pillar_1.9.0            crayon_1.5.3            jquerylib_0.1.4        
#> [28] BiocParallel_1.41.0     cachem_1.1.0            DelayedArray_0.33.1    
#> [31] viridis_0.6.5           abind_1.4-8             locfit_1.5-9.10        
#> [34] tidyselect_1.2.1        zip_2.3.1               digest_0.6.37          
#> [37] stringi_1.8.4           labeling_0.4.3          forcats_1.0.0          
#> [40] fastmap_1.2.0           grid_4.5.0              colorspace_2.1-1       
#> [43] cli_3.6.3               SparseArray_1.7.1       magrittr_2.0.3         
#> [46] patchwork_1.3.0         S4Arrays_1.7.1          utf8_1.2.4             
#> [49] withr_3.0.2             scales_1.3.0            UCSC.utils_1.3.0       
#> [52] rmarkdown_2.29          XVector_0.47.0          httr_1.4.7             
#> [55] gridExtra_2.3           openxlsx_4.2.7.1        evaluate_1.0.1         
#> [58] knitr_1.49              viridisLite_0.4.2       rlang_1.1.4            
#> [61] Rcpp_1.0.13-1           glue_1.8.0              jsonlite_1.8.9         
#> [64] R6_2.5.1                zlibbioc_1.53.0