1 imageTCGAutils

2 Introduction

imageTCGAutils provides utility functions for integrating and analyzing multi-modal whole-slide image (WSI) data from The Cancer Genome Atlas (TCGA). It is designed to work alongside imageFeatureTCGA, which handles data import of precomputed features derived from histopathology foundation models, including HoVerNet (Graham et al. 2019) and Prov-GigaPath (Xu et al. 2024).

In particular, Prov-GigaPath is a vision encoder pretrained on over 1.3 billion pathology image tiles from the Providence Health System, producing high-dimensional tile-level embeddings that capture rich visual and morphological characteristics of tissue architecture.

imageTCGAutils facilitates the integration of these tile-level embeddings with nuclei-level segmentation and classification results generated by HoVerNet. Because these data sources operate at different spatial resolutions and use distinct coordinate systems, the package provides functions such as matchHoverNetToTiles() to compute scaling factors and assign nuclei-level features to their corresponding tiles. This alignment enables downstream analyses that combine cellular morphological context from nuclei classification with the global representations encoded in tile-level embeddings.

Additionally, the package includes functionality to import into R user-generated results obtained from CONCH (CONtrastive learning from Captions for Histopathology) (Lu et al. 2024), a vision–language foundation model pretrained on 1.17 million histopathology image–caption pairs. CONCH achieves state-of-the-art performance across multiple tasks, including image classification, segmentation, and image–text retrieval, thereby enabling multi-modal analyses that incorporate both visual and textual information.

In this vignette, we demonstrate how to work with tile-level embeddings derived from whole-slide images (WSIs). Each tile corresponds to a tissue patch, and its embedding is a high-dimensional vector summarizing visual and morphological features extracted by Prov-GigaPath.

We begin by importing the tile-level data using imageFeatureTCGA. Next, we perform principal component analysis (PCA) to reduce the high-dimensional embeddings to two principal components, which facilitates visualization and preliminary exploration of the data. We then visualize the spatial layout of the tiles on the tissue slide, coloring by the principal components to examine patterns in the embedding space.

3 Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("waldronlab/imageTCGAutils")

4 Loading packages

library(BiocStyle)
library(imageFeatureTCGA)
library(imageTCGAutils)
library(ggplot2)
library(dplyr)
library(sfdep)
library(spdep)
library(SpatialExperiment)
library(data.table)

5 Import Prov-GigaPath tile level embeddings

## filter with catalog
getCatalog("provgigapath") |> 
    dplyr::filter(Project.ID == "TCGA-OV") |> 
    dplyr::pull(filename)
## Registered S3 method overwritten by 'bit64':
##   method          from 
##   print.bitstring tools
# select Ovarian Cancer Slide as an example
tile_prov_url <- paste0(
    "https://store.cancerdatasci.org/provgigapath/tile_level/",
    "TCGA-23-1021-01Z-00-DX1.F07C221B-D401-47A5-9519-10DE59CA1E9D.csv.gz"
)

example_slide <- ProvGiga(tile_prov_url) |>
    import()
## Warning in file.rename(file, dest): cannot rename file
## '/tmp/Rtmp1NOBLB/file5fd8b365101ca/https:/store.cancerdatasci.org/provgigapath/tile_level/TCGA-23-1021-01Z-00-DX1.F07C221B-D401-47A5-9519-10DE59CA1E9D.csv.gz'
## to
## '/home/biocbuild/.cache/R/BiocFileCache/https://store.cancerdatasci.org/provgigapath/tile_level/TCGA-23-1021-01Z-00-DX1.F07C221B-D401-47A5-9519-10DE59CA1E9D.csv.gz',
## reason 'Invalid cross-device link'

6 Embedding PCA

# Extract embedding numbers for pca
embedding_cols <- grep("^[0-9]+$", names(example_slide), value = TRUE)

# Run PCA
pca_res <- prcomp(example_slide[, embedding_cols], scale. = TRUE)

pca_example_slide <- bind_cols(
    example_slide,
    as_tibble(pca_res$x)[, 1:2] |> rename(PC1 = "PC1", PC2 = "PC2")
)
ggplot(pca_example_slide, aes(PC1, PC2)) +
    geom_point(alpha = 0.6, size = 1) +
    theme_minimal() +
    labs(title = "Tile-level PCA Ovarian Cancer Embedding: Single Slide")

ggplot(pca_example_slide, aes(tile_x, tile_y, color = PC1)) +
    geom_point(size = 1) +
    scale_color_viridis_c() +
    coord_equal() +
    theme_minimal() +
    labs(title = "Tissue layout colored by PC1")

7 Spatial Patterns

To investigate spatial patterns in the tissue, we use the PCA-reduced embeddings for each tile. Each tile has a physical location (tile_x, tile_y) on the slide, which allows us to explore how similar embedding values cluster across space. We construct a k-nearest neighbor graph to define which tiles are spatially “connected,” and then compute global and local spatial autocorrelation metrics.

coords <- pca_example_slide[, c("tile_x", "tile_y")]
nb <- knn2nb(knearneigh(coords, k = 6))
lw <- nb2listw(nb, style = "W")

Next, we calculate global spatial autocorrelation using Moran’s I and Geary’s C, which quantify the overall tendency of similar PC1 values to cluster or disperse on the tissue slide. We also compute Local Moran’s I (LISA) to detect local clusters of similar embedding values.

mi <- moran.test(pca_example_slide$PC1, lw)
gc <- geary.test(pca_example_slide$PC1, lw)
lisa <- localmoran(pca_example_slide$PC1, lw)
pca_example_slide$localI <- lisa[, "Ii"]
pca_example_slide$localI_pval <- lisa[, "Pr(z != E(Ii))"]

mi
## 
##  Moran I test under randomisation
## 
## data:  pca_example_slide$PC1  
## weights: lw    
## 
## Moran I statistic standard deviate = 71.599, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      6.134724e-01     -2.347418e-04      7.347018e-05
gc
## 
##  Geary C test under randomisation
## 
## data:  pca_example_slide$PC1 
## weights: lw   
## 
## Geary C statistic standard deviate = 71.784, p-value < 2.2e-16
## alternative hypothesis: Expectation greater than statistic
## sample estimates:
## Geary C statistic       Expectation          Variance 
##      3.657367e-01      1.000000e+00      7.806998e-05

We visualize the spatial patterns. The Moran scatterplot shows the relationship between each tile’s PC1 value and the mean of its neighbors, while the LISA plot highlights local clusters (“hotspots”) of high or low PC1 values across the tissue slide.

moran.plot(pca_example_slide$PC1, lw, labels = FALSE,
                main = "Moran scatterplot of PC1")

# LISA visualization
df_lisa <- data.frame(coords, Ii = lisa[, "Ii"])
ggplot(df_lisa, aes(x = tile_x, y = tile_y, color = Ii)) +
    geom_point(size = 0.5) +
    scale_color_viridis_c() +
    coord_equal() +
    theme_minimal() +
    ggtitle("Local Moran's I (LISA) for PC1")

8 Adding HoverNet Nuclei Features

You can import HoVerNet segmentation results as a SpatialExperiment or SpatialFeatureExperiment.

In this section we want show you how to integrate HoVerNet classification and segmentation output with Prov-GigaPath embeddings.

# import HoVerNet
hov_file <- paste0(
    "https://store.cancerdatasci.org/hovernet/h5ad/",
    "TCGA-23-1021-01Z-00-DX1.F07C221B-D401-47A5-9519-10DE59CA1E9D.h5ad.gz"
)

hn_spe <- HoverNet(hov_file, outClass = "SpatialExperiment") |>
    import()

# import Prov-GigaPath
tile_prov_url <- paste0(
    "https://store.cancerdatasci.org/provgigapath/tile_level/",
    "TCGA-23-1021-01Z-00-DX1.F07C221B-D401-47A5-9519-10DE59CA1E9D.csv.gz"
)

pg_spe<- ProvGiga(tile_prov_url) |>
    import()
# Extract cell coordinates from HoVerNet
cell_coords <- spatialCoords(hn_spe)

# Extract nuclei metadata 
cell_meta <- colData(hn_spe)
cell_meta$x <-cell_coords[,1]
cell_meta$y <-cell_coords[,2]

9 Visualizing Hovernet nuclei vs tile coordinates to see that they do not match

perfectly. You can use matchHoverNetToTiles to compute the scaling factor.

plot(cell_meta$x, cell_meta$y, pch=16, col="#0000FF20")
points(pca_example_slide$tile_x, 
        pca_example_slide$tile_y, 
        pch=16, 
        col="#FF000020")

10 Scale factor between nuclei coordinates and tile coordinates

match_hv_pg <- matchHoverNetToTiles(hn_spe, pg_spe)
ggplot(match_hv_pg$tiles_with_nuclei, aes(tile_x, tile_y, 
                                    color = cell_type_label, 
                                    size = N)) +
    geom_point(alpha = 0.7) +
    coord_equal() +
    theme_minimal() +
    labs(title = "All HoverNet cell types per tile")
## Warning: Removed 389 rows containing missing values or values outside the scale range
## (`geom_point()`).

ggplot(match_hv_pg$tiles_with_nuclei, aes(tile_x, tile_y, 
                                    color = dominant_cell_type)) +
    geom_point(size = 2) +
    coord_equal() +
    theme_minimal() +
    labs(title = "Per-tile dominant HoverNet cell type")

11 Session Info

sessionInfo()
## R version 4.6.0 alpha (2026-04-05 r89794)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] data.table_1.18.2.1         SpatialExperiment_1.21.0   
##  [3] SingleCellExperiment_1.33.2 SummarizedExperiment_1.41.1
##  [5] Biobase_2.71.0              GenomicRanges_1.63.2       
##  [7] Seqinfo_1.1.0               IRanges_2.45.0             
##  [9] S4Vectors_0.49.2            BiocGenerics_0.57.1        
## [11] generics_0.1.4              MatrixGenerics_1.23.0      
## [13] matrixStats_1.5.0           spdep_1.4-2                
## [15] sf_1.1-0                    spData_2.3.4               
## [17] sfdep_0.2.5                 dplyr_1.2.1                
## [19] ggplot2_4.0.2               imageTCGAutils_0.99.24     
## [21] imageFeatureTCGA_0.99.69    BiocStyle_2.39.0           
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.3.0              deldir_2.0-4           httr2_1.2.2           
##   [4] s2_1.1.9               anndataR_1.1.2         sandwich_3.1-1        
##   [7] rlang_1.2.0            magrittr_2.0.5         multcomp_1.4-30       
##  [10] otel_0.2.0             e1071_1.7-17           compiler_4.6.0        
##  [13] RSQLite_2.4.6          png_0.1-9              vctrs_0.7.3           
##  [16] pkgconfig_2.0.3        wk_0.9.5               crayon_1.5.3          
##  [19] fastmap_1.2.0          backports_1.5.1        dbplyr_2.5.2          
##  [22] magick_2.9.1           XVector_0.51.0         labeling_0.4.3        
##  [25] rmarkdown_2.31         tzdb_0.5.0             tinytex_0.59          
##  [28] purrr_1.2.2            bit_4.6.0              xfun_0.57             
##  [31] cachem_1.1.0           jsonlite_2.0.0         blob_1.3.0            
##  [34] rhdf5filters_1.23.3    DelayedArray_0.37.1    Rhdf5lib_1.33.6       
##  [37] LearnBayes_2.15.2      parallel_4.6.0         R6_2.6.1              
##  [40] bslib_0.10.0           RColorBrewer_1.1-3     reticulate_1.46.0     
##  [43] boot_1.3-32            jquerylib_0.1.4        Rcpp_1.1.1-1          
##  [46] bookdown_0.46          knitr_1.51             zoo_1.8-15            
##  [49] readr_2.2.0            BiocBaseUtils_1.13.0   splines_4.6.0         
##  [52] igraph_2.2.3           Matrix_1.7-5           tidyselect_1.2.1      
##  [55] dichromat_2.0-0.1      abind_1.4-8            yaml_2.3.12           
##  [58] codetools_0.2-20       curl_7.0.0             rjsoncons_1.3.2       
##  [61] lattice_0.22-9         tibble_3.3.1           withr_3.0.2           
##  [64] S7_0.2.1-1             marginaleffects_0.32.0 coda_0.19-4.1         
##  [67] evaluate_1.0.5         survival_3.8-6         units_1.0-1           
##  [70] proxy_0.4-29           BiocFileCache_3.1.0    pillar_1.11.1         
##  [73] BiocManager_1.30.27    filelock_1.0.3         KernSmooth_2.23-26    
##  [76] dbscan_1.2.4           vroom_1.7.1            sp_2.2-1              
##  [79] hms_1.1.4              scales_1.4.0           class_7.3-23          
##  [82] glue_1.8.1             spatialreg_1.4-3       tools_4.6.0           
##  [85] BiocIO_1.21.0          mvtnorm_1.3-7          rhdf5_2.55.16         
##  [88] grid_4.6.0             nlme_3.1-169           TENxIO_1.13.4         
##  [91] cli_3.6.6              rappdirs_0.3.4         viridisLite_0.4.3     
##  [94] S4Arrays_1.11.1        gtable_0.3.6           sass_0.4.10           
##  [97] digest_0.6.39          classInt_0.4-11        TH.data_1.1-5         
## [100] SparseArray_1.11.13    rjson_0.2.23           farver_2.1.2          
## [103] memoise_2.0.1          htmltools_0.5.9        lifecycle_1.0.5       
## [106] MASS_7.3-65            bit64_4.6.0-1

References

Graham, Simon, Quoc Dang Vu, Shan E Ahmed Raza, Ayesha Azam, Yee Wah Tsang, Jin Tae Kwak, and Nasir Rajpoot. 2019. “Hover-Net: Simultaneous Segmentation and Classification of Nuclei in Multi-Tissue Histology Images.” Medical Image Analysis 58: 101563. https://doi.org/https://doi.org/10.1016/j.media.2019.101563.

Lu, Ming Y, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guillaume Jaume, et al. 2024. “A Visual-Language Foundation Model for Computational Pathology.” Nature Medicine 30 (3): 863–74. https://doi.org/10.1038/s41591-024-02856-4.

Xu, Hanwen, Naoto Usuyama, Jaspreet Bagga, Sheng Zhang, Rajesh Rao, Tristan Naumann, Cliff Wong, et al. 2024. “A Whole-Slide Foundation Model for Digital Pathology from Real-World Data.” Nature 630 (8015): 181–88. https://doi.org/10.1038/s41586-024-07441-w.