library(cellmig)
library(ggplot2)
library(ggforce)
library(patchwork)
library(rstan)
ggplot2::theme_set(new = theme_bw(base_size = 10))
This vignette demonstrates how to use cellmig to simulate cell migration speed data under partially and fully generative models.
The partially generative model simulates cell migration speed from a hierarchical Bayesian model implemented in Stan, where some model parameters (e.g., derived from experimental data) are fixed by the user and the remaining parameters are drawn from their prior distributions.
The fully generative model simulates cell migration speed from a hierarchical Bayesian model implemented in Stan, where model parameters are drawn from their prior distributions.
These simulation functionalities are useful for validating the model, exploring parameter variability, and guiding experimental design and power analysis.
We simulate data for an experiment with the following design:
Six biological replicates (plates), indexed by \(p = 1, 2, \dots, 6\).
Three technical replicates (wells) per treatment per plate, indexed by \(t = 1, 2, 3\).
30 cells per well.
Seven treatment groups (\(t = \{1, \dots, 7\}\)) with fixed effects \(\delta_t = [-0.4, -0.2, -0.1, 0, 0.1, 0.2, 0.4]\).
Biological replicate variability: \(\sigma_{\text{bio}} = 0.1\).
Technical replicate variability: \(\sigma_{\text{tech}} = 0.05\).
Plate-specific batch effects are modeled by drawing six (one for each plate \(p\)) values (\(\alpha_p\)) from a normal distribution with mean \(\mu=\exp(-0.5)\approx 0.6\) (micrometers/min) and the standard deviation \(\sigma=0.1\). Hence, \(\alpha_p\) represents the plate-specific mean speed (on log-scale) of the control treatment (offset). To simulate minimal plate-specific batch effects, one can select a lower \(\sigma\) (e.g. 0.01).
\[ \alpha_p \sim \text{Normal}(-0.5, 0.1) \]
To correct for batch effects across plates, we designate one treatment as the reference (offset). Here, we set the offset to treatment 4 (i.e., the fourth treatment group, which has \(\delta_t = 0\)).
The speed in each well are gamma distributed, and the gamma distribution is defined by a rate and shape parameter. The shape parameter for well \(w\), \(\kappa_w\), controls the form or skewness of the distribution and, to some extent, the variability in the data. Larger \(\kappa\) is associated with lower degree of skewness (less variable data). Empirically we observe \(\kappa_w\approx 5\) \(\rightarrow\) \(\log(\kappa_w)\approx 1.5\)
We simulate the values of \(\log(\kappa_w)\) by drawing randomly from a
normal distribution with mean (\(\mu_{\kappa}\)) and standard deviation
(\({\sigma}_{\kappa}\)). The values of \(\mu_{\kappa}\) and
\({\sigma}_{\kappa}\) can either be fixed to point values, or be drawn
from two normal distributions whose means and standard deviations are
described by the inputs prior_kappa_mu_M, prior_kappa_sigma_M,
prior_kappa_mu_SD, and prior_kappa_sigma_SD. To fix the values of
\(\mu_{\kappa}\) and \({\sigma}_{\kappa}\) to point estimates set the scales
to small values (e.g., set in control prior_kappa_mu_SD=0.01 and
prior_kappa_sigma_SD).
We use the gen_partial function from cellmig to simulate the
data. The function requires a control list specifying the experimental design
and model parameters.
# set seed for reproducible results
set.seed(seed = 1253)
# simulate data
y_p <- gen_partial(control = list(N_biorep = 8,
N_techrep = 5,
N_cell = 50,
delta = c(-0.4, -0.2, -0.1, 0, 0.1, 0.2, 0.4),
sigma_bio = 0.1,
sigma_tech = 0.05,
offset = 4,
prior_alpha_p_M = -0.5,
prior_alpha_p_SD = 0.1,
prior_kappa_mu_M = 1.5,
prior_kappa_mu_SD = 0.1,
prior_kappa_sigma_M = 0,
prior_kappa_sigma_SD = 0.1))
# see data structure
str(y_p$y)
FALSE 'data.frame': 14000 obs. of 6 variables:
FALSE $ v : num 0.343 0.685 0.484 1.303 0.479 ...
FALSE $ well : int 1 1 1 1 1 1 1 1 1 1 ...
FALSE $ plate : int 1 1 1 1 1 1 1 1 1 1 ...
FALSE $ group : int 1 1 1 1 1 1 1 1 1 1 ...
FALSE $ compound: int 1 1 1 1 1 1 1 1 1 1 ...
FALSE $ dose : chr "X" "X" "X" "X" ...
The output is a dataframe with columns:
v: Cell speed
well: Well identifier
plate: Plate identifier
group: Treatment group
compound: Compound identifier (same as group in this simulation)
dose: Dose level (not used in this simulation, set to “X”)
Visualizing the simulated data
We can visualize the simulated cell speeds using scatter plots, with points jittered by well and colored by plate.
ggplot(data = y_p$y)+
geom_sina(aes(x = paste0("t=", group), col = paste0("p=", plate),
y = v, group = well), size = 0.5)+
theme_bw()+
theme(legend.position = "top",
strip.text.x = element_text(margin = margin(0.03,0,0.03,0, "cm")))+
xlab(label = "dose")+
ylab(label = "migration speed")+
guides(color = guide_legend(override.aes = list(size = 3)))+
scale_color_grey(name = "plate")+
scale_y_log10()
Alternatively, we can visualize the well-specific mean speeds to highlight plate-specific batch effects:
ggplot(data = aggregate(v~well+group+plate, data = y_p$y, FUN = mean))+
geom_sina(aes(x = paste0("t=", group), col = paste0("p=", plate),
y = v, group = well), size = 1)+
theme_bw()+
theme(legend.position = "top",
strip.text.x = element_text(margin = margin(0.03,0,0.03,0, "cm")))+
xlab(label = "dose")+
ylab(label = "migration speed")+
guides(color = guide_legend(override.aes = list(size = 3)))+
scale_y_log10()+
scale_color_grey(name = "plate")
The simulated data must be formatted to match the input requirements of
cellmig. Specifically, we need to convert some columns to
character and add an offset column indicating the reference treatment.
sim_data <- y_p$y
# format simulated data to be used as input in cellmig
sim_data$well <- as.character(sim_data$well)
sim_data$compound <- as.character(sim_data$compound)
sim_data$plate <- as.character(sim_data$plate)
sim_data$offset <- 0
sim_data$offset[sim_data$group=="4"] <- 1
We now analyze the simulated data using cellmig with the following control parameters:
osd <- cellmig(x = sim_data,
control = list(mcmc_warmup = 250,
mcmc_steps = 800,
mcmc_chains = 2,
mcmc_cores = 2,
mcmc_algorithm = "NUTS",
adapt_delta = 0.8,
max_treedepth = 10))
We compare the inferred parameters (posterior means and 95% HDIs) with the true values used in the simulation.
ggplot(data = osd$posteriors$alpha_p)+
geom_point(aes(y = plate_id, x = exp(mean)))+
geom_errorbarh(aes(y = plate_id, x = exp(mean), xmin = exp(X2.5.),
xmax = exp(X97.5.)), height = 0.1)+
geom_point(data = data.frame(v = exp(y_p$par$alpha_p),
plate = osd$posteriors$alpha_p$plate_id),
aes(y = plate, x = v), col = "red")+
scale_y_continuous(breaks = osd$posteriors$alpha_p$plate_id)+
xlab(label = expression(alpha[p]*"'"))
ggplot(data = osd$posteriors$delta_t)+
geom_point(aes(y = group_id, x = mean))+
geom_errorbarh(aes(y = group_id, x = mean, xmin = X2.5., xmax = X97.5.),
height = 0.1)+
geom_point(data = data.frame(delta = c(-0.4, -0.2, -0.1, 0.1, 0.2, 0.4),
group_id = 1:6),
aes(y = group_id, x = delta), col = "red")+
xlab(label = expression(delta[t]))+
ylab(label = "treatment")+
scale_y_continuous(breaks = 1:8)
plot(osd$fit, par = c("sigma_bio", "sigma_tech"))
We can also use cellmig to generate cell speeds from the prior distributions of the parameters \(\rightarrow\) data from prior predictive distribution.
In this scenario, the parameters, such as \(\delta_t\), will not be fixed to point values but rather generated from their priors.
control <- list(N_biorep = 4,
N_techrep = 3,
N_cell = 30,
N_group = 8,
prior_alpha_p_M = -0.5,
prior_alpha_p_SD = 1.0,
prior_kappa_mu_M = 1.5,
prior_kappa_mu_SD = 1.0,
prior_kappa_sigma_M = 0,
prior_kappa_sigma_SD = 1.0,
prior_sigma_bio_M = 0.0,
prior_sigma_bio_SD = 1.0,
prior_sigma_tech_M = 0.0,
prior_sigma_tech_SD = 1.0,
prior_delta_t_M = 0.0,
prior_delta_t_SD = 1.0)
# generate data
y_f <- gen_full(control = control)
FALSE Chain 1:
FALSE Chain 1: Elapsed Time: 0 seconds (Warm-up)
FALSE Chain 1: 0.002 seconds (Sampling)
FALSE Chain 1: 0.002 seconds (Total)
FALSE Chain 1:
The generate data has a similar structure as before:
str(y_f)
FALSE List of 2
FALSE $ y :'data.frame': 2880 obs. of 8 variables:
FALSE ..$ i : int [1:2880] 1 2 3 4 5 6 7 8 9 10 ...
FALSE ..$ v : num [1:2880] 6.25e-01 1.87e-07 1.91e+01 4.80e-02 1.45 ...
FALSE ..$ well : chr [1:2880] "1" "1" "1" "1" ...
FALSE ..$ plate : chr [1:2880] "1" "1" "1" "1" ...
FALSE ..$ group : chr [1:2880] "1" "1" "1" "1" ...
FALSE ..$ compound: chr [1:2880] "1" "1" "1" "1" ...
FALSE ..$ dose : chr [1:2880] "X" "X" "X" "X" ...
FALSE ..$ trep_id : int [1:2880] 1 1 1 1 1 1 1 1 1 1 ...
FALSE $ control:List of 19
FALSE ..$ N_biorep : num 4
FALSE ..$ N_techrep : num 3
FALSE ..$ N_cell : num 30
FALSE ..$ N_group : num 8
FALSE ..$ prior_alpha_p_M : num -0.5
FALSE ..$ prior_alpha_p_SD : num 1
FALSE ..$ prior_kappa_mu_M : num 1.5
FALSE ..$ prior_kappa_mu_SD : num 1
FALSE ..$ prior_kappa_sigma_M : num 0
FALSE ..$ prior_kappa_sigma_SD: num 1
FALSE ..$ prior_sigma_bio_M : num 0
FALSE ..$ prior_sigma_bio_SD : num 1
FALSE ..$ prior_sigma_tech_M : num 0
FALSE ..$ prior_sigma_tech_SD : num 1
FALSE ..$ prior_delta_t_M : num 0
FALSE ..$ prior_delta_t_SD : num 1
FALSE ..$ offset : num 1
FALSE ..$ N_plate : num 4
FALSE ..$ N_well_reps : num 3
w <- data.frame(v_f = y_f$y$v[1:2000], v_p = y_p$y$v[1:2000])
ggplot(data = w)+
geom_point(aes(x = v_f, y = v_p), size = 0.5)+
geom_density_2d(aes(x = v_f, y = v_p), col = "orange")+
scale_x_log10(name = "Speed from fully generative model",
limits = c(0.01, 10^4))+
scale_y_log10(name = "Speed from partially generative model",
limits = c(0.01, 10^4))+
annotation_logticks(base = 10, sides = "lb")+
theme_bw(base_size = 10)
FALSE Warning: Removed 104 rows containing non-finite outside the scale range
FALSE (`stat_density2d()`).
FALSE Warning: Removed 104 rows containing missing values or values outside the scale range
FALSE (`geom_point()`).
How many biological replicates do we need to capture certain effect size (\(\delta\))? Similar question can be asked in order to optimize the number of technical replicates or the number of cells per well.
The code below takes considerable time to be executed. This is because we generate N_biorep times N_sim datasets and fit a model. To avoid this during R-package checks, the following chunks will not be evaluated. Do this yourselves.
Important note: using multicore execution you can considerably speed-up this analysis!
# Constants
N_bioreps <- c(3, 6, 9)
N_sim <- 10
true_deltas <- c(-0.3, -0.15, 0, 0.2, 0.4)
offset <- 3
# power analysis
deltas <- vector(mode = "list", length = length(N_bioreps)*N_sim)
i <- 1
for(N_biorep in N_bioreps) {
for(b in 1:N_sim) {
# simulate data
y_p <- gen_partial(control = list(N_biorep = N_biorep,
N_techrep = 3,
N_cell = 40,
delta = true_deltas,
sigma_bio = 0.1,
sigma_tech = 0.05,
offset = offset,
prior_alpha_p_M = -0.5,
prior_alpha_p_SD = 0.1,
prior_kappa_mu_M = 1.5,
prior_kappa_mu_SD = 0.1,
prior_kappa_sigma_M = 0,
prior_kappa_sigma_SD = 0.1))
# format simulated data to be used as input in cellmig
sim_data <- y_p$y
sim_data$well <- as.character(sim_data$well)
sim_data$compound <- as.character(sim_data$compound)
sim_data$plate <- as.character(sim_data$plate)
sim_data$offset <- 0
sim_data$offset[sim_data$group==offset] <- 1
# run cellmig
o <- cellmig(x = sim_data,
control = list(mcmc_warmup = 300,
mcmc_steps = 1000,
mcmc_chains = 1,
mcmc_cores = 1,
mcmc_algorithm = "NUTS",
adapt_delta = 0.8,
max_treedepth = 10))
delta <- o$posteriors$delta_t
delta$b <- b
delta$N_biorep <- N_biorep
delta$true_deltas <- true_deltas[-offset]
delta$TP <- (delta$X2.5. <= delta$true_deltas &
delta$X97.5. >= delta$true_deltas) &
!(delta$X2.5. <= 0 & delta$X97.5. >= 0)
deltas[[i]] <- delta
i <- i + 1
}
}
rm(i, delta, o, sim_data, y_p, b, N_biorep, offset, true_deltas)
deltas <- do.call(rbind, deltas)
What do you see? For Large effect (\(|\delta_t|>>0\)) we need few biological replicates to capture the true effect and not the null effect (\(\delta_t=0\)), i.e., in nearly all simulations we have a true positive. For smaller effects, this is not the case, and we probably need many more biological replicates to each a high true positive rate.
ggplot(data = aggregate(TP~N_biorep+true_deltas, data = deltas, FUN = sum))+
geom_point(aes(x = N_biorep, y = TP, col = abs(true_deltas),
group = as.factor(true_deltas)), size = 2, alpha = 0.5)+
geom_line(aes(x = N_biorep, y = TP, col = abs(true_deltas),
group = as.factor(true_deltas)), alpha = 0.5)+
ylab(label = "Number of true positives (TPs)")+
scale_color_distiller(name = expression("|"*delta[t]*"|"),palette="Spectral")+
theme_bw(base_size = 10)
sessionInfo()
FALSE R Under development (unstable) (2026-01-15 r89304)
FALSE Platform: x86_64-pc-linux-gnu
FALSE Running under: Ubuntu 24.04.3 LTS
FALSE
FALSE Matrix products: default
FALSE BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
FALSE LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
FALSE
FALSE locale:
FALSE [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
FALSE [3] LC_TIME=en_GB LC_COLLATE=C
FALSE [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
FALSE [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
FALSE [9] LC_ADDRESS=C LC_TELEPHONE=C
FALSE [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
FALSE
FALSE time zone: America/New_York
FALSE tzcode source: system (glibc)
FALSE
FALSE attached base packages:
FALSE [1] stats graphics grDevices utils datasets methods base
FALSE
FALSE other attached packages:
FALSE [1] rstan_2.32.7 StanHeaders_2.32.10 patchwork_1.3.2
FALSE [4] ggforce_0.5.0 ggplot2_4.0.1 cellmig_1.1.6
FALSE [7] BiocStyle_2.39.0
FALSE
FALSE loaded via a namespace (and not attached):
FALSE [1] ggiraph_0.9.3 tidyselect_1.2.1 dplyr_1.1.4
FALSE [4] farver_2.1.2 loo_2.9.0 S7_0.2.1
FALSE [7] fastmap_1.2.0 lazyeval_0.2.2 tweenr_2.0.3
FALSE [10] fontquiver_0.2.1 digest_0.6.39 lifecycle_1.0.5
FALSE [13] tidytree_0.4.7 magrittr_2.0.4 compiler_4.6.0
FALSE [16] rlang_1.1.7 sass_0.4.10 tools_4.6.0
FALSE [19] yaml_2.3.12 knitr_1.51 labeling_0.4.3
FALSE [22] htmlwidgets_1.6.4 pkgbuild_1.4.8 curl_7.0.0
FALSE [25] plyr_1.8.9 RColorBrewer_1.1-3 aplot_0.2.9
FALSE [28] withr_3.0.2 purrr_1.2.1 grid_4.6.0
FALSE [31] polyclip_1.10-7 stats4_4.6.0 gdtools_0.4.4
FALSE [34] inline_0.3.21 scales_1.4.0 MASS_7.3-65
FALSE [37] isoband_0.3.0 tinytex_0.58 dichromat_2.0-0.1
FALSE [40] cli_3.6.5 rmarkdown_2.30 treeio_1.35.0
FALSE [43] generics_0.1.4 otel_0.2.0 RcppParallel_5.1.11-1
FALSE [46] ggtree_4.1.1 reshape2_1.4.5 ape_5.8-1
FALSE [49] cachem_1.1.0 stringr_1.6.0 parallel_4.6.0
FALSE [52] ggplotify_0.1.3 BiocManager_1.30.27 matrixStats_1.5.0
FALSE [55] vctrs_0.7.1 yulab.utils_0.2.3 V8_8.0.1
FALSE [58] jsonlite_2.0.0 fontBitstreamVera_0.1.1 bookdown_0.46
FALSE [61] gridGraphics_0.5-1 magick_2.9.0 systemfonts_1.3.1
FALSE [64] tidyr_1.3.2 jquerylib_0.1.4 glue_1.8.0
FALSE [67] codetools_0.2-20 stringi_1.8.7 gtable_0.3.6
FALSE [70] QuickJSR_1.9.0 tibble_3.3.1 pillar_1.11.1
FALSE [73] rappdirs_0.3.4 htmltools_0.5.9 R6_2.6.1
FALSE [76] evaluate_1.0.5 lattice_0.22-7 ggfun_0.2.0
FALSE [79] fontLiberation_0.1.0 bslib_0.10.0 rstantools_2.6.0
FALSE [82] Rcpp_1.1.1 gridExtra_2.3 nlme_3.1-168
FALSE [85] xfun_0.56 fs_1.6.6 pkgconfig_2.0.3