
Student transform
Developed by Gabriel Hoffman
Run on 2022-11-29 14:22:57
Source:vignettes/studentize.Rmd
studentize.Rmd
The goal of a variance stabilizing transform (VST) is to down-weight
imprecise measurements so that downstream analyses focus on signal,
rather than being dominated by imprecise observations. VST’s have been
widely used in transcriptomics: the vsn package provides a
VST for microarray data, and DESeq2 package
provides a VST for RNA-seq using a negative bionomial model. Here we
introduce a VST based on crumblr
using precision weighted
approximation to the centered log-ration (CLR) transform of multinomial
Dirichlet counts. This VST is very fast even for large datasets and
effectively stabilizes the variance of count ratio data compared to
using fractions or the CLR transform alone. This improves performance of
downstream analaysis such as PCA and clustering.
Examine VST
First, simulate count data:
library(crumblr)
library(cowplot)
set.seed(1)
# set probability of each category
x = rgamma(300, 1, 10)
prob = x / sum(x)
# number of samples
n_samples = 500
# number of counts
nCounts = 3000
# simulate counts from multinomial
counts = t(rmultinom(n_samples, size = nCounts, prob = prob))
colnames(counts) = paste0("cat_", 1:length(prob))
rownames(counts) = paste0("sample_", 1:n_samples)
# keep categories with at least 5 counts in at least 10 samples
keep = colSums(counts > 5) > 20
# compute fractions from counts
# using pseudocount of 0.5
fractions = apply(counts[,keep], 1, function(x){
x = x + 0.5
x / sum(x)
})
# run crumblr on counts
cobj = crumblr(counts[,keep], tau=1)
Apply vst
crumblr
performs the centered log-ratio (CLR) transform,
and computes the observation-level precision weights. The VST scales the
transformed values using the precision weights. Here we see that the
vst()
is almost linear for sufficiently large CLR
values.
df_vst = studentize(cobj)
plotScatterDensity(cobj$E[,1], df_vst[,1]) +
geom_abline(color="red", size=.3) +
ggtitle("crumblr + vst transform") +
xlab("CLR") +
ylab("crumblr + studentize")
Concordance between samples for each transform
Concordance between two identically distributed samples is show using (A) fractions, (B) CLR and (C) the VST proposed here. For low counts, the CLR in (B) is highly discordant between the two samples due to imprecise measurement. In (C) the VST down-weights these measurements to improve concordance.
fig1 = plotScatterDensity(fractions[,1], fractions[,2]) +
geom_abline(color="red", size=.3) +
ggtitle("Fractions") +
xlab("Sample 1") +
ylab("Sample 2")
fig2 = plotScatterDensity(cobj$E[,1], cobj$E[,2]) +
geom_abline(color="red", size=.3) +
ggtitle("CLR") +
xlab("Sample 1") +
ylab("Sample 2")
fig3 = plotScatterDensity(df_vst[,1], df_vst[,2]) +
geom_abline(color="red", size=.3) +
ggtitle("crumblr + studentize") +
xlab("Sample 1") +
ylab("Sample 2")
plot_grid(fig1, fig2, fig3, labels=LETTERS[1:3], nrow=1)
Measuring variance stabilization
The variance stabilizing property can be observed empirically. For each feature (i.e. gene, cell type, etc), the standard deviation of the transformed value is compared to the rank of the mean. The variance is stabilized when the coefficient of variation (i.e. sd/mean) is smaller. While the CLR-transform does provide some variance stabilization compared to using fractions, the VST produces a much stronger stabilization.
# Mean vs SD plot
fig1 = meanSdPlot(fractions) + ggtitle("Fractions")
fig2 = meanSdPlot(cobj$E) + ggtitle("CLR")
fig3 = meanSdPlot(df_vst) + ggtitle("crumblr vst")
plot_grid(fig1, fig2, fig3, labels=LETTERS[1:3], nrow=1)