Logistic Factor Analysis (LFA) (Hao, Song, and Storey 2016). Briefly, LFA fits a latent variable model on categorical (i.e. SNP genotypes coded as 0, 1, and 2) data by modeling the logit transformed binomial parameters in terms of latent variables. The resulting “logistic factors” are analagous to principal components, but fit into a convenient likelihood based model. As a result, the logistic factors can power a number of other analyses.
We include a sample real dataset with the package as the variable hgdp_subset—a small subset of the HGDP genotypes. The row names are the rsids for the SNPs and the column names are coarse geographical labels for the individuals.
lfaThe lfa function has two required arguments. The first is the genotype matrix, and the second is the number of logistic factors including the intercept.
LF <- lfa(hgdp_subset, 4)
dim(LF)
#> [1] 159 4
head(LF)
#> [,1] [,2] [,3] [,4]
#> [1,] -0.03635636 0.10431465 -0.026157330 1
#> [2,] -0.04618695 0.10283828 -0.014107285 1
#> [3,] -0.02800495 0.09071295 -0.030261474 1
#> [4,] -0.03513497 0.08332765 -0.002122198 1
#> [5,] -0.03064114 0.07938743 -0.021696325 1
#> [6,] -0.02442314 0.06953981 -0.005312549 1We can plot the first two logistic factors and color by geographical information:
dat <- data.frame(LF[,1], LF[,2], colnames(hgdp_subset))
colnames(dat) = c("LF1", "LF2", "geo")
library(ggplot2)
ggplot(dat, aes(LF1, LF2, color=geo)) + geom_point() + theme_bw() +
coord_fixed(ratio=(max(dat[,1])-min(dat[,1]))/(max(dat[,2])-min(dat[,2])))One aspect of lfa is that the return value is a matrix of logistic factors, thus, an important part of subsequent analysis is to keep your matrix of logistic factors to pass as an argument.
afGiven a genotype matrix and logistic factors, the af function computes the individual-specific allele frequencies
allele_freqs <- af(hgdp_subset, LF)
allele_freqs[1:5, 1:5]
#> [,1] [,2] [,3] [,4] [,5]
#> rs4050954 0.6799538 0.7061143 0.6424166 0.6716247 0.6400715
#> rs302665 0.2560284 0.2569022 0.2406448 0.2457005 0.2309623
#> rs2899446 0.3431056 0.3539195 0.3759817 0.4283358 0.4124516
#> rs10069940 0.5453350 0.5380593 0.5287222 0.5023167 0.5093611
#> rs6114921 0.3944875 0.4466875 0.3337968 0.3834435 0.3344592Since the calculation is independent at each locus, you can pass a subset of the genotype matrix as an argument if you aren’t interested in all the SNPs.
subset <- af(hgdp_subset[15:25,], LF)
subset[1:5,1:5]
#> [,1] [,2] [,3] [,4] [,5]
#> rs409169 0.17045449 0.14658345 0.19092932 0.1303421 0.17978531
#> rs1530933 0.03441206 0.03465226 0.03969294 0.0412471 0.04427358
#> rs1007833 0.53811318 0.55250219 0.53989070 0.5674498 0.55458834
#> rs2328383 0.46839267 0.48418147 0.42004022 0.4419977 0.40006848
#> rs12637969 0.74369515 0.75969674 0.71407465 0.7317521 0.70772691Given the allele frequencies, you can do some other interesting calculations—for example, compute the log-likelihood for each SNP.
The best way to load genotypes is by using the function read_plink, from the genio package, which assumes that you have binary PLINK formatted genotypes. The binary PLINK format uses files: a .bed for the genotypes, a .bim for the genotype information, and a .fam for the individuals information. read_plink takes as an argument the prefix for your three files. A BEDMatrix object (from the eponymous function and package) is also supported, and can result in reduced memory usage (at a small runtime penalty).
Hao, Wei, Minsun Song, and John D. Storey. 2016. “Probabilistic Models of Genetic Variation in Structured Populations Applied to Global Human Studies.” Bioinformatics 32 (5): 713–21. https://doi.org/10.1093/bioinformatics/btv641.