eQTL tutorial, Bioc 2014
VJ Carey stvjc at channing dot harvard dot edu
Basic data from RECOUNT
We will use transcript profiles obtained through RNA-seq applied
to HapMap cell lines (Montgomery, Sammeth, Gutierrez-Arcelus, Lach, Ingle, Nisbett, Guigo, and Dermitzakis (2010)). The reported
gene-level count data have been rendered as an ExpressionSet (montpick\_eset.RData) in the RECOUNT project
(Frazee, Langmead, and Leek (2011)) using the ENSEMBL nomenclature
for genes. These counts were filtered to
exclude genes with uniform zero value over all samples, subjected to
rlog transformation (DESeq2), and then filtered to include only
those with MAD at the 40th percentile of the overall distribution of genewise
MADs.
Characteristic code for the transformation to SummarizedExperiment is given here.
You do not need to execute this as the transformed data are available for loading
as noted below. (exs2se() is a bit of custom code available in the source
of this vignette.)
```r
library(Biobase)
load("/data/eQTL2014/montpick_eset.RData")
mp = montpick.eset
annotation(mp) = "org.Hs.eg.db"
tmp = select(org.Hs.eg.db, keytype="ENSEMBL", columns="ENTREZID",
keys=featureNames(mp))
ezs = split(tmp[,2], tmp[,1])
ezs = sapply(ezs, "[", 1)
drop = which(duplicated(ezs) | is.na(ezs))
if (length(drop)>0) ezs = ezs[-drop]
mp = mp[names(ezs)]
featureNames(mp) = as.character(ezs[featureNames(mp)])
mpSE = exs2se(mp, annoDb = org.Hs.eg.db, probekeytype="ENTREZID")
```
Transformed expression data in SummarizedExperiment
We obtain the basic image of transformed count data as follows.
```r
load("/data/eQTL2014/rmpDEPF.rda") # rlogged Montgomery Pickrell DESeqDataSet, some positive, filtered
rmpDEPF
```
```
## class: SummarizedExperiment
## dim: 7054 129
## exptData(1): MIAME
## assays(1): ''
## rownames(7054): 8813 57147 ... 56099 56104
## rowData metadata column names(6): baseMean baseVar ... dispFit
## rlogIntercept
## colnames(129): NA06985 NA06986 ... NA19239 NA19257
## colData names(5): sample.id num.tech.reps population study
## sizeFactor
```
Genotype data in an Amazon S3 bucket
All the data generated in the 1000 genomes project is available
in a S3 (simple storage service) container. We have a
URL that can be used to access the genotypes from
chr22 in VCF format.
```r
load("/data/eQTL2014/s3url.rda")
s3url
```
```
## [1] "http://1000genomes.s3.amazonaws.com/release/20110521/ALL.chr22.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz"
```
```r
library(VariantAnnotation)
```
```r
head = scanVcfHeader(s3url)
str(head, max.level=4)
```
```
## Formal class 'VCFHeader' [package "VariantAnnotation"] with 3 slots
## ..@ reference: chr(0)
## ..@ samples : chr [1:1092] "HG00096" "HG00097" "HG00099" "HG00100" ...
## ..@ header :Formal class 'SimpleDataFrameList' [package "IRanges"] with 4 slots
## .. .. ..@ elementType : chr "DataFrame"
## .. .. ..@ elementMetadata: NULL
## .. .. ..@ metadata : list()
## .. .. ..@ listData :List of 4
```
A naive analysis
The GGtools package includes a function, cisAssoc(), that
can test additive genetic associations in cis between SNP genotypes
housed in tabix-indexed VCF and expression data housed in a
SummarizedExperiment. The test is the score test for a generalized
linear model fit to independent
samples with a continuous response, with response variance approximately
independent of response mean. These assumptions are generally not met
for count data from RNA-seq experiments, but they may hold reasonably
approximately after
the rlog transformation. We will check on the severity
of the violations
after performing some tests.
Here is code that computes, for 200 genes on
chr17, association test statistics
and versions thereof under permutation of expression against
genotype. Do not perform these calculations, they will take too
long for the tutorial. The answer is provided in the
tutorial data, see below.
```r
library(GGtools)
```
```
## Loading required package: GGBase
## Loading required package: snpStats
## Loading required package: survival
## Loading required package: splines
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following object is masked from 'package:VariantAnnotation':
##
## expand
##
## The following object is masked from 'package:IRanges':
##
## expand
##
## Loading required package: data.table
## data.table 1.9.2 For help type: help("data.table")
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
##
## Attaching package: 'GGtools'
##
## The following object is masked from 'package:stats':
##
## getCall
```
```r
library(foreach)
library(doParallel)
registerDoSEQ()
r17 = rmpDEPF[ which(seqnames(rmpDEPF)=="chr17"),]
set.seed(1234)
s3urlb = sub("22", "17", s3url)
c17 = cisAssoc(r17, TabixFile(s3urlb), cisradius=100000, lbmaf=.025)
c17d = addmindist(c17)
```
Here is an extract from the output. We have manually
added a column 'mindist' with the number of nucleotides
from SNP to nearest end of gene body. 'mindist' is zero
for SNP lying within the gene body.
```r
c17d[1:3,1:5]
```
```
## GRanges with 3 ranges and 5 metadata columns:
## seqnames ranges strand | paramRangeID REF
## |
## [1] 17 [36926181, 36926181] * | 3927 T
## [2] 17 [36926303, 36926303] * | 3927 C
## [3] 17 [36926384, 36926384] * | 3927 C
## ALT chisq permScore_1
##
## [1] A 2.9869 0.51171
## [2] T 4.3823 1.01236
## [3] T 0.4235 0.05441
## ---
## seqlengths:
## 17
## NA
```
```r
names(mcols(c17d))
```
```
## [1] "paramRangeID" "REF" "ALT" "chisq"
## [5] "permScore_1" "permScore_2" "permScore_3" "snp"
## [9] "MAF" "probeid" "score" "fdr"
## [13] "mindist"
```
Computing the plug-in FDR
The following algorithm is from The
Elements of Statistical Learning by
Hastie, Tibshirani and Friedman (Springer 2001).
We can obtain estimated FDR for the gene-SNP pairs
tested in the example above, as follows, using the
pifdr() function in GGtools.
```r
fdr = pifdr( c17d$chisq, c(c17d$permScore_1,
c17d$permScore_2, c17d$permScore_3) )
sum(fdr == 0)
```
```
## [1] 430
```
```r
sum(fdr <= .05)
```
```
## [1] 5292
```
Sensitivity analysis, and "what to count?"
The following computations generate a sensitivity
analysis display, confronting the enumeration of
significantly associated SNP-gene pairs.
```r
library(data.table)
library(foreach)
```
```
## foreach: simple, scalable parallel programming from Revolution Analytics
## Use Revolution R for scalability, fault tolerance and more.
## http://www.revolutionanalytics.com
```
```r
registerDoSEQ()
dt1 = data.table(as.data.frame(mcols(c17d)))
```
```r
s1 = eqsens_dt( dt1 )
```
```r
plotsens(s1, ylab="Count of significant SNP-gene pairs at given FDR")
```
```
## Loading required package: reshape2
## Loading required package: ggplot2
```
![plot of chunk dopl](figure/dopl.png)
If instead of counting SNP-gene pairs, we wish to
count genes that show evidence of association with
some genotype, the plug-in FDR procedure can be
used with maximal association scores per gene, for
observed and permuted tests.
```r
s2 = eqsens_dt( dt1, by="probes" )
```
```r
plotsens(s2, ylab="Count of genes with evidence of eQTL at given FDR")
```
![plot of chunk pl2](figure/pl2.png)
There is an indication here that we get a larger yield if
we filter more sharply on MAF and distance than we did in the
initial run.
Exercise 1.
How can we compute the FDR for the gene-wise hypotheses
H0g: Mean expression of gene g is independent of SNP
content for all SNP cis to g, under the restrictions
cis radius less than 50kb and MAF greater than 5 percent?
Visualization
The following code creates a visualization of genotype-expression
association for a selected
SNP-gene pair.
```r
plotOne = function (summex, vcf.tf, whicha, whichv, genome="hg19", ... ) {
sampidsInSumm = colnames(summex)
sampidsInVCF = sampsInVCF(vcf.tf)
oksamp = intersect(sampidsInSumm, sampidsInVCF)
stopifnot(length(oksamp) > 0)
summex = summex[, oksamp]
vp = ScanVcfParam(fixed = "ALT", info = NA, geno = "GT",
samples = oksamp, which = whichv)
vdata = readVcf(vcf.tf, genome = genome, param = vp)
rdd = rowData(vdata)
vdata = GGtools:::snvsOnly(vdata)
gtdata = factor(geno(vdata)[[1]])
plot( as.numeric(assays(summex)[[1]][whicha,]) ~ gtdata , ...)
}
r17 = rmpDEPF[ which(as.character(seqnames(rmpDEPF))=="chr17"),]
s3urlb = sub("22", "17", s3url)
plotOne(r17, TabixFile(s3urlb), 1, c17d[1] )
```
![plot of chunk doonePlot](figure/doonePlot.png)
Exercise 2.
What are the names of the SNP
and gene depicted here? Replot, annotating with this
information (improve the x and y labels for maximal
clarity of interpretation).
Exercise 3.
What is the FDR for the association shown in the display?
We can show four relationships at once fairly simply.
```r
opar = par(no.readonly=TRUE)
par(mfrow=c(2,2))
for (i in 2:5)
plotOne(r17, TabixFile(s3urlb), i, c17d[i] )
```
![plot of chunk dofff](figure/dofff.png)
```r
par(opar)
```
Exercise 4.
Depict the associations for four gene-SNP pairs
with very small FDR. The four genes should be different.
Hint:
```r
order(fdr)[1:10]
c17d[7659]
args(plotOne)
plotOne(r17, TabixFile(s3urlb), "114881", c17d[7659] )
```
Exercise 5: Master regulators.
Find SNP whose FDR for association with multiple distinct genes is
found to be less than one percent. Visualize the relationships.
Connection with "GWAS hits"
We can acquire an image of the current NHGRI GWAS catalog
as follows.
```r
library(gwascat)
curgw = makeCurrentGwascat()
curgw
```
HOWEVER -- as of 28 July 2014, the coordinates used for the
current catalog are GRCh38. We used rtracklayer's liftOver
with the appropriate chain file to obtain an hg19 addressing scheme,
losing a small number of nonmappable loci.
```r
load("/data/eQTL2014/curgwHG19.rda")
curgwHG19[1:3,1:5]
```
```
## GRanges with 3 ranges and 5 metadata columns:
## seqnames ranges strand | Date.Added.to.Catalog
## |
## 1 chr19 [ 7739177, 7739177] * | 04/16/2014
## 2 chr2 [191113009, 191113009] * | 07/26/2014
## 3 chr6 [ 31197514, 31197514] * | 07/26/2014
## PUBMEDID First.Author Date Journal
##
## 1 24123702 Chung CM 03/03/2014 Diabetes Metab Res Rev
## 2 24325914 Chu M 12/09/2013 Carcinogenesis
## 3 24324648 Yatagai Y 12/04/2013 PLoS One
## ---
## seqlengths:
## chr19 chr16 chr1 chr17 chr21 chr13 ... chr15 chr9 chr12 chr14 chrX
## NA NA NA NA NA NA ... NA NA NA NA NA
```
Exercise 6
How frequent are exact coincidences between SNP exhibiting
eQTL association at FDR five percent or
less, and GWAS hits? Compare this frequency as computed
for SNP exhibiting association at FDR 95 percent or greater.
Further exercises
Exercise 6: Population of origin.
We have disregarded the population of origin in
this analysis. Use visualization to assess whether the
association between gene OSBPL7 and SNP rs17774008
has similar forms in subjects from CEU and YRI populations.
You can use s3url to get access to genotype data for chr22.
It takes 5 minutes or so to run cisAssoc on chr22 genes and SNP.
Obtain analyses for CEU and YRI groups separately. What
is the concordance on significance of SNP-gene pairs at FDR five percent or so?
Exercise 7: Efficient analysis with counted expression values.
/data/eQTL2014/montpick_eset.RData has the raw counts.
(Note, probe identifiers are in ENSEMBL vocabulary.) Use DESeq2 or edgeR to
obtain an FDR for a selected SNP-gene pair. Write a function
that does this for any given selection. In principle the mean-variance
relationship needs to be modeled separately for each gene-SNP pair,
so obtaining a statistically optimal solution for a
general search may be laborious. It would be nice to understand
the costs of using rlog with blind and fast set to TRUE
before doing a Gaussian-like analysis of eQTL, as opposed to
pair-specific optimally modeled negative binomial testing for eQTL.
Bibliography
[1] A. C. Frazee, B. Langmead and J. T. Leek. "ReCount: a
multi-experiment resource of analysis-ready RNA-seq gene count
datasets.". In: _BMC bioinformatics_ 12.1 (Jan. 2011), p. 449.
ISSN: 1471-2105. DOI: 10.1186/1471-2105-12-449. .
[2] S. B. Montgomery, M. Sammeth, M. Gutierrez-Arcelus, et al.
"Transcriptome genetics using second generation sequencing in a
Caucasian population.". In: _Nature_ 464.7289 (Apr. 2010), pp.
773-7. ISSN: 1476-4687. DOI: 10.1038/nature08903. .