%\documentclass[a4paper,12pt]{article} \documentclass[12pt]{article} \usepackage{fullpage} % \usepackage{times} %\usepackage{mathptmx} %\renewcommand{\ttdefault}{cmtt} \usepackage{graphicx} \usepackage[pdftex, bookmarks, bookmarksopen, pdfauthor={David Clayton and Chris Wallace}, pdftitle={snpMatrix Vignette}] {hyperref} \title{snpMatrix vignette\\Example of genome-wide association testing} \author{David Clayton and Chris Wallace} \date{\today} \usepackage{Sweave} \SweaveOpts{echo=TRUE, pdf=TRUE, eps=FALSE} \begin{document} \setkeys{Gin}{width=1.0\textwidth} %\VignetteIndexEntry{snpMatrix introduction} %\VignettePackage{snpMatrix} \maketitle \section*{The {\tt snpMatrix} package} The package ``{\tt snpMatrix}'' was written to provide data classes and methods to facilitate the analysis of whole genome association studies in R. In the data classes it implements, each genotype call is stored as a single byte and, at this density, data for single chromosomes derived from large studies and new high-throughput gene chip platforms can be handled in memory by modern PCs and workstations. The object--oriented programming model introduced with version 4 of the S-plus package, usually termed ``S4 methods'' was used to implement these classes. For population-based studies, both quantitative and qualitative phenotypes may be analysed but, at present, rather more limited facilities are available for family--based studies. Flexible functions are provided which can carry out single SNP tests which control for potential confounding by quantitative and qualitative covariates. Tests involving several SNPs taken together as ``tags'' are also supported. Efficient calculation of pair-wise linkage disequilibrium measures is implemented and data input functions include a function which can download data directly from the international HapMap project website. The package was described by Clayton and Leung (2007) {\it Human Heredity}, {\bf 64}: 45--51. Since this publication many new facilities have been introduced. These are explored in further vignettes. \section*{Getting started} We shall start by loading the the packages and the data to be used in the first part of this exercise, which concerns a population--based case--control study: <>= require(snpMatrix) require(hexbin) data(for.exercise) @ In addition to the {\tt snpMatrix} package, we have also loaded the {\tt hexbin} package which reduces file sizes and legibility of plots with very many data points. The data have been created artificially from publicly available datasets. The SNPs have been selected from those genotyped by the International HapMap Project\footnote{\tt http://www.hapmap.org} to represent the typical density found on a whole genome association chip, (the Affymetrix 500K platform\footnote{\tt http://www.affymetrix.com/support/technical/sample\_data/500k\_hapmap\_genotype\_data.affx}) for a moderately sized chromosome (chromosome 10). A (rather too) small study of 500 cases and 500 controls has been simulated allowing for recombination using beta software from Su and Marchini. Re-sampling of cases was weighted in such a way as to simulate three ``causal'' locus on this chromosome, with multiplicative effects of 1.3, 1.4 and 1.5 for each copy of the risk allele at each locus. It should be noted that this is a somewhat optimistic scenario! You have loaded three objects: \begin{enumerate} \item {\tt snps.10}, an object of class ``{\tt snp.matrix}'' containing a matrix of SNP genotype calls. Rows of the matrix correspond to subjects and columns correspond to SNPs: <<>>= show(snps.10) @ \item {\tt snp.support}, a conventional R data frame containing information about the SNPs typed. To see its contents: <<>>= summary(snp.support) @ Row names of this data frame correspond with column names of {\tt snps.10} and comprise the (unique) SNP identifiers. \item {\tt subject.support}, another conventional R data frame containing further information about the subjects. The row names coincide with the row names of {\tt snps.10} and comprise the (unique) subject identifiers. In this simulated study there are only two variables: <<>>= summary(subject.support) @ The variable {\tt cc} identifies cases ({\tt cc=1}) and controls ({\tt cc=0}) while {\tt stratum}, coded 1 or 2, identifies a stratification of the study population --- more on this later. \end{enumerate} In general, analysis of a whole--genome association study will require a subject support data frame, a SNP support data frame for each chromosome, and a SNP data file for each chromosome\footnote{ Support files are usually read in with general tools such as {\tt read.table}. The {\tt snpMatrix} package contains a number of tools for reading SNP genotype data into an object of class ``{\tt snp.matrix}''.}. A short summary of the contents of {\tt snps.10} is provided by the {\tt summary} function. This operation actually produces two ``summaries of summaries''. First, summary statistics are calculated for each row (sample), and their results summarised. Then summary statistics are calculated for each column (SNP) and their results summarised. <<>>= summary(snps.10) @ The row-wise and column-wise summaries are calculated with the functions {\tt row.summary} and {\tt col.summary}. For example, to calculate summary statistics for each SNP (column): <<>>= snpsum <- col.summary(snps.10) summary(snpsum) @ The second command duplicates the latter part of the result of {\tt summary(snps.10)}, and the contents of {\tt snpsum} are fairly self-explanatory. We could look at a couple of summary statistics in more detail: <>= par(mfrow = c(1, 2)) hist(snpsum$MAF) hist(snpsum$z.HWE) @ The latter should represent a $z$-statistic. {\it i.e.} a statistic normally distributed with mean zero and unit standard deviation under the hypothesis of Hardy--Weinberg equilibrium (HWE). Quite clearly there is extreme deviation from HWE, but this can be accounted for by the manner in which this synthetic dataset was created. The function {\tt row.summary} is useful for detecting samples that have genotyped poorly. This calculates call rate and mean heterozygosity across all SNPs for each subject in turn: <>= sample.qc <- row.summary(snps.10) summary(sample.qc) @ (note that the last command yields the same as the first part of {\tt summary(snps.10)}). The plot of heterozygosity against call rate is useful in detecting poor quality samples: <>= par(mfrow = c(1, 1)) plot(sample.qc) @ There is one clear outlier. \section*{The analysis} We'll start by removing the `outlying' sample above (the sample with Heterozygosity near zero): <>= use <- sample.qc$Heterozygosity>0 snps.10 <- snps.10[use, ] subject.support <- subject.support[use, ] @ Then we'll see if there is any difference between call rates for cases and controls. First generate logical arrays for selecting out cases or controls:\footnote{ These commands assume that the subject support frame has the same number of rows as the SNP matrix and that they are in the same order. Otherwise a slightly more complicated derivation is necessary.} <>= if.case <- subject.support$cc == 1 if.control <- subject.support$cc == 0 @ Now we recompute the genotype column summaries separately for cases and controls: <>= sum.cases <- col.summary(snps.10[if.case, ]) sum.controls <- col.summary(snps.10[if.control, ]) @ and plot the call rates, using hexagonal binning and superimposing a line of slope 1 through the origin: <>= hb <- hexbin(sum.controls$Call.rate, sum.cases$Call.rate, xbin=50) sp <- plot(hb) hexVP.abline(sp$plot.vp, 0, 1, col="black") @ There is no obvious difference in call rates. This is not a surprise, since no such difference was built into the simulation. In the same way we could look for differences between allele frequencies, superimposing a line of slope 1 through the origin: <>= sp <- plot(hexbin(sum.controls$MAF, sum.cases$MAF, xbin=50)) hexVP.abline(sp$plot.vp, 0, 1, col="white") @ This is not a very effective way to look for associations, but if the SNP calling algorithm has been run separately for cases and controls this plot can be a useful diagnostic for things going wrong ({\it e.g.} different labelling of clusters). It should be stressed that, for real data, the plots described above would usually have many more outliers. Our simulation did not model the various biases and genotype failures that affect real studies. The fastest tool for carrying out simple tests for association taking the SNP one at a time is {\tt single.snp.tests}. The output from this function is a data frame with one line of data for each SNP. Running this in our data and summarising the results: <>= tests <- single.snp.tests(cc, data = subject.support, snp.data = snps.10) @ Some words of explanation are required. In the call, the {\tt snp.data=} argument is mandatory and provides the name of the matrix providing the genotype data. The {\tt data=} argument gives the name of the data frame that contains the remaining arguments --- usually the subject support data frame\footnote{This is not mandatory --- we could have made {\tt cc} available in the global environment. However we would then have to be careful that the values are in the right order; by specifying the data frame, order is forced to be correct by checking the order of the row names for the {\tt data} and {\tt snp.data} arguments.}. Let us now see what has been calculated: <>= summary(tests) @ We have, for each SNP, chi-squared tests on 1 and 2 degrees of freedom (df), together with $N$, the number of subjects for whom data were available. The 1 df test is the familiar Cochran-Armitage test for codominant effect while the 2 df test is the conventional Pearsonian test for the $3\times 2$ contingency table. The large number of {\tt NA} values for the latter test reflects the fact that, for these SNPs, the minor allele frequency was such that one homozygous genotype did not occur in the data. We will probably wish to restrict our attention to SNPs that pass certain criteria. For example <>= use <- snpsum$MAF > 0.01 & snpsum$z.HWE^2 < 200 @ (The Hardy-Weinberg filter is ridiculous and reflects the strange characteristics of these simulated data. In real life you might want to use something like 16, equivalent to a 4SE cut-off). To see how many SNPs pass this filter <>= sum(use) @ We will now throw way the discarded test results and save the positions of the remaining SNPs <>= tests <- tests[use] position <- snp.support[use, "position"] @ We now calculate $p$-values for the Cochran-Armitage tests and plot minus logs (base 10) of the $p$-values against position <>= p1 <- p.value(tests, df=1) plot(hexbin(position, -log10(p1), xbin=50)) @ Clearly there are far too many ``significant'' results, an impression which is made even clearer by the quantile-quantile (QQ) plot: <>= chi2 <- chi.squared(tests, df=1) qq.chisq(chi2, df = 1) @ The three numbers returned by this command are the number of tests considered, the number of outliers falling beyond the plot boundary, and the slope of a line fitted to the smallest 90\% of values ({\it i.e.} the multiple by which the chi-squared test statistics are over-dispersed). The ``concentration band'' for the plot is shown in grey. This region is defined by upper and lower probability bounds for each order statistic. The default is to use the 2.5\% and 95.7\% bounds\footnote{Note that this is not a simultaneous confidence region; the probability that the plot will stray outside the band at some point exceeds 95\%.}. This over-dispersion of chi-squared values was built into our simulation. The data were constructed by re-sampling individuals from {\em two} groups of HapMap subjects, the CEU sample (of European origin) and the JPT$+$CHB sample (of Asian origin). The 55\% of the cases were of European ancestry as compared with only 45\% of the controls. We can deal with this by stratification of the tests, achieved by adding the {\tt stratum} argument to the call to {\tt single.snp.tests} (the remaining commands are as before) <>= tests <- single.snp.tests(cc, stratum, data = subject.support, snp.data = snps.10) tests <- tests[use] p1 <- p.value(tests, df = 1) plot(hexbin(position, -log10(p1), xbin=50)) @ <>= chi2 <- chi.squared(tests, df=1) qq.chisq(chi2, df = 1) @ Most of the over-dispersion of test statistics has been removed (the residual is probably due to ``cryptic relatedness'' owing to the way in which the data were simulated). Now let us find the names and positions of the most significant 10 SNPs. The first step is to compute an array which gives the positions in which the first, second, third etc. can be found <>= ord <- order(p1) top10 <- ord[1:10] top10 @ We now list the 1~df $p$-values, the corresponding SNP names and their positions on the chromosome: <>= names <- tests@snp.names p1[top10] names[top10] position[top10] @ The most associated SNPs lie within two small regions of the genome. To concentrate on the rightmost region (the most associated region on the left contains just one SNP), we'll first sort the names of the SNPs into position order along the chromosome and select those lying in the region approximately one mega-base either side of the second most associated SNP: <>= posord <- order(position) position <- position[posord] names <- names[posord] local <- names[position > 9.6e+07 & position < 9.8e+07] @ The variable {\tt posord} contains the permutation necessary to sort SNPs into position order and {\tt names} and {\tt position} have now been reordered in this manner. The variable {\tt local} contains the names of the SNPs in the selected 2 mega-base region. Now create a matrix containing just these SNPs, in position order, and compute the linkage disequilibrium (LD) between them: <>= snps.2mb <- snps.10[, local] ld.2mb <- ld.snp(snps.2mb) @ A plot of the $D'$ values across the region may be written to a file (in encapsulated postscript format) as follows: \begin{verbatim} plot(ld.2mb, file="ld2.eps") \end{verbatim} This can be viewed (outside R) using a postscript viewer such as ``gv'' or ``ggv''. Alternatively it can be converted to a .pdf file and viewed in a pdf viewer such as ``acroread''. The associated SNPs fall in a region of tight LD towards the middle of the plot. Next we shall estimate the size of the effect at the most associated SNPs for each region (rs870041, rs10882596). In the following commands, we extract each SNP from the matrix as a numerical variable (coded 0, 1, or 2) and then, using the {\tt glm} function, carry out a logistic regression of case--control status on this numerical coding of the SNP and upon stratum. The variable {\tt stratum} must be included in the regression in order to allow for the different population structure of cases and controls. We first make copies of the {\tt cc} and {\tt stratum} variables in {\tt subject.support} in the current working environment (where the other variables reside): <>= cc <- subject.support$cc stratum <- subject.support$stratum top <- as(snps.10[, "rs870041"], "numeric") glm(cc ~ top + stratum, family = "binomial") @ The coefficient of {\tt top} in this regression is estimated as 0.5100, equivalent to a relative risk of $\exp(.5100) =1.665$. For the other top SNP we have: <>= top2 <- as(snps.10[, "rs10882596"], "numeric") glm(cc ~ top2 + stratum, family = "binomial") @ This relative risk is $\exp(0.4575)=1.580$. Both estimates are close to the values used to simulate the data. Finally you might like to repeat the analysis above using the 2~df tests. The conclusion would have been much the same. A word of caution however; with real data the 2~df test is less robust against artifacts due to genotyping error. On the other hand, it is much more powerful against recessive or near-recessive variants. \section*{Multi-locus tests} There are two other functions for carrying out association tests ({\tt snp.lhs.tests} and {\tt snp.rhs.tests}) in the package. These are somewhat slower, but much more flexible. For example, the former function allows one to test for differences in allele frequencies between more than two groups. An important use of the latter function is to carry out tests using {\em groups} of SNPs rather than single SNPs. We shall explore this use in the final part of the exercise. A prerequisite to multi-locus analyses is to decide on how SNPs should be grouped in order to ``tag'' the genome rather more completely than by use of single markers. Hopefully, the {\tt snpMatrix} package will eventually contain tools to compute such groups, for example, by using HapMap data. The function {\tt ld.snp}, which we encountered earlier, will be an essential tool in this process. However this work is not complete and, for now, we demonstrate the testing tool by grouping the 27,828 SNPs we have decided to use into 20kB blocks. The following commands compute such a grouping, tabulate the block size, and remove empty blocks: <>= blocks <- split(posord, cut(position, seq(100000, 135300000, 20000))) bsize <- sapply(blocks, length) table(bsize) blocks <- blocks[bsize>0] @ You can check that this has worked by listing the column positions of the first 20 SNPs together with the those contained in the first five blocks <>= posord[1:20] blocks[1:5] @ Note that these positions refer to the reduced set of SNPs after application of the filter on MAF and HWE. Therefore, before proceeding further we create a new matrix of SNP genotypes containing only these 27,828: <>= snps.use <- snps.10[, use] remove(snps.10) @ The command to carry out the tests on these groups, controlling for the known population structure differences is <>= mtests <- snp.rhs.tests(cc ~ stratum, family = "binomial", data = subject.support, snp.data = snps.use, tests = blocks) summary(mtests) @ The first argument, together with the second, specifies the model which corresponds to the null hypothesis. In this case we have allowed for the variation in ethnic origin ({\tt stratum}) between cases and controls. We complete the analysis by extracting the $p$--values and plotting minus their logs (base 10): <>= pm <- p.value(mtests) plot(hexbin(-log10(pm), xbin=50)) @ The same associated region is picked out, albeit with a rather larger $p$-value; in this case the multiple df test cannot be powerful as the 1 df test since the simulation ensured that the ``causal'' locus was actually one of the SNPs typed on the Affymetrix platform. QQ plots are somewhat more difficult since the tests are on differing degrees of freedom. This difficulty is neatly circumvented by noting that, under the null hypothesis, $-2\log p$ is distributed as chi-squared on 2~df: <>= qq.chisq(-2 * log(pm), df = 2) @ \end{document}