Genome Wide Association (GWAS) QuickStart guide using R and GenABEL

Note: This article is a quickstart guide, there is ALOT more to know about GWAS, but this provides a good starting point.

Introduction to GWAS
Genome Wide Association Studies (GWAS) are essentially a means by which to figure out which genetic traits (usually single nucleotide polymorphisms (SNPs)) are correlated with a variable of interest. In essence it allows us to assign function to regions of the genome, for example to find which genetic variants (differences in DNA sequence between individuals) are associated with changes in measurable variables like height, obesity, or any other trait or disease of interest, with the end result that we can make statements like “This variant of this gene is implicated in an increased risk of brain cancer”.

To perform such an analysis we need two kinds of data. SNP data for each individual in the study (which will be used to catalog the genetic variation between each sample) and data measuring the trait we are investigating for each individual in the study.

Our software will then simply find which regions of the genome vary in conjunction with the trait of interest, for example if all tall people in a study have a G at a certain genomic locus and all short individuals have a T at the same locus, we can hypothesize that variation in this SNP is responsible for the variation in height, although this is obviously a grossly oversimplified example, but then this guide isn’t aimed at complicating things!

Performing GWAS in R using GenABEL

Just to cover all bases R is a programming language used for statistics, it’s particularly useful in bioinformatics. GenABEL is an R package and part of the bioconductor project, that we can use to perform GWAS.

If you are not familiar with R, go to their website and read the documentation.

Intuitively, to perform a GWAS analysis we need two kinds of data; SNP data to show how DNA varies across samples and phenotypic data which shows how our variable of interest (height, weight, disease, whatever) varies across samples. In this example analysis we will be using genotype (SNP) information from the Hapmap project. If you are not familiar with HapMap you can read my short article explaining it here.

So to begin, open up R.

The first step is to install GenABEL.

source("http://bioconductor.org/biocLite.R")
biocLite('GenABEL')

We now need to download the HapMap data. We will be using the CEU dataset (if you don’t know what that means read the HapMap article), which we can download here in Plink format. Plink is a different GWAS program released by the folks in Harvard. We need to get the file hapmap3_r2_b36_fwd.qc.poly.tar.bz2 which contains both the .map and .ped file. The .ped file contains the genotype information (which SNP variants are where) and the .map file is essentially a list of the SNP names.

Once we have these files we can use the GenABEL convert.snp.ped() function to change the files to GenABEL format. First unizp the archive we have downloaded. The code to convert the files is as follows.


library('GenABEL')
setwd('/whereever/file/have/been/unzipped/')
convert.snp.ped(ped='hapmap3_r2_b36_fwd.CEU.qc.poly.ped', map='hapmap3_r2_b36_fwd.CEU.qc.poly.map', out='ceu.out')

The next step is to load the data into GenABEL. The important point at this stage is to have a correctly formatted phenotype file, which indicates the values associated with the phenotypic trait of interest in each sample. The GenABEL tutorial does a good job of explaining what this file should look like. As an example see below. It’s essentially just a tab-delimited text file which contains the sample names and the values for the trait of interest, which can be any quantifiable trait, in this example it’s height in centimeters.


"id" "sex" "traitOfInterest"
"NA06985" 0 120.1
"NA06991" 0 175.3
"NA06993" 1 157.2
"NA06994" 1 135.9

Now that we have created our genotype file (‘ceu.out’) and our phenotype file (which we will save as ‘hapmap_CEU.ph’) we can read in the data and create our GenABEL object, perform filtering according to the default parameters and the GWAS analysis and list our results in 5 simple lines of code.


hap3data <- load.gwaa.data(phenofile = "hapmap_CEU.ph", genofile = "ceu.out")
qc1 <- check.marker(hap3data)
qcedDataCeu <- hap3data[qc1$idok, qc1$snpok]
results <- mlreg("mean~factor(sex)", data=qcedDataCeu)
descriptives.scan(results, top=20)

Low and behold our top 20 SNPs are printed, p-values and all. Now wasn’t that quick and painless.

For a more detailed (and much more painful) explaination of all of this you should review the GenABEL documentation which is available at their homepage.

Until next time.