Genome Wide Association (GWAS) QuickStart guide using R and GenABEL

Note: This article is a quickstart guide, there is ALOT more to know about GWAS, but this provides a good starting point.

Introduction to GWAS
Genome Wide Association Studies (GWAS) are essentially a means by which to figure out which genetic traits (usually single nucleotide polymorphisms (SNPs)) are correlated with a variable of interest. In essence it allows us to assign function to regions of the genome, for example to find which genetic variants (differences in DNA sequence between individuals) are associated with changes in measurable variables like height, obesity, or any other trait or disease of interest, with the end result that we can make statements like “This variant of this gene is implicated in an increased risk of brain cancer”.

To perform such an analysis we need two kinds of data. SNP data for each individual in the study (which will be used to catalog the genetic variation between each sample) and data measuring the trait we are investigating for each individual in the study.

Our software will then simply find which regions of the genome vary in conjunction with the trait of interest, for example if all tall people in a study have a G at a certain genomic locus and all short individuals have a T at the same locus, we can hypothesize that variation in this SNP is responsible for the variation in height, although this is obviously a grossly oversimplified example, but then this guide isn’t aimed at complicating things!

Performing GWAS in R using GenABEL

Just to cover all bases R is a programming language used for statistics, it’s particularly useful in bioinformatics. GenABEL is an R package and part of the bioconductor project, that we can use to perform GWAS.

If you are not familiar with R, go to their website and read the documentation.

Intuitively, to perform a GWAS analysis we need two kinds of data; SNP data to show how DNA varies across samples and phenotypic data which shows how our variable of interest (height, weight, disease, whatever) varies across samples. In this example analysis we will be using genotype (SNP) information from the Hapmap project. If you are not familiar with HapMap you can read my short article explaining it here.

So to begin, open up R.

The first step is to install GenABEL.

source("http://bioconductor.org/biocLite.R")
biocLite('GenABEL')

We now need to download the HapMap data. We will be using the CEU dataset (if you don’t know what that means read the HapMap article), which we can download here in Plink format. Plink is a different GWAS program released by the folks in Harvard. We need to get the file hapmap3_r2_b36_fwd.qc.poly.tar.bz2 which contains both the .map and .ped file. The .ped file contains the genotype information (which SNP variants are where) and the .map file is essentially a list of the SNP names.

Once we have these files we can use the GenABEL convert.snp.ped() function to change the files to GenABEL format. First unizp the archive we have downloaded. The code to convert the files is as follows.


library('GenABEL')
setwd('/whereever/file/have/been/unzipped/')
convert.snp.ped(ped='hapmap3_r2_b36_fwd.CEU.qc.poly.ped', map='hapmap3_r2_b36_fwd.CEU.qc.poly.map', out='ceu.out')

The next step is to load the data into GenABEL. The important point at this stage is to have a correctly formatted phenotype file, which indicates the values associated with the phenotypic trait of interest in each sample. The GenABEL tutorial does a good job of explaining what this file should look like. As an example see below. It’s essentially just a tab-delimited text file which contains the sample names and the values for the trait of interest, which can be any quantifiable trait, in this example it’s height in centimeters.


"id" "sex" "traitOfInterest"
"NA06985" 0 120.1
"NA06991" 0 175.3
"NA06993" 1 157.2
"NA06994" 1 135.9

Now that we have created our genotype file (‘ceu.out’) and our phenotype file (which we will save as ‘hapmap_CEU.ph’) we can read in the data and create our GenABEL object, perform filtering according to the default parameters and the GWAS analysis and list our results in 5 simple lines of code.


hap3data <- load.gwaa.data(phenofile = "hapmap_CEU.ph", genofile = "ceu.out")
qc1 <- check.marker(hap3data)
qcedDataCeu <- hap3data[qc1$idok, qc1$snpok]
results <- mlreg("mean~factor(sex)", data=qcedDataCeu)
descriptives.scan(results, top=20)

Low and behold our top 20 SNPs are printed, p-values and all. Now wasn’t that quick and painless.

For a more detailed (and much more painful) explaination of all of this you should review the GenABEL documentation which is available at their homepage.

Until next time.

8 Responses to “Genome Wide Association (GWAS) QuickStart guide using R and GenABEL”

  1. Hi! Thanks for posting this! Not sure if you would be able to help me. I have been following along with your example as I am trying to achieve a similar target using GenABEL but I seem to be having problems attaching the phenotype file. The map & ped files are read in just fine. My pheno file is in .dat format, could that be an issue? How did you get your phenotype file into the .ph format? Also, is there literature or other sources aside from the GenABEL manual you can point me to? (as that is not providing me with the solutions I need)

  2. Thanks for publishing about this. There’s a lot of solid tech information on the internet. You’ve got a lot of that info here on your site. I’m impressed – I try to keep a couple blogs pretty ongoing, but it’s a struggle sometimes. You’ve done a big job with this one. How do you do it?

  3. Hi, Very interesting article. I am quite impressed and just wanted to let you know that you did a fine job on this article. However, I do have some unanswered questions that I would like to ask you. I will contact you via email so that you can clear some of these things up for me. Again, very well written article. Keep up the good work.

  4. Exceptional post – and great domain by the way!

  5. Thank you. Not bad site you got here. Have some extra sites to point to which have more stuff like this?

  6. -`, I am really thankful to this topic because it really gives up to date information ‘;,

  7. Hi,Thanks for posting such a usefull documant that demonstrates how to use GenABEL in R package when one performs GWAS. Also you used a plain engilsh to make life easier! I would suggest you also show and explain the GenABEL output in similar maner.

    Thanks,
    Ali

  8. Nice intro, thank you! For these who are interested in more details further info about GenABEL suite can be found on the project’s web-site (www.genabel.org) and also at the community forum (forum.genabel.org).

Leave a Reply