Installing Birdsuite – Not Fun.

Birdsuite is an application released by the Broad (Harvard/MIT) for analysing Affy SNP 6.0 array data. It was published in Nature Genetics. The installation process is really shockingly bad. Its a little surprising that journal like Nature Genetics can possibly justify publishing a piece of software in this condition. You would have to question whether anyone from a mid-level university would get away with releasing something like this in a top journal.

An update. So next i used the bioconductor package “clrmm”. I would recommend avoiding this too. The documentation is way sparse and from what I can tell there are no functions for integrating your results with downstream analysis (correct me if I’m wrong, but I couldn’t find any way of producing annotated output).

Another update. I ended up getting it to work using “Affymetrix Genotying Console”. Its a good piece of software with good documentation. It works seamlessly, does quality control, outputs data in usable and ANNOTATED format and can even output in Plink format. Thumbs up, good job Affymetrix!!

Split a chromosome name and location type string into its constituent parts in R

Easy to do with a regular expression and the strsplit function in R. The | operator means “or”. For example for “chr15:88120587-88121480″:

> unlist(strsplit("chr15:88120587-88121480", "chr|:|-"))[2:4]
[1] "15" "88120587" "88121480"

Principal Components Analysis Explained using R

Here, we will explain principle component analysis (PCA) by stepping though the algorithm manually and reproducing the output of the prcomp() function in R, which is normally used to do PCA.

First make up some data and plot it; in terms of gene expression analysis, we can think of the rows of the matrix below as different samples (or microarray experiments etc.), with 2 genes each. Its much easier to explain PCA for two dimensions and then generalise from there.
x <- c(1,3,6,8,9,10,15,12,17)
y <- c(10,14,12,17,19,20,20,24,19)
m <- cbind(x, y)
plot(m, pch=seq(9))

Here’s how you’d make a PCA plot for this data using R’s inbuilt functions (We’re about to learn how to do this for ourselves.)
pca_proc <- prcomp(m)
plot(pca_proc$x, pch=seq(9)) #the pca plot
plot(pca_proc) #the proportion of variance capture by each PC

Now to do PCA manually, first step, center data around zero, then plot the results.
x1 <- x - mean(x)
y1 <- y - mean(y)
plot(x1, y1, xlim=c(-40, 40), ylim=c(-40, 20))

Next, calculate the covariance matrix for vectors x1 and y1 and plot the column vectors, these column vectors describe the direction of variability in the data, i.e how similar is cov(x, x) with cov(x, y) and how similar is cov(y, y) with cov(y, x). Note that the covariance of x with itself is the same as the variance of x (i.e. var(x) == cov(x, x)). More info on covariance matrix.
m1 <- cbind(x1, y1)
cM <- cov(m1)
lines(x=c(0, cM[1,1]), y=c(0, cM[2,1]), col="blue")
lines(x=c(0, cM[1,2]), y=c(0, cM[2,2]), col="blue")

As the cov(x, y) can never be greater than cov(x,x), one of the lines plotted above will always be above the diagonal and one will be below, plot the diagonal (i.e. a line through the origin with a slope of 1).
abline(0, 1, col="grey")

While the vectors we have plotted above describe the direction of the variability in the data, they do not do so in a way that is particularly useful. By finding the eigenvectors of the covariance matrix, we can describe the variability in terms of two orthogonal vectors which capture the direction of variation, instead of the two vectors that we are currently plotting. For more info on eigenvectors you’ll need to study some basic linear
algebra, this can be done on Khan Academy, of particular interest will be the videos on matrix multiplication and obviously eigenvectors. This is by far the most difficult part of PCA to understand.
eigenVecs <- eigen(cM)$vectors
lines(x=c(0, eigenVecs[1,1]), y=c(0, eigenVecs[2,1]), col="red")
lines(x=c(0, eigenVecs[1,2]), y=c(0, eigenVecs[2,2]), col="red")

As the eigenvectors are unit vectors (i.e. of length 1) it may be easier to visualize how much each of them influences the data if we multiply them by their corresponding eigenvalues, which represent the proportion of variability explained by each eigenvector.
eVal1 <- eigen(cM)$values[1]
eVal2 <- eigen(cM)$values[2]
lines(x=c(0, eVal1*eigenVecs[1,1]), y=c(0, eVal1*eigenVecs[2,1]), col="red")
lines(x=c(0, eVal2*eigenVecs[1,2]), y=c(0, eVal2*eigenVecs[2,2]), col="red")

This matrix contains the eigenvectors and we want to display the data in terms of these eigenvectors. In this case we will select both eigenvectors, but on high dimensional datasets, it is normal to chose a subset of eigenvectors.
rowFeatVec <- t(eigenVecs)
rowDataAdj <- t(m1)

Finally use matrix multiplcation to get the dot product between each eigenvector and each point in the original centered data matrix (m1). This operation describes to what degree each point is influenced by each eigenvector Plot this and that’s the final plot.
Note, %*% is the matrix multiplication operator in R.

transFData <- rowFeatVec %*% rowDataAdj
finalData <- rowFeatVec %*% rowDataAdj
plot(t(finalData), pch=seq(9))

Finally, to plot the equivalent of the scree plot we made above, simply plot the eigen values.
barplot(eigen(cM)$values)

Writing a group of R data.frames to named Excel Worksheets

First install “WriteXLS” library in R using:

source("http://bioconductor.org/biocLite.R")
biocLite("WriteXLS")

Now if I have 3 data.frames called “genes”, “proteins” and “elephants”, all I need to do to write them to the same Excel file, on different *named* worksheets is:

library("WriteXLS")
WriteXLS(c("genes", "proteins", "elephants"), "WriteXLS.xls")

Note, for this to work, Perl needs to be installed. You also need the “Text::CSV_XS.pm” library. On Ubuntu this can be installed by issuing the command:

sudo apt-get install libtext-csv-xs-perl

Evolution, exercise, diet and why so many people have crooked teeth.

Next time you walk down a crowded street, keep track of how many people look “healthy”. Why are so many people obese, have acne, are underweight or suffer from any number of diseases? We can answer this question by introducing you to the Kitava. They a hunter gatherer tribe who live in Papua New Guinea. They suffer virtually none of the ailments that afflict modern humans. You heard me right – NONE. The answer as to why is very simple, your genes are at odds with your environment. Evolution has, over the course of millions of years, programmed your genes to expect certain “inputs” so to speak. Your body needs certain nutrients from different kinds of foods, like fresh seasonal fruit and vegetables; foods like grains, dairy, legumes, refined sugar and alcohol are (almost) totally alien to your genes and that’s why they make you sick. Not necessarily violently sick straight away, but over the course of years these foods are damaging people at a subclinical level. Your body expects to be exercised, to lift heavy weights and to run on a daily basis. Your genes want you to interact socially, to experience a certain amount of sunlight exposure (to synthesise Vit D – which you are almost certainly deficient in). Take these basic things away and you end up overweight, depressed, tired, weak and sick.

To me the smoking gun is crooked teeth, something that evolution should have obviously weeded out. But the thing is, evolution DID weed this out, but only in the ancestral environment, when your genes received the stimulus they expected (in terms of nutrition and exercise). When these are lacking what happens? Your jaw bone does not develop properly and you end up needing braces. Later in life the same issue will lead to all of the other problems we just accept as symptoms of growing old.

So the message is, look around at the sorry state of modern man, then at your ancestral roots and adjust your lifestyle accordingly; i.e. go to the gym, eat your veggies, run around in the sun, enjoy the regular social outings and stay out of Pizza Hut!

Further Reading:
The Paleo Solution
The Primal Blueprint

A selfish gene theory based exlanation of rapport seeking behavior and its implications for social heirarchy in humans.

Coming soon…

A selfish gene theory based explaination as to why we find baby animals “cute”.

Forewarning: To understand this article, you’re probably going to have to have read Richard Dawkins book “The Selfish Gene“.

Kittens/puppys/ducklings/foals are undeniably cute and lovable. But why do we all think this? Ostensibly, it seems that thinking of a duckling as “cute” makes no more sense than finding a pile of bricks cute. What is it about these wee critters that causes a seemingly universal emotional response in humans?

One might argue that they remind us of our own offspring, or that their helplessness is appealing, but again, there should be an identifiable reason that these responses are universal.

I argue, that in the context of selfish gene theory, this emotional response could be explained as a mechanism that prevents you from eating the young of closely related species (or more specifically, other survival machines who share a large proportion of your genes). From your genes point of view, ~90% of which are shared with a kitten, it makes no sense for you to eat a younger animal for dinner, given an available alternative. If your genes were to allow this, it would mean removing this particular “survival machine” from the population before it has reached an age where it can reproduce and has a chance to propagate its own genetic material, which would not give this particular combination of genetic material a fair shot at propagation (as opposed to a mature animal who has had their chance). For the 90% of homologous genes which are shared between you and a cat/kitten, it is more advantageous for THOSE GENES that you to eat the older animal (who has reached a reproductive age) for brunch/supper. The older “survival machine” has had a fairer chance to produce an offspring as well as being in a fairer position to avoid being eaten (if the particular combination of genes which it possesses have made it smart enough).

Hence why our genes program this universal emotional response. And as it happens it seems that this response also exists in other species!

Use “org.Hs.eg.db” to map between Entrez Gene Ids and HUGO gene symbols in R

To translate between these identifiers in R, this code creates a table with the mapping:

library(org.Hs.eg.db)
e2s = toTable(org.Hs.egSYMBOL)

To have a look at the mapping (which is stored as a data.frame)

head(e2s)

This will output something like this
gene_id symbol
1 1 A1BG
2 2 A2M
3 3 A2MP1
4 9 NAT1
5 10 NAT2
6 11 AACP

Now to map a set of Ids you have stored in a vector called “entrezGenes” to HUGO:

entrezGenes <- c("57573", "8563" , "8434")
geneSymbols <- e2s$symbol[e2s$gene_id %in% entrezGenes]

NOTE THIS DOES NOT PRESERVER ORDER.

In R, list all files in current directory ending with “.CEL”

Use regular expression matching, where the “$” means that this string ends with the preceding string:

dir(pattern=".CEL$")

Use biomaRt to tranlate HUGO to Entrez gene Ids.

We can use the R package biomaRt to conveniently convert between different types of gene ids. In this example we will convert official HUGO gene names to entrez gene ids.

First we load biomaRt in R using the current ensembl database for human:
library(biomaRt)
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")

Now get the mapping, myHugoGeneNames is a vector of HUGO gene names. As this mapping will not tend to be unique, make sure "uniqueRows=FALSE" is included in the query.
mapTab <- getBM(attributes = c("hgnc_symbol", "entrezgene"), filters = "hgnc_symbol", values = myHugoGeneNames, mart = ensembl, uniqueRows=FALSE)

This query will return a table with HUGO gene names in the first column and corresponding Entrez gene id in the 2nd column. As the mapping isn't unique, you may wish to remove duplicates, which can be done as follows:

dupRows <- union(which(duplicated(mapTab[,1])), which(duplicated(mapTab[,2])))
entrezIds <- mapTab[-dupRows, 2]
names(entrezIds) <- mapTab[-dupRows, 1]

The object "entrezIds" now contains a unique mapping of the gene ids.

If you wish to map different kinds of Ids, these function will be of use in identifying what identifiers are available in biomaRt:
listFilters(ensembl)
listAttributes(ensembl)

Boom!