Evolution, exercise, diet and why so many people have crooked teeth.

Coming soon…

A selfish gene theory based exlanation of rapport seeking behavior and its implications for social heirarchy in humans.

Coming soon…

A selfish gene theory based explaination as to why we find baby animals “cute”.

Forewarning: To understand this article, you’re probably going to have to have read Richard Dawkins book “The Selfish Gene“.

Kittens/puppys/ducklings/foals are undeniably cute and lovable. But why do we all think this? Ostensibly, it seems that thinking of a duckling as “cute” makes no more sense than finding a pile of bricks cute. What is it about these wee critters that causes a seemingly universal emotional response in humans?

One might argue that they remind us of our own offspring, or that their helplessness is appealing, but again, there should be an identifiable reason that these responses are universal.

I argue, that in the context of selfish gene theory, this emotional response could be explained as a mechanism that prevents you from eating the young of closely related species (or more specifically, other survival machines who share a large proportion of your genes). From your genes point of view, ~90% of which are shared with a kitten, it makes no sense for you to eat a younger animal for dinner, given an available alternative. If your genes were to allow this, it would mean removing this particular “survival machine” from the population before it has reached an age where it can reproduce and has a chance to propagate its own genetic material, which would not give this particular combination of genetic material a fair shot at propagation (as opposed to a mature animal who has had their chance). For the 90% of homologous genes which are shared between you and a cat/kitten, it is more advantageous for THOSE GENES that you to eat the older animal (who has reached a reproductive age) for brunch/supper. The older “survival machine” has had a fairer chance to produce an offspring as well as being in a fairer position to avoid being eaten (if the particular combination of genes which it possesses have made it smart enough).

Hence why our genes program this universal emotional response. And as it happens it seems that this response also exists in other species!

Use “org.Hs.eg.db” to map between Entrez Gene Ids and HUGO gene symbols in R

To translate between these identifiers in R, this code creates a table with the mapping:

library(org.Hs.eg.db)
e2s = toTable(org.Hs.egSYMBOL)

To have a look at the mapping (which is stored as a data.frame)

head(e2s)

This will output something like this
gene_id symbol
1 1 A1BG
2 2 A2M
3 3 A2MP1
4 9 NAT1
5 10 NAT2
6 11 AACP

Now to map a set of Ids you have stored in a vector called “entrezGenes” to HUGO:

entrezGenes <- c("57573", "8563" , "8434")
geneSymbols <- e2s$symbol[e2s$gene_id %in% entrezGenes]

NOTE THIS DOES NOT PRESERVER ORDER.

In R, list all files in current directory ending with “.CEL”

Use regular expression matching, where the “$” means that this string ends with the preceding string:

dir(pattern=".CEL$")

Use biomaRt to tranlate HUGO to Entrez gene Ids.

We can use the R package biomaRt to conveniently convert between different types of gene ids. In this example we will convert official HUGO gene names to entrez gene ids.

First we load biomaRt in R using the current ensembl database for human:
library(biomaRt)
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")

Now get the mapping, myHugoGeneNames is a vector of HUGO gene names. As this mapping will not tend to be unique, make sure "uniqueRows=FALSE" is included in the query.
mapTab <- getBM(attributes = c("hgnc_symbol", "entrezgene"), filters = "hgnc_symbol", values = myHugoGeneNames, mart = ensembl, uniqueRows=FALSE)

This query will return a table with HUGO gene names in the first column and corresponding Entrez gene id in the 2nd column. As the mapping isn't unique, you may wish to remove duplicates, which can be done as follows:

dupRows <- union(which(duplicated(mapTab[,1])), which(duplicated(mapTab[,2])))
entrezIds <- mapTab[-dupRows, 2]
names(entrezIds) <- mapTab[-dupRows, 1]

The object "entrezIds" now contains a unique mapping of the gene ids.

If you wish to map different kinds of Ids, these function will be of use in identifying what identifiers are available in biomaRt:
listFilters(ensembl)
listAttributes(ensembl)

Boom!

See all users jobs in Sun Gridengine

This info is hard to find, but to see what jobs are being run by other users on Sun Gridengine:

qstat -u "*"

Find files in linux command line using wildcards

To find a file whose name contains the text “lostfile” in the current directory and all subdirectories:

find . -name \*lostfile\*

To search the filesystem for such a file simply do:

find / -name \*lostfile\*

Installing R, Rmpi & Bioconductor on Beowulf Cluster Running ClusterVision OS

We recently purchased a new computer cluster here and I was tasked with installing R & Rmpi on the thing. Not an especially easy task, so here’s how you do it.

Firstly we need to install R on the slave image. The slave image is basically a directory tree contained on the master node that the slaves/nodes download and use as their filesystem when they boot up.

So to install files in the slave image, firstly log in as root. If you have a system setup with a separate login node, first exit from the login node by typing “exit”, to get to the master node.

On our system /cm/images/default-image/ is the location of the slave image, which is different from what it says in our manual, apparently they moved it and haven’t bothered to update the manual so I had to email support to find this. The following code tells the yum package manager to install R in the slave image’s directory tree.

yum --installroot=/cm/images/default-image/ install R.x86_64

Now to install the Bioconductor packages in the installation of R we have just created we need to change the root directory to the slave image, initiate R and download and install the packages.

[root@cluster]# R
> source("http://bioconductor.org/biocLite.R")
> biocLite()

On our cluster, some packages that certain Bioconductor libraries required were not installed, so we needed to install them as follows before installing Bioconductor as above. You may have to do something similar when installing on the master node (see below).

yum --installroot=/cm/images/default-image/ install gcc-gfortran.x86_64 libxml2-devel.x86_64 curl-devel.x86_64

The slave nodes need to now be restarted for them to download and install the updated slave image. This took about 5 minutes on our system. It can be achieved by entering cmsh and using the following command to restart the slaves.

[root@cluster]# cmsh
% device power reset -c slave

Next we must install R and Rmpi on the master node. First we type “exit” to undo what was done with the “chroot” command above, we now have a normal shell again. So install R and Rmpi:

[root@cluster]# yum install R.x86_64
[root@cluster]# R
> source("http://bioconductor.org/biocLite.R")
> biocLite("Rmpi")

On our system which uses the Sun GridEngine queuing system, to submit a job we use the following shell script to submit a script called “task_pull.R “. This shell script may be quite different based on your systems setup or if you are using the PBS, but the ClusterVision users guide is a good place to start. I’ve included comments in the script below.

#!/bin/sh
#
# Your job name
#$ -N My_Job
#
# Use current working directory
#$ -cwd
#
#
# pe (Parallel environment) request. Set your number of processors here.
#$ -pe mpich 15
#
# Run job through bash shell
#$ -S /bin/bash
# If modules are needed, source modules environment:
. /etc/profile.d/modules.sh
# Add any modules you might require:
module add shared mvapich2 openmpi
# The following output will show in the output file. Used for debugging.
echo ``Got $NSLOTS processors.''
echo ``Machines:''
cat $TMPDIR/machines
#Merge the standard out and standard error to one file
#$ -j y
mpirun -np 1 R --slave CMD BATCH task_pull.R

Now to submit the job to the cluster type the following line of code at the command prompt, where “all.q” is the name of the queue you wish to submit to and test_Rmpi.sh is the name of the shell script above.

qsub -q all.q test_Rmpi.sh

The sample code that I’m running (“task_pull.R”) is found here: http://math.acadiau.ca/ACMMaC/Rmpi/examples.html. There are some useful examples on this site and it is a good guide to MPI in R, although doesn’t provide information on the setup of the system.

Keep it real.

Performing Genome Wide Association (GWAS) using R and GenABEL

Introduction to GWAS
Genome Wide Association Studies (GWAS) are essentially a means by which to figure out which genetic traits (usually single nucleotide polymorphisms (SNPs)) are correlated with a variable of interest. In essence it allows us to assign function to regions of the genome, for example to find which genetic variants (differences in DNA sequence between individuals) are associated with changes in measurable variables like height, obesity, or any other trait or disease of interest, with the end result that we can make statements like “This variant of this gene is implicated in an increased risk of brain cancer”.

To perform such an analysis we need two kinds of data. SNP data for each individual in the study (which will be used to catalog the genetic variation between each sample) and data measuring the trait we are investigating for each individual in the study.

Our software will then simply find which regions of the genome vary in conjunction with the trait of interest, for example if all tall people in a study have a G at a certain genomic locus and all short individuals have a T at the same locus, we can hypothesize that variation in this SNP is responsible for the variation in height, although this is obviously a grossly oversimplified example, but then this guide isn’t aimed at complicating things!

Performing GWAS in R using GenABEL

Just to cover all bases R is a programming language used for statistics, it’s particularly useful in bioinformatics. GenABEL is an R package and part of the bioconductor project, that we can use to perform GWAS.

If you are not familiar with R, go to their website and read the documentation.

Intuitively, to perform a GWAS analysis we need two kinds of data; SNP data to show how DNA varies across samples and phenotypic data which shows how our variable of interest (height, weight, disease, whatever) varies across samples. In this example analysis we will be using genotype (SNP) information from the Hapmap project. If you are not familiar with HapMap you can read my short article explaining it here.

So to begin, open up R.

The first step is to install GenABEL.

source("http://bioconductor.org/biocLite.R")
biocLite('GenABEL')

We now need to download the HapMap data. We will be using the CEU dataset (if you don’t know what that means read the HapMap article), which we can download here in Plink format. Plink is a different GWAS program released by the folks in Harvard. We need to get the file hapmap3_r2_b36_fwd.qc.poly.tar.bz2 which contains both the .map and .ped file. The .ped file contains the genotype information (which SNP variants are where) and the .map file is essentially a list of the SNP names.

Once we have these files we can use the GenABEL convert.snp.ped() function to change the files to GenABEL format. First unizp the archive we have downloaded. The code to convert the files is as follows.


library('GenABEL')
setwd('/whereever/file/have/been/unzipped/')
convert.snp.ped(ped='hapmap3_r2_b36_fwd.CEU.qc.poly.ped', map='hapmap3_r2_b36_fwd.CEU.qc.poly.map', out='ceu.out')

The next step is to load the data into GenABEL. The important point at this stage is to have a correctly formatted phenotype file, which indicates the values associated with the phenotypic trait of interest in each sample. The GenABEL tutorial does a good job of explaining what this file should look like. As an example see below. It’s essentially just a tab-delimited text file which contains the sample names and the values for the trait of interest, which can be any quantifiable trait, in this example it’s height in centimeters.


"id" "sex" "traitOfInterest"
"NA06985" 0 120.1
"NA06991" 0 175.3
"NA06993" 1 157.2
"NA06994" 1 135.9

Now that we have created our genotype file (‘ceu.out’) and our phenotype file (which we will save as ‘hapmap_CEU.ph’) we can read in the data and create our GenABEL object, perform filtering according to the default parameters and the GWAS analysis and list our results in 5 simple lines of code.


hap3data <- load.gwaa.data(phenofile = "hapmap_CEU.ph", genofile = "ceu.out")
qc1 <- check.marker(hap3data)
qcedDataCeu <- hap3data[qc1$idok, qc1$snpok]
results <- mlreg("mean~factor(sex)", data=qcedDataCeu)
descriptives.scan(results, top=20)

Low and behold our top 20 SNPs are printed, p-values and all. Now wasn’t that quick and painless.

For a more detailed (and much more painful) explaination of all of this you should review the GenABEL documentation which is available at their homepage.

Until next time.