In R, list all files in current directory ending with “.CEL”

Use regular expression matching, where the “$” means that this string ends with the preceding string:

dir(pattern=".CEL$")

Use biomaRt to tranlate HUGO to Entrez gene Ids.

We can use the R package biomaRt to conveniently convert between different types of gene ids. In this example we will convert official HUGO gene names to entrez gene ids.

First we load biomaRt in R using the current ensembl database for human:
library(biomaRt)
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")

Now get the mapping, myHugoGeneNames is a vector of HUGO gene names. As this mapping will not tend to be unique, make sure "uniqueRows=FALSE" is included in the query.
mapTab <- getBM(attributes = c("hgnc_symbol", "entrezgene"), filters = "hgnc_symbol", values = myHugoGeneNames, mart = ensembl, uniqueRows=FALSE)

This query will return a table with HUGO gene names in the first column and corresponding Entrez gene id in the 2nd column. As the mapping isn't unique, you may wish to remove duplicates, which can be done as follows:

dupRows <- union(which(duplicated(mapTab[,1])), which(duplicated(mapTab[,2])))
entrezIds <- mapTab[-dupRows, 2]
names(entrezIds) <- mapTab[-dupRows, 1]

The object "entrezIds" now contains a unique mapping of the gene ids.

If you wish to map different kinds of Ids, these function will be of use in identifying what identifiers are available in biomaRt:
listFilters(ensembl)
listAttributes(ensembl)

Boom!

See all users jobs in Sun Gridengine

This info is hard to find, but to see what jobs are being run by other users on Sun Gridengine:

qstat -u "*"

Find files in linux command line using wildcards

To find a file whose name contains the text “lostfile” in the current directory and all subdirectories:

find . -name \*lostfile\*

To search the filesystem for such a file simply do:

find / -name \*lostfile\*

Installing R, Rmpi & Bioconductor on Beowulf Cluster Running ClusterVision OS

We recently purchased a new computer cluster here and I was tasked with installing R & Rmpi on the thing. Not an especially easy task, so here’s how you do it.

Firstly we need to install R on the slave image. The slave image is basically a directory tree contained on the master node that the slaves/nodes download and use as their filesystem when they boot up.

So to install files in the slave image, firstly log in as root. If you have a system setup with a separate login node, first exit from the login node by typing “exit”, to get to the master node.

On our system /cm/images/default-image/ is the location of the slave image, which is different from what it says in our manual, apparently they moved it and haven’t bothered to update the manual so I had to email support to find this. The following code tells the yum package manager to install R in the slave image’s directory tree.

yum --installroot=/cm/images/default-image/ install R.x86_64

Now to install the Bioconductor packages in the installation of R we have just created we need to change the root directory to the slave image, initiate R and download and install the packages.

[root@cluster]# R
> source("http://bioconductor.org/biocLite.R")
> biocLite()

On our cluster, some packages that certain Bioconductor libraries required were not installed, so we needed to install them as follows before installing Bioconductor as above. You may have to do something similar when installing on the master node (see below).

yum --installroot=/cm/images/default-image/ install gcc-gfortran.x86_64 libxml2-devel.x86_64 curl-devel.x86_64

The slave nodes need to now be restarted for them to download and install the updated slave image. This took about 5 minutes on our system. It can be achieved by entering cmsh and using the following command to restart the slaves.

[root@cluster]# cmsh
% device power reset -c slave

Next we must install R and Rmpi on the master node. First we type “exit” to undo what was done with the “chroot” command above, we now have a normal shell again. So install R and Rmpi:

[root@cluster]# yum install R.x86_64
[root@cluster]# R
> source("http://bioconductor.org/biocLite.R")
> biocLite("Rmpi")

On our system which uses the Sun GridEngine queuing system, to submit a job we use the following shell script to submit a script called “task_pull.R “. This shell script may be quite different based on your systems setup or if you are using the PBS, but the ClusterVision users guide is a good place to start. I’ve included comments in the script below.

#!/bin/sh
#
# Your job name
#$ -N My_Job
#
# Use current working directory
#$ -cwd
#
#
# pe (Parallel environment) request. Set your number of processors here.
#$ -pe mpich 15
#
# Run job through bash shell
#$ -S /bin/bash
# If modules are needed, source modules environment:
. /etc/profile.d/modules.sh
# Add any modules you might require:
module add shared mvapich2 openmpi
# The following output will show in the output file. Used for debugging.
echo ``Got $NSLOTS processors.''
echo ``Machines:''
cat $TMPDIR/machines
#Merge the standard out and standard error to one file
#$ -j y
mpirun -np 1 R --slave CMD BATCH task_pull.R

Now to submit the job to the cluster type the following line of code at the command prompt, where “all.q” is the name of the queue you wish to submit to and test_Rmpi.sh is the name of the shell script above.

qsub -q all.q test_Rmpi.sh

The sample code that I’m running (“task_pull.R”) is found here: http://math.acadiau.ca/ACMMaC/Rmpi/examples.html. There are some useful examples on this site and it is a good guide to MPI in R, although doesn’t provide information on the setup of the system.

Keep it real.

Performing Genome Wide Association (GWAS) using R and GenABEL

Introduction to GWAS
Genome Wide Association Studies (GWAS) are essentially a means by which to figure out which genetic traits (usually single nucleotide polymorphisms (SNPs)) are correlated with a variable of interest. In essence it allows us to assign function to regions of the genome, for example to find which genetic variants (differences in DNA sequence between individuals) are associated with changes in measurable variables like height, obesity, or any other trait or disease of interest, with the end result that we can make statements like “This variant of this gene is implicated in an increased risk of brain cancer”.

To perform such an analysis we need two kinds of data. SNP data for each individual in the study (which will be used to catalog the genetic variation between each sample) and data measuring the trait we are investigating for each individual in the study.

Our software will then simply find which regions of the genome vary in conjunction with the trait of interest, for example if all tall people in a study have a G at a certain genomic locus and all short individuals have a T at the same locus, we can hypothesize that variation in this SNP is responsible for the variation in height, although this is obviously a grossly oversimplified example, but then this guide isn’t aimed at complicating things!

Performing GWAS in R using GenABEL

Just to cover all bases R is a programming language used for statistics, it’s particularly useful in bioinformatics. GenABEL is an R package and part of the bioconductor project, that we can use to perform GWAS.

If you are not familiar with R, go to their website and read the documentation.

Intuitively, to perform a GWAS analysis we need two kinds of data; SNP data to show how DNA varies across samples and phenotypic data which shows how our variable of interest (height, weight, disease, whatever) varies across samples. In this example analysis we will be using genotype (SNP) information from the Hapmap project. If you are not familiar with HapMap you can read my short article explaining it here.

So to begin, open up R.

The first step is to install GenABEL.

source("http://bioconductor.org/biocLite.R")
biocLite('GenABEL')

We now need to download the HapMap data. We will be using the CEU dataset (if you don’t know what that means read the HapMap article), which we can download here in Plink format. Plink is a different GWAS program released by the folks in Harvard. We need to get the file hapmap3_r2_b36_fwd.qc.poly.tar.bz2 which contains both the .map and .ped file. The .ped file contains the genotype information (which SNP variants are where) and the .map file is essentially a list of the SNP names.

Once we have these files we can use the GenABEL convert.snp.ped() function to change the files to GenABEL format. First unizp the archive we have downloaded. The code to convert the files is as follows.


library('GenABEL')
setwd('/whereever/file/have/been/unzipped/')
convert.snp.ped(ped='hapmap3_r2_b36_fwd.CEU.qc.poly.ped', map='hapmap3_r2_b36_fwd.CEU.qc.poly.map', out='ceu.out')

The next step is to load the data into GenABEL. The important point at this stage is to have a correctly formatted phenotype file, which indicates the values associated with the phenotypic trait of interest in each sample. The GenABEL tutorial does a good job of explaining what this file should look like. As an example see below. It’s essentially just a tab-delimited text file which contains the sample names and the values for the trait of interest, which can be any quantifiable trait, in this example it’s height in centimeters.


"id" "sex" "traitOfInterest"
"NA06985" 0 120.1
"NA06991" 0 175.3
"NA06993" 1 157.2
"NA06994" 1 135.9

Now that we have created our genotype file (‘ceu.out’) and our phenotype file (which we will save as ‘hapmap_CEU.ph’) we can read in the data and create our GenABEL object, perform filtering according to the default parameters and the GWAS analysis and list our results in 5 simple lines of code.


hap3data <- load.gwaa.data(phenofile = "hapmap_CEU.ph", genofile = "ceu.out")
qc1 <- check.marker(hap3data)
qcedDataCeu <- hap3data[qc1$idok, qc1$snpok]
results <- mlreg("mean~factor(sex)", data=qcedDataCeu)
descriptives.scan(results, top=20)

Low and behold our top 20 SNPs are printed, p-values and all. Now wasn’t that quick and painless.

For a more detailed (and much more painful) explaination of all of this you should review the GenABEL documentation which is available at their homepage.

Until next time.

What is the Hapmap Project?

The International HapMap Project is a project whose aim is to develop a haplotype map of the human genome, which will describe the common patterns of human genetic variation. In plain English, what they have done is taken blood samples from groups of related individuals from various regions of the planet and used these samples to catalog some of the common variations in their DNA which was extracted from a type of white blood cell called a lymphoblast. Because it would be far too expensive to sequence the full genomes of so many people, the HapMap project looks at sites of common variation in DNA sequence called Single Nucleotide Polymorphisms (SNPs).

All of the data has been sampled from related individuals, two parents and a child, these are referred to as “trios”. The reason this is done is that it allows researchers to investigate patters of heritability.

So essentially the raw data which the project has produced is genotype (SNP) data for trios of related individuals from from various regions around the world; for example what is referred to as the CEU subset of data are genotypes for 30 trios (2 parents, 1 child) who are residents of Utah of northern and western European decent. Similarly the YRI data are 30 adult-and-both-parents trios from Ibadan, Nigeria.

There have thus far been three “phases” to the project. They are essentially a combination of different samples (e.g. new geographic locations) and different genotyping technologies. For example Phase II has 270 samples between 4 geographic areas and is genotyped with 3.8 million SNPs per sample. Phase III has a greater number of samples (1,115) but is genotyped using a less sensitive technology (1.6 million SNPs per sample).

The data can be applied to studies such as genome wide association, where phenotypic traits (e.g. height, a disease etc.) are linked to SNPs to give an idea of which areas of the genome are responsible for what visible traits. Because the individuals who were samples are related the data can also be applied to heritability studies.

The HapMap website provides some analysis tools such as a genome browser that can be used to browse and visualise the data. The data is also available for bulk download.

Good Man Boys,

Till Next time.

Steps in Preprocessing Affymetrix GeneChip Data

The analysis of microarray data to produce lists of differentially expressed genes has several steps which can differ based on the type of data being assayed. However, all data follows the same general pipeline which involves reading raw data, quality assessing the data, removing bad spots/arrays from further analysis, preprocessing the data and calculating differential expression by statistical analysis. This list of differentially expressed genes can subsequently be annotated with useful information that explains the various genes’ function, for example, gene ontology. I will now explain in more detail how this data analysis pipeline is followed for the types of data supported by this system.

Before any kind of microarray data can be analysed for differential expression several steps must be taken. Raw data must be quality assessed to ensure its integrity. Unprocessed raw data will always be subject to some form of technical variation and thus must be preprocessed to remove as many unwanted sources of variation as is possible, to ensure that results are of the highest attainable level of accuracy. Ideally, the data being assayed should be preprocessed using several different methods, the results of which should be compared to identify which method is of the highest level of suitability. The most appropriate method should then be used to preprocess the raw data before differential expression analysis.

Because of the design of Affymetrix GeneChip Arrays, the steps that need to be taken before differential expression analysis are slightly more elaborate than for cDNA arrays.

The first step is generally to background correct the intensity reading for each spot. Background fluorescence can arise from many sources, such as non-specific binding of labelled sample to the array surface, processing effects such as deposits left after the wash stage or optical noise from the scanner. There is always some level of background noise, even if nothing but sterile water is labeled and hybridised to the array, some fluorescence will still be picked up by the scanner. Different algorithms will use different methods of background correction. The popular Robust Multi-Array Analysis (RMA) algorithm, for example, uses the convolution of signal and noise distributions.

The next stage is normalization. The purpose of this step is to adjust data for technical variation, as opposed to biological differences between the samples. There will always be slight discrepancy between the hybridisation processes for each array and these variations tend to lead to scaling differences between the overall fluorescence intensity levels of various arrays. For example the quantity of RNA in a sample, the amount of time for which a sample spends hybridising or the volume of a sample can all introduce significant variance. Even subtle physical differences between arrays or between the scanners used to read arrays can have an effect.

Put simply, normalization ensures that when comparing expression levels of different arrays, that we are, as much as is possible, comparing like with like. Studies have shown that the normalization method used has a significant difference on final differential expression levels, so it is vital to choose an appropriate method.

The next step is usually PM Correction As stated previously, PM probes on the GeneChip measure both the relative abundance of the corresponding gene and the amount of non-specific binding, which arises when mRNA binds to a probe which is not targeting it. MM probes are designed to give a measure for non-specific binding of their corresponding PM probe. It then seems obvious that the MM values should be subtracted from their corresponding PM values as a first step in the analysis process.

In reality however, this does not work, because generally about 30% of MMs are actually larger than their corresponding PMs \cite{Schwender2006}. This is because, as well as measuring background signal, high volumes of mRNA targeted intentionally by the PM probes tend to also bind to MMs. Many of the most popular preprocessing methods solve this problem by simply ignoring the MM probes altogether and PM values are corrected for non-specific binding using other methods.

We have already seen how GeneChip arrays work by using 11 different PM spots to target 11 separate 25 base long sections of a target genes mRNA. The final step in preprocessing GeneChip Data is summarization to summarise the data from these 11 separate probes into an expression value for the gene in question. There are a number of different ways that this can be achieved, but the end result is always a single expression value for each gene on each chip.

What Are DNA Microarrays

As a prerequisite to this article you must understand what genes, DNA, proteins and mRNA are. You can find our short introduction to these concepts here.

DNA microarrays are a high throughput technology used to measure the expression levels of thousands of genes, in some cases all of the genes in a genome, simultaneously. The fundamental idea behind most microarrays is to exploit complementary base pairing (see previous section) to measure the amount of the different types of mRNA molecules in a cell, thus indirectly measuring the expression levels of the genes that are responsible for the synthesis of those particular mRNA molecules.

The spots on a microarray contain single stranded DNA oligonucleotides called probes. Each of these spots will contain DNA which is of a complementary sequence to the specific mRNA molecule that corresponds to the gene that it is targeting. An mRNA molecule which is complementary to the probe in question, should hybridise to that probe, forming a strong mRNA-DNA bond. These mRNA molecules will have previously been labeled with fluorescent dye, which means that the amount of hybridisation that has taken place can be measured by the level of fluorescence of the dye, which is examined with a scanner. This scanner then outputs a text file for each array, which contains the relevant data pertaining to that array, such as the level of fluorescence of each spot and the level of background noise. It is these text files which are subsequently computationally analysed. In theory, a spot with brighter fluorescence means that more mRNA has hybridised, which in turn infers that more mRNA was present in sample extracted from the original cell and that the gene represented by this spot is experiencing a higher level of expression.

The types of DNA microarrays most widely used today can be broadly divided into two categories, cDNA arrays and oligonucleotide arrays.


The Affymetrix GeneChip

The GeneChip, which is manufactured by Affymetrix, is an oligonucleotide array and is the most commonly used type of DNA microarray. They differ slightly in operation from other kinds of arrays. Each array will contain hundreds of thousands of probe spots and each of these spots will in turn contain millions of copies of an individual 25 base long DNA oligonucleotide.

Each gene that is being targeted is represented by typically (but not necessarily always) by 11 pairs of these probes. This set of probes contains 11 perfect match (PM) probes, which are exactly complementary to the DNA sequence of a subset of 25 bases of the target gene. Each PM probe has a corresponding mismatch probe (MM), which contains the same 25 base long sequence as the PM probe, except for the fact that the middle base, or the 13th base in the chain, is substituted for the complement of the 13th base of its corresponding PM probe; so for example, a G in the 13th base of a PM probe will be replaced with a C in the MM probe. This is meant to give an estimate of non-specific binding, which occurs when mRNA that is not targeted binds to a PM probe.

cDNA Arrays

cDNA microarrays differ from Affymetrix arrays in that each spot corresponds entirely to a specific gene. Sometimes duplicate spots will target the same gene, but these spots are exact copies of each other. The probes are of varying length but are generally hundreds of bases long. Instead of mRNA levels being directly measured, these arrays measure complementary DNA (cDNA), because this is more stable molecule than mRNA at these large sizes. mRNA from the original sample is reverse transcribed in a laboratory to create an equivalent number of the more stable cDNA molecules which are then hybridised to the microarray. These cDNA molecules are usually more than 500 bases long. Each of the probes contained on the spots on the microarary will be complementary to a cDNA molecule that represents a given gene. Thus, the measure of how much cDNA binds to its corresponding spot gives an accurate measure of the expression level of the gene in question, assuming that nothing has gone wrong.

Also instead of expression levels of an individual sample being measured directly, two separate samples are hybridised to the same array at the one time. One of these samples is generally a control sample, while the other one is a sample of interest such as tumour tissue. Each of these samples is labeled with a particular dye; either a red-fluorescent dye, Cyanine 5 or Cy5, or a green-fluorescent dye, Cyanine 3 or Cy3. When the array is read by the scanner, the differential expression level of a given gene is measured by the difference in intensity level between the red and green channel, at the spot that corresponds to the gene in question.

cDNA microarrays are initially read by a scanner, which produces a TIFF image of the array. These images are then interpreted by one of a number of image analysis software packages, all of which output data in slightly different formats. This system supports analysis of data from the major platforms, including Spotfire, GenePix, BlueFuse and Agilent.

miRNA Arrays

Microarrays can also be used for detection of miRNA expression levels. miRNA are short RNA molecules, generally about 22 nucleotides in length. They are encoded in genes but are not translated into proteins; instead, these molecules down regulate the expression of certain genes. They achieve this by being complementary to specific mRNA molecules created in a cell. The miRNA molecules bind to the complementary sections of these mRNAs and stop them from being converted into proteins.

Exiqon manufacture microarrays for detection of miRNA expression. The spots on these microarrays consist of Locked Nucleic Acid (LNA) probes. LNA is a modified RNA nucleotide that, because of the short length of miRNA molecules, forms a more stable bond with miRNA than standard DNA probes meaning that accuracy of measurements is increased. The miRNA molecules that are being targeted will bind to its complementary LNA probe.

Other than this the processes of labeling the sample with a fluorescent dye, hybridising the sample to the array and reading the hybridisation levels with a scanner are similar to those of other arrays.

Introduction to Genes, DNA and Proteins

Eukaryotes are organisms whose cells are organized into complex structures enclosed within membranes. Most eukaryotic organisms, for example, human beings, contains billions of individual cells. Almost all of these cells contain, within each nucleus, the entire genome for that organism. This genome contains the organism’s complete hereditary information in the form of deoxyribonucleic acid (DNA), that encodes a complete blueprint for all activities and structures within the organism.

In the human body, the genome consists of 23 pairs of chromosomes. One of each of this pair is inherited from the mother and the other from the father. Each chromosome is made of chains of DNA. DNA consists of two polymers (large molecules of repeating subunits) made up of units called nucleotides, these molecules are wrapped around each other in a structure known as a double helix. Each nucleotide consists of a deoxyribose sugar, a phosphate group and one of the four nitrogen bases, guanine, adenine, thymine and cytosine. These bases, which are usually represented by their first letters, G, A, T and C, are where hereditary genetic information is actually encoded. It is worth noting that one of the two strands of the DNA double helix will suffice to describe this information; this is because of complementary base pairing, whereby an A on one strand always binds to a T on the other and a C always binds to a G. The molecular structure of DNA is shown in the animation below.

The Molecular Structure of DNA

Genes are essentially segments of the DNA structure described above. Loosely speaking, a gene is a section of DNA that defines a single trait by encoding a particular pattern, about 27,000 of which exist in humans; more technically, a gene is a locatable region of genomic sequence, corresponding to a unit of inheritance.

The main purpose of genes is to act as a blueprint in the creation of proteins. Proteins are made of amino acids and are responsible for the structure and activity of an organism at a cellular level. They are created as follows; starting at the 5′ end (the leading end) of a gene and proceeding to the 3′ end (the tail end), the information contained in the gene is transcribed into a messenger ribonucleic acid (mRNA) strand. This process is performed by an enzyme called RNA polymerase. After transcription this mRNA molecule leaves the nucleus of the cell where it is transcribed into a protein in a process called translation. This is performed by ribosomes, which read the code carried by mRNA molecules from the cell nucleus and create proteins combining any of the 20 amino acids in the body into complex polypeptide chains. These proteins are the building blocks of the organism. This process of translating a gene into a functional product is known as gene expression.