Comments on “Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines”

I recently co-authored a paper in Genome Biology entitled “Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines”. The paper proposes a solution to a problem that has a long and sordid history in pharmacogenomics (see “Deception at Duke” on YouTube).

Because of continued work on these and similar data, some things have come to my attention since the publication that I would like to discuss here. I would also like to give some additional background on how the project came about.

The project did not begin with in vivo prediction in mind. In fact, I had been working on a very large dataset that we had received from a collaborator at a major pharmaceutical company, a component of which was a large panel of cell lines that had been treated with one of the drugs in CGP. I was interested in testing how well I could predict drug sensitivity in this dataset, using models derived on CGP; this was an interesting question given the highly divergent microarray platforms used by these studies (a two-channel array platform Vs a single-channel Affymetrix array). I informally tested a few methods, achieving decent results, until I stumbled across two papers in the literature which had compared the performance of a host of methods for their ability to predict survival phenotypes from gene expression microarray data. One method, ridge regression, was identified by both studies as the top performing method, so I proceeded to apply it to my cell line data, where I got results that were substantially better than previous methods that I had applied. At the same time, I had also been investigating methods for integrating data from different microarray platforms, but luckily there had also been a large comparison published for these, which recommended the ComBat function from the R library SVA. In conjunction with these methods I had been long familiar with the concept of remapping microarray probes to the latest build of the genome using the BrainArray data. Thus, these methods formed the bones of my pipeline and they performed very well in my panel of cell lines. While working on the panel of cell lines it also came to my attention that removing genes with very low variance seemed to cause a very slight performance increase, which would be in line with biological expectations. As there are no established methods of removing low variance genes (that I am aware of) I simply removed the lowest 20% of genes, based on a visual inspection of a histogram of the variance of all genes in CGP. I should note that these results on the panel of cell lines did not make it into the paper because these were unpublished data obtained for an entirely different project, but I will happily share the code if you are interested and email me!

So, with this pipeline in place (containing components that the literature suggested were as strong as they could be) and having recently become aware of the “Potti Scandal”, I decided to give the in vivo prediction a shot. I obtained the Docetaxel dataset from Potti’s flagship Nature Medicine paper. To my considerable surprise, the pipeline yielded a significant result with the expected directionality. I was quite amazed by this and decided that I should obviously try to reproduce this result in other clinical trials. The next suitable trial that I identified was for Bortezomib in Myeloma and in this case I was again surprised to find results that were even more significant than previously (although the sample size was considerably larger). However, my optimism was subsequently tapered somewhat by null results in a Cisplatin clinical trial in breast cancer and for Erlotinib and Sorafinib in lung cancer. This led to investigations. A possible reason for the Cisplatin result is that there appeared to be issues with the clinical data, possibly because the drug isn’t typically used to treat breast cancer and variability in drug response is low; this and related issues are discussed at length in our paper.

However, I noted that the distribution of drug sensitivity data in Erlotinib and Sorafinib (in the CGP cell line training-data) were drastically different from those of Docetaxel, Bortezomib and Cisplatin, with a very obvious trend towards only a few cell lines responding to these drugs (interestingly, also consistent with biological expectation), rather than the far more uniform distribution of the previous data. This led me to hypothesis that there may be issues with fitting a linear model to these data (discussed at length in the paper) and as an alternative approach, I dichotomized the drug response data and fit a logistic regression model, while keeping as much of the rest of the pipeline in tact as was possible. The data were dichotomized using the number of cell lines that achieved a measurable IC50 on one end and a large number of resistant cell lines on the other – but I would suggest that more transparent and robust methods should be developed in future. However, again to my surprise, this immediately yielded a significant result for Erlotiinib. A null result was achieved for Sorafenib, but this also leads an interesting discussion point. One very plausible explanation is that it is thought that one of the main mechanisms for the anti-tumor activity of Sorafenib in vivo is as an angiogenesis inhibitor. This is achieved by inhibiting VEGF. Thus, a cell viability assay may not be a good model for the in vivo activity of this drug (and potentially other drugs). Considering whether the in vitro model is reflective of the expected in vivo biology is a very important concern and one that should always be strongly considered in pursuing such lines of research.

There are some potential criticisms of the Erlotinib result, particularly that it was only demonstrated on a single dataset and this is where I would ask readers of the paper to exercise caution when interpreting this result. The number of samples is small (only 25) and we need to wait and see if this result will generalize to larger datasets (which are apparently coming soon from the BATTLE trial), however, there are a number of reasons that this result was still definitely worth including in this paper. Firstly, using this example, we have highlighted an important difference in the distribution of drug response phenotype across a panel of cell lines for highly-targeted versus more-broadly-cytotoxic drugs. This is a very important consideration for the choice of model / machine learning algorithm and should absolutely not be ignored by future investigators who attempt similar analysis. Perhaps most importantly, the model worked right out of the bag and is thus impossible to ignore. There are some issues, but a larger dataset will eventually put the issue to rest one way or another. One interesting side note is that it has recently come to my attention that the same model actually also performs well in predicting drug sensitivity in the Erlotinib CCLE data, which I have been working with recently as part of a new project.

There have been some interested parties in terms of the choice of ridge regression. It is important that I re-iterate that, strong literature support should arguably be considered the most unbiased means by which to chose an algorithm for such a study. Given the existing support, it is baffling to me that this algorithm hasn’t been used for prediction of pharmacogenomic phenotypes. As a side point, I would like to emphasize that the comparisons to Lasso and ElasticNet regression were included because a “comparison of methods” was requested by all three reviewers, a request that I have some problems with, given the literature support for ridge regression. I should also emphasize, that no algorithm other than ridge regression was ever applied to the clinical datasets prior to these reviewers requests, as the inclusion of these results could easily lead to a dangerous accusation of selecting methods based on performance in the datasets, which is likely to lead to a useless model.


I hope that the future will lead us to identify further datasets on which this method can be tested. My dream is that substantial progress can be made in terms of personalized cancer therapy by using statistical models constructed on models system (be they cell lines or anything else). This will require focused attention to best practices in machine learning and data mining. We hope that we have taken an important first step and are excited to see a productive future for this field.

The Paper

All code and data to reproduce results

Split a chromosome name and location type string into its constituent parts in R

Easy to do with a regular expression and the strsplit function in R. The | operator means “or”. For example for “chr15:88120587-88121480″:

> unlist(strsplit("chr15:88120587-88121480", "chr|:|-"))[2:4]
[1] "15" "88120587" "88121480"

Principal Components Analysis Explained using R

Here, we will explain principle component analysis (PCA) by stepping though the algorithm manually and reproducing the output of the prcomp() function in R, which is normally used to do PCA.

First make up some data and plot it; in terms of gene expression analysis, we can think of the rows of the matrix below as different samples (or microarray experiments etc.), with 2 genes each. Its much easier to explain PCA for two dimensions and then generalise from there.
x <- c(1,3,6,8,9,10,15,12,17)
y <- c(10,14,12,17,19,20,20,24,19)
m <- cbind(x, y)
plot(m, pch=seq(9))

Here’s how you’d make a PCA plot for this data using R’s inbuilt functions (We’re about to learn how to do this for ourselves.)
pca_proc <- prcomp(m)
plot(pca_proc$x, pch=seq(9)) #the pca plot
plot(pca_proc) #the proportion of variance capture by each PC

Now to do PCA manually, first step, center data around zero, then plot the results.
x1 <- x - mean(x)
y1 <- y - mean(y)
plot(x1, y1, xlim=c(-40, 40), ylim=c(-40, 20))

Next, calculate the covariance matrix for vectors x1 and y1 and plot the column vectors, these column vectors describe the direction of variability in the data, i.e how similar is cov(x, x) with cov(x, y) and how similar is cov(y, y) with cov(y, x). Note that the covariance of x with itself is the same as the variance of x (i.e. var(x) == cov(x, x)). More info on covariance matrix.
m1 <- cbind(x1, y1)
cM <- cov(m1)
lines(x=c(0, cM[1,1]), y=c(0, cM[2,1]), col="blue")
lines(x=c(0, cM[1,2]), y=c(0, cM[2,2]), col="blue")

As the cov(x, y) can never be greater than cov(x,x), one of the lines plotted above will always be above the diagonal and one will be below, plot the diagonal (i.e. a line through the origin with a slope of 1).
abline(0, 1, col="grey")

While the vectors we have plotted above describe the direction of the variability in the data, they do not do so in a way that is particularly useful. By finding the eigenvectors of the covariance matrix, we can describe the variability in terms of two orthogonal vectors which capture the direction of variation, instead of the two vectors that we are currently plotting. For more info on eigenvectors you’ll need to study some basic linear
algebra, this can be done on Khan Academy, of particular interest will be the videos on matrix multiplication and obviously eigenvectors. This is by far the most difficult part of PCA to understand.
eigenVecs <- eigen(cM)$vectors
lines(x=c(0, eigenVecs[1,1]), y=c(0, eigenVecs[2,1]), col="red")
lines(x=c(0, eigenVecs[1,2]), y=c(0, eigenVecs[2,2]), col="red")

As the eigenvectors are unit vectors (i.e. of length 1) it may be easier to visualize how much each of them influences the data if we multiply them by their corresponding eigenvalues, which represent the proportion of variability explained by each eigenvector.
eVal1 <- eigen(cM)$values[1]
eVal2 <- eigen(cM)$values[2]
lines(x=c(0, eVal1*eigenVecs[1,1]), y=c(0, eVal1*eigenVecs[2,1]), col="red")
lines(x=c(0, eVal2*eigenVecs[1,2]), y=c(0, eVal2*eigenVecs[2,2]), col="red")

This matrix contains the eigenvectors and we want to display the data in terms of these eigenvectors. In this case we will select both eigenvectors, but on high dimensional datasets, it is normal to chose a subset of eigenvectors.
rowFeatVec <- t(eigenVecs)
rowDataAdj <- t(m1)

Finally use matrix multiplcation to get the dot product between each eigenvector and each point in the original centered data matrix (m1). This operation describes to what degree each point is influenced by each eigenvector Plot this and that’s the final plot.
Note, %*% is the matrix multiplication operator in R.

transFData <- rowFeatVec %*% rowDataAdj
finalData <- rowFeatVec %*% rowDataAdj
plot(t(finalData), pch=seq(9))

Finally, to plot the equivalent of the scree plot we made above, simply plot the eigen values.

Writing a group of R data.frames to named Excel Worksheets

First install “WriteXLS” library in R using:


Now if I have 3 data.frames called “genes”, “proteins” and “elephants”, all I need to do to write them to the same Excel file, on different *named* worksheets is:

WriteXLS(c("genes", "proteins", "elephants"), "WriteXLS.xls")

Note, for this to work, Perl needs to be installed. You also need the “” library. On Ubuntu this can be installed by issuing the command:

sudo apt-get install libtext-csv-xs-perl

Use “” to map between Entrez Gene Ids and HUGO gene symbols in R

To translate between these identifiers in R, this code creates a table with the mapping:

e2s = toTable(org.Hs.egSYMBOL)

To have a look at the mapping (which is stored as a data.frame)


This will output something like this
gene_id symbol
1 1 A1BG
2 2 A2M
3 3 A2MP1
4 9 NAT1
5 10 NAT2
6 11 AACP

Now to map a set of Ids you have stored in a vector called “entrezGenes” to HUGO:

entrezGenes <- c("57573", "8563" , "8434")
geneSymbols <- e2s$symbol[e2s$gene_id %in% entrezGenes]


In R, list all files in current directory ending with “.CEL”

Use regular expression matching, where the “$” means that this string ends with the preceding string:


Use biomaRt to tranlate HUGO to Entrez gene Ids.

We can use the R package biomaRt to conveniently convert between different types of gene ids. In this example we will convert official HUGO gene names to entrez gene ids.

First we load biomaRt in R using the current ensembl database for human:
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")

Now get the mapping, myHugoGeneNames is a vector of HUGO gene names. As this mapping will not tend to be unique, make sure "uniqueRows=FALSE" is included in the query.
mapTab <- getBM(attributes = c("hgnc_symbol", "entrezgene"), filters = "hgnc_symbol", values = myHugoGeneNames, mart = ensembl, uniqueRows=FALSE)

This query will return a table with HUGO gene names in the first column and corresponding Entrez gene id in the 2nd column. As the mapping isn't unique, you may wish to remove duplicates, which can be done as follows:

dupRows <- union(which(duplicated(mapTab[,1])), which(duplicated(mapTab[,2])))
entrezIds <- mapTab[-dupRows, 2]
names(entrezIds) <- mapTab[-dupRows, 1]

The object "entrezIds" now contains a unique mapping of the gene ids.

If you wish to map different kinds of Ids, these function will be of use in identifying what identifiers are available in biomaRt:


See all users jobs in Sun Gridengine

This info is hard to find, but to see what jobs are being run by other users on Sun Gridengine:

qstat -u "*"

Find files in linux command line using wildcards

To find a file whose name contains the text “lostfile” in the current directory and all subdirectories:

find . -name \*lostfile\*

To search the filesystem for such a file simply do:

find / -name \*lostfile\*

Installing R, Rmpi & Bioconductor on Beowulf Cluster Running ClusterVision OS

We recently purchased a new computer cluster here and I was tasked with installing R & Rmpi on the thing. Not an especially easy task, so here’s how you do it.

Firstly we need to install R on the slave image. The slave image is basically a directory tree contained on the master node that the slaves/nodes download and use as their filesystem when they boot up.

So to install files in the slave image, firstly log in as root. If you have a system setup with a separate login node, first exit from the login node by typing “exit”, to get to the master node.

On our system /cm/images/default-image/ is the location of the slave image, which is different from what it says in our manual, apparently they moved it and haven’t bothered to update the manual so I had to email support to find this. The following code tells the yum package manager to install R in the slave image’s directory tree.

yum --installroot=/cm/images/default-image/ install R.x86_64

Now to install the Bioconductor packages in the installation of R we have just created we need to change the root directory to the slave image, initiate R and download and install the packages.

[root@cluster]# R
> source("")
> biocLite()

On our cluster, some packages that certain Bioconductor libraries required were not installed, so we needed to install them as follows before installing Bioconductor as above. You may have to do something similar when installing on the master node (see below).

yum --installroot=/cm/images/default-image/ install gcc-gfortran.x86_64 libxml2-devel.x86_64 curl-devel.x86_64

The slave nodes need to now be restarted for them to download and install the updated slave image. This took about 5 minutes on our system. It can be achieved by entering cmsh and using the following command to restart the slaves.

[root@cluster]# cmsh
% device power reset -c slave

Next we must install R and Rmpi on the master node. First we type “exit” to undo what was done with the “chroot” command above, we now have a normal shell again. So install R and Rmpi:

[root@cluster]# yum install R.x86_64
[root@cluster]# R
> source("")
> biocLite("Rmpi")

On our system which uses the Sun GridEngine queuing system, to submit a job we use the following shell script to submit a script called “task_pull.R “. This shell script may be quite different based on your systems setup or if you are using the PBS, but the ClusterVision users guide is a good place to start. I’ve included comments in the script below.

# Your job name
#$ -N My_Job
# Use current working directory
#$ -cwd
# pe (Parallel environment) request. Set your number of processors here.
#$ -pe mpich 15
# Run job through bash shell
#$ -S /bin/bash
# If modules are needed, source modules environment:
. /etc/profile.d/
# Add any modules you might require:
module add shared mvapich2 openmpi
# The following output will show in the output file. Used for debugging.
echo ``Got $NSLOTS processors.''
echo ``Machines:''
cat $TMPDIR/machines
#Merge the standard out and standard error to one file
#$ -j y
mpirun -np 1 R --slave CMD BATCH task_pull.R

Now to submit the job to the cluster type the following line of code at the command prompt, where “all.q” is the name of the queue you wish to submit to and is the name of the shell script above.

qsub -q all.q

The sample code that I’m running (“task_pull.R”) is found here: There are some useful examples on this site and it is a good guide to MPI in R, although doesn’t provide information on the setup of the system.

Keep it real.