Comments on “Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines”

The opinions expressed here are those of only Paul Geeleher and not my colleagues or institution.

I recently co-authored a paper in Genome Biology entitled “Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines”. The paper proposes a solution to a problem that has a long and sordid history in pharmacogenomics (see “Deception at Duke” on YouTube).

Because of continued work on these and similar data, some things have come to my attention since the publication that I would like to discuss here. I would also like to give some additional background on how the project came about.

BACKGROUND
The project did not begin with in vivo prediction in mind. In fact, I had been working on a very large dataset that we had received from a collaborator at a major pharmaceutical company, a component of which was a large panel of cell lines that had been treated with one of the drugs in CGP. I was interested in testing how well I could predict drug sensitivity in this dataset, using models derived on CGP; this was an interesting question given the highly divergent microarray platforms used by these studies (a two-channel array platform Vs a single-channel Affymetrix array). I informally tested a few methods, achieving decent results, until I stumbled across two papers in the literature which had compared the performance of a host of methods for their ability to predict survival phenotypes from gene expression microarray data. One method, ridge regression, was identified by both studies as the top performing method, so I proceeded to apply it to my cell line data, where I got results that were substantially better than previous methods that I had applied. At the same time, I had also been investigating methods for integrating data from different microarray platforms, but luckily there had also been a large comparison published for these, which recommended the ComBat function from the R library SVA. In conjunction with these methods I had been long familiar with the concept of remapping microarray probes to the latest build of the genome using the BrainArray data. Thus, these methods formed the bones of my pipeline and they performed very well in my panel of cell lines. While working on the panel of cell lines it also came to my attention that removing genes with very low variance seemed to cause a very slight performance increase, which would be in line with biological expectations. As there are no established methods of removing low variance genes (that I am aware of) I simply removed the lowest 20% of genes, based on a visual inspection of a histogram of the variance of all genes in CGP. I should note that these results on the panel of cell lines did not make it into the paper  because these were unpublished data obtained for an entirely different project, but I will happily share the code (including some other possibly embarrassing early efforts) if you are interested and email me!

So, with this pipeline in place (containing components that the literature suggested were as strong as they could be) and having recently become aware of the “Potti Scandal”, I decided to give the in vivo prediction a shot. I obtained the Docetaxel dataset from Potti’s flagship Nature Medicine paper. To my considerable surprise, the pipeline yielded a significant result with the expected directionality. I was quite amazed by this and decided that I should obviously try to reproduce this result in other clinical trials. The next suitable trial that I identified was for Bortezomib in Myeloma and in this case I was again surprised to find results that were even more significant than previously (although the sample size was considerably larger). However, my optimism was subsequently tapered somewhat by null results in a Cisplatin clinical trial in breast cancer and for Erlotinib and Sorafinib in lung cancer. This led to prolonged investigations. A possible reason for the Cisplatin result is that there appeared to be issues with the clinical data, possibly because the drug isn’t typically used to treat breast cancer and variability in drug response is very low; this and related issues are discuseed at lenght in our paper.

However, I noted that the distribution of drug sensitivity data in Erlotinib and Sorafinib (in the CGP cell line training-data) were drastically different from those of Docetaxel, Bortezomib and Cisplatin, with a very obvious trend towards only a few cell lines responding to these drugs (interestingly, also consistent with biological expectation), rather than the far more uniform distribution of the previous data. This led me to hypothesis that there may be issues with fitting a linear model to these data (discuseed at lenght in the paper) and as an alternative approach, I dichotimized the drug response data and fit a logistic regression model, while keeping as much of the rest of the pipeline in tact as was possible. The data were dichotimized using the number of cell lines that achieved a measurable IC50 on one end and a large number of resistant cell lines on the other – but I would suggest that more transparent and robust methods should be developed in future. However, again to my surprise, this immediately yeilded a signficant result for Erlotiinib. A null result was achieved for Sorafenib, but this also leads an interesting discussion point. One very plausible explanation is that it is thought that one of the main mechanisms for the anti-tumor activity of Sorafenib in vivo is as an angiogenesis inhibitor. This is achieved by inhibiting VEGF. Thus, a cell viability assay may not be a good model for the in vivo activity of this drug (and potentially other drugs). Considering whether the in vitro model is reflective of the expected in vivo biology is a very important concern and one that should always be stronly considered in pursuing such lines of research.

There are some potential criticisms of the Erlotinib result, particularly that it was only demonstrated on a single dataset and this is where I would ask readers of the paper to exercise caution when interpreting this result. The number of samples is small (only 25) and we need to wait and see if this result will generalize to larger datasets (which are apparently coming soon from the BATTLE trial), however, there are a number of reasons that this result was still definitely worth including in this paper. Firstly, using this example, we have highlighted an important difference in the distribution of drug response phenotype across a panel of cell lines for highly-targetted versus more-broadly-cytotoxic drugs. This is a very important consideration for the choice of model / machine learning algorithm and should absolutely not be ignored by future investigators who attempt similar analysis. Perhaps most importantly, the model worked right out of the bag and is thus impossible to ignore. There are some issues, but a larger dataset will eventually put the issue to rest one way or another. One interesting side note is that it has recently come to my attention that the same model actually also performs well in predicting drug sensitivity in the Erlotinib CCLE data, which I have been working with recently as part of a new project.

CHOICE OF MACHINE LEARNING ALGORITHM
There have been some interested parties in terms of the choice of ridge regression. It is important that I re-iterate that, strong literature support should arguably be considered the most unbiased means by which to chose an algorithm for such a study. Given the existing support, it is baffling to me that this algorithm has been ignored for prediction of pharmacogenomic phenotypes. Why? I don’t know. As a side point, I would like to emphasize that the comparisons to Lasso and ElasticNet regression were included because a “comparison of methods” was requested by all three reviewers, a request that I have some problems with, given the literature support for ridge regression. I should also emphasize, that no algorithm other than ridge regression was ever applied to the clinical datasets prior to these reviewers requests, as the inclusion of these results could easily lead to a dangerous accusation of selecting methods based on performance in the datasets, which is likely to lead to a useless model.

THE FUTURE

I hope that the future will lead us to identify further datasets on which this method can be tested. My dream is that substantial progress can be made in terms of personalized cancer therapy by using statistical models constructed on models system (be they cell lines or anything else). This will require focused attention to best practices in machine learning and data mining. We hope that we have taken an important first step and are excited to see a productive future for this field.

The Paper

All code and data to reproduce results

The economic cost of releasing unusable academic software

Birdsuite is an application released by the Broad (Harvard/MIT) for analysing Affy SNP 6.0 array data. It was published in Nature Genetics. The installation process is shockingly bad and has recently caused me an indescribable amount of utterly needless frustration. Its a little surprising that journal like Nature Genetics can possibly justify publishing a piece of software in this condition and you would have to question whether other institutions would be able to release something like this in a top journal. Personally I would recommend avoiding this software if at all possible. You will very likely end up tearing your hair out.

An interesting question arises from this. What is the financial cost of this type of sloppy work? Birdsuite has about 500 citations on Google Scholar. So say three times more than that attempted to install it giving 1500 people (a conservative estimate). Now lets say that you each of these people spent 12 hours trying to get it to work (which is surprisingly/worryingly realistic) and you value these people’s time at 30 dollars an hour (probably an underestimate for a PhD level scientist). This gives a total of 12*30*1000 = $540,000. Over half a million dollars, sucked out of the economy, just like that :(

For a far more straightforward solution, I ended up getting it to work using “Affymetrix Genotying Console”. Its a good piece of software with good documentation. It works seamlessly, does quality control, outputs data in usable and annotated format and can even output in Plink format. Thumbs up, good job Affymetrix!!

Split a chromosome name and location type string into its constituent parts in R

Easy to do with a regular expression and the strsplit function in R. The | operator means “or”. For example for “chr15:88120587-88121480″:

> unlist(strsplit("chr15:88120587-88121480", "chr|:|-"))[2:4]
[1] "15" "88120587" "88121480"

Principal Components Analysis Explained using R

Here, we will explain principle component analysis (PCA) by stepping though the algorithm manually and reproducing the output of the prcomp() function in R, which is normally used to do PCA.

First make up some data and plot it; in terms of gene expression analysis, we can think of the rows of the matrix below as different samples (or microarray experiments etc.), with 2 genes each. Its much easier to explain PCA for two dimensions and then generalise from there.
x <- c(1,3,6,8,9,10,15,12,17)
y <- c(10,14,12,17,19,20,20,24,19)
m <- cbind(x, y)
plot(m, pch=seq(9))

Here’s how you’d make a PCA plot for this data using R’s inbuilt functions (We’re about to learn how to do this for ourselves.)
pca_proc <- prcomp(m)
plot(pca_proc$x, pch=seq(9)) #the pca plot
plot(pca_proc) #the proportion of variance capture by each PC

Now to do PCA manually, first step, center data around zero, then plot the results.
x1 <- x - mean(x)
y1 <- y - mean(y)
plot(x1, y1, xlim=c(-40, 40), ylim=c(-40, 20))

Next, calculate the covariance matrix for vectors x1 and y1 and plot the column vectors, these column vectors describe the direction of variability in the data, i.e how similar is cov(x, x) with cov(x, y) and how similar is cov(y, y) with cov(y, x). Note that the covariance of x with itself is the same as the variance of x (i.e. var(x) == cov(x, x)). More info on covariance matrix.
m1 <- cbind(x1, y1)
cM <- cov(m1)
lines(x=c(0, cM[1,1]), y=c(0, cM[2,1]), col="blue")
lines(x=c(0, cM[1,2]), y=c(0, cM[2,2]), col="blue")

As the cov(x, y) can never be greater than cov(x,x), one of the lines plotted above will always be above the diagonal and one will be below, plot the diagonal (i.e. a line through the origin with a slope of 1).
abline(0, 1, col="grey")

While the vectors we have plotted above describe the direction of the variability in the data, they do not do so in a way that is particularly useful. By finding the eigenvectors of the covariance matrix, we can describe the variability in terms of two orthogonal vectors which capture the direction of variation, instead of the two vectors that we are currently plotting. For more info on eigenvectors you’ll need to study some basic linear
algebra, this can be done on Khan Academy, of particular interest will be the videos on matrix multiplication and obviously eigenvectors. This is by far the most difficult part of PCA to understand.
eigenVecs <- eigen(cM)$vectors
lines(x=c(0, eigenVecs[1,1]), y=c(0, eigenVecs[2,1]), col="red")
lines(x=c(0, eigenVecs[1,2]), y=c(0, eigenVecs[2,2]), col="red")

As the eigenvectors are unit vectors (i.e. of length 1) it may be easier to visualize how much each of them influences the data if we multiply them by their corresponding eigenvalues, which represent the proportion of variability explained by each eigenvector.
eVal1 <- eigen(cM)$values[1]
eVal2 <- eigen(cM)$values[2]
lines(x=c(0, eVal1*eigenVecs[1,1]), y=c(0, eVal1*eigenVecs[2,1]), col="red")
lines(x=c(0, eVal2*eigenVecs[1,2]), y=c(0, eVal2*eigenVecs[2,2]), col="red")

This matrix contains the eigenvectors and we want to display the data in terms of these eigenvectors. In this case we will select both eigenvectors, but on high dimensional datasets, it is normal to chose a subset of eigenvectors.
rowFeatVec <- t(eigenVecs)
rowDataAdj <- t(m1)

Finally use matrix multiplcation to get the dot product between each eigenvector and each point in the original centered data matrix (m1). This operation describes to what degree each point is influenced by each eigenvector Plot this and that’s the final plot.
Note, %*% is the matrix multiplication operator in R.

transFData <- rowFeatVec %*% rowDataAdj
finalData <- rowFeatVec %*% rowDataAdj
plot(t(finalData), pch=seq(9))

Finally, to plot the equivalent of the scree plot we made above, simply plot the eigen values.
barplot(eigen(cM)$values)

Writing a group of R data.frames to named Excel Worksheets

First install “WriteXLS” library in R using:

source("http://bioconductor.org/biocLite.R")
biocLite("WriteXLS")

Now if I have 3 data.frames called “genes”, “proteins” and “elephants”, all I need to do to write them to the same Excel file, on different *named* worksheets is:

library("WriteXLS")
WriteXLS(c("genes", "proteins", "elephants"), "WriteXLS.xls")

Note, for this to work, Perl needs to be installed. You also need the “Text::CSV_XS.pm” library. On Ubuntu this can be installed by issuing the command:

sudo apt-get install libtext-csv-xs-perl

Evolution, exercise, diet and why so many people have crooked teeth.

Next time you walk down a crowded street, keep track of how many people look “healthy”. Why are so many people obese, have acne, are underweight or suffer from any number of diseases? We can answer this question by introducing you to the Kitava. They a hunter gatherer tribe who live in Papua New Guinea. They suffer virtually none of the ailments that afflict modern humans. You heard me right – NONE. The answer as to why is very simple, your genes are at odds with your environment. Evolution has, over the course of millions of years, programmed your genes to expect certain “inputs” so to speak. Your body needs certain nutrients from different kinds of foods, like fresh seasonal fruit and vegetables; foods like grains, dairy, legumes, refined sugar and alcohol are (almost) totally alien to your genes and that’s why they make you sick. Not necessarily violently sick straight away, but over the course of years these foods are damaging people at a subclinical level. Your body expects to be exercised, to lift heavy weights and to run on a daily basis. Your genes want you to interact socially, to experience a certain amount of sunlight exposure (to synthesise Vit D – which you are almost certainly deficient in). Take these basic things away and you end up overweight, depressed, tired, weak and sick.

To me the smoking gun is crooked teeth, something that evolution should have obviously weeded out. But the thing is, evolution DID weed this out, but only in the ancestral environment, when your genes received the stimulus they expected (in terms of nutrition and exercise). When these are lacking what happens? Your jaw bone does not develop properly and you end up needing braces. Later in life the same issue will lead to all of the other problems we just accept as symptoms of growing old.

So the message is, look around at the sorry state of modern man, then at your ancestral roots and adjust your lifestyle accordingly; i.e. go to the gym, eat your veggies, run around in the sun, enjoy the regular social outings and stay out of Pizza Hut!

Further Reading:
The Paleo Solution
The Primal Blueprint

A selfish gene theory based exlanation of rapport seeking behavior and its implications for social heirarchy in humans.

Coming soon…

A selfish gene theory based explaination as to why we find baby animals “cute”.

Forewarning: To understand this article, you’re probably going to have to have read Richard Dawkins book “The Selfish Gene“.

Kittens/puppys/ducklings/foals are undeniably cute and lovable. But why do we all think this? Ostensibly, it seems that thinking of a duckling as “cute” makes no more sense than finding a pile of bricks cute. What is it about these wee critters that causes a seemingly universal emotional response in humans?

One might argue that they remind us of our own offspring, or that their helplessness is appealing, but again, there should be an identifiable reason that these responses are universal.

I argue, that in the context of selfish gene theory, this emotional response could be explained as a mechanism that prevents you from eating the young of closely related species (or more specifically, other survival machines who share a large proportion of your genes). From your genes point of view, ~90% of which are shared with a kitten, it makes no sense for you to eat a younger animal for dinner, given an available alternative. If your genes were to allow this, it would mean removing this particular “survival machine” from the population before it has reached an age where it can reproduce and has a chance to propagate its own genetic material, which would not give this particular combination of genetic material a fair shot at propagation (as opposed to a mature animal who has had their chance). For the 90% of homologous genes which are shared between you and a cat/kitten, it is more advantageous for THOSE GENES that you to eat the older animal (who has reached a reproductive age) for brunch/supper. The older “survival machine” has had a fairer chance to produce an offspring as well as being in a fairer position to avoid being eaten (if the particular combination of genes which it possesses have made it smart enough).

Hence why our genes program this universal emotional response. And as it happens it seems that this response also exists in other species!

Use “org.Hs.eg.db” to map between Entrez Gene Ids and HUGO gene symbols in R

To translate between these identifiers in R, this code creates a table with the mapping:

library(org.Hs.eg.db)
e2s = toTable(org.Hs.egSYMBOL)

To have a look at the mapping (which is stored as a data.frame)

head(e2s)

This will output something like this
gene_id symbol
1 1 A1BG
2 2 A2M
3 3 A2MP1
4 9 NAT1
5 10 NAT2
6 11 AACP

Now to map a set of Ids you have stored in a vector called “entrezGenes” to HUGO:

entrezGenes <- c("57573", "8563" , "8434")
geneSymbols <- e2s$symbol[e2s$gene_id %in% entrezGenes]

NOTE THIS DOES NOT PRESERVER ORDER.

In R, list all files in current directory ending with “.CEL”

Use regular expression matching, where the “$” means that this string ends with the preceding string:

dir(pattern=".CEL$")