Comments on “Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines”

The opinions expressed here are those of only Paul Geeleher and not my colleagues or institution.

I recently co-authored a paper in Genome Biology entitled “Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines”. The paper proposes a solution to a problem that has a long and sordid history in pharmacogenomics (see “Deception at Duke” on YouTube).

Because of continued work on these and similar data, some things have come to my attention since the publication that I would like to discuss here. I would also like to give some additional background on how the project came about.

BACKGROUND
The project did not begin with in vivo prediction in mind. In fact, I had been working on a very large dataset that we had received from a collaborator at a major pharmaceutical company, a component of which was a large panel of cell lines that had been treated with one of the drugs in CGP. I was interested in testing how well I could predict drug sensitivity in this dataset, using models derived on CGP; this was an interesting question given the highly divergent microarray platforms used by these studies (a two-channel array platform Vs a single-channel Affymetrix array). I informally tested a few methods, achieving decent results, until I stumbled across two papers in the literature which had compared the performance of a host of methods for their ability to predict survival phenotypes from gene expression microarray data. One method, ridge regression, was identified by both studies as the top performing method, so I proceeded to apply it to my cell line data, where I got results that were substantially better than previous methods that I had applied. At the same time, I had also been investigating methods for integrating data from different microarray platforms, but luckily there had also been a large comparison published for these, which recommended the ComBat function from the R library SVA. In conjunction with these methods I had been long familiar with the concept of remapping microarray probes to the latest build of the genome using the BrainArray data. Thus, these methods formed the bones of my pipeline and they performed very well in my panel of cell lines. While working on the panel of cell lines it also came to my attention that removing genes with very low variance seemed to cause a very slight performance increase, which would be in line with biological expectations. As there are no established methods of removing low variance genes (that I am aware of) I simply removed the lowest 20% of genes, based on a visual inspection of a histogram of the variance of all genes in CGP. I should note that these results on the panel of cell lines did not make it into the paper  because these were unpublished data obtained for an entirely different project, but I will happily share the code if you are interested and email me!

So, with this pipeline in place (containing components that the literature suggested were as strong as they could be) and having recently become aware of the “Potti Scandal”, I decided to give the in vivo prediction a shot. I obtained the Docetaxel dataset from Potti’s flagship Nature Medicine paper. To my considerable surprise, the pipeline yielded a significant result with the expected directionality. I was quite amazed by this and decided that I should obviously try to reproduce this result in other clinical trials. The next suitable trial that I identified was for Bortezomib in Myeloma and in this case I was again surprised to find results that were even more significant than previously (although the sample size was considerably larger). However, my optimism was subsequently tapered somewhat by null results in a Cisplatin clinical trial in breast cancer and for Erlotinib and Sorafinib in lung cancer. This led to investigations. A possible reason for the Cisplatin result is that there appeared to be issues with the clinical data, possibly because the drug isn’t typically used to treat breast cancer and variability in drug response is low; this and related issues are discuseed at lenght in our paper.

However, I noted that the distribution of drug sensitivity data in Erlotinib and Sorafinib (in the CGP cell line training-data) were drastically different from those of Docetaxel, Bortezomib and Cisplatin, with a very obvious trend towards only a few cell lines responding to these drugs (interestingly, also consistent with biological expectation), rather than the far more uniform distribution of the previous data. This led me to hypothesis that there may be issues with fitting a linear model to these data (discuseed at lenght in the paper) and as an alternative approach, I dichotimized the drug response data and fit a logistic regression model, while keeping as much of the rest of the pipeline in tact as was possible. The data were dichotimized using the number of cell lines that achieved a measurable IC50 on one end and a large number of resistant cell lines on the other – but I would suggest that more transparent and robust methods should be developed in future. However, again to my surprise, this immediately yeilded a signficant result for Erlotiinib. A null result was achieved for Sorafenib, but this also leads an interesting discussion point. One very plausible explanation is that it is thought that one of the main mechanisms for the anti-tumor activity of Sorafenib in vivo is as an angiogenesis inhibitor. This is achieved by inhibiting VEGF. Thus, a cell viability assay may not be a good model for the in vivo activity of this drug (and potentially other drugs). Considering whether the in vitro model is reflective of the expected in vivo biology is a very important concern and one that should always be stronly considered in pursuing such lines of research.

There are some potential criticisms of the Erlotinib result, particularly that it was only demonstrated on a single dataset and this is where I would ask readers of the paper to exercise caution when interpreting this result. The number of samples is small (only 25) and we need to wait and see if this result will generalize to larger datasets (which are apparently coming soon from the BATTLE trial), however, there are a number of reasons that this result was still definitely worth including in this paper. Firstly, using this example, we have highlighted an important difference in the distribution of drug response phenotype across a panel of cell lines for highly-targetted versus more-broadly-cytotoxic drugs. This is a very important consideration for the choice of model / machine learning algorithm and should absolutely not be ignored by future investigators who attempt similar analysis. Perhaps most importantly, the model worked right out of the bag and is thus impossible to ignore. There are some issues, but a larger dataset will eventually put the issue to rest one way or another. One interesting side note is that it has recently come to my attention that the same model actually also performs well in predicting drug sensitivity in the Erlotinib CCLE data, which I have been working with recently as part of a new project.

CHOICE OF MACHINE LEARNING ALGORITHM
There have been some interested parties in terms of the choice of ridge regression. It is important that I re-iterate that, strong literature support should arguably be considered the most unbiased means by which to chose an algorithm for such a study. Given the existing support, it is baffling to me that this algorithm hasn’t been used for prediction of pharmacogenomic phenotypes. As a side point, I would like to emphasize that the comparisons to Lasso and ElasticNet regression were included because a “comparison of methods” was requested by all three reviewers, a request that I have some problems with, given the literature support for ridge regression. I should also emphasize, that no algorithm other than ridge regression was ever applied to the clinical datasets prior to these reviewers requests, as the inclusion of these results could easily lead to a dangerous accusation of selecting methods based on performance in the datasets, which is likely to lead to a useless model.

THE FUTURE

I hope that the future will lead us to identify further datasets on which this method can be tested. My dream is that substantial progress can be made in terms of personalized cancer therapy by using statistical models constructed on models system (be they cell lines or anything else). This will require focused attention to best practices in machine learning and data mining. We hope that we have taken an important first step and are excited to see a productive future for this field.

The Paper

All code and data to reproduce results

The economic cost of releasing unusable academic software

Birdsuite is an application released by the Broad (Harvard/MIT) for analysing Affy SNP 6.0 array data. It was published in Nature Genetics. The installation process is shockingly bad and has recently caused me an indescribable amount of utterly needless frustration. Its a little surprising that journal like Nature Genetics can possibly justify publishing a piece of software in this condition. Personally I would recommend avoiding this software if at all possible. You will very likely end up tearing your hair out.

An interesting question arises from this. What is the financial cost of this type of sloppy work? Birdsuite has about 500 citations on Google Scholar. So say three times more than that attempted to install it giving 1500 people (a conservative estimate). Now lets say that you each of these people spent 12 hours trying to get it to work (which is surprisingly/worryingly realistic) and you value these people’s time at 30 dollars an hour (probably an underestimate for a PhD level scientist). This gives a total of 12*30*1000 = $540,000. Over half a million dollars :(

For a far more straightforward solution, I ended up getting it to work using “Affymetrix Genotying Console”. Its a good piece of software with good documentation. It works seamlessly, does quality control, outputs data in usable and annotated format and can even output in Plink format. Thumbs up, good job Affymetrix!!

Split a chromosome name and location type string into its constituent parts in R

Easy to do with a regular expression and the strsplit function in R. The | operator means “or”. For example for “chr15:88120587-88121480″:

> unlist(strsplit("chr15:88120587-88121480", "chr|:|-"))[2:4]
[1] "15" "88120587" "88121480"

Principal Components Analysis Explained using R

Here, we will explain principle component analysis (PCA) by stepping though the algorithm manually and reproducing the output of the prcomp() function in R, which is normally used to do PCA.

First make up some data and plot it; in terms of gene expression analysis, we can think of the rows of the matrix below as different samples (or microarray experiments etc.), with 2 genes each. Its much easier to explain PCA for two dimensions and then generalise from there.
x <- c(1,3,6,8,9,10,15,12,17)
y <- c(10,14,12,17,19,20,20,24,19)
m <- cbind(x, y)
plot(m, pch=seq(9))

Here’s how you’d make a PCA plot for this data using R’s inbuilt functions (We’re about to learn how to do this for ourselves.)
pca_proc <- prcomp(m)
plot(pca_proc$x, pch=seq(9)) #the pca plot
plot(pca_proc) #the proportion of variance capture by each PC

Now to do PCA manually, first step, center data around zero, then plot the results.
x1 <- x - mean(x)
y1 <- y - mean(y)
plot(x1, y1, xlim=c(-40, 40), ylim=c(-40, 20))

Next, calculate the covariance matrix for vectors x1 and y1 and plot the column vectors, these column vectors describe the direction of variability in the data, i.e how similar is cov(x, x) with cov(x, y) and how similar is cov(y, y) with cov(y, x). Note that the covariance of x with itself is the same as the variance of x (i.e. var(x) == cov(x, x)). More info on covariance matrix.
m1 <- cbind(x1, y1)
cM <- cov(m1)
lines(x=c(0, cM[1,1]), y=c(0, cM[2,1]), col="blue")
lines(x=c(0, cM[1,2]), y=c(0, cM[2,2]), col="blue")

As the cov(x, y) can never be greater than cov(x,x), one of the lines plotted above will always be above the diagonal and one will be below, plot the diagonal (i.e. a line through the origin with a slope of 1).
abline(0, 1, col="grey")

While the vectors we have plotted above describe the direction of the variability in the data, they do not do so in a way that is particularly useful. By finding the eigenvectors of the covariance matrix, we can describe the variability in terms of two orthogonal vectors which capture the direction of variation, instead of the two vectors that we are currently plotting. For more info on eigenvectors you’ll need to study some basic linear
algebra, this can be done on Khan Academy, of particular interest will be the videos on matrix multiplication and obviously eigenvectors. This is by far the most difficult part of PCA to understand.
eigenVecs <- eigen(cM)$vectors
lines(x=c(0, eigenVecs[1,1]), y=c(0, eigenVecs[2,1]), col="red")
lines(x=c(0, eigenVecs[1,2]), y=c(0, eigenVecs[2,2]), col="red")

As the eigenvectors are unit vectors (i.e. of length 1) it may be easier to visualize how much each of them influences the data if we multiply them by their corresponding eigenvalues, which represent the proportion of variability explained by each eigenvector.
eVal1 <- eigen(cM)$values[1]
eVal2 <- eigen(cM)$values[2]
lines(x=c(0, eVal1*eigenVecs[1,1]), y=c(0, eVal1*eigenVecs[2,1]), col="red")
lines(x=c(0, eVal2*eigenVecs[1,2]), y=c(0, eVal2*eigenVecs[2,2]), col="red")

This matrix contains the eigenvectors and we want to display the data in terms of these eigenvectors. In this case we will select both eigenvectors, but on high dimensional datasets, it is normal to chose a subset of eigenvectors.
rowFeatVec <- t(eigenVecs)
rowDataAdj <- t(m1)

Finally use matrix multiplcation to get the dot product between each eigenvector and each point in the original centered data matrix (m1). This operation describes to what degree each point is influenced by each eigenvector Plot this and that’s the final plot.
Note, %*% is the matrix multiplication operator in R.

transFData <- rowFeatVec %*% rowDataAdj
finalData <- rowFeatVec %*% rowDataAdj
plot(t(finalData), pch=seq(9))

Finally, to plot the equivalent of the scree plot we made above, simply plot the eigen values.
barplot(eigen(cM)$values)

Writing a group of R data.frames to named Excel Worksheets

First install “WriteXLS” library in R using:

source("http://bioconductor.org/biocLite.R")
biocLite("WriteXLS")

Now if I have 3 data.frames called “genes”, “proteins” and “elephants”, all I need to do to write them to the same Excel file, on different *named* worksheets is:

library("WriteXLS")
WriteXLS(c("genes", "proteins", "elephants"), "WriteXLS.xls")

Note, for this to work, Perl needs to be installed. You also need the “Text::CSV_XS.pm” library. On Ubuntu this can be installed by issuing the command:

sudo apt-get install libtext-csv-xs-perl

Use “org.Hs.eg.db” to map between Entrez Gene Ids and HUGO gene symbols in R

To translate between these identifiers in R, this code creates a table with the mapping:

library(org.Hs.eg.db)
e2s = toTable(org.Hs.egSYMBOL)

To have a look at the mapping (which is stored as a data.frame)

head(e2s)

This will output something like this
gene_id symbol
1 1 A1BG
2 2 A2M
3 3 A2MP1
4 9 NAT1
5 10 NAT2
6 11 AACP

Now to map a set of Ids you have stored in a vector called “entrezGenes” to HUGO:

entrezGenes <- c("57573", "8563" , "8434")
geneSymbols <- e2s$symbol[e2s$gene_id %in% entrezGenes]

NOTE THIS DOES NOT PRESERVER ORDER.

In R, list all files in current directory ending with “.CEL”

Use regular expression matching, where the “$” means that this string ends with the preceding string:

dir(pattern=".CEL$")

Use biomaRt to tranlate HUGO to Entrez gene Ids.

We can use the R package biomaRt to conveniently convert between different types of gene ids. In this example we will convert official HUGO gene names to entrez gene ids.

First we load biomaRt in R using the current ensembl database for human:
library(biomaRt)
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")

Now get the mapping, myHugoGeneNames is a vector of HUGO gene names. As this mapping will not tend to be unique, make sure "uniqueRows=FALSE" is included in the query.
mapTab <- getBM(attributes = c("hgnc_symbol", "entrezgene"), filters = "hgnc_symbol", values = myHugoGeneNames, mart = ensembl, uniqueRows=FALSE)

This query will return a table with HUGO gene names in the first column and corresponding Entrez gene id in the 2nd column. As the mapping isn't unique, you may wish to remove duplicates, which can be done as follows:

dupRows <- union(which(duplicated(mapTab[,1])), which(duplicated(mapTab[,2])))
entrezIds <- mapTab[-dupRows, 2]
names(entrezIds) <- mapTab[-dupRows, 1]

The object "entrezIds" now contains a unique mapping of the gene ids.

If you wish to map different kinds of Ids, these function will be of use in identifying what identifiers are available in biomaRt:
listFilters(ensembl)
listAttributes(ensembl)

Boom!

See all users jobs in Sun Gridengine

This info is hard to find, but to see what jobs are being run by other users on Sun Gridengine:

qstat -u "*"

Find files in linux command line using wildcards

To find a file whose name contains the text “lostfile” in the current directory and all subdirectories:

find . -name \*lostfile\*

To search the filesystem for such a file simply do:

find / -name \*lostfile\*