The phrase “It’s not rocket science” tends to annoy me. Sure, rocket science is harder than, say, mowing the lawn or fixing a flat tire. But the bottom line is, in space Mr. Newton’s F=ma and seem to cover most of what you need to know. There is no atmosphere so there is no turbulence (Richard Feynman’s “most important unsolved problem of classical physics”). For a moon shot, all you really have to do is line up a particularly long pool shot on a particularly perfectly balanced table. There may be a few details I’ve missed as I’ve never actually tried it, but, whatever.
Now, Cancer Biology. It’s not rocket science. It’s way, way harder.
Scientists and engineers usually tackle a problem by breaking it down into its myriad component parts, understanding how those parts interact, and then building back up to a working system. We now have endless detail on hundreds of genes and proteins and drug candidates, and we know the laws on how atoms interact with each as well as Newton’s laws of motion. When it comes to molecular dynamics, we also know where all the atoms actually are in a particular complex molecule, like a protein. So, it is actually feasible to do a simulation and predict with reasonable accuracy, for example, that small molecule X will bind to and inhibit the enzymatic activity of protein kinase Y.
Now, will blocking protein Y actually cure a patient’s cancer? Who knows? We don’t understand the multiple levels of organization above molecular structures in biology to make a solid prediction of whether knocking down protein kinase Y will actually have an effect. We don’t understand the network of interactions that kinase Y has in a normal cell, let alone how those interactions are different in cancer. And by the way, every patient’s cancer is essentially a unique set of mutations away from a genome that is unique to the patient to begin with. We certainly don’t have a lot of confidence about what else molecule X could do elsewhere in an actual patient. Bottoms-up simulation or prediction from first principles in biology is like trying to debug an issue logging into your Facebook account by watching and simulating the movements of electrons in all the transistors making up all the computers interacting in this system. Sure, at some level, they’re moving the wrong way. Good luck with that.
So, what we have is a constantly evolving set of data mining / machine learning approaches to try to understand patterns in molecular and clinical data. Hopefully that will let us pick out some small pieces of biology that we can understand well enough to segment a patient population, or green-light a billion dollar gamble that we can create the right drug. Understanding the best algorithms to do this work is an open research problem. The technologies generating the molecular data are improving faster than Moore’s Law. What kind of informatics support do you need when your application domain is changing that quickly?
Software geeks reserve the term “Beautiful Code” to describe software that is easy to maintain and adapt to new uses. Code that evolves with its users effortlessly to do new things not envisioned when the code was originally written. My biggest certainty is the uncertainty of scientific requirements only a few years in the future. Therefore, flexibility is my #1 concern in an informatics platform.
From this point of view, Amazon’s recently released Simple Workflow Service is the epitome of beauty. SWF coordinates distributed work spread out across just about any combination of computing platforms and locations. It handles the rather tricky job of ensuring reliable execution of a complex series of tasks spread out over unreliable components and connected by an imperfect network. With SWF, my company just needs to worry about how to implement the right “deciders” and “workers” which is plenty for us to handle, thanks. The case study we recently developed at Sage Bionetworks with Amazon on SWF is one example of an application. Yes, here’s our testimonial right next to NASA’s (OK, yes, I couldn’t come up with an image as cool as the Mars Rover for our project). The quick summary is that using SWF let us pull down large quantities of gene expression data representing 10,000+ studies from public repositories, run standardized statistical normalization and error detection procedures on all data, and rehost the data for use in further studies. But the real benefit isn’t what we did with this technology yesterday. It’s that even though I don’t know in 3 years where genomic data will be coming from, or the specific measurement technologies being used, or what the performance characteristics of the best algorithms will be, I am pretty sure SWF has the flexibility to string something together. Hopefully we’ll figure out what that something should be.