In preparation for next week’s session at the Sage Congress, I started playing with the famous quote from Sir Issac Newton, “If I have seen further, it is by standing on the shoulders of giants”, as a way to introduce why my team is building Synapse. In a nutshell: scientific ideas do not spring fully formed out of the minds of an isolated and brilliant scientist. Rather, great scientists, while they may be smarter than the average layperson, most often make great advances because they happen to be in the right place at the right time. The closer you look at most scientific breakthroughs, the more you realize they are the combination of several pre-existing ideas and minor advances into something only slightly new. It is often only in retrospect that we elevate certain breakthroughs with special significance, like Roger Banister’s 3:59.4 mile, which improved on Gunder Hagg’s 4:01.4 by less than 1%, and was surpassed by John Landy’s 3:58.0 less than 2 months later.
In Newton’s time, scientists first adopted the printing press from its origins in producing religious texts and started circulating the first scientific journals, greatly accelerating scientific progress though the accelerated mixing of ideas. Today, it is the introduction of technology like the internet and cloud-computing that provide new opportunities for accelerating the mixing of ideas and the advancement of science.
Today I realized the quote also describes our team’s development philosophy. As a small start up we must be relentlessly focused on our unique value: pulling together existing technology in new ways to make our scientist users more productive. Sage Bionetworks is not the place to be if you want to do fundamental research in computer science; we need practical engineers that know how to effectively learn and reuse new and diverse technologies in rapid succession.
At Sage, the technology team is standing on the shoulders of two types of giants. The first is open source software. From infrastructure like Linux, Tomcat, and MySQL, to a variety of software libraries like Spring and GWT, we leverage a lot of open source software in our work. The second giant is cloud computing, which helps prevent my team from becoming bogged down in low level IT work. Up until now that has meant Amazon (If I seem like an Amazon shill stay tuned at the Congress for a big announcement with another major player). Given Amazon’s philosophy to Let a Thousand Platforms Bloom these two giants have played together pretty nicely so far. RDS gives us a hosted and scalable deployment of MySQL. When we recently maxed out storage on our instance, we were able to resize the disc at the touch of a button and have our system up again in no time flat. Similarly, Elastic Beanstalk’s instantly available and rescalable set of Tomcat servers allow us to quickly adapt when a scientist decides he wants to index all of GEO, or TCGA.
Earlier this year we had an interesting decision to make, involving these two giants. The introduction of ten thousand gene expression data sets into our application completely broke the browsing-based navigational model, and made introducing search into our application a top priority. There was a pretty obvious open-source way to proceed by standing up Solr / Lucene on an EC2 instance. The downside was this would involve adding a considerable amount of new infrastructure to manage. However, through our work with Amazon we were introduced to another opportunity: integrate with Amazon’s still-private Cloud Search service. This approach involved no infrastructure to manage at all. Instead we wrote our application to create documents based on objects in our system, and called an AWS service over HTTP to upload them into Cloud Search. We then wrote our own search UI, which simply calls the second AWS Cloud Search service to execute the search. As a start-up, the ability to execute on this major new feature so quickly and to have such a low cost of maintenance afterwards is a huge win.
There is one thing nagging me a little on this decision, and that stems from Sage Bionetwork’s status as a non-profit and mission of fostering open data and systems to accelerate human health research. Up to now I’ve been able to tell people you could stand up a Synapse instance anywhere. We happen to use Amazon, but there’s no reason you need My SQL and Tomcat to be hosted to do so. In fact, our developers run local instances of these components on their workstations all the time. Cloud Search takes the entire search functionality and sticks it in the black-box behind those two Cloud Search APIs. It’s an enormous productivity benefit when it works, and our team never has, and probably never would actually try to touch a line of source in My SQL or Tomcat. But as the technology industry seems to increasingly shift to hosted services, I wonder what the fate of open source software is. Hopefully it still has a place because those who see farthest are often not standing on the shoulders of giants, but layers upon layers of ordinary people.