Recently, I sat down with Jeff Barr on the AWS report to discuss how we’ve used various Amazon services throughout our architecture while developing Synapse. In the interview, I discussed how Synapse uses RDS (MySQL) as our back end database, Elastic Beanstalk to host our service and web hosting tiers, Cloud Search for providing a search across all Synapse content, and Simple Workflow to manage distributed scientific workflows (see also our AWS case study). The decision to rely heavily on Amazon as an infrastructure provider for our project was based on the belief that hosted infrastructure was they way of the future, and it was best to build technology with that future in mind assuming that services that were still early stage would mature along with our own work. Despite a few of the hic-ups associated with adopting early stage technology, I’m still pretty pleased with the decision to go full steam ahead on cloud computing in general, and with Amazon in particular.
In the interview, we focused more on what Sage Bionetworks has already done rather than what we might do with AWS in the future; however the breadth of offerings from Amazon keeps expanding along with our potential applications. One very interesting service to us is Amazon’s recently launched Glacier product, which is designed for data archival. With S3, you are essentially paying Amazon to make your data continually available and protected from loss; many machines must have the data live on disc to provide S3’s level of service. Therefore, even though Amazon may be operating the service on pretty thin margins, it’s still a relatively expensive way to store data for some use cases. Glacier complements S3 because it lets you trade off cost for availability: by agreeing to wait 3-5 hours for a request for data to be serviced, you can significantly cut the cost of storing that data. Biomedical research is full of examples where large amounts of data are actively analyzed for a period of time, but quickly give way to processed versions of the same data that will be used for downstream analyses, e.g. processing a raw genetic sequence to a series of varients compared to a reference genome. While the reference genome or the processing algorithm does occasionally change, once the variants are called this occurs relatively infrequently. This sort of use case makes Glacier a very interesting proposition.
Along with genomics, imaging is another common source of large data volumes in the medical research space. Sage is planning to enter this space shortly by launching “Melanoma Hunt” project, intended to aid in the early detection of melanoma through images of skin lesions captured from mobile phones. We’d like to create a publicly accessible database of suspicious and benign images to catalyze the development of better image processing and machine learning algorithms. Feature extraction from raw images might be performed only occasionally, feeding into downstream work to build a classification model. The project will also aim to engage the average citizen in the research process, crowdsourcing much of the effort that goes into developing these classifiers. Citizen scientists may learn to classify the images manually, or remove potentially identifying information from the images. The organization of such efforts could benefit immensely from technologies such as Mechanical Turk, with data ultimately relased to researchers through Synapse.
As the Synapse system itself scales to handle these sorts of projects, we are going to need additional technologies on the back end to scale appropriately as well. Given that there will always be a category of application data where we need real-time concurrent updates, we will probably always have some of our services running off of RDS. We’re already using Cloud Search to support a standard search feature, but there are other sorts of queries where we might need higher query performance and scalability. We’ve started looking at DynamoDB (a noSQL db) and ElastiCache for some of our future needs, or for refactoring current services to support higher scales. In this type of architecture, we will likely end up using Amazon SQS as a message queue between synchronous and asynchronous logic in Synapse.
We’ve also been investing recently in automating the deployment of new instances of Synapse. As the application has grown in complexity, the configuration of the AWS components it’s built upon has become increasing difficult to manage manually. Fortunately, several approaches are possible to automate provisioning these components and installing software into them. The one we took was to simply write a program (See Sage Bionetworks / Synapse Stack Builder on GitHub) using the AWS SDK for Java that automates these tasks; the approach works for us because we are a very Java-centric shop and don’t have clear boundaries between developers and operations personnel. You could easily do the same in other languages, or though the use of Amazon’s Cloud Formation templates. The end result is a blue-green deployment system, where we will be continually building new instances, operating them in staging mode for a test period, and then using Route 53 to manage the cut-over as a staging system is promoted to production.
Finally, I’ll end by noting that we have been looking at the cloud computing space more generally. In particular, we’ve had a great relationship with Google, who’s very generously provided 2,000 virtual cores of support for our public challenge in the predictive modeling of breast cancer, as well as hosting the clinical and genomic data the challenge is built upon. We’ve recently taken the “Tour de France” strategy of awarding an intermediate stage win to the currently best preforming classification model of aggressive vs. nonaggressive disease. It’s important to remain flexible in our support for scientific computing; the cloud computing market is still young and ultimately, scientists will move to the computing platforms that provide the mix of compute services appropriate for their applications. Another force that will catalize the formation of communities of scientists working with particular technologies is the presence of large, interesting scientific data sets to work with. If those data sets can be open to the full scientific community, so much the better.