Science in the Clouds

Recently, I sat down with Jeff Barr on the AWS report to discuss how we’ve used various Amazon services throughout our architecture while developing Synapse.  In the interview, I discussed how Synapse uses RDS (MySQL) as our back end database, Elastic Beanstalk to host our service and web hosting tiers, Cloud Search for providing a search across all Synapse content, and Simple Workflow to manage distributed scientific workflows (see also our AWS case study). The decision to rely heavily on Amazon as an infrastructure provider for our project was based on the belief that hosted infrastructure was they way of the future, and it was best to build technology with that future in mind assuming that services that were still early stage would mature along with our own work.  Despite a few of the hic-ups associated with adopting early stage technology, I’m still pretty pleased with the decision to go full steam ahead on cloud computing in general, and with Amazon in particular.

In the interview, we focused more on what Sage Bionetworks has already done rather than what we might do with AWS in the future; however the breadth of offerings from Amazon keeps expanding along with our potential applications.  One very interesting service to us is Amazon’s recently launched Glacier product, which is designed for data archival.  With S3, you are essentially paying Amazon to make your data continually available and protected from loss; many machines must have the data live on disc to provide S3’s level of service.  Therefore, even though Amazon may be operating the service on pretty thin margins, it’s still a relatively expensive way to store data for some use cases.  Glacier complements S3 because it lets you trade off cost for availability: by agreeing to wait 3-5 hours for a request for data to be serviced, you can significantly cut the cost of storing that data.  Biomedical research is full of examples where large amounts of data are actively analyzed for a period of time, but quickly give way to processed versions of the same data that will be used for downstream analyses, e.g. processing a raw genetic sequence to a series of varients compared to a reference genome. While the reference genome or the processing algorithm does occasionally change, once the variants are called this occurs relatively infrequently.  This sort of use case makes Glacier a very interesting proposition.

Along with genomics, imaging is another common source of large data volumes in the medical research space. Sage is planning to enter this space shortly by launching “Melanoma Hunt” project, intended to aid in the early detection of melanoma through images of skin lesions captured from mobile phones.  We’d like to create a publicly accessible database of suspicious and  benign images to catalyze the development of better image processing and machine learning algorithms.  Feature extraction from raw images might be performed only occasionally, feeding into downstream work to build a classification model.  The project will also aim to engage the average citizen in the research process, crowdsourcing much of the effort that goes into developing these classifiers. Citizen scientists may learn to classify the images manually, or remove potentially identifying information from the images. The organization of such efforts could benefit immensely from technologies such as Mechanical Turk, with data ultimately relased to researchers through Synapse.

As the Synapse system itself scales to handle these sorts of projects, we are going to need additional technologies on the back end to scale appropriately as well. Given that there will always be a category of application data where we need real-time concurrent updates, we will probably always have some of our services running off of RDS.  We’re already using Cloud Search to support a standard search feature, but there are other sorts of queries where we might need higher query performance and scalability.  We’ve started looking at DynamoDB (a noSQL db) and ElastiCache for some of our future needs, or for refactoring current services to support higher scales.  In this type of architecture, we will likely end up using Amazon SQS as a  message queue between synchronous and asynchronous logic in Synapse.

We’ve also been investing recently in automating the deployment of new instances of Synapse.  As the application has grown in complexity, the configuration of the AWS components it’s built upon has become increasing difficult to manage manually.  Fortunately, several approaches are possible to automate provisioning these components and installing software into them.  The one we took was to simply write a program (See Sage Bionetworks / Synapse Stack Builder on GitHub) using the AWS SDK for Java that automates these tasks; the approach works for us because we are a very Java-centric shop and don’t have clear boundaries between developers and operations personnel.  You could easily do the same in other languages, or though the use of Amazon’s Cloud Formation templates.  The end result is a blue-green deployment system, where we will be continually building new instances, operating them in staging mode for a test period, and then using Route 53 to manage the cut-over as a staging system is promoted to production.

Finally, I’ll end by noting that we have been looking at the cloud computing space more generally.  In particular, we’ve had a great relationship with Google, who’s very generously provided 2,000 virtual cores of support for our public challenge in the predictive modeling of breast cancer, as well as hosting the clinical and genomic data the challenge is built upon.  We’ve recently taken the “Tour de France” strategy of awarding an intermediate stage win to the currently best preforming classification model of aggressive vs. nonaggressive disease.  It’s important to remain flexible in our support for scientific computing; the cloud computing market is still young and ultimately, scientists will move to the computing platforms that provide the mix of compute services appropriate for their applications.  Another force that will catalize the formation of communities of scientists working with particular technologies is the presence of large, interesting scientific data sets to work with.  If those data sets can be open to the full scientific community, so much the better.

Posted in Uncategorized | Tagged , | 2 Comments

If I had a billion dollars…

In an apparently recurring theme, my thoughts again are running to the incentives that drive human behavior, this time inspired by the recent news that the Russian billionaire Yuri Milner has established a new $3 Million Fundamental Physics Prize.  He’s actually awarded 9 of these prizes for a cool $27M promoting the efforts of theoretical physics.  Certainly that kind of money and publicity could drive a lot of attention to the field, and I love the fact that we now almost have a basketball team’s worth of physicists who almost make a basketball player’s salary.

However, is this the best way to spend $27M to shake up and rally support for science?  Of course Mr. Milner is free to spend his money any way he wishes but I see some potential problems with his approach.  Quoting from the NY times article referenced above “Mr. Milner personally selected the inaugural group, but future recipients of the Fundamental Physics Prize, to be awarded annually, will be decided by previous winners.”  I don’t know how well a Russian billionaire can select the best work in theoretical physics, but let’s assume he did his due diligence as well as the experts in Stolckholm.  Past the first year the process turns into a bunch of self-anointed experts picking their own colleagues.  Nothing particularly wrong with this, but not that much different than the Nobel Prize.

The other fact in the article that really caught my attention was the condition that theoretical predictions don’t need experimental evidence to be considered breakthroughs.  No sitting around for decades waiting for messy difficult to acquire data to roll in here, this prize gets straight to rewarding breakthough ideas.  According to Milner “This intellectual quest to understand the universe really defines us as human beings,”. What could be wrong here?

Well, I’m reminded of the quote by Thomas Huxley which was posted on my thesis adviser’s door: “The great tragedy of Science — the slaying of a beautiful hypothesis by an ugly fact.”  I think the lack of a requirement for experimental validation for the Fundamental Physics Prize shows a fundamental lack of understanding of what science is.  Science is not philosophy.  It is based in a belief that there is in fact a real world which behaves a certain way, and that the way to uncover the way this universe works is though empirical evidence, not the scientist’s opinion of the beauty of the theory explaining it.

But maybe I’m just a cranky science dropout.  Let’s check the news post again “Dr. Arkani-Hamed, for example, has worked on theories about the origin of the Higgs boson… None of his theories have been proved yet. He said several were ‘under strain’ because of the new data.”… Wow.  Tough break.  Even in the short interval between when these winners were decided and when they were announced, one of the winner’s ideas is “under strain”, or in layman’s terms “wrong”.

I don’t have the billion dollars to fund a competitor to the Nobel, but I do have the $18 it took to acquire the sciencereengineered domain, and in this world I am the boss.  So, here’s my proposal for a Nobel alternative:

First of all, I’m not going after the Physics prize.  I’m targeting the one for Physiology and Medicine.  But I’m not giving the power to award it to a group of experts, I’m going to let patients vote on it.  My guess is that unproven theoretical ideas decades away from experimental validation won’t make the top spots.  Instead, the award will go to the projects that have the biggest impact on people’s lives.  Doing this in practice might be difficult so maybe I’d pick a different disease every year and go to that patient community to get a more involved and knowledgeable subset of voters.  To make sure the voters are knowledgeable, part of the process of awarding the prize will be having the candidate projects present their work to the lay audience.  I’d build some sort of online environment for the projects to build presentations of their work, and for patients and scientists to discuss over the course of several months before the vote.  Of course, the data, code, and other materials used to comprise the project have to be open and available to those who want to validate the work.

Secondly, I’m not giving the award to individuals.  This propagates the false belief that science advances due the unique and rare break-through insights of a small group of geniuses (see On the Shoulders of Giants).  Instead the award goes to projects, and is shared by every member of the project team equally.  Say a flat prize of $100,000 per team member…. I want to give the winners enough to be a noticeable Thank-You but not be so large that they retire!  However, unlike the Nobel, I have no limit on the number of people on the team.  It doesn’t matter if you’re the PI, the lab rat doing the pipetting, the data analyst, or the marketing guy putting together the project description for the lay audience.  The award goes to the whole team, identified in alphabetical order.  For $27M that means that the winning team size could hit 270 people.  That’s big, but if you’ve seen some of the massive author lists on 4 page journal articles it’s not that far off the mark for modern science.

So, that’s a first pass at the Kellen prize in Physiology or Medicine.  Of course, if there’s one thing that’s clear from studies of human behavior, it’s that incentive structures often motivate people in ways unintended by those who create the incentives.  So, I reserve the right to modify the rules of the prize based on empirical evidence acquired as the prizes unfold.  After all, that’s what being a scientist means.

Posted in Uncategorized | Tagged , | 1 Comment

Motivating a challenge

In my previous post I introduced the Breast Cancer Challenge Sage is hosting to build predictive models of the disease.  The initial conception of this project was as a winner take all competition, with a clear scoring method and single top model as the winner of the big prize: publication in Science Translation Medicine.  Compared to some of our other attempts to catalyze scientific collaborations based more on preaching to scientists to share data and methods for the better good of society, this approach seems to have triggered substantially more interest from the community, and action by some of the initial participants.

Our task now is how to best harness this energy to motivate researchers, and also form a community where people not only compete, but also collaborate and build off each other’s work effectively.  Many of my recent discussions with the challenge organizers have drawn analogies to the Tour de France, and how multiple dimensions of awards and glory motivate different riders to focus on achieving different objectives in the race.  Every year there’s only a handful of guys who can realistically hope to win the yellow jersey, but many other riders that compete to take home other awards.

Multiple Jerseys – In the tour, there’s not only an overall winner of the yellow jersey, but several other jerseys that award different riding styles.  The green jersey is awarded by points obtained by sprints to stage and intermediate race points, and the polka-dotted king of the mountains awards the rider who does the best on the major climbs of the tour.  Breast cancer is not really a single disease; in reality many different molecular defects give rise to a heterogeneous collection of diseases.  A prognostic test, particularly one focused on detailed genetic data, is unlikely to perform well across all types of breast cancer.  It would be interesting to award sub-category awards for particular types of cancer, based on things like ER, PR, and HER2 status.

Stage Winners – An awful lot of the tactics of a bike race involve the dynamic between riders who are trying to win an individual stage, and riders or teams that are more concerned with the overall standings.  The glory of a stage win is enough to cause riders to attempt to break away from the peloton even though they have no hope of competing for the overall lead.  With our challenge we’d like to reward people for submitting ideas early and allowing others to incorporate and modify their models, instead of waiting to the the last minute to submit an entry.  One way to do this would be to have defined intermediate points where we had some sort of recognition of the leader.  Maybe we could introduce new validation data at set periods to score the models, and then allow contestants to incorporate previous validation data as new training data in subsequent rounds.  Another approach we’ve discussed is to reward people for time spent on top of the leaderboard and for the amount of improvement they make over the previous best score.  The person who makes many submissions that improve the top score over the competition may have contributed more to the competition than someone who sneaks in a marginally better model right at the end of the challenge.

Visible Leaderboard – In the tour, everyone knows what everyone is doing.  Riders have their coaches in their ears, giving them information on what other riders are doing, and the team constantly replan strategy as the race unfolds over several weeks.  We’ve got a basic leaderboard up giving real-time feedback on the challenge and it seems to be a good way to engage people.  I’m sure with more time and resources we could do a better job of making this resource more exciting for the challengers.  Maybe we should pull in contestants Synapse profiles and try to connect the contestants to an audience. Math and statistics might not be as sexy as bike racing, but there’s got to be some cancer survivors out there that would be interested in seeing people compete to better understand the disease.

So what do you think?  Do you have any better ideas on how to organize these scientific challenges?  We are open to experimentation.

Posted in Uncategorized | Tagged , | 1 Comment