Breast Cancer Predictive Modeling Challenge

Way back at the Sage Congress in April, Sage Bionetworks and DREAM announced a jointly sponsored modeling challenge.  The basic idea behind this effort is to try to catalyze better understanding of the disease by framing a scientific challenge for the entire scientific community to solve.  The starting point for the challenge is a clinical study of breast cancer patients.  On all these women, we now have a range of full-genome molecular data, as well as detailed clinical data on the cancer and course of treatment.  The immediate scientific goal of this challenge is to see who can build the best model of survival time, segmenting patients into aggressive vs. non-aggressive disease.  If you’re interested in the scientific details of the challenge, you can read more on the contest wiki site, or watch last week’s intro video.

The higher level experiment that Sage is running though is really on the social structures governing how science is done, and if other non-traditional incentives and structures can accelerate the discovery process itself.  The traditional way this research would be performed is for a high-powered company or academic group to organize and execute a clinical study: an enormously expensive undertaking as it involves managing care and collecting data on a statistically significant number of patients.  The data used in our challenge was gathered over 10 years and comes from about 2,000 patients.  Once collected, this data becomes a valuable commodity.  The group running the trial will typically hold it while analyzing it, and only release their high-level conclusions in the form of publications, or in support of FDA approval for the sale of a new drug.  The data may be shared long after the generators believe they have extracted all value from it, or used as a trading chip as a few high-powered labs form closed collaborations.

Sage’s hypothesis is that getting this data into the public domain as quickly as possible is in the best interests of patients and society as a whole. Last week we finally had a big public launch of the challenge and got a phenomenal response from the community.  So far, we’ve had over 160 people register to attempt the modeling challenges, and over 100 attend last week’s opening on-line launch.  We’ve already had at a few people submit models, and even have one that beats the simple base-line approach we used as an example.  Compared to other research efforts I’ve seen, the wide open approach of the challenge seems to have generated far more interest from the community.  We’ll see by this fall if this turns into better science.

Of course, the idea of framing scientific challenges and posting rewards for their completion is not a new one.  Our partner DREAM has been doing this in the academic setting for 7 years, and there’s some interesting commercial sites out there like Kaggle and Innocentive.  However, I think some of the things we’re doing in the course of this challenge are pushing the envelop on this format:

  • Generating experimental validation data – In parallel with running the challenge we’ve identified another 350 or so frozen samples obtained in one of the studies providing the underlying data for the challenge.  Sage has raised funding for generating new molecular data off these samples in parallel with running the modeling challenge. One of the big problems in this space is that statistical models produced in one study don’t hold up when applied to data not used to generate them.  It will be interesting to see if the best models during the competition phase hold up and generalize to this new data used to determine the final winner.
  • Publishing the winner in a premier journal – Science Translation Medicine has agreed to publish an article written by the winner.  Instead of the usual system of blind peer review by two experts, for this article the fact that the challenge is run on a completely open platform will provide a broader and hopefully more rigorous method of ensuring the winning approach is reviewed and understood by the community.
  • Requiring participants submit reusable code – Computational experts often complain about a lack of access to high-powered data sets, but can be less willing to invest the time to make their own analytical code open to use by others.  Unlike many other challenges in which analysts simply submit a vector of predictions for the validation data, in our challenge analysts will have to submit code that we can run to score the model.
  • Providing participants dedicated compute space – Through a generous donation by Google, we are providing all participants use of a Google virtual machine running on Google’s new Compute Engine service.  Besides providing raw compute power (currently 2,000 cores dedicated for community use), this approach helps ensure reproducibility by providing commonly configured infrastructure.

I’m really excited to get this project off the ground, and am sure I’ll have more detailed posts on specific aspects of it over the coming months.

Posted in Uncategorized | Tagged , , | Leave a comment

Meltdown in the Cloud

One of the perks for working for a small non-profit is that we can decide to do simply do things that large companies never would.  This week, Sage Bionetworks is completely closed, with a week-long summer vacation as our bonus for hitting last year’s company-wide objectives.  Of course, leave it to Mr. Murphy to intrude on what seemed like a great morale-booster.

As our week off approached and people started heading out to start their vacations early, we got a first warning that shutting down completely is probably not an option if you’re operating an online service, even an early-stage beta that’s currently supporting a handful of users.  The week before our break we had a brief service outage caused by a security certificate expiring on the Crowd server backing out authentication services.  Although we corrected the problem quickly, that incident got our attention enough to decide that pushing out a new release with only minor bug fixes right before the break was probably not worth the risk.  We decided to just leave everything alone and resume work once everyone was back in town.

Unfortunately, that first incident was only the precursor to the main event.  Friday night, Amazon suffered a power outage at their Virginia data center, which also affected a number of other much higher profile sites like Netflix, Heroku, and Instagram.  Saturday morning I woke to the realization that Synapse was completely non-functional.  It took the better part of the day for us to trace the problem to our AWS RDS service being down, and recover the live system using a backup taken shortly before the event. 

So, does this incident bring into question the decision to use cloud services and particularly Amazon as the foundation of our own service?  What’s the alternative?  In my previous job my team built a service using a stack of our own selected servers installed in a local colo facility.  Several months in, after a couple power outages had taken us down, we were able to determine that this facility did not in fact have redundant power wired to our cage as they had promised us when we moved in.  Instead, they had just plugged us straight into Seattle City light.  Stung by this we found a new, more reputable hosting partner, located in a quality facility also serving the local ABC news affiliate.  They had a big bank of UPS devices, multiple redundant connections to the power grid, and the ability to run the site indefinitely off backup generators in case of emergency.  Still, at that facility we had a couple day outage caused by an electrical fire that cut power to our servers for a couple days.  At Sage, I could buy physical servers and put them in the data center right next to my office at the Hutch.  It might give me a warm fuzzy safe feeling, but I think that’s mostly due to a psychological tendency to imagine that bad things only happen to other people.  I doubt they’d really be any safer than rented servers sitting in an AWS facility in Virginia.

So, I think the lesson here is not that cloud services are unreliable and you should build everything yourself.  Rather, I think it’s a reminder to not believe the cloud marketing hype too strongly, and remember that cloud services are a leaky abstraction over the fact that you are programming someone else’s data center.  Most of the time that abstraction is a useful simplification which allows you to work more efficiently, but occasionally, like last Saturday, that simplification breaks down.  However, what I took out of responding to the crisis wasn’t a desire to move elsewhere, it was a set of improvements to the way we configure and maintain our service that could greatly improve our ability to respond to things like this in the future.  There’s no magic here, just a set of engineering tasks that we need to triage against other work like adding new features or improving documentation to provide the best experience for our users.  For example:

  • When creating a service off a set of AWS services, it’s best to wire components together though cnames instead of directly to the public name of the service.  We had done this with our app and web tiers running on Elastic Beanstalk, but not with RDS and our Cloud Search instances.  Consequently, we had to change configuration on the production server instead of just swap in new services, slowing our response and increasing risk. 
  • We need better monitoring of our services.  Both more extensive use of Cloud Watch metrics on key components of our services, as well as an external periodic smoke test of the application would have gotten us information about the problem more quickly.
  • We need better automation of provisioning new environments.  Some Synapse components require a manual set up process, documented via a wiki.  Like most wikis we try to keep ours up to date, but things occasionally get missed.  An automated script to build a new environment checked into version control is self documenting and accurate if you continuously use it to create new environments.
  • We should think hard about the number of different components comprising our service.  Some, like the Crowd server, offer only a bit of value for us, at the cost of making our overall environment more complex to administer and trouble-shoot.  We need to continuously watch our environment and make sure we are engineering for reliability and ease of maintenance as we expand functionality.

However, in the end, I still think the flexibility and development speed we get from building off of Amazon’s building blocks is worth the cost.  More days like last Saturday could change my mind about Amazon, but I’d likely be shopping for a more reliable vendor, not trying to bring more low level IT maintenance in house.

Posted in Uncategorized | Tagged , , | Leave a comment

Synapse Now Open Source

I am happy to announce that all Synapse source code is now posted on the Sage Bionetworks GitHub site.  Of course, since Sage is a non-profit institute focused on promoting open science you might fairly ask why this is news now, over a year and a half after we started initial coding.  Why wasn’t the code up on GitHub from the very beginning?

Well, in retrospect maybe we should have started that way from the very beginning.  I doubt it would have hurt anything, and it may have actually facilitated a collaboration or two that we missed out on over the initial stages of the project.  In our initial months at Sage, our team didn’t know what we had, or what we were doing, or if people would care at all about our vision.  We picked a set of development tools that were familiar and powerful (Atlassian’s hosted suite) and just focused on prototyping ideas and evaluating base technologies.  At the beginning our co-workers didn’t even pay much attention to what we were doing even though they were the customers we were supposed to be supporting.  It took some time for the vision of Synapse to form, and for us to start getting people interested in what we were building.

Once the project was underway, adding more functionality always seemed to take precedence over taking time to think about how to effectively get other to collaborate on development.   It was really only at the Sage Congress last April where we started demoing the product and felt enough traction forming within the community that we started believing we were seriously on to something. There’s still an awful lot of new functionality we desperately want to start working on and it’s quite tempting to dive straight into coding again.

However, we’ve now bitten the bullet and taken the last month to focus on not just moving code to Git, but also trying to structure our codebase and development practices to facilitate having other developers come on to the code base.  We’ve refactored the code base into smaller pieces, are putting more effort into developer docs, and thinking about how to review and incorporate check-ins from external developers.  We’ve had our first of a three new engineers start recently: a great young student on internship which is a good chance to do a dry run with a fresh set of eyes.  We’ve got more open positions I’m working hard to fill, but even more ideas than people who can implement them.  Hopefully taking the time to do this right now will let us more fully engage the community on the development front in the future.

Posted in Uncategorized | Tagged , , | Leave a comment