Way back at the Sage Congress in April, Sage Bionetworks and DREAM announced a jointly sponsored modeling challenge. The basic idea behind this effort is to try to catalyze better understanding of the disease by framing a scientific challenge for the entire scientific community to solve. The starting point for the challenge is a clinical study of breast cancer patients. On all these women, we now have a range of full-genome molecular data, as well as detailed clinical data on the cancer and course of treatment. The immediate scientific goal of this challenge is to see who can build the best model of survival time, segmenting patients into aggressive vs. non-aggressive disease. If you’re interested in the scientific details of the challenge, you can read more on the contest wiki site, or watch last week’s intro video.
The higher level experiment that Sage is running though is really on the social structures governing how science is done, and if other non-traditional incentives and structures can accelerate the discovery process itself. The traditional way this research would be performed is for a high-powered company or academic group to organize and execute a clinical study: an enormously expensive undertaking as it involves managing care and collecting data on a statistically significant number of patients. The data used in our challenge was gathered over 10 years and comes from about 2,000 patients. Once collected, this data becomes a valuable commodity. The group running the trial will typically hold it while analyzing it, and only release their high-level conclusions in the form of publications, or in support of FDA approval for the sale of a new drug. The data may be shared long after the generators believe they have extracted all value from it, or used as a trading chip as a few high-powered labs form closed collaborations.
Sage’s hypothesis is that getting this data into the public domain as quickly as possible is in the best interests of patients and society as a whole. Last week we finally had a big public launch of the challenge and got a phenomenal response from the community. So far, we’ve had over 160 people register to attempt the modeling challenges, and over 100 attend last week’s opening on-line launch. We’ve already had at a few people submit models, and even have one that beats the simple base-line approach we used as an example. Compared to other research efforts I’ve seen, the wide open approach of the challenge seems to have generated far more interest from the community. We’ll see by this fall if this turns into better science.
Of course, the idea of framing scientific challenges and posting rewards for their completion is not a new one. Our partner DREAM has been doing this in the academic setting for 7 years, and there’s some interesting commercial sites out there like Kaggle and Innocentive. However, I think some of the things we’re doing in the course of this challenge are pushing the envelop on this format:
- Generating experimental validation data – In parallel with running the challenge we’ve identified another 350 or so frozen samples obtained in one of the studies providing the underlying data for the challenge. Sage has raised funding for generating new molecular data off these samples in parallel with running the modeling challenge. One of the big problems in this space is that statistical models produced in one study don’t hold up when applied to data not used to generate them. It will be interesting to see if the best models during the competition phase hold up and generalize to this new data used to determine the final winner.
- Publishing the winner in a premier journal – Science Translation Medicine has agreed to publish an article written by the winner. Instead of the usual system of blind peer review by two experts, for this article the fact that the challenge is run on a completely open platform will provide a broader and hopefully more rigorous method of ensuring the winning approach is reviewed and understood by the community.
- Requiring participants submit reusable code – Computational experts often complain about a lack of access to high-powered data sets, but can be less willing to invest the time to make their own analytical code open to use by others. Unlike many other challenges in which analysts simply submit a vector of predictions for the validation data, in our challenge analysts will have to submit code that we can run to score the model.
- Providing participants dedicated compute space – Through a generous donation by Google, we are providing all participants use of a Google virtual machine running on Google’s new Compute Engine service. Besides providing raw compute power (currently 2,000 cores dedicated for community use), this approach helps ensure reproducibility by providing commonly configured infrastructure.
I’m really excited to get this project off the ground, and am sure I’ll have more detailed posts on specific aspects of it over the coming months.