The past two decades have seen an amazing growth in the ability to generate genomic data fueled by the rapidly decreasing cost of sequencing technologies. However, with a few exceptions, acquisition of this type of data has so far failed to generate significant improvements in the treatment of human diseases. Improving the methodology around this genotype to phenotype prediction problem is an active area of research. The challenge covers both the statistical / machine learning approaches to analyzing clinical genomics data as well as the broader incentives driving how scientists collaborate though shared data and approaches.
Sage Bionetworks believes that a key impediment to advances in this area is the relatively closed nature of scientific research built around the publication – grant – work cycle of academic research, or the walled-off approaches of industry. For the past 6 years the DREAM project has organized a series of challenges in Systems Biology meant to drive the field forward by getting multiple groups to attack the same problems from different angles. These groups feel that catalyzing more open and collaborative approaches to scientific research is an essential part of moving the field forward, and have partnered to organize the Sage / DREAM Breast Cancer Prognosis Challenge.
The goal of the breast cancer prognosis challenge is to assess the accuracy of computational models designed to predict breast cancer survival (median 10 year follow up) based on clinical information about the patient’s tumor as well as genome-wide molecular profiling data including gene expression and copy number profiles. This challenge is fueled by the generous donation of clinical study data on 1,200 breast cancer patients obtained by Carlos Caldas of Cancer Research UK and Anne-Lise Borresen-Dale of Oslo University Hospital.
The challenge consists of building predictive models of breast cancer survival, specified by survival data containing:
- Time from diagnosis until death, or time of last follow-up if the patient is not known to have died.
- Whether the patient was alive at last follow up time.
Predictive models will be built using the following feature data:
- Genome-wide gene expression profiles
- Genome-wide gene copy number profiles
- Detailed clinical information about each tumor
- Additional information from related data sets, such as other breast cancer studies (some suggested publicly available datasets will be formatted and provided to users as part of the challenge)
The final scoring of the algorithms will be performed by scoring them against data to be obtained from another 500 patients, with tissue samples already banked. The challenge time line is:
- Challenge initiation: April 20th Sage Commons Congress – Description of challenge and opening of registration.
- Challenge evolution: April 20-October 1 – The challenge will dynamically evolve during the participation phase as participant input, discussions, and suggestions for improvements will be incorporated by the conference organizers to continuously improve the challenge. Moreover, the compute platform provided by Sage Bionetworks and Google will actively add features throughout the duration of the challenge
- Challenge submission deadline: October 1, 2012 – final submission of all models to be scored against validation data.
- Challenge conclusion: November 12-16 2012 – DREAM 7 conference, review of the results, what was learned, and presentation by the leading group.
A challenge of standardizing computational models developed for this competition is that most sophisticated models require the use of large compute clusters for model optimization. Therefore, individual labs often program customized workflows to run on their own cluster architecture, making it difficult to standardize or re-run analyses. For this reason, prior competition efforts either 1) abandoned the requirement for re-runnable code submission (DREAM6 competition), or 2) limited entries to those that complete in a small amount of time on a single processor (Innocentive competition).
We believe that supporting reusable, extensible code is critical to facilitating the type of transparency, rigor, and community development that we hope to promote with this competition. We have therefore partnered with Google to donate to the community the computational cycles allowing participants to develop and test complex models on a common compute architecture in the cloud. In addition to promoting scientific rigor and transparency, this donation of compute time will also enable the “democratization of medicine” in which participants from around the world can develop sophisticated methods from a level playing field, without being in a rich institution with access to high-performance compute clusters. The challenge is unique in its attention to the reproducibility and comparison of different analytical approaches; notably it requires contestants to run code in a common compute environment to ensure reproducibility of the results.