When I describe Synapse, the software platform we are building at Sage Bionetworks, I use the analogy “It’s like GitHub for biologists” or “GitHub for data science” a lot. This seems to be useful, at least for the subset of people that I talk to that know what GitHub is. For the rest of you, in a nutshell GitHub is a set of online tools to make it super easy to share code and manage software development projects. It’s evolved into an active community where developers can quickly launch projects, contribute to or reuse existing projects, or recruit new developers to their projects. It’s got a particularly strong following in the open source community, but is also used by a number of major corporations to run at least some of their projects.
Now, the idea that basic science would benefit by importing some of the culture and tools of open source software is certainly not an original thought on my part. This particular meme has been floating out on the net for some time, for example see Marcio von Muhlen’s post We need a GitHub of Science for more extended discussion.
I’m not actually going to ask the question “Do we need something like GitHub for biologists” in this post. I’m going to assume the answer is yes, and move on to “If we need a GitHub for biologists why not just adapt the current GitHub so that it is geared for biological data as well as code?” I think this is a more interesting question because it is often the case that adapting an existing technology to new uses wins out over building something new. In fact, Wired recently published a story about GitHub which mentioned experiments with other forms of text documents on the site, books, legal contracts, even one person who uploaded his genotype information to spur research (which another user promptly forked and issued a pull request on). In addition to code sharing, GitHub also offers additional project management tools like wikis and issue trackers. These are pretty generic tools, and in fact we’ve seen similar tools migrate from the software team to the research teams at Sage (we’re actually using a GitHub competitor, Atlassian’s Jira suite, internally).
So, why not just teach scientists to use GitHub, instead of building a new, dedicated system? I came up with at least three reasons:
1. Git’s hyperdistributed peer-to-peer data sharing model is good for code, but bad for big scientific data. This is because Git works by placing a complete version of the entire code repository, including all versioning and branching, onto the laptop of every developer who forks (copies) the code base for their own personal use, and then allows them to pull (merge in) other people’s development as it proceeds. This has turned out to be a very powerful way to develop because it gives enormous flexibility to developers to experiment, and then select the experiments that work. This works because code is small and usually evolves though small and mostly orthogonal diffs (changes) in text files that are efficient to merge.
In contrast to code developers, biologists, including those here at Sage Bionetworks, are dealing with fairly large data sets. Our mission at Sage Bionetworks is to facilitate the integration across many clinical studies, each of which may contain 10-100 GB of data including full genotype, copy number, and expression profiles. The rapidly increasing availability of sequencing technologies like RNAseq, full exome, and even full genome promises to increase the rate of data generation data faster than improvements in storage, and especially network bandwidth. Furthermore, if you re-run an analysis, you’re probably going to change *all* the output results, and the analysis itself might require a warehouse full of computers to process in a reasonable time. Giving everyone a copy of the data on their local machine just won’t work. Giving analyst teams distributed access to shared and centralized data and compute resources is necessary, and becoming more technically straightforward given the rise of commercial cloud computing platforms.
2. Git Hub’s tools are optimized for the production of code. In contrast, bioinformatics data analysis is not the single task of writing software. Sure, some aspects of bioinformatics analysis require the writing of code: We do our share of this at Sage Bionetworks, and even have our own GitHub outpost where we post some newly developed algorithms and software. But most of bioinformatics analysis is the iterative application of existing methods to data. Analysts tune parameters, massage data into the right formats, make adjustments until they are satisfied that their data can serve as a robust foundation for their scientific interpretation. All the top analysts I know work at an interactive command line in a scripting language (mostly R). They need dedicated tools that capture and share the status of this iterative workflow.
3. Bioinformatic data analysts are a distinct community, with a culture that is very different than that experienced by open source developers. The real-time transparency of open source that developers take for granted is much less prevalent in science, where information is typically not shared until it can be published in a formal journal article. Ultimately, I hope Synapse can enrich and support a community of scientists working together more effectively. However, to create the right incentives we’re going to have to create a bridge to the more traditional metrics used to assess scientific productivity: publications and citations. This will require experimentation to develop the right interfaces to encourage scientists to work in a more open manner (for more detail, see my previous post).
Having thought through what GibHub can offer vs. what bioinformatics scientists need, at Sage Bionetworks we’re taking the approach of “inspired by GitHub”, as opposed to “adapting GitHub”. If this post convinced you that we made the right choice that’s great, and we’d appreciate any support you can give us on what I think is a pretty ambitious project. But if you still think that we can simply adapt GitHub to for scientists’ use, then I’d really like to hear from you. After all, questioning basic assumptions until you get them right is what science and engineering is all about.