If Not On Base Percentage, Then What Should We Measure?

A while back our company had a movie night featuring Moneyball.  I actually missed that party due to a business trip, but the movie finally popped to the top of my NetFlix queue, and turned out to be a great pick.  While I’ve never quite caught the fantasy sports bug myself, the problem of predicting which players will form the best sports team is actually a pretty difficult and interesting problem.  While there are plenty of variables that can be measured to gauge an athlete’s potential, the things that are easy to measure are often only loosely correlated with fielding a team that produces the most wins.

In the movie, the key insight that gave the Oakland A’s an edge was the realization that certain statistics, like on base percentage, were more predictive of a player’s impact on wins than more subjective characteristics (ugly girlfriend = low confidence = bad hitter), physical characteristics (raw speed) or even more traditionally hallowed metrics of on-field performance (batting average).  The A’s strategy became to find the players scoring relatively highly on these unconventional metrics but relatively poorly on the conventional ones.  This gave them a set of players whose talent was undervalued by their competitors, and thus accessible to the A’s organization despite their limited payroll.

Last week at work we had a great visit from Adrien Treuille, a Carnegie Mellon CS professor and developer of a completely different type of game: Fold It.  Fold It is a game designed to allow humans to solve the protein folding problem: Given the huge number of degrees of freedom that an amino acid sequence can bend itself into, find the shape with the lowest energy in which the protein will eventually end up in.  This has been a problem traditionally attacked by massive computer simulation.  Given knowledge of the thousands of atoms in a protein, and the laws governing forces between the atoms, it should be possible to simulate the molecular dynamics of the system until you get to the minimum energy of the system.  The problem is the energy landscape of this problem is extremely complex, with a huge number of local minimums that thwart even today’s computers.   By turning protein folding literally into a computer game, Adriane and his colleagues were able to recruit a bunch of non-scientists who knew nothing of the laws of chemistry, but were able to learn to find true protein structures better than the computers.

In both Moneyball and protein folding, there seems to be areas where humans can add value: high level strategies and heuristics, picking the right features to pay attention to, and defining the right success criteria.  There are other areas where computers excel: crunching huge amounts of data and applying algorithms on massive scale to fairly tightly defined problems.

There’s been a lot of effort recently devoted to algorithms that predict phenotypes based on the large number of molecular features coming out of modern genomics platforms.  This is a hot area because it has direct clinical applications like predicting a patient’s likely response to different treatment options based on her genetic makeup.  Unfortunately it’s been relatively difficult to squeeze information out of the genomics data increasingly being generated in clinical trials.  A recent Nature Biotechnology paper summarized the problem in a survey of different machine learning approaches to the problem: there is no magic algorithm for squeezing extra information out of data.  Many statically methods yield similar results on the same data; the “best” method for a particular data set is only marginally better than the field of methods, and often changes with the data or scoring criteria; a few workhorse machine learning algorithms seem to work well on a variety of problems in a variety of fields. Future progress will depend likely depend on human insights in ways to formulate the problem upstream of the application of the machine learning algorithms.

In discussions with Adriane we were sadly unable to come up with a next game to follow his successes in the biochemistry domain.  Biochemistry after all is based on physical structures that translate nicely to an interactive gaming experience.  The genotype – phenotype prediction problem is a bit more abstract.  However, I hope we can be inspired by his work and at least find ways to better engage a community of scientists and patients to work on these sorts of problem.  After all, this is another area where just throwing more data generation and computing power at the problem isn’t likely to make much progress.


About Michael Kellen

I've spent my career working at the intersection of science and technology. Currently I lead the technology team at Sage Bionetworks, but this blog contains my own thoughts.
This entry was posted in Uncategorized and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s