Ah, fall. A new school year, and a new class on my way to finish this darn degree of mine. This semester I’m taking “Statistical Analysis of Genomics Data”, and the whole first was dedicated to discussing reproducible research. As you can imagine, I was psyched. I’ve talked quite a bit about reproducible research, but genomics data has some twists I hadn’t previously considered. In addition to all the usual replication issues, here are a few issues that come up when you try to replicate genomics studies:
- Different definitions of “raw data” In the paper “Repeatability of published microarray gene expression analyses” John Ioannidis et al attempted to reproduce one figure from 18 different papers that used microarray data. They succeeded on 2 of them. The number one reason for failure to replicate? Not being able to access the raw data that was used. In most cases the data had been deposited (as required by the journal) but it had not really been reviewed to see if it was just summary data or even particularly identifiable. Six out of 18 research groups had deposited data that you couldn’t even attempt to use, and other groups had data so raw it was basically useless. Makes me shudder just to think about it.
- Large and unwieldy data files Even in papers where the data was available, it was not always useable. Ioannidis et al had trouble reproducing the about 8 papers due to unclear data decisions. Essentially the files and data were there, but they couldn’t figure out how someone actually had waded through them to produce the results they got. To give you a picture of how big these data files are, my first homework for this class required a “practice” file that was 20689×37….or almost 800,000 data points. Unless that data is very well labeled, you will have trouble recreating what someone else did.
- Non-reproducible workflow Anyone who’s ever attempted to tame an unweildy data set knows it’s a trek and a half. I swear to god I have actually emerged from my office sweating after one of those bouts. That’s not so terrible, but what can kick it to the seventh circle of hell is finding out there was an error in the data set and now you have to redo the whole thing. In 8 of the papers Ioannidis et al looked at, they couldn’t figure out what the authors actually did to generate their figures. Turns out, sometimes authors can’t figure out what they did to generate their figures….which is why we end up with videos like this:
All that copy/pasting and messing around is just ASKING for an error.
- Software version changes Another non-glamorous way things can get screwed up: you update your software part way through and the original stuff you wrote gets glitchy. This is an enormous headache if you notice it, and a huge issue if you don’t. 2 of the papers Ioannidis et al looked at didn’t include software version and couldn’t be reproduced. R is the most commonly used software for things like this and it’s open source, so updates aren’t always compatible with each other.
- Excel issues Okay, so you loaded your data, made a reproducible workflow, figured out your version of R, and now you are awesome right? Not necessarily. It turns out that Excel, one of the most standard computer programs on the planet, can seriously screw you up. In a recent paper, it was discovered that 20% of all genomics papers with Excel data files had inadvertently converted gene names to either dates or floating point numbers. This almost certainly means those renamed genes didn’t end up being included in the final research, but what effect that had is unknown. Sadly, the rate of this error is actually increasing by about 15%. Oof.
I am tempted to summarize all this by saying “Mo’ Data Mo’ Problems”, but…..no, actually, that sounds about right. Any time you can’t actually personally review all the data, you are putting your faith in computer systems and the organization of the files. Good organization is key, and it’s hard to focus on that when you’re wading through data files. Semper vigilans.