Does Popularity Influence Reliability? Methods and Results

Welcome to the “Papers in Meta Science” where we walk through published papers that use science to scrutinize science. At the moment we’re taking a look at the paper “Large-Scale Assessment of the Effect of Popularity on the Reliability of Research” by Pfeiffer and Hoffman. Read the introduction here.

Okay, so when we left off last time, we were discussing the idea that findings in (scientifically) popular fields were less likely to be reliable than those in less popular fields.  The theory goes that popular fields would have more false positives (due to an overall higher number of experiments being run) or that increased competition would increase things like p-hacking and data dredging on the part of research teams, or both.

Methods: To test this hypothesis empirically, the researchers decided to look at the exciting world of protein interactions in yeast. While this is not what most people think about when they think of “popular” research, it’s actually a great choice. Since the general public probably is mostly indifferent to protein interactions, all the popularity studied here will be purely scientific. Any bias the researchers picked up will be from their scientific training, not their own pre-conceived beliefs.

To get data on protein interactions, the researchers pulled large data sets that were casting a wide net and smaller data sets that were looking for specific proteins and compared the results between the two. The thought was that the large data sets were testing large numbers of interactions all using the same algorithm and would be less likely to be biased by human judgement and could therefore be used to confirm or cast doubt on the smaller experiments that required more human intervention.

Thanks to the wonders of text mining, the sample size here was HUGE – about 60,000 statements/conclusions made about 30,000 hypothesized interactions. The smaller data sets had about 6,000 statements/conclusions about 4,000 interactions.

Results: The overall results showed some interesting differences in confirmation rates:

Basically, the more popular an interaction, the more often the interaction was confirmed. However, the more popular an interaction partner was, the less often it was confirmed. Confused? Try this analogy: think of protein interactions as the popular kids in school. The popular kids were fairly easy to identify, and researchers got the popular kids right a lot of the time. However, once they tried to turn that around and figure out who interacted with the popular kids later, they started getting a lot of false positives. Just like the less-cool kids in high school might overplay their relationship to the cooler kids, many researchers tried to tie their new findings to previously recognized popular findings.

This held true for both the “inflated error effect”  and the “multiple testing effect”. In other words, having a popular protein involved made both the individual statements or conclusions less likely to be validated, and ended up with more interactions that were found once but then never replicated. This held true across all types of experimental techniques, and it held true across databases that were curated by experts vs broader searches.

We’ll dive in to the conclusions we can draw from this next week.