5 Things About the Many Analysts, One Data Set Paper

I’ve been a little slow on this, but I’ve been meaning to get around to the paper “Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results. This paper was published back in August, but I think it’s an important one for anyone looking to understand why science can often be so difficult.

The premise of this paper was simple, but elegant: give 29 teams the same data set and the same question to answer, then see how everyone does their analysis and if all of those analyses yield the same results. In this case, the question was “do soccer referees give red cards to dark skinned players more than light skinned players”. The purpose of the paper was to highlight how seemingly minor choices in data analysis can yield different results, and all participants had volunteered for this study with full knowledge of what the purpose was. So what did they find? Let’s take a look!

    1. Very few teams picked the same analysis methods. Every team in this study was able to pick whatever method they thought best fit the question they were trying to answer, and boy did the choices vary. First, the choice of analysis method varied: Next, the choice of covariates varied wildly. The data set had contained 14 covariates, and the 29 teams ended up coming up with 21 different combinations to look at:
    2. Choices had consequences As you can imagine, this variability produced some interesting consequences. Overall 20 of the 29 teams found a significant effect, but 9 didn’t. The effect sizes they found also varied wildly, with odds ratios running from .89 to 2.93. While that shows a definite trend in favor of the hypothesis, it’s way less reliable than the p<.05 model would suggest.
    3. Analytic choices didn’t necessarily predict who got a significant result. Now because all of these teams signed up knowing what the point of the study was, the next step in this study was pretty interesting. All the teams methods (but not their results) were presented to all the other teams, who then rated them. The highest rated analyses gave a median odds ratio of 1.31, and the lower rated analyses gave a median odds ratio of…..1.28. The presence of experts on the team didn’t change much either. Teams with previous experience teaching or publishing on statistical methods generated odds ratios with a median of 1.39, and the ones without such members had a median OR of 1.30. They noted that those with statistical expertise seemed to pick more similar methods, but that didn’t necessarily translate in to significant results.
    4. Researchers beliefs didn’t influence outcomes. Now of course the researchers involved in this had self-selected in a to a study where they knew other teams were doing the same analysis they were, but it’s interesting to note that those who said up front they believed the hypothesis was true were not more likely to get significant results than those who were more neutral. Researchers did change their beliefs over the course of the study however, as this chart showed:While many of the teams updated their beliefs, it’s good to note that the most likely update was “this is true, but we don’t know why”, followed by “this is true, but may be caused by something we didn’t captured in this data set (like player behavior)”.
    5. They key differences in analysis weren’t things most people would pick up on. At one point in the study, the teams were allowed to debate back and forth and look at each others analysis. One researcher noted that those teams that had included league and club as covariates were the ones who got non-significant results. As the paper states “A debate emerged regarding whether the inclusion of these covariates was quantitatively
      defensible given that the data on league and club were
      available for the time of data collection only and these
      variables likely changed over the course of many players’
      careers”. This is a fascinating debate, and one that would likely not have happened had these papers just been analyzed by one team. This choice was buried deep in the methods section, and I doubt under normal circumstances anyone would have thought twice about it.

That last point gets to why I’m so fascinated by this paper: it shows that lots of well intentioned teams can get different results even if no one is trying to be deceptive. These teams had no motivation to fudge their results or skew anything, and in fact were incentivized in the opposite direction. They still got different results however, for reasons that were so minute and debatable, they had to take multiple teams to discuss them. This shows nicely Andrew Gelman’s Garden of Forking Paths, how small choices can lead to big changes in outcomes. With no standard way of analyzing data, tiny boring looking choices in analysis can actually be a big deal.

The authors of the paper propose more group approaches may help mitigate some of these problems and give us all a better sense of how reliable results really are. After reading this, I’m inclined to agree. Collaborating up front also takes the adversarial part out, as you don’t just have people challenging each others research after the fact. Things to ponder.

Does Popularity Influence Reliability? A Discussion

Welcome to the “Papers in Meta Science” where we walk through published papers that use science to scrutinize science. At the moment we’re taking a look at the paper “Large-Scale Assessment of the Effect of Popularity on the Reliability of Research” by Pfeiffer and Hoffman. Read the introduction here, and the methods and results section here.

Well hi! Welcome back to our review of how scientific popularity influences the reliability of results. When last we left off we had established that the popularity of protein interactions did not effect the reliability of results for pairings initially, but did effect the reliability of results involving those popular proteins. In other words, you can identify the popular kids pretty well, but figuring out who they are actually connected to gets a little tricky. People like being friends with the popular kids.

Interestingly, the overall results showed a much stronger effect for the “multiple testing hypothesis” than the “inflated error effect” hypothesis, meaning that many of the false positive results seem to be coming from the extra teams running many different experiments and getting a predictable number of false positives. More overall tests = more overall false positives. This effect was 10 times stronger than the inflated error effect, though that was still present.

So what do should we do here? Well, a few things:

  1. Awareness Researchers should be extra aware that running lots of tests on a new and interesting protein could result in less accurate results.
  2. Encourage novel testing Continue to encourage people to branch out in their research as opposed to giving more funding to those researching more popular topics
  3. Informal research wikis This was an interesting idea I hadn’t seen before: use the Wikipedia model to let researchers note things they had tested that didn’t pan out. As I mentioned when I reviewed the Ioannidis paper, there’s not an easy way of knowing how many teams are working on a particular question at any given time. Setting up a less formal place for people to check what other teams were doing may give researchers better insight in to how many false positives they can expect to see.

Overall, it’s also important to remember that this is just one study and that findings in other fields may be different. It would be interesting to see a similar thing repeated in a social science type filed or something similar to see if public interest makes results better or worse.

Got another paper you’re interested in? Let me know!

Does Popularity Influence Reliability? Methods and Results

Welcome to the “Papers in Meta Science” where we walk through published papers that use science to scrutinize science. At the moment we’re taking a look at the paper “Large-Scale Assessment of the Effect of Popularity on the Reliability of Research” by Pfeiffer and Hoffman. Read the introduction here.

Okay, so when we left off last time, we were discussing the idea that findings in (scientifically) popular fields were less likely to be reliable than those in less popular fields.  The theory goes that popular fields would have more false positives (due to an overall higher number of experiments being run) or that increased competition would increase things like p-hacking and data dredging on the part of research teams, or both.

Methods: To test this hypothesis empirically, the researchers decided to look at the exciting world of protein interactions in yeast. While this is not what most people think about when they think of “popular” research, it’s actually a great choice. Since the general public probably is mostly indifferent to protein interactions, all the popularity studied here will be purely scientific. Any bias the researchers picked up will be from their scientific training, not their own pre-conceived beliefs.

To get data on protein interactions, the researchers pulled large data sets that were casting a wide net and smaller data sets that were looking for specific proteins and compared the results between the two. The thought was that the large data sets were testing large numbers of interactions all using the same algorithm and would be less likely to be biased by human judgement and could therefore be used to confirm or cast doubt on the smaller experiments that required more human intervention.

Thanks to the wonders of text mining, the sample size here was HUGE – about 60,000 statements/conclusions made about 30,000 hypothesized interactions. The smaller data sets had about 6,000 statements/conclusions about 4,000 interactions.

Results: The overall results showed some interesting differences in confirmation rates:

Basically, the more popular an interaction, the more often the interaction was confirmed. However, the more popular an interaction partner was, the less often it was confirmed. Confused? Try this analogy: think of protein interactions as the popular kids in school. The popular kids were fairly easy to identify, and researchers got the popular kids right a lot of the time. However, once they tried to turn that around and figure out who interacted with the popular kids later, they started getting a lot of false positives. Just like the less-cool kids in high school might overplay their relationship to the cooler kids, many researchers tried to tie their new findings to previously recognized popular findings.

This held true for both the “inflated error effect”  and the “multiple testing effect”. In other words, having a popular protein involved made both the individual statements or conclusions less likely to be validated, and ended up with more interactions that were found once but then never replicated. This held true across all types of experimental techniques, and it held true across databases that were curated by experts vs broader searches.

We’ll dive in to the conclusions we can draw from this next week.

Does Popularity Influence Reliability? An Introduction

Well hi there! Welcome to the next edition of “Papers in Meta Science” where I walk through interesting papers that use science to scrutinize science. During the first go around we looked at the John Ioannidis paper “Why Most Published Research Findings Are False”, and this time we’re going to look at a paper that attempted to prove one of that papers key assertions: that “hot” scientific fields produce less trustworthy results than less popular fields. This paper is called “Large-Scale Assessment of the Effect of Popularity on the Reliability of Research“, and was published on PlosOne by Pfeiffer and Hoffmann in 2009. They sought to test empirically whether or not this particular claim was true using the field of protein interactions.

Before we get to the good stuff though, I’d expect this series to have about 3 parts:

  1. The Introduction/Background. You’re reading this one right now.
  2. Methods and Results
  3. Further Discussion

Got it? Let’s go!

Introduction:  As I mentioned up front, one of the major goals of this paper was to confirm or refute the mathematical theory put forth by John Ioannidis that “hot” fields were more likely to produce erroneous results than those that were less popular. There are two basic theories as to why this could be the case:

  1. Popular fields create competition, and competitive teams are more likely to be incentivized to cut corners or do what it takes to get positive results (Ioannidis Corollary 5)
  2. Lots of teams working on a problem means lots of hypothesis testing, and lots of tested hypotheses means more false positives due to random chance (Ioannidis Corollary 6).

While Pfeiffer and Hoffman don’t claim to be able to differentiate between those two motives, they were hopeful that by looking at the evidence they could figure out if this effect was real and if it was perhaps estimate a magnitude. For their scrutiny, they chose the field of protein interactions in yeast.

This may seem a little counter-intuitive, as almost no definition of “popular science” conjures pictures of protein interactions. However, it is important to remember that the point of this paper was to examine scientific popularity, not mentions in the popular press. Since most of us probably already assume that getting headline grabbing research can cause it’s own set of bias problems, it’s interesting to consider a field that doesn’t grab headlines. Anyway, despite it’s failure to lead the 6 o’clock news, it turns out that the world of protein interactions actually does have a popularity issue. Some proteins and their corresponding genes are studied far more frequently than others, and this makes it a good field for examination. If a field like this can fall prey to the effect of multiple teams, than we can assume that more public oriented fields could as well.

Tune in next week to see what we find out!

So Why AREN’T Most Published Research Findings False? The Rebuttal

Welcome to “So Why ARE Most Published Research Findings False?”, a step by step walk through of the John Ioannidis paper “Why Most Published Research Findings Are False”. It probably makes more sense if you read this in order, so check out the intro here , Part 1  here ,Part 2  here,  Part 3 here, Part 4 here, and Part 5 here.

Okay people, we made it! All the way through one of the most cited research papers of all time, and we’ve all lost our faith in everything in the process. So what do we do now? Well, let’s turn the lens around on Ioannidis. What, if anything, did he miss and how do we digest this paper? I poked around for a few critiques of him, just to give a flavor. This is obviously not a comprehensive list, but it hits the major criticisms I could find.

The Title While quite a few people had no problem with the contents of Ioannidis’s paper, some took real umbrage with the title, essentially accusing it of being clickbait before clickbait had really gotten going. Additionally, since many people never read anything more than the title of a paper, a title that blunt is easily used as a mallet by anyone trying to disprove any study they chose. Interestingly, there’s apparently some question regarding whether or not Ioannidis actually wrote the title or if it was the editors at Plos Medicine, but the point stands. Given that misleading headlines and reporting are hugely blamed by many (including yours truly) for popular misunderstanding of science, that would be a fascinating irony.

Failing to reject the null hypothesis does not mean accepting the null hypothesis This is not so much a criticism of Ioannidis as it is of those who use his work to promote their own causes. There is a rather strange line of thought out there that seems to believe that life, or science, is a courtroom. Under this way of thinking, when you undermine a scientist and their hypothesis, your client is de facto not guilty. This in not true. If you somehow prove that chemotherapy is less effective than previously stated, that doesn’t actually mean that crystals cure cancer. You never prove the null hypothesis, you only fail to reject it.

The definition of bias contained more nuance In a paper written in response to the Ioannidis paper, some researchers from Johns Hopkins took umbrage with the presentation of “bias” in the paper. Their primary grouse seemed to be intent vs consequence. Ioannidis presents bias as a factor based on consequence, i.e. the way it skews the final results. They disliked this and believed bias should be based on intent, pointing out numerous ways in which things Ioannidis calls “bias” could creep in innocently. For example, if you are looking for a drug that reduces cardiac symptoms but you also find that mortality goes down for patients who take the medication, are you really not going to report that because it’s not what you were originally looking for? By the strictest definition this is “data dredging”, but is it really? Humans aren’t robots. They’re going to report interesting findings where they see them.

The effect of multiple teams This is one of the more interesting quibbles with the initial paper. Mathematically, Ioannidis proved that having multiple teams working on the same research question would increase the chances of a false result. In the same Hopkins paper, the researchers question the math behind the “multiple teams lead to more false positives” assertion. They mention that for any one study, the odds stay the same as they always have been. Ioannidis counters with an argument that boils down to “yes, if you assume those in competition don’t get more biased”.  Interestingly, later research has shown that this effect does exist and is much worse in fields where the R factor (pre-study odds) is low.

So overall, what would I say are the major criticisms or cautions around this paper that I personally will employ?

  1. If you’re citing science, use scientific terms precisely. Don’t get sloppy with verbage just to make your life easier.
  2. Remember, scientific best practices all feed off each other Getting a good sample size and promoting caution can reduce both overall bias and the effect of bias that does exist. The effect of multiple team testing can be partially negated by high pre-study odds. If a team or researcher employs most best practices but misses one, that may not be a death blow to their research. Look at the whole picture before dismissing the research.
  3. New is exciting, but not always reliable We all like new and quirky findings, but we need to let that go. New findings are the least likely to play out later, and that’s okay. We want to cast a broad net, but for real progress we need a longer attention span.
  4. Bias takes many forms When we mention “bias” we often jump right to financial motivations. But intellectual and social pressure can be bias, competing for tenure can cause bias, and confirming ones own findings can cause bias.
  5. There are more ways of being wrong than there are ways of being right Every researcher wants a true finding. They really do. No one wants their life’s work undone. While some researchers may be motivated by results they like, I do truly believe that the majority of problems are caused by the whole “needle in a haystack” thing more than the “convenient truth” thing.

Alright, that wraps us up! I enjoyed this series, and may do more going forward. If you see a paper that piques your interest, let me know and I’ll look in to it.  Happy holidays everyone!



So Why ARE Most Published Research Findings False? A Way Forward

Welcome to “So Why ARE Most Published Research Findings False?”, a step by step walk through of the John Ioannidis paper “Why Most Published Research Findings Are False”. It probably makes more sense if you read this in order, so check out the intro here , Part 1  here ,Part 2  here,  Part 3 here, and Part 4 here.

Alright guys, we made it! After all sorts of math and bad news, we’re finally at the end. While the situation Ioannidis has laid out up until now sounds pretty bleak, he doesn’t let us end there. No, in this section “How Can We Improve the Situation” he ends with both hope and suggestions.  Thank goodness.

Ioannidis starts off with the acknowledgement that we will never really know which research findings are true and which are false. If we had a perfect test, we wouldn’t be in this mess to begin with. Therefore, anything we do to improve the research situation will be guessing at best. However, there are things that it seems would likely do some good. Essentially they are to improve the values of each of the “forgotten” variables in the equation that determines the positive predictive value of findings. These are:

  1. Beta/study power: Use larger studies or meta-analyses aimed at testing broad hypotheses
  2. n/multiple teams: Consider a totality of evidence or work done before concluding any one finding is true
  3. u/Bias: Register your study ahead of time, or work with other teams to register your data to reduce bias
  4. R/Pre-study Odds: Determine the pre-study odds prior to your experiment, and publish your assessment with your results

If you’ve been following along so far, none of those suggestions should be surprising to you. Let’s dive in to each though:

First, we should be using larger studies or meta-analyses that aggregate smaller studies. As we saw earlier, large sample size = higher study power -> blunts the impact of bias.  That’s a good thing. This isn’t fool proof though, as bias can still slip through and a large sample size means very tiny effect sizes can be ruled “statistically significant”. These studies are also hard to do because they are so resource intensive. Ioannidis suggests that large studies be reserved for large questions, though without a lot of guidance on how to do that.

Second, the totality of the evidence. We’ve covered a lot about false positives here, and Ioannidis of course reiterates that we should always keep them in mind. One striking finding should almost never be considered definitive, but rather compared to other similar research.

Third, steps must be taken to reduce bias. We talked about this a lot with the corollaries, but Ioannidis advocates hard that groups should tell someone else up front what they’re trying to do. This would (hopefully) reduce the tendency to say “hey, we didn’t find an effect for just the color red, but if you include pink and orange as a type of red, there’s an effect!”. Trial pre-registration gets a lot of attention in the medical world, but may not be feasible in other fields. At the very least, Ioannidis suggests that research teams share their strategy with each other up front, as a sort of “insta peer review” type thing. This would allow researchers some leeway to report interesting findings they weren’t expecting (ie “red wasn’t a factor, but good golly look at green!”) while reducing the aforementioned “well if you tweak the definition of red a bit, you totally get a significant result”.

Finally, the pre-study odds. This would be a moment up front for researchers to really assess how likely they are to find anything, and a number for others to use later to judge the research team by. Almost every field has a professional conference, and one would imagine determining pre-study odds for different lines of inquiry would be an interesting topic for many of them. Encouraging researchers to think up front about their odds of finding something interesting would be an interesting framing for everything yet to come.

None of this would fix everything, but it would certainly inject some humility and context in to the process from the get go. Science in general is supposed to be a way of objectively viewing the world and describing what you find. Turning that lens inward should be something researchers welcome, though obviously that is not always the case.

In that vein, next week I’ll be rounding up some criticisms of this paper along with my wrap up to make sure you hear the other side. Stay tuned!


So Why ARE Most Published Research Findings False? Bias bias bias

Welcome to “So Why ARE Most Published Research Findings False?”, a step by step walk through of the John Ioannidis paper “Why Most Published Research Findings Are False”. It probably makes more sense if you read this in order, so check out the intro here , Part 1  here ,Part 2  here, and Part 3 here.

Alright folks, we’re almost there. We covered a lot of mathematical ground here and last week ended with a few corollaries. We’ve seen the effects of sample size, study power, effect size, pre-study odds, bias and the work of multiple teams. We’ve gotten thoroughly depressed, and we’re not done yet. There’s one more conclusion we can draw, and it’s a scary one. Ioannidis holds nothing back, and he just flat out calls this section “Claimed Research Findings May Often Be Simply Accurate Measures of the Prevailing Bias“. Okay then.

To get to this point, Ioannidis lays out a couple of things:

  1. Throughout history, there have been scientific fields of inquiry that later proved to have no basis….like phrenology for example. He calls these “null fields”.
  2. Many of these null fields had positive findings at some point, and in a number great enough to sustain the field.
  3. Given the math around positive findings, the effect sizes in false positives due to random chance should be fairly small.
  4. Therefore, large effect sizes discovered in null fields pretty much just measure the bias present in those fields….aka that “u” value we talked about earlier.

You can think about this like a coin flip. If you flip a fair coin 100 times, you know you should get about 50 heads and 50 tails. Given random fluctuations, you probably wouldn’t be too surprised if you ended up with a 47-53 split or even a 40-60 split. If you ended up with an 80-20 split however, you’d get uneasy. Was the coin really fair?

The same goes for scientific studies. Typically we look at large effect sizes as a good thing. After all, where there’s smoke there’s fire, right? However, Ioannidis points out that large effect sizes are actually an early warning sign for bias. For example, lets say you think that your coin is weighted a bit, and that you will actually get heads 55% of the time you flip it. You flip it 100 times and get 90 heads. You can react in one of 3 ways:

  1. Awesome, 90 is way more than 55 so I was right that heads comes up more often!
  2. Gee, there’s a 1 in 73 quadrillion chance that 90 heads would come up if this coin were fairly weighted. With the slight bias I thought was there, the chances of getting the results I did is still about 1 in 5 trillion. I must have underestimated how biased that coin was.
  3. Crap. I did something wrong.

You can guess which ones most people go with. Spoiler alert: it’s not #3.

The whole “an unexpectedly large effect size should make you nervous” phenomena is counterintuitive, but I’ve actually blogged about it before. It’s what got Andrew Gelman upset about that study that found that 20% of women were changing their vote around their menstrual cycle, and it’s something I’ve pointed out about the whole 25% of men vote for Trump if they’re primed to think about how much money their wives make.  Effect sizes of that magnitude shouldn’t be cause for excitement, they should be cause for concern. Unless you are truly discovering a previously unknown and overwhelmingly large phenomena, there’s a good chance some of that number was driven by bias.

Now of course, if your findings replicate, this is all fine, you’re off the hook. However if they don’t, the largeness of your effect size is really just a measure of your own bias. Put another way, you can accidentally find a 5% vote swing that doesn’t exist just because random chance is annoying like that, but to get numbers in the 20-25% range you had to put some effort in.

As Ioannidis points out, this isn’t even a problem with individual researchers, but in how we all view science. Big splashy new results are given a lot of attention, and there is very little criticism if the findings fail to replicate at the same magnitude. This means that a researchers have nothing but incentives to make sure the effect sizes they’re seeing as big as possible. In fact Ioannidis has found (in a different paper) that about half the time the first paper published on a topic shows the most extreme value ever found. That is way more than what we would expect to see if it were up to random chance. Ioannidis argues that by figuring out exactly how far these effect sizes deviate from chance, we can actually measure the level of bias.

Again, not a problem for those researchers who replicate, but something to consider for those who don’t. We’ll get in to that next week, in our final segment: “So What Can Be Done?”.