5 Things About the Many Analysts, One Data Set Paper

I’ve been a little slow on this, but I’ve been meaning to get around to the paper “Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results. This paper was published back in August, but I think it’s an important one for anyone looking to understand why science can often be so difficult.

The premise of this paper was simple, but elegant: give 29 teams the same data set and the same question to answer, then see how everyone does their analysis and if all of those analyses yield the same results. In this case, the question was “do soccer referees give red cards to dark skinned players more than light skinned players”. The purpose of the paper was to highlight how seemingly minor choices in data analysis can yield different results, and all participants had volunteered for this study with full knowledge of what the purpose was. So what did they find? Let’s take a look!

    1. Very few teams picked the same analysis methods. Every team in this study was able to pick whatever method they thought best fit the question they were trying to answer, and boy did the choices vary. First, the choice of analysis method varied: Next, the choice of covariates varied wildly. The data set had contained 14 covariates, and the 29 teams ended up coming up with 21 different combinations to look at:
    2. Choices had consequences As you can imagine, this variability produced some interesting consequences. Overall 20 of the 29 teams found a significant effect, but 9 didn’t. The effect sizes they found also varied wildly, with odds ratios running from .89 to 2.93. While that shows a definite trend in favor of the hypothesis, it’s way less reliable than the p<.05 model would suggest.
    3. Analytic choices didn’t necessarily predict who got a significant result. Now because all of these teams signed up knowing what the point of the study was, the next step in this study was pretty interesting. All the teams methods (but not their results) were presented to all the other teams, who then rated them. The highest rated analyses gave a median odds ratio of 1.31, and the lower rated analyses gave a median odds ratio of…..1.28. The presence of experts on the team didn’t change much either. Teams with previous experience teaching or publishing on statistical methods generated odds ratios with a median of 1.39, and the ones without such members had a median OR of 1.30. They noted that those with statistical expertise seemed to pick more similar methods, but that didn’t necessarily translate in to significant results.
    4. Researchers beliefs didn’t influence outcomes. Now of course the researchers involved in this had self-selected in a to a study where they knew other teams were doing the same analysis they were, but it’s interesting to note that those who said up front they believed the hypothesis was true were not more likely to get significant results than those who were more neutral. Researchers did change their beliefs over the course of the study however, as this chart showed:While many of the teams updated their beliefs, it’s good to note that the most likely update was “this is true, but we don’t know why”, followed by “this is true, but may be caused by something we didn’t captured in this data set (like player behavior)”.
    5. They key differences in analysis weren’t things most people would pick up on. At one point in the study, the teams were allowed to debate back and forth and look at each others analysis. One researcher noted that those teams that had included league and club as covariates were the ones who got non-significant results. As the paper states “A debate emerged regarding whether the inclusion of these covariates was quantitatively
      defensible given that the data on league and club were
      available for the time of data collection only and these
      variables likely changed over the course of many players’
      careers”. This is a fascinating debate, and one that would likely not have happened had these papers just been analyzed by one team. This choice was buried deep in the methods section, and I doubt under normal circumstances anyone would have thought twice about it.

That last point gets to why I’m so fascinated by this paper: it shows that lots of well intentioned teams can get different results even if no one is trying to be deceptive. These teams had no motivation to fudge their results or skew anything, and in fact were incentivized in the opposite direction. They still got different results however, for reasons that were so minute and debatable, they had to take multiple teams to discuss them. This shows nicely Andrew Gelman’s Garden of Forking Paths, how small choices can lead to big changes in outcomes. With no standard way of analyzing data, tiny boring looking choices in analysis can actually be a big deal.

The authors of the paper propose more group approaches may help mitigate some of these problems and give us all a better sense of how reliable results really are. After reading this, I’m inclined to agree. Collaborating up front also takes the adversarial part out, as you don’t just have people challenging each others research after the fact. Things to ponder.

Does Popularity Influence Reliability? A Discussion

Welcome to the “Papers in Meta Science” where we walk through published papers that use science to scrutinize science. At the moment we’re taking a look at the paper “Large-Scale Assessment of the Effect of Popularity on the Reliability of Research” by Pfeiffer and Hoffman. Read the introduction here, and the methods and results section here.

Well hi! Welcome back to our review of how scientific popularity influences the reliability of results. When last we left off we had established that the popularity of protein interactions did not effect the reliability of results for pairings initially, but did effect the reliability of results involving those popular proteins. In other words, you can identify the popular kids pretty well, but figuring out who they are actually connected to gets a little tricky. People like being friends with the popular kids.

Interestingly, the overall results showed a much stronger effect for the “multiple testing hypothesis” than the “inflated error effect” hypothesis, meaning that many of the false positive results seem to be coming from the extra teams running many different experiments and getting a predictable number of false positives. More overall tests = more overall false positives. This effect was 10 times stronger than the inflated error effect, though that was still present.

So what do should we do here? Well, a few things:

  1. Awareness Researchers should be extra aware that running lots of tests on a new and interesting protein could result in less accurate results.
  2. Encourage novel testing Continue to encourage people to branch out in their research as opposed to giving more funding to those researching more popular topics
  3. Informal research wikis This was an interesting idea I hadn’t seen before: use the Wikipedia model to let researchers note things they had tested that didn’t pan out. As I mentioned when I reviewed the Ioannidis paper, there’s not an easy way of knowing how many teams are working on a particular question at any given time. Setting up a less formal place for people to check what other teams were doing may give researchers better insight in to how many false positives they can expect to see.

Overall, it’s also important to remember that this is just one study and that findings in other fields may be different. It would be interesting to see a similar thing repeated in a social science type filed or something similar to see if public interest makes results better or worse.

Got another paper you’re interested in? Let me know!

Does Popularity Influence Reliability? Methods and Results

Welcome to the “Papers in Meta Science” where we walk through published papers that use science to scrutinize science. At the moment we’re taking a look at the paper “Large-Scale Assessment of the Effect of Popularity on the Reliability of Research” by Pfeiffer and Hoffman. Read the introduction here.

Okay, so when we left off last time, we were discussing the idea that findings in (scientifically) popular fields were less likely to be reliable than those in less popular fields.  The theory goes that popular fields would have more false positives (due to an overall higher number of experiments being run) or that increased competition would increase things like p-hacking and data dredging on the part of research teams, or both.

Methods: To test this hypothesis empirically, the researchers decided to look at the exciting world of protein interactions in yeast. While this is not what most people think about when they think of “popular” research, it’s actually a great choice. Since the general public probably is mostly indifferent to protein interactions, all the popularity studied here will be purely scientific. Any bias the researchers picked up will be from their scientific training, not their own pre-conceived beliefs.

To get data on protein interactions, the researchers pulled large data sets that were casting a wide net and smaller data sets that were looking for specific proteins and compared the results between the two. The thought was that the large data sets were testing large numbers of interactions all using the same algorithm and would be less likely to be biased by human judgement and could therefore be used to confirm or cast doubt on the smaller experiments that required more human intervention.

Thanks to the wonders of text mining, the sample size here was HUGE – about 60,000 statements/conclusions made about 30,000 hypothesized interactions. The smaller data sets had about 6,000 statements/conclusions about 4,000 interactions.

Results: The overall results showed some interesting differences in confirmation rates:

Basically, the more popular an interaction, the more often the interaction was confirmed. However, the more popular an interaction partner was, the less often it was confirmed. Confused? Try this analogy: think of protein interactions as the popular kids in school. The popular kids were fairly easy to identify, and researchers got the popular kids right a lot of the time. However, once they tried to turn that around and figure out who interacted with the popular kids later, they started getting a lot of false positives. Just like the less-cool kids in high school might overplay their relationship to the cooler kids, many researchers tried to tie their new findings to previously recognized popular findings.

This held true for both the “inflated error effect”  and the “multiple testing effect”. In other words, having a popular protein involved made both the individual statements or conclusions less likely to be validated, and ended up with more interactions that were found once but then never replicated. This held true across all types of experimental techniques, and it held true across databases that were curated by experts vs broader searches.

We’ll dive in to the conclusions we can draw from this next week.

Does Popularity Influence Reliability? An Introduction

Well hi there! Welcome to the next edition of “Papers in Meta Science” where I walk through interesting papers that use science to scrutinize science. During the first go around we looked at the John Ioannidis paper “Why Most Published Research Findings Are False”, and this time we’re going to look at a paper that attempted to prove one of that papers key assertions: that “hot” scientific fields produce less trustworthy results than less popular fields. This paper is called “Large-Scale Assessment of the Effect of Popularity on the Reliability of Research“, and was published on PlosOne by Pfeiffer and Hoffmann in 2009. They sought to test empirically whether or not this particular claim was true using the field of protein interactions.

Before we get to the good stuff though, I’d expect this series to have about 3 parts:

  1. The Introduction/Background. You’re reading this one right now.
  2. Methods and Results
  3. Further Discussion

Got it? Let’s go!

Introduction:  As I mentioned up front, one of the major goals of this paper was to confirm or refute the mathematical theory put forth by John Ioannidis that “hot” fields were more likely to produce erroneous results than those that were less popular. There are two basic theories as to why this could be the case:

  1. Popular fields create competition, and competitive teams are more likely to be incentivized to cut corners or do what it takes to get positive results (Ioannidis Corollary 5)
  2. Lots of teams working on a problem means lots of hypothesis testing, and lots of tested hypotheses means more false positives due to random chance (Ioannidis Corollary 6).

While Pfeiffer and Hoffman don’t claim to be able to differentiate between those two motives, they were hopeful that by looking at the evidence they could figure out if this effect was real and if it was perhaps estimate a magnitude. For their scrutiny, they chose the field of protein interactions in yeast.

This may seem a little counter-intuitive, as almost no definition of “popular science” conjures pictures of protein interactions. However, it is important to remember that the point of this paper was to examine scientific popularity, not mentions in the popular press. Since most of us probably already assume that getting headline grabbing research can cause it’s own set of bias problems, it’s interesting to consider a field that doesn’t grab headlines. Anyway, despite it’s failure to lead the 6 o’clock news, it turns out that the world of protein interactions actually does have a popularity issue. Some proteins and their corresponding genes are studied far more frequently than others, and this makes it a good field for examination. If a field like this can fall prey to the effect of multiple teams, than we can assume that more public oriented fields could as well.

Tune in next week to see what we find out!

So Why AREN’T Most Published Research Findings False? The Rebuttal

Welcome to “So Why ARE Most Published Research Findings False?”, a step by step walk through of the John Ioannidis paper “Why Most Published Research Findings Are False”. It probably makes more sense if you read this in order, so check out the intro here , Part 1  here ,Part 2  here,  Part 3 here, Part 4 here, and Part 5 here.

Okay people, we made it! All the way through one of the most cited research papers of all time, and we’ve all lost our faith in everything in the process. So what do we do now? Well, let’s turn the lens around on Ioannidis. What, if anything, did he miss and how do we digest this paper? I poked around for a few critiques of him, just to give a flavor. This is obviously not a comprehensive list, but it hits the major criticisms I could find.

The Title While quite a few people had no problem with the contents of Ioannidis’s paper, some took real umbrage with the title, essentially accusing it of being clickbait before clickbait had really gotten going. Additionally, since many people never read anything more than the title of a paper, a title that blunt is easily used as a mallet by anyone trying to disprove any study they chose. Interestingly, there’s apparently some question regarding whether or not Ioannidis actually wrote the title or if it was the editors at Plos Medicine, but the point stands. Given that misleading headlines and reporting are hugely blamed by many (including yours truly) for popular misunderstanding of science, that would be a fascinating irony.

Failing to reject the null hypothesis does not mean accepting the null hypothesis This is not so much a criticism of Ioannidis as it is of those who use his work to promote their own causes. There is a rather strange line of thought out there that seems to believe that life, or science, is a courtroom. Under this way of thinking, when you undermine a scientist and their hypothesis, your client is de facto not guilty. This in not true. If you somehow prove that chemotherapy is less effective than previously stated, that doesn’t actually mean that crystals cure cancer. You never prove the null hypothesis, you only fail to reject it.

The definition of bias contained more nuance In a paper written in response to the Ioannidis paper, some researchers from Johns Hopkins took umbrage with the presentation of “bias” in the paper. Their primary grouse seemed to be intent vs consequence. Ioannidis presents bias as a factor based on consequence, i.e. the way it skews the final results. They disliked this and believed bias should be based on intent, pointing out numerous ways in which things Ioannidis calls “bias” could creep in innocently. For example, if you are looking for a drug that reduces cardiac symptoms but you also find that mortality goes down for patients who take the medication, are you really not going to report that because it’s not what you were originally looking for? By the strictest definition this is “data dredging”, but is it really? Humans aren’t robots. They’re going to report interesting findings where they see them.

The effect of multiple teams This is one of the more interesting quibbles with the initial paper. Mathematically, Ioannidis proved that having multiple teams working on the same research question would increase the chances of a false result. In the same Hopkins paper, the researchers question the math behind the “multiple teams lead to more false positives” assertion. They mention that for any one study, the odds stay the same as they always have been. Ioannidis counters with an argument that boils down to “yes, if you assume those in competition don’t get more biased”.  Interestingly, later research has shown that this effect does exist and is much worse in fields where the R factor (pre-study odds) is low.

So overall, what would I say are the major criticisms or cautions around this paper that I personally will employ?

  1. If you’re citing science, use scientific terms precisely. Don’t get sloppy with verbage just to make your life easier.
  2. Remember, scientific best practices all feed off each other Getting a good sample size and promoting caution can reduce both overall bias and the effect of bias that does exist. The effect of multiple team testing can be partially negated by high pre-study odds. If a team or researcher employs most best practices but misses one, that may not be a death blow to their research. Look at the whole picture before dismissing the research.
  3. New is exciting, but not always reliable We all like new and quirky findings, but we need to let that go. New findings are the least likely to play out later, and that’s okay. We want to cast a broad net, but for real progress we need a longer attention span.
  4. Bias takes many forms When we mention “bias” we often jump right to financial motivations. But intellectual and social pressure can be bias, competing for tenure can cause bias, and confirming ones own findings can cause bias.
  5. There are more ways of being wrong than there are ways of being right Every researcher wants a true finding. They really do. No one wants their life’s work undone. While some researchers may be motivated by results they like, I do truly believe that the majority of problems are caused by the whole “needle in a haystack” thing more than the “convenient truth” thing.

Alright, that wraps us up! I enjoyed this series, and may do more going forward. If you see a paper that piques your interest, let me know and I’ll look in to it.  Happy holidays everyone!



So Why ARE Most Published Research Findings False? A Way Forward

Welcome to “So Why ARE Most Published Research Findings False?”, a step by step walk through of the John Ioannidis paper “Why Most Published Research Findings Are False”. It probably makes more sense if you read this in order, so check out the intro here , Part 1  here ,Part 2  here,  Part 3 here, and Part 4 here.

Alright guys, we made it! After all sorts of math and bad news, we’re finally at the end. While the situation Ioannidis has laid out up until now sounds pretty bleak, he doesn’t let us end there. No, in this section “How Can We Improve the Situation” he ends with both hope and suggestions.  Thank goodness.

Ioannidis starts off with the acknowledgement that we will never really know which research findings are true and which are false. If we had a perfect test, we wouldn’t be in this mess to begin with. Therefore, anything we do to improve the research situation will be guessing at best. However, there are things that it seems would likely do some good. Essentially they are to improve the values of each of the “forgotten” variables in the equation that determines the positive predictive value of findings. These are:

  1. Beta/study power: Use larger studies or meta-analyses aimed at testing broad hypotheses
  2. n/multiple teams: Consider a totality of evidence or work done before concluding any one finding is true
  3. u/Bias: Register your study ahead of time, or work with other teams to register your data to reduce bias
  4. R/Pre-study Odds: Determine the pre-study odds prior to your experiment, and publish your assessment with your results

If you’ve been following along so far, none of those suggestions should be surprising to you. Let’s dive in to each though:

First, we should be using larger studies or meta-analyses that aggregate smaller studies. As we saw earlier, large sample size = higher study power -> blunts the impact of bias.  That’s a good thing. This isn’t fool proof though, as bias can still slip through and a large sample size means very tiny effect sizes can be ruled “statistically significant”. These studies are also hard to do because they are so resource intensive. Ioannidis suggests that large studies be reserved for large questions, though without a lot of guidance on how to do that.

Second, the totality of the evidence. We’ve covered a lot about false positives here, and Ioannidis of course reiterates that we should always keep them in mind. One striking finding should almost never be considered definitive, but rather compared to other similar research.

Third, steps must be taken to reduce bias. We talked about this a lot with the corollaries, but Ioannidis advocates hard that groups should tell someone else up front what they’re trying to do. This would (hopefully) reduce the tendency to say “hey, we didn’t find an effect for just the color red, but if you include pink and orange as a type of red, there’s an effect!”. Trial pre-registration gets a lot of attention in the medical world, but may not be feasible in other fields. At the very least, Ioannidis suggests that research teams share their strategy with each other up front, as a sort of “insta peer review” type thing. This would allow researchers some leeway to report interesting findings they weren’t expecting (ie “red wasn’t a factor, but good golly look at green!”) while reducing the aforementioned “well if you tweak the definition of red a bit, you totally get a significant result”.

Finally, the pre-study odds. This would be a moment up front for researchers to really assess how likely they are to find anything, and a number for others to use later to judge the research team by. Almost every field has a professional conference, and one would imagine determining pre-study odds for different lines of inquiry would be an interesting topic for many of them. Encouraging researchers to think up front about their odds of finding something interesting would be an interesting framing for everything yet to come.

None of this would fix everything, but it would certainly inject some humility and context in to the process from the get go. Science in general is supposed to be a way of objectively viewing the world and describing what you find. Turning that lens inward should be something researchers welcome, though obviously that is not always the case.

In that vein, next week I’ll be rounding up some criticisms of this paper along with my wrap up to make sure you hear the other side. Stay tuned!


So Why ARE Most Published Research Findings False? Bias bias bias

Welcome to “So Why ARE Most Published Research Findings False?”, a step by step walk through of the John Ioannidis paper “Why Most Published Research Findings Are False”. It probably makes more sense if you read this in order, so check out the intro here , Part 1  here ,Part 2  here, and Part 3 here.

Alright folks, we’re almost there. We covered a lot of mathematical ground here and last week ended with a few corollaries. We’ve seen the effects of sample size, study power, effect size, pre-study odds, bias and the work of multiple teams. We’ve gotten thoroughly depressed, and we’re not done yet. There’s one more conclusion we can draw, and it’s a scary one. Ioannidis holds nothing back, and he just flat out calls this section “Claimed Research Findings May Often Be Simply Accurate Measures of the Prevailing Bias“. Okay then.

To get to this point, Ioannidis lays out a couple of things:

  1. Throughout history, there have been scientific fields of inquiry that later proved to have no basis….like phrenology for example. He calls these “null fields”.
  2. Many of these null fields had positive findings at some point, and in a number great enough to sustain the field.
  3. Given the math around positive findings, the effect sizes in false positives due to random chance should be fairly small.
  4. Therefore, large effect sizes discovered in null fields pretty much just measure the bias present in those fields….aka that “u” value we talked about earlier.

You can think about this like a coin flip. If you flip a fair coin 100 times, you know you should get about 50 heads and 50 tails. Given random fluctuations, you probably wouldn’t be too surprised if you ended up with a 47-53 split or even a 40-60 split. If you ended up with an 80-20 split however, you’d get uneasy. Was the coin really fair?

The same goes for scientific studies. Typically we look at large effect sizes as a good thing. After all, where there’s smoke there’s fire, right? However, Ioannidis points out that large effect sizes are actually an early warning sign for bias. For example, lets say you think that your coin is weighted a bit, and that you will actually get heads 55% of the time you flip it. You flip it 100 times and get 90 heads. You can react in one of 3 ways:

  1. Awesome, 90 is way more than 55 so I was right that heads comes up more often!
  2. Gee, there’s a 1 in 73 quadrillion chance that 90 heads would come up if this coin were fairly weighted. With the slight bias I thought was there, the chances of getting the results I did is still about 1 in 5 trillion. I must have underestimated how biased that coin was.
  3. Crap. I did something wrong.

You can guess which ones most people go with. Spoiler alert: it’s not #3.

The whole “an unexpectedly large effect size should make you nervous” phenomena is counterintuitive, but I’ve actually blogged about it before. It’s what got Andrew Gelman upset about that study that found that 20% of women were changing their vote around their menstrual cycle, and it’s something I’ve pointed out about the whole 25% of men vote for Trump if they’re primed to think about how much money their wives make.  Effect sizes of that magnitude shouldn’t be cause for excitement, they should be cause for concern. Unless you are truly discovering a previously unknown and overwhelmingly large phenomena, there’s a good chance some of that number was driven by bias.

Now of course, if your findings replicate, this is all fine, you’re off the hook. However if they don’t, the largeness of your effect size is really just a measure of your own bias. Put another way, you can accidentally find a 5% vote swing that doesn’t exist just because random chance is annoying like that, but to get numbers in the 20-25% range you had to put some effort in.

As Ioannidis points out, this isn’t even a problem with individual researchers, but in how we all view science. Big splashy new results are given a lot of attention, and there is very little criticism if the findings fail to replicate at the same magnitude. This means that a researchers have nothing but incentives to make sure the effect sizes they’re seeing as big as possible. In fact Ioannidis has found (in a different paper) that about half the time the first paper published on a topic shows the most extreme value ever found. That is way more than what we would expect to see if it were up to random chance. Ioannidis argues that by figuring out exactly how far these effect sizes deviate from chance, we can actually measure the level of bias.

Again, not a problem for those researchers who replicate, but something to consider for those who don’t. We’ll get in to that next week, in our final segment: “So What Can Be Done?”.

So Why ARE Most Published Research Findings False? The Corollaries

Welcome to “So Why ARE Most Published Research Findings False?”, a step by step walk through of the John Ioannidis paper “Why Most Published Research Findings Are False”. It probably makes more sense if you read this in order, so check out the intro here , Part 1  here and Part 2  here.

Okay, first a quick recap: Up until now, Ioannidis has spent most of the paper providing a statistical justification for considering not just study power and p values, but also made a case for including pre-study odds, bias measures, and the number of teams working on a problem as items to look at when trying to figure out if a published finding is true or not. Because he was writing a scientific paper and not a blog post, he did a lot less editorializing than I did when I was breaking down what he did. In this section he changes all that, and he goes through a point by point breakdown of what this all means with a set of 7 6 corollaries. The words here in bold are his, but I’ve simplified the explanations. Some of this is a repeat from the previous posts, but hey, it’s worth repeating.

Corollary 1: The smaller the studies conducted in a scientific field, the less likely the research findings are to be true. In part 1 and part 2, we saw a lot of graphs that showed good study power had a huge effect on result reliability. Larger sample sizes = better study power.

Corollary 2: The smaller the effect sizes in a scientific field, the less likely the research findings are to be true. This is partially just intuitive, but also part of the calculation for study power. Larger effect sizes = better study power. Interestingly, Ioannidis points out here that given all the math involved, any field looking for effect sizes smaller than 5% is pretty much never going to be able to confirm their results.

Corollary 3: The greater the number and the lesser the selection of tested relationships in a scientific field, the less likely the research findings are to be true. That R value we talked about in part 1 is behind this one. Pre-study odds matter, and fields that are generating new hypotheses or exploring new relationships are always going to have more false positives than studies that replicate others or meta-analyses.

Corollary 4: The greater the flexibility in designs, definitions, outcomes, and analytical modes in a scientific field, the less likely the research findings are to be true. This should be intuitive, but it’s often forgotten. I work in oncology, and we tend to use a pretty clear cut end point for many of our studies: death. Our standards around this are so strict that if you die in a car crash less than 100 days after your transplant, you get counted in our mortality statistics. Other fields have more wiggle room. If you are looking for mortality OR quality of life OR reduced cost OR patient satisfaction, you’ve quadrupled your chance of a false positive.

Corollary 5: The greater the financial and other interests and prejudices in a scientific field, the less likely the research findings are to be true. This one’s pretty obvious. Worth noting: he points out “trying to get tenure” and “trying to preserve ones previous findings” are both sources of potential bias.

Corollary 6: The hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true. This was part of our discussion last week. Essentially it’s saying that if you have 10 people with tickets to a raffle, the chances that one of you wins is higher than the chances that you personally win. If we assume 5% of positive findings happen due to chance, having multiple teams work on a question will inevitably lead to more false positives.

Both before and after listing these 6 things out, Ioannidis reminds us that none of these factors are independent or isolated. He gives some specific examples from genomics research, but then also gives this helpful table.  To refresh your memory, the 1-beta column is study power (influenced by sample size and effect size), R is the pre-study odds (varies by field), u is bias, and the “PPV” column over on the side there is the chance that a paper with a positive finding is actually true. Oh, and “RCT” is “Randomized Control Trial”:

I feel a table of this sort should hang over the desk of every researcher and/or science enthusiast.

Now all this is a little bleak, but we’re still not entirely at the bottom. We’ll get to that next week.

Part 4 is up! Click here to read it.

So Why ARE Most Published Research Findings False? Bias and Other Ways of Making Things Worse

Welcome to “So Why ARE Most Published Research Findings False?”, a step by step walk through of the John Ioannidis paper “Why Most Published Research Findings Are False”. It probably makes more sense if you read this in order, so if you missed the intro, check it out here and check out Part 1 here.

First, a quick recap: Last week we took a look at the statistical framework that helps us analyze the chances that any given paper we are reading found a relationship that actually exists. This first involves turning the study design (assumed Type 1 and Type 2 error rate) in to a positive predictive value….aka given the assumed error rate, what is the chance that a positive result is actually true. We then added in a variable R or “pre-study odds” which sought to account for the fact that some fields are simply more likely to find true results than others due to the nature of their work. The harder it is to find a true relationship, the less likely it is that any apparently true relationship you do find is actually true. This is all just basic math (well, maybe not basic math), and provides us the coat hook on which to hang some other issues which muck things up even further.

Like bias.

Oh, bias: Yes, Ioannidis talks about bias right up front. He gives it the letter “u” and defines it as “the proportion of probed analyses that would not have been “research findings,” but nevertheless end up presented and reported as such, because of bias“. Note that he is specifically focusing on research that is published claiming to have found a relationship between to things. He does mention that bias could be used to bury true findings, but that is beyond the current scope. It’s also probably less common simply because positive findings are less common. Anyway, he doesn’t address reasons for bias at this point, but he does add it in to his table to show how much it mucks about with the equations:

This pretty much confirms our pre-existing beliefs that bias makes everything messy. Nearly everyone knows that bias screws things up and makes things less reliable, but Ioannidis goes a step further and seeks to answer the question “how much less reliable?”  He helpfully provides these graphs (blue line is low bias of .05, yellow is high bias of .8):

Eesh. What’s interesting to note here is that good study power (the top graph) has a pretty huge moderating effect on all levels of bias over studies with low power (bottom graph). This makes sense since study power is influenced by sample size and the size of the effect your are looking for. While even small levels of bias (the blue line) influence the chance of a paper being correct, it turns out good study design can do wonders for your work.  To put some numbers on this, a well powered study with 30% pre-study odds with a positive finding has a 83% chance of being correct with no bias. If that bias is 5%, the chances drop to about 80%. If the study power is dropped, you have about a 70% chance of a true finding being real. Drop the study power further and you’re under 60%. Keep your statisticians handy folks.

Independent teams, or yet another way to muck things up: Now when you think about bias, the idea of having independent teams work on the same problems sounds great. After all, they’re probably not all equally biased, and they can confirm each other’s findings right?

Well, sometimes.

It’s not particularly intuitive to think that having lots of people working on a research question would make results less reliable, but it makes sense. For every independent team working on the same research question, the chances that one of them gets a false positive finding goes up. This is a more complicated version of the replication crisis, because none of these teams necessarily have to be trying the same method to address the question. Separating out what’s a study design issue and what’s a false positive is more complicated than it seems. Mathematically, the implications of this are kind of staggering. The number of teams working on a problem (n) actually increase some of the factors exponentially. Even if you leave bias out of the equation, this can have an enormous impact on the believability of positive results:

If you compare this to the bias graph, you’ll note that having 5 teams working on the same question actually decreases the chances of have a true positive finding more than having a bias rate of 20% does….and that’s for well designed studies. This is terrible news because while many people have an insight in to how biased a field might be and how to correct for it, you rarely hear people discuss how many teams are working on the same problem.  That Indeed, researchers themselves may not know how many people are researching their question. I mean, think about how this is reported in the press “previous studies have not found similar things”.  Some people take that as a sign of caution, but many more take that as “this is groundbreaking”. Only time can tell which one is which, and we are not patient people.

Now we have quite a few factors to take in to account. Along with the regular alpha and beta, we’ve added R (pre-study odds),  u (bias) and n (number of teams). So far we’ve looked at them all in isolation, but next week we’re going to review what the practical outcomes are of each and how they start to work together to really screw us up. Stay tuned.

Part 3 is up! Click here to read “The Corollaries”

So Why ARE Most Published Research Findings False? A Statistical Framework

Welcome to “So Why ARE Most Published Research Findings False?”, a step by step walk through of the John Ioannidis paper bearing that name. If you missed the intro, check it out here.

Okay, so last week I gave you the intro to the John Ioannidis paper Why Most Published Research Findings are False. This week we’re going to dive right in with the first section, which is excitingly titled “Modeling the Framework for False Positive Findings“.

Ioannidis opens the paper with a review of the replication crisis (as it stood in 2005 that is) and announces his intention to particularly focus on studies that yield false positive results….aka those papers that find relationships between things where no relationship exists.

To give a framework for understanding why so many of these false positive findings exists, he creates a table showing the 4 possibilities for research findings, and how to calculate how large each one is. We’ve discussed these four possibilities before, and they look like this:

Now that may not look too shocking off the bat, and if  you’re not in to this sort of thing you’re probably yawning a bit. However, for those of us in the stats world, this is a paradigm shift.  See historically stats students and researchers have been taught that the table looks like this:


This table represents a lot of the decisions you make right up front in your research, often without putting much thought in to it. Those values are used to drive error rates, study power and confidence intervals:


The alpha value is used to drive the notorious “.05” level used in p-value testing, and is the chances that you would see a relationship more extreme than the one you’re seeing due to random chance.

What Ioannidis is adding in here is c, or the overall number of relationships you are looking at, and the R, which is the overall proportion of true findings to false findings in the field. Put another way, this is the “Pre-Study Odds”. It asks researchers to think about it up front: if you took your whole field and every study ever done in it, what would you say the chances of a positive finding are right off the bat?

Obviously R would be hard to calculate, but it’s a good add in for all researchers. If you have some sense that your field is error prone or that it’s easy to make false discoveries, you should be adjusting your calculations accordingly. Essentially he is asking people to consider the base rate here, and to keep it front and center.  For example, a drug company that has carefully vetted it’s drug development process may know that 30% of the drugs that make it to phase 2 trials will ultimately prove to work. On the other hand, a psychologist attempting to create a priming study could expect a much lower rate of success. The harder it is for everyone to find a real relationship, the greater the chances that a relationship you do find will also be a false positive. I think requiring every field to come up with an R would be an enormously helpful step in and of itself, but Ioannidis doesn’t stop there.

Ultimately, he ends up with an equation for the Positive Predictive Value (aka the chance that a positive result is true aka PPV aka the chance that a paper you read is actually reporting a real finding) which is PPV = (1 – β)R/(R – βR + α). For a study with a typical alpha and a good beta (.05 and .2, respectively), here’s what that looks like for various values of R:


So the lower the pre-study odds of success, the more likely it is that a finding is a false positive rather than a true positive. Makes sense right?

Now most readers will very quickly note that this graph shows that you have a 50% chance of being able to trust the result at a fairly low level of pre-study odds, and that is true. Under this model, the study is more likely to be true than false if (1 – β)R > α. In the case of my graph above, this translates in to pre-study odds that are greater than 1/16. So where do we get the “most findings are false” claim?

Enter bias.

You see, Ioannidis was setting this framework up to remind everyone what the best case scenario was. He starts here to remind everyone that even within a perfect system, some fields are going to be more accurate than others simply due to the nature of the investigations they do, and that no field should ever expect that 100% accuracy is their ceiling. This is an assumption of the statistical methods used, but this assumption is frequently forgotten when people actually sit down to review the literature. Most researchers would not even think of claiming that their pre-study odds were more than 30%, yet very few would say off the top “17% of studies finding significant results in my field are wrong”, yet that’s what the math tells us. And again, that’s in a perfect system. Going forward we’re going to add more terms to the statistical models, and those odds will never get better.

In other words, see you next week folks, it’s all down hill from here.

Click here to go straight to part 2.