So Why AREN’T Most Published Research Findings False? The Rebuttal

Welcome to “So Why ARE Most Published Research Findings False?”, a step by step walk through of the John Ioannidis paper “Why Most Published Research Findings Are False”. It probably makes more sense if you read this in order, so check out the intro here , Part 1  here ,Part 2  here,  Part 3 here, Part 4 here, and Part 5 here.

Okay people, we made it! All the way through one of the most cited research papers of all time, and we’ve all lost our faith in everything in the process. So what do we do now? Well, let’s turn the lens around on Ioannidis. What, if anything, did he miss and how do we digest this paper? I poked around for a few critiques of him, just to give a flavor. This is obviously not a comprehensive list, but it hits the major criticisms I could find.

The Title While quite a few people had no problem with the contents of Ioannidis’s paper, some took real umbrage with the title, essentially accusing it of being clickbait before clickbait had really gotten going. Additionally, since many people never read anything more than the title of a paper, a title that blunt is easily used as a mallet by anyone trying to disprove any study they chose. Interestingly, there’s apparently some question regarding whether or not Ioannidis actually wrote the title or if it was the editors at Plos Medicine, but the point stands. Given that misleading headlines and reporting are hugely blamed by many (including yours truly) for popular misunderstanding of science, that would be a fascinating irony.

Failing to reject the null hypothesis does not mean accepting the null hypothesis This is not so much a criticism of Ioannidis as it is of those who use his work to promote their own causes. There is a rather strange line of thought out there that seems to believe that life, or science, is a courtroom. Under this way of thinking, when you undermine a scientist and their hypothesis, your client is de facto not guilty. This in not true. If you somehow prove that chemotherapy is less effective than previously stated, that doesn’t actually mean that crystals cure cancer. You never prove the null hypothesis, you only fail to reject it.

The definition of bias contained more nuance In a paper written in response to the Ioannidis paper, some researchers from Johns Hopkins took umbrage with the presentation of “bias” in the paper. Their primary grouse seemed to be intent vs consequence. Ioannidis presents bias as a factor based on consequence, i.e. the way it skews the final results. They disliked this and believed bias should be based on intent, pointing out numerous ways in which things Ioannidis calls “bias” could creep in innocently. For example, if you are looking for a drug that reduces cardiac symptoms but you also find that mortality goes down for patients who take the medication, are you really not going to report that because it’s not what you were originally looking for? By the strictest definition this is “data dredging”, but is it really? Humans aren’t robots. They’re going to report interesting findings where they see them.

The effect of multiple teams This is one of the more interesting quibbles with the initial paper. Mathematically, Ioannidis proved that having multiple teams working on the same research question would increase the chances of a false result. In the same Hopkins paper, the researchers question the math behind the “multiple teams lead to more false positives” assertion. They mention that for any one study, the odds stay the same as they always have been. Ioannidis counters with an argument that boils down to “yes, if you assume those in competition don’t get more biased”.  Interestingly, later research has shown that this effect does exist and is much worse in fields where the R factor (pre-study odds) is low.

So overall, what would I say are the major criticisms or cautions around this paper that I personally will employ?

  1. If you’re citing science, use scientific terms precisely. Don’t get sloppy with verbage just to make your life easier.
  2. Remember, scientific best practices all feed off each other Getting a good sample size and promoting caution can reduce both overall bias and the effect of bias that does exist. The effect of multiple team testing can be partially negated by high pre-study odds. If a team or researcher employs most best practices but misses one, that may not be a death blow to their research. Look at the whole picture before dismissing the research.
  3. New is exciting, but not always reliable We all like new and quirky findings, but we need to let that go. New findings are the least likely to play out later, and that’s okay. We want to cast a broad net, but for real progress we need a longer attention span.
  4. Bias takes many forms When we mention “bias” we often jump right to financial motivations. But intellectual and social pressure can be bias, competing for tenure can cause bias, and confirming ones own findings can cause bias.
  5. There are more ways of being wrong than there are ways of being right Every researcher wants a true finding. They really do. No one wants their life’s work undone. While some researchers may be motivated by results they like, I do truly believe that the majority of problems are caused by the whole “needle in a haystack” thing more than the “convenient truth” thing.

Alright, that wraps us up! I enjoyed this series, and may do more going forward. If you see a paper that piques your interest, let me know and I’ll look in to it.  Happy holidays everyone!

 

 

So Why ARE Most Published Research Findings False? A Way Forward

Welcome to “So Why ARE Most Published Research Findings False?”, a step by step walk through of the John Ioannidis paper “Why Most Published Research Findings Are False”. It probably makes more sense if you read this in order, so check out the intro here , Part 1  here ,Part 2  here,  Part 3 here, and Part 4 here.

Alright guys, we made it! After all sorts of math and bad news, we’re finally at the end. While the situation Ioannidis has laid out up until now sounds pretty bleak, he doesn’t let us end there. No, in this section “How Can We Improve the Situation” he ends with both hope and suggestions.  Thank goodness.

Ioannidis starts off with the acknowledgement that we will never really know which research findings are true and which are false. If we had a perfect test, we wouldn’t be in this mess to begin with. Therefore, anything we do to improve the research situation will be guessing at best. However, there are things that it seems would likely do some good. Essentially they are to improve the values of each of the “forgotten” variables in the equation that determines the positive predictive value of findings. These are:

  1. Beta/study power: Use larger studies or meta-analyses aimed at testing broad hypotheses
  2. n/multiple teams: Consider a totality of evidence or work done before concluding any one finding is true
  3. u/Bias: Register your study ahead of time, or work with other teams to register your data to reduce bias
  4. R/Pre-study Odds: Determine the pre-study odds prior to your experiment, and publish your assessment with your results

If you’ve been following along so far, none of those suggestions should be surprising to you. Let’s dive in to each though:

First, we should be using larger studies or meta-analyses that aggregate smaller studies. As we saw earlier, large sample size = higher study power -> blunts the impact of bias.  That’s a good thing. This isn’t fool proof though, as bias can still slip through and a large sample size means very tiny effect sizes can be ruled “statistically significant”. These studies are also hard to do because they are so resource intensive. Ioannidis suggests that large studies be reserved for large questions, though without a lot of guidance on how to do that.

Second, the totality of the evidence. We’ve covered a lot about false positives here, and Ioannidis of course reiterates that we should always keep them in mind. One striking finding should almost never be considered definitive, but rather compared to other similar research.

Third, steps must be taken to reduce bias. We talked about this a lot with the corollaries, but Ioannidis advocates hard that groups should tell someone else up front what they’re trying to do. This would (hopefully) reduce the tendency to say “hey, we didn’t find an effect for just the color red, but if you include pink and orange as a type of red, there’s an effect!”. Trial pre-registration gets a lot of attention in the medical world, but may not be feasible in other fields. At the very least, Ioannidis suggests that research teams share their strategy with each other up front, as a sort of “insta peer review” type thing. This would allow researchers some leeway to report interesting findings they weren’t expecting (ie “red wasn’t a factor, but good golly look at green!”) while reducing the aforementioned “well if you tweak the definition of red a bit, you totally get a significant result”.

Finally, the pre-study odds. This would be a moment up front for researchers to really assess how likely they are to find anything, and a number for others to use later to judge the research team by. Almost every field has a professional conference, and one would imagine determining pre-study odds for different lines of inquiry would be an interesting topic for many of them. Encouraging researchers to think up front about their odds of finding something interesting would be an interesting framing for everything yet to come.

None of this would fix everything, but it would certainly inject some humility and context in to the process from the get go. Science in general is supposed to be a way of objectively viewing the world and describing what you find. Turning that lens inward should be something researchers welcome, though obviously that is not always the case.

In that vein, next week I’ll be rounding up some criticisms of this paper along with my wrap up to make sure you hear the other side. Stay tuned!

 

So Why ARE Most Published Research Findings False? Bias bias bias

Welcome to “So Why ARE Most Published Research Findings False?”, a step by step walk through of the John Ioannidis paper “Why Most Published Research Findings Are False”. It probably makes more sense if you read this in order, so check out the intro here , Part 1  here ,Part 2  here, and Part 3 here.

Alright folks, we’re almost there. We covered a lot of mathematical ground here and last week ended with a few corollaries. We’ve seen the effects of sample size, study power, effect size, pre-study odds, bias and the work of multiple teams. We’ve gotten thoroughly depressed, and we’re not done yet. There’s one more conclusion we can draw, and it’s a scary one. Ioannidis holds nothing back, and he just flat out calls this section “Claimed Research Findings May Often Be Simply Accurate Measures of the Prevailing Bias“. Okay then.

To get to this point, Ioannidis lays out a couple of things:

  1. Throughout history, there have been scientific fields of inquiry that later proved to have no basis….like phrenology for example. He calls these “null fields”.
  2. Many of these null fields had positive findings at some point, and in a number great enough to sustain the field.
  3. Given the math around positive findings, the effect sizes in false positives due to random chance should be fairly small.
  4. Therefore, large effect sizes discovered in null fields pretty much just measure the bias present in those fields….aka that “u” value we talked about earlier.

You can think about this like a coin flip. If you flip a fair coin 100 times, you know you should get about 50 heads and 50 tails. Given random fluctuations, you probably wouldn’t be too surprised if you ended up with a 47-53 split or even a 40-60 split. If you ended up with an 80-20 split however, you’d get uneasy. Was the coin really fair?

The same goes for scientific studies. Typically we look at large effect sizes as a good thing. After all, where there’s smoke there’s fire, right? However, Ioannidis points out that large effect sizes are actually an early warning sign for bias. For example, lets say you think that your coin is weighted a bit, and that you will actually get heads 55% of the time you flip it. You flip it 100 times and get 90 heads. You can react in one of 3 ways:

  1. Awesome, 90 is way more than 55 so I was right that heads comes up more often!
  2. Gee, there’s a 1 in 73 quadrillion chance that 90 heads would come up if this coin were fairly weighted. With the slight bias I thought was there, the chances of getting the results I did is still about 1 in 5 trillion. I must have underestimated how biased that coin was.
  3. Crap. I did something wrong.

You can guess which ones most people go with. Spoiler alert: it’s not #3.

The whole “an unexpectedly large effect size should make you nervous” phenomena is counterintuitive, but I’ve actually blogged about it before. It’s what got Andrew Gelman upset about that study that found that 20% of women were changing their vote around their menstrual cycle, and it’s something I’ve pointed out about the whole 25% of men vote for Trump if they’re primed to think about how much money their wives make.  Effect sizes of that magnitude shouldn’t be cause for excitement, they should be cause for concern. Unless you are truly discovering a previously unknown and overwhelmingly large phenomena, there’s a good chance some of that number was driven by bias.

Now of course, if your findings replicate, this is all fine, you’re off the hook. However if they don’t, the largeness of your effect size is really just a measure of your own bias. Put another way, you can accidentally find a 5% vote swing that doesn’t exist just because random chance is annoying like that, but to get numbers in the 20-25% range you had to put some effort in.

As Ioannidis points out, this isn’t even a problem with individual researchers, but in how we all view science. Big splashy new results are given a lot of attention, and there is very little criticism if the findings fail to replicate at the same magnitude. This means that a researchers have nothing but incentives to make sure the effect sizes they’re seeing as big as possible. In fact Ioannidis has found (in a different paper) that about half the time the first paper published on a topic shows the most extreme value ever found. That is way more than what we would expect to see if it were up to random chance. Ioannidis argues that by figuring out exactly how far these effect sizes deviate from chance, we can actually measure the level of bias.

Again, not a problem for those researchers who replicate, but something to consider for those who don’t. We’ll get in to that next week, in our final segment: “So What Can Be Done?”.

So Why ARE Most Published Research Findings False? The Corollaries

Welcome to “So Why ARE Most Published Research Findings False?”, a step by step walk through of the John Ioannidis paper “Why Most Published Research Findings Are False”. It probably makes more sense if you read this in order, so check out the intro here , Part 1  here and Part 2  here.

Okay, first a quick recap: Up until now, Ioannidis has spent most of the paper providing a statistical justification for considering not just study power and p values, but also made a case for including pre-study odds, bias measures, and the number of teams working on a problem as items to look at when trying to figure out if a published finding is true or not. Because he was writing a scientific paper and not a blog post, he did a lot less editorializing than I did when I was breaking down what he did. In this section he changes all that, and he goes through a point by point breakdown of what this all means with a set of 7 6 corollaries. The words here in bold are his, but I’ve simplified the explanations. Some of this is a repeat from the previous posts, but hey, it’s worth repeating.

Corollary 1: The smaller the studies conducted in a scientific field, the less likely the research findings are to be true. In part 1 and part 2, we saw a lot of graphs that showed good study power had a huge effect on result reliability. Larger sample sizes = better study power.

Corollary 2: The smaller the effect sizes in a scientific field, the less likely the research findings are to be true. This is partially just intuitive, but also part of the calculation for study power. Larger effect sizes = better study power. Interestingly, Ioannidis points out here that given all the math involved, any field looking for effect sizes smaller than 5% is pretty much never going to be able to confirm their results.

Corollary 3: The greater the number and the lesser the selection of tested relationships in a scientific field, the less likely the research findings are to be true. That R value we talked about in part 1 is behind this one. Pre-study odds matter, and fields that are generating new hypotheses or exploring new relationships are always going to have more false positives than studies that replicate others or meta-analyses.

Corollary 4: The greater the flexibility in designs, definitions, outcomes, and analytical modes in a scientific field, the less likely the research findings are to be true. This should be intuitive, but it’s often forgotten. I work in oncology, and we tend to use a pretty clear cut end point for many of our studies: death. Our standards around this are so strict that if you die in a car crash less than 100 days after your transplant, you get counted in our mortality statistics. Other fields have more wiggle room. If you are looking for mortality OR quality of life OR reduced cost OR patient satisfaction, you’ve quadrupled your chance of a false positive.

Corollary 5: The greater the financial and other interests and prejudices in a scientific field, the less likely the research findings are to be true. This one’s pretty obvious. Worth noting: he points out “trying to get tenure” and “trying to preserve ones previous findings” are both sources of potential bias.

Corollary 6: The hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true. This was part of our discussion last week. Essentially it’s saying that if you have 10 people with tickets to a raffle, the chances that one of you wins is higher than the chances that you personally win. If we assume 5% of positive findings happen due to chance, having multiple teams work on a question will inevitably lead to more false positives.

Both before and after listing these 6 things out, Ioannidis reminds us that none of these factors are independent or isolated. He gives some specific examples from genomics research, but then also gives this helpful table.  To refresh your memory, the 1-beta column is study power (influenced by sample size and effect size), R is the pre-study odds (varies by field), u is bias, and the “PPV” column over on the side there is the chance that a paper with a positive finding is actually true. Oh, and “RCT” is “Randomized Control Trial”:

I feel a table of this sort should hang over the desk of every researcher and/or science enthusiast.

Now all this is a little bleak, but we’re still not entirely at the bottom. We’ll get to that next week.

Part 4 is up! Click here to read it.

So Why ARE Most Published Research Findings False? Bias and Other Ways of Making Things Worse

Welcome to “So Why ARE Most Published Research Findings False?”, a step by step walk through of the John Ioannidis paper “Why Most Published Research Findings Are False”. It probably makes more sense if you read this in order, so if you missed the intro, check it out here and check out Part 1 here.

First, a quick recap: Last week we took a look at the statistical framework that helps us analyze the chances that any given paper we are reading found a relationship that actually exists. This first involves turning the study design (assumed Type 1 and Type 2 error rate) in to a positive predictive value….aka given the assumed error rate, what is the chance that a positive result is actually true. We then added in a variable R or “pre-study odds” which sought to account for the fact that some fields are simply more likely to find true results than others due to the nature of their work. The harder it is to find a true relationship, the less likely it is that any apparently true relationship you do find is actually true. This is all just basic math (well, maybe not basic math), and provides us the coat hook on which to hang some other issues which muck things up even further.

Like bias.

Oh, bias: Yes, Ioannidis talks about bias right up front. He gives it the letter “u” and defines it as “the proportion of probed analyses that would not have been “research findings,” but nevertheless end up presented and reported as such, because of bias“. Note that he is specifically focusing on research that is published claiming to have found a relationship between to things. He does mention that bias could be used to bury true findings, but that is beyond the current scope. It’s also probably less common simply because positive findings are less common. Anyway, he doesn’t address reasons for bias at this point, but he does add it in to his table to show how much it mucks about with the equations:

This pretty much confirms our pre-existing beliefs that bias makes everything messy. Nearly everyone knows that bias screws things up and makes things less reliable, but Ioannidis goes a step further and seeks to answer the question “how much less reliable?”  He helpfully provides these graphs (blue line is low bias of .05, yellow is high bias of .8):

Eesh. What’s interesting to note here is that good study power (the top graph) has a pretty huge moderating effect on all levels of bias over studies with low power (bottom graph). This makes sense since study power is influenced by sample size and the size of the effect your are looking for. While even small levels of bias (the blue line) influence the chance of a paper being correct, it turns out good study design can do wonders for your work.  To put some numbers on this, a well powered study with 30% pre-study odds with a positive finding has a 83% chance of being correct with no bias. If that bias is 5%, the chances drop to about 80%. If the study power is dropped, you have about a 70% chance of a true finding being real. Drop the study power further and you’re under 60%. Keep your statisticians handy folks.

Independent teams, or yet another way to muck things up: Now when you think about bias, the idea of having independent teams work on the same problems sounds great. After all, they’re probably not all equally biased, and they can confirm each other’s findings right?

Well, sometimes.

It’s not particularly intuitive to think that having lots of people working on a research question would make results less reliable, but it makes sense. For every independent team working on the same research question, the chances that one of them gets a false positive finding goes up. This is a more complicated version of the replication crisis, because none of these teams necessarily have to be trying the same method to address the question. Separating out what’s a study design issue and what’s a false positive is more complicated than it seems. Mathematically, the implications of this are kind of staggering. The number of teams working on a problem (n) actually increase some of the factors exponentially. Even if you leave bias out of the equation, this can have an enormous impact on the believability of positive results:

If you compare this to the bias graph, you’ll note that having 5 teams working on the same question actually decreases the chances of have a true positive finding more than having a bias rate of 20% does….and that’s for well designed studies. This is terrible news because while many people have an insight in to how biased a field might be and how to correct for it, you rarely hear people discuss how many teams are working on the same problem.  That Indeed, researchers themselves may not know how many people are researching their question. I mean, think about how this is reported in the press “previous studies have not found similar things”.  Some people take that as a sign of caution, but many more take that as “this is groundbreaking”. Only time can tell which one is which, and we are not patient people.

Now we have quite a few factors to take in to account. Along with the regular alpha and beta, we’ve added R (pre-study odds),  u (bias) and n (number of teams). So far we’ve looked at them all in isolation, but next week we’re going to review what the practical outcomes are of each and how they start to work together to really screw us up. Stay tuned.

Part 3 is up! Click here to read “The Corollaries”

So Why ARE Most Published Research Findings False? A Statistical Framework

Welcome to “So Why ARE Most Published Research Findings False?”, a step by step walk through of the John Ioannidis paper bearing that name. If you missed the intro, check it out here.

Okay, so last week I gave you the intro to the John Ioannidis paper Why Most Published Research Findings are False. This week we’re going to dive right in with the first section, which is excitingly titled “Modeling the Framework for False Positive Findings“.

Ioannidis opens the paper with a review of the replication crisis (as it stood in 2005 that is) and announces his intention to particularly focus on studies that yield false positive results….aka those papers that find relationships between things where no relationship exists.

To give a framework for understanding why so many of these false positive findings exists, he creates a table showing the 4 possibilities for research findings, and how to calculate how large each one is. We’ve discussed these four possibilities before, and they look like this:

Now that may not look too shocking off the bat, and if  you’re not in to this sort of thing you’re probably yawning a bit. However, for those of us in the stats world, this is a paradigm shift.  See historically stats students and researchers have been taught that the table looks like this:

basic2by2

This table represents a lot of the decisions you make right up front in your research, often without putting much thought in to it. Those values are used to drive error rates, study power and confidence intervals:

type1andtype2

The alpha value is used to drive the notorious “.05” level used in p-value testing, and is the chances that you would see a relationship more extreme than the one you’re seeing due to random chance.

What Ioannidis is adding in here is c, or the overall number of relationships you are looking at, and the R, which is the overall proportion of true findings to false findings in the field. Put another way, this is the “Pre-Study Odds”. It asks researchers to think about it up front: if you took your whole field and every study ever done in it, what would you say the chances of a positive finding are right off the bat?

Obviously R would be hard to calculate, but it’s a good add in for all researchers. If you have some sense that your field is error prone or that it’s easy to make false discoveries, you should be adjusting your calculations accordingly. Essentially he is asking people to consider the base rate here, and to keep it front and center.  For example, a drug company that has carefully vetted it’s drug development process may know that 30% of the drugs that make it to phase 2 trials will ultimately prove to work. On the other hand, a psychologist attempting to create a priming study could expect a much lower rate of success. The harder it is for everyone to find a real relationship, the greater the chances that a relationship you do find will also be a false positive. I think requiring every field to come up with an R would be an enormously helpful step in and of itself, but Ioannidis doesn’t stop there.

Ultimately, he ends up with an equation for the Positive Predictive Value (aka the chance that a positive result is true aka PPV aka the chance that a paper you read is actually reporting a real finding) which is PPV = (1 – β)R/(R – βR + α). For a study with a typical alpha and a good beta (.05 and .2, respectively), here’s what that looks like for various values of R:

prestudyvspoststudy

So the lower the pre-study odds of success, the more likely it is that a finding is a false positive rather than a true positive. Makes sense right?

Now most readers will very quickly note that this graph shows that you have a 50% chance of being able to trust the result at a fairly low level of pre-study odds, and that is true. Under this model, the study is more likely to be true than false if (1 – β)R > α. In the case of my graph above, this translates in to pre-study odds that are greater than 1/16. So where do we get the “most findings are false” claim?

Enter bias.

You see, Ioannidis was setting this framework up to remind everyone what the best case scenario was. He starts here to remind everyone that even within a perfect system, some fields are going to be more accurate than others simply due to the nature of the investigations they do, and that no field should ever expect that 100% accuracy is their ceiling. This is an assumption of the statistical methods used, but this assumption is frequently forgotten when people actually sit down to review the literature. Most researchers would not even think of claiming that their pre-study odds were more than 30%, yet very few would say off the top “17% of studies finding significant results in my field are wrong”, yet that’s what the math tells us. And again, that’s in a perfect system. Going forward we’re going to add more terms to the statistical models, and those odds will never get better.

In other words, see you next week folks, it’s all down hill from here.

Click here to go straight to part 2.

So Why ARE Most Published Research Findings False? (An Introduction)

Well hello hello! I’m just getting back from a conference in Minneapolis and I’m completely exhausted, but I wanted to take a moment to introduce a new Sunday series I’ll be rolling out starting next week. I’m calling it my “Important Papers” series, and it’s going to be my attempt to cover/summarize/explain the important points and findings in some, well, important papers.

I’m going to start with the 2005 John Ioannidis paper “Why Most Published Research Findings are False“.  Most people who have ever questioned academic findings have heard of this one, but fewer seem familiar with what it actually says or recommends. Given the impact this paper has had, I think it’s a vital one for people to understand.  I got this idea when my professor for this semester made us all read it to kick off our class, and I was thinking how helpful it was to use that as a framework for further learning. It will probably take me 6 weeks or so to get through the whole thing, and I figured this week would be a good time to do a bit of background. Ready? Okay!

John Ioannidis is Greek physician who works at Stanford University. In 2005 he published the paper “Why Most Published Research Findings Are False”. This quickly became the most cited paper from PLOS Medicine, and is apparently one of the most accessed papers of all time with 1.5 million downloads. The paper is really the godfather of the meta-research movement…i.e. the push to research how research goes wrong. The Atlantic did a pretty cool breakdown of Ioannidis’s career and work here.

The paper has a few different sections, and I’ll going through each of them. I’ll probably group a few together based on length, but I’m not sure quite yet how that will look.  However, up front I’m thinking the series will go like this:

  1. The statistical framework for false positive findings
  2. Bias and failed attempts at corrections
  3. Corollaries (aka uncomfortable truths)
  4. Research and Bias
  5. A Way Forward
  6. Some other voices/complaints

I’ll be updating that list with links as I write them.

We’ll kick off next week with that first one. There will be pictures.

Week one is up! Go straight to it here.