*Welcome to “So Why ARE Most Published Research Findings False?”, a step by step walk through of the John Ioannidis paper “Why Most Published Research Findings Are False”. It probably makes more sense if you read this in order, so check out the intro here , Part 1 here ,Part 2 here, and Part 3 here.*

Alright folks, we’re almost there. We covered a lot of mathematical ground here and last week ended with a few corollaries. We’ve seen the effects of sample size, study power, effect size, pre-study odds, bias and the work of multiple teams. We’ve gotten thoroughly depressed, and we’re not done yet. There’s one more conclusion we can draw, and it’s a scary one. Ioannidis holds nothing back, and he just flat out calls this section “Claimed Research Findings May Often Be Simply Accurate Measures of the Prevailing Bias“. Okay then.

To get to this point, Ioannidis lays out a couple of things:

- Throughout history, there have been scientific fields of inquiry that later proved to have no basis….like phrenology for example. He calls these “null fields”.
- Many of these null fields had positive findings at some point, and in a number great enough to sustain the field.
- Given the math around positive findings, the effect sizes in false positives due to random chance should be fairly small.
- Therefore, large effect sizes discovered in null fields pretty much just measure the bias present in those fields….aka that “u” value we talked about earlier.

You can think about this like a coin flip. If you flip a fair coin 100 times, you know you should get about 50 heads and 50 tails. Given random fluctuations, you probably wouldn’t be too surprised if you ended up with a 47-53 split or even a 40-60 split. If you ended up with an 80-20 split however, you’d get uneasy. Was the coin really fair?

The same goes for scientific studies. Typically we look at large effect sizes as a *good* thing. After all, where there’s smoke there’s fire, right? However, Ioannidis points out that large effect sizes are actually an early warning sign for bias. For example, lets say you think that your coin is weighted a bit, and that you will actually get heads 55% of the time you flip it. You flip it 100 times and get 90 heads. You can react in one of 3 ways:

- Awesome, 90 is way more than 55 so I was right that heads comes up more often!
- Gee, there’s a 1 in 73 quadrillion chance that 90 heads would come up if this coin were fairly weighted. With the slight bias I thought was there, the chances of getting the results I did is still about 1 in 5 trillion. I must have underestimated how biased that coin was.
- Crap. I did something wrong.

You can guess which ones most people go with. Spoiler alert: it’s not #3.

The whole “an unexpectedly large effect size should make you nervous” phenomena is counterintuitive, but I’ve actually blogged about it before. It’s what got Andrew Gelman upset about that study that found that 20% of women were changing their vote around their menstrual cycle, and it’s something I’ve pointed out about the whole 25% of men vote for Trump if they’re primed to think about how much money their wives make. Effect sizes of that magnitude shouldn’t be cause for *excitement*, they should be cause for *concern*. Unless you are truly discovering a previously unknown and overwhelmingly large phenomena, there’s a good chance some of that number was driven by bias.

Now of course, if your findings replicate, this is all fine, you’re off the hook. However if they don’t, the largeness of your effect size is really just a measure of your own bias. Put another way, you can accidentally find a 5% vote swing that doesn’t exist just because random chance is annoying like that, but to get numbers in the 20-25% range you had to put some effort in.

As Ioannidis points out, this isn’t even a problem with individual researchers, but in how we all view science. Big splashy new results are given a lot of attention, and there is very little criticism if the findings fail to replicate at the same magnitude. This means that a researchers have nothing but incentives to make sure the effect sizes they’re seeing as big as possible. In fact Ioannidis has found (in a different paper) that about half the time the first paper published on a topic shows the most extreme value ever found. That is way more than what we would expect to see if it were up to random chance. Ioannidis argues that by figuring out exactly how far these effect sizes deviate from chance, we can actually measure the level of bias.

Again, not a problem for those researchers who replicate, but something to consider for those who don’t. We’ll get in to that next week, in our final segment: “So What Can Be Done?”.

I learned a lot of my statistics reading baseball history as a boy. So while I like big new effects as much as anyone else, I completely get that those are the least likely things to be correct. In baseball, if you come up with a new measurement that shows that Babe Ruth was just above-average, while Frank Malzone was one of the all-time greats, then everyone knows your new toy is crap. But if you come up with something that shows that 5 or 6 of the top 100 really weren’t that valuable, while some overlooked players never did get proper credit, then people would perk up their ears and want to hear more. That could be possible.

There were studies by the fictitious Lovenstein Institute in the early 200’s that purported to show that Bill Clinton had an IQ of 182, while George Bush weighed in at 91. Also, they claimed a state-average IQ of 115 for Massachusetts and 85 for Utah. If such things were true – if they were even remotely close to true – they would tell us a great deal about humanity, to our profit. But anyone who actually knows anything about IQ scores would immediately smell a rat. An effect that big is not a cause for excitement, but for concern, as you say.

LikeLike

Pingback: So Why ARE Most Published Research Findings False? A Way Forward | graph paper diaries