p values

One of the more basic principles of science is the idea of replication or reproducibility. In its simplest form, this concept is pretty obvious: if you really found the phenomena you say you found, then when I look where you looked I should be able to find it too. Most people who have ever taken a science class are at least theoretically familiar with this concept, and it makes a lot of sense…..if someone tells you the moon is green, and no one else backs them up, then we know the observation of the moon being green is actually a commentary on the one doing the looking as opposed to being a quality of the moon itself.

That being said, this is yet another concept that everyone seems to forget the minute they see a headline saying “NEW STUDY BACKS UP IDEA YOU ALREADY BELIEVED TO BE TRUE”. While every field faces different challenges and has different standards, scientific knowledge is not really a binary “we know it or we don’t” thing. Some studies are stronger than other studies, but in general, the more studies that find the same thing, the stronger we consider the evidence. While many fields have different nuances, I wanted to go over some of the possibilities we have when someone tries to go back and replicate someone else’s work.

Quick note: in general, replication is really applicable to currently observable/ongoing phenomena. The work of some fields can rely heavily on modeling future phenomena (see: climate science), and obviously future predictions cannot be replicated in the traditional sense. Additionally attempts to explain past behavior can often not be replicated (see: evolution) as they have already occurred.

Got it? Great….let’s talk in generalities! So what happens when someone tries to replicate a study?

The replication works. This is generally considered a very good thing. Either someone attempted to redo the study under similar circumstances and confirmed the original findings, or someone undertook an even stronger study design and still found the same findings. This is what you want to happen. Your case is now strong. This is not always 100% definitive, as different studies could replicate the same error over and over again (see the ego depletion studies that were replicated 83 times before they were called in to question) but in general, this is a good sign.
You get a partial replication. For most science, the general trajectory of discovery is one of refining ideas. In epidemiology for example, you start with population level correlations, and then try to work your way back to recommendations that can be useful on an individual level. This is normal. It also means that when you try to replicate certain findings, it’s totally normal to find that the original paper had some signal and some noise. For example, a few months ago I wrote about headline grabbing study that claimed that women’s political, social and religious preference varied with their monthly cycle. The original study grabbed headlines in 2012, and by the time I went back and looked at it in 2016 further studies had narrowed the conclusions substantially. Now the claims were down to “facial preferences for particular politicians may vary somewhat based on monthly cycles, but fundamental beliefs do not”. I would like to see the studies replicated using the 2016 candidates to see if any of the findings bear out, but even without this you can see that subsequent studies narrowed the findings quite a bit. This is not necessarily a bad thing, but actually a pretty normal process. Almost every initial splashy finding will undergo some refinement as it continues to be studied.
The study is disputed. Okay, this one can meander off the replication path here a bit. When I say “disputed” here, I’m referencing the phenomena that occurs when one studies findings are called in to question by another study that found something different, but they used two totally different methods to get there and now no one knows what’s right. Slate Star Codex has a great overview of this in Beware the Man of One Study, and a great example in Trouble Walking Down the Hallway. In the second post he covers two studies, one that shows a pro-male bias in hiring a lab manager and one that shows a pro-female bias in hiring a new college faculty member. Everyone used the study whose conclusions they liked better to bolster their case while calling the other study “discredited” or “flawed”. The SSC piece breaks it down nicely, but it’s actually really hard to tell what happened here and why these studies would be so different. To note: neither came to the conclusion that no one was biased. Maybe someone switched the data columns on one of them.
The study fails to replicate. As my kiddos preschool teacher would say “that’s sad news”. This is what happens when a study is performed under the same conditions and effect goes away. For a good example, check out the Power Pose/Wonder Woman study, where a larger sample size undid the original findings….though not before TED talk went viral or the book the research wrote about it got published. This isn’t necessarily bad either, thanks to p-value dependency we expect some of this, but in some fields it has become a bit of a crisis.
Fraud is discovered. Every possibility I mentioned above assumes good faith. However, some of the most bizarre scientific fraud situations get discovered because someone attempts to replicate a previous published study and can’t do it. Not replicate the findings mind you, but the experimental set up itself. Most methods sections are dense enough that any set up can sound plausible on paper, but it’s in practice that anomalies appear. For example, in the case of a study about how to change people’s views on gay marriage, a researcher realized the study set up was prohibitively expensive when he tried to replicate the original. While straight up scientific fraud is fairly rare, it does happen. In these cases, attempts at replication are some of our best allies at keeping everyone honest.

It’s important to note here that not all issues fall neatly in one of these categories. For example, in the Women, Ovulation and Voting study I mentioned in #2, two of the research teams had quite the back and forth over whether or not certain findings had been replicated. In an even more bizarre twist, when the fraudulent study from #5 was actually done, the findings actually stood (still waiting on subsequent studies!). For psychology, the single biggest criticism of the replication project (which claims #4) is that it’s replications aren’t fair and thus it’s actually #3 or #2.

My point here is not necessarily that any one replication effort is obviously in one bucket or the other, but to point out that there are a range of possibilities available. As I said in the beginning, very few findings will end up going in a “totally right” or “total trash” bucket, at least at first. However, it’s important to realize any time you see a big exciting headline that subsequent research will almost certainly add or subtract something from the original story. Wheels of progress and all that.

A reader named Doug has sent me a couple of awesome articles about p-values (thanks Doug!) and why we should regard them with suspicion. As often happens with these things, I subsequently tried to explain to someone unfamiliar with stats/math why this is such an interesting topic that everyone should be aware of and realized I needed a whole blog post.

While most people outside of research/stats circles won’t ever understand the math part of a p-value calculation, it’s actually a pretty important concept for anyone who wants to know what researchers are up to. Thus, allow me to go all statsplainer on you to get you up to speed.

Okay, so why are you anthropomorphizing p-values and accusing people of being mad at them?

Well, you probably didn’t click on the link up there that Doug sent me, but it was a post from the journal Nature on the American Statistical Association’s recent warning about the use of p-values in published literature.

P-values and the calculation thereof are a pretty fundamental part of most basic statistics courses, so to see a large group of statisticians push back against their use is a bit of a surprise.

Gotcha. So the people who taught us to use them in the first place are now telling us to watch out for them. Fantastic.

Yeah, they kind of acknowledge that. Their paper on the issue actually starts with this joke:

Q: Why do so many colleges and grad schools teach p=0.05
A: Because that’s still what the scientific community and journal editors use

Q: Why do so many people still use p=0.05
A: Because that’s what they were taught in grad school

That’s a terrible joke. I’m not even sure I get it.
Yeah, statistical humor tends to appeal to a limited audience. What it’s trying to point out though is that we’ve gotten ourselves in to a difficult spot by teaching “what everyone does” and then producing a group of people who only know how to do what they were taught.

Okay, that makes sense I guess…but what does the whole p=0.05 thing even mean?
Well, when you’re doing research, at some point or another you’ll want to do something called “hypothesis testing”. This is the basis of most published studies you hear about. You set up two opposing sides, formally called the null and alternative hypothesis, and then you figure out if you have the evidence to support one or the other.

The null hypothesis H₀, is typically the theory that nothing interesting is happening. Two groups are equal, there’s no change in behavior, etc etc.

The alternative hypothesis H_a, is typically the theory you REALLY want to be true…at least in terms of your academic career. This would mean that something interesting is occurring: two groups are different, there’s a change in behavior, etc etc.

Okay, I’m with you so far…keep going.
This next step can work differently depending on the experiment/sample size/lots of other details, but lets say we’re comparing Star Bellied Sneetches to those without stars on their bellies and seeing if the groups eat a different amount of dessert. After we calculated the average dessert eaten by both groups, we would calculate something called a t statistic using this equation.

Once we have that value, we take the amusingly old school step of pulling out a table that looks like this, and then finding the value we want to compare our value to.

Okay, so how do we figure out where on this table we’re looking?

Well, the degrees of freedom part is another whole calculation I threw in just to be annoying, but the other part is your α, or alpha. Alpha is what we’re really referencing when we say p=0.05….we set our significance level (or alpha) at .05, so now that’s what we’re aiming for. If the value we calculated using that equation up there is larger than the value the table gives, then it’s considered a finding significant at the level of alpha.

I think I lost you.

That’s fine. Most stats software will actually do this for you, and spit out a nice little p-value to boot. Your only decision is whether or not the value is acceptable. The most commonly used “significant” value is p < .05.

Okay, how’d we pick that number?

Arbitrarily.

No really.

No, that’s really it. This is why that joke up there was funny. After all the fancy technical math, we compare it to a value that’s basically used because everyone uses it. Sometimes people will use .1 or .01 if they’re feeling frisky, but .05 is king. There’s even an XKCD comic about it:

This is where we get in to the meat of the issue. There’s no particularly good reason why .049 can make a career and .051 doesn’t. As I showed with the equation above, the difference between those two values can actually be more about sample size than the difference in effect.

In theory, the p-value should mean “the chances we’d see an effect larger the one we are seeing if the null hypothesis was true”, but the more people aspire to the .05 level, the less accurate that becomes.

Why’s that?

Well, a couple reasons. First, the .05 value will always mean that 1 out of 20 p-values could be due to chance. For some studies that gather a lot of data points, this means they will almost always be able to get a significant finding.

This tactic was used by the journalist who published an intentionally fake “chocolate helps you lose weight” study last year. He did a real study, but collected 18 different measures on people who were eating chocolate, knowing that chances were good that he would get a significant result on one of them. Weight loss ended up being the significant result, so he led with that just downplayed the other ones. Other researchers just throw the non-significant effects in the drawer.

There’s also the issue of definitions. It can be really hard to grasp what a p-value is, and even some stats text books end up saying wrong or misleading definitions. This paper gives a really good overview of some of the myths, but suffice it to say the p-value is not “the chance the null hypothesis is true”.

Okay, so I think I get it. But why did the American Statistical Society speak out now? What got them upset?

Yeah, let’s get back to them. Well as they said in their paper, the problem is not that no one has warned about this previously, it’s that they keep seeing the issue. Their exact words:

Let’s be clear. Nothing in the ASA statement is new. Statisticians and others have been sounding the alarm about these matters for decades, to little avail. We hoped that a statement from the world’s largest professional association of statisticians would open a fresh discussion and draw renewed and vigorous attention to changing the practice of science with regards to the use of statistical inference.

As replication crises have rocked various fields, statisticians have decided to speak out. Fundamentally, p-values are really supposed to be like the SAT: a standardized way of comparing findings across fields. In practice they can have a lot of flaws, and that’s what the ASA guidance wanted to point out. Their paper essentially spelled out their view of the problem and proposed 6 guidelines for p-value use going forward.

And what were those?

P-values can indicate how incompatible the data are with a specified statistical model.
P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
Proper inference requires full reporting and transparency
A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

They provide more explanation in the paper, but basically what they’re saying is what I was trying to get across above: p-values are useful if you’re being honest about what you’re using them for. They don’t tell you if your experimental set up was good, if your explanation for your data is reasonable, and they don’t guard against selection bias very well at all. The number “.05” is good but arbitrary, and the whole thing is a probability game, not a clear “true/false” line.

Okay, I think I get it….but should statisticians really be picking on other fields like this?

That’s a good point, and I’d like to address it. Psychologists don’t typically walk in to stats conferences and criticize them, so why do statisticians get to criticize everyone else? Andrew Gelman probably explains this best. He was one of the authors on the ASA paper, and he’s a baseball fan. In a post a few months ago, he said this:

Believing a theory is correct because someone reported p less than .05 in a Psychological Science paper is like believing that a player belongs in the Hall of Fame because hit .300 once in Fenway Park.

This is not a perfect analogy. Hitting .300 anywhere is a great accomplishment, whereas “p less than .05” can easily represent nothing more than an impressive talent for self-delusion. But I’m just trying to get at the point that ultimately it is statistical summaries and statistical models that are being used to make strong (and statistical ridiculous) claims about reality, hence statistical criticisms, and external data such as come from replications, are relevant.

As Bill James is quoted as saying, “the alternative to good statistics isn’t no statistics…it’s bad statistics.”

Got a question you’d like an unnecessarily long answer to? Ask it here!

graph paper diaries

because some of us need a few more lines to keep everything straight

5 Replication Possibilities to Keep in Mind

What’s a p-value and Why Is Everyone So Mad At It?

Share this:

Share this: