Several months ago now, I was having dinner with a friend who told me he was working on some science fiction based on some interesting precognition studies he had heard about. As he started explaining them to me and how they was real scientific proof of ESP, he realized who he was talking to and quickly got sheepish and told me to “be gentle” when I ended up doing a post about it. Not wanting to kill his creative momentum, I figured I’d delay this post for a bit. I stumbled on the draft this morning and realized it’s probably been long enough now, so let’s talk about the paranormal!
First, I should set the stage and say that my friend was not actually wrong to claim that precognition has some real studies behind it. Some decent research time and effort has been put in to experiments where researchers attempt to show that people react to things that haven’t happened yet. In fact the history of this work is a really interesting study in scientific controversy and it tracks quite nicely with much of the replication crisis I’ve talked about. This makes it a really interesting topic for anyone wanting to know a bit more about the pluses/minuses of current research methods.
As we dig in to this, it helps to know a bit of background: Almost all of the discussions about this are referencing a paper by Daryl Bem from 2011, where 9 different studies were run on the phenomena. Bem is a respected psychological researcher, so the paper made quite a splash at the time. So what did these studies say and what should we get out of them, and why did they have such a huge impact on psychological research? Let’s find out!
- The effect sizes were pretty small, but they were statistically significant Okay, so first things first….let’s establish what kind of effect size we’re talking about here. For all 9 experiments the Cohen’s d was about .22. In general, a d of .2 is considered a “small” effect size, .5 would be moderate, .8 would be large. In the real world, this translated in to participants picking the “right” option 53% of the time instead of the 50% you’d expect by chance.
- The research was set up to be replicated One of the more interesting parts of Bem’s research was that he made his protocols publicly available for people trying to replicate his work, and he did this before he actually published the initial 2011 paper. Bem particularly pointed people to experiments #8 and #9, which showed the largest effect sizes and he thought would be the easiest to replicate. In these studies, he had people try to recall words off of a word list, writing down those they could remember. He then gave them a subset of those words to study more in depth, again writing down what they could remember. When they looked back, they found that subjects had recalled more of their subset words than control words on the first test. Since the subjects hadn’t seen their subset words at the time they took the first test, this was taken as evidence of precognition.
- Replication efforts have been….interesting. Of course with interesting findings like these, plenty of people rushed to try to replicate Bem’s work. Many of these attempts failed, but Bem published a meta-analysis stating that on the whole they worked. Interestingly however, the meta-analysis actually analyzed replications that pre-dated the publication of Bem’s work. Since Bem had released his software early, he was able to find papers all the way back to 2001. It has been noted that if you remove all the citations that pre-dated the publication of his paper, you don’t see an effect. So basically the pre-cognition paper was pre-replicated. Very meta.
- They are an excellent illustration of the garden of forking paths. Most of the criticism of the paper comes down to something Andrew Gelman calls “The Garden of Forking Paths“. This is a phenomena in which researchers make a series of tiny decisions as their experiments and analyses progress, which may add up to serious deviation from the original results. In the Bem study for example, it has been noted that some of his experiments actually used two different protocols, then combined the results. It was also noted that the effect sizes got smaller as more subjects were added, suggesting that the number of subjects tested may have fluctuated based on results. There are also decisions so small you mostly wouldn’t notice. For example, in the word recall study mentioned above, word recall was measured by comparing word lists for exact matches. This meant that if you spelled “retrieve” as “retreive”, it didn’t automatically give you credit. They had someone go through and correct for this manually, but that person actually knew which words were part of the second experiment and which were the control words. Did the reviewer inadvertently focus on or give more credit to words that were part of the “key word” list? Who knows, but small decisions like this can add up. There were also different statsticall analyses performed on different experiments, and Bem himself admits that if he started a study and got no results, he’d tweak it a little and try again. When you’re talking about an effect size of .22, even tiny changes can add up.
- The ramifications for all of psychological science were big It’s tempting to write this whole study off, or to accept it wholesale, but the truth is a little more complicated. In a thorough write-up over at Slate, Daniel Engber points out that this research used typical methods and invited replication attempts and still got a result many people don’t believe is possible. If you don’t believe the results are possible, then you really should question how often these methods are used in other research. As one of the reviewers put it “Clearly by the normal rules that we [used] in evaluating research, we would accept this paper. The level of proof here was ordinary. I mean that positively as well as negatively. I mean it was exactly the kind of conventional psychology analysis that [one often sees], with the same failings and concerns that most research has”. Even within the initial paper, the word “replication” was used 23 times. Gelman rebuts that all the problems with the paper are known statistical issues and that good science can still be done, but it’s clear this paper pushed many people to take good research methods a bit more seriously.
So there you have it. Interestingly, Bem actually works out of Cornell and has been cited in the whole Brian Wansink kerfluffle, a comparison he rejects. I think that’s fair. Bem has been more transparent about what he’s doing, and did invite replication attempts. In fact his calls for people to look at his work were so aggressive, there’s a running theory that he published the whole thing to make a point about the shoddiness of most research methods. He’s denied this, but that certainly was the effect. An interesting study on multiple levels.