3 Control Groups I’m Pondering

One of the more important (but often overlooked) parts of research is the choice of control group, i.e. what we are comparing the group of interest to. While this seems like a small thing, it can actually have some big implications for interpreting research.  I’ve seen a few interesting examples recently, so I figured I’d do a quick list here:

First up, a new-to-me article about personality assessments in a traditional hunter-gatherer tribe. I’ve mentioned the problem of psychological research focusing too much on WEIRD (Westernized, Educated, Industrialized, Rich and Democratic) countries before, and this study sought to correct that error. Basically, they used the “Big 5” personality testing model and then tried to assess members of a traditional South American tribe according to this “universal” personality measurement. It failed. While it seemed like extraversion and conscientiousness could actually translate somewhat, agreeableness and openness were mixed, and neuroticism didn’t translate all that well. They ended up with a “Big Two”, which were basically an agreeableness/extraversion mix (pro-sociality) and something like conscientiousness (industriousness). They talk a lot about the challenges (translation issues, non-literate populations, etc), but the point is that what we call “universal” relies on a very narrow set of circumstances. Western college kids don’t make a good baseline.

Second, a new dietary study shows that nutritional education can be an effective treatment for depression.  It’s a good study, and I was interested to see the control group was given increased social support/time with a trained listener/companion type person. At 12 weeks, almost a third of the diet group were no longer depressed, whereas only 8% of the control group were feeling better. Interesting to note though: this was advertised as a dietary study, so those who didn’t get the diet intervention knew they were the control group. There was a higher dropout rate in the control group (25% vs 6%), and interestingly it was the most educated people who dropped out. Gotta admit, part of me wonders if it was the introverts driving this result. Just wondering how many people really enjoyed the whole “hang out with a stranger who’s not a therapist” thing. I would be interested to see how this works when paired with some sort of “hour of general relaxation” type thing.

Finally, after putting up my pre-cognition post on Sunday, I realized there was a Slate Star Codex post a few years back about the Bem paper that I wanted to reread. It was called “The Control Group is out of Control” and took the stance that parapsychology was actually a great control group for all of science. Given that you have a whole group of people attempting to follow the scientific method to prove something that most people believe doesn’t exist, they end up serving as a sort of “placebo science”, or an indicator of what science looks like when it’s chasing after nothing.

He has some really interesting anecdotes here about the amount of evidence we have that researchers are influencing their own results in ways that seem nearly impossible to control for. For example, he talks about a case in which rival researchers who supported different hypotheses and had gotten different results teamed up to use the same protocol and watched each other execute the experiments to see if they could figure out where the other one was going wrong. They still both ended up proving their preferred hypothesis, and in the discussion section brought up the (mutual) possibility that one or the other of them had hacked the computer records. That’s an odd thing to ponder, but it’s even odder when you wonder what this means for every other study ever done.


5 Things About Precognition Studies

Several months ago now, I was having dinner with a friend who told me he was working on some science fiction based on some interesting precognition studies he had heard about. As he started explaining them to me and how they was real scientific proof of ESP, he realized who he was talking to and quickly got sheepish and told me to “be gentle” when I ended up doing a post about it. Not wanting to kill his creative momentum, I figured I’d delay this post for a bit. I stumbled on the draft this morning and realized it’s probably been long enough now, so let’s talk about the paranormal!

First, I should set the stage and say that my friend was not actually wrong to claim that precognition has some real studies behind it. Some decent research time and effort has been put in to experiments where researchers attempt to show that people react to things that haven’t happened yet. In fact the history of this work is a really interesting study in scientific controversy and it tracks quite nicely with much of the replication crisis I’ve talked about. This makes it a really interesting topic for anyone wanting to know a bit more about the pluses/minuses of current research methods.

As we dig in to this, it helps to know a bit of background: Almost all of the discussions about this are referencing a paper by Daryl Bem from 2011, where 9 different studies were run on the phenomena. Bem is a respected psychological researcher, so the paper made quite a splash at the time. So what did these studies say and what should we get out of them, and why did they have such a huge impact on psychological research? Let’s find out!

  1. The effect sizes were pretty small, but they were statistically significant Okay, so first things first….let’s establish what kind of effect size we’re talking about here. For all 9 experiments the Cohen’s d was about .22. In general, a d of .2 is considered a “small” effect size, .5 would be moderate, .8 would be large. In the real world, this translated in to participants picking the “right” option 53% of the time instead of the 50% you’d expect by chance.
  2. The research was set up to be replicated One of the more interesting parts of Bem’s research was that he made his protocols publicly available for people trying to replicate his work, and he did this before he actually published the initial 2011 paper. Bem particularly pointed people to experiments #8 and #9, which showed the largest effect sizes and he thought would be the easiest to replicate. In these studies, he had people try to recall words off of a word list, writing down those they could remember. He then gave them a subset of those words to study more in depth, again writing down what they could remember. When they looked back, they found that subjects had recalled more of their subset words than control words on the first test. Since the subjects hadn’t seen their subset words at the time they took the first test, this was taken as evidence of precognition.
  3. Replication efforts have been….interesting. Of course with interesting findings like these, plenty of people rushed to try to replicate Bem’s work. Many of these attempts failed, but Bem published a meta-analysis stating that on the whole they worked. Interestingly however, the meta-analysis actually analyzed replications that pre-dated the publication of Bem’s work. Since Bem had released his software early, he was able to find papers all the way back to 2001. It has been noted that if you remove all the citations that pre-dated the publication of his paper, you don’t see an effect. So basically the pre-cognition paper was pre-replicated. Very meta.
  4. They are an excellent illustration of the garden of forking paths. Most of the criticism of the paper comes down to something Andrew Gelman calls “The Garden of Forking Paths“. This is a phenomena in which researchers make a series of tiny decisions as their experiments and analyses progress, which may add up to serious deviation from the original results. In the Bem study for example, it has been noted that some of his experiments actually used two different protocols, then combined the results. It was also noted that the effect sizes got smaller as more subjects were added, suggesting that the number of subjects tested may have fluctuated based on results. There are also decisions so small you mostly wouldn’t notice. For example, in the word recall study mentioned above, word recall was measured by comparing word lists for exact matches. This meant that if you spelled “retrieve” as “retreive”, it didn’t automatically give you credit. They had someone go through and correct for this manually, but that person actually knew which words were part of the second experiment and which were the control words. Did the reviewer inadvertently focus on or give more credit to words that were part of the “key word” list? Who knows, but small decisions like this can add up. There were also different statsticall analyses performed on different experiments, and Bem himself admits that if he started a study and got no results, he’d tweak it a little and try again. When you’re talking about an effect size of .22, even tiny changes can add up.
  5. The ramifications for all of psychological science were big It’s tempting to write this whole study off, or to accept it wholesale, but the truth is a little more complicated. In a thorough write-up over at Slate, Daniel Engber points out that this research used typical methods and invited replication attempts and still got a result many people don’t believe is possible. If you don’t believe the results are possible, then you really should question how often these methods are used in other research. As one of the reviewers put it “Clearly by the normal rules that we [used] in evaluating research, we would accept this paper. The level of proof here was ordinary. I mean that positively as well as negatively. I mean it was exactly the kind of conventional psychology analysis that [one often sees], with the same failings and concerns that most research has”. Even within the initial paper, the word “replication” was used 23 times. Gelman rebuts that all the problems with the paper are known statistical issues and that good science can still be done, but it’s clear this paper pushed many people to take good research methods a bit more seriously.

So there you have it. Interestingly, Bem actually works out of Cornell and has been cited in the whole Brian Wansink kerfluffle, a comparison he rejects. I think that’s fair. Bem has been more transparent about what he’s doing, and did invite replication attempts. In fact his calls for people to look at his work were so aggressive, there’s a running theory that he published the whole thing to make a point about the shoddiness of most research methods. He’s denied this, but that certainly was the effect. An interesting study on multiple levels.

6 Year Blogiversary: Things I’ve Learned

Six years ago today I began blogging (well, at the old site) with a rather ambitious mission statement. While I don’t have quite as much hubris now as I did then, I was happy to see that I actually stand by most of what I said when I kicked this whole thing off. Six years, 647 posts,  a few hiatuses and one applied stats degree later, I think 2012 BS King would be pretty happy with how things turned out.

I actually went looking for my blogiversary date because of a recent discussion I had about the 10,000 hour rule myth. The person I was talking to had mentioned that after all these years of blogging my writing must have improved dramatically, and I mentioned that the difference was probably not as big as you might think. While I do occasionally get feedback on grammar or confusing sentences, no one sits down with bloggers and tells them “hey you really should have combined those two sentences” or “paragraph three was totally unnecessary”. In the context of the 10,000 hour rule, this means I’m lacking the “focused practice” that would truly make me a better writer. To truly improve you need both quality AND quantity in your practice.

The discussion got me wondering a bit…what skills does blogging help you hone? If the ROI for writing is minimal, what does it help me with?  I mean, there’s a lot of stuff I love about it: the exchange of ideas, meeting interesting people, getting to talk about the geeky topics I want to talk about, thinking more about how I explain statistics and having people send me interesting stuff. But does any of that result in the kind of focused practice and feedback that improves a skill?

As I mulled it over, I realized there are two main areas I’ve improved in, one smaller, one bigger. The first is simply finding more colorful examples for statistical concepts. Talking to high school students helps with this, as those kids are unapologetic about falling asleep on you if you bore them. Blogging and thinking about this stuff all the time means I end up permanently on the lookout for new examples, and since I tend to blog about the best ones, I can always find them again.

The second thing I’ve improved on is a little more subtle. Right after I put this blog up, I established some ground rules for myself. While I’ve failed miserably at some of these (apostrophes are still my nemesis), I have really tried to stick to discussing data over politics. This is tricky because most of the data people are interested in is political in nature, so I can’t avoid blogging about it. Attempting to figure out how to explain a data issue routed in a political controversy with a reader base that contains highly opinionated conservatives, liberals and a smattering of libertarians has taught me a LOT about what words are charged and which aren’t. This has actually transferred over to my day job, where I occasionally get looped in to situations just so I can “do that thing where you recap what everyone’s saying without getting anyone mad”.

I even notice this when I’m reading other things now, how often people attempt to subtly bias their words in one direction or another while claiming to be “neutral”. While I would never say I am perfect at this, I believe the feedback I’ve gotten over the years has definitely improved my ability to present an issue neutrally, which I hope leads to a a better discussion about where data goes wrong. Nothing has made me happier over the years than hearing people who I know feel strongly about an issue agree to stop using certain numbers and to use better ones instead.

So six years in, I suppose I just want to say thank you to everyone who’s read here over the years, given me feedback, kept me honest, and put up with my terrible use of punctuation and run on sentences. You’ve all made me laugh, and made me think, and I appreciate you taking the time to stop on by. Here’s to another year!

Praiseworthy Wrongness: Genes in Space

Given my ongoing dedication to critiquing bad headlines/stories, I’ve decided to start making a regular-ish feature of people who get things wrong then work to make them right. Since none of us can ever be 100% perfect, I think a big part of cutting down on errors and fake news is going to be lauding those who are willing to walk back on what they say if they discover they made an error. I started this last month with an example of someone who realized she had asserted she was seeing gender bias in her emails when she wasn’t. Even though no one had access to the data but her, she came clean that her kneejerk reaction had been wrong, and posted a full analysis of what happened. I think that’s awesome.

Two days ago, I saw a similar issue arise with Live Science, who had published a story stating that after one year in space astronaut Scott Kelly had experienced significant changes (around 7%) to his genetic code. The finding was notable since Kelly is one half of an identical twin, so it seemed there was a solid control group.

The problem? The story got two really key words wrong, and it changed the meaning of the findings. The original article reported that 7% of Kelly’s genetic code had changed, but the 7% number actually referred to gene expression. The 7% was also a subset of changes….basically out of all the genes that changed their expression in response to space flight, 7% of those changes persisted after he came back to earth. This is still an extremely interesting finding, but nowhere near as dramatic as finding out that twins were no longer twins after space flight, or that Kelly wasn’t really human any more.

While the error was regrettable, I really appreciated what Live Science did next. Not only did they update the original story (with notice that they had done so), they also published a follow up under the headline “We Were Totally Wrong About that Scott Kelly Space Genes Story” explaining further how they erred. They also Tweeted out the retraction with this request:

This was a nice way of addressing a chronic problem in internet writing: controversial headlines tend to travel faster than their retractions. By specifically noting this problem, Live Science reminds us all that they can only do so much in the correction process. Fundamentally, people have to share the correction at the same rate they shared the original story for it to make a difference. While ultimately the original error was their fault, it will take more than just Live Science to spread the correct information.

In the new age of social media, I think it’s good for us all to take a look at how we can fix things. Praising and sharing retractions is a tiny step, but I think it’s an important one. Good on Live Science for doing what they could, then encouraging social media users to take the next step.

YouTube Radicals and Recommendation Bias

The Assistant Village Idiot passed along an interesting article about concerns being raised over YouTube’s tendency to “radicalize” suggestions in order to keep people on the site. I’ve talked before about the hidden dangers and biases algorithms can have over our lives, and this was an interesting example.

Essentially, it appears that YouTube has a tendency to suggest more inflammatory or radical content in response to both regular searches and in response to watching more “mainstream” viewing. So for example, if you search for the phrase “the Pope” as I just did in incognito mode on Chrome, it gives me these as the top 2 hits:

Neither of those videos are even the most watched Pope videos….scrolling down a bit shows some funny moments with the Pope (little boy steals the show) with 2.1 million hits and a Jimmy Kimmel bit on him with 4 million views.

According to the article, watching more mainstream news stories will quickly get you to more biased or inflammatory content. It appears that in it’s quest to make an algorithm that will keep users on the site, YouTube has created the digital equivalent of junk food…..content that is tempting but without a lot of substance.

It makes a certain amount of sense if you think about it. Users may not have time to really play around much on YouTube, unless the next thing they see is slightly more tempting than what they were originally looking for. Very few people would watch three videos in a row of Obama State of the Union Address coverage, but you might watch Obama’s State of the Union address followed by Obama’s last White House Correspondents Dinner talk followed by “Obama’s best comebacks” (the videos I got suggested to me when I looked for “Obama state of the Union”.

Even with benign things I’ve noticed this tendency. For example, my favorite go to YouTube channel after a long day is the Epic Rap Battles of History channel. After I’ve watched two or three videos, I started noticing it would point me towards videos from the creators lesser-watched personal channels. I actually had thought this was some sort of setting the creators set, but now I’m wondering if it’s the same algorithm. Maybe people doing random clicking gravitate towards lesser watched content as they keep watching. Who knows.

What makes this trend a little concerning is that so many young people use YouTube to learn about different things. My science teacher brother had mentioned seeing an uptick in kids spouting conspiracy theories in his classes, and I’m wondering if this is part of the reason. Back in my day, kids had to actually go looking for their offbeat conspiracy theories, now YouTube brings this right to them. In fact a science teacher who asks their kids to look for information on a benign topic may find that they’ve now inadvertently put them in the path of conspiracy theories that came up as video recommendations after the real science. It seems like this algorithm may have inadvertently stumbled on how to prime people for conversion to radical thought, just through collecting data.

According the the Wall Street Journal, YouTube is looking to tackle this problem, but it’s not clear how they’re  going to do that without running in to the same problems Facebook did when it started to crack down on fake news. It will be interesting to watch this develop, and it’s a good bias to keep in mind.

In the meantime, here’s my current favorite Epic Rap Battle:


What I’m Reading: March 2018

I’ve talked about salami slicing before, but Neuroskeptic has found perhaps the most egregious example of “split your data up and publish each piece individually” ever. An Iranian mental health study surveyed the whole population, then split up their results in to 31 papers….one for each Iranian province. They also wrote two summary papers, one of which got cited in each of the other 32. Now there’s a way to boost your publication count.

Also from Neuroskeptic: the fickleness of the media, and why we can’t have nice replications. Back in 2008, a study found that antidepressants worked mildly better than a placebo with a Standard Mean Difference of .32 (.2 is small, .5 is moderate). In 2018, another meta analysis found that they worked with a Standard Mean Difference of .3. Replication! Consistency! We have a real finding here! Right? Well, here are the Guardian headlines: 

Never trust the headlines.

In another interesting side by side, Scott Alexander Tweeted out the links to two different blog post write ups about the new “growth mindset” study. One calls it the “nail in the coffin” for the theory, the other calls it a successful replication.  Interesting to see the two different takes. The pre-print looks like it was taken down, but apparently they found that watching 2 25 minute videos about the growth mindset resulted in an average GPA boost of .03. However, it looks like that effect was higher for the most at risk students. The question appears to be if that effect is particular to the “growth mindset” instruction, or whether it’s really just a new way of emphasizing the value of hard work.

Also, close to my heart, are critical thinking and media literacy efforts backfiring? This one covers a lot of things I covered in my Two Ways to Be Wrong post. Sometimes teaching people to be critical results in people who don’t believe anything. No clear solution to this one.

I also just finished The Vanishing American Adult by Ben Sasse. Lots of interesting stuff in this book, particularly if you’re a parent in the child rearing years. One of the more interesting chapters covered building a reading list/bookshelf of the all time great books throughout history and encouraging your kids to tackle them. His list was good, but it always irks me a little that lists like these are so heavy on philosophy and literature and rarely include foundational mathematical or scientific books. I may have to work on this.



Delusions of Mediocrity

I mentioned recently that I planned on adding monthly(ish) to my GPD Lexicon page, and my IQ post from Sunday reminded me of a term I wanted to add. While many of us are keenly aware of the problem of “delusions of grandeur” (a false sense of one’s own importance), I think fewer people realize that thinking oneself too normal might also be a problem.

In some circles this happens a lot when topics  like IQ or salary come up, and a bunch of college educated people sit around and talk about how it’s not that much of an advantage to have a higher IQ or having an above average salary. While some people saying this are making good points, some are suffering a delusion of mediocrity. They are imagining in these discussions that their salary or IQ is “average” and that everyone is working in the same range as them and their social circle. In other words, they are debating IQ while only thinking about those with IQs above 110 or so, or salaries above the US median of $59,000.  In other words:

Delusions of Mediocrity: A false sense of one’s one averageness. Typically seen in those with above average abilities or resources who believe that most people live like they do.

Now I think most of us have seen this on a personal level, but I think it’s also important to remember it on a research level. When research finds things like “IQ is correlated with better life outcomes”, they’re not just comparing IQs of 120 to IQs of 130 and finding a difference….they’re comparing IQs of 80 to IQs of 120 and finding a difference.

On an even broader note, psychological research has been known to have a WEIRD problem. Most of the studies we see describing “human” behavior are actually done on those in Western, educated, industrialized, rich and democratic countries (aka WEIRD countries) that do NOT represent that majority of the world population. Even things like optical illusions have been found to vary by culture, so how can we draw conclusions about humanity while drawing from a group that represents only 12% of the world’s population? The fact that we don’t often question this is a mass delusion of mediocrity.

I think this all gets tempting because our own social circles tend to move in a narrow range. By virtue of living in a country, most of us end up seeing other people from that country the vast majority of the time. We also self segregate by neighborhood and occupation. Just another thing to keep in mind when you’re reading about differences.