Selection Bias: The Bad, The Ugly and the Surprisingly Useful

Selection bias and sampling theory are two of the most unappreciated issues in the popular consumption of statistics. While they present challenges for nearly every study ever done, they are often seen as boring….until something goes wrong. I was thinking about this recently because I was in a meeting on Friday and heard an absolutely stellar example of someone using selection bias quite cleverly to combat a tricky problem. I get to that story towards the bottom of the post, but first I wanted to go over some basics.

First, a quick reminder of why we sample: we are almost always unable to ask the entire population of people how they feel about something. We therefore have to find a way of getting a subset to tell us what we want to know, but for that to be valid that subset has to look like the main population we’re interested in. Selection bias happens when that process goes wrong. How can this go wrong? Glad you asked! Here’s 5 ways:

  1. You asked a non-representative group Finding a truly “random sample” of people is hard. Like really hard. It takes time and money, and almost every researcher is short on both. The most common example of this is probably our personal lives. We talk to everyone around us about a particular issue, and discover that everyone we know feels the same way we do. Depending on the scope of the issue, this can give us a very flawed view of what the “general” opinion is. It sounds silly and obvious, but if you remember that many psychological studies rely exclusively on W.E.I.R.D. college students for their results, it becomes a little more alarming. Even if you figure out how to get in touch with a pretty representative sample, it’s worth noting that what works today may not work tomorrow. For example, political polling took a huge hit after the introduction of cell phones. As young people moved away from landlines, polls that relied on them got less and less accurate. The selection method stayed the same, it was the people that changed.
  2. A non-representative group answered Okay, so you figured out how to get in touch with a random sample. Yay! This means good results, right? No, sadly. The next issue we encounter is when your respondents mess with your results by opting in or opting out of answering in ways that are not random. This is non-response bias, and basically it means “the group that answered is different from the group that didn’t answer”. This can happen in public opinion polls (people with strong feelings tend to answer more often than those who feel more neutrally) or even by people dropping out of research studies(our diet worked great for the 5 out of 20 people who actually stuck with it!). For health and nutrition surveys, people also may answer based on how good they feel about their response, or how interested they are in the topic.  This study from the Netherlands,for example, found that people who drink excessively or abstain entirely are much less likely to answer surveys about alcohol use than those who drink moderately.   There’s some really interesting ways to correct for this, but it’s a chronic problem for people who try to figure out public opinion.
  3. You unintentionally double counted This example comes from the book Teaching Statistics by Gelman and Nolan. Imagine that you wanted to find out the average family size in your school district. You randomly select a whole bunch of kids and ask them how many siblings they have, then average the results. Sounds good, right? Well, maybe not. That strategy will almost certainly end up overestimating the average number of siblings, because large families are by definition going to have a better chance of being picked in any sample.  Now this can seem obvious when you’re talking explicitly about family size, but what if it’s just one factor out of many? If you heard “a recent study showed kids with more siblings get better grades than those without” you’d have to go pretty far in to the methodology section before you might realize that some families may have been double (or triple, or quadruple) counted.
  4. The group you are looking at self selected before you got there Okay, so now that you understand sampling bias, try mixing it with correlation and causation confusion. Even if you ask a random group and get responses from everyone, you can still end up with discrepancies between groups because of sorting that happened before you got there. For example, a few years ago there was a Pew Research survey that showed that 4 out of 10 households had female breadwinners, but that those female breadwinners earned less than male breadwinner households. However, it turned out that there were really 2 types of female breadwinner households: single moms and wives who outearned their husbands. Wives who outearned their husbands made about as much as male breadwinners, while single mothers earned substantially less. None of these groups are random, so any differences between them may have already existed.
  5. You can actually use all of the above to your advantage. As promised, here’s the story that spawned this whole post: Bone marrow transplant programs are fairly reliant on altruistic donors. Registries that recruit possible donors often face a “retention” problem….i.e. where people initially sign up, then never respond when they are actually needed. This is a particularly big problem with donors under the age of 25, who for medical reasons are the most desirable donors. Recently a registry we work with at my place of business told us their new recruiting tactic used to mitigate this problem. Instead of signing people up in person for the registry, they get minimal information from them up front, then send them an email with further instructions about how to finish registering. They then only sign up those people who respond to the email. This decreases the number of people who end up registering to be donors, but greatly increases the number of registered donors who later respond when they’re needed. They use selection bias to weed out those who were least likely to be responsive….aka those who didn’t respond to even one initial email. It’s a more positive version of the Nigerian scammer tactic.

Selection bias can seem obvious or simple, but since nearly every study or poll has to grapple with it, it’s always worth reviewing. I’d also be remiss if I didn’t  include a link here for those ages 18 to 44 who might be interested in registering to be a potential bone marrow donor.