Happy Sunday! Let’s talk about outliers!
Outliers have been coming up a lot for me recently, so I wanted to put together a few of my thoughts on how we treat them in research. In the most technical sense, outliers are normally defined as any data point that is far outside the expected range for a value. Many computer programs (including Minitab and R) automatically define an outlier as a point that lies more than 1.5 times the interquartile range outside the interquartile range as an outlier. Basically any time you look at a data set and say “one of these things is not like the others” you’re probably talking about an outlier.
So how do we handle these? And how should we handle these? Here’s a couple things to consider:
- Extreme values are the first thing to go When you’re reviewing a data set and can’t review every value, almost everyone I know starts by looking at the most extreme values. For example, I have a data set I pull occasionally that tells me how long people stayed in the hospital after their transplants. I don’t scrutinize every number, but I do scrutinize every number higher than 60. While occasionally patients stay in the hospital that long, it’s actually equally likely that some sort of data error is occurring. Same thing for any value under 10 days….that’s not really even enough time to get a transplant done. So basically if a typo or import error led to a reasonable value, I probably wouldn’t catch it. Overly high or low values pretty much always lead to more scrutiny.
- Is the data plausible? So how do we determine whether an outlier can be discarded? Well the first is to assess if the data point could potentially happen. Sometimes there are typos, data errors, someone flat out misread the question, or someone’s just being obnoxious. An interesting example of implausible data points possibly influencing study results was in Mark Regenerus’ controversial gay parenting study. A few years after the study was released, his initial data set was re-analyzed and it was discovered that he had included at least 9 clear outliers….including one guy who reported he was 8 feet tall, weighed 88 lbs, had been married 8 times and had 8 children. When one of your outcome measures is “number of divorces” and your sample size is 236, including a few points like that could actually change the results. Now, 8 marriages is possible, but given the other data points that accompanied it, they are probably not plausible.
- Is the number a black swan? Okay, so lets move out of run of the mill data and in to rare events. How do you decide whether or not to include a rare event in your data set? Well….that’s hard to. There’s quite a bit of controversy recently over black swan type events….rare extremes like war, massive terrorist attacks or other existential threats to humanity. Basically, when looking at your outliers, you have to consider if this is an area where something sudden, unexpected and massive could happen to change the numbers. It is very unlikely that someone in a family stability study could suddenly get married and divorced 1,000 times, but in public health a relatively rare disease can suddenly start spreading more than usual. Nicholas Nassim Taleb is a huge proponent of keeping an eye on data sets that could end up with a black swan type event, and thinking through the ramifications of this.
- Purposefully excluding or purposefully including can both be deceptive In the recent Slate Star Codex post “Terrorist vs Chairs“, Scott Alexander has two interesting outlier cases that show exactly how easy it is to go wrong with outliers. The first is to purposefully exclude them. For example, since September 12th, 2001, more people in the US have been killed by falling furniture than by terrorist attacks. However, if you move the start line two days earlier to September 10th, 2001, that ratio completely flips by an order of magnitude. Similarly, if you ask how many people die of the flu each year, the average for the last 100 years is 1,000,000. The average for the last 97 years? 20,000. Clearly this is where the black swan thing can come back to haunt you.
- It depends on how you want to use your information Not all outlier exclusions are deceptive. For example, if you work for the New York City Police Department and want to review your murder rate for the last few decades, it would make sense to exclude the September 11th attacks. Most charts you will see do note that they are making this exclusion. In those cases police forces are trying to look at a trends and outcomes they can affect….and the 9/11 attacks really weren’t either. However, if the NYPD were trying to run numbers that showed future risk to the city, it would be foolish to leave those numbers out of their calculations. While tailoring your approach based on your purpose can open you up to bias, it also can reduce confusion.
Take it away Grover!