5 Things You Should Know About Statistical Process Control Charts

Once again I outdo myself with the clickbait-ish titles, huh? Sorry about that, I promise this is actually a REALLY interesting topic.

I was preparing a talk for a conference this week (today actually, provided I get this post up when I plan to), and I realized that statistical process control charts (or SPC charts for short) are one of the tools I use quite often at work but don’t really talk about here on the blog. Between those and my gif usage, I think you can safely guess why my reputation at work is a bit, uh, idiosyncratic. For those of you who have never heard of an SPC chart, here’s a quick orientation. First, they look like this:

(Image from qimacros.com, and excellent software for generating these)

The chart is used for plotting something over time….hours, days, weeks, quarters, years, or “order in line”…take your pick.  Then you map some ongoing process or variable you are interested in…..say employee sick calls. You measure employee sick calls in some way (# of calls or % of employees calling in) in each time period. This sets up a baseline average, along with “control limits”, which are basically 1, 2 and 3 standard deviation ranges. If at some point your rate/number/etc starts to go up or down, the SPC chart can tell you if the change is significant or not based on where it falls on the plot.  For example, if you have one point that falls outside the 3 standard deviation line, that’s significant. If two in a row fall outside the 2 standard deviation line, that’s significant as well. The rules for this vary by industry, and Wiki gives a pretty good overview here. At the end of this exercise you have a really nice graph of how you’re doing with a good visual of any unusual happenings, all with some statistical rigor behind it. What’s not to love?

Anyway, I think because they take a little bit of getting used to,  SPC charts do not always get the love they deserve. I would like to rectify this travesty, so here’s 5 things you should know about them to tempt you to go learn more about them:

  1. SPC charts are probably more useful for most business than hypothesis testing While most high school level statistics classes at least take a stab at explaining p-values and hypothesis testing to kids, almost none of them even show an example of a control chart. And why not? I think it’s a good case of academia favoring itself. If you want to test a new idea against an old idea or to compare two things at a fixed point in time p-values and hypothesis testing are pretty good. That’s why they’re used in most academic research. However, if you want see how things are going over time, you need statistical process control. Since this is more relevant for most businesses, people who are trying to keep track of any key metric should DEFINITELY know about these.   Six Sigma and many process improvement class teach statistical process control, but they still don’t seem widely used outside of those settings. Too bad. These graphs are  practical, they can be updated easily, and it gives you a way of monitoring what’s going on and lot of good information about how your process are going. Like what? Well, like #2 on this list:
  2. SPC charts track two types of variation Let’s get back to my sick call example. Let’s say that in any given month, 10% of your employees call in sick. Now most people realize that not every month will be exactly 10%. Some months it’s 8%, some months it’s 12%. What statistical process control charts help calculate is when those fluctuations are most likely just random (known as common cause variation) and the point at which they are probably not so random (special cause variation). It sets parameters that tell you when you should pay attention. They are better than p-values for this because you’re not really running an experiment every month….you just want to make sure everything’s progressing as it usually does. The other nice part is this translates easily in to a nice visual for people, so you can say with confidence “this is how it’s always been” or “something unusual is happening here” and have more than your gut to rely on.
  3. SPC charts help you test new things, or spot concerning trends quickly SPC charts were really invented for manufacturing plants, and were perfected and popularized in post-WWII Japan. One of the reasons for this is that they really loved having an early warning about when a machine might be breaking down or an employee might not be following the process. If the process goes above or below a certain red line (aka the “upper/lower control limit”) you have a lot of confidence something has gone wrong and can start investigating right away. In addition to this, you can see if a change you made helps anything. For example, if you do a handwashing education initiative, you can see what percentage of your employees call in sick the next month. If it’s below the lower control limit, you can say it was a success, just like with traditional p-values/hypothesis testing. HOWEVER, unlike p-values/hypothesis testing, SPC charts make allowances for time. Let’s say you drop the sick calls to 9% per month, but then they stay down for 7 months. Your SPC chart rules now tell you you’ve made a difference. SPC charts don’t just take in to account the magnitude of the change, but also the duration. Very useful for any metric you need to track on an ongoing basis.
  4. They encourage you not to fix what isn’t broken One of the interesting reasons SPC charts caught on so well in the manufacturing world is that the idea of “opportunity cost” was well established. If your assembly line puts out a faulty widget or two, it’s going to cost you a lot of money to shut the whole thing down. You don’t want to do that unless it’s REALLY broken. For our sick call example, it’s possible that what looks like an increase (say to 15% of your workforce) isn’t a big deal and that trying to interfere will cause more harm than good. Always good to remember that there are really two ways of being wrong: missing a problem that does exist, and trying to fix one that doesn’t.
  5. There are quite a few different types One of the extra nice things about SPC charts is that there are actually 6 types to chose from, depending on what kind of data you are working with. There’s a helpful flowchart to pick your type here, but a good computer program (I use QI macros) can actually pick for you. One of the best parts of this is that some of them can deal with small and varying sample sizes, so you can finally show that going from 20% to 25% isn’t really impressive if you just lowered your volume from 5 to 4.

So those are some of my reasons you should know about these magical little charts. I do wish they’d get used more often because they are a great way of visualizing how you’re doing on an ongoing basis.

If you want to know more about the math behind them and more uses (especially in healthcare), try this presentation. And wish me luck on my talk! Pitching this stuff right before lunch is going to be a challenge.

Funnel Plots 201: Romance, Risk, and Replication

After writing my Fun With Funnel Plots post last week, someone pointed me to this Neuroskeptic article from a little over a year ago.  It covers a paper called “Romance, Risk and Replication” that sought to replicate “romantic priming studies”, with interesting results….results best shown in funnel plot form! Let’s take a look, shall we?

A little background: I’ve talked about priming studies on this blog before, but for those unfamiliar, here’s how it works: a study participant is shown something that should subconsciously/subtly stimulate certain thoughts. They are then tested on a behavior that appears unrelated, but could potentially be influenced by the thoughts brought on in the first part of the study. In this study, researchers took a look at what’s called “romantic priming” which basically involves getting someone to think about meeting someone attractive, then seeing if they do things like (say they would) spend more money or take more risks.

Some ominous foreshadowing: Now for those of you who have been paying attention to the replication crisis, you may remember that priming studies were one of the first things to be called in to question. There were a lot of concerns about p-value hacking, and concerns that they were falling prey to basically all the hallmarks of bad research. You see where this is going.

What the researchers found: Shanks et al attempted to replicate 43 different studies on romantic priming, all of which had found significant effects. When they attempted to replicate these studies, they found nothing. Well, not entirely nothing. They found no significant effects of romantic priming, but they did find something else:

The black dots are the results from original studies, and the white triangles are the results from the replication attempts. To highlight the differences, they drew two funnel plots. One encompasses the original studies, and shows the concerning “missing piece” pattern in the lower left hand corner.  Since they had replication studies, they funnel plotted those as well. Since the sample sizes were larger, they all cluster at the top, but as you can see they spread above and below the zero line. In other words, the replications showed no effect in exactly the way you would expect if there were no effect, and the originals showed an effect in exactly the way you would expect if there were bias.

To thicken the plot further, the researchers also point out that the original studies effect sizes actually all fall just about on the line of the funnel plot for the replication results. The red line in the graph shows a trend very close to the side of the funnel, which was drawn at the p=.05 line. Basically, this is pretty good evidence of p-hacking…aka researchers (or journals) selecting results that fell right under the p=.05 cut off. Ouch.

I liked this example because it shows quite clearly how bias can get in and effect scientific work, and how statistical tools can be used to detect and display what happened. While large numbers of studies should protect against bias, sadly it doesn’t always work that way. 43 studies is a lot, and in this case, it wasn’t enough.

5 Things You Should Know About Study Power

During my recent series on “Why Most Published Research Findings Are False“, I mentioned a concept called “study power” quite a few times. I haven’t talked about study power much on this blog, so I thought I’d give a quick primer for those who weren’t familiar with the term. If you’re looking for a more in depth primer try this one here, but if you’re just looking for a few quick hits, I gotcha covered:

  1. It’s sort of the flip side of the p-value We’ve discussed the p-value and how it’s based on the alpha value before, and study power is actually based on a value called beta. If alpha can be thought of as the chances of committing a Type 1 error (false positive), then the beta is the chance of getting a Type 2 error (false negative). Study power is actually 1 – beta, so if someone says study power is .8, that means the beta was .2. Setting the alpha and beta values are both up to the discretion of the researcher….their values are more about risk tolerance than mathematical absolutes.
  2. The calculation is not simple, but what it’s based on is important Calculating study power is not easy math, but if you’re desperately curious try this explanation. For most people though, the important part to remember is that it’s based on 3 things: the alpha you use, the effect size you’re looking for, and your sample size.  These three can all shift based on the values of the other one. As an example, imagine you were trying to figure out if a coin was weighted or not. The more confident you want to be in your answer (alpha), the more times you have to flip it (sample size). However, if the coin is REALLY unfairly weighted (effect size), you’ll need fewer flips to figure that out. Basically the unfairness of a coin weighted to 80-20 will be easier to spot than a coin weighted to 55-45.
  3. It is weirdly underused As we saw in the “Why Most Published Findings Are False” series, adequate study power does more than prevent false negatives. It can help blunt the impact of bias and the effect of multiple teams, and it helps everyone else trust your research. So why don’t most researchers put much thought in to it, science articles mention it, or people in general comment on it? I’m not sure, but I think it’s simply because the specter of false negatives is not as scary or concerning as that of false positives. Regardless, you just won’t see it mentioned as often as other statistical issues. Poor study power.
  4. It can make negative (aka “no difference”) studies less trustworthy With all the current attention on false positive/failure to replicate studies, it’s not terribly surprising that false negatives have received less attention…..but it is still an issue. Despite the fact that study power calculations can tell you how big the effect size you can detect is, and odd number of researchers don’t include their calculations. This means a lot of “negative finding” trials could also be suspect. In this breakdown of study power, Stats Done Wrong author Alex Reinhart cites studies that found up to 84% of studies don’t have sufficient power to detect even a 25% difference in primary outcomes. An ASCO review found that 47% of oncology trials didn’t have sufficient power to detect all but the largest effect sizes. That’s not nothing.
  5. It’s possible to overdo it While underpowered studies are clearly an issue, it’s good to remember that overpowered studies can be a problem too. They waste resources, but can also detect effect sizes so small as to be clinically meaningless.

Okay, so there you have it! Study power may not get all the attention the p-value does, but it’s still a worthwhile trick to know about.

5 Interesting Examples of Self Reporting Bias

News flash! People lie. Some more than others. Now there are all sorts of reasons why we get upset when people don’t tell the truth, but I’m not here to talk about those today. No, today I’m here to give a few interesting examples of where self-reporting bias can really kinda screw up research and how we perceive the world.

Now, self reporting bias can happen for all sorts of reasons, and not all of them are terrible. Some bias happens because people want to make themselves look better, some happens because people really think they do things differently than they do, some happens because people just don’t remember things well and try to fill in gaps. Regardless of the reason, here’s 5 places bias may pop up:

  1. Nutrition/Food Intake Self reported nutrition data may be the worst example of research skewed by self reporting. For most nutrition/intake surveys, about 67% of respondents give implausibly low answers….an effect that actually shows up cross culturally. Interestingly there are some methods known to improve this (doubly labeled water for example), but they tend to be more expensive and thus are used less often. Unfortunately this effect isn’t random, so it’s hard to know exactly how bad they effect is across the board.
  2. Height While it’s pretty ubiquitous that people lie about their weight, lying about height is a less recognized but still interesting problem. It’s pervasive in online dating for both men AND women, both of whom exaggerate by about 2 inches. On medical/research surveys we all get slightly more honest, with men overestimating their height by about .5 inches, and women by .33 inches.
  3. Work hours Know anyone who says they work a 70 hour week? Do they do this regularly? Yeah, they’re probably not remembering that correctly.  Edit: My snark got ahead of me here, and I got called out in the comments, so I’m taking it back. I also added some text in bold to clarify what the problem is. When people are asked how much they work per week, they tend to give much higher answers than when they are asked to list out the hours they worked during the week. The more they say they work, the more likely to have inflated the number. People who say they work 75+ hours work an average of 50 hours/week, and  those who say they work 40 hours/week tend to work about 37. Added: While some professions do actually require crazy hours (especially early in your career….looking at you medical residencies, and first year teachers are notorious for never going home), very few keep this up forever. Additionally, what people work most weeks almost never equals what they work when averaged over the course of a year. That 40 hour a week office worker almost certainly gets some vacation time, and even 2 weeks of vacation and a few paid holiday take that yearly average down to 37 hours per week…and that’s before you add in sick time.  Some of this probably gets confusing because of business travel or other “grey areas” like professional development time, but it also speaks to our tendency to remember our worst weeks better than our good ones.
  4. Childhood memories It is not uncommon in psychological/developmental research that adults will be asked various questions about the state of their life currently while also being queried about their upbringing. This typically leads to conclusions about parenting type x leading to outcome y in children. I was recently reading a paper about various discipline methods and long term outcomes in kids, when I ran across a possible confounder I hadn’t considered: sex differences in the recollection of childhood memories. Apparently overall men are not as good at identifying family dynamics from their childhoods, and the authors wondered if that led to some false findings. They didn’t have direct evidence, but it’s an interesting thing to keep in mind.
  5. Base 10 madness You wouldn’t think our fingers would cause a reporting bias, but they probably do. Our obsession with doing things in multiples of 5 or 10 probably comes from our use of our hands for counting. When it comes to surveys and self reports, this leads to a phenomena called “heaping”, where people tend to round their reports to multiples of 5 and 10.  There’s some interesting math you can use to try to correct for this, but given that rounding tends to be non-constant (ie we round smaller numbers to 5 and larger numbers to 10) this can actually affect some research results.

Base 10 aside: one of the more interesting math/pop-culture videos I’ve seen is this one, where they explore why the Simpson’s (who have 4 fingers on each hand) still use base 10 counting (7:45 mark):

 

How the Sausage Gets Made and the Salami Gets Sliced

Ever since James the Lesser pointed me to this article about some problems with physics,  I’ve been thinking a lot about salami slicing. For those of you who don’t know, salami slicing (aka using the least publishable unit) is the practice of taking one data set and publishing as many papers as possible from it. Some of this is done through data dredging, and some if it is just done by breaking up one set of conclusions in to a series of much smaller conclusions in order to publish more papers. This is really not a great practice, as it can give potentially shaky conclusions more weight (500 papers can’t be wrong!) and multiply the effects of any errors in data gathering.  This can then have other effects like increasing citation counts for papers or padding resumes.

A few examples:

  1. Physics: I’ve talked about this before, but the (erroneous) data set mentioned here resulted in 500 papers on the same topic. Is it even possible to retract that many?
  2. Nutrition and obesity research: John Ioannidis took a shot at this practice in his paper on obesity research, where he points out that the PREDIMED study (a large randomized trial looking at the Mediterranean diet)  has resulted in 95 published papers.
  3. Medical findings: In this paper, it was determined that nearly 17% of papers on specific topics had at least some overlapping data.

To be fair, not all of this is done for bad reasons. Sometimes grants or other time pressures encourage researchers to release their data in slices rather than in one larger paper. The line between “salami slicing” and “breaking up data in to more manageable parts” can be a grey one….this article gives a good overview of some case studies and shows it’s not always straightforward. Regardless, it’s worth keeping in mind if you see multiple studies supporting the same conclusion that you should at least check for independence among the data sources. This paper breaks down some of the less obvious problems with salami slicing:

  1. Dilution of content/reader fatigue More papers mean a smaller chance anyone will actually read all of them
  2. Over-representation of some findings Fewer people will read these papers, but all the titles will make it look like there are lots of new findings
  3. Clogging journals/peer review Peer reviewers and journal space is still a limited resource. Too many papers on one topic does take resources from other topics
  4. Increasing author fatigue/sanctions An interesting case that this is actually bad for the authors in the long run. Publishing takes a lot of work, and publishing two smaller papers is twice the work of one. Also, duplicate publishing increases the chance you’ll be accused of self-plagiarism and be sanctioned.

Overall, I think this is one of those possibilities many lay readers don’t even consider when they look at scientific papers. We assume that each paper equals one independent event, and that lots of papers means lots of independent verification. With salami slicing, we do away with this element and increase the noise. Want more? This quick video give a good overview as well:

On Outliers, Black Swans, and Statistical Anomolies

Happy Sunday! Let’s talk about outliers!

Outliers have been coming up a lot for me recently, so I wanted to put together a few of my thoughts on how we treat them in research. In the most technical sense, outliers are normally defined as any data point that is far outside the expected range for a value. Many computer programs (including Minitab and R) automatically define an outlier as a point that lies more than 1.5 times the interquartile range outside the interquartile range as an outlier. Basically any time you look at a data set and say “one of these things is not like the others” you’re probably talking about an outlier.

So how do we handle these? And how should we handle these? Here’s a couple things to consider:

  1. Extreme values are the first thing to go When you’re reviewing a data set and can’t review every value, almost everyone I know starts by looking at the most extreme values. For example, I have a data set I pull occasionally that tells me how long people stayed in the hospital after their transplants. I don’t scrutinize every number, but I do scrutinize every number higher than 60. While occasionally patients stay in the hospital that long, it’s actually equally likely that some sort of data error is occurring. Same thing for any value under 10 days….that’s not really even enough time to get a transplant done. So basically if a typo or import error led to a reasonable value, I probably wouldn’t catch it. Overly high or low values pretty much always lead to more scrutiny.
  2. Is the data plausible? So how do we determine whether an outlier can be discarded? Well the first is to assess if the data point could potentially happen. Sometimes there are typos, data errors, someone flat out misread the question, or someone’s just being obnoxious. An interesting example of implausible data points possibly influencing study results was in Mark Regenerus’ controversial gay parenting study. A few years after the study was released, his initial data set was re-analyzed and it was discovered that he had included at least 9 clear outliers….including one guy who reported he was 8 feet tall, weighed 88 lbs, had been married 8 times and had 8 children. When one of your outcome measures is “number of divorces” and your sample size is 236, including a few points like that could actually change the results. Now, 8 marriages is possible, but given the other data points that accompanied it, they are probably not plausible.
  3. Is the number a black swan? Okay, so lets move out of run of the mill data and in to rare events. How do you decide whether or not to include a rare event in your data set? Well….that’s hard to. There’s quite a bit of controversy recently over black swan type events….rare extremes like war, massive terrorist attacks or other existential threats to humanity. Basically, when looking at your outliers, you have to consider if this is an area where something sudden, unexpected and massive could happen to change the numbers. It is very unlikely that someone in a family stability study could suddenly get married and divorced 1,000 times, but in public health a relatively rare disease can suddenly start spreading more than usual. Nicholas Nassim Taleb is a huge proponent of keeping an eye on data sets that could end up with a black swan type event, and thinking through the ramifications of this.
  4. Purposefully excluding or purposefully including can both be deceptive In the recent Slate Star Codex post “Terrorist vs Chairs“, Scott Alexander has two interesting outlier cases that show exactly how easy it is to go wrong with outliers. The first is to purposefully exclude them. For example, since September 12th, 2001, more people in the US have been killed by falling furniture than by terrorist attacks. However, if you move the start line two days earlier to September 10th, 2001, that ratio completely flips by an order of magnitude. Similarly, if you ask how many people die of the flu each year, the average for the last 100 years is 1,000,000. The average for the last 97 years? 20,000.  Clearly this is where the black swan thing can come back to haunt you.
  5. It depends on how you want to use your information Not all outlier exclusions are deceptive. For example, if you work for the New York City Police Department and want to review your murder rate for the last few decades, it would make sense to exclude the September 11th attacks. Most charts you will see do note that they are making this exclusion. In those cases police forces are trying to look at a trends and outcomes they can affect….and the 9/11 attacks really weren’t either. However, if the NYPD were trying to run numbers that showed future risk to the city, it would be foolish to leave those numbers out of their calculations. While tailoring your approach based on your purpose can open you up to bias, it also can reduce confusion.

Take it away Grover!

Selection Bias: The Bad, The Ugly and the Surprisingly Useful

Selection bias and sampling theory are two of the most unappreciated issues in the popular consumption of statistics. While they present challenges for nearly every study ever done, they are often seen as boring….until something goes wrong. I was thinking about this recently because I was in a meeting on Friday and heard an absolutely stellar example of someone using selection bias quite cleverly to combat a tricky problem. I get to that story towards the bottom of the post, but first I wanted to go over some basics.

First, a quick reminder of why we sample: we are almost always unable to ask the entire population of people how they feel about something. We therefore have to find a way of getting a subset to tell us what we want to know, but for that to be valid that subset has to look like the main population we’re interested in. Selection bias happens when that process goes wrong. How can this go wrong? Glad you asked! Here’s 5 ways:

  1. You asked a non-representative group Finding a truly “random sample” of people is hard. Like really hard. It takes time and money, and almost every researcher is short on both. The most common example of this is probably our personal lives. We talk to everyone around us about a particular issue, and discover that everyone we know feels the same way we do. Depending on the scope of the issue, this can give us a very flawed view of what the “general” opinion is. It sounds silly and obvious, but if you remember that many psychological studies rely exclusively on W.E.I.R.D. college students for their results, it becomes a little more alarming. Even if you figure out how to get in touch with a pretty representative sample, it’s worth noting that what works today may not work tomorrow. For example, political polling took a huge hit after the introduction of cell phones. As young people moved away from landlines, polls that relied on them got less and less accurate. The selection method stayed the same, it was the people that changed.
  2. A non-representative group answered Okay, so you figured out how to get in touch with a random sample. Yay! This means good results, right? No, sadly. The next issue we encounter is when your respondents mess with your results by opting in or opting out of answering in ways that are not random. This is non-response bias, and basically it means “the group that answered is different from the group that didn’t answer”. This can happen in public opinion polls (people with strong feelings tend to answer more often than those who feel more neutrally) or even by people dropping out of research studies(our diet worked great for the 5 out of 20 people who actually stuck with it!). For health and nutrition surveys, people also may answer based on how good they feel about their response, or how interested they are in the topic.  This study from the Netherlands,for example, found that people who drink excessively or abstain entirely are much less likely to answer surveys about alcohol use than those who drink moderately.   There’s some really interesting ways to correct for this, but it’s a chronic problem for people who try to figure out public opinion.
  3. You unintentionally double counted This example comes from the book Teaching Statistics by Gelman and Nolan. Imagine that you wanted to find out the average family size in your school district. You randomly select a whole bunch of kids and ask them how many siblings they have, then average the results. Sounds good, right? Well, maybe not. That strategy will almost certainly end up overestimating the average number of siblings, because large families are by definition going to have a better chance of being picked in any sample.  Now this can seem obvious when you’re talking explicitly about family size, but what if it’s just one factor out of many? If you heard “a recent study showed kids with more siblings get better grades than those without” you’d have to go pretty far in to the methodology section before you might realize that some families may have been double (or triple, or quadruple) counted.
  4. The group you are looking at self selected before you got there Okay, so now that you understand sampling bias, try mixing it with correlation and causation confusion. Even if you ask a random group and get responses from everyone, you can still end up with discrepancies between groups because of sorting that happened before you got there. For example, a few years ago there was a Pew Research survey that showed that 4 out of 10 households had female breadwinners, but that those female breadwinners earned less than male breadwinner households. However, it turned out that there were really 2 types of female breadwinner households: single moms and wives who outearned their husbands. Wives who outearned their husbands made about as much as male breadwinners, while single mothers earned substantially less. None of these groups are random, so any differences between them may have already existed.
  5. You can actually use all of the above to your advantage. As promised, here’s the story that spawned this whole post: Bone marrow transplant programs are fairly reliant on altruistic donors. Registries that recruit possible donors often face a “retention” problem….i.e. where people initially sign up, then never respond when they are actually needed. This is a particularly big problem with donors under the age of 25, who for medical reasons are the most desirable donors. Recently a registry we work with at my place of business told us their new recruiting tactic used to mitigate this problem. Instead of signing people up in person for the registry, they get minimal information from them up front, then send them an email with further instructions about how to finish registering. They then only sign up those people who respond to the email. This decreases the number of people who end up registering to be donors, but greatly increases the number of registered donors who later respond when they’re needed. They use selection bias to weed out those who were least likely to be responsive….aka those who didn’t respond to even one initial email. It’s a more positive version of the Nigerian scammer tactic.

Selection bias can seem obvious or simple, but since nearly every study or poll has to grapple with it, it’s always worth reviewing. I’d also be remiss if I didn’t  include a link here for those ages 18 to 44 who might be interested in registering to be a potential bone marrow donor.

Type IV Errors: When Being Right is Not Enough

Okay, after discussing Type I and Type II errors a few weeks ago and Type III errors last week, it’s only natural that this week we’d move on to Type IV errors. This is another error type that doesn’t have a formal definition, but is important to remember because it’s actually been kind of a problem in some studies. Basically, a Type IV error is an incorrect interpretation of a correct result.

For example, let’s say you go to the doctor because you think you tore your ACL

A Type I error would occur if the doctor told you that your ACL was torn when it wasn’t. (False Positive)

A Type II error would occur if the doctor told you that you just bruised it, but you had really torn your ACL. (False Negative)

A Type III error would be if the doctor said you didn’t tear your ACL, and you hadn’t, but she sent you home missed that you had a tumor on your hip causing the knee pain. (Wrong problem)

A Type IV error would be if you were correctly diagnosed with an ACL tear, then told to put crystals on it every day until it healed. Alternatively, the doctor refers for surgery and the surgery makes the problem worse. (Wrong follow up)

When you put it like that, it’s decently easy to spot, but a tremendous number of studies can end up with some form of this problem. Several papers have found that when using ANOVA tables, as many as 70% of authors will end up doing incorrect or irrelevant follow up statistical testing.  Sometimes these affect the primary conclusion and sometimes not, but it should be concerning to anyone that this could happen.

Other types of Type IV errors:

  1. Drawing a conclusion for an overly broad group because you got results for a small group. This is the often heard “WEIRD” complaint, when psychological studies use populations from White Educated Industrialized Rich Democratic countries (especially college students!) and then claim that the results are true of humans in general. The results may be perfectly accurate for the group being studied, but not generalizable.
  2. Running the wrong test or running the test on the wrong data.  A recent example was the retraction that had to be made when it turned out the authors of a paper linking conservativism and psychotic traits had switched the coding for conservatives and liberals. This meant all of their conclusions were exactly reversed, and they now linked liberalism and psychotic traits. They correctly rejected the null hypothesis, but were still wrong about the conclusion.
  3. Pre-existing beliefs and confirmation bias. There’s interesting data out there that suggests that people who write down their justifications for decisions are more hesitant to walk back on those decisions when it looks like they are wrong. It’s hard for people to walk back on things once they’ve said them. This was the issue with a recent Politifact “Pants on Fire Ranking” ranking it gave a Donald Trump claim. Trump had claimed that “crime was rising”. PolitiFact said he was lying. When it was pointed out to them that preliminary 2015 and 2016 data suggests that violent crime is rising, they said preliminary data doesn’t count stood by the ranking. The Volokh Conspiracy has the whole breakdown here, but it struck them (and me) that it’s hard to call someone a full blown liar if they have  preliminary data on their side. It’s not that his claim is clearly true, but there’s a credible suggestion it may not be false either. Someone remind me to check when those numbers finalize.

In conclusion: even when you’re right, you can still be wrong.

 

Type III Errors: Another Way to Be Wrong

I talk a lot about ways to be wrong on this blog, and most of them are pretty recognizable logical fallacies or statistical issues. For example, I’ve previously talked about the two ways of being wrong when hypothesis testing that are generally accepted by statisticians.  If you don’t feel like clicking, here’s the gist: Type I errors are also known as false positives, or the error of believing something to be true when it is not. Type II errors are the opposite, false negatives, or the error of believing an idea to be false when it is not.

Both of those definitions are really useful when testing a scientific hypothesis, which is why they have formal definitions. Today though, I want to bring up the proposal for there to be a recognized Type III error: correctly answering the wrong question.

Here are a couple of examples:

  1. Drunk Under a Streetlight: Most famously, this could be considered a variant of the streetlight effect. It’s named after this anecdote: “A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, “this is where the light is.”
  2. Blame it on the GPS: In my “All About that Base Rate” post, I talked about a scenario where the police were testing trash cans for the presence of drugs. A type I error is getting a positive test on a trash can with no drugs in it. A type II error is getting a negative test on a trash can with drugs in it. A type III error would be correctly finding drugs in a trash can at the wrong house.
  3. Stressing about string theory: James recently had a post about the failure to prove some key aspects of string theory which was great timing since I just finished reading “The Trouble With Physics” and was feeling a bit stressed out by the whole thing. In the book, the author Lee Smolin makes a rather concerning case that we are putting almost all of our theoretical physics eggs in the string theory basket, and we don’t have much to fall back on if we’re wrong. He repeatedly asserts that good science is being done, but that there is very little thought given to the whole “is this the right direction” question.
  4. Blood Transfusions and Mental Health:The book “Blood Work: A Tale of Medicine and Murder in the Scientific Revolution” provides another example, as it recounts the history of the blood transfusion. Originally, the idea was that transfusions could be used as psychiatric treatments. For many many reasons, this use failed spectacularly enough that they weren’t used again for almost 150 years. At that point someone realized they should try using them to treat blood loss, and the science improved from there.

No matter how good the research was in all of these cases, the answer still wouldn’t have helped answer the larger questions at hand. Like a swimmer in open water, the best techniques in the world don’t help if you’re not headed in the right direction. It sounds obvious, but formalizing a definition like this and teaching it while you teach other techniques might help remind scientists/statisticians to look up every once in a while. You know, just to see where you’re going.

 

Proving Causality: Who Was Bradford Hill and What Were His Criteria?

Last week I had a lot of fun talking about correlation/causation confusion, and this week I wanted to talk about the flip side: correctly proving causality. While there’s definitely a cost to incorrectly believing that Thing A causes Thing B when it does not, it can also be quite dangerous to NOT believe Thing A causes Thing B when it actually does.

This was the challenge that faced many public health researchers when attempting to establish a link between smoking and lung cancer. With all the doubt around correlation and causation, how do you actually prove your hypothesis?  British statistician Austin Bradford Hill was quite concerned with this problem, and he established a set of nine criteria to help prove causal association. While this criteria is primarily used for proving causes for medical conditions, it is a pretty useful framework for assessing correlation/causation claims.

Typically this criteria is explained using smoking (here for example), as that’s what is was developed to assess. I’m actually going to use examples from the book The Ghost Map, which documents the cholera outbreak in London in 1854 and the birth of modern epidemiology.  A quick recap: A physician named John Snow witnessed the start of the cholera outbreak in the Soho neighborhood of London, and was desperate to figure out how the disease was spreading. The prevailing wisdom at the time was that cholera and other diseases were  transmitted by foul smelling air (miasma theory), but based on his investigation Snow began to believe the problem was actually a contaminated water source. In the era prior to germ theory, the idea of a water-borne illness was a radical one, and Snow had to vigorously document his evidence and defend his case….all while hundreds of people were dying. His investigation and documentation is typically acknowledged as the beginning of the field of formal epidemiology, and it is likely he saved hundreds if not thousands of lives by convincing authorities to remove the handle of the Broad Street pump (the contaminated water source).

With that background, here are the criteria:

  1. Strength of Association: The first criteria for proof is basic. People who do Thing A must have a higher rate of Thing B than those who don’t. This is basically a request for an initial correlation. In the case of cholera, this was where John Snow’s “Ghost Map” came in. He created a visual diagram showing that the outbreak of cholera was not necessarily purely based on location, but by proximity to one particular water pump. Houses that were right next to each other had dramatically different death rates IF the inhabitants typically used different water pumps. Of those living near the water pump, 127 died. Of those living nearer to other pumps, 10 died. That’s one hell of an association.
  2. Temporality: The suspected cause must come before the effect. This one seems obvious, but must be remembered. It’s clear that both water and air are consumed frequently, so either method of transmission passed this criteria. However, if you looked closely, it was clear that bad smells often came after disease and death, not before. OTOH, there were a lot of open sewer systems in London at the time, so everything probably smelled kinda bad. We’ll call this one a draw.
  3. Consistency: Different locations must show the same effects. This criteria is a big reason why miasma theory (the theory that bad smells caused disease) had taken hold. When disease outbreaks happened, the smells were often unbearable. This appeared to be very consistent across locations and different outbreaks. Given John Snow’s predictions however, it would have been beneficial to see if cholera outbreaks had unusual patterns around water sources, or if changing water sources changes the outbreak trajectory.
  4. Theoretical Plausibility This one can be tricky to establish, but basically it requires that you can propose a mechanism for cause. It’s designed to help keep out really out there ideas about crystals and star alignment and such. Ingesting a substance such as water quite plausibly could cause illness, so this passed.  Inhaling air also passed this test, since we now know that many diseases are actually transmitted through airborne germs. Cholera didn’t happen to have this method of transmission, but it wasn’t implausible that it could have. Without germ theory, plausibility was much harder to establish. Plausibility is only as good as current scientific understanding.
  5. Coherence The coherence requirement looks at whether the proposed cause agrees with other knowledge, especially laboratory findings. John Snow didn’t have those, but he did gain coherence when the pump handle was removed and the outbreak stopped. That showed that the theory was coherent, or that things proceeded the way you would predict they would if he was correct. Conversely, the end of the outbreak caused a lack of coherence for miasma theory…if bad air was the cause, you would not expect changing a water source to have an effect.
  6. Specificity in the causes The more specific or direct the relationship between Thing A and Thing B, the clearer the causal relationship and the easier it is to prove. Here again, by showing that those drinking the water were getting cholera at very high rates and those not drinking the water were not getting cholera as often, Snow offered a very straightforward cause and effect. If there had been other factors involved….say water drawn at a certain time of day….this link would have been more difficult to establish.
  7.  Dose Response Relationship The more exposure you have to the cause, the more likely you are to have the effect. This one can be tricky. In the event of an infectious disease for example, one exposure may be all it takes to get sick. In the case of John Snow, he actually doubted miasma theory because of this criteria. He had studied men who worked in the sewers, and noted that they must have more exposure to foul air than anyone else. However, they did not seem to get cholera more often than other people. The idea that bad air made you sick, but that lots of bad air didn’t make you more likely to be ill troubled him. With the water on the other hand, he noted that those using the pump daily became sick immediately.
  8. Experimental Evidence While direct human experiments are almost never possible or ethical to run, some experimental evidence may used as support for the theory. Snow didn’t have much to experiment on, and it would have been unethical if he had. However, he did note people who had avoided the pump and noted if they got sick or not. If he had known of animals that were susceptible to cholera, he could have tested the water by giving one animal “good” water and another animal “bad” water.
  9. Analogy If you know that something occurs one place, you can reasonably assume it occurs in other places. If Snow had known of other water-borne diseases, one suspects it would have been easier for him to make his case to city officials. This one can obviously bias people at times, but is actually pretty useful. We would never dream of requiring a modern epidemiologist to prove that a new disease could be water-borne….we would all assume it was at least a possibility.

Even though Snow didn’t have this checklist available to him, he ended up checking most of the boxes anyway. In particular, he proved his theory using strength of association, coherence, consistency and specificity. He also raised questions about the rival theory by pointing to the lack of dose-response relationship. Ultimately, the experiment of removing the pump handle succeeded in halting the outbreak.

Not bad for a little data visualization:

While some of these criteria have been modified or improved, this is a great fundamental framework for thinking about causal associations. Also, if you’re looking for a good summer read, I would recommend the book I referenced here: The Ghost Map. At the very least it will help you stop making “You Know Nothing John Snow” jokes.