5 Things You Should Know About Statistical Process Control Charts

Once again I outdo myself with the clickbait-ish titles, huh? Sorry about that, I promise this is actually a REALLY interesting topic.

I was preparing a talk for a conference this week (today actually, provided I get this post up when I plan to), and I realized that statistical process control charts (or SPC charts for short) are one of the tools I use quite often at work but don’t really talk about here on the blog. Between those and my gif usage, I think you can safely guess why my reputation at work is a bit, uh, idiosyncratic. For those of you who have never heard of an SPC chart, here’s a quick orientation. First, they look like this:

(Image from qimacros.com, and excellent software for generating these)

The chart is used for plotting something over time….hours, days, weeks, quarters, years, or “order in line”…take your pick.  Then you map some ongoing process or variable you are interested in…..say employee sick calls. You measure employee sick calls in some way (# of calls or % of employees calling in) in each time period. This sets up a baseline average, along with “control limits”, which are basically 1, 2 and 3 standard deviation ranges. If at some point your rate/number/etc starts to go up or down, the SPC chart can tell you if the change is significant or not based on where it falls on the plot.  For example, if you have one point that falls outside the 3 standard deviation line, that’s significant. If two in a row fall outside the 2 standard deviation line, that’s significant as well. The rules for this vary by industry, and Wiki gives a pretty good overview here. At the end of this exercise you have a really nice graph of how you’re doing with a good visual of any unusual happenings, all with some statistical rigor behind it. What’s not to love?

Anyway, I think because they take a little bit of getting used to,  SPC charts do not always get the love they deserve. I would like to rectify this travesty, so here’s 5 things you should know about them to tempt you to go learn more about them:

  1. SPC charts are probably more useful for most business than hypothesis testing While most high school level statistics classes at least take a stab at explaining p-values and hypothesis testing to kids, almost none of them even show an example of a control chart. And why not? I think it’s a good case of academia favoring itself. If you want to test a new idea against an old idea or to compare two things at a fixed point in time p-values and hypothesis testing are pretty good. That’s why they’re used in most academic research. However, if you want see how things are going over time, you need statistical process control. Since this is more relevant for most businesses, people who are trying to keep track of any key metric should DEFINITELY know about these.   Six Sigma and many process improvement class teach statistical process control, but they still don’t seem widely used outside of those settings. Too bad. These graphs are  practical, they can be updated easily, and it gives you a way of monitoring what’s going on and lot of good information about how your process are going. Like what? Well, like #2 on this list:
  2. SPC charts track two types of variation Let’s get back to my sick call example. Let’s say that in any given month, 10% of your employees call in sick. Now most people realize that not every month will be exactly 10%. Some months it’s 8%, some months it’s 12%. What statistical process control charts help calculate is when those fluctuations are most likely just random (known as common cause variation) and the point at which they are probably not so random (special cause variation). It sets parameters that tell you when you should pay attention. They are better than p-values for this because you’re not really running an experiment every month….you just want to make sure everything’s progressing as it usually does. The other nice part is this translates easily in to a nice visual for people, so you can say with confidence “this is how it’s always been” or “something unusual is happening here” and have more than your gut to rely on.
  3. SPC charts help you test new things, or spot concerning trends quickly SPC charts were really invented for manufacturing plants, and were perfected and popularized in post-WWII Japan. One of the reasons for this is that they really loved having an early warning about when a machine might be breaking down or an employee might not be following the process. If the process goes above or below a certain red line (aka the “upper/lower control limit”) you have a lot of confidence something has gone wrong and can start investigating right away. In addition to this, you can see if a change you made helps anything. For example, if you do a handwashing education initiative, you can see what percentage of your employees call in sick the next month. If it’s below the lower control limit, you can say it was a success, just like with traditional p-values/hypothesis testing. HOWEVER, unlike p-values/hypothesis testing, SPC charts make allowances for time. Let’s say you drop the sick calls to 9% per month, but then they stay down for 7 months. Your SPC chart rules now tell you you’ve made a difference. SPC charts don’t just take in to account the magnitude of the change, but also the duration. Very useful for any metric you need to track on an ongoing basis.
  4. They encourage you not to fix what isn’t broken One of the interesting reasons SPC charts caught on so well in the manufacturing world is that the idea of “opportunity cost” was well established. If your assembly line puts out a faulty widget or two, it’s going to cost you a lot of money to shut the whole thing down. You don’t want to do that unless it’s REALLY broken. For our sick call example, it’s possible that what looks like an increase (say to 15% of your workforce) isn’t a big deal and that trying to interfere will cause more harm than good. Always good to remember that there are really two ways of being wrong: missing a problem that does exist, and trying to fix one that doesn’t.
  5. There are quite a few different types One of the extra nice things about SPC charts is that there are actually 6 types to chose from, depending on what kind of data you are working with. There’s a helpful flowchart to pick your type here, but a good computer program (I use QI macros) can actually pick for you. One of the best parts of this is that some of them can deal with small and varying sample sizes, so you can finally show that going from 20% to 25% isn’t really impressive if you just lowered your volume from 5 to 4.

So those are some of my reasons you should know about these magical little charts. I do wish they’d get used more often because they are a great way of visualizing how you’re doing on an ongoing basis.

If you want to know more about the math behind them and more uses (especially in healthcare), try this presentation. And wish me luck on my talk! Pitching this stuff right before lunch is going to be a challenge.

Funnel Plots 201: Romance, Risk, and Replication

After writing my Fun With Funnel Plots post last week, someone pointed me to this Neuroskeptic article from a little over a year ago.  It covers a paper called “Romance, Risk and Replication” that sought to replicate “romantic priming studies”, with interesting results….results best shown in funnel plot form! Let’s take a look, shall we?

A little background: I’ve talked about priming studies on this blog before, but for those unfamiliar, here’s how it works: a study participant is shown something that should subconsciously/subtly stimulate certain thoughts. They are then tested on a behavior that appears unrelated, but could potentially be influenced by the thoughts brought on in the first part of the study. In this study, researchers took a look at what’s called “romantic priming” which basically involves getting someone to think about meeting someone attractive, then seeing if they do things like (say they would) spend more money or take more risks.

Some ominous foreshadowing: Now for those of you who have been paying attention to the replication crisis, you may remember that priming studies were one of the first things to be called in to question. There were a lot of concerns about p-value hacking, and concerns that they were falling prey to basically all the hallmarks of bad research. You see where this is going.

What the researchers found: Shanks et al attempted to replicate 43 different studies on romantic priming, all of which had found significant effects. When they attempted to replicate these studies, they found nothing. Well, not entirely nothing. They found no significant effects of romantic priming, but they did find something else:

The black dots are the results from original studies, and the white triangles are the results from the replication attempts. To highlight the differences, they drew two funnel plots. One encompasses the original studies, and shows the concerning “missing piece” pattern in the lower left hand corner.  Since they had replication studies, they funnel plotted those as well. Since the sample sizes were larger, they all cluster at the top, but as you can see they spread above and below the zero line. In other words, the replications showed no effect in exactly the way you would expect if there were no effect, and the originals showed an effect in exactly the way you would expect if there were bias.

To thicken the plot further, the researchers also point out that the original studies effect sizes actually all fall just about on the line of the funnel plot for the replication results. The red line in the graph shows a trend very close to the side of the funnel, which was drawn at the p=.05 line. Basically, this is pretty good evidence of p-hacking…aka researchers (or journals) selecting results that fell right under the p=.05 cut off. Ouch.

I liked this example because it shows quite clearly how bias can get in and effect scientific work, and how statistical tools can be used to detect and display what happened. While large numbers of studies should protect against bias, sadly it doesn’t always work that way. 43 studies is a lot, and in this case, it wasn’t enough.

5 Things You Should Know About Study Power

During my recent series on “Why Most Published Research Findings Are False“, I mentioned a concept called “study power” quite a few times. I haven’t talked about study power much on this blog, so I thought I’d give a quick primer for those who weren’t familiar with the term. If you’re looking for a more in depth primer try this one here, but if you’re just looking for a few quick hits, I gotcha covered:

  1. It’s sort of the flip side of the p-value We’ve discussed the p-value and how it’s based on the alpha value before, and study power is actually based on a value called beta. If alpha can be thought of as the chances of committing a Type 1 error (false positive), then the beta is the chance of getting a Type 2 error (false negative). Study power is actually 1 – beta, so if someone says study power is .8, that means the beta was .2. Setting the alpha and beta values are both up to the discretion of the researcher….their values are more about risk tolerance than mathematical absolutes.
  2. The calculation is not simple, but what it’s based on is important Calculating study power is not easy math, but if you’re desperately curious try this explanation. For most people though, the important part to remember is that it’s based on 3 things: the alpha you use, the effect size you’re looking for, and your sample size.  These three can all shift based on the values of the other one. As an example, imagine you were trying to figure out if a coin was weighted or not. The more confident you want to be in your answer (alpha), the more times you have to flip it (sample size). However, if the coin is REALLY unfairly weighted (effect size), you’ll need fewer flips to figure that out. Basically the unfairness of a coin weighted to 80-20 will be easier to spot than a coin weighted to 55-45.
  3. It is weirdly underused As we saw in the “Why Most Published Findings Are False” series, adequate study power does more than prevent false negatives. It can help blunt the impact of bias and the effect of multiple teams, and it helps everyone else trust your research. So why don’t most researchers put much thought in to it, science articles mention it, or people in general comment on it? I’m not sure, but I think it’s simply because the specter of false negatives is not as scary or concerning as that of false positives. Regardless, you just won’t see it mentioned as often as other statistical issues. Poor study power.
  4. It can make negative (aka “no difference”) studies less trustworthy With all the current attention on false positive/failure to replicate studies, it’s not terribly surprising that false negatives have received less attention…..but it is still an issue. Despite the fact that study power calculations can tell you how big the effect size you can detect is, and odd number of researchers don’t include their calculations. This means a lot of “negative finding” trials could also be suspect. In this breakdown of study power, Stats Done Wrong author Alex Reinhart cites studies that found up to 84% of studies don’t have sufficient power to detect even a 25% difference in primary outcomes. An ASCO review found that 47% of oncology trials didn’t have sufficient power to detect all but the largest effect sizes. That’s not nothing.
  5. It’s possible to overdo it While underpowered studies are clearly an issue, it’s good to remember that overpowered studies can be a problem too. They waste resources, but can also detect effect sizes so small as to be clinically meaningless.

Okay, so there you have it! Study power may not get all the attention the p-value does, but it’s still a worthwhile trick to know about.

5 Interesting Examples of Self Reporting Bias

News flash! People lie. Some more than others. Now there are all sorts of reasons why we get upset when people don’t tell the truth, but I’m not here to talk about those today. No, today I’m here to give a few interesting examples of where self-reporting bias can really kinda screw up research and how we perceive the world.

Now, self reporting bias can happen for all sorts of reasons, and not all of them are terrible. Some bias happens because people want to make themselves look better, some happens because people really think they do things differently than they do, some happens because people just don’t remember things well and try to fill in gaps. Regardless of the reason, here’s 5 places bias may pop up:

  1. Nutrition/Food Intake Self reported nutrition data may be the worst example of research skewed by self reporting. For most nutrition/intake surveys, about 67% of respondents give implausibly low answers….an effect that actually shows up cross culturally. Interestingly there are some methods known to improve this (doubly labeled water for example), but they tend to be more expensive and thus are used less often. Unfortunately this effect isn’t random, so it’s hard to know exactly how bad they effect is across the board.
  2. Height While it’s pretty ubiquitous that people lie about their weight, lying about height is a less recognized but still interesting problem. It’s pervasive in online dating for both men AND women, both of whom exaggerate by about 2 inches. On medical/research surveys we all get slightly more honest, with men overestimating their height by about .5 inches, and women by .33 inches.
  3. Work hours Know anyone who says they work a 70 hour week? Do they do this regularly? Yeah, they’re probably not remembering that correctly.  Edit: My snark got ahead of me here, and I got called out in the comments, so I’m taking it back. I also added some text in bold to clarify what the problem is. When people are asked how much they work per week, they tend to give much higher answers than when they are asked to list out the hours they worked during the week. The more they say they work, the more likely to have inflated the number. People who say they work 75+ hours work an average of 50 hours/week, and  those who say they work 40 hours/week tend to work about 37. Added: While some professions do actually require crazy hours (especially early in your career….looking at you medical residencies, and first year teachers are notorious for never going home), very few keep this up forever. Additionally, what people work most weeks almost never equals what they work when averaged over the course of a year. That 40 hour a week office worker almost certainly gets some vacation time, and even 2 weeks of vacation and a few paid holiday take that yearly average down to 37 hours per week…and that’s before you add in sick time.  Some of this probably gets confusing because of business travel or other “grey areas” like professional development time, but it also speaks to our tendency to remember our worst weeks better than our good ones.
  4. Childhood memories It is not uncommon in psychological/developmental research that adults will be asked various questions about the state of their life currently while also being queried about their upbringing. This typically leads to conclusions about parenting type x leading to outcome y in children. I was recently reading a paper about various discipline methods and long term outcomes in kids, when I ran across a possible confounder I hadn’t considered: sex differences in the recollection of childhood memories. Apparently overall men are not as good at identifying family dynamics from their childhoods, and the authors wondered if that led to some false findings. They didn’t have direct evidence, but it’s an interesting thing to keep in mind.
  5. Base 10 madness You wouldn’t think our fingers would cause a reporting bias, but they probably do. Our obsession with doing things in multiples of 5 or 10 probably comes from our use of our hands for counting. When it comes to surveys and self reports, this leads to a phenomena called “heaping”, where people tend to round their reports to multiples of 5 and 10.  There’s some interesting math you can use to try to correct for this, but given that rounding tends to be non-constant (ie we round smaller numbers to 5 and larger numbers to 10) this can actually affect some research results.

Base 10 aside: one of the more interesting math/pop-culture videos I’ve seen is this one, where they explore why the Simpson’s (who have 4 fingers on each hand) still use base 10 counting (7:45 mark):

 

How the Sausage Gets Made and the Salami Gets Sliced

Ever since James the Lesser pointed me to this article about some problems with physics,  I’ve been thinking a lot about salami slicing. For those of you who don’t know, salami slicing (aka using the least publishable unit) is the practice of taking one data set and publishing as many papers as possible from it. Some of this is done through data dredging, and some if it is just done by breaking up one set of conclusions in to a series of much smaller conclusions in order to publish more papers. This is really not a great practice, as it can give potentially shaky conclusions more weight (500 papers can’t be wrong!) and multiply the effects of any errors in data gathering.  This can then have other effects like increasing citation counts for papers or padding resumes.

A few examples:

  1. Physics: I’ve talked about this before, but the (erroneous) data set mentioned here resulted in 500 papers on the same topic. Is it even possible to retract that many?
  2. Nutrition and obesity research: John Ioannidis took a shot at this practice in his paper on obesity research, where he points out that the PREDIMED study (a large randomized trial looking at the Mediterranean diet)  has resulted in 95 published papers.
  3. Medical findings: In this paper, it was determined that nearly 17% of papers on specific topics had at least some overlapping data.

To be fair, not all of this is done for bad reasons. Sometimes grants or other time pressures encourage researchers to release their data in slices rather than in one larger paper. The line between “salami slicing” and “breaking up data in to more manageable parts” can be a grey one….this article gives a good overview of some case studies and shows it’s not always straightforward. Regardless, it’s worth keeping in mind if you see multiple studies supporting the same conclusion that you should at least check for independence among the data sources. This paper breaks down some of the less obvious problems with salami slicing:

  1. Dilution of content/reader fatigue More papers mean a smaller chance anyone will actually read all of them
  2. Over-representation of some findings Fewer people will read these papers, but all the titles will make it look like there are lots of new findings
  3. Clogging journals/peer review Peer reviewers and journal space is still a limited resource. Too many papers on one topic does take resources from other topics
  4. Increasing author fatigue/sanctions An interesting case that this is actually bad for the authors in the long run. Publishing takes a lot of work, and publishing two smaller papers is twice the work of one. Also, duplicate publishing increases the chance you’ll be accused of self-plagiarism and be sanctioned.

Overall, I think this is one of those possibilities many lay readers don’t even consider when they look at scientific papers. We assume that each paper equals one independent event, and that lots of papers means lots of independent verification. With salami slicing, we do away with this element and increase the noise. Want more? This quick video give a good overview as well:

On Outliers, Black Swans, and Statistical Anomolies

Happy Sunday! Let’s talk about outliers!

Outliers have been coming up a lot for me recently, so I wanted to put together a few of my thoughts on how we treat them in research. In the most technical sense, outliers are normally defined as any data point that is far outside the expected range for a value. Many computer programs (including Minitab and R) automatically define an outlier as a point that lies more than 1.5 times the interquartile range outside the interquartile range as an outlier. Basically any time you look at a data set and say “one of these things is not like the others” you’re probably talking about an outlier.

So how do we handle these? And how should we handle these? Here’s a couple things to consider:

  1. Extreme values are the first thing to go When you’re reviewing a data set and can’t review every value, almost everyone I know starts by looking at the most extreme values. For example, I have a data set I pull occasionally that tells me how long people stayed in the hospital after their transplants. I don’t scrutinize every number, but I do scrutinize every number higher than 60. While occasionally patients stay in the hospital that long, it’s actually equally likely that some sort of data error is occurring. Same thing for any value under 10 days….that’s not really even enough time to get a transplant done. So basically if a typo or import error led to a reasonable value, I probably wouldn’t catch it. Overly high or low values pretty much always lead to more scrutiny.
  2. Is the data plausible? So how do we determine whether an outlier can be discarded? Well the first is to assess if the data point could potentially happen. Sometimes there are typos, data errors, someone flat out misread the question, or someone’s just being obnoxious. An interesting example of implausible data points possibly influencing study results was in Mark Regenerus’ controversial gay parenting study. A few years after the study was released, his initial data set was re-analyzed and it was discovered that he had included at least 9 clear outliers….including one guy who reported he was 8 feet tall, weighed 88 lbs, had been married 8 times and had 8 children. When one of your outcome measures is “number of divorces” and your sample size is 236, including a few points like that could actually change the results. Now, 8 marriages is possible, but given the other data points that accompanied it, they are probably not plausible.
  3. Is the number a black swan? Okay, so lets move out of run of the mill data and in to rare events. How do you decide whether or not to include a rare event in your data set? Well….that’s hard to. There’s quite a bit of controversy recently over black swan type events….rare extremes like war, massive terrorist attacks or other existential threats to humanity. Basically, when looking at your outliers, you have to consider if this is an area where something sudden, unexpected and massive could happen to change the numbers. It is very unlikely that someone in a family stability study could suddenly get married and divorced 1,000 times, but in public health a relatively rare disease can suddenly start spreading more than usual. Nicholas Nassim Taleb is a huge proponent of keeping an eye on data sets that could end up with a black swan type event, and thinking through the ramifications of this.
  4. Purposefully excluding or purposefully including can both be deceptive In the recent Slate Star Codex post “Terrorist vs Chairs“, Scott Alexander has two interesting outlier cases that show exactly how easy it is to go wrong with outliers. The first is to purposefully exclude them. For example, since September 12th, 2001, more people in the US have been killed by falling furniture than by terrorist attacks. However, if you move the start line two days earlier to September 10th, 2001, that ratio completely flips by an order of magnitude. Similarly, if you ask how many people die of the flu each year, the average for the last 100 years is 1,000,000. The average for the last 97 years? 20,000.  Clearly this is where the black swan thing can come back to haunt you.
  5. It depends on how you want to use your information Not all outlier exclusions are deceptive. For example, if you work for the New York City Police Department and want to review your murder rate for the last few decades, it would make sense to exclude the September 11th attacks. Most charts you will see do note that they are making this exclusion. In those cases police forces are trying to look at a trends and outcomes they can affect….and the 9/11 attacks really weren’t either. However, if the NYPD were trying to run numbers that showed future risk to the city, it would be foolish to leave those numbers out of their calculations. While tailoring your approach based on your purpose can open you up to bias, it also can reduce confusion.

Take it away Grover!

Selection Bias: The Bad, The Ugly and the Surprisingly Useful

Selection bias and sampling theory are two of the most unappreciated issues in the popular consumption of statistics. While they present challenges for nearly every study ever done, they are often seen as boring….until something goes wrong. I was thinking about this recently because I was in a meeting on Friday and heard an absolutely stellar example of someone using selection bias quite cleverly to combat a tricky problem. I get to that story towards the bottom of the post, but first I wanted to go over some basics.

First, a quick reminder of why we sample: we are almost always unable to ask the entire population of people how they feel about something. We therefore have to find a way of getting a subset to tell us what we want to know, but for that to be valid that subset has to look like the main population we’re interested in. Selection bias happens when that process goes wrong. How can this go wrong? Glad you asked! Here’s 5 ways:

  1. You asked a non-representative group Finding a truly “random sample” of people is hard. Like really hard. It takes time and money, and almost every researcher is short on both. The most common example of this is probably our personal lives. We talk to everyone around us about a particular issue, and discover that everyone we know feels the same way we do. Depending on the scope of the issue, this can give us a very flawed view of what the “general” opinion is. It sounds silly and obvious, but if you remember that many psychological studies rely exclusively on W.E.I.R.D. college students for their results, it becomes a little more alarming. Even if you figure out how to get in touch with a pretty representative sample, it’s worth noting that what works today may not work tomorrow. For example, political polling took a huge hit after the introduction of cell phones. As young people moved away from landlines, polls that relied on them got less and less accurate. The selection method stayed the same, it was the people that changed.
  2. A non-representative group answered Okay, so you figured out how to get in touch with a random sample. Yay! This means good results, right? No, sadly. The next issue we encounter is when your respondents mess with your results by opting in or opting out of answering in ways that are not random. This is non-response bias, and basically it means “the group that answered is different from the group that didn’t answer”. This can happen in public opinion polls (people with strong feelings tend to answer more often than those who feel more neutrally) or even by people dropping out of research studies(our diet worked great for the 5 out of 20 people who actually stuck with it!). For health and nutrition surveys, people also may answer based on how good they feel about their response, or how interested they are in the topic.  This study from the Netherlands,for example, found that people who drink excessively or abstain entirely are much less likely to answer surveys about alcohol use than those who drink moderately.   There’s some really interesting ways to correct for this, but it’s a chronic problem for people who try to figure out public opinion.
  3. You unintentionally double counted This example comes from the book Teaching Statistics by Gelman and Nolan. Imagine that you wanted to find out the average family size in your school district. You randomly select a whole bunch of kids and ask them how many siblings they have, then average the results. Sounds good, right? Well, maybe not. That strategy will almost certainly end up overestimating the average number of siblings, because large families are by definition going to have a better chance of being picked in any sample.  Now this can seem obvious when you’re talking explicitly about family size, but what if it’s just one factor out of many? If you heard “a recent study showed kids with more siblings get better grades than those without” you’d have to go pretty far in to the methodology section before you might realize that some families may have been double (or triple, or quadruple) counted.
  4. The group you are looking at self selected before you got there Okay, so now that you understand sampling bias, try mixing it with correlation and causation confusion. Even if you ask a random group and get responses from everyone, you can still end up with discrepancies between groups because of sorting that happened before you got there. For example, a few years ago there was a Pew Research survey that showed that 4 out of 10 households had female breadwinners, but that those female breadwinners earned less than male breadwinner households. However, it turned out that there were really 2 types of female breadwinner households: single moms and wives who outearned their husbands. Wives who outearned their husbands made about as much as male breadwinners, while single mothers earned substantially less. None of these groups are random, so any differences between them may have already existed.
  5. You can actually use all of the above to your advantage. As promised, here’s the story that spawned this whole post: Bone marrow transplant programs are fairly reliant on altruistic donors. Registries that recruit possible donors often face a “retention” problem….i.e. where people initially sign up, then never respond when they are actually needed. This is a particularly big problem with donors under the age of 25, who for medical reasons are the most desirable donors. Recently a registry we work with at my place of business told us their new recruiting tactic used to mitigate this problem. Instead of signing people up in person for the registry, they get minimal information from them up front, then send them an email with further instructions about how to finish registering. They then only sign up those people who respond to the email. This decreases the number of people who end up registering to be donors, but greatly increases the number of registered donors who later respond when they’re needed. They use selection bias to weed out those who were least likely to be responsive….aka those who didn’t respond to even one initial email. It’s a more positive version of the Nigerian scammer tactic.

Selection bias can seem obvious or simple, but since nearly every study or poll has to grapple with it, it’s always worth reviewing. I’d also be remiss if I didn’t  include a link here for those ages 18 to 44 who might be interested in registering to be a potential bone marrow donor.