Predictions and Accuracy: Some Technical Terms

Okay, so we started off our discussion of Statistical Tricks and Treats with a general post about contingency matrices and the two ways to be wrong, and then followed up with a further look at how the base rate involved in the matrix can skew your perception of what a positive or negative test means.

In this post I’m going to take another look at the contingency matrix and define a few words you may hear associated with them. But first, let’s take another look at that matrix from last week:

Drugsearch

Accuracy: Accuracy is the overall chance that your test is correct in either direction. So we’d have this:

Number of correct search warrants + number of innocent people left alone
Total number of tests run

Note: this SHOULD be the definition used. People will often try to wiggle out of this by saying “it’s accurate 99% of the time when drugs are present!”.  They are hoping the word “accurate” distracts you from their failure to mention what happens when drugs aren’t present. This is the type of sales pitch that leads to innocent people getting arrested and cops who had no idea the were using a test that likely to be wrong.

Sensitivity:  When that hypothetical sales person up there said “it’s accurate 99% of the time when drugs are present!”, what they were actually giving you was the sensitivity of the test. It (along with specificity) answers the question “How often does the test do what we want it to do?” The sensitivity is also called the true positive rate, and it’s basically this:

Correct warrants/arrests
Correct warrants/arrests + bad guys who got away with it

In other words, it’s the number of “correct” positives divided by the total number of positives.  Another way of looking at it is it’s the number in the top row green box over the total number in the top row. A high percentage here means you’ve minimized false negatives.

Specificity: This is the opposite of sensitivity, and in this example it’s the one the sales person is trying not to mention. This is how accurate the test is when drugs are NOT present, aka the true negative rate. It looks like this:

Number of times you left an innocent person alone
Harassed and harassed innocent people whose trash was tested

Basically it’s the number of correctly negative tests divided by the number of total negative tests. It’s also the number in the green box in the bottom row over the total number in the bottom row. A high percentage here means you’ve minimized false positives.

Positive Predictive Value: Okay, so both sensitivity and specificity dealt with rows, and this one deals with columns. Positive predictive value is a lot of what I talked about in my base rate post: if you get a positive test, what are the chances that it’s correct?

As we covered last week, it’s this:

Correct search warrants/arrests
Correction search warrants/arrests + incorrect warrants/arrests

In other words, given that we think we’ve found drugs, what are the chances that we actually have? It’s the green box in the first column over the total number in the first column. This is the one the base rate can mess with BIG time. You see, when companies that develop tests put them on the market, they can’t know what type of population you’re going to use them on. This value is unknown until you start to use it. A high value here means you’ve minimized false positives.

Negative Predictive Value: The flip side of the positive predictive value, this is about the second column. Given that we got a negative test, what are the chances there are no drugs? This is:

Innocent people who go unbothered
Innocent people who go unbothered + bad guys who get away with it

So for the second column, the number in the  green box over the total second column. A high value here means you’ve minimized false negatives.

So to recap:

Sensitivity and Specificity:

  1. Answer the question “how does the test perform when drug are/are not present”
  2. Refer to the rows (at least in this table set up)
  3. High sensitivity = low false negatives, low sensitivity = lots of false negatives
  4. High specificity = low false positives, low specificity = lots of false positives
  5. Information about how “accurate” one of the values is does not give the whole picture

Positive and Negative Predictive value (PPV and NPV):

  1. Answer the question “Given a positive/negative test result, what are the chances drugs are/are not actually present?”
  2. Refer to columns (at least in this table set up)
  3. High PPV = low false positives, low PPV = high false positives
  4. High NPV = low false negatives, low NPV = high false negatives
  5. Can be heavily influenced by the rate of the underlying condition (in this case drug use) in the population being tested (base rate)

 

All About that Base Rate

Of all the statistical tricks or treats I like to think about, the base rate (and it’s associated fallacy) are probably the most interesting to me. It’s a common fallacy, in large part because it requires two steps of math to work out what’s going on. I’ve referenced it before, but I wanted a definitive post where I walked through what a base rate is and why you should remember it exists. Ready? Let’s go.

First, let’s find an example.
Like most math problems, this one will be a little easier to follow if we use an example.In my

In my Intro to Internet Science series, I mentioned the troubling case of a couple of former CIA analysts whose house was raided by a SWAT team after they were spotted shopping at the wrong garden store. After spotting the couple purchasing what they thought was marijuana growing equipment, the police had tested their trashcans for the presence of drugs. Twice the police got a positive test result, and thus felt perfectly comfortable raiding the house and holding the parents and kids at gunpoint for two hours while they searched for the major marijuana growing operation they believed they were running. In the end it was determined the couple was actually totally innocent. There’s a lot going on with this story legally, but what was up with those positive drug tests?

Let’s make a contingency table!
In last week’s post, I discussed the fact that there is almost always more than one way to be wrong. A contingency table helps us visualize the various possibilities that can arise from the two different types of test results and the two different realities:

Drugsearch

So here we have four options, two good and two bad:

  1. True positive (yes/yes): we have evidence of actual wrongdoing1
  2. False negative (no/yes): someone with drugs appears innocent
  3. False positive (yes/no): someone without drugs appears guilty
  4. True negative (no/no): an innocent person’s innocence is confirmed

In this case, we ended up with a false positive, but how often does that really happen? Is this just an aberration or something we should be concerned about?

Picking between the lesser of two evils.
Before we go on, let’s take a step back for a minute and consider why the police department may have had to consider when they selected a drug screening test to use. It’s important to recognize that in this situation (as in most of life), you actually do have some discretion over which way you chose to be wrong.  In a perfect world we’d have unlimited resources to buy a test that gets the right answer every time, but in the real world we often have to go the cheap route and consider the consequences of either type of error and make trade-offs.

For example, in medicine false positives are almost always preferable to false negatives. Most doctors (and patients!) would prefer that a screening test told them they might have a disease that they did not have (false positive) than to have a screening test miss a disease they did have (false negative).

In criminal justice, there is a similar preference. Police would rather have evidence of activity that didn’t happen (false positive) then not get evidence when a crime was committed (false negative).

So what kind of trade-offs are we talking about?
Well, in the article I linked to above, it mentioned that one of the downfalls of the drug tests many police departments use is a very high false positive rate…..as high as 70%. This means that if you tested 100 trashcans that were completely free of drugs, you’d get a positive test for 70 of them.

Well that sounds pretty bad….so is that the base rate you were talking about?
No, but it is an important rate to keep in mind because it influences the math in ways that aren’t particularly intuitive for most people. For example, if we test 1000 trash cans, half with drugs and half without, here’s what we get:
Drugsearch2

When the police are out in the field, they get exactly one piece of information: whether or not the trash can tested positive for drugs.  In order to use this information, we actually have to calculate what that means. In the above example, we have 495 true positive trash cans with drugs in them. We also have 350 false positive trash cans with no drugs in them, but with a positive test. So overall, we have 845 trash cans with a positive test. 495/845 is about 59%…..so under these circumstances, a positive test only means drugs are present about 60% of the time.

Now about that base rate……
Okay, so none of that is great, but this actually can get worse. You see, the rate of those who do drugs and those who don’t do drugs isn’t actually equal. The rate of those who don’t do drugs is actually much much higher, and this is the base rate I was talking about before.

According to many reports, about 10% of the US adult population used illegal drugs in the past month (mostly marijuana, FYI….not controlled for states that have legalized it). Presumably this means that about 10% of trash cans might contain drugs at any given time. That makes our numbers look like this:

drugsearch3

Using the same math as above, we get 99/(630+99) = 14%. Now we realize that for every positive test, there’s actually only about a 14% chance there are drugs in that trash can. I’m somewhat curious how much worse that is than just having a trained police officer take a look.  In fact, because the base rates are so different, you actually would need a test with an 11% false positive rate (as compared to the 70% we currently have) to make the chances 50/50 that your test is telling you what you think it’s telling you. Yikes.

Now of course these numbers only holds if you’re testing trash cans randomly….but if you’re testing the garbage of everyone who goes to a garden store on a Saturday morning, that may be a little closer to the truth than you want to admit.

So what’s the takeaway?
The crux of the base rate fallacy is that a small percentage of a large number can easily be larger than a large percentage of a small number. This is basic math, but it becomes hard to remember when you’re in the moment and the information is not being presented in a straightforward way. If you got a math test that said “Which value is larger….11% of 900 or 99% of 100?” You’d probably get it right pretty quickly. However, when it’s up to you to remember what the base rate is, people get much much worse at this problem. In fact, the vast majority of medical doctors don’t get this type of problem correct when it’s presented to them and they’re specifically given the base rate….so my guess is the general population success rate is quite low.

No matter how accurate a test is, if the total number of entries in one of the rows (or columns) is much larger than the total of the other, you should watch out for this.

Base rate matters.
1. Note for the libertarians: It is beyond the scope of this post to discuss current drug policy and whether or not this should actually constitute wrongdoing. Just roll with it.

Two Ways To Be Wrong

One of the most interesting things I’ve gotten to do since I started blogging about data/stats/science is to go to high school classrooms and share some of what I’ve learned. I started with my brother’s Environmental Science class a few years ago, and that has expanded to include other classes at his school and some other classes elsewhere. I often get more out of these talks than the kids do…something about the questions and immediate feedback really pushes me to think about how I present things.

Given that, I was intrigued by a call I got from my brother yesterday. We were talking a bit about science and skepticism, and he mentioned that as the year wound down he was having to walk back on some of what I presented to his class at the beginning of the year. The problem, he said, was not that the kids had failed to grasp the message of skepticism…but rather that they had grasped it too well. He had spent the year attempting to get kids to think critically, and was now hearing his kids essentially claim it was impossible to know anything because everything could be manipulated.

Oops.

I was thinking about this after we hung up, and how important it is not to leave the impression that there’s only one way to be wrong.  In most situations that need a judgment call, there’s actually two ways to be wrong.  Stats and medicine have a really interesting tool for showing this phenomena: a 2×2 contingency matrix . Basically, you take two different conditions and sort how often they agree or disagree and under what circumstances those happen.

For example, for my brother’s class, this is the contingency matrix:

Skepticalgullible

In terms of outcomes,  we have 4 options:

  1. True Positive:  Believing a true idea (brilliant early adopter).
  2. False Negative (Type II error): Not believing a true idea (in denial/impeding progress).
  3. False Positive (Type I error): Believing a false idea (gullible rube)
  4. True Negative: Not believing a false idea (appropriately skeptical)

Of those four options, #2 and #3 are the two we want to avoid. In those cases the reality (true or not) clashes with the test (in this case our assessment of the truth).  In my talk and my brother’s later lessons, we focused on eliminating #3. One way of doing this is to be more discerning with what we believe or we don’t, but many people can leave with the impression that disbelieving everything is the way to go. While that will absolutely reduce the number of false positive beliefs, it will also increase the number of false negatives. Now, depending on the field this may not be a bad thing, but overall it’s just substituting one lack of thought for another. What’s trickier is to stay open to evidence while also being skeptical.

It’s probably worth mentioning that not everyone gets into these categories honestly…some people believe a true thing pretty much by accident or fail to believe a false thing for bad reasons. Every field has an example of someone who accidentally ended up on the right side of history. There also aren’t always just two possibilities, many scientific theories have shades of gray.

Caveats aside, it’s important to at least raise the possibility that not all errors are the same. Most of us have a bias towards one error or another, and will exhort others to avoid one at the expense of the other. However, for both our own sense of humility and the full education of others, it’s probably worth keeping an eye on the other way of being wrong.

Proof: Using Facts to Deceive (Part 7)

Note: This is part 7 in a series for high school students about reading and interpreting science on the internet. Read the intro and get the index here, or go back to Part 6 here.

Okay, now we come to the part of the talk that is unbelievably hard to get through quickly. This is really a whole class, and I will probably end up putting some appendices on this series just to make myself feel better.  If the only thing I ever do in life is to teach as many people as possible the base rate fallacy, I’ll be content. Anyway, this part is tough because I at least attempt to go through a few statistical tricks that actually require some explaining. This could be my whole talk, but I’ve decided against it in favor some of the softer stuff. Anyway, this part is called:

Crazy Stats Tricks: False Positives, Failure to Replicate, Correlations, Etc

Okay, so what’s the problem here?

Shenanigans, chicanery, and folks otherwise not understanding statistics and numbers. I’ve made reference to some of these so far, but here’s a (not-comprehensive) list:

  1. Changing the metric (ie using growth rates vs absolute rates, saying “doubled” and hiding the initial value, etc)
  2. Correlation and causation confusion
  3. Failure to Replicate
  4. False Positives/False Negatives

They each have their own issues. Number 1 deceives by confusing people, Number 2 makes people jump to conclusions, Number 3 presents splashy new conclusions that no one can make happen again, and Number 4 involves too much math for most people but yields some surprising results.

Okay, so what kind of things should we be looking out for?

Well each one is a little different. I touched on 1 and 2 a bit previously with graphs and anecdotes. For failure to replicate, it’s important to remember that you really need multiple papers to confirm findings, and having one study say something doesn’t necessarily mean subsequent studies will say the same thing. The quick overview though is that many published studies don’t bear out. It’s important to realize that any new shiny study (especially psychology or social science) could turn out to not be reproducible, and the initial conclusions invalid. This warning is given as a boilerplate “more research is needed” at the end of articles, but it’s meant literally.

False positives/negatives are a different beast that I wish more people understood.  While this applies to a lot of medical research, it’s perhaps clearest to explain in law enforcement.  An example:

In 2012, a (formerly CIA) couple was in their home getting their kids ready for school when they were raided by a SWAT team. They were accused of being large scale marijuana growers, and their home was searched. Nothing was found.  So why did they come under investigation? Well it turns out they had been seen buying gardening equipment frequently used by marijuana growers, and the police had then tested their trash for drug residue. They got two positive tests, and they raided the house.

Now if I had heard this reported in a news story, I would have thought that was all very reasonable. However, the couple eventually discovered that the drug test used on their trash has a 70% false positive rate. Even if their trash had been perfectly fine, there was still at least a 50% they’d get two positive tests in a row (and that assumes nothing in their trash was triggering this). So given a street with ZERO drug users, you could have found evidence to raid half the houses.  The worst part of this is that the courts ruled that the police themselves were not liable for not knowing that the test was that inaccurate, so their assumptions and treatment of the couple were okay. Whether that’s okay is a matter for legal experts, but we should all feel a little uneasy that we’re more focused on how often our tests get things right than how often they’re wrong.

Why do we fall for this stuff?

Well, some of this is just a misunderstanding or lack of familiarity with how things work, but the false positive/false negative issue is a very specific type of confirmation bias. Essentially we often don’t realize that there is more than one way to be wrong, and in avoiding one inaccuracy, we increase our chances of different types of inaccuracy.  In the case of the police departments using the inaccurate tests, they likely wanted something that would detect drugs when they were present. They focused on making sure they’d never get a false negative (ie a test that said no drugs when there were). This is great, until you realize that they traded that for lots of innocent people potentially being searched. In fact, since there are more people who don’t use drugs than those who do, the chances that someone with a positive test doesn’t have drugs is actually higher than the chance that they do….that’s the base rate fallacy I was talking about earlier.

To further prove this point, there’s an interesting experiment called the Wason Selection task that shows that when it comes to numbers in particular, we’re especially vulnerable to only confirming an error in one direction. In fact 90% of people fail this task because they only look at one way of being wrong.

Are you confused by this? That’s pretty normal. So normal in fact that the thing we use to keep it all straight is literally called a confusion matrix and it looks like this:

If you want to do any learning about stats, learn about this guy, because it comes up all the time. Very few people can do this math well, and that includes the majority of doctors. Yup, the same people most likely to tell you “your test came back positive” frequently can’t accurately calculate how worried you should really be.

So what can we do about it?

Well, learn a little math! Like I said, I’m thinking I need a follow up post just on this topic so I have a reference for this. However, if you’re really not mathy, just remember this: there’s more than one way to be wrong. Any time you reduce your chances of being wrong in one direction, you probably increase them in another. In criminal justice, if we make sure we never miss a guilty person, we might also increase the number of innocent people we falsely accuse. The reverse is also true. Testings, screenings, and judgment calls aren’t perfect, and we shouldn’t fool ourselves in to thinking they are.

Alright, on that happy note, I’ll bid you adieu for now. See ya next week!

Read Part 8 here.

Lance Armstrong and False Positives

Well the talk went well.

I’m waiting for the official rating (people fill out anonymous evals), but there seemed to be a lot of interest….and more importantly I got quite a few compliments on the unique approach.  Giving people something new in the “how to get along” genre was my goal, so I was pleased.

Between that and having 48 hours to pull together another abstract for submission to a transplant conference, posting got slow.

It was interesting though….the project I was writing the abstract was about a new test we introduced that saved patients over an hour of waiting time IF it came out above a certain level.  We had hours of discussion about where that should be, ultimately deciding that we had to minimize false positives (times when the test said they passed but a better test said they failed) at the cost of driving up false negatives (when the test said they failed, but they really hadn’t).  We have to perform the more accurate test regardless, so it was a choice between having a patient wait unnecessarily, or having them start an expensive uncomfortable procedure unnecessarily.  Ethically and reasonably, we decided most patients would rather find out they’d waited when they didn’t have to than that they’d gotten an entirely unnecessary procedure.

I bring all this up both to excuse my absence and to say I was fascinated by Kaiser Fung’s take on Lance Armstrong.  He goes in depth about anti-doping tests, hammering on the point that testing agencies will accept high false negatives to minimize false positives.  It would ruin their credibility to falsely accuse someone, so we have to presume many many dopers test clean at various points in time.  It follows then, that clean tests mean fairly little, while other evidence means quite a lot.

I thought that was an interesting point, one I had certainly not heard covered.

Also, as any Orioles fan (or someone who lives with one) would know, I have good reason to want Raul Ibanez tested right now.

More posts this week than last, I promise.