Stats in the News: February 2017

I’ve had a couple interesting stats related news articles forwarded to me recently, both of which are worth a look for those interested in the way data and stats shape our lives.

First they came for the guys with the data

This one comes from the confusing world of European economics, and is accompanied by the rather alarming headline “Greece’s Response to Its Resurgent Debt Crisis: Prosecute the Statistician” (note: WSJ articles are behind a paywall, Google the first sentence of the article to access it for free). The article covers the rather concerning story of how Greece attempted to clean up it’s (notoriously wrong) debt estimates, only to turn around and prosecute the statistician they hired to do so. Unsurprisingly, things soured when his calculations showed they looked even worse than they’d said and were used to justify austerity measures. He’s been tried 4 times with no mathematical errors found, and it appears that he adhered to general EU accounting conventions in all cases. Unfortunately he still has multiple cases pending, and in at least one he’s up for life in prison.

Now I am not particularly a fan of economic data. Partially that’s because I’m not trained in that area, and partially because it appears to be some of the most easily manipulated data there is. The idea that someone could come up with a calculation standard that was unfair or favored one country over others is not crazy. There’s a million ways of saying “this assumption here is minor and reasonable but that assumption there is crazy and you’re deceptive for making it”. There’s nothing that guarantees that the EU recommended way of doing things was fair or reasonable, other than that they claim they are. Greece could have been screwed by German recommendations for debt calculations, I don’t know. However, prosecuting the person who did the calculations as opposed to vigorously protesting the accounting tricks is NOT the way to make your point….especially when he was literally hired to clean up known accounting tricks you never prosecuted anyone for.

Again, no idea who’s right here, but I do tend to believe (with all due respect to Popehat) that vagueness in data complaints is the hallmark of meritless thuggery. If your biggest complaint about a statistic is it’s outcome, then I begin to suspect your complaint is not actually a statistical one.

Safety and efficacy in Phase 1 clinical trials

The second article I got forwarded was an editorial from Nature, and is a call for an increased focus on efficacy in Phase 1 clinical trials. For those of you not familiar with the drug development world, Phase 1 trials currently only look at drug safety without having to consider whether or not they work. Currently about half of all drugs that proceed to phase 2 or phase 3 end up failing to demonstrate ANY efficacy.

The Nature editorial was spurred by a safety trial that went terribly wrong and ended up damaging almost all of the previously healthy volunteers. Given that there are a limited number of people willing to sign up to be safety test subjects, this is a big issue. Previously the general consensus had been to leave this up to companies to decide what was and was not worth proceeding with, believing that market forces would get companies to screen the drugs they were testing. However, given some recent safety failures and recent publications showing how often statistical manipulations are used to push drugs along have called this in to question. As we saw in our “Does Popularity Influence Reliability” series, this effect will likely be worse the more widely studied the topic is.

It should be noted that major safety failures and/or damage from experimental drugs is fairly rare, so much of this is really a resource or ethics debate. Statistically though, it also speaks to increasing the pre-study odds we talked in the “Why Most Published Research Findings are False” series. If we know that low pre-study odds are likely to lead to many false positives, then raising the bar for pre-study odds seems pretty reasonable. At the very least the company’s should have to submit a calculation, along with the rationale. I still maintain this should be a public function of professional associations.

Does Popularity Influence Reliability? Methods and Results

Welcome to the “Papers in Meta Science” where we walk through published papers that use science to scrutinize science. At the moment we’re taking a look at the paper “Large-Scale Assessment of the Effect of Popularity on the Reliability of Research” by Pfeiffer and Hoffman. Read the introduction here.

Okay, so when we left off last time, we were discussing the idea that findings in (scientifically) popular fields were less likely to be reliable than those in less popular fields.  The theory goes that popular fields would have more false positives (due to an overall higher number of experiments being run) or that increased competition would increase things like p-hacking and data dredging on the part of research teams, or both.

Methods: To test this hypothesis empirically, the researchers decided to look at the exciting world of protein interactions in yeast. While this is not what most people think about when they think of “popular” research, it’s actually a great choice. Since the general public probably is mostly indifferent to protein interactions, all the popularity studied here will be purely scientific. Any bias the researchers picked up will be from their scientific training, not their own pre-conceived beliefs.

To get data on protein interactions, the researchers pulled large data sets that were casting a wide net and smaller data sets that were looking for specific proteins and compared the results between the two. The thought was that the large data sets were testing large numbers of interactions all using the same algorithm and would be less likely to be biased by human judgement and could therefore be used to confirm or cast doubt on the smaller experiments that required more human intervention.

Thanks to the wonders of text mining, the sample size here was HUGE – about 60,000 statements/conclusions made about 30,000 hypothesized interactions. The smaller data sets had about 6,000 statements/conclusions about 4,000 interactions.

Results: The overall results showed some interesting differences in confirmation rates:

Basically, the more popular an interaction, the more often the interaction was confirmed. However, the more popular an interaction partner was, the less often it was confirmed. Confused? Try this analogy: think of protein interactions as the popular kids in school. The popular kids were fairly easy to identify, and researchers got the popular kids right a lot of the time. However, once they tried to turn that around and figure out who interacted with the popular kids later, they started getting a lot of false positives. Just like the less-cool kids in high school might overplay their relationship to the cooler kids, many researchers tried to tie their new findings to previously recognized popular findings.

This held true for both the “inflated error effect”  and the “multiple testing effect”. In other words, having a popular protein involved made both the individual statements or conclusions less likely to be validated, and ended up with more interactions that were found once but then never replicated. This held true across all types of experimental techniques, and it held true across databases that were curated by experts vs broader searches.

We’ll dive in to the conclusions we can draw from this next week.

Who Votes When? Untangling Non-Citizen Voting

Right after the election, most people in America saw or heard about this Tweet from then President elect Trump:

I had thought this was just random bluster (on Twitter????? Never!), but then someone sent me  this article. Apparently that comment was presumably based on an actual study, and the study author is now giving interviews. It turns out he’s pretty unhappy with everyone….not just with Trump, but also with Trump’s opponents who claim that no non-citizens voted. So what did his study actually say? Let’s take a look!

Some background: The paper this is based on is called “Do Non-Citizens Vote in US Elections” by Richman et all and was published back in 2014. It took data from a YouGov survey and found that 6.4% of non-citizens voted in 2008 and 2.2% voted in 2010. Non-citizenship status was based on self report, as was voting status, though the demographic data of participants was checked with that of their stated voting district to make sure the numbers at least made sense.

So what stood out here? A few things:

  1. The sample size While the initial survey of voters was pretty large (over 80,000 between the two years) the number of those identifying themselves as non-citizens was rather low: 339 and 489 for the two years. There were a total of 48 people who stated that they were not citizens and that they voted. As a reference, it seems there are about 20 million non-citizens currently residing in the US.
  2. People didn’t necessarily know they were voting illegally One of the interesting points made in the study was that some of this voting may be unintentional. If you are not a citizen, you are never allowed to vote in national elections even if you are a permanent resident/have a green card. The study authors wondered if some people  didn’t know this, so they analyzed the education levels of those non-citizens who voted. It turns out non-citizens with less than a high school degree are more likely to vote than those with more education. This actually is the opposite trend seen among citizens AND naturalized citizens, suggesting that some of those voters have no idea what they’re doing is illegal.
  3. Voter ID checks are less effective than you might think If you’re first question up on reading #2 was “how could you just illegally vote and not know it?” you may be presuming your local polling place puts a little more in to screening people than they do. According to the participants in this study, not only were non-citizens allowed to register and cast a vote, but a decent number of them actually passed an ID check first. About a quarter of non-citizen voters said they were asked for ID prior to voting, and 2/3rds of those said they were then allowed to vote. I suspect this issue is that most polling places don’t actually have much to check their information against. Researching citizenship status would take time and money that many places just don’t have. Another interesting twist to this is that social desirability bias may kick in for those who don’t know voting is illegal. Voting is one of those things more people say they do than actually do, so if someone didn’t know they couldn’t legally vote they’d be more likely to say they did even if they didn’t. Trying to make ourselves look good is a universal quality.
  4. Most of the illegal voters were white Non-citizen voters actually tracked pretty closely with their proportion of the population, and about 44% of them were white. The next most common demographic was Hispanic at 30%, then black, then Asian. In terms of proportion, the same percent of white non-citizens voted as Hispanic non-citizens.
  5. Non-citizens are unlikely to sway a national election, but could sway state level elections When Trump originally referenced this study, he specifically was using it to discuss national popular vote results. In the Wired article, they do the math and find that even if all of the numbers in the study bear out it would not sway the national popular vote. However, the original study actually drilled down to a state level and found that individual states could have their results changed by non-citizen voters. North Carolina and Florida would both have been within the realm of mathematical possibility for the 2008 election, and for state level races the math is also there.

Now, how much confidence you place in this study is up to you. Given the small sample size, things like selection bias and non-response bias definitely come in to play. That’s true any time you’re trying to extrapolate the behavior of 20 million people off of the behavior of a few hundred. It is important to note that the study authors did a LOT of due diligence attempting to verifying and reality check the numbers they got, but it’s never possible to control for everything.

If you do take this study seriously, it’s interesting to note what the authors actually thought the most effective counter-measure against non-citizen voting would be: education. Since they found that low education levels were correlated with increased voting and that poll workers rarely turned people away, they came away from this study with the suggestion that simply doing a better job of notifying people of voting rules might be just as effective (and cheaper!) than attempting to verify citizenship. Ultimately it appears that letting individual states decide on their own strategies would also be more effective than anything on the federal level, as different states face different challenges. Things to ponder.

 

Does Popularity Influence Reliability? An Introduction

Well hi there! Welcome to the next edition of “Papers in Meta Science” where I walk through interesting papers that use science to scrutinize science. During the first go around we looked at the John Ioannidis paper “Why Most Published Research Findings Are False”, and this time we’re going to look at a paper that attempted to prove one of that papers key assertions: that “hot” scientific fields produce less trustworthy results than less popular fields. This paper is called “Large-Scale Assessment of the Effect of Popularity on the Reliability of Research“, and was published on PlosOne by Pfeiffer and Hoffmann in 2009. They sought to test empirically whether or not this particular claim was true using the field of protein interactions.

Before we get to the good stuff though, I’d expect this series to have about 3 parts:

  1. The Introduction/Background. You’re reading this one right now.
  2. Methods and Results
  3. Further Discussion

Got it? Let’s go!

Introduction:  As I mentioned up front, one of the major goals of this paper was to confirm or refute the mathematical theory put forth by John Ioannidis that “hot” fields were more likely to produce erroneous results than those that were less popular. There are two basic theories as to why this could be the case:

  1. Popular fields create competition, and competitive teams are more likely to be incentivized to cut corners or do what it takes to get positive results (Ioannidis Corollary 5)
  2. Lots of teams working on a problem means lots of hypothesis testing, and lots of tested hypotheses means more false positives due to random chance (Ioannidis Corollary 6).

While Pfeiffer and Hoffman don’t claim to be able to differentiate between those two motives, they were hopeful that by looking at the evidence they could figure out if this effect was real and if it was perhaps estimate a magnitude. For their scrutiny, they chose the field of protein interactions in yeast.

This may seem a little counter-intuitive, as almost no definition of “popular science” conjures pictures of protein interactions. However, it is important to remember that the point of this paper was to examine scientific popularity, not mentions in the popular press. Since most of us probably already assume that getting headline grabbing research can cause it’s own set of bias problems, it’s interesting to consider a field that doesn’t grab headlines. Anyway, despite it’s failure to lead the 6 o’clock news, it turns out that the world of protein interactions actually does have a popularity issue. Some proteins and their corresponding genes are studied far more frequently than others, and this makes it a good field for examination. If a field like this can fall prey to the effect of multiple teams, than we can assume that more public oriented fields could as well.

Tune in next week to see what we find out!

Funnel Plots 201: Romance, Risk, and Replication

After writing my Fun With Funnel Plots post last week, someone pointed me to this Neuroskeptic article from a little over a year ago.  It covers a paper called “Romance, Risk and Replication” that sought to replicate “romantic priming studies”, with interesting results….results best shown in funnel plot form! Let’s take a look, shall we?

A little background: I’ve talked about priming studies on this blog before, but for those unfamiliar, here’s how it works: a study participant is shown something that should subconsciously/subtly stimulate certain thoughts. They are then tested on a behavior that appears unrelated, but could potentially be influenced by the thoughts brought on in the first part of the study. In this study, researchers took a look at what’s called “romantic priming” which basically involves getting someone to think about meeting someone attractive, then seeing if they do things like (say they would) spend more money or take more risks.

Some ominous foreshadowing: Now for those of you who have been paying attention to the replication crisis, you may remember that priming studies were one of the first things to be called in to question. There were a lot of concerns about p-value hacking, and concerns that they were falling prey to basically all the hallmarks of bad research. You see where this is going.

What the researchers found: Shanks et al attempted to replicate 43 different studies on romantic priming, all of which had found significant effects. When they attempted to replicate these studies, they found nothing. Well, not entirely nothing. They found no significant effects of romantic priming, but they did find something else:

The black dots are the results from original studies, and the white triangles are the results from the replication attempts. To highlight the differences, they drew two funnel plots. One encompasses the original studies, and shows the concerning “missing piece” pattern in the lower left hand corner.  Since they had replication studies, they funnel plotted those as well. Since the sample sizes were larger, they all cluster at the top, but as you can see they spread above and below the zero line. In other words, the replications showed no effect in exactly the way you would expect if there were no effect, and the originals showed an effect in exactly the way you would expect if there were bias.

To thicken the plot further, the researchers also point out that the original studies effect sizes actually all fall just about on the line of the funnel plot for the replication results. The red line in the graph shows a trend very close to the side of the funnel, which was drawn at the p=.05 line. Basically, this is pretty good evidence of p-hacking…aka researchers (or journals) selecting results that fell right under the p=.05 cut off. Ouch.

I liked this example because it shows quite clearly how bias can get in and effect scientific work, and how statistical tools can be used to detect and display what happened. While large numbers of studies should protect against bias, sadly it doesn’t always work that way. 43 studies is a lot, and in this case, it wasn’t enough.

5 Things You Should Know About Study Power

During my recent series on “Why Most Published Research Findings Are False“, I mentioned a concept called “study power” quite a few times. I haven’t talked about study power much on this blog, so I thought I’d give a quick primer for those who weren’t familiar with the term. If you’re looking for a more in depth primer try this one here, but if you’re just looking for a few quick hits, I gotcha covered:

  1. It’s sort of the flip side of the p-value We’ve discussed the p-value and how it’s based on the alpha value before, and study power is actually based on a value called beta. If alpha can be thought of as the chances of committing a Type 1 error (false positive), then the beta is the chance of getting a Type 2 error (false negative). Study power is actually 1 – beta, so if someone says study power is .8, that means the beta was .2. Setting the alpha and beta values are both up to the discretion of the researcher….their values are more about risk tolerance than mathematical absolutes.
  2. The calculation is not simple, but what it’s based on is important Calculating study power is not easy math, but if you’re desperately curious try this explanation. For most people though, the important part to remember is that it’s based on 3 things: the alpha you use, the effect size you’re looking for, and your sample size.  These three can all shift based on the values of the other one. As an example, imagine you were trying to figure out if a coin was weighted or not. The more confident you want to be in your answer (alpha), the more times you have to flip it (sample size). However, if the coin is REALLY unfairly weighted (effect size), you’ll need fewer flips to figure that out. Basically the unfairness of a coin weighted to 80-20 will be easier to spot than a coin weighted to 55-45.
  3. It is weirdly underused As we saw in the “Why Most Published Findings Are False” series, adequate study power does more than prevent false negatives. It can help blunt the impact of bias and the effect of multiple teams, and it helps everyone else trust your research. So why don’t most researchers put much thought in to it, science articles mention it, or people in general comment on it? I’m not sure, but I think it’s simply because the specter of false negatives is not as scary or concerning as that of false positives. Regardless, you just won’t see it mentioned as often as other statistical issues. Poor study power.
  4. It can make negative (aka “no difference”) studies less trustworthy With all the current attention on false positive/failure to replicate studies, it’s not terribly surprising that false negatives have received less attention…..but it is still an issue. Despite the fact that study power calculations can tell you how big the effect size you can detect is, and odd number of researchers don’t include their calculations. This means a lot of “negative finding” trials could also be suspect. In this breakdown of study power, Stats Done Wrong author Alex Reinhart cites studies that found up to 84% of studies don’t have sufficient power to detect even a 25% difference in primary outcomes. An ASCO review found that 47% of oncology trials didn’t have sufficient power to detect all but the largest effect sizes. That’s not nothing.
  5. It’s possible to overdo it While underpowered studies are clearly an issue, it’s good to remember that overpowered studies can be a problem too. They waste resources, but can also detect effect sizes so small as to be clinically meaningless.

Okay, so there you have it! Study power may not get all the attention the p-value does, but it’s still a worthwhile trick to know about.

What I’m Reading: Farming, the Future, Risk, AI, and Other Things Keeping Me Up At Night

Note: I started trying to do a regular reading list here, but my reading list has sent me in to an existential tailspin this month, so I’m going to just reflect a little on all of that, talk about some farming, and then I’m going to remind you that artificial intelligence is probably the biggest threat to humanity you haven’t bothered worrying about today. Figured you may want that heads up.

I don’t normally read novels, but my farmer brother loves Wendell Berry and has been encouraging me to read Jayber Crow for quite some time and a few weeks ago I actually got around to it. It’s one of his “Port Williams” novels, which all take place from the perspectives of various members of a fictitious town in Kentucky starting in the 1920s and ending in the 1970s. There’s a lot of religious and theological themes to the novel that quite a few of my readers will probably have opinions about, but that’s not what caught my eye. What intrigued me was how in 300 pages or so, the novel takes the main character from a young man to an old man, and the reflections on how his slice of America and their approach to land changed in that time, and the impact it had on the community. If you don’t have a farming philosophy (or didn’t spend most of your childhood around people who had one, whether they called it that or not), this may not strike you as much as it did me. I grew up hearing about my grandfather’s (who would have been about Jayber’s age) approach and how my uncle tried to change it, how my other Uncle took it over when my grandfather died, and then how my brother continues the tradition. Land use  as a reflection of greater social change is kind of a thing in my family, and it was interesting to see that captured in an novel format. The subtle influence of technology on the perceptions of land and farming were also rather fascinating. Also at this point it’s kind of nice to read a recounting of the 20th century that’s not entirely Boomer-centric.

Concurrent with that book, I also read “But What If We’re Wrong? Thinking About the Present As If It Were the Present” by Chuck Klosterman. In it, Klosterman reflects on all the ways we talk about the past, and continuously reminds us that people in 100 years will remember us quite differently than we like to think. He points out that we all know this in general, but if you point to anything specific, our first reaction is to get defensive and explain that whatever particular idea we’re discussing is one of the ones that will endure. We’re willing to acknowledge that the future will be different, but only if that difference is familiar.

To further mess with my self-perception, I still haven’t entirely recovered from reading Antifragile a few months ago. There’s a lot of good stuff in that book (including a discussion of the Lindy Effect, a helpful rule of thumb for what ideas will actually persist in 100 years), but what Taleb is really famous for is his concern about Black Swans. Black Swans are events that are unexpected. They are hard to predict. They are not really what we were focusing on. They shape history dramatically, but we all forget about it because we focus our statistical predictions on things that have a prior probability (if you want to blame the Bayesians) or things that have larger risks (if you want to blame the frequentists).

So on to this mix of risk and uncertainty and reflections on the past and future, I decided to start reading more about artificial intelligence (AI) risk. For the sake of my insomnia, this was an error. However, now that I’m here I’d like to share the pain. If you want an exceptionally good comprehensive overview of where we’re at, try the Wait But Why post  on the topic, or for something shorter try this. If you’re feeling really lazy, let me summarize:

  1. We are racing like hell to create something smarter than ourselves
  2. We will probably succeed, and sooner than you might think
  3. Once we do that, pretty much by definition whatever we create will start improving itself faster than we can, towards goals that are not the same as ours
  4. The ways this could go horribly wrong are innumerable
  5. Almost no one  appears worried about this

By #5 of course I mean no one on my Facebook feed. Bill Gates, Stephen Hawking and Elon Musk are actually all pretty damn concerned about this. The problem is that something like this is so far outside our current experience that it’s hard for most people to even conceive of it being a risk….but that lack of familiarity doesn’t actually translate in to lack of risk. If you want the full blow by blow I suggest you go back and read the articles I suggested, but here’s a quick story to illustrate why AI is so risky:

You know that old joke where someone starts pouring you water or serving you food and says “say when”, then fails to stop when you say things like “stop” or “enough” or “WHAT ARE YOU DOING YOU MANIAC” because “you didn’t say when!”? Well, I know people who will take that joke pretty far. They will spill water on the table or overflow your plate or whatever they need to sell the joke. However, I have never met a person who will go back to the faucet and get more water just to come back to the table and keep pouring. That’s a line that all humans, no matter how dedicated to their joke, understand. It wouldn’t even occur to most people. When we’re talking about computers though, that line doesn’t exist. Computers keep going. Anyone who’s ever accidentally crashed a program by creating an infinite while loop knows this. The oldest joke in the coding world is “the problem with computers is that they do exactly what you tell them to”. Even the most malicious humans have a fundamental bias towards keeping humanity in existence. Maybe not individuals, but the species as a whole is normally not a target. AI won’t have this bias.

Now, I’m not saying we’re all doomed, but I’m definitely in anxious avenue here:

The fact that almost everyone I know has spent more time thinking about their opinion on Donald Trump’s hair than AI risk doesn’t help. At a bare minimum, this should at least register on the national list of things people talk about, and I don’t even think it’s in the top 1000.

On the minus side, this reading list has made me a little pensive. On the plus side, I’m kind of like that anyway. On the DOUBLE plus side, bringing up AI risk in the middle of any political conversation is an incredibly useful tool for getting people to stop talking to you about opinions you’re sick of hearing.

Anyway, if you’d like to send me some lighter and  happier reading for February, I might appreciate it.

The Perfect Metric Fallacy

“The first step is to measure whatever can be easily measured. This is OK as far as it goes. The second step is to disregard that which can’t be easily measured or to give it an arbitrary quantitative value. This is artificial and misleading. The third step is to presume that what can’t be measured easily really isn’t important. This is blindness. The fourth step is to say that what can’t be easily measured really doesn’t exist. This is suicide.” – Daniel Yankelovich

“Andy Grove had the answer: For every metric, there should be another ‘paired’ metric that addresses the adverse consequences of the first metric” -Marc Andreessen

“I didn’t feel the ranking system you created adequately captured my feelings about the vendors we’re looking at, so instead I assigned each of them a member of the Breakfast Club. Here, I made a poster.” -me

I have a confession to make: I don’t always like metrics. There. I said it. Now most people wouldn’t hesitate to make a declaration like that, but for someone who spends a good chunk of her professional and leisure time playing around with numbers it’s kind of a sad thing to have to say. Some metrics are totally fine of course, and super useful. On the other hand, there are times when it seems like the numbers subsume the actual goal, and those become front and center. This is bad. In statistics, numbers are a means to an end, not the end. I need a name for this flip flop, so from here on out I’m calling it  “The Perfect Metric Fallacy”.

The Perfect Metric Fallacy: The belief that if one simply finds the most relevant or accurate set of numbers possible, all bias will be removed, all stress will be negated, and the answer to complicated problems will become simple, clear and completely uncontroversial.

As someone who tends to blog about numbers and such, I see this one a lot.  On the one hand, data and numbers are wonderful because they help us identify reality, improve our ability to compare things, spot trends, and overcome our own biases. On the other hand, picking the wrong metric out of convenience or bias and relying too heavily on it can make everything I just named worse plus piss everyone around you off.

Damn.

While I have a decent number of my own stories about this, what frustrates me is how many I hear from others. When I tell people these days that I’m in to stats and data, almost a third of people respond with some sort of horror story about how data or metrics are making their professional lives miserable. When I talk to teachers, this number goes up to 100%.

This really bums me out.

It seems after years of  disconnected individuals going with their guts and kind of screwing everything up, people decided that now we should put numbers on those grand ideas to prove that they were going to work. When these ideas now fail, people either blame the numbers (if you’re the person who made the decision) or the people who like the numbers (if you’re everybody else).  So why do we let this happen? Almost everyone up front knows that numbers are really just there to guide decision making, so why do we get so obsessed with them?

  1. Math class teaches us that if you play with numbers long enough, there will be a right answer There’s a lot of times in life when your numbers have to be perfect. Math class. Your tax return. You know the drill. Endless calculations, significant figures, etc, etc. In statistics, that’s not true. It’s a phenomena known as “false precision“, where you present data in a way that makes it look more accurate than it really can be. My favorite example of this is a clinic I worked with at one point. They reported weight to two significant figures (as in 130.45 lbs), but didn’t have a standard around whether or not people had to take their coat off before they weighed them. In the beginning of the post, I put a blurb about me converting a ranking system in to a Breakfast Club Poster. This came up after I was presented with a 100 point scale to rank 7 vendor against each other in something like 16 categories. When you have 3 days to read through over 1000 pages of documentation and assign scores, your eyes start to blur a little and you start getting a little existential about the whole thing. Are these 16 categories really the right categories? Do they cover everything I’m getting out of this? Do I really feel 5 points better about this vendor than that other one, and are both of them really 10 points better than that 3rd one? Or did I just start increasing the strictness of my rankings as I went along, or did I get nicer as I had to go faster, or what? It wasn’t a bad ranking system, but the problem was me. If I can’t promise I kept consistent in my rankings over 3 days, how can I attest to my numbers at the end?
  2. We want numbers to take the hit for unpleasant truths A few years ago someone sent me a comic strip that I have promptly sent along to nearly everyone who complains to me about bad metrics in the workplace: This almost always gets a laugh, and most people then admit that it’s not the numbers they have a problem with, it’s the way they’re being used. There’s a lot of unpleasant news to deliver in this world, and people love throwing up numbers to absorb the pain. See, I would totally give you a raise or more time to get things done but the numbers say I can’t. When people know you’re doing exactly what you were going to do to begin with, they don’t trust any number you put up. This gets even worse in political situations. So please, for the love of God, if the numbers you run sincerely match your pre-existing expectations, let people look over your methodology, or show where you really tried to prove yourself wrong. Failing to do this gives all numbers a bad rap.
  3. Good Data is Hard to Find One of the reasons statistician continues to be a profession is because good data is really really really hard to find, and good methods for analysis actually require a lot of leg work. Over the course of trying to find a “perfect metric” many people end up believing that part of being “perfect” is being easily obtainable. As my first quote mentions, this is ridiculous. It’s also called the McNamara Fallacy, and it warns us that the easiest things to quantify are not always the most important.
  4. Our social problems are complicated The power of numbers is strong. Unfortunately, the power of some social problems is even stronger. Most of our worst problems are multi faceted, which of course is why they haven’t been solved yet. When I decided to use metrics to address my personal weight problem, I came up with 10 distinct categories to track for one primary outcome measure. That’s 365,000 data points a year, and that’s just for me. Scaling that up is immensely complicated, and introduces all sorts of issues of variability among individuals that don’t exist when you’re looking at just one person. Even if you do luck out and find a perfect metric, in a constantly shifting system there is a good chance that improving that metric will cause a problem somewhere else. Social structures are like Jenga towers, and knocking on piece out of place can have unforeseen consequences. Proceed with caution, and don’t underestimate the value of small successes.

Now again, I do believe metrics are incredibly valuable and used properly can generate good insights. However, in order to prevent your perfect metric from turning in to a numerical bludgeon, you have to keep an eye on what your goal really is. Are you trying to set kids up for success in life or get them to score well on a test? Are you trying to maximize employee productivity or keep employees over the long term? Are you looking for a number or a fall guy? Can you know what you’re looking to find out with any sort of accuracy? Things to ponder.

 

Fun With Funnel Plots

During my recent series on “Why Most Published Research Findings Are False“, we talked a lot about bias and how it effects research. One of the classic ways of overcoming bias in research is either to 1) do a very large well publicized study that definitively addresses the question or 2) pull together all of the smaller studies that have been done and analyze their collective results. Option #2 is what is referred to as a meta-analysis, because we are basically analyzing a whole bunch of analyses.

Now those of you who are paying attention may wonder how effective that whole meta-analysis thing is. If there’s some sort of bias in either what gets published or all of the studies being done, wouldn’t a study of the studies show the same bias?

Well, yeah, it most certainly would. That’s why there’s a kind of cool visual tool available to people conducting these studies to take a quick look at the potential for bias. It’s called a funnel plot, and it looks exactly as you would expect it to:

Basically you take every study you can find about a topic, and you map the effect size noted on the x-axis, and the size of the study on the y-axis.  With random variation, the studies should look like a funnel: studies with small numbers of people/data points will vary a lot more than larger studies, and both will converge on the true effect size. This technique has been used since the 80s, but was popularized by the excitingly titled paper “Bias in Meta-Analysis Can Be Detected by a Simple Graphical Test”.  This paper pointed out that if you gather all studies together and don’t get a funnel shape, you may be looking at some bias. This bias doesn’t have to be on the part of the researchers by the way….publication bias would cause part of the funnel to go missing as well.

The principle behind all this is pretty simple: if what we’re looking at is a true effect size, our experiments will swing a bit around the middle. To use the coin toss analogy, a fairly weighted coin tossed 10 times will sometimes come up 3 heads, 7 tails or vice versa, but if you toss it 100 times it will probably be much closer to 50-50. The increased sample size increases the accuracy, but everything should be centered around the same number….the “true” effect size.

To give an interesting real life example, take the gender wage gap. Now most people know (or should know) that the commonly quoted “women earn 77 cents on the dollar” stat is misleading. The best discussion of this I’ve seen is Megan McArdle’s article here, and in it an interesting fact emerges: even controlling for everything possible, no study has found that women outearn men.  Even the American Enterprise Institute and the Manhattan Institute both put the gap at 94 to 97 cents on the dollar for women. At one point in the AEI article, they opine that such a small gap “may not be significant at all”, but that’s not entirely true. The fact that no one seems to find a small gap going the other direction actually suggests the gap may be real. In other words, if the true gap was zero, at least half of the studies should show women out earning men. If the mid-line is zero, we only have half the funnel. Now this doesn’t tell us what the right number is or why it’s there, but it is a pretty good indication that the gap is something other than zero. Please note: The McArdle article is from 2014, so if there’s new data that shows women out earn men in a study that controls for hours worked and education level, send it my way.

Anyway, the funnel plot is not without it’s problems. Unfortunately there’s not a lot of standards around how to use it, and changing the scale of the axis can make it look more or less convincing than it really should be. Additionally, if the number of studies is small, it is not as accurate. Finally, it should be noted that missing part of the funnel is not definitive proof that publication or other bias exists. It could be that those compiling the meta-analysis had a hard time finding all the studies done, or even that the effect size varies based on methodology.

Even with those problems, it’s an interesting tool to at least be aware of, as it is fairly frequently used and is not terribly hard to understand once you know what it is. You’re welcome.

How To Read a Headline: Are Female Physicians Better?

Over the years I’ve spilled a lot of (metaphorical) ink on how to read science on the internet. At this point almost everyone who encounters me frequently IRL has heard my spiel, and few things give me greater pleasure than hearing someone say “you changed the way I read about science”. While I’ve written quite a fewer longer pieces on the topic, recently I’ve been thinking a lot about what my “quick hits” list would be. If people could only change a few things in the way they read science stories,  what would I put on the list?

Recently, a story hit the news about how you might live longer if your doctor is a woman and it got me thinking. As someone who has worked in hospitals for over a decade now, I had a strong reaction to this headline. I have to admit, my mind started whirring ahead of my reading, but I took the chance to observe what questions I ask myself when I need to pump the brakes. Here they are:

  1. What would you think if the study had said the opposite of what it says? As I admitted up front, when I first heard this study, I reacted. Before I’d even made it to the text of the article I had theories forming. The first thing I did to slow myself down was to think “wait, how would you react if the headline said the opposite? What if the study found that patients of men did better?” When I ran through those thoughts, I realized they were basically the same theories. Well, not the same…more like mirror image, but they led to the same conclusion. That’s when I realized I wasn’t thinking through the study and it’s implications, I was trying to make the study fit what I already believed. I admit this because I used this knowledge to mentally hang a big “PROCEDE WITH CAUTION” sign on the whole topic. To note, it doesn’t matter what my opinion was here, what matters is that it was strong enough to muddy my thoughts.
  2. Is the study linked to? My first reaction (see #1) kicked in before I had even finished the headline, so unfortunately “is this real” comes second. In my defense, I was already seeing the headlines on NPR and such, but of course that doesn’t always mean there’s a real study. Anyway, in this case of this study, there is a real identified study (with a link!) in JAMA.  As a note, even if the study is real, I distrust any news coverage that doesn’t provide a link to the source. In 2017, that’s completely inexcusable.
  3. Do all the words in the headline mean what you think they mean? Okay, I’ve covered headlines here, but it bears repeating: headlines are a marketing tool. This study appeared under several headlines such as “You Might Live Longer if Your Doctor is a Woman“. What’s important to note here is that by “live longer” they meant “slightly lower 30 day mortality after discharge from the hospital”, by doctor they meant “hospitalist”, and by “you” they meant “people over 65 who have been hospitalized”. Primary care doctors and specialists were not covered by this study.
  4. What’s the sample size and effect size? Okay, once we have the definitions out of the way, now we can start with the numbers. For this study, the sample size was fantastic….about 1.5 million hospital admissions. The effect size however….not so much. For patients treated by female physicians vs male, the 30 day mortality dropped from 11.49% to 11.07%. That’s not nothing (about a 5% drop), but mathematically speaking it’s really hard to reliably measure effect sizes of under 5% (Corollary #2)  even when you have a huge sample size. To their credit, the study authors do include the “number to treat”, and note that you’d have to have 233 patients treated by female physicians over male physicians in order to save one life. That’s a better stat than the one this article tried to use “Put another way – if you were to replace all the male doctors in the study with women, 32,000 fewer people would die a year.” I am going to bet that wouldn’t actually work out that way. Throw “of equal quality” in there next time, okay?
  5. Is this finding the first of it’s kind? As I covered recently in my series on “Why Most Published Research Findings Are False“, first of their kind exploratory studies are some of the least reliable types of research we have. Even when they have good sample sizes, they should be taken with a massive grain of salt. As a reference, Ioannidis puts the chances that a positive finding is true for a study like this at around 20%. Even if subsequent research proves the hypothesis, it’s likely that the effect size will diminish considerably in subsequent research. For a study that starts off with a 5% effect size, that could be a pretty big hit. It’s not bad to continue researching the question, but drawing conclusions or changing practice over one paper is a dangerous game, especially when the study was observational.

So after all this, do I believe this study? Well, maybe. It’s not implausible that personal characteristics of doctors can effect patient care. It’s also very likely that the more data we have, the more we’ll find associations like this. However, it’s important to remember that proving causality is a long and arduous process, and that reacting to new findings with “well it’s probably more complicated than that” is an answer that’s not often wrong.