What I’m Reading: July 2016

This month my book was The Signal and the Noise,  which I enjoyed enough that I’m doing a chapter by chapter contingency matrix series on it over at the other blog.

Sampling strategy and research design can sound really boring, until you blow through $1.3 billion dollars and have nothing to show for it. This article on the long slow death of the National Children’s Study should be assigned reading for anyone who ever wanted to know why it was so damn hard to get good research done.

Did you hear the one about all the Brexit voters furiously Googling “What is the EU?” after they voted to leave it? Yeah? That was pretty bogus. It was about 1000 people total, no one knows if their Googling was “furious”, how they voted, or if those people were even eligible to vote.

This article is from a few months ago, but it’s an interesting look at motivations and political bias. It turns out people do better on “political fact” tests when you offer them money for right answers than when they take them with no incentives.  The Volokh Conspiracy discusses implications for our understanding of political ignorance.

Also from a few months ago: the Quartz guide to bad data. More properly it might be called “guide to cleaning up your spreadsheet”. If you ever actually get a large data file and don’t know how to find potential problems before you analyze it, this is a good start.

Another good guide is this list of data science books from Stitch Fix. Stitch Fix is an online personal stylist service that I just so happen to use to get most of my work clothes. They also have a REALLY active data science division that helps come up with clothing recommendations. Good stuff.

This is an interesting data visualization of the changing American obesity rates.

I actually listened to this one, but there was an interesting piece on Science Friday about “differential privacy” and response randomization. The transcript is available here,  and there’s some interesting discussion about honesty, privacy, and research in the big data era.

 

Proving Causality: Who Was Bradford Hill and What Were His Criteria?

Last week I had a lot of fun talking about correlation/causation confusion, and this week I wanted to talk about the flip side: correctly proving causality. While there’s definitely a cost to incorrectly believing that Thing A causes Thing B when it does not, it can also be quite dangerous to NOT believe Thing A causes Thing B when it actually does.

This was the challenge that faced many public health researchers when attempting to establish a link between smoking and lung cancer. With all the doubt around correlation and causation, how do you actually prove your hypothesis?  British statistician Austin Bradford Hill was quite concerned with this problem, and he established a set of nine criteria to help prove causal association. While this criteria is primarily used for proving causes for medical conditions, it is a pretty useful framework for assessing correlation/causation claims.

Typically this criteria is explained using smoking (here for example), as that’s what is was developed to assess. I’m actually going to use examples from the book The Ghost Map, which documents the cholera outbreak in London in 1854 and the birth of modern epidemiology.  A quick recap: A physician named John Snow witnessed the start of the cholera outbreak in the Soho neighborhood of London, and was desperate to figure out how the disease was spreading. The prevailing wisdom at the time was that cholera and other diseases were  transmitted by foul smelling air (miasma theory), but based on his investigation Snow began to believe the problem was actually a contaminated water source. In the era prior to germ theory, the idea of a water-borne illness was a radical one, and Snow had to vigorously document his evidence and defend his case….all while hundreds of people were dying. His investigation and documentation is typically acknowledged as the beginning of the field of formal epidemiology, and it is likely he saved hundreds if not thousands of lives by convincing authorities to remove the handle of the Broad Street pump (the contaminated water source).

With that background, here are the criteria:

  1. Strength of Association: The first criteria for proof is basic. People who do Thing A must have a higher rate of Thing B than those who don’t. This is basically a request for an initial correlation. In the case of cholera, this was where John Snow’s “Ghost Map” came in. He created a visual diagram showing that the outbreak of cholera was not necessarily purely based on location, but by proximity to one particular water pump. Houses that were right next to each other had dramatically different death rates IF the inhabitants typically used different water pumps. Of those living near the water pump, 127 died. Of those living nearer to other pumps, 10 died. That’s one hell of an association.
  2. Temporality: The suspected cause must come before the effect. This one seems obvious, but must be remembered. It’s clear that both water and air are consumed frequently, so either method of transmission passed this criteria. However, if you looked closely, it was clear that bad smells often came after disease and death, not before. OTOH, there were a lot of open sewer systems in London at the time, so everything probably smelled kinda bad. We’ll call this one a draw.
  3. Consistency: Different locations must show the same effects. This criteria is a big reason why miasma theory (the theory that bad smells caused disease) had taken hold. When disease outbreaks happened, the smells were often unbearable. This appeared to be very consistent across locations and different outbreaks. Given John Snow’s predictions however, it would have been beneficial to see if cholera outbreaks had unusual patterns around water sources, or if changing water sources changes the outbreak trajectory.
  4. Theoretical Plausibility This one can be tricky to establish, but basically it requires that you can propose a mechanism for cause. It’s designed to help keep out really out there ideas about crystals and star alignment and such. Ingesting a substance such as water quite plausibly could cause illness, so this passed.  Inhaling air also passed this test, since we now know that many diseases are actually transmitted through airborne germs. Cholera didn’t happen to have this method of transmission, but it wasn’t implausible that it could have. Without germ theory, plausibility was much harder to establish. Plausibility is only as good as current scientific understanding.
  5. Coherence The coherence requirement looks at whether the proposed cause agrees with other knowledge, especially laboratory findings. John Snow didn’t have those, but he did gain coherence when the pump handle was removed and the outbreak stopped. That showed that the theory was coherent, or that things proceeded the way you would predict they would if he was correct. Conversely, the end of the outbreak caused a lack of coherence for miasma theory…if bad air was the cause, you would not expect changing a water source to have an effect.
  6. Specificity in the causes The more specific or direct the relationship between Thing A and Thing B, the clearer the causal relationship and the easier it is to prove. Here again, by showing that those drinking the water were getting cholera at very high rates and those not drinking the water were not getting cholera as often, Snow offered a very straightforward cause and effect. If there had been other factors involved….say water drawn at a certain time of day….this link would have been more difficult to establish.
  7.  Dose Response Relationship The more exposure you have to the cause, the more likely you are to have the effect. This one can be tricky. In the event of an infectious disease for example, one exposure may be all it takes to get sick. In the case of John Snow, he actually doubted miasma theory because of this criteria. He had studied men who worked in the sewers, and noted that they must have more exposure to foul air than anyone else. However, they did not seem to get cholera more often than other people. The idea that bad air made you sick, but that lots of bad air didn’t make you more likely to be ill troubled him. With the water on the other hand, he noted that those using the pump daily became sick immediately.
  8. Experimental Evidence While direct human experiments are almost never possible or ethical to run, some experimental evidence may used as support for the theory. Snow didn’t have much to experiment on, and it would have been unethical if he had. However, he did note people who had avoided the pump and noted if they got sick or not. If he had known of animals that were susceptible to cholera, he could have tested the water by giving one animal “good” water and another animal “bad” water.
  9. Analogy If you know that something occurs one place, you can reasonably assume it occurs in other places. If Snow had known of other water-borne diseases, one suspects it would have been easier for him to make his case to city officials. This one can obviously bias people at times, but is actually pretty useful. We would never dream of requiring a modern epidemiologist to prove that a new disease could be water-borne….we would all assume it was at least a possibility.

Even though Snow didn’t have this checklist available to him, he ended up checking most of the boxes anyway. In particular, he proved his theory using strength of association, coherence, consistency and specificity. He also raised questions about the rival theory by pointing to the lack of dose-response relationship. Ultimately, the experiment of removing the pump handle succeeded in halting the outbreak.

Not bad for a little data visualization:

While some of these criteria have been modified or improved, this is a great fundamental framework for thinking about causal associations. Also, if you’re looking for a good summer read, I would recommend the book I referenced here: The Ghost Map. At the very least it will help you stop making “You Know Nothing John Snow” jokes.

The Fallibility of Journalistic Memory, a Play in Three Acts

If you’re looking for a little fun reading on this long holiday weekend, I would like to point you to a series of posts Ann Althouse has put up over the past couple of days. It’s not stats related, but touches on some of my other favorite topics: bias, certainty, and memory.

Act 1: Poetic Justices and Questionable Citations

Linda Greenhouse writes an Op-Ed for the New York Times, in which she complains about the “lack of poetry” in the recent Supreme Court Whole Women’s Health vs Hellerstedt decision.  Greenhouse compares it to the decision Planned Parenthood vs Casey, written 24 years earlier.

The next day, Ann Althouse blogs about the article, noting that Greenhouse attributed the line “Liberty finds no refuge in a jurisprudence of doubt” from Planned Parenthood vs Casey  to Anthony Kennedy.  Althouse points out that the line was taken from a jointly written decision, and that to attribute it to only one justice (Kennedy) is not correct.

Act 2: Challenge Accepted

Ann Althouse posts a follow-up post after Linda Greenhouse emails her to dispute the quote mis-attribution charge. In her email, Greenhouse cites her source for attributing the line to Kennedy: the Jeffrey Toobin book “The Nine” and her own presence in the courtroom the day the justices read the Planned Parenthood vs Casey decision 24 years earlier. She recounts Kennedy leading off with the line in question, and the stir it created in the courtroom. She asserts that the act of reading the line verifies that he was the author. She ends the email with the line “Of course you are completely free to trash my opinions and my writing style.  I would caution you against challenging my facts.”

Althouse, choosing to ignore that last part,  located the original recording of the reading of the decision. She discovers that not only did Kennedy not lead off, but neither he nor anyone else reads the line that Greenhouse so clearly remembers hearing.

The book in question does attribute the line to him, but has no named source for that information.

Part 3: We’re All a Little Wrong Sometimes, Aren’t We Though?

Confronted with the recording that shows her memory was incorrect, Greenhouse emails Althouse again, conceding that “I guess it’s fair to say that each of us was right and each of us was wrong.”

Althouse posts that email, along with her complete rejection of Greenhouse’s conclusion here. She (Althouse) ends her post with “I didn’t say anything that was wrong. I have a way of blogging that keeps me out of trouble like that. I don’t make assertions about things I don’t know.”

Epilogue: One of my favorite books is Mistakes Were Made (But Not By Me) by Carol Tavris1. There’s a tremendous amount of research in to how and why we rewrite memories, and the book covers a lot of those reasons. The main takeaway here though is that we all need to guard against created memories and overconfidence in our facts….ESPECIALLY if you’re going to be writing for the New York Times and PARTICULARLY if you’re challenged.

The point of who exactly wrote that original line is a minor one to many people. Kennedy was certainly involved in with the decision, so naming him as a solo author isn’t that out there. If Greenhouse had merely cited the book that mentioned the line as his, I would never have thought twice about it. However, when she cited her own vivid recollection that turned out to be completely wrong I have to imagine nearly everyone reading the saga started questioning her more seriously.

To her credit, Greenhouse did fully admit her shock at discovering her memory was incorrect. Hopefully the lesson for all of us here is to be very cautious when we rely on an emotionally charged memory, and  EXTRA cautious when we tell someone not to challenge our facts.

1. Conservative readers be warned: in the very first chapter of the book Tavris lets some pretty liberal biased statements through as fact. She cuts this out (I think) after the first chapter, but it’s really bugged at least one person I’ve recommended the book to. I think it’s worthwhile despite that, but YMMV.

Projections, Predictions and Guns vs Cars

Welcome to “From the Archives”, where I dig up old posts and see what’s changed in the years since I originally wrote them.

Last week, while researching my post about definitions you should remember while discussing mass shootings, I came across a post from January of 2013 that warranted further investigation. It was my take on a Bloomberg News article that projected that by 2015 automotive deaths would surpass gun deaths. They had showed this chart:

My primary grouse was that they seemed to be extrapolating the 2015 data from the 2008 and 2009 data. I decided to take a look and see how the Bloomberg prediction had turned out.

Interestingly enough, at this point it appears to be a statistical tie. The Violence Policy Center has a chart up through 2014 showing a slight lead for motor vehicle deaths:

The Washington Post OTOH, gave them a tie (rates reported per 100,000 people):

According to this post, the numbers for gun deaths ended up being 33,599 and the car deaths were 33,736. It is interesting to note that Bloomberg underestimated the car deaths by a little less than 2,000 /year, and the gun deaths by about 600/year. So they were wrong in their assumption that motor-vehicle deaths would continue to drop at the same pace they had been, but right in their assumption gun deaths would continue to rise. I’ll give myself half credit on this one.  Of course, we do have one more year to go before we get the 2015 data, so I could still entirely eat crow.

It’s worth noting that the rise in firearm death through 2014 was entirely due to an increase in suicide rates. Homicide rates actually decreased during that time:

Someone remind me to check back in next year to see where we went with this!

6 Examples of Correlation/Causation Confusion

When I first started blogging about correlation and causation (literally my third and fourth post ever), I asserted that there were three possibilities whenever two variables were correlated. Now that I’m older and wiser, I’ve expanded my list to six:

  1. Thing A caused Thing B (causality)
  2. Thing B caused Thing A (reversed causality)
  3. Thing A causes Thing B which then makes Thing A worse (bidirectional causality)
  4. Thing A causes Thing X causes Thing Y which ends up causing Thing B (indirect causality)
  5. Some other Thing C is causing both A and B (common cause)
  6. It’s due to chance (spurious or coincidental)

The obvious conclusion is that years spent blogging about statistics directly correlates to the number of possible ways of confusing correlation and causation you recognize.

Anyway, I’ve talked about this a lot over the years, and this lesson is pretty fundamental in any statistics class…though options #3 and #4 up there aren’t often covered at all. It’s easily forgotten, so I wanted to use this post to pull together an interesting example of each type.

  1. Smoking cigarettes cause lung cancer (Thing A causes Thing B): This is an example I use in my Intro to Internet Science talk I give to high school students. Despite my continued pleading to be skeptical of various claims, I like to point out that sometimes disbelieving a true claim also has consequences. For years tobacco companies tried to cast doubt on the link between smoking and lung cancer, often using “correlation is not causation!” type propaganda.
  2. Weight gain in pregnancy and pre-eclampsia (Thing B causes Thing A): This is an interesting case of reversed causation that I blogged about a few years ago. Back in the 1930s or so, doctors had noticed that women who got pre-eclampsia (a potentially life threatening condition) also had rapid weight gain. They assumed the weight gain was causing the pre-eclampsia, and thus told women to severely restrict their weight gain. Unfortunately it was actually the pre-eclampsia causing the weight gain, and it is pretty likely the weight restrictions did more harm than good.
  3. Dating and desperation (Thing A causes Thing B which makes Thing A worse): We’ve all had that friend. The one who strikes out with everyone they try to date, and then promptly doubles down on their WORST behaviors. This is the guy who stops showering before he takes girls out because “what’s the point”. Or the girl who gets dumped after bringing up marriage on the third date, so she brings it up on the first date instead. This  is known as “bidirectional causality” and is less formally known as “a vicious cycle”. In nature this can cause some really amusing graph behavior, as in the case of predators and prey.  An increase in prey can cause an increase in predators, but an increase in predators will cause a decrease in prey. Thus, predator and prey populations can be both positively AND negatively correlated, depending on where you are in the cycle.
  4. Vending machines in Schools and obesity (Thing A causes Thing X causes Thing Y which then causes Thing B): One obvious cause of obesity is eating extra junk food. One obvious source of extra junk food is vending machines. One obvious place to find vending machines is in many schools. So remove vending machines from schools and reduce obesity, right? No, sadly, not that easy.  In a longitudinal study that surprised even the authors, it was found that kids who moved from schools without vending machines to those with vending machines don’t gain weight. What’s interesting is that you can find a correlation between kids who were overweight and eating food from vending machines, but it turns out the causal relationship is convoluted enough that removing the vending machines doesn’t actually fix the original end point.
  5. Vitamins and better health (Some other Thing C is causing Thing A and Thing B):This one is similar to #4, but I consider it more applicable when it turns out Thing A and Thing B weren’t even really connected at all. Eating a bag of chips out of a vending machine every day CAN cause you to gain weight, even if removing the vending machine doesn’t help you lose it again. With many vitamin supplements on the other hand, initial correlations are often completely misleading. Many people who get high levels of certain vitamins (Thing A) are actually just those who pay attention to their health (Thing C), and those people tend to have better health outcomes (Thing B).  Not all vitamins should be tarred with the same brush though, this awesome visualization shows where the evidence stands for 100 different supplements.
  6. Spurious Correlations (spurious or due to chance): There’s a whole website of these, but my favorite is this one:  NicCage

Causal inference, not for the faint of heart.

 

The Signal and the Noise: Chapter 1

I’ve been reading Nate Silver’s “The Signal and the Noise” recently, and pretty much every chapter seems to lend itself to a contingency matrix. Each chapter is focused around a different prediction issue, and Chapter 1 is around the housing bubble and the incorrect valuation of the CDO market.

I wasn’t going to get in to CDO ratings, but here’s the housing bubble:

SignalNoiseCh1

It should be noted that swapping the word “home” in the title for any other product describes pretty much every market bubble ever. Color scheme taken from the cover art of the hardcover version, or maybe the Simpsons.

See all The Signal and the Noise posts here, or go to Chapter 2 here.

5 Definitions You Need to Remember When Discussing Mass Shootings This Week

In the wake of the Orlando tragedy of last week, the national conversation rapidly turned to what we could do to prevent situations like this in the future. I’ve heard/seen a lot of commentary on this, and I get concerned at how often statistics get thrown out without a clear explanation of what the numbers actually do or don’t say.  I wanted to review a few of the common issues I’m seeing, and to clarify what some of the definitions are. While I obviously have my own biases, my goal is NOT to endorse one viewpoint or another here. My goal is to make sure everyone knows what everyone else is talking about when they throw numbers out there.

Got it? Let’s go!

  1. Base rate Okay, this is obviously one of my pet issues right now, but this is a great example of a time you have to keep the concept of a base rate in mind. In the wake of mass shootings, many people propose various ideas that will help us predict who future mass shooters might be. Vox does a great article here about why most of the attempts to do this would be totally futile. Basically, for every mass shooter in this country, there are millions and millions of non mass shooters. Even a detection algorithm that makes the right call 99.999% of the time would yield a couple hundred false positives (innocent people incorrectly identified) for every true positive.  Read my post on base rates here for the math, but trust me, this is an issue.
  2. Mass Shooting I’ve seen the claim a couple of places that we have about one mass shooting per day in this country, and I’ve also seen the claim that we had 4-6 last year.  This Mother Jones article does an excellent deep dive on the statistic, but basically it comes down to circumstances. Most people agree that “mass” refers to 3 or 4 people killed at one time, but the precipitating events can be quite different. There are basically three types of mass shootings: 1. Domestic/family violence 2. Shootings that occur during/around other criminal activity 3. Indiscriminate public shootings. If you count all 3 together, you get the “one per day” number. If you only count #3, you get 4-6 per year. While obviously all of these events are horrible, the methods  of addressing each are going to be different. At the very least, it’s good to know when we’re talking about one and when we’re talking about ALL of them.
  3. Gun Deaths Even more common than the confusion about the term “mass shooting” is the term “gun deaths”. This pops up frequently that I’ve been posting about it almost as long as I’ve been blogging and have made a couple of graphs (here and here) that have come in handy in some Twitter debates. The short version is that anything marked “gun deaths” almost always includes suicides and accidents. Suicide is the biggest contributor to this category, and any numbers or graphs generated from “gun death” data tend to look really different when these are taken out.
  4. Locations This is a somewhat minor issue compared to the others, but take care when someone mentions “school shootings” or “attacks on American soil”. As I covered here, sometimes people use very literal definitions of locations to include situations you wouldn’t normally think of.
  5. Gun violence Okay, this one should be obvious, but gun violence only refers to, um, gun violence. In the wake of a tragedy like Orlando, I’ve seen the words “gun violence” and “terrorism” tossed about as though they are interchangeable.  When you state it clearly, it’s obvious that’s not true, but in the heat of the moment it’s an easy point to conflate. In one of my guns and graphs posts, I discovered that states with higher rates of gun murders also tend to have higher rates of non-gun murders with r=.6 or so. In most states gun murders are higher than non-gun murders, but it’s important to remember other types of violence exist as well….especially if we’re talking about terrorism.

One definition I didn’t cover here is the word “terrorism”. I’ve been looking for a while, and I’m not I’ve found a great consensus on what constitutes terrorism and what doesn’t. Up until a few years ago for example, the FBI ranked “eco-terrorism” as a major threat (and occasionally the number one domestic threat) to the the US, despite the fact that most of these incidents caused property damage rather than killing people.

Regardless of political stance, I always think it’s important to understand the context of quoted numbers and what they do or don’t say. Stay safe out there.

5 Replication Possibilities to Keep in Mind

One of the more basic principles of science is the idea of replication or reproducibility.  In its simplest form, this concept is pretty obvious: if you really found the phenomena you say you found, then when I look where you looked I should be able to find it too. Most people who have ever taken a science class are at least theoretically familiar with this concept, and it makes a lot of sense…..if someone tells you the moon is green, and no one else backs them up, then we know the observation of the moon being green is actually a commentary on the one doing the looking as opposed to being a quality of the moon itself.

That being said, this is yet another concept that everyone seems to forget the minute they see a headline saying “NEW STUDY BACKS UP IDEA YOU ALREADY BELIEVED TO BE TRUE”. While every field faces different challenges and has different standards, scientific knowledge is not really a binary “we know it or we don’t” thing. Some studies are stronger than other studies, but in general, the more studies that find the same thing, the stronger we consider the evidence. While many fields have different nuances, I wanted to go over some of the possibilities we have when someone tries to go back and replicate someone else’s work.

Quick note: in general, replication is really applicable to currently observable/ongoing phenomena. The work of some fields can rely heavily on modeling future phenomena (see: climate science), and obviously future predictions cannot be replicated in the traditional sense. Additionally attempts to explain past behavior can often not be replicated (see: evolution) as they have already occurred.

Got it? Great….let’s talk in generalities! So what happens when someone tries to replicate a study?

  1. The replication works. This is generally considered a very good thing. Either someone attempted to redo the study under similar circumstances and confirmed the original findings, or someone undertook an even stronger study design and still found the same findings. This is what you want to happen. Your case is now strong. This is not always 100% definitive, as different studies could replicate the same error over and over again (see the ego depletion studies that were replicated 83 times before they were called in to question) but in general, this is a good sign.
  2. You get a partial replication. For most science, the general trajectory of discovery is one of refining ideas.  In epidemiology for example, you start with population level correlations, and then try to work your way back to recommendations that can be useful on an individual level. This is normal. It also means that when you try to replicate certain findings, it’s totally normal to find that the original paper had some signal and some noise. For example, a few months ago I wrote about headline grabbing study that claimed that women’s political, social and religious preference varied with their monthly cycle. The original study grabbed headlines in 2012, and by the time I went back and looked at it in 2016 further studies had narrowed the conclusions substantially. Now the claims were down to “facial preferences for particular politicians may vary somewhat based on monthly cycles, but fundamental beliefs do not”. I would like to see the studies replicated using the 2016 candidates to see if any of the findings bear out, but even without this you can see that subsequent studies narrowed the findings quite a bit. This is not necessarily a bad thing, but actually a pretty normal process. Almost every initial splashy finding will undergo some refinement as it continues to be studied.
  3. The study is disputed. Okay, this one can meander off the replication path here a bit. When I say “disputed” here, I’m referencing the phenomena that occurs when one studies findings are called in to question by another study that found something different, but they used two totally different methods to get there and now no one knows what’s right. Slate Star Codex has a great overview of this in Beware the Man of One Study, and a great example in Trouble Walking Down the Hallway. In the second post he covers two studies, one that shows a pro-male bias in hiring a lab manager and one that shows a pro-female bias in hiring a new college faculty member. Everyone used the study whose conclusions they liked better to bolster their case while calling the other study “discredited” or “flawed”. The SSC piece breaks it down nicely, but it’s actually really hard to tell what happened here and why these studies would be so different. To note: neither came to the conclusion that no one was biased. Maybe someone switched the data columns on one of them. 
  4. The study fails to replicate. As my kiddos preschool teacher would say “that’s sad news”. This is what happens when a study is performed under the same conditions and effect goes away. For a good example, check out the Power Pose/Wonder Woman study, where a larger sample size undid the original findings….though not before TED talk went viral or the book the research wrote about it got published. This isn’t necessarily bad either, thanks to p-value dependency we expect some of this, but in some fields it has become a bit of a crisis.
  5. Fraud is discovered. Every possibility I mentioned above assumes good faith. However, some of the most bizarre scientific fraud situations get discovered because someone attempts to replicate a previous published study and can’t do it. Not replicate the findings mind you, but the experimental set up itself. Most methods sections are dense enough that any set up can sound plausible on paper, but it’s in practice that anomalies appear. For example, in the case of a study about how to change people’s views on gay marriage, a researcher realized the study set up was prohibitively expensive when he tried to replicate the original. While straight up scientific fraud is fairly rare, it does happen. In these cases, attempts at replication are some of our best allies at keeping everyone honest.

It’s important to note here that not all issues fall neatly in one of these categories. For example, in the Women, Ovulation and Voting study I mentioned in #2, two of the research teams had quite the back and forth over whether or not certain findings had been replicated.  In an even more bizarre twist, when the fraudulent study from #5 was actually done, the findings actually stood (still waiting on subsequent studies!).  For psychology, the single biggest criticism of the replication project (which claims #4) is that it’s replications aren’t fair and thus it’s actually #3 or #2.

My point here is not necessarily that any one replication effort is obviously in one bucket or the other, but to point out that there are a range of possibilities available. As I said in the beginning, very few findings will end up going in a “totally right” or “total trash” bucket, at least at first. However, it’s important to realize any time you see a big exciting headline that subsequent research will almost certainly add or subtract something from the original story. Wheels of progress and all that.

What I’m Reading: June 2016

This month, my book was Beautiful Data: The Stories Behind Elegant Data Solutions . It’s more on the technical side (some stories include the code used to analyze the data), but it’s pretty good if that’s what you’re in to.

Speaking of book lists, I’ve updated my list of recommendations.

Also, I’ve been having fun with those 2×2 matrices, so I started another site to play around with more of those.  It’s called Two Ways to be Wrong, and I’m pretty much just experimenting with different topics/colors/etc, but I’m having fun. Feel free to swing by!

Okay, this was an interesting article about EurekaAlert, the cite that publicizes recent published studies.

I’ve been playing around with Image Quilt ever since I took Edward Tufte’s seminar a few years ago, and it’s pretty great. Here’s a good article about it from it’s release.

Speaking of data visualization, here’s a promising looking guide to learning R, one of my next goals for the year.

How math helps fight epidemics. 

 

5 Ways that Average Might Be Lying to You

One of the very first lessons every statistics students learns in class is how to use measures of central tendency to assess data. While in theory this means most people should have at least a passing familiarity with the terms “average” or “mean, median and mode”, the reality is often quite different. For whatever reason, when presented with a statement about your average we seem to forget the profound vulnerabilities of the “average”. Here’s some of the most common:

  1. Leaving a relevant confounder out of your calculations Okay, so maybe we can never get rid of all the confounders we should, but that doesn’t mean we can’t try at least a little. The most commonly quoted statistic I hear that leaves out relevant confounders is the “Women make 77 cents  for every dollar a man earns” claim.  Now this is a true statement IF you are comparing all men in the US to all women in the US, but it gets more complicated if you want to compare male/female pay by hours worked or within occupations. Of course “occupation and hours worked” are two things most people actually tend to assume are included in the original statistic, but they are not. The whole calculation can get really tricky (Politifact has a good breakdown here), but I have heard MANY people tag “for the exact same work” on to that sentence without missing a beat. Again, it’s not possible to control for every confounder, but your first thought when you hear a comparison of averages should be to make sure your assumptions about the conditions are accurate.
  2. A subset of the population could be influencing the value of the whole population. Most people are at least somewhat familiar with the idea of outlier type values and “if Bill Gates walks in to a bar, the average income goes way up” type issues. What we less often consider is how different groups being included/excluded from a calculation can influence things. For example, in the US we are legally required to educate all children through high school. The US often does not do well when it comes to international testing results. However in this review by the Economic Policy Institute, they note that in some of the countries (Germany and Poland for example) certain students are assigned to a “vocational track” quite early and may not end up getting tested at all. Since those children likely got put on that track because they weren’t good test takers, the average scores go up simply by removing the lowest performers. We saw a similar phenomena within the US when more kids started taking the SATs. While previous generations bemoaned the lower SAT scores of “kids these days” the truth was those were being influenced by expanding the pool of test takers to include a broader range of students. Is that the whole explanation? Maybe not, but it’s worth keeping in mind.
  3. The values could be bimodal (or another non-standard distribution) One of my first survey consulting gigs consisted of taking a look at some conference attendee survey data to try and figure out what the most popular sessions/speakers were. One of the conference organizers asked me if he could just get a list of the sessions with the highest average ranking. That sounded reasonable, but I wasn’t sure that was what they really wanted. You see, this organization actually kind of prided itself on challenging people and could be a little controversial. I was fairly sure that they’d feel very differently about a session that had been ranked mostly 1’s and 10’s, as opposed to a session that had gotten all 5’s and 6’s. To distill the data to a simple average would be to lose a tremendous amount of information about the actual distribution of the ratings. It’s like asking how tall the average human is…..you get some information, but lose a lot in the process. Neither the mean or median account for this.
  4. The standard deviations could be different Look, I get why people don’t always report on standard deviations….the phrase itself probably causes you to lose at least 10% of readers automatically. However, just because two data sets have the same average doesn’t mean the members of those groups look the same. In #3 I was referring to those groups that have two distinct peaks on either side of the average, but even less dramatic spreads can cause the reality to look very different than the average suggests.
  5. It could be statistically significant but not practically significant. This one comes up all the time when people report research findings. You find that one group does “more” of something than another. Group A is happier than Group B.  When you read these, it’s important to remember that given a sample size large enough ANY difference can become statistically significant. A good hint this may be an issue is when people don’t tell you the effect size up front. For example, in this widely reported study it was shown that men with attractive wives are more satisfied with their marriages in the first 4 years. The study absolutely found a correlation between attractiveness of the wife and the husband’s marital satisfaction….a gain of .36 in satisfaction (out of a possible 45 points) for every 1 point increase in attractiveness (on a scale of 1 to 10). That’s an interesting academic finding, but probably not something you want to knock yourself out worrying about.

Beware the average.