4 Examples of Confusing Cross-Cultural Statistics

In light of my last post about variability in eating patterns across religious traditions, I thought I’d put together a few other examples of times when attempts to compare data across international borders got a little more complicated than you would think.

Note: not all of this confusion changed the conclusions that people were trying to get to, but it did make things a little confusing.

  1. Who welcomes the refugee  About a year or so ago, when Syrian refugees were making headlines, there was a story going around that China was the most welcoming country for people fleeing their homeland. The basis of the story was an Amnesty International survey that showed a whopping 46% of Chinese citizens saying they would be willing to take a refugee in to their home…..far more than any other country. The confusion arose when a Quartz article pointed out that there is no direct Chinese translation for the word “refugee” and the word used in the survey meant “person who has suffered a calamity” without clarifying whether that person is international or lives down the street. It’s not clear how this translation may have influenced the response, but a different question on the same survey that made the “international” part clearer received much lower support.
  2. The French Paradox (reduced by 20%) In the process of researching my last post, I came across a rather odd tidbit I’d never heard of before regarding the “French Paradox”. A term that originated in the 80s, the French Paradox is the apparent contradiction that French people eat lots of cholesterol/saturated fat and yet don’t get heart disease at the rates you would expect based on data from other countries. Now I had heard of this paradox before, but the part I hadn’t heard  was the assertion that French doctors under-counted deaths from coronary heart disease. When you compared death certificates to data collected by more standardized methods, they found that this was true:

    They suspect the discrepancy arose because doctors in many countries automatically attribute sudden deaths in older people to coronary heart disease, whereas the French doctors were only doing so if they had clinical evidence of heart disease. This didn’t actually change the rank of France very much; they still have a lower than expected rate of heart disease. However, it did nearly double the reported incidence of CHD and cuts the paradox down by about 20%.

  3. Crime statistics of all sorts This BBC article is a few years old, but it has some interesting tidbits about cross-country crime rate comparisons. For example, Canada and Australia have the highest kidnapping rates in the world. The reason? They count all parental custody disputes as kidnappings, even if everyone knows where the child is. Other countries keep this data separate and only use “kidnapping” to describe a missing child. Countries that widen their definitions of certain crimes tend to see an uptick in those crimes, like Sweden saw with rape when it widened its definition in 2005.
  4. Infant mortality  This World Health Organization report has some interesting notes about how different countries count infant mortality, and it notes that some countries (such as Belgium, France and Spain) only count infant mortality in infants who survive beyond a certain time period after birth, such as 24 hours. Those countries tend to have lower infant mortality rates but higher stillbirth rates than countries that don’t set such a cutoff. Additionally, as of 2008 approximately 3/4 of countries lack the infrastructure to count infant mortality through hospitals and do so through household surveys instead.

Like I said, not all of these change the conclusions people come to, but they are good things to keep in mind.

5 Things You Should Know About the “Backfire Effect”

I’ve been ruminating a lot on truth and errors this week, so it was perhaps well timed that someone sent me this article on the “backfire effect” a few days ago. The backfire effect is a name given to a psychological phenomena in which attempting to correct someone’s facts actually increases their belief in their original error. Rather than admit they are wrong when presented with evidence they narrative goes, people double down. Given the current state of politics in the US, this has become a popular thing to talk about. It’s popped up in my Facebook feed and is commonly cited as the cause of the “post-fact” era.

So what’s up with this? Is it true that no one cares about facts any more? Should I give up on this whole facts thing and find something better to do with my time?

Well, as with most things, it turns out it’s a bit more complicated than that. Here’s a few things you should know about the state of this research:

  1. The most highly cited paper focused heavily on the Iraq War The first paper that made headlines was from Nyhan and Reifler back in 2010, and was performed on college students at a Midwest Catholic University. They presented some students with stories including political misperceptions, and some with stories that also had corrections. They found that the students that got corrections were more likely to believe the original misperception. The biggest issue this showed up with was whether or not WMDs were found in Iraq. They also tested facts/corrections around the tax code and stem cell research bans, but it was the WMD findings that grabbed all the headlines. What’s notable is that the research was performed in 2005 and 2006, when the Iraq War was heavily in the news.
  2. The sample size was fairly small and composed entirely of college students One of the primary weaknesses of the first papers (as stated by the authors themselves) is that 130 college students are not really a representative sample. The sample was half liberal and 25% conservative. It’s worth noting that they believe that was a representative sample for their campus, meaning all of the conservatives were in an environment where they were the minority. Given that one of the conclusions of the paper was that conservatives seemed to be more prone to this effect than liberals, it’s an important point.
  3. A new paper with a broader sample suggest the “backfire effect” is actually fairly rare. Last year, two researchers (Porter and Wood) polled 8,100 people from all walks of life on 36 political topics and found…..WMDs in Iraq were actually the only issue that provoked a backfire effect. A great Q&A with them can be found here. This is fascinating if it holds up because it means the original research was mostly confirmed, but any attempt at generalization was pretty wrong.
  4. When correcting facts, phrasing mattered One of the more interesting parts of the Porter/Wood study was when the researchers described how they approached their corrections. In their own words “Accordingly, we do not ask respondents to change their policy preferences in response to facts–they are instead asked to adopt an authoritative source’s description of the facts, in the face of contradictory political rhetoric“. They reject heartily “corrections” that are aimed at making people change their mind on a moral stance (like say abortion) and focus only on facts. Even with the WMD question they found that the more straightforward and simple the correction statement, the more people of all political persuasions accepted it.
  5. The 4 study authors are now working together In an exceptionally cool twist, the authors who came to slightly different conclusions are now working together. The Science of Us gives the whole story here, but essentially Nyhan and Reifler praised Porter and Wood’s work, then said they should all work together to figure out what’s going on. They apparently gathered a lot of data during the height of election season and hopefully we will see those results in the near future.

I think this is an important set of points, both because it’s heartwarming (and intellectually awesome!) to see senior researchers accepting that some of their conclusion may be wrong and actually working with others to improve their own work. Next, I think it’s important because I’ve heard a lot of people in my personal life commenting that “facts don’t work” so they basically avoid arguing with those who don’t agree with them. If it’s true that facts DO work as long as you’re not focused on getting someone to change their mind on the root issue, then it’s REALLY important that we know that. It’s purely anecdotal, but I can note that this has been my experience with political debates. Even the most hardcore conservatives and liberals I know will make concessions if you clarify you know they won’t change their mind on their moral stance.

5 Things You Should Know About Statistical Process Control Charts

Once again I outdo myself with the clickbait-ish titles, huh? Sorry about that, I promise this is actually a REALLY interesting topic.

I was preparing a talk for a conference this week (today actually, provided I get this post up when I plan to), and I realized that statistical process control charts (or SPC charts for short) are one of the tools I use quite often at work but don’t really talk about here on the blog. Between those and my gif usage, I think you can safely guess why my reputation at work is a bit, uh, idiosyncratic. For those of you who have never heard of an SPC chart, here’s a quick orientation. First, they look like this:

(Image from qimacros.com, and excellent software for generating these)

The chart is used for plotting something over time….hours, days, weeks, quarters, years, or “order in line”…take your pick.  Then you map some ongoing process or variable you are interested in…..say employee sick calls. You measure employee sick calls in some way (# of calls or % of employees calling in) in each time period. This sets up a baseline average, along with “control limits”, which are basically 1, 2 and 3 standard deviation ranges. If at some point your rate/number/etc starts to go up or down, the SPC chart can tell you if the change is significant or not based on where it falls on the plot.  For example, if you have one point that falls outside the 3 standard deviation line, that’s significant. If two in a row fall outside the 2 standard deviation line, that’s significant as well. The rules for this vary by industry, and Wiki gives a pretty good overview here. At the end of this exercise you have a really nice graph of how you’re doing with a good visual of any unusual happenings, all with some statistical rigor behind it. What’s not to love?

Anyway, I think because they take a little bit of getting used to,  SPC charts do not always get the love they deserve. I would like to rectify this travesty, so here’s 5 things you should know about them to tempt you to go learn more about them:

  1. SPC charts are probably more useful for most business than hypothesis testing While most high school level statistics classes at least take a stab at explaining p-values and hypothesis testing to kids, almost none of them even show an example of a control chart. And why not? I think it’s a good case of academia favoring itself. If you want to test a new idea against an old idea or to compare two things at a fixed point in time p-values and hypothesis testing are pretty good. That’s why they’re used in most academic research. However, if you want see how things are going over time, you need statistical process control. Since this is more relevant for most businesses, people who are trying to keep track of any key metric should DEFINITELY know about these.   Six Sigma and many process improvement class teach statistical process control, but they still don’t seem widely used outside of those settings. Too bad. These graphs are  practical, they can be updated easily, and it gives you a way of monitoring what’s going on and lot of good information about how your process are going. Like what? Well, like #2 on this list:
  2. SPC charts track two types of variation Let’s get back to my sick call example. Let’s say that in any given month, 10% of your employees call in sick. Now most people realize that not every month will be exactly 10%. Some months it’s 8%, some months it’s 12%. What statistical process control charts help calculate is when those fluctuations are most likely just random (known as common cause variation) and the point at which they are probably not so random (special cause variation). It sets parameters that tell you when you should pay attention. They are better than p-values for this because you’re not really running an experiment every month….you just want to make sure everything’s progressing as it usually does. The other nice part is this translates easily in to a nice visual for people, so you can say with confidence “this is how it’s always been” or “something unusual is happening here” and have more than your gut to rely on.
  3. SPC charts help you test new things, or spot concerning trends quickly SPC charts were really invented for manufacturing plants, and were perfected and popularized in post-WWII Japan. One of the reasons for this is that they really loved having an early warning about when a machine might be breaking down or an employee might not be following the process. If the process goes above or below a certain red line (aka the “upper/lower control limit”) you have a lot of confidence something has gone wrong and can start investigating right away. In addition to this, you can see if a change you made helps anything. For example, if you do a handwashing education initiative, you can see what percentage of your employees call in sick the next month. If it’s below the lower control limit, you can say it was a success, just like with traditional p-values/hypothesis testing. HOWEVER, unlike p-values/hypothesis testing, SPC charts make allowances for time. Let’s say you drop the sick calls to 9% per month, but then they stay down for 7 months. Your SPC chart rules now tell you you’ve made a difference. SPC charts don’t just take in to account the magnitude of the change, but also the duration. Very useful for any metric you need to track on an ongoing basis.
  4. They encourage you not to fix what isn’t broken One of the interesting reasons SPC charts caught on so well in the manufacturing world is that the idea of “opportunity cost” was well established. If your assembly line puts out a faulty widget or two, it’s going to cost you a lot of money to shut the whole thing down. You don’t want to do that unless it’s REALLY broken. For our sick call example, it’s possible that what looks like an increase (say to 15% of your workforce) isn’t a big deal and that trying to interfere will cause more harm than good. Always good to remember that there are really two ways of being wrong: missing a problem that does exist, and trying to fix one that doesn’t.
  5. There are quite a few different types One of the extra nice things about SPC charts is that there are actually 6 types to chose from, depending on what kind of data you are working with. There’s a helpful flowchart to pick your type here, but a good computer program (I use QI macros) can actually pick for you. One of the best parts of this is that some of them can deal with small and varying sample sizes, so you can finally show that going from 20% to 25% isn’t really impressive if you just lowered your volume from 5 to 4.

So those are some of my reasons you should know about these magical little charts. I do wish they’d get used more often because they are a great way of visualizing how you’re doing on an ongoing basis.

If you want to know more about the math behind them and more uses (especially in healthcare), try this presentation. And wish me luck on my talk! Pitching this stuff right before lunch is going to be a challenge.

Moral Outrage, Cleansing Fires and Reasonable Expectations

Last week, the Assistant Village Idiot forwarded me a new paper called “A cleansing fire: Moral outrage alleviates guilt and buffers threats to one’s moral identity“. It’s behind a ($40) paywall, but Reason magazine has an interesting breakdown of the study here, and the AVI does his take here. I had a few thoughts about how to think about a study like this, especially if you don’t have access to the paper.

So first, what did the researchers look at and what did they find? Using Mechanical Turk, the researchers had subject read articles that talked about either labor exploitation in other countries or the effects of climate change. They found that personal feelings of guilt about those topics predicted greater outrage at a third-party target, a greater desire to punish that target, and that getting a chance to express that outrage decreased guilt and increased feelings of personal morality. The conclusion being reported is (as the Reason.com headline says) “Moral outrage is self-serving” and “Perpetually raging about the world’s injustices? You’re probably overcompensating.”
.

So that’s what’s being reported.  So how do we think through this when we can’t see the paper? Here’s 5 things I’d recommend:

  1. Know what you don’t know about sample sizes and effect sizes Neither the abstract nor the write ups I’ve seen mention how large the effects reported were or how many people participated. Since it was a Mechanical Turk study I am assuming the sample size was reasonable, but the effect size is still unknown. This means we don’t know if it’s one of those unreasonably large effect sizes that should alarm you a bit or one of those small effect sizes that is statically but not practically significant. Given that reported effect size heavily influences the false report probability, this is relevant.
  2. Remember the replication possibilities Even if you think a study found something quite plausible, it’s important to remember that fewer than half of psychological studies end up replicating exactly as the first paper reported. There are lots of possibilities for replication, and even if the paper does replicate it may end up with lots of caveats that didn’t show up in the first paper.
  3. Tweak a few words and see if your feelings change Particularly when it comes to political beliefs, it’s important to remember that context matters. This particular studies calls to mind liberal issues, but do we think it applies to conservative issues too? Everyone has something that gets them upset, and it’s interesting to think through how that would apply to what matters to us. When the Reason.com commenters read the study article, some of them quickly pointed out that of course their own personal moral outrage was self serving. Free speech advocates have always been forthright that they don’t defend pornographers and offensive people because they like those people, but because they want to preserve free speech rights for themselves and others. Self serving moral outrage isn’t so bad when you put it that way.
  4. Assume the findings will get more generic In addition to the word tweaks in point #3, it’s likely that subsequent replications will tone down the findings. As I covered in my Women Ovulation and Voting post, 3 studies took findings from “women change their vote and values based on their menstrual cycle” to “women may exhibit some variation in face preference based on menstrual cycle”. This happened because some parts of the initial study failed to replicate, and some caveats got added. Every study that’s done will draw another line around the conclusions and narrow their scope.
  5. Remember the limitations you’re not seeing One of the most important parts of any papers is where the authors discuss the limitations of their own work. When you can’t read the paper, you can’t see what they thought their own limitations where. Additionally, it’s hard to tell if there were any interesting non-findings that didn’t get reported. The limitations that exist from the get go give a useful indication of what might come up in the future.

So in other words….practice reasonable skepticism. Saves time, and the fee to read the paper.

Who Votes When? Untangling Non-Citizen Voting

Right after the election, most people in America saw or heard about this Tweet from then President elect Trump:

I had thought this was just random bluster (on Twitter????? Never!), but then someone sent me  this article. Apparently that comment was presumably based on an actual study, and the study author is now giving interviews. It turns out he’s pretty unhappy with everyone….not just with Trump, but also with Trump’s opponents who claim that no non-citizens voted. So what did his study actually say? Let’s take a look!

Some background: The paper this is based on is called “Do Non-Citizens Vote in US Elections” by Richman et all and was published back in 2014. It took data from a YouGov survey and found that 6.4% of non-citizens voted in 2008 and 2.2% voted in 2010. Non-citizenship status was based on self report, as was voting status, though the demographic data of participants was checked with that of their stated voting district to make sure the numbers at least made sense.

So what stood out here? A few things:

  1. The sample size While the initial survey of voters was pretty large (over 80,000 between the two years) the number of those identifying themselves as non-citizens was rather low: 339 and 489 for the two years. There were a total of 48 people who stated that they were not citizens and that they voted. As a reference, it seems there are about 20 million non-citizens currently residing in the US.
  2. People didn’t necessarily know they were voting illegally One of the interesting points made in the study was that some of this voting may be unintentional. If you are not a citizen, you are never allowed to vote in national elections even if you are a permanent resident/have a green card. The study authors wondered if some people  didn’t know this, so they analyzed the education levels of those non-citizens who voted. It turns out non-citizens with less than a high school degree are more likely to vote than those with more education. This actually is the opposite trend seen among citizens AND naturalized citizens, suggesting that some of those voters have no idea what they’re doing is illegal.
  3. Voter ID checks are less effective than you might think If you’re first question up on reading #2 was “how could you just illegally vote and not know it?” you may be presuming your local polling place puts a little more in to screening people than they do. According to the participants in this study, not only were non-citizens allowed to register and cast a vote, but a decent number of them actually passed an ID check first. About a quarter of non-citizen voters said they were asked for ID prior to voting, and 2/3rds of those said they were then allowed to vote. I suspect this issue is that most polling places don’t actually have much to check their information against. Researching citizenship status would take time and money that many places just don’t have. Another interesting twist to this is that social desirability bias may kick in for those who don’t know voting is illegal. Voting is one of those things more people say they do than actually do, so if someone didn’t know they couldn’t legally vote they’d be more likely to say they did even if they didn’t. Trying to make ourselves look good is a universal quality.
  4. Most of the illegal voters were white Non-citizen voters actually tracked pretty closely with their proportion of the population, and about 44% of them were white. The next most common demographic was Hispanic at 30%, then black, then Asian. In terms of proportion, the same percent of white non-citizens voted as Hispanic non-citizens.
  5. Non-citizens are unlikely to sway a national election, but could sway state level elections When Trump originally referenced this study, he specifically was using it to discuss national popular vote results. In the Wired article, they do the math and find that even if all of the numbers in the study bear out it would not sway the national popular vote. However, the original study actually drilled down to a state level and found that individual states could have their results changed by non-citizen voters. North Carolina and Florida would both have been within the realm of mathematical possibility for the 2008 election, and for state level races the math is also there.

Now, how much confidence you place in this study is up to you. Given the small sample size, things like selection bias and non-response bias definitely come in to play. That’s true any time you’re trying to extrapolate the behavior of 20 million people off of the behavior of a few hundred. It is important to note that the study authors did a LOT of due diligence attempting to verifying and reality check the numbers they got, but it’s never possible to control for everything.

If you do take this study seriously, it’s interesting to note what the authors actually thought the most effective counter-measure against non-citizen voting would be: education. Since they found that low education levels were correlated with increased voting and that poll workers rarely turned people away, they came away from this study with the suggestion that simply doing a better job of notifying people of voting rules might be just as effective (and cheaper!) than attempting to verify citizenship. Ultimately it appears that letting individual states decide on their own strategies would also be more effective than anything on the federal level, as different states face different challenges. Things to ponder.

 

5 Things You Should Know About Study Power

During my recent series on “Why Most Published Research Findings Are False“, I mentioned a concept called “study power” quite a few times. I haven’t talked about study power much on this blog, so I thought I’d give a quick primer for those who weren’t familiar with the term. If you’re looking for a more in depth primer try this one here, but if you’re just looking for a few quick hits, I gotcha covered:

  1. It’s sort of the flip side of the p-value We’ve discussed the p-value and how it’s based on the alpha value before, and study power is actually based on a value called beta. If alpha can be thought of as the chances of committing a Type 1 error (false positive), then the beta is the chance of getting a Type 2 error (false negative). Study power is actually 1 – beta, so if someone says study power is .8, that means the beta was .2. Setting the alpha and beta values are both up to the discretion of the researcher….their values are more about risk tolerance than mathematical absolutes.
  2. The calculation is not simple, but what it’s based on is important Calculating study power is not easy math, but if you’re desperately curious try this explanation. For most people though, the important part to remember is that it’s based on 3 things: the alpha you use, the effect size you’re looking for, and your sample size.  These three can all shift based on the values of the other one. As an example, imagine you were trying to figure out if a coin was weighted or not. The more confident you want to be in your answer (alpha), the more times you have to flip it (sample size). However, if the coin is REALLY unfairly weighted (effect size), you’ll need fewer flips to figure that out. Basically the unfairness of a coin weighted to 80-20 will be easier to spot than a coin weighted to 55-45.
  3. It is weirdly underused As we saw in the “Why Most Published Findings Are False” series, adequate study power does more than prevent false negatives. It can help blunt the impact of bias and the effect of multiple teams, and it helps everyone else trust your research. So why don’t most researchers put much thought in to it, science articles mention it, or people in general comment on it? I’m not sure, but I think it’s simply because the specter of false negatives is not as scary or concerning as that of false positives. Regardless, you just won’t see it mentioned as often as other statistical issues. Poor study power.
  4. It can make negative (aka “no difference”) studies less trustworthy With all the current attention on false positive/failure to replicate studies, it’s not terribly surprising that false negatives have received less attention…..but it is still an issue. Despite the fact that study power calculations can tell you how big the effect size you can detect is, and odd number of researchers don’t include their calculations. This means a lot of “negative finding” trials could also be suspect. In this breakdown of study power, Stats Done Wrong author Alex Reinhart cites studies that found up to 84% of studies don’t have sufficient power to detect even a 25% difference in primary outcomes. An ASCO review found that 47% of oncology trials didn’t have sufficient power to detect all but the largest effect sizes. That’s not nothing.
  5. It’s possible to overdo it While underpowered studies are clearly an issue, it’s good to remember that overpowered studies can be a problem too. They waste resources, but can also detect effect sizes so small as to be clinically meaningless.

Okay, so there you have it! Study power may not get all the attention the p-value does, but it’s still a worthwhile trick to know about.

How To Read a Headline: Are Female Physicians Better?

Over the years I’ve spilled a lot of (metaphorical) ink on how to read science on the internet. At this point almost everyone who encounters me frequently IRL has heard my spiel, and few things give me greater pleasure than hearing someone say “you changed the way I read about science”. While I’ve written quite a fewer longer pieces on the topic, recently I’ve been thinking a lot about what my “quick hits” list would be. If people could only change a few things in the way they read science stories,  what would I put on the list?

Recently, a story hit the news about how you might live longer if your doctor is a woman and it got me thinking. As someone who has worked in hospitals for over a decade now, I had a strong reaction to this headline. I have to admit, my mind started whirring ahead of my reading, but I took the chance to observe what questions I ask myself when I need to pump the brakes. Here they are:

  1. What would you think if the study had said the opposite of what it says? As I admitted up front, when I first heard this study, I reacted. Before I’d even made it to the text of the article I had theories forming. The first thing I did to slow myself down was to think “wait, how would you react if the headline said the opposite? What if the study found that patients of men did better?” When I ran through those thoughts, I realized they were basically the same theories. Well, not the same…more like mirror image, but they led to the same conclusion. That’s when I realized I wasn’t thinking through the study and it’s implications, I was trying to make the study fit what I already believed. I admit this because I used this knowledge to mentally hang a big “PROCEDE WITH CAUTION” sign on the whole topic. To note, it doesn’t matter what my opinion was here, what matters is that it was strong enough to muddy my thoughts.
  2. Is the study linked to? My first reaction (see #1) kicked in before I had even finished the headline, so unfortunately “is this real” comes second. In my defense, I was already seeing the headlines on NPR and such, but of course that doesn’t always mean there’s a real study. Anyway, in this case of this study, there is a real identified study (with a link!) in JAMA.  As a note, even if the study is real, I distrust any news coverage that doesn’t provide a link to the source. In 2017, that’s completely inexcusable.
  3. Do all the words in the headline mean what you think they mean? Okay, I’ve covered headlines here, but it bears repeating: headlines are a marketing tool. This study appeared under several headlines such as “You Might Live Longer if Your Doctor is a Woman“. What’s important to note here is that by “live longer” they meant “slightly lower 30 day mortality after discharge from the hospital”, by doctor they meant “hospitalist”, and by “you” they meant “people over 65 who have been hospitalized”. Primary care doctors and specialists were not covered by this study.
  4. What’s the sample size and effect size? Okay, once we have the definitions out of the way, now we can start with the numbers. For this study, the sample size was fantastic….about 1.5 million hospital admissions. The effect size however….not so much. For patients treated by female physicians vs male, the 30 day mortality dropped from 11.49% to 11.07%. That’s not nothing (about a 5% drop), but mathematically speaking it’s really hard to reliably measure effect sizes of under 5% (Corollary #2)  even when you have a huge sample size. To their credit, the study authors do include the “number to treat”, and note that you’d have to have 233 patients treated by female physicians over male physicians in order to save one life. That’s a better stat than the one this article tried to use “Put another way – if you were to replace all the male doctors in the study with women, 32,000 fewer people would die a year.” I am going to bet that wouldn’t actually work out that way. Throw “of equal quality” in there next time, okay?
  5. Is this finding the first of it’s kind? As I covered recently in my series on “Why Most Published Research Findings Are False“, first of their kind exploratory studies are some of the least reliable types of research we have. Even when they have good sample sizes, they should be taken with a massive grain of salt. As a reference, Ioannidis puts the chances that a positive finding is true for a study like this at around 20%. Even if subsequent research proves the hypothesis, it’s likely that the effect size will diminish considerably in subsequent research. For a study that starts off with a 5% effect size, that could be a pretty big hit. It’s not bad to continue researching the question, but drawing conclusions or changing practice over one paper is a dangerous game, especially when the study was observational.

So after all this, do I believe this study? Well, maybe. It’s not implausible that personal characteristics of doctors can effect patient care. It’s also very likely that the more data we have, the more we’ll find associations like this. However, it’s important to remember that proving causality is a long and arduous process, and that reacting to new findings with “well it’s probably more complicated than that” is an answer that’s not often wrong.