A is for Alternative Hypothesis

A few weeks ago, I hung out with two lovely people who were self-professed language lovers who didn’t really like math. Over a wonderful spread of french fries and beer, I tried to get them a little more well versed in why math and statistics were so appealing to me.  After a few more beers and a wonderful reading of Introductory Calculus For Infants, I had an idea: wouldn’t it be fun to put together a list of statistics words for logophiles? Since my urge to systematize pervades all aspects of my life, I figured I’d start with the alphabet. More specifically, the letter A. Obviously I’m cheating a bit here as this is technically a phrase, but bear with me.

LetterA

Hypothesis testing in general and the Alternative Hypothesis in particular are beautiful things. Learn more about them here.

How Do They Call Elections so Early?

I live in Massachusetts now, but for the first 18 or so years of my life I lived in New Hampshire. I still have most of my family and many friends there, so every 4 years around primary time my Facebook feed turns in to a front row seat for the “first in the nation primary” show1.  This year the primary was on Tuesday February 9th, and it promised to be an interesting time as both parties have unexpected races going on. I was interested in the results of the primary, but since I tend to go to bed early, was unsure I’d stay up late enough to see it through. Thus like many others, I was completely surprised to see CNN had called the race around 8:30 for Trump and Sanders with  only 8% of the votes counted. By 8:45 I had a message in my inbox from a NH family member/Sanders supporter saying “okay, how’d they do that????”.

It’s a great question and one I was interested to learn more about. It turns out most networks keep their exact strategies secret, but I figured I’d take a look at the most likely general approach. I start with some background math stuff, but I include pictures!

Okay, first things first, what information do we need?

Whenever you’re doing any sort of polling (including voting), there are a couple things you need to think through.  These are:

  1. What your population size is
  2. How confident you want to be in your guess (confidence level)
  3. How close you want your guess to be to reality  (margin of error)
  4. If you have any idea what the real value is
  5. Sampling bias risk

#1 is pretty easy here. About 250,000 voters voted in the Democrat primary, and 280,000 voted in the Republican primary. This doesn’t matter much when it’s this large.

#2 Confidence is up to the individual network, but they’re almost ubiquitously pretty conservative. They’re skittish here because every journalist to ever pick up a pen has seen this image and lives in fear of it:

If you’re missing the reference Wikipedia’s got your back, but suffice it to say networks live in fear of a missed call.

#3 is how close you want to be to reality. We’ll come back to this, but basically it’s how much you need your answer to look like the real answer. When polls say “the margin of error is +/- 3 percentage points”, this is what they’re saying.  If you look at this diagram:

Margin of error is basically how close those x’s need to be to the target, confidence interval (#2) is how close you need them to be to each other.

#4 is whether or not you’re working from scratch or you have a guess. Basically, do you know ahead of time what percent of people might be voting for a candidate or are you going in blind?

#5 is all the other messy stuff that has nothing to do with math.

Okay, so what do we do with this?

Well factors 1-4 all end up in this equation:

MEeq

So basically what that’s saying is that the more confident and precise you need to be, the more people you need to poll. Additionally, the larger the gap between your “percent saying yes” and “percent saying something else”, the fewer people you need before you can make a call. A landslide result may be bad for your candidate, but great for predictions.

Okay, thanks for the math lesson. Now what?

Now things get dirty. What I showed you above is basically how we’d do an estimate for each of the candidates, putting in their prior polling numbers for p one at a time. What about the other numbers though? We know we have to set our confidence high so we’re not embarrassed, but what about our margin of error?  Well here’s where all those phone calls you get prior to the election help.

Going in to voting day, the pollsters had Trump in the lead at 31%, with his next closest rival at 14%. This 17 point lead means we can set our margin of error pretty wide. After all, CNN doesn’t have to know what percent of the vote Trump got as much as it needs to know that someone is really unlikely to beat him. If you split it down the middle, you get a margin of error of 8. Their count could be off by that much and still only lower Trump to 23% of the vote and raise his opponent to 22%. However, that assumes all of his error would go to his closest opponent. With so many others in the race that’s unlikely to happen, so they could probably go with +/- 10.

For the Democrats, I found the prior polls showed Sanders leading 54% to Hillary’s 41%. Splitting that difference you could go about +/- 6.

In a perfect world this means we’d need about 160 random votes to predict Trumps win and about 460 to predict Sanders win at the 99% confidence level.

Whoa that’s it? Why’d they wait so long then?

Well, remember #5 up there? That’s the killer. All those pretty equations I just showed you only work if you get a random sample, and that’s really hard to come by in a situation like this. Even in a small state like New Hampshire you will have geographic differences in the types of candidates people like.  This post from smartblogs had a map shows some of the differences:

So as precincts report, we know there’s likely some bias to those numbers. If the 8% of the votes you’ve counted are from throughout the state, you have a lot more information than if those 8% are just from Manchester or Nashua. Because of this most networks have eschewed strict stats limits like that one I did above in favor of slightly messier rules.

So why’d you tell us all that other stuff?

Because frequentist probability theory is great and you should know more about it. Also, those are still the steps that underlie everything else the networks do. As we discussed above, the size of the leads made the initial/perfect world required number quite small.  To highlight this, watch what would happen to that base number of votes needed as we close the margin of error:

Samplesize

Anything lead closer than about +/- 4 (or about an 8 point difference) gets increasingly more difficult to call. If you’re over that though, you can act a little faster. In this case, both leads were bigger than that from the get go.

To hedge their bets against bias, the networks likely produce some models of the state based on past elections, polling, exit polls and demographic shifts, call the election the day before, then spend election night validating their models/predictions. Bayesian inference would come in handy here, as the networks could rapidly update their guesses with new information. So they’re not really calculating “what is the probability that Trump is winning” they’re calculating “given that the polls said Trump was winning, what are the chances he is also winning now”.  That sounds like semantics, but it can actually make a huge difference. If they saw anything unusual happening or any conflicting information, they could delay (justifying a few veteran election watchers hanging out to pick up on this stuff), but in this case all their information sources were agreeing.

As the night went on, it became apparent that Trump and Sanders were actually out performing the pre-election polls, so this probably increased the network’s confidence rapidly. In pre-election polls, the most worrying thing is non-response bias. You get concerned that those answering the polls are not the same as those who are going to vote. Voting results eliminate this bias….in a democracy we only count the opinions of those who show up at the polls. So if you get two different types of samples with different error sources saying the same things, you increase your confidence.

Overall, I don’t totally know all the particulars about how the networks do it, but they almost certainly use some of the methods above in addition to some gut reactions. With today’s computing power, they could be individually computing probabilities for every precinct or have very advanced models to predict which areas that were most likely to go rogue. It’s worth noting that the second place Clinton and Kasich won very few individual districts, so this strategy would have produced results quickly as well.

So there you have it. The more accurate the prior polling, the greater the gap between candidates, the more regions reporting at least some of their votes, and the less inter-region variability, the faster the call. An hour and a half after the polls close seems speedy until you consider that statistically they probably could have called it accurately after the first 1% came in. No matter how mathematically backed however, that definitely would have gotten them the same level of love that my over-zealous-in-class-question-answering habits got me in middle school. They had to be quick, but not too quick. My guess is that last half hour was more a debate over the respectability of calling so soon rather than the math. Life’s annoying like that some times.

Got a stats question? Send it in here!

Updated to add: Based on a Facebook conversation about this post, I thought I should add that if the race is REALLY close, the margin of error with the vote counting itself starts to come in to play. Typically things like absentee ballots aren’t even counted if it won’t make a difference, but in very close races when every ballot matters, which ballots are valid becomes a big deal. The weirdest example of this I know of is the Al Franken/Minnesota senate seat election from 2008. It took 8 months to resolve which votes were valid and get someone sworn in.

1. This is the quadrennial tradition where New Hampshire acts like a hot girl in a bar who totally hates the fact that she’s getting so much attention yet never seems to want to leave.

SCOTUS Nomination Timing

After yesterday’s news about the death of Antonin Scalia’s death, the conversation almost immediately turned to whether or not President Obama should or would nominate a new candidate.  There’s obviously a lot being said about this right now by better legal and political minds than mine, but I did start wondering what kind of timing there normally was between Supreme Court nominations and Presidential Elections.  Thanks to Wikipedia, I was able to find a list of all 160 Supreme Court nominations that have occurred since 1789. I combined this with a list of election dates, and calculated the difference between the day the person was submitted to the Senate and the next presidential election.  I graphed days vs election year, and color coded the dots with the outcome of the nomination.

A few notes:

  1. I didn’t fully vet the Wikipedia data. If there’s an error in that data, it’s in this chart.
  2. All day calculations for years prior to the 1848 election are approximate. Prior to that, states had a 34 day window prior to the first Wednesday in December to hold their election. I gave them a default date of November 3rd for their year, which could be off in some cases.
  3. There were a few cases in which presidents attempted to nominate someone after the election but before the next inauguration. If they got re-elected, I counted that nomination from the election that would take place 4 years later. If they were leaving office, I gave them a negative number.
  4. 310 days is approximately the number of days between January 1st of a year and the general election, so I put a reference line there.
  5. These nominations include Chief Justice nominations….and those nominees may have been active justices when they were nominated.

With that out of the way, here you go:

Days to election

Rutheford B Hayes sets the record for getting things in under the wire, as he nominated William Burnham Woods in late December of 1880. He actually also nominated Stanley Matthews in January of that year, but it didn’t go to a vote. Matthews was renominated and confirmed a few months later by Garfield.

Overall only about 15% of nominations ever have come in this close to the election, and the success rate of those nominations is a little less than half. To compare, those nominees submitted before January 1st of the election year have about an 80% all time success rate. Obviously we haven’t even dealt with this in a while, but it’s interesting to see that historically this was more common than in recent years.

This could get interesting kids!

People: Our Own Worst Enemies (Part 9)

Note: This is part 9 in a series for high school students about reading and interpreting science on the internet. Read the intro and get the index here, or go back to Part 8 here.

Okay, we’re in the home stretch here! In part 8 I talked about how we as individuals work to confuse ourselves when we read and interpret data. Today I’m going to talk about how we as a society collectively work to undermine our own understanding of science, one little step at a time.  Oh that’s right, we’re talking about:

Surveys and Self Reporting

Okay, so what’s the problem here?

The problem is that people are weird. Not any individual really (ed note: this is false, some people really are weird), but collectively we have some issues that add up. Nowhere is this more evident than on surveys. There is something about those things that brings out the worst in us.  For example, in this paper from 2013, researchers found that 59% of men and 67% of women in the National Health and Nutrition Examination Survey (NHANES) database had reported calorie intake that were “physiologically implausible” and “incompatible with life”.  The NHANES database is incredibly widely used for nutrition research for about 40 years, and these findings have caused some to call for an end to self-reporting in nutrition research.  Now I doubt any individual was intending to mislead, but as a group those effects add up.

Nutrition isn’t the only field with a problem though. Any field that studies something where people think they can make themselves look better has an issue. For example, the Bureau of Labor Statistic found that most people exaggerate how many hours they work per week. People who say they work 40 hours normally only work 37. People who say they work 75 hours a week typically work about 50. One or two people exaggerating doesn’t make a difference, but when it’s a whole lot of people it adds up.

So what kinds of things should we be looking out for?

Well, any time things say they’re based on a survey, you may want to get the particulars. Before we even get to some of the reporting bias I mentioned above, we also have to contend with questions that are asked one way and reported another.  For example back in 2012 I wrote about an article that said “1/3rd of women resent their husbands don’t make more money”. When you read the original question, it asked if the “sometimes” resent that their husband doesn’t make more money.  It’s a one word difference, but it changes the whole tone of the question.  Every time you see a headline about what “people think”, be a little skeptical.  Especially if it looks like this:

lizardpeople

That one’s from a survey about conspiracy theories, and they got that 12 million number from extrapolating out the 4% of respondents to the survey who said they believed in lizard people to the entire US population.  In the actual survey, this represented 50 people.  Do you think it’s more plausible that the pollsters found 50 people who believed in lizard people or 50 people who thought this was an amusing thing to say yes to?

But people who troll polls aren’t the only problem, polling companies play this game too, asking questions designed to grab a headline. For example, recently a poll found that 10% of college graduates believe a woman named Judith Sheindlin sits on the Supreme Court.  College graduates were given a list of names and told to pick the one who was a current Supreme Court justice.  So what’s the big deal, other than a wrong answer? Well apparently Judith Sheindlin is the real life name of “Judge Judy” a TV show judge. News outlets had a field day with the “college grads think Judge Judy is on the Supreme Court” headlines. However, the original question never used the phrase “Judge Judy”, only the nearly unrecognizable name “Judith Sheindlin. The Washington Post thankfully called this out, but all the headlines had already been run. Putting in a little known celebrity name in your question then writing a headline with the well known name is beyond obnoxious. It’s a question designed to make people look dumb and make everyone reading feel superior. I mean, quick, who is Caryn Elaine Johnson? Thomas Mapother IV? People taking a quiz will often guess things that sound vaguely right or familiar, and I wouldn’t read too much in to it.

Why do we fall for this stuff?

This one I fully blame on the people reporting things for not giving proper context. This is one area where journalists really don’t seem to be able to help themselves. They want the splashy headline, methodology or accuracy be damned. They’re playing to our worst tendencies and desires….the desire to feel better about yourself. I mean, it’s really just a basic ego boost. If you know that Judge Judy isn’t on the Supreme Court, then you must clearly be smarter than all those people who didn’t right?

So what can we do about it?

The easiest thing to do is not to trust the journalists. Don’t let someone else tell you what people said, try to find the question itself.  Good surveys will always provide the actual questions that they asked people. Remember that tiny word shifts can change answers enormously.  Words like “sometimes” “maybe” and “occasionally” can be used up front, then dropped later when reported. Even more innocuous word choices can make a difference. For example, in 2010 CBS found that asking if “gays and lesbians” should be able to serve in the military instead of “homosexuals” causes quite the change in people’s opinions:

gaysinmilitary

So watch the questions, watch the wording, watch out for people lying, and watch out for the reporting.  Basically, paranoia is just good sense when lizard people really are out to get you.

See you in Week 10! Read Part 10 here.

Are Conservatives Simple Minded?

Not too long ago, there was a bit of buzz going on about a study that suggested that liberals and conservatives can both be simple minded. In the past most of the reporting has suggested that when it comes to politics conservatives as a group are less complex thinkers than liberals, so naturally it created a stir. The buzz and the study intrigued me, so I decided to do a bit of a deep dive and sketchnote out what the researchers did.

I got the original study “Are Conservatives Really More Simple-Minded than Liberals? The Domain Specificity of Complex Thinking” by Conway, et al. , and started to read.  One really important note up front: neither I nor the authors suggest that complex thinking is always a sign of correct thinking or even desirable thinking. If you were out with a new acquaintance who told you their views on, say, cannibalism, were complex, you would probably be squicked out. However, since previous research had suggested that conservatives were almost always less complex than liberals, the authors wanted to check that specifically.  Their basic hypothesis was simple: when it comes to complex thinking, topic matters. They conducted 4 studies to test this hypothesis, so the whole thing got a little crowded….but here’s the overview:FullSizeRender

A couple thoughts/notes:

  1. One of those most interesting findings was that complexity dropped as intensity of feeling increased. This causation could go either way….people could feel strongly about things they believe are straightforward, or we could simplify when our feelings are strong. Or it could be both.
  2. It’s interesting to me that they rated both regular college kids and then rated the debates.  That seemed like a nice balance.
  3. When they used regular college kids, they only used people who scored at the higher end for conservatism or liberalism. They did not include people who were in the middle.

Overall, I don’t think this result is particularly surprising. It makes sense that people are not entirely complex or entirely simple. Interesting study, and I look forward to more!

What I’m Reading: February 2016

An awesome article from two of my favorite statisticians on scientific overreach, power poses, and why we really need to stop quoting studies that don’t subsequently replicate. The day this went up I Tweeted it out, and within an hour someone I follow posted an article they had written that day quoting the original study. Didn’t correct it when I sent them the link either. Harumph.

I have ongoing debates with quite a few friends over appropriate emoji use. This paper should help us out.

A Crusade Against Multiple Regression Analysis. I’m in. Any crusade that has an upfront statistical warning label as end goal is one I can get behind.

What social science reporting gets wrong, according to social scientists. I saw this one then 3 people sent it to me, which makes me feel like I’m headed in the right direction in life.

How you are going to die and at what your chances are of being dead at given ages. Given that I’ve already lived as long as I have, my chances don’t hit 50/50 until I’m 81.

I’ve been playing Guess the Correlation. Current high score is 88.

Totally not in my normal wheelhouse for research I look at, but the Assistant Village Idiot’s take on a study about the origins of fairy tales was fascinating.

This take on the recent debates around nutrition science is a long-ish but fascinating read. Good to see all sides represented at once.

People: Our Own Worst Enemy (Part 8)

Note: This is part 8 in a series for high school students about reading and interpreting science on the internet. Read the intro and get the index here or go back to Part 7 here.

I love this part of the talk because I get to present my absolute favorite study of all time. Up until now I’ve mostly been covering things about how other people are trying to fool you to get them to your side, but now I’m going to wade in to how we seek to fool ourselves.  That’s why I’m calling this part:

Biased Interpretations and Motivated Reasoning

Okay, so what’s the problem here?

The problem here my friend, is you. And me. Well, all of us really…..especially if you’re smart.  The unfortunate truth is that for all the brain power we put towards things, our application of that brain power can vary tremendously when we’re evaluating information that we like, that we’re neutral towards, or that we don’t like.  How tremendously? Well, in 2013 the working paper “Motivated Numeracy and Enlightened Self-Government“, some researchers decided to ask if people with a rash got better if they used a new skin cream.  They provided this data:

Pt8matrix

The trick here is that you are comparing absolute value to proportion.  More people got better in the “use the skin cream” group, but more people also got worse. The proportion is better for those who did not use the cream (about 5:1) as opposed to those who did use it (about 3:1). This is a classic math skills problem, because you have to really think through what question you’re trying to answer before you calculate, and what you are actually comparing. Baseline about 40% of people in the study got this right.

What the researchers did next was really cool. For some participants, they took the original problem, kept the numbers the same, but changed “patients” to “cities”, “skin cream” to “strict gun control laws” and “rash” to “crime”.  They also flipped the outcome around for both problems, so participants had one of four possible questions.  In one the skin cream worked, in one it didn’t, in one strict gun control worked, in one it didn’t. The numbers in the matrix remained the same, but the words around them flipped.  They also asked people their political party and a bunch of other math questions to get a sense of their overall mathematical ability. Here’s how people did when they were assessing rashes and skin cream:

rashgraph

Pretty much what we’d expect. Regardless of political party, and regardless of the outcome of the question, people with better math skills did better1.

Now check out what happens when people were shown the same numbers but believed they were working out a problem about the effectiveness of gun control legislation:

gunproblem

Look at the end of that graph there, where we see the people with a high mathematical ability. If using their brain power got them an answer that they liked politically, the did it. However, when the answer didn’t fit what they liked politically, they were no better than those with very little skill at getting the right answer.  Your intellectual capacity does NOT make you less likely to make an error….it simply makes you more likely to be a hypocrite about your errors.  Yikes.

Okay, so what kind of things should we be looking out for?

Well, this sort of thing is most common on debates where strong ethical or moral stances intersect with science or statistics. You’ll frequently see people discussing various issues, then letting out a sigh and saying “I don’t know why other people won’t just do their research!”. The problem is that if you believe something strongly already, you’re quite likely to think any research that agrees with you is more compelling than it actually is. On the other hand, research that disagrees with you will look less compelling than it may be.

This isn’t just a problem for the hoi polloi either. I just wrote earlier this week about two research groups who were accusing the other of choosing statistical methods that would support their own pet conclusions. We all do it, we just see it more clearly when it’s those we disagree with.

Why do we fall for this stuff?

Oh so many reasons.  In fact Carol Tarvis has written an excellent book about this (Mistakes Were Made (but Not by Me): Why We Justify Foolish Beliefs, Bad Decisions, and Hurtful Acts) that should be required reading for everyone. In most cases though it’s pretty simple: we like to believe that all of our beliefs are perfectly well reasoned and that all the facts back us up. When something challenges that assumption, we get defensive and stop thinking clearly.  There’s also some evidence that the internet may be making this worse by giving us access to other people who will support our beliefs and stop us from reconsidering our stances when challenged.

In fact, researchers have found that the stronger your stance towards something, the more likely you are to hold simplistic beliefs about it (ie “there are only two types of people, those who agree with me and those who don’t”).

An amusing irony: the paper I cited in that last paragraph was widely reported on because it showed evidence that liberals are as bad about this as conservatives. That may not surprise most of you, but in the overwhelmingly liberal field of social psychology this finding was pretty unusual. Apparently when your field is >95% liberal, you mostly find that bias, dogmatism and simplistic thinking are conservative problems. Probably just a coincidence.

So what can we do about it?

Richard Feynman said it best:

If you want to see this in action, watch your enemy. If you want to really make a difference, watch yourself.

Well that got heavy.  See you next week for Part 9!

Read Part 9 here.

1. You’ll note this is not entirely true at the lowest end. My guess is if you drop below a certain level of mathematical ability, guessing is your best bet.

 

 

Grading an Education Infographic

Welcome to Grade the Infographic, which is pretty much exactly what it sounds like. I have three criteria I’m looking for in my grading: source of data, accuracy of data and accuracy of visuals. While some design choices annoy me, I’m not a designer, couldn’t do any better, and won’t be commenting unless I think it’s skewing the perception of the information. I’m really only focused with what’s on the graphic, so I also don’t assess stats that maybe should have been included but weren’t.  If you’d like to submit an infographic for grading, go here. If you’d like to protest a grade for yourself or someone else, feel free to do so in the comments or on the feedback page.

When I first started doing any stats/data blogging, an unexpected thing happened: people started sending me their infographics.  Despite my repeated assertions that I actually hated infographics, companies seemed to troll the web attempting to find people to post their infographics on various topics.  It gets a little weird because they’re frequently not related to my topics, but apparently I’m not the only blogger who has had this problem.  Long story short, I actually got sent this infographic back in 2013. 

Click to enlarge.

Since one of my favorite readers is a teacher (hi Erin!) who also shares my displeasure with infographics, I thought I’d start off by grading this one. It’s pretty long so I chopped it up in to pieces.  Because I’m a petty despot and all, I actually start with the end. Grading the reference section first is a bit backwards, but it gives me an idea of how much work I’m going to have to do to figure out the accuracy of the rest of it.

Pt1

Oooh, not off to a great start.  The maximum grade you can get from me if you don’t give me a source I can track is in the B range. Giving a website is good, but when it’s as big as the National Center for Education Statistics, it’s also nearly useless.

Pt2

Okay, this is good! That’s a decent selection of countries. Not sure if there was a particular reason, but there doesn’t appear to be any cherry picking going on.

Pt3

Hmmm….this got a little funky. I couldn’t actually locate this source data, though I did locate some from the World Bank that backed up the elementary school numbers. I’m guessing this is real data, but saddened they didn’t let me know where they got it! If you do the work, get the credit! Also, the 4 year gap confused me. Where are 2001 – 2004? It doesn’t look like it particularly matters for this trend, so I only subtracted 2 points for not indicating the gap or better spacing the years.

Pt4

This part broke even.  I was hoping for a year (source again!) but did get some good context about what kind of test this was. That was really helpful, so it got an extra point. The data’s all accurate and it’s from 2011.

Pt5

Oooh, now here’s a bit of a snafu. The graphic said “hours spent studying” which surprised me because that’s 3 hours/day for the US kids. When I found the source data (page 114) it turns out those are actually classroom hours. That made more sense. I docked three points because I don’t think that’s what most people mean by “time spent studying”. It’s not totally wrong, but not totally accurate either. Class hours are normally referred to as such. I felt there was a bit of wiggle room on the definition of “study” though, so I didn’t know it down the 5 points I was going to.

Pt6

Oof. That’s not good. Where did these numbers come from? I went to the OECD report to check out the 2010 numbers, and they were WAY off.

Country Infographic 2010 number OECD 2010 number
United States 88.4% 77%
United Kingdom 82.9% 91%
Spain 64.7% 80%
Germany 86.5% 87%
Sweden 91.1% 75%
South Korea 91.1% 92%
Australia 84.8% No numbers

Now graduation rates have lots of different ways of being calculated (mostly due to differences in what counts as “secondary education”, so it’s plausible those numbers came from somewhere. This is the risk you run when you don’t include sources.

Finalgrade

And there you have it.  Cite your sources!

 

Women, Ovulation and Voting in 2016

Welcome to “From the Archives” where I revisit old posts  to see where the science (or my thinking) has gone since I put them up originally.

Back in good old October of 2012, it was an election year and I was getting irritated1. First, I was being bombarded with Elizabeth Warren vs Scott Brown for Senate ads, and then I was confronted with this study:The fluctuating female vote: politics, religion, and the ovulatory cycle (Durante, et al), which purported to show that women’s political and religious beliefs varied wildly around their monthly cycle, but in different ways if they were married or single. For single women they claimed that being fertile caused them to get more liberal and less religious, because they had more liberal attitudes toward sex. For married women, being fertile made them more conservative and religious so they could compensate for their urge to cheat.  The swing was wide too: about 20%.  Of note, the study never actually observed any women changing their vote, but compared two groups of women to find the differences. The study got a lot of attention because CNN initially put it up, then took it back down when people complained.  I wrote two posts about this, one irritated and ranty, and one pointing to some more technical issues I had.

With a new election coming around, I was thinking about this paper and wanted to take a look at where it had gone since then. I knew that Andrew Gelman had ultimately taken shots at the study for reporting an implausibly large effect2 and potentially collecting lots of data/comparisons and only publishing some of them, so I was curious how this study had subsequently fared.

Well, there are updates!  First, in 2014, a different group tried to replicate their results in a paper called  Women Can Keep the Vote: No Evidence That Hormonal Changes During the Menstrual Cycle Impact Political and Religious Beliefs by Harris and Mickes.  This paper recruited a different group, but essentially recreated much of the analysis of the original paper with one major addition. They conducted their survey prior to the 2012 election AND after, to see predicted voting behavior vs actual voting behavior.  A few findings:

  1. The first paper (Durante et al) had found that fiscal policy beliefs didn’t change for women, but social policy beliefs did change around ovulation. The second paper (Harris and Mickes) failed to replicate this finding, and also failed to detect any change in religious beliefs.
  2. In the second paper, married women had a different stated preference for Obama (high when low feritility, lower when high fertility), but that difference went away when you looked at how they actually voted. For single women, it was actually the opposite. They reported the same preference level for Obama regardless of fertility, but voted differently based on the time of the month.
  3. The original Durante study had taken some heat for how they assessed fertility level in their work. There were concerns that self reported fertility level was so likely to be inaccurate that it would render any conclusions void. I was interested to see that Harris and Mickes clarified that the Durante paper actually didn’t accurately describe how they did fertility assessments in the original paper, and that they had both ultimately used the same method. This was supposed to be in the supplementary material, but I couldn’t find a copy of that free online. It’s an interesting footnote.
  4. A reviewer asked them to combine the pre and post election data to see if they could find a fertility/relationship interaction effect. When pre and post election data were kept separate, there was no effect. When they were combined, there was.

Point #4 is where things got a little interesting. The authors of the Harris and Mickes study said combining their data was not valid, but Durante et al hit back and said “why not?”. There’s an interesting piece of stat/research geekery about the dispute here, but the TL;DR version is that this could be considered a partial replication or a failure to replicate, depending on your statistical strategy. Unfortunately this is one of those areas where you can get some legitimate concern that a person’s judgement calls are being shaded by their view of the outcome. Since we don’t know what either researchers original plan was, we don’t know if either one modified their strategy based on results. Additionally the “is it valid to combine these data sets” question is a good one, and would be open for discussion even if we were discussing something totally innocuous. The political nature of the discussion intensifies the debate, but it didn’t create it.

Fast forward now to 2015, when yet another study was published: Menstrual Cycle Phase Does Not Predict Political Conservatism. This study was done using data ALSO from the 2012 election cycle3, but with a few further changes.  The highlights:

  1. This study, by Scott and Pound, addressed some of the “how do you measure fertility when you can’t test” concerns by asking about medical conditions that might influence fertility to screen out women whose self reporting might be less accurate. They also ranked fertility on a continuum as opposed to the dichotomous “high” and “low”. This should have made their assessment more accurate.
  2. The other two studies both asked for voting in terms of Romney vs Obama. Scott and Pound were concerned that this might capture a personal preference change that was more about Obama and Romney as people rather than a political change. They measured both self-reported political leanings and a “moral foundations” test and came up with an overall “conservatism” rank, then tracked that with chances of conception.
  3. They controlled for age, number of children, and other sociological factors.

So overall, what did this show? Well, basically, political philosophy doesn’t vary much no matter where a woman is in her cycle.

The authors have a pretty interesting discussion at the end about the problems with Mechanical Turk (where all three studies recruited their participants in the same few months), the differences of measuring person preference (Obama vs Romney) vs political preference (Republican vs Democrat), and some statistical analysis problems.

So what do I think now?

First off, I’ve realized that getting all ranty when someone brings up women’s hormones effecting things may be counterproductive. Lesson learned.

More seriously though, I find the hypothesis that our preferences for individuals may change with hormonal changes more compelling than the hypothesis that our overall philosophy of religion or government changes with our hormones. The first simply seems more plausible to me. In a tight presidential election though, this may be hopelessly confounded by the candidates actual behavior. It’s pretty well known that single women voted overwhelmingly for Obama, and that Romney had a better chance to capture the votes of married women. Candidates know this and can play to it, so if a candidate makes a statement playing to their base, you may see shifts that have nothing to do with hormones of the voters but are an actual reaction to real time statements. This may be a case where research in to the hypothetical (i.e. made up candidate A vs B) may be helpful.

The discussions on fertility measures and statistical analysis were interesting and a good insight in to how much study conclusions can change based on how we define particular metrics.  I was happy to see that both follow up papers hammered on clear and standard definitions for “fertility”. If that is one of  the primary metrics you are assessing, then the utmost care must be taken to assess it accurately, or else the signal to noise ratio can go through the roof.

Do I still think CNN should have taken the story down? Yes….but just as much as I believe that they should take most sensational new social/psych research stories down. If you follow the research for just two more papers, you see the conclusion go from broad (women change their social, political and religious views and votes based on fertility!) to much narrower (women may in some cases change their preference or voting patterns for particular candidates based on fertility, but their religious and political beliefs do not appear to change regardless). I’ll be interested to see if anyone tries to replicate this with the 2016 election, and if so what the conclusions are.

This concludes your trip down memory lane!
1. Gee, this is sounding familiar
2. This point was really interesting. He pointed out that around elections, pollsters are pretty obsessive about tracking things, and short of a major scandal breaking literally NOTHING causes a rapid 20 point swing. The idea that swings that large were happening regularly and everyone had missed it seemed implausible to him. Statistically of course, the authors were only testing that there was a difference at all, not what it was….but the large effect should possibly have given them pause. It would be like finding that ovulation made women spend twice as much on buying a house. People don’t change THAT dramatically, and if you find that they do you may want to rerun the numbers.
3. Okay, so I can’t be the only one noticing at this point that this means 3 different studies all recruited around 1000 American women not on birth control, not pregnant, not recently pregnant or breastfeeding but of child bearing age, interested in participating in a study on politics, all at the same time and all through Amazon’s Mechanical Turk. Has anyone asked the authors to compare how much of their sample was actually the same women? Does Mechanical Turk have any barriers for this? Do we care? Oh! Yes, turns out this is actually a bit of a problem.

Proof: Using Facts to Deceive (Part 7)

Note: This is part 7 in a series for high school students about reading and interpreting science on the internet. Read the intro and get the index here, or go back to Part 6 here.

Okay, now we come to the part of the talk that is unbelievably hard to get through quickly. This is really a whole class, and I will probably end up putting some appendices on this series just to make myself feel better.  If the only thing I ever do in life is to teach as many people as possible the base rate fallacy, I’ll be content. Anyway, this part is tough because I at least attempt to go through a few statistical tricks that actually require some explaining. This could be my whole talk, but I’ve decided against it in favor some of the softer stuff. Anyway, this part is called:

Crazy Stats Tricks: False Positives, Failure to Replicate, Correlations, Etc

Okay, so what’s the problem here?

Shenanigans, chicanery, and folks otherwise not understanding statistics and numbers. I’ve made reference to some of these so far, but here’s a (not-comprehensive) list:

  1. Changing the metric (ie using growth rates vs absolute rates, saying “doubled” and hiding the initial value, etc)
  2. Correlation and causation confusion
  3. Failure to Replicate
  4. False Positives/False Negatives

They each have their own issues. Number 1 deceives by confusing people, Number 2 makes people jump to conclusions, Number 3 presents splashy new conclusions that no one can make happen again, and Number 4 involves too much math for most people but yields some surprising results.

Okay, so what kind of things should we be looking out for?

Well each one is a little different. I touched on 1 and 2 a bit previously with graphs and anecdotes. For failure to replicate, it’s important to remember that you really need multiple papers to confirm findings, and having one study say something doesn’t necessarily mean subsequent studies will say the same thing. The quick overview though is that many published studies don’t bear out. It’s important to realize that any new shiny study (especially psychology or social science) could turn out to not be reproducible, and the initial conclusions invalid. This warning is given as a boilerplate “more research is needed” at the end of articles, but it’s meant literally.

False positives/negatives are a different beast that I wish more people understood.  While this applies to a lot of medical research, it’s perhaps clearest to explain in law enforcement.  An example:

In 2012, a (formerly CIA) couple was in their home getting their kids ready for school when they were raided by a SWAT team. They were accused of being large scale marijuana growers, and their home was searched. Nothing was found.  So why did they come under investigation? Well it turns out they had been seen buying gardening equipment frequently used by marijuana growers, and the police had then tested their trash for drug residue. They got two positive tests, and they raided the house.

Now if I had heard this reported in a news story, I would have thought that was all very reasonable. However, the couple eventually discovered that the drug test used on their trash has a 70% false positive rate. Even if their trash had been perfectly fine, there was still at least a 50% they’d get two positive tests in a row (and that assumes nothing in their trash was triggering this). So given a street with ZERO drug users, you could have found evidence to raid half the houses.  The worst part of this is that the courts ruled that the police themselves were not liable for not knowing that the test was that inaccurate, so their assumptions and treatment of the couple were okay. Whether that’s okay is a matter for legal experts, but we should all feel a little uneasy that we’re more focused on how often our tests get things right than how often they’re wrong.

Why do we fall for this stuff?

Well, some of this is just a misunderstanding or lack of familiarity with how things work, but the false positive/false negative issue is a very specific type of confirmation bias. Essentially we often don’t realize that there is more than one way to be wrong, and in avoiding one inaccuracy, we increase our chances of different types of inaccuracy.  In the case of the police departments using the inaccurate tests, they likely wanted something that would detect drugs when they were present. They focused on making sure they’d never get a false negative (ie a test that said no drugs when there were). This is great, until you realize that they traded that for lots of innocent people potentially being searched. In fact, since there are more people who don’t use drugs than those who do, the chances that someone with a positive test doesn’t have drugs is actually higher than the chance that they do….that’s the base rate fallacy I was talking about earlier.

To further prove this point, there’s an interesting experiment called the Wason Selection task that shows that when it comes to numbers in particular, we’re especially vulnerable to only confirming an error in one direction. In fact 90% of people fail this task because they only look at one way of being wrong.

Are you confused by this? That’s pretty normal. So normal in fact that the thing we use to keep it all straight is literally called a confusion matrix and it looks like this:

If you want to do any learning about stats, learn about this guy, because it comes up all the time. Very few people can do this math well, and that includes the majority of doctors. Yup, the same people most likely to tell you “your test came back positive” frequently can’t accurately calculate how worried you should really be.

So what can we do about it?

Well, learn a little math! Like I said, I’m thinking I need a follow up post just on this topic so I have a reference for this. However, if you’re really not mathy, just remember this: there’s more than one way to be wrong. Any time you reduce your chances of being wrong in one direction, you probably increase them in another. In criminal justice, if we make sure we never miss a guilty person, we might also increase the number of innocent people we falsely accuse. The reverse is also true. Testings, screenings, and judgment calls aren’t perfect, and we shouldn’t fool ourselves in to thinking they are.

Alright, on that happy note, I’ll bid you adieu for now. See ya next week!

Read Part 8 here.