Government benefits OR definitions and the census strike again

Last week I got a little fascinated by the census bureau data…..and this weekend I was sent an article from the Wall Street Journal regarding yet another set of Census Bureau Data that was getting passed around.

This one addressed the number of households in the US receiving “government benefits”….apparently it’s up to 49.1%.
Now that’s a scary number, but I am always wary of the phrase “government benefits” when it’s used in a statistical context.  The problem is that it’s an incredibly vague term, and can be used to cover a myriad of programs….not all of which are what initially spring to mind.  
I first learned to be wary of this term when my dear liberal brother mentioned that some group he had been following had claimed that there was some ludicrous number of government handout programs in place today.  The number struck him as high, so he got on their website and found out that they were actually counting both federal assistance programs AND tax breaks (such as home interest deductions, student loan interest deductions, dependent credits, etc) as “entitlements”.  Thus in this case I am extra vigilant about my “find the definition” rule.
I took a look around the census website (we’ve become good friends lately) and found the list they were using as of 2008*:
  • Dept of Veteran’s Affairs – Compensation, Pension, Education Assistance
  • Medicare
  • Social Security
  • Unemployment
  • Workman’s Comp
  • Food Stamps
  • Free/Reduced-Price School Lunch and Breakfast Program
  • Housing Assistance
  • Federal and State Supplemental Security Income (SSI)
  • Medicaid
  • Temporary Assistance for Needy Families (TANF)
  • Supplemental Nutrition Program for Women, Infants, and Children (WIC)
Not a terribly surprising list, though I wouldn’t have realized that Veteran’s benefits were on there.  Even without the economy going down hill or any other expansion of programs, the Veteran’s benefits most certainly would have expanded in the past few years as people continue

Additionally, it would be important to note that only one member of the household needed to receive this in order to be counted.  That struck me because my parents and my grandmother all live in the same house, which means both of my dear hard working parents are lumped in to that 49.1% number.

Whatever your feeling about government benefits, it’s important to know exactly which ones are being counted in any list.  I’d imagine that many people who might dislike Medicaid might not care to eliminate Veteran’s Benefits, and those who don’t like TANF may very well support workman’s comp.  Just something to be aware of, especially in an election year.

*To note: the latest data I could find was from 2008.  I really hate that the WSJ doesn’t link to where the heck it got it’s numbers.  I couldn’t find the stuff they put up anywhere on the census bureau website.  I’m not doubting them, I just wonder if it would have killed them to include a link????

Watch the definitions

A quick one for a Friday:

I’ve blogged before about paying careful attention to the definition of words used in study results.  It is often the case that the definition used in the study/statistic may not actually match what you presume the definition is.

Eugene Volokh posted a good example of this today, when he linked to this op-ed in the Detroit Free Press.  It cites a spokesperson from the Violence Policy Center who states that “Michigan is one of 10 states in which gun deaths now outpace motor vehicle deaths”.

My knee jerk reaction was that seemed high, but my tired Friday brain probably would have kept skimming.  Then I read why Volokh was posting it:

The number of accidental gun deaths in Michigan in 2009 (the most recent year reported in WISQARS) was … 12, compared to 962 accidental motor-vehicle-related deaths. 99% of the gun deaths in Michigan that year consisted of suicides (575) and homicides (495).

To be honest, I had presumed homicides were included, but suicide death didn’t even occur to me.   I’d be interested to see how many of the vehicular deaths were suicides, my guess is the percentage would not be as high as in the gun case.  Either way, I’m sure I’m not the only one who didn’t realize what was being counted.

Watch the definitions, and have a fabulous Memorial Day weekend!

More census data….the minority-majority issue

I was happy to see that my post from yesterday  got an excellent comment from Glenn, a former Census Bureau employee.  He let me know that it was likely the sample they used was actually a stratified cluster sample, which is not exactly what I had surmised, but close.

As I was looking up more info on some of the Census Bureau data, I ran in to a fascinating column from Matthew Yglesias over at Slate.com.  In it, he describes his experience filling out the census form, and how his own experience made him question some of the data being released.

In specific, he questioned the recent headline that we are quickly heading towards a minority-majority society.  He mentions that as a 25% Cuban man, he looks very white, but was not sure how to answer the question regarding whether he was “Hispanic in origin”.  If he wasn’t sure how to answer a race question, how many others were in his boat?  He further comments that as people continue to become increasingly of mixed racial background (keeping in mind that 1 out of 12 marriages is now mixed race) it is much more likely that we will have to shift our concept of what “white” is to keep up with the times.

As Elizabeth Warren can tell you, percentage of heritage matters….but where do we draw the line?  If 3% Native American isn’t enough, how much is?  I mean that quite literally.  I don’t know.

In my cultural competency class in school, we had a fascinating example of racial confusion.  One of the girls I sat next too mentioned that her grandparents were from Lebanon, had immigrated to South America, her parents were both born there, married, moved to the US, and that’s where she was born.  Her skin was fair, she was fluent in Spanish, and she felt she spent her life explaining that she was genetically Arabic, ethnically South American and culturally American.  I don’t know what she checked off on the census, but I’m sure nothing captured that particular combination accurately.

As times change, so do our ideas of race. When reading the history of census racial classification, it’s hard to disagree with Yglesias’ assertion that today’s racial breakdown will not be comparable to whatever breakdown we have in ten years.  That’s a good thing to keep in mind when analyzing racial data.

 Racial numbers are as good as the categories we have to put them in.   

The (ACS) Devil and Daniel Webster

As a New Hampshire native, I am prone to liking people named Daniel Webster.

It is thus with some interest that I realized that the Florida Congressman who is sponsoring the bill to eliminate the American Community Survey happens to share a name with the famous NH statesman.  I have been following this situation since I read about it on the pretty cool Civil Statistician blog, run by a guy who runs stats for the census bureau.

Clearly there’s some interesting debate going on here about data, analysis, role of the government, and the classic “good of the community vs personal liberty” debate.

I’m going to skip over most of that.

So why then, do I bring up Daniel Webster?

Well, I was intrigued by this comment from him , as reported in the NYT article on the ACS:

“We’re spending $70 per person to fill this out. That’s just not cost effective,” he continued, “especially since in the end this is not a scientific survey. It’s a random survey.”

It was that last part of the sentence that caught my eye.

I was curious, first of all, what the background was of someone making that claim.  I took a look at his website, and was pleased to discover that Rep. Webster is an engineer.   It’s always interesting to see one of my own take something like this on (especially since Congress only has 6 of his kind!).

That being said, is a random survey unscientific?

Well, maybe.

In grad school, we actually had to take a whole class on surveys/testing/evaluations, and the number one principal for polling methods is that there is no one size fits all.  The most scientifically accurate way to survey a group is based on the group you’re trying to capture.  All survey methods have pitfalls.   One very interesting example our professor gave us was the students who tried to capture a sample of their college by surveying the first 100 students to walk by them in the campus center.  What they hadn’t realized was that a freshman seminar was just letting out, so their “random” survey turned out to be 85% freshman.  So over all, it’s probably worse when your polling methodology isn’t random than when it is.

There’s all kinds of polling methods that have been created to account for these issues:

  • simple random sampling – attempts to be totally random
  • systematic sampling – picking say, every 5th item on a list
  • stratified sampling – dividing population in to groups and then picking a certain percentage from each one (above this would have meant picking 25 random people from each class year)
  • convenience sampling – grabbing whoever is closest
  • snowball sampling – allowing sampled parties to refer/lead to other samples
  • cluster sampling – taking one cluster of participants (one city, one classroom, etc) and presuming that’s representative of the whole
There are others, though most subtypes off of these types (see more here).
So what does the ACS use?  
As best I can tell, they use stratified sampling.  They compile as comprehensive a list as they can, then they assign geocodes, and select from there.  So technically, their sampling is both random and non-random.   

Now, NYT analysis aside, I wonder if this is really what Webster was questioning.  The other meaning one could take from his statement is that he was challenging the lack of scientific method.  As an engineer, he would be more familiar with this than with sampling statistics (presuming his coursework looked like mine).  What would a scientific survey look like there?  Well, here’s the scientific method in a flowchart (via Sciencebuddies.org):

So it seems plausible he was actually criticizing the polling being done, not the specific polling methodology.  It’s an important distinction, as all data must be analyzed on two levels: integrity of data, and integrity of concept.   When discussing “randomness” in surveys, we must remember to acknowledge that there are two different levels going on, and criticisms can potentially have dual meanings.

Sometimes people just make things up

ALWAYS LOOK FOR A PRIMARY SOURCE.

THEN MAKE SURE THAT PRIMARY SOURCE SAYS WHAT THEY SAY IT SAYS.

Sorry for the caps lock, but some people seem to doubt that other would actually fabricate stats to prove a point.  They do, and in the New York Times no less.

H/T Instapundit.

Correlation and Causation: Real World Problems

In yesterday’s post, I got a bit worked up over sloppy reporting on a study on dietary interventions in pregnancy. 

This led to an interesting comment from the Assistant Village Idiot regarding weight gain recommendations for pregnant women.  The current weight gain recommendation is 25 – 35 pounds for a normal BMI woman, but AVI commented that it used to be much lower, and that women were hospitalized to stop them from eating too much.
I didn’t actually know that, so I immediately decided to look it up.  
I stumbled on to a fascinating presentation put together by an OB at UCSF on the history of maternal weight gain recommendations (link goes to the PowerPoint slides).  It not only confirmed what AVI had mentioned, but also gave some of the reasoning….which turned out to be a very interesting example of people erroneously conflating correlation with causation.  
Apparently part of the reason why they (they being doctor’s circa 1930) were so nervous about weight gain in pregnancy was that they were trying to prevent preeclampsia.  Now preeclampsia is a life threatening condition if left untreated, and one of the warning signs is rapid weight gain.  Apparently some doctors actually thought that the symptom was the cause, and believed that all excessive weight gain was a sign the patient was about to become preeclamptic.  Thus, the theory went, limiting weight gain would prevent preeclampsia and aid in “figure preservation” to boot*.  
Sadly, this also led to higher infant mortality, disability, and mental retardation….which seems a pretty steep price to pay for what was really a data analysis error.  As I’ve said before, this is why statistics are so relevant in medicine….the cost for getting things wrong is too steep to not be careful.
*To note, it is actually true that preeclampsia is linked to higher weight/glucose/insulin production….but the way they went about addressing it did as much harm to the fetus as good.  Current weight gain recommendations are set to optimize outcomes for the babies, not the mothers.  

When in doubt, blame the journalist: prenatal dieting edition

Sometimes bad science reporting makes me laugh, and sometimes it actually kind of stresses me out.  This is one of the “this stresses me out” times.

The headline reads: Diet during pregnancy is safe and reduces risk for complications, study finds

Now aside from being a bit on the garbled side, it’s a pretty provocative headline.  As someone who has been in and out of obstetrician’s offices for the past 7 months or so, it also runs counter to everything I’ve been told.  According to this write-up however, here’s a few things this study found:

 Is it safe for a pregnant woman to go on a diet? According to a new study, not only is it safe, but it can even be beneficial and reduce the risk of dangerous complications.

That would seem to contradict what my doctor has told me….but let’s read on (to what they found about dieting methods):

The researchers found that all three methods reduced a mother’s weight, but diet showed the greatest effect with an average reduction of almost 9 pounds. Pregnant moms who only exercised lost about 1.5 pounds, and moms who did a combination of diet and exercise lost an average of 2.2 pounds.

So they had mothers to be lose weight during pregnancy?  That seems….extra wrong….but go on:

Women who went on a calorie-restricted diet were 33 percent less likely to develop pre-eclampsia, a spike in blood pressure caused by significant amounts of protein in the urine.

Wait, now I know he’s just phoning it in.  Pre-eclampsia is not high blood pressure caused by protein in the urine, it’s high blood pressure AND high protein in the urine….in fact the Mayo Clinic article he links to says so.  

At this point, I took a look at the original study, and found other “oops” moments in the reporting.  First, the study never looked at “diets”.  What they actually looked at was “dietary interventions”…which they describe as follows:

Typical dietary interventions included a balanced diet consisting of carbohydrates, proteins, and fat and maintenance of a food diary. 

Since this was a meta-analysis, I took a look at the references, and in fact only one study cited directly looked at caloric restriction….the sort of thing most of us think of when we hear the word “diet”.

Furthermore, that part about the women’s weight being reduced?  It wasn’t.  Their weight gain was reduced.   …something the study authors are clear about, but the subsequent write up completely leaves out.

I actually got a little angry about this.  You can feel free to blame pregnancy hormones, but I find this sort of thing is just irresponsible.  CBS is a major news network, and people are going to take what they say seriously.  As the Assistant Village Idiot likes to point out, people believing faulty science on small things can be funny and doesn’t matter much….but when you realize bad studies could actually affect the way people live, it gets scary.  Someone following this story could do some real damage.  In fact, the article does get clearer towards the end (when it quotes the original study author), but that’s 6 paragraphs in.  It drives me nuts that a good a carefully thought through study can get reported so sloppily and potentially dangerously.  There is a world of difference between what most of us think of when we say “diet” and what the researchers here described, which was essentially just formalized pre-natal nutritional counseling.

Overall, real dieting during pregnancy is still dangerous….and can backfire in a big way.  Mother’s who are forced to restrict calories during pregnancy (famine victims, etc) actually wind up having children who are more likely to be obese and develop diabetes.  As a side note, one of the most fascinating studies on this is the Dutch Famine Study where mother’s who had temporary famine conditions during pregnancy could be studied for the long term effects on the children.

This is why it matters that the media report things correctly.  People should not walk away from reading about good science with bad ideas.  Words like “diet” or “weight reduction” do not mean the same thing as “dietary interventions” or “weight gain reduction”. No one should have to read to paragraph 7 to get accurate information.  That’s just bad form.

The only thing that could have made this story worse would have been an infographic.  I’m going to have nightmares about that tonight.

Correlation and Causation: the Teen Pregnancy Edition

One of the first posts I ever did was on correlation and causation.  In it, I spelled out the three rules to consider whenever two variables (x and y) are linked:

  1. X is causing Y
  2. Y is causing X
  3. Something else is causing both X and Y
While most people jump to the conclusion that it’s number 1, Matthew Yglesias wrote a piece for Slate.com this week where he rather awkwardly jumps to conclusion number 2.  
He starts off well with the second paragraph, but then goes to very strange place in the third: 

Delivering the commencement address last weekend at the evangelical Liberty University, Mitt Romney naturally stuck primarily to “family values” and religious themes. He did, however, make one economic observation that intersects with some fascinating new research. “For those who graduate from high school, get a full-time job, and marry before they have their first child,” he said, “the probability that they will be poor is 2 percent. But if [all] those things are absent, 76 percent will be poor.”
These are striking numbers, but they raise the age-old question of correlation and causation. Does this mean that the representative high-school dropout would be doing much better had he stuck it out in school for a few more years? Or is it instead the case that the population of high-school dropouts is disproportionately composed of people who have attributes that lead to low earnings?
When it comes to early pregnancy, surprising new evidence indicates that Romney and most everyone else have it backward: Having a baby early does not hamper a young woman’s economic prospects, as Romney implies. Rather, young women choose to become mothers because their economic outlook is so objectively bleak.

Say what?

As a former teenage girl myself, this is a strange conclusion….I certainly never met a teen mom who would have put it that way.  But surely there was some wonderful evidence to support this scathing conclusion?

Well, not really.  Here’s the original paper….and  here’s how the authors conveyed their thoughts:

We describe some recent analysis indicating that the combination of being poor and living in a more unequal (and less mobile) location, like the United States, leads young women to choose early, non-marital childbearing at elevated rates, potentially because of their lower expectations of future economic success. …These findings lead us to conclude that the high rate of teen childbearing in the United States matters mostly because it is a marker of larger, underlying social problems.

The emphasis was mine….but notice how much more careful they are in their language.  If you take my list above, you see that they are challenging possibility number 1, seeing if #2 is a feasible conclusion, but ultimately pointing the finger at #3….i.e. “larger, underlying social problems”.

For example, the cite low maternal education as a risk factor for teen pregnancy…which one could presume could be either the result of or the cause of low income.

Teen pregnancy is complicated, and honestly I would be very surprised if you could ever figure out a way to pin it on just one factor.  Additionally, so much information is unavailable that it can be hard to parse through what you have left.  A key factor in all of this would be to determine if higher income girls weren’t having babies because they weren’t getting pregnant or because they were having abortions….data which could lead to very different conclusions.

I fully support this study, by the way, questioning the prevailing wisdom is always a good thing. What I resent is when people think just by flipping the order of a normal conclusion that they’re being clever.

X could cause Y, Y could cause X, something else could be causing both.

Then again, it could also just be a coincidence.  

The price of bad data

Yesterday Instapundit linked to a story on “the perfect data storm”.

Thinking that sounded up my alley, I went and read the article.  It’s from a professor named David Clemens at Monterey Peninsula College, complaining about the use of data in higher education:

While knowing full well data’s vulnerability, education managers cannot resist the temptation to be data driven because data absolves them of responsibility; to be data driven lets them say “the data made me do it” (hat tip to Flip Wilson).

That made me sad.  
He cited a few numbers floating around his campus that he knew were bad…transfer rates that only counted transfers to state schools for example….and yet they were still being included in policy decisions as though they were comprehensive.
That made me really sad.
While I enjoy mocking bad data, it’s important to remember that there is a real price to it.  That’s why I think it’s important to empower people to question the data they’re hearing and to know what weaknesses to look for when you hear numbers that sound implausible.  
Clemens continues:

….we discover that information does not touch any of the important problems of life. If there are children starving in Somalia, or any other place, it has nothing to do with inadequate information. If our oceans are polluted and the rain forests depleted, it has nothing to do with inadequate information. 

I am going to make a radical suggestion about data and higher education:  colleges and universities will be better served if they avoid kneeling at the altar of data and instead fill key positions with people driven by intuition, experience, values, conviction, and principle.  A good place to start would be looking for leadership guided by a transcendent educational narrative.

I both agree and take issue with this statement.  Data doesn’t solve problems, but in a world of limited resources, data can guide us on where to put our efforts.  It’s not that most of us don’t agree children shouldn’t starve in Somalia, it’s that the “act first figure out if it works later” approach has the potential to cause as much harm as good.  That’s why health care is data driven by necessity…..courts are notoriously unsympathetic to the excuse “I treated the patient this way because my transcendent narrative said it was a good idea”.  Data is a good idea when you have an outcome you can’t afford to take a chance with.

In the end, I don’t think data is to blame for this backlash.  I am relatively sure that the same people who “kneel at the altar of data” to justify their own behavior are the same people who would, absent data, pursue their own gut feelings to the exclusion of rationality.  Intuition is very easily confused with emotion, experience can lead to falsely limiting possibilities, values can be misguided, conviction is dangerous in the wrong hands, and principle is easily warped.  No amount of data can change the way people are, but the more people who can spot the flaws in data and call BS, the better.

*Steps off soap box*

Trudge on friends, and don’t let the weasels get you down.

Why most marriage statistics are completely skewed

Apparently Slate.com is now doing a “map of the week”.  This week, it was a map of states by marriage rate.  Can’t get it to format well….click on the map and drag to see other states.

http://a.tiles.mapbox.com/v3/slate.marriage.html#4.00/40.65/-95.45

It shows Nevada as the overwhelming winner, with Hawaii second.  This reminded me about my annoyance at most marriage data.

Marriage data is often quoted, but fairly poorly understood.  The top two states in the map above should tip you off as to the major problem with marriage data derived from the CDC in particular….it’s based on the state that issued the marriage license, not the state where the couple resides.  Since all (heterosexual) marriages affirmed by one state are currently recognized by every other state, state of residence information is not reported to the CDC.  This means that states with destination wedding type locations (Las Vegas anyone?) skew high, and all others are presumably a bit lower than they should be.  Anecdotally, it’s also conceivable that states with large meccas for young people (New York City, Boston, DC) may be artificially low because many young people return to their childhood home states to marry.  This

The other problem with marriage data is the resulting divorce data is even more skewed.  Quite a few states don’t report divorce statistics at all (California, Georgia, Hawaii, Indiana, Louisiana, Minnesota) and the statistics from the remaining states are often misinterpreted.  One of the most commonly quoted statistics is that “50% of marriages end in divorce”.  This isn’t true.

In any given year, there are about twice as many marriages as there are divorces….but thanks to changing population, changing marriage rates, people with multiple divorces, and the pool of the already married, this does not mean that half of all marriages end in divorce.  In fact, if you change the stat to “percent of people who have been married and divorced”, you wind up at only about 33%.  More explanation here.

Ultimately, when considering any marriage data, it is important to remember that there are no national databases for this stuff.  All data has to come from somewhere, and if the source is spotty, the conclusions drawn from the data will likely be wrong.  This all applies to quite a few types of data….but marriage data is used with such confidence that it’s tough to remember how terrible the sources are.  A few people have let me know that I’ve ruined infographics for them forever, and I’m hoping to do the same with all marriage data.

You’re welcome.