Fun Quotes for Friday

Intuition becomes increasingly valuable in the new information society precisely because there is so much data.
John Naisbitt

It is a capital mistake to theorize before one has data.
Arthur Conan Doyle

I do not think it means what you think it means….

Oh teamwork.

I sat in a fascinating talk yesterday about some pretty interesting team failures.  One in particular stuck out to me: two teams, working on the east and west coast, funded by a huge grant from the NSF.  One team was tasked with building a database, the other was going to populate it with all of the data.  A year’s worth of work later, it was discovered that the two teams had never clarified what they meant by several words (including the word data) and that the whole thing was completely useless.  
Oops.  
Now, there are several lessons in that story, but one of them is the importance of knowing what certain words mean to the people who are saying them.  This can be a big issue in reading research and interpreting data, especially around popular public health type issues.  There are many issues….”rape” “excessive drinking” “binge eating” and “substance abuse” to name a few….that people tend to believe there is one hard and fast definition for.  When reading studies on these things, always verify that the authors definition matches your own.  In looking for good examples of this, I found this report on some drinking statistics that were being floated around a few years ago.  

A new study from Columbia University’s National Center on Addiction and Substance Abuse (CASA) claims that adults who drink excessively and youths who drink illegally account for over half of the alcohol consumed in the United States, and that the alcoholic beverage industry makes too much money from these groups to ever voluntarily address the problem.

 The article goes on to point out that if you look at the data, “excessive drinking” was defined as more than two servings of alcohol in one day, with no respect for height, weight, or frequency.  I somehow doubt this is the picture most people got when they read “adults who drink excessively”.

This comes up a lot in studies that have psychiatric diagnoses attached as well.  I have a friend who works with eating disorders who gets annoyed to no end that you can’t technically call someone anorexic until they’re 15% under a healthy body weight or have had their period stop, even if they stop eating for weeks.  Not many people know that up until this year, the FBI defined rape as something that could only happen to women.

Things to watch out for.

Boston vs Chicago

This week, Bad Data Bad is coming to live from downtown Chicago, just a few feet away from the Magnificent Mile!

I’m at the Science of Team Science conference, and so far it’s going pretty well.  I got a chance to present and discuss some of my research with people last night, and it’s fun having people recognize more of the psych aspect of what I’ve been doing.  Your normal bone marrow transplant crowd really doesn’t care about that part of anything, so it was nice to have people recognize the theories behind what I was doing.  They’re posting the abstracts online at some point, I’ll link to them when I figure that out.

Anyway, on my flight out here the data geek in me realized that a Boston/Chicago comparison would be a great input for the Google Ngram Viewer.  If you haven’t played with this yet, it’s fun.  Basically it tracks how many times the words you put in were mentioned in books over the last 200 years.  They uploaded a massive number of books to get the data, so the results are kind of fun.  Here’s Boston vs. Chicago:

For reference, Chicago wasn’t founded until 1837.  I tried running it starting at Boston’s founding in 1630, but that made a weird spike that made the rest of the graph look silly.  My guess is that’s a function of fewer books from that era loaded in to the database, since the y-axis is percentage.

For more about the project behind google ngrams, here’s my good friend TED to explain:
http://video.ted.com/assets/player/swf/EmbedPlayer.swf

Why career advice on the Internet can be total crap

I like nurses, though I’ve never wanted to be one.  My mother’s a nurse, my sister will be in a year or so.  Most of my best projects have been done in conjunction with nursing departments.  Due to my proximity to lots and lots of nurses, I tend to hear a lot about the ups and downs of the profession.

Given that, this article annoyed the heck out of me.

The headline reads “How To Land A $60K Health Care Job With A Two-Year Degree”, and being curious about the salaries of those around me, I took a peak.  I was stunned to see that the supposed “$60K job with 2 years of education” was nursing.  As proof, they offered the average annual salary for RN’s as $67,000 (backed up by the BLS here.  (The BLS actually used the median, which is slightly lower at $64,000).  They went on to mention that nurses in Massachusetts make an average of $84,000 a year.

Now that all sounds awesome, but here’s what’s deceptive:  RN is not a degree.  RN is a license.  Neither the Bureau of Labor Statistics nor this article differentiate between the salaries of those who get an RN after getting an associate’s degree, and those who get it after getting a bachelor’s degree.  It turns out there’s a lot of debate over how much of a difference this makes, but I can definitely speak to that Massachusetts salary number.  I work for one of the institutions that’s notorious for paying nurses extremely well.  They do not hire nurses who don’t have a BSN.  For most of the major Boston teaching hospitals, this is an increasing trend.  The Institute of Medicine is calling for 80% of nurses to be BSN educated by 2020, and many hospitals are responding accordingly.  Most management jobs are off limits to associate’s level nurses.

I’ll leave it to the nursing associations to debate whether all this is necessary or not, but I will bring up that taking an average of two different degrees with two different sets of job prospects and then not mentioning that it may be apples and oranges.  Additionally, even when nurses and nurse managers make the same amount, it’s often because one is overtime eligible (and works nights and evenings) and one doesn’t.  So overall, deceptive headline, designed to make people click on it.

Of course since I did click on it, I guess that worked.

Friday links for fun – 4.13.12

This will be completely lost on you if you’re not a Hunger Games fan, but the stats work/extrapolation is pretty damn impressive.

Professionally, I found this interesting….I can only get you the numbers, ma’am, I can’t make you use them wisely.

I haven’t talked much about small sample sizes, but this blog does.

These guys are my new heroes.  They noticed a statistical error that kept popping up in neuro research, and then went back and figured out how often people were getting it wrong….half of the studies that could have got it wrong did.  It’s a stat geeky read, but hears the story.

Age Bias and Polling Methods

A few years ago, in one of my research methods classes in grad school, a professor I had asked us to raise our hand if we had a cell phone.  

Everyone raised their hands.  
Then he asked people to keep their hands up if they had a land line as well.  
Many hands went down.  
For those left, he asked how many answered it regularly or had caller ID and screened calls.  
Pretty much everyone.
This of course then led in to a discussion of political polling and how many of us had ever considered who was actually answering these questions.  It was an interesting discussion, as pretty much the entire class admitted they would have self excluded.  The Pew Research center suggests this was not an anomaly, and that this is actually a problem that’s becoming more acute in political polling.  
While many large national polling organizations have started calling cell phones as well, on the state level this is not often corrected for.  This can, and has, resulted in some inaccurate polls, as the sample of people home, with a landline, willing to answer a pollsters call, does not always reflect the general population.  Actually, I think there’s good reason to question the representativeness of a sample willing to answer their phone for an unknown number, but that could be disputed (those interested enough to pick up the phone also might be more likely to actually go vote).  
Anyway, none of this is new.  What is new this (presidential) election cycle is that news organizations are now starting to put up stats on Twitter and Facebook status updates.  I decided to take a look and see exactly how skewed these stats are, and found that Twitter is most popular in the 18-29 demographic.  Of course, this is the least likely demographic to actually vote.  Interestingly, the poll on Twitter usage did not include people under 18, but these are not excluded when they are compiling trends.  
So two different ways of tracking elections, two different sets of flaws.  Pick your poison.

There’s bad data, and then there’s data that’s just plain mean….

I’ve worked at teaching hospitals for pretty much my whole post-college career, so I generally heave a bit of a sigh when I hear the initials “IRB”.  IRB’s (Institutional Review Board) are set up to protect patients and approve of research, but they also have power to reject proposed studies and cause lots of paperwork.  Sometimes though, you need a good reminder of why they were invented.

Apparently, some scientists in the 1940’s tried to develop a pain scale based on burning people and rating the pain.  Then, to make sure they had a good control, they burned pregnant women while in between contractions.

While it actually wasn’t a half bad way of figuring out what their numerical scale should look like, that is just WRONG.  As a pregnant women, I can pretty confidently say that anyone coming at me with a flat iron during labor will be kicked.  Hard.

Unethical gathering of data is not only not worth it, but also frequently wasted.  In the study mentioned above, the data proved useless, as pain is too subjective to be really quantified.  After this fiasco, it wouldn’t be until 2010 that someone came up with a really workable pain scale.

You can’t misquote a misquote

Yesterday I talked about sensational statistics and to always verify that there’s no missing adjectives that would change the statistic.  It was thus a bit serendipitous that today I happened to hear a debate about a misquoted statistic, and whether the quote or the misquote was more accurate.  It was on a podcast I listen to, and it was about a month old (sometimes I don’t keep up well).

It was happening around the time the contraception debate was at it’s most furious (see what I did there?  It was a federally mandated coverage of contraception debate, to give you all the adjectives).  Anyway, at the time the statistic about the prevalence of birth control usage among Catholic women was getting tossed around quite a bit.  The statistic, in it’s most detailed form, is this:  98% of self-identified Catholic women of child bearing age who are sexually active have used a contraceptive method other than natural family planning at some point in their lives.

Now, this stat rarely got quoted in it’s entirety.  First, I always think designating that the religions is self identified is important.  The women answering this survey didn’t have to clarify if they thought they were good Catholics, just Catholic.  Second, the “sexually active” got glossed over as well, despite the fact that it probably cuts down the numbers at least a bit (for young adult Catholics, to approximately 89% of respondents).  Third, “at some point”.  The study’s authors have justified this qualifier by arguing that if a woman is on birth control for years, then decides to start trying to have children and goes off of it, she would have been excluded.  Critics have argued that this strategy was designed to include women who may have tried it, decided it was wrong, and stopped.  Both have a point.

That being said, I most often heard this being quoted as “98% of Catholic women use birth control” or sometimes even “98% of Catholics use birth control”.  

It was that last phrase that got the debate going on the show I was listening to.  Person 1 argued that it annoyed him that people kept dropping the “women” part of the quote.  Person 2 shot back that it actually drove him nuts that people felt the need to add it.  He argued that for every straight female using contraception, there was by definition a straight man using it.  Unless one presumed a statistically significant number of women were misleading their partners, 98% of Catholic men were also using birth control (of course, even if they were being misled, they were actually still using it…just not knowingly).  Since according to Catholic doctrine the contraception mandate is for both genders, both parties are therefore guilty.

I liked the debate, and would be totally fascinated to hear the numbers on men who have used (or had a partner who used) contraception.  I am curious if a significant number don’t know, or would claim not to know.  I still think that clarifying “women” in the quote is fine, as it’s who the study was actually done on.  In my mind extrapolation should always be classified as extrapolation, not an actual finding.

Also of note, this was an in-person survey.  That’s always useful to realize that every answer given in a survey like this had to verbalize their answers to another person….important when the topic is anything highly subject to social pressures.  For a further breakdown of issues with that study, see here.