Observer Effects: 3 Studies With Interesting Findings

I’ve gotten a few links lately on the topic of “observer effects”/confirmation biases, basically the idea that the process of observing a phenomena can actually influence the phenomena you’re observing. This is an interesting issue to grapple with, and there’s a lot of misconceptions out there, so it seemed about right for a blog post.

First up, we have a paper on the Hawthorne effect. The Hawthorne effect was originally a study done on factory workers (in the Hawthorne factory) in order to see how varying their working conditions  improved their productivity. What the researchers found was that changing basically anything in the factory work environment ended up changing worker productivity. This was so surprising it ended up being dubbed “the Hawthorne effect”. But was it real?

Well, likely yes, but the initial data was not nearly as interesting as reported. For several decades it appeared to have been lost entirely, but it was found again back in 2011. The results were published here, and it turns out most of the initial effect was due to the fact that all the lighting conditions were changed over the weekend, and the productivity was measured on Monday. No effort was made to separate the “had a day off” effect from the effect of varying the conditions, so the 2011 paper attempted to do that. They found subtle differences, but nothing as large as originally reported. The authors state they believe the effect is probably real, but not as dramatic as often explained.

Next up, we have this blog post that summarizes the controversy over the “Pygmalion effect“. (h/t Assistant Village Idiot). This another pretty famous study that showed that when teachers believed they were teaching high IQ children, the children’s actual IQs ended up going up. Or did they? It turns out there’s a lot of controversy over this one, and like the Hawthorne effect paper the legend around the study may have outpaced its actual findings. The criticisms were summed up in this meta-analysis from 2005:

  1. Self-fulfilling prophecies in the classroom do occur, but these effects are typically small, they do not accumulate greatly across perceivers or over time, and they may be more likely to dissipate than accumulate
  2. Powerful self-fulfilling prophecies may selectively occur among students from stigmatized social groups
  3. Whether self-fulfilling prophecies affect intelligence, and whether they in general do more harm than good, remains unclear
  4. Teacher expectations may predict student outcomes more because these expectations are accurate than because they are self-fulfilling.

I find the criticisms of both studies interesting not because I think either effect is completely wrong, but because these two studies are so widely taught as definitively right. I double checked the two psych textbooks I have laying around and both mention these studies positively, with no mention of controversy. Interestingly, the Wikipedia pages for both go in to the concerns….score one for editing in real time.

Finally, here’s an observation effect study I haven’t seen any criticism of that has me intrigued “Mind Over Milkshakes: Mindsets, Not Just Nutrients, Determine Ghrelin Response” (h/t Carbsane in this post). For this study, the researchers gave people a 380 calorie milkshake two weeks in a row, and measured their ghrelin response to it. The catch? In one case it was labeled as a 620 calorie “indulgent” milkshake, and in the other case it was labeled as a 120 calorie “sensible” shake. The ghrelin responses are seen below:

This is a pretty brilliant test, as everyone served as their own control group. Each person got each shake once, and it was the same shake in each case. Not sure how large the resulting impact on appetite would be, but it’s an interesting finding regardless.

Overall I think the entire subject of how viewing things can change reality is rather fascinating. For the ghrelin example in particular, it’s interesting to see how a hormone none of us could consciously manipulate can still be manipulated by our expectations. It’s also interesting to see the limitations of what can be manipulated. For the Pygmalion effect, it’s found that if the teachers know the kids for at least 2 weeks prior to getting IQ information, there is actually no effect whatsoever. Familiarity appears to breed accurate assessments I suppose. All of this seems to point to the idea that observation does something, but the magnitude of the change may not be easy to predict. Things to ponder.

What I’m Reading: May 2018

I saw a few headlines about  new law in Michigan that would exempt most white Medicaid recipients from work requirements, but keep the work requirement for most black people in the same spot. This sounded like a terrible plan,  so I went looking for some background and found this article that explains the whole thing. Basically some lawmakers thought that the work requirements didn’t make sense for people who lived in areas of high unemployment, but they decided to calculate employment at the county level. This meant that 8 rural-ish counties had their residents exempted, but Detroit and Flint did not. Those cities have really high unemployment, but they sit in the middle of counties that do not. The complaints here seem valid to me….city dwellers tend not to have things like cars, so the idea that they can reverse commute out to the suburbs may be a stretch. 10 miles in a rural area is really different from 10 miles in the middle of a city (see also: food deserts/access issues/etc). Seems like a bit of a denominator dispute.

I’ve talked before about radicalization of people via YouTube, and this Slate article touched on a related phenomena: Netflix and Amazon documentaries. With the relative ease of putting content up on these platforms, things like 9/11 truther or anti-vaccine documentaries  have found a home.  It’s not clear what can be done about it unfortunately, but it’s a good thing to pay attention to.

I liked this piece from Data Colada on “the (surprising?) shape of the file drawer“.  It starts out with a pretty basic question: if we’re using p<.05 as a test for significance, how many studies does a  researcher before he/she gets a significant effect where none should exist? While most people (who are interested in this sort of thing) get the average right (20), what he points out is that most of us do not intuit the median (14) or mode (1) for the same question. His hypothesis is that we’re all thinking about this as a normal distribution, when really it’s geometric. In other words the “number of studies” graph would look like this (figure from the Data Colada post):

And that’s what it would look like if everyone was being honest or only had one hypothesis at a time.

Andrew Gelman does an interesting quick take post on why he thinks the replication crisis is centered around social psychology. In short: lower budget/easier to replicate studies (in comparison to biomedicine), less proprietary data, vaguer hypotheses, and the biggest financial rewards come through TED talks/book tours.

Given my own recent bought with Vitamin D deficiency, I was rather alarmed to read that 80% of African Americans were deficient in Vitamin D. I did some digging and found that apparently the test used to diagnose Vitamin D deficiency is actually not equally valid across all races, and the suspicion is that African Americans in particular are not served well by the current test. Yet another reason to not assume research generalizes outside it’s initial target population.

This Twitter thread covered a “healthy diets create more food waste” study that was getting some headlines. Spoiler alert: it’s because fruits and veggies go bad and people throw them out, whereas they tend to eat all the junk food or meat they buy. In other words, if you’re looking at environmental impact of your food, you should look at food eaten + food wasted, not just food wasted. The fact that you finish the bag of Doritos but don’t eat all your corn on the cob doesn’t mean the Doritos are the winner here.

On Average: 3 Examples

A few years ago now, I put up a post called “5 Ways that Average Might Be Lying to You”. I was thinking about that post this week, as I happened to come across 3 different examples of confusing averages.

First up was this paper called “Political Advertising and Election Results” which found that (on average) political advertising didn’t impact voter turnout. However, this is only partially true. While the overall number of voters didn’t change, it appeared the advertising increased the number of Democrat voters turning out while decreasing Republican turnout. The study was done in Illinois so it’s not clear if this would generalize to other states.

The second paper was “How does food addiction influence dietary intake profile?“, which took a look at self reported eating habits and self reported score on the Yale Food Addiction Scale. Despite the fact that people with higher food addiction scores tend to be heavier, they don’t necessarily report much higher food intake than those without.  The literature here is actually kind of conflicted, which suggests that people with food addiction may have more erratic eating patterns than those without and thus may be harder to study with 24 hour dietary recall surveys. Something to keep in mind for nutrition researchers.

Finally, an article sent to me by my brother called “There is no Campus Free Speech Crisis” takes aim at the idea that we have a free speech problem on college campuses. It was written in response to an article on Heterodox Academy that claimed there was a problem. One of the graphs involved really caught my eye. When discussing what percentage of the youngest generation supported laws that banned certain types of speech, Sachs presented this graph:

From what I can tell, that’s an average score based on all the different groups they inquired about. Now here’s the same data presented by the Heterodox Academy group:

Same data, two different pictures. I would have been most interested to hear what percentage of the age ranges supported laws against NO groups. People who support laws against saying bad things about the military may not be the same people who support laws against immigrants, so I’d be interested to see how these groups overlapped (or not).

Additionally, the entire free speech on campus debate has been started by outliers that are (depending on who you side with) either indicative of a growing trend or isolated events that don’t indicate anything. Unfortunately averages give very little insight in to that sort of question.

Maternal Mortality and Miscounts

I’m a bit late to the party on this one, but a few weeks ago there was a bit of a kerfluffle around a comment from a Congressman from Minnesota’s comments about maternal mortality in states like Texas and Missouri:

Now I had heard about the high maternal mortality rate in Texas, but it wasn’t until I read this National Review article about the controversial Tweet that I discovered that the numbers I’d heard reported may not be entirely accurate.

While it’s true that Texas had a very high rate of maternal mortality reported a few years ago, the article points to an analysis done after the initial spike was seen. A group of Texas public health researchers went back and recounted the maternal deaths within the state, this time trying a different counting method. Instead of relying on deaths that were coded as occurring during pregnancy or shortly afterward, they decided to actually look at the records and verify  that the women had been pregnant. In half the cases, they found that no medical records could be found to corroborate that the woman was pregnant at the time of death. This knocked the maternal mortality rate down from 38.4 per 100,000 to 14.6 per 100,000.  Yikes.

The problem appeared to be the way the death certificate itself was set up. The “pregnant vs not-pregnant” status was selected via dropdown menu. The researcher suspected that the 70 or so miscoded deaths were due to people accidentally clicking on the wrong option. They suggested replacing a dropdown with a radio button. To make sure this error wasn’t being made in both directions, they did actually go back and look at fetal death certificates and other death certificates for women of child bearing age to make sure that some weren’t incorrectly classified in the other direction. Unsurprisingly, it appears that when people want to classify a death as “occurring during pregnancy” they didn’t tend to make a mistake.

The researchers pointed out that such a dramatic change in rate suggested that every state should probably go back and recheck their numbers, or at least assess how easy it would be to miscode something. Sounds reasonable to me.

This whole situation reminded me of a class I attended a few years back that was sponsored by the hospital network I work for. Twice a year they invite teams to apply with an idea for an improvement project, and they give resources and sponsorship to about 13 groups during each session. During the first meeting, they told us our assignment was to go gather data about our problem, but they gave us an interesting warning. Apparently every session at least one group gathers data and discovers the burning problem that drove them to apply isn’t really a problem. This seems crazy, but it’s normally for reasons like what happened in Texas. In my class, it happened to a pediatrics group who was trying to investigate why they had such low vaccination rates in one of their practices. While the other 11 clinics were at >95%, they struggled to stay above 85%. Awareness campaigns among their patients hadn’t helped.

When they went back and pulled the data, they discovered the problem. Two or three of their employees didn’t know that when a patient left the practice, you were supposed to click a button that would take them off the official “patients in this practice” list. Instead, they were just writing a comments that said “patient left the practice”. When they went back and corrected this, they found out their vaccination rates were just fine.

I don’t know how widespread this is, but based on that classroom anecdote and general experience, I wouldn’t be surprised to find out 5-10% of public health data we see has some serious flaws. Unfortunately we probably only figure this out when it gets bad enough to pay attention to, like in the Texas case. Things to keep in mind.

Bilingualism in Units of Measure

Well, I’m back from Germany and everything went quite well, except for one little incident with a spontaneous bloody nose brought on by the descent in to the Atlanta airport. Thankfully there’s a bathroom before you actually have to go through customs in Atlanta (there was not in Stuttgart), because I’m pretty sure the border patrol folks would have been less than impressed at my attempts to clean myself up with my leftover bottled water and that weird mesh they cover the complimentary pillow with. Good times.

It was a fun trip overall, and my lack of German didn’t end up making a difference. The town we were in was a college town, so nearly everyone spoke English as a second language. It was a little interesting though, as it was clear very few people we talked to were used to conversing with native English speakers (we saw quite a few people conversing in English where it was clear they were both ESL with different primary languages), which led to some fascinatingly idiosyncratic translation issues. For example, one of the people we spent the most time with clearly only knew the pronoun “he” and applied it to everything. The sign above the coat rack in our hotel informed us “We are not responsible for your wardrobe”, which didn’t quite come off as I believe they intended it. Not judging of course, since it’s all better than my forays in to other languages, but I actually love seeing where the unusual phrasing comes up.

Anyway, while thinking about various translation issues, I started thinking a little bit about units of measure. There were a few times over the course of the week where distances or volumes came up, and I was interested to see that I have minimal problems translating kilometers to miles/pounds to kilograms/liters to quarts or vice versa. Part of this is just general quick mental math, but I did realize that I’m actually pretty comfortable in thinking in either the metric system or the US/imperial system. My engineering degree and lab work both used a lot of metric system units, and being a runner keeps you familiar with 5k and 10k distances, which make all the distance translations pretty straightforward.

The only unit I have real trouble with is temperature. I simply cannot think in Celsius. Every time I see a temperature in Celsius I have to spend quite a bit of time calculating before I get to the right ballpark. I’m not sure why this is, though I suspect it’s something about the simultaneous change in the magnitude of a degree and the reference numbers. Somehow trying to doing both at once throws me off.

I’m curious how many people are actually comfortable in both sets of units. I’m guessing there’s a strong influence of profession here.

On a related note, here’s the history of the US relationship to the metric system as told by NIST.

On an unrelated note, here’s a map of Europe and what each region calls Germany:

Apparently this is directly correlated with which occupants of Germany invaded which country first, though I can’t confirm that.

Public Interest vs Public Research

Slate Star Codex has up an interesting post up about a survey he conducted on sexual harassment by industry. While he admits there are many limitations to his survey (it was given to his readers), the data is still interesting and worth looking at.  He has a decent overview of why some surveys yield low numbers (normally by asking “have you been harassed at work?”) and some high (by asking specific questions like “have you been groped at work?”), that actually serves as an interesting case study for how to word survey questions.  Words like “harassed” tend carry emotional weight for people, so including them in surveys can be a mixed bag.

Anyway, data questions aside, I was a little fascinated by something he said at the end of his post that caught my interest “This may be the single topic where the extent of public interest is most disproportionate to the minimal amount of good research done.”

His complaint is that for all we hear about certain industries being rife with sexism and harassment (and those two terms frequently being conflated), he couldn’t find much real research on which industries were truly the worst.

I think that’s a really interesting point, but it got me wondering what other public interest questions don’t have much research behind them. My first thought was gun research. While not technically banned, back in 1996 an amendment went through that cut the CDC budget by the amount they had previously been spending on firearm research and included a rule that federal dollars couldn’t be spent “to advocate or promote gun control”. This comes up every time there’s a shooting like Parkland, and people are looking to overturn it. While I’ve mostly heard gun control advocates talk about this, it’s interesting to note that not all the pre-Dickey amendment research cast guns in a bad light. Reason magazine recently put up an article highlighting how little we know about how often guns are used for self-defense purposes, and how the CDCs last numbers put it at much higher than I would have thought (1.5% of Americans per year).

I’m curious if people can think of other topics like this.

Germans and Bone Marrow Donation

I’m headed to Germany for the week to visit a bone marrow collection center there. Most people don’t know this, but 25-33% of all the world’s unrelated volunteer bone marrow donors are managed by the German national registry.

Even though bone marrow transplants were pioneered in the US,  Germany and the Netherlands that took a lead on building registries where people could look for donors if someone didn’t have a suitable sibling donor. As they started to build their own registries, they also got backing from their governments to recruit people to them as well. German participation in the bone marrow registry is about double the US (almost 10% vs 5%).

Anyway, I’ll be giving two talks (thankfully with a colleague) that are supposed to go 3 hours. Since I only know about 10 German words, this could get interesting. Fingers crossed for me!