Increasing Health Care Costs Are Not Like Other Cost Increases

When it comes to current day financial woes, it is common to hear people focus on three things specifically: housing, higher education and health care costs. This will often be accompanied by something like Mark Perry’s “chart of the century” that shows the increase in prices vs wage increase since the year 2000:

Of the 5 categories of spending that outpaced average wage growth, 2 are in healthcare. But those healthcare categories are much trickier than the remaining 3 categories. If I bring up childcare spending, college textbooks or even college tuition and fees, you pretty much know what that covers. Even if you haven’t personally used it in a while, you probably know what a daycare or bachelors degree entails, and I think we all have the same memories of college textbooks. But how do you compare the cost of healthcare in the year 2000 to today? What are we even comparing when we say “hospital services”? How do we add in the fact that there is simply more healthcare available now than there was 25 years ago?

As it turns out, this is an incredibly tricky problem no one has quite solved. The data above comes from the BLS medical CPI data, which tracks out of pocket spending for medical services. It states that in general “The CPI measures inflation by tracking retail prices of a good or service of a constant quality and quantity over time.” But as someone who has worked in various health care facilities since just about the year 2000, I am telling you no one actually wants to revert back to the care they got then. Additionally, CPI tracks the price of something, but not how often you need it or why you needed it.

Here’s an example: when I started in oncology, all bone marrow transplants were done inpatient. Then, people started experimenting with some lower risk patients actually getting their transplants outpatient. People really love this! They sleep in hotel rooms with more comfortable beds, and walk in to clinic every day to get checked up on. However, this means that your average inpatient transplant is now more complex, the “easy” patients who were likely to have a straightforward course of care were removed from the sample. I don’t often look at what we charge, but I wouldn’t be surprised to see that the cost for an admission for bone marrow transplant has continued to trend upward. But this doesn’t mean the cost has actually gone up for most patients. In this case, comparing the exact same hospital stay for the exact same diagnosis as 25 years ago is not comparing the same thing. Innovation didn’t change that some patients need a hospital stay, it meant that fewer patients needed one.

While this is one example, I suspect rather heavily that’s a big reason why hospital services cost has gone up so much. The big push in the last 2 decades has been all about keeping people out of the hospital unless they really need to be there, which will have the effect of making hospital stays more expensive while keeping more people out of the hospital.

This run off can also increase the cost for outpatient medical services, the other category we see above. This past year for example, I got my gallbladder removed. In the year 2024, 85% of people who got a gallbladder removed went home the same day, as did I. In the year 2000 however, that was hovering at around 30%. So again, we see that the hospitals are now caring for just the sickest people, but one also assumes that outpatient follow up visits might be more complex than they were 25 years ago. Having 50% of patients change treatment strategies is a huge shift in the way care is delivered, even if it shows up as “the exact same visit type for the exact same diagnosis”. From the standpoint of CPI, a ‘gallbladder removal’ looks like the same service. From the standpoint of reality, it has become a fundamentally different care pathway.

Now this is just one graph, and it’s true there are other graphs that get passed around that show an explosion in overall healthcare spending. This is also true, but fails to reflect that the amount of healthcare available since <pick your year> has also exploded. Here’s a list of commonly used medical interventions that didn’t exist in the year 2000:

  1. Most popular and expensive migraine drugs (CGRP inhibitors)
  2. GLP 1s for diabetes/weight loss (huge uptick in the past 5 years)
  3. Cancer care (CAR-T cell therapy and immune checkpoint inhibitors)
  4. Surgical improvements (cardiac, joint replacement, etc)
  5. Cystic fibrosis treatment (life expectancy has gone from 26 to 66 since 2008)
  6. HIV treatment (life expectancy was 50-60 in 2000, now is the same as the rest of the population)
  7. ADHD medication (this one is more an expansion in diagnosis, was $758 million in 2000, now estimated at $10 to $12 billion. I bring this up as a tangential rant because for some reason I’ve seen 2 people recently mention that insurance annoyed them because they didn’t use it because they were healthy, but they or their children were on ADHD medication. If you are going to complain about healthcare costs, it’s good to make sure you are accurately assessing your own first.)

Childcare or higher education have made no similar changes in the same time period.

My point here is not that healthcare has no inflation, it almost certainly does. Rising wages, increased IT needs, increased regulatory burden and increased cost of supplies would all hit healthcare as well. But when you compare healthcare in the year 2000 and the year 2025, you are comparing two different products. Go further back with your comparison and the differences will be even more stark. We are never going to control healthcare costs as long as we are constantly adding new, cool and really desirable things to the basket. There is not a world in which we can both functionally cure cystic fibrosis AND do it for the same price as not curing cystic fibrosis. Not all cost increases are the same.

Knowing the Media Lies Isn’t the Same as Knowing When

A few months ago, I wrote a post called “Gell-Mann Amnesia Applies to TikTok Too“. AAs often happens, once I wrote it, I started seeing the phenomenon everywhere. A few days ago however, I saw a fantastic example I wanted to get out there.

As you may or may not know, I am a long-time fan of the Kardashians and the various iterations of their reality show. I’d say it’s my guilty pleasure, but I don’t feel particularly guilty about it. They’re entertaining, and I’ve learned a surprising amount about what it’s actually like to be famous.

(Side note: while the money looks fun, fame looks terrible. Credit to anyone who can tolerate it.)

When the original show “Keeping Up with the Kardashians” started in 2007, the family was only mildly famous. Almost 2 decades later, they are now some of the most famous and recognizable celebrities on the planet, and the show has grown to reflect that. At least a few times a season they do a segment on various media reports/social media rumors they have seen about themselves, and how people are often entirely making things up about them just to cash in on their name. They vent their frustration that people keep posting nonsense, and complain about how hard it is to set the record straight once rumors start. This makes sense to me, and in their shoes I’d do the same thing.

Given this, I was interested to see on a recent episode that Kim Kardashian admitted she didn’t believe we really landed on the moon. When the producers pressed her to explain why, she had a few snippets of concern, but she ended it with “Go to TikTok—they explain the whole thing.”

Bam. Gell-Mann Amnesia.

As a reminder, Gell-Mann amnesia is “the tendency of individuals to critically assess media reports in a domain they are knowledgeable about, yet continue to trust reporting in other areas despite recognizing similar potential inaccuracies.”

Kim Kardashian knows – possibly better than almost anyone alive – how inaccurate media coverage can be, and how often influencers make wild or misleading claims to build attention and monetize outrage. And yet, the moment the topic shifts from something she understands intimately (herself and her family) to something she doesn’t (astrophysics), those same incentives and distortions vanish from her mental model.

Knowing the media lies in one domain does not automatically make us skeptical in others. In fact, it often makes us overconfident. We are certain we can spot nonsense, as long as it isn’t about something we already know about.

Gell-Mann amnesia applies to TikTok too. Possibly more than anywhere else.

True Crime Times

I have a Substack piece up today on the True Crime Times: What True Crime Can Learn from the Science of Getting Things Wrong.

This is an abbreviated version of my True Crime Replication Crisis series, albeit aimed towards a true crime audience rather than my usual folks here. I was pleased to see the first comment was from someone who also likes to rant about statistics. If you’re looking for that, I’m always here for you. Please see my Intro to Internet Science series for details.

College Costs Stopped Spiraling About a Decade Ago

When you talk about the economy and the state of young people today, you will almost always here about how young people are drowning with student debt. I took this statement at face value, until a few months ago when I saw someone make a comment that college cost creep had largely slowed down. I didn’t follow up much more on it until I saw the recent Astral Star Codex post about the Vibecession data, and he confirmed that the high water mark for student loan debt was actually 2010 (Note that this is debt per enrolled student, so those numbers aren’t impacted by changes in enrollment):

His theory is that 2010 is around when large numbers of people first started maxing out the $31k cap on many government loans, and that that cap hasn’t moved. I think there’s two other things going on:

  1. The youth (16-24) unemployment rate was above 15% for about 4 solid years after the 2008 crash: 2009-2013. This is the worst youth unemployment since the early 80s and it actually hasn’t been replicated. The COVID unemployment spike lasted about 5 months (April-August 2020). It dropped back below 15% by September 2020 and was below 10% by June 2021. 4 years of bad job prospects leads to a “well may as well finish my degree/go back to grad school” type thinking in a way 5 months of bad job prospects don’t.
  2. 2012 is generally considered the inflection point for smartphones/internet adoption, and this opened up a lot of low cost options for online college. I don’t have good comparison numbers, but today in 2025 you can get a bachelor’s degree for $10k/year at Southern New Hampshire University, and they helpfully point out some of their “pricey” competitors are $15k/year. Adjust for inflation, that would be the equivalent of a 2010 student being able to get tuition for about $6800/year. I was not shopping for a degree at that time, but Google tells me you would have paid triple that for my state school that year.

Now you can find websites that say student loan debt is going up, but from what I can find these graphs don’t inflation adjust their data. The complaints about how a $100,000 salary isn’t what it used to be are accurate, but by the same token a $100,000 debt isn’t what it used to be. Looking at the top graph on this website for example, you wouldn’t know that a $17.5k loan in the year 2000 is actually about $34k today, remarkably close to the $35k they say 2025 graduates are taking out.

Ok, so that’s debt, but what about the sticker price? Well, the College Board puts together a pretty useful report on the topic that shows a few things:

For public universities, we see a slow down in tuition increases starting about a decade ago, and for private schools the change happens about 5 years ago. For all schools we see the inflation adjusted cost is currently lower than it was a decade ago. But wait, there’s more!

The above graphs are just the sticker price. The College Board also tracks the “real cost” after factoring in grants. Here’s the data for private 4 year colleges:

Note the “grant aid” line, which slowed during the 2008 crash, but then has been ticking upward starting in 2013 and hasn’t stopped. To emphasize, those are grants, not loans. That’s just money off the sticker price. According to the College Board, the net cost of attendance of a college in 2024 was less than in 2006. I won’t keep pummeling you with graphs, but for private 4 year, public 4 year and public 2 year colleges, the “real” cost peaked in 2016.

I am not an economist, but the numbers suggest a pretty clear story to me. When unemployment for those under 25 was high in 2009-2013, going to college, any college, seemed like a good financial move. For many, it probably was. Then, as employment picked up, students were able to get choosier and consider the cost of student loan debt in their choices. Very quickly colleges started upping the amount of grants offered, and then stopped increasing the sticker price. With recent inflation, the price increases actually dropped below the inflation line and now the real cost of college is dropping.

Additionally, technology improvements allowed online schools to start offering cheaper tuition at a large scale. This might have only made a small dent, except then the pandemic happened and traditional campus life was upended. This made the difference between going to a traditional college and an online college much smaller, and based on those I know with kids that age a lot of kids opted for at least a year or two at a cheaper online school rather than pay through the nose to sit in their dorm room all day. This put additional cost pressure on schools, and we see the prices tick down further.

All that being said, there is a not-small group of people who were pretty slammed by college costs: those who were coming of age during the 2008 financial crash and it’s aftermath. However, most of those people are actually in their late 30s now, and it’s important to note that state of affairs did not persist for those who came after them. Times change, sometimes for the better.

My Favorite Book of the Year: The Age of Diagnosis

As 2025 comes to a close and we careen towards Christmas and giving season, I wanted to put in a plug for my favorite book of the year. The book is The Age of Diagnosis: How Our Obsession with Medical Labels Is Making Us Sicker, and I enjoyed it immensely. I found it originally when Jesse Singal did a Substack post called “Long Covid can be both Psychosomatic and Real”, and immediately forwarded it to my sister (an NP), who promptly got the book and then immediately called me to talk about it. She was annoyed I hadn’t actually read it yet, so I got the book and could see why she was calling. This is a book you want to talk about with people.

The author is a UK neurologist and a skilled writer, and she dares to ask the question “what is the point of diagnosing people with things”. She points out that diagnosis is supposed to be used strictly to inform treatment options, but we’ve completely overlooked the psychological impact a diagnosis can have. She starts with the example of Huntington’s disease, a fatal genetic disease that you can test for and diagnose, but for which there is currently no cure. Prior to the advent of testing, 90% of patients and their families said they would love to have a test. Once one was developed however, the decision to test or not proved a lot harder for people than they had expected.

She goes on to cover many other areas of medicine: COVID, chronic Lyme, autism, ADHD, cancer screenings, and points out repeatedly that there are two ways to be wrong. Missing a diagnosis you could have treated is obviously bad, but giving someone a diagnosis they may not have also carries a risk. It’s that second risk she explores for both physical and psychological illnesses. What does happen if you think you have a disorder that you don’t? Does disorder creep carry a cost? If your diagnosis makes you feel better about yourself but actually doesn’t improve your objective functioning or even worsens it, should it really have been given? Shouldn’t we be, you know, studying some of these questions?

I liked this book because I’ve spent a lot of time in the last 7 years or so thinking about the purpose of diagnoses and what they’re good for. Back in 2019 I wrote about my lengthy journey to getting diagnosed with chronic migraines (they had an atypical presentation at first), and it was a great relief to finally getting a name to my issue. However, it still took years to get a treatment regimen that worked, and I still have problems. I also have a new appreciation for psychosomatic illness because the migraines have messed up my sense of pain quite a bit. I now have to let every health care provider I have know that my sense of pain is not a great guiding light, in either direction. I have felt pain in places that appeared to have nothing actually wrong with them, and failed to recognize pain in other places because I thought it was part of the regular pain I have. Not having your senses work predictably is a huge disadvantage in diagnosis, but there are more people this happens to than you think. One highlight of the book was when she notes many people experience psychological pain as physical pain, and get slapped with every escalating numbers of diagnoses while trying to treat it. This isn’t good for anyone.

A related read this week was Accommodation Nation in the Atlantic, which points out that now over 20% of students at elite universities have a disability on file. This is a rate far higher than less elite universities, and the disabilities are primarily autism, ADHD and anxiety, and again makes us wonder what a diagnosis is really for. If the best and brightest are claiming to be disproportionately impaired, what are we really looking at here?

What The Age of Diagnosis highlights, sometimes uncomfortably, is that our institutions haven’t caught up to the psychological and social power of a label. In an era where traditional communities seem to be shrinking, we run the risk of allowing diagnoses to take a disproportionate role in the way we define ourselves. Books like this don’t offer easy answers, but they do give us the vocabulary to ask better questions about how we allocate care, how we define impairment, and what we actually want our diagnostic categories to accomplish in a world where they shape so much of public and private life.

There Weren’t Just 2 Scientific Advances that Made the Sexual Revolution Possible, There Were 4

There’s a Bret Weinstein speech going around on Twitter where he makes a comment about how birth control and abortion changed the game around sex, commonly known as the sexual revolution that occurred in the 1950s-1970s. I have not listened to his speech so I have no comment on what he was saying specifically, but in reading some of the comments I was interested that when people discuss “what changed” during the 1950s through the 1970s, they seem to focus on just abortion and birth control on repeat. Even the Wikipedia page for the sexual revolution only mentions these two. Those things absolutely changed behavior, but I think there’s two more things that need to be a bigger part of the discussion:

  1. Paternity testing
  2. Antibiotics

Paternity testing started out with blood testing in the 1920s, but hit it’s stride in the 1960s with HLA testing. Prior to that, you had to use social rules and general vibes to determine paternity. It largely relied on people’s own truthfulness. Prior to paternity testing, marriage was the most surefire way to ensure no one questioned whose kids were whose, but after we got a better method the number of kids born to single moms went from 5% to 40%. You can see that as good/bad/neutral, but that almost certainly doesn’t happen without the ability to identify a father accurately.

As for antibiotics Penicillin was discovered in 1928, but WWII sped up the perfection of antibiotics for treatment of bacterial infections, and widespread for the public use came in during the 1950s. From 1935 to 1968, 12 new classes of antibiotics were launched. Prior to this, basic STDs like syphilis were actually killing people at a rate similar to suicide today:

And that’s just deaths from syphilis, not cases. That figure comes from this analysis, which notes that prior treatment methods may have been as effective, but they were expensive and time consuming, and penicillin just made everything easier. Of course, syphilis is just one of the diseases people were dodging, chlamydia and gonorrhea also would have been issues. Antibiotics changed the game here.

I bring these up not to take any particular stance on any issue, but to point out that the past was very different in ways we don’t often think about. Even if somehow birth control and abortion were wiped off the face of the planet today, antibiotics and paternity testing would still ensure our population level practices around sex were different than they were 100 years ago. Sexual mores were never just about pregnancy, they were also about ensuring you could establish paternity and avoid STDs.

I think this is important for both cultural conservatives and cultural liberals to remember, as at times we can look at the past as either a golden era of morality or a deep pit of oppression. But in prior “moral” eras, a lot of sexual behavior was kept in check by people lying or threatening to lie about true things, and paternity testing stopped that. Conversely, things like religion may never have had quite the level of influence we attribute to them, they were often coping with very real issues around STD control in an era when the medical community couldn’t help much. When those things changed, behavior changed. It’s a good reminder that most social changes have several causes, and are not just related to one thing.

To note: the things I mention above are those I believe had a direct impact on sexual issues in the 1950s-1970s specifically. There’s a few other advances that probably changed sexual behavior in a slightly less direct fashion: cars (teenagers could go see each other more easily), at home pregnancy tests (earlier identification of pregnancy, no doctor needed), mass distribution of porn (TBD), dating apps (thank God I missed that era).

Anything else I missed?

Snip Happens: A Study in Hypothetical Hair Sabotage

Earlier this week, the Assistant Village Idiot tagged me in one of his link roundups:

Off With Her Hair Women tell attractive women to cut their hair. The study’s authors are all female.  I wonder what it is like for women studying female intrasexual competition. Is it harder to get along, or easier? Bethany, you need to get in on researching the women who research women.

I’ll admit I got a kick out of this, in part because I love a good gender study, and in part because I have REALLY long hair. I mostly wear it up, but it’s the kind of hair that people actually say “whoa, I had no idea it was that long” if I take it down. I call it homeschool hair. The last time I wore it down for an extended period of time, someone (who I knew) stopped me and asked if she could take a picture of it. I have no particular attachment to this style, but I actually don’t like haircuts, so here we are.

I hadn’t yet had a chance to dive in to the study, when a Tweet popped up on a similar topic:

It actually came to my attention because a few people immediately pointed out that these women were in a no win situation: if they’d told their coworker “she looked like shit” they would be considered catty, but if they tell her it looks good they are intrasexually competitive. Additionally, they were coworkers of hers, not friends, and it’s pretty weird to expect that all women at all moments must be aiding every other woman they know with her appearance. I suppose there’s an option where they could have tried to be pleasant but not endorse the haircut, but that’s a very hard tone to hit correctly and honestly? I’ve also seen plenty of male coworkers say things “looked great” when other males came in proud of some new thing they did/purchased/whatever. Why start conflict with a coworker for no reason?

All of this prompted me to deep dive in to this study, to see what they found. Ready? Let’s go!

Study Set Up

So the basic set up of the study is that 200ish (mostly college aged) women were recruited for a series of two studies. In both, they had a series of female faces cropped to the shoulders like this:

The women studied were supposed to suggest how many centimeters (they were Australian) they were supposed to cut off. They were given the picture of the woman, an assessment of the hair’s condition and then how much hair the woman was comfortable cutting off. Those last two were a binary: hair condition was either good/damaged and the requested length of cut was either as much as needed/as little as possible. After that they asked women to rank themselves on a few different scales, including one that measured intrasexual competitiveness.

What’s intrasexual competitiveness you might ask? Well, it’s apparently a 12 question measure that asks you stuff about how you feel about those of your gender who might be better than you on some level. The questions they mention are things like asking you to agree/disagree with statements like “I just don’t like very ambitious women” or “I tend to look for negative characteristics in attractive women”. Highly intrasexually competitive women are those who answer that they strongly agree with questions like that.

They hypothesized that women who scored high on this scale might be more aggressive with their recommendations to other women about how much hair the should cut off, under the idea that men like long hair and this would be sabotaging other women who might be competitors to them. And to be honest, this sounds like a pretty plausible hypothesis to me! These are women who just answered a bunch of questions reiterating that they really didn’t particularly like other women, I would imagine they’d actually end up being meaner to other women than people who disagreed with those statements. It reminded me of someone who recently pointed this out about introvert/extravert tests: they will ask a bunch of people if they like big groups of people, and then you call those who said “no” introverts, then we declare that we found introverts don’t really like parties. I mean, that makes sense! But it does at times seem like most of the sorting already took place before we even got to the study itself. But I digress, let’s keep going.

The Findings

Ok, so the first thing that caught my eye is that the primary finding of the study is that all women, regardless of scale ranking, first and foremost based their haircut recommendations on two things:

  1. The condition of the woman’s hair (those with damaged hair were told to get more cut off)
  2. The hypothetical client’s stated preference (it was followed).

So to be clear, it was found that even women who stated they didn’t much like other women primarily based their recommendations on what was best for the other woman and what they other woman wanted. And it wasn’t even close. Every other effect we are going to talk about was much smaller both in absolute value and in statistical significance. Here’s the graph:

To orient you, the top panel is the recommendations for healthy hair, the bottom is the recommendations for unhealthy hair. As you can see, in general the difference in recommendations based on that condition alone is quite large, around the 2cm (a bit under inch for us USA folks) range for all conditions. The second biggest impact was what women wanted, which made a difference of about 1-1.5cm in the recommendations. Then we get to everything else.

It’s important to note that despite how this topic often gets introduced, there was no significant effect found based on attractiveness in general. This is notable because like the Tweet above shows, this stuff is often portrayed in popular culture as something “women” do, and we don’t have much proof that it is! They did find an attractiveness effect for the women with healthy hair being judged by regular and highly competitive women, but it went the opposite way: it was actually unattractive women who got the recommendation to cut off more hair. And again, the difference was a fraction of the impact of the other two factors: somewhere between .1-.2cm. For those of us in the US, that’s less than 1/10 of an inch. A quick Google suggests that’s less than a weeks worth of hair growth, and certainly not enough for anyone to notice.

I think it’s good to hammer on this because if I told you someone was out to sabotage you, you might be worried. But if I told you someone was out to sabotage you but they’d first do what was best for you, then follow what you wanted, then would sabotage you so subtly it would be imperceptible to the naked eye…..well, you’d probably calm down substantially. Much like when we see studies like “eating eggs will double your risk of heart disease in your 50s (from .001 to .002 per thousand)”, we need to be careful when we are quoting results like this that find a near imperceptible difference that can be fixed with 5 days of regularly hair growth.

But back to the finding that attractive women didn’t actually get penalized and instead the slight increase in hair cut recommendation was aimed the other direction, the study authors conclude the following:

This suggests that appearance advice may act as a vector for intrasexual competition, and that such competition (in this scenario at least) tends to be projected downward to less attractive competitors.

I will admit that annoyed me a bit, because this means that ANY variation is now considered to prove the thesis. They stated this was ok because there was no active “mate threat”, so they would expect it to go this way, but I will point out if attractive women had been penalized it would have also been considered proof. Having just finished our series on the replication crisis, I will point out that explaining every finding as proving your original thesis is a big driver of non-replicated findings.

Moving on to the second study though, the study authors did a few really smart changes to their set up. First, they provided participants with a picture of a ruler and a credit card up front so they’d actually have a reminder of what different lengths meant. They also changed from using a text box for the answers to “how much hair would you recommend they cut off” to using a Likert scale type set up where you had to recommend a whole number 1-10 cm. I liked that these changes were there because it showed a good faith effort to improve the results. In this condition, they added faces that were considered “average” to the mix and repeated most of the same experiment.

The findings were similar. The biggest variations were based on hair damage and client wishes, with relatively small differences .1-.2cm appearing across different individual groups. The graph that got the headline though is this one:

This is the graph they used for the title of the study, and it comes from dropping the whole clients wishes/hair damage thing and just looking at the overall amount of hair these women suggested be removed for anyone. You will note again the variation across attractiveness levels is .1-.2cm, but indeed the “high” intrasexual competitiveness women recommend more than the other two groups. The highest recommendation is about .8cm higher than the lowest value. That’s about 1/3 of an inch. Not enough for you to visually notice, but still something.

What caught my eye though was that we only really saw variation with the high and low group, which got me wondering how many women were in each category. And that’s where I found something interesting. In the first study, they defined “high” and “low” intrasexual competitiveness as being 1 SD from the mean. Assuming a normal distribution, that would mean about 16% of the sample were in the high/low groups, and the remaining 68% were in the average group. For this study though, they changed it to 1.5 SD, which means a little less than 7% of the group are in the high/low groups. Given the sample size of around 250, we’re looking at about 17 people in both the high and low group (34 people total) and 216 or so in the average group. By itself that will lead to higher variation in the groups with smaller sample sizes. You will note there is very little variation in what the group with most of the participants answered.

My thoughts

So like I said at the beginning, I find this study’s conclusion fairly plausible. The idea that women who specifically state they don’t like other women will give other women worse advice just kind of makes sense. But a few thoughts:

  1. The main findings weren’t mentioned. The title and gist of this study was presented as “intrasexually competitive women advise other women to cut more hair off”, but it could just as easily have been “intrasexually competitive women primarily take other women’s best interest and preferences in to account” and it would be just as (if not more) accurate. The extra hair cut is presented as a primary driver of haircut recommendations, but really it’s in a distant third to the other two. This is fine for academic research, but if you’re trying to talk about how this applies to real life, it’s probably good to note that women actually gave quite reasonable advice, with slight variation around the edges.
  2. The absolute value was never discussed. I was curious if the authors would bring up the small absolute findings as part of their discussion, and alas, they did not. The AVI let me know he found the link in Rob Henderson’s post here, and I was amused to find this line one paragraph before his discussion of this study: This is why reproductive suppression is primarily a female phenomenon. Of course, there have been cases of male suppression (e.g., eunuchs). Or men raiding a village and simply slaughtering all of the males and abducting the women as wives and concubines. But suppression among women is subtler. If by subtler you mean 2mm of extra hair, then yes. If I had to pick between that and murder and castration, I admit I’m feeling women got the better end of the deal here. If you would keep eating eggs (or whatever other food) that was associated with a tiny increase in cancer, then you probably can’t take this hair cutting study as a good sign of intrasexual competition. How are women sabotaging other women if they are doing so at a level most men wouldn’t notice? I suspect there’s an assumption this effect is magnified in real life, but again, this study doesn’t prove that.
  3. Motives are assumed. Much like in the critiques of the Tweet above, I noticed that through the paper the authors explained why targeting attractive women, average women and unattractive women would all be intrasexual competition. What I did not see was any attempt to consider non-intrasexual competition reasons. Maybe people suggest unattractive people cut more hair off because they think they should try a different look? Maybe scoring high on a intrasexual competition survey is an indication of aggressiveness, and aggressiveness correlates to more aggressive hair cutting? Unclear, but I will note the idea that all variances could only be explained by intrasexual competition surprised me, particularly when we’re discussing effects that are likely too subtle to be spotted by the opposite sex.
  4. We don’t know this is a female only phenomena. Despite Rob Henderson’s claim above, you will be unsurprised to hear no one (that I could find) has ever done this study on men. I actually would have been interested to see that study, even if it was men making suggestions for female hair. One reason I’d like to see this is because I heavily suspect men would be somewhat more erratic in their rankings, which would actually increase the risk of spurious findings. Frankly, that would amuse me to watch people have to explain why their statistically significant findings were still meaningful, or to have to admit sometimes that just happens and it doesn’t mean anything at all. But still, we’re told constantly that “subtle” sabotage is a woman thing, but I actually couldn’t find any studies suggesting people were looking at this. Might be interesting.

Ok, well that’s all I have! Thanks for reading, and I’m going to go consider cutting my hair an amount no one will notice, just for fun.

The True Crime Replication Crisis Part 8: Consequences

Well we’ve reached the end of the road here folks, and it’s time to wrap things up with some conclusions and consequences. As I mentioned in the first post, I’ve been loosely following the Wikipedia entry on the replication crisis, and I’d like to point out the first paragraph of it’s consequences section (bolding mine):

When effects are wrongly stated as relevant in the literature, failure to detect this by replication will lead to the canonization of such false facts.[195]

A 2021 study found that papers in leading general interest, psychology and economics journals with findings that could not be replicated tend to be cited more over time than reproducible research papers, likely because these results are surprising or interesting. The trend is not affected by publication of failed reproductions, after which only 12% of papers that cite the original research will mention the failed replication.[196][197] Further, experts are able to predict which studies will be replicable, leading the authors of the 2021 study, Marta Serra-Garcia and Uri Gneezy, to conclude that experts apply lower standards to interesting results when deciding whether to publish them.[197]

So overall we find that in science, with highly educated PhDs with professional reputations and institutional affiliations built on truth we find that:

  • False facts end up being canonized
  • Less reliable studies get more attention
  • Even when findings are formally challenged, they will continue to be repeated as true with almost no one mentioning they were called in to question
  • Standards are lower for anything surprising or interesting

Do we really believe that Youtubers and TikTokers are actually more reliable than this, while they compete for nothing but attention? I hate to beat a dead horse, but papers can get retracted, colleges can investigate you, and you can sink a career in academia. Maybe not often, but the odds are certainly better than even a mainstream journalist actually losing a defamation case. Science is set up to self police, maybe not as well as it should be, but there are mechanisms. True crime documentaries and podcasts are set up to entertain, and there are no mechanisms to self correct outside of a person getting aggravated enough to file a lawsuit against you. So it is very likely that:

  • Some portion of what you believe you know about popular cases is flat out false
  • The most popular cases will have more incorrect facts floating around than the “boring” cases
  • Even when things are proven to be incorrect, they will not stop circulating as fact
  • Standards are lower for anything surprising or interesting

So what do we do?

Well, it’s actually not straightforward. Because of the apparatus around science, it’s been straightforward to propose changes. Change hasn’t always come fast, but it has been progressing. True crime has no such oversight, so any change will be a challenge. However, I think the things I used to bring up in my Intro to Internet Science Course still all apply here. I broke down the things to watch for in to 4 categories: Presentation: How They Reel You In, Pictures: Trying to Distract You, Proof: Using Number to Deceive, and People: Our Own Worst Enemy. I think those still all apply here, with just a few tweaks.

  1. Presentation: How They Reel You In A high production value documentary is not the same as an honest documentary, and a lengthy series on a topic does not mean people didn’t leave anything out. Be skeptical of things, no matter how glossy or voluminous.
  2. Pictures: Trying to Distract You In the stats and data world, graphs are often used to catch people’s eye and give them the immediate visual impression something is happening before they’ve had a chance to read anything. In true crime, this is often what the victims or the perpetrator look like, immediately playing on tropes of who we think commits crimes or which victims get our sympathy. Be skeptical of anything that focuses on the good looking, wealthy or college educated to the exclusion of others. Additionally, watch any attempt to immediately invoke another case or movie in the current case, which will prime you to skip actual facts in favor of an “I know this type of person, they do X”. When our local case hit national media, one of the first things one of the main people did was to start citing a popular movie filmed in the area almost 20 year ago, based somewhat on events that had occurred 20-30 years prior to that. The attempt to evoke specific imagery was clear.
  3. Proof: Using Number to Deceive While numbers aren’t always at play in the true crime world, evidence certainly gets kicked around pretty often. But just like numbers, out of context evidence is often worse than useless and extremely misleading.
  4. People: Our Own Worst Enemy We bring our biases to every case, and some narratives will be more palatable to us than others. Be careful with people who bring cases in to make a “bigger point” or anything that seems a little too outrageous or focuses on extremely unusual types of crime. It’s also good to look back on early reporting and see if what got you in to the case held up, and to actually take it in to account if it didn’t.

To all of this, I’d add two more points. The first is that a surprising number of people tell me that true crime is fine playing fast and loose with the facts as long as it challenges the police, because there the state has more power. This is of course how our whole justice system is set up, but I think it falls rather flat. In science we are taught that there are both type 1 errors (false positives) and type 2 errors (false negatives) and that both carry consequences. This is also true in the criminal justice system. Blackstone’s principal says that it’s better that ten guilty men go free than one innocent man hang, and that is what we build our system around. But this doesn’t mean there’s no consequences to a guilty person going free. The obvious first issue is that they offend again, and that we will then also be upset that nobody stopped them. But this is a natural consequence of “it’s never bad to let the accused go”, and we can’t have it both ways. A recent Twitter thread highlighted this from a victims perspective, as she recounted both the emotional toll of testifying against a stranger who assaulted her and then watching him get let go repeatedly just to watch him continue to assault other women. The other issue of course is that if you have a justice system that never finds anyone guilty, people take things in to their own hands. It’s commonly noted that the mafia initially gained power with immigrant Italian communities because the police wouldn’t investigate crimes against them, and the same is true of newer gangs. Likewise, the Old Testament is riddled with references to the sin of denying justice. Even if you’re not religious, it’s good to flag that unpaid for crimes have been considered a socially destabilizing force for thousands of years. Playing fast and loose with the truth about government actions is not a victimless crime just because they have power, as people typically find when their particular group falls out of favor in the court of public opinion.

And finally, I want to give a mini rant about why this topic bothers me so much. Watching a case up close and personal like this, I was stunned and appalled how many people seemed to completely miss that this case was for many people, one of the darkest moments of their life even before the internet was involved. Watching people turn that in to their own personal whodunit/reality TV show was horrifying. People talked about the various people like they were merely characters in a movie, like you could say horrifying things about them with no consequences. I didn’t know these people but I do see many of them frequently, and the pain on their faces was visible. None of this was fun. None of this was asked for. We’re in a time when we have blockbuster documentaries about how exploitive reality TV show was, so it’s bizarre to me so many people are excited to tune in to stories about people who never volunteered for this. While errors in scientific publishing can erroneously impact how we view the world, errors in true crime reporting can irreparably ruin lives. The first one may sound worse, unless you’re the target of the second. Power posing failing to replicate hurt a few self help gurus talks, thousands of people falsely accusing someone of murder is something you probably never recover from. Consume media that reminds you that everyone involved, whether accused or victim, is a human.

Thanks for reading folks.

The True Crime Replication Crisis Part 7: Random Other Issues

Ok folks, so we’re nearing the end of our Wikipedia list of issues, so I’m at the point where I don’t know what to call this one. We have a bunch of random issues I’ll run through in order. Ready? Let’s go!

Context Sensitivity

In scientific study, context sensitivity refers to the idea that the same study performed under two different sets of circumstances might yield different results in ways people didn’t expect. This seems somewhat obvious when you say it directly, but often isn’t actually on people’s minds when they are reading a study. I have actually covered this a LOT on my blog over the years, as often people will make huge claims about how men or women (in general) view marriage, and you’ll find out the whole study was done on a group of 18 year old psychology students who are almost certainly not married or getting married any time soon. Zooming out, there’s a big criticism that most psychological research is done on “WEIRD” people, meaning Western, Educated, Industrialized, Rich and Democratic. What we consider settled science around human behavior may not be so settled if you include people from wildly different countries and contexts.

So how does this apply to true crime? Well, just like when I look up a paper the first thing I do is go to the methods section to understand the context in which the data was collected, I think the most important thing in a true crime story is to understand the big picture of where and how things happened. As I mentioned previously, true crime cases are often really unusual cases, so it’s important to flag any abnormalities will be heightened substantially. A few questions: how much crime is in the area in general? Were there any unusual events challenging people’s behavior? True crime often goes over this stuff, but I’ve noticed some cases breeze through contextualizing things or not acknowledging that unusual circumstances might change people’s behavior.

The other odd context thing is that a lot of people seem to think that because a case became well known later, the initial investigators should have been thinking from the get go how things would look on Dateline. Unfortunately most investigators/witnesses/defendants don’t have the luxury of knowing in the first 24 hours that people will be reviewing their actions for decades to come. If the case is OJ Simpson? Well yes, you should be prepared for that. If the case is Jon Benet Ramsey? You should give them some grace for not predicting the firestorm. Context matters.

Bayesian Explanation

This is similar to some of the statistical concerns I mentioned last week, but basically if you have a “surprising” result and a low powered study, Bayes theorem suggests you will have a high failure to replicate rate. Bayesian statistics can be powerful to help think through this, because they force you to consider how likely you thought something was before you ran your study, which can help you put your subsequent results in context.

So what’s the true crime equivalent? Well, I think it’s actually a good reminder to put all the evidence in context. Here’s an example: imagine a police department (or podcaster) believes a suspect is guilty mainly because they failed a polygraph. The polygraph has a low ability to detect real guilt (low power) and many innocent people fail it (high false-positive rate), and the prior likelihood that this particular person committed the crime is low. Even though the polygraph result says “guilty,” it does not mean there is a 95% chance they did it. Just like a weak psychological study, a “positive” polygraph doesn’t reliably tell you whether the hypothesis is true or whether the result will replicate.

This can be reapplied to all sorts of evidence, and should be, particularly when you have one piece of evidence that flies in the face of the rest of them. We even have a legal standard for this: circumstantial evidence, which can only be let in under certain circumstances. However in true crime reporting, a lot of circumstantial evidence is treated as extremely weighty, regardless of how discordant it is with everything else. You have to be honest about the prior probability or all your subsequent calculations are going to be skewed.

The Problem With Null Hypothesis Testing

This is a somewhat interesting theory, based on the idea that null hypothesis testing may not be appropriate for every field. For example, if you are testing whether or not a new drug helps cure cancer, you want to know if it has an effect or not. Pretty simple. But with a field like social psychology, human behavior may be too nuanced to have a true yes or no question. Running statistical tests that suggest there is a clear yes/no might end up with unreliable results because the whole set up was inappropriate for the question asked.

In true crime, this reminds me of people using legal standards as though they are moral standards or everyday standards we might use. For example, a person accused of rape may not be convicted under a reasonable doubt standard, but that doesn’t mean that you’d be ok with them dating your daughter/sister/friend. In murder cases, even when the police get things wrong they often had a good reason to start believing people were guilty. Drug or alcohol use can make people looks suspicious, lying up front to the police can make you look suspicious, prior similar convictions can make you look suspicious etc etc. I’ve seen a strong tendency for people to decide that whoever they favor is blameless (null hypothesis = absolutely nothing wrong), but as we covered last week a lot of people mixed up with legal trouble have something working against them.

Base Rate Fallacy

I’ve written about the base rate fallacy before, and it can be a tricky thing to overcome. In short, the base rate fallacy happens when something is extremely uncommon and you use an imperfect method to try to find it. For example, if you use an HIV test to test a thousand random people in the US for HIV, we know that 3-4 might have it. If you are using a test that is 99% accurate but has a 1% false positive rate, that actually means more people (10) will get a false positive result than a true positive result. When the frequency of something is low, false positives become a much bigger problem. In publishing, the theory is that previously unnoticed phenomena are getting rare, so surprising findings are increasingly likely to be false positives.

So how does this apply to true crime? Well, it’s a little hard to make a clear comparison, because so many crimes have unusual things happening by default. To take OJ Simpson as an example, it’s unusual for a celebrity of his stature to be accused of a crime. However, it’s also pretty unusual for a celebrity’s ex wife to end up dead like his did. Our base rate doesn’t totally work because we actually know something weird has happened. This is where we have to get back to judging people by evidence, not statistics.

However, in the broader scheme of true crime content, I think it’s good to note that the demand for new cases is currently exceeding the supply. As we’ve continued to cover, people want attractive articulate defendants with “interesting” cases, and we just don’t have that many of them. This creates a vacuum where people are very incentivized to make their cases “interesting” enough for true crime podcasters to pick up on. This is challenging because overall the murder rate in the US is down substantially from the 80s and 90s, so we have fewer current cases to draw from.

Alright, that’s all I have for this week. I’ll be looking to wrap up next week with a few lessons learned and thoughts. Thanks all!

To go to part 8, click here.

The True Crime Replication Crisis: Part 6 Statistical Errors

Welcome back folks! This week we’re still talking about true crime, and I’m going to cover how some statistical errors and how they relate to cognitive errors we see being made when we discuss true crime stories. Before I get to that though, I want to touch on a point made in the comments last week. David brought up that a good example of a fraudulent case that gained traction was the Duke Lacrosse rape accusation, which was ultimately found to be a false accusation. Many people continued to cling to it long after the evidence turned because they believed it was “an important conversation”. This sounds silly, but in the phenomenal “Toxoplasma of Rage” essay by Scott Alexander over at Slate Star Codex, he points out the following:

The University of Virginia rape case profiled in Rolling Stone has fallen apart. In doing so, it joins a long and distinguished line of highly-publicized rape cases that have fallen apart. Studies sometimes claim that only 2 to 8 percent of rape allegations are false. Yet the rate for allegations that go ultra-viral in the media must be an order of magnitude higher than this. As the old saying goes, once is happenstance, twice is coincidence, three times is enemy action.

The enigma is complicated by the observation that it’s usually feminist activists who are most instrumental in taking these stories viral. It’s not some conspiracy of pro-rape journalists choosing the most dubious accusations in order to discredit public trust. It’s people specifically selecting these incidents as flagship cases for their campaign that rape victims need to be believed and trusted. So why are the most publicized cases so much more likely to be false than the almost-always-true average case?

Scott goes on to hypothesize why this is: basically we are attracted to controversial stories because they allow us to signal our beliefs about different topics. I tend to believe he’s on to something, but for purposes of this series I want to emphasize his point that cases that get talked about are often more likely to contain extreme deception than regular every day cases. We have no reason to believe this is limited to rape cases, and every reason to believe that stories that grab headlines are uniquely unreliable.

Alright, with that out of the way, let’s move on to some stats issues!

Low Statistical Power

One issue that has likely contributed to the replication crisis is that many studies lack statistical power, which basically means a study doesn’t have enough data to reliably detect real effects. This basically makes the findings unstable, so when you repeat the study, the result might not appear again. Adequate statistical power is dependent on a few things, including sample size and the size of effect you’re looking to detect. For example, if you want to understand height differences between adult men and women, you might need a decent group before you can accurately say if the difference is 3 inches or 5 inches. If you’re looking at the height differences between adults and 5 year olds however, you’re going to need a much smaller group to establish there’s a huge difference. The smaller the effect size, the more people you need to reliably see what’s happening.

So how does this apply to true crime? Well, as I pointed out in part 2, most popular crime stories are highly unusual. While they are often things we deeply fear, they are almost always things we have no experience with. Given this lack of data, we have almost no basis for deciding what’s normal/abnormal, and yet we do it anyway! It’s a running joke on social media that every time a new subject comes up, people immediately switch from being infectious disease experts to nuclear war experts to trade agreement experts, etc. True crime is an extension of that, with people who have never experienced any part of the justice system loudly opining about what should or shouldn’t have been done. In the rush to get press coverage, I also noticed a lot of experts who did have experience in related fields would often comment on cases without actually having read all the details. I also consider this a lack of statistical power: all the general knowledge in the world doesn’t help if you don’t actually know the specifics of the case you’re talking about.

Positive Effect Size Bias

Otherwise known as the decline effect, many studies experience the phenomena of initially finding a large effect size that keeps getting smaller with each subsequent study. A classic example is medications, which often appear to work extremely well when they’re first rolled out, only to be much less impressive when studied after a few years.

I have seen this in a lot of true crime cases, where initially you are told “oh hey, you have to look at this absolutely CRAZY case they cover in this documentary”. If you look at the other side though, you gradually discover most of the things that hooked your attention are a lot more nuanced than they appeared. In our local case, there was one article that sparked all the interest and several years later someone went back and fact checked it. They estimated about 75% of it was proven incorrect and often laughably inaccurate. Bizarrely, people who got interested in the case didn’t seem to care that the thing that hooked them was so unreliable, they had simply moved on to new claims. Regardless of what you think happened in some case, it’s good to note when claims don’t hold up and not simply move on to new claims.

Problems of Meta-Analysis

One guardian against the replication crisis was supposed to be meta-anlyses, which take a lot of studies on the same topic and analyzes them together. A few issues with this is that one bad study can “infect” the whole meta-analysis, so even lumping a whole bunch of studies together doesn’t help. If you get one 6’2″ basketball player in your female height sample, it’s going to take a while for that average to come back to normal. Another issue is that if the hypothesis is wrong, you are not going to get studies with a strong effect in the opposite direction to balance things out, you are going to get studies that cluster around zero. Again, this means it will take a LOT of studies to show the real effect size.

So how does this work in true crime? Well, I actually think meta-analyses are the worst thing that can happen to a true crime case. Our justice system is supposed to be based on individual facts, not on group dynamics. This gets argued a lot with racial profiling, but perhaps my favorite example is family criminality. Crime is highly heritable, and yet our justice system doesn’t let your family history in to court, and for good reason. The foundation of our justice system is that you are supposed to be judged as an individual based on evidence, not on “well this would make sense”. True crime on the other hand is rife with this type of commentary. The police are always like this, people in small towns are like this, white rich kids are like this, etc etc etc. I actually am not very against stereotypes as a first step, but stereotypes are not evidence. If the evidence starts to contradict your stereotype, you may want to consider that someone might have been attempting to evoke exactly that stereotype to get you to override your reason.

P-hacking

I covered p-hacking back in part 4, where we talked about the idea of looking through tons of data for “surprising” connections. In both research and true crime, the more data you take in, the more likely you are to find connections that may or may not be meaningful. I did want to emphasize one more part of this though, something I’ll call “narrative hacking”. If p-hacking is when you overinterpret random connections, the narrative hacking is selectively including or emphasizing details, interpretations, or coincidences until a desired emotional or moral conclusion ‘feels significant.’. As I said to someone when talking about my local case, “some of what they complain about is real, some of it is just normal stuff said in a scary voice”. Selective interpretation of events is a normal human trait, and trying to make mundane things sound significant is a key trait of anyone trying to hook you on a story. Suddenly “weirdly, he never left the house all day” is said just the same as “oddly, he only left the house once that day” and “bizarrely, he left the house multiple times that day”. It’s good to be alert for when a narrator is emphasizing details that really aren’t that interesting.

Statistical heterogeneity

Statistical heterogeneity means that different studies of “the same” effect actually vary in methods, samples, measures, or contexts. What this means is that when you try to replicate a study, you can run in to the issue of changing something that actually was important to the study. For example, you might find an effect in a study done on all men that disappears if you add women to the sample, or a study on college students that doesn’t replicate to senior citizens. Sometimes slight wording in questions can radically change answers, etc etc. This can actually be an important issue to note, because sometimes it can show a previously hidden factor was influencing the original results.

In true crime, similar inputs do not always yield similar outputs. Two missing child cases can have very different reactions from parents, not because one is lying and the other isn’t, but because there’s a huge range of possible reactions to a horrible situation. This is somewhat akin to what I said above about overgeneralizations. There’s a huge range of crimes, contexts, and individuals involved, and even in a perfect system that would produce a huge range of human behavior. Trying to “follow” unusual tragic cases may lead to false confidence in your conclusions.

Alright, I think that’s all I have for today, tune in next week for what I’m hoping might be my last post before the wrap up, depending on how long winded I get. It’ll be fun!

To go straight to part 7, click here.