Carlisle Method Take 3: Carlisle Harder

While looking for something else this week I found my old posts on the Carlisle method (2017) and the one year later follow up (2018). Seemed time for yet another update, so here we are.

For those of you without a photographic memory for random data controversies from 9 years ago, the Carlisle method was a statistical method by research John Carlisle who was attempting to figure out a way to identify potentially fraudulent papers more quickly than undertaking laborious investigations. His idea was to actually look at the baseline data for control groups and intervention groups and to try to detect data anomalies there, assuming that authors would have focused much more on their results than on their baseline data so anomalies would be easier to spot. He named a bunch of studies that appears to have skewed baseline data, and others took it from there.

Interestingly, while some studies did end up having to adjust, it did become clear this method was not always detecting fraud. In a few cases some of the statistics were actually just mislabeled. In the most notable case, it turned out the study authors had not been clear on how their samples were selected, and they had to update their results without some of their original data.

So what’s happened since then? Well in 2021 Carlisle decided to use his prior method and standing as a journal editor to take things up a notch. While his initial method was a quick screening, he decided to develop a screening tool to flag papers that might have a problem. This included “previous false data from one or more authors or the research institute; inconsistencies in registered protocols; content copied from published papers, including tables, figures and text; unusually dissimilar or unusually similar mean (SD) values for baseline variables; or incredible results”. If a paper flagged as having these risk factors, he would ask for a spreadsheet with the patient level data in it so he could look at it more closely to ensure it was ok.

Unsurprisingly, he found problems. But what happened next was even worse.

Carlisle discovered that when he followed up with the universities these papers were coming from, he discovered that the universities these authors were from actually were not overly anxious to investigate the particular concerns he was raising, which concerned even more. So starting in 2019, Carlisle decided the journal would ask for patient level data from everyone from the countries that submitted the most papers: Egypt, China, India, Iran, Japan, South Korea, and Turkey. The results were not encouraging:

Basically, when Carlisle screened for high risk papers, he found about 10 “false” papers in 2 years. When he screened everyone, he found 60+ papers in the next 2 years. Yikes. Just to clarify what he means by “false” or “zombie”, here it is in his words:

Data I categorised as false included: the duplication of figures, tables and other data from published work; the duplication of data in the rows and columns of spreadsheets; impossible values; and incorrect calculations. I have chosen the word ‘zombie’ to indicate trials where false data were sufficient that I think the trial would have been retracted had the flaws been discovered after publication. The varied reasons for declaring data as false precluded a single threshold for declaring the falsification sufficient to deserve the name ‘zombie’, although I have explicitly stated my reasoning for each trial in the online Supporting Information (Appendix S1).

So overall 14% of papers submitted had substantial flaws and 8% were retraction worthy, but that rate went way up after they started requesting data from everyone. Unfortunately Carlisle ended by mentioning a few fairly discouraging things:

  1. He has no reason to believe his journal was attracting particularly bad papers. One might actually assume the opposite given that he very publicly was out fighting fraud for several years before this.
  2. It took him a really long time to look through the spreadsheets, and sometimes he only caught the fake data on the 2nd, 3rd or 4th look.
  3. Fraud can actually happen at any level of research, which makes it scary. It one case he mentions, the researchers discovered it was a med student they were working with who made up the data. We think of scientific fraud as the big name getting the credit, but you can see where it’s actually really likely it’s an overwhelmed lower level person trying to deliver results to them who might actually provide fake data.
  4. Nothing stops people from submitting these papers to other journals that don’t have this level of scrutiny.

In the end Carlisle concludes that these types of data errors or fraud are so common that developing screening tools for them should be a primary goal of journals, lest they risk up to a quarter of their studies being retraction worthy. Not great, but thank God for people like Carlisle.

New Substack Who Dis

So after my therapy session a few weeks ago about why I liked writing here so much and how hard it was to translate the norms I was used to in to other places, I kept thinking about how to do what I was trying more effectively. After mulling it for a few days more, I decided that it might be worth trying a different writing project that I controlled, written more in my own voice, but aimed at a different audience.

So I have now have a new bright and shiny Substack to fool around with: Exhibit Asterisk (or Exhibit A*). The focus is going to my ideas about how statistical thinking can apply to the world of true crime, so basically a more topic specific version of what I do here, with the assumption the audience will be a bit less familiar with stats concepts. For now I plan to keep posting in both places, though their may be some cross posting. I am not doing paid subscriptions, because I do not need that kind of pressure in my life, and this is entirely experimental. For all that I rant about this topic, it may turn out I have very little left to say. We’ll see! Anyway, if you’d like to join me over there, the link to the first post is here.

Hope to see you there!

Tracing Claims in the Age of the Timeline or Whose Line is it Anyway

There was an irritating (to me) discussion on Twitter this week (shocking) that got me thinking about an interesting problem I’m seeing more of in the age of ubiquitous social media: the problem of who said what. The issue started when Bridget Phetasy, a podcaster in her 40s, Tweeted out the following “Thanks to big pharma trying to sell us GLP-1s we are now allowed to admit that when you lose weight it takes stress off your joints and improves your health—a thing we were told was not true for a decade.”

Now as someone who is a bit younger than Ms Phetasy and has lived my life at a wider variety of different weights than her photos suggest she has, I was rather surprised to hear this. I have never gone to the doctors office and not had a doctor mention where my weight was in relation to the where the healthy range was. I have never had joint issues, but my friends who have all inform me you are very much told the impact weight has on your joints. Given that I work at a hospital, I decided to mention this to a few doctors/NPs I see daily, and was looked at like I had 3 heads. Who, they inquired, was ever telling anyone anywhere that weight loss wouldn’t take stress off your joints or improve your health? Obviously if someone was underweight you wouldn’t mention that, but that is not the case for most people. Weight loss is bog standard advice from every major medical organization for nearly every condition. Everyone knows this.

And yet a sizeable number of people on Twitter seemed extremely convinced that this had not actually been the case. Interestingly, their case pulled out a lot of media articles. The first thing Bridget posed to defend herself was a screenshot from Cosmo Magazine (UK version) that used overweight fitness influencers to show that fat could be healthy. Ben Ryan pointed out that there’s a very popular podcast that criticizes the idea of weight loss. Others posted op-eds written about the hopelessness of weight loss. People confirmed there was a very active “Healthy at Every Size” movement, as captured by the book of the same name. Everyone saw this.

Finally, I saw a tweet that seemed to help shed some light on what was really going on. The problem here was one of perspective: there was a time period where the media ecosystem changed rapidly to include a push for healthy at every size type advocacy and a rapid expansion of plus size clothing in retailers and advertising. This meant those (like Bridget Phetasy) who worked in media and were themselves thin saw those things as the primary conversations around obesity. For other people (like say, myself) who mostly had these conversations with either my personal physician or at work/in a research context, the entire idea that one book from 2010 was the “real” conversation seems insane. Why would someone be listening to a non-MD political podcaster about how to resolve their joint pain? If the people whose job it is to deal with such things stayed on the right path, do the other conversations really matter? And yet, maybe they do. People’s expectations are set by culture all the time. Maybe there’s something to this.

I don’t know that I’ll resolve any of that in this post, but I do want to highlight the general problem. I am increasingly running in to discussions with people where we actually have spent a substantial part of the conversation trying to sort through if the thing we are referring to are actually happening. It was over 10 years ago that Parker Malloy first noted that she made one slightly dismissive Tweet about a lipstick color she found weird that somehow got parlayed in to several articles in major media outlets about her “major freakout”, and the problem has only gotten worse. We now have random tweets from nobodies being treated as though they are serious platforms of major political parties. Conversely, with all the various online noise going on, I have also found at times that people can now miss when major political office holders say actual terrible things because they assume it was internet snark.

So I think when you hear a “they said” type claim, it’s good to sort out the following things:

  1. Who actually said it, someone with power? Someone with a large audience? Or a random person on Twitter? Include the claims of those on the other side only if you would find it fair if the situation was reversed.
  2. Who matters for this claim? As I outlined above, there’s no one right answer for this, but it can help nuance the discussion.
  3. What was the actual wording of the original claim? A lot of claims mutate somewhat between initial takes and responses.
  4. How many people were making this claim and how many versions were there? Arguments that were really really broad often have stronger and weaker versions and it helps to zoom in on which side of the argument you’re addressing.
  5. Who were people talking to/aiming at when they did something? In the obesity discussion above, a few people pointed out that clothing companies were getting lumped in as “advocates”. This superficially seems fair, but realistically as obesity rates grew clothing companies were going to have to expand their size offerings if they wanted to stay in business. And why would you be lecturing your customers on weight loss while trying to sell them something? Don’t confuse business practices with medical advice.

I should also add that none of these new problems did away with the age old problem of people having a strong personal connection to someone who behaves ridiculously that they then over generalize to the rest of the population, or to believe that their own social group represents the general population better than it does. So we’ve just added new issues here on top of the ones we already knew about.

Ultimately I think the best thing any of us can do is to remember that everyone is awash in commentary all the time, and we all can probably prove any point we want about what “they are saying” just by poking around for a few minutes online. It’s just the world we live in now, all noise, weak signal. It may be battling against the current, but I think by double checking a bit where we’re getting our impression of what others are saying from we can help make our senses a bit more solid than just vibes.

Thinking in Graph Paper, Writing in Prose

A little over a month ago now, I got in to a discussion about doing another post for the True Crime Times, this time about modifying some old school scientific reasoning tools like the Bradford Hill criteria to apply to true crime type stories and evidence assessments for better thinking. Amusingly, they appear to lock posts after a certain period of time so I now can’t go back and see what exactly sparked the discussion, but I liked the idea and wrote up a draft. While I enjoyed the heck out of actually writing the whole thing and it clarified a lot of stuff I had been thinking about, I ultimately wasn’t entirely convinced it worked all that well. First, it got incredibly long. The Bradford Hill criteria are pretty lengthy, and explaining the background took a while, then it took even longer to explain each criteria, then even longer to explain why I thought they applied. All told I think it ended up at like 3000 words, which on this blog I would have probably split up over at least two posts and also made some snarky commentary to lessen the blow of that many words. Writing more formally, even I felt like it was a slog by the end.

It occurred to me that this is why I’ve always liked having a blog like this, even as blogs have fallen out of fashion, because they really are a place to work out some long form ideas without having to feel like you’re trying to get subscribers or condensing your thinking in to little snippets. It’s how I process stuff. I’ve actually taken a lot of what I’ve written here over the years and polished it up to use elsewhere, and it’s somewhat rare I’ve been able to publish something in a different outlet without working it out here first. So I realized I need to come back here and work out a few things before I tried to write anything up.

One of the reasons I like writing here so much is that in a very real way, anyone who sticks around here for any length of time tends to be, on some level, one of my type of people. When I named this site Graph Paper Diaries, I was serious. I tend to think in numbers, and I like drawing lines around things. I count things when I get bored. My first question when I hear a statistic is “hold up, where did that come from”. And most importantly “is that true?”. In other words, I like quantification over feelings, I like definitions, I like numbers, I like sources, and I like to know if I have my facts straight. It was always my goal with this site to de-emphasize debates on particular hot button topics, and instead focus on the underlying data to see if we could at least get agreement there to help inform bigger discussions. It was (and still is) my belief that agreeing on baseline facts and standards of truthfulness and certainty was a way of fostering respectful debate around important topics. I’m never going to get everyone to agree with me on everything, but I can certainly try to help create a world where I enjoy the process of disagreeing with people more.

While I get some drive by comments from people who don’t understand any of this, I think anyone who sticks around here for more than a post or two generally gets the value of at least some of this stuff. You may at times question how well I actually execute any of my goals, but I don’t think most of you question the aim. That’s a lot of fun to hang out with.

What gets a little tougher is trying to jump in to a different subculture and translate all of that stuff. I have fun here because I started with a group of people who were interested the rather number based place called “Graph Paper Diaries”, but how do I translate that to a group of people interested in to the incredibly narrative driven world of true crime?

That’s what got me thinking about Sir Austin Bradford Hill. He was a British epidemiologist who helped prove smoking caused lung cancer and subsequently came up with nine “viewpoints” from which he thought all evidence should be assessed before assuming it proved that one thing caused another thing. Epidemiology seems like a pretty uniquely good analogy for true crime since epidemiology is by definition the study of disease in messy population based conditions. Unlike lab based science where you get to control your experiments, epidemiologists are often just expected to work with what they have, and there are no redoes if they get things wrong. I think you can see why the analogies to crime investigation jumped out to me. While it would be great if you could have unlimited time or resources and have it only hit perfect victims in a more ideal location at a better time of year, in both cases, you have to go where the problem is and work with what you have.

Because in both cases, the stakes are actually pretty high. Never figuring out how to stop a disease outbreak has consequences, as does never solving a crime. It’s extremely easy to get annoyed people don’t have better evidence, but we have to accept that in life some problems are just going to have messy evidence. If we don’t accept messy evidence, we’re going to settle for no evidence. And I don’t think any of us want that.

So how do we muddle through this? Well first we obviously gather as much evidence as possible. But after that what do we do with it? As I mentioned last week all the data in the world can’t save us if we don’t have a good question, so what questions should we be asking as we look at the information we have? This is where Bradford Hill comes in. He asked people to take a look at the data they had from 9 different viewpoints to evaluate evidence. I’ve gone over these before in a strict public health context, but I’ve adapted them for true crime stories.

  1. Strength: If this person were innocent, how weird would this evidence be? When we look at heavy smokers, the likelihood of lung cancer wasn’t just a little bit higher, it was 20-30 times higher. That’s a compelling piece of evidence. Similarly in true crime, some pieces of evidence are more compelling than others. One piece of strong evidence trumps 10 small coincidences.
  2. Consistency: Does the same story show up when the evidence comes from different places? The smoking/lung cancer connection shows up in lots of different populations in different locations. Similarly, in crime investigations, digital data agreeing with witness testimony agreeing with physical evidence is a pretty strong story.
  3. Specificity: Does this evidence actually point to one person and one version of events? Yeah, I know “they” did it. “They” are responsible for everything. But lets narrow that down just a bit.
  4. Temporality: Did things happen in this order, based on what people knew at the time (not what we know now)? When you learn all the evidence during a one hour podcast, it can be incredibly hard to remember the events actually unfolded over the course of several months and that people could only react to what they knew at the time. Keeping the actual timeline in mind is important.
  5. Evidence Gradient: As more evidence is added, does the story get clearer or more complicated? When hearing new evidence that contradicts something they already believe, a lot of people start to over complicate their theories without even realizing it. “Sure that evidence looks bad, but maybe it was planted” Okay, but you just traded one problem for another. You explained away the contrary evidence at the price of now needing to explain how someone planted it. That’s not a clearer theory, that’s just shuffling your problems around.
  6. Plausibility: How much would have to go exactly right for this story to be true? Ocean’s 11 is a fun movie, but rarely in life are things that perfectly timed.
  7. Coherence: Does this explanation fit with the physical evidence, the timeline, and how people usually behave? Much as with plausibility, if you take a step back, does a full picture start to emerge or does it get murkier?
  8. Experiment: Is there any part of this that could be checked or tested instead of argued about? This isn’t the most common situation but can certainly clear some points up pretty quickly if it’s possible.
  9. Analogy: Am I convinced by the facts of this case, or because it reminds me of another one? I used to read advice columns a lot and I would always be interested to see how much people would read in to situations based on what were clear issues from their personal life. I know women like this. Men like that will always act like this. While analogies can be useful in suggesting questions to ask, they can also lead you to make assumptions about people that aren’t true.

So there they are, nine questions to help people think through messy evidence when that’s the only option. While this was never supposed to be an explicit checklist that would prevent every error, it was supposed to help you look at things from enough angles that you reduced your chances of missing something or getting hung up on a pet theory as evidence mounted pointing in other directions. Because that’s a key thing with messy evidence, it’s not an easy thing to wade through, and it’s easy to get stuck on one or two piece and to start missing the big picture.

But I suspect you already know that’s a good idea. I think this way of thinking is solid, it’s worked on some of our most important public health problems after all. I’m still workshopping the delivery.

If you have thoughts on how to introduce a framework like this to a true-crime audience, I’d love to hear them. What would you lead with, what would you lose, or what would make you actually want to keep reading? I’ll keep working on the piece in the next few days, so open to any ideas! I’ll probably publish whatever I come up with here at the very least even if I don’t find another spot for it. This is just my favorite problem to noodle on at the moment.

Data Can’t Save You From a Poorly Formed Question

One of the more interesting things I’ve done at points in my career is to help field data requests from a large database. If you’ve ever had to be the gatekeeper of any type of data like that, you learn rather quickly that you are going to have to ask a lot of questions that people are going to initially see as nitpicky and obstructionist and they will be terribly annoyed with you. With any luck after some gentle prodding however, they will eventually realize that their initial question was poorly formed, and that they are actually going to have to get a lot more specific before they can get the data that will help them answer the question they are really after.

For example (conversation entirely fictitious to protect the guilty, who have given me an equally hard time over similar issues):

Researcher: Can you give me all the data you have about women of childbearing age who were transplanted in the last 5-10 years? We’re doing a study.

Beleaguered database owner: Sure. A few questions though….

Researcher, sighing: It’s not hard, just everything you have:

Beleaguered database owner:, persisting: Can you clarify the timeframe? Do you want to include all the time during the COVID slowdown?

Researcher: Oh, I guess not actually. We only admitted really sick patients then, let’s just do 5 years back.

Beleaguered database owner: Ok. Did you want women of childbearing age or of childbearing potential? We actually screen women to see if they’ve had a hysterectomy or entered menopause, so we could exclude those women, otherwise we’ll give you everyone under 54. Were you looking for pediatric patients? We can start at age 12 or at those who had their first period.

Researcher: Oh, I guess I didn’t specify. I was looking at the impact of having a menstrual cycle, so we can exclude the women who didn’t.

Beleaguered database owner: Ok, one more thing. Did you want all transplants, including those who got a second transplant? Because those patients will be listed in the database twice.

Researcher: Oh, I forgot about those people. I just want individual patients. Exclude anyone who came back twice.

And so on. This can go on for a really long time, and this is with experienced researchers accessing a huge treasure trove of information.

I bring this up because I think when we’re trying to figure out “the truth” we often jump to the fact finding portion of our mission before we’ve even properly formulated our question. I was thinking about this earlier this week when the Assistant Village Idiot posted about how we still didn’t know much about the Alex Pretti shooting, and I replied that I felt there were 3 different conversations happening simultaneously:

  1. Were various elected officials justified/truthful/helpful in their statements about the shooting?
  2. Was the shooting legally justified?
  3. Was the shooting morally justified and/or otherwise preventable in the future?

You can quibble with my list or add your own questions, but my point here is much what it is to researchers I mentioned above: if you’re not clear on what your question is, you’re going to struggle to figure out which pieces of data are actually relevant to answering it. There actually is a bit of danger in just requesting “everything” and then trying to sort through it later. If you are trying to prove that Tim Walz/Kristie Noem gave a misleading press conference, that is a different set of data than reviewing the legal justifications for use of force by a border partrol agent, which is different still from a big picture review of everything that led up to the incident. All of the data is coming from one big pool and there’s certainly overlap, but in our discussions we tend to hop around a lot. Heck even in our own minds we tends to jump around a lot, but it can pay off substantially to take a moment to figure out what your actual question is.

We worry a lot these days about “misinformation”, and I certainly stand by that concern. However, I’m also starting to get worried that even when we’re all sharing the right information we’re going to keep arguing more than necessary because we’re not stopping to agree on what we’re even arguing about first. In nearly any public event there’s always going to be multiple relevant questions that need answering, and slight changes in focus can change the relevant data set substantially. My two cents.