Data Can’t Save You From a Poorly Formed Question

One of the more interesting things I’ve done at points in my career is to help field data requests from a large database. If you’ve ever had to be the gatekeeper of any type of data like that, you learn rather quickly that you are going to have to ask a lot of questions that people are going to initially see as nitpicky and obstructionist and they will be terribly annoyed with you. With any luck after some gentle prodding however, they will eventually realize that their initial question was poorly formed, and that they are actually going to have to get a lot more specific before they can get the data that will help them answer the question they are really after.

For example (conversation entirely fictitious to protect the guilty, who have given me an equally hard time over similar issues):

Researcher: Can you give me all the data you have about women of childbearing age who were transplanted in the last 5-10 years? We’re doing a study.

Beleaguered database owner: Sure. A few questions though….

Researcher, sighing: It’s not hard, just everything you have:

Beleaguered database owner:, persisting: Can you clarify the timeframe? Do you want to include all the time during the COVID slowdown?

Researcher: Oh, I guess not actually. We only admitted really sick patients then, let’s just do 5 years back.

Beleaguered database owner: Ok. Did you want women of childbearing age or of childbearing potential? We actually screen women to see if they’ve had a hysterectomy or entered menopause, so we could exclude those women, otherwise we’ll give you everyone under 54. Were you looking for pediatric patients? We can start at age 12 or at those who had their first period.

Researcher: Oh, I guess I didn’t specify. I was looking at the impact of having a menstrual cycle, so we can exclude the women who didn’t.

Beleaguered database owner: Ok, one more thing. Did you want all transplants, including those who got a second transplant? Because those patients will be listed in the database twice.

Researcher: Oh, I forgot about those people. I just want individual patients. Exclude anyone who came back twice.

And so on. This can go on for a really long time, and this is with experienced researchers accessing a huge treasure trove of information.

I bring this up because I think when we’re trying to figure out “the truth” we often jump to the fact finding portion of our mission before we’ve even properly formulated our question. I was thinking about this earlier this week when the Assistant Village Idiot posted about how we still didn’t know much about the Alex Pretti shooting, and I replied that I felt there were 3 different conversations happening simultaneously:

  1. Were various elected officials justified/truthful/helpful in their statements about the shooting?
  2. Was the shooting legally justified?
  3. Was the shooting morally justified and/or otherwise preventable in the future?

You can quibble with my list or add your own questions, but my point here is much what it is to researchers I mentioned above: if you’re not clear on what your question is, you’re going to struggle to figure out which pieces of data are actually relevant to answering it. There actually is a bit of danger in just requesting “everything” and then trying to sort through it later. If you are trying to prove that Tim Walz/Kristie Noem gave a misleading press conference, that is a different set of data than reviewing the legal justifications for use of force by a border partrol agent, which is different still from a big picture review of everything that led up to the incident. All of the data is coming from one big pool and there’s certainly overlap, but in our discussions we tend to hop around a lot. Heck even in our own minds we tends to jump around a lot, but it can pay off substantially to take a moment to figure out what your actual question is.

We worry a lot these days about “misinformation”, and I certainly stand by that concern. However, I’m also starting to get worried that even when we’re all sharing the right information we’re going to keep arguing more than necessary because we’re not stopping to agree on what we’re even arguing about first. In nearly any public event there’s always going to be multiple relevant questions that need answering, and slight changes in focus can change the relevant data set substantially. My two cents.

Leave a comment