What I’m Reading: March 2019

The AVI sent this along, and man do I feel seen.  Source.

This is an utterly bonkers NSFW article about how alarmingly easy it is to fake your credentials if you talk about sex, try this one.  Seriously, is there any other field where you could get this much media attention with this little media scrutiny? Anything more political would have been looked in to by opponents, and anything more boring probably wouldn’t have gotten media attention.

I took this creativity test and got  76.32. My breakdown was pretty accurate. My best moments of creativity come when I’m trying to improve the status quo  (curiosity – my highest), but I don’t really go much further once the problem is fixed (persistence – my lowest). Applied laziness is kinda my super power.

 

COMPare and Contrast: Journals Response to Errors

Big news in the meta science world last week, when Ben Goldacre (of Bad Science and Bad Pharma fame) released some new studies about ethical standards in scientific publishing. The studies called “COMPare: a prospective cohort study correcting and monitoring 58 misreported trials in real time” and “COMPare: Qualitative analysis of researchers’ responses to critical correspondence on a cohort of 58 misreported trials” wanted to look at what happened when study authors didn’t follow the ethical or quality standards that the journals they published in set forward. The first paper looked at the journals response to issues that were pointed out, the second looked at the response of the paper authors themselves. Goldacre and his team found these papers, found the errors, then pointed them out to the journals to watch what happened. Unfortunately, it went less well than you would hope.

Before we get in to the results though, I want to give a bit of context to this. Over the past few years, there’s been a lot of debate over how to ethically “call out” bad publication patterns. The Calling BS guys have a whole section about this in their class “The Ethics of Calling Bullshit”, which I wrote about here. To highlight concerns about scientific publishing people have published fake papers, led replication efforts, and developed statistical tools to try to ferret out bad actors. In all of these cases, concerns have been raised about the ethics of each approach. People have complained about mob mentalities or picking on individual researchers, taking advantage of people’s trust, or using these things to advance their own careers.  “Science is already self-correcting” the complaint goes “no need to make a bigger deal out of it”.

I have to think Goldacre had this in mind when he designed this study. His approach is fascinating in that it actually shares the blame between journals and authors, and also focuses heavily on the ability of people to respond to criticism. Journals tend to point to their ethical/quality standards when proving that they are concerned about quality of studies they publish, but it is often unclear how those standards are actually enforced. Additionally, issues with a journals standards or enforcement are a big deal with a widespread impact. Finding a study author who made a mistake or committed fraud is great, but still only impacts the person in question. Finding out a journal has a systemic issue can have ripple effects to hundreds of studies, and a whole field of research. To highlight this fact, Goldacre and his team specifically looked at some of the biggest journals out there: the New England Journal of Medicine (NEJM), Journal of the American Medical Association (JAMA), Annals of Internal Medicine, British Medical Journal (BMJ) and Lancet. No small fish in this pond.

In the first study, the journals and their responses were the focus. Goldacre and his team looked at 67 trials and found 58 had issues.  The metrics they were looking for were simple: did the papers publish their publicly available pre-trial registration outcomes, or did they explain any changes from their original plan. These are the basic requirements laid out by the CONSORT guidelines (found here) which all the journals said they endorse.  Basic findings:

  • Only 40% of their letters were published
  • JAMA and NEJM published NONE of the letters they received
  • Most  letters were published online only
  • Letters that were published in the hard copy journals were often delayed by months

Now the more concerning findings were grouped by the researchers in to themes:

  • Conflicts with CONSORT standards Despite saying they endorsed the CONSORT standards, when instances of non-compliance were pointed out to them the journals said they didn’t really agree with the standard or think it was necessary
  • Timing of pre-specification/registries in general: Several journals objected that actually the trial pre-registrations were done too early or were too unreliable to go by.
  • Rhetoric: This was my favorite category. This is where Goldacre et al put complaints like “space constraints prevented us from adding a reason why we changed our outcome metric”, with a note that mentioned that they had plenty space to add new and interesting outcomes. They also got some “we applaud your goal but think you’re going about this poorly”.
  • Journal processes: This one was weird too. Journals clarified that they asked authors to do things by the book, despite Goldacre et al showing that the authors weren’t actually doing those things. Odd defense.
  • Placing responsibility on others: Sometimes the journals claimed it was actually up to the reader to go check the preregistration. Sometimes they said it was the preregistration databases that were wrong, not them. The Lancet didn’t reply at all and just let the authors of the paper in question respond.

The paper goes on to also summarize the criticism they got from journal editors once reporters started asking. Just scroll down to the tables in the paper here to read all the gory details. The summary of the responses for individual journals was pretty interesting too:

  • NEJM: Published no letters, said they never required authors to adhere to CONSORT. Provided journalists with a rebuttal they had not sent to Goldacre et al.
  • JAMA: Published no letters, said they didn’t have enough detail to find the errors. Goldacre points out they have a word limit, and that they linked to their full complaints on their website.
  • Lancet: Published almost every letter, but its editors didn’t reply to anything.
  • BMJ: Published all letters and issued a correction for one study out of 3
  • Annals: Got in a weird fight with the COMPare folks that has its own timeline in the paper

Overall, the results seem to suggest that there are still a lot of work to be done in getting journals to adhere to clear and transparent standards. They suggested that something like CONSORT should perhaps have a list of those who “endorse” the standards and those who agree to “enforce” the standards.

They also noted that they were actually quite surprised by the number of responses that they got saying that trial pre-registrations were inaccurate or not useful, because they noted that journals were actually one of the driving forces behind getting those set up. The idea that they were useless/not their problem was a very troubling rewrite of history.

Interestingly, the COMPare folks noted that for all the back and forth, they had a feeling their findings might actually be making a difference. They plan on doing a follow up study to see if anything’s changed. Something about knowing people are watching does tend to

Alright, I’ve gone on a bit on this one, I’ll wait until next week to review the second paper.

When Lies Tell More Than Truth

A few weeks ago, the Assistant Village Idiot was at my house and noted, with pleasure, that one of his favorite books (Albion’s Seed) was on my bookshelf. Sheepishly, I had to confess that the copy on my shelf was actually one he had lent me “about a year ago”.

He apparently had not remembered this, but remarked that he had bought a new copy as he was unclear what had happened to the previous one. This is a very AVI problem to have, but he also remarked that he was pretty sure the book had been lent closer to two years ago.

Being properly chagrined at this conversation, I promptly did two things:

  1. Began reading the book immediately, which I hadn’t quite gotten to yet
  2. Looked up the email exchange we had about the book, to figure out when exactly he had lent it to me. October 2017 – 16 months, for those keeping track at home.

In other words, my one year was generous, but I’m not quite so bad as two years. Also, if you lend me a book, you may want to put a time limit on it.

Anyway, I was thinking of this today because I got to an interesting part in the book that had to do with the cultural insights that come from noting how people tweak the truth.

For those of you who have never read Albion’s Seed, the basic premise is that different parts of the US were settled by people from different parts of England, and that led to cultural differences that persist until the current day. I’m still in the part about New England, but they note that the Puritans who settled here highly valued those who lived to older ages. This case is proved in part by the fact that the census data showed that after a certain age, more people claimed to be older than they actually were. Instead of having a relatively even spread between ages like 69, 70 and 71, they found that more people said they were 70, and fewer said they were 69:

This is the reverse of what we see today, where people tend to say they are a little younger than they are. This trend actually did not remain constant with the Puritans, rather the age bias only came in to play in the late 50s:

This struck me as fascinating, because we so often think of lying on a survey as a bad thing. However, when you have access to comparison data, people lying on surveys can actually be helpful. Since everyone tends to exaggerate in a direction that makes them look better, figuring out what direction people lie in can actually give you good insight in to what a culture values.

Things to consider.

Visualizing Effect Sizes

Under the weather this week, but I wanted to post this excellent graph from Twitter, showing what different Cohen’s d effect sizes mean in practice for populations:

I love this because despite the efforts of many many people, every time you see some sort of “group x is different from group y” type assertion, you still see people claiming that this either:

  1. Can’t be true because they know someone in group y who is more like group x or
  2. This  is completely true and every from group x is superior to every member of group y

Both are mistakes. For a more detailed look, there’s a visualization tool here that shows what these translate in to for random superiority. In other words, if you pick one person from each group randomly, what is the chance that the one from the higher group will actually have an outcome above the other person?

  • For d=.2, it’s 55.6%
  • For d=.5, it’s 63.8%
  • For d=.8, it’s 71.4%
  • For d=2, it’s 92.1%

This is good to keep in mind, as Cohen’s d is not an overly intuitive statistic for most people. Visualizations are good to help see quickly what these differences might mean on the population level.

Not the most groundbreaking point, but one that seems to bear repeating!

A Twitter Parallax

As long as humans have been around and arguing with each other, there have always been disputes about what the interpretation of events. During these disputes, I’d imagine that people wished we had ways of capturing events in real time, believing that this would eliminate disagreements. If everyone could work from a shared set of facts, then of course we would end disputes, right?

While I’m sure that’s what I would have thought if I’d lived 100 years ago, our recent age of ubiquitous cell phone cameras and real time Twitter updates has taught all of us that things are not so simple. Every time we see an example of this, I think of one of my favorite words: parallax. Defined as “the apparent displacement of an observed object due to a change in the position of the observer“, it reminds me that our perception of things sometimes depends not only on the object itself, but also where you’re standing.

If this is true of physical objects, then of course emotional situations up the ante. In the best of circumstances human communication can be prone to difficulties, and differences in perception can complicate things enormously. While the promise of technology is often that it will improve communication, the reality is that it often just creates new opportunities for differences in perception.

I bring this up because I had a really interesting example of this in my personal life recently, when the same Twitter thread led to two really different conclusions.

In the middle of one of the many many Twitter controversies of the past few weeks, I noticed some rather high profile people reacting to a Twitter thread from someone I knew was an acquaintance of my brother. Their reactions were not kind, and she was generally getting kinda dragged. He and I had talked about the issue previously before I knew she had jumped in, so I texted him the thread as an example of what I considered a Bad Opinion.

Basically while opining on the issue of the day, this woman had tried to make point X, but in the process had (IMHO) completely minimized counterpoint Y to a comical extent. My opinion was shared by others, who were mocking her for it.

When my brother and I talked a few days later, I mentioned the incident, and was surprised to find he disagreed with me. He said he didn’t at all see that she had minimized point Y, and in fact had emphasized point Y with points Y1 and Y2. At this point I got confused….I hadn’t seen either of those points made. My mind spun a bit. My brother and I have been arguing for years, and I knew he wouldn’t make something like that up. I was in the car so I couldn’t check the Twitter thread, but I started to wonder how I had missed the points she had made. Had I been scrolling too fast? Had I been projecting? Had I jumped to conclusions? I admitted to my brother that it was possible I’d missed something, and agreed that if I had I’d misjudged his friend.

Later that night it was still bothering me, so I went back to my text and reread the Twitter thread. I read all 31 Tweets she had sent, and discovered that neither point Y1 or Ywere in there. Now I was really confused. Like I said, my brother is one of my oldest sparring partners. We’ve been arguing for decades, and at this point I know he would never fabricate a point like that. I also knew that he would have mentioned it if he’d seen it elsewhere. So what the heck had happened?

As I pondered this, I reflexively hit the button “show more replies” to see if I could see some of the reactions he had mentioned and found….there were 4 more Tweets in the thread. Apparently at some point after she had sent the one she labelled 31/31 and said “thanks for listening”, she had added on a few other points. I’m not sure, but I think that because of the time delay between the initial Twetts and the add ons, Twitter hid those Tweets from anyone who accessed the thread directly. Since she had indicated it was the end of the thread, no one reading it that way would have known to go looking for other Tweets. When I clarified this with my brother, he mentioned that when I’d sent him the thread he hadn’t actually clicked on it, he’d just gone directly to her Twitter feed. Since Twitter shows things in reverse chronological order, he had read her added on points first, and then read everything else through that lens. No wonder we’d ended up with different opinions.

I was very struck by this whole thing, as it got me wondering how often this happens in our everyday lives. We believe we’re seeing the same thing due to the promises of technology, but the way it’s presented to us skews our reading. If my brother and I didn’t have years of good faith arguing behind us, I would likely not have been so curious about our different perceptions. If we weren’t both so interested in how information gets presented, we may not have cared or considered how our way of accessing the information had colored our subsequent reading of it. Our belief that we were both seeing the same thing might have actually impeded our communication rather than helped it.

I don’t have a good answer for how to get around this, but it’s good to keep in mind as more disputes are started and perpetuated online. While eliminating some pitfalls, technology does create new ones on a much grander scale. Just because you’re looking at the same thing doesn’t always mean you’re seeing the same thing.

Counting Terrorism and the Pitfalls of Open Source Databases

Terrorism is surging in the US, fueled by right-wing ideologies

Someone posted this rather eye catching story on Twitter recently, which came from an article back in August from QZ.com. I’ve blogged about how we classify terrorism or other Mass Casualty Incidents over the years, so I decided to click through to the story.

It came with two interesting graphs that I thought warranted a closer look. The first was a chart of all terror incidents (the bars) vs the fatalities in the US:

Now first things first: I always note immediately where the year starts. There’s a good reason this person chose to do 15 years and not 20, because including 9/11 in any breakdown throws the numbers all off. This chart peaks at less than 100 fatalities, and we know 2001 would have had 30 times that number.

Still, I was curious what definition of terrorism was being used, so I went to look at the source data they cited from the Global Terrorism Database. The first thing I noted when I got to the website is that data collection for incidents is open source. Interesting. Cases are added by individual data collectors, then reviewed by those who maintain the site. I immediately wondered exactly how long this had been going on, as it would make sense that more people added more incidents as the internet became more ubiquitous and in years where terrorism hit the news a lot.

Sure enough, on their FAQ page, they actually specifically address this (bolding mine):

Is there a methodological reason for the decline in the data between 1997 and 1998, and the increases since 2008 and 2012?

While efforts have been made to assure the continuity of the data from 1970 to the present, users should keep in mind that the data collection was done as events occurred up to 1997, retrospectively between 1998 and 2007, and again concurrently with the events after 2008. This distinction is important because some media sources have since become unavailable, hampering efforts to collect a complete census of terrorist attacks between 1998 and 2007. Moreover, since moving the ongoing collection of the GTD to the University of Maryland in the Spring of 2012, START staff have made significant improvements to the methodology that is used to compile the database. These changes, which are described both in the GTD codebook and in this START Discussion Point on The Benefits and Drawbacks of Methodological Advancements in Data Collection and Coding: Insights from the Global Terrorism Database (GTD), have improved the comprehensiveness of the database. Thus, users should note that differences in levels of attacks before and after January 1, 1998, before and after April 1, 2008, and before and after January 1, 2012 may be at least partially explained by differences in data collection; and researchers should adjust for these differences when modeling the data.

So the surge in incidents might be real, or it might be that they started collecting things more comprehensively, or a combination of both. This is no small matter, as out of the 366 incidents covered by the table above, 266 (72%)had no fatalities. 231  incidents (63%) had no fatalities AND no injuries. Incidents like that are going to be much hard to find records for unless they’re being captured in real time.

The next graph they featured was this one, where they categorized incidents by perpetrator:

The original database contains a line for “perpetrator group”, which seems to speak loosely to motivation. Overall they had 20 different categories for 2017, and Quartz condensed them in to the 4 above. I started to try to replicate what they did, but immediately got confused because the GTD lists 19 of the groups as “Unknown”, so Quartz had to reassign 9 of them to some other group. Here’s what you get just from the original database:

Keep in mind that these categories are open source, so differences in labeling may be due to different reviewers.

Now it’s possible that information got updated in the press but not in the database. It seems plausible that incidents might be added shortly after they occur, then not reviewed later when more facts were settled. For example, the Las Vegas shooter was counted under “anti-government extremists”, but we know that the FBI closed the case 6 months ago stating they never found a motive. In fact, the report concluded that he had a marked disinterest in political and religious beliefs, which explains his lack of manifesto or other explanation for his behavior. While anti-government views had been floated as a motive originally, that never panned out. Also worth noting, the FBI specifically concluded this incident did not meet their definition for terrorism.

Out of curiosity, I decided to take a look at just the groups that had an injury or fatality associated with their actions (29 out of the 65 listed for 2017):

If you want to look at what incident each thing is referring to, the GTD list is here. Glancing quickly, the one incident listed as explicitly right wing was Mitchell Adkins, who walked in to a building and stabbed 3 people after asking them their political affiliation. The one anti-Republican one was the attack on the Republican Congressional softball team.

I think overall I like the original database categories better than broad left or right wing categories, which do trend towards oversimplification. Additionally, when using crowd sourced information, you have to be careful to account for any biases in reporting. If the people reporting incidents are more likely to come from certain regions or to pay more attention to certain types of crimes, the database will reflect that.

To illustrate that point, I should note that 1970 is by FAR the worst year for terrorist incidents they have listed. Here’s their graph:

Now I have no idea if 1970 was really the worst year on record or if it got a lot of attention for being the first year they started this or if there’s some quirk in the database here, but that spike seems unlikely. From scanning through quickly, it looks like there’s a lot of incidents that happened in the same day. That trend was also present in the current data, and there were a few issues I noted that looked like duplicates but also could have been two things done similarly in the same day.

Overall though, I think comparing 1970 to 2017 shows an odd change in what we call terrorism. Many of the incidents listed in 1970 were done by people who specifically seemed to want to make a point about their group. In 2017, many of the incidents seemed to involve someone who wanted to be famous, and picked their targets based on whoever drew their ire.  You can see this by the group names. In 2017 only one named group was responsible for a terrorist attack (the White Rabbit Militia one) whereas in 1970 there at least a dozen groups with names like “New World Liberation Front” or “Armed Revolutionary Independence Movement“.

Overall, this change does make it much harder to figure out what ideological group terrorists belong to, as a large number of them seem to be specifically eschewing group identification. Combine that with the pitfalls of crowd sourcing, and changing definitions, and I’d say this report is somewhat inconclusive.

Reporting the High Water Mark

Another day, another weird practice to add to my GPD Lexicon.

About two weeks ago, a friend sent me that “People over 65 share more fake news on Facebook” study to ask me what I thought. As I was reviewing some of the articles about it, I noticed that they kept saying the sample size was 3,500 participants. As the reporting went on however, the articles clarified that not all of those 3,500 people were Facebook users, and that about half the sample opted out. Given that the whole premise of the study was that the researchers had looked at Facebook sharing behavior by asking people for access to their accounts, it seemed like that initial sample size wasn’t reflective of those used to obtain the main finding. I got curious how much this impacted the overall number, so I decided to go looking.

After doing some follow up with the actual paper, it appears that 2,771 of those people had Facebook to begin with,  1,331 people actually enrolled in the study, and 1,191 were able to link their Facebook account to the software the researchers needed. So basically the sample size the study was actually done on is about a third of the initially reported value.

While this wasn’t necessarily deceptive, it did strike me as a bit odd. The 3,500 number is one of the least relevant numbers in that whole list. It’s useful to know that there might have been some selection bias going on with the folks who opted out, but that’s hard to see if you don’t report the final number.  Other than serving as a selection bias check though (which the authors did do), 63% of the participants had no link sharing data collected on them, and thus are irrelevant to the conclusions reported.  I assumed at first that reporters were getting this number from the authors, but it doesn’t seem like that’s the case.  The number 3,500 isn’t in the abstract. The press release uses the 1,300 number. From what I can tell, the 3,500 number is only mentioned by itself in the first data and methods section, before the results and “Facebook profile data” section clarify how the interesting part of the study was done. That’s where they clarify that 65% of the potential sample wasn’t eligible or opted out.

This was not a limited way of reporting things though, as even the New York Times went with the 3,500 number. Weirdly enough, the Guardian used the number 1,775, which I can’t find anywhere. Anyway, here’s my new definition:

Reporting the high water mark: A newspaper report about a study that uses the sample size of potential subjects the researchers started with, as opposed the sample size for the study they subsequently report on.

I originally went looking for this sample size because I always get curious how many 65+ plus people were included in this study. Interestingly, I couldn’t actually find the raw number in the paper. This strikes me as important because if older people are online in smaller numbers thank younger ones, the overall number of fake stories might be larger among younger people.

I should note that I don’t actually think the study is wrong. When I went looking in the supplementary table, I noted that the authors mentioned that the most commonly shared type of fake news article was actually fake crime articles. At least in my social circle, I have almost always seen those shared by older people rather than younger ones.

Still, I would feel better if the relevant sample size were reported first, rather than the biggest number the researchers looked at throughout the study.