Counting Terrorism and the Pitfalls of Open Source Databases

Terrorism is surging in the US, fueled by right-wing ideologies

Someone posted this rather eye catching story on Twitter recently, which came from an article back in August from QZ.com. I’ve blogged about how we classify terrorism or other Mass Casualty Incidents over the years, so I decided to click through to the story.

It came with two interesting graphs that I thought warranted a closer look. The first was a chart of all terror incidents (the bars) vs the fatalities in the US:

Now first things first: I always note immediately where the year starts. There’s a good reason this person chose to do 15 years and not 20, because including 9/11 in any breakdown throws the numbers all off. This chart peaks at less than 100 fatalities, and we know 2001 would have had 30 times that number.

Still, I was curious what definition of terrorism was being used, so I went to look at the source data they cited from the Global Terrorism Database. The first thing I noted when I got to the website is that data collection for incidents is open source. Interesting. Cases are added by individual data collectors, then reviewed by those who maintain the site. I immediately wondered exactly how long this had been going on, as it would make sense that more people added more incidents as the internet became more ubiquitous and in years where terrorism hit the news a lot.

Sure enough, on their FAQ page, they actually specifically address this (bolding mine):

Is there a methodological reason for the decline in the data between 1997 and 1998, and the increases since 2008 and 2012?

While efforts have been made to assure the continuity of the data from 1970 to the present, users should keep in mind that the data collection was done as events occurred up to 1997, retrospectively between 1998 and 2007, and again concurrently with the events after 2008. This distinction is important because some media sources have since become unavailable, hampering efforts to collect a complete census of terrorist attacks between 1998 and 2007. Moreover, since moving the ongoing collection of the GTD to the University of Maryland in the Spring of 2012, START staff have made significant improvements to the methodology that is used to compile the database. These changes, which are described both in the GTD codebook and in this START Discussion Point on The Benefits and Drawbacks of Methodological Advancements in Data Collection and Coding: Insights from the Global Terrorism Database (GTD), have improved the comprehensiveness of the database. Thus, users should note that differences in levels of attacks before and after January 1, 1998, before and after April 1, 2008, and before and after January 1, 2012 may be at least partially explained by differences in data collection; and researchers should adjust for these differences when modeling the data.

So the surge in incidents might be real, or it might be that they started collecting things more comprehensively, or a combination of both. This is no small matter, as out of the 366 incidents covered by the table above, 266 (72%)had no fatalities. 231  incidents (63%) had no fatalities AND no injuries. Incidents like that are going to be much hard to find records for unless they’re being captured in real time.

The next graph they featured was this one, where they categorized incidents by perpetrator:

The original database contains a line for “perpetrator group”, which seems to speak loosely to motivation. Overall they had 20 different categories for 2017, and Quartz condensed them in to the 4 above. I started to try to replicate what they did, but immediately got confused because the GTD lists 19 of the groups as “Unknown”, so Quartz had to reassign 9 of them to some other group. Here’s what you get just from the original database:

Keep in mind that these categories are open source, so differences in labeling may be due to different reviewers.

Now it’s possible that information got updated in the press but not in the database. It seems plausible that incidents might be added shortly after they occur, then not reviewed later when more facts were settled. For example, the Las Vegas shooter was counted under “anti-government extremists”, but we know that the FBI closed the case 6 months ago stating they never found a motive. In fact, the report concluded that he had a marked disinterest in political and religious beliefs, which explains his lack of manifesto or other explanation for his behavior. While anti-government views had been floated as a motive originally, that never panned out. Also worth noting, the FBI specifically concluded this incident did not meet their definition for terrorism.

Out of curiosity, I decided to take a look at just the groups that had an injury or fatality associated with their actions (29 out of the 65 listed for 2017):

If you want to look at what incident each thing is referring to, the GTD list is here. Glancing quickly, the one incident listed as explicitly right wing was Mitchell Adkins, who walked in to a building and stabbed 3 people after asking them their political affiliation. The one anti-Republican one was the attack on the Republican Congressional softball team.

I think overall I like the original database categories better than broad left or right wing categories, which do trend towards oversimplification. Additionally, when using crowd sourced information, you have to be careful to account for any biases in reporting. If the people reporting incidents are more likely to come from certain regions or to pay more attention to certain types of crimes, the database will reflect that.

To illustrate that point, I should note that 1970 is by FAR the worst year for terrorist incidents they have listed. Here’s their graph:

Now I have no idea if 1970 was really the worst year on record or if it got a lot of attention for being the first year they started this or if there’s some quirk in the database here, but that spike seems unlikely. From scanning through quickly, it looks like there’s a lot of incidents that happened in the same day. That trend was also present in the current data, and there were a few issues I noted that looked like duplicates but also could have been two things done similarly in the same day.

Overall though, I think comparing 1970 to 2017 shows an odd change in what we call terrorism. Many of the incidents listed in 1970 were done by people who specifically seemed to want to make a point about their group. In 2017, many of the incidents seemed to involve someone who wanted to be famous, and picked their targets based on whoever drew their ire.  You can see this by the group names. In 2017 only one named group was responsible for a terrorist attack (the White Rabbit Militia one) whereas in 1970 there at least a dozen groups with names like “New World Liberation Front” or “Armed Revolutionary Independence Movement“.

Overall, this change does make it much harder to figure out what ideological group terrorists belong to, as a large number of them seem to be specifically eschewing group identification. Combine that with the pitfalls of crowd sourcing, and changing definitions, and I’d say this report is somewhat inconclusive.

Reporting the High Water Mark

Another day, another weird practice to add to my GPD Lexicon.

About two weeks ago, a friend sent me that “People over 65 share more fake news on Facebook” study to ask me what I thought. As I was reviewing some of the articles about it, I noticed that they kept saying the sample size was 3,500 participants. As the reporting went on however, the articles clarified that not all of those 3,500 people were Facebook users, and that about half the sample opted out. Given that the whole premise of the study was that the researchers had looked at Facebook sharing behavior by asking people for access to their accounts, it seemed like that initial sample size wasn’t reflective of those used to obtain the main finding. I got curious how much this impacted the overall number, so I decided to go looking.

After doing some follow up with the actual paper, it appears that 2,771 of those people had Facebook to begin with,  1,331 people actually enrolled in the study, and 1,191 were able to link their Facebook account to the software the researchers needed. So basically the sample size the study was actually done on is about a third of the initially reported value.

While this wasn’t necessarily deceptive, it did strike me as a bit odd. The 3,500 number is one of the least relevant numbers in that whole list. It’s useful to know that there might have been some selection bias going on with the folks who opted out, but that’s hard to see if you don’t report the final number.  Other than serving as a selection bias check though (which the authors did do), 63% of the participants had no link sharing data collected on them, and thus are irrelevant to the conclusions reported.  I assumed at first that reporters were getting this number from the authors, but it doesn’t seem like that’s the case.  The number 3,500 isn’t in the abstract. The press release uses the 1,300 number. From what I can tell, the 3,500 number is only mentioned by itself in the first data and methods section, before the results and “Facebook profile data” section clarify how the interesting part of the study was done. That’s where they clarify that 65% of the potential sample wasn’t eligible or opted out.

This was not a limited way of reporting things though, as even the New York Times went with the 3,500 number. Weirdly enough, the Guardian used the number 1,775, which I can’t find anywhere. Anyway, here’s my new definition:

Reporting the high water mark: A newspaper report about a study that uses the sample size of potential subjects the researchers started with, as opposed the sample size for the study they subsequently report on.

I originally went looking for this sample size because I always get curious how many 65+ plus people were included in this study. Interestingly, I couldn’t actually find the raw number in the paper. This strikes me as important because if older people are online in smaller numbers thank younger ones, the overall number of fake stories might be larger among younger people.

I should note that I don’t actually think the study is wrong. When I went looking in the supplementary table, I noted that the authors mentioned that the most commonly shared type of fake news article was actually fake crime articles. At least in my social circle, I have almost always seen those shared by older people rather than younger ones.

Still, I would feel better if the relevant sample size were reported first, rather than the biggest number the researchers looked at throughout the study.

What I’m Reading: January 2019

Well my best read of the month was the draft manuscript of my brother’s upcoming book Addiction Nation: What the Opioid Crisis Reveals About Us . This book came out of an article he wrote about his own opioid addiction, which I blogged about here. I’m super proud of him for this book, so expect more mentions of this as the publication date draws closer. He’s asked me if I’d do some blogging with him about some of the research around this topic, so if anyone had anything in particular they’d be interested on that topic please let me know.

A recent bout of Wikipedia reading led me to this really interesting visual about the Supreme Court.  My dad had mentioned recently the idea that there used to be a “Catholic seat” on the Supreme Court, before the more recent trend of Catholics dominating SCOTUS. Turns out there’s a visual that shows how right he was:

So basically for almost 60 years there was only one Catholic justice, and they seemed to be nominated to replace each other. Then in the late 80s that all changed, and by the late 2000s Catholics would dominate the court. As it stands today the breakdown is 5 Catholics, 3 Jews, and 1 Episcopalian.

I stumbled across an interesting paper a few weeks ago called “Metacognitive Failure as a Feature of Those Holding Radical Beliefs” that found that people who held “radical” beliefs were more likely to be overconfident/less aware of their errors in neutral areas as well. I haven’t read through the full study, but the idea that radical beliefs are due to generalized overconfidence as opposed to attachment to a specific idea is intriguing.

As someone who was raised with a good dose of 90s era environmentalism, I thought this Slate Star Codex post about “What Happened to 90s Environmentalism?” was fascinating. Turns out some of the stuff we were warned about was solved, some was overhyped and some….just stopped being talked about.

On a totally different note, I’ve decided to do a cookbook challenge this year, and am cooking my way through the book 12 Months of Monastery Soups. I sort of started blogging about it, but I’m not sure if I like that format or not. If I end up ditching that, then I’m still going to post pictures on my heretofore neglected Instagram account.

Updates on Mortality Rates and the Impact of Drug Deaths

A couple of years ago now, there was a lot of hubbub around a paper about mortality rates among white Americans. This paper purported to show that mortality for middle aged white people in the US were not decreasing (like other countries/races/ethnicities) were, but was actually increasing.

Andrew Gelman and others challenged this idea, and noted that some of the increase in mortality was actually a cohort effect. In other words, mortality was up, but so was the average age of a “45-54 year old”. After adjusting for this, their work suggested that actually it was white middle aged women in the south who were seeing an increase in mortality:

In this article for Slate, they published the state by state data to make this even clearer:

In other words, there are trends happening, but they’re complicated and not easy to generalize.

One of the big questions that came up when this work was originally discussed was how much “despair deaths” like the opioid overdoses or suicide rates were driving this change.

In 2017, a paper was published that showed that this was likely only partially true. Suicide and alcohol related deaths had remained relatively stable for white people, but drug deaths had risen:

Now, there appears to be a new paper coming out  that shows there may be elevated mortality in even earlier age groups. It appears only the abstract is up at the moment, but the initial reporting shows that there may be some increase in Gen X (current 38-45 year olds) and some Gen Y (27-37 year olds). They have reportedly found elevated mortality patterns among white men and women in that age group, being partially driven by drug overdoses and alcohol poisonings.

From the abstract, the generations with elevated mortality were:

    • Non-Hispanic Blacks and Hispanics: Baby Boomers
    • Non-Hispanic White females: late-Gen Xers and early-Gen Yers
    • Non-Hispanic White males: Baby Boomers, late-Gen Xers, and early-Gen Yers.

Partial drivers for each group:

  • Baby Boomers: drug poisoning, suicide, external causes, chronic obstructive pulmonary disease and HIV/AIDS for all race and gender groups affected.
  • Late-Gen Xers and early-Gen Yers: are at least partially driven by mortality related to drug poisonings and alcohol-related diseases for non-Hispanic Whites.

And finally, one nerve-wracking sentence:

Differential patterns of drug poisoning-related mortality play an important role in the racial/ethnic disparities in these mortality patterns.

It remains to be seen if this paper will have some of the cohort effect problems that have plagued other analyses, but the drug poisoning death issue seems to be a common feature. It remains to be seen what the long term outcomes of this will be, but here’s an interesting visualization from Senator Mike Lee’s website:

Not a pretty picture.