Calling BS Read-Along Week 7: Big Data

Welcome to the Calling Bullshit Read-Along based on the course of the same name from Carl Bergstorm and Jevin West  at the University of Washington. Each week we’ll be talking about the readings and topics they laid out in their syllabus. If you missed my intro and want the full series index, click here or if you want to go back to Week 6 click here.

Well hello week 7! This week we’re taking a look at big data, and I have to say this is the week I’ve been waiting for. Back when I first took a look at the syllabus, this was the topic I realized I knew the least about, despite the fact that it is rapidly becoming one of the biggest issues in bullshit today. I was pretty excited to get in to this weeks readings, and I was not disappointed. I ended up walking away with a lot to think about, another book to read, and a decent amount to keep me up at night.

Ready? Let’s jump right in to it!

First, I suppose I should start with at least an attempt at defining “big data”. I like the phrase from the Wiki page here “Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.” Forbes goes further and compiles 12 definitions here. If you come back from that rabbit hole, we can move in to the readings.

The first reading for the week is “Six Provocations for Big Data” by danah boyd and Kate Crawford. The paper starts off with a couple of good quotes (my favorite: ” Raw data is both an oxymoron and a bad idea; to the contrary, data should be cooked with care”) and a good vocab word/warning for the whole topic: apophenia, the tendency to see patterns where none exist. There’s a lot in this paper (including a discussion about what Big Data actually is), but the six provocations the title talks about are:

  1. Automating Research Changes the Definition of Knowledge Starting with the example of Henry Ford using the assembly line, boyd and Crawford question how radically Big Data’s availability will change what we consider knowledge. If you can track everyone’s actual behavior moment by moment, will we end up de-emphasizing the why of what we do or broader theories of development and behavior? If all we have is a (big data) hammer, will all human experience end up looking like a (big data) nail?
  2. Claims to Objectivity and Accuracy are Misleading I feel like this one barely needs to be elaborated on (and is true of most fields), but it also can’t be said often enough. Big Data can give the impression of accuracy due to sheer volume, but every researcher will have to make decisions about data sets that can introduce bias. Data cleaning, decisions to rely on certain sources, and decisions to generalize are all prone to bias and can skew results. An interesting example given was the original Friendster (Facebook before there was Facebook for the kids, the Betamax to Facebook’s VHS for the non-kids). The developers had read the research that people in real life have trouble maintaining social networks of over 150 people, so they capped the friend list at 150. Unfortunately for them, they didn’t realize that people wouldn’t use online networks the same way they used networks in real life. Perhaps unfortunately for the rest of us, Facebook did figure this out, and the rest is (short term) history.
  3. Bigger Data are Not Always Better Data Guys, there’s more to life than having a large data set. Using Twitter data as an example, they point out that large quantities of data can be just as biased (one person having multiple accounts, non-representative user groups) as small data sets, while giving some people false confidence in their results.
  4. Not all Data are Equivalent With echos of the Friendster example from the second point, this point flips the script and points out that research done using online data doesn’t necessarily tell us how people interact in real life. Removing data from it’s context loses much of it’s meaning.
  5. Just Because it’s Accessible Doesn’t Make it Ethical The ethics of how we use social media isn’t limited to big data, but it definitely has raised a plethora of questions about consent and what it means for something to be “public”. Many people who would gladly post on Twitter might resent having those same Tweets used in research, and many have never considered the implications of their Tweets being used in this context. Sarcasm, drunk tweets, and tweets from minors could all be used to draw conclusions in a way that wouldn’t be okay otherwise.
  6. Limited Access to Big Data Creates New Digital Divides In addition to all the other potential problems with big data, the other issue is who owns and controls it. Data is only as good as your access to it, and of course nothing obligates companies who own it to share it, or share it fairly, or share it with people who might use it to question their practices. In assessing conclusions drawn from big data, it’s important to keep all of those issues in mind.

The general principles laid out here are a good framing for the next reading the Parable of the Google Flu, an examination of why Google’s Flu Trends algorithm consistently overestimated influenza rates in comparison to CDC reporting. This algorithm was set up to predict influenza rates based on the frequency of various search terms in different regions, but over 108 weeks examined it overestimated rates 100 times, sometimes by quite a bit. The paper contains a lot of interesting discussion about why this sort of analysis can err, but one of the most interesting factors was Google’s failure to account for Google itself. The algorithm was created/announced in 2009, and some updates were announced in 2013. Lazer et al point out that over that time period Google was constantly refining its search algorithm, yet the model appears to assume that all Google searches are done only in response to external events like getting the flu. Basically Google was attempting to change the way you search, while assuming that no one could ever change the way you search. They call this internal software tinkering “blue team” dynamics, and point out that it’s going to be hell on replication attempts. How do you study behavior across a system that is constantly trying to change behavior? Also considered are “red team” dynamics, where external parties try to “hack” the algorithm to produce results they want.

Finally we have an opinion piece from a name that seems oddly familiar, Jevin West, called “How to improve the use of metrics: learn from game theory“. It’s short, but got a literal LOL from me with the line “When scientists order elements by molecular weight, the elements do not respond by trying to sneak higher up the order. But when administrators order scientists by prestige, the scientists tend to be less passive.” West points out that when attempting to assess a system that can respond immediately to your assessment, you have to think carefully about what behavior your chosen metrics reward. For example, currently researchers are rewarded for publishing a large volume of papers. As a result, there is concern over the low quality of many papers, since researchers will split their findings in to the “least publishable unit” to maximize their output. If the incentives were changed to instead have researchers judged based on only their 5 best papers, one might expect the behavior to change as well. By starting with the behaviors you want to motivate in mind, you can (hopefully) create a system that encourages those behaviors.

In addition to those readings, there are two recommend readings that are worth noting. The first is Cathy O’Neil’s Weapons of Math Destruction (a book I’ve started but not finished), which goes in to quite a few examples of problematic algorithms and how they effect our lives. Many of O’Neil’s examples get back to point #6 from the first paper in ways most of don’t consider. Companies maintaining control over their intellectual property seems reasonable, but what if you lose your job because your school system bought a teacher ranking algorithm that said you were bad? What’s your recourse? You may not even know why you got fired or what you can do to improve. What if the algorithm is using a characteristic that it’s illegal or unethical to consider? Here O’Neil points to sentencing algorithms that give harsher jail sentences to those with family members who have also committed a crime. Because the algorithm is supposedly “objective”, it gets away with introducing facts (your family members involvement in crimes you didn’t take part in) that a prosecutor would have trouble getting by a judge under ordinary circumstances. In addition, some algorithms can help shape the very future they say they are trying to predict. Why are Harvard/Yale/Stanford the best colleges in the US News rankings? Because everyone thinks they’re the best. Why do they think that? Look at the rankings!

Finally, the last paper is from Peter Lawrence with “The Mismeasurement of Science“. In it Lawrence lays out an impassioned case that the current structure around publishing causes scientists to spend too much time on the politics of publication and not enough on actual science. He also questions heavily who is rewarded by such a system, and if those are the right people. It reminded me of another book I’ve started but not finished yet “Originals: How Non-Conformists Move the World”. In that book Adam Grant argues that if we use success metrics based on past successes, we will inherently miss those who might have a chance at succeeding in new ways. Nicholas Nassim Taleb makes a similar case in Antifragile, where he argues that some small percentage of scientific funding should go to “Black Swan” projects….the novel, crazy, controversial destined-to-fail type research that occasionally produces something world-changing.

Whew! A lot to think about this week and these readings did NOT disappoint. So what am I taking away from this week? A few things:

  1. Big data is here to stay, and with it come ethical and research questions that may require new ways of thinking about things.
  2. Even with brand new ways of thinking about things, it’s important to remember the old rules and that many of them still apply
  3. A million plus data points does not  =/= scientific validity
  4. Measuring systems that can respond to being measured should be approached with some idea of what you’d like that response to be, along with some plans for change if you have unintended consequences
  5. It is increasingly important to scrutinize sources of data, and to remember what might be hiding in “black box” algorithms
  6. Relying too heavily on the past to measure the present can increase the chances you’ll miss the future.

That’s all for this week, see you next week for some publication bias!

Week 8 is up! Read it here.