Surveys, Privacy and the Usefulness of Lies

I’ve been thinking a lot about surveys this week (okay, I’m boring, I think a lot about them every week), but this week I have a particularly good reason. A few years ago, I wrote about a congressman named Daniel Webster and his proposal to eliminate the American Community Survey. I’ve been a little fascinated with the American Community Survey ever since, and last week I opened my mailbox to discover that we’d been selected to take it this year.

For those unfamiliar with the American Community Survey, it’s an ongoing survey by the Census Bureau that asks people lots of information about their houses, income, disability and employment status. Almost every time you see a chart that shows you “income by state” or “which county is the richest” or “places in the US with the least internet access”, the raw data came from the American Community Survey.  This obviously provides lots of good and useful information to many people and businesses, but it’s not without it’s critics. People like Congressman Webster object to the survey for reasons like government overreach, the cost and possible privacy issues with the mandatory* survey.

While I’ve written about this for years, I actually had never taken it so I was fairly excited to see what all the fuss was about. Given the scrutiny that’s been placed on the cost, I was interested to see that the initial mailing strongly encouraged me to take the survey online (using a code on the mailing) and cited all the cost savings associated with me doing so. Filling out surveys online almost certainly reduces cost, but in this day and age it also tends to increase the possible privacy issues. While the survey doesn’t ask for sensitive information like social security numbers, it does ask lots of detailed information about salary, work status, the status of your house, mortgage payments and electricity usage. I wouldn’t particularly want a hacker getting a hold of this, nor would most others I suspect.

I don’t particularly know how the Census Bureau should proceed with this survey or what Congress will decided to do, but it did get me thinking about privacy issues with online surveys and how to balance the need for data with these concerns. I work in an industry (healthcare) that is actually required by regulations to get feedback on how we’re doing and make changes accordingly, yet we also must balance privacy concerns and people who don’t want to give us information. Many people who have no problem calling you up and lecturing you about everything that went wrong while they were in the hospital absolutely freeze when you ask them to fill out a survey: they find it invasive. It’s a struggle. One of my favorite post election moments actually reflected this phenomena, in the form of a Chicago Tribune letter to the editor from a guy who said he’d never talked to a pollster in the run up to the election. His issue? He hates pollsters because they want to capture your every thought AND they never listen to people like him.  While many people like and appreciate services that reflect their perspective, are friendlier, more usable, and more tailored to their needs, many of us don’t want to be the person whose data gets taken to get there. For good reason too: our privacy is disappearing at an alarming rate, and data hacks are pretty much weekly news.

So how do survey purveyors get the trust back? One of the newest frontiers in this whole balancing act is actually coming from Silicon Valley, where tech companies are as desperate for user data as users are concerned about keeping it private. They have been advancing something called “differential privacy”, or the quest to use statistical techniques to render a data set collectively useful while rendering individual data points useless and unidentifiable. So how would this work?

My favorite of the techniques is something called “noise injection” where fake results are inserted in to the sample at a known level. For example: a survey asks you if you’ve ever committed a crime. Before you answer, you are told to flip a coin. If the coin says heads, you answer truthfully. If the coin says tails, you flip the coin again. If the coin says heads this time, you say “yes, I’ve committed a crime”. Tails, you haven’t. When the researchers go back in, they can take out the predicted fake answers and find the real number. For example, let’s say you started with 100 people. At the end of the test, you find that 35 say they committed a crime, and 65 say they haven’t. You know that 25 of those 35 should have answered “yes” due to coin flip, so you have 10 who really said “yes”. You can also subtract 25 from the 65 to get 40.

They now know the approximate real percentage of those who have committed a crime (20% in this example), but you can’t know if any individual response is true or not. This technique has possible holes in it (what if people don’t follow instructions?) and you have to cut your sample size in half, but  just asking people to admit to a crime directly with a “we promise not to share your data” actually doesn’t work so well either.  Additionally, the beauty of this technique is that it works better the larger your sample is.

Going forward we may see more efforts like this, even within the same survey or data set. While 20 years ago people may have been annoyed to fill out a section of a survey with fake data, today’s more privacy conscious consumers may be okay with it if it means their responses can’t be tied to them directly. I don’t know that the Census Bureau would ever use anything like this, but as we head towards the  2020 census, there will definitely be more talk about surveys, privacy and methodology.

*The survey is mandatory, but it appears the Census Bureau is prohibited by Congress from actually enforcing this.