There Is No Such Thing as “Public” Data

By Woodrow Hartzog

Posted on August 23, 2016


And it’s not OK for researchers to scrape information from websites like OkCupid.

Image: Hartzog_OK-cupid_350x249-(1).jpg
Image: Hartzog-belowOkCupid_350x30.jpg


Are you an OkCupid user? Would you consider the data on your profile public—fair game for anyone to download and share with the rest of the world?


That’s the argument made by a group of Danish researchers who released a data set on nearly 70,000 users of the popular dating website. The researchers used an automated tool called a “scraper” that captures parts of a webpage—a possible violation of the website’s terms of use. These users had answered questions on intimate topics like drug use and sexual preferences. The researchers took no steps to deidentify the data set when they released it, despite it being possible to reidentify many of the profiles. When the researchers were called out about this lapse on Twitter, one of them shrugged it off with the flip statement “Data is already public.”


I hear arguments like this all the time. Websites that post mug shot photos to shame people say they’re just using public records. Harassers who take “upskirt” photos of women say they are blameless because their activities occurred “in public.” Police say they are free to use powerful technologies to surveil anyone for as long as they like as long as they are “in public.”


This justification is fundamentally wrong. Not just because we should be able to expect a certain amount of privacy in public, but because, despite frequency of use and seeming self-evidence, we actually don’t even know what the term public even means. It has no set definition in privacy law or policy. I often ask people to define the term for me. Common responses include “where anyone can see you” or “government records.” But by far the most common response I get is “not private.” Fair enough. But thinking of publicness this way only leads us to the equally difficult question of defining privacy.


Frankly, this argument is dangerous. People are wielding the notion of publicness as a sort of trump-all-rebuttals talisman to justify privacy invasions. By itself, this concept of publicness has no exculpatory power. How could it? We can’t even define it. We should be more critical of appeals to the publicness of data to justify its collection, use, and disclosure.


The “public data” concept is gaining steam in both policy and our everyday lives. The U.S. Department of Health and Human Services has proposed excluding public data sets from research oversight because, in its view, doing so presents low risk of harm. People who seek some sense of privacy on social media are ridiculed because “Twitter is public.” It’s time to abandon the misguided notion that public information is fair game.


Even if we were to adopt a plausible definition of public information, the researchers’ argument about their OkCupid data set fails on its own terms. There are basically three different ways you can define the notion of public data. The OkCupid data dump cannot be considered public under any of them.


1) Anything That Is “Not Private”


When people collecting and sharing information invoke its publicness, what they usually mean is that their actions are justified because what they collected or shared could not be considered “private.” Thinking of public as “not private” can be useful because it cuts down on the number of squishy terms that need precise boundaries. To figure out what “public” means, we ask what is “private” in any given context.


The problem is that you can’t rely on the “not private” notion of publicness to justify data collection, use, or disclosure. It’s circular. You can’t say “This data is not private because it is public” when what you mean is “This data is not private because it is not private.” Thinking of public data in this way means we must ask tough questions about context, confidants, data sensitivity, shared expectations, and the structural and legal safeguards that form our perceptions of trust and risk. All of that gets washed away with justifications like “Data is already public.”


2) Anything That Is “Freely Accessible”


Another flawed definition of “publicness” equates things that are easily accessible or observable with a lack of privacy interests. This conceptualization can be found in debates over social media privacy settings as well as facial recognition technologies and license plate readers. But this definition of publicness breaks down quickly under scrutiny. Our notions of privacy are much more contextually dependent than a bright line of theoretical accessibility. Almost everything we do online and outside of the house is observable or accessible by someone.


In reality, people gauge privacy risks along a continuum. We seek out low-risk environments with structural and practical safeguards where information can be seen by some but not all, like a cozy café where we can gossip about our co-workers. Even in our safe zones, we are usually still theoretically observable or accessible. Visits to doctors are often considered private, but you can see everyone in your physician’s waiting room.


We rely upon the fact that we’re obscure to the world when gauging the risk of exposure or disclosure. Think of your daily walk from your office to your car or the family blog read by a total of 10 people. Of course, many things that are freely accessible should not be considered private. But it is not genuine to suggest that the only things that are not public are our activities in houses and data kept under lock and key.


A more plausible descriptive account of publicness is what we think of as “common knowledge”—things most of us know. For example, we all know Jay Z and Beyoncé are married. But our shared cultural repository is comparatively microscopic compared with the entire universe of knowable things. Only handfuls of people likely ever saw the individual OkCupid profiles before they were scraped. That’s a far cry from being common knowledge.


3) Anything That’s Designated as Public


The most legitimate claim that information is “public” is based upon a policy determination that the information should be available with minimal restrictions. The collection and release of public government records, court documents, and “open data” sets created and released by the government are often justified because of their publicness. Sometimes judges and administrative agencies declare that certain information is for the public at large. What all of these examples have in common is that policy concerns dictate the publicness of the data and not the other way around.


Here, usually some decision was made or system was implemented regarding the relevant privacy concerns that justified a data release. For example, public records are often redacted, deidentified, or simply checked to make sure they are safe to release. However, these sorts of public data can still challenge our notions of privacy through obscurity, which is the idea that when information is hard or unlikely to be found or understood, it is to some degree safe.


This notion of publicness at least dispenses with the pretense of publicness as a descriptive concept. Instead, it embraces the publicness as a question of policy. The OkCupid data set doesn’t consist of public records nor did the researchers initially take any of the steps that often justify the release of government data (though they subsequently protected the data set with a password).


The collection, use, and disclosure of our personal information has too often been justified under the auspices of its publicness with little or no scrutiny. The “public information” justification is a simple way to avoid answering hard questions about the privacy interests in data. You can call data revolutionary. You can call it “big,” and you can call it “open.” But if you call it “public,” you better be able to back it up.


This article is part of Future Tense, a collaboration among Arizona State University, New America, and Slate. Future Tense explores the ways emerging technologies affect society, policy, and culture. To read more, follow us on Twitter and sign up for our weekly newsletter.


Woodrow Hartzog is an associate professor at Samford University’s Cumberland School of Law and affiliate scholar at Stanford Law School’s Center for Internet and Society.



The preceding is republished on TAP with permission by its author, Woodrow Hartzog, Assistant Professor at the Cumberland School of Law at Samford University. “There Is No Such Thing as ‘Public’ Data” was originally published May 19, 2016 in Slate.