Can We Have Too Much Data?

By Daron Acemoglu

Posted on October 21, 2020


Billions of people around the world are currently using social media platforms and sharing information about their preferences, views and intimate details (Facebook alone has close to 3 billion active monthly users). The data they share are processed with increasingly sophisticated machine learning and AI methods, and are then sold to third parties for advertising or product development as well as forming the basis of various customized online services. We are repeatedly told that this is the beginning of a revolution that is going to transform society and bring unparalleled prosperity and welfare to all of us.


There is a dark side to all of these data, however. Data sharing on social media and other platforms compromises not only the privacy of users doing the sharing, but others who are not actively engaged in such data sharing and have very little say or control about how their data are being used indirectly.


The Cambridge Analytica scandal gives us a glimpse of the practices and their implications. The history is, by now, well known. Facebook allowed Cambridge Analytica to acquire the private information of millions of individuals from data shared by about 270,000 Facebook users, who voluntarily downloaded an app, “This is your digital life”, designed to map their personality traits. The app accessed users' news feed, timeline, posts and messages. The main problem, however, wasn’t the intrusive nature of this app (itself potentially concerning), but how it also collected information about other Facebook users whom these 270,000 individuals were connected to. Cambridge Analytica was able to infer detailed information about more than 50 million Facebook users. The company then deployed these data for designing personalized political messages and advertising in the Brexit referendum and the 2016 US presidential election.


The problem goes well beyond Cambridge Analytica. This isn’t just because Facebook itself and other third parties engage in the same strategies. It is because the very nature of predictive big data approaches is to forecast the behavior or characteristics of groups of individuals from data shared by samples.


Consider the case of data from Facebook and other social media platforms being used for predicting who will take place in protests against the government. Less extreme but no less relevant is the ability of companies to predict the behavior or identity, location, nationality, age and sexual orientation of individuals who may wish to keep this information private. Suppose, for example, that people aged between 20 and 25 born in specific country and with a particular sexual orientation are much more likely than others to go to a particular restaurant in New York City and some of them share this information on social media. This would mean that the social media platform will have a fairly accurate way of finding out the sexual orientation of others with the same demographic background frequenting the same restaurant, even if they are very averse to sharing that information themselves.


Such information leakages are not a byproduct of the business model of Facebook and other platforms. Rather, it is central to their overall strategy. Yet it causes a huge loss of privacy, even if it can also generate some benefits both to Facebook and perhaps even to some of the users.


Advocates of data markets emphasize these benefits, pointing out how information shared by an individual about their preferences or health problems can be useful for understanding the behavior and diseases affecting others with similar characteristics. But the same logic extends to privacy concerns as well. When Facebook or other companies can predict the behavior of individuals who haven’t shared their data, this amounts to a violation of privacy to which these individuals have not consented.


Individuals may value their privacy for a variety of reasons. They may not want others to have some information about them, such as their political views, sexual orientation or some of their leisure activities. They may also enjoy lower prices because companies do not know some of the relevant information about them; imagine, for instance, a situation in which a platform knows that you would be willing to pay $10 for downloading a song that others are just paying one dollar for and is able to do individualized pricing, thus capturing the $9 surplus that you would have obtained otherwise. Alternately, their online experience may be improved if they are not targeted by ads they do not want to see and privacy would limit the reach of such advertising.


Even if these concerns are present, many in the tech industry as well as experts may still argue they are not significant enough to counterbalance the positive benefits from data, because the presumption is that privacy concerns are unimportant. This is based on existing studies (e.g., Laudon, 1996, Varian, 2002, and Athey et al., 2017) that find relatively low willingness to pay by most users to protect their privacy. Yet this inference (implicitly) depends on the presumption that these revealed willingness to pay measures reflect the true value of privacy. When one’s information is revealed by others, this need not be the case.


In “Too Much Data: Prices and Inefficiencies in Data Markets”, jointly written with Ali Makdhoumi, Azaraksh Malekian and Asu Ozdaglar, I investigated the costs and benefits of data markets when these type of data leakages are pervasive. The critical ingredient, as described above, is that the data of an individual are informative not only about their own characteristics but the characteristics of other users (and potentially non-users) of the platform that has access to or is able to purchase these data.


Suppose that, as is likely to be the case in practice, each individual also differs according to the value they attach to privacy, so that some people will be much more willing to share their data on social media and other platforms, but in the process also reveal information about others. Information enables the platform or third parties to estimate the underlying characteristics of an individual, and more accurate estimates create greater value for the platform either for advertising or targeted product development. However, as my discussion so far underscores, more accurate estimates lead to more compromised privacy from the viewpoint of the individual.


The more an individual’s data are correlated only with the characteristics, preferences, or actions of others, the more valuable is her data to platforms, but simultaneously, the more privacy violating is her data sharing — especially for people who do not want their information being revealed to these platforms.


To understand the nature of the problem in a little more detail, let us consider a simple example, with a platform and two users. The platform can acquire or buy the data of a user in order to better estimate her characteristics, preferences, or actions. The relevant data of the two users are correlated, which means that the data of one user enables the platform to more accurately estimate the characteristics of the other user. The objective of the platform is to minimize the estimation error of user characteristics, or maximize the amount of leaked information about them. Suppose that the valuation (in monetary terms) of the platform for the users’ leaked information is one, while the value that the first user attaches to her privacy, again in terms of leaked information about her, is 1/2 and for the second user it is v (which can be less than or greater than one as we discussed below). The platform offers prices to the users in exchange for their data. This can be an explicit payment to the user in exchange for data sharing or an implicit payment in the form of services offered for free by the platform.


Each user can choose whether to accept the price offered by the platform or not. In this simple example without any transaction costs in exchanging data, the first user will always sell her data because her valuation of privacy, 1/2, is less than the value of information to the platform.

Figure 1 showing correlation between user 1 and user 2. The correlation between the two users can be very high, so that the data of one revealed a lot about the other.

Figure 1: The correlation between the two users can be very high, so that the data of one revealed a lot about the other

But given the correlation between the characteristics of the two users, this implies that the platform will already have a fairly good estimate of the second user’s characteristics. Suppose also, for illustration, that the correlation between the data of users is very high, meaning that once it has access to the data of the first user, the platform will know quite a bit about the second user. Here we see the implications of information leakage in data markets: the first user is revealing a considerable amount of information about the second user, regardless of whether the second user decides to sell her own data. In particular, the platform will know almost everything relevant about the second user from the first user’s data, and this undermines the willingness of the second user to protect her data. In fact, since the first user is revealing almost everything about her, the second user would be willing to sell her own data for a very low price. But then we see another implication of information leakage: once the second user decides to sell her data, this also reveals the first user’s data, thus, with an identical reasoning, the first user can only charge a very low price for her data because this time the tables have turned and the second user is revealing a lot of information about the first user.


Therefore, in this simple example, the platform will be able to acquire users’ data at a very low price, even though both users have privacy concerns. The depressed value of data prices below the value of privacy has obvious distributional implications: the platform benefits from cheap data and users receive no compensation for their data. When the second user’s valuation v is less than one, the market can be thought of as functioning decently because data are socially beneficial — the benefits to the platform exceed the disutility to users due to loss of privacy. There is a catch here: even though data transactions might be beneficial, who gets the benefits from data is up for grabs. This example also shows it will often be the platform who reaps the benefits because it will be able to get information about the second user from the first user (without directly paying for it). In fact, it’s worse than that as we will see next.


In contrast to the previous case, when v is above one and large, the market starts malfunctioning badly. Here the second user would really like to protect her data, but there is no way she can do that given the ability of the platform to buy the first user’s data. The first user, by selling her data, is creating what we call a “negative externality” on the second user — she is directly hurting her.


This example is stylized and is not meant to be realistic. Nevertheless, it captures two critical features about data markets. First, data sharing by individuals always creates externalities on others whose information is revealed. This externality can be positive when others’ information helps companies develop higher-quality products or services for me. But they may also be negative as we have just seen.


In fact, these negative effects may not be large enough to overturn the benefits from the platform’s use of these data. But even in this case, as I have already emphasized, they create distributional effects (they benefit the platform at the expense of users). Worse, when some of the other users value their privacy highly, these negative effects may outweigh the benefits, leading to too much data sharing. When this is the case in the extreme, for example, with data sharing by some individuals creating much harm on the privacy of others, even shutting down data markets may be a better option than laissez-faire in data exchange (though in general there will exist arrangements that allow some amount of data to be transacted that are superior to shutting down data markets completely).


Second, and perhaps more subtly, data sharing by an individual changes both the value of data to the platform and the value of privacy to other users. This is because these data enable the platform to better estimate the characteristics of other users, hence the platform itself will have less use for the data of other users. Analogously, once their information is leaked, these users may no longer choose to protect their own data. Hence, they may themselves share their own data even though they value their privacy greatly. This reiterates that, in the presence of data sharing externalities, the value of privacy of users cannot be inferred from their revealed data sharing decisions.


The main principles communicated by this simple example are fairly general. Externalities resulting from information leakages are ubiquitous in data markets. Some of it may be positive (and that’s what much of the commentary on this topic emphasizes). But when individuals value their privacy either for its own sake or because privacy enables them to get better deals or better products in the future, information sharing also creates negative externalities.


Moreover, these negative externalities are often associated with depressed prices for data. The principle here is general as well: when your data reveals information about me, then my own data becomes less valuable both to me and to the platform, depressing its price. This implies that when people report to greatly value their privacy but are unwilling to pay much to protect their data, this does not mean they are irrational. Rather, they may have understood that protecting their data in the narrow is not going to do much for protecting their privacy.


Another lesson is also apparent. There can be too much data (in the example when the value of privacy of the second user, v, is very high). In practice, whether this is the case or not will depend on the details of how much different individuals value privacy and the exact way in which data are transacted and purchased by different platforms. Put differently, we need a case-by-case analysis of the efficiency and distributional implications of data markets.


This conclusion is strengthened when we recall that data externalities are also depressing prices, so we cannot just rely on individuals’ willingness to protect their data as a gauge for how much they truly value privacy. We need detailed analyses of individual attitudes as well as platform strategies.


Can anything be done about data externalities when they are paramount? The most common strategy in practice is a range of data anonymization strategies. However, the logic of data externalities outlined here also reveals that anonymizing data does not resolve the problem. When a user’s data is anonymized, this prevents the platform from learning some of the relevant information about her. But when the data externality originates from the fact that she’s sharing information relevant about a specific group, then that informational leakage remains.


In our paper, we develop several ideas about regulation of data markets, but much more thinking is necessary on this topic. Central to these new regulatory ideas is the use of more sophisticated algorithms and protocols so that some of the unwanted information in data transactions are taken out. For example, if the data of the first and second users are correlated, for example, some of the information that the first user is sharing is only relevant about herself but some others are directly or indirectly about the second user, there may be ways of filtering out some of this information (or statistically, removing some of the correlation). Critically, what is required is very different than anonymizing data: a user may be allowed to share information about herself, but it is information about others that is taken out.


Of course, a key question is whether such regulatory approaches are practical and whether they can be implemented. Another complicating factor is that often they will rely on the existence of a trusted party, which may be hard to achieve if consumers become more suspicious about platforms and their objectives.


The bottom line is that data markets, which are surely here to stay, pose enormous policy challenges as well as presenting huge opportunities. Understanding the potential negative effects of data transactions as well as their benefits is an important step in formulating new public policies and a comprehensive approach to the regulation of data.


D. Acemoglu, A. Makhdoumi, A. Malekian, and A. Ozdaglar. Too much data: prices and inefficiencies in data markets. National Bureau of Economic Research, Working Paper No. 26296, 2019.


S. Athey, C. Catalini, and C. Tucker. The digital privacy paradox: Small money, small costs, small talk. National Bureau of Economic Research, Working Paper No. 23,488, 2017.


K. C. Laudon. Markets and privacy. Communications of the ACM, 39(9):92–104, 1996.


H. Varian. Economic aspects of personal privacy. Cyber Policy and Economics in an Internet Age, pages 127-137, Springer, 2002.


S. Zuboff. The age of surveillance capitalism: The fight for a human future at the new frontier of power. PublicAffairs, 2019.




Daron Acemoglu is the Institute Professor with the Massachusetts Institute of Technology, Department of Economics. He is a leading thinker on the labor market implications of artificial intelligence, robotics, automation, and new technologies. His innovative work challenges the way people think about how these technologies intersect with the world of work.