Economics in the Age of Big Data

By Jonathan Levin

Posted on April 21, 2014


Large-scale administrative datasets and proprietary private sector data can greatly improve the way we measure, track and describe economic activity. Liran Einav and TNIT member Jonathan Levin outline some of the opportunities and challenges for economic researchers in accessing and using these ‘big data’.


Twenty years ago, data on economic activity were relatively scarce. Economists of our generation were trained to work with small datasets and econometric methods that may turn out to be quite different from those that current graduate students will use. In this essay, we suggest some of the opportunities that big data offers and how economic research might adapt to take full advantage of them.

In his 2010 Ely Lecture to the American Economic Association, Hal Varian linked the expansion of economic data to the rise of ‘computer mediated’ transactions. As an example, consider retail transactions. A few decades ago, a store might have tracked daily sales, perhaps split by products or product categories. Scanners made it possible for retailers to record individual purchases easily and automatically, and if customers were using loyalty cards, to link purchases over time and create customer histories for use in marketing and promotions.

Nowadays, an Internet retailer records far more than a customer’s purchases: they track her search queries, the items she viewed and discarded, the recommendations or promotions she saw and the reviews she might leave subsequently. In principle, these data can be linked to other online activity, such as browsing activity, advertising exposure or social media consumption.

A similar evolution has occurred in industries such as financial services, healthcare and real estate, and also in business activity. As firms have moved their operations online, it has become possible to compile rich datasets of sales contacts, hiring practices and physical shipments of goods. Increasingly, there are also electronic records of collaborative work efforts, personnel evaluations and productivity measures. The same story applies to the public sector in terms of the ability to access and analyze tax filings, government expenditures and regulatory activities.

What opportunities do these new data offer for economic research? One possibility is the creation of new economic statistics that allow for closer and more disaggregated tracking of economic activity. For example, during the financial crisis, there were relatively limited data available on how sharply consumer spending was dropping and how difficult firms were finding it to obtain credit and maintain their working capital.

Improved data can also facilitate and enhance the type of empirical research that economists have been doing for decades: documenting and explaining historical patterns of economic activity and finding research designs that make it possible to trace out the causal effects of different policies. In fact, more granular and comprehensive data are likely to allow a range of clever and novel research designs: for example, by matching individuals more finely to create plausible control groups; or by taking advantage of discrete policy changes that create discontinuities in a cross-sectional or time-series dataset with many closely spaced observations.

New data may also lead economists to adopt and develop new econometric tools. A natural candidate is the big data predictive modeling techniques that already are widely used in statistics and computer science. These tools have not yet caught on much with economists - perhaps because the emphasis they place on prediction seems so different from the causal identification framework that has dominated empirical microeconomics.

But in our view the distinction is not so sharp, and these techniques are likely to become quite popular, whether to construct single dimensional measures of heterogeneity for use in economic models (for example, risk scores, credit scores and quality scores) or to construct matched samples or instruments for causal inference.

What challenges do economists need to overcome to take advantage of new data? A number of open issues revolve around data access. For a long time, empirical research in economics relied heavily on government survey data, which had the virtue that there were well-established (though sometimes cumbersome) protocols for accessing and using these data, and results could be verified or refined over time.

These systems are still being worked out for the US administrative data that recently have been used for research: from the Internal Revenue Service, Medicare or the Social Security Administration. Some European countries, such as Norway, Sweden and Denmark, have gone much further to facilitate research. Their experience suggests that broader access is possible, and that reducing barriers to data access can have a profound effect on the amount of research and the quality of what is learned.

Accessing private data creates other issues. Not every company wants to work with researchers. Many see it as potentially beneficial and a useful way to learn from outsiders; but others may view it as a distraction or worry about the publicity risks.

To mitigate these risks, researchers who collaborate with companies generally need to enter into contracts to prevent disclosure of confidential information, and may face some limits on questions they can investigate. Our experience has been that the benefits of working with company data generally far outweigh the costs, but that a fair amount of effort on both sides is required to develop successful collaborations.

Private sector data can also be limited. They often contain information only on a firm’s customers, who may not be representative even within a particular industry. In addition, many private datasets are collected for transactional purposes, and may contain a specific set of information that is ideal for some purposes but not for others.

For example, there is a computerized record of practically every physician visit in the United States, but it is generally an insurance claim that records the information necessary for payment, but not necessarily any type of actual health information, such as patients’ biometrics or how they feel. It also is not easily linked to employment records, household financial information or social network indicators. It is possible that what we can learn from individual private datasets will prove to be far less than what might be learned from linking information that is currently separated.

A second challenge for economists is learning the skills required to manage and work with large datasets. Virtually all successful internet firms - and many firms in other sectors – are investing not just in data storage and distributed data processing, but in skilled computer scientists and engineers. At Stanford, we have heard of computer science majors earning over $200,000 in their first year out of college - talk about a skill premium!

Interestingly, however, even when these companies hire ‘data scientists’ to look for empirical patterns in their data, they generally focus on engineers rather than economists. Our expectation is that future economists who want to work with large datasets will have to acquire some new skills so that they can combine the conceptual framework of economics with the ability to implement ideas on large-scale data.

Finally, a big challenge in our view is that just having a lot of data does not automatically make for great research. In fact, with very large, rich datasets it can be non-trivial just to figure out what questions can be answered. While in the past, researchers could simply browse through their dataset and get a sense of its key features, large datasets require time and effort for conceptually trivial tasks, such as extracting different variables and exploring relationships between them.

Looking at our own experience over the last few years, and the experience of our students who have done successful projects with internet data from retail platforms (eBay), job matching platforms (oDesk, freelancer), lending platforms (prosper), sharing platforms (airbnb, taskrabbit) and financial management sites, one pattern is that most projects started with a relatively long and slow process of figuring out exactly what was in the data and how to work with it.

The situation may turn out to be different with administrative datasets to the extent that they end up being used by many researchers: over time, there will be common learning about advantages and drawbacks of the data, as well as various methods for organizing the data and exploring different questions. This may be one further difference between research with government datasets, which if access is increased may occupy many economists, relative to research with proprietary datasets that are likely to allow much more limited access.

Overall, it seems pretty clear to us that over the next few decades, big data will change the landscape of economic research. We don’t think it will substitute for common sense, economic theory or the need for careful research designs. Rather, new data will complement them. How exactly remains to be seen.

This article draws on ‘The Data Revolution and Economic Analysis’ by Jonathan Levin and Liran Eirav.
Liran Eirav and TNIT member Jonathan Levin are at Stanford University.

The preceding post is republished on TAP with permission by the Toulouse Network for Information Technology (TNIT). “Economics in the Age of Big Data” was originally published in TNIT’s March newsletter.