danah boyd Advocates for Responsible Data Science

By TAP Staff Blogger

Posted on November 29, 2022


Share

Data science is increasingly being used to ground decision-making in both industry and public life. As data become significant and powerful, people who rely on those data come to expect certain things from the data. All too often, data are expected to be precise, neutral, and objective. Those data are expected to speak with confidence—and not reveal their limitations.

 

Microsoft researcher danah boyd is currently conducting a multi-year ethnographic study of the U.S. census to understand how data are made legitimate. In a talk she gave during Microsoft Research Summit in 2021, Dr. boyd discussed how illusions surrounding data can be weaponized. This talk, titled “Statistical Imaginaries: An Ode to Responsible Data Science,” highlighted how the U.S. Census Bureau’s decision to embrace differential privacy as part of its system to protect statistical confidentiality upended what people imagined the work of data to be. Dr. boyd also discussed the importance of grappling with uncertainty and limitations as a key part of responsible data science.

 

Below is an overview of Dr. boyd’s talk, “Statistical Imaginaries: An Ode to Responsible Data Science,” at the Microsoft Research Summit 2021. Recorded October 19, 2021.

 

Summary

 

Policymakers and the public often view census data as a set of neutral facts. Researchers and data scientists should design tools that allow users to resist the illusion that data is neutral.

 

Main Points

 
  • The census is explicitly recognized in the United States Constitution as democracy’s data infrastructure, used to make key decisions about voting, funding of services, and public health.
     
  • Attempts to professionalize the collection of national statistics have met with significant political resistance.
     
  • People pretend that the use of data allows for neutral decision-making, but data are never neutral.
     
    • Some statistical techniques, like regression, have dubious historical roots.
       
    • The choice of which data to collect and how to collect it reveals ideological differences.
       
  • Data analysts find that clients in government and industry prefer to ignore problems of error and uncertainty.
     
  • People will refuse to participate in the census if their names can be linked to their data; a method called “differential privacy” allows statisticians to introduce noise into the data to prevent outliers from being identified.
     
  • In 2006, the Census Bureau faced a public backlash when using differential privacy techniques to make previously inaccessible data accessible for the first time.
     
    • Users saw census data as a set of facts that should not be altered, rather than a mathematical product.
       
    • Statisticians were morally committed to transparency about their methods, but transparency can be a political nightmare.
       
  • A “statistical imaginary” forms when people construct a vision of what data are and could be, and the key to responsible data science is keeping statistical imaginaries in line with reality; technology researchers and scientists may design tools that help users resist the fantasy that data is neutral.
     
  • Official statistics typically include information about population, some measurement of national incomes, and some measure of people’s health and wellbeing.
     
  • In embracing innovation to increase the quality of data, industry may use obscure techniques that are not transparent, but public authorities require more transparency to gain user trust.
     
  • When first using differential privacy techniques to improve the quality of census data, national authorities could set out exactly how the data will be used, and design the system to meet these uses.
     

Conclusion

 

Policymakers and the public often see census data as a set of neutral facts. However, decisions as to what data to collect, how to collect the data, and how to present the data are often deeply political. To avoid undermining policies based on data, government and business clients often prefer to ignore the limits of data, including uncertainty and the risk of error. In 2006, when the Census Bureau used a method called differential privacy to better protect individual privacy without compromising the accuracy of its statistics, the agency faced a public backlash. Users viewed the data as a set of facts that should not be altered, rather than a product of mathematics. Technology researchers and data scientists may support responsible data science by developing tools that remind users of the limits of the data they are using.

 

View Dr. boyd’s talk, “Statistical Imaginaries: An Ode to Responsible Data Science,” at the Microsoft Research Summit 2021. Recorded October 19, 2021.

 

Related Reading

 

Differential Perspectives: Epistemic Disconnects Surrounding the US Census Bureau’s Use of Differential Privacy” by danah boyd and Jayshree Sarathy, (March 15, 2022, Harvard Data Science Review, forthcoming)

 

Abstract:
When the U.S. Census Bureau announced its intention to modernize its disclosure avoidance procedures for the 2020 Census, it sparked a controversy that is still underway. The move to differential privacy introduced technical and procedural uncertainties, leaving stakeholders unable to evaluate the quality of the data. More importantly, this transformation exposed the statistical illusions and limitations of census data, weakening stakeholders’ trust in the data and in the Census Bureau itself. This essay examines the epistemic currents of this controversy. Drawing on theories from Science and Technology Studies (STS) and ethnographic fieldwork, we analyze the current controversy over differential privacy as a battle over uncertainty, trust, and legitimacy of the Census. We argue that rebuilding trust will require more than technical repairs or improved communication; it will require reconstructing what we identify as a ‘statistical imaginary.’

 

danah boyd is a Partner Researcher at Microsoft Research and the founder of Data & Society. Dr. boyd's research focuses on the intersection of technology and society, with an eye to how structural inequities shape and are shaped by technologies. She is currently conducting a multi-year ethnographic study of the U.S. census to understand how data are made legitimate. Her previous studies have focused on media manipulation, algorithmic bias, privacy practices, social media, and teen culture.


Share