A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society, September 2011
This article challenges the assumption that only large sets of data provide the best information to social scientists.
While there are benefits to analyzing large data sets from online sites where users leave digital traces, the limitations of the data collected must also be taken into account. It is important to understand that Big Data (large data sets gathered about users of a particular online service or networking site) alone cannot answer all necessary questions in social science research.
- Large collections of data from social networking and other sites do not answer questions about why users behave in particular ways.
- The ways in which the sites store information (more recent information often being more accessible) and the information that is gathered from a site (what questions they do and do not ask about a user) influences the conclusions researchers can draw by examining the data.
- Big Data is still subjective because the interpretation of the results is subjective.
- It is difficult to make claims about the general population or users of a particular service based on large data sets such as those from Twitter. Not all people have a Twitter account, and some accounts have multiple users. Not all content from Twitter is accessible because some posts are censored and private posts are not made publicly available.
- Big data and whole data – meaning data that provides a complete picture - are not the same.
- Smaller data sets – even information about one individual – can still provide useful insight into how technology is used. Information that can be found in a small study may be overlooked in a Big Data study.
- Analysis done with small data cannot always be done better with Big Data. Taken out of context, data loses its value.
- Just because it is accessible does not make it ethical – it is not clear whether or not ‘public’ data on a site should be used without requesting the user’s permission. This remains a highly contentious privacy related topic in Big Data analysis.