At the end of March OII held a workshop on the potential of big data for social scientific research. The workshop brought together researchers from various continents and a wide variety of disciplines, with research interests including immigration and xenophobia, the genesis of innovation, labour markets and financial risk. The aim of the event was to connect researchers with each other and generate new work: participants were invited to bring a junior or senior pair with whom they wanted to explore a particular big-data-related question, and to connect across the group to investigate new issues.
The workshop started with some breakout groups to explore broad issues of interest. These were organised around four main topics: first, aspects of big data to do with risk and privacy; second, big data’s use as a tool for policymakers to understand the social world; third, our academic understanding of ‘big data research’ and what gets left out in the study of big data, and fourth, organising and sharing data and how this impacts research.
This workshop was particularly useful because it helped clarify an emerging characteristic of the current social scientific discussions of big data: the tendency to conflate methodology with sociological approach. This was something also addressed in a workshop earlier this year at OII. This is an important question because it goes to the constantly recurring question of the role theory has to play in big data research, and whether big data may be leading to the ‘end of theory’. This seems to relate to a sense that sociological theory has less of a role to play as we become better at metaphorically ‘eyeballing’ huge datasets, so that the part no longer needs to stand for the whole. So how theory-driven does research using big data need to be? Or is it a descriptive exercise that will itself eventually generate theory? If the former, then do we need new theory, or does existing theory suffice? Furthermore, is research using big data just the next step in the scientization of social research, or does it constitute a new and different way of doing social science? The interplay of theory and methodology is central to understanding these questions.
Some of this debate about where theory belongs in big data research may also be about the extent to which theory is voiced. One can argue that there is always theory in play when social scientists do research, but that with big data they may be less likely to be explicit about their theoretical assumptions and to focus more on clarifying their methodological choices. For instance, using a statistical approach versus a visualisation approach is based on theories about what the data is, and what it can tell us – but this is not usually framed as a theoretical discussion in big data work. Accessing this data rather than the other is similarly a theoretical position; so is choosing a question to ask of it, and the way in which findings are turned into conclusions. So we are a long way from being free of theory – but the big data context does seem to change the way in which we talk about it.
This question of voicing theory may also relate to the interdisciplinarity which is also emerging as a common feature of research using big data. Theory can be hard to translate across disciplines, and some disciplines involved in analysing large datasets, such as computer science or physics, use radically different types of theory. One of our discussions probed whether we can explore interdisciplinarity itself using big data, treating it as a flow rather than a series of distinct interactions, and how we can use big data methods such as visualisation, language and network analysis to look at processes of colonisation and pollination in knowledge transfer – the ways scientists connect and work together, and negotiate theory across disciplinary boundaries.
Interdisciplinarity is also an important consideration in the discussion about the value of theory in big data research because it illuminates the differences in the way big data is collected: social scientists who collect their own data tend to have to be explicit about theory when they do so (what do you propose to study, why this and not that, and what do you expect will be interesting about it?). However the kind of datasets referred to as big data tend to be gathered for some other purpose and only accessed by a social scientist after the data has done the work for which it was gathered – to spread information, or inform business strategy for instance – so that there is already an approach, a method and a taxonomy in place before a social scientist’s research question can be conceptualised.
Besides pulling apart the theory/practice dichotomy, several other ideas that surfaced from the breakout groups’ discussion were pragmatically oriented. One related to the connection between big data and the surveillance society, most recently in the news just before the workshop, and to the challenge of both managing the privacy risks related to the digital data people generate, and regulating how long data should be kept where it is valuable. On the one hand, some of our data exists beyond our reach and may end up being reused, recycled or otherwise cracked open (sometimes literally). On the other hand, we like to keep our most important data backed up, but if the firms that provide the repositories for it cease to exist, there is no regulation protecting our data from disappearing. What if your Dropbox, Google Drive or other cloud-based backup system went bankrupt tomorrow? It was pointed out that there is currently an institutional vacuum both regarding our right to be forgotten and our right to be remembered.
Other conversations focused on the nature of big data, and its demands on researchers and vice versa: first, what does small data have to tell us about big data – can we get to what’s different about big data by tracking its use at the micro-level? Second, what does open access data have to offer researchers? We are increasingly directed to make our data open for sharing and reusing, but what is needed for researchers using big data? Third, what can we learn from interrogating the ‘engineering solutions’ necessary for asking questions of big data, and the statistical methods used on it which are increasingly based on nonparametric rather than probabilistic approaches, i.e. working without formal assumptions about the structure of the data.
Finally, there was also a strong methodological thread running through both day’s discussions: how our methodological choices influence the way we understand data, and how in turn the data shapes our methodological choices (in this case by being very large, nonparametric, noisy or all three).
Overall, the workshop was useful in bringing out the questions that tend not to get asked as researchers try to figure out the mechanics of big data. What is particular about this type of research; why is it worth doing; what can it add to the body of social theory, and what theories can help to frame it and amplify the conclusions that we draw from it? As researchers apply themselves to the search for what one participant termed the ‘engineering solutions’ of big data work, they are also working in relationship to different bodies of theory which must often be stretched across disciplines, and placed into conversation with other bodies of theory as interdisciplinarities occur. Theory underlies the curiosity that is inherent in the way social scientists are investigating the potential of big data, and it informs the new methodologies being developed. By surfacing and articulating that theory, social science brings particular value to big data research.