This post comes out of the big data project currently underway at the Oxford Internet Institute. One of the questions we are asking in this research is what constitutes ‘big data’ for social scientists, and how it’s changing the way they do their work. One prevalent assumption is that the most elementary distinction between ‘big’ and ‘not big’ in terms of data is that the former has to be treated with quantitative methods and the latter less so. However, as we have researched social scientists’ use of big data, one finding that is becoming very clear is that the bigger the data, the greater the qualitative challenge.
One extreme example of how big data gets boiled down to little data is the current research project, Life in the Alpha Territory, on London’s super-rich currently being undertaken by Roger Burrows and Caroline Knowles at Goldsmiths in London. Their starting point is the data owned by Experian, the credit rating firm, which records around 400 data points on every address in the UK, from private and public sources. (Fairly big data by anyone’s standards.) They then use Richard Webber’s MOSAIC classification system to drill down and find the super-rich by postcode. But the aim of the project is not simply to further classify, to visualise or to quantify anything about the research subjects. In fact, it’s impossible to get to the research subjects themselves. So after identifying them, the researchers will instead go in and do what Burrows terms ‘deep ethnographic qualitative description of those neighbourhoods’, by talking to the locals who actually live there and who are part of the local economy that revolves around the super-rich residents.
In this project the actual research occurs at several removes from the data. First it’s collected by a private entity, Experian. Next, it’s classified and crunched by MOSAIC. Next, those classifications are analysed to pick out the neighbourhoods the project’s interested in studying (a more qualitative process seeking a range of types of territory). Finally, researchers go into the neighbourhoods in person and do ethnographic work to map the relationships and activities that go on around the super-rich. So effectively, Burrows and his colleagues are using big data to generate the territory for micro-level, highly qualitative work that does not focus on the ostensible subjects of the big data at all – the super-rich identified by MOSAIC are not targets for the project, which instead aims to study the ordinary people who surround them.
Another interviewee for our project, Rich Ling of the University of Copenhagen, offers a case in point for how even the biggest data is only given meaning by small-scale understanding. Working with data from Telenor, Norway’s main mobile phone network, Ling and his collaborators used mobile calling records to study how Norwegians’ calling patterns reacted to the Utøya massacre in 2011. However, they also used focus group interviews to understand the meaning of the calling patterns because without the story people told about what happened to them that day, the patterns could not be accurately interpreted.
These are just two examples of something we are finding everywhere in big data work. Economists using huge datasets such as LinkedIn and the IRS’ records on the entire US population’s financial life history tend to use big data approaches (ranging from distributed computing to more traditional methods) to sample the data, and take out just the part of it that can answer a particular question. It’s both surprising, and yet intuitively obvious, to find that most of the questions we want to ask of big data are not about the ‘universe’, but about a particular corner of it, and that the main challenge of working with big data is usually reducing it to small data so that it can answer the question at hand. Even the projects such as FutureICT which purportedly aim to model everything, and answer the really big questions (how does the economy work? how can cities become more sustainable?) are in fact aiming to collect and analyse data in extremely sophisticated ways in order to answer what are fundamentally local questions. There is no such thing as ‘the sustainable city’ or ‘the resilient economy’, only particular cities and economies at various levels, with locally determined parameters and problems to be addressed.
The idea that the quantitative is dependent on the qualitative is not new. Just ask anyone who has tried to clean a large dataset. Nothing that can be quantified – populations, economic development, literacy rates, migration – can become useful information without some understanding of the meta-data, the descriptive element that tells you how the data was collected, and how the object was defined. If you disagree, just try to come up with a single definition of what constitutes a rich neighbourhood, a migrant, or a poor child.
Much of the discussion around big data as a source of information about the social world tends to focus on the universal nature of these datasets, and the notion that by using sophisticated data manipulation tools the annoyingly messy qualitative element can be skipped. The notion of finally having ‘enough’ data is powerful: finally the research space can become three-dimensional and the researcher can view her subject from all angles at once – imagine a digital city viewed through public records, social media, constant crowdsourcing and flash polls to inform urban decision-making, inhabitants’ mobile phone call records, the ‘data exhaust’ from all the electronic devices they use every day. Or the idea of the ‘smart home’, a kind of automated panopticon packed with embedded sensors that monitor the resident’s every move and heartbeat, warning if they are in danger and thus enabling the old and infirm to live independently.
These ideas are, of course, dependent on a particular vision of modernity which dates back to the industrial revolution. They imply that big data is just another step in humanity’s irresistible progress toward a more homogenous, harmonious, manageable social world. Counter to this discourse is a more critical one epitomised by danah boyd and Kate Crawford’s excellent article, ‘Six Provocations for Big Data’. They point out that objectivity and perfect accuracy in datasets are a fiction; that when you increase the size of the dataset you magnify any bias or error, and that networks mediated by technology may not be analysable using assumptions drawn from small-scale qualitative studies of communicative behaviour.
So our ability to understand patterns may not be keeping pace with our ability to identify them, and harmony and homogeneity may have to wait. The world is, in fact, lumpy and full of inequalities, and the more you try to analyse it as a whole, treating any dataset as a ‘universe’, the more you will run into problems of classification, non-representativeness and the pure intractability of data about human behaviour. In terms of the social sciences, the rule so far has turned out to be that if we can simplify it, it’s because we don’t understand it yet.
Some of this is disciplinary, of course. If you are working from a Communication Studies perspective, then a Twitter dataset can arguably be treated as the ‘universe’ of objects you are interested in. Or if you are a theoretical physicist interested in studying how complex classification systems evolve, a Wikipedia dump plus the records of the Universal Decimal Classification system can become two universes to be compared, as this team from the Netherlands has done. But once you start trying to derive conclusions about behaviour, rather than merely about what patterns are emerging from the data, big data presents a whole set of new problems in terms of understanding what’s present and what’s missing.
So could big data, seen in this way, ironically contribute to the breaking down of the embattled qualitative-quantitative distinction in the social sciences? It could, but only if social scientists themselves acknowledge the presence of the qualitative within the quantitative – where it lives already, but where it generally goes unseen and unidentified.