I recently started work on a new project at the Oxford Internet Institute, ‘Accessing and Using Big Data to Advance Social Science Knowledge’. Along with Eric Meyer and Ralph Schroeder, I will be figuring out what ‘big data’ means for the social sciences. I use quotation marks because I’m still not sure what big data is. Fortunately (or unfortunately), by studying it we will contribute to defining it. Ralph has been asking people what they think ‘data’ is – a question that can cause some discombobulation if you think hard enough about it. Hence this post.
My uncertainty is not due to a lack of definitions. In practical terms, big data seems to be defined as an incrementally larger set of information in a particular domain, demanding new methods of interpretation. Those who need it to get ahead are busy using it rather than problematising it. For business, the huge landscape of social media offers new ways to chart consumers’ movements, preferences and sentiments. Consulting firms are selling new ways of manipulating personal data to governments and firms; banks are using individual-level data to predict risk, and economists and mathematicians are thinking about how to use micro-data from social media or online auctions to predict economic behaviour and trends.
The social sciences have been a little late to the party, but funding bodies have been incentivising researchers to investigate the potential of big data. Various groups of innovative researchers have started independent projects: the just in time sociology group looks at rioting and eating disorders, big data for social good is quantifying crime waves and giving Kigali pointers on urban planning, and there is a lot of slightly psychedelic market-oriented oddness going on in the Near Future Laboratory. Closer to home, neogeographers Floating Sheep are mapping the psychogeography of the web. If you need to know where the zombies are, they can tell you. And you never know when you may need that information. Big data in the social sciences is so hip it hurts. But are we converging on a single definition, or just labelling anything involving data that feels big?
Back in the 17th century when science was young and definitions were obedient, ‘data’ was used by philosophers to denote information which could be assumed to be factual. Data represented certainty, the starting point for critical thought. Today, though, certainty isn’t what it used to be and over the last century facts have received a good kicking from poststructuralism. This leaves us with functional definitions of data such as ‘information categorised according to its mode of collection’, or ‘information framed or organised for scientific analysis’. These work, but they’re boring. If we’re feeling less positivist we might also define data as a text which can be used heuristically, i.e. to seek meaning. This would involve looking at how data are used rather than where they begin, and following how they are organised using a particular methodology and set of assumptions into a text which can then be interpreted in order explain social phenomena. Bruno Latour might like this definition. Depending on the methodology and the assumptions of the researcher, data may be a way of expressing an idea or getting to a new question, but they may also easily congeal into an instrument of surveillance and control. Large datasets involving private data may challenge the separation between the academic, governmental and commercial (who owns it? who gets to use it? who feels uncomfortable when they do?). While US researchers are more comfortable with exploring new methodologies to see where big data can lead and have fewer restrictions on the use of individual-level information, the EU has some experience of what can happen to sensitive personal information in the hands of governments, and is focusing on privacy concerns.
So. Two diverging perspectives on big data: a tool or a text. And a third suggestion: in the projects mentioned here, researchers are bringing together data sources and collections to explore a given idea similarly to the way people use language, with their disciplinary norms as the grammar and methodology as the syntax. This highlights the question of whether we are talking about an incremental change or a paradigm shift. Is ‘big data’ a new form of expression, or merely a new dialect of an existing language?
Or perhaps both? It may depend on whether we are talking about ‘designed’ or ‘organic’ data. ‘Designed’ data are collected thematically and framed by a purpose, as with a population census or an opinion poll. ‘Organic’ data are a direct projection of elements of the technological environment, which may be generated by human interactions but which are emitted rather than gathered. Examples of this ‘data in the wild’ are RFID sensor emissions, automatically generated CCTV footage or the number plate images captured by highway cameras. Similarly, the web content created when we microblog (Twitter, Facebook) is organic, as are the ‘transactional’ data generated whenever we make a mobile phone call, do a Google search or buy something online. Organic data are messy, unorganised and demand that we talk directly to machines in order to make sense of them. However large the dataset, designed data can usually be analysed with statistical tools that seek out patterns and similarities, but organic data is different. It’s granular, and demands new tools. Instead of building a model to reduce the data to manageability, you build a way to interrogate the data in its entirety. You wouldn’t, for example, get much out of using a statistical model to analyse all the video uploaded to the web, but you might from developing a new tool that can recognise types of movement or emotion.
So are organic data speaking a new language? The challenge remains similar: to categorise and to order, maybe to test hypotheses – but the processes involved and the types of questions that can be asked seem to have a different flavour. Or is this just a new twist on an existing theme? It’s true that researchers developing projects around big data need to be able to code, or to have friends or research minions who do – but a lot of research already involves coding that is done upstream by the developers of software packages such as Nvivo or STATA. Economic geographers and demographers already have to code in order to work with spatial and population datasets. Plus old and new data types can be combined – for instance, one could relate static health survey or census data to constantly updating locational data from mobile phone users in order to predict the development of an epidemic or a migration wave.
So is the technical challenge of big data, generally mentioned to prove that this is a new turn in social science, just a red herring? A sceptic might label as ‘big data’ anything which has a weird name and gives rise to a claim of novelty (e.g. ‘culturomics’, studying cultural trends through quantification of digitised texts) – although Thomas Kuhn would say that this is exactly how paradigm shifts tend to get ignored by scientists. Is big data just a white rabbit leading the unwary down the hole into a world where the data drive the question rather than vice versa? (Though this is not new at all for social scientists – it’s called inductive reasoning and anthropologists have always done it.)
The world down the rabbit hole is recognisable, but things are a little different there. If data are a text, ‘big data’ is currently behaving more like literature than nonfiction. As a caution not to take ourselves too seriously, I enjoyed the definition produced by a recent Hadoop conference, where they decided big data was anything which would not fit into an Excel spreadsheet. This groups the works of Jane Austen with the human genome and IBM’s payroll records, and is actually quite helpful because it suggests ‘big data’ is not something researchers can make sense of and dominate, but rather something that forms a new environment in which research can take place.
So, down the rabbit hole to where the sheep are floaty and the ravens just like writing desks…