Can we trust machines to manage our data?

In a recent post on the World Economic Forum’s blog, Kate Crawford argues that two of the biggest challenges posed by big data are preserving due process and reducing power imbalances at a time when data scientists have an unprecedented ability to ‘classify and quantify human life’.

In contrast to her focus on the human scale, and to real-time consideration of the risks big data can present, there are a spate of innovations occurring by engineers looking to solve data governance problems by taking the humans out of data processing decisions. Two examples are the Ethereum and Enigma projects, both of which provide architectural responses to problems of data governance using the block chain model. These models envisage deregulation and automation as a way of addressing what are currently ethical problems: who should have access to data about us, and what they should be able to do with it.

These projects, outlined by Alex (Sandy) Pentland in this recent blog post, are brilliant in terms of engineering, but don’t address the anarchic landscape of intermediaries where data are shared across borders and subjected to wide-ranging types of data-mining and distributed processing. Instead these are highly sophisticated systems for simplifying and automating the way data are channelled and permissions attached to them. Pentland argues that bureaucracies cannot be trusted with our data, and that rule systems run by machines are more appropriate custodians.

This poses a problem, however. In the real world of data flows, no application can audit all the ways we are tracked, and control the information. Even if it could, we don’t come into contact with much of the information that reflects our activities and identities. If we pass by a sensor and it captures some of our internet traffic, that is not identified as part of a profile of us, and there is no way to attach metadata to it to say that it should be treated as data about us. Yet if we look at the data market through the work of Frank Pasquale, for example, it’s clear that a lot of data that reflect our most intimate and private characteristics – yet aren’t classed as belonging to us – are regularly bought and sold on the open market in ways that Pasquale terms ‘black boxed’ and impossible for us to audit and control. For example, Pasquale notes that in the US it’s legal for marketers to create profiles such as ‘has been raped’; ‘diabetic-concerned’ (someone who searched for diabetes information online); and ‘daughter killed in car crash’.

In the EU this degree of profiling is not possible because broad (and human-interpreted) data protection regulations forbid it. The Transparent Holland experiment recently publicised examples of the kind of profiling possible in the Netherlands, and found that with the data available they could make up financial risk scores for almost anyone. In comparison to the US case, however, their profiles are actually pretty comforting: they show details about people’s neighbourhood house prices, the energy classification of their building, their level of education, and other fairly mundane details. The only really sensitive detail is whether they have been bankrupt. But the wealth of precise detail available to US data intermediaries is so far absent.

The automated solutions suggested by Pentland and other leading computer scientists are intended for types of data that are cleanly demarcated in terms of their purpose, such as financial, health or genetic data. They don’t, however, offer a solution to the kinds of data processing that take place under exceptions to the rules of purpose limitation, such as humanitarian-sector analytics in crisis situations, or data mining for public policy. Nor can they address the way data may be transformed from non-sensitive to sensitive through merging or linking with other datasets. And in a more wide-ranging problem, they don’t address the kind of observed data that feeds data markets worldwide. Machines will recognise data with an explicit link to an individual – not the kind of ‘data doubles’ that are built from myriad sources to replicate us for analytical purposes.

This is not to suggest that automated data categorisation and management is a bad idea. But the world of big data is full of situations where it seems like a bad idea to remove contextual human decisionmaking process from the data value chain. Thinking of development or humanitarian data analytics, these would include levels of aggregation to obscure data subjects’ identities; whether to restrict access to datasets, models or end products so they can’t be used for nefarious purposes; assessing in a contextual way whether a particular analytics project presents risks to a particular group – these are all questions that can only be answered on a project-by-project basis. Governing data in these cases requires contextual knowledge and considerable expertise, both technical and sociological, in order to understand and weigh the possible risks and benefits of a given project.  This is not a coding process, or a process of classifying data as broadly sensitive or not, but a risk-based decisionmaking process that can only work by incorporating local knowledge to understand what the risks are.

If this vision of data governance were applied within our current models of data governance it would create massive bottlenecks in situations where data may be needed in a hurry, for example in real-time epidemiological analysis. For instance, there is an international association of data protection and privacy commissioners – surely they could make this kind of decision at the international level.  Yet this would be hugely problematic in the case of development data analytics because of the underrepresentation of developing countries: in fact, the membership of that international association actually numbers more federal states from within Germany (fourteen) than it does African, Asian and Latin American countries in total (just twelve). Established legal compliance measures from the private sector are also difficult to apply because development data are used in locations, and in ways, that go far beyond the ability of Privacy Impact Assessments (the current process used by firms) to ascertain whether they will lead to harm.

Privacy self-management is another approach that doesn’t appear well suited to the current situation in developing countries. In places where not everyone who is made visible by their data also has an internet connection, or meaningful connectivity, it becomes impossible for people to see where their data are travelling and to make ongoing decisions about them.

To conclude: despite what Lawrence Lessig says, code isn’t law. Or at least it shouldn’t be with regard to sharing big data, and particularly not for development research or humanitarian purposes. Law and regulation are slow and fallible, but they are also accountable and contextual. In order to understand the risks of data use, we have to know the place and the people in question. Where data analysts don’t have that knowledge, we need decisionmaking infrastructures that include people who do know.

Sometimes researchers build risky models out of safe data, as happened to the Harvard Signal Program, and sometimes open and public data can become sensitive when used in particular ways or merged with other datasets, as shown by Frederik Borgesius, Mireille van Eechoud and Jonathan Gray’s new study. To deal with problems like this – the kind of problems we can’t imagine until we create them – we need a whole new human and institutional infrastructure which can evolve and learn as those problems emerge. We’ve done this elsewhere: we’ve regulated power through democratic checks and balances, we’ve regulated the academic and medical spheres through qualifications, ethics boards and review processes. We’re in an ongoing struggle to regulate the financial sphere, which is taking a lot of innovation – and the reinsertion of humans into a process that has long been automated. It’s time we addressed digital data the same way, rather than expecting this massive and evolving industry to somehow self-regulate or align itself with existing decisionmaking structures. Automation is not the answer – though it may be a very useful tool for human regulators. If we want decisionmaking that can work for the whole world, we must be brave enough to take the problem out of the engineering sphere and into the social and political sphere where it belongs.

Leave a comment