I just wrote a paper about claims I’ve heard in the development/humanitarian research world that big data should be treated as a public good. It’s called The ethics of big data as a public good: which public? Whose good? And you can find a pre-print here.
I argue that there’s a reason mobile operators are not responding to these calls positively. It’s because they can’t make mobile data freely available, even ‘fully anonymised’, because there’s no such thing as full anonymisation and mobile data is just too risky to share freely. The paper looks at the available models through which this kind of data is being made available, and explores a couple of projects at Telenor and Orange where it’s happening.
I also look at the problems with arguing that big data should, on principle, be open data (or at least open to researchers). One is that the current decisionmaking logic used to decide whether data can do good is problematic. In a recent meeting, one data-analytics-for-social-good initiative proposed that it go like this:
- Determine the possible harm a particular use of data may cause, and how likely that harm is to occur;
- Decide who the beneficiary of the project should be;
- Determine the possible good the project can do.
Can anyone see the flaw here? The projections of harm are always generic in this model, but the possible benefits are contextualised. Therefore this model will always confirm that you should conduct your project. Here’s an example.
- My de-identified mobile calling dataset still has the potential to identify people by ethnicity, age, gender, religion and political affiliation. These may cause harm through … umm… well, I’d know if I knew which country we were talking about. HARM = UNKNOWN.
- We could use the dataset to determine poverty levels in country X.
- Country X seems to have a lot of poverty and people are really suffering there. This data can show where this is occurring. BENEFITS = ALLEVIATING POVERTY.
Result, use the data on country X because the potential to alleviate poverty trumps unknown harm. However, if we do some research on country X, we find that poverty is accompanied by political instability, a lack of democratic representation and free speech, and oppression of minorities. This is obviously hypothetical, but poverty often correlates with several of these conditions. So if we consider the problems in context, identifing people by ethnicity, age, gender, religion and political affiliation could be pretty useful to someone looking to confirm the movement and behavioural patterns of political opponents, of smugglers, of migrants or of other groups they might want to influence.
I argue that we should move from unknown harm being a signal that it’s a good idea to proceed, to researching harms with the same specificity that we research benefits. This means, though, having a more nuanced understanding of poverty, oppression and conflict. It also means that mobile operators are able to access this nuanced understanding because their data comes from country subsidiaries, and is therefore collected and handled by people with a good idea of its potential for good and bad.
My paper suggests that we let mobile operators be the judges of what is ok to do with mobile data, because they are predisposed to be conservative about protecting their customers. Not all problems require data science, just as not everything we can’t understand needs to be poked with a stick.