This is a group post from a session held at the Big Data: Rewards and Risks for the Social Sciences conference in March (http://www.oii.ox.ac.uk/events/?id=557). Participants in the group were Chris Birchall, Michael Khoo, Cornelius Puschmann, Kalpana Shankar, Jillian Wallis, Janet Smart, Melissa Terras and Linnet Taylor.
This is an account of the session we held on the tools available for working with big data, and issues of getting access to relevant datasets for social scientific work. The session provided different perspectives on the use of Big Data. It involved a survey of methods of social network data harvesting, analysis and storage as well as other contexts of data creation, such as library activities and corpus analysis and the linkage of existing data sets to create large increments of scale. The discussion on tools was fairly straightforward – there are an increasing number of tools for harvesting, analysing, and presenting data for text, numerical data, and related sorts. These tools (and of course, the data themselves) have varying price points and what is needed/desirable will necessarily be depending on research needs and local resources.
Here is a list of the basic packages we, or our colleagues, are using currently for data access and visualisation:
- Datasift – commercial data provision service, whitelisted with social networks
- Offers facebook data
- $10 free credit – 20c/hour,
- pay per search $5, 22,000 records
- export as json/csv
- GNIP – $25,000 for full search for ‘digital humanities’
- Netviz (network visualisation and analysis)
- IBM Many Eyes (data visualisation, various)
- APIs (for various social media sources)
- Gephi (for network visualisation and analysis)
- NodeXL (for social media including twitter feeds)
- Tableau (visual analytics, various)
- Discover text (text mining and analysis)
- Infobright (business data analytics)
- For your own data
Another theme that arose was the diversity of data structures and the challenges in sharing data – the costs present in documenting and formalising data structures to make them useful – which highlighted the challenge of sharing, re-using and re-purposing. How can we respond to the concept of data description and accessibility as a hurdle to data sharing and openness? Perhaps common structures of metadata schemes and data construction descriptors could help, but could also prove to be a greater barrier than the one they aim to break. Taken in the context of data value resulting from the combination of diverse data sets, a formalisation of data construction methods including decisions, constraints and inputs provided at various stages of the data life cycle could be crucial in evaluating the insights gained from them.
When we started discussing big data vis-a-vis open access, the conversation became less straightforward. As we heard throughout the workshop, it’s difficult to make many of these data sets “open” because of agreements with commercial sector providers and similar entities. It is it even necessary to make these data open, given that one cannot re-analyse them readily because of their essentially ephemeral nature. We discussed the role of research repositories but most of them aren’t useful because the kinds of data social scientists are collecting don’t fit perfectly, so there was some discussion of just making them available on a blog, etc. Of course, not all researchers are meticulous or even fully cognizant about what it takes to truly make data “safe” for public consumption. The question of business models also came up – for archiving this data, making it accessible, and of course, commercializable.
We took a brief survey of the participants of OII’s workshop, Big data: rewards and risks for the social sciences, asking people what kinds of data they were using and how they had accessed it. There was a huge range of data, from geotagged social media and mobile call data to academic journal articles and abstracts. What was perhaps most interesting was that most people were either using domain-specific aggregators such as JSTOR or Reuters, rather than APIs or webcrawlers. Given the often bounded subjects of study this was not surprising, but also formed a counterpoint to the common assumption that to use big data you have to be able to program complex collection mechanisms. Of those who had to pay for their data, about half the researchers had funded their access through grants and were unsure of exactly how much it had cost. Only one had used a commercial service – Datasift – but reported paying the least for their data out of the whole group who had paid anything.
The next section presents our discussion on sharing data and making it accessible across different platforms and disciplines:
Towards an open source and open access platform for social scientific big data research
(Borgman, C. L. (1996). Social aspects of digital libraries. In E.A. Fox, & G. Marchionini (Eds.), Proceedings of the 1st ACM international conference on digital libraries (pp. 170–171). Bethesda, MD.
Open access and open source are two separate principles applicable to many aspects of social science big data. The open source and open access needs of social scientific big data research can be usefully framed within a data lifecycle model, in this case Borgman (1996). Ideally, all stages of this research, ranging from initial data collection, to final data access and distribution, and associated tools and services, should take place in open access environments and configurations (however defined). This involves at least two distinct approaches:
- data sets and analytical tools available to the researchers for carrying out big data research
- tools/platforms for distributing big data results and data sets. At the moment any open access/source environment for big data research exists only in partial form.
Data gathering and analysis
In terms of data gathering and analysis, there is a range of tools available that can support researchers. They vary in cost, and range from API and harvesting protocols, which are free but require researchers to configure harvesting tools and queries, to commercial services that provide front ends for APIs and allow researchers easily to configure data collection, but which charge for their services. There are a number of data visualization tools; R can be used for analysis, and there are also commercial visualization services, some of which are free, some of which charge fees, and some of which also may keep retain copies of any analyses. The group identified some initial questions for a survey of such tools include assessing different capabilities and costs, such as:
- What are the company profiles?
- What are the business models of these companies?
- How much does it cost to do research?
- How much are you willing to pay for your data and analysis?
- What APIs do they use?
- What metadata do they use?
- What is the data provenance?
Data archiving and distribution
In terms of archiving and distribution, there are fewer tools/services available. Further, there is little information on existing storage practices. Questions for understanding the requirements for open source and open access platforms for social scientific big data research here include:
- Whether researchers place their data to be ‘inward’ or ‘outward’ looking, and if outward-looking, how they are exposed for access and harvesting
- Whether there are institutional repositories available for archiving
- If so, do any mandates exist for deposition? – these appear to be rare
- Are there any benefits for researchers here (e.g. does this count towards tenure and promotion?)
- How interoperable are Institutional Repositories?
- Do they use standard metadata?
- Is data managed in any way (e.g. data management plans)
- Different character of human subjects data versus ‘data’ data
- What implications are there for privacy? What are Institutional Review Boards policies/requirements?
Finally, in terms of distribution, ideally there should be an open infrastructure that supports easy cross-disciplinary search and retrieval of social science big data research, results, visualizations, data sets, etc. This infrastructure does not yet exist, apart from some examples in specific disciplines. It would be nice to have workable infrastructure – but what would it be built on?
- What levels of data re-use currently exist?
- What models and platforms are currently in play or being discussed (such as Linked Data)?
- What standards for data sharing are being discussed?
In summary, there are a number of tools/services/platforms available for Big Data collection, analysis, archiving and sharing. Overall impressions are:
- that there is a need for open source data analysis tools and services;
- that the availability of tools and services, open access or otherwise, decreases as social science big data passes through the data lifecycle;
- that incentives to promote archiving in repositories are needed; and
- that any open access platforms for the data cycle are nascent.