The internet archiving puzzle: who can capture the web, and what happens if they do?

On Tuesday I went to a workshop on Big-Data Analytics for the Temporal Web run by LAWA (Longitudinal Analytics of Web Archive data). LAWA is an EU-funded project to develop new archiving techniques for the web, and is the European counterpart to the US Wayback Machine. In Paris, in a startlingly low-tech classroom at the Centre National des Arts et Metiers, with slides projected on a wall and only one power socket, some of the smartest technical minds in web analytics came together to discuss some fiercely intractable problems to do with the intersection of various aspects of time with web datasets consisting of billions of items in a constant process of evolution.

LAWA’s eventual product will be a Virtual Web Observatory which will bring together the techniques developed under the project and make them accessible to users who want to query the archive. These users are currently conceptualised as people who can code, but there are efforts by some of the participants such as the Internet Memory Foundation to make the web observatory searchable by ordinary mortals. The workshop brought those working on LAWA projects together to discuss the ‘temporal web’ – the fact that archiving the web involves tracking change over time rather than merely taking an extremely large snapshot.

Several of these projects relate to webcrawlers. If you are an archivist, your first task is to develop a crawler to find the right content. This is a whole question in itself since it depends on your perspective – a century from now, will people be more interested in the recent US election or in all the millions of pages of numerical content that record a day’s stock trading on the NYSE? – and then to capture every detail and aggregate it in order to turn it into reference material. As Julien Masanès of the Internet Memory Foundation points out, there is no way to know what people searching the archive in the future will want to know about us.

Once they figure out the target, archivists then have to help the crawler recognise it. This involves natural language processing and is one of the most complicated questions computer scientists are currently dealing with. How do you help a crawler, searching for mentions of the flu in order to try to predict a coming epidemic, distinguish between a tweet that says ‘I’ve got the flu, so no partying for me’ and one that says ‘It’s as if I’ve got the flu! No more tequila for me’. Dates are a challenge too. If someone posts a statement about the Haitian earthquake of 2010, and mentions that an earlier catastrophic earthquake occurred in 1564 and that in 2012 Haiti is recovering, how do you get your crawler to recognise that this is a post about the 2010 earthquake?

And once you figure out how to find it and store it, how do you then make it usable? Robert Fischer from Germany’s SWR TV network is at the user end of the spectrum of LAWA’s work, and is involved in ARCOMEM, a project to create a searchable archive of all SWR’s web content. ARCOMEM will become one way to look at all the social media activity around a particular event: blogs, social media, video and photo networks. The ARCOMEM project suggests that one way internet archives may become more user-friendly is by archiving everything referring to a particular event such as a hurricane or a rock concert, and allowing people to search the way they would use a reference library.

Despite all the technogeeky brilliance occurring here, Fischer notes a problem: we don’t know if this is legal. All these technologies work fine when the content being archived belongs to the archiving organisation. But when it doesn’t, a whole new set of questions opens up. What happens if all the social media content which has already been archived becomes protected for ten years? Or twenty? He guesses that 95% of the really interesting work being done on archiving and semantic querying is going to turn out to be illegal with regard to country-level privacy laws.

So what are the implications of web archiving for people’s right not to be associated with certain content? Here the interests of archivists may diverge from those of web users. For example, a significant portion of social media content is later taken down by those who posted it. Archivists refer to this content as ‘lost’ (as in this article on the Egyptian revolution) as if it had burned up in a tragic library fire. In fact, whose who originally posted it owned the content and may have taken it down for good reasons as the country’s political landscape evolved or as their own position changed. Fischer understands why certain developments may make people repossess their content: in 1998 he helped develop a facial recognition tool to identify Bill Clinton in archived content on US elections, and the technology was then adopted by Israel to identify potential terrorism suspects. To quote him, ‘all these technologies are guns’ and we can’t predict who may point them at us in the future.

This goes back to the larger question of who owns the content we post. The best discussion so far of the right to be forgotten can be found in Victor Mayer-Schonberger’s Delete. Companies are overstepping the boundaries all the time: AOL got in a spot of trouble back in 2006 for offering up its subscribers’ search records to the public in the belief that they were anonymous, and Facebook has addressed its myriad grey areas by incorporating Ireland, where the Data Protection Commissioner’s office is still small and its ability – and will – to regulate its technology has taken a while to rev up.

All this comes down to the question of how we should distinguish between different types of ownership on the web. Ownership is multifaceted: those who create content own it, but so do those who pay them to create it, those who own the online spaces where it’s published, and the independent archivists who want to preserve it as part of our cultural heritage. And as with physical archives, some of the online archiving technologies currently being developed will lead to public access repositories, but most – and possibly those able to capture the greatest detail, to store it for the longest, and to retrieve it most accurately – will not. It’s in all of our interest that projects such as LAWA succeed because publicly funded, publicly owned archives are the most likely to be accountable to their contributors, and to changes in the law designed to protect our information. LAWA is the canary in the coalmine: what’s being developed through public projects is an indication of what may be occurring in the commercial sphere. It’s worth tracking because, whether we like it or not, whether we’re posting on social media or just interacting with technology as we live our lives, we’re all named somewhere.


  1. Such a great round-up of the web content “grey areas”. I’d thought about the ownership issues before but hadn’t considered the archivists perspective on content as an historical artifact.

    I’d love to share this post – where are your sharing buttons?

    1. Thanks! and note the just-added sharing buttons…

  2. Ok, tweet accomplished. Nice buttons.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: