PRESERVATION

Mind the gap: catching digital content before it slips away

Mind the gap: catching digital content before it slips away
Shutterstock

As preservation communities race to bridge the divide between the research world's burgeoning digital content and its long-term storage options, Rebecca Pool examines the issues and the challenges

Research Information: February/March 2013

To describe digital preservation as a moving target is now, more than ever before, something of an understatement. In the beginning, organisations grappled with obsolete formats and defunct file extensions but the rise of the internet and proliferation of digital devices has brought a relentless churn of operating systems, HTML code and apps. Preserving today’s digital content is a minefield.

As Taylor Surface, global product manager at the not-for-profit organisation OCLC, put it: ‘There’s a big gap between what the majority of cultural-memory institutions can achieve in digital preservation and what people are now doing with the things they create digitally.’

Advances in digital preservation over the past decade now mean that storing a traditional atomic unit of information, such as an image or PDF file, can be relatively straightforward. However, doing the same with web content or a blog is a different matter. As Surface highlighted, no common practice exists for capturing web blogs and blogging, let alone a common practice for how you archive and preserve this content in the long term.

‘The technology on websites changes so rapidly. There are different tools to create websites and different technologies used to present information and these are changing year after year,’ explained Surface. ‘If you are trying to capture this information coherently you really have to focus on who you are preserving it for, and why.’

Bram van der Werf, executive director of the not-for-profit Open Planets Foundation (OPF), a UK-based forum for digital preservation, agrees, and believes rapidly changing and complex content is a major preservation challenge.

‘Until a few years ago, digital objects, such as Word and Excel files, were relatively static, and, over time, libraries and archives have learned to deal with these,’ he said. ‘But people are now increasingly publishing and disseminating information on websites and blogs. We are no longer dealing with static objects but, instead, highly-interactive digital expressions.’

To make matters worse, today’s digital content is also highly dependent on the latest "device", be it a tablet, iPhone or iPad, as well as its flavour of operating system and even the latest app. ‘These technologies have a very short life cycle and the speed at which they regenerate is so fast, often less than a year,’ observed van der Werf. ‘So the real issue is that the objects we want to keep are becoming increasingly complex and dependent on lots of technologies that we only have very limited control over.’

Taking on the challenge

That’s not to say organisations aren’t tackling the problem. Stanford University-based, LOCKSS (Lots of Copies Keep Stuff Safe) is, fundamentally, a web-harvesting organisation, set up to preserve libraries’ electronics materials across its distributed network.

As such, the LOCKSS system has always been file-format agnostic, preserving web-published content as displayed on the web. ‘From the start we have been preserving large amounts of audio and video databases, HTML, XML, PDFs,’ said LOCKSS executive director, Victoria Reich. ‘We were set up to preserve academic e-journals exactly as they were presented on publisher websites, so the mime extension or file format really hasn’t mattered to the LOCKSS program.’

But harvesting and archiving websites is growing in complexity, with preservation becoming increasingly difficult to manage as technology on websites rapidly changes. Like Surface and van der Werf, Reich is seeing new challenges.

‘To preserve the author’s words and presentation of his or her material is an increasing technical challenge as the web is evolving from a document model to a programming environment,’ she noted. ‘This is a general web-preservation problem; with new technologies such as Web 2.0 and HTML5, published content is increasingly dynamic.’

To tackle this, LOCKSS won a grant from the Andrew W Mellon Foundation in April 2012 to develop new ways to gather and preserve some dynamic digital content. According to Reich, a key challenge has been to capture the richness of content presented on the web. However, the organisation has made progress and now plans to release its first open-source software addressing this problem by April this year (see section: ‘Preserving blogs’).

For its part, the OPF has also established several new projects to deal with today’s issues. For example, SCAPE, co-funded by the European Commission and led by the Austrian Institute of Technology, is developing open-source, scalable tools for the automated preservation of complex, multi-terabyte-sized data sets.

Van der Werf believes the project is crucial, given the sheer volumes of digital content to preserve. ‘Whereas the traditional stream of information was manageable via human curation, the amount of information now being published is so huge we have to develop scalable solutions to deal with the massive flows,’ he argued. ‘Automation is the only way to deal with the vast amount of information.’

The project is collaborating with global, open-source initiatives to help deal with scalability issues, including access to super-computers as well as existing technologies to identify and characterise files.

But as the OPF director also emphasised: ‘If you want to deal with this you have to understand it. We’re dealing with fundamental issues such as what’s inside a file system. Without explicitly understanding the problem, you simply can’t resolve the issue.’

Outside scholarship

Fundamental understanding or not, one non-scholarly organisation that has forged ahead with preservation is the USA-based Internet Archive. Launched in 1996 by Brewster Kahle, the dot.com multi-millionaire who wanted to back-up the internet, the project very publicly demonstrates what can be done. His internet library quickly became the largest publicly-accessible, privately-funded digital archive in the world, and currently includes books, audio and film collections, cartoons and software.

However, the one service of Internet Archive that continues to draw widespread attention is the Wayback Machine, which enables users to see archived versions of websites, dating back to 1996. Admittedly, the service has only captured a fraction of the internet’s history – but it is still more than 150 billion pages from millions of websites and you can, for example, read original coverage from the BBC and CNN of ‘America Under Attack’ on 11 September 2001.

‘The Internet Archive made it its mission to broadly harvest the web and do its best to present it back to users,’ says Surface. ‘Some content doesn’t display itself correctly after a while and you can’t just search generally and find full content, but it’s far better than nothing and the fact they have a time-line is just great.’

Dispersed and dynamic

And it’s not just web-harvesting organisations that are scrambling to stay abreast of today’s changing digital content. The not-for-profit organisation, Portico, preserves tens of thousands of scholarly e-journals, digitised historical collections and, more recently, e-books.

But according to Kate Wittenberg, Portico’s managing director, the organisation is now looking to deal with more and more complex content. ‘Dynamic forms of content, such as data and multimedia, are increasingly becoming integral parts of publication, instead of being an outside link as was the case in the past,’ she said.

According to Wittenberg, dynamic content is becoming widespread in scholarly science publications, and also quickly moving to the humanities and social sciences. ‘This dynamic content is critical to the authors’ arguments so how do you connect and preserve that data or those sources to their final work?’ she asked. ‘An article could be published by, say, Springer, and the data that supports it could be housed at a university or an aggregator or a data-service organisation, so how do we preserve the links between the published article and the data, housed in a different places? This is going to be important.’

Portico is investigating research projects to address these very questions, so it’s too early for answers, but as Wittenberg added: ‘We have the capacity to preserve scholarly material coming out in a traditional form but we will also be prepared for what’s coming next.’

While exploring the options for dealing with more dynamic content, Wittenberg said the organisation is also seeing growth in demand for its e-book preservation service; the number of e-books preserved has risen from just fewer than 2,000 in 2009 to more than 170,000 today.

As she highlighted, more and more aggregators for scholarly e-books are emerging. These include the John Hopkins University press with its e-book service of content from many US University presses and USA-based Ithaca’s book-hosting service on its Jstor platform. ‘We’ll be preserving the e-books from both of these initiatives,’ she said. ‘So we are going through a similar development process that we saw with journals; we have the same business model, it’s just a different form of scholarly publishing.’

Communication is key

Much preservation work is underway to better deal with digital content and its escalating complexity, but is the community seeing results? As OCLC’s Surface highlights: ‘The gap [between digital-preservation practice and what’s required to store digital content] isn’t getting wider, but I don’t see any consistent standards and common practices coming out to help you to deal with this.’

Community-based organisations, such as the Digital Preservation Coalition in the UK as well as the National Digital Stewardship Alliance, USA, help the development and dissemination of good practice. But as Surface adds: ‘These organisations provide a focus for discussion [about digital-preservation practices], but as you are having the conversations the technology is advancing.’

Surface’s words may seem light-hearted, but take a look at the myriad community blogs on digital preservation and the theme of communication emerges over and over again. Tensions over duplication in research – a strong signal that communication could be stronger – appear to be running high.

In last year’s ninth annual iPRES conference, and as blogged by Inge Angevarre from the Netherlands Coalition for Digital Preservation, plenary keynote speaker, Steven Knight from the National Library of New Zealand, pointed out to the audience that ‘10 years on we are still pretty much talking about the same things’.

Knight’s sentiments have been echoed by Paul Wheatley, project manager of OPF project Spruce at Leeds University, UK, also a plenary speaker from the same conference. He categorically stated that the community is duplicating its efforts, and urges practitioners to ‘re-use, don’t re-invent the wheel’, adding that most problems have already been solved, but not necessarily by this community.

With communication in mind, the OPF is working hard to bring communities of digital-preservation practitioners together in a bid to ensure that precious information is shared. Van der Werf’s team organises hackathons in which participants analyse different formats of e-books or images, for example, while members also regularly blog about technical challenges. In addition, the organisation hosts conferences, publishes webinars and has established a Wiki knowledge base covering myriad digital-preservation issues.

As Rebecca McGuiness, OPF membership manager argued: ‘Take tool registries. Even though eight, 10, 12 registries exist at the moment, people still tend to start a new one. We are trying to stop this and get people working together as a community.’

Preserving blogs

In March 2011, the project ‘BlogForever’ was launched at the UK’s Warwick University, with the aim of developing robust digital-preservation, management and dissemination facilities for weblogs.

Bringing together a worldwide network of universities, institutions and companies, the EU-funded project is in the process of developing a prototype software platform that anyone will be able to install to a server and preserve a selection of blogs.

The project has delivered the prototype of a blogosphere spider that enables crawling and monitoring lists of identified blogs, as well as new, unknown blogs.

As Morten Rynning, from BlogForever, wrote on the project website: ‘Any new blog posts or comments from each blog will be added to the feed through the spider, which can be downloaded and run from a single server; and managed through a web portal interface.’

Organisers hope that libraries and information centres, museums, universities, research institutes, businesses, and, of course, bloggers, will use the final repository, scheduled to be completed by September 2013.

Small journals at risk?

In recent months Portico has signed up an increasing number of smaller, often society-run journals. While the organisation offers the same service to the libraries and publishers of these titles, managing director Kate Wittenberg says this content brings additional challenges.

‘Many of these journals are not on the kinds of standard platforms that other publishers use so files are not in a standard format,’ she explained. ‘Also, you are not getting one big flow of files from that standard platform, instead you are dealing with a variety of titles coming in with different formats.’


Kate Wittenberg (left) and Victoria Reich (right) 

But while Portico takes on more of the smaller journals, others in the digital-preservation community are voicing concerns over the future of this content. Victoria Reich, executive director at LOCKSS, highlighted how preserving content from a smaller publisher is much more expensive per title. For example, an organisation might write a preservation system for, say, Elsevier’s publications, and preserve thousands of titles. Do the same for a smaller publisher and it only preserves a few titles.

‘Many resources are going into preserving content that is not at risk and people do not want to pay to preserve the content that is truly at risk,’ she said. ‘So you have an agricultural title from a developing country that could hold insight for climate change. This content is often free to libraries but is far more expensive to preserve than Elsevier’s, leaving little incentive to preserve it. This content is at risk and to be perfectly frank, the key issue is money,’ she added.