FEATURE
Topic: 

The rise and rise of digital preservation

As the demand sky-rockets, industry players are juggling dynamic content, D-Collections and more, writes Rebecca Pool

While digital preservation was once considered to be a ticking time-bomb for libraries worldwide, today myriad organisations and initiatives exist, ready and willing to archive content for both the short- and long-term.

US-based not-for-profit preservation archive, Portico, is a front-runner in academic preservation and, like many in the industry, continues to see rising demand for services. Recent contracts to preserve e-journals and e-books come from India-based Environ Researchers and Emerald Publishing, UK. As managing director, Kate Wittenberg highlights: ‘The amount that we ingest into the archive just continues to grow every year.

‘We’ve just celebrated one billion files in our archives and I find it astounding how much growth we see; we have to increase our storage capacity because demand is growing so quickly.’

Likewise, not-for-profit venture, CLOCKSS, US, has also seen a steady stream of both e-journals and e-books deposited into its internationally distributed network of archives. CLOCKSS uses an open source digital preservation system – LOCKSS – developed at Stanford University, US, and this year, has signed key contracts with Cambridge University Press, Emerald Group Publishing and IOPP Publishing, UK. ‘We’ve seen a realisation among libraries and publishers in 2015 that it is now important to preserve books as well as journals,’ comments executive director Craig Van Dyck.

Wittenberg agrees, saying: ‘E-books are not as fast-moving as e-journals at this point, as [this sector] is in an earlier stage of development compared to e-journal publishing programmes.

‘But we continue to sign new e-books... and in many cases this is from journal publishers that realise they need preservation and are now also turning to their e-book programmes,’ she adds.

However, for Wittenberg and Portico, another important source of demand for preservation is now gathering momentum. While e-journal and e-book titles still provide the mainstay of growth for the company, digitised historical collections are fuelling business more and more.

In early January 2014, Portico had extended an already established partnership with US-based aggregator, Gale, to preserve all digital collections in its D-Collection Preservation Service. Content is estimated to total at least 1.5 million documents across more than 150 million pages.

But late last year, a further contract came through for the same function, this time from EBSCO Information Services. The US-based discovery service provider and aggregator launched its Digital Archives in 2009, and has now enlisted Portico to ensure long-term availability of content such as Civil War Primary Source Documents and African American Historical Serials.

As Wittenberg points out: ‘We have been working on this complex arrangement for a very long time and, combined with Gale, it represents a very big jump in the amount of content in our D-Collection Service.’

‘These are huge [collections] and take up a large percentage of our physical space in the archives,’ she adds. ‘Signing EBSCO was a massive deal that sparked considerable interest in the library community... and we will now be going out to talk to other potential D-Collection partners.’

In a similar vein, CLOCKSS anticipates demand for its preservation services from aggregators. Van Dyck asserts: ‘I’m expecting that in 2016 we will see some very large collections coming to us from aggregators.’

‘Right now, the CLOCKSS software and hardware systems are being prepared for very significant growth from D-Collections and back files,’ he adds. ‘We also expect more supplementary material from journals, including videos, which will all add to capacity requirements, so we’re making long-term investments this year.’

Indeed, part of CLOCKSS’ longer-term masterplan includes participating in the broader world of digital preservation. As Van Dyck puts it: ‘Digital preservation is of interest in libraries worldwide, and while scholarly literature is our core competency, it is not the entire realm of digital preservation. We will be collaborating with industry leaders worldwide to improve digital preservation infrastructure.’

‘[With LOCKSS], we are in the process of recrafting our technology to be more modular, and are taking advantage of the many open source components that are available,’ he adds. ‘Developers of LOCKSS are collaborating with other developers across the entire industry and we also want to be in a position to take advantage of that.’

Complex and dynamic

Unsurprisingly, the preservation of dynamic content is still throwing up issues as the technologies Web 2.0 and HTML5 give way to more complex, changing content. Typical examples include databases and encyclopedia structures that are being updated, and organisations across the board are developing strategies to deal with this.

Indeed, for several years now, LOCKKS, as used by CLOCKSS, has been working on how best to deal with this constantly moving target. For example, open-source software has been released to capture content that had been locked behind inaccessible forms, while code has also been developed to collect materials delivered via Javascript.

But, as Van Dyck candidly admits: ‘Thanks to a Mellon grant, LOCKSS has definitely addressed methods to capture content in more dynamic formats including HTML5 and AJAX, but the process is still not seamless.’

‘From CLOCKSS point of view, we can take snapshots [of captured content], so if a customer was to say we want to preserve a database that is ever-changing, we would say fine, but we need to agree on periodic snapshots,’ he adds. ‘If that was acceptable, great.’

Right now, Portico is also working on dynamic content. Last year, the organisation signed an agreement to preserve Harvard University Press’s Dictionary of American Regional English (DARE), which includes dynamic content such as audio recordings and mapping tools.

‘This has been fascinating; the interface is a map that you click on to get an audio recording of the regional language accent for that location,’ says Wittenberg.

Indeed, the managing director expects to see more of the same over the next five years. As she points out, publishers are handling more distributed, dynamic objects as part of their publications with content increasingly based on data rather than text, so establishing the best preservation approach will be crucial.

‘Publications could be driven by a database with a dynamic interface; this may have links to datasets that reside outside the article, or even multimedia collections that are central to the author’s argument,’ she says. ‘We’re starting to see a trickle of these but my feeling is we’ll see a lot more in the next several years, and preservation services need to be responsive to that.’

Dynamic content aside, drawing in content from small publishers has also vexed preservation organisations for several years. Referred to as the ‘long tail’ of small publishers, this content is less straightforward and more expensive to preserve, and therefore more at risk.

However, for its part, CLOCKSS, together with LOCKSS, has been very active in addressing this long tail. As Van Dyck points out: ‘These [smaller organisations] come to us directly now, and we saw a good slice of our growth driven by smaller publishers in 2015, many of whom were open-access publishers.’

According to Wittenberg, Portico has also seen success in attracting smaller publishers, with around half of its journals coming from this sector. Classing these organisations as publishers that produce at most, 10 titles, recent signings include Pacific University Libraries, US, Methaodos Revista de Ciencias Sociales, Spain, and Hygeia Press, Italy.

‘It’s very difficult to contact these small publishers,’ highlights Wittenberg. ‘The publication could be run by a faculty member, a graduate student or an academic department that doesn’t, for example, have a marketing group or legal department.’

To this end, Portico developed guidelines to ease participation and tools for straightforward content deposit; one example includes an export plugin for the Open Journal Systems (OJS) platform.

‘It’s rarely the price of preservation for these publishers, it really is just more difficult to ‘reach out’,” adds Wittenberg. “But these organisations evolve and now smaller publishers are increasingly using existing platforms that aggregate many journals, so we’re also receiving their content via [this].’

A different approach

But while the likes of Portico and CLOCKKS are dark archives providing long-term preservation services, myriad other organisations are emerging, offering short-term archiving options. A key UK-based player, Arkivum, promises to secure copies of data safely for up to 25 years, which can be quickly accessed, while guaranteeing the stored content is usable in the future.

Arkivum span out of the University of Southampton in 2011, and since this time has seen an ever-increasing demand for its services. Today the company stores a dizzying array of content for organisations in higher education as well as healthcare, life sciences and heritage. Key contracts come from JISC, and many UK-based universities, as well as the New York Museum of Modern Art, The Tate Gallery, North Bristol NHS Trust and Royal Botanic Gardens.

‘The data types within our service are hugely varied, including genomic data, medical imaging, research data from universities and even call recordings from call centres,’ says Matthew Addis, chief technology officer at Arkivum. ‘We also quite often integrate our archiving solution with, say, a university institutional repository – and it could be a data publication platform, such as Figshare.’

Addis spent more than a decade carrying out research on digital preservation and archiving at the University of Southampton, before joining Arkivum.

As he highlights, the company receives a lot of interest in its archiving services, following what he calls a ‘compelling event’. For example, the company saw a lot of activity in the run up to the last year’s deadline for the EPSRC policy framework on research data. “Many universities realised they had a deadline and opted for an ‘out-of-the-box’, ready to go solution, so this kind of thing really catalyses demand,” he says.

‘As part of the 100,000 Genomic Project we also see UK government pushing for 100,000 people to have their genomes sequenced over the next couple of years,’ he adds. ‘It’s this rapid increase in data that causes all hell to break loose with people thinking ‘right, we’ve got to take action’.’

Like many in the industry, Addis see many issues surrounding storing complex content – for example, handling different file formats and dealing with meta data. But what he asserts is most crucial, is to actually get started.

‘I’ve heard Tim Gollins, head of digital archiving at National Records of Scotland, say: “You can think about all the risks and complex challenges, but unless you’ve captured the data, you’ve got no hope of preserving it”,’ he laughs.

‘For me, the key challenge is knowing where to start and to keep things simple,’ he says. ‘There are possibly too many people spending too much time thinking about the problem without making any short-term decision to get on and make progress.’

Ex Libris update

Late last year US-based aggregator, ProQuest, acquired Ex Libris, a provider of cloud-based solutions for higher education. The digital preservation system, Rosetta, forms a key part of Ex Libris’s suite library resource tools, but as such, the librarians and publishers can expect business as usual from the company.

‘There is no impact on Rosetta following the acquisition by ProQuest,’ asserts Adi Alter, Rosetta product manager. ‘We have a wider horizon of opportunities resulting from the merger, and will be investigating ways to enhance the system, based on expertise drawn from ProQuest.’

According to Adi Alter, in the run-up to the acquisition Rosetta experienced significant growth, particularly from the academic community. As he points out, the range of collection types being preserved by insitutions is vast, covering both digitally-born and digitised content.

‘Recently we’ve seen a dramatic increase in the demand to preserve more complex content types, such as web content and scholarly research data,’ he adds.

Indeed the company recently clinched deals with China-based Shanghai University and US-based academic libraries consortium, Ohio-Link to preserve a complex range of content.

For example OhioLink’s content includes its electronic journal collection, with tens of millions of digital files, as well as its electronic dissertations and theses covering tens of thousands of records. Electronics books are included, as are hundreds of thousands of video, image and audio files from the digital resources collection.

‘Rosetta will provide a digital repository backbone that will serve the digital asset management and preservation needs of the different libraries in the consortium,’ says Alter. ‘One challenge is to build different workflows for the evolving collections. OhioLINK will take advantage of Rosetta being a flexible and fully-configurable system.’

Alter also asserts that Shanghai Library selected Rosetta for digital asset management and preservation, and multi-language support.

‘[The system] copes with diverse content types via plug-ins, which enable the management, preservation and delivery of almost any content,’ he says. ‘The language needs of Shanghai are easily met by Unicode and its interface can be translated into any language.’

Feature

Open access looks set to shake up the humanities and social sciences book landscape for the better, reports Rebecca Pool

Interview

Nigel Lee, CEO at Glasstree Academic Publishing, describes how he wants to transform scholarly communications