Preserving progress for future generations
As organisations race to meet today`s digital preservation challenges, are uncertainties over cost and necessity still stifling progress? Rebecca Pool finds out
Digital preservation remains one of the most critical challenges facing scholarly communities today. From e-journals and e-books to emails, blogs and more, electronic content is proliferating fast and organisations worldwide are racing to preserve information for next generations before technological obsolescence, or even data loss, creep in.
But for two key US digital preservation organisations, Portico and CLOCKSS, today’s scenario spells business as usual. Having preserved content since 2005, Portico now manages tens of thousands of e-journals, e-books and more.
Meanwhile, CLOCKSS – controlled LOCKSS – continues to add publisher after publisher, and more and more libraries, to its roll call. ‘We’re still experiencing very rapid growth,’ says Portico managing director Kate Wittenberg. ‘We’re getting more and more open access journals and have had particularly rapid growth in our D-collection.’
In past years, libraries and publishers have been heavily investing in the creation of digitised historical collections, be they newspapers, images or other sources of information vital to research (See ‘Meet Carcanet’, below).
As Wittenberg highlights, Portico preserves such content on behalf of participating publishers.
In the last year, very large collections from the likes of Sage-owned primary source publisher, Adam Matthews, US, and Cengage Learning subsidiary, Gale, US, have buoyed Portico’s archive numbers as content from 19th century newspapers, first world war experiences and American Indian histories continues to flow in.
‘We’ve added all of Gale,’ highlighted Wittenberg, ‘They signed an agreement this year for, as we call it, an “all in”, so everything the company publishes in its primary source collections now come to us. These are huge collections so this has jumped our content level significantly.’
And at the same time, the company’s open access journals count is rising, especially as it now provides so-called open access triggers. At Portico, and other organisations, if content becomes unavailable, the preserving body will launch a ‘trigger event’ making the content available to participating libraries. For open access content, this has posed a problem as triggered content should, in theory, be available to all. But not any more.
‘Some publishers have said, “well if you trigger that only to your participants, that’s locking up content that was previously open”,’ explained Wittenberg. ‘So we changed our policy and now perform open access triggers if a publisher requests them for an open access journal.’
CLOCKSS is no stranger to open access preservation, having been one of the first archiving organisations to provide the service. But like many of his peers, Randy S Kiefer, CLOCKSS executive director, is still grappling with the cost models of providing preservation services to these publishers.
He, and colleagues at CLOCKSS, have spent the last year exploring ways in which to find grants to underwrite the costs of initial preservation fees. ‘We’ve made proposals to a few people and have had interesting comments, but have not yet had any takers,’ he said. ‘So we’re still looking at pricing and understanding exactly how different an open access publisher is from a commercial publisher.’
As Kiefer highlighted, an open access publisher could produce the same amount of content as, say, a commercial publisher with an annual turnover of US$5 million, but still fit into the organisation’s ‘lower cost’ bracket: ‘We have a disproportionate amount of expense being carried by the commercial publishers, so we’re working on ways to balance that correctly.’
Likewise, Portico’s Wittenberg points out how her open access publishers are subject to the same charges as commercial publishers: ‘We can’t do this for free as we still have to cover the same preservation costs for open access content as we do commercial [content],’ she said. We charge on a scale based on revenue and many open access journals – but not all – are still quite small, so they come in at our lowest cost levels.’
Tried and tested
Open access aside, both Wittenberg and Kiefer are seeing a rising demand for the preservation of dynamic content. Kiefer highlighted databases and encyclopedia structures as key examples of content that is continuously updated, but added organisations are still developing strategies to deal with this.
‘There isn’t anybody that I’m aware of, that can capture dynamic content and [preserve] a day-to-day, or even, minute-to-minute feed of this content,’ he said. ‘You can get snapshots, but these are just snapshots.’
Right now, CLOCKSS is developing the ‘how to’ process to preserve these ‘snapshots’ across multiple locations, validating each against the other, and is also exploring the best pricing structures to preserve such content. Meanwhile, Portico has signed an agreement to preserve Harvard University Press’s Dictionary of American Regional English (DARE), a publication that includes dynamic content such as audio recordings and mapping tools. As Wittenberg put it: ‘To me, this is indicative of where we’re going with scholarly publishing and is a wonderful project, and our challenge at Portico is to figure out the most effective way to preserve it.
‘We’ve been expecting this kind of dynamic content to come down the pipeline, and now we’ve seen it, we’re going to figure out exactly how to handle it.’
For Kiefer, a key achievement this year is the CLOCKSS archive being audited by the Center for Research Libraries and, like other industry organisations including Portico, now being certified as a trustworthy digital repository.
‘We are the only archive to score a perfect five in technology and security which is what preservation is all about,’ pointed out Kiefer. ‘Having being established as a trusted digital repository makes for an easier conversation with publishers and libraries... and this is no longer a differentiator between us and other organisations.’
Growing options
As preservation activities amongst Portico and CLOCKSS, and other US ventures, such as LOCKSS, The Digital Preservation Network, and the HathiTrust continue to grow, across the Atlantic, the number of digital preservation options is also rising.
Early this year, UK-based analytics consultancy firm, Tessella, spun off its digital preservation arm, now called Preservica. Offering three preservation options – a cloud-based service, out-of-the box software and an enterprise edition – the organisation supports the Met Office, the UK National Archives, the Swiss Federal Archives and more.
Meanwhile Israel-based software business, Ex Libris Group, released the fourth version of its ‘Rosetta’ digital preservation package this year, with recent contracts coming from the the State Library of New South Wales and the State Library of Queensland.
With more options, such as the free and open-source system, Archivematica, gathering momentum, digital preservation development is clearly growing in both size and complexity.
‘Clearly progress is being made and you can measure that by the maturity of solutions on offer,’ said Neil Grindley, head of resource discovery at UK-based digital technology charity, Jisc.
But, for Grindley, the urgency of digital preservation has yet to hit home in most organisations: ‘It’s been an aim to get this notion of preservation embedded into strategic thinking within organisations. Yet if you ask senior management within most organisations what their stance is on preservation, they will still look at you blankly.’
And without a doubt, cost remains a key stumbling block for doing just this. As Grindley said: ‘Trying to sell the idea of digital preservation on the basis of return on investment has been very hard. By its nature, it’s a long-term activity and you’re really hedging your bets against future risks. I think we are still in the very early days of genuinely understanding the value of digital assets... and transferring this understanding over to financial assets doesn’t yet work very well.’
As well as heading up resources at Jisc, Grindley also coordinates a pan-European consortium called 4C – Collaboration to Clarify the Costs of Curation – that has been tackling this very problem.
Co-funded by the 7th Framework Programme of the European Commission, the thirteen project participants include Jisc, The Royal Library – National Library of Denmark, and the Digital Preservation Coalition. Crucially, the project has aimed to foster a more effective marketplace for digital preservation services by helping organisations across Europe to invest in digital curation and preservation.
Indeed, its recently released roadmap provides six key messages to help organisations appraise digital assets, adopt a strategy to grow preservation assets and develop costing processes. The messages include ‘identify the value of digital assets and make choices’ and ‘demand and choose more efficient systems’ as well as ‘develop scalable services and infrastructure’ and ‘be collaborative and transparent to drive down costs’.
These may sound broad, and even daunting, but Grindley believes the detail in the roadmap will make the difference: ‘We’ve developed a curation costs concept model, which is basically instructions for building your own cost model and has been designed as a community platform. If people coalesce around this and support it, then that will be a brilliant outcome to the project.’
Yet future developments beyond 4C, right now, look uncertain with European Commission funding for digital preservation research projects having dried up. According to Grindley: ‘The basic message has been that the Commission has funded a lot of research over the years but hasn’t seen enough progress in terms of a healthy, commercial marketplace for preservation solutions.’
As Grindley highlighted, the ultimate aim of the 4C consortium has been to address this, and he is hopeful its results will resonate across digital preservation communities worldwide: ‘The only way to crack the whole notion of understanding the costs of preservation is though sharing. We’ve seen many attempts to develop a generic cost model, but this just hasn’t been possible. The only way we can crack this now is through openness and collaboration.’
In 2012, the UK-based John Rylands University Library at the University of Manchester set out to preserve one of its most important modern archives; the Carcanet Press.
Comprising the email correspondences of the founder of Carcanet Press, Michael Schmidt, with world-famous poets, critics, editors, translators and artists, Carcanet’s born-digital archive was in danger of being lost forever.
Some 215,000 emails and 65,000 attachments later, the archive is safe. ‘Our archive currently covers 2001 to 2013, with around 14,000 different correspondences represented,’ said Fran Baker, archivist at the John Rylands Library. ‘E-mail is such a tricky thing to deal with, but this is now in a preservable format and has been ingested into our institutional repository.’
Baker and colleagues used forensic software to assess the emails. ‘We ran various metadata extraction tools to record essential details such as the folder titles, how large each folder was, the correspondents, data coverage and so on,’ she said. ‘Archivists are increasingly making use of software designed for the police and law enforcement, as it’s designed not to make any unauthorised changes [to documents].’
With emails appraised, a key task was also to ensure the Microsoft Outlook emails were in a format that would be accessible in the future. Emails were migrated to several different formats, including EML.
‘We also migrated [data] to XML for preservation purposes as this retains absolutely every detail of the formatting,’ highlighted Baker. ‘Technical colleagues also developed a fantastic index, so we have searchable metadata too.’
Earlier this year, the library received the ‘Safeguarding the Digital Legacy’ award from Digital Preservation Coalition, for its Carcanet Press project.
Material will continue to be added to the archive on an annual basis, while the systems developed during the project will ensure the library is well-placed to deal with future born-digital archives.
‘We’re hoping to launch a project focusing specifically on the long-term preservation of emails into our archive,’ concluded Baker.
‘For example, there are all sorts of issues around research access, and privacy, sensitivity and data protection with these contemporary emails, and these are the kinds of things that will be in a work package.’