Partners go Dutch to preserve the minutes of science
Two years ago Elsevier became the first publisher to agree to deposit all its journals articles into the Dutch National Library. Elsevier's director of IT Solutions, Geoffrey Adams, explains why and how it is progressing so far
Recent reports that our digital family photographs, carefully burned on to CD-ROMs, can in fact become unreadable and corrupt after a few years has caused puzzled alarm among amateur photographers: provided you don't use the CD as a beer mat, surely, such digital data is indestructible?, they say.
Libraries and research institutions, however, have known for years – as have governments – that digital information is in fact fragile and at risk. Not so much because of the instability of the storage medium itself, but because of the constant re-invention of hardware, applications, platforms and formats. This permanent revolution makes the task of capturing, archiving and, above all, safely retrieving vital content a vast challenge: one requiring new skills, different infrastructures and, of course, more money.
Since much of scientific research these days is 'born digital', and is increasingly accessed digitally too, the stability of these 'minutes of science' is not of purely academic concern. Preservation of these 'digital objects', and ensuring permanent access to them, has become one of the key tasks of national libraries.
Research on tools and procedures for permanent access is still in its infancy, but some institutions and publishers are taking on pioneering roles. The Koninklijke Bibliotheek (KB), which is the National Library of the Netherlands, for example, has made significant progress in archiving comprehensive digital content from several publishers. The first of its collaborations was with its neighbour, Elsevier.
Towards a 'safe place'
But even before this project was started, the KB was working on electronic deposit. It first decided to include electronic publications in its deposit collection a decade ago, and in 1995 the KB launched plans to design and build an electronic deposit or e-Depot system – the Depot for Netherlands Electronic Publications (DNEP).
IBM Netherlands was awarded the contract, which was developed on site at KB and handed over in October 2002. Built as far as possible from off-the-shelf components, it has now become an IBM product in its own right, under the brand name DIAS – Digital Information Archiving System.
The project leader for the KB's long-term digital preservation programme is Dr Johan Steenbakkers (below), director of e-strategy, who sits on the three-man board of directors. He acknowledges the commitment and expertise that IBM has displayed in delivering and now maintaining an exceedingly challenging e-Depot system. But he says that, however good the technical partner, the archiving institution must develop its own expertise in road-testing every new component and upgrade. 'Libraries need to create their own 'gold standard' test sets', he said, 'because even minor glitches can degrade data when it is being loaded. So, clearly, they need to train or recruit staff who are competent in this relatively new discipline.'
Future developments at KB will focus on the 'permanent access' challenge. Experiments with IBM on data preservation with the Universal Virtual Computer may offer a promising outcome to the long search for a solution based on emulating old formats on current platforms, but it is still early days.
In addition, the KB is working within the PATCH consortium (creating the Permanent Access Toolbox for digital Cultural Heritage) and trying to persuade the ICT sector to adopt design criteria that guarantee the persistence of digital information, rather than trying to fix the redundancy problem years later. PATCH is also hoping to squeeze funds from the European Union to develop and test a range of tools for permanent access.
Meanwhile, over the road...
Funding from the Mellon Foundation brought together Elsevier and the Yale University Library for a year's worth of e-archive planning that began in 2001. This provided a significant bridge across the traditional divide between the scholarly library world and publishing and delivered many valuable insights for both sides.
Even earlier, and closer to home, in Elsevier's native town of Amsterdam, work had been progressing on its electronic warehouse (EW). This had been in planning since 1993; it started in 1996 and has been operational since 1997. A fundamental corporate commitment states that all the Elsevier-owned journals on ScienceDirect will continue to be held and made available on a permanent basis, even if the journal ceases publication. The EW archive, which has so far cost Elsevier more than 10 million euros, is designed to make good on this promise.
This vast electronic archive holds about six million articles and more than six million high-resolution graphics files, and eats up more than six terabytes of storage. It guards not only the more recent articles, but also the older journals as they are progressively digitised in Elsevier's 'backfiles' project. Among the oldest to be digitised is The Lancet, which was first published in 1823 and is still the best-selling of all Elsevier's journals. The backfiles project alone required the construction of a special building and the deployment of 2,000 members of staff.
And as each year passes the archive expands. In journals alone, more than 1,800 titles pass through the EW each year – some 1,000 articles each working day. Physical, human and electronic security precautions and disaster-recovery procedures are comprehensive, including multi-site storage and off-site back-up.
But the onward pace of technology has already brought radical changes in Elsevier's digital preservation programme. Starting from this summer, a new Electronic Warehouse (EW2) comes into operation. New data coming into EW2 is stored in XML format. When the older content and metadata from EW1 is ported across and converted to the new XML format, the original electronic documents will still be retained and linked to their new XML format version.
Naturally, all the accompanying metadata are preserved with the article, but Elsevier's electronic archiving experts are conscious, along with many librarians, that more extensive metadata are needed for publications that may be accessed decades or even centuries later. One conclusion from the Yale-Elsevier project was that preservation metadata can add real value to content. Elsevier is monitoring the debate within the library and scholarly community, including the Open Archival Information System (OAIS) reference model; the work of the Joint Information Systems Committee (JISC) in the UK; and developments within the Online Computer library Centre/Research Libraries Group, whose working group on Preservation Metadata published a report in June 2002 on what these items should be.
The KB and Elsevier partnership
On top of that, Elsevier saw that the KB was emerging as one of the leaders among national deposit libraries in digital archiving, and one therefore that had the skills and attitudes necessary to collaborate successfully with a commercial publisher.
Karen Hunter, senior vice-president for strategy at Elsevier and responsible for this digital archiving initiative, explained the relevance of this agreement: 'It is essential that we will be able to guarantee both authors and researchers using the journals that the electronic files will be permanently available. Journals have been called 'the minutes of science'. As we move toward journals being available only in electronic form and being held centrally on publishers' computers, the public has the right to be assured that, should a publisher go out of business, these files will not be lost. The deal with the KB provides that assurance for Elsevier titles, which constitute an essential part of the core scientific literature currently published.'
Under the agreement, the KB receives, without charge, digital copies of all journals available on Elsevier's web platform, ScienceDirect – some 1,800 STM journals. If new journals are published, they are added automatically. But on top of that, the KB will get all the older Elsevier journals as they are progressively digitised, right back to Volume 1 Issue 1. The starting collection, including the backfiles, is expected to represent some seven terabytes of data. (By some estimates, that is the equivalent of 350,000 trees converted into paper and printed as journal pages.)
For the KB, the benefit of the project includes not just the obvious one of access to content, but also provides a partner to work with on developing the infrastructure and skills needed for handling and maintaining electronic publications.
'We have been quite successful in finding excellent partners and in building good working relationship with them,' noted the KB's Steenbakkers. Such partners are IBM and RAND, but also other, advanced national libraries like the NEDLIB partners, the Library of Congress, the National Library of Australia and the British Library. 'A second crucial factor for our success is the close co-operation, right from the start, with some major publishers. The first publisher the KB teamed-up with was Elsevier,' he added.
Steenbakkers went on to say that, since the agreement with Elsevier, the KB has negotiated similar arrangements with Kluwer Academic Publishers, Blackwell Publishing and BioMed Central. Other agreements are in the pipeline, with the potential to capture 80 per cent of STM journal content by the end of 2004.
The archive is primarily a preservation archive of content. It doesn't offer a document delivery service, but is available for on-site use. So far it is the only official archive to hold Elsevier's STM content but, like all national libraries, it exists for the benefit of the whole international scholarly community.
The technical details
The bulk of content capture comes from online media, in the form of high-volume electronic articles sent by publishers such as Elsevier. These publications are either sent to the KB on tapes (for processing the backfiles) or by means of File Transfer Protocol (FTP). In both cases, publications ready for ingestion end up in an electronic post office in which they are checked and validated. At this stage the contents of the submission are checked for technical conformity, based on agreed specifications. If the material does not meet the required criteria, or if other errors occur, the content is passed to a database for error recovery. In fact, inspection of this database is currently the only manual effort involved. Once the content is validated, content and metadata are put together as Publisher Submission Packages (PSPs) which are then processed by a part of DIAS called the Batch Builder.
The Batch Builder ingests the material (both the content and the metadata), and converts the bibliographical descriptions from the publisher into the KB's internal format. After conversion, the content itself is stored in the e-depot, while the metadata is stored into the KB catalogue. Customers may query the online catalogue only after a process of identification, authentication and authorisation (IAA). The e-depot itself cannot be accessed directly, but only passes relevant documents to the user after clarification.
DNEP resides on two IBM RS/6000 SP servers. An RS/6000 F50 server functions as a control workstation for the other servers. For long-term storage, IBM Global Services and the KB created a storage area network (SAN) based on a 3.4TB IBM TotalStorage Enterprise Storage Server, three IBM TotalStorage 3494 Enterprise Tape Library systems and an IBM 3995 Optical Library. These all function as a central storage pool for the electronic documents.
The point has just been reached – in August 2004 – when all current publications of KB's commercial publishing partners (Elsevier, Kluwer Academic, Blackwell Publishing, SDU Publishers, and BioMed Central) are becoming publicly available to researchers at workstations in the KB. In future, content will be made available through the KB website, but this will be subject to agreement with copyright holders.
And Elsevier – will it conclude similar deals with other deposit libraries? Karen Hunter said that it has no immediate plans to do so. 'The KB is our national deposit library and is internationally recognised for its work in preserving digital content. We are completely happy with the way things are working out. Over time we expect to establish other official archives, but until potential partners are as well-prepared as the KB, we will not try to replicate what they are doing.'
As Steenbakkers, of the KB, pointed out: 'The challenge of preserving digital information and guaranteeing permanent access to it can only be addressed successfully by realising a long-standing and close co-operation of three key-players: leading memory institutions (national libraries and archives), main producers of information (publishers and public agencies), and, last but not least, leading IT companies. The development of the e-Depot at the KB, together with the science publisher Elsevier and IT-company IBM, is a good example of such a co-operation. These partners have been breaking new ground in the functional, technical, and policy area, in order to develop permanent availability of digital information. It is hoped more major players in the areas mentioned will take up their responsibility for digital preservation and start pushing back frontiers.'
About the partners
The Koninklijke Bibliotheek (KB), founded in 1798, is the national library of the Netherlands. In addition to being the official deposit library for all Dutch printed and electronic publications, it maintains rare and special collections including medieval manuscripts and incunabula and is tasked with the preservation, management, documentation and accessibility of the national cultural heritage in written, printed and electronic form. With a holding of 3.3 million items, its deposit collection is growing by an average of 445,000 books and electronic publications annually. The KB is an autonomous body financed by the Dutch Ministry of Education, Culture and Science.
Elsevier is a world-leading publisher of scientific, technical and medical (STM) information products and services. It publishes more than 1,800 journals and 2,200 new books per year, in addition to offering a suite of electronic products such as ScienceDirect, MD Consult, Scopus, bibliographic databases, and online reference works.