Solving archive challenges
How to preserve data as hardware and data formats evolve is a key concern for many organisations. In the first of two articles, Robert Sharp of Tessella looks at ways of ingesting, managing and storing digital information.
In today’s society an increasing amount of information is being created and stored digitally, which offers many great benefits. But there are also risks: the number of ways in which this information can be stored is also increasing, with new software products or new versions of existing products constantly coming onto the market. These formats rapidly become obsolete, and as hardware and operating systems move on, digital files can become unreadable. Even in cases where the format can be read by newer software, some of the information in that file may be altered or lost in the transformation.
There are more threats too: digital records can be endangered because of the way in which they are stored: the media can deteriorate or become difficult to read owing to the obsolescence of the associated hardware. This is in sharp contrast to paper records which, provided they are stored in the appropriate conditions, are likely to remain readable for a very long time.
A case study illustrates the problem well. ‘The Domesday Book’, William the Conqueror’s survey of England in 1086, is still readable today in the UK National Archives at Kew, London. A modern, digital version of the ‘Domesday Book’ was created by the BBC children’s programme Blue Peter in 1986 to celebrate the 900th anniversary of the original. Children were invited to submit digital material about their community, which was stored on the latest technology – 12-inch laser disks – to guard against obsolescence.
Just 15 years later, serious action had to be taken to save these records from being lost. Fortunately, the time-gap from creation was sufficiently small that a few laser disk readers still existed, so this problem could be solved. The files were extracted in binary format onto more modern media. The next challenge was to interpret this so that it was not just a meaningless string of 1s and 0s, as the digital files could not be interpreted by any modern software. Solving this problem was a not a trivial exercise, but was completed successfully after considerable effort including experience and input from the original record creators.
Although the records are now safe, for another 15 years maybe, the amount of effort required to preserve a relatively small amount of information shows that it is not practical to rely on such methods for all digital records and that active preservation measures are preferable.
For some organisations such as national libraries and major research institutions, digital preservation is a particularly pressing concern. Many such organisations have already taken significant active preservation measures. These efforts make it possible to plot a roadmap for other organisations to follow in tackling the digital preservation problem.
Figure 1: Schematic view of the OAIS model
Digital archiving is a relatively young discipline and, as such, standards are in their infancy. Nonetheless, ISO has endorsed NASA’s Reference model for an Open Archival Information System (OAIS) (http://ssdoo.gsfc.nasa.gov/nost/isoas/overview.html). This approach divides the digital archiving challenge into six functional entities: ingest; data management; storage; access; preservation planning; and administration (see figure 1). The first three of these will be tackled in this article, while the others will be examined in the next issue of Research Information.
Ingesting the data
For a digital archive to be effective, material must be added to the archive. This requires material to be selected, stored in an appropriate structure, and described by appropriate metadata. Before archiving a set of potential records, it is necessary to decide whether they are worth keeping. One option, of course, is to archive everything. However, the cost of archiving is large: archiving a file is a commitment to keep it for an exceedingly long time (potentially in perpetuity) in a managed, well-maintained environment. And, in some cases, it may not be possible to archive everything for security, intellectual property, legal or other reasons.
The ideal situation would be to review all material, decide what needs to be archived and reject the rest. But manual selection is also expensive. In the longer term, it is likely that tools to help with this selection will evolve, using technologies that can assess the usefulness of potential records based on their characteristics or derived content. These tools may need to decide not just which files to keep, but also which parts of an individual file to keep.
Some current digital archives will only select records for preservation if they are created in (or first converted to) a set number of formats. This can work well if it is possible to tie the archival system to the system used to create and manage these records. A demonstration of this principle was created by the Dutch Government’s Tested project where Tessella produced an email add-in to Microsoft Outlook that allowed users to make use of their email almost as normal, but saved sent and received emails as XML documents, thereby ensuring that the format of the files was fit for future preservation. Another example is in archiving thesis abstracts. A Scandinavian project called DiVA, for example, uses custom-created input forms to capture the records to be archived.
Despite such initiatives, however, a generic archival solution, although it can influence formats, cannot impose absolute restrictions; people will continue to use the formats that suit their purposes rather than purely to fit preservation goals.
The structure of records
There are more challenges in ingesting data too. Records tend to be hierarchical in nature. The ingest process has to allow archivists to choose the most appropriate logical structure for records. The records to be archived will consist of physical files, which may also have a hierarchy that will need to be preserved. This hierarchy is more a function of current technology than conceptual content structure. For example, the minutes of a set of meetings could be stored as a series of word processor documents, where there is likely to be one physical file per meeting; a dedicated database to hold all the minutes of the meetings, in which case there might be a single physical file covering the entire set of meetings; or part of a large enterprise-wide database, where the record might be a part of a series of files associated with a record at a higher level. It is important to allow archivists to assign a conceptual hierarchy to the records independently of their physical hierarchy. This implies that an archive should allow some files to be shared between records even if they are not associated with a higher-level record.
Bringing in the metadata
It is also necessary to ensure that appropriate metadata is captured in the ingest process. This metadata falls into two main broad categories: technical and descriptive.
Technical metadata allows future consumers to learn to use the records and enables archivists to take active measures to preserve them. Most of this technical metadata can be derived automatically, using appropriate third-party software. This could, for example, automatically determine the file format of most of the ingested files, which can then be matched against known combinations of application software, operating system and hardware that the consumer will need in order to be able to interpret these formats. This mapping can be provided by PRONOM, an online repository of formats created for the UK National Archives by Tessella.
Descriptive metadata allows future consumers to understand the records. For paper-based records, archivists have traditionally provided manually-created, detailed descriptive meta-data to accompany the records and allowed consumers to find the records they require. It is obviously possible to do the same for digital records but the sheer quantity of these records (and the fact that it is necessary to use appropriate application software, operating systems and hardware to view them) means that this is potentially a bottleneck in the process.
Ideally, this metadata would be held with the records at the point that they are created, and then updated when they are edited, such as through the use of a records management system. However, this is unlikely to be the case for existing digital records. An alternative is to try to create this metadata at the point of archiving, ideally automatically. However, this is much more difficult than extracting technical metadata owing to the lack of appropriate software tools and the fact that the relationship between a record and its physical file structure is potentially complex.
There are two potential solutions to this problem. The first is to develop better automatic extraction tools. While some progress is being made in this area, the field of data mining is still young and it is likely to be some time before effective tools of this nature become mature.
The alternative is to rely less on such traditional finding aids for digital records and use other methods such as an internet-style, content-based search engine. This option exploits one of the advantages of digital records: it is possible to search within such records, and it is a method that is becoming increasingly accepted by end-users. However, even if advanced cataloguing and indexing techniques are used, it will still be preferable to offer the consumer a brief summary of the record, so that they can assess its potential usefulness before they are required to view, download or request the files making up that record. Also, while this style of searching works well for text-based documents, it is less easily applicable to files such as images and databases.
Managing the data and metadata
In a digital archive it is important not to allow users unauthorised access to the records. No one should be able to edit the archived files. However, approved users should be able to edit the metadata about the records such as to add extra information or correct spelling mistakes. All metadata entry and editing should be audited so that it is possible to work out who changed what, and when each change occurred.
In addition, many records will need to be accrued over time, so it must be possible to add extra files to an existing series of records. Migration of files to new formats should also be allowed. In this case, it is advisable to always retain the original, but, if a record has undergone a series of transformations, it is not necessary to retain every intermediate format.
Another complication arises because some records could contain information that not all users will be authorised to see. In such cases, it may be possible to create a redacted version of the record consisting of files with the secure information removed. This, in essence, leads to the creation of another sibling record of the original. In general, there is no generic way of performing this redaction and thus it may be necessary to return to the original application in order to edit the file.
Storage is an essential component
One of the key aspects of an archive is to ensure that the records are stored securely and safely. Fortunately, the core of this problem is not especially challenging, as millions of organisations that retain their digital information from day to day can testify. There are, however, some unusual features needed in an archive. Firstly, it is important to recognise the difference between the metadata (which can change) and the actual digital files (which remain the same). One way to resolve this issue is to store the metadata and files separately (eg in an XML-enabled database and a separate file store respectively). This means that the responsibility of keeping the links between the two must be performed by the archive. The alternative is to store the metadata and the files together as one object, in order to ensure that they cannot become separated and to simplify backup issues. However, this can lead to the creation of unwieldy large objects, and it makes the editing of these records a potentially complicated process.
The second issue to consider with storage is that everything that goes into the archive must also be backed up and stored off-site. There are a number of options for doing this but all face the same slightly unusual issue: the material to be retained is invariant. As the archive grows in size, running a full overnight backup of the system, say once a week, may not be a realistic option, and thus appropriate backup policies have to be developed that take into account the ingest rate and the relative difficulty of re-ingesting information from a given day, for example. It is also important that the backup policy keeps the metadata and the data synchronised (if these are stored separately).
In addition to creating backups, the storage system needs to actively manage its holdings to ensure that every file held is being appropriately cared for with an automated programme of checks. This includes, for example, exercising tapes to prevent them from sticking and ensuring regular maintenance occurs before there is a problem. There are a variety of commercial systems on the market that help perform such functions.
A way to guarantee that files have not been changed during storage can be achieved by the creation (and subsequent verification) of checksums for each file. This is really an application-level issue since it is best to create a checksum quickly in the ingest process and to verify that checksum just before dissemination. Some systems might decide to build-in digital signatures in addition to a checksum. The advantage of relying on just a checksum is that the technology required is simpler and openly available (checksum algorithms are freely published). The advantage of a digital signature is that it provides additional information about who verified the contents of the file, although this can also be provided by the archiving system itself and recorded in an audit trail.
Digital preservation tends to occur on a massive scale so the amount of information to be stored is often vast. One option to reduce the volume is to perform guaranteed loss-less compression. The argument against this approach is that it results in a loss of redundancy. This would make the impact of losing a single data bit much higher. However, if the storage system is working as it is intended to then this should not be a problem: a key requirement of an archive should be to ensure that a single bit is never lost – and, in any case, there should always be another backup copy of every file if the system needs to be restored.
Tessella’s Robert Sharpe has worked mainly on digital archiving projects for the last three years. These include the UK National Archives’ Digital Archive and PRONOM systems, consultancy on the US National Archives and Records Administration’s Electronic Records Archives system, and consultancy for the Dutch National Archief. Look out in the AugSep07 issue of Research Information for his insight into the access, preservation planning and administration challenges of digital preservation.