Preservation requires planning and maintenance

17 August 2007

Share this on social media:

Issue:

August/September 2007

Topic tags:

In the last issue of Research Information we looked at the processes of ingesting, managing and storing digital information to help organisations preserve data. The next challenges to consider are access, preservation planning and administration, according to Tessella's Robert Sharpe.

Ensuring long-term access to digital information is an ongoing challenge. Firstly, there is the need to ensure that data is inputted and stored in a way that can be easily transferred as formats change. Secondly, there is the need to make sure that this data can be accessed correctly by users in years to come. Access to an archive involves searching for records and delivering them to the end-user. Use of any or all of these services could incur a fee, and it would thus be the responsibility of the archive software to ensure that appropriate payment was received before allowing access to a fee-bearing service.

There are two ways in which consumers can find records: open searching and browsing through catalogue hierarchies. The ability to perform these tasks will depend on the indexing and cataloguing that has occurred as part of the ingest process (see Research Information, June/July 2007). Typically, when searching, users will enter a search criterion and then be presented with a prioritised list of possible ‘hits’ showing a brief summary of the records that match that criterion. Users will then be able to refine the search to home in on the record they want. Alternatively, users may be able to browse through a catalogue in order to locate records.

Having chosen the record they want to see, the consumer should have the option of viewing more detailed metadata about this record. As well as allowing the consumer to ensure that they have found the correct record, this information may contain important contextual information that is needed to correctly interpret the contents of the record to be retrieved. They may also be shown some technical metadata so that they can ascertain whether they are capable of physically using the record, such as whether they need specialised application software or an obsolete operating system.

When searching for records, it is quite likely that consumers would want to perform a single search to find all the available material on a given subject, regardless of whether it is stored digitally, on paper, or even in which archive it is stored. There are two ways of enabling such a search.

The first of these is retaining a central index containing all the relevant metadata. This option will probably give consumers the fastest response, but it means that the organisation responsible for maintaining that index has to store a considerable amount of information, ensure that it is all in compatible formats and keep it up to date.

The alternative is to distribute the search using Web services. In this case, the archive that receives the consumer’s request would send out a series of sub-requests to each registered archive asking it to perform an automated search of its holdings, based on the consumer’s criteria. This would return a list of hits, each with a numerical relevance score. It would be necessary for all the archives to agree the format of the search criteria (e.g. using an agreed XML schema) and agree a scoring scheme for the hits (although these could then be potentially weighted according to which archive they are stored in). Once all archives have replied (or a pre-specified timeout has expired) the consumer will be presented with the amalgamated and sorted list of hits. If more detailed metadata is requested on a remote holding, this could be obtained by another Web service request or by re-directing the consumer to that archive.

In both cases, the user would probably need to be re-directed to the archive hosting the material to be disseminated.

The simplest way of disseminating records is simply to allow the users to download the records (either in their original or a migrated format). However, some downloads will be large so it may be more practical to allow users to request a posted copy, for example, on a CD.

Both of these methods have the disadvantage of requiring consumers to have appropriate client-side application software to interpret the file formats. Thus, a third option is to create presentation-ready versions of records (e.g. converting word processing documents into HTML) and display these directly to the users.

Preservation planning

It might be tempting to conclude that the simplest approach to archiving would just be to print everything that is important and then store the paper records or, if space is a concern, to store the information on microfiche. However, this would lose many of the potential advantages offered by digital records, such as the ability to maintain security and verify authenticity. It would also prevent the ability to make verifiable copies, easily edit a document (if required) and search within a document. Further, for some records, such as databases or virtual reality models, it is not possible to create a printed version that captures all the relevant information. Better solutions must be found.

One possibility would be the museum approach – maintaining the old hardware and software used to create the data in the first place. However, this is not very practical. Such a solution would require the maintenance of every combination of hardware and software required, and the hardware would become increasingly expensive to upkeep and would eventually become irreparable. This is really only an interim measure.

Another option is the migration approach (see figure 1). In this technique, a copy of the original is transformed into another, more modern format that can be read by newer application software. For instance, scientific data in a bespoke binary format may be transformed into a document conforming to an XML schema, which is self-describing and based on very simple low-level technology so is less vulnerable to obsolescence. In more complex cases it may be necessary to perform a series of transformations over the lifetime of the data, either because of a change in the available application software or because a better transformation engine becomes available. In such cases, it is normally preferable to return to the original file and transform this into the new format, rather than transform it from the previous migration (as this will potentially already have lost some of the information in the original).

Figure 1: Migration involves accepting that natural changes occur to hardware, operating systems and application software (light grey changes) and therefore the original file is deliberately transformed (dark grey changes) in order to allow a record to remain readable.

To make this approach easier, it would be best to restrict the number of formats by moving towards standardisation. One example where standardisation has worked is in image formats. Specifications such as TIFF have been almost universally adopted by software manufacturers because they have realised that there is a larger overall market if images can be readily exchanged. In some cases, such as in colour specification, manufacturers have actively collaborated (forming the International Colour Consortium) to make standardisation happen. And migration could also allow easier presentation. For instance, if a digital record consists of Microsoft Word files, a consumer could choose to download the records to their local PC and read them using a locally-installed copy of the software. An alternative would be to create an HTML rendition of this file and display this to the consumer instead. Third-party products exist that will perform such migrations for a number of formats with a reasonable degree of integrity.

However, while standardisation has many attractions, the commercial companies that create the majority of application software in use today are unlikely to follow this route unless there is a competitive advantage to be gained. Also, it is worth remembering that it is not always trivial to translate records from their current formats into a standard format. Such a transformation may require archivists to make assumptions about the intentions of the original author(s).

The fact that migration involves a transformation, which may result in loss, means that it is necessary to understand and categorise this loss so that different transformation software can be assessed and compared. The attributes that need to be considered can be split into five categories:

Context. This is set by metadata and thus is unaffected by migration (although the migration process should itself be documented)
Content. A good transformation should preserve all the content of the original. However, sometimes the new format will not allow information to be kept in exactly the same form
Structure. It is important to remember that, if an accession undergoes migration, for either preservation or presentation purposes, the logical (technology-independent) structure will be preserved, but the physical (technology-dependent) structure may be altered as not all migrations will lead to an exact one-to-one file correspondence. This means that migration is potentially a complex process and as such could be prone to human error (e.g. marking a file incorrectly as having been superseded by a newer version)
Appearance. It is quite hard to preserve the look and feel of the original when performing a migration. For most purposes, this may not matter too much but there is not always a clear-cut distinction between appearance and content. For instance, if an author uses bold or italics at some point in a document, it is probably an emphasis and thus can be interpreted as being part of the content of that document
Behaviour. One of the advantages of digital records is that it is possible to manipulate the information within them. For example, database records can be queried to provide new views of the information contained within them or a model can be re-run using different initial parameters. This aspect of a digital record relies on programming logic embodied in the application software and is thus difficult to preserve by migration

One of the key aspects of preservation planning is ensuring that the strategy for data types is reviewed regularly. This means that there is a requirement to maintain a repository of information about each file format stored in the archive, to assist archivists in determining its best preservation strategy. The ideal scenario for a large archive would be to automate the migration process. The process would then work something like this:

An archivist updates the file format repository to state that migration of format XYZ is now required and that the approved policy is to migrate to format ABC using a specified piece of software. The archive automatically detects the update and calculates how long it will take and then schedules this processing to occur at relatively quiet periods.The migration then takes place automatically, with humans only needing to be involved to provide a quality check.

The emulation approach

An alternative to migration is to use emulation. There are variations of this technique but the most promising would seem to be hardware emulation. This is where the original file, application software and operating system are retained but, since it is accepted that hardware will become obsolete over time, the original hardware is emulated in software on new hardware (see figure 2)

Figure 2: Hardware emulation involves accepting that natural changes occur to hardware but make no change to the original file, application software or operating system. Enabling the software to continue to run requires the creation of an emulator (dark grey change) to emulate the original hardware on the new hardware.

This technique potentially has an advantage over migration in that it should allow the look, feel and behaviour of the original application to remain intact. This will be especially helpful for records with a high degree of behavioural content such as virtual reality models. Also, for a given piece of hardware, such an emulator can be written once and re-used by many organisations although an emulator may need to be re-written when hardware changes again. However, such generic emulators do not yet exist so the concept cannot yet be seen to be a proven universal approach. The approach also means that licensed copies of the original application software (and a record may rely on many applications to operate as originally intended) and the original operating system, must be retained, including the relevant bug fixes, service releases etc. It also means that the effort required to access an old record could be considerable, since the original application software and operating system must be installed together with the emulator before the record can be meaningfully interpreted.

One emulation method is operating system emulation, where the original file and application software are retained and the ‘natural’ evolution of both the operating system and the hardware is accepted. Alternatively, application software emulation is where the original file is retained and the ‘natural’ evolution of the application software, operating system and the hardware is accepted.

Neither of these seem as feasible as hardware emulation (see the summary of the result of the Dutch Government Digital Preservation Testbed project for more details: www.digitaleduurzaamheid.nl/bibliotheek/docs/white_paper_emulatie_EN.pdf).

Administrating the system

Day-to-day running of a digital archive involves many tasks that are very similar to those required to keep any other large software system running. In particular, it is important to remember that all such systems require standard operating procedures and other processes in addition to the actual hardware and archiving software.

Building and maintaining a digital archive also poses some unique maintenance issues that need to be addressed during design and development. The first of these is future-proofing. The point of a digital archive is to keep digital material for a long time but the lifetime of most software and hardware is very short: typically just a few years before it becomes obsolete. To help with this, careful attention must be paid when designing an archive to provide a system that is as future-proof as possible.

The first step is to use a well-established framework as a benchmark for the design of the archive. Users of archives will normally expect an interface based around web technology. Assuming that this is the case, an archive should be built with a standard n-tier architecture using one of the two well-established frameworks currently in operation: J2EE (Java 2 Enterprise Edition), an open standard owned by Sun, or Microsoft’s .NET framework, which is not an open standard, but is now well established and is likely to continue to be supported in the foreseeable future.

Figure 3 illustrates the second step needed to help future-proof the software in an archive: the design must be exceedingly modular with clearly defined interfaces between each component. This allows one component to be easily swapped with another without affecting the rest of the software.

A third and related feature of digital archive software is that any third-party components used must have a well-established ‘sunset’ policy (i.e. before any component is used it must be clear how that component could be retired and swapped with an alternative component). This can be achieved by wrapping the interface of the component in such a way that it creates a buffer between the main components of the archive and the third-party software, which insulates the former from a change in the latter. This enables a new version of the same component, or a completely different component that performs the same job, to be switched for the third-party component without affecting the rest of the application.

This modular policy must also apply to the interface, with software used to control the storage of the digital files.

Figure 3: Tessella has designed and built a future-proof system consisting of interacting components with well-defined interfaces. This diagram shows a simplified version of this n-tier architecture.

Operating system, hardware and file storage issues

Clearly an archive will need to run on real computers and so will need to interact with real operating systems and hardware. However, there is no reason for any of the software to be tied to a particular operating system, and this should be avoided. This is one of the main reasons for choosing J2EE over .NET.

A larger issue is the hardware on which the actual files are stored because the sheer quantity of material in a digital archive poses some potential problems. A programme of migration will need to be started well before the planned retirement date of any equipment in order to ensure that all the migrations required can occur in time.

The metadata about all the records in an archive are best held in XML. This provides a format that is likely to last well into the future and will be readily understood, and is also independent of the exact nature of how it is stored, which will make migration easier. Such metadata should be captured by the user interface or imported from another source (such as a record management system) and converted into an XML file that is compliant with the archive’s XML schema. This is then likely to be appended to automatically-extracted information as the ingest process occurs. This metadata is probably best stored in its native XML format, although some denormalisation may be required to allow quick access. This metadata can then be extracted for editing purposes or converted to HTML for viewing by the user.

It is also important to have the ability to extract a full accession comprising all its XML metadata and every archived file. This functionality can be used to replicate full records between non-networked digital archive instances or to migrate the data forwards into a new system.

Tool procurement

There is a need for many specialist software tools to enable the effective use of digital archives. For instance, the automatic selection of records for ingest, the automatic extraction of descriptive metadata from records at ingest, and the migration from some formats to another all require high-quality software to be developed.

There are a number of ways in which this software could be procured by organisations that need to perform archival activities. They could develop bespoke software themselves, but this is very expensive and would result in lots of organisations repeatedly procuring similar software. Another option would be to rely on open-source software. Open-source software can solve a number of problems but it does have limitations. The kind of software tools needed here must be produced in a timely manner, fit into a controlled framework, be of a verifiably high quality and be supported. It is possible that open-source software can solve a few of the problems listed above but it is unrealistic to think that all the software required can be procured in this way. A third option is to buy commercial software. The software market for these type of tools is in its infancy, but, as the importance of digital preservation begins to be realised more widely, the market is likely to grow.

The digital archives in existence at the time of writing have almost exclusively been built for national archives or national libraries. These institutions do not have the resources to develop their own software, nor can they afford to wait in the hope that open-source software will come along. However, one pro-active role that such institutions could play is to establish benchmarks for best practice. For instance, if a piece of software is produced that performs a migration from one format to another, these benchmarks could be used to assess the ability of this software to produce a new file that faithfully reproduces various features of the original format. If such benchmark scores became recognised by the rest of the software world, as a measure of the worth of this software, it would drive manufacturers to improve the software and thus provide an ever-increasing variety of high-quality archival tools.

Tessella’s Robert Sharpe has worked mainly on digital archiving projects for the last three years. These include the UK National Archives’ Digital Archive and PRONOM systems, consultancy on the US National Archives and Records Administration’s Electronic Records Archives system, and consultancy for the Dutch National Archief.

Popular

Latest issue