Data preservation: to infinity and beyond

Share this on social media:

Fifteen years ago a research student saved all their work on to a floppy disk – lecture notes, reading lists, even their thesis, reliant on the idea that it was a safe storage format.  But now that same student has a laptop without a floppy disk drive and the website that used to host the thesis has long-since disappeared.   A print copy of the text is stored in a long-term repository but this does not include the raw data that supports the thesis.

Not only does the student not have a complete record of their academic achievement, but researchers in the same field can’t access the information to check its validity or re-use the research using new techniques and updated processes. Both the individual and the university have lost the results of years of investment.

Research information is fragile – stored on a memory stick, in a spreadsheet, or even on the web, it is easily corrupted, lost or destroyed.  But properly curated and preserved, this information is ripe for sharing and repurposing by the original researcher, by their department and by other academics who may access the research information long after it was first created.

University managers and researchers are increasingly looking to justify the cost of preservation alongside other technical and legal issues, but we know far more about how to preserve data than we do about demand for preserved digital items.  How do universities and funding bodies decide what to preserve?  An international task force funded by JISC and several organisations from the USA is looking at business cases for long-term preservation and access, highlighting examples of current practice in America and Europe.  The Blue Ribbon task force report argues that in deciding what to preserve, and how, universities can usefully focus on a ‘supply and demand’ business model.

Anyone trying to convince their peers that information will be valuable enough to be preserved is already engaging with potential use, or demand.   Currently, peer-review mechanisms are helping funders make these decisions. Choosing to deposit material in a trusted preservation environment also indicates that the data is perceived as valuable. For example, the Economic and Social Research Council in the UK offers data to the UK Data Archive at the end of its sponsored projects. This data is assessed by the archive alongside experts before being included.

The usefulness of certain information might be affected by its context.  Keeping a digital snapshot of the circumstances in which data is created is crucial if the data is observational. If a researcher takes a weather reading at the top of a mountain, for example, that data can be usefully re-used only if all the details of that summit and the prevailing conditions can be kept to inform the context.  To help address concerns about managing such information, JISC has invested almost £2 million in eight projects to provide UK universities with examples of good research data management.

A stunning 15th century manuscript of the Canterbury Tales is currently being painstakingly photographed by specialists at the University of Manchester, UK as part of a larger programme to preserve and make accessible the university’s special collections for the benefit of researchers as far into the future as Chaucer is behind us (see BBC website). But there are few links between preserving manuscripts such as these and the results from a maths equation or graphs from a social-science experiment. Some e-journal archives are now transferring preservation responsibilities to external organisations such as the USA-based JSTOR and Portico, or the Koninklijke Bibliotheek in the Netherlands.

By putting the university at the heart of the preservation process, we can ensure that researchers are able to make links between these resources.  Simply knowing exactly what they already have in their archives might lead to cost savings because what organisations think they need, they may already have.  Control over the preservation lifecycle will also give universities the responsibility to ensure that there is an efficient flow of resources – perhaps drawing on the support of public-private partnerships as at the Worldwide Protein Data Bank, or from experts like legal and business specialists.

Once preserved, digital assets can be accessed by different researchers and students without rivalling each others’ use – the same e-book, for example, could be read by many different people without them detracting from each other’s experiences.  But this weakens the incentive for one party to take on the cost of preservation since many other parties can free ride on the benefits.  So we also need to examine stronger incentives for universities and organisations to preserve in the public interest. After all, as digital technologies become more integrated into research, we need to build in enough flexibility to accommodate uses of today’s research that we can’t yet envisage.

  • The Blue Ribbon task force will be presenting its final report at a free one-day symposium in London on 6 May 2010, alongside responses from the BBC, the Natural History Museum, the British Library, European Bioinformatics Institute and the European Commission.

Neil Grindley is programme manager at JISC, UK