Better management reduces data loss risk

Share this on social media:

The impact of data loss can be staggering for research. Nathan Westgarth argues the case for better data management

The loss of scientific data can have a devastating impact on careers. After moving all of his data home to write up, biologist Billy Hinchen returned one afternoon to find that his laptop and all his backup hard drives had been stolen. All that remained was a disparate collection of data, spread around numerous small flash drives, email attachments and scribbled drawings that were difficult to piece together once the main bulk of information had been lost.

The knock-on effect was disastrous. As Billy put it: ‘I was focussed on creating high resolution, 3D time lapse videos of developing crustacean embryos, so all of my work was digital-based. When I lost my laptop and backups, I lost 400GB of data and close to four years of work. As a direct result of this I ended up getting an MPhil rather than the PhD I’d been working towards. I was hoping to have an illustrious career in science and for a time it seemed like everything would be stopped in its tracks’.

The importance of data management

While this is an extreme case of data loss, it does highlight how important it is to consider how scientific data is managed. From the surveys and interviews we’ve held with the academic community, we’ve hear a common theme: researchers seem to have difficulty managing and accessing their data. Furthermore, it appears to be an on-going problem for research scientists, at any stage of their careers.

Former PhD student and subsequent founder of the figshare platform, Mark Hahnel, typified a common challenge: ‘During my PhD I was never good at managing my research data. I had so many different file names for my data that I always struggled to find the correct file quickly and easily when it was requested. My former PI was so horrified upon seeing the state of my data organisation that she held an emergency lab book meeting with the rest of my group when l was leaving’.

Research data management is becoming one of the most pressing issues facing the scientific community, not just for university management teams, but also for every individual researcher. Our investigations have revealed a concerning picture of the effect that poor data management is having on the quality and reliability of scientific outputs.

More data, more complexity

The amount of research data being generated is currently increasing by 30 per cent annually (Why manage research data? In G. Pryor (Ed.), Managing research data (pp. 1-16), Facet Publishing). This data is not being effectively managed, stored, and made easily accessible. One study found that the odds of sourcing datasets decline by 17 per cent each year and that a huge 80 per cent of scientific data is lost within two decades (The availability of research data declines rapidly with article age, Current Biology (24)1: 94-97).

The information that remains is often poorly reported. In a second review, researchers found that 54 per cent of the resources used to perform experiments across 238 published studies could not be identified, making verification impossible (On the reproducibility of science: unique identification of research resources in the biomedical literature, PeerJ 1:e148). This means that much of the $1.5 trillion per year estimated total global spend on research and development is wasted (2013 Global R & D Funding Forecast, Advantage Business Media).

As well as the financial consequences, poor data management can have a significant impact on time and other resources. For example, since the year 2000, over 80,000 patients have taken part in clinical trials based on research that was later retracted because of error or fraud (Problems with scientific research: How science goes wrong, The Economist). Meanwhile, the number of peer-reviewed paper retractions due to error has grown over fivefold since 1990 (Why has the number of scientific retractions increased?, PLOS ONE 8(7): e68397). At best, that’s a lot of wasted time and effort, but, at worse, drug discovery is halted and careers are severely affected.

Given the above, it is perhaps unsurprising that as many as 34 developed countries have signed up to the ‘Declaration on Access to Research Data from Public Funding’. In addition, key funding bodies such as the NIH, MRC and Wellcome Trust now request that data-management plans be part of applications.

Looking after your data

The time has come to start better protecting our scientific data. The starting point is to make the capturing of research data more efficient through the better use of technology.

There are a host of generic tools available that can be used to fit into existing research workflows. Some such tools proving popular are Evernote, cloud storage services like Google Drive and Dropbox, and code hosting sites like GitHub. While many offer a range of benefits, these tools haven’t been designed with the scientific community in mind.

For this reason, tools specifically for academics are starting to be developed to suit their needs. For example, Digital Science’s figshare tool is a cloud-based repository where researchers can store their data outputs privately, share them with colleagues, or make them publicly available and citable with a permanent DOI. figshare is increasingly being used by institutions to host and manage all file types of research data, securely in the cloud. Institutions can also use it to promote collaboration internally and facilitate backup and organisation, without having to share data with the wider world until researchers are ready to publish.

Digital Science has also recently developed Projects, an application that lets researchers safely manage and organise their research data on the desktop. It provides a visual timeline to make finding files easy, backup functionality to help seamlessly recover previous versions of files, annotation features and a structured hierarchy to encourage users to organise their files.

In the future, we hope to see data management taken more seriously by everyone involved in making science happen, from individual researchers through to institutions and governments. While all are clearly dedicated to improving human existence through exploration and discovery, more energy must be put in to safeguarding this data for the future benefit of science.

Nathan Westgarth is a product manager for research tools at Digital Science. He manages Projects, which aims to help scientific researchers organise their data in a safe, simple and structured way