Better management reduces data loss risk

Share this on social media:

Topic tags: 

The impact of data loss can be staggering for research. Nathan Westgarth argues the case for better data management

The loss of scientific data can have a devastating impact on careers. After moving all his data home to write up, biologist Billy Hinchen returned one afternoon to find that his laptop and all his back-up hard drives had been stolen. All that remained was a disparate collection of data, spread around numerous small flash drives, email attachments and scribbled drawings that were difficult to piece together once the main bulk of information had been lost.

The knock-on effect was disastrous. As Billy put it: ‘I was focussed on creating high-resolution, 3D time-lapse videos of developing crustacean embryos, so all of my work was digital-based. When I lost my laptop and backups, I lost 400GB of data and close to four years of work. As a direct result of this I ended up getting an MPhil rather than the PhD I’d been working towards. I was hoping to have an illustrious career in science and for a time it seemed like everything would be stopped in its tracks.’

The importance of data management

While this is an extreme case of data loss, it does highlight how important it is to consider how scientific data is managed. From the surveys and interviews we’ve held with the academic community, we’ve heard a common theme: researchers seem to have difficulty managing and accessing their data. Furthermore, it appears to be an ongoing problem for research scientists, at any stage of their careers.

Former PhD student and subsequent founder of the Figshare platform, Mark Hahnel, typified a common challenge: ‘During my PhD I was never good at managing my research data. I had so many different file names for my data that I always struggled to find the correct file quickly and easily when it was requested. My former PI was so horrified upon seeing the state of my data organisation that she held an emergency lab book meeting with the rest of my group when l was leaving’.

Research data management is becoming one of the most pressing issues facing the scientific community, not just for university management teams, but also for every individual researcher. Our investigations have revealed a concerning picture of the effect that poor data management is having on the quality and reliability of scientific outputs.

More data, more complexity

The amount of research data being generated is currently increasing by 30 per cent annually. This data is not being effectively managed, stored, and made easily accessible. One study found that the odds of sourcing datasets decline by 17 per cent each year and that a huge 80 per cent of scientific data is lost within two decades.

The information that remains is often poorly reported. In a second review, researchers found that 54 per cent of the resources used to perform experiments across 238 published studies could not be identified, making verification impossible. This means that much of the $1.5 trillion per year estimated total global spend on research and development is wasted.

As well as the financial consequences, poor data management can have a significant impact on time and other resources. For example, since the year 2000, more than 80,000 patients have taken part in clinical trials based on research that was later retracted because of error or fraud. Meanwhile, the number of peer-reviewed paper retractions due to error has grown over fivefold since 1990. At best, that’s a lot of wasted time and effort, but, at worse, drug discovery is halted and careers are severely affected.

Given the above, it is perhaps unsurprising that as many as 34 developed countries have signed up to the Declaration on Access to Research Data from Public Funding. In addition, key funding bodies such as the NIH, MRC and Wellcome Trust now request that data-management plans be part of applications.

Looking after your data

The time has come to start better protecting our scientific data. The starting point is to make the capturing of research data more efficient through the better use of technology.
There are a host of generic tools available that can be used to fit into existing research workflows. Some such tools proving popular are Evernote, cloud storage services like Google Drive and Dropbox, and code hosting sites like GitHub. While many offer a range of benefits, these tools haven’t been designed with the scientific community in mind.

For this reason, tools specifically for academics are starting to be developed to suit their needs. For example, Digital Science’s Figshare tool is a cloud-based repository where researchers can store their data outputs privately, share them with colleagues, or make them publicly available and citable with a permanent DOI. Figshare is increasingly being used by institutions to host and manage all file types of research data, securely in the cloud. Institutions can also use it to promote collaboration internally and facilitate backup and organisation, without having to share data with the wider world until researchers are ready to publish.

Digital Science has also recently developed Projects, an application that lets researchers safely manage and organise their research data on the desktop. It provides a visual timeline to make finding files easy, backup functionality to help seamlessly recover previous versions of files, annotation features and a structured hierarchy to encourage users to organise their files.
In the future, we hope to see data management taken more seriously by everyone involved in making science happen, from individual researchers through to institutions and governments. While all are clearly dedicated to improving human existence through exploration and discovery, more energy must be put in to safeguarding this data for the future benefit of science.

Nathan Westgarth is a product manager for research tools at Digital Science. He manages Projects, which aims to help scientific researchers organise their data in a safe, simple and structured way.

Meeting the research data management challenge

Jisc’s Rachel Bruce describes some of the changes required to ensure that managing research data is a high priority for research institutions

With the drive for open data and the expansion in terms of the size, variety and complexity of data that researchers and institutions are handling, the need to manage these datasets effectively has never been more pertinent.

Managing research data is not simply a concern for higher-education research managers or information professionals, but is a cross-institutional issue. It is an area that institutions are increasingly taking the lead in when it comes to establishing research data policies. However, there is, of course, still room for improvement. Many factors are driving this improvement and bringing about a cultural change when it comes to curating, retaining and storing vital research data.

EPSRC (the UK’s Engineering and Physical Sciences Research Council) sent a clear message on compliance by stating that institutions that receive its funding for research must have developed a roadmap in 2011 outlining support for researchers in implementing responsible and sustainable reuse of their data. Furthermore it stated that institutions must be compliant with these roadmaps by 2015. Arguably, this has been a key driver for institutions, but there are many other factors influencing the development of research data management.

Why is data management crucial for the research community?

In addition to compliance obligations, institutions want and need to demonstrate research excellence. By making their studies and data discoverable, the hope is that it will drive new and exciting research efforts. If an academic from one university has created useful datasets on a particular area, it would be much more efficient for other researchers working on that or similar areas to access these findings and build on the study rather than starting from scratch. Having robust policies and infrastructures in place to organise data will improve efficiencies by reducing duplication and pushing research to the next level.

Research funders including research councils now expect institutions to have procedures and resources in place to ensure the accurate and efficient collection of data. This is largely to improve transparency, so they also guarantee that there will be a wider impact and development to the research they are financing.

This trend has been spurred on by a growing understanding that the research paper or journal is not the only output of research; the findings, survey and experimental data are arguably more important in developing a body of knowledge around subjects and supporting verification of findings. Data therefore needs to be managed effectively and made available to relevant parties. Accessible data can drive more creative collaborations with other researchers nationally and internationally.

What are the key challenges?

While effective data management will ultimately reduce the administrative burden and improve research, institutions inevitably experience some difficulties in putting a policy in place.

Understandably, there also still exists some resistance from researchers. While some researchers may be proprietary over their work, many might find sharing the various outcomes from an experimental study intimating as perhaps it leaves them open to criticism. Others believe it is only a small group of experts that will need access to their data and that will understand it.

For other researchers, there is perhaps a lack of understanding about which data to share and at what point. Researchers and information professionals don’t always have the necessary skills to curate data effectively. It can be hard enough to define data; it could be a collection of images, digitised notebooks, a collection of environmental temperatures or observations. Curating this plethora of diverse material and making it discoverable is another mountain to climb.

Another challenge that has to be overcome to ensure research data management runs smoothly is making it a cross-institutional and departmental concern. Libraries, researchers, senior leadership and IT teams will need to work together in order to achieve a coordinated approach to gathering and maintaining the integrity of data. The University of Bristol is a great example of a university that overcame that barrier in order to really improve research data management across the institution. It was one of the first universities in the UK to put together a university-wide policy on managing research data and, as a result, it has managed to make a business case to put in place for a long-term commitment from the university to support research data management.

What support is available?

The number of resources and tools available to support data management has steadily increased around the compliance agenda.

The Digital Curation Centre created the first online data-management planning tool, which is used by the researcher at the point of application and throughout the research project. The checklist asks questions such as ‘what type of data will you be producing?’ and ‘when do you expect to publish findings?’. In addition, there are tools such as Figshare, which enables researchers to share their data in the cloud but retain control over who has access to it. Software such as CKAN, an open-source data portal platform, can be used to make data accessible as well as streamlined to the processes of publishing, sharing and finding data. To help researchers when it comes to citing data and being credited, DataCite is a useful standard as it ensures that datasets have a digital object identifier, which increases the ability to track data and reuse it.

Another exciting development which will help to realise the benefits of research data management is the fact that Jisc has received funding to develop a UK-wide service which will pull together local data catalogues and data registries. This will be the first shared resource of this kind for all the universities in the country and it aims to help make data discoverable and therefore more easily re-used.

This is a global challenge and there is still room for further advances in university policy and support services, and of course in shared services. Indeed, managing research data is a crucial contributor to fulfilling research funder requirements, which will ultimately help achieve research excellence.

  • Rachel Bruce is director of technology innovation at Jisc