2020 the year of open data?

30 March 2020

If you had just one word to sum up what’s happening in the world of open data right now, it should be progress.

On 15 January the International Association of Scientific, Technical and Medical Publishers launched ‘STM 2020 Research Data Year’, an industry-wide initiative to expand the numbers of journals depositing data links as well as grow the volume of citations to datasets.

Then, two weeks later, eight university networks – representing more than 160 research-intensive universities worldwide – signed the Sorbonne Declaration on research data rights, which sets out the needs and benefits of having research data open, by default, wherever possible.

As Joris van Rossum, STM research data director, highlights, key publishers have been striving to drive open data forward for some time now. As early as 2014, PLOS journals laid out a data policy requiring that research articles include a Data Availability Statement, providing details on how to access the relevant data for each paper so findings could be replicated. Since this time, PLOS has published more than 124,000 articles with such statements, and many other publishers and journals have followed in the company’s footsteps.

Meanwhile, both Springer Nature and Elsevier have been very clear on the importance of linking articles with datasets to bring research and data together. In a recent development, Springer Nature has granted OpenAIRE, access to all of its articles, so the EU scholarly communications organisation can extract links from articles, research data and other outputs using text and data mining algorithms.

‘Open science is becoming open data and now that open access is landing we have a clear direction of where we are going,’ he says. ‘The focus is now on open data and this is very exciting as it will help science even more than open access alone.’

And this, of course, is why launching STM 2020 Research Data Year has such appeal right now. Key participants include Cambridge University Press, Elsevier, DeGruyter, IOP Publishing, Karger Publishers, Oxford University Press, Sage Publishing, Springer Nature, Taylor & Francis and Wiley.

As part of STM 2020 Research Data Year, STM is working with partners to not only increase the number of journals with data policies and articles with data availability statements, but also to raise the number of journals that deposit data links to the Scholix framework, which encourages links between research and data. The programme also intends to increase the numbers of citations to datasets, using the Joint Declaration of Data Citation Principles for scholarly data.

Sharing best practices and experience is fundamental to the programme, which will take place through workshops and webinars, while on-site visits to individual publishers will also support data-sharing. What’s more, van Rossum is eager to bring more society and smaller medical publishers on board in the coming year.

‘Over half of the journals published worldwide are now being represented by participants on our programme,’ he says. ‘We have seen the larger publishers implement open data and now it’s time for the smaller [organisations] to also gain traction… we want to see all publishers do everything they can to offer solutions for researchers to make data more open.’

Iain Hrynaszkiewicz, publisher, open research at PLOS, is also excited about STM 2020 Research Data Year, and like van Rossum, hopes to see new publishers taking part. He says: ‘This is so important as small- and medium-sized publishers are now engaging on the topic of data and will be introducing data policies… we have moved from a smaller number of big publishers adopting open data to having an industry-wide focus and measuring this progress is really important.’

To this end, the 2020 STM Research Data Year website already plans to have a dashboard to show progress on the numbers of links to datasets from articles, journals with good data policies, and articles with data availability statements. Importantly, the programme also offers frameworks for journal policies and data availability that publishers can select according to its journal and community.

‘We even have a Codex framework that publishers can use to link and cite to datasets,’ says van Rossum. ‘We really want to point publishers to the right resources and allow them to learn from other publishers on how to implement open data as effectively as possible.’

Mandates matter

Alan Hyndman is marketing director at online open access repository, Figshare, and like van Rossum and Hrynaszkiewicz, he has noted a rapid take-up of data-sharing across the board. ‘When Figshare started in 2011, [data-sharing] was quite obscure and only really carried out by people into the open research movement,’ he says. ‘But since then we’ve seen major funders, publishers and institutions, worldwide, introduce data policies.’

‘We’ve seen grass-roots, bottom-up adoption and peer-to-peer sharing but also a lot of top-down pressure, such as funding mandates,’ he adds.

Indeed, Hyndman is certain that the growing use of mandates is having a big impact on data-sharing. While the likes of The Wellcome Trust and EPSRC have led the mandate pack, the scholarly community is now seeing developments from the National Institutes of Health, US, and the Chinese Academy of Sciences.

‘The European Commission has also been putting out a lot of guidance here and South Africa [the National Research Foundation] has mandated all of its data to be openly available – this really is a growing trend,’ says Hyndman.

Similarly, Hrynaszkiewicz, is a firm believer that data policies and mandates are crucial to the adoption of open data. From word-go, PLOS has implemented a solid data policy, which the PLOS Publisher believes has raised the awareness of data-sharing as well as signalled its importance.

Hrynaszkiewicz also highlights PLOS research that reveals that only around five per cent of researchers provided a statement of data availability with such an ‘encouragement’ policy. In contrast, this figure increased to some 90 per cent when a mandate is brought in. ‘A data sharing policy is more likely to lead to the sharing of data if it is mandatory rather than just a statement of encouragement to share data,’ he says.

Kirsty Merrett, research support librarian for research data management, at the University of Bristol Library Services, UK, is also convinced that mandates work when it comes to data-sharing. Bristol is home to the data.bris repository, and as such, Merrett advises researchers on data management planning, data storage and data sharing.

‘We can talk about the societal benefits of sharing data, the fact that it’s publicly funded, and it’s good for your research and you might get more collaborators,’ she says. ‘But when funders say share your data as we’ve paid for it, and publishers say we want the data that supports this – these are the real drivers.’

‘[Sharing data] is no longer philanthropy, it’s now a case of you might not get any money and you might not get published if you don’t,’ she adds. ‘So it’s this publish or perish notion… and if a researcher hasn’t got his or her data ready for this, then they’ve lost those REF points.’

Mandates aside, citations – considered by many researchers as the Holy Grail in terms of reward – are emerging as a powerful incentive to share data. In a recent study, ‘The citation advantage of linking publications to research data’, published in open-access repository, arXiv, Hrynaszkiewicz and colleagues classified the data availability statements on more than half a million PLOS and BioMed Central journals. Analyses revealed that researchers who stored their data in a repository were associated with, on average, a 25 per cent increase in citations to their research papers.

‘Several studies over the last decade have also found an association between sharing data publicly and more citations to the papers that report that data,’ he says. ‘We weren’t able to say that sharing was the cause of the citation rise, but there was certainly that association.’

Van Rossum is heartened by the latest studies linking data-sharing with a rise in citations and reckons the results outline the clear advantages of sharing data. ‘A key anxiety for any research is “does this help my career?”,’ he says. ‘We see that the more data is shared, the better cited your article becomes… [and also] citations to a dataset count just as much as citations to literature.’

In a similar vein, Figshare tracks citations to all of its content, from articles and data to images, video files and code, and according to Hyndman, the repository has seen exponential growth here. But interestingly, the company has discovered that its most cited format is code.

‘Researchers write little bits of software then they make this available – we’re seeing other researchers using this code and then citing that in their papers,’ says Hyndman. ‘We really didn’t expect this but it’s actually a very nice way for a researcher to receive credit. Academia is a cut-throat industry so when it comes to your post-doctoral research and beyond you really need to prove your research has as much impact as possible.’

Braving the barriers

So as the data-sharing mandates, rewards and incentives increase, progress looks set to continue. But despite the results, issues persist. In the recent Digital Science report: The State of Open Data 2019, confusion over the licences used to make data openly available was raised by researcher authors, yet again. Researcher uncertainties over FAIR principles, which ensure that data can be effectively re-used, remained.

As pro-vice-provost, University College London Library Services, Paul Ayris wrote in the report: ‘There is a need for co-ordinated skills development to train researchers in what is needed to deliver FAIR data and, indeed, in adopting open data as the norm. Is there a need for a new profession of data curators who can take on this role for research groups?’

Van Rossum concurs: ‘Implementing FAIR principles is a collective effort and all the actors that participate in this ecosystem should do their part.’

For his part, Hrynaszkiewicz points out how some publishers are already providing data-sharing and curating services as well as research training via libraries. And he also believes that integrating publishing systems with data repositories helps researchers to more easily deposit data into repositories.

Merrett, however, takes a different tack, and is keen to highlight the pitfalls of trying to adopt open data in less developed nations. ‘With the rise of heavily research-intensive universities, so many researchers have this Western Ideal and assume that everyone in their field can access the same resources that they can. But I’ve been on conference calls to say, researchers in Africa, where they talk about infrastructure needs such as electricity,’ she adds. ‘So we can’t just assume that someone will have the latest MATLAB software.’

Still, be it licensing, FAIR principles, streamlining platforms or resources, one additional concern consistently rears its ugly head more than most; trust.

Hrynaszkiewicz outlined this critical issue in The State of Open Data 2019 after some 2,000 researchers had expressed concerns over data misuse. As he also pointed out, the latest results had echoed past surveys, which when combined with the latest survey, totalled many thousands of researchers’ opinions.

‘Trust is a collection of different issues,’ he says. ‘These include fear that work may be misinterpreted, fear that data can be used for a purpose that wasn’t intended and also the fear of scooping, where other researchers find new opportunities to use that data.’

According to the PLOS Publisher, technological features such as sharing data privately in repositories before publication can help to allay such fears, but he believes that trust is more of a cultural, rather than a technological issue. So what’s the answer?

In line with Ayris’ and Rossum’s calls to train researchers on FAIR principles, Hrynaszkiewicz reckons training could relieve trust concerns. What’s more, he is also sure that the ongoing implementation of solid journal data policies is helping the issue.

‘We really do need to think creatively about the role of stakeholders and publishers to create a culture where sharing data is more common and rewarding,’ he concludes. ‘I don’t have all the answers to this but we must all continue to try to address this issue.’