FEATURE
Topic: 

Open data: growing pains

While researchers, publishers and funders warm to data sharing, issues over misuse, citation and credit remain, reports Rebecca Pool

In its latest State of Open Data survey, Figshare revealed that a hefty 64 per cent of respondents made their data openly available in 2018.

The percentage, up four per cent from last year and seven per cent from 2016, indicates a healthy awareness of open data and for Daniel Hook, chief executive of Figshare’s parent company, Digital Science, it spells good news.

‘This figure is a high percentage and has always been higher than I would have expected since our first survey in 2016,’ he says. ‘In a single decade, we’re probably going to hit 70 per cent, and given the open data movement hasn’t really been in existence since much before 2010, this is a really big achievement.’

Partnering with Springer-Nature, Figshare – an online repository for academic research – set up its State of Open Data survey to examine the attitudes and experiences of more than 1,000 researchers working with open data around the world. Now in its third year, results suggest open data is becoming increasingly embedded in research communities.

For example, the majority of respondents – 63 per cent – support national mandates for open data, an eight  per cent rise from 2017. And, at the same time, nearly half of the respondents – 46 per cent – reckon data citations motivate them to make data openly available. This figure is up seven per cent from last year.

The respondents’ leaning to national mandates comes at a time when funding bodies worldwide have laid out policies on data sharing, while government organisations are warming to open data. Indeed, in October this year, China’s Ministry of Science and Technology released its open science and open data mandate – The Measures for Managing Scientific Data – to improve open sharing. Meanwhile, the European Commission has just mandated open access to research data as well as publications, as part of its proposed €100 billion Horizon Europe research budget, running from 2021 to 2027.

And, according to Hook, the survey respondents’ growing thirst for data citations also coincides with general industry development. ‘I think people are getting more comfortable with the idea that a dataset is “a thing”,’ he says. ‘Open data is much more on [researchers’] radars and importantly, more and more institutions and funders are recognising that the production of a dataset is valuable.’

Given these developments, Digital Science recently released a new data platform, ‘Dimensions’. Designed to improve scholarly search, the platform links different data sources from publications with citations, grants, altmetric data, clinical trials and patents.

As Hook emphasises: ‘Dimensions is tracking citations and mining citations from papers, and can provide citation information, which means researchers can much more easily see the citations they are getting from their data.’

Professor Brian Nosek, social psychologist from the University of Virginia, US, is equally enthusiastic about data citations, and highlights the significance of these to research communities. ‘Data should be cited as a scholarly contribution in the same way that papers should be cited, and this is beginning to be addressed in the community,’ he says. ‘You’re identifying the source, you’re providing credit and you’re showing that researchers are building on other researchers’ work, which also induces accountability for misuse.’

Nosek, a key figure in the open science movement, co-founded the Center for Open Science in 2013, with the aim of increasing the openness and reproducibility of scientific research. Five years on, the not-for-profit centre is widely known for a range of mechanisms and tools to promote openness, including the Transparency and Openness Promotion (TOP) guidelines for data citation, data materials, code transparency and more.

Crucially, these standards have been embraced in more than 1,000 journals, with publishing heavyweights, such as Springer Nature, Wiley, Elsevier and more adopting TOP policies. And now the funders are following.

‘The TOP guideline signatories really show that the journals are aligning on promoting data sharing, either incentivising or requiring it,’ says Nosek. ‘The next frontier is funders, and some have already started to make more assertive requirements. I think that over the next year or two, funders will really start to increase their policy alignment with the TOP guidelines, and promote more transparency for the research that they fund,’ he adds.

Funders aside, Nosek is also seeing more and more research fields sharing data. While the likes of macroeconomics, which hinges on economists sharing government data, and astronomy, with its shared instrumentation, are attuned to the concept, the researcher reckons other disciplines are jumping on board. Psychology is just one example. In 2015, Nosek and colleagues published The Reproducibility Project, which set out to replicate the findings from 100 past psychology studies, and ultimately highlighted the thorny issue of failed reproducibility in social science.

But as the researcher points out: ‘At the same time, we also made all of our data available and it has since generated more than 20 publications, with other researchers using those data for new purposes.

‘Some of these purposes have been quite different to what our original research was about,’ he adds. ‘Just making that data available has created possibilities that couldn’t have been anticipated in advance.’ Nosek also points to OpenNeuro, formerly known as OpenfMRI, an open platform for sharing neuroimaging data. ‘We’ve seen a number of different publications that have come out of [OpenNeuro] that wouldn’t have happened without this,’ he says. ‘So this long tail of research is getting more exposed to data sharing and we’re seeing investigations that no one could have imagined, as a result of smaller datasets being aggregated.’

Causes for concern

Yet, amid the data-sharing success stories, myriad worries remain. Top of the pile is the potential for data misuse, but as Nosek highlights: ‘Yes, someone could misuse your data but what’s the relative risk of someone using it well?’

‘Are you certain you have taken the best approach?’ he adds: ‘Originators of the dataset don’t always think of all the different ways that that data could be used, and I’ve seen so many new research applications emerge from the same data.’

Inappropriate sharing of data is another key concern, which Nosek agrees is ‘very real’. He believes researchers should take appropriate steps to comply with ethical guidelines, adding: ‘This is just an essential part of moving towards an open data environment.’

But beyond data-misuse and unethical sharing, the thorny issue of credit remains. While data citation is necessary if researchers are to share datasets, tensions around credit for data sharing clearly emerged in the recent State of Open Data survey.

Results indicated that a mighty 58 per cent of respondents felt they do not receive sufficient credit for sharing data, while only nine per cent felt they do.

Grace Baynes, VP for research data and new product development, open research, at Springer Nature, is keen to encourage data sharing, but believes credit is critical. As she writes in this year’s State of Open Data report: ‘Researchers would share data more routinely, and more openly, if they genuinely believed they would get proper credit for their work, that counted in advancing their academic standing and success in career development and grant applications, and for subsequent work that builds on their data.’

For her part, she believes published, citable datasets should be valued in the same way as research articles but concedes: ‘Routine inclusion of datasets, their citations and impact in grant assessments and CV evaluation is probably still years away.’

In the meantime, she reckons citable data articles should be encouraged, and also highlights the importance of initiatives that promote dataset use and citation (see ‘FAIR principles’). Key examples include pan-European research data management initiative, GO FAIR, the FAIR data project from The Netherlands Institute’s Data Archiving and Networked Services, and the Alfred P Sloan Foundation-funded project, MakeDataCount. Meanwhile, pertinent community initiatives also include DataCite, which provides DOIs for research data and FORCE, already implementing a Data Citation Roadmap with publishers and other organisations.

‘Perhaps there is more we can do to make it easier for researchers to write and publish data articles, and see the benefits to their research in doing so,’ she adds.

Digital Science’s Hook agrees and is also adamant that professorships should be awarded to researchers that produce ‘brilliant work with data’.

‘Until you make someone a professor for creating a dataset, you are not going to see the change that we need, and you are not going to see researchers being promoted for the right reasons,’ he says.

In addition to concerns over misuse, ethics and citations, a pressing need clearly exists to provide researchers with more credit for generating and sharing data. And as part of this, Hook believes that researchers now should be recognised for much more than published results. ‘It’s a great conceit of modern research to feel that you are the best person to come up with the idea, get the grant funding, work on experimental design, perform the experiment, collect, analyse and interpret the data, write up the paper and get it published,’ he says. 

‘It’s like playing midfield, defence, goalkeeper and striker all at the same time, and in today’s increasingly complex research problems, you just can’t do this,’ he adds. ‘We need to recognise people for the very different roles that they now play in research.’  

FAIR principles

It’s no secret that existing digital infrastructure surrounding scholarly data publication prevents users from extracting maximum benefit from research investments. What’s more, as researchers increasingly look to reuse data, ways to ease this process are becoming more and more important.

One route to best reuse is to make data ‘FAIR’, that is, Findable, Accessible, Interoperable and Reusable. Conceived in 2014 and first published in 2016, the FAIR guiding principles aim to improve the infrastructure that supports the discovery and reuse of scholarly data, and comprise 14 metrics that prescribe a continuum of increasing reusability.

Two years on, the FAIR principles are being rapidly adopted by publishers, funders and institutions, yet in the world of scholarly research, the guidelines remain relatively unknown. When asked about the FAIR principles, only 15 per cent of the State of Open Data respondents reported being familiar with the guidelines, while 60 per cent sauid they have never heard of these.

Digital Science’s Daniel Hook is not surprised by the results. As he puts it: ‘Academics are busy doing research and given the first time that I heard the term was probably no more than 15 months ago, this isn’t a surprise to me.’

Still, the lack of awareness signals a clear need for education, and as is highlighted in the survey report, confirms the need for open data initiatives such as GO FAIR, which provides researchers with clear instructions on how to comply with FAIR.

‘Experimental science is quite often very messy and trying to enshrine guidelines on how to work with the open data that comes out of that is very challenging,’ adds Hook. ‘Open data is a relatively new field, and I do think what we’re seeing here is a representation of this fuzziness.’

Easing data-sharing

While more and more publishers, funders and institutions are providing data sharing policies that recommend or mandate that data from an article be made available upon publication, compliance with these policies is often low. Many authors remain unsure as to which datasets they should be sharing, while stakeholders cannot always tell when the authors have shared the right data. However, proposed software from the Collaborative Knowledge Foundation – Coko – could change this.

Coko recently won funding from the Sloan Foundation to build DataSeer, an online service that will use Natural Language Processing to identify datasets that are associated with a particular article. Pioneered by Dr Tim Vines, from Origin Editorial, the software aims to plug, what Vines sees as, a huge implementation gap between general policy and actually sharing data.

‘Researchers are struggling to get funding, are overloaded with reviews and data sharing is just another thing they need to do,’ he says. ‘But I believe that DataSeer will change the game and transform data-sharing, as this process will now become easy and almost automatic. The software will use artificial intelligence to read articles and work out which dataset fields should be provided.’

According to Vines, the ultimate goal is to provide a service that guides authors through the data sharing process for their article, with reports for publishers, funders, and institutions, so they can easily assess policy compliance by comparing what should be shared, with what was shared. Initial partners are the University of California Curation Center (UC3), PLOS, and the University of California Press.

‘We will initially work with publishers to get DataSeer into a journal’s editorial workflow,’ said Vines. ‘We will release this as free open source software and it will be freely available to all potential users as a standalone online service or a component of Coko’s journal management software, PubSweet.’

Interview

Danny Kingsley, deputy director at Cambridge University Library, looks back at her early days at Australian National University – and forward to the many challenges facing librarians

Analysis and opinion
Feature

While researchers, publishers and funders warm to data sharing, issues over misuse, citation and credit remain, reports Rebecca Pool