Closed countries, open data
Rebecca Pool asks: has Covid-19 pushed the move towards open data to the point of no return?
When the coronavirus pandemic struck and UK education shutdown, University of Bristol librarian, Dr Kirsty Merrett, was overwhelmed with requests for assistance with data management plans. Perhaps not a huge surprise - many researchers were working from home instead of the lab - but for the research data management specialist, the sharp rise in interest bodes well for data-sharing.
'I don't know if researchers were thinking "Oh, let's sort out that BBSRC grant that I've been putting off for a couple of years", but we were swamped,' says Merrett. 'In the short term I don't expect this will have a monumental impact on data-sharing but because this period has raised awareness of the need for data management we may see more data publications in two to three years' time.'
At the same time, Merrett also noted a rising number of data accesses at data-bris, with more and more researchers tapping into the respository's existing data-sets. University database records reveal that from June to September this year, data accesses rose some 68 per cent from 2589 to 4350, an unequivocal leap. Merrett is excited.
'Don't you think that this is pushing the envelope and forcing those accessing the data to look beyond their own research?' she asks. 'I'm hoping this will raise awareness of the possibilities of interdisciplinary research and show researchers that there is a different way of approaching research and learning from other studies.'
'I don't know if the pandemic will have a massive impact on data-sharing and open data – it may raise awareness,' she adds. 'But from a user-perspective, it could bring a culture of wanting to pay it forward and putting data in a repository, so that should we be held to ransom by a pandemic again, research can still be undertaken remotely.'
Across the Atlantic, an analogous story is unfolding. Daniella Lowenberg is based at the University of California and is product manager for Dryad, the open and curated data publishing platform for scientific and medical disciplines. A passionate advocate of open research data, Lowenberg is also director of Make Data Count, an initiative focused on building the infrastructure for research data metrics.
From its inception, Dryad has set out to connect publishers, institutions, data repositories and researchers, in a bid to drive the adoption of proper data management practices that institutions and funders are keen to see. And over the pandemic, Lowenberg has seen a clear rise in dataset deposits.
'We've certainly seen increased amounts of data coming in from all disciplines; ecology, evolution, biomedical sciences and specifically from Covid-19 research,' she says. 'We saw a big bump from March onwards, which we think was because researchers had more time to spend with their data as opposed to producing new data in the lab.'
Lowenberg is also certain that her Dryad observations mimic other repositories across the globe. As she highlights, myriad researchers were gene sequencing SARS-CoV-2 from the outset, and publishing results in NextStrain, an open source project to publicise and analyse pathogen genome data, and the National Institutes of Health-supported data repositories, which fuelled further analyses on how the virus had evolved.
'We definitely saw an influx of Covid-19 data-sets,' she adds. 'But quantity has had to be balanced with quality, and we received new datasets that just couldn't be published as they posed ethical risks.'
Like Merrett, Lowenberg is not yet sure how much of an impact the pandemic will have on data sharing and open data but she is certain it will bring some change. Pointing to past pandemics and crises, such as the 2014 Ebola outbreak and climate change, she highlights how these events also triggered intense rises in data-sharing across relevant disciplines, and importantly, raised community standards.
'Much like geneticists, ecology and environmental scientists early on understood the importance of sharing data and it became a standard in those research domains to share data,' she says. 'So now, in the last decade, we continue to see the importance of sharing biomedical data, and hopefully that trend will stick.'
For the Dryad product manager, this doesn't come a moment too soon. In June, this year, The Lancet article, “Hydroxychloroquine or chloroquine with or without a macrolide for treatment of COVID-19: a multinational registry analysis”, was retracted after question marks were raised over the paper's data.
While the published research initially brought global trials of anti-malarial drug hydroxychloroquine for Covid-19 to a grinding halt amid fears of increased deaths, anomalies in its data soon emerged. Concerns were not helped by participating company, Surgisphere, refusing to transfer a full dataset to peer reviewers, citing confidentiality violations. The lead author quickly requested the research be retracted.
With the controversy over, trials on the anti-malarial drug have since re-started and The Lancet is now demanding more detailed data-sharing statements while altering peer review processes for future papers based on large datasets.
However, as Lowenberg also points out, the entire drug-testing debacle could have been avoided if the necessary datasets were made available from day one. 'If the data had been available, the research could have been called out immediately,' she says. 'Data are the building blocks of science, without which [research] cannot be trusted.'
Mainstay repositories are not alone in witnessing the impact of the pandemic on data sharing-related activities. According to Yann Mahé, Managing Director of MyScienceWork, his business saw a sharp increase in demand for demonstrations of Polaris OS, an open source platform for archiving and analysing data.
'It's been quite intense for us as we are quite a small organisation, but from March to June we had a surge in demand for system demonstrations,' he says. 'I believe this is because a lot of researchers are becoming more open to the idea of having an institutional repository for publications and data-sets, and are aware that open data can really boost innovation.'
Mahé reckons demand has come from all disciplines, including sociology, economics, as well as Covid-19-related fields. He believes this also reflects Polaris OS's current users, which include the French Alternative Energies and Atomic Energy Commission (CEA), the French National Institute of Health and Medical Research, INSERM, and Deutsche Bundesbank.
And throughout the pandemic Mahé also had many organisations from South America contacting him, which he partly attributes to these nations' high levels of Covid-19. He also believes that a lot of this interest arises from the fact that the repository platform is open source and promotes open access.
'Researchers can be using a truly open system with data that can be shared freely and openly with other researchers around the world – the impact of this very important to them,' he says.
But demonstrations aside, traffic across Polaris OS platforms has also been higher since the beginning of the pandemic, in all disciplines, and particularly in the US and China. 'Users mainly represent universities and research institutions and we are trying to understand why this is so - I think it's very interesting, as these two countries have very big research communities,' says Mahé.
Given the intense interest in his open repository platform since the virus outbreak, Mahé is hopeful that the pandemic will have had a positive impact on open data. 'A lot of researchers have now had access to free and open data that would not have been accessible without the Covid-19 crisis,' he says. 'I hope that collectively we become more conscious on how open data and open access publishing can help researchers and their colleagues to work more productively.'
Like Mahé, Grace Baynes, vice president of research data and new product development, Research Solutions at Springer Nature, has been keenly watching the impact of Covid-19 on her organisation. As she points out, submissions across all Springer Nature journals were up 26 per cent year-on-year from January to June, with the highest growth of 51 per cent coming from medical journals. And in the same time period, the organisation published a hefty 10,000 articles relating to Covid-19.
However, for Baynes, it is the collaborations that have emerged since the pandemic that have been 'really interesting'. Pointing to Scientific Data, an open-access, online-only, peer-reviewed journal for descriptions of scientific datasets, she highlights the incredible story behind first Covid-19 article that the journal published.
'This was describing an epidemiological dataset about the spread of the virus, that was openly developed by researchers working around the world in real-time as the pandemic was evolving,' she says. 'Researchers came from China, South America, the US, UK and the rest of Europe to work on this, and it's still being updated continuously today.'
'The editor tells me it was a unique challenge to peer review this article as the data-set was changing minute by minute... but it really shows how the pandemic has brought international research groups together to collaborate and very quickly share data,' she adds.
The dataset is not a first for international collaboration – look at the Human Genome Project, the International Space Station, the Millenium Seed Bank Partnerhip and CERN, to name but a few. But as Baynes highlights, the sheer speed at which this and other Covid-19-related research data collaborations - including Coronavirus Infectious Disease Ontology, CIDO and the Covid-19 disease map - were established is what really stands out.
Quality matters
Still, speed aside, what about the quality of the data that is being shared? Without a doubt, researcher uncertainties over the FAIR principles - which ensure data can be effectively re-used - remain, as evidenced by the numerous initiatives underway to promote a fair data culture. For example, earlier this year, the EU-funded FAIRsFAIR project joined forces with FAIRsharing.org to support data repositories in developing FAIR research data management.
Meanwhile, more and more tools that help researchers adhere to the FAIR principles are also becoming available, with for example, the Germany-based European Molecular Biology Laboratory being active here. According to Mahé, Polaris OS also helps researchers to share data and metadata, and will ensure that a datasets's metadata is FAIR - structured, clean and enriched - ready for other researchers to find.
'We really need to help researchers deposit their data-sets and publications, curate and fill out their metadata without making them take too much time,' he says.
For her part, Merrett is seeing clear progress, and points out how at data.bris, researchers are showing a greater awareness of the FAIR principles now compared to even just six months ago.
'I'm not sure at what point it changed but we do get more people talking to us about this now,' she says. 'Also, if a researcher has a draft data management plan, he or she will mention FAIR data – before we might have only seen this once across 10 to 15 plans, we see it more now.'
To help drive this adoption of FAIR principles further, the University of Bristol has just launched an Open Research Prize. Akin to similar schemes underway at the Universities of Groningen and Reading, competition entrants are to submit case-studies that highlight how they have used open practices in their research. And implicit to the scheme, researchers will need to provide evidence of FAIR data principles were appropriate, as well as including data availability statements in publications.
'I'm really trying to impart at the moment that we need to be as open as possible and as closed as necessary,' says Merrett. 'I think these kinds of activities are really important for changing the culture, and I hope we can get researchers into the habit [of using the FAIR principles].'
Baynes also believes that incentives and credit mechanisms remain critical to unlocking good research data practice. Drawing on results from this year's State of Open Data Report from Digital Science, in partnership with Figshare and Springer Nature, she highlights how impact and visibility, and public benefits, were cited as key motivators over publisher and funder requirements. Meanwhile, citations, credit in funding applications and co-authorship were the favoured credit mechanisms.
Baynes is also certain that researchers need to clearly understand why good practice is worth their time; and believes Covid-19 may have helped. “One of the hardest areas to share data is medical and clinical research, often because it has sensitive data,” she says. “However, the ways in which this data can be managed and shared will now be more obvious to researchers that hadn't had to think about data-sharing before Covid-19.”
Like Merrett, Baynes is also seeing progress on FAIR principles. As she points out, more and more authors are including data availability statements in journals articles while results from the State of Open Data surveys reveal understanding of the principles is rising. 'We're definitely making progress on making data easier to find and access, but interoperability and reuse are more challenging,' she says.
Here, Baynes asserts that investment in human resources and technology is needed. She reckons reusable data demands solid curation with good metadata and descriptive information while interoperability between datasets needs more development of community standards.
'And if we're talking about making data AI-ready, then we need better tools to collect research data, data standards and that rich metadata and description,' she adds. 'This all needs investment and effort from across the research community – we all have a part to play.'
To this end, Baynes is certain that the 2020 STM Research Data Year - intended to develop a clear open data action plan for publishers – has had a huge benefit. She's seen many publishers either introducing or strengthening data policies, bolstering information on what a data availability statement should include and setting common standards, so readers know where data is available.
At the same time, she points to encouraging signs from funders. For example the NIH recently updated its public access policy for research data, while carrying out a pilot with Figshare to see how researchers could more easily deposit and share data openly.
'We also know that UKRI has said it will be looking at how it can facilitate open data with is taskforce reports,' she says. 'There's still far more to do but we are making progress.'
Dryad's Lowenberg is also seeing progress, but agrees scholarly communities have some way to go with improving data quality. She sees a lot of data files with generic metadata submitted across repositories with most researchers still uploading data as supporting information files that don't contain data-specific metadata.
To counter this, Lowenberg asserts that publishers should no longer accept data as supplementary files to the article, as these simply aren't re-usable, citable or accessible. 'I believe that all publishers must require researchers to put data in either a discipline-specific or general repository – this would change a lot of what we're seeing,' she says. 'And then we need to have quality data curation across the board, involving institutional libaries and data curators.'
'It's not quantity, and about uploading incomplete data quickly with your article, it should be about quality,' she adds.
Clearly, progress continues to be made on the road to open data, but has the Covid-19 pandemic actually helped accelerate the journey? Mahé, for one, thinks so.
'Researchers have had access to information that would not have been accessible without the Covid-19 crisis, so perhaps they will be more conscious about the fact that open data can help them and colleagues work better,' he says. 'Perhaps the pandemic has also encouraged publishers to think about how they can do things differently – so I see Covid-19 as providing this real opportunity to bring about change.'
Lowenberg points out how Covid-19 has provided yet 'another shining example' of why data sharing is important. 'I think it's really frustrating that it takes a pandemic for organisations to prioritise public good over their finances,” she says. “I hope we don't follow a trend where this only happens in an emergency.'
Still as Baynes highlights, the move from publishers, such as Springer Nature, to openly share virus-related data and research is what typically takes place during any pandemic or public health emergency.
'Open or certainly free access during such times predates this pandemic,' she says. 'But I do think it's going to continue, and over time we'll see a bigger and bigger percentage of research being published as open access every year.'
'The pandemic brings home the real need for putting in place a structured and shared approach to research data... to ensure that data is as open and discoverable as possible,' she adds. 'And I'm really looking forward to the times when research data really starts to catch up with where we are on open access publications.'
State of the Nation snapshot
Latest information from this year's State of the Nation report, from Digital Science, reveals that of the academic researchers surveyed, some 32 per cent felt their research had been either very, or extremely, impacted.
What's more, when it came to data sharing, around half felt it was at least somewhat likely they would re-use open data provided by other laboratories, while 65 per cent expected to re-use their own data, following the pandemic. Meanwhile, around a third of those surveyed expected to see more collaborations from here on in.
Grace Baynes, Springer Nature, believes that these latest results make the case for ensuring data is as openly available as possible, so other researchers can make the most of it. 'It was those working in medicine and clinical settings that said they were much more likely to collaborate – I think the pandemic has reminded us how research really is global,' she says.
Links:
Scientific Data: Open data in the COVID-19 pandemic: https://www.nature.com/collections/ebaiehhfhg
Scientific Data: Epidemiological data from the COVID-19 outbreak, real-time case information: https://www.nature.com/articles/s41597-020-0448-0
Scientific Data: COVID-19 Disease Map, building a computational repository of SARS-CoV-2 virus-host interaction mechanisms: https://www.nature.com/articles/s41597-020-0477-8
Scientific Data: CIDO, a community-based ontology for coronavirus disease knowledge and data integration, sharing, and analysis: https://www.nature.com/articles/s41597-020-0523-6
Sam Hindle, content lead at bioXriv and co-founder of PREreview, provides insight into what happened at preprint servers after the SARS-CoV-2 outbreak:
Since the beginning of the Covid-19 pandemic, bioRxiv has posted between 150 and 350 COVID-19 preprints per month, during a period when total monthly submissions rose from around 3,000 to some 3,800 new papers. Meanwhile, medRxiv saw an increase from around 200 posts per month in January 2020 to around 2000 posts in May. Around 70 per cent of all posted manuscripts from March to September related to the pandemic.
We quickly became aware that the SARS-CoV-2 outbreak would lead to challenges, so at bioRxiv we sought insight from outbreak scientists and created a group of bioRxiv 'Outbreak Affiliates' to provide guidance on Covid-19 submissions. On medRxiv, we asked authors to reduce any causal language used in observational studies. And on each server, we became more alert to papers that could fuel conspiracy theories.
Looking beyond submissions, bioRxiv and medRxiv also experienced massive increases in attention. In January, bioRxiv received around 6 million abstract views whereas by June, views reached eight million per month. On medRxiv, abstract views increased from 0.7 million in January to 11 million views in April alone.
Heightened media and public attention has been challenging as the nuanced differences between a preprint and a published article haven't been immediately obvious to non-scientist readers. However, responsible journalism, such as notifying readers that preprints have not been peer-reviewed and seeking insight from experts not associated with the study, can address these issues.
Still, my hope is this momentum towards open and accessible research during the pandemic is maintained and extended to non-COVID-19 research. Early on in the pandemic, several biomedical journals flipped their COVID-19 articles to open access while several medical journals modified preprint policies to support preprint posting. The further we move the needle in the right direction, the sooner we will live in a world where open, constructive scientific critique is valued, and researchers can focus on creative thinking and scientific discovery rather than wasting energy navigating a complex publishing ecosystem.