Breaking down barriers to data sharing

Share this on social media:

Interviews for this article have been adapted from recent PhaidraCon roundtable events and from upcoming 2023 editions of EpistemiCast.

Access to an open pool of existing third-party datasets offers many benefits alongside the obvious opportunities to reduce the cost of research projects: access to additional shared data can increase the depth and scope of what is possible within any individual study; financial barriers are reduced and accessibility is opened up to less advantaged scholars and institutions; and, at a global/societal level, new opportunities are created to increase scrutiny, collaboration, and the pace of learning.  

However, while the will to share academic data is clearly growing, in many areas of study, there are still many practical barriers to greater implementation. 

Growing awareness with a strong need for training 

A 2022 study published in the scientific journal Nature by Dr Devan Donaldson and Wolfgang Koepke, ‘A focus groups study on data sharing and research data management’, looked at the issues around data sharing and reuse in detail, with a survey of the wants and needs of researchers across five fields (atmospheric and earth science, computer science, chemistry, ecology and neuroscience). 

“There are increasing mandates all over the world about ensuring that data from federally funded research is made publicly available. It is about making sure that this data can be shared and reused both within and across disciplines,” explains Donaldson. 

“We were really thinking about how data sharing can accelerate scientific discovery. How sharing and reuse can increase return on investment for data beyond the group that produced the original data.”

In comparison to earlier results examined as part of his literature review, Donaldson feels that there has been real progress in the awareness of the importance of data sharing and concepts such as FAIR data principles. However, as with all areas of data management he found a strong desire for greater training.

“Respondents felt that other researchers in their field were not necessarily aware of the data repositories that they could use to make data more widely available. They wanted training sessions to just let people know the options,” Donaldson observes. “It was clear that most respondents wanted to do a good job at data management, but didn’t necessarily feel they had the tools and knowledge to do that.”

Training need echoed in the humanities

Pre-Pandemic Research by the ARIADNE-Plus project shows a similar pattern. Between 2013 and 2019, there was significant growth in the number of archaeological research projects sharing their underlying data in some form of public repositories: D2.1 Initial Report on Community Needs. 

The number of survey respondents sharing data for ‘Most Projects’ roughly doubled across a range of categories, while the number of projects not sharing data at all dropped by between 10 and 20 percent. 

However, one of the report’s authors, Professor Franco Niccolucci, echos Donaldson’s sentiment.

“Within the humanities there is a real need for upskilling, a change of mentality and even change to the underlying humanities methodology.” 

Professor Elizabetta Lazzaro, Professor of Creative and Cultural Industries Management at the Business School for the Creative Industries, University for the Creative Arts (UK) agrees: 

“There is a need to include more technical literacy into the Humanities curriculum, to familiarise researchers with the available tools and show them how to operate them in a more active way.” 

First-hand experience 

Originally, a masters graduate in Anglophone Literature and Cultures at the University of Vienna, Marta Palandri is one academic who has already begun exploring the possibilities of greater crossover between humanities and technology. Now working as a software developer within Vienna’s University of the Applied Arts, Palandri agrees that there is greater need for technical literacy to bridge the current gaps in humanities research and learning:

“It’s not just about tools. For many humanities scholars, basic technical literacy seems quite mystic, when it really isn’t.” 

Palandri believes that many trained software developers have not been made aware of the opportunities for them within the research and academia sector:

“Most developers don’t see the research sector, and specifically humanities, as glamorous and exciting. And yet, there are so many intriguing possibilities for tools like artificial intelligence and machine learning to advance study within these fields. There is a real need to showcase these opportunities and champion the humanities themselves, to attract more technical talent into the space.”  

Opportunities for broader collaboration 

One organisation seeking to address some of Palandri’s concerns is King’s Digital Lab (KDL) within King’s College London. The lab’s director, Dr Arianna Ciula also recognises the vital importance of the human aspect of infrastructure in bridging the gap between technical skills and researchers in the humanities:

“We don’t just connect researchers directly with design and development. For us, there is a vital role for research software analysts with experience in the Humanities, who can bridge the gap between the research domain and technical implementation. At KDL, we also separate out a clear role for designers to focus on usability, user interaction and user experience for the researchers.”

Ciula sees the lab’s role as a broker or bridge stretching beyond greater academic collaboration: 

“By facilitating robust methods around its software development lifecycle including effective sharing of data, the types of expertise embedded in research software engineering units such as KDL are well positioned to promote more mobility across sectors, between research institutions, cultural heritage professionals and the creative industries.” 

Looking to the future, Ciula adds: “There is real potential to not just share expertise, but to create new cross-industry career opportunities for researchers, developers, technical analysts, designers, and creative industry roles. Allowing more mobility between roles and sectors would allow everyone to harness some new opportunities more effectively.”

Technology and the role of the repository 

In a bid to help researchers, Donaldson and Koepke’s study looked to identify the most common ‘desired repository features’.

“We were specifically interested in data management practices and how data repositories could help serve scientists in this regard,” explains Donaldson.

The five most common desired features were laid out as a data repository appropriateness rubric, to help researchers to evaluate their repository requirements for future projects:

  • Data traceability 

  • Metadata 

  • Data use restrictions 

  • Stable infrastructure 

  • Security

All of these features are examined in detail within the study. However, in interview with Donaldson, we focused-in on how the first two points can act as key drivers to increase data sharing.

Data traceability

Many of the researchers surveyed were interested in metrics and even notifications around how many people view, cite, and publish based on the data they deposit. Additionally, participants wanted versioning for repositories to track any changes to their data post-deposit. While some of this was linked to researchers’ desire to maintain oversight on how their data is used, Donaldson himself believes re-use metrics may also be an important motivator do drive greater data sharing:

“Scientists want to know what is happening with their data. If people have used their data to create new data, they want to know how people have built on the work that was done. I think that if we have innovation here, around these statistics and making them accessible to the people we want to deposit data, we could create more of an excitement around data sharing,” he explains. 

“This is why people write papers: they are adding to the discussion and seeing where it goes. If we can generate the same kind of energy around data, I think this might spur people to deposit and share their data in ways that they haven’t before.”

The key role of metadata

The discoverability of existing datasets is an obvious prerequisite for greater data reuse in any field of study. Within this, the metadata used to describe and contextualise data plays a key role.  

“Our study found that scientists were really concerned with data searchability and data discoverability,” comments Donaldson. “They might not always use the terminology, but It is very apparent to me that when scientists are describing a need for data searchability and discoverability, then what they want are the metadata and data standards that could enable that.”

Neil Jefferies, head of innovation and open scholarship at the University of Oxford’s Bodleian Libraries, believes that there is currently an issue with too many metadata standards and too much specialisation.

“This hinders discoverability because it narrows the reach of discoverability tools quite significantly. People tend to search only within their domains and the domains are very narrowly specified. This really limits the capacity for any level of interdisciplinarity or any interchange of information at a broader level. So there is actually  a lot of utility in some of the more generic standards,” he explains. “There is scope for a simplification of the standards to improve discoverability in terms of breadth.”

Jeffries highlights the importance of Google and the regular search engines which are still used by a lot of academics. 

“Whatever metadata you have, if you can map it to things like and the more generic web friendly schemas, then you improve the overall discoverability of those items.”

Looking at recent developments, Jeffries also highlights the increasing importance of machine readability and APIs:

“In the last four to five years, I have seen a massive increase in the number of machine tools and mechanisms for trawling databases and extracting useful data. And here, machine readability is essential.”

Fani Gargova, a postdoctoral researcher at the Goethe University of Frankfurt am Main, believes that within the humanities, especially where research is so dependent on multi-lingualism, controlled vocabulary and persistent identifiers play a key role in making metadata more effective and shared data more discoverable. 

“This is essential for making data accessible, findable and actually also machine readable. It is important to make it possible for the researchers to actually contribute to this controlled vocabulary, and also to make this more centralised, so as to reduce the number of disparate standards, definitions and ways of accessing the data.”

Future automatisation 

For Donaldson, one of the interesting and unexpected ‘future wants’ that came up in his study was how some of the respondents wanted automated metadata creation when uploading their data into a repository. 

“There is a tension because, at one level the data creator is very much aware of the context and circumstances of the creation of the data and so they are best positioned to provide metadata for it. But is the time that it takes, or the perception of the time that it takes to provide good metadata. There is this desire or wish list, where they want to metadata creation to be automated.”  

Jeffries believes that this is very much a work in progress and that the ongoing development of persistent identifiers and broader knowledge graphs plays a key role.

“Within science, you can start to look at creating persistent identifiers for facilities, for instruments and for research projects. And once you have these, they can be applied to the dataset because this is a simple transitive operation. Within the Humanities we are really looking at things like people and place as the core anchors for contextualising the data.” he explains.

“We are now at the stage where we are starting to build these frameworks. A lot of that is in terms of persistent identifiers, but a lot of this is stuff that isn’t stored directly next the the data in the repository itself. You are relying on other related repositories to provide that. This shifts your dataset into being a node in a broader knowledge graph. And what you are actually interested in, if you are interested in meaning, is capturing all of that knowledge graph.” 

“This isn’t something that you do alone as a repository, it is part of a community process capturing all of that information. This makes it a more complex process but also a more collaborative process.”

Throughout 2023, EpistemiCast will continue to explore all issues around data preservation, data sharing. There will be interviews and first-hand experiences from some of the leaders in this field, as well as providing insights into the latest standards, initiatives and open technologies as they emerge. 

Raman Ganguly is Head of Support for Research IT at the University of Vienna, project leader for the Phaidra Project and the driving force behind the newly launched  EpistemiCast podcast series