DATA

'Good enough' is not good enough

'Good enough' is not good enough
'Pooling knowledge: bringing together data sets about, for example, different parts of a lake could give more insight into the whole.' Credit: Worradirek/Shutterstock

Managing research data well can open up the possibilities of valuable new insight with the help of linked data, writes Sian Harris

Research Information: December 2012/January 2013

Simon Hodson’s job remit throws up an interesting issue. Some of the key aspects of being programme manager – managing research data at JISC are to help researchers and institutions in the UK to manage their data better and to demonstrate the need for this.

The interesting issue is why these tasks are so necessary. Surely managing their data is of prime importance to researchers? Surely they don’t need somebody to demonstrate why this is required?

At the simplest level, of course researchers do know that their data is important. However, managing data goes beyond making sure you keep the data behind your published – or soon to be published – research. And it also goes beyond ensuring that you know where you have stored your data.

The projects that receive funding through Hodson’s programme show that well-managed research data can have value far beyond what was originally thought of in an initial research project.

Take FISHNet, for example, a collaboration between King’s College London and the Freshwater Biological Association (FBA). The first step of this project, which JISC funded, was to set up an archive of freshwater biological data and enhance it with metadata.

‘In freshwater biology, datasets tend to be created by individuals or small groups, often in Excel spreadsheets, and are usually created to answer a specific research question,’ explained Mark Hedges, director of the Centre for e-Research at King’s College, London.

Bringing these datasets together in the same place starts to open up the possibility of getting integrated views. For example, someone might have taken water samples from one part of a lake while someone else took similar samples from another part of the lake. Bringing the data together makes it possible to start to get a picture of the whole lake. ‘We are looking at how to break information out of silos,’ said Hedges. This work was started by the follow-on project FISH.Link, also funded by JISC, with the additional collaboration of the University of Manchester.

King’s College London and the FBA are now working on a new project developing these ideas, the DTC Archive project, funded by Defra (the UK’s Department for Environment, Food and Rural Affairs). ‘We want to have cross-dataset querying,’ he noted, adding that the partners are looking at using RDF (Resource Description Framework, a framework for describing resources on the web) as an intermediate form.

Hedges and his colleagues have also been working on another JISC-funded project in the humanities. Named the SPQR (Supporting Productive Queries for Research) project, this aimed to investigate the potential of linked data for integrating datasets related to classical antiquity. ‘Lots of the digital datasets that people have been creating for years lack interoperability and the opportunity to get an integrated view across them,’ said Hedges. ‘They were generally either relational databases or XML documents. Our approach is to transform them to RDF triples.’

Such projects show the potential of linked data and semantic enrichment in research. ‘The key is making it machine readable – at high levels of granularity and at scale,’ explained Hodson of JISC. ‘If we can link data we can derive insight of real social and economic benefit.’

The dangers of assumptions

But there are some inevitable challenges that such projects show. In particular they highlight differences in the way data is collected and stored. In the SPQR project, for example, there were issues where some data used Latin letters and others used Greek and there were inconsistencies in the ways things like date were expressed. ‘It’s OK informally, but formally it is problematic,’ noted Hedges. ‘Implicit assumptions was another thing we identified in the project, although we didn’t solve the problem.’

He has observed similar issues in other areas of humanities research. ‘Historical events are often broken down very differently in different countries, for example the way that the First World War is described in historical documents, including when it started, depends on where the documents are from. You can have some ambiguity in conversation but once you are trying to formalise and map data into RDF it’s more of a challenge.’

This issue is more of a challenge in areas like humanities than hard sciences, he added. ‘In the humanities, data is much more difficult to capture and more subtle. Also, researchers don’t tend to think of their stuff as data.’

Nonetheless there were also similar challenges in the FISH.Link and DTC Archive projects. ‘So far we’ve been dealing with legacy data created before this project began so the challenge is finding or creating appropriate vocabularies to map it,’ said Hedges. ‘When people create these things, they do it for themselves, so many things like column headings don’t follow ontologies.’ He explained that column headings might lack units, use different words for the same thing or use the same word to mean something slightly different. ‘One of the challenges was mapping column headings to common vocabularies.’

Such issues can be tackled by developing standard ontologies, for example for expressing depth in metres. This means that future contributors of data can choose from pre-determined lists of terms that correspond with the terms others are using. This, in itself, can be tough though. As Hodson observed: ‘Standardising ontologies is a necessary building block, but it requires agreement and this can be challenging.’

And it goes beyond simply choosing some shared terms. Hedges said that in some cases for FISH.Link there had been existing vocabularies but they were not in appropriate forms for RDF.

For the new project with the FBA, this issue is being tackled with two-day vocabulary workshops. The aim of these is to bring all the interested parties together – freshwater biologists, the people developing the data systems and other potential users of the data such as policy makers – to hammer out vocabulary terms. ‘It is a cyclical, iterative process, with proposed vocabularies fed back to researchers. That’s key for a successful vocabulary,’ explained Hedges. He hopes that the ontologies developed through this project will be useful to freshwater biologists in other parts of the world.

Another challenge comes from the use of RDF triples themselves. ‘We’ve learned that if you extract a whole bunch of RDF triples it may be hard for a typical scientist to know what to do with them,’ observed Hedges. The reason for this is that potentially hundreds of thousands or even millions of triples (tiny pieces of information linked together) could be available to researchers and the users may not have the technical expertise to know how to make use of them.

‘It is very easy to get overwhelmed by triples. One of the things we are looking at is ways to make it easier to query them,’ he said. ‘It obviously makes it less flexible, but it also makes it easier to interact with the data without learning a complicated querying language.’

In helping researchers to interpret these datasets Hedges has noticed some different patterns of behaviour. Scientists tend to come with specific questions and query the data for them but humanities researchers tend to interact with data in a very different way, he said. ‘Humanities researchers don’t necessarily come to data with such a clear plan of what to query. They tend to browse data. One of the key things is to provide them with enough information to know what to browse and to stop them getting lost in the data.’

Good management

Meanwhile, there are challenges with the underlying data itself. ‘To convert legacy data is very time consuming. However, it’s not the researchers’ fault; when they created the data it was for their own purposes,’ said Hedges. ‘The linked data approach forces people to think collaboratively about data gathering and this aids discovery across datasets.’

And this is where Hodson’s role comes in. ‘One of the challenges yet to be solved is how to make such projects scalable because the data needs to be better quality.’

Data-quality issues arise from the ways that data has been created and maintained. ‘Many researchers have excellent data skills but many do not. Many researchers will add just enough information for the data to be useful to them rather than making the extra effort of creating a perfectly-annotated table of data,’ observed Hodson. ‘This means that anyone using it is going to be confronted with a significant job of data cleaning.’

And the reasons for looking after data aren’t purely altruistic. After all, as Hodson pointed out, ‘The first person you share your data with is your future self.’

His programme at JISC focuses on building capacity in universities, developing policies and strategies in universities about data management and providing training and advice for universities. It is also about improving data management during projects. ‘By and large we hope researchers will have a good awareness of what they need. Also, increasingly publishers are becoming more aware of the importance of looking after data. It is of more use if it is linked to research articles,’ he said.

Hodson is also involved in the Dryad data archive, an international initiative that he is very enthusiastic about. ‘Again it’s a question of making data available so that researchers have the raw material that allows the prospect of semantic enrichment and the benefits that brings. Without initiatives like Dryad this would not be possible.’