The vision of an open-data future may still be some way off but David Stuart looks at some of the possibilities and the challenges along the way
The problem with idealistic visions of the future is that they always seem to take much longer to come to fruition than initially predicted. When Tim Berners-Lee and James Hendler published their vision of the semantic web in Nature and Scientific American in 2001, it didn’t seem too unlikely a possibility.
When governments like those of the USA and UK launched open-data sites (Data.gov and Data.gov.uk respectively) in the last few years, it seemed as though the bright open-data future was just around the corner. After all, if the slow-moving steps of government had accepted the arguments in favour of open data and started to make it available, then it seemed as though it couldn't be long before data was also pouring forth from industry and academia.
The open-data future even offered to be the long-promised semantic one, as governments didn’t just publish data but re-published a proportion of it according to linked open-data principles, enabling the automatic analysis and querying of data across multiple sites.
This widespread sharing of data was supposed to release data’s true potential. Industry would be able to repurpose existing data in new and innovative ways – while science would reduce the replication of work, and scientific findings could be held up to greater scrutiny than ever before. What’s more, the public would have any misgivings about the robustness of scientific or medical findings allayed, and would even be able to contribute expertise to the innovation process.
Three years on from the opening of the UK government’s data hub, however, the world of open data and linked data seems as far away as it ever did: according to Google Trends, while searches for 'linked data' increased fast in 2009 they have been holding steady since then; and of the 2,212 open-access repositories listed in the OpenDOAR repository directory, only 84 are listed as containing datasets - in fact, the only content types listed in fewer repositories are patents (65) and software (33).
Undoubtedly, things are changing. Funders are increasingly requiring the raw data to be made publicly available. Libraries are increasingly spoken about as having a role to play in providing support for the data-management process. Librarians themselves are also seen as potentially having an increased role in the creation of knowledge, moving away from the role of information scientist towards one of data scientist.
However, with changes seemingly occurring so slowly, questions need to be asked. What are the barriers that continue to restrict access to this web of data? What steps can library and information professionals take to overcome them?
Some of the biggest barriers to the publishing and use of open data are undoubtedly technical ones. Data is a far more difficult commodity to handle than a finished journal article. Whereas the journal article is self-contained and will generally have always been designed to be shared with others, data is often fragmented and is primarily designed for the convenience of the researchers themselves. This creates difficulties for both the creator of research data and the potential re-user: which data should be made available? What is the best way to make it available? How can someone reuse this data?
It is far easier to make decisions about the most appropriate way to create data for publication during the research process rather than at the end. Structuring data, making use of widely adopted vocabularies and ontologies, and the creation of sufficient documentation to enable its reuse, are not things that can be easily achieved at the end of the research process.
Because of this, information professionals need to be involved at the beginning of the research process, using their experience of the multitude of ways data can be made available, their knowledge of the widely-adopted standards, and their understanding of the importance of documentation. As has been the case with the open-source community, documentation is usually the last priority but it can make a huge difference to the extent to which it is adopted. Unless information professionals are involved throughout the research process, we risk finding ourselves in a situation where, although most of the data is archived and freely available, it is largely unusable as too much work is needed to make use of the data.
Intellectual property rights
As open access to both research publications and data struggles into existence, intellectual property issues inevitably raise their head. Those with vested interests in the existing system inevitably look to defend their ground. The most recent example of this sees publishers restricting the use of the text mining of journal articles by researchers, even where the researchers have access to the data. And such quibbles are nothing in comparison to the complexities of intellectual property rights for data.
In many instances a dataset will not have been created entirely by the authors themselves, but rather will have made use of pre-existing datasets. Unfortunately this often means that the resulting datasets have a host of associated restrictions on them. The publishing of a dataset may also require the approval of many associated data publishers.
Library and information professionals need to be involved in the research process from the start, so that they can ensure the obligations of the researchers to the funding agencies are not hampered by the restrictions placed on any pre-existing data that is used. In some instances there will be alternative, more open datasets, but these may not be as immediately obvious. Library and information professionals will need to help their stakeholders identify and access these resources in the same way they would with others. They also need to make sure that any data they help to publish makes use of appropriate licences, so that anyone wanting to reuse data further down the line is not hampered by overly draconian restrictions.
Although the potential benefits of freeing up data may be widely recognised, it is by no means an unmitigated positive step. The downsides are increased scrutiny of researchers’ work and concerns about researchers’ work being misrepresented. The data stories that usually grab the headlines are not ones about the repurposing of data to discover additional insights and making valuable contributions to science. Instead they are about students picking holes in the findings of Harvard professors, and climate sceptics badgering scientists for access to their data before their investigations are even complete.
What is required is the promotion of more positive stories of data sharing and reuse. Such stories would not only encourage the publishing of large datasets, but also encourage researchers to reuse data. The reuse of existing datasets should be as widely respected an avenue of research as more traditional approaches that include the creation of original datasets – if not more so, as they provide ways of making novel discoveries at a fraction of the cost.
Librarians have a role in both finding and sharing these stories, and encouraging a culture of data sharing generally. This requires recommendations – and also incentives. While citations may be thought of as the currency of science, it is important that we recognise the contributions not only of journal articles, but of all of a project’s outcomes. Where a dataset, or part of it, is reused by others this should be reflected in appropriate metrics.
There is still a long way to go before open-data publishing and use are a normalised part of the process of science. In many ways, open data is a bit like the push towards open access of publications, which has also taken a long time to become widespread. A crucial difference, however, is that whereas the open-access movement required large quantities of articles to be available before a real impact was made on the research process, every dataset can make a difference. Whereas a journal article may reference dozens of other journal articles in making its case, research may be built on the reuse of a single dataset.
The opening up of data has not ground to a halt; there are still many innovations occurring. At the end of April, Wikimedia (the organisation behind Wikipedia) launched its Wikidata project, a free editable knowledge base of structured data. There is seemingly a public appetite for data, but it needs to be available in an easily-usable format and, for that, library and information professionals need to play their part. This involves both the general and the specific; from generally promoting the concept of open data within an organisation, to developing a library of ontologies for one specific subject area. What’s important is that everyone takes some responsibility.
David Stuart is a research fellow at the Centre for e-Research, King’s College London, as well as an honorary research fellow in the Statistical Cybermetrics Research Group, University of Wolverhampton