Libraries could play key role in managing research data

12 December 2014

Sharing and long-term preservation of research data are increasingly important to the research process, strengthening the process of science and maximising a funder’s return on research investment. While some fields have embraced the sharing of data more fully than others, the sharing of research data is of growing interest across all scientific disciplines.

However, we are still transitioning from a document-centric view of science to a data-centric view, and the infrastructure is not yet in place for the seamless sharing and reuse of scientific data. A growing number of increasingly sophisticated instruments and sensors mean that the scientific data available for sharing is growing rapidly, but there is a lot of work involved in extracting that data from researchers’ hard drives and ensuring that it is accessible in the long term.

The rise of electronic publishing has disrupted traditional information roles, and the position that the library will hold in an increasingly data-centric science is not yet clear. There are undoubtedly opportunities, but if libraries move too slowly they may find other organisations fulfilling these roles.

A complex ecosystem

The modern scientific publishing system has become increasingly complicated. The traditional publishing process could be modelled as a simple loop, with articles flowing from researchers to publishers to libraries and back to researchers. In contrast, the modern system is modelled more as a network – with new relationships and increasingly overlapping roles. Today, research papers are not only submitted to journals, but to institutional and subject repositories as well. Institutional repositories are often hosted by library services, and hold grey literature as well as formal publications. In addition, the bundling of electronic journals by publishers diminishes the acquisitions role of the library, while open-access journals can potentially disrupt the traditional distribution role of publishing companies.

All models are necessarily over-simplifications, and the traditional journal model was more complex than the simple loop suggested here. Nonetheless, it is clearly the case that organisational roles are increasingly less rigid in academic publishing. It is into this more fluid ecosystem that the need for more data-centric services has emerged.

Increased interest in scientific data and a need for data-centric services provides a host of opportunities for the library to re-establish itself as having a central role within research institutions, but the overlapping roles of competing organisations mean that other organisations can quickly stake claims in areas that the library profession may have considered theirs by right.

The data lifecycle

Data lifecycle models can provide a framework for considering the opportunities available to library services in the sharing of research data. The UK Data Archive research data lifecycle distinguishes six stages in the data lifecycle; processing data; analysing data; preserving data; giving access to data; and re-using data. Many of these stages can benefit from the skills of the library community.

Effective data management starts at the beginning of the research process, not as an afterthought. Research libraries should be in a position to offer advice on the appropriate structure, storage, and metadata, for research data.

The long-term preservation of data is likely to have different formatting, storage, and metadata needs to the data during the creation process. Once again, research libraries should be in a position to offer advice. Most importantly, the storage of a research project’s data is likely to extend beyond the scope of an individual project, and needs to be stored in an appropriate repository.

Giving access to data is another area in which libraries can be involved. Providing access to the data not only requires that the data is available, but that it can be found and appropriate rights are provided for its reuse. The library community has a long history in the establishment of classification systems, and extensive experience of copyright.

At the final stage in the lifecycle, reuse of data requires the finding of data, and ensuring that sufficient information is available for the data to be reused. Of course there is a difference between opportunities being available and opportunities being taken, and for the most part libraries continue to be primarily document-centric.

Although many libraries have established institutional repositories, repositories continue to be focused on documents rather than data. Of the 2,727 repositories listed in OpenDOAR, the Directory of Open Access Repositories, only 131 are currently listed as containing datasets (4.8 per cent). This is not too dissimilar to the results from the same query at the beginning of 2011, which found the proportion to be 4.1 per cent.

Distinguishing between institutional repositories and disciplinary datasets draws a bleaker picture for institutional repositories, where only four per cent of institutional repositories are listed as containing datasets in comparison to 11.1 per cent of disciplinary repositories.

Specialised repositories can potentially provide more innovative interactive interfaces for specific types of data than a more general institutional repository might hope to achieve, although when data lasts longer than the project, questions remain over who will take responsibility for these orphan works in the long term if a repository closes or data/metadata needs to be updated.

Where libraries fail to provide sufficient new and innovative data services, others will. One example of this is the new data journals.

Repackaging data

Two new open-access journals that are based on research data rather research findings started publication in 2014: Scientific Data, from the Nature Publishing Group; and Wiley’s Geoscience Data Journal. These publications provide a place for detailed descriptions about how and why a dataset was collected, and are linked to the dataset itself in one of a number of approved repositories.

Such products potentially have advantages for both individual researchers and science as a whole. They provide a greater incentive for the sharing of data by providing a peer-reviewed publication that can be cited and for which a researcher can receive credit, without necessitating new insights or novel findings. They also help with researchers finding and reusing existing datasets.

Many of the advantages of these new data journals could have been achieved by the library community without the need for commercial publishers. There are also concerns about publishers expanding their scope when some people perceive them to have abused their market position with high journal price increases. But if libraries are to continue to be relevant, they must be as willing to adapt and learn from publishers’ willingness to innovate with data.

What service will libraries provide?

Iain Hrynaszkiewicz, head of data and HSS publishing in open research at Nature Publishing Group/Palgrave Macmillan, sees the role of the librarian as one that will continue to evolve, with greater emphasis on data literacy: ‘Librarians have historically been involved in information literacy training in all its forms, and research data are increasingly equal to other research outputs, such as papers, in research assessment and funding. Providing training on accessing, archiving, publishing and managing data is therefore a natural progression of this role.

‘Data Descriptors, such as those published by Scientific Data, are important for data discoverability and reusability and meeting funder and institution requirements for data sharing. They could be considered part of best practice for research data management and publication planning for any piece of research. For this reason, we are keen to work together with librarians and information professionals, in establishing researchers’ skills and understanding of the importance of data management,’ he said.

The need for on-site data management expertise ensures there will be a role for the library and information professional in the future of data management, although the extent to which most libraries and librarians can fulfil these needs is not yet clear. For every example of strong proactive data services, there seem to be many where its data services are limited or non-existent.

Technological change requires skill sets that may be limited in the information sector due to the focus of library schools in the past. Information professionals may continue to be part of work, but it is always possible that these workers are distributed on specific projects.

Conclusion

The transformative nature of innovation on library services has been a regular staple of the profession’s literature for many years. Often such works involve predictions that, if not doom-laden, may change the library role beyond all recognition. In comparison, the need for more data-centric services provides great opportunities that clearly fall within a library’s remit, but there are also many potential competitors.

If libraries continue to fulfil their core traditional role, then there needs to be far more innovative approaches to data. Failure to innovate successfully may see the library and librarians with a far more diminished role.

David Stuart is a research fellow at the Centre for e-Research, King’s College, London