Librarians should embrace linked data
Data sharing in libraries doesn't have to be daunting for librarians, writes David Stuart
If libraries are to realise the value of the data they have been building and refining over many years, then it is not enough for them to just embrace the web of documents, they must also embrace the web of data. The associated technologies may seem complex and impenetrable but the idea of libraries embracing the web of data doesn’t have to mean that every librarian has to embrace every bit of technology.
The web of data refers to the publication of data online in a machine-readable format, so that individual pieces of information can be both linked to and read automatically. There are vast quantities of information being shared in documents on the web, but at the moment the content of most of these documents are not machine-readable; browsers know how to display the text and images on a web page, but there is little understanding of the data contained.
In comparison, the web of data makes use of many of the same protocols and technologies as the web of documents, but also makes use of increasingly widely agreed standards to add meaning to the contents of pages. ‘John Smith’ becomes not just 10 characters on a page to be displayed in a large Helvetica font, but can be understood by computers to be a specific person, distinctive from other John Smiths, and that can be linked to from other entities.
Data sharing in libraries
The potential of the web of data is increasingly recognised by the community of library and information professionals – not only for the publishing of catalogue data, but also for transaction data, and the datasets that are increasingly stored in institutional repositories.
Publishing catalogue data according to widely adopted standards offers the potential for the aggregation of data from multiple institutions and the creation of new websites and services: a library’s holdings will appear in search engine results, rather than requiring users to visit a library’s own website; union catalogues can become far more comprehensive; and the catalogue data can be combined with data from other sources to provide new insights and mashups.
Transaction data, suitably anonymised, offers the potential for an additional level of insight into the value of library holdings that may be incorporated into a host of new recommendation systems. The publishing of research datasets in structured formats will potentially ease the process of reusing this data, enabling it to be combined more efficiently with other datasets.
It is essential for libraries to embrace the potential of these technologies because the web has fundamentally changed the way people access information. Google can provide researchers in any place or at any time with millions of web pages on seemingly any topic; more results than any researcher could hope to read in a lifetime. The quality of a library’s holdings may be higher than those of Google, but they are destined to remain under-used unless a library’s holdings are as accessible as Google’s.
Whilst the challenge from the web was once met by putting library catalogues on the web and providing electronic access to a library’s holdings, the increasingly recognised need is to integrate a library’s data with the web – to make it available where the user is, rather than expecting the user to come to a library’s homepage.
The problem, however, is that the idea of a web of data is accompanied by a raft of technical specifications and standards that can be daunting for those from a less technical discipline. The acronyms and neologisms – RDF triples, RDF/XML, RDFa, RDFs, OWL, microdata – are seemingly without end, while the multitude of competing ontologies and vocabularies that are available for providing structure to the content can provide another layer of confusion and indecision.
How many of the underlying technical aspects the typical librarian needs to engage with is a matter of debate. Some (including the author of this article) have argued that the changing environment requires the librarian to become more technically proficient, and that they now need an extensive understanding of RDF and other semantic technologies, and may even consider acquiring programming skills as they start to facilitate access to this web of data. Others however, even among the leading advocates for linked library data, see far less of a need for librarians to change drastically from what they have already been doing so well.
A less daunting perspective
OCLC is one of the organisations at the forefront of the sharing of library data, and does not necessarily see a future of linked library data necessitating librarians with more in-depth technical knowledge. Richard Wallis and Ted Fons are involved in the data-sharing strategy at OCLC and are contributing to its forthcoming whitepaper on data sharing in libraries. Rather than predicting a sudden change in working practices, they expect a more gradual process with technology and aggregation dealing with many of the complexities.
For Wallis, part of the reason for the gradual process is the difference in the way the library world and the web world view information. ‘The ways that we model and the vocabularies we use are very library specific, and are created around a record that will contain everything to do with a book in one chunk of data. This is compared with the web world where they tend to have one source for information about different entities that is linked to,’ he explained.
‘The problem the library world is grappling with at the moment is how do we do this with library data, and we are starting to develop standards, such as the Library of Congress BibFrame for library-focused vocabularies. But that’s still very library specific. OCLC is also working with open vocabularies that are used widely across the web, such as Schema.org, which is backed by the major search engines.
‘The general library community has significant investment in business as usual, and the MARC record is not going away very quickly, it will evolve away over time. Linked data is probably an add-on process over time and the library systems will start to evolve without the librarians having to understand all the technicalities. OCLC is looking at how the systems we provide to our members can take advantage of the technologies without having to worry about the technologies.’
Fons reiterated the idea that the librarians don’t need to get bogged down in the technologies. ‘The librarian doesn’t have to worry about RDF and the technologies, as these are handled at the network level. We need librarians to contribute their work, make sure they are sharing, and the systems can do the rest.’
‘The library community is good on making adjustments to Schema.org, but there’s still a final gap in true global commitment to aggregation of data and making that data available globally with a high level of accuracy. What we have today is really good collections of library metadata, like WorldCat, where we record what has been published and is being published (although not all special collections are there) but we don’t have global commitments for libraries to say “yes, I want my holdings record there at a degree of accuracy where we have a fully recognised place to go”. We want to really impress the world with size and scale and linking activity into and out of the existing metadata store. That’s the next step, to bring together the progress in existing metadata sharing, that’s where we’ll see benefits due to the power of aggregation and collaborative management of data.’
So if there is no need for librarians to get bogged down in the details of the technologies, what should they be doing?
According to Wallis, we are at an ‘evolutionary stage’ and contributions that the librarians can make will change over time. ‘Make sure your systems are capable of sharing their information in the global manner. If it’s interesting they can join SchemaBibEx, a group to discuss and prepare bibliographic extensions to Schema.org schemas, but we’re not expecting everyone to join. What librarians need to do is keep up to speed, be aware of the systems, and be aware that an increasing number of users don’t come directly.’
Wallis gave the example of the National Library of France; which put up a catalogue that was accessible to search engines. The library found that over 80 per cent of the hits then came directly from the search engines; these were users who probably didn’t even know the web address of the library catalogue.
For Wallis and Fons, data sharing in libraries does not require every librarian to embrace every angled bracket of an RDF/XML file. The traditional record will gradually evolve to a more web-friendly format. If the library community are willing to commit to sharing high-quality data the value of this data can undoubtedly have a significant impact on the web and the information people access.
The seeming inevitability of linked library data, irrespective of the decision of any individual librarian, should not be an excuse to opt out, however, but an argument for librarians becoming as involved as possible. If librarians don’t try to direct the technology, it will inevitably come to direct them.