Dynamic data presents a world of possibilities

Sarah Bartlett reports from OCLC's recent conference about insights from data

OCLC’s two-day conference in Strasbourg at the end of February explored the opportunities that today’s huge data aggregations open up for libraries.

One presenter who certainly made an impact was Jean-Baptiste Michel, a French-Mauritian researcher working at Harvard University. Around 2007, Michel and his collaborator, Erez Lieberman Aiden, persuaded Google to give them access to all the books so far digitised by the corporation, and found themselves looking at an immense databank of 50 million books, or ‘12 per cent of all books ever written at any point in time – a huge chunk of human culture.’

With this immense resource, Michel and Aiden have developed quantitative approaches to the study of cultural history, demonstrating the value of ‘big data’ aggregations beyond their stronghold, the sciences. Michel explained that the volumes of data analysed were sufficient to track widespread changes in culture over time and that through large-scale textual analysis, we can look at anything and see its cultural impact. With a new invention, for example, we can find out how rapidly it penetrated the culture – and, indeed, whether it changed the culture itself over a given period. Or we can look at a historical event by examining the frequency of references to the event in books published in the ensuing years. One particularly entertaining observation was that the word ‘sustainable’ did not appear until the late 20th century, and Michel joked that, if usage continued to increase at its current rate, by 2061 it would be the only word in use in the English language.

‘What’s really amazing to me,’ said Michel, ‘is that our past is becoming digitised at a very rapid pace, and that’s very, very powerful. We can access all this material in a digital format to transform our understanding of culture and language. Libraries and text repositories are at the front lines of a real revolution in the social sciences and humanities, which changes the way we approach questions about the human experience.’

Michel’s ability to present complex research in an accessible way, combined with the compelling nature of his work, meant that delegates were able to engage meaningfully, even making suggestions for further research. The break after this keynote presentation buzzed with excited discussion.

Also at the meeting, OCLC’s Roy Tennant introduced a number of data-mining projects carried out within OCLC’s WorldCat repository of 290 million library records. Rather than analysing full text, like Michel and Aiden, Tennant works with library metadata. The WorldCat Identities project, for example, can generate a ‘feature timeline’ for any author or creator, showing publications by and about the author in all languages and editions. Key analytics, such as the most widely-held works of the author by libraries around the world, generate a picture of the breadth of global awareness that the author enjoys, and demonstrate the valuable insights that we can derive from metadata.

Linked data is here

Tennant went on to talk about the Bibliographic Framework Transition initiative of the Library of Congress – which, like a number of national libraries, is already working with linked data. ‘It’s about moving from cataloguing to catalinking,’ explained Tennant. ‘Instead of having all the data in the record, we link out to reputable sources, pull data in, process it, index it, but we don’t manufacture it. For information about Hamlet, then, we can simply point to an authoritative source of information on William Shakespeare, rather than create the data ourselves.’ Tennant also played a video, Cataloging Unchained, to show how linked data can make library metadata work much harder.

The BBC, where linked data is already embedded, offers a high-profile case study. The organisation’s information architect Silver Oliver took the audience through the BBC’s story. The project began by assigning publicly-accessible identifiers – URLs – to each of the programmes broadcast by the corporation. ‘People started to realise,’ recalled Oliver, ‘that they could now point to them, talk about them, share them, and link to them from both inside and outside the organisation.’

Philippe Stirnweiss 

Two years later, with a number of pilot projects successfully completed and a whole range of lessons learned, the BBC had the confidence to represent the biggest event the BBC had ever covered, the 2012 Olympic Games, on a linked data platform. The BBC built single web pages for each of the athletes, organisations, events and venues of the Games. It then pulled together data from both external sources (such as venue information from the Geonames dataset) and internal ones, as well as live coverage.

OCLC’s technology evangelist, Richard Wallis, a well-known voice in the global linked data community, explained to the audience that all resources held on WorldCat now have linked data embedded in them. ‘With any linked data representation, you start with a URL, which uniquely identifies that resource across all datasets,’ Wallis said. ‘The Schema standard vocabulary, which OCLC uses, can describe a resource as a book, and provide a web page that defines what a book is.’

A combination of Dewey classification numbers, available as linked data since 2009, and Library of Congress subject heading identifiers, tells the reader what the book is about. ‘We’re using the whole of the web and its resources to describe library materials.’

Markus Geipel, from the German National Library, went on to tell the audience about the Culturegraph project, which creates connections within library data. ‘We are witnessing a paradigm shift,’ he said. ‘To apply metadata to knowledge today, we connect entities together to form a web of knowledge.’ In traditional library data, Geipel pointed out, there are missing pathways. By way of example, it is difficult to get from the library authority record, which plays a thesaurus-type role in defining name conventions for entities such as authors, back to all the books connected to it. Culturegraph acts as a set of signposts, enriching the data, and improving the information-seeking experience.

Web is the only game in town

A recurring theme of the conference was the need for libraries to be on the open web, where their users are. ‘The search interface of the French National Library gets 80 per cent of its hits direct from Google and other search engines,’ said Richard Wallis. ‘That’s where our users are, so our information must be in that environment.’

Titia van der Werf, senior program officer at OCLC Research emphasised this point. ‘Our users are no longer local users, but web users,’ she said. ‘The web infrastructure is like a blood circulation system, where data flows through and penetrates every part of the body. The paths that users choose as they navigate the web yield powerful usage data that we can mine to understand behaviour and information needs better.’

Van der Werf argued that libraries need to position themselves more clearly in the back end, the supply function, of the web, and specifically urged libraries to take responsibility for data quality. ‘Google and Wikipedia are like learning systems, not changing their data architecture dramatically, but continuously improving data and relying on crowdsourcing,’ she said. ‘The system is only as good as the quality of the data within it; the goal is not to attract our users back to local systems, but to be a bigger part of the web system itself.’

Libraries in the cloud

Organisations that become part of cloud-based aggregated services recognise the truth of this. On the first day of the conference, OCLC announced that Tilburg University in the Netherlands was the first European library to go live with WorldShare Management Services (WMS), a collaborative library platform based in the cloud. At the meeting, Jola Prinsen, a project manager from Tilburg University, shared her experiences of WMS, along with Henar Silvestre from Madrid Business School, which is at an advanced stage of implementation.

Philippe Stirnweiss

Eric van Lubeek, managing director of OCLC EMEA, said: ‘With cloud hosting, OCLC and libraries can work together with collaborators to provide one big platform that has visibility on the web. This enhances our ability to showcase libraries and the materials they hold. At the same time, it’s cost-effective – across Europe we see cuts in libraries and university departments. Libraries are desperately looking for operational savings to safeguard their collections and staff.’

To take full advantage of the opportunities that the conference speakers articulated so enthusiastically, the information world needs to open itself up to more collaborative ways of working, building on its own venerable traditions of sharing and reuse. To exemplify this spirit, Marie-Christine Doffey spoke about the recent adoption of an open data strategy at the Swiss National Library, where she is director. As Roy Tennant concluded: ‘We’re moving into a whole new world now. We have the tools to do widespread collaboration well, but it’s an imperative rather than a choice.’


From the Journal Impact Factor to the latest altmetrics, scholarly players are crying out for metrics to be used responsibly, reports Rebecca Pool


Nine industry figures give Tim Gillett the low-down on recent developments in discovery as part of the research process