Powering future discovery
The semantic web, linked data and RDF triples are all hot topics at the moment. Sian Harris finds out what impact they are having, and will have, on resources for research
‘We saw that, by applying existing standards, semantic enrichment would allow meaningful identification of the actual content.’
So wrote Richard Kidd, then manager for editorial production systems at the Royal Society of Chemistry Publishing (RSC), in an article in the April/May 2007 issue of Research Information. His article was describing Project Prospect, an initiative that the society publisher launched in February of that year to semantically enrich chemistry articles. The project made the content of articles machine-readable so that readers could click on named compounds and scientific concepts in an article to download structures, understand topics, or link through t o electronic databases.
At the time the RSC was somewhat blazing a trail among primary research publishers. Indeed Kidd, now business development manager at RSC, reflects that this potentially limited the project’s early usefulness.
‘As a reader-centric tool it was limited going from individual papers,’ he says. ‘At the time we launched the project we were the only chemistry publisher doing this and we only represented around five per cent of the chemistry papers published.’
That was six years ago – and, of course, much can change in six years in the scholarly publishing industry.
At the time of Project Prospect’s launch, Kidd recalls that much of the work came out of research projects and the RSC’s semantic offering required significant manual intervention.
Today, it’s a different story: the project has evolved to focus more on discovery than just linking out from individual papers, a shift assisted by the RSC’s purchase of the free chemical search tool ChemSpider in 2008. Work on semantic enrichment at the publisher is also moving beyond journal content to encompass a wide range of educational and news resources published by RSC.
And, of course, semantic enrichment is no longer confined to a few research projects. Every major publisher is experimenting with its possibilities and researchers are starting to see its benefits.
Semantics in action
As Daniel Mayer, VP of marketing at TEMIS, noted, applications of semantic enrichment include enhancing information access, semantic products, and analysis. ‘It’s about creating a network of links between content,’ he explained. This approach enables publishers to make content more compelling and build new products, based on, for example, topics.
‘One of the main use cases is discovery,’ added Michael Clarke, executive vice president of Silverchair Information Systems. ‘This could be, for example, recommending similar articles based on semantic fingerprints (although there are other ways to do this today based on analytics).’
Semantic enrichment can also help search – to normalise content across many different sources – and in topic modelling, which is similar to search but presented in browse mode. ‘With enrichment you can create topic browsing – useful for getting very granular with large volumes of information. You could do it manually but that is time-consuming and hard to make dynamic,’ said Clarke, who explained that, when a new term comes about, a whole corpus of information can be updated without doing anything by hand.
Another application that he identified is user intelligence. ‘There’s a vast investment in user analytics, and semantics tells you more about people. If I’m searching heavily on some sort of drug or laser switch, this tells a lot about me.’
Bradley Allen, VP at Elsevier Labs, noted: ‘It’s not just about content but about relationships between content.’
Elsevier has used semantic techniques to develop products such as Geofacets and Brain Navigator. Allen also mentioned an application behind a special issue of the Lancet. The issue focused on problems in childbirth in developing countries and the application enables readers to map these problems by geography and other factors.
‘Science and technology publishers are beginning to explore ways to across literature mine to support linking into external databases,’ he said. ‘Now we have this mechanism to begin to create links, we are going back and seeing how it enables us to create tools that help researchers.’
‘We, and the rest of the publishing community, are trying to feel our way through it. We have put a lot on the publically-available cloud, such as funding information (used in FundRef),’ he continued. ‘The article and the traditional bibliographic metadata associated with it is core.’
Another recent development in this area came last April when Nature Publishing Group (NPG) released its bibliographic metadata as linked data in the form of a publicly available RDF triplestore. ‘Linked data is a tool to power the semantic web. It enables us to describe patterns and connections and automate discovery of knowledge,’ explained Dan Pollock, associate director, nature.com. ‘RDF triples basically describe the reason for links between two pieces of information on the web.’
This work, carried out by TSO (The Stationery Office), began with 50 million triples but TSO has now ingested over 300 million triples for NPG into the TSO OpenUp Linked Data Platform. ‘It’s basically bibliographic metadata for all our journals and society journals. It’s part of good web citizenship,’ said Pollock.
‘We’re really interested to see what people are doing with it,’ he continued. ‘It’s very early days yet but we’ve seen some engagement from the community. Linked data for researchers is a good thing in the long term. People will build tools and it will enable them to discover data. It’s about helping them to make sense of a world where there is too much information. However, linked data still very new and reasonably niche in terms of practical applications.’
There are also some big projects going on, such as the three-year European OpenPHACTS project, which is working on a freely-available platform to integrate pharmacological data from a variety of information resources, as well as providing tools and services to question this data. The RSC is involved in this project, mapping chemical information in public data stores. ‘We are talking to other publishers and public data sources. It is a way of making semantics work as a discovery tool,’ explained Kidd. ‘OpenPHACTS is good because it has an end product.’ He added that the lessons learned from this could be relevant to other projects because the infrastructure underneath is very generic.
But, despite such development and services, semantic enrichment still brings plenty of challenges.
‘A lot of semantic work is still very research-orientated. This means that standards have big gaps,’ said Kidd. In his area, chemistry, for example, he said that organic chemistry and the overlap with biology are well covered by standard vocabularies – or ontologies. In addition, the InChI standard is in the process of being expanded to organometallics and inorganic chemistry. However, ‘there are big areas like materials that are not well understood in terms of standard terms,’ he added.
‘Cross-domain ontologies are tricky and expensive to do,’ he continued. But there have been developments, he said. ‘Nobody is trying to build their own standards in chemistry anymore. People are open to collaboration to develop new ones.’
Elsevier’s Allen has seen a shift in the emphasis on ontologies. He noted that, in the early days of semantics, there was much emphasis on ontologies. However, Tim Berners-Lee proposed a range of approaches, including using URLs as names, using HTTP referenceable on the web; having things that come back from HTTP come back in a standard form; and having those things point to other things. ‘It allows us to stitch stuff together in the way that the web is stitched together. It doesn’t have to be centrally managed.’ Allen explained.
‘The semantic web focuses practically on what can be done now. Ontologies are important but complex. They are worthwhile but, in some ways, we had the cart before the horse. Now, when people talk about them, they are very specific in very specific domains,’ he said.
Serving different needs
There are challenges with serving a range of domains too, as SPIE has discovered. The society has recently worked with Access Innovations and Silverchair to ‘create a granular, multi-headed hierarchy sophisticated enough to automatically index our content at the paper level with a high level of accuracy,’ according to Timothy Lamkins, SPIE press manager.
‘Semantic enrichment promises to open up new ways of extracting complex knowledge from the otherwise unfathomable ocean of information to which we all contribute,’ he said.
However, one challenge that Lamkins identified is the technical domain that the society covers. ‘We cover nanotechnology to astronomy and industries as diverse as energy, defense, and biomedical. It’s difficult enough to isolate the uses of a single term in a narrow domain, but to account for all of the uses in the various fields and industries we cover is a complex task,’ he said. In addition, the society has a wide range of publications and breadth of people that it serves.
Another challenge for publishers is ensuring internal data quality. As Pollock of NPG explained: ‘In order to build out we have to model metadata more efficiently. We have to be disciplined in how we model data and content.’
Such processes are particularly important – and potentially challenging – for large publishers that have grown through mergers and acquisitions.
Elsevier’s Allen explained: ‘The publishing industry has grown through incorporation of many imprints into big companies. That has meant that organisations have lots of different silos. In the print world it was easy to just put imprints on the shelf next to each other but with online this issue becomes more important. Linked data allows us to resolve this at arm’s length.’ There is also a challenge of bringing together content from publishers with that of suppliers.
A further challenge is the author’s role in the process. Kidd, of the RSC, sees the ideal future as having metadata captured at the creation of a project. However, ‘when people are writing a paper they are telling a story. This is different from marking up content for semantic searching, and the tools aren’t there to help.’
Perhaps authors may embrace this more as they see more tools emerge that use semantic technology. ‘The potential is there but there is not a clear use yet,’ said Kidd. ‘We are all still trying to find the end uses of semantic enrichment. It’s still something that the techies are doing. It’s always going to be one of those "under the hood" things.’
‘It reminds me a bit of when businesses moved to XML. At a certain point when it becomes normalised, you just need it to be in business,’ observed Clarke of Silverchair. ‘We’ve moving very rapidly to a point where semantic enrichment is just part of doing business.’
‘There are technical challenges, but the bigger challenges are to get content enriched and have it live in an environment that knows what to do with content and to have a business model that’s going to support it,’ he continued. ‘To just get content enriched is not particularly expensive. The bigger costs are around applications to support content and making it open to search engines and other tools.’
Elsevier’s Allen agreed: ‘At the end of the day, what you are going to get is extremely complex relationships that people still need help making sense of, finding vocabularies, mining it and building integration with other tools used in research. Semantic enrichment is going to make information richer but it’s probably not going to make it easier,’ he predicted. ‘There’s not a magic solution where a machine is going to address information needs. And, although many people can come and query linked data, not many people see that as their day job. They see that as the job of publishers or librarians,’ he concluded.
If publishers and others do this right, there are potential benefits for researchers and beyond. As Lamkins of SPIE observed: ‘Content that is unfindable or without context might as well not exist, so clearly semantic enrichment is critical to the future of scholarly publishing. Moving deeper into the content – semantically tagging the components that make up our publications – will open up the possibility of mash-ups of experimental data, medical images, or mathematical models, for example. Soon, the semantic layer atop content may become just as important as the content itself.’