Semantic enrichment boosts information retrieval

13 April 2007

At the beginning of February 2007, RSC Publishing launched the first phase of its Project Prospect and, in doing so, became the first primary research publisher to publish semantically-enriched articles. Such articles should help researchers gain more information from research papers.

The aim of the project when we started was to make the science within journal articles machine-readable, and to add new ways of retrieving and presenting the information within. Currently, readers using search engines have to rely on text searches to identify either chemicals or subjects of interest. We saw that, by applying existing standards, semantic enrichment would allow meaningful identification of the actual content – either by an exact match to a compound, or by a hierarchical classification of subject terms. To create these semantically-enriched papers, RSC editors use text-mining software to annotate compounds, concepts and data within the articles. They then link these to additional electronic resources such as biological databases. Although some of the standards we are using such as the IUPAC International Chemical Identifier (InChI) for compounds, Chemical Markup Language (CML) for compounds and reactions, and Open Biomedical Ontologies (OBO) for classifications, aren’t yet widely used within publishing, we hope and expect that their use will soon take off.

So what have we actually done in this first phase, and what can readers and authors see? This is best illustrated by taking an example (such as going to the paper: http://dx.doi.org/10.1039/b613656g and clicking on the Prospect view). We have enhanced the HTML view of an article so that readers can open a new navigation toolbox. If users choose to highlight compounds in the text (in pink), the links bring up a compound page containing the compound’s InChI, simplified molecular input line entry specification (SMILES) string, CML link, related RSC articles, and a 2-D graphic link.

Similarly, choosing the option to highlight ontology terms in text (in blue) or pick them off a menu gives links to definitions from the Gene Ontology (GO), Sequence Ontology and Cell Ontology (part of the Open Biomedical Ontologies), along with links to related RSC articles.

The final option is to highlight IUPAC Gold Book terms (in yellow). This bring up pages from the online IUPAC Gold Book in a new window.

What’s more, in the near future, our existing RSS feeds will be enhanced with ontology terms in XML as well as InChIs and graphics for primary compounds. With this we will be enabling very advanced classification of our published research in a freely-available and standards-compliant manner. A reader will see the subject terms and structures discussed within the paper. These will also be computer readable, allowing automated selection and discovery.

RSC Publishing's new technology allows readers to click on named compounds and scientific concepts in an electronic journal article to download structures, understand topics, or link through to electronic databases.

What will the benefits be?

Even quite close to launch, we thought the main benefits of the enhancements would be to increase the visibility of the articles. By including both subject and compound material, and making these available to search engines and via RSS feeds, we saw that this would be a first stage to proper semantic searching, where information can be read and analysed by computers.

However, it became apparent as soon as we starting testing the enhanced articles that the articles became much more accessible in several ways. Attaching structures to compounds makes it easier for readers to visualise what is going on. Creating links to other articles that contain exactly that compound or ontology term make it much easier to find other articles of interest, even before we really apply proper semantic searching. The ontology and Gold Book definitions provided as links have proved to be a real learning resource – readers who perhaps are not intimately familiar with some of the jargon within papers can easily get a clear explanation of context and meaning.

We can see now that this will bring down some of the subject and journal title barriers within our publications. The RSC publishes a wide range of subject matter, ranging from

physics through chemistry and materials to biology. The identification of terms in context makes it easier for readers to identify papers of interest across titles, and also to understand more easily papers that may be of real interest but written from the viewpoint of another specialisation.

Collaborative partners

The technology underpinning ProjectProspect has benefi ted from RSC’s contacts with the Unilever Centre for Molecular Informatics at Cambridge University. Over the past few years, the RSC has sponsored summer students to look at text mining of chemistry papers. In 2003 this resulted in the Experimental Data Checker. This tool enables the organic synthesis sections of a paper to be identifi ed and the long paragraphs of NMR and spectral data to be put in a standard form and a series of tests and visualisations run on that data.

The basic toolkit used for this, named OSCAR (Open Source Chemical Analysis Routines) has been further developed over the years at the Unilever Centre, and now forms part of the SciBorg Project to analyse the science within scientifi c publications. This project is run between the Unilever Centre and the Computer Laboratory at Cambridge, and the RSC has been supporting this with staff time and example data. The latest version of OSCAR, maintained by Peter Corbett at the Unilever Centre for Molecular Informatics, is used to identify compounds while extensions have been written to identify ontology terms, The GO Consortium has been incredibly helpful with training and adding new terms where appropriate. In addition, the cheminformatics community obviously has a great interest in this area. We’ve worked closely with Peter Murray-Rust (who, along with Henry Rzepa, developed CML) as well as others in the field.

As with everything concerned with data, it’s only when you start to apply it that the problems and opportunities really show themselves – so there’s been an extended training process to teach OSCAR to pick out the right things, and to apply the ontology and Gold Book terms in a sensible manner. Integrating OSCAR into our editing software, PTC’s Arbortext (formerly called Epic), posed unique challenges.

We had a few surprises as we began Project Prospect. For example, we discovered how many compounds are never actually named within a paper but simply referred to as something like ‘13b’. Although attaching accurate information to these is a largely manual process it also adds tremendous value to an article.

The main overall challenge has been to implement this into an existing production workflow that already has world-class publication times, and to do this across all our publications rather than just on a small test bed. Our intention was to move to enhancing all papers during the course of 2007, but we have increased the proportion of enhanced papers much more quickly than we originally expected. To a great extent this has been down to our technical editors. They’ve developed new skills to judge the meaning and context of terms, and as they’re experienced and highly qualified chemical scientists we feel that this is a great application of their skills and knowledge.

As for the RSC’s archive of about 225,000 articles stretching back to 1841, there are a couple of options for enhancement. We’re still in a training period but after a few months we should get a pretty good idea of what can be achieved from a purely automated result. Then we could well opt for a software-only approach on the bulk of our archive, and perhaps a review could be applied to the most recent material or for certain subject areas. We do not have a timescale for this yet. However, just as making our archive available electronically transformed the access to this material, semantic enrichment of it should make it even more of a resource for scientists today.

Standards compliance

We are the only primary publisher to have done semantic enrichment so far, but the underlying standards mean that similar developments from other publishers should be easily compatible. Other publishers will obviously introduce their own applications but, as long as we’re all basing the science on the same basic standards, it should still allow the growth of semantic searching and analysis of the content within primary publications. To an extent, secondary publishers provide some of these enhancements already. We have adopted these standards, partly because we felt someone should take a lead with these in the chemical science area, but also because we felt we could offer real benefits for authors and readers.

We want the Project Prospect developments to demonstrate the best of online applications and to improve the visibility of the scientific information within chemical science research papers. The first phase of Project Prospect was launched as an example of what is possible, but we are looking at how the functionality can be improved, and, more importantly, how chemical or subject data will be integrated within our papers. The current classifications used are particularly relevant to our publications that have biological content, but we will be looking to add appropriate enhancements to other subject areas.

The RSC views Project Prospect as an important investment in its publications and to show the benefits of learned-society publishing. And the early responses that we have had seem to suggest that it is on the right track. ‘This functionality rounds off the scientific quality of the work and makes it more accessible to scientists from other disciplines. I thank you very much for improving my article in this way!’ commented Alejandro Marangoni, University of Guelph, Canada. Glen Newton, Canada Institute for Scientific and Technical Information (CISTI) was similarly enthusiastic: ‘…an exciting effort by the RSC … as someone working in the area of text retrieval and semantic web, this is a very exciting prospect.’

The comments of many on the launch of this project were summed up by Ed Pentz, the executive director of CrossRef: ‘[Project Prospect] is fantastic. I’ve just seen the future of the journal,’ he said. Such reactions have been very gratifying, and we hope we can continue to generate this interest in the years to come.’

Richard Kidd is manager for editorial production systems at Royal Society of Chemistry. More information can be found at www.projectprospect.org