From documents to data

Share this on social media:

Topic tags: 

The rapid growth of digital information has led to rising interest in new approaches to information retrieval and discovery, writes David Stuart

In a world where thousands, millions, or even billions of documents may be returned in response to a simple keyword search, new tools are required to help people find the most relevant resources and help them to dig deeper into the results. One approach that has gained a lot of interest is semantic enrichment, the enrichment of content with additional semantic metadata to enable greater understanding of content by machines.

Speaking with Bob Kasenchak, head of product development at Access Innovations (, it’s clear that the work of semantic enrichment is really only just beginning, and there is a lot of potential in the adoption of increasingly complex knowledge organisation systems and enrichment at a finer level of granularity.

The challenge

The rapid growth in digital resources over the last couple of decades comes from both the digitisation of previously analogue content and a huge growth in the quantity of new digital content being made available. Digital versions of traditional publications have been joined by a wealth of new born-digital genres (e.g., web pages, blogs, and social media), whilst the potential of open data and open code to stimulate innovation has led to the opening of such resources by innovative organisations in many different sectors.

The value of the growing abundance of resources can only be realised, however, if people can find the information they need, and don’t have to waste their time trawling through thousands of resources irrelevant to their particular needs. Although full text search is increasingly offered for textual resources, the inherent ambiguity of natural language leads to numerous false drops.

Ensuring people have access to the resources they need, as and when they need them, isn’t a problem restricted to any one particular type of organisation, but rather is a problem for all kinds of organisations and individuals in the retrieval of a wide range of resources. It’s a problem for governments that need to ensure civil servants can quickly find the data they need whichever government department has collected the data; it’s a problem for commercial organisations that want to streamline and standardise processes throughout the world; and it’s a problem for researchers that need to be able to find the resources they need, irrespective of whether those resources originate within their own fields of study or in a tangential field.

It is a problem that is particularly relevant to organisations providing content to the public, especially those who want to be paid for it. According to Kasenchak, this has led scholarly publishers to be at the leading edge of semantic enrichment: ‘Hundreds of thousands of document were digitised for researchers, but as digital models became prevalent people also came to expect content to be free, and they get very frustrated when you pay for membership to a science society and the search is terrible.

‘In order to demonstrate the value to their membership scholarly publishers began finding ways to give the benefit of membership in actual enhancements, and one of the ways they were doing that was in increasing the findability of their content using semantic enrichment.’

From subject headings to ontologies

Whereas computers struggle to understand the meaning of natural language, enriching content with additional semantic metadata from controlled vocabularies can reduce ambiguity. A controlled vocabulary is a knowledge organisation system that reduces natural language ambiguity by restricting the terms that may be used, and how the terms should be used. In natural language, ‘Apple’ may refer to a computer, the associated company, the Beatles’ record company, or even occasionally a piece of fruit. Within a controlled vocabulary ‘Apple’ can only refer to one of these things.

Controlled vocabularies take many forms, including subject headings, taxonomies, thesauri, and ontologies. Somewhat ironically, the terms themselves are often used inconsistently, with taxonomy and ontology in particular having a wide variety of definitions. The fundamental difference between the different types of controlled vocabularies, however, is in the variety of the relationships that exist between the terms. Whereas subject headings may have no relationships between the individual terms, thesauri express hierarchical and associative relationships between terms, whilst ontologies often incorporate a wider variety of both entities and relationships between them. 

Once terminology is used consistently it becomes meaningful to computers, and as we move from subject headings to ontologies with an increasingly rich set of relationships, there is far more potential for discovering related content and inferring new relationships.

For Kasenchak there are signs that this move is beginning to happen: ‘Fifteen years ago it was common to have a subject headings list, today forward thinking organisations have a thesaurus, and the move is towards more ontological structures. The standards are evolving that way. The cost in assembling an ontology from scratch may be prohibitive, but people are taking their existing taxonomy or thesaurus and slowly developing an instance ontology over time. Either that or a company with a lot of energy behind their library initiative might hire someone in our sector to expedite that process of building those ontologies.’

From documents to data

Applied at the document level, ontologies may allow people to find related content that wouldn’t be possible previously, but semantic enrichment doesn’t have to stop at the document level. Semantic enrichment may be applied at far finer levels of granulation, enriching increasingly smaller parts of documents:  sections, paragraphs, tables, diagrams, even the appearance of individual concepts or entities. This not only facilitates the retrieval of relevant resources, but also the relevant parts of documents. Today such enrichment is often in the form of XML files, but increasingly RDF triples are becoming important: a web of data to integrate with an ontological web of terms.

Kasenchak sees such a move as opening a host of new opportunities and insights from content: ‘RDF triples are on the rise in a big way. We’re going to see people converting data, and rethinking how they store that data as RDF triples instead of adding just XML to a document.

‘Once you have content marked up with semantic terms, then you have things you can count. As soon as you have countable quantified things from a controlled list you’ve turned your content into big data.

You can make inferences, ask questions, cluster things together, and make wonderful visual displays and graphs for exploring these data sets. It’s just beginning, and the curve is accelerating.’


Information overload is not a new problem; the history of scientific publishing is filled with innovations to cope with the expanding quantity of literature, enabling information retrieval without wasting prohibitively excessive amounts of time.

The difference now is that the scale of the problem means that automated processes need to provide more semantically rich solutions.

Scholarly publishers may be at the leading edge of semantic enrichment, but anyone who regularly makes use of their online offerings will recognise that there is a wide variation in the quality of the current services, and most still have a long way to go.

As William Gibson might say: ‘The future has arrived – it’s just not evenly distributed yet.’

Semantic enrichment is not necessary a cheap option (especially with high-quality bespoke ontologies) and there are competing technologies that offer the promise of helping people find the content that they need. This is especially the case with open content, where altmetrics and social media both harness the size of the audience to offer alternative ways of filtering data and finding and identifying related resources. 

Unfortunately, recognition of information problems, even if accompanied by information solutions, does not always equate to the necessary investment in information services.

Semantic enrichment undoubtedly has a lot to offer in the long term, but only time will tell how many organisations are willing to make the investment they need to in the short term. 

Reducing ambiguity
  • Subject headings are a controlled set of terms designed to describe the subject of a resource, one of the best known of which is the Library of Congress Subject Headings (LoCSH) ( Subject headings do not necessarily include the relationships between terms, although the LoCSH has been published as RDF with thesaurus type relationships.
  • A thesaurus provides hierarchical relationships between concepts (i.e., broader and narrower terms), as well as equivalence and associative relationships. Popular examples include Getty’s Art and Architecture Thesaurus and the Getty Thesaurus of Geographic Names ( Once again these resources are now publicly available as RDF. 
  • An ontology provides a formal representation of knowledge with rich semantic relationships between terms. As well as concepts it may include a wide range of other entities. The BBC has been at the forefront of developing a wide range of ontologies for modelling its content ( By breaking content into chunks of data, and curating links to external resources, rich resources may be created dynamically.