From collection enhancement to undiscovered public knowledge

Share this on social media:

Topic tags: 

David Stuart looks at the potential of semantic enrichment and considers what information professionals can do to help

Over the past two decades since the development of the web, the library and information profession has changed significantly. Most noticeably, this has been with the replacement of many traditional services by digital services. These digital services are not static either. They continue to evolve – first with the development of web 2.0 technologies and, increasingly, with the adoption of semantic technologies.

Semantic technologies provide the opportunity to enrich both digital documents and also the information contained inside these documents by adding meaning in a machine-readable format. This not only helps users find the information that they need, but also offers the possibility of identifying ‘undiscovered public knowledge’.

There are several approaches to semantic enrichment. There are different ways of encoding the additional level of meaning (for example, RDF triples or Microdata) and competing models for representing different classes of object (such as the Europeana Data Model or CIDOC Conceptual Reference Model). There are also different methodologies for semantically enriching documents, for example, crowdsourcing or automatic entity extraction.

What all these approaches have in common is that they all enable meaning to be added to content, allowing computers to understand the information that is included within documents. They mean that a computer can identify that ‘apple’ the fruit differs from ‘Apple’ the technology company. Semantic enrichment enables computers to recognise that concepts are related to other concepts, rather than existing in isolation, and that places are within wider geographic areas. What’s more, it shows that people are related to one another and may be involved in events and organisations that are associated with other events and other organisations. Ontologies have been built with hundreds, thousands, and even millions of instances (for example, DBpedia).

As many of these are slowly joined under the banner of ‘linked data’, a network of connections is being built that offers a new way of understanding the connections between documents and the data that they contain.

Information retrieval

The establishment of metadata standards at the document level has a long history within the library and information science profession, and interest in the use of unambiguous subject terms and authorised names is not new either.

Increasing interest in linked data and the semantic web, however, emphasises the importance of these unambiguous terms and authorised names – not only in facilitating access to associated documents within the same institution, but also across multiple institutions. The use of rich ontologies that include the relationships between different subject terms and authorised names offers the potential for discovering a wider range of associated documents than would have been possible with isolated terms.

It’s not just at the document level that semantic enrichment provides added value, but also within documents. As the web has become the most important information resource for a wide variety of queries it is not surprising to find that the major search engines are interested in the development of schemas for structuring data on the web in a way that can be understood by search engines. Schema.org was launched by Bing, Yahoo, and Google, and provides ways to structure information about everything from musical recordings and recipes to internet cafes and childcare facilities; structured recipes may be searched by preparation time and musical recordings by intended audience. As information within documents is semantically enriched, the information becomes free from the document, and may be accessed and used without requiring a reader to wade through pages of documents. This is not limited to information that may already exist within a structured form in documents (such as in tables of information) but also to relationships that exist between entities that are discussed in the narrative form.

Undiscovered public knowledge

Improvements in the retrieval of individual documents or pieces of information are only the beginning of the potential of semantically-enhanced documents. The real potential comes from identifying undiscovered public knowledge. The term ‘undiscovered public knowledge’ was coined by the information scientist Don Swanson to refer to knowledge that, although in the public domain, was in fact ‘undiscovered’ as it was fragmented across different specialisms and had never been brought together in one place.

Increased specialisation means that there are likely to be an increasing number of examples of undiscovered public knowledge, and as the literature continues to grow rapidly there is little hope for such knowledge to be discovered manually. However, the semantic enrichment of documents can enable the automatic analysis of millions of documents across widely different domains, which could then identify the most likely areas for further investigation.

In Swanson’s original paper, the undiscovered public knowledge was that fish oil could help alleviate symptoms of Raynaud’s disease (a circulatory disease) – a theory later supported through clinical trials. How many other unknown treatments are examples of undiscovered public knowledge, waiting to be discovered at the computer terminal?

Gradual progress

Despite the great potential of semantic technologies, there is still a long way to go before they are widely adopted. It’s already more than 12 years since the semantic web came to widespread public attention following the publication of the landmark article about the semantic web by Tim Berners-Lee, James Hendler and Ora Lassila in Scientific American.

Since that time, semantic technologies have not only evolved, but also spread far beyond computer science, with notable examples from across the different sectors of society. The winner of the 2013 Semantic Web Challenge was the BBC World Service Archive Prototype, which continued the BBC’s notable public engagement with semantic technologies through the automatic adding of tags from Wikipedia to recordings from the extensive world service archive. Wikipedia tags were chosen due to their unambiguous nature, and an interface is provided to allow users to vote on the terms selected, or select additional Wikipedia terms. A different example of the semantic web in action is the second-placed project, Constitute, which created an ontology of 600 concepts that were associated with more than 700 constitutions from around the world so that they could be explored and compared.

The role of the information professional

There is still a long way to go however before undiscovered public knowledge can easily be found, and it is important that library and information professionals take an active role in the adoption of semantic technologies. The most obvious role is in the use of semantic technologies in the publishing of a library’s own data. This may involve following the lead of institutions such as the British Library, which made the British National Bibliography available as linked data, and the British Museum, which made its catalogue available as linked data, or merely the encoding of content on a website according to the schemas from Schema.org so that information about the library itself is more easily discoverable.

There is also much to be done by the library and information professional away from enhancing a library’s own information. There is a role for the professional  in helping users to identify ontologies for enriching their own content. Appropriate ontologies are often hard to identify and there may be competing ontologies with different advantages and disadvantages.

There is also a role for the library and information professional in the development of new ontologies. While some areas may have multiple competing ontologies (for example, for representing people or events), more specialised ontologies are often overlooked.

Finally, there is an important advocacy role for the library and information professional – making sure that researchers have the right to access the information they need, in the way that they need it. While most documents won’t have been semantically enriched, natural language processing and automatic entity extraction methodologies nonetheless enable a lower standard of semantic content to be created automatically – but only if researchers have the right to download and read such documents automatically, which is something that publishers’ terms and conditions often prohibit.

Conclusions

Semantic technologies should increasingly be at the heart of the library and information professional’s work, and how they think about the modern information landscape. Like many new areas of technology, they are surrounded by a seemingly impenetrable vocabulary of acronyms and highly specialised terms that may be daunting to the library and information professional. However, at a time when the profession is subject to increasing financial pressures, semantic enrichment offers an area of growth and recognisable value beyond the profession.

The library and information profession has a long history of adopting new technologies in its quest to provide people with access to the information they need. Semantic technologies are merely the latest step