Four industry figures discuss the latest developments around semantic enrichment with Tim Gillett
Please explain what you understand by ‘semantic enrichment’
Babis Marmanis, executive vice president and CTO at Copyright Clearance Center (CCC): Word representation is central to natural language processing. The default approach of representing words as discrete and distinct symbols is insufficient for many tasks and suffers from poor generalisation. Semantic enrichment is the enhancement of content with information about its meaning. This process augments the amount of information carried by specific words or composition of words, thus, enhancing its value by making it easily discoverable and relatable to other data sets or information assets.
Giuliano Maciocci, head of product, eLife: The addition of meaning and context to data.
Donald Samulack, president, U.S. operations, Editage: In research communication semantic enrichment can, on the one hand, mean the design or packaging of content to increase human or machine comprehension, but it can also mean the augmentation, association with, or the embedding of additional content in a format other than text; such as an infographic, video explanation, or other form of data visualisation. Semantic enrichment strategically brings focus to the main message of the content, makes specific content stand out above the rest of the narrative, and further enhances the discoverability of the content – through either ‘human’ or ‘machine’ processing of information.
Jordan White, director of content operations, ProQuest: For Alexander Street-branded databases within ProQuest’s portfolio, we have a concept called ‘semantic indexing’ in use since 2000, wherein we take disparate pieces of content, add metadata and other functionality, revealing utility of the content in context. We begin with a scholar and a discipline in mind – a musicologist studying Bach, or a human rights scholar studying Rwanda. We ask: ‘What would they want to know and how they would want to approach the question?’ This leads us to certain themes and metadata concepts – specific instruments or musical keys; historical origins, legal concepts and social trends; ‘metaworks’ such as a play or musical composition that provide a semantic link between various instantiations of that work, sharing certain metadata. All content included in the database is then organised and indexed under these frameworks. Our systems, our controlled vocabularies, and our presentation of our products are all designed to make content discoverable in this light, attuned and adjusted to the expectations and practice of a specific scholarly discipline.
It’s more than 10 years since we reported that RSC Publishing became the first primary research publisher to publish semantically-enriched articles. What do you see as the key developments, industry-wide, since then?
Marmanis, CCC: Advances in natural language processing (NLP) through machine learning (ML) is the major key development that has had and will continue to have significant impact on both the production and consumption of semantically-enriched articles.
For example, in life sciences, text mining has become an important tool for the researchers and the most fundamental task is the recognition of biomedical named entities; such as proteins, species, chemicals, genes, diseases, and so on. The ability to automatically develop effective word embeddings for biomedical literature has substantially enhanced text-mining in that area.
Word embeddings capture semantic similarities between words that are not visible from their lexicographic form; for instance, the words ‘enables’ and ‘allows’ are syntactically very different, yet their meaning is somewhat related.
Maciocci, eLife: The adoption of structured, semantically rich formats, such as JATS XML, for the digital publishing of academic articles has since become much more prevalent, making more of the available literature – particularly in the open access space – easier to search, mine, cite and cross-link. Additionally, an increasing number of services are now using artificial intelligence (AI) and data-mining techniques to add additional layers of context to the published literature, going well beyond the semantic information contained at the individual paper level and uncovering trends and connections across a much broader corpus. Integration with persistent identifiers (PIDs), including ORCIDs for researchers and Research Resource Identifiers (RRIDs) for resources, along with the use of DOIs and initiatives such as FAIR sharing and the Joint Declaration of Data Citation Principles, are also playing an important role in helping to trace the relationships between papers across publishers.
Samulack, Editage: There was first a movement toward adding more supplemental information in association with a research article. Then there was the recognition – through technology and the general lay-up and design of the average journal page – that it is the attention which is drawn to an article that makes research findings come to life. On a more sociological level, constraints of time and the vastness of the literature have forced readers to be more discerning in what they pay attention to. Brevity and discoverability, as well as the imagery surrounding content is now becoming the trend. As such, the movement toward graphic and video representation of data (enhancing the visualisation and the understanding of complex concepts), is becoming more pronounced. Article and page design is being adjusted to accommodate the inclusion of infographics and video in association with, as well as embedded within the article. While machine learning is likely the next phase – requiring a level of sophistication of meta-tagging and word semantic association – for now, the ‘stopping power’ of an article, or of a dataset within an article, is very much the success factor for human comprehension.
White, ProQuest: Challenges of scale and responses to it are clearly the most important theme of the last decade. On any given topic, there is simply too much information available in myriad formats and modalities for a person to comfortably digest with ease. Machine learning, algorithmic personalisation, the rise and normalisation of Wikipedia, Google’s ‘Featured Snippets,’ and Netflix burying search in favour of browsable ‘taste clusters’ are all attempts to mediate discovery at scale and account for this. Semantic enrichment leans on these types of algorithmic solutions to attend to the scale problem. In exchange, these solutions lean on semantic enrichment to appropriately isolate content into trusted categories, giving machine learning a stake in the ground from which it can extrapolate and identify similar material.
In what ways do these developments benefit researchers, librarians, and the wider publishing community?
Marmanis, CCC: These developments are critical to properly harness the collective research community’s scholarly output. There is no manual process that can effectively synthesise, draw inferences from, and take action on the millions of research articles published annually, nor their accompanying data sets. Essentially, these trends are driving us toward paradigm shifts across research domains – from biomedical applications to chemical engineering – making current processes more efficient, but also introducing new workflows and opportunities.
Maciocci, eLife: They allow for faster discovery, easier and more precise searches for relevant literature, and the ability to reuse published science in novel and unexpected ways. However, in order for semantically enriched papers to really be useful, publishers might want to consider providing the right application programming interfaces (APIs) to developers to ensure the papers are accessible beyond the publisher’s site – and not just by the dominant search engines and service providers, but by researchers and developers alike. At eLife, we’ve made sure our APIs can be used easily to access any part of the article, allowing the community to come up with novel uses for what we publish, from desktop library applications such as ScienceFair to the ability to read eLife digests using Amazon’s voice services. As more publishers begin to do this, we can expect to see the digital publishing of academic literature to really start moving beyond its print origins and toward a more useful, accessible and reusable form.
Samulack, Editage: In the case of researchers and librarians, personal time is often the limiting factor. For the publishing community at large, wide distribution and readership is the goal. For an overlooked segment in the initial question, the public-at-large (who are on the outside, looking in), ‘open’ accessibility and discoverability are the needs. For all population segments, semantic enrichment (the visualisation of data through enriched content like infographics and video) addresses the ‘stopping power’ of an article or element of data. While this is a pure marketing concept, it is the easiest way to articulate the value related to the human factor – making research findings stand out among the noise and vastness of the literature.
Beyond the stopping power of an article being linked to readability, the discoverability of an article through machine-recognition is on equal footing as a benefit. In essence, beyond the complexity of article dissemination challenges, both for the human factor (attention) and for the machine factor (discoverability), the concept of ‘stopping power’ is equally applicable.
White, ProQuest: Much of scholarly inquiry is centred around the people or organisations producing primary/secondary material. Curation of content around these entities – themselves semantic concepts – makes this kind of inquiry possible. Implemented well, semantic enrichment provides meaningful answers faster, allowing scholars to build on or discard hypotheses more quickly. Researchers and librarians don’t need to study a system-specific language or workflow to conduct their inquiries; rather they can navigate using natural language or academic vernacular that makes sense within their discipline. The best solutions balance an emphasis on the quickest, most likely answer, while still creating opportunity for the unexpected ‘eureka’ moments that move scholarship forward. For the publishing community, these developments give us more information that allow us to find unique ways to be more useful, more relevant in supporting scholarship.
How is your organisation currently enhancing the semantic enrichment process?
Marmanis, CCC: We embed semantic enrichment of content in various product offerings. We partner with state-of-the-art tools and semantic lexicon providers and we also create domain-specific word embeddings that help us to further enhance our content.
Melissa Harrison, head of production operations, at eLife: eLife’s content represents a small fraction of the literature and, although we put a lot of effort into the quality of our JATS XML, working alone will not help with the aspirations we have for science. We are therefore active in the JATS community and have fostered JATS4R as a central organisation to bring together both open access and subscription-based publishers to agree on basic principles to make cross-publisher XML more comparable and mineable. We also subscribe to the Joint Declaration of Data Citations Principles and have been active in promoting the implementation of these principles within cross-publisher groups. We support the use of PIDs and also new initiatives such as the CReDIT taxonomy. Being involved in cross-publisher working groups and pulling together the editorial and technical representatives and infrastructure to realise these initiatives to their full potential is a challenge that all publishers are facing.
Samulack, Editage: At Editage, while we’ve been heavily focused on the readability of an article and how it is written and structured, we’ve also been focused on strategies of enhancement and enrichment of content within the research article. We’re bringing awareness to the publishing industry that the use of video and infographics highlight key findings and add stopping power to the attractiveness, curiosity, and discoverability (through DOI tags) of the article. Successful, high-profile collaborations with Cell Press, Brill, the Journal of Bone and Joint Surgery, the journal Neurology, and others, have been a testament to the success of semantic enrichment in both illustrative and written forms. In addition, we have been developing an automated document assessment tool, called Ada, to help cut down on the time and cost associated with desk rejections. Ada is a machine-driven document screening tool that leverages semantic rule-based decision-making algorithms to assess the readability of a scientific manuscript. Whether at point of manuscript receipt, or at any point in the editorial process, Ada can be taught to offer standardised and objective semantic decision-making surrounding the worthiness of moving forward with a manuscript on the grounds of a readability grade, compliance to certain pre-set ethical criteria such as plagiarism, standard guideline disclosures, and/or an internal checklist of language inclusions.
White, ProQuest: Not all content is text-based. Not all content can be ‘owned’ by one publisher. Relevance of content may differ among disparate user groups. Content interactions should factor in all of what was produced and who produced it, but also its importance to specific users consuming it. We work to find innovative ways to break down those silos – across modalities, publishers, publishing models and user groups – to ensure the answers we provide are complete, correct, and relevant. This takes the challenge of scale and adds new wrinkles, because less is under our immediate control. Matching a journal article with a video, an archival finding aid, an audio recording and a set of statistics, some of which we possess in our databases and some not – and ensuring all of it is immediately relevant to you is the goal. Allowing information to exist as it is without conforming to our needs as a publisher, while layering on top our semantic enrichment in support of scholarship is a major focus for us.
Looking to the future of scholarly communications, can you predict any major developments in the field of semantic enrichment?
Marmanis, CCC: The primary source of semantic enrichment is well-curated semantic lexicons.
A major development in the field of semantic enrichment would be the ability to develop semantic lexicons specific to a domain of discourse on the basis of a representative corpus, with minimal human intervention and in a way that is highly-accurate, cost-efficient, and complete.
Maciocci, eLife: Both semantic enrichment at the publisher side and the emergence of increasingly ‘smart’ AI, in combination with a broader uptake of open access, may lead to an increase in meta-discovery, where large-scale data mining of a semantically enriched corpus can generate new insights by uncovering previously unknown trends and correlations across seemingly unrelated papers. AI is likely to see an increase in the automatic inference of semantic context, reducing the amount of labour needed during the production and publication of online manuscripts by automating parts of the semantic tagging process.
Samulack, Editage: Because of the movement toward the future of machine reading, machine learning, and various other associated aspects of artificial intelligence, the use of meta-tagging and word semantic association nomenclature will become very important. But also, because of the focus today surrounding the human factor of article enrichment (infographics, video, and other forms of imagery), the ability to read this content and to fingerprint large data sets will be a challenge that needs to be addressed. Imagery will need to be accompanied by embedded plain language descriptions (like alt-text in support of machine readers for the visually impaired), and large datasets will need to have semantic descriptors to clarify parameters of context and scope that a machine can understand.
White, ProQuest: The knowledge graphs produced by human- and machine-based entity extraction have already started to supplant the content from which these entities were extracted as a source of answers to questions. Further evolutions toward open access publishing and the accompanying scale will accelerate this change. Traditional ‘relevance ranked search’ as a method of uncovering appropriate research materials will become an anachronism and semantic enrichment solutions will replace it. Users will expect their information systems to present trusted seminal works at the top of search results, vetted by the evidence of usage, in a context that explains how any answer was derived from the knowledge graph and cited content. The best of these solutions will be neither strictly AI nor strictly human-facilitated, but rather will strike a unique and interesting balance between the two. This, in turn, will expand the types of scholarly communications and inquiries available to researchers – and who knows what we will learn then?