Back to the future

30 June 2021

Phil Gooch looks back at the history of semantic enrichment, and how it will be used going forward

Semantic enrichment is the process of adding machine-readable information to unstructured content, in order to add meaning and utility^[1]. This information is often used to provide metadata for documents, or can be used to structure documents automatically – for example, to add structural tags that can be used both for formatting, linking, and information retrieval. It is closely related to the concept of Hypertext, which has a surprisingly long history that dates back to the 1960s^[2].

The scholarly publishing industry has often been at the forefront of innovation in the use of semantic technology. Back in 1984, Oxford University Press, working with Tim Bray and a team from the University of Waterloo, pioneered semantically ‘tagging’ content to produce the second edition of the Oxford English Dictionary in both print and interactive digital form [3]. The result eventually led to the creation of XML by Bray, and the formation of OpenText, now the largest software company in Canada^[4].

In the mid-90s, at Routledge, our internal production editorial team semantically tagged content during copyediting. Combined with templates and macros, this allowed us to typeset monographs with one click, directly in Microsoft Word, to a standard indistinguishable from the externally typeset version. For more complex works, the tagged content was imported into professional typesetting software. Over time, we developed ways to automatically tag content using a complex rules-based system. Building these technical skills in house was very rewarding for the team, and led to further projects that combined in-house expertise with vendor support, such as the Routledge Encyclopedia of Philosophy. The enriched content powered the print, digital and online versions, while allowing spin-off products to be easily created from that content.

From the 2000s, there was a trend toward outsourcing most editorial and production work, and so in-house technical expertise diminished somewhat. But with the rise of artificial intelligence (AI) and machine learning, it seems we might be returning to the earlier, more collaborative model. As scholarly publishing moves ‘upstream’, with preliminary research being published – and therefore processed – at a much earlier stage, publishers are rebuilding internal AI skills to enhance their preprint offerings^[5].

As publishers develop their data science and AI capability, it reminds me of those early, exciting days where it seemed anything was possible. But now we have language models, NLP libraries and APIs, rather than hand-crafted Word macros. With AI, semantic enrichment can provide much more than metadata extraction. It can be used to identify concepts and their relations to help recommend similar articles; provide a first pass for peer review; and even transform manuscripts into graphical abstracts or plain-language synopses.

I recently spoke to Helen King, head of transformation at SAGE Publishing, about how they are using semantic enrichment to help with manuscript screening and review.

Screening manuscripts to ensure that they meet a journal’s guidelines is a time-consuming manual process for many Editorial teams,’ said Helen. ‘Some of these screening checks relate to the scope of the journal and are relatively straightforward, for example an author might have submitted a case report to a journal that does not publish case reports or a pre-clinical study to a journal that only publishes clinical studies. Other checks are more technical, for example, does the manuscript include a funding statement and competing interest statement? Have all the figures and tables listed been included? Have the statistics been reported correctly? Other checks relate to the content itself, for example is the manuscript about a physical impossibility such as a perpetual motion machine or an unproven treatment such as homeopathy that the journal wouldn’t publish?’

The first step is to turn author manuscripts, which may arrive in many formats (Word, PDF, LaTeX) into a unified, machine-readable representation. SAGE use the Scholarcy API for this first step, to which they apply their internally developed AI checks.

Helen added: ‘Recently, our screening has expanded to include checks which aim to identify manuscripts that may have originated at a paper mill or have been written using an automated paper generator such as SCIgen^{[6, 7]}. These checks can be as simple as spotting authors using an email address from a free email service rather than one from an academic institution to much more complex checks analysing the structure and content of potentially problematic papers.’

As checklists to help editor’s grow and become more complex^[8], publishers are seeking ways to semantically identify elements within a manuscript so that automated checks can be built around those elements, reducing the amount of manual checking Editorial teams must perform.

Phil Gooch is co-founder of Scholarcy, an EdTech startup that puts AI into the hands of researchers and students worldwide

References

Clarke M and Harley P (2014). How Smart Is Your Content? Using Semantic Enrichment to Improve Your User Experience and Your Bottom Line. Science Editor, Spring 2014, Vol 37, No 2. Available from: http://www.councilscienceeditors.org/wp-content/uploads/v37n2p40-44.pdf
Nielsen J (1995) The History of Hypertext. Available from: https://www.nngroup.com/articles/hypertext-history/
Weiner E (2019). Digitizing the OED: the making of the Second Edition. Available from: https://public.oed.com/blog/digitizing-the-oed-the-making-of-the-second-edition/
https://en.wikipedia.org/wiki/OpenText
Alves T (2021). JATS-Con and STM Spring Conference. Available from: https://tonyhopedale.com/blog/f/jats-con-and-stm-spring-conference
Else H and Van Noorden A (2021). The fight against fake-paper factories that churn out sham science. Nature 591, 516-519. Available from: https://www.nature.com/articles/d41586-021-00733-5
Labbé C, Labbé D, Portet F (2016). Detection of Computer-Generated Papers in Scientific Literature. In: Degli Esposti M, Altmann E, Pachet F (eds) Creativity and Universality in Language. Lecture Notes in Morphogenesis. Springer, Cham. https://doi.org/10.1007/978-3-319-24403-7_8
Seifert R (2021). How Naunyn-Schmiedeberg’s Archives of Pharmacology deals with fraudulent papers from paper mills. Naunyn-Schmiedeberg’s Arch Pharmacol 394, 431–436. https://doi.org/10.1007/s00210-021-02056-8.

Scholarcy