Michael Upshall discusses the impact of machine learning on scholarly communications
The last 30 or so years in scholarly communications have been marked by a steady trend towards digital content creation and delivery; particularly for journals, the industry has now moved from print to online. However, online delivery may mask underlying print-based attitudes, as is evidenced from the continuing preference academics have for reading PDF rather than any other format.
In this regard, the advent of machine learning, because it involves a semantic engagement with the content, may in the end be more far-reaching, in fact genuinely disruptive, as it looks to challenge existing business processes. Here, for the first time, is a tool that can identify the meaning contained within articles.
Machine learning will have an impact throughout the publishing lifecycle, from discovery (the most obvious quick win) to authoring, classification and presentation. The publishers most likely to succeed from this new technology will be smaller, more innovative organisations that are less wedded to current orthodoxies of how academic publishing should take place, although larger publishers will be able to demonstrate the biggest cost savings and productivity benefits.
As with any industry, academic publishing has experienced waves of technical innovation during its existence: the switch from paper to online, the move to markup systems like SGML and then XML, and the rise of digital workflow systems. Most of these initiatives took place first with journals, with books following behind slowly, and usually in a more partial way.
Most commonly, the driver of change has been the need to reduce costs. The explosion of journal publishing in the second half of the 20th century compelled publishers to look to more efficient solutions to manage the workload. Initially, outsourcing was widely adopted – using offshore companies in India or the Philippines. This switch delivered an immediate bottom-line improvement in publishers’ margins. However, labour costs inevitably rise, and the need for further improvements in efficiency led to many experiments in implementing automated workflow.
One major enhancement delivered via XML, for example, was the switch to management by exception. Instead of manually checking every article, publishers found they could set up systems that only alerted them if there was a problem in the XML conversion, when something went wrong. Improvements in the workflow such as these dramatically improved the efficiency of the publishing process, but did not impact the quality of research.
For many years, machine learning was simply another technical innovation with potential for change but little real impact. Like linked data, the technology was known but had not been widely adopted; as with linked data, it was thought to be complex and to require considerable IT involvement for large-scale implementation.
How today’s machine learning is different
One crucial development in machine learning has made the technology much more applicable.
Early, symbolist approaches to machine learning used a “top-down” approach: first model your universe, then apply that model to the content. This two-step process required an extensive domain modelling before doing anything useful. By contrast, the current generation of machine-learning tools (including UNSILO) are probabilistic, using statistical tools to identify core concepts using only the content supplied, without using any prior taxonomy - in other words, working from the bottom up. They train themselves, without human intervention, using a subject-based corpus of content.
Using such tools need not result in an increase in the IT department headcount. It provides not only cost savings but more fundamental improvements to the entire scholarly process. For example, UNSILO provides a simple tool that checks an article abstract and enables the author (or publisher) to identify key concepts from the text that were not included in the abstract. No training is required to use the software. It has always been prohibitively expensive for publishers to provide manual curation of abstracts, and this new tool enables publishers to genuinely add value to author content.
Other successful machine-learning initiatives include the identification of suitable peer reviewers for academic article submissions. To find an expert reviewer by hand is a lengthy and fraught process for non-specialists, yet a machine can identify with far greater precision and speed than a human the most appropriate reviewer based on a search of research in the relevant field.
UNSILO’s flavour of machine learning differs fundamentally from social media initiatives. Software tools such as FigShare and altmetric (and, fundamentally, Google) are popularity-based. While there is a role to play in social media for disseminating knowledge about research, it does not play the central part in research that content plays. Whatever the merits or demerits of a probabilistic approach, it is entirely derived from the content itself.
When UNSILO finds a related article, it is independent of the number of hits that article may have had – indeed, the article may never have been read or accessed before. For researchers, the content, not the popularity, is key.
Machine learning and taxonomies
Most publishers have, or work with, a taxonomy or taxonomies, either a public taxonomy, such as MeSH for medical content, or a proprietary, in-house classification. The probabilistic model used by some of the current machine-learning software tools works without any requirement for a prior taxonomy or ontology, then how are these two worlds – taxonomic and machine-based concept extraction – to be reconciled?
The probabilistic machine-learning tools such as UNSILO do not require a taxonomy, but can match their automatically extracted concepts to a given taxonomy. They can even identify new terms that do not yet exist in a standard taxonomy. Using such tools, the cost of building and maintaining an in-house taxonomy is considerably reduced.
Objections to the machine-learning approach
As happens with any technical innovation, publishers and researchers initially respond to machine learning with concern. The questions they ask include:
'This is a black box – I don’t understand how it works. How can we control it?' Publishers have an understandable fear of losing control, but it is perfectly possible to measure machine-learning output and to compare alternative tools. UNSILO has carried out studies with researchers that indicate high levels of engagement by researchers with the tools provided.
'What’s wrong with my current workflow?' The relentless drive towards lowering costs means that any publisher who does not innovate will become sidelined by more efficient rivals who can introduce more powerful features.
'I can’t justify the investment.' Publishers often instinctively believe that new technology may not have an immediate payback; they assume that machine learning tools only deliver relatedness features. On the contrary, at UNSILO we have identified many publishing and research workflows where an immediate and dramatic cost benefit can be shown compared to current manual processes.
'Everyone uses Google Scholar.' Google Scholar has indeed captured a high proportion of academic search activity, but without providing particularly useful tools to improve the process. UNSILO’s UX studies, which echo several other public surveys, suggest that Google Scholar is used for initial research, to find a named author or paper, simply because it is the largest collection of academic content. Yet, as we all know, the academic user journey is considerably more elaborate than that. From their starting article, researchers like to browse and to explore “what if?” avenues. A good machine-learning tool provides several new avenues of discovery; UNSILO provides links to tens if not hundreds of related articles and concepts for every article found.
Implementation and adoption
The most successful machine-learning innovations will be the ones that do not require extensive in-house technical teams to manage the technology. Instead, the machine-learning tools will be delivered in a user-friendly way that does not require technical expertise. The current generation of machine-learning tools demonstrates an intuitive interface that facilitates rapid adoption.
Of course, machine learning is not a universal panacea that will solve all the problems that researchers and publishers face. The recent book by Cathy O’Neil, Weapons of Math Destruction (2016) remind us all that algorithms are designed by people, whose bias may be questionable. But this does not detract from the potential of machine learning. The best machine-learning will not claim to replace all human input; instead, they make more effective use of human skills.
Typically, a probabilistic tool delivers solutions that are correct to a percentage of reliability. Depending on the context, for those problems where human input is essential, it will provide automated tools that enable humans to focus on the challenging parts. With that capability of uniting the best features of humans and machines, it looks likely that machine learning will transform the publishing sector in the coming years.
Michael Upshall is head of business development at UNSILO, a Danish machine-learning startup. He co-founded reference publisher Helicon Publishing and has worked with several other academic publishers (Pearson, CABI, IET, Cambridge University Press). He has been with UNSILO since March 2016.