Thanks for visiting Research Information.

You're trying to access an editorial feature that is only available to logged in, registered users of Research Information. Registering is completely free, so why not sign up with us?

By registering, as well as being able to browse all content on the site without further interruption, you'll also have the option to receive our magazine (multiple times a year) and our email newsletters.

Reading research papers in 'unprecedented depth'

Share this on social media:

Eduard Hovy, from the Language Technologies Institute of Carnegie Mellon University in Pennsylvania, is one of a team working on DARPA’s Big Mechanism project. Prof Hovy is working with Anton Yuryev, a consultant with Elsevier R&D solutions’ professional services team, using Elsevier’s natural language processing tools to develop automated ‘deep-reading’ technologies that support research into cancer drugs

Can you explain what the main goals of the Big Mechanism project are and how it will impact future research?

The goal of the Big Mechanism initiative, in the words of DARPA, is to leapfrog state-of-the-art analytics by developing automated technologies to help explain the causes and effects that drive complicated systems such as diseases like cancer. The initiative has the potential to transform the models we use to research and develop cancer drugs, as well as boosting the development, optimisation and selection of more targeted treatments – all within the next few years.
 
Research teams are in the process of creating computerised systems that will read scientific and medical research papers in unprecedented depth and detail. Through this deep reading they can both uncover and extract relevant information; then integrate those initial fragments of knowledge into computational models of cancer pathways that already exist, updating them accordingly. With these updated models the systems can produce hypotheses and predications about mechanisms that will in turn be tested by researchers.

How will creating automated ‘deep reading’ methods assist researchers?
 
Essentially, automated deep reading means the machines ploughing through articles can 'read' them more like scientists do, instead of a more shallow, surface-level reading. Machines can: make judgments about statements and findings, then extract only that information which either supports or adds to existing knowledge. This filtering process makes it much easier for the team to go to the next step – namely, providing accurate input for those doing the modelling. In addition, the system becomes semantically 'smarter' with each iteration, ultimately benefitting everyone who uses it, whether those are current or future industry, academic or government partners.

How does natural language processing assist in the process of ‘deep reading’?
 
Project teams are made up of individuals with widely different areas of expertise, from biology and chemistry to informatics and visualisation. Communication is therefore critical. Different disciplines often have their own language to describe the same phenomena, meaning an inability to find a common tongue can block a project’s progress. Natural language processing software is a vital tool in this regard. By standardising names and learning which variations have the same meaning, the software supports our understanding of both terminology and experimental methods. Our colleagues from Elsevier provide invaluable biomedical domain expertise in this regard. We can then identify relevant information that could easily stay hidden in the literature, before expressing it in terms that every member of the team understands. This in turn allows deep reading to be applied to the literature and other relevant input.

In addition to applying computational power to analysing and understanding literature, how does the human element of a project team support the aims (e.g. domain knowledge, understanding of language)?
 
Domain knowledge is a crucial component of any project team. Domain experts can wade through the tangled thicket of information and explain that when one person says X, and another person says Y, they both actually mean Z. Similarly, when one person uses an unknown term, experts can explain what they are really referring to. With the help of Elsevier experts we can easily disambiguate these situations.

What are the main challenges that the project is trying to overcome (e.g. how research abstracts are written and indexed, how published articles may be incomplete)?
 
The first challenge for our team – one of 12 funded by DARPA – is making sense of a vast amount of data from diverse sources. This means doing much more than just scanning the published literature. If we limited ourselves to that level of input the fact that, for the most part, the published articles are incomplete would result in large information gaps.
 
In particular, there are two problems that can interfere with any conclusions we might draw, and that are common to many other projects. First, the 'methods' sections of many articles don’t show all the steps the authors took to reach their conclusions. With only part of the story, we can’t tell if a claim or result is valid or not. Quite simply, we can’t take results sections or author interpretations of their findings at face value. Second, statements featured in papers may be contradictory. Oversimplification or incorrect assertions might not be picked up during the editing and publication process. We saw one particularly egregious example of this: a recently published article’s introduction section claimed a particular protein could transcriptionally activate certain genes, yet cited a study stating that the same protein represses gene transcription.

Why did CMU choose to work with Elsevier on the Big Mechanism project?
 
CMU recognised that Elsevier could help inform our own decision making, thanks to its technologies that mine the full text of both literature-based and experimental evidence, as well as relevant clinical data—all of which could be a potential source of useful information. Elsevier also provided specialists with a thorough knowledge of biological terminology and language, as well as access to both a massive library of papers and the capacity to handle new information.

So far, what progress has been made towards the project’s ultimate goals?
 
We began the project approximately eight months ago. DARPA initially requested the development of a 'use case', starting with text input and analysis that suggests new ways to examine as-yet unresolved issues; leading  in turn to a bench scientist doing real experiments with the output; and ultimately ending with that scientist returning the results of these experiments to the system to inform future investigations. We resolved a lot of organisational issues and began to assemble the software pipeline integrating Elsevier NLP engine with other programs that we began to develop specifically for this project. 

Our priority for the next 18 months, is mining all documents that mention anything related to proteins of the KRAS gene, mutations in which underlie a significant proportion of colon, lung, head and neck cancers. We will extract all relevant information into a central database; identify any inconsistencies and gaps; then ask the team’s biologist to perform specific experiments that could help remove and close them, as appropriate. In the long term, the hope is that once the process for developing better models of KRAS-driven cancers has been sharpened, it can be applied to not just different types of cancers, but also other complex diseases. This in turn should yield a better understanding of specific disease processes and accelerate the journey of effective treatments from bench to bedside.

What are the problems that researchers currently face specifically related to KRAS cancers (e.g. terminology used, various proteins referenced in the literature, synonyms)?
 
One significant problem is that different laboratories often use different naming standards when referring to the same proteins or processes. For example, in our current project, the gene of interest could be called KRAS, KRAS2 or RASK2. Its protein in turn might be referred to as GTPase Kras; K-Ras 2; Ki-Ras; c-K-ras; or c-Ki-ras. Furthermore, KRAS interacts with about 150 other proteins in the human genome, each of which has multiple synonyms of its own. This could easily result in hundreds, if not thousands, of different ways of describing the same basic information.

What implications will the eventual findings of the research have for personalised therapies?
 
The text mining and natural language processing tools we are using in this project can ultimately help build mechanistic models for treating cancer in practice, allowing physicians to more easily perform precision medicine through assessing a patient’s individual cancer profile and suggesting the most effective treatment. We expect that models such as these will eventually assist decision making in molecular tumour boards – meetings of teams that collaborate on treatments for patients whose tumours have been analysed using genomic diagnostic tests.