ANALYSIS & OPINION

Extracting more information from scientific literature

21 July 2014



Jacqui Hodgkinson considers the merits and limitations of different approaches to extracting life-science information from published research

How can life science researchers stay on top of the constantly growing body of medical literature that is potentially relevant to their work? Reading through the more than one million such articles published annually clearly is not an option. That leaves two primary strategies for sifting through the burgeoning literature and extracting meaningful information: manual curation, or automated curation.

For years manual curation of scientific publications has been the gold standard, with technology-based solutions ranking far behind in terms of accuracy and completeness. Today, that is no longer the case. Versatile, well-designed, and well-tested applications combined with significantly enhanced computational power are elevating automated curation to a more equivalent position. Proprietary text-mining technologies now rival manual curation for some types of search needs as a means of ensuring that researchers are not missing out on valuable information.  

Automated and manual curation of scientific papers each has its own strengths and weaknesses. Which solution works best for a given researcher, laboratory, or organisation varies depending on the application. For example, if the goal is to retrieve facts, terms, and relationships from the articles, automated systems outperform manual curation due to the speed and precision with which computers can match predefined lists of items (terminologies) across thousands or millions of documents. Accuracy in matching these terms can approach 98 per cent - far higher than human curation on average. Furthermore, when new terms are identified or added, the speed of computer processing allows the user to re-index all the existing content rapidly, extracting additional new information from older papers.

Manual (human) curation, on the other hand, excels at inferring conclusions from disconnected facts, or from different sources; at identifying and translating complex concepts into clearly understood, human-readable forms; and at summarising large amounts of information into a distilled version. In addition, most automated text-mining systems struggle with information in tables or figures – something humans can handle easily.

Here are some key factors to consider when you are deciding on automated versus manual text mining:

1. Scope and Volume

The information needed to interpret experimental results, describe a cellular pathway or identify complex interactions in regulatory networks is often scattered throughout hundreds of articles and publications – more than a single researcher can review. That can leave you with a gnawing feeling of ‘What have I missed?’

The solution is to cast a wide net, reviewing as many articles from as many journals as possible across a specific field. But that’s simply not practical in most cases. One option is to read only the abstracts of papers. While these are considerably shorter, they do not contain all the important information found in the full text. Multiple studies comparing the full text and abstract from the same paper concluded that fewer than half of the key facts from the body of a paper are present in the abstract (see further information, below).

Manual curation is another option, relying on PhD researchers trained as manual curators. Although they can be very accurate, those experts can likely only read and annotate about 20-25 papers a day – fine, perhaps, for a highly targeted query to a small number of journals in a select area, but not adequate for comprehensive coverage of a topic – or multiple disconnected topics.

Bias can also be a factor when people have to winnow down an extraordinary amount of data. Manual curation can unintentionally introduce bias by limiting journals and articles due to resource restrictions and assumptions about journal or paper value. They may cull only the most appropriate articles from high-profile journals in any given field. Yet these days, critical information regarding particular pathways or relationships can turn up just about anywhere. Automated systems, with their much higher throughput, can scan much larger quantities of documents – for example, all of the abstracts in Medline and millions of full-text articles (the limit here mainly is due to legal issues and licensing fees). 

Both automated and manual systems can be constrained by documents not written in English; automated translation systems often introduce errors in translation, while it can be difficult to find large numbers of highly-trained human curators who also speak the language of the paper. Currently the vast majority of scientific papers are published in English, but there is a rapidly growing corpus of non-English journals that will need to be addressed in the future.

2. Accuracy

Some would argue that quality is more important than quantity – and that manual curation ensures accuracy. Overall, expert curators are about 90 per cent accurate (as measured by inter-curator agreement on annotation) for specific tasks. In the past five to seven years, the accuracy of specialised automated text-mining systems has improved dramatically. For instance, in-house research at Elsevier indicates text mining solution accuracy of about 85-90 per cent overall. In addition, automated systems are exceptionally consistent in their annotation (~98 per cent) from paper to paper and journal to journal, unlike human curators, who show some natural variation both over time and between curators.

3. Speed

In many situations, speed is of the essence – for example, if a researcher needs specific information to meet a grant proposal deadline, or to reduce the time required to get a new drug to market. In this realm, there’s no contest: a trained manual curator can read and annotate at most 20-25 papers a day, so the overall throughput per day is limited by the number of available trained curators. In contrast, text-mining technologies can process more than 20 million abstracts overnight, and 80,000 to 100,000 full-text articles per hour. Moreover, automated systems can be easily updated to include new terms and concepts in biology, simply by adding new ontologies. The speed of automated systems also allows then to reprocess entire document collections whenever new names, synonyms, or concepts of interest are added to the terminology – effectively gathering additional “new” information from previously “read” papers. This is something that rarely, if ever, happens with manually curated systems due to the previously mentioned resource limitations.

The other aspect of speed relates to timeliness – the sooner a researcher has access to information, the sooner they can act on it. Therefore curated data need to be rapidly and frequently updated to reflect the latest literature. Although some manually curated systems add new data on a daily or weekly basis, here again the amount of new content added in such a short time is limited by the availability of expert curators. As a result, some manually curated systems only update their data monthly or quarterly because it takes time to read a significant number of new papers. In contrast, automated systems can be updated as often as needed and frequently add large quantities of new content on a weekly basis.  In certain cases, having access to pre-press information often results in being able to provide customers with extracted data weeks or months ahead of the actual publication.

4. Molecular Interactions

When trying to understand the underlying biology of a disease, process, or drug responses, identifying relationships between entities – for example, protein-protein or drug-protein interactions – is at the heart of pathway analysis. To do this effectively, automated curation would have to be able to mimic the human ability to identify these connections from text – and indeed it can. Natural language text-mining systems can identify meaningful relationships through a combination of specialised ontologies and linguistics rules, in much the same way humans that identify relationships between terms and concepts as they learn to read.

As noted previously, although abstracts are short, and many can be read quickly, they don’t contain all the key facts from the full-text publication. Typically fewer than half of cited terms, including molecular relationships, in a paper are mentioned in the abstract. Because automated systems scan full-text articles as well as abstracts, they can identify many more relevant relationships than could be found by scanning abstracts alone.

5. Personal Preference

Some researchers feel comfortable relying on a curator’s expertise to identify important information in the literature. Others want to at least be able to review the results and use their own expertise to decide whether a particular finding or relationship is relevant or credible. Automated systems do something most systems based on manual curation don’t: they show the sentence in the abstract or paper used to identify each relationship, so the researcher can personally review them and decide whether or not to include or exclude any reference. So rather than only relying on someone else’s interpretation, the final decision about what information to take into account for their research rests with each user.

6 Text Mining at Home

There are number of commercial and open-source text mining applications currently available. Before deciding to develop a “DIY” text mining solution, researchers should be aware of several key factors that can significantly affect the success of their efforts:

Terminology quality: Automated systems are primarily reliant on terminologies – lists of specific terms and their related concepts – to extract matching information from source documents. With a few exceptions, anything not included in the terminology can’t be reliably extracted from the literature. But the creation of comprehensive biomedical terminologies requires both deep domain expertise in the area of study, and significant expertise in linguistics, pattern matching, and ontology development – not something for the faint of heart.  Fortunately there are a number of high-quality domain-specific terminologies that are publicly available. The challenge then is to combine and de-dupe them into a comprehensive set of terms – something commercial vendors will have already done for their commercial systems.

Content Licensing: Depending on the source of the material, there may be restrictions (such as licensing, volume or age of articles) on how end users can extract information from that content if they do not have a subscription.  

The bottom line is this. If you’re trying to decide between software and solutions based on manual and automated curation, ask yourself: Would your project benefit from information obtained from a wide swath of journals or just a chosen few? Does identifying a greater number of relevant relationships between entities give you more confidence in your data? Are you comfortable with only an external perspective on what research is most relevant to your work, or do you want to review the information and make that decision yourself? The final choice is yours.

Jacqui Hodgkinson is VP, product development at Elsevier Life Science Solutions

Further information

[1] Corney, D. P. A., Buxton, B. F., Langdon, W. B. & Jones, D. T. BioRAT: extracting biological information from full-length papers. Bioinforma. Oxf. Engl. 20, 3206–3213 (2004)

[2] McIntosh, T. & Curran, J. R. Challenges for automatically extracting molecular interactions from full-text articles. BMC Bioinformatics 10, 311 (2009)

[3] Schuemie, M. J. et al. Distribution of information in biomedical abstracts and full-text publications. Bioinforma. Oxf. Engl. 20, 2597–2604 (2004)

[4] Shah, P. K., Perez-Iratxeta, C., Bork, P. & Andrade, M. A. Information extraction from full text scientific articles: where are the keywords? BMC Bioinformatics 4, 20 (2003)