Semantic tools help extract meaning
Siân Harris speaks to some companies that provide organisations with tools for text and data mining
When companies want to do more with their content they often turn to text and data mining to help them add value. Such companies might be big pharmaceutical or healthcare companies, eager to reap more insight from their high-throughput screening, clinical trials and literature.
They might also be scholarly publishers wanting to do semantic enrichment on the content they publish. This could be in order to help readers to search or find related content better, or help the publisher derive other packages of content they can sell in different ways. New resources such as thematically-focused topic pages, linked data knowledge bases, or content APIs, rely on semantic metadata.
‘Text mining is about adding meaning,’ noted Rob Virkar-Yates, marketing and communications director at Semantico, which provides some base level text and data mining through its Scolaris platform and also uses TEMIS’s Luxid platform to extract meaning from large passages of unstructured text. ‘Our most frequently requested use case is "related documents", where we can use the information gathered from Luxid to suggest other documents that may be of interest to the reader of a given article. In the past, the only way of achieving this would have been to tag "related documents" manually, and keep them maintained, or to create taxonomy and link documents together using the subjects and facets they are tagged with. Both of these could easily be a full-time job for one or more people,’ said Virkar-Yates.
According to Daniel Mayer, VP of marketing at TEMIS, there are several key requests that TEMIS receives from customers. ‘In industry, we see two types of use cases,’ he said. ‘The first relates to what we call "discovery" activities. They involve the processing and analysis of massive amounts of available scientific literature, patents, research publications and/or news to extract and aggregate available knowledge about a given subject of interest.’
He continued: ‘The second type are "enrichment" use cases that are mostly focused on internal/corporate documents that are stored in-house, with a goal of enhancing the way they are archived (to prevent loss of knowledge, for example, or to ensure enhanced compliance) and enhancing the capture, preservation and exploitation of the organisational knowledge they contain.’
He also noted the benefits of text and data mining in the area of analytics and visualisation. ‘The Luxid platform also includes Luxid Information Analytics – a tool that enables researchers to investigate their field of interest visually and quantitatively, based on information extracted from unstructured documents.’
Linked data
Linked data is a related buzz phrase in the industry today, and is an approach being taken by several large publishers and information organisations. Peter Camilleri, business development director of TSO, which works with Nature Publishing Group, the Royal Society of Chemistry, the British Library and others, explained: ‘Many documents are made up of vast amounts of unstructured or semi-structured text, which contain valuable information buried within it. Linked data is a way of making available data as RDF (resource description format) or "triples". RDF presents data as statements, with the statements being made up of a subject, a predicate and an object.
This allows datasets from different domains to be linked together in flexible ways without predefined schemas. RDF also means relationships between different datasets can be expressed, leading to greater discovery of information held within the data.’
As with any technology, there is a trade-off between making it simple to use and making it flexible. Phil Hastings, SVP for sales and marketing at Linguamatics, described the company’s natural language processing ability: ‘Its strength is that it is very flexible, agile and scalable. How it is used is up to the customer. We try not to apply any preconceived ideas to what might be important to users. It is about applying the right filters.’
This is important, he said, because different types of content might need to be considered differently. For example, social media provides plenty of useful information but also lots of noise. In addition, it also uses different grammar from scholarly literature and other sources.
Where this trade-off between simplicity and flexibility lies depends on the size of an organisation using the tools and the depth of their pockets, according to Virkar-Yates. He said that what Semantico provides, through TEMIS’ Luxid tool, is a ‘volume-based load, entry-level solution for small publishers.’
However, he said some publishers are not even able to do basic text and data mining. ‘I would suggest that many publishers are still playing catch-up and are being let down by their vendors. For example, you should be able to type two different spellings of a word and get the same results in a search,’ he explained.
‘There is a disconnect between what vendors talk about and what they do. The reality on the ground may be different because it’s expensive, or because they may not have the expertise. It comes down to money and is still in the realms of the people with deeper pockets.’
Accuracy and formats
And there are other challenges with text and data mining today too.
‘Automated tagging of data within unstructured text is never 100 per cent accurate, and should always be checked by a subject-area specialist. This can be costly, but the time taken to identify false-positives is far less than tagging this data manually. This process of identifying unwanted matches can also be fed back into the data-mining service to improve the accuracy of subsequent imports,’ said Virkar-Yates.
Camilleri commented: ‘I’d say the biggest challenge is finding domain-specific thesauri or controlled vocabularies that we can use to develop annotators for our clients. Varying source data formats are also challenging because organisations hold data in many different formats, so an initial data conversation exercise may be required to normalise the data before we can run it through our enrichment engine.’
Hastings added: ‘There’s still an educational challenge too. People have very much changed in their attitudes to text mining but we still need to educate them in how mining is different from search.’ He also noted that the need for access to content is obvious with any search.
Despite these issues, however, text and data mining is becoming more firmly embedded in the processes and plans of big – and smaller – companies.
‘Text mining has a very bright future,’ said Hastings. ‘Clearly the amount of unstructured information is not going to go away, and we see customers expanding mining into more areas. When we first started Linguamatics, people questioned whether text mining could have an impact, but over the past five years in particular people have been asking us more and more about how and where they can use it, because they already understand that it can have a significant impact.’
Camilleri agreed on the potential of these techniques: ‘With increasing numbers of organisations "opening up" their research data for the greater scientific good, combining data from disparate data sources will become the norm.
‘Not only will there be an increase in the number of organisations using these tools, but it will be increasingly difficult to carry out effective research without them.’