Language skills help text mining

Share this on social media:

Topic tags: 

French text-mining company Temis believes that working in multiple languages is essential for getting the most out of electronic information

For those born into the English-speaking world it is easy to imagine that everything is written in English. While English is obviously an important world language anyone who assumes there are no others could be missing a trick or two.

Text mining technology is all about not missing those tricks by sifting through vast arrays of information looking for the few nuggets of gold amid the piles of detritus. And French text-mining company Temis has recognised that this can be achieved more thoroughly if multi-lingual capabilities are included at the very heart of its text mining system.

This means more than simply having various language dictionaries on a search engine. A text miner needs to recognise the context in which the information you are looking for might be found rather than just returning instances of a particular word. The software has to be sensitive to the culture associated with the language - not just the words. It then extracts what you are looking for and presents it in a useful form, which may be a table of figures rather than just a series of links to documents.

Temis was founded in 2000 by four French people, two Germans and one Italian. Six of them had worked in various parts of IBM. The initial team had a variety of backgrounds in data mining, mathematics, IT and linguistics. Some of its founders had been involved in the development and marketing of the Intelligent Miner for Text, IBM's flagship text mining tool.

The six colleagues were not particularly happy at IBM, so they decided to form their own company to develop text mining software. They believed that while IBM was investing in the fundamental technology of text mining, it was not putting enough into bringing that technology forward in a form which customers could easily use. They saw an opportunity to make something new which would have an impact on the market. In particular they felt that IBM and most other manufacturers were focusing only on extraction in English. The ex-IBM founders were joined by Gilles Pouzenc, an experienced financier and business manager who is now chairman of the board. They all invested their life savings and scraped up about €300,000 to get started. The company later received a strategic investment from Credit Lyonaise.

They did not choose the best time to strike out on their own, just as the dot com boom was crashing, but they were fortunate in that most of the customers they were aiming at were large corporations in pharmaceuticals, automotive and publishing.

Temis started with nothing and needed to come up with a product. So they approached the Xerox Research Centre Europe (XRCE) in Grenoble, which had developed a basic engine for extraction that worked in many languages, called Xelda. Its main advantage is that it had a single API for 16 European languages. Temis made an OEM agreement with XRCE and incorporated it into its first product, the Insight Discovery Extractor.

Charles Huot, one of the founders and CEO of Temis, said that most text mining engines are designed to work in English first because English is a much easier language to work in. Its words tend not to change according to their tense, case or gender, which is known as the 'morphology'. The Xelda engine was designed from the beginning to work in many languages with many structures, syntax and morphologies.

He said: 'It is commonly admitted in the linguistic community that some languages are more difficult to handle than others. German is amongst the most difficult, French is not so difficult, as is English, and Spanish, Italian and Portuguese are complex. Nordic languages are a little bit complex but not too difficult. Greek and Turkish have complex syntax. But of course the difficulty of a language is not reflected in its business value. It might be difficult to work in some languages but there is a demand for it from our customers; German is an example of a difficult language but there is a big market for it. Some other languages might be easier but the market for them is smaller.'

Huot said that the partnership with Xerox was extremely successful. It would have been inconceivable for Temis to have worked on a parsing engine from scratch and XRCE had invested 10 years of work in producing Xelda, but he said that Temis had a better understanding of what customers would want from a text mining system and was able to turn the basic engine into a commercial product.

From 2000 to 2003 Temis had a non-exclusive deal with Xerox for Xelda but then Xerox offered to sell the technology and the team which developed it. Temis created a new division in Grenoble with about 12 people. They became the Temis research team.

Huot said: 'It was very important to us and our customers to be independent in terms of technology, so we did everything we had to do to acquire it. We have seen some competitors, who have licensed their text mining technology, get into trouble when their technology partners cause them difficulties.'

Further capital investment has led to the opening of offices in the UK and US and the company now stands at 43 people with a turnover of about €3.5 million. It aims to reach break-even point this year.

In the early days Temis had a distinct advantage in the European mainland because its products were optimised for languages other than English. However, in the last year it has opened offices in the English-speaking world. The business opportunities there are larger but the competitive landscape is much more crowded. Because of this competition the product range has expanded to include 'skill cartridges' which are software extensions containing knowledge about an industry sector, such as pharmaceuticals, or applications like business intelligence, or specific research fields. This means it has more to offer than just languages. The skill cartridges contain information about what particular information would look like in that business context, making it easier to recognise which documents have the information as well as knowing how to extract the information of interest.

Huot said: 'For companies, text information is like fuel. We provide an engine to turn that text into useful information to drive them forwards, through things like dashboards and graphic representations of complex information. We have been successful in life sciences in Europe and so we are in a good position to access the life sciences market in the US. Also, English speaking companies are not just interested in English text, because they are doing business worldwide.'

'In the US we have looked at entering into the OEM business, rather than the direct sales model that we adopted in Europe. We now have many partnerships with software companies that want to have text mining features incorporated into their products. OEM customers include EMC Documentum and LION Bioscience, and we are talking to a lot of companies in the search engine, document management and knowledge management markets. People who used to use data mining in business intelligence products are looking to incorporate text mining. OEM sales will always be a significant part of our business but it will not become everything. OEM sales are easier than they were five years ago because access to data is becoming an issue for everyone. But all the founders have a strong desire to keep the direct contact between the company and its end customers.'

Many of Temis's customers are using the Temis system directly to extract information from their own databases and published sources, including news feeds. Other customers are publishers themselves who use the technology to create their own databases to which they sell access. The advantage of using text mining against a search engine is that it does not rely on the quality of the original indexing as it can recognise not just key words but variations on key words and in many languages. It can analyse documents in a variety of formats and then aggregate the information into a single view. The skill cartridges act as a filter on the raw data so that someone looking for a licensing deal or partnership agreements within the pharmaceutical sector could pick out the most relevant information from a vast array of data sources rather than having to use a series of precise searches. The system recognises the key words in sentences and then looks for the relevant features of the document to extract the relationships between the words that most closely match what the user is looking for. Different skill cartridges are used, for example, for competitive intelligence and for safety information.

One example is the energy company Total, which uses text mining to derive from news reports from around the world exactly what refining capacity is available throughout the world. They would have to read several thousand documents per day in a variety of languages. They do not want the full text; they just want the figures in an easily accessible form and in real time. This information is used directly to establish prices on a daily basis.

Another example is publishers. Huot explains: 'They have to do a lot of manual curation on their databases. They might need to extract authors and then put them on a separate database. They can also create links between different sources of information that is very time consuming if done manually. Elsevier MDL's Chemistry Patent Database is a good example of this. Publishers typically get a return on their investment within a single quarter.'

'We are also working with LexisNexis on its legal databases, where it is trying to manage the flow of information. The laws of countries are obviously written in their own languages so you cannot do anything unless you can search in those languages.

Temis originally focused on the life sciences and automotive markets, but this year it has expanded its vertical market coverage into energy and publishing. It is also looking to consolidate in the UK and US so it is not planning to stretch itself too much for this year. According to Huot, growth last year was 80 per cent and its target next year is 120 per cent. In the future it will look at growing markets such as 'homeland security'.

Huot said: 'We are already pretty busy, and while speed is important it is also important to consolidate with partners and integrators. Text is used everywhere and if we can satisfy our customers and prove the return on investment there are many opportunities out there.'

John Murphy