Search engine helps chemists
David Robson finds out how a new search tool from the USA could help chemists with information retrieval
Hunting down relevant information from the different acronyms, formulae and multiple names that represent chemical compounds is hard enough for an untrained human, let alone a computer, and most generic internet search engines simply cannot sort the wheat from the chaff when solving chemists’ queries.
Now researchers from the College of Information Sciences and Technology at Pennsylvania State University, USA are promising to change this. They have developed a new, open-source chemical search engine that understands chemical information well enough to return sensible documents, ranked according to relevance, with a high accuracy. Their tool can even find relevant information from data held in results tables within academic papers – a skill that previous chemical search engines had struggled with.
‘To the best of our knowledge, it’s the first of its kind,’ said Lee Giles from Penn State University, who helped to develop the search engine. Unlike generic search engines like Google, ChemXSeer can understand when two seemingly-different terms mean the same thing. At its simplest, this means equating methane with its formula representation CH4.However, CH4 can also be written as H4C, and things get much more complicated with more complex molecules whose structures can be written in many different permutations.
In common with many search engines, the ChemXSeer search engine relies on a crawler – a piece of software that prowls cyberspace, following every link it finds to collect relevant information about the documents it encounters.
Most chemical search engines still rely on humans to perform this tagging, but ChemXSeer is completely automatic, which should allow it to process a greater volume of documents.
ChemXSeer’s crawler is particularly smart because it can work out whether a page is a chemical document or not, and it can differentiate between chemical and nonchemical terms.
The ChemXSeer search engine splits chemical terms into smaller and smaller sub-terms, which it then indexes.
For example, this means that a chemist searching for OH will only find documents where this represents the hydroxyl group and not the American state of Ohio. The crawler learnt this skill through a process similar to trial and error called machine learning.
Once it has determined whether a document is relevant to chemists, the crawler identifies which terms to tag to its index of web pages.
It typically splits up chemical names and formulae into small, meaningful chunks, so that users can find results for parts of chemical structures as well as for complete chemicals.
For example, a document containing the chemical name methyl methanesulphonate would be indexed under ‘methyl’, ‘methane’ and ‘sulphonate’, and it would appear in searches for all three.
Researchers could also be more specific and search for the complete term ‘methyl methanesulphonate’. This allows users to find exact terms or similar results to their queries – a flexibility that hadn’t been available in previous offerings.
The software is also smart enough to identify new terms that it has never come across before. ‘It extracts features, instead of just comparing the pages to a chemical dictionary, so it can recognise new terms,’ explained Bingjun Sun from Pennsylvania State University, who also worked on ChemXSeer.
Like other search engines, the software also ranks its pages, so the most relevant documents are returned at the top of the list of search results. Pages where the term appears most frequently are generally ranked more highly, but the software also accounts for the fact that some terms in a search query will be very common but not very useful, and adjusts accordingly.
The team hope that this is more competitive than other offerings on this front too. ‘We have more dedicated ranking tools, and we think our algorithms are better than the other chemical search engines available,’ said Prasenjit Mitra, another member of the team. ‘The results show an accuracy of over 90 per cent.’
Finding data in tables
Possibly the most exciting development in the project is a new feature that allows chemists to find data held within results tables – information that would have previously been hidden from other search engines.
‘For a while, chemists have wanted to be able to find out if anyone has reported results from the same experiment,’ said Mitra. ‘In the past, some institutions would employ graduate students to find the data, extract it from a table and enter it into a spreadsheet by hand. We’ve tried to automate this. Our tool finds the tables within the documents, indexes the captions and references contained within the table and extracts and partitions the data from the cells.’
This should provide chemists with quicker and easier access to the results that really matter to their work. Richard Kidd, an informatics expert from the Royal Society of Chemistry who was not involved in the work, pointed out that many publishers, including the RSC, already mark-up and tag their tables manually before publication to aid search engines. ‘But this automatic text mining should be useful when searching through older materials that hadn’t been tagged in this way,’ he added.
The Penn State researchers presented their latest work at the WWW2008 conference in Beijing, China last April, but their work is far from over. They are currently working on a couple of new features that will make the tool even more useful to chemists.
The first of these is a structure search tool, which they are currently testing as a prototype. In addition to chemical formulae and chemical names, scientists will soon be able to enter the 3D chemical structures into the search engine for more specific enquiries.
A screenshot of search results for the ethyl group, returned by ChemXSeer.
They also plan to extend the table search tool so that it can extract information from graphs held within documents, by automatically reading the data points against the axis. ‘It will give real values in real units,’ said Mitra. ‘It’s almost like reverse-engineering the data.’
The team also need to expand the range of documents included in its index: currently, it only covers RSC documents, although they hope to increase this in the near future.
Ultimately, the search engine is still a work in progress, and only time will tell whether it can rise above previous attempts. ‘My feeling is that it may be very good at what it does,’ observed the RSC’s Richard Kidd, ‘but we still need a better idea of what its accuracy is like and how it interprets the documents. It’s impossible to tell at the moment whether this is a significant development.’