Legal documents enable scientific discovery

Share this on social media:

Topic tags: 

Sian Harris reports from IPI-ConfEx in Seville, Spain on how
patent information could help pharmaceutical research

‘A patent is a legal document but in a scientist’s hands it has to become a scientific document,’ observed Jorge Manrique, VP of sales and marketing for Prous Science, part of Thomson Scientific at the recent IPI-ConfEx meeting in Seville, Spain. He explained how scientists can use patent information, along with data mining techniques, in drug discovery.

‘You can predict pharmacology that might not have been envisaged by the inventor,’ he commented. ‘If you have five avenues to pursue and two look more promising because of this data, then you know which ones to pursue.’

One tool to do this is the BioEpisteme data mining technology developed by the Prous Institute for Biomedical Research and licensed by Prous Science. This tool uses factual data from a variety of sources to evaluate the probability that a given molecule will display a certain combination of molecular events. For a given compound this system could predict the probability that the compound will treat a specific condition, have specific mechanisms of action or target specific receptors.

Prous is not alone. At the meeting, David Walsh, an information scientist for Pfizer Global Research and Development, spoke of the challenges of finding the valuable scientific information among the legal jargon of patents.

One of these, he pointed out, is the considerable diversity in the methods of describing a particular piece of information. Indeed the first hurdle is a lack of standardisation even in the nomenclature of patents. In general, patent numbers take the form of a country code (such as WO, PCT or US), then the year, a number and finally a kind code. Within this scheme, however, there can be wide variety. The year could be two characters or four, for example, and even the links between the different parts of the patent number could be spaces, hyphens or something else. With so many possible variations even in the patent number, it is not surprising that there are no standard ways of describing their content. ‘There is no such thing as standardisation of diseases or target names in patents or indexing of patents and there is great difficulty in extracting this information,’ commented Walsh.

Current systems of searching for patents result in structures and patents, according to Walsh: ‘There is no easy way to get that information to our chemists. We’d like to be able to integrate the structures and patents within our own processes, our internal data, software and decisions.’ Mining can help identify ‘druggability’ of compounds, regions of patent infringement, and patent opportunity.

However, there are problems with large-scale text mining of patents. These include poor use of IUPAC nomenclature and synonyms, the difficult of identifying the beginning of a chemical structure within a body of text and the problem of mining chemical structures that are presented in image form rather than in text.

For this reason Walsh called on database vendors to work with the pharmaceutical industry to enhance the processes of extracting content from patents. He suggested that vendors ‘adopt standards for data that avoid time-consuming processing when integrating the results of many databases’ and ‘join up the process for identifying inventions to enable interaction with the data.’

Scientists and patent professionals must also ensure that the sources of patent information are as comprehensive as possible. This is quite a challenge, according to Christine Emmerich of FIZ-Karlsruhe, who is involved in marketing and sales for STN Europe. She illustrated to delegates why searching a combination of value-added patent information – contained in databases such as CAPLUS, Derwent WPI and MARPAT – is important. The example she chose was the anti-ulcer drug pantoprazole, which has recently come out of patent protection.

A search for this drug in value-added patent databases towards the end of last year revealed 587 inventions, 30 per cent of which were unique to just one of the value-added databases. This is because different databases use different indexing guidelines and because the enhanced titles and abstracts significantly differ between producers. More worrying for those relying on first-level databases (such as INPADOC and those available on the patent authorities’ websites), 117 inventions relating to this drug could only be retrieved using the CAS and Derwent value-added patent data. The patents relating to key features of pantoprazole such as product protection and basic manufacturing processes could not be found in first-level patent databases. This might be because the drug is represented as a chemical structure or as part of a generic structure, because chemical names are not standardised in the patent fulltext or because of misspellings of generic name and brand names in the patent full-text.

With so many challenges on the way to using patent information more fully in drug discovery it looks like pharmaceutical companies and patent information providers will have plenty to talk about for many meetings to come.