New centre boosts biotech text mining

Share this on social media:

Topic tags: 

Professor John Keane, co-director of the National Centre for Text Mining in Manchester, UK, describes the aims of the centre

Text mining is a way of discovering new information by applying techniques from natural language processing, data mining, and information retrieval. Its uses include drug discovery, predictive toxicology, protein interaction, and identification of new product possibilities. The results of text mining can be used either directly by individual scientists, or indirectly to validate and complement scientific databases.

Three research councils in the UK are investing around �1m to establish a National Centre for Text Mining in Manchester, which will have an initial focus on biology and biomedicine. There is much interest in text mining in these areas at the moment, because of the large and increasing number of biomedical articles.

At the core of the centre's infrastructure will be a framework of high-performance database systems, text and data mining tools, and parallel computing. The intention is to develop a component-based architecture to help users define text-mining scenarios and integrate third-party and user-specific components. The infrastructure will be based on a distributed system warehouse, which will integrate document and metadata repositories, different resources (ontologies, grammars, non-textual databases), component pipeline definitions and execution schedules, and user workspaces.

The centre will also include an advanced clustering technique that should enable items to be interlinked automatically and retrieved quickly. This system will enhance index-term weighting through an automatic text-retrieval context that combines Latent Semantic Analysis with probabilistic retrieval methods. The resulting salient text fragments will form the input for following information extraction components. The system should also support hybrid text mining, such as from a journal and from textual representations of DNA.

Investigation of terminology processing from biomedical literature is another priority. The centre will support dynamic and automatic terminology recognition and structuring, and an intelligent terminology manager for storing terminological data in a database and facilitating linking of textual to factual databases. Results of terminological acquisition and management will also aid the curation of scientific databases.

Information extraction (IE) will be based on additional ontological processing, as there is intense interest in linking ontologies to IE systems. The problems in scaling up components to handle large volumes of text and intermediate representations will be mitigated by caching intermediate results, incremental processing, and efficient storage of ontology breadcrumbs to enable rapid access. The centre will also look into developing a common annotation scheme to ensure communication among different components.

The provision of a Grid-enabled framework is another area for investigation. This will have a uniform interface for connecting to heterogeneous data resources and replicated datasets, based on their attributes rather than by their names or physical locations. Its inclusion as part of the system will add functionality for the Semantic Web/Grid, including semantic linking through metadata, collection and data abstraction, attribute-based data discovery, and virtual data (or data creation on demand). In particular, it will provide a scalable information discovery and access system for computing with scientific data and metadata. Scientists will be able to use the capabilities of myGrid in conjunction with the Grid-enabled framework.

Another component of the new centre will be the development and implementation of user interfaces, scientific data integration, and mediation. With the results of term and information extraction, users should be able to run data mining tools over the structured metadata tables resulting from these processes and over those of other factual databases. Typically, a user will need to define a data-mining scenario: a set of structured metadata tables providing input; a set of data transformations to be performed on the input; a set of mining algorithms and their parameters; and a set of execution paths connecting data transformations and mining algorithms. Users could be guided via wizards in this. There should also be browsable search interfaces that support hierarchically displayed text and non-text data.

A final priority is to support advanced visualisation capabilities, knowledge representation techniques and integration into Grid-enabled applications. The centre hopes to create interfaces that allow non-experts to search across databases of unknown scope, visualise the results of term management, IE and data mining, and work with binary data. In time, these should support post-processing of results into releant applications.

The text-mining centre has only just been announced, but there is already a big pool of potential users. Over 80,000 biology and medical researchers work in UK universities and the centre will subsequently become available to the public sector and companies (throughout the supply chain in the biotechnological and pharmaceutical industries).

And it is not simply a UK project. One of the goals is to drive the international, as well as national, research agenda in text mining. To help achieve this wider goal, the centre has already formed international partnerships. In addition to UMIST, the Victoria University of Manchester, University of Liverpool, and the University of Salford, international self-funded partners include University of California Berkeley, University of Geneva, University of Tokyo, and the San Diego Supercomputing Center.