Community curation helps chemical information

Share this on social media:

Topic tags: 

Antony Williams explains why the Royal Society of Chemistry has acquired chemical information search tool ChemSpider

The internet has spawned access to unprecedented levels of information. For chemists, the increasing number of online chemistry-related resources available provides a valuable path to discovery of information. This path was previously limited to commercial ventures and therefore constrained resources. A recent shift to publicly-available resources offers great promise, to the benefits of science and society. The success of the PubChem project demonstrated both the value and attractiveness of an online structure database for facilitating the connections between structures and associated data.

However, there are hundreds of databases of chemical information such as literature data, molecular properties, environmental data, toxicity data, analytical data, as well as chemical vendor catalogues. There has been no easy way to search across all these sources or even to determine the availability of information within them – even if the sources are open-access databases. What’s more, despite the fact that there were a large number of databases containing chemical compounds and data available online, their inherent quality, accuracy and completeness was lacking in many regards.

Along came a spider

With this challenge in mind, ChemSpider was developed initially as a hobby project by a small group of dedicated cheminformatics specialists. The intention was to aggregate and index available sources of chemical structures and their associated information into a single searchable repository and make it available to everybody, at no charge. One of the initial concepts for ChemSpider was to aggregate into a single database all chemical structures available within open-access and commercial databases and to provide the necessary pointers from the ChemSpider search engine to the information of interest.

The intention with ChemSpider was also to provide a platform whereby the chemistry community could contribute to cleaning up and improving the quality of the data online. They could also expand the information available to include data such as reaction syntheses, analytical data, experimental properties such as melting points and solubilities etc., and link to other valuable resources available on the internet.

A new breed of structure database

ChemSpider has unique capabilities. These include real-time curation of the data, association of analytical data with chemical structures, real-time deposition of single or batch chemical structures (including activity data) and transaction-based predictions of physicochemical data. The social community aspects of the system demonstrate the potential of this approach. Almost 2,000 spectra have been added to the site by members of the community. Curation of the data continues daily and tens of thousands of edits by members of the community have dramatically improved the quality of the data, compared with other public resources for chemistry.

ChemSpider has now grown into a resource containing more than 21.5 million unique chemical structures from over 200 data sources such as chemical vendors, commercial database vendors, publishers, government databases and members of the community.

A series of web services are available, allowing integration with ChemSpider for the purpose of searching the system. Analytical hardware vendors and cheminformatics vendors have also integrated with ChemSpider to improve their own products and capabilities. The system also integrates text-based searching of open-access articles and presently searches more than 500,000 articles and includes both structure and substructure searching of articles on PubMed.

Strength through partnership

The Royal Society of Chemistry (RSC) has a very strong web presence in delivering published content. We also have established a new standard in semantic publishing for chemistry with our award-winning HTML mark-up technology, “RSC Prospect”. We felt that the addition of ChemSpider to our offerings would dramatically enhance our existing web presence and provide a critical mass of content necessary to make chemical structure searching meaningful. The ChemSpider platform is evolving into an environment for networking chemical scientists and will soon provide the facility for online collaboration and interaction, thereby offering significant opportunities for our learned and professional activities.

Our prediction of the future sees scientists requesting chemical information on demand. Increasingly a scientist must retrieve data from many domains and integrate them without the help of experts. We envisage the future of information provision in chemistry to be one in which chemists have access to a significant number of disparate, publicly-available online repositories of data. ChemSpider goes a long way to fulfilling the RSC Publishing vision. With ChemSpider the future can be a common interface to many diverse public repositories, including structures associated with chemistry articles, provided by a central website. With quality assurance and effective data curation in place, this platform is going to become an authoritative body co-managed by the community.

RSC ChemSpider team (L-R): Richard Kidd (informatics manager), Antony Williams (VP strategic development), Graham McCann (business manager), David James (ICT and informatics director), Valery Tkachenko (chief technology officer) and Sergey Shevelev (software engineer).

Integration with journals and databases

At the moment the majority of molecular data is published in journal articles; most is never captured in reusable e-form. We are setting out to change this through collaboration with the publishing community. Through ChemSpider, the chemistry in journals, magazines and online content such as blogs will become more discoverable. As a first step, we are now working to enhance significantly RSC journals and databases, by providing the ability to query content by structure or substructure searching. Examples are databases such as Analytical Abstracts and Natural Product Updates. Imagine the power of being able to query all mass spectrometry-based analytical methods by entering some text-based conditions and including a particular substructure to filter out the appropriate articles. ChemSpider will enable such improved search capabilities. The RSC is already in discussions with a number of other publishers to integrate their data and we are keen to identify further collaborations. Our semantic markup technology (known as RSC Prospect) should be significantly enhanced by the access to ChemSpider. In the future, the markup of a single article should be able to direct the user to chemical vendors, analytical data, associated patents, information on other databases such as Wikipedia and to other related articles with ChemSpider as the hub.

ChemSpider’s future and semantic connections

ChemSpider hopes to grow in its reach into the chemistry community. It aims to improve the quality of available information and give increased access to chemistry-related information. There are many types of data and information that can be associated with chemical compounds and made available to the benefit of the chemistry community. As an example of this, the association of analytical data has been demonstrated, as has the integration to patent searches and, presently in progress, access to reaction synthesis protocols.

ChemSpider also hopes to provide access to online tools and services. It already offers tools for the prediction of certain chemical properties for chemists to take advantage of. An increasing number of software algorithms, provided by members of the community, will be added in to the system and provided via the ChemSpider hub.

Finally, the project aims to enhance semantic publishing in chemistry. As semantic technologies such as RDF are layered onto the system, the cheminformatics community will be able to link and reference ChemSpider as the central structure resource on the internet. If this happens, the vision of a truly connected web of chemistry can be realised.

Antony Williams is VP of strategic development, ChemSpider, for the Royal Society of Chemistry

Further information