rilogo.jpg (8K)

the chemical brotherhood of cml



It has always been more difficult for chemistry to keep up in the Internet age; but, as David Bradleydiscovers, a new language could herald a new era for the discipline.


Transparency in sharing and transferring information across the Internet is a must. Today, a spectroscopist's data-stream flows as readily as the outpourings of the Human Genome Project. But there is one group of scientists that could have become marginalised were it not for the pioneering work of a handful of their number – the chemists.

The problem with chemistry is how to represent what chemists are talking about. In a lecture, the chemistry professor can easily scratch and scribble a molecular structure, adding notes on the chalkboard, and (hopefully) the students will understand. The sharing of chemical information around the globe is not so trivial. Graphics programs lack chemical savvy; so once drawn the molecule has none of the atom connectivity of the molecule.

Chemists do have their own drawing packages that retain inherent knowledge about a molecule drawn using them, such as ChemDraw and ChemSketch, and files can usually be interconverted via standards such as the mol file format. But chemistry is richer than its structural representations. There are countless nuances that might be associated with a structure; its molecular weight, the distribution of isotopic ratios of its carbon atoms, its infrared, ultraviolet, nuclear magnetic resonance spectra, its x-ray crystallographic data... the list goes on. How can this additional information be brought to hand transparently?

Thankfully, an answer is forthcoming: CML, or chemical mark-up language.

Chemical mark-up expands the electronic boundaries for chemistry

The eXtensible Mark-up Language provides a universal format for structured documents and data on the Web and so offers a way for scientists and others to carry a wide range of information types across the net in a transparent way. All that is needed is an XML browser. 'XML is a framework from which problem-specific formats derive,' explains CML programmer and organic chemist Beda Kosata of the Prague Institute of Chemical Technology.

But there remains a barrier. How can the seeming simplicity of XML carry all those atomic coordinates, spectra and connectivity information in a transparent manner? The clue lies in the semantics: it is not called eXtensible for nothing. In recent years a pioneering group of chemists has developed a chemical language system under the XML format to allow chemical information to be transported easily and retrieved and displayed on a user's screen with just those few clicks as other scientists have achieved with their knowledge base.

CML represents a sea change in the management of molecular information; it has been described as 'HTML for molecules' but it is so much more, having the scope to span disciplines from the smallest inorganic molecules, carbon monoxide, water and ozone, to vast macromolecular structures, such as polymers, proteins and DNA – and it can handle quantum chemistry. Its nickname as 'HTML for molecules' hints at its potential to bring together disparate chemical information sources – those connectivity, spectra, and X-ray data – into a coherent and structured document format without loss of chemical nous. Chemists are granted a far greater freedom of expression.

Not just chemical structures but their associated properties can now be easily communicated across the Internet

Chemical Internet pioneers Peter Murray-Rust, Henry Rzepa and Christopher Leach introduced the concept of a Chemical Mark-up Language at the American Chemical Society meeting in August 1995. They realised that molecular modelling programs, like MOPAC, generate interesting chemical information only in human readable form. The output is not 'self defining', and worse, explains Rzepa, was prone to change between versions. 'By making a special version of MOPAC "CML aware" (by my student Chris Leach) and capable of both reading and writing the early CML, we showed how information could be made to "round trip" not only in MOPAC but through any chemical information system that could "add value" to chemistry.'

Mark-up languages are nothing without their translator, and Nottingham University virtual chemist Peter Murray-Rust, now at the Unilever-Cambridge University molecular informatics centre, soon revealed a CML browser written in Tcl/Wish. Gradually, the language's formalisms evolved and with the emergence of the Java platform-independent programming system in 1996, Murray-Rust had brought a new species to light – JUMBO – a CML browser. JUMBO is, naturally, an example of CML-aware software.

Rzepa, Murray-Rust and Michael Wright seminally illustrated the publication possibilities last year in a paper in the New Journal of Chemistry (2001, 618-634). 'Here, the molecules and information about them are precisely specified within the body of the article text,' explains Rzepa, 'if you feed this article to any CML-aware program (such as JUMBO) it should recognise the molecules, and display them appropriately.' Rzepa compares this with the Adobe PDF system commonly used by online publishers, which requires a human reader to extract the chemical information from a picture of a molecule. This is, unlike machine-readable chemical information, slow, error prone, non-reusable and susceptible to inducing boredom!

'We have also shown how an XML/CML article could be automatically transformed to e.g. Acrobat form for printing, or to another XML language SVG for high quality viewing. It is enabling these "added value" transformations that XML offers for the first time, and which we believe is why the use of XML and CML so profoundly anticipates a radical revolution in the primary, secondary and tertiary publishing processes.'

The CML website provides examples of its handling capabilities including datafiles, such as the International Union of Crystallography's CIF and the Protein Data Bank PDB formats. It can handle compound data cards, such as those being produced by the SELFML project to associate molecules and mixtures with their physicochemical properties and Materials Safety Data Sheets (MSDS). The system also allows for the lossless interconversion of various older formats, such as the Mol file, Sybil MOL2, JME, XYZ, SMILES, PDB and CIF. And it can access the log files of those quantum mechanical programs and display sensible graphics and information from them.

The CML website provides examples of its handling capabilities including datafiles, such as the International Union of Crystallography's CIF and the Protein Data Bank PDB formats. It can handle compound data cards, such as those being produced by the SELFML project to associate molecules and mixtures with their physicochemical properties and Materials Safety Data Sheets (MSDS). The system also allows for the lossless interconversion of various older formats, such as the Mol file, Sybil MOL2, JME, XYZ, SMILES, PDB and CIF. And it can access the log files of those quantum mechanical programs and display sensible graphics and information from them.

An exciting recent development in the CML world is the creation of a sub-site on the Open Source development site SourceForge.net. SourceForge.net provides a market for free software as well as services for developers including project hosting, version control, bug and issue tracking, project management, backups and archives, and communication and collaboration resources. Such freedom is a key issue, according to Jiri Jirat, who is based in Prague and is working on the ZVON.org project for Systinet Corporation: 'Using CML you become application and vendor independent, you are not bound to a proprietary format, anybody can write their own application based on CML and you can use this format without any legal or copyright problems for data interchange and storage.'

At present, the CML project hosted by SourceForge.net is one of more than 37,000 projects hosted, although unfortunately it has to be categorised under bioinformatics as there is no chemistry or molecular science grouping. The CML project has three administrators Murray-Rust, and Imperial College's Henry Rzepa and Michael Wright. These are also developers, Rzepa and Murray-Rust having devised the pioneering chemical MIME type for the Internet.

Among the other developers are Daniel Zaharevitz, Jirat and Miloslav Nic formerly of the Prague Institute of Chemical Technology. Among the earliest CML-aware applications was Peter Ertl's JME chemical editor while the Sourceforge projects, JChemPaint editor and JMol, also have CML support and recently Kosata's BKChem, which bring us full circle. The packages allow one to create bond-by-bond drawing of molecules, generate structures from templates for common carbon rings, add arrows, apply rich text, and align, scale and rotate molecules. Platform independence is vital. 'XML [and by definition] CML itself is designed to be platform independent and multilingual, thus enabling the development of fully internationalised data exchange format for scientists from all over the world,' explains Jirit.

CML software is not all being developed in academia, however. Chemaxon's Marvin is a Java-based chemistry package that comes with an editor, conversion utilities for CML and although professionally developed is still free. MarvinSketch/Swing is an applet for editing and visualising molecules on a Web page while its companion, MarvinView/Swing, can be used for viewing said molecules.

Murray-Rust and his colleagues are collaborating with Dan Zaharevitz (at the US National Cancer Institute) to create 250, 000 molecules in CML. This will form the core of an open molecular resource, which will collect molecular data both experimental and theoretical. 'Increasingly chemists (and non-chemists) will be able to ask "What does this molecule do?" and "Why is it important?" and since all the information will be in XML it will be relatively straightforward to answer,' Murray-Rust said. 'It is then a short step to "where can I find a molecule with certain properties?" – a chemicalspecific Web-search engine,' he adds.

XML can assist in regulatory and legal processes, where authenticity and validity is essential. 'We recognised this early on,' says Rzepa, 'and showed how having a well-defined structure to a document could allow components to have digital signatures attached, signed by whoever was responsible for creating or authorising the information.'

'There is steadily growing interest in CML and there is no alternative,' says Murray-Rust. The bottom line is that having started with a few scratchings on the cave (sorry, lecture room chalkboard), chemists are now evolving the Internet to allow them to share and discuss the very fundamentals of their science.


LINKS

CML website
http://www.xml-cml.org/
The CML Project on SourceForge.net
http://cml.sourceforge.net/
Chemaxon's Marvin
http://www.chemaxon.com/marvin/
CML at ZVON
http://zvon.org/xxl/CML1.0/Output/index.html
BKChem
https://savannah.gnu.org/projects/bkchem or http://www.freesoftware.fsf.org/bkchem/
JchemPaint, an alternative Java 2 CML chemistry-drawing package
http://jchempaint.sourceforge.net
Jmol
http://jmol.sourceforge.net/
New J Chem article is available online at...
http://www.rsc.org/suppdata/NJ/B0/B008780G/index.sht
The latest version of the powerful MOPAC program is available at...
http://www.schrodinger.com/Products/mopac.html


back to main features page
home