rilogo.jpg (8K)

DATA STORAGE



Will banking data improve research output?


A UK-funded project is blurring the distinction between lab notebooks, journals, and libraries, by storing data and results electronically. Peter Rees explores the implications


Experimental results have always gone missing or, perhaps, have simply not been properly recorded. If the margin of Diophantus' Arithmetica had been wider, Fermat might have been able to write out his proof of his last theorem. Being able to check Fermat's reasoning might then have spared generations of mathematicians from tearing their hair out in search of the missing solution.

But what if all the important details of a scientific experiment could be captured electronically from the beginning, and published in a digital library for all to read, thus saving duplicated effort and erroneous conclusions? This could be one of the outcomes of eBank, a 12-month project that will look at how research data could be used over and over again for further research and teaching.

eBank promises to dissolve the boundary between laboratory notebooks, journals, the library, and teaching resources. Lessons learned during the project will help develop an information architecture - a set of standards and protocols - that will allow staff and students in UK higher education to publish, and have access to, electronic information as part of their learning and research activities.

The project mixes technical development work on databases and software, desk-based research, and consultations with researchers and lecturers. It will also bring together digital librarians with computer scientists, working on grid computing. The project is headed by Dr Liz Lyon, of UKOLN, at the University of Bath, in partnership with the University of Southampton, and PSIgate at the University of Manchester. It is being funded by the Joint Information Systems Committee - advisors to the UK's Higher Education Funding Council - and is a follow-up to JISC's ePrints UK project, which involved development of a repository for research publication metadata (each entry will be a sort of electronic reference card describing the ePrint papers). It also builds on work carried out by the eScience Combechem Project, which was funded by the Engineering and Physical Sciences Research Council. The 2.2m Combechem project is spread between Southampton and Bristol University, and embraces chemistry, statistics, and computer science.

Combechem's lead investigator at Southampton University, Dr Jeremy Frey, says: 'The eBank project deals with the end-to-end process from the laboratory all the way through to the publication. We are very concerned to make sure that all the information and background data, the data from which it is derived, is made available.'

The reasons for publishing detailed experimental data electronically are diverse. 'It's all the usual things: the provenance - it's very important to show what you did; social responsibility - saying that, for publicly funded research, you should make all the data available,' says Frey. There is scientific fraud, though 'that's a minor issue'. But Frey claims that it is mostly self-interest on the part of scientific researchers. 'You can often think of better things to do with the data later on,' he says. But by then it could be lost, or no longer available in an accessible form, or no-one might be quite sure of the context it was gathered under - perhaps the person who carried out the experiment has left the university. 'Data is so easily mislaid. We know it's there; it's been backed up; but how do you find the data you want?' asks Frey. And a request for information may come years later, which sometimes forces scientists to do ludicrous things: 'Like take a publication, scan in the picture, and re-measure up the graph to get our points off it, simply because another researcher may want the data for something the author hadn't thought of.'

This means thinking hard about how the information is collected in the first place. So one part of the project is working on a smart laboratory. This is more than just an electronic notebook, says Frey. 'Because those have sort-of come and gone, and have not been ideal'. To create an environment that will capture all the necessary information calls for new ways of getting computer scientists talking to bench chemists. 'That means you need to collect the data in a way that you know you are going to be able to make it available, and you need to give the people tools to do it, so they don't need to make extra effort'. Enter Smart Tea, a joint project between chemists and computer scientists to make tea as if in a chemistry experiment. By describing the making of tea in the form of an experiment, the aim is to generate a better understanding of the laboratory environment, and what chemists actually do, so computer scientists can design a smart laboratory that will be simple and easy to use.

'We want to make the collection of data even easier and the payback really quick,' says Frey. If it's not easy, people won't be bothered to use it. 'If you make it easier to do, and the reward is quick, they will do it.'

Frey gives developments in x-ray crystallography as another example of how useful the eBank project will be. The process is now becoming very fast. 'Structures that once took weeks, or even months, can be completed in five minutes or an hour,' he says. So chemists refer to structures in scientific papers, but no longer publish the x-ray data.

Southampton University is home to the Engineering and Physical Sciences Research Council's National Crystallographic Service. Chemists with molecular structures to solve can send their crystals to experts at Southampton, who try to get an x-ray diffraction pattern and solve the structure. Because of the speed of this process, Southampton is looking to automate the process of sending out the structures. This will mean they may only have been checked by computer. 'It is currently going to refer to an e-print page, which will say we have determined the structure of this compound and this it what it is,' says Frey.

But what is the validity of the data? 'It could be viewed as "user beware", with the whole community acting as a referee, but that's only really feasible if the experts who are alerted, or want to check, or do a better job of refining it, can get the original data,' says Frey. eBank will help solve the problem by publishing the data and making it searchable.

Crystallography is a good example because of the nature of the data, and the checks you can do on it. 'We believe this is important throughout the whole of chemistry, and more widely,' says Frey.

There's another good reason for starting with crystallography. 'It's fairly clear what data is useful,' says Frey. But what to publish in other areas? That's one of the aims of the project, to highlight this issue, says Frey. 'It's an evolving culture of where you draw the line between the journal, the institution, and the research group. And this is one of the things we are looking at under eBank, together with open archiving and self-archiving.'

eBank researchers will be talking to the users and producers of data, and will produce a report on their views.

'On the whole we are following the procedure that, once it's published, the trail that leads to the data should be made available. But who's responsible for keeping it, and making it available, and for how long, are issues that we are having to try and explore.' And then there are the usual issues about copyright and ownership of data.

As well as capturing information for future re-use, this approach will aid collaborative research and promote new ways of working, says Frey. eBank will make collaborative research easier 'because it will be easier to see what everyone is doing,' says Frey. 'Making sure you can show what you have done is really important in interdisciplinary work, because different disciplines make different assumptions. It makes available the tools for easy collaboration. When you start to dream of doing the experiment, you should dream of doing it in a collaborative context, where the digital world enables the collaboration. For example, the person who made the crystal, and the person at Southampton determining its structure, can communicate more readily about the circumstance in which the crystal was made - such as the solvent used to prepare it. This should save time and effort.

The technical side of the project will be to provide a repository of metadata, describing the research data, and to build links to associated e-prints and peer-reviewed articles. Combechem will supply the data for eBank initially, but data from other sources will be added in due course. It will be stored in the central ePrints database, alongside other e-print records. It must be accurately described in agreed form as metadata. It will draw on the experience gained in the ePrints UK Project. The ePrints archive uses the Open Archives Initiative protocol for metadata harvesting, which helps identify and replicate material from different e-print archives, so that searches can be carried out across various institutional and subject e-print archives.

ePrints is developing a series of services, through which universities and colleges can access e-print papers available from Open Archive repositories, particularly in the UK. Metadata is automatically classified by subject, and a check is made on the author's name. A third web service, a citation analysis service, will be offered through the Open Citation project team based at the University of Southampton. This service will scan citation information in the document text to form computer-readable citations in the form of OpenURLs.

eBank will be delivered through the Manchester-based Physical Sciences Information Gateway (PSIgate) hub. PSIgate provides free access to good quality internet resources for researchers and students. The interface will allow users to navigate back and forth between e-print records and research data records.

The ePrint and eBank services should give greater impact to researchers' papers by publishing them more widely and more quickly. But it is the educational application of these services that excites Frey most. Through eBank, original research data could be referred to in electronic teaching materials, or a student could follow an electronic trail back from an online course or published article. 'eBank's impact on teaching will, I think, be significant,' says Frey.

Frey says that one of the best ways to motivate students who have done two years in undergraduate laboratories is to get them doing research projects. 'That's the way to inspire them,' says Frey. eBank will help dissolve the boundary between teaching and research, he believes. And it could form a model for even wider dissemination of science - making the virtual laboratory bigger. Frey cites as an example another JISC-funded project, for schoolchildren, that will use distributed computing to search for molecules that could be targets for anti-malarial drugs. E-malaria won't be some passive screensaver, but will involve the children in chemical research. If a target is found and synthesised, the children (and their parents and teachers) will be able to keep up with its progress through links to the researchers, and perhaps be drawn into science.


back to main features page
home