Mining for data, the new raw material

Share this on social media:

Alastair Horne reports on discussions about open data and data mining at the recent UKSG One-Day conference

Although open access was the main focus of November’s UKSG One-Day conference, open data remained an important secondary topic. Adam Tickell, provost and vice principal at the UK’s University of Birmingham, suggested that in his role as a university manager, open data was actually a greater concern for him than open access.

In the 2012 UK government White Paper on Open Data, cabinet minister Francis Maude suggests that ‘Data is the 21st Century’s new raw material’. Tickell noted that the explicit inclusion of data in the UK government’s drive towards greater openness means that universities and research institutions must now address the challenges and opportunities presented.

Making the data behind research outputs openly available raises logistical challenges – how might one easily and effectively make available the terabytes of data being generated at CERN, for instance? – but it also offers considerable possibilities, not least of which are enhanced scientific capacity, and greater opportunities for exposing research misconduct.

Peter Murray-Rust, reader in molecular informatics at the University of Cambridge, and the final speaker in the afternoon session, took a different approach to these challenges and opportunities in his talk. An ardent and vocal campaigner for the benefits of data mining, Murray-Rust began by pointing out that the vast majority of the scientific data that we spend billions of dollars creating is thrown away. Deemed superfluous by publishers, this data are actually of enormous value, not only vital to any attempt at validating or reproducing results, but also capable of re-use elsewhere.

Key to extracting full value from data, according to Murray-Rust, is data mining, automated mechanical processing that could, he claimed, extract a hundred million scientific facts from the data, build reusable objects from them, and even create new businesses that might earn the UK alone £500 million annually. Text and data mining, he suggested, quoting John McNaught, could even save lives.

The only obstacles to all this added value, Murray-Rust told his audience, are publishers. Though some had, it seemed, recently experienced a Damascene conversion to the open data cause, many publisher contracts explicitly prohibit mining of their content, and institutions that ignore this directive can find themselves cut off from accessing papers.

Rejecting publishers’ concerns that allowing such mining might result in their servers being overloaded by automated requests, and the resulting data being distributed freely without their consent, Murray-Rust dismissed their attempts to regulate the process through licensing as akin to taxing spectacles. ‘The right to read’, he insisted, ‘is the right to mine’. His Ami project, for example, would liberate the data held within PDF articles and convert it into usable HTML and CSV files. Discussions with interested parties outside of publishing – including the British Library and Mozilla – are on-going.

Murray-Rust finished his talk by urging libraries not to sign contracts that prohibite data mining. It was left to Gemma Hersh, head of public affairs for the Publishers Association, to offer an alternative perspective in the short question-and-answer session that followed, assuring the audience of publishers’ efforts to find a workable solution to what they felt were genuine problems raised by data-mining, and reminding them that Murray-Rust had recently walked out of talks attempting to resolve those issues. Sadly, that discussion was cut short, but data mining is likely to remain a controversial issue for some time to come.