Published data needs standards

Share this on social media:

Topic tags: 

Accessing original datasets is becoming increasingly important in research today but finding this data can be a challenge. Toby Green, head of publishing for OECD, argues the case for standards to help researchers find and cite published datasets and tables

In early June, one of President Obama’s pre-election promises was delivered with the launch of Data.gov, a portal for US Federal Government data. It’s an elegant website full of modern tools like user ratings and a range of widgets. So, providing you know about Data.gov, you can now find and download great chunks of US data in ‘machine-readable’ format and you’re invited to participate by developing ‘apps’ and using the data for your research.

Each dataset has detailed metadata describing where the data came from, how it was gathered and calculated and when. Each dataset also has a unique ID. ACRES, a dataset on brown-field properties, for example, has ID number 7. You are invited to cite it as ‘EPA Geospatial Download’ with a URL that may or may not prove to be persistent. Both these pieces of publishing metadata – the ‘7’ and the URL are producer-centric.

Let’s contrast this with all the academic papers, journal articles and monograph chapters that will be written using ACRES data from ‘EPA Geospatial Download’. Regardless of publisher, each of them will be cited and catalogued according to longestablished, machine-readable, international standards. Publishers will be able to interlink the articles and monograph chapters using the DOI-based CrossRef system. This enables readers to hop among the related articles and chapters regardless of publisher. Librarians will use MARC-driven systems to catalogue the journals and books containing these articles and chapters. Abstracting and Indexing systems will track them too. In short, the papers and chapters will be interwoven into a scholarly information network on a permanent, persistent and professional basis.

So, why not the data too? Has Data.Gov slipped up?

No, Data.Gov is doing the same as all other data producers – building independent websites and reckoning on the likes of Google and word-of-mouth to bring them traffic. In spite of the obvious importance of original data in research and, consequently in research publications, there is no standard system to publish original data. Data.Gov (and all other data producers) have no citation or cataloguing standard to help them.

However, this may soon change. OECD (Organisation for Economic Cooperation and Development) is one of the world’s largest producers of internationally-compatible social, economic and environmental data. It is also a publisher of books and journals. This is rare – most data producers do not also publish analysis – and it means that OECD’s publishing division has the challenge of handling both primary research publications and the original data on which the research was based.

Researchers want to access data

From its experience of publishing both types of resources, OECD has learned that readers do want to access underlying data. Since 2004, OECD has put links to data files under the charts, tables and graphs in its publications. In 2008, more than 900,000 Excel files were downloaded – proof, if it were needed, that readers will grab the data if it is easily accessible. In parallel, usage of OECD’s datasets is huge. However, users have been unable to link easily from the data to OECD’s own analysis of its data.

OECD’s vision is that datasets should be as discoverable and citable as articles and book chapters and be integrated into the scholarly information network. Readers should be able to jump from a chapter to the data and from a dataset to a chapter. To this end, OECD has worked with bibliographic experts to develop a publishing metadata standard for datasets and data tables that is compatible with those for scholarly books and journals. The proposed standard was released via a white paper in April.

The standard in action

One reaction from the release of this white paper was ‘it’s all very well proposing a new standard – but who will put it into practice?’ Well, since June it is being put into practice in OECD’s new publishing platform, OECD iLibrary. Built using the new metadata system, the platform weaves OECD’s datasets with its working papers, books and journals in a way that is compatible with CrossRef and library systems. OECD’s datasets are set to join the scholarly information system.

Not all datasets are the same and there are still unresolved problems on how to handle dynamic datasets – but it’s clear that datasets need to be integrated into the scholarly record on a permanent, persistent and professional basis. OECD is hoping that the ideas presented in the white paper, together with the practical expression of those ideas in OECD iLibrary, will inspire the industry to embrace datasets so that initiatives like Data.gov are no longer left out in the cold.

Further information

Green, T: We Need Publishing Standards for Datasets and Data Tables http://dx.doi.org/10.1787/603233448430