Thanks for visiting Research Information.

You're trying to access an editorial feature that is only available to logged in, registered users of Research Information. Registering is completely free, so why not sign up with us?

By registering, as well as being able to browse all content on the site without further interruption, you'll also have the option to receive our magazine (multiple times a year) and our email newsletters.

What about data?

Share this on social media:

Data is a buzzword today but it can mean many different things, writes Michael Clarke

There is a lot of excitement about data at the moment in STM publishing but when people talk about it they can mean many different things.

First and foremost there is research data itself. A lot of discussion is currently underway to make research data more accessible and to make sure it is properly archived. There is a need for more data repositories that can handle the diverse array of researcher data and maintain it over time.

There are some interesting data archives springing up that are very specialised. The geneticists are way out in front on this with Flybase and Wormbase and the like. But the notion is spreading. Archaeologists recently launched one called tDAR. DataONE is under development to archive ecological and environmental data. Dryad is providing archiving of data underlying peer-reviewed articles in the basic and applied biosciences. Given the myriad data repositories, a lot of work is being done on making these data sets linked and interoperable so they can be interrogated and mined. This is one of the goals for the semantic web championed by Tim Berners-Lee and others.

Second there is there is the publishing of research data – or of linking to it from journal articles. There are questions here about what is appropriate to publish and what sort of demand you can place on peer reviewers. Publishing supplementary data is becoming more and more common, however, and as more and more data is being generated. I was glad to see NISO get involved recently and begin to recommend some standards around this.

Third you have publication metrics. There is a lot of experimentation today around article-level metrics and alternatives to the impact factor, or altmetrics. These include looking at citations to articles independent of the journal they appear in. This makes a lot of sense as, even in the best journals, there are some duds. Similarly, in the second- and third-tier journals there are some gems. Public Library of Science (PLoS), with its open-access mega journal PLoS ONE is a particular champion of article-level metrics as one way to help user navigate through the wealth of content published in the title. 

PLoS is also experimenting with a number of altmetrics that, at the moment, are of questionable value. For example, usage and coverage in social media probably tells us more about the size of the author’s research field and his or her ability to network than they do about the underlying science. The number of people who have bookmarked a paper in Mendeley is interesting but again biases towards large fields (and, of course, to the subset of scientists that use Mendeley). But still, the experimentation is interesting and welcome despite its limitations.

A fourth kind of data is usage data and some really interesting things are happening around the intersection of semantic metadata (really metadata of all kinds) and usage data. Publishers are beginning to cross-tabulate usage data with data about content to ask interesting questions. What kinds of content are different user groups interacting with? When members of a user group begin to look at a certain paper or set of papers outside their field, is that a signal of an interdisciplinary breakthrough? Are there ways to leverage these dynamic communities of interest to help readers find information more efficiently and to find information of relevance that they might have missed? And, of course, publishers are exploring how they can build on this information to generate revenue via product upsells and targeted advertising.

This is the kind of user interaction that Amazon and others have been using for a long time and but is just starting to make inroads in STM. In some ways it is more interesting in STM than in consumer sectors because of the vast quantity of information; the goal is not simply to sell more saucepans to people that bought the ‘Joy of Cooking’ but rather to better understand how very smart people are using very complex information.

Michael Clarke is executive vice president for product and market development at Silverchair Information Systems. Look out for more of his thoughts in the interview in the June/July 2012 issue of Research Information magazine