Driven by the data?

Share this on social media:

The mechanisms of scholarly communication are showing their age, writes Daniel Hook

Data has been driving a revolution in research for more than 20 years, but the recent trend of activities being described as ‘data-driven’ feels wrong.

‘Data-driven decision making’ suggests that the data is forcing or making the decision, rather than the person responsible. ‘Data-supported decision making’, on the other hand, seems to be a much more appropriate way of describing what should be, and in fact does, happen in most of research. Research is changing as a result of the rise of data. This is forcing other areas of the research ecosystem to change with it.

Increases in our data production and data processing capabilities now allow us to probe new areas of mathematics, physics, computer science, biology and beyond.
In these subjects, the link to data is clear and it is not surprising that increased data volumes have led to new tools and new policies to help researchers take advantage of data.

Perhaps less expected has been the availability of data from the now ubiquitous digital paraphernalia that we carry with us and that monitor our lives. Our online interactions and the data streams that originate from our daily lives, interactions, physical and mental wellbeing are available in ways that could not have been anticipated even 20 years ago.

This, in part, has helped to fuel new studies in digital humanities. All these developments are still in their infancy and, as a result, the tools, policies and skillsets that researchers need, as well as the mechanisms to disseminate research output from these types of studies, are still in development.

Data is profoundly shaping our research experience, not only in terms of the problems that we can access, but also in the skills and tools that are needed. Few outside computer science will have had any formal training in data science and yet, for many fields, this is becoming not only a key skill but the key skill. The reproducibility problem that we see today is one that is in part generated by an earlier revolution in research. As data first became available in greater volume, researchers seeking to use it in their research were ill-equipped to manage the flow of data from data collection activities, and had not been trained in the use of analytic tools, which in turn were not necessarily written to handle the kinds of data that were being passed to them. The relationship between researchers and statistical science 20 years ago was very much at the level of their relationship with data science today.

We still have not come to terms with the reproducibility issues generated by the deficiencies in tools and knowledge of the last few years.

While we have failed to deal with these issues, the problem is only getting worse, as tools are becoming ever more powerful and data ever more plentiful. The prevalent approach to handling reproducibility is to try to spot a problem after it has happened. In order to be successful in making research reproducible we need to take a radically different approach: we need to design our research process with reproducibility built in.

We need to start by training all researchers in the methods of statistics and data science. This should become a prerequisite in any PhD, almost regardless of field. Data literacy is a key skill in today’s world and, given that most PhD candidates won’t stay in research, it is one of the most transferable skills that we can give them. For those who do stay in research, a solid training in data science will sensitise them to the new issues around data policies, ethics, storage, handling, analytics and communication.

In a few years we will have created tools where researchers do not need to understand the inner workings of the ‘data black box’. In the same way that some children cannot do long division by hand, we may lose the ability to do basic data analysis. To truly tackle the reproducibility crisis, understanding the data analysis that we do is critical to the endeavour.

Next, we need to ensure that data is captured at source in an auditable manner. There is a significant burden and an opportunity with equipment manufacturers to improve devices, so that they are more aware of their surroundings. If a humble mobile phone knows where it is, it’s orientation, the details of the light level and so many other details of its surroundings, shouldn’t experimental equipment have knowledge of exactly who is operating it, the details of its local surroundings: the air pressure, the magnetic field strength and beyond? If we had experimental equipment that automatically came with greater context, this would start to create a systemic framework where reproducibility was built-in. Once data is produced, it should be stored in an auditable manner. That doesn’t mean it must be shared.

It is understood that some data are sensitive, but the choice to delete should be a positive one driven by the knowledge of the investigator, rather than not having the data captured in the first place. Data flow and transformation needs to be orchestrated in a far better way.

Many research data flows (such as that from a piece of equipment in a lab, through some basic processing, to an output image) rely on custom-written code that in turn relies on the mind of the postdoc who worked in the lab three years ago. This is an uncomfortably recognisable reality to many. Tools such as Ovation (https://www.ovation.io/) and Riffyn (https://riffyn.com/) that help to streamline and systematise these processes, as well as protocol codifying languages such as Autoprotocol (http://autoprotocol.org/), now allow this to be done more professionally, while data analysis is starting to be systematised via tools such as Pachyderm (https://www.pachyderm.io/).

The area that lags behind these other innovations is scholarly communication. Still tied to the journal article, monograph or conference proceeding, for the main part, we lack the media to share the core of the research that is starting to be codified. Over the next few years there will almost certainly be a major shift in how research finders are shared, but this will take time to develop, test and gain acceptance.

Initial moves to extend the nature of the academic article came in the form of technologies such as Figshare (http://figshare.com), part of Digital Science’s portfolio, which makes data more shareable and easier to access with persistent identifiers. But, in order to make scholarly communication more closely linked to prior steps in research, and hence decrease the chance of introducing reproducible steps, we need to make more of the methodology shareable in a consistent manner. We believe that the route to doing this is through technologies such as Gigantum (http://gigantum.com), in which Digital Science made an investment earlier this year.

Similar to CodeOcean (http://codeocean.com), Gigantum helps a researcher to keep their data files and analysis files synchronised so that for any output shared, be it a video, audio file or image, there is an audit trail showing which version of code was used to process the data and which version of the data were processed. Data, code and audit trail can then be packaged and shared with collaborators, reviewers or readers.

The final part of the reproducibility landscape that I’m going to touch on here is peer review.

We have approaches and mechanisms that can help us through the transition to the new article. However, I believe that peer review is likely to remain a constant of research communication. I don’t think that AI will replace peer reviewers, but I do think that intelligence augmentation is something that’s natural in the space. Enhancing a peer reviewer’s insight into how the data was created for a study or experiment is where significant value can be added to research.

Digital Science recently announced its support for Ripeta (http://ripeta.com), a technology that generates a report for peer reviewers and others on the reproducibility of a piece of research from a paper. At the moment, this type of technology is in its early stages but it is clear that, with the increasing complexity of research work, and the data handling and analysis that accompanies it, we need technologies that help us to interpret and critically review scholarly communication. It is no longer possible for external reviewers to understand the nuances of every technique, and to know what could and should be shared in all cases.

Failing to tackle reproducibility will be an increasing problem: public funding supports a large proportion of our research output – this places pressure on academics and institutions to ensure that the research they do can be trusted by the research community and the public that has supported it. While it is clear that not all research will be reproducible, much of it can be, and we need to give researchers the tools and skills to drive the data.

Daniel Hook is CEO at Digital Science