The art of the possible
Nathan Cunningham
A national infrastructure for research data is necessary to effectively apply AI and machine learning, writes Nathan Cunningham
It was in the Arctic that I came to understand the importance of open science and the infrastructure that is needed to support that. I was working at the Polar Data Centre to ensure that scientific polar and cryospheric data was available to various research teams both in the UK and in the field. It struck me that access to this rich data was limited to the research teams directly involved but didn’t go beyond the biological and environmental research communities.
Breaking down Chinese walls
Research funding is still presented in a siloed ways and we need to look at how we can break down those barriers. An example of where that is working well is the Big Data Network. Since 2013, the Economic and Social Research Council (ESRC) has invested over £64 million in this project bringing together data that helps to inform government policy makers. The Big Data Network connects comprehensive data on people, their behaviours, their attitudes and their motivations at the national level. All scientists require this national research data infrastructure to capture, store, process and access data to enable collaboration across all disciplines.
However, we see a lot of money invested in some superb research assets, but they are not all joint up. Especially research communities such as humanities are often not (yet) integrated into a national data infrastructure even though rich peta scale data sets have become the norm for most research projects.
When I was at the British Antarctic Research Survey 10 years ago, I already had about six petabytes of data to manage, of which most was remote sensing modelling data. None of that research was fed into a national infrastructure and there’s still a lot of scientific data that is stored outside a national research infrastructure. This is not a criticism; we just need to review where we’re at and see how we can link up these rich sources of data.
The problem we’ve got is that research communities are granted big pots of money to support their area of research. The UK has some fantastic research facilities such as the MET office and the Square Kilometre Array, but investments in these fabulous projects does not necessarily create a national infrastructure.
What I take away as a learning from the £64 million Big Data Network is that a lot of the assets are standing on their own. This is where we’re missing a piece. We’re trying to stand up a lot of information, but it needs a lot of computational support and we’re getting to a point that we can no longer move the data around so easily.
If you look at the government’s industrial strategy or the grand research challenges of ‘healthy ageing’ or ‘sustainable food’ we need to bring in a lot of requirements to overlay or wrap data so that we can connect various data sources.
Now is the time that we need to create a greater understanding of this national federated asset piece. My vision is that for every bit of money from UKRI or any of the research councils, a small proportion will be allocated to the creation of a national infrastructure for research data.
Currently, research projects need to present a research data management plan and a place of deposit. But this practice is fairly lax, and a lot of research communities are still without robust data management and storage plans.
As a person that has worked within Russell group institutions, I’ve supported a lot of effort just capturing research data. I estimate that around 70 per cent of the research still isn’t part of a larger data infrastructure.
I propose that we look at the UKRI investment funds and say that all research communities now have the same requirements as the large ones such as CERN and the MET office 10 years ago. Large computational needs are now the norm because any projects that are tracking the industrial strategy or the grand challenges agenda, all these projects work in an interdisciplinary and multi-institutional way.
We want to capture that base line to support research. We can’t ignore the dearth of that interlocking piece of engineering that is not funded and that is harder and harder to do due to the increasing complexity of data sets.
If every research grant is given out by research councils that there will be a place to deposit research data that is already paid for where people can do research computing at a base level and Jisc would offer that as a service like you would normally enter into when you work on a specific project. This base level connectedness will promote interproject working and interdisciplinary working. It will create a knowledge base where we can apply all the new technologies from AI to machine learning.
We need to leverage this baseline of investment and very much need to bring in linkages between different research communities so that we can couple oceans and atmosphere research but also bring medical research to other communities by extending the trusted environment they’re currently working in.
It is the art of the possible. I believe this this piece of infrastructure will allow us to lay down a more coherent capability across the academic landscape. I genuinely think that it will allow us to continue to compete on the global academic stage.
Nathan Cunningham is head of research computing at the Norwich Bioscience Institutes partnerships