DataSeer develops AI system to track dataset reuse

DataSeer, in collaboration with The Michael J. Fox Foundation (MJFF), has developed a new large language model (LLM)-based system designed to detect and quantify dataset reuse across the scholarly literature at scale.

The system aims to address a longstanding challenge for research funders and institutions: measuring the downstream impact of shared research data.

By automating the detection of dataset reuse in published research, the platform is intended to remove a key bottleneck that has previously limited large-scale analysis of how data is reused.

Data sharing has increasingly become a priority for research funders seeking to accelerate discovery and improve transparency in the scientific process. MJFF has been actively involved in this movement, serving as the implementation partner for the Aligning Science Across Parkinson’s (ASAP) initiative since 2020. The foundation expanded its open science practices across its wider research network in 2022, building on policies introduced through the ASAP programme.

The new system builds on a dataset reuse study commissioned in 2025 alongside Strategies for Open Science (Stratos), which helped provide the evidence base for developing the approach.

Developed by DataSeer in collaboration with its Open Science Indicator partner PLOS and with input from the broader open science community, the LLM was piloted on a corpus of 6,000 MJFF-funded articles. Unlike traditional approaches that rely on formal data citations or digital object identifiers (DOIs), the model analyses the full text of research articles to identify reused datasets.

This allows the system to detect reuse even when datasets are referenced indirectly, such as through accession numbers, repository names, URLs, or narrative descriptions.

“Detecting dataset re-use is genuinely hard,” said Tim Vines, founder and CEO of DataSeer. “Traditional approaches that depend on structured identifiers typically find evidence of reuse in only about two percent of articles. When we applied our LLM to the MJFF corpus, we found clear evidence of data reuse in forty-three percent of articles. That gap confirms the broad perception that data reuse has always been happening but was effectively invisible.”

“For funders, there is growing interest in understanding not just what gets published, but how research outputs are used and reused over time,” said Josh Gottesman, Community Director for Research Data at MJFF. “The ability to systematically track data reuse gives us a new lens on openness, research integrity, and the downstream impact of our funding dollars – while also underscoring the critical contributions of researchers who generate data that enables future discoveries.”

Be first to read the lastest industry news and analysis! SUBSCRIBE to the Research Information Newsline!