Why scholarly publishing needs a neutral governance body for the AI age

In response to David Worlock’s blog post, Darrell Gunter says mechanisms that governed trust, correction, and authority in the print and early digital eras are no longer sufficient
What this discussion ultimately reveals is that the scholarly communication system has crossed a structural threshold.
The mechanisms that governed trust, correction, and authority in the print and early digital eras are no longer sufficient for a world in which artificial intelligence systems ingest, remix, and redistribute vast portions of the scholarly record at scale.
Retractions, corrections, provenance, and version control were once slow, imperfect, but manageable problems. In the print era, a retracted article was addressed by a notice in a subsequent issue, a correction slip sent to libraries, and, in the best cases, careful librarians physically annotating volumes. The system was inconsistent, but the damage was bounded. A mistaken or fraudulent article might mislead some readers, but it could not be instantaneously copied, embedded, and operationalised by AI systems globally.
AI has changed that irreversibly.
Once a scholarly article valid or invalid is ingested into a training corpus, it becomes part of a computational substrate that can no longer be surgically edited. Retractions cannot be “pulled back” from large language models the way PDFs can be removed from a publisher’s website. Errors are no longer localised. They propagate.
This is not an AI problem alone. It is a governance failure.
For decades, the scholarly publishing ecosystem has relied on a loose federation of publishers, libraries, and indexing services to manage trust. Retraction Watch, Crossref, PubMed, OpenAlex, and others have all done heroic work, but their authority is partial, unevenly implemented, and not designed for machine governance. Their signals are advisory rather than binding. Their metadata is often optional. And their integration into AI pipelines is inconsistent or nonexistent.
The result is a system in which every actor defines “truth,” “trust,” and “reliability” differently. That cannot work in an AI-driven research environment.
We have solved this problem before
What makes this moment especially frustrating is that the scholarly community already knows how to build global, neutral infrastructure when it is forced to do so.
We built Crossref when citations became too important to leave to proprietary silos. We built ORCID when identity became too fragmented. We built COUNTER when usage metrics became too economically and strategically important to remain opaque. None of these systems belong to any single publisher. All of them are governed by the community. All of them create shared, machine-readable truth.
AI now demands the same level of institutional response, but for research integrity itself.
What is missing is a neutral, nonprofit governance layer for scholarly data in the AI era: an organization with the authority to define, certify, and enforce standards for how scholarly content is labeled, transmitted, and used by machines.
This is not about content ownership. It is about content status.
We need a globally recognised system that can answer, unambiguously and in machine-readable form:
- Has this paper been retracted?
- Has it been corrected?
- Which version is authoritative?
- Is this dataset approved for model training?
- Was this content withdrawn for ethical or scientific reasons?
- What is the provenance chain behind this knowledge claim?
Today, no such system exists.
Why AI forces the issue
Kent Anderson is correct about one thing: once bad data enters an LLM, it cannot simply be deleted. But the conclusion that some draw from this that AI is therefore inherently incompatible with science is wrong.
What AI has exposed is not a technological impossibility but an institutional gap.
The real problem is not that retracted material entered early training sets. It is that the scholarly ecosystem never created a binding, machine-actionable way to prevent that from happening in the first place.
Even before AI, retracted articles were notoriously difficult to detect. PDFs lived on pirate sites. Metadata was incomplete. Some databases were updated; others were not. Publishers varied widely in how aggressively they labeled or removed flawed work.
AI simply amplified these weaknesses.
The solution is not to retreat to nostalgia about print. It is to finally do what we failed to do in the digital era: treat scholarly integrity as structured data, not editorial policy.
What this new organisation must do
A modern governance body for research integrity must operate at the same level as Crossref or ORCID, not as a watchdog, but as infrastructure.
Its responsibilities would include:
- Retraction and Correction Signalling – A single, authoritative registry of retractions, corrections, and expressions of concern, available via APIs and embedded directly into metadata pipelines used by publishers, indexers, and AI systems.
- Provenance and Lineage Tracking – Machine-readable records of where data, articles, and claims originated, how they were modified, and which entities validated them.
- Training-Data Certification – A standard for what scholarly content is eligible for AI training, including exclusion lists for retracted or ethically compromised material.
- Auditability and Transparency – A framework that allows AI providers to certify which datasets they used and how those datasets were filtered.
- Compliance and Enforcement – Not legal enforcement, but technical and reputational enforcement, the same way COUNTER compliance or DOI registration works today.
This organisation would not replace publishers. It would stabilise them.
Why trust will define the winners
One of the most insightful points in the original post is that trust will not flow automatically from journal brands to AI systems. Researchers will not blindly accept AI outputs just because they are licensed from prestigious publishers.
Trust will attach to systems that can prove their integrity.
That is why the future is likely to be dominated by research institution branded AI platforms like Mayo Clinic AI, Max Planck AI, and Francis Crick AI backed by clean, certified data pipelines. These systems will not be trusted because they are powerful, but because they are governed.
And that governance layer must be neutral. Publishers, AI companies, and universities will all participate but none can own it.
The cost of not acting
Without this shared infrastructure, we will drift into a fractured future:
- Each AI provider will have its own definition of “clean data.”
- Each publisher will issue its own retraction signals.
- Each platform will implement its own filters.
Researchers will be unable to compare, verify, or trust results across systems. Regulators will intervene. Litigation will rise. And the credibility of AI-assisted science will be permanently damaged. This is exactly what Crossref and ORCID prevented in their domains. We now need the same intervention at the level of truth itself.
AI has not destroyed scholarly publishing. It has forced it to mature. The question is whether the industry will respond by building real governance or by continuing to pretend that fragmented metadata and voluntary compliance are good enough in a machine-driven world.
They are not.
Darrell Gunter is CEO of Gunter Media Group
