Scholarly publishing’s great leap

Darrell_Gunter_0245_f 2
Darrell Gunter

Darrell Gunter explains why the industry must embrace compute-ready documents to survive the AI era

The scholarly publishing industry stands at a precarious crossroads. For decades, the sector has relied on the PDF as its atomic unit of value, a format designed for the human eye, stable and printable.

However, as the world transitions from an economy of access to an “Answers Economy,” the infrastructure that once served us is now structurally failing.  The industry is currently fighting a war on multiple fronts: the proliferation of paper mill content, the existential threat of Generative AI (GenAI) hallucinations, and the unauthorised exploitation of content by scraping bots. To survive and thrive, publishers must move beyond the “lossy” limitations of the PDF and embrace the Compute-Ready Document (CRD).

The crisis: hallucinations, paper mills, and structural failure

The primary threat facing publishers today is not just technological; it is architectural. We are attempting to feed 21st-century AI models with 20th-century file formats. The PDF is a container designed for visual layout, not machine understanding. When AI models, such as Large Language Models (LLMs), attempt to ingest these documents via RAG (Retrieval-Augmented Generation) pipelines, they rely on “blind chunking”, splitting data by arbitrary word counts.

This process destroys context. A methodology section is separated from its results; a figure on page 3 is severed from its caption on page 4. As Steve Smith notes, this results in “lossy compression” where meaning is dispersed, leading AI models to generate synthetic, ungrounded answers. This structural failure exacerbates the issue of hallucinations. GenAI is not a crystal ball; it is a “word prediction engine”, a mirror that reflects patterns but does not verify truth,. Without structured data, AI models guess at relationships, leading them to fabricate citations or present scientifically unsound content as fact.

Simultaneously, the industry is besieged by “paper mills”, entities churning out fraudulent research. In a PDF-based ecosystem, distinguishing between a legitimate, peer-reviewed study and a sophisticated fake is difficult for a machine. This is compounded by the “bot attacks” described by Rosalyn Metz, where libraries and repositories suffer “DDoS-like” traffic from AI training bots scraping content. This “openness without governance” allows AI companies to exploit the unpaid labor of curators and the intellectual property of publishers without consent or compensation, driving up cloud infrastructure costs for the victims.

The solution: the compute-ready document (CRD)

The remedy is not a better PDF, but a fundamental shift in knowledge architecture to the Compute-Ready Document (CRD). The CRD is not merely a file; it is a “semantic twin” of the original content, designed explicitly for agentic ingestion rather than blind chunking.

According to Signal65, a CRD is composed of three essential layers that directly address the industry’s crisis:

  1. The Asset: The high-signal item, such as a circuit diagram, heatmap, or chemical structure, treated as a first-class digital entity.
  2. The Context: The structured metadata that explicitly binds the asset to its captions, methods, and constraints. This ensures that an AI understands how a result was achieved, eliminating the context collapse that leads to hallucinations.
  3. The Provenance: This is the antidote to paper mills. The CRD encodes the chain of custody, DOIs, author ORCIDs, peer-review status, and licensing directly into the object.

By embedding provenance, the CRD allows AI systems to justify their answers. It transforms the AI from a guessing engine into a reasoning engine that can distinguish between verified peer-reviewed scholarship and unverified noise. This architecture moves intelligence from the slow, expensive “query time” to the fast, deterministic “ingestion time,” ensuring that “garbage in, garbage out” is replaced by a pipeline of quality, structured data,.

Proven performance: validated by Signal 65, Dell, and Broadcom

The efficacy of the CRD is not theoretical; it has been rigorously tested and endorsed by industry leaders in hardware and testing. Signal 65, an independent laboratory, has conducted real world testing on the efficacy of the TopicLake Insights Engine with  Dell Technologies and Broadcom infrastructure. Their findings confirm that shifting to an on-premises, CRD-based approach eliminates the “Cloud Tax” of network latency, offering deterministic speed regardless of load. This collaboration validates that the CRD is not just a software concept but a scalable enterprise solution capable of handling the heaviest data workloads with absolute data sovereignty.

Real-world application: the federal register

To understand the power of the CRD, we can look to the Gadget Software TopicLake Insights engine, which has successfully ingested one of the world’s most complex publishing environments: The Federal Register.

The Federal Register is a massive unstructured data source containing daily notices, proposed rules, and executive orders, all bound by dense legal hierarchies. Every day, the TopicLake Insights engine uploads and processes the XML from the Federal Register. Instead of treating these rules as flat text, the engine decomposes the daily rules into queryable artifacts. It captures cross-references, effective dates, and statutory authority, transforming a “raw rule” into a computed artifact.

Where traditional RAG pipelines fail to capture the statutory authority or effective dates buried in dense legal text, the TopicLake engine preserves these relationships. This allows users to query complex regulatory changes with precision, proving that CRD architecture can untangle even the most convoluted government data streams.

Revenue and new capabilities: knowledge-as-a-service

Adopting the CRD is not just a defensive manoeuvre; it unlocks the “Knowledge-as-a-Service” (KaaS) revenue model. As we move from selling access to selling answers, value shifts from content ownership to “context stewardship”.

By converting static content into CRDs, publishers can offer new product capabilities similar to the AI generated Workspace. Powered by TopicLake Insights, this workspace offers interactive data visualization, knowledge graphing that links people to topics, and AI research assistants capable of prompt-based learning. These are not static files; they are dynamic knowledge environments.

The economic opportunity is clear: AI companies may pay for training data once, but corporations and researchers will pay ongoing subscriptions for a trusted, structured knowledge layer. Publishers can monetize the “connective tissue” of research by selling the circuit diagram linked to its equations to an engineer, or the crystal structure linked to its synthesis conditions to a materials scientist. This moves the value proposition from “Can I read it?” to “Can I use it?”.

Conclusion: the alarm has been ringing

The alarm bell for this transition has been ringing for decades. The divergence between visual (PDF) and structural (HTML) formats began in the early 1990s, and the exploitation of open infrastructure has been a known vulnerability for years.

The scholarly publishing industry must come off the sidelines today. The “wait and see” approach is no longer viable in a world where AI bots are actively scraping value and paper mills are eroding trust. By embracing the Compute-Ready Document, publishers can secure their content against unauthorized scraping, eliminate hallucinations through structured context, and unlock the lucrative future of the Answers Economy. The technology is tested, the hardware is ready, and the market is waiting. It is time to leap.

Darrell Gunter is Chief Commercial Officer at Gadget Software

Keep up to date with all the latest industry news and analysis – SUBSCRIBE to the Research Information Newsline!

Back to top