After the PDF: A new unit of knowledge for the AI era

Steve Smith

What is the atomic unit of the scholarly record?

For decades, scholarly publishing has operated with an implicit answer: the PDF. We built our entire infrastructure around that assumption. We mint DOIs for documents. We sell collections of documents. We measure impact in citations to documents.

The PDF made sense for its era: portable, platform-independent, and visually consistent. Knowledge, we assumed, lived inside the container, and the container was stable, portable, and sufficient.

But as I argued recently in Research Information on Knowledge-as-a-Service, the industry is shifting from an access economy to an answers economy. Which raises an urgent question: if “answers” are the new product, what’s the essential unit we’re actually packaging? Not documents. Something smaller, denser, and computable.

In an AI-mediated world, the PDF is no longer a vessel for knowledge at all. It is, in fact, often an obstacle. If publishers want to provide “answers” rather than “files,” we need a different atomic unit of knowledge. We must move from the Article to the Knowledge Object.

Why the PDF is a “lossy” format for context

Anyone who has tried to extract structured insight from a PDF knows the experience. The figure you need is on page 3. The caption sits just below it. The method that defines the experimental constraints is buried on page 5. The dataset lives in a repository somewhere else, linked only if you’re lucky. And the provenance, including peer review status, authorship roles, and licensing, is either implicit or scattered across metadata fields.

Humans can piece this together because narrative provides coherence. We read the introduction, follow the argument, and reconstruct the relationships between claim and evidence as we go. Machines can approximate this. Vision-Language Models extract spatial cues from page layouts, and LLMs infer relationships statistically from surrounding text, but this reconstruction is probabilistic, not deterministic. The model guesses at connections based on probabilistic associations; it doesn’t read explicitly encoded structure. When those inferences are wrong, they can fail silently. The result is a kind of lossy compression: meaning dispersed across space, critical relationships unmarked, and no guarantee that what the machine “understands” matches what the authors intended.

The result is predictable: hallucinated values, incorrect assumptions, misinterpreted diagrams, synthetic “answers” built without grounding. These aren’t failures of AI capability; they’re failures of knowledge architecture. AI performance degrades sharply when context collapses. And the PDF, brilliant for human reading, is structurally hostile to context preservation at machine scale.

The remedy isn’t better PDFs or fancier metadata overlays. It’s a fundamentally different unit of knowledge, one designed for relational inference, not for linear narrative.

What a Knowledge Object is

If the PDF is a container, the Knowledge Object is a molecule of usable knowledge: a self-contained packet expressing a claim together with its supporting structure. It has three layers, each necessary, none sufficient alone.

Layer 1: The Asset (the “What”)
The high-signal item at the core: often visual, sometimes textual, always interpretively dense. Circuit diagrams. Crystal structures. Gene expression heatmaps. Flow diagrams. In the PDF era, these are merely elements on a page. In a Knowledge-Object world, they’re first-class digital entities: queryable, addressable, independently meaningful when paired with their context.

Layer 2: The Context (the “How” and “Why”)
This is where objects become meaningful, not merely informative. A Knowledge Object binds the asset to its interpretive scaffolding in ways that are explicit, semantic, and machine-readable.

The caption matters, not as prose, but as structured metadata capturing constraints, assumptions, variable definitions. Method snippets matter, but only the parts relevant to this asset: synthesis conditions, reagents, hyperparameters, boundary conditions, and dataset links.

In a Knowledge Object, the caption is not “near” the figure in typographic space; it’s attached, explicitly and permanently. The relationships aren’t implied by proximity; they’re encoded as part of the object’s structure.

Layer 3: The Provenance (the “Who” and “When”)
In an AI-driven world, trust is the new scarcity. Knowledge Objects carry their lineage internally: the DOI of the parent article, authors’ ORCIDs and CRediT roles, institutional identifiers (RORs), grant IDs, version history, licensing terms, and peer-review status as a verifiable record.

This lineage lets AI systems justify answers, not just output them. When a model cites a Knowledge Object, it propagates trust signals: this structure was peer-reviewed, these authors have validated credentials, this dataset is openly available. Provenance isn’t decorative metadata; it’s the mechanism by which machines inherit human judgment.

Why AI needs Knowledge Objects

Machines can infer context statistically. An LLM will guess at what a figure represents based on surrounding text and patterns in its training data, but these inferences are probabilistic, not grounded. To an AI, a figure without a caption isn’t meaningless; it is underspecified.

An LLM will still generate an interpretation, but without explicit constraints, that interpretation may silently diverge from the author’s intent. A method without parameters becomes a template the model fills in from training data. A dataset without provenance is usable but unverifiable. The risk isn’t that machines fail to process these objects; it’s that they process them confidently and incorrectly.

Knowledge Objects turn implicit relationships into machine-readable structure. They’re self-contained, context-rich, provenance-aware, rights-clear, interoperable, and ready for inference not just retrieval. This is the bridge from search to understanding: from “find me a paper about X” to “help me reason whether X applies in context Y.”

Consider three examples:

  • Engineering: A model training task doesn’t need a 14-page article about circuit design. It does need a single object containing the topology diagram with its governing equations, operational constraints, environmental parameters, validating dataset, and a provenance chain confirming peer review and data availability.
  • Materials Science: A crystal structure Knowledge Object would include the structure image, synthesis conditions, experimental parameters, dataset DOIs for raw diffraction data, links to replication code, and peer-review assertions. A materials scientist training a predictive model needs exactly this crystal structure packet, not the full paper, but the structured relationships between structure, synthesis, and validation.
  • Biomedicine: A researcher querying “BRCA1 expression in triple-negative breast cancer” needs the heatmap paired explicitly with reagent identifiers, method constraints, and versioned metadata; not a 12-page article where relevant information and critical variables are scattered across three different sections.

These aren’t “snippets” extracted from documents. They’re units of usable scientific knowledge: the smallest indivisible packets capable of supporting valid inference.

The coming struggle for the context layer

If publishers don’t define and supply Knowledge Objects, others will reconstruct them, imperfectly, expensively, and without attribution. This is already happening. Model builders employ armies of annotators to extract figures, parse captions via OCR, and manually link methods to results. Corporate R&D teams build proprietary knowledge graphs from licensed content because needed structured connections don’t exist in purchasable form. Search engines generate answers without provenance, obscuring who did what work under what conditions.

Value is shifting from content ownership to context stewardship. Whoever controls the relationships (who validated this, what methods produced it, confidence levels), controls the answer. Knowledge Objects are how publishers reclaim that layer, not by locking content behind paywalls, but by becoming the authoritative source for verified, structured scientific relationships.

The economic logic is straightforward. AI companies pay for high-quality training data, but that’s a static, one-time transaction. Researchers, institutions, and corporations will pay ongoing subscriptions for trusted knowledge layers that solve the verification crisis they cannot bridge alone. The Knowledge Object model positions publishers not as suppliers of raw material but as curators of the connective tissue that makes AI trustworthy.

What it would take in practice

This doesn’t require rebuilding everything overnight. The first move is conceptual: treat your most valuable content (figures, datasets, methods) as first-class knowledge products, not article byproducts.

The operational challenges are real, but solvable. They require distinct shifts:

  • Workflows that capture structured metadata at submission, not as afterthought.
  • Technology teams exposing that structure via usable, reliable endpoints.
  • Standards enabling interoperability, not bespoke islands.
  • Business models shifting from selling documents to stewarding context.

The playbook for this transition is already emerging. The building blocks exist today in fragments: persistent identifiers, metadata schemas, APIs, rights frameworks, and emerging standards like the Model Context Protocol that help AI systems discover and reason over structured content.

The details vary by discipline and portfolio, but the goal is clear: supplying structured, contextualised knowledge as a product, and treating Knowledge Object design as an ongoing strategic program, not a one-off project.

The infrastructure of discovery

The PDF became the unit of scholarship in an era when humans read linearly and machines read not at all. It was optimised for a world where meaning lived in narrative, where coherence emerged from the act of reading, and where the primary distribution challenge was getting documents to people’s desks. That world is gone.

The successor to the PDF won’t simply be a richer media format, like video or executable code. It must be something more fundamental. Knowledge Objects are the unit of usable knowledge in a new reality: one where humans and machines reason together; where AI surfaces patterns humans would miss; where researchers query knowledge graphs instead of keyword-searching PDFs; and where trust depends not on journal prestige but on traceable provenance. Whether the payload is text, code, or video, the Knowledge Object provides the connective tissue that makes it computable.

If the last era was about access, opening doors so everyone could read, the next will be about context: preserving and distributing the relationships that make reading meaningful. The infrastructure for this already exists in fragments (identifiers, metadata, graphs, and workflows). What’s missing is the strategic commitment to treat structured, relational knowledge as the product, not a side effect of producing articles.

We started by asking about the fundamental unit of research knowledge. In the PDF era, the document served as a workable proxy because humans could reconstruct context while reading. In the answers economy, that reconstruction must be explicit. Claims need their supporting context and traceable provenance attached in a form machines can use deterministically, not guess probabilistically.

Knowledge Objects are that unit. The Access era asked: Can everyone read it? The Answers era asks: Can machines reason with it? The publishers who answer that question will shape not just how scholarship is communicated, but the very infrastructure of discovery.

Steven D Smith, DPhil, is the founder of STEM Knowledge Partners and an independent consultant. The author thanks Philip Carpenter, Ben Kaube, Bill Trippe, and Jonathan Woahn for their comments on earlier drafts of this piece.

Keep up to date with all the latest industry news and analysis – SUBSCRIBE to the Research Information Newsline!

Back to top