How Hard Can It Be?

Share this on social media:

This article is brought to you by: 

"How hard can it be?" is a question either explicitly or implicitly asked by many of our publisher clients, about many things: transforming LaTeX into XML, or Word into InDesign, or PDF into EPUB3... or turning any author-provided manuscript into perfectly paginated and typeset pages, and PDF, and EPUB, and XML. After all, Microsoft Word can produce an EPUB file. How hard can it be?

Well, the difficulty depends a great deal on the level of perfection desired. It can be pretty easy (if your standard of perfection is low), or very hard indeed (if your standard of perfection is high).

Let me step back a bit, and talk about the nature(s) of text, and perfection.

There's a distinction between “Presentational markup” (i.e., Word, and HTML, where the size/style/indent has visual meaning), “Procedural markup” (i.e., LaTeX, and troff, where the marked-up file creates results on the fly, and are fundamentally programs not representations), and “Descriptive markup” (nowadays XML, in which content elements are coded/identified by their purpose, rather than by their look, or their programmatic function).

If that sounds a little too much like hair-splitting, let me give an example:


This is a Chapter Heading


The above is "descriptive markup”: the codes say nothing about how the content ("This is a Chapter Heading") is displayed, nor what should be done with it. The code does, however, identify the content by its structure, which can allow another program to recognize it as significant, with visual characteristics, and as something to be captured and used to produce a table of contents.

Publishing services companies want to automate as much as possible, and yearn to transform anything to anything, seamlessly, at seven-sigma rates of accuracy. They want to capture every nuance of an author's "presentational markup" (that can be harvested from a Word or LaTeX file), and produce a file with rich "descriptive XML markup" that matches the author's and publisher's intent for every element. Or, to take a soulless "descriptive XML file" and imbue it with visual life, from a presentational point of view.

Most companies apply "transformational markup," a hybrid, designed for capturing both structure and presentation, prior to import into InDesign, or Arbortext/3B2.

Most composition houses like ours, have something like we have: a "special sauce" code set which is a more-detailed-than-any-known-needs structure, into which other structures are transformed, and which once refined, can be digitally transformed into any output format: InDesign, ArborText, PDF, XML, EPUB, HTML, etc.

How hard can this be? Well, I’m just getting started.

So far, we're just dealing with file structures and their varied purposes. We haven't even touched on the nuances of professional typesetting, a craft with 500+ years of heritage behind it.

Nor have we mentioned what is perhaps the biggest challenge of all: coping with language.

Operationally, there are two primary modes of "copyediting" the content of a publication: mechanical (or "style") editing, and substantive editing. Depending on the publisher's preferences, publishing services companies like ours may do both, or either (or neither). I’ll just focus on the former.

Mechanical editing is performed to ensure that the manuscript is consistent: consistent within itself (if "U.S.A." is used in one spot, "USA" in another, and "US" in a third, it's unprofessional), and consistent with a named "style," such as APA, MLA, or other official styles.

There are grammatical rules, punctuation rules, and stylistic rules which need to be applied according to the official style; a great many of these style transformations can be automated. Like many vendors in our space, we use a combination of our own processes (built over many years), continued close analysis by skilled practitioners, and an editorial automation tool called Merops.

Merops, along with eXtyles and other similar software, closely analyzes text to deal with much of the recognizable patterns that comprise 80 to 90 per cent of the nuisance work of mechanical copyediting. It also raises questions about 10 to 15 per cent of the rest.

Rules regarding small caps, spaces before ellipses, commas and quotation marks, en- and em-dashes, periods and parentheses, and much more, can be turned into search/replace code. More delicate efforts like the structure and abbreviation styles within references, require complex semantic processing.

Automated inclusion of DOIs, PubMedCentral references, ORCIDs, and more, require a level of intersystem communications that is beyond the cost/benefit of most publishers. A centralizing system (like Westchester leveraging Merops' arrangements with multiple metadata aggregators) is more cost effective than assigning staff to produce results.

And there's always some percentage of any manuscript's oddities that are not caught by algorithmic cleverness, which is where human quality control comes in.

A committed author with committed volunteers (or graduate students) can perform many of the above tasks, over many hours. Or, a committed publisher can have in-house staff perform them, over many hours. But that scenario is increasingly rare, as publishers recognize the value of deep expertise.

"How hard can it be?" depends on what is trying to be accomplished.

Amateurish typography and editorial services are easily available to anyone with access to Microsoft Word, or Acrobat, or HTML, and has a friend with an English degree.

Satisfactory typography and editorial services are easily available to anyone willing to pay professionals to assist, or to invest time and effort in achieving an identified goal.

Quality typography and editorial services are easily available to publishers who (like venture capitalists) are investing in a project in hopes that it will become a unicorn: a unique and wonderful project that lots of people will pay money to read, and needs to be perfect. That generally requires specialized software and professional specialists.

"How hard can it be?" It depends on what one is trying to achieve.

Michael Jensen is Director of Technology for Westchester Publishing Services. Previously, Michael ran the online publishing program of the National Academies of Sciences, Medicine, and Engineering; ran a sustainable farm; directed the startup phase of Project Muse; and produced the first searchable online publisher’s catalog, available via Telnet, in February of 1990, for the University of Nebraska Press.

Michael Jensen, Director of Technology