Enabling ‘artisan work’

Share this on social media:

Mark Gross, president at Data Conversion Laboratory, tells the story of the firm’s birth and its work in scholarly communications

Tell us a little about your background and qualifications … 

Other than the fact I was always an avid reader, not much in my early background would suggest that I would be building a company to process millions of pages of scholarly documents. I graduated with a BS in Civil Engineering from Columbia University; my first job was building nuclear power plants. However, I soon realised that with the changing political winds, building nuclear plants was not going to be a growth industry for long. In fact, the nuclear plant I was working on was one of the last plants constructed in the United States. NYU Graduate School of Business was down the street and offered me a job as a computer consultant in the computer centre, and accepted me into the MBA program, where I got my MBA in Computer Applications and Marketing. I later taught at the New York University Graduate School of Business, the New School, and Pace University before leaving to join the consulting practice of Arthur Young & Co, where I managed large-scale data implementations in the pre-desktop world.

Can you give us some background to Data Conversion Laboratory? How did it come about?

In the early 1980s the first ‘personal’ computers were coming onto the market but weren’t yet considered business machines. One day my dentist showed me a billing system he had built for himself on an Apple, and I was hooked. I was convinced that these machines soon would replace ‘big iron’ for many applications, and some large systems I was building could be built on this much smaller footprint. 

When I found little enthusiasm on using these new-fangled devices at the consulting firm, I left to start, with my wife, what became Data Conversion Laboratory. While slow going at first, six months later IBM announced their first PC, validating the concept, and businesses started coming on board. 

An early customer was a large accounting firm with 160 offices, each with a group of VYDECs, an early stand-alone word processor, the standard of the time, wanting to convert to the PC using MEC, an early, way ahead-of-its-time, code-based word processing software. Neither word processing system exists today. Since the Vydec didn’t use coding, we needed to somehow infer a coding structure based on the ‘look of the page’. We developed software that we called MindReader because it needed to infer what the coding should have been if the editor would have known about codes. 

Upon reflection, this was a very early implementation of artificial intelligence for structuring content. MindReader worked well and converted more than two million pages in 1983!  

While early coding systems were somewhat ad hoc, SGML, and later, XML, provided more structure and standardisation, allowing the handling and distribution of larger and larger volumes of content. DCL’s role grew as these new capabilities allowed the data industries to grow and handle the ever-increasing data streams we process on a daily basis today.

How does the organisation fit in with the world of scholarly communications?

Starting in the mid-1990s DCL got more heavily involved in supporting scholarly communications. The industry started looking for new and innovative approaches to deal with the ever-growing mountains of content, and the need to reduce costs and become more efficient. Since then DCL developed services to support scholarly publishing from beginning to the end of the publishing workflow. We specialise in complex content transformations.

Some examples:

  • Ingesting author manuscripts, composing them, and standardising them into JATS and other standard formats;
  • Normalising legacy collections into standards formats and loading the content onto the various platforms the industry supports;
  • Identifying and extracting metadata to support taxonomies and ontologies;
  • Coding and verifying bibliographies and references against 3rd-party data sources;
  • Ongoing distribution of content and metadata to discovery vendors with its Discovery Bridge service;
  • Analysis of large document collections to identify content reuse across multiple document sets and source format with our Harmonizer software; and
  • QA validation and independent review of previously converted content.

What is the biggest challenge for data companies at present? 

Keeping up with rapidly growing volume of the world’s research output, while assuring quality and truthfulness is big challenge, and goes beyond the research paper itself, and data companies should take the lead in meeting this challenge.

How content consumers interact with information is still basically the same as when we were in a print-driven world – someone wants information, looks for the information, discovers a topic of interest, and then consumes that information. The tools we use to search and read are certainly different and constantly updated. But the volumes of material are so much larger today – and finding the right information, and not missing that critical piece of information is much harder. 

I think the recent pandemic illustrates how important it is to keep metadata content relevant and current. Many publishers and other content-centric organisations deeply understand the importance of taxonomies and metadata. But how do you ensure language you implemented 10 (or 20) years ago maintains pace today? And when a search identifies hundreds, or thousands, of articles – how do you scan them while assuring that you are not missing critical information? Much of what’s done in scholarly communications is artisan work, and without more automation it’s difficult to keep up.

It’s time to revisit complex data and content structure challenges. Advances in automation and artificial intelligence, in all its facets (machine learning, natural language processing, computer vision, and so on) hold answers that were not feasible even five years ago. Projects that were previously impractical due to budget constraints are now within reach. 

I always like to listen to our customers detail big-picture projects that they want to explore and find ways to make it affordable to undertake. For example, the New York Public Library knew that they wanted to make ‘a resource for the world’. The thought was they want to provide access to all books that are out of copyright. The first step to undertake was to ensure the copyright records are structurally and intelligently tagged. At DCL we took the Internet Archive’s digitised (but unstructured) bibliographic references and put it into XML. This serves as the basis for the NYPL’s large resource.

I always ask people: are there unstructured data stream and data collections that can be structured – and improve your process?

Jump forward 20 years … what will be the role of data in research and academia?

I think many of the issues with increasing volume, and standardising information will likely be solved over the next few decades, and may not be that different – though there will be much more of it, and there will be a need for more efficient ways to find what you need.

The looming problem today is trusting the research results. How do we verify research data and make sure that what gets out is accurate and honest? Attempts to reproduce and verify research results are often not successful. Should base data be required? Should independent verification be required? How to avoid plagiarism and faulty research? With the need to get research out faster in the form of preprint, how does one ensure the scientific process and validation?

The concept of big data is not new and bigger data is already here. Intelligence and structure will allow us to better sort through what is meaningful, and what is not. Content structure and metadata helps separate the wheat from the chaff and might help us identify faulty research to some extent – but the biggest challenge may not be a data problem. It may be a trust problem.

Any interesting facts, pastimes or hobbies that you want to tell us about?

I’m an avid skier and have a strong interest in history, as well as artificial intelligence. A few years ago, I learned to play the saxophone!

• Interview by Tim Gillett