Innovation boosts article formats

14 August 2012

What is the best format for presenting scholarly information? Is the PDF dead? These are some of the questions that a lively and entertaining discussion at the recent UKSG meeting attempted to answer.

First up in the debate was Steve Pettifer, of the University of Manchester, UK, presenting the case for the PDF: ‘PDF comes in for a lot of stick but yet it’s just a file format,’ he pointed out, before adding that more than 80 per cent of all articles are downloaded as PDFs.

He cited the notion of ‘keeping the minutes of science’. PDFs, Pettifer argued, can be referred to reliably and stably. He said that the PDF is great for this as ‘it’s a single blob of stuff’ and serves reasonably well at giving a snapshot of data.

PDFs, he said, are downloadable and archivable: ‘mine to keep (as long as you keep a sensible backup)’. They also cannot be sneakily modified because there is no real way to change them. In terms of correcting mistakes, he added, you can publish corrections.

Martin Fenner, of the Hannover Medical School, Germany, saw the situation differently, however, challenging how such corrections might be distributed to the thousands of people who might have downloaded a paper. And he had other issues with PDFs too: ‘It’s an open question about how much is read as HTML versus downloaded as PDF,’ he said.

Steve Pettifer

What’s more, he argued, the PDF doesn’t work for mobile devices because the format won’t reflow for a smaller screen as HTML does. And a PDF is not good for linking content, while linking to other things is what HTML is built for.

This latter issue comes down to how you want to use content. Pettifer replied that the point of scientific articles is not primarily for linking but for telling ‘stories that persuade with data,’ as Anita De Waard of Elsevier Labs describes them.

PDFs meet this purpose well, Pettifer believes, because they have ‘edges’ – a specific number of pages and boundaries, like stories that have a beginning, middle and end. ‘An article is meant to be one idea/argument. When I read an article I either want to go all through without distraction or find a particular result,’ he said. In contrast, he added: ‘If you are reading online you never really know if you’ve finished the article because you can go off following links.’

He demonstrated what he saw as other limitations of HTML by comparing the same paper in the two formats side by side. He blocked out all the things displayed that were not ‘content’ of the paper (the publisher logo, journal name, links, tag cloud etc) and showed that almost 50 per cent of the HTML version was not actually the paper’s content. The amount of ‘non-content’ space on the PDF version was much lower.

Pettifer also noted some common ‘PDF fallacies’. He argued that the PDF has not been a ‘closed format’ for about 10 years and that it can encode international characters. He also argued against the belief that PDFs are not ‘semantic’ and have no ‘structure’, saying that this is only because publishers don’t use features that have been part of the PDF specifications for at least 10 years.

And then there is the issue of people making bad format choices. ‘People will say "I asked for data but was sent a table in PDF format. I can’t do anything with that. Therefore PDFs are stupid" – but if people sent you data as a JPG or as interpretive dance that would also be stupid,’ he joked, adding: ‘There is no such thing as stupid formats, just stupid people.’

Fenner agreed: ‘It is not so much about one or the other – I think we’ll still have PDF in 50 years – but it’s about how people use them.’ But he pointed out that HTML has also moved on from some of its old limitations, with ways to tailor, for example, how content is displayed.

Fenner highlighted developments such as Java Script, jQuery and CSS, which enable people to write web documents with innovative navigation and presentation. He also mentioned the emerging trend of publishing with WordPress.

Responsive design and CSS functionality can be used to check the resolution and size of the reading device screens. In addition, he said, there are solutions to offline reading such as Instapaper and EPUB.

The discussion also turned to other ways and formats for delivering content, with Geoff Bilder of CrossRef bemoaning the popularity of apps. ‘The creation of apps means that suddenly e-books are tied up with a format and you can’t link into them – and publishers seem suicidally drawn to doing this,’ he remarked.

Pettifer said that such an approach ‘seems like the opposite of open access,’ while Fenner added that ‘formats for certain publishers makes no sense. One problem we have is that EPUB, which is the most logical format for mobile devices, is not really there for journals yet.’ However, one delegate commented that perhaps the choice to use apps is driven by consumer preference.

HTML5 is also expected to make a difference to format use, as is the ongoing move towards XML-based workflows. As Fenner explained: ‘Most publishers are now primarily producing XML then going to PDF and HTML. We are starting to see cross over with typography.’ However, although ‘XML is nice for computers’ in reality, according to Fenner, ‘text is sent to India and then taken apart and done by hand.’

The PDF also has potential to do more, according to Pettifer. He referred to the new PDF reader called Utopiadocs.com. ‘It gives a stable PDF with dynamic content on the side,’ he explained. The content that can be displayed on the side includes citations, licensing conditions, links and related papers in Mendeley.

Utopiadocs.com also includes mechanisms to comment and share without the comments changing the paper. It displays the data behind the article and enables readers to see who else is talking about it online.

When it comes to the PDF versus HTML debate maybe it is possible to have the ‘best of both worlds’.