Robotics speed up book digitisation

11 August 2008

Share this on social media:

Issue:

August / September 2008

Topic tags:

By the end of this year, 20 million pages of the British Library's 19th century books will be available electronically. Siân Harris visited the library to see how it is being done

As I write this – and probably as you read it – six sophisticated machines and their operators are hard at work in a corner of the British Library (BL). These machines are busy turning hundreds of pages of old books into digital files every hour.

The BL’s digitisation of 19th century books is one of many digitisation projects around the world that have been funded by Microsoft. The software giant was originally loading the digitised books onto its Live Search platform and about 40,000 British Library items were available on this site before Microsoft pulled its book project at the end of May.

The Live Search Books programme, which also included libraries such as those at the University of California, the University of Toronto, the New York Public Library, the American Museum of Veterinary Medicine and Cornell University, digitised 750,000 out-of-copyright books to put on the platform. However, Microsoft said that it now believes that the best way for a search engine to make book content available is by crawling content repositories created by publishers and libraries.

The end of Live Search Books does not mean an immediate end to the projects it was funding, however. For the British Library, the Microsoft funding covers 20 million pages which is approx 80,000 to 100,000 books, a target that the library anticipates reaching by the end of the year, subject to production variables. And Microsoft is encouraging its partners to keep their own digitised copies and carry on their projects. ‘We are removing our contractual restrictions placed on the digitised library content and making the scanning equipment available to our digitisation partners and libraries to continue digitisation programmes,’ said Satya Nadella, Microsoft’s senior vicepresident for search, portal and advertising when the Microsoft decision was announced.

Making whole collections electronic

The 19th century book project is not the British Library’s first digitisation initiative or its only one – there are 15 such projects going on currently. However, the speed and sheer number of titles being digitised are far greater than past initiatives and this is changing the process of picking which books to digitise.

‘One of the big challenges with digitisation is title selection,’ said Neil Fitzgerald, book digitisation project delivery manager for the British Library. ‘Mass digitisation allows us to deal with historical biases by digitising a whole collection.’

The six machines in the BL, which were provided by Kirtas Technologies, USA can each digitise up to 2,400 pages per hour, although Fitzgerald said that 1,200 pages per hour is more realistic for old and fragile books such as many in the BL’s collections. These machines are being put to work 16 hours per day by digitisation partner, Content Conversion Specialists (CCS) of Germany. ‘The original target was to digitise about one million pages per month but it will soon be two million pages per month,’ commented Fitzgerald. This project was piloted last year and full production began in late October/early November 2007. The pilot was essential in deciding the workflow. According to Fitzgerald, the book scanning itself posed fewer challenges than other parts of the process. ‘The actual digitisation is relatively easy. The robotics are new, but we have been digitising materials for 20 years,’ he explained. ‘New approaches are needed, however, to cope with volumes.’

One of the biggest challenges with the volumes involved was that of getting enough books from their shelves to provide the operators with 75,000 pages to scan each day, and then reshelving them all afterwards. ‘At the start of the project we did a book movement pilot to check that we could support this and to find out how many staff it required so that we could work that in from the beginning,’ pointed out Fitzgerald. ‘We also had to develop a bulk ordering system to get several hundred copies at once and that had to work with the integrated library system.’

Users of British Library reading rooms can now read these books online. Even fold-outs within the books have been scanned.

Scanning the books

Each of the Kirtas machines has a cradle that holds a book at a gentle 110° angle. The two sides of the cradle move independently of each other and as the pages move from one side to the other the heights of the two sides of the cradle adjust. They capture both pages at the same time in full colour.

Pages are turned using a vacuum arm. The strength of the vacuum can be adjusted depending on the paper weight. There is also something called a fluffer, which blows air gently at the book and separates the next 10 pages. This, according to Fitzgerald, also has the side benefit of cleaning the book.

The machines also have sensors that check if two pages have been turned at once. As part of the process, the front and back covers of the books are also scanned. In addition, all fold-outs in books are digitised. This is more costly than the normal pages as it cannot be done on the fast Kirtas machines. Instead, when the operators encounter a fold-out, they indicate this in the workflow software. Once the rest of the book has been scanned, the fold-out is then scanned on a flatbed scanner and the software drops it into the right place in the digitised book.

The scanned pages are stored as PDF files with embedded optical character recognition (OCR). The PDF version allows readers to see the pages as they were originally created. This is important in some disciplines, particularly in the humanities, where the context of text can be as important as the content itself. In addition, the OCR process helps to make the content easier to read.

The digitised books with all their metadata and OCR are about 1Mb per page and the BL expects to have 25-30Tb of data by the end of the project. This data is replicated in three locations for preservation and disaster recovery.

Although these books are no longer available on the Live Search platform, they can be accessed through the BL’s catalogue. According to Fitzgerald, it takes between five and 30 seconds for a digitised book to be delivered to the desktop. Once the PDF is open you can just look at thumbnails of the pages, look at the PDF and OCR or do full-text searching.

Inevitably with a project of this scale, maintaining the quality and consistency is a big challenge. However, there are several quality control steps within the process beyond the checks on the machines themselves. Firstly, the computer shows the operator the pages and page numbers so that they can watch for any differences and note them in the workflow. The OCR process also checks page numbers. After this, partners in Romania do remote quality assurance using a mixture of automated and manual tools. In addition, CCS samples 15 per cent of the digitisations to check their quality. Finally, the British Library does quality control by batch sampling and rejects or accepts a batch of digitised books based on its sample.

Cataloguing differences

Another challenge comes from the ways that books were originally catalogued. British Library staff in the 19th century were unlikely to have foreseen the mass digitisation effort currently taking place so the information in their catalogues is often not the same as that required today.

One example is that language encoding was often not included in historic catalogues. ‘The idea in the past was that if you could read the print record then the book would be relevant to you but if you couldn’t then it wasn’t,’ said Fitzgerald. ‘We are therefore also using this process to add language encoding.’

This is an important issue because only about half of the British Library collections are in English as this only became the dominant publishing language in the late 19th century and early 20th century.

Another issue for historical material is its condition. ‘The irony is that later books are printed on cheaper materials such as paper compared with cloth or vellum so the newer books are in a worse condition,’ said Fitzgerald. ‘The British Library is trying to get intellectual property rights provision to digitise newer titles for preservation. We receive all printed books by legal deposit but the law does not keep pace with technology.’

Looking ahead

As the Microsoft funding draws to an end, Fitzgerald says that the BL is considering several options. ‘The library is pragmatic; partnerships will only continue, or be extended beyond mutually agreed thresholds, if they continue to fulfil the needs of all those taking part,’ he said.

And the BL has plenty of other commercial partners – including Microsoft – on a number of other projects such as digitising newspapers, theses and sound recordings.

What will be digitised next is a tricky decision. One thing is certain though: with 150 million items, a number that is increasing dramatically every year, the BL will not be short of options.

Popular

Latest issue