Thanks for visiting Research Information.

You're trying to access an editorial feature that is only available to logged in, registered users of Research Information. Registering is completely free, so why not sign up with us?

By registering, as well as being able to browse all content on the site without further interruption, you'll also have the option to receive our magazine (multiple times a year) and our email newsletters.

Automation reduces the cost of archiving

Share this on social media:

Topic tags: 

Automating the way that data is indexed, sorted and structured can save publishers and societies considerable time and expense, writes Philip Paterson of UK-based SomCom

Archiving has typically been perceived as an extremely expensive undertaking that offers very little gain. For this reason it usually features well down on an organisation's agenda. In addition to cost, the quality of early optical character recognition software, and not knowing if the archive will be used once it has been created, have been major barriers to adoption in the past. While these latter two concerns have now been more or less addressed, cost can still be a major barrier. One society I visited recently was quoted more than £100K just for hosting its archive! This sort of number is well beyond the means of most societies, libraries or institutions to entertain without the benefit of donor funding.

Much of this cost arises from the labour-intensive nature of the tasks involved in creating an archive, such as recoding into XML or SGML and retype-setting entire documents. Even using highly skilled but low-paid workers in the Far East, the costs of these processes can be a huge deterrent.

Such a dilemma was encountered recently by the British Academy, the UK's national academy for the humanities and social sciences, which celebrated its centenary in 2002. The British Academy wanted to produce an electronic archive of the annual volumes of its flagship publication, the Proceedings of the British Academy. This publication includes conference proceedings, texts of scholarly lectures and extended obituaries of the academy's fellows. It covers a wide range of subjects so the academy recognised the importance of providing a way for users to home in easily on their area of interest.

What was needed, it concluded, was a simple and affordable way of capturing the articles in electronic form, making them word- and phrase-searchable, and then delivering them electronically to users in a widely accepted format. For this, the British Academy turned to SomCom, a UK-based provider of electronic publishing solutions for societies and professional publishers.

SomCom believes that automation is the key to tackling both the cost and delivery time of archive creation, as well as many other publishing tasks. It has used its proprietary software in projects such as document delivery and e-commerce; fully searchable reference works; sorting, indexing and structuring of data; and online peer review processes.

The company's approach to software development is to use robust and proven open-source standards, language and protocols, such as Binary, Perl, C++, CGI, SQL, XML and Linux. This keeps the cost of development to a minimum and does not require the licensing of costly third-party proprietary software. This approach helps increase the shelf life of the solution, and the code that is created is strong and fast. With these tools SomCom can deliver a full archiving, full text search and document delivery platform for around 50p to £1 a page.

The process involved

SomCom's electronic archiving solution allows a whole lecture or article to be searched - rather than just the headers and abstracts - without the need for recoding into XML or SGML or retype-setting the entire document. This gives a cost-effective way to publish articles online that are not already structured in an electronic format. Data can be taken in a variety of forms - hard copies and electronic files - and then converted so that all may be stored together in one database.

The possibility of handling a range of forms was crucial for the British Academy project. Bound volumes of the Proceedings of the British Academy covering the years 1964-93 were scanned in and saved as PDF files. They were also put through the latest optical character recognition software. More recent volumes were provided as QuarkXpress and 3B2 (a professional typesetting program) files and these were also saved as PDF files for final output. The use of the PDF format ensures that the user obtains standardised and easily accessible files. It is similar to a photocopy in that the user will obtain a copy of the article as it is seen in its original published form.

The next step was to use the company's SomCNV software to merge data from both the scanned and electronic sources and produce an internet-based archive database. Another of SomCom's programs, SomFTS, was then used to enable full text searching of the database.

SomCNV and SomFTS together allow complete automation of the structuring and the indexing of an archive database, keeping the manual work required to a minimum and the accuracy of the finished product to a maximum. Both programs have been designed for large, multi-volume publications and can handle data from a few thousand to millions of pages.

The archive database for the academy and the in-house programs are currently hosted on a dedicated server based in London, although the programs can be licensed if clients wish to hold their archives on third-party servers.

Searching the archive

From the British Academy website the user can access the Full Search Screen and is then able to key in up to 50 words. The program will search the entire archive for matches in seconds. The way in which the data is structured automatically by SomCNV enables the user to select whether to search the entire archive or just the memoirs or lectures. The data is also indexed so that it is possible to view all articles written by a certain author or as part of a certain lecture series.

Once the program has completed the search, the user can view the PDF of the first page of each article listed as well as the page that has the optimum match. This allows the user to assess enough of the article to decide whether it is the correct one to download. Once a required article is found, the PDF of the whole article can be downloaded via an e-commerce or password-restricted area of the website.

Each time a file is to be downloaded, it is sent to a unique temporary directory, which the user has access to for a limited time as dictated by the client. This ensures that users are unable to obtain PDF files that have not been authorised for downloading. If further security is required, the PDF files can also be encrypted.

Beyond the internet

The electronic archive created is not limited to internet access alone. We are able to produce a CD-ROM version of any of our internet archives, which will still have the full-text search facility and will look and respond just as the website does. The CD-ROMs include software to link back to the internet-based archive for updating. This provides organisations with the option of generating extra revenue by offering the CD-ROM to libraries and other customers.

The British Academy is not the only organisation to work with SomCom on this type of project. Other clients include The Energy Institute; The World Petroleum Congress; Blackwell Publishing; The Institute of Marine Engineering, Science & Technology; The Institute of Chemical Engineers; National Inspection Council for Electrical Installation Contracting; and T&F Informa.

The type of projects that SomCom engages in is varied. We analyse the proposed project and use our expertise in publishing, typesetting and technology to suggest how technology could help the client to achieve its objective. In most cases they want to offer additional value to members and subscribers or generate additional revenue streams which will help make a sound business case for the future and offset the cost of development. SomCom has found that it is best to keep the approach simple. Quite often organisations can become caught up in the latest buzz-word or trend in technology, but have limited knowledge of the technology and how it can be applied to their business.

One of the over-riding principles that we hold is that computer processors are cheaper than human resources. They are inexpensive and reliable. Computers can handle large volumes of data, carry out multiple tasks simultaneously, and can work all day and night without a break. They do not complain and do not require benefits. And in the unlikely event they do break, they are likely to have been superseded with something more powerful and are equally cheap to replace.

Many organisations still push work offshore to the Far East, and try to solve their problems with large numbers of talented but low-paid people - but this is a bit short-sighted. Much of the work can now be automated and run on computers, giving more reliability, access to the content, scalability, greater economies of scale and cost control. Meanwhile, the cost of labour is likely to rise as the developing countries' economies grow while the cost of computers is falling and processing power is increasing. In the future many more organisations could turn to an automated approach.

Philip Paterson is business development director for SomCom