Charlie Rapple (right) reports on proposals to improve the accuracy and completeness of the data supplied to knowledge bases
Knowledge bases are the resources within the OpenURL process that know where content is, how to link to it and, crucially, which version of the content a particular institution’s users are entitled to access. This might be on a publisher’s own website (such as InformaWorld), a hosting platform (such as IngentaConnect), in a database (such as EbscoHOST), via a gateway (such as SwetsWise), in a library repository – or even on a library shelf.
The quality of a knowledge base depends heavily on the data supplied to it by publishers, aggregators and other content providers. There is currently no standard format for such data, and knowledge base owners must convert title lists from different providers to a single format. This can introduce errors, for example, from misunderstandings about how a field is being used. Other errors are not specifically related to the transfer of data but to inconsistencies or inaccuracies in the data itself, for example, incorrect ISSN usage (particularly when titles change), title variations and misspellings, and inaccurate date information.
If data provided to knowledge bases is incomplete, inaccurate, out of date or in some other way ‘bad’, the efficacy of the OpenURL standard is undermined and can often become useless. This point was at the heart of recommendations made by the 2007 report Link Resolvers and the Serials Supply Chain. This report was researched and written by James Culling (then of Scholarly Information Strategies) on behalf of UKSG, a non-profit organisation that connects the information community.
Tackling the problems
Following this report, UKSG set up the KBART (Knowledge Bases And Related Tools) working group, which is also endorsed by NISO (the US National Information Standards Organization). KBART has brought together members of all parts of the e-resource supply chain to explore the issue of data transfer to knowledge bases. It aims to create guidelines that will help to resolve the most common and high-impact problems. Its 12 core members represent the different stakeholders in the scholarly information supply chain equally. It also includes a ‘monitors group’ to allow interested parties to receive regular reports on the group’s progress and to help review the group’s reports prior to public release. The group’s first report is being finalised for public release as this article goes to press, and this article explores some of its proposed solutions.
In response to the lack of standard formats for the data provided to knowledge bases, the KBART Phase I report recommends some best practices for formatting, populating and distributing content holdings lists. By making some small adjustments to the format of their title lists, content providers can greatly increase the accessibility of their products and reduce frustration for library users.
The recommendations are designed to be intuitive and easy for all parties to implement. They are based on those methods and data fields that have proven to be effective or valuable in the combined experience of the working group members. In many cases there are acceptable alternatives but for clarity and simplicity the group has distilled its experience into a single recommendation, where possible.
The recommendations can be summarised as follows:
- Content providers should post holdings data to a website or FTP site for download by link resolver suppliers;
- Content providers should provide a metadata update every month, or less often if the coverage data changes less frequently;
- Both the content provider and the knowledge base provider should designate specific staff members to be responsible for data files and exchange;
- Content providers should provide metadata in tab-separated values format;
- The filename should follow a specified format that includes the provider name as given in their web domain (eg JSTOR or INGENTACONNECT) and the date of the transfer;
- Separate files should be produced for each package of content that the provider offers;
- All metadata should be provided as plain text;
- Text should be encoded as UTF-8;
- One publication should be given in each line of the file; and
- A title should be listed twice if there is a coverage gap of greater than or equal to 12 months.
The group has identified the following as core data fields, all of which should be supplied, if they exist:
- Publication title;
- Print-format identifier;
- Online-format identifier;
- Date of first issue available online;
- Date of last issue available online;
- Number of first volume available online;
- Number of last volume available online;
- Number of first issue available online;
- Number of last issue available online;
- Title-level URL;
- First author (for monographs);
- Title ID;
- Content Type (abstracts/fulltext); and
- Publisher name (if not given in the file’s title).
The full Phase I report includes plenty of supporting information, to help explain the recommendations, and comprehensive instructions for creating and transferring a data file. The group acknowledges that many content providers and knowledge base owners are already successfully exchanging metadata with knowledge bases. The recommendations are not intended to detract from or interfere with such existing processes; they are intended to provide guidance to those who are unsure about how best to exchange metadata.
KBART’s working group members have been active in raising awareness of the group’s activities to date and will continue to speak and write about the recommendations now that they have been released. Future platforms include ALPSP’s ‘Does my content look big in this?’ seminar (February 2010) and the UKSG annual conference (April 2010). The group welcomes suggestions for other events at which it could work to raise awareness of its activities and recommendations.
In terms of future reports, the group anticipates many additional steps that can be taken by all stakeholders to further improve the library user’s experience of link resolvers and their related knowledge bases. Some that will likely be addressed in future phases include:
- Customisation of data transfer to reflect individual library holdings;
- Consortia-specific data transfer;
- Documentation of guidelines for data relating to non-text content data; and
- Review of data transfer for e-books.
The group encourages feedback on the focus of its future efforts.
Charlie Rapple is head of marketing development at TBI Communications and retiring chair of the KBART working group. Ongoing, the group is chaired by Peter McCracken (Serials Solutions) and Sarah Pearson (University of Birmingham)