Looking for pearls

Share this on social media:

Topic tags: 

It can be hard to find all the relevant material online when there is so much available. The OAIster project of the University of Michigan in the USA provides a solution by harvesting the information that is hidden in over 400 institutions around the world. Katerina Hagerdorn, metadata harvesting librarian for the project, describes what this means

People looking for scholarly information can find it difficult to know where to start. There is so much to choose from, even when we narrow the field to those materials available in digital format. The amount of scanned and born digital material such as images, texts, datasets and videos that is fully available via the web is growing at an exponential rate.

However, items of interest may not be part of library catalogues, federated databases or online journal subscriptions. Many reside in local databases, available via the web but difficult to locate. Web search engines, such as Yahoo! or Google, often have difficulty adding these materials to their indexes because they are 'hidden' behind search forms or CGI scripts. In these cases, materials are essentially invisible to the scholar.

OAIster was developed to alleviate this problem. It is named after the Open Archives Initiative (OAI) Protocol for Metadata Harvesting (PMH), developed to make it much easier for descriptive information about materials to be shared among institutions. At the time (pre-1999), strong communication was necessary among institutions to discover remote resources, notify appropriate personnel and develop plans to integrate these resources. While protocols like Z39.50 had the potential for automating this process, the OAI-PMH was designed from a different point of view. Using OAI-PMH, institutions need not set up time-consuming committees or conferences to develop methods for exchanging material. Instead, they need only exchange metadata describing what is available.

Metadata is data about data - in this case, author, title, publisher, date, etc., similar in form to electronic library catalogue records. This type of metadata contains pointers, i.e. web URLs, to the actual digital materials themselves. An OAI-PMH metadata record with the title 'The single hound: poems of a lifetime' will have a link to the full text of the poems at the institution from which the metadata record was retrieved.

Host institutions (i.e. data providers) enable their systems to understand the protocol - a limited set of HTTP request verbs that will return their XML Unicode metadata - and requesting institutions (i.e. service providers) use a software tool to 'harvest' this metadata from host institutions. The method allows requesting institutions to receive metadata from all over the globe without contacting the host institutions directly.

This is what OAIster has done - harvested metadata pointing to digital materials from as many institutions as make their metadata available. Our initial foray in June 2002 gathered 200,000 materials. Two and a half years later we provide search access to over 4.8 million metadata records, and therefore digital materials, from around the world. OAIster has become the de facto digital materials union catalogue.

Where is the data?
The majority of the North American academic research institutions participate in OAI but more than half of the over 400 repositories we harvest are outside North America. These include institutions across Europe, such as the University of Southampton and University of Glasgow in the UK, the National University of Ireland, Universit�t Bielefeld in Germany, Biblioth�que Nationale de France, Italy's Universita degli Studi di Firenze and Norway's Universitetet i Oslo. The Australian National University and National Library of Australia are also involved, as are institutions from South America (e.g. Universidade Federal do Paran� in Brazil and Universidad Nacional del Centro de la Provincia de Buenos Aires, Argentina) and Asia (e.g. Hong Kong University, Japan's Hokkaido University and National Sun Yat-sen University in Taiwan).

Developing countries are not well represented, but aggregators have begun collecting and making output from these nations available (e.g. African Journals Online). In addition, publishers such as PubMed, BioMed Central and Public Library of Science have made their metadata available for harvesting.

Many of the metadata repositories we harvest are self-archiving tools - avenues for creators to archive their not-yet-published material, and make this material openly and freely accessible. Because the publishing process can be slow, and then hampered by restrictions in access to the published materials, a work-around was needed for quicker access to research. Self-archiving is one of the most popular methods. Although OAIster never set out to be a focal point of open access, it contains virtually all the self-archived materials available.

Of the types of digital materials we collect metadata on, over two million are textual (e.g. books, articles, theses), more than one million are images (e.g. photographs, paintings), and several thousand are audio or video files. These reflect all subject matters, nearly evenly split between the sciences and the humanities.

Quirks of using metadata
Some of the metadata we harvest is purely bibliographic, i.e. metadata without the pointer to the digital material. It may point nowhere (i.e. contains no URL), it may point to further descriptive information, or it may just point to an abstract of the digital material. Since we aim to be the union catalogue of digital materials, we strive to remove this kind of metadata. Due to the volume of records harvested, we cannot check each URL to see if it is pointing to digital material. Therefore we spot-check each repository the first time we harvest it - if a majority of the records are 'dead ends', we will not use this repository. But we will contact the repository owner to see if they are able to create a special 'set' for us of just the records that point to digital materials. Often, this is successful as the repository owners are eager for their materials to be more widely available.

In other instances, metadata will point to a finding aid, which is a tool used by archives to list and describe archival containers. It is possible to view finding aids in two ways. The first is as the digital material itself, especially if the finding aid has been painstakingly crafted, and the second is as the pointer to the archival container, which often cannot be digitised. We decided to include finding aids for scholars who can utilise them as digital items.

Harvesting metadata bypasses the storage issues of harvesting digital material itself. OAIster's metadata records are around 2.5 GB and the indexes of those records make up less than that. But not owning the digital material necessitates checking for dead links. Due to the volume of records we harvest, it is difficult to check every link on a periodic basis. Instead we re-harvest each repository once a month and if we receive error messages we check that repository for records with dead URLs. This is uncommon - the large majority of repositories we harvest are trustworthy, since they originate as part of a digital library project or information technology venture. Permanent URLs are fast becoming an integral part of any digital project, and factor into the repository's trustworthiness.

For the patrons of OAIster's search system, we knew we would face some challenges. It is nigh-on impossible to create the 'perfect' search system because it is nigh-on impossible to know all the users of your system. We performed usability tests with people who were potentially heavy users of the system. These tests ranged from label sorting exercises to watching people use the first version of the system and that informed a great deal of our interface work.

However, there were features we were not able to provide to users and much of the reason was technical. In one example, users were interested in being able to sort their search results by date, a decidedly useful option to have. Because we harvest metadata from all over the world, and from institutions that have differing policies on the creation of this metadata, what we receive varies widely from repository to repository. Some examples of date formats we receive include:

  • 2-12-01;
  • 2002-01-01;
  • 0000-00-00;
  • 1822;
  • between 1827 and 1833;
  • 18--?;
  • November 13, 1947;
  • SEP 1958;
  • 235 bce; and
  • Summer, 1948

Obviously, it would be difficult to allow date sorting with such variability among date formats. Even the option of normalising such dates - flattening the realm of dates into all numeric, or month-day-year format, or another choice - seemed too onerous to pursue.

Normalisation by type
We decided instead to normalise the 'type' metadata element. This element contains information on whether the digital item is a book, article, thesis, photograph, moving image, or any other type of digitally created material. We could, for instance, take the value 'preprint' and map this to the generic value 'text.' Performing that action for all metadata records results in a list of types mapped to four generic values: 'text', 'image', 'audio' and 'video'. Patrons can then limit their searches to a particular type, e.g. 'New England' limited to 'image'.

Of course, this is rudimentary normalisation at best. Any type value that is not recognised as belonging to one of the generic values, or not included in the original value / generic value list, will not be mapped. This manifests itself in cases of misspellings (e.g. Litograph) or character-encoded languages (e.g. Actes de Conférence). More importantly, it is also seen with values that cannot be precisely mapped (e.g. 'print' could be either 'text' or 'image').

Being the union catalogue of digital materials involves harvesting all repositories' metadata records and then weeding out those records that do not fit (such as with the 'no-pointer' problem mentioned above). In a minority of cases, we will harvest aggregator metadata, which collects metadata from original repositories. If we have already harvested the original repository, we end up with duplicate metadata. Currently we have no method to pinpoint duplicates, partially due to the difficulty of determining the unique target for identifying duplicates, although title seems to be the best choice. This means that we have to resort to removing these duplicates by hand. We do not do this on a record-by-record basis, but on a repository-by-repository basis. Our goal is to keep the original repository's records, which are often the most up-to-date. However, we still need to harvest aggregators for those repositories that we are unable to harvest directly (such as those that are not yet OAI-enabled or that we receive errors from when harvesting).

It is here that OAI 'sets' are used to best advantage by service providers such as OAIster. Sets are created by data providers to partition and categorise their metadata (e.g. thematically, by format or by ISSN). In the case of aggregator data providers, it is often possible to ignore sets that reflect original repository metadata. For example, we harvest metadata directly from the BioMed Central (BMC) repository and ignore the BMC metadata available in multiple sets through PubMed's aggregator repository. This is a huge timesaving device as the alternative would be to harvest everything and then pick out that which is not needed.

Another factor in harvesting everything available is that we receive metadata pointing to restricted digital materials. At first, we were unsure whether to include these in OAIster. Our rationale for eventually including them was that the union catalogue would not be complete if we didn't. For patrons able to access materials restricted to particular communities, we would be limiting their access. It is also often problematic to determine whether materials are restricted. Following the link to a restricted digital item can be invisible, with no indication that a restricted site has been accessed (for instance, because your institution subscribes to these materials by IP address). In some cases, restrictions are indicated in the element, however this is up to the creators of the metadata and can be as variable as date format. An OAI protocol subcommittee has worked on incorporating rights and restrictions elements into the protocol. This is at the repository level, which is useful in choosing whether to harvest a particular repository. However, many repositories have a mix of restricted and freely available metadata records. Incorporating these elements at the record level is proposed work for the subcommittee.

Furthering access

Creating a union catalogue of digital materials is only worthwhile if people know about it. We realised that publicising OAIster would reach mostly digital librarians, and not public service librarians and scholars who would find it most useful. So, when we were approached by the project manager for the Content Acquisition Program (CAP) at Yahoo!, we jumped at the chance for OAIster metadata to be directly included in the Yahoo! Search index. After the launch in March 2004, our search hit statistics exploded to more than 100 times that of our OAIster site statistics. Being in the Yahoo! index means that people all over the world who use Yahoo! Search, and associated partners such as MSN Search, can link to the digital materials our metadata describes. (Try searching 'Immunomodulatory properties of the Chinese medicinal extract polysaccharide peptide' at search.yahoo.com and viewing the first result.)

Naturally, Google came calling after they learned of our partnership with Yahoo!. We currently provide Google with OAIster metadata but have yet to see how it is being used, although Google claims to only be using the URLs.

Google's new library digitisation project, of which the University of Michigan is a part, does not impact the majority of digital materials OAIster points to. While it is true that Google will digitise the copies of 'The single hound: poems of a lifetime' on our library shelves, it will not digitise 'Immunomodulatory properties of the Chinese medicinal extract polysaccharide peptide' or any other thesis from Hong Kong University, at least not in the near future. Digitising such materials will be the responsibility of individual institutions. While Google (and other search engines) might be investigating harvesting OAI metadata to retrieve these materials, we view this as the ultimate goal of OAI - to make digital materials more accessible to the public.

For this to be most successful, we need to further publicise the benefits of making metadata available via OAI. Surprisingly, many of the most prestigious universities in the USA have not made anything available. And for those that have, these are often small selections from departmental materials. For example, Harvard's only contribution to date has been metadata of maths and science videos from the Science Education Department at the Harvard-Smithsonian Center for Astrophysics. From the informal network of service providers and data providers, we have learned that one of the main reasons for this lack of availability is that OAI sits at the bottom of the list of priorities after digitisation of all materials has been completed.

We hope to encourage OAI efforts through our formation of a 'Best Practices' group, which has solicited advice from most of the major players in the OAI field and representatives of OAI-enabling software products. This group aims to provide guidelines for those interested in becoming data providers by answering technical, metadata and protocol questions.

Planned improvements
Improving OAIster is always on our agenda. The larger we grow, the more imperative it is to make OAIster manageable. Slowness is one of our biggest hurdles. We will be researching, as part of a grant-funded collaborative effort with three other universities, the means to partition OAIster into smaller search areas to improve speed of searching. Most likely, OAIster will be sectioned thematically, e.g. social sciences, physics, art history, but populating these partitions will be a challenge. Metadata contains much less textual information than the full text of a document, and categorising software works most effectively the more words it is given to work on. While we will investigate the use of such software, we may need to resort to thematically partitioning on a repository-by-repository basis. This can be problematic, as a significant number of repositories contain metadata describing more than one subject area.

'De-duplication' and normalisation of additional metadata elements will also be a part of our grant effort, in addition to more advanced interface features, such as simple/advanced search forms, email/download records capabilities and OpenURL/Z39.50 compliancy. However, our main effort will focus on developing OAIster as a service provider model.

Currently, it is possible to use the OAIster template out of the box within the Digital Library eXtension Service (DLXS), which is software developed by the University of Michigan Libraries for building and hosting digital materials. To create an OAIster-like service entails changing the interface colours and graphics only. For instance, the Committee on Institutional Cooperation (CIC) has created an OAIster-like portal to CIC member university metadata (nergal.grainger.uiuc.edu/cgi/b/bib/oaister). However, there is no mechanism for using more robust metadata (i.e. more narrowly defined metadata, involving a more complex element structure) within such a service. Expanding the current capabilities of the template to utilise these different metadata formats will, in effect, create a more generic model.

It will be interesting to see where OAI leads - whether to a more fully defined protocol, further protocols that will eventually take the place of the current one, or an entirely new method. In any scenario, the impetus behind developing the OAI protocol will not change. Academic researchers and scholars will always need access to primary source material, and more and more frequently will require it in digital form. What may change is our method of providing it to them.

Further information
www.oaister.org
To learn more about becoming a data provider, see www.oaister.org/o/oaister/dataproviders.html