Looking for pearls
It can be hard to find all the relevant material online when there is so much available. The OAIster project of the University of Michigan in the USA provides a solution by harvesting the information that is hidden in over 400 institutions around the world. Katerina Hagerdorn, metadata harvesting librarian for the project, describes what this means
People looking for scholarly information can find it difficult to know where to start. There is so much to choose from, even when we narrow the field to those materials available in digital format. The amount of scanned and born digital material such as images, texts, datasets and videos that is fully available via the web is growing at an exponential rate.
However, items of interest may not be part of library catalogues, federated databases or online journal subscriptions. Many reside in local databases, available via the web but difficult to locate. Web search engines, such as Yahoo! or Google, often have difficulty adding these materials to their indexes because they are 'hidden' behind search forms or CGI scripts. In these cases, materials are essentially invisible to the scholar.
OAIster was developed to alleviate this problem. It is named after the Open Archives Initiative (OAI) Protocol for Metadata Harvesting (PMH), developed to make it much easier for descriptive information about materials to be shared among institutions. At the time (pre-1999), strong communication was necessary among institutions to discover remote resources, notify appropriate personnel and develop plans to integrate these resources. While protocols like Z39.50 had the potential for automating this process, the OAI-PMH was designed from a different point of view. Using OAI-PMH, institutions need not set up time-consuming committees or conferences to develop methods for exchanging material. Instead, they need only exchange metadata describing what is available.
Metadata is data about data - in this case, author, title, publisher, date, etc., similar in form to electronic library catalogue records. This type of metadata contains pointers, i.e. web URLs, to the actual digital materials themselves. An OAI-PMH metadata record with the title 'The single hound: poems of a lifetime' will have a link to the full text of the poems at the institution from which the metadata record was retrieved.
Host institutions (i.e. data providers) enable their systems to understand the protocol - a limited set of HTTP request verbs that will return their XML Unicode metadata - and requesting institutions (i.e. service providers) use a software tool to 'harvest' this metadata from host institutions. The method allows requesting institutions to receive metadata from all over the globe without contacting the host institutions directly.
This is what OAIster has done - harvested metadata pointing to digital materials from as many institutions as make their metadata available. Our initial foray in June 2002 gathered 200,000 materials. Two and a half years later we provide search access to over 4.8 million metadata records, and therefore digital materials, from around the world. OAIster has become the de facto digital materials union catalogue.
Where is the data?
Developing countries are not well represented, but aggregators have begun collecting and making output from these nations available (e.g. African Journals Online). In addition, publishers such as PubMed, BioMed Central and Public Library of Science have made their metadata available for harvesting.
Many of the metadata repositories we harvest are self-archiving tools - avenues for creators to archive their not-yet-published material, and make this material openly and freely accessible. Because the publishing process can be slow, and then hampered by restrictions in access to the published materials, a work-around was needed for quicker access to research. Self-archiving is one of the most popular methods. Although OAIster never set out to be a focal point of open access, it contains virtually all the self-archived materials available.
Of the types of digital materials we collect metadata on, over two million are textual (e.g. books, articles, theses), more than one million are images (e.g. photographs, paintings), and several thousand are audio or video files. These reflect all subject matters, nearly evenly split between the sciences and the humanities.
Quirks of using metadata
In other instances, metadata will point to a finding aid, which is a tool used by archives to list and describe archival containers. It is possible to view finding aids in two ways. The first is as the digital material itself, especially if the finding aid has been painstakingly crafted, and the second is as the pointer to the archival container, which often cannot be digitised. We decided to include finding aids for scholars who can utilise them as digital items.
Harvesting metadata bypasses the storage issues of harvesting digital material itself. OAIster's metadata records are around 2.5 GB and the indexes of those records make up less than that. But not owning the digital material necessitates checking for dead links. Due to the volume of records we harvest, it is difficult to check every link on a periodic basis. Instead we re-harvest each repository once a month and if we receive error messages we check that repository for records with dead URLs. This is uncommon - the large majority of repositories we harvest are trustworthy, since they originate as part of a digital library project or information technology venture. Permanent URLs are fast becoming an integral part of any digital project, and factor into the repository's trustworthiness.
For the patrons of OAIster's search system, we knew we would face some challenges. It is nigh-on impossible to create the 'perfect' search system because it is nigh-on impossible to know all the users of your system. We performed usability tests with people who were potentially heavy users of the system. These tests ranged from label sorting exercises to watching people use the first version of the system and that informed a great deal of our interface work.
However, there were features we were not able to provide to users and much of the reason was technical. In one example, users were interested in being able to sort their search results by date, a decidedly useful option to have. Because we harvest metadata from all over the world, and from institutions that have differing policies on the creation of this metadata, what we receive varies widely from repository to repository. Some examples of date formats we receive include:
Obviously, it would be difficult to allow date sorting with such variability among date formats. Even the option of normalising such dates - flattening the realm of dates into all numeric, or month-day-year format, or another choice - seemed too onerous to pursue.
Normalisation by type
Of course, this is rudimentary normalisation at best. Any type value that is not recognised as belonging to one of the generic values, or not included in the original value / generic value list, will not be mapped. This manifests itself in cases of misspellings (e.g. Litograph) or character-encoded languages (e.g. Actes de Conférence). More importantly, it is also seen with values that cannot be precisely mapped (e.g. 'print' could be either 'text' or 'image').
Being the union catalogue of digital materials involves harvesting all repositories' metadata records and then weeding out those records that do not fit (such as with the 'no-pointer' problem mentioned above). In a minority of cases, we will harvest aggregator metadata, which collects metadata from original repositories. If we have already harvested the original repository, we end up with duplicate metadata. Currently we have no method to pinpoint duplicates, partially due to the difficulty of determining the unique target for identifying duplicates, although title seems to be the best choice. This means that we have to resort to removing these duplicates by hand. We do not do this on a record-by-record basis, but on a repository-by-repository basis. Our goal is to keep the original repository's records, which are often the most up-to-date. However, we still need to harvest aggregators for those repositories that we are unable to harvest directly (such as those that are not yet OAI-enabled or that we receive errors from when harvesting).
It is here that OAI 'sets' are used to best advantage by service providers such as OAIster. Sets are created by data providers to partition and categorise their metadata (e.g. thematically, by format or by ISSN). In the case of aggregator data providers, it is often possible to ignore sets that reflect original repository metadata. For example, we harvest metadata directly from the BioMed Central (BMC) repository and ignore the BMC metadata available in multiple sets through PubMed's aggregator repository. This is a huge timesaving device as the alternative would be to harvest everything and then pick out that which is not needed.
Another factor in harvesting everything available is that we receive metadata pointing to restricted digital materials. At first, we were unsure whether to include these in OAIster. Our rationale for eventually including them was that the union catalogue would not be complete if we didn't. For patrons able to access materials restricted to particular communities, we would be limiting their access. It is also often problematic to determine whether materials are restricted. Following the link to a restricted digital item can be invisible, with no indication that a restricted site has been accessed (for instance, because your institution subscribes to these materials by IP address). In some cases, restrictions are indicated in the
Naturally, Google came calling after they learned of our partnership with Yahoo!. We currently provide Google with OAIster metadata but have yet to see how it is being used, although Google claims to only be using the URLs.
Google's new library digitisation project, of which the University of Michigan is a part, does not impact the majority of digital materials OAIster points to. While it is true that Google will digitise the copies of 'The single hound: poems of a lifetime' on our library shelves, it will not digitise 'Immunomodulatory properties of the Chinese medicinal extract polysaccharide peptide' or any other thesis from Hong Kong University, at least not in the near future. Digitising such materials will be the responsibility of individual institutions. While Google (and other search engines) might be investigating harvesting OAI metadata to retrieve these materials, we view this as the ultimate goal of OAI - to make digital materials more accessible to the public.
For this to be most successful, we need to further publicise the benefits of making metadata available via OAI. Surprisingly, many of the most prestigious universities in the USA have not made anything available. And for those that have, these are often small selections from departmental materials. For example, Harvard's only contribution to date has been metadata of maths and science videos from the Science Education Department at the Harvard-Smithsonian Center for Astrophysics. From the informal network of service providers and data providers, we have learned that one of the main reasons for this lack of availability is that OAI sits at the bottom of the list of priorities after digitisation of all materials has been completed.
We hope to encourage OAI efforts through our formation of a 'Best Practices' group, which has solicited advice from most of the major players in the OAI field and representatives of OAI-enabling software products. This group aims to provide guidelines for those interested in becoming data providers by answering technical, metadata and protocol questions.
'De-duplication' and normalisation of additional metadata elements will also be a part of our grant effort, in addition to more advanced interface features, such as simple/advanced search forms, email/download records capabilities and OpenURL/Z39.50 compliancy. However, our main effort will focus on developing OAIster as a service provider model.
Currently, it is possible to use the OAIster template out of the box within the Digital Library eXtension Service (DLXS), which is software developed by the University of Michigan Libraries for building and hosting digital materials. To create an OAIster-like service entails changing the interface colours and graphics only. For instance, the Committee on Institutional Cooperation (CIC) has created an OAIster-like portal to CIC member university metadata (nergal.grainger.uiuc.edu/cgi/b/bib/oaister). However, there is no mechanism for using more robust metadata (i.e. more narrowly defined metadata, involving a more complex element structure) within such a service. Expanding the current capabilities of the template to utilise these different metadata formats will, in effect, create a more generic model.
It will be interesting to see where OAI leads - whether to a more fully defined protocol, further protocols that will eventually take the place of the current one, or an entirely new method. In any scenario, the impetus behind developing the OAI protocol will not change. Academic researchers and scholars will always need access to primary source material, and more and more frequently will require it in digital form. What may change is our method of providing it to them.