OAI-PMH Harvesting
In the repository domain there are several standards that are widely implemented and ensure interoperability, the most well known being Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) which is used as a common interface for harvesting metadata from repositories.
What is metadata harvesting?
For a more detailed overview of harvesting protocols and related issues, see the Linking UK Repositories report (Swan & Awre, 2006).
When external search services want to be able to index a website, they often crawl it by following links in order to find each web page, and then extract the text from the pages they find. This process works well but lacks the ability to make use of structured metadata and advanced search facilities based on metadata fields, or more advanced interrogation through methods such as data mining. It is also hard to identify new content without either another complete crawl of the website, or by making use of sitemaps.
The alternative to web crawling is harvesting. Harvesting involves making queries to the repository about its content, and receiving replies that contain lists of items, and item metadata. The Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH) is a machine-to-machine (m2m) interface that is specifically designed to facilitate the harvesting of metadata from Open Access repositories, and the vast majority of repositories provide this option.
Most repositories provide an OAI-PMH interface and are designated 'OAI-compliant'. When a repository adheres to this protocol, some or all of the metadata that it holds for all the items in its collection is exposed for harvesting by service providers. The returned metadata usually includes a URL for the item, and for each full-text file which can then also be processed if required.
Principles of OAI-PMH
Systems that provide information via OAI-PMH are known as data providers, and systems that harvest information using OAI-PMH are known as service providers (as they provide new services with the data).
OAI-compliant repositories have a base URL in addition to the URL for human users. For instance, Aberystwyth University's CADAIR repository has the OAI Base URL. On its own, an OAI Base URL simply returns XML containing an error message. This is because the protocol expects instructions in the form of a 'verb' and other arguments to be appended to the URL.
OAI-PMH Verbs
The simplest case is the verb 'Identify', which returns identity information about the repository.
Altogether, there are six OAI-PMH verbs, some of which require additional arguments:
Identify |
Returns information about the repository Example - RRP - Roehampton |
||
ListMetadataFormats |
Lists the metadata formats supported by the repository. The minumum requirement is oai_dc (Dublin Core) Example - CADAIR - Aberystwyth University |
||
ListSets |
Lists the sets provided by the repository (e.g. departments, subjects) Example - e-Prints - University of Southampton |
||
ListIdentifiers |
Lists record identifiers, dates and any other headers for each deposited item. Requires the argument 'metadataPrefix'. Results can be limited to specified subsets using the 'set' parameter. Results can be limited to specified time periods by using the 'to' and 'from' parameters. Example - ePublication Library - Chilbolton, Daresbury, and Rutherford Appleton Laboratories |
||
ListRecords |
Harvests metadata records from the repository. Requires the argument 'metadataPrefix' - metadataPrefix=oai_dc should suffice. Results can be limited to specified subsets by using the 'set parameter'. Results can be limited to specified time periods by using the 'to' and 'from' parameters. Example - ePrints - Nottingham |
||
GetRecord |
|
Note: The verbs and their associated arguments are case-sensitive.
When the results of an OAI-PMH query are large, they are often split into chunks of records. Each chunk ends with a 'resumption token' that can be used to retrieve the next chunk.
OAI-PMH installations can be set up to return results using a variety of metadata schemas. As a minimum, all OAI-PMH servers must be able to return results using the unqualified simmple Dublin Core (oai_dc) schema, and this is all that many repositories or packages offer. However, they can provide as many or as few additional schemas as they wish. For example EPrints and DSpace both support the United Kingdom Eletronic Thesis and Dissertation Dublin Core (uketd_dc) metadata format required by EThOS.
OAI Registration
Useful links
Open Archives Initiative - The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content
Open Archives Forum - The Open Archives Forum provided a Europe-based focus for dissemination of information about European activity related to open archives and, in particular, to the Open Archives Initiative. They have a useful online tutorial
OAI Registration - Information on registering with the OAI as an OAI-PMH conformant data provider
Open Archives Initiative - Repository Explorer - This site presents an interface to interactively test archives for compliance with the OAI-PMH
While a repository can be harvested simply by providing an OAI-PMH interface, registration provides a useful means of promoting the visibility of your repository to service providers for harvesting. The Open Archives Initiative provides a service allowing your repository to be registered as a data provider in the OAI registry. The registry is a publicly accessible list of all OAI conformant repositories which allows easy discovery of data providers by service providers. When registering your repository, the OAI service will perform conformance testing to ensure your repository complies with OAI-PMH. If validation is successful, your repository will be added into the registry. The OAI will also periodically test your repository for conformance. If the analysis fails, your repository will be removed, and a notification email sent to the administrator detailing the reason for removal. This ensures the integrity of the OAI registry and your repository interface. Information on registering your repository with the OAI can be found at the OAI website. The OAI-PMH interfaces provided by the major open source repository platforms are created in such a way to ensure that they are always compliant.
Although registering with the Open Archives Initiative will assist in increasing the visibility of your repository to other service providers, direct registration with these services is also possible to guarantee your repository is harvested by them. Service providers which require additional registration are Intute Search, OAIster and OpenDOAR.





