Adding code to web pages that shows how resources link together can help people find and use information and data, writes David Stuart
At the beginning of 2010, when the UK government opened up its one-stop portal for government data (data.gov.uk) it didn’t just link to data sets in popular document formats such as Excel, but also re-published much of the data as Linked Data. This enables related data to be connected across the web by combining several concepts. Firstly, there are URIs (Uniform Resource Identifiers), which are strings of characters used to identify a name or resource on the internet. Then there is HTTP, the request-response protocol of the web. The third concept that is enabling data to be connected is RDF (Resource Description Framework).
RDF is a set of specifications that are used to describe objects and the relationships between them in the form of expressions that include three parts: subject; predicate; and object. These so-called RDF triples allow for the encoding of a semantic web, one that can be read by computers as well as humans.
In its simplest form, the concept could be applied to a sentence like ‘John drinks tea’ or ‘David likes apples’. Here, ‘John’ and ‘David’ are the subjects. The predicates are ‘drinks’ and ‘likes’ and the objects are ‘tea’ and ‘apples’. The idea of RDF triples builds on this: essentially RDF triples tie related resources and data together by indicating what something is, what attribute it has and how the attribute relates to it. A scientific paper can use RDF triples to express associated bibliographic information, whilst the relationships between individual tests, experiments, and their results can also be linked. Linking things in this way enables computers to pull up relevant data and results from all over the internet.
However, the power of RDF triples goes beyond linking specific words or phrases. Any of the parts of an RDF triple can be replaced with URIs, which are unique to a particular thing or concept. In the simple triple example ‘David likes apples’, it would not be clear to a machine whether ‘apples’ refers to the fruit or the computer, leading to ambiguity and irrelevant terms appearing in a semantic search. The distinction can be achieved by replacing the literal ‘apples’ with a URI to either the fruit (dbpedia.org/page/Apple) or the computer company (dbpedia.org/page/Apple_Inc.).
The use of URIs also allows established ontologies and vocabularies to be built. For instance, when describing a website, it is possible to make use of both Dublin Core metadata elements for describing resources and Library of Congress Subject Headings. Using established ontologies (or devising and making public your own ontologies if none exist already in your field) helps computers to be able to find all the related information. See the box (below) for examples of how RDF triples are written.
RDF is important to librarians as both publishers and consumers of information. Libraries are responsible for the publishing of a wide variety of information on web pages, in catalogues, and increasingly in institutional repositories of both journal articles and data sets. The value of this information can be increased by making it machine readable, and connecting it to otherwise unconnected data sources. At the simplest level it may be including RDFa markup in staff contact pages, or at a more advanced level making the whole catalogue available in RDF and connecting it to other sources such as the Library of Congress. As consumers of information, librarians need to be aware of the range of data that is being published online and the tools available to make use of it, from browser toolbars for extracting event information and sending it straight to their calendar, to interfaces for querying the data stored in triplestores.
As has already been seen in the area of Web 2.0 mashups, when data is made publicly available, it is used in a multitude of ways unthought-of by the original organisation. The same will be true as an increasing amount of information is made available using RDF.
David Stuart is an independent web analyst and consultant and honorary member of the Statistical Cybermetrics Research Group at the University of Wolverhampton, UK
There are several ways of representing RDF triples that have become established, either as separate RDF files or embedded within the web pages. The example below contains three triples representing the creator, date and subject relating to my website – www.davidstuart.co.uk:
< rdf:Description rdf:about=”http://www.davidstuart.co.uk/”>
< dc:creator>David Stuart
< dc:subject rdf:resource=”
The above example is written in an RDF/XML format. An alternative, where the underlying triples can be more clearly seen, may be to store the triples in an N-Triples file. This contains the same underlying RDF triples, and could be handled by a computer in just the same way, but is formatted slightly differently:
< http://www.davidstuart.co.uk/> “David Stuart” .
< http://www.davidstuart.co.uk/> “2010-01-20” .
< http://www.davidstuart.co.uk/> < http://purl.org/dc/elements/1.1/subject>
< http://id.loc.gov/authorities/sh2008005218#concept> .
In the example, rather than having to explain what is meant by a ‘creator’, ‘date’, or ‘subject’, the predicates used are defined within Dublin Core. Similarly, it makes use of the Library of Congress Subject Headings for ‘technology writers’, which have been published online as RDF.
Large sets of RDF triples are often stored in triplestores, databases purpose built for the retrieval and storage of RDF, although individual RDF pages can be coded by hand, in much the same way as HTML may be coded by hand. For instance an individual may wish to create a Friend of a Friend file (www.foaf-project.org) to describe themselves and the relationships with other people (for example, Tim Berners-Lee’s FOAF file – www.w3.org/People/Berners-Lee/card.rdf), a task that is likely to be more difficult through a triplestore.
Alternatively, RDF may be embedded into web pages using RDFa notation. A website with RDFa is unlikely to look any different from any other website. For example, the Library Services web pages of De Montfort University (www.library.dmu.ac.uk) use RDFa to provide machine-readable information about library staff and their contact information. Having data in a machine-readable format means it can be extracted and manipulated automatically. For example, browser toolbars can enable contact information to be extracted from a web page and added to an address book at the click of a button. Interest in RDFa increased in 2009 when Google announced that it was going to start indexing some of this semantic information.