Linked data expertise for libraries
Ashleigh Faith discusses data quality for discoverability, analytics and AI using linked data ontologies
Libraries have been using linked data for generations to share information across datasets, databases, or on the open web in a consistent and hyperlinked way. Linked data has also been used heavily to map equivalent subjects in metathesauri like the Unified Medical Language System (UMLS), where a subject has a unique identifier and/or URL that is mapped or linked to similar subjects in different controlled vocabularies, primarily for enhanced subject precision in search.
Linked data also connects nodes of information (i.e. an author) to similar information (like an author from the same discipline) and additional data about that node such as where an author is from, when their first publication was released, where they studied university, and where they get inspiration for their research. This network of information goes outside the usual bounds of a single Creative Works’ Title, Author, publisher, and publication date and location, to include connected data found in a wide variety of linked data sources.
When a node (i.e., author, title, publisher or another entity) has this additional data linked, it starts to create a network of information, or statements, about that node called a Knowledge Graph. The schema behind a knowledge graph is called an ontology, and these have a special characteristic that helps enforce data standards called constraints.
Ontology constraints increase the accuracy, consistency, and dependability of data once the data has been either verified or corrected. Data that can then be used to analyse a library’s holdings, discover borrowing trends and increase the accuracy of AI when it is used in combination with Large Language Models (LLMs) during a process called Retrieval Augmented Generation (RAG). Studies have shown that grounding AI on a dependable, high-quality source of truth like a knowledge graph increases AI accuracy by 46 percent.
Recognising the impact of your library on the overall success of your organisation is essential for maximising the benefits and effectiveness of library resources. This understanding has been shown to boost funding for critical resources, attract more patrons, increase student enrolment, retain and attract faculty, and open new funding opportunities such as grants and investments. However, none of this is achievable without reliable and consistent library data. Linked data ontologies can play a crucial role in ensuring data accuracy and consistency.
Let’s dive into how ontology constraints work, and how you can start to integrate them into your library information workflows.
Ontology constraints
An ontology is the schema that captures your linked data, such as “dc:creator” would house the node “Charles Dickens” for the author information of the Work, “A Tale of Two Cities.” Every node is represented by a hyperlinked unique identifier like a wikidata ID or DOI. Each node is connected to another node to create a network of statements, called triples. Triples have a declared relationship connecting them and this is always in camel-case. For instance, Charles Dickens’ medicalCondition (wikidataID P1050) Epilepsy (wikidata ID Q41571) is a triple documenting a statement about Charles Dickens. The ontology schema behind this statement could be dc:creator notableWork dc:title. Constraints can be added to the schema to enforce rules, such as requiring at least one creator node for every title node, enforcing the rule that in your library you only want Works with author information, all titles can only have one publisher of the same name, or all authors must be from the author authority list of nodes.
The constraints added to an ontology can be defined according to your use case, library data governance, the collection you oversee, or to discover data that are exceptions to the rule for analytics or data clean-up projects.
Data Improvements
Data inaccuracies such as missing publication date or incorrect data in fields will be caught through data validation on ontology constraints. But these can easily be verified by looking at the original work or verification from corroborating datasets. Human verification, however, is essential for statements that cannot be verified by data alone. For discoverability, having correct data increases the chances for information to be discovered and used. For analytics, knowledge graphs allow for many network assessments like gap and saturation detection for collection development, trend assessments to see barrowing patterns over time and what influences them, or even detecting the most influential people, courses, or materials at your organisation. And unlike other data clean-up and governance techniques, ontologies have native data governance and validation associated with them.
Implementation techniques
There is a spectrum to implementing ontologies. The first stage is usually working with your organisation and staff to understand what data governance rules you are interested in capturing. Once you have governance and problem areas identified, starting your model in a data modelling tool (many of which are open source) is a great next step. Adding constraints is a good next logical step so that you can run what is called a Shapes Constraints Language (SHACL), which is a way to validate ontology data constraints to identify data that does not confirm to the data governance constraints you have outlined. When outliers are identified, the data can be human verified and corrected, or additional governance can be created to capture the information you have discovered.
If you want to use this high-quality data to increase the accuracy of your AI projects, general RAG techniques can be used. AI models are usually general intelligence, so they do not know many things about your organisation, use cases, or needs, so adding a source of truth like an ontology-based knowledge graph makes your AI much more effective and trustworthy.
Linking it together
Adding in linked data ontology constraints does not need to be a monumental task, but can be taken in incremental steps, each being able to produce governance and data quality improvements as you mature your ontology work. By adding in constraints at the schema level, it allows your data governance to scale with limited effort. Constraints also help identify the specific errors in your data based on the constraints that were not met when validation is run on your data, which makes correcting or verifying the data with human oversight that much easier because it is targeted and specific.
As insights about library resources increase in importance, and research becomes more integrated into AI systems, ontology constraints offer a way to improve data and build trust in its applications. Ultimately, this can help libraries highlight the unique benefits they offer to their organisations.
Ashleigh Faith has her MLIS in Information Retrieval and her PhD in Advanced Semantics, and is EBSCO’s Director of AI and Semantic Innovation. Connect with Ashleigh on LinkedIn.
Do you want to read more content like this? SUBSCRIBE to the Research Information Newsline!