The importance of interrogation

Share this on social media:

Manisha Bolina

The need for fast, accurate and enriched data has never been more pressing, writes Manisha Bolina

With the explosion in the volume of open access data, turning information into trustworthy, meaningful and insightful knowledge is critical.

Researchers simply do not have the time to read the full-text of preprints, grants, patents, data sets, publications, policy documents and more. In a world where interdisciplinary research is intrinsic in the way that it affects our lives; we need technology to help us make the process more efficient and free from human-bias. 

Enter artificial intelligence, or AI. Let's look at a couple of examples of how AI is being used, and the power it brings to research and development. 

Dimensions, the world’s largest linked research information dataset, takes the approach of linking data – so it's not only about looking at publications, but also how they connect to other data points such as clinical trials, patents, data sets, grants and policy documents. Research does not begin and end with publications alone; to REALLY understand it you need to see the whole picture, and be able to place a piece of research information within the wider context of the research landscape. 

How does Dimensions help researchers, funders, institutions and more do this? Easy. Digital Science has hired a million librarians and their sole purpose is to read all articles going back to 1665... OK, I jest! It actually uses AI and machine learning algorithms to read full texts of publications and other data sources, then links them together using metadata, thereby creating around five billion connections between them. Dimensions doesn’t have a ‘front and back list’ or an editorial board telling them what should be included in the platform. Its inclusive approach means that it puts the power back into the user’s hands.

AI is really useful for things like author disambiguation; so many have tried repeatedly to crack this nut, and it's definitely a hard one to do. Dimensions uses machine learning (ML) methods to ensure disambiguation to a higher level of accuracy – just WHO is this John Smith? Aside from the guy whose name is on the beer, of course! Dimensions collects author IDs from all the usual places, including Scopus, PubMed, ORCID, Mendeley and the rest of the ‘PID crew’ of persistent identifiers and associated metadata, but it doesn’t stop there. 

Our AI goes deeper;  it ‘reads’ all the different data created by the author and continually checks complementary information asking questions like: ‘who are the usual co-authors?’, ‘which institution are they at/have they been affiliated with?’, ‘what fields of research are they writing in?’, delving even deeper into the data to ask, ‘what concepts are constantly being extracted by this author?. This interrogation process helps the AI understand and correctly ascertain that John Smith at X University in the English department writing about Shakespeare is not the same person as John Smith at the same university writing about nanotechnology. 

Dimensions builds what can be considered a ‘semantic fingerprint’ to correctly disambiguate millions of authors, enabling the detangling of those authors who could have multiple ORCID IDs (many of whom may have never had training, or will have created a new ID every time they move institution/write a publication) making a monumental task seem seem like child’s play. Whilst AI can really take the pain out of the process, how accurate is it? Well, think of it this way; does Google always get it right when you look for the ‘best Chinese takeaway near me’? No, but it does get pretty close! In the case of Dimensions, it will not only look at publication data, but also patent, grants, data sets, clinical trials, altmetric attention and citation information related to that author and connect them. AI and ML algorithms allow for semantic enrichment to create linked-data; without it, this task would be impossible. 

Our friends at Ripeta are using semantic analysis to help overcome another challenge that research is facing; trust and robustness of research. Transparency is so important when working with authors and trust in authorship is very important so, imagine if you were able to create a sort of ‘credit check’ to report the robustness of authors and scientific methods? 

The Covid-19 pandemic amplified the need to find, check, share, and reuse data at a faster pace than ever before. Using Natural Language Processing (NLP), Ripeta tackles this problem by providing a quick and accurate way to assess the trustworthiness of research. Why is this important? Well, we live in a world where people actually believe ‘fake news’ and so many articles have to be retracted; we’ve seen this in front of our eyes in Dimensions during this pandemic!

So, how does Ripeta do this?  It's the ‘magic’ of AI and we need AI just to manage and get through the sheer volume of data. Ripeta splits this process it into three categories: Reproducibility, Professionalism and Research: 

Reproducibility – Can this paper be replicated for future research?

Look for… Code Availability Statement, Data Availability Statement (DAS), Data Locations;

Includes data availability statements and links to the data used;

Detailed methods section laid out in the abstract.

Professionalism – Are the actors behind the study reliable?

Look for… Ethical Approval Statement, Funding Statement, Section Headings Information;

More than one author, and all are verified through their institution and previous works;

Includes a funding statement and all pertaining information;

Contains an ethics statement.

Research – Is this actual research?

Look for… Study Objective;

The study objective is clearly stated;

Detailed Methods and Results sections.

For publishers this technology creates efficiencies and adds value to their new and existing authors as well as ensuring preprint content is of high value (especially as we are seeing an increase in preprints and this type of data is now often cited and shared through social networks!). Institutions can improve the quality of their manuscripts, which is especially important for early career researchers and, for researchers, they will know what to report in a way that will grab publishers’ attention and increase their likelihood of being cited.

I now ask you, reader, have I convinced you yet about how important AI and machine learning are for semantic enrichment and scholarly communications? Without machine learning it would be a long, tedious and exhaustive process also filled with human- bias further prone to error. Our industry must employ AI into their workflows if we want to continue high quality scientific dissemination and trust in academia. The way we turn information into knowledge affects our lives as we have seen. Let's embrace, trust and strive to continually improve it, only WE can do it well! 

Manisha Bolina is product solutions sales manager, EMEA, at Dimensions