From search to discovery

13 October 2008

Share this on social media:

Issue:

October/November 2008

Topic tags:

Mark Johns, president of USA-based Littlearth, investigates different ways of searching for information and argues that a 'discovery engine' approach is sometimes best

Searching for information in an online collection of unstructured documents is extremely valuable. Examples of such document collections are patent documents, news articles, legal cases and articles in a medical journal.

For many document collections, searching for relevant documents via keywords is the most common and accepted method. Sifting through the results, reworking the query and collecting and organising the results is a process that most researchers have become familiar with. It is still quite analogous to manually investigating a collection of printed documents. Software just helps to perform that job more efficiently.

The advent of the ‘search engine’ was a cornerstone in the evolution of information research. In its simplest form, a search engine is used to find documents that contain some specific words. Advanced search engines such as Google can yield results that don’t literally match on the keywords. With such search engines usually comes the baggage of ‘page rank’. This can skew the results, which may or may not be desirable. Most database search engines, such as those in Wikipedia and the United States Patent & Trade Office, incorporate the familiar ‘Boolean keyword search’. This approach is very literal, which, of course, has its own distinct value and applicability. However, if a researcher types in too many keywords, they end up with no matches at all. If they type in too few, there are too many and highly varying results. This means that they need to rework the query by adding some complex combination of ‘AND’, ‘OR’, ‘NOT’, use of parentheses and phrases, for example.

So, what is the best way to build the appropriate search criteria? Consider the following scenario: a researcher enters some keywords that yield a set of documents that are not satisfactory. After struggling for a while, the researcher comes upon some document that at least comes close to what they are looking for. They then discover some words in the document itself that would help them develop their search criteria.

If the researcher could somehow use the entirety of that particular document as the criteria for the search, it is extremely likely that many more relevant documents can be found. A pure Boolean keyword search on the body of text would not be likely to yield any other matches. A completely different type of ‘search’ is warranted.

There are several methodologies for solving such a problem. These include extracting limited keywords or utilising metadata from each document and matching only on that data, or clustering, where each document belongs to a single class of documents or a limited number of classes. Another option is to extract internal forward and/or backward hard-coded references for a given document to form document trees. Latent semantic analysis is also a possibility, but it is worth noting that this varies in quality and is often accompanied by an algorithm that approximates the data representing any given document. Full-text comparison that works on relatively small datasets is also sometimes used.

Using the full text to search

In many document collections, the highest quality search criteria is actually the entire text of one of the documents in the database. A real document in the collection (or a new one that a researcher could type in full) contains much more information than what a researcher would typically type as keywords. The natural language of the document and all its inherent properties tend to shine through, if analysed with appropriate algorithms. The effect is that the result of the search criteria is the set of documents that are most similar to or related to the original document.

In ‘complexity theory’, such a phenomenon is known as ‘emergence’. This emergence is the key to a natural stepping-stone in the evolution of information research – a ‘discovery engine’. Such discovery engines can currently be found in one form or another but our existing culture’s awareness and use of the concept is still in its infancy. To be complete, it should be noted that the search for relevant documents may still begin with a small set of keywords but they can really just be treated as a mini document.

At Littlearth, a technology called DocumentDiscovery has recently been developed. It was designed from the ground up to tackle the problem of discovering related documents in a large collection of documents (the technology does not even utilise a commercially-available database management system). The technology is fundamentally designed to work on any language, but English is currently the only one that is fully accommodated.

DocumentDiscovery can be integrated into other systems as a standard web service or it can behave as a document reader and provide its own user interface. Currently, it is being applied to three different document collections, in the form of three different websites that are owned by Littlearth: www.Wiki-Surf.com; www.PatentSurf.net; and www.USCodeSurf.com. The company is continuing to develop these websites as well as to take on ventures with organisations that distribute valuable document collections.

DocumentDiscovery distinguishes itself in part or in whole from other methodologies in several ways. Firstly, it confronts the real problem, which requires an extreme amount of computer resources.

For a collection of 10 million documents, the number of pairs of relationships is approximately 50 trillion. Secondly, the tool is extensible. An example of extensibility is that different combinations of text can be used as the search criteria. This might be multiple documents taken as a whole, an existing document that is augmented with some text supplied by the researcher or subsections of documents. The quality of the algorithms is also important. Firstrate algorithms result in a high-quality set of related documents.

PatentSurf incorporates a Boolean keyword search.

Helping match patents

One good example of how a ‘discovery engine’ can offer benefits over a ‘search engine’ is in performing a patent search. For example, a researcher might already have a full description of his own patent.

The description is submitted as the ‘search criteria’ and the top related documents are returned. Some of the results look very relevant so the researcher holds/tags them in order to be able to return to them later. The researcher also tags others to ignore so they don’t show up in any subsequent result sets. One of the top results looks relevant so the researcher clicks ‘Related’ on it in order to see the top related documents for that patent record. From there, they click ‘Related’ on another document, all the while accumulating relevant documents.

The ‘search criteria’ is effectively changing each time on-the-fly. This is very different from having to rework a query manually. In fact, this process is much like the job of an old-time patent analyser.

They would have sorted through paper documents, reviewed each of them, and acquired others that are referenced. They would then have placed good candidates in one pile and placed irrelevant documents in another pile. The major difference with using a discovery engine is that a given electronic document effectively points to all of the related documents and it is never out of date – unlike a paper document which, at best, has some relevant backward document references.

Using a discovery engine requires a different mindset for researching information but it is actually a very intuitive and familiar process. Keyword searching via the traditional search engine will always have its place in research but this relatively unknown type of search, the ‘discovery engine’, will hopefully be seen as having its own merits as well.

For an example of how Littlearth’s technology works on the content of Research Information and our sister publication, Scientific Computing World, visit http://74.208.45.110/!RSI

Popular

Latest issue

From search to discovery

Using the full text to search

Helping match patents

Latest issue