Unlocking unstructured data

17 August 2011

What does MarkLogic do?

MarkLogic is a database designed for unstructured information. If you look at what people do with information it bears little relation to where the information has come from, or how it is stored. We believe to really get the most out of information you need to be able to store and manage it at a very granular level.

Indexing unstructured information requires a different approach from structured data, where you would typically spend a lot of time trying to define a schema and then more time fitting the data into it. The start point for us is to just put everything in the system. The database uses automated indexes and XML to build its own links. When you manage information at a more granular level all sorts of things are possible.

What can be done with unstructured data?

In the past, publishers were trying to do in electronic form what they did in print. Now the more innovative companies are structuring their data at more granular level. This can change a business and can take publishers into a new market. For example, Springer Images, which automatically identifies and extracts images from a huge collection of existing resources, is based on our technology. It is a new offering, but based on information that was buried elsewhere.

Some customers use us specifically to track citations. We sit on top of Twitter for some clients. Another client is the CQ Roll Call, which holds videos from the floor of the US Congress. The videos are time-stamped and the transcribed text is also available. Users can find the topic of interest in the text, and jump to the video to see the speech at that point.

The richness of the indexing is important because it enables people to do analysis and assess relevance. What’s more, any analysis you do can be turned into an alert, so users can request to be notified of anything new that arrives that matches the search criteria. Interacting with customers and allowing them to add comments also increases the power of the product. The ability to search across resources is important in unlocking the power of unstructured information too; we often get used as a kind of ‘universal index’ carrying extracts from multiple disparate systems to provide a single point of access and identification of items/assets/resources.

Publishers and libraries can start thinking of new value propositions. A lot of publishers are offering ‘build your own textbook from components taken from others’. This sort of thing adds dimensions to what can be done.

What are the challenges with data?

Sheer volume, and growth. We are dealing with huge volumes of existing information and receiving new information. Managing both of these seamlessly can be relatively simple when dealing with data in structured form, because the volumes are much smaller. With unstructured data, scalability is a challenge. With the volume of addressable archive, volume of queries and amount to ingest, you need a specialised database designed for the task. Unstructured data represents 80 per cent of the wealth of information most organisations are trying to manage, so it’s important to use the right tool.

A retention schedule is critical to managing an archive’s volume. The key is being able to identify stuff you can now get rid of. That’s a great use for analytics and alerting.

The level of detail and completeness of the semantic web varies enormously. Our engine sits perfectly with semantic-based solutions because we’re managing the information at a fine level of granularity. People can also use semantic technologies to go back over data and enrich its indexing, using for example, medical taxonomies. We are watching the semantic web very closely and building product capability and alliances in that area.

What’s the future?

Historically, a lot of business models were about what was downloaded. I see more emphasis in the future on subscription models in particular for research tools, accessed from a variety of devices – right information, right place, right time.

Geospatial tagging is another interesting feature to lots of customers and I think social interaction with data will also increase. The more users are allowed to comment and interact with data, the more you can use that to enhance search, ranking and analytics.

The key is providing a rich enough experience from day one and making it easy to add stuff iteratively over time as your requirements change and evolve. More and more people are starting to understand that unstructured data is different and needs new technology to address it. It is a changing landscape.

Interview by Siân Harris