Emil Eifrem looks at how researchers can generate insights from vast datasets with the fast-emerging SQL alternative
Big data means we are no longer talking in megabytes or gigabytes when it comes to information, but petabytes and exabytes. Its debut also means we need to stop assuming as researchers that all we have to work on is structured information.
As a result, researchers everywhere – from science to business to government – are collecting vast streams of data about anything and everything, but often without much consideration on how it will be managed, analysed or stored. Traditional methods can struggle here especially around the unstructured aspect of this tsunami of bytes. A possible aid is emerging, in the shape of something first used by social web giants Google, LinkedIn and Facebook.
It’s an approach based on the insight that in data, it’s the relationships that are all important and of interest to the researcher – an approach called graph databases. It’s a technology whose power was recently exemplified by what’s been hailed as the best example of data driven investigative journalism – the 'Panama papers'.
Limitations of SQL-based tools
The Panama papers is the biggest data-based investigation ever conducted, far larger than anything from Wikileaks or Snowden, at 2.6 terabytes and 11.5m documents. The ICIJ, the group that coordinated the world’s media teams on investigating all this data, says that it didn’t even know what it had on its hands until it started to look at graph databases as a way to work with it (“It wasn’t until we picked up graph that we started to really grasp the potential of the data”).
Mar Cabra, the head of its data unit, has outlined how The Times of London found a story about the actress Emma Watson and the Panama Papers, based on unpicking hidden connections that manually would have taken an inordinate amount of time.
Graph databases not only handle vast datasets like this, but are uniquely able to uncover patterns that are difficult to detect using traditional representations such as SQL-based rdbms or other approaches. Indeed, when the International Consortium of Investigative Journalists (ICIJ) investigated its first big offshore financial story in 2012, her team had to draw out links by hand – drawing lines to spot relationships in simple Word documents.
The reason why SQL would struggle here, is that the high-volume, highly linked dataset the Panama Papers data set instantiates is too hard to parse. Relational databases model the world as a set of tables and columns, carrying out complex joins and self-joins when the dataset becomes more inter-related. Such queries are technically challenging to construct and expensive to run, while making them work in synchronous time is not easy, with performance faltering as the dataset size increases. The relational data model also doesn’t match our mental visualisation of the application (technically defined as 'object-relational impedance mismatch'). If you think about it, you’ll see why; as people, we draw connections between data elements, creating an intuitive model on whiteboards, so attempting to take a data model based on relationships and forcing it into a tabular framework, the way a data platform like Oracle asks us to, creates a mental disconnect.
By contrast, the power of graph database technology is in discovering relationships between data points and understanding them at huge scale. That’s why graph databases are ideal at allowing the researcher to uncover hidden patterns, in large data sets, difficult to detect using traditional representations.
Life science use cases of graph are emerging every day
It’s not just journalists finding this out. Tim Williamson, a data scientist at Monsanto, whose role focuses on coming up with ways to help the company get better research inferences out of genomic datasets, says that his team had previously managed its data problems in a very classical, relational way. However, he said: 'Every question we would want to ask needed lots of real-time analysis to be run and it would take us seconds to minutes to hours to perform one round of analysis, which doesn’t scale.'
Monsanto is continually researching what are the best possible plant varieties and what genetic traits cause them to thrive in different climatic and environmental conditions. There are genetic problems that rely on being able to treat a dataset as an ancestor family tree. Williamson discovered that these family tree datasets naturally matched a graph database and it was really easy to write graph queries instead. Now, he says, analysis that used to take minutes or hours took seconds: 'This was really cool, because then we could do it across everything; I could ask the same question of several million objects instead of one at a time.'
That frees Monsanto up to make some abstractions around important genetic features, he says, such as which plant ancestors continually form the most productive cross-breeds. That means its starting to get much quicker at spotting which plant varieties are the most productive, so the firm can spend its resources researching those rather than less promising options.
Another medical research company seeing graph database technology is the EU FP7 HOMAGE consortium, which focuses on early detection and prevention of heart failure. It has a dataset from more than 45 thousand patients, originating from 22 cohort studies, covering patient characteristics, clinical parameters like their medical history, electrocardiograms and biochemical measurements. All this patient data is now being connected with existing biomedical knowledge in public databases to create an analysis platform so all the relationships, implicit and explicit, in these elements can be explored by the team of bio researchers. That really helps, as the graph database for just one heart failure network analysis platform contains over 130 thousand nodes and seven million relationships alone, for instance.
‘Adept at discovering unknowns at big data scale’
Integrating data and knowledge in life sciences involves the modelling of an incomplete and ever-changing model of how our bodies work and what we know about it. One of the practical hurdles for computing biology is that biologists and clinicians describe things differently depending on context, with these labels often ending up as being ambiguous.
For instance, biologists in different subdomains use different terms, with the same protein being described by many different names. Meanwhile, there are at least 35 different ontologies describing heart failure and associated phenotypes, all partly overlapping, and all partly tuned to a specific, unique application. And while our knowledge on biology progresses, these models continuously need to change; for example, large parts of human DNA originally deemed junk turns out to be an important player in the system.
In biology, really everything is connected – and those connections change depending on context, time and environmental triggers. To be able to deal with these challenges, researchers need tools like graph databases adept at discovering unknowns at big data scale, but which are also able to handle dynamic and constantly evolving data.
It turns out that researchers are using graphs in other research areas as well. NASA scientists use graph techniques to map and track trend information, for instance.
Last but not least, were you aware that academic research stalwart Google Scholar is also built on graph technology? Clearly, graphs are helping researchers from many differing fields cope with the emergence of big data. Could it help your team, too?
Emil Eifrem is CEO of Neo Technology