Mining for insight

Siân Harris investigates the role of text and data mining in research - and what the publishing industry is doing, and could do, to help

Text and data mining is a hot topic. It has been extensively discussed in copyright and open-access discussions and has been mentioned in many recent policies in these areas. But is there a fundamental disconnect between what researchers want to do and what information providers think they need?

Part of the challenge comes down to defining text and data mining (TDM). At one extreme it’s a large-scale, deep search to generate specialised datasets. For example, Shreejoy Tripathy, a PhD candidate in the Neural Computation Center for the Neural Basis of Cognition at Carnegie Mellon University, USA, said of his research, ‘I use full-text literature text mining to extract information about the electrical properties of different neuron types in the brain. I then analyse the resulting dataset to better understand the electrical diversity of neurons throughout the brain. Because this data is useful to other researchers who can use it for purposes different from my intended use, I also provide the extracted information (but not the publications themselves) back to the field at www.neuroelectro.org.’

TDM can also be much simpler – an extension of search – as Cameron Neylon, director of advocacy at Public Library of Science (PLOS), explained: ‘Researchers want better awareness of the latest research. This is not really delivered by current search tools. For example, I look a lot at methods, and it’s very common for these to be left out of abstracts. If we could do TDM on papers it would be fairly trivial to build tools to search methods sections.’

According to many publishers, the amount of TDM going on is still small. Wim van der Stelt, executive vice president of corporate strategy at Springer, said that the company has not had very many requests so far, and that most of these have been from pharmaceutical research. ‘There is a company policy in the works but, so far, the amount has been so low we’ve handled them case by case,’ he said.

Nature Publishing Group (NPG) has a similar story: ‘In general, the number of requests has been pretty low. They are generally one-offs, so are dealt with on a case-by-case basis, although there is more interest now,’ said Jessica Rutt, rights and licensing manager at NPG.

Alicia Wise, director of universal access for Elsevier, says that her company has a dedicated help-desk for people who want to do TDM. However, she said that the company has had fewer than 100 requests and ‘a lot don’t seem to actually do it once they are set up’. She remarked that it is ‘still early days, but some do TDM a lot. We want to support the full spectrum, from power-users to those who just do a bit.’

Canada-based researcher Heather Piwowar, however, believes that the lack of requests does not fully represent the demand for TDM in research. ‘How many help-desk requests is often used as a guide to how little interest there is, but it’s not a full picture,’ she explained. ‘As a grad student I knew I wanted to do TDM but I never asked. And a lot of people wish Google Scholar had an API. I think these people are actually wishing to do TDM, but don’t know it yet.’

In addition, Neylon suggested that some low-level TDM goes on below the radar. ‘Text and data miners at universities often have to hide their location to avoid auto cut-offs of traditional publishers. This makes them harder to track. It’s difficult to draw the line between what’s text mining and what’s for researchers’ own use, for example, putting large volumes of papers into Mendeley or Zotero,’ he explained.

Formats and standards

When it comes to how people want content to be presented for mining, this varies depending on the topic of the research and the methodology.

Rutt of NPG noted: ‘In more recent requests people have been leaning towards XML, but that definitely varies. Not all publishers do XML so researchers may want content that fits with what other publishers offer.’

‘Some researchers prefer XML and some publishers only publish PDF,’ agreed UK-based palaeontologist Ross Mounce. However, he said that he requires PDFs because he needs to mine images for the data behind them. Nonetheless, he said that the ‘PDF is a horrible container. You have to develop tools to mine PDF text.’

But there are things that can make mining easier, he argued. He urged publishers to specify that images derived from data are submitted as vector, rather than raster, files, because then the underlying data ‘would be relatively easy to decode’.

For Piwowar, ‘PDFs are not good.’ She noted: ‘JSON is fantastically easy to parse and so is XML. Ideally, papers would all be in the same format but we don’t want to be stuck in a format of 15 years ago. And if papers are available as one format, someone could develop a way to convert to another. That’s the great thing with open access,’ she noted.

Tripathy said: ‘I prefer working with the HTML full texts of the article but also work with the XML provided through Elsevier’s text-mining API. A nice thing about HTML is that it is a single standard that most publishers conform to – so I can use code that I’ve developed for one publisher on content provided by another.’

However, he also noted some challenges. The first is the lack of a standard format across publishers. ‘For each publisher that I extract information from, I need custom bits of code to find relevant content from each of these publishers,’ he explained.

Another challenge he has is extracting structured information from academic publications, which are typically unstructured. ‘Challenges include identifying relevant bits of information in the publication and tagging that bit of the publication as being relevant to me with high accuracy. Because this problem is quite difficult, algorithmically, I often take a combined approach, with automated text-mining plus manual curation. I use text-mining to scan through thousands of articles to find the relevant ones and then go through and manually check everything that was extracted automatically and fix things as necessary. This problem would be substantially improved if there were better standards for how the information I’m extracting should be communicated within a publication (like there are for genomic information, which specify how different genes should be communicated).’

Kamila Markram, CEO of Swiss OA publisher Frontiers, agreed on the need for standards. ‘Standardisation is never easy but needed,’ she said, adding that Frontiers works with various groups of researchers and research organisations to get consensus on how data should be presented. ‘Publishers need to make content available and annotate it. The challenge is in finding what you need to annotate. Researchers need to come to consensus about what they need.’

However, Neylon warned that, sometimes, standardising on database formats and structures can be a challenge and may not always be the right approach. ‘There’s been a tendency to think that, because the early success was with big databases, we should standardise on those formats,’ he observed. ‘It was the right thing for big things like protein crystal structures or gene sequences because they are such a clear kind of data. As you move into how you do things in the lab, for example, it’s less easy to define and those kinds of models don’t fit so well.’

Downloads or crawling

Another consideration for people wishing to do TDM is how to handle publisher content. The two main approaches are doing bulk downloads or crawling the content on the publishers’ site.

However, there are limitations with both approaches – mainly due to data rates. Mounce finds this frustrating: ‘The literature isn’t actually that large. I’ve got all of PLOS on my computer. I’m looking to extract all 20th century phylogenetic data. I estimate that, in the last decade, there have been more than 100,000 papers on this but that would be less than 100Gb. However, even where I have legitimate access I can’t download too many – and if I download them, it’s at a really slow data rate.’

He noted that each publisher has different limits for how many papers can be downloaded in a given time-frame. ‘They all claim that it’s a technical limit, protecting users from denial of service but we’re only pinging the content quickly,’ he said. He argued that it is not in researchers’ interests to bring down publishers’ servers either, so TDM is done considerately. He is also frustrated that, even when content is published under a Creative Commons licence, there are still often restrictions on download speeds and crawling.

Alicia Wise said that Elsevier enables TDM through its API or content can be accessed by bulk download – and that content is available for mining by robots in a way that does not upset normal use. She notes that the API route is particularly suited to relatively low-scale text mining, while bulk download might be more suitable to larger scale TDM. ‘We see both approaches as rough and ready. We support them but we see the opportunity for better tools,’ she observed, adding that the company is doing a number of pilot projects with institutions around the world in this area.

Markram of Frontiers said: ‘We allow bots but we do evaluate them and stop them if their behaviour is funny. We prefer people to come and ask first because not all bots on the internet are benign.’

Rutt explained of NPG’s position: ‘We allow TDM on subscribed content but there are a few practical constraints: we want the IP address of users and ask that they don’t violate copyright with the output of TDM. We also ask librarians of site licences to sign an addendum to their site licence as a one-off. We allow users to come into our system and crawl it, but there is a pretty slow crawl rate so it does not disrupt the system. We can also deliver data on CD or via FTP but we tend to charge for this.’

However, Neylon is dismissive of the idea of TDM causing problems to servers. ‘There’s a lot of nonsense about crawling causing problems. If you’re a publisher of any size you should be able to deal this traffic. We get over four million unique visitors per month. We’re talking about a few tens of thousands of hits from TDM. It’s just day-to-day operation of a high-traffic website. And the people who are doing TDM tend to be the most polite and considerate users. Security testing can cause many more problems to publisher servers,’ he said.

Access and imagination

Aside from technical details, there are some big themes that text and data miners want. As Neylon summarised: ‘For anyone to do anything with a corpus of literature, they need to be able to discover and identify sets of literature. They also have to be able to access it, ideally through OA to the material of papers. Thirdly, they need to have the legal rights to do whatever they need to do with it.’

He continued: ‘The technology to support it is all there. Technical tools for building indexes also exist. The only real problem is access to content and the reason it is blocked is sheer lack of imagination and thinking of business models.’

Piwowar agreed on the challenges. ‘I want to use literature to do research on researchers and gather evidence that’s only in full text. The largest challenge is that there is no place to search it all. The closest thing is Google Scholar but that doesn’t have an API. Other places could do it but do not offer a full set of literature,’ she said. ‘There is very limited support because many articles are not OA.’

She continued: ‘Another obstacle is, once you’ve done text mining, how you distribute the results. The NC clause [Creative Commons’ non-commercial designation in some licences] is very ambiguous. ImpactStory [the altmetrics organisation that she founded with Jason Priem] is not for profit but it is incorporated as a company, and at some point we might charge for some premium services. Copyright laws are different in different countries. It’s hard to figure out what you are allowed to do and so it’s easier not to do TDM.’

‘If there’s any risk we might get sued, we won’t do it. It has a huge chilling effect,’ agreed Mounce.

For content licensed under the Creative Commons CC BY licence the issue is simpler, according to Markram. ‘If you don’t have restrictions it’s just a matter of instructions to typesetters. We do ask researchers to acknowledge us.’ She explained that the people behind Frontiers are ‘researchers but also publishers so we have built tools that we want to use’. Indeed, Henry Markram (Markram’s husband and co-founder of the company) is director of a major brain mapping project so he has a huge need to mine data in his own research.


Another stumbling block that researchers find is the requirement to ask permission to do TDM. Although academic researchers often have access to a large body of literature through institutional subscriptions, this does not give them automatic rights to do TDM with the content. Many publishers require researchers to ask permission individually, which can present a significant time barrier. ‘I have physical access to content already. Negotiating again genuinely blocks research,’ noted Mounce.

‘This has been a non-trivial challenge,’ observed Tripathy. ‘While my institutional librarians have been very helpful in helping me obtain licences, in cases where I have had to wait for licences, it has slowed my research. However, I’ve found that, when communicating with publishers about text mining (for example, Elsevier or Wiley), they have been very excited to hear about my use case and have been willing to work with me, both in giving me access to content and doing things on their end to help with extracting content.’

He continued: ‘One thing I’ve observed is that editors of the journals I’m extracting information from usually have a hard time understanding why it is even an issue, as long as my institution has a journal subscription. They also appreciate that I’m providing the extracted information back to the community in a useful form.’

Meanwhile, Piwowar recounted how she gained permission to use a large body of content from one subscription publisher but by the time this permission was approved it was too late for her to use the content in her project.

‘Researchers need it when they need it, not months later. Even if you shorten the approval process to two weeks, it’s too long,’ she said, noting that this is a particular problem for early-career researchers who are often on short-term contracts. A related issue is that permission to use subscription content for TDM is generally affiliated through the subscribing institution but early-career researchers frequently change institution and therefore have to renegotiate permissions to do TDM.

Neylon is not a fan of the approach of requiring researchers to seek permission: ‘Getting researchers to request to do TDM is barking mad,’ he said. ‘Traditional publishers have built up an entire business model based on control. Structurally and functionally they have to understand how things are used so they can see if they can make money from it – but the reality is that this is actively blocking people from experimenting.’

He added: ‘For most people there is very little point to have output from just one publisher. They need content from all publishers. If you need to spend six weeks negotiating with Elsevier and then another six weeks negotiating with Wiley to get different use conditions, this blocks the project.’ PLOS, he said, does not require users to request permission to do TDM.

Green access

Much of the discussion around TDM focuses on gold OA content but there is, of course, another body of content available – green OA.

Discussions at a Westminster Higher Education Forum held in London in February highlighted how some participants favour the green route, despite concerns raised by others over the ability to do TDM on green OA content.

And long-term green OA ‘archivangelist’ Steven Harnad, tweeted similarly from an event in May, ‘#wilbis keeps dwelling on special cases where data-mining important – ignores vast majority where it is not.’ He went on to tell Research Information: ‘@researchinfo Green *will* lead to as much CC-BY and re-use as authors want and need: but all need to mandate immediate-Green first! #wilbis’.

However, this does not seem to reflect researchers’ experiences on the ground today.

‘I haven’t tried using green OA content. The reason is that I don’t know how to go about finding green OA content that is relevant to my use case,’ said Tripathy.

Mounce said: ‘Most green things basically ignore licensing and there is very little CC BY content. Searching across repositories is very difficult, although I’m sure it will get better. I think that saying free access is good enough is selling out future generations. If we get licensing wrong it will be with us for another 70 years.’

And there are practical challenges too, as NPG has found. ‘In 2007 we had a change of policy for our green OA content in PubMed Central (PMC) that put it in a subset that can be mined,’ said Rutt. However, her colleague Grace Baynes, NPG’s head of corporate communications, added that it took a long time for PMC to put this content into the TDM subset.

According to Neylon, ‘With green OA there remains the challenge of discovery. We don’t have federated search tools that are good enough yet. There is also a significant challenge over legal rights and, in most cases, institutional repositories don’t provide clear enough licence information. It certainly could answer the TDM question but, on these things, repositories are a bit behind even traditional publishers,’ he said.

Industry efforts

Wherever the content is, there seems to be agreement that the process could be streamlined. Van der Stelt of Springer noted: ‘TDM is growing and there is some demand for standard licensing. We are working on that as an industry, proposing standard licensing clauses and maybe an infrastructure to enable licensing over multiple publishers. We are actively engaged in the discussions and multiple initiatives.’

‘Good TDM takes content from lots of publishers so we need to have industry-wide initiatives,’ agreed Rutt from NPG. ‘It’s not practical to go to every publisher. Researchers would miss the long tail of content.’

She is involved in a working group that came out of a meeting in 2011 organised by Publishing Research Consortium (PRC) and STM. She explained that the group came up with model licence, which was agreed last summer and became the basis for NPG’s addendum. ‘Policy and technology have to work together,’ she said. ‘We have to think how is that going to happen.’

In addition, there are efforts involving industry third parties such as Copyright Clearance Center and CrossRef to help streamline TDM.

However, setting up industry initiatives is no guarantee of success. Earlier this year 10 organisations abandoned talks on TDM in Europe as part of the Licences for Europe discussions. Mounce, who was involved in these activities, explained: ‘The European proposal excludes PhD students, and researchers would also need to define their project at the start, which is inflexible and discourages innovation. It seemed nobody was really listening to each other or finding a halfway house.’ Instead, he said, the disillusioned parties are planning their own discussions on TDM in the autumn: ‘There is so much data – and huge value – locked inside papers, but it’s quite an uphill battle.’

Piwowar recommended that publishers start allowing text mining as a standard agreement in clauses with universities. ‘Publishers should compete with each other on how easy it is to do,’ she said. She also suggested that libraries become centres of text mining knowledge. ‘My dream scenario would be for all scholarly literature to be licensed in an open way that allows open use, including commercial, and to be hosted somewhere so that we can do a single query or downloaded so could do it locally,’ she continued.

‘Everything is moving into big data whether we want it or not,’ concluded Markram. ‘The publisher of today and particularly of tomorrow will have to comply with these kinds of needs. Whether publishers get together to do this or whether repositories aggregate it, sooner or later publishers will have to address this.’