Thanks for visiting Research Information.

You're trying to access an editorial feature that is only available to logged in, registered users of Research Information. Registering is completely free, so why not sign up with us?

By registering, as well as being able to browse all content on the site without further interruption, you'll also have the option to receive our magazine (multiple times a year) and our email newsletters.

Navigating through complexity

Share this on social media:

Publisher opportunities for text and data mining in an uncertain copyright environment, by Roy Kaufman of Copyright Clearance Center

Publishers are increasingly challenged to create new ways of disseminating content while simultaneously facing a shifting technology and legal landscape. One of the most telling examples of this involves text and data mining (TDM).  Information consumers see enormous potential in TDM, creating an incentive for publishers to develop innovative new TDM products; however, some legislators are now considering laws that would actually penalise publishers for such innovation. How does a publisher navigate through this complexity to tap new market potential?

Decreasing costs of processing power have presented opportunities for information consumers to use published content to achieve innovation through data and combinatorial technologies. Text mining is the data analysis of natural language works (in other words, using text as a form of data), while data mining is the numeric analysis of data works.

For example, in text mining, a researcher might want to mine a large body of scientific journal articles to track whether a particular genome has previously been correlated, or not correlated, with a particular disease. Similarly, in data mining, a researcher could mine multiple data sets to automatically identify potential relationships between a specific medicine and a disease with which it has not previously been correlated. Together, these two forms of analysis, referred to collectively as TDM, hold promise for improving healthcare outcomes and addressing many vital medical, scientific and social challenges.

Although the number of academic researchers and enterprises using TDM is still relatively small, the topic of TDM has loomed large in copyright debates. The good news is that existing intellectual property licensing regimes are already well-suited to help support such innovation. TDM enters the copyright world in part because it involves first making copies of, or downloading, the content in electronic format in order for it to be mined. 

In a May 2011 UK government report Ian Hargreaves called for a TDM copyright exception in UK copyright law, which went into force on 1 June 2014.  Likewise, the European Union has been looking into the issue, and in the United States there is significant debate about whether and to what extent TDM activities come within the doctrine of fair use. 

The UK exception, having been enacted, is worth examining. It applies to already purchased and licensed content, so there is no requirement that works or data be made available to non-customers. Likewise, data unavailable for sale or license is not subject to the exception.  The exception is limited to non-commercial text and data mining initiatives, so corporations and corporate-university partnerships are not covered under the exception. 

Publishers are still permitted to exercise reasonable restrictions on bulk copying and downloading and, thereby, to prevent users from hindering website performance through crawling and other activities which could reduce publishers’ ability to provide access to others.  Finally, publishers and users cannot 'contract out' of the right to mine. Thus, in economic terms, if publishers wish to be compensated for increased costs of enabling TDM, they need to spread those costs evenly across all non-commercial users, not just those users who wish to perform TDM. 

The EU is considering an entire range of options, from encouraging TDM licensing to creating an exception, likely modeled on the UK law.  In the US, with our fact-specific application of the doctrine of fair use, some TDM would likely be considered fair use – for example, a social scientist might wish to use TDM to study linguistic shifts in the early 20th century. Other uses would probably not be considered fair use – for example, hedge fund managers might want to mine high-cost news feeds in order to short bond markets. With uncertain copyright rules, what is a publisher to do?

First, publishers should recognize that there is already a market for TDM-enabling content licenses.  In fact, companies about whom many publishers are only dimly aware are already paying for published content, typically thorough intermediaries and aggregators.

Second, despite legal uncertainty, publishers should note that the corporate market for TDM is small but growing. In the last two years, my employer, Copyright Clearance Center, has spoken with dozens of users – corporate and academic – along with big data companies and TDM solution providers to identify licensing challenges associated with TDM. 

Here are the key gleanings: first, while the highest ranked and most sought-after publications may publish content of arguably higher value (at least to some users) than what may be found in other journals, nothing is more critical to the success of TDM innovation than making it easy for all users to access normalised, current and updated content feeds.

Second, these users, and the legal and information teams who support them, lack the time, resources or inclination to negotiate one-by-one for every piece of content, and to then set up and maintain a bespoke arrangement with each publisher to take in, reformat, and mine each stream. ‘Easy’ matters here, because it results in faster time-to-market for the breakthrough which may result from this TDM research.

Third, publishers should develop strategies that maximise the market opportunities coming out of TDM, including (a) finding ways to offer readily-mineable XML content, (b) making the mineable content available by application programming interface (API) to popular TDM software used by customers, intermediaries, or both, (c) looking at all content as worthwhile, and not just those publications perceived to be of the highest value to a select group of users, and (d) maximising content reach by partnering with a variety of channel players and service providers. 

Users may well prefer your content to that of other publishers, but when abundant material is available, easier routes to ‘yes’ are better. Most companies would favor the cost and usability advantages of solutions that provide uniform access to multiple sources versus those limited to a single source and/or publisher, no matter how good that single source is.

The evolution of TDM will ultimately enable publishers and users to earn new revenues, help mitigate health care challenges, and perhaps even move financial markets. To navigate this complexity successfully, publishers need to look at all their content as a valuable form of big data; make it easily available, normalised, and extractable; and, as is so often the case, keep the process simple.