Databases help sift out high-quality information

8 October 2013

Serin Dabb, editor, data, Royal Society of Chemistry

Broadly speaking, there are two main types of databases that researchers use. Literature databases are used by researchers to find relevant articles and keep up-to-date with what other people are publishing. These need to be up to date, and be easily searchable using keywords (and, in the case of chemistry, structures). Most researchers will rely on databases such as these to find important papers, rather than reading the contents pages of journals.

The other general type of database is for experimental or research results. These could have a number of purposes: to store data from a specific research group or collaboration; to manage data as part of a researcher’s workflow; to collate data from different researchers to build one central repository of data; or be collected and curated from the general literature as a paid-for service. Again, these databases must be searchable across a range of different input fields (such as temperature, size and date), as different researchers will query the data in different ways.

The Royal Society of Chemistry provides a variety of databases for literature searching, updating services and chemical data.

Our subject specific abstracting and indexing services include Analytical Abstracts, which is a premier analytical science literature database that allows users to perform bespoke searches relating to analytical techniques and applications. Methods of Organic Synthesis, Catalysis and Catalysed Reactions and Natural Product Updates are all current awareness services that have monthly updates for readers on the most relevant and high-quality papers from the recent literature.

We are most excited by our recent acquisition of The Merck Index, the 15th Edition of which we published in April 2013. We now offer The Merck Index Online, which provides the same highly authoritative information as the print edition, in a chemical-structure and fully text-searchable database. This database provides important bibliographic references in addition to experimental data and properties.

The searching functionality of databases has improved greatly due to the advances in web-based search engines such as Google. The internet has set a high standard for the ease and speed of obtaining information. Essentially, a database has to be more fit-for-purpose than Google otherwise people won’t use it. The different types of data they contain have also grown. For example, many abstracting and indexing databases now also contain research data from the articles.

How data is described is also difficult. There are no standards to describe an experiment, how to label the data points, or the file types used. This is where human intervention is often required.

The main aim of most databases is to be comprehensive. This can either be achieved using clever web-based processes (for example, ChemSpider pulls information from hundreds of different data sources), or the data needs to be hand picked and curated, which takes a significant amount of time.

Freely available information is often automated, and might not always be 100 per cent accurate. For example, Wikipedia is a great source of information but not always trustworthy. Where subscription-based products win is where they have been invested in to produce only the highest quality data, ensuring its authority and trustworthiness. I believe there will always be a need for these highly curated services, which will always need some sort of payment model.

We provide both free and subscription-based products to ensure good audience coverage, and to satisfy different needs. As a not-for-profit publisher one of our primary goals is the dissemination of scientific information. ChemSpider, our free chemical database, provides information on 28 million chemical structures from hundreds of sources.

We currently host the National Chemical Database Service in the UK, on behalf of the EPSRC. We intend to build on this service in the future, by developing a chemistry data repository for UK academia and building tools, models and services on this data store to increase the value and impact of researchers’ work. We look forward to working with the scientific community to make this a world-leading example of the value of research data availability. The repository will enable chemists to host their data, under embargo if necessary.

Eventually it will allow sharing of data between individuals, groups and institutions.

Karen Hawkins, director of product management, IEEE

Researchers today can be overwhelmed by the sheer volume of available content. It has become more difficult to stay current on the literature, so tools like saved search are appreciated. Researchers also want to minimise the amount of time they spend looking for content. The big change I see is the ability to search across multiple databases simultaneously and to discover new content sets worldwide.

The downside of some of the services that go across databases, especially if they are highly customised, is that they may not produce consistent results. Search on the open web can yield results that may not point to the version of record, or that are posted on unauthorised sites. What starts out looking like a timesaver may be the opposite. It is also challenging to maintain a clean, intuitive user interface given the user demand for more and more information about articles and additional functionality.

We’re asking scholarly databases, especially those that host full text, to host more content types. For example, many of the early full-text databases included exclusively journal content. Now it is common to offer multiple content types, e-commerce, and other information and functionality, such as search facets and article-level bibliometrics.

IEEE’s organisation-wide database for libraries and researchers, IEEE Xplore, includes over 3.5 million documents, mostly from the IEEE content set. From IEEE we have all journals, transactions, letters and magazines, IEEE Conference Proceedings, and over 2,400 standards documents, e-books, and educational courses.

We also host top-quality technical content from partners. We include journals and magazines, conference proceedings and seminar digests from IET, relevant e-books from MIT Press, and journals from AIP, IBM and others.

In terms of the free availability of content on the internet, some of that free content is ours. IEEE offers open-access options in all our journals and we have launched five fully open-access online-only journals, hosted on and accessed from IEEE Xplore. The availability of free content raises the bar for all of us, and that’s good. To serve our users, IEEE needs to ensure that researchers easily and consistently find authoritative information on Xplore. We constantly evaluate other platforms, conduct usability tests, and do multiple releases each year to meet evolving user needs. We benchmark other publishers’ databases, but we also look at the state-of-the-art e-commerce and consumer platforms that generate great user loyalty.

To ensure that researchers can find our content regardless of where their search begins, we partner with abstracting and indexing services, discovery services, and open web search engines to maximise discoverability of IEEE content.

We are in the process of converting 10 years of full-text backfiles to HTML and we will finish this project in 2014. HTML makes it easier for users to work with parts of articles, like figures, and it will also enhance discovery through functionality such as image search. We are working on making the Xplore user experience work equally well on the full range of devices.

Kelly Rogers, senior publishing marketing manager, CABI

We all use the internet to search for information on a daily basis. Searches typically return thousands, if not millions, of results and we are tasked with trying to filter through to find the information we need. It can be a time-consuming task and we don’t always find exactly what we are looking for unless we use complex search strings or very specific terms. But what if we don’t know the exact terms? This is where databases are invaluable by providing set search terms, refine and filter options to filter results, and linking to retrieve the articles needed.

CABI now has over 75 bibliographic databases and internet resources across agriculture, human health, and animal and plant sciences. These include CAB Abstracts and Global Health, containing over 10 million bibliographic records, full-text articles, news items and reports across the applied life sciences. Our databases cover indexed information of all scientific research, detailed datasheets, images and much more to support and enhance research across the applied life sciences.

The key needs for researchers today remain largely unchanged. Researchers’ main requirement is access to relevant scientific data in order to maintain current awareness and to develop knowledge within their area of investigation. It is important that databases offer a reliable, credible, quality filter to identify the most relevant and appropriate materials they need.

Today, researchers increasingly tell us that they need access to full-text records, as well as requiring usability and discoverability. Bibliographic database users require the ability to access content wherever they are on a variety of devices; regular updates to ensure up-to-the-minute content; and the ‘Google-isation’ of interfaces to make searching more acceptable for non-expert users.

We continually work to improve our databases, ensuring they are intuitive to users and offer relevant quality content that is hand-picked from thousands of sources and is easily discoverable through our search features. Many of our databases are updated frequently to keep researchers abreast of the latest information within their field of study.

There is the ever-growing body of open-access content that is high quality but remains poorly organised. It can be time-consuming and difficult to find the most relevant content needed for research.

By indexing only the quality materials – including OA – CABI provides an increasingly important role in distilling the free web to identify content that is useful. Our databases offer a default free-text simple search box for broad searches as well as allowing users to refine and filter results to find what they need quickly and easily.

We will soon be migrating our products to a new platform, with an interface designed following extensive user experience testing. The recently re-launched VetMed Resource was the first of our products to migrate early in 2013, and our e-books will be the next product range to migrate. We will also launch a number of new online resources. The first will be InfoTree later this year, which showcases the key economically- and environmentally-important tree species, pests and disease within applied forest science.

Martha Fogg, development director, Adam Matthew

Adam Matthew provides access to interdisciplinary digital collections of rare, and often unique, primary source materials. Each theme-based collection covers the humanities or social sciences and all materials are sourced from leading libraries and archives across the world.

Collections range in coverage: from early 15th century family documents to 20th century British government files, and from some of the first moving images held by the BFI National Archive, to oral history accounts of soldiers that fought in the First World War, recorded especially for the Imperial War Museum.

Each collection contains additional features alongside the main contents. These include essays, visual sources and maps. Where possible we include interactive features to enhance exploration of the materials such as the ‘prices visualisation’ tool in ‘Global Commodities’, which enables comparison of markets and commodities across the globe, or the ‘exhibitions’ of optical toys, panoramas, dioramas and other visual ‘delights’ of the Victorian era. These consist of videos, 360-degree rotations of objects and gallery images.

Demand for digital primary-source collections has increased significantly and there have been massive changes in digital publishing. These include the advances in technology that have brought higher-resolution colour images; more accurate OCR systems; mobile technology and the use of tablet computers; advanced layout, display and design sophistication; and the increased number of online resources available within the marketplace. This has meant that researchers are now able to gain access to a much broader range of materials on a scale that was not possible 10 years ago.

Research is now conducted through multi-faceted platforms both inside and outside of institution campuses, often on personal devices and outside of the traditional library. Researchers want to be able to gain access to content in a quick and reliable way and in a format that is best suited to their needs. Students and other researchers require materials to be searchable from library catalogues, within federated search systems and from internet search engines such as Google.

A 2010 OCLC report on ‘Perceptions of Libraries’ showed that 83 per cent of college students start their research at a search engine and that the majority of searches start with Google. The expectation is that the documents should be available on demand, and the idea of the library requiring a licence or purchase to make these documents available to the researcher are, possibly, not considered.

With this prevalence of internet search engine use, publishers (and libraries) are faced with the challenge of how to provide search engines such as Google access to content and metadata to increase discoverability, while at the same time protecting contributing libraries and archives’ copyright. It’s an ongoing challenge and one that we consider to be a key question for the immediate future.

Although simplified access is a key requirement, researchers still need to be able to trust the content that they are presented with. Researchers rely on the integrity of the publisher and the accuracy of their published content.

To this end, we work closely with our global library partners to provide the best available and accurate metadata. All printed materials and metadata is full-text searchable, and exportable to RefWorks and EndNote.

Our library partners share the common goal of increasing access to unique materials and to ensuring that we are able to deliver access to the highest quality materials, while being mindful of the unique and fragile nature of some of the materials that we work with.

The digital knowledge base has become increasingly more sophisticated over time and there are greater demands for long-term preservation of digital materials. We have focused on ensuring the long-term protection and access to our collections by a partnership with Portico. Technology continues to change, quickly, and with this users’ expectations of what research materials should deliver. We continually review our products in line with customer feedback and requirements.

Our partnerships with libraries and archives ensure that their materials are protected for future use (by the digitisation process), and that the library/archive benefits financially from every sale of our collections (through royalty payments) and through publicity in our promotion of the collections.

The current driver towards open-access publishing has brought with it the assumption that everything should be freely available. Libraries and archives may have to look further at how the costs of digitisation, development, hosting and data footprints, can be met before offering up this option.

‘Free content’ is misleading when considering the needs of researchers. It is true that many printed books, for example, have been made available on the internet: fantastic if you’re limiting your research to just this one source, but not so useful if you are looking to support your research using multiple document types.

Some libraries and archives have programmes of digitising materials and making them available over the internet. While these digitisation projects enable a wide dissemination of their materials, much of the accompanying contextual understanding can be lost.

We provide detailed metadata for every item. These additional ‘benefits’ and the confidence that our collections are both copyright-cleared and future-proofed for availability, provide opportunities for the researcher that are often unavailable when accessing ‘free content’.

Chris Burghardt, vice president for product and market strategy, Thomson Reuters

Thomson Reuters offers the Web of Knowledge platform for research discovery across disciplines and geographies. It guides researchers and librarians through the search and discovery process and helps them follow the progression of research across data, proceedings, books, journals, patents, and more. Web of Knowledge includes the Web of Science journal citation indexes, Conference Proceedings Citation Index, Book Citation Index, Data Citation Index, BIOSIS Citation Index (Life Sciences), Zoological Record, Derwent Innovations Index (Global Patents), INSPEC (Engineering and Technology), CABI (Agriculture and Environment), FSTA (Food Science), and Medline (Biomedical). In October, we are launching the Brazilian-based SciELO Citation Index on the platform and are expanding trial access for the Chinese Science Citation Database.

In addition we offer InCites, our web-based research evaluation tool that contains a growing collection of global metrics, in-depth institutional profiles and custom data sets that are built to give a comprehensive view of an institution’s performance. We also offer EndNote, a research workflow tool that allows researchers to build a personal database of material, including scholarly articles, podcasts, press releases and websites. They can then use their EndNote libraries to create works cited lists in more than 5,000 formats, streamlining the journal submission process.

Researchers need databases that streamline their workflow. This means focusing their search onto a curated, selected content set, integrating tools with data to help make sense of results, and seamlessly incorporating the data into their writing and publishing. A database should be so valuable that it becomes invisibly a part of how research gets done.

In order to provide a truly valuable service, a database has to be more than bibliographic metadata. That content must be selected, controlled, verified and continuously evaluated, and the specific metadata elements that are included has to expand to allow new and different uses. The act of collecting, indexing and aggregating key metadata elements such as addresses and funding sources provides a way to navigate and understand the evolution of the scholarly effort. For example, we can now identify and map how an emerging post-industrial nation expands its contribution to the world’s research – by subject and by individual. We can benchmark to similar efforts worldwide.

Thomson Reuters is particularly well-positioned to work on this due to our unique processing of cited reference data. Every one of the 65 million cited references we captured in 2012 was a highly specific, scholarly association between published works, each created by an expert in the subject. We capture references as a separate and indexed type of data – not just as though they were only a feature of a source article. This allows unique types of aggregation that go way beyond a ‘times cited’ count on another article in the database.

Citation data – from sources that we have established as authoritative – is like ‘crowd sourcing’ the question of ‘what are the best resources available?’

But…the ‘crowd’ we’re listening to are the world’s best authors and scholars.

Right now, scholarly publishing is in the midst of series of fundamental changes. It was less than 20 years ago that the discussion began about electronic publishing and e-journals. Internet use affected availability. It moved us from physical shipment of content (print and physical electronic storage like diskettes) to instant worldwide access. This has provided more and better access to high-quality information resources to researchers around the world.

Now, it’s not just about electronically available content but about continuous content generation, and continuously updating content. It’s not just about the management of archival access but about open access. Peer review is in a state of evolution, as is the basic question of what constitutes scholarly communication. All of these have implications for how we select, collect and deliver the formal, stable, high-quality sources that are necessary to researchers.

The widespread availability of free content on the internet makes curation more critical than ever – both of what sources we include and how we manage the data and rankings that we produce.

Rafael Sidi, senior vice president and general manager, ProQuest Information Solutions

Our vision is to enable the creation of intelligent information within the research workflow, linking content to extract actionable information. ProQuest aims to provide a simple search experience, allowing the researcher to find the results they need quickly – and anytime, anywhere access – regardless of which device a researcher is using.

We support a wide range of academic research – the arts and literature, business, government information, history, medicine, natural science, and technology. We also offer centuries-deep content in news and dissertations. Through the ProQuest Dialog service, we provide knowledge discovery with tools that serve the specific needs of our corporate users.

ProQuest aggregates journals and other content types and distributes solutions such as MEDLINE and the Modern Language Association International Bibliography. We offer a wide range of services from focussed abstracting and indexing within a discipline to full-text coverage. One key area of specialisation for ProQuest is the digitisation of and access to primary-source content. Our Early English Books programme has expanded to include the archives of historical libraries throughout Europe – and the Historical Newspapers programme continues to grow, providing a rich survey of news over centuries. There are also specialty digitisations we undertake with partners through our Historical Archives such as the Queen Victoria’s Journals, which provide a unique, personal view into history.

Quality and breadth of content are important. Researchers are looking for the most reputable content, ranging from ‘traditional’ sources such as scholarly and professional journals, to dissertations, working papers, reports, conference papers, and datasets. They are sometimes looking to locate highly specialist or obscure materials. There is also the need to add value to any content set through, for example, citation chaining or access to the original datasets behind the research.

Researchers want seamless access to content. The increasingly interdisciplinary nature of research means that researchers also need records from adjacent disciplines. For example, researchers in linguistics or education also need content from psychology databases. Having a search platform that cross-searches broad sets of content, products and formats, as well as products such as ProQuest Central that combine quality content across a number of disciplines, helps. However the key way we will address this issue is through improved interoperability across all the content our researchers want to access.

A key change is the diversity of content needed to support research. According to our usage statistics and research, non-traditional sources for scholarly research – working papers, reports and more – are highly valuable to researchers. ProQuest is increasingly focussed on indexing and delivery of new content types used in research. A good example is the upcoming ProQuest Video Curation Service, which will involve indexing and digitising video holdings of partner universities.

The open-access movement is driving change in the quality of ‘free’ content on the web. We’re seeing much more peer-reviewed content that researchers want to have access to as part of their broad discovery process. ProQuest has explicit plans to integrate open-access content in its solutions, providing a central point that brings together open-access and proprietary content.

It goes without saying that we are seeing an increased emphasis on mobile. With the release of ProQuest Mobile, researchers can search, discover, manage, and share authoritative content anytime, anywhere, regardless of their choice of device. For smartphone users, ProQuest Mobile is browser based, so there are no apps to download or install on your smartphone and the full ProQuest platform is optimised for tablet use. Mobile access to ProQuest provides the same robust content to support research throughout the knowledge workflow.

However, we are not just a content provider. Our goal is to empower researchers and librarians in their daily workflow. That means supporting every phase of research, from vetting the first hypothesis to obtaining funding and getting published. For example, our Pivot service is a web-based resource that identifies active sources of funding and matches them with researchers in one step.