New large-scale data analysis techniques of many different types of data sets can give researchers new insight, writes Alastair Dunning
What do you do with a million books? Or a million newspaper pages? Or a million art images? Scholars now have access to huge repositories of digitised data – far more than they could read in a lifetime. And yet the way researchers access these documents is still based on the relatively linear process of searching for specific data like keywords, names, events, places or dates. Once users have entered such a search term, they are then presented with a list of hits to scan and choose from.
While accelerating the speed of research, such searching does not permit the researcher to exploit the full breadth and richness of a digitised resource. Rather than analysing the body of material as a whole, it restricts us to looking at little snippets – like tearing out pages from a book rather than reading whole chapters.
New techniques of large-scale data analysis can, however, allow researchers to discover relationships, detect discrepancies and perform computations on data sets that are so large they can only be processed using powerful software. The Digging into Data challenge asked teams of international researchers to exploit new web tools and contemporary computing power to explore larger bodies of data. Bringing together four funding bodies from either side of the Atlantic – JISC in the UK, National Endowment for the Humanities and National Science Foundation in the USA and the Social Sciences and Humanities Research Council of Canada – the programme has created teams combining the expertise of social scientists, humanists and computer scientists.
One of these teams is immersed in approximately 23,000 hours of recorded music – around 350,000 individual songs and compositions – which are now being tagged by researchers at the ‘structural analysis of large amounts of music information’ (Salami) project. The sheer quantity of music being analysed, from a capella to Zydeco, Appalachian to Zambian, and medieval to post-modern, allows the team to rescale the traditional research questions music scholars are asking. By using a range of software tools to tag each piece according to elements such as rhythm or harmony, the music can then be analysed to compare genres and find changes in musical patterns over time.
Meanwhile supercomputing is also helping to provide fresh insights into an old scholarly question: how to determine authorship. Looking at medieval manuscripts, early modern maps and more recent knitted quilts, the ‘digging into authorship’ team is making use of high quality digital images to check for repeating motifs, patterns and other traces of artistic identity.
Such analyses do not necessarily provide academics with the final answers. However, the ability of processor power to analyse digital images with huge file sizes rapidly can do what a human can’t: read everything quickly and synthesise it in minutes. This offers the scholar new clues to determine the creator of a given piece. Although data mining is a technique already used elsewhere in the humanities – to establish the identity of a playwright for example – this is probably the first time that researchers have applied this kind of analysis to very large collections of images.
Data analysis needn’t be confined to researchers in the sciences. Projects like these are designed to foster interdisciplinary collaboration. They can help pool resources and promote international collaboration, getting more out of the research funds that we strive so hard for.
It’s clear that valuable data mining can be done on data sets that were created for an entirely different purpose. Researchers are not just generators, but gatekeepers of their data. It’s now the task of researchers and curators to make data available using open standards, and for repositories that hold large digital collections to ensure efficient access to these materials for research, for example by changing text and data into readable formats.
We also need to create tools that extend the research capacity of ordinary databases. This might be software that can ‘read’ data from different disciplines and could, for example, search biology and chemistry data for features that are useful to academics working in either discipline. The scale of resources can have a positive impact on research – but only if those millions of pages, images and datasets are made accessible in the first place.
Alastair Dunning is programme manager for digitisation at JISC in the UK