Taming Twitter

Share this on social media:

Topic tags: 

As the number of social media users sky-rockets, can researchers catch the data fallout? Rebecca Pool finds out

In September 2013, Germany held its federal election to determine the members of the 18th national parliament of Germany. All seats to the ‘Bundestag’ were at stake and Chancellor Angela Merkel’s party, the Christian Democratic Union (CDU), won a major victory.

But while the world watched election events play out, a group of social scientists were monitoring developments from quite a different perspective.

By collecting, archiving and interrogating data from Facebook and Twitter around this time, researchers from the GESIS Leibniz Institute for the Social Sciences were gleaning information on how political candidates communicated through social media. At the same time, GESIS Information Scientist, Katrin Weller, was also intent on developing the best strategies to harvest, store and process data for future use.

As Weller highlights: ‘Many academics from different disciplines are interested in social media data now, and many researchers turn to Twitter data as this is one of the easiest platforms to get data from.’

Today, Twitter boasts more than 300 million monthly active users with 500 million tweets per day. And for social scientists, as well as researchers from other disciplines, this goldmine of data could provide rich insight into politics, society and more.

According to Weller, Twitter provides an API – application programming interface – from which academics can collect data for free. ‘You can collect certain parts of data from this API and then buy the remaining data if you need to access the entire dataset,’ she says. ‘But what many people have been doing is collecting data from the free public API, which offers a sample of all tweets currently being published or tweets from specific topics.’

Indeed, as well as the German Bundestag elections of 2013, Weller and colleagues have been collecting APIs to explore how soccer fans communicate with their favourite clubs via Twitter. Meanwhile, researchers elsewhere have been interrogating these APIs to better understand reactions to landmark events such as the Arab Spring, Hurricane Sandy and more.

Alongside research, and driven by Twitter’s more open and accessible API, software developers have devised a host of tools and methods to capture and analyse data from the social media platform. Crucially, many have been developed for users that do not have any programming knowledge.

Indeed, in a recent blog, Wasim Ahmed, PhD student from the University of Sheffield and social sciences blogger for the London School of Economics listed, ‘Webometric Analyst’, ‘NodeXL’, ‘Visibrain Focus’ and more, as key tools for social scientists. Currently looking at the use of Twitter data to give insights into health conditions and health-related events, Ahmed has also used sentiment, time series and network analysis, as well as machine learning methods, to analyse Twitter data.

But researchers’ love of the Twitter platform is not all about easy access to APIs and readily available analysis tools. For example, the 140 character tweets have lent themselves to relatively simple data searches and retrieval. Likewise, the hashtag norms ease data gathering and sorting.

For Weller, the academic world’s predisposition towards Twitter is clear, and she has spent a lot of time exploring how researchers from different disciplines use this and other social media platforms.

According to the researcher, spontaneous, ad hoc communication is rife around conferences: ‘Researchers [from all disciplines] at a conference will use Twitter to connect. Some use it as an alternative to business cards, adding colleagues to their Twitter list while others use it as a news source or to follow recommendations on what to read from trusted fellow scholars.’

However, more detailed research has uncovered a clear discipline bias. ‘Bibliometric analysis has shown me that researchers from many different disciplines use social media data, but computer scientists take by far the biggest share of this,’ she says.

‘Social scientists come next and then you see a long tail of linguists, studying, for example, how language changes in social media,’ she adds. ‘Then we see doctors and health specialists looking into topics such as well-being, and also economists, looking into, for example, the use of social media to predict stock market exchange rates.’

Despite the proliferation of online methods and tools to help the less computer-savvy researcher analyse data, computer scientists are king in this academic sphere, and researchers from other disciplines need their expertise.

‘Myself and colleagues have observed that researchers from non-technical backgrounds still struggle to gather specific types of data and rely on collaborations with computer scientists and even physicists, at least for big data research topics,’ Weller says. ‘Indeed, in these social media research environments we see that researchers depend heavily on collaborative efforts to study the data. ‘So we see a trend towards interdisciplinary research teams, in particular in the field of social media data.’

Research challenges

But in the race to use social media platforms as a data source, academics are hitting hurdle after hurdle. For Twitter in particular, ethical and legal issues need to be addressed.

When retrieving large swathes of Twitter APIs for analysis, it is not always possible to gain direct consent from participants, a point that Sara Day-Thomson, project officer from the Scotland-based Digital Preservation Coalition is very concerned about.

‘A handful of highly competent social science researchers are trained to handle big research data, having worked with, for example, census data, and use strict and mature ethics processes to account for bias.’ she says. ‘But some computer scientists and physicists run analyses and produce results without relating these to a context.’

‘So what concerns me is the vast majority of social media users are private citizens that don’t necessarily understand the implications of their data being made accessible for either researcher or consumer analysis research,’ she adds.

Ethics was one of several issues considered in Day-Thomson’s recent DPC report called,  ‘Preserving Social Media’, written to provide guidance to academics accessing social media for research purposes, and to organisations looking to preserve social media data. As part of her research, Day-Thomson also considered how social media data, when combined with administrative data, could unintentionally reveal personal information on Twitter users.

‘Though some methods [exist] to mitigate this risk, simple anonymisation may not fully prevent accidental disclosure,’ she says.

Indeed, the recent ‘Wisdom of the Crowd’ project from Ipsos MORI, UK, which explored social media use and spawned numerous publications, validates Day-Thomson’s concerns. Titles such as A Guide to Embedding Ethics in Social Media Research and The Road to Respresentivity demand better ethical standards and rigour to be built into the research process.

Predictably, representivity is a thorny issue as social media users are not necessarily representative of the entire national population. Recent analysis from the US-based Pew Research Center for Internet, Science and Technology on the demographics of social media users, revealed only 20 per cent of the entire adult population use Twitter, 62 per cent of the entire adult population use Facebook, while 22 per cent used LinkedIn.

But Weller takes a different tack on respresentivity. ‘Most of my colleagues do not study social media data to get a representative sample of the general population, but rather to learn something about people,’ she says. ‘This research is not a survey and [we don’t have] sampling mechanisms,’ she adds. ‘So we focus on a research question, which is representative for a specific platform and doesn’t create this problem.’

As Weller highlights, when probing Twitter, correctly phrasing a search query is crucial. Still, even if the researcher hits the target here, other issues still come into play.

Perhaps most noteworthy is that, for Twitter researchers, a public API only provides a one per cent sample of the Twitter data, with no means for the researcher to focus on a particular user of pattern. What’s more, a researcher has no idea how this data has been sampled.

As Day-Thomson points out: ‘Twitter’s one per cent streaming API is popular with researchers and includes a lot of data – and research institutions may not have the capacity to cope with more data than this. But a problem is that this one per cent is pulled out by Twitter and the algorithm hasn’t been released, so for social scientists there is no way to account for any bias here, as they don’t know what’s been included and why.’

Weller agrees and adds: ‘You just don’t have a clue how this Twitter data is sampled and then, by the way you decide to collect your data, you may create other biases.’

At the same time, interrogation of past, public APIs isn’t easy. According to Weller, Twitter provides access to historical tweets, so you can access the last 3,200 tweets of a single user, such as Barack Obama. But, as she highlights: ‘For topical keyword or hashword searches, you can’t go back into the past, you can only say “I’ll search from now”.’

Clearly, for spontaneous events such as  Arab Spring protests, a researcher will have to set up data collection methods very quickly if he or she wants to analyse public APIs, rather than buy data from Twitter data-reseller, Gnip. But as Weller emphasises, academics can still work with these issues. ‘Researchers can understand the issues and critically refer to these in a publication to make it explicitly clear to another researcher what he or she actually did,’ she says.

Dare to share?

Beyond ethics and representivity, legal issues abound. Under Twitter’s API Terms of Service, researchers can share tweet identification numbers but sharing larger datasets is prohibited.

‘This means that the next researcher who wants to look at [your research] will have to retrieve every single tweet based on its ID, which takes time,’ she says. ‘And if the tweet has been deleted, it just doesn’t show up anyway, which for research purposes is a disaster in some ways. The Terms of Service also means you cannot preserve your research data, which is a big problem.’

As many in the field acknowledge, including Thomson-Day and Weller, the terms and conditions from Twitter and other social media platforms largely exist to protect the profits gained by selling user data to commercial companies. However one noteworthy exception has raised a glimmer of hope for researchers worldwide.

In 2010, Twitter gave the US Library of Congress permission to archive every public tweet since its inception in 2006, and to continue archiving future tweets. At the time, Twitter was processing more than 50 million tweets a day and the historical archive consisted of some 170 billion tweets.

Given the historical and cultural significance of this data, the Library of Congress’s archive held enormous promise for researchers worldwide. But following practical challenges, the project stalled.

Key problems around how to ingest, organise and store the vast quantities of data stymied progress. Meanwhile issues over finding useful retrieval methods, creating the appropriate access controls to the archive and, as always, privacy policies, are still being addressed (see ‘Where next for the Library of Congress?’).

But failings aside, Twitter has been the only social media platform that has publicly acknowledged the value of its data to researchers. And while the agreement between Twitter and the Library of Congress has not yet fulfilled expectations, it is a key a milestone for social media research, and other projects are now following.

In October 2015, MIT Media Lab launched the ‘Laboratory for Social Machines’ funded by a five-year, $10 million commitment from Twitter. The social platform is providing access to all public streaming tweets as well as historical tweets, so researchers can analyse how information spreads on Twitter and other social media platforms.

Meanwhile, outside of the US other key projects include the Social Data Science Lab and COSMOS platform at Wales-based Cardiff University, the Social Repository of Ireland and the National Archives’ UK Government Social Media Archive. Sara Day-Thomson believes these projects are very useful. However, she is adamant greater collaboration among institutions, such as universities and national heritage libraries, will be crucial to solving current social media research issues, as well as ensuring access into the future.

To this end, she believes the coming perabytes of social media should be managed and preserved by several large centralised providers to ease data analytics and also reduce data costs and create benchmarks for data quality.

‘We know that many institutions in the UK have relationships with Twitter, but don’t discuss this openly,’ she says. ‘Yet taking a lesson from the Library of Congress, it doesn’t necessarily work to have a large platform in which to deposit all the data.’

‘A more successful model would be to develop a larger, and potentially national, interface that could liaise with different social media platforms to smooth over the rights issues, and terms and condition,’ she adds. ‘Things like this don’t come cheap, but the benefits would far outweigh the cost.’ 

Where next for the Library of Congress and Twitter?

From January to June 2015, Katrin Weller held a research fellowship at the Library of Congress. The GESIS Information Scientist had hoped to work with Twitter datasets in the archive, but the lack of availability meant her research never actually took place. What’s more, she doesn’t hold much hope for any solutions soon.

‘From my experience I can say the archive isn’t going to happen soon or even in the next 10 to 20 years, so as a researcher I wouldn’t rely on this being available,’ she says. ‘The library is also archiving text-based tweet formats, which you can get through an API or Gnip. This means, you won’t get the look and feel of what the platform looked like in, say, 2006, and images and URLs inside a tweet may lose their value if they can’t be resolved.’