Taking the web seriously

19 October 2007

Share this on social media:

Issue:

October/November 2007

Topic tags:

Isidro Aguillo explains how university websites can be compared and why this gives meaningful insight into the institutions research output

Traditionally, the quality of universities and their research are compared using measures such as numbers of published papers, especially those in peer-reviewed journals with high impact factors, and numbers of times that those papers are cited. But such measures do not give the whole picture.

In the days of print-only journals there was no good way to evaluate how many times such papers were actually looked at. The web, however, has changed all that: websites and search engines gather extensive data on page downloads and the relative importance of sites. In other words, they can reveal which papers are read the most and where they are from, as well as the popularity of other, less formal resources on universities’ websites. Of course, the web is much too big and diverse for such results to neatly pop up in a straightforward search. Instead, finding this information has given birth to a whole new field of research, cybermetrics, which is sometimes given the narrower but popular name of webometrics.

Cybermetrics is devoted to the quantitative description of the internet. That means not only the content available on the web (including that in databases, the invisible or deep web), but also the computer-mediated communication (messages exchanged in forums, data for escience), the structure of the physical and virtual networks and even the behaviour of users, amongst other similar topics.

The need for a quantitative approach

A quantitative approach to describing the internet is needed because the web has grown exponentially, and sometimes the only way to make the analysis feasible is to use metric tools. And because global comparisons are possible, the use of web indicators provides a way to easily analyse and understand the internet.

There are two main aspects of analysing the web: webometrics and web log file analysis. The first, webometrics, is dedicated to the analysis of content. It covers:

Cybermetrics of search engines, including studies about the size and composition of the web, crawling behaviour, ranking procedures and overlap among databases;
Applied webometrics, which relates to the positioning in search engines;
Descriptive webometrics of websites, including the structure, composition and persistence of the content, object informetrics, web impact factor and other indicators;
Link analysis, which includes network analysis, origin and target patterns, motivation of linking, quality (dead links), webliographic linking and scholarly communication; and
Bibliometrics of the (deep) web, which looks at resources such as databases, repositories and e-journals.

The second area, web log file analysis (also called webmetrics), is devoted to usage analysis. It includes:

Popularity (domains, sites, pages, downloads);
Visits (sessions, temporal distribution);
Visitors (geo-demographics, behaviour); and
Referrals and referrers (terms used in search engines). Building indicators

In order to build meaningful indicators about the web, two main factors should be taken into account. The first of these is activity, and includes all the results derived from the different parts of an organisation. The second factor is impact, and this is usually measured by the citations that the content receives. To encompass these different aspects, a multi-dimensional approach is therefore needed.

Using the options available from search engines (see box: Collecting the data), several indicators can be easily calculated. These include: size of the domain, which usually means the number of web pages; number of files, specially the rich formats such as Acrobat (pdf), PostScript (ps) or any of the Office formats (doc, ppt, etc); visibility (or link popularity), defined as the number of external inlinks; and popularity, expressed as the absolute or relative number of visits or visitors. Visibility is a far more powerful indicator than popularity. It is also a more time-stable measurement. Sometimes the number of times the name of an institution appears in the search engines is used as an indicator (invocation) but the lack of standardisation and spurious motivations make this value unusable.

Ranking academic research output

One of the most popular results from cybermetric research has been the Webometrics Ranking of World Universities. Using a catalogue of more than 13,000 universities and more than 4,500 research institutions worldwide, the site offers a ranking of the top 4,000 universities and the top 1,000 research institutions in the world. This ranking has been updated twice a year since 2004.

Web presence is an important part of communicating research results but US universities dominate the comparisons. Institutions where English is not a national language lag far behind.

The classification is built using a combined indicator called WR (World Ranking) that takes into account the number of published web pages (25 per cent of the WR), the number of rich files, ie those in pdf, ps, doc and ppt format (12.5 per cent of the WR), the number of articles gathered from the Google Scholar database (12.5 per cent of the WR) and the total number of external inlinks (50 per cent of the WR). Therefore, in this way, a ratio of 1:1 between the visibility (‘sitations’) and the volume of published information (pages, files and articles) is preserved.

This classification does not currently cover web traffic on publishers’ sites. However, the classification does cover any subscription journals indexed by Google Scholar and any papers posted in institutional repositories or on authors’ own websites. Inevitably, any papers or institutional repositories with an openaccess policy are likely to be accessed more and therefore favoured by this formula.

Digital divide

Strong web presence comes from a wide variety of factors that correlate with the global quality of the institution. First of all, it is a good measure of research and academic excellence as published papers in electronic formats attract larger audiences, from both developed and developing countries. Incoming links or downloads of these files means that they are not only visible, but also have an impact.

But there are many other activities that can have a reflection in a website, from supporting teaching material to specialised research resources. Formal research results co-exist with informal communications, news and social information. Links to and from institutions’ websites uncover not only close relationships among colleagues but also other relevant ties with industry, community or government. The results of these rankings show an academic digital divide that is wider than expected. It affects not only developing countries but also EU and Japanese universities, which often appear in positions far below their US counterparts. Especially worrisome are the situations of many French, Italian and Japanese institutions. This is probably due to a limited use in their web pages of English, which is the scientific (and global) lingua franca.

Another surprise is the very low positions of the Indian universities. Given the country’s burgeoning IT industry, India’s universities might be expected to have high-quality websites with many links from other sites. The results also reveal that there is still a deficit of quality information in the web. This gap can be easily filled if most universities agree to adhere to open-access initiatives. More and more universities are organising their own institutional repositories, but the commitment of individual researchers to add their papers to these systems is still limited. The target groups of the scientific results dissemination should be expanded, both geographically and culturally.

In that sense, electronic publishing should not only focus on formal papers but encompass all the range of activities from the faculty members of highereducation institutions and those from research groups.

Looking ahead

There are a number of other sorts of institutions suitable for this kind of analysis. The next ranking to be based on web indicators is an ongoing project that will take into account hospitals. As many of them are linked to universities, a similar approach to the one used with the academia seems appropriate.

More than 12,000 hospital web domains have already been identified and data collection will start in October in order to publish the first edition in January 2008.

Isidro Aguillo works in the Cybermetrics Laboratory of Spain’s Centre for Scientific Information and Documentation (Centro de Información y Documantación Científica or CINDOC), which is part of the Spanish National Research Council.

Collecting the data

It is no longer possible to check the content of a website by hand because of the huge number of sites and their sizes. One approach to collecting the data is to use commercial or free crawlers for automatic downloading of the pages. However, customising one of these robots can be very difficult, and huge human and computer resources are needed. What’s more, none of them are capable of collecting all the objects as there are barriers, orphaned pages and other problems in the crawling process. Eventually, however, they can be used for sampling purposes, using the ‘random walk’ method. This consists of designating at random some ‘seed pages’ and allowing link navigation until a certain level is reached. Another option for data collection is to use search engines. They already have well-designed and tested robots, their databases are updated frequently and they offer powerful operators for data extraction. Moreover, as search engines are the main intermediaries in web navigation, presence in their database is a good measure of visibility. Commercial search engines have some drawbacks though: the results are sometimes inconsistent, coverage bias has been described for some of them and numbers provided are rounded without explanation.

Currently, in order to avoid some of these problems, several search engines must be used together. This can be a challenge as the number of independent search engines with large databases is small, and not all of them are usable for cybermetric purposes. The few that can be used are Google (and Google Scholar), Yahoo! Search, Live (but not Academic Live), Exalead and Alexa. Extracting values from search engines can be done with the help of operators, such as site, link or file type. However not all the engines support the same options or syntax. Unfortunately, neither Google nor Live are currently usable for hypertext analysis. On the other hand, Google’s PageRank and Alexa’s Traffic Rank can be recovered as relative (positions) values. An interesting option that is only provided by Yahoo! Search is the possibility to identify sub-domains inside a certain domain, although the results are usually very noisy. Other considerations must be noted regarding web domain names. An important advantage of the web naming system is that the web domains refer mainly to institutions, so in many cases there is a unique domain for each university or at least one that holds most of the content. However this is not completely true as there are bad practices in the use of names by some institutions. Several universities have changed their domains but still maintain many pages under the old one (for example, Imperial College, Paris Diderot University and University of Barcelona). Other universities share their domains with external organisations (for example, Helsinki city and university share a domain). In addition, the main domain sometimes only represents a fraction of the total number of their pages (for example, the three universities of Strasbourg share the same domain).

Popular

Latest issue