Publication and data surveillance in academia

11 January 2023

Share this on social media:

Issue:

Research Information Yearbook 2022/2023

Joseph Koivisto and Jordan Sly from the University of Maryland discuss the implications of the publications-as-data model.

Data, technology and academia in 2022

It is becoming increasingly clear that the core functions of higher education are destined to be quantified and that this data will be harvested, curated, and repackaged through a variety of enterprise management platforms. All aspects of the academic lifecycle, such as research production, publication, distribution, impact determination, citation analysis, grant award trends, graduate student research topic, and more can be sold, analysed, and gamed to an unhealthy degree. By unhealthy, we mean constricted and self-consuming as the output we develop is directly contingent on the input we receive. Well-meaning tools, such as algorithmically derived research suggestions and citation analysis, create a shrinking and inequitable academic landscape that favours invisibly defined metrics of impact that are reinforced through further citation thereby limiting the scope and scale of research available.

But we know this, right? The quantification of research is nothing new, nor is the search for additional areas of impact analysis. We do not have to think hard to remember the bad old days of alt-metrics and the measuring of our social media presence as a factor of our scholarly profile. What is more disruptive and more harmful are the new ways in which the data we produce –both intentionally and unintentionally– are packaged and resold to us in ways that will further constrict this research pipeline and the seemingly practised naïvte with which enterprise vendors are approaching these thorny questions.

The overreliance on metrics belies misunderstood beliefs that numbers must be objective and that these measurements must have an enhancing effect on the research, teaching, and other aspects of the academic enterprise. This is not unique to academia, of course. As Theodore Porter has discussed, metrics and assessment allow administrators in all areas of education, government, and industry to make decisions without seeming to decide, to trust in an imagined purity and objectivity of numbers. In academia we see this primacy of calculable metrics in the form of citation counts, impact factors, and h-indexes among other domain-specific metrics. This reliance has fostered an environment in which academic success is confined to a narrow range of quantitative measures as opposed to measuring true scholarly merit. Several recent studies, including some which present strikingly important data such as University of Maryland’s Michael Dougherty, et al. highlight the gaps and biases endemic to metric-based assessment approaches and the ways in which tenure and promotions committees can adapt to a reliance on actual scholarly quality instead of these faulty metrics. The problem is, this is hard. This is a time-consuming and intellectually taxing activity and from the university administrative perspectives, difficult to achieve at scale. This is how we have ended up in the situation we are in and the space in which innovative enterprise companies correctly see an opportunity for vertical integration and end-to-end or closed-loop systems to address these needs.

To many, quantification actuates prestige and determines selectivity both in the hiring of professors and the selection of students and create an elite university setting. As Pierre Bourdieu showed in his 1984 Homo Academicus, however, this is not an evenly distributed sentiment as the hierarchies of academic power are not distributed to all areas of the university equally. In addition, Stefan Collini discusses the economic incentive of universities to quantify and focus on the impact of particular disciplines over others, highlighting the business entanglements of the modern university and the corporate sponsors feeding both ends of the cycle. Importantly, the competitiveness of a university depends on precise metrics facilitated internally, yet levered externally. What is of central concern to us is the ways in which this constricting influence impacts both the research output of specific fields and the ways that this reliance on metrics fosters an unequal environment of output favouritism.

Recently, Sun-Ha Hong has written about the notions of “Prediction as Extraction of Discretion” which studies this reliance on extracted data as self-consuming and the ways it reflects the input model the developing data cycle relies on. In other words, it creates and reinforces the world it reflects. In Hong’s work, this model reflects behavioural and punitive technology, but the same can be said in the reliance, replication, and canonising of work creates an academic hegemony that could hinder innovative research due to the reliance on this cyclical model for grant awards, citation, publications, graduate student thesis development, and much more. We’ve seen these shifts in the broader tech world as work such as Shoshana Zuboff’s Surveillance Capitalism have shown.

Within higher-ed tech literature, too, thinkers such as Björn Brembs have written about algorithmic employment decisions in academia. What is concerning is the active push to commoditise user data without reference or acknowledgement of the issues that have caused concern in the wider consumer-technology sphere. As we see these continued pushes into the academic enterprise world, there is an avoidance of the questions around data privacy, usage, and algorithmic bias that is concerning. This situation is made potentially more dire with the innovations in generative artificial intelligence (AI) and the discussions around AI-generated academic content pulling from this set of enshrined data and content we’re describing.

This problem is not only internal to universities. The prestige journal publishing apparatus, too, is so reliant on the notion of metrics to calculate pole positioning in this competitive space. As Michael Callaham and his co-authors have shown, there is such a reliance on these metrics that the impact factor and prestige of the journal outweighed the value of the actual scholar and scholarship. Additionally as Daniel Klein and David Chiang found, this metrics-based emphasis showed some evidence of citational bias; that by following the high impact citations scholars were unintentionally promulgating a strain of academic ideology that favoured specific disciplinary interpretations and not others. Critically, this all happens in the opaque black-box of proprietary information.

What about the counter-movements and their impact?

As the shift to open access gains momentum, there is danger of the unintended consequences as enterprise platforms seek to maximise profit as the models shift from under their feet. As Alexander Grossmann and Björn Brembs discuss, the cost creep incurred by libraries reflects this pivot shifting to a model of author costs, which are often supported by libraries, thereby adjusting costing methods from the backend subscription model to the front-end pay to publish model. It is not surprising or controversial that for-profit enterprise, database, and academic platform vendors are seeking to turn a profit. We should remain vigilant, however, to academia's willingness to find the easy and convenient solution without considering the longer-term effects of what they are selling. In a recent industry platform webinar, academic enterprise representatives discussed the “alchemy” of user-derived data and their ability to repackage and sell this data, with consent, to development companies with their key take away being a driver towards increased revenue. More to the point, they had learned the lessons of the tech industry, and more specifically the social media companies in understanding the data we generate can be used to target us, to sell to us, to use us for further development. They discussed the ways in which the use of this data would become, like social media, intelligent and drive user behaviour – further cinching the knot on the closed-loop as algorithmically-based suggestions further constrain research and reinforce a status-quo enabled by profit motive in the guise of engagement, use, and reuse.

Updates on the latest industry initiatives around data and technology

Recent industry happenings illustrate the pressing nature of this issue in the contexts of both higher education and libraries. In 2021, Clarivate – the data insights and analytics company that owns Web of Science, EndNote, InCites, Conversis, and more – moved to acquire ProQuest, one of the largest publishers of academic resources. Bound up in this deal was a wide suite of library systems owned by ProQuest, including Ex Libris and Innovative Interfaces, providing Clarivate access to almost all stages of scholarly work and communication – from research development to publication, dissemination, and discovery. While this not only raised concerns in the US government – the Federal Trade Commission conducted an antitrust probe of the merger – several voices in the academy expressed trepidation that so many elements of the scholarly communication lifecycle could be held within the hands of a single corporate entity, presenting the very real possibility that the self-amplification of select voices could grow to further bias research and scholarship.

Clarivate’s move also appears to have sped up the academic vendor arms race, worrying competitors of decreased market share in the face of a behemoth end-to-end enterprise. In 2022, OCLC, a bibliographic data and library system vendor, sued Clarivate, claiming that their development of MetaDoor – an “open platform for sharing cataloguing records” – represented predatory market behaviour and tortious interference in OCLC’s contracts. Elsewhere, Elsevier – a major competitor of ProQuest – moved to acquire Interfolio, a company offering research career management and impact assessment tools, demonstrating a tit-for-tat escalation of direct corporate competition to secure a full spectrum of research management and analytics tools.

Elsewhere in the academic market, the move to more fully embrace AI as a facet of the scholarly communications lifecycle continues. Notably, Gendron, Andrew, and Cooper observe that the increasing prevalence of artificial intelligence in peer review by companies such as Elsevier, Wiley, and Springer excises critical human judgement in the evaluation of scholarship and serves the capitalist interests of these corporate entities rather than academic interests and values. In addition, they note an over-emphasis on quantitative metrics such as h-index and impact factor, illuminating the principles – and computational biases – that guide these AI algorithms. Furthermore, the total surveillance approach that is needed to drive AI approaches to peer review necessitates the university’s complicitness in underwriting corporate data capture and establishing what Tressie McMillan Cottom refers to as “private data worlds”, opaque corporate data sets that defy democratic inquiry and evaluation.

What can be seen from these recent events is the increased primacy of quantitative scholarly metrics as the defining measure of scholarship coupled with corporate efforts to provide software suites that enable end-to-end service of the scholarly communications lifecycle – from research production and publication to impact assessment and evaluation. As more elements of academic enterprise management are provided by individual corporate entities, the impact of assessment and citational bias grows in magnitude as scholarly outputs are assessed by the same vendor that facilitated said outputs and so on until it becomes a truly closed circle. Furthermore, data collection and analysis activities become increasingly streamlined as individual vendors will be able to pressure numerous institutional divisions – schools and departments, administrative offices, libraries, and more – to conform to data standards and practices that align with corporate needs. Such monolithic vendor suites also pose a threat of “university captivity”, leaving institutions with little recourse to oppose or escape the methodologies of a single-source vendor.

The technology landscape for 2023

Following OCLC’s victory in the MetaDoor lawsuit, Clarivate’s Gordon Samson, in a statement issued by the company, responded noting that “Clarivate will continue to support the goals of open research and data exchange – because we believe it is the best way to make the process of research and learning faster, more robust and more transparent.” What is notable from this response is the redoubling of notions of data distribution and use as Samson says “when scholarly information is easily accessible and shareable, the dots are easier to join, the connections are explicit, and collaborations are more natural and meaningful.” Taking this statement as plain fact, the goals are laudable, but given the context we have discussed, it is not clear that companies like Clarivate are seeing the problem the same way that many of us in higher education are. The ease with which they describe this data pipeline is the selfsame problem inherent in the reliance and overuse of metrics but from a different angle and with a marketing sheen.

As the market for scholarly communication and academic enterprise management platforms continues to consolidate within the hands of limited vendors, the potentially negative impacts of hyperfocusing on metrics and analytical monoculture distil to the point that the bias inherent in one facet of the scholarly lifecycle infects all downstream products. It is likely that market consolidation will continue apace in the near future. Universities will continue to seek out new solutions for their management needs, finding themselves in the unadmirable position of picking between a handful of vendor solutions with little recourse to effectively advocate against coercive data practices.

As scholars are seeing the increased need to work within this highly metrics-based environment, innovations in AI and algorithmically defined research parameters are increasingly needed to, as Chubb, et. al. has recently written, keep up with the volume of literature. As these researchers note, however, we are sacrificing research creativity and research synthesis in order to simply facilitate more and more stuff to cram. Through this process we shrink the impact of quality work to note its metrical place within a sea of scholarship as opposed to the novel ideas and methods used, thereby stultifying innovation in research. Additionally, as we have seen in recent years with the exponential rise in conspiracy thinking and internet radicalisation, algorithmically derived suggestion engines create and foster rabbit holes of increasingly self-referential and cyclical content myopia. If exported to the academic research enterprise in even more direct ways, the same could be true of research as it will continue to valorise the metrics-based echo chamber, but in a novel fashion. Automated peer-review and scholarly assessment, algorithmically informed hiring and promotion decisions, and library collection practices driven less by professional insight and more by ecommerce-like suggestions and bundling become not a far-flung possibility, but a probable future.

While the market momentum seems to be on the vendors’ side, voices increasingly critical of this shift towards algorithm/AI informed scholarly practices and market consolidation are finding a greater foothold in the theoretical and practical discussions in the academy. As we have presented here, scholars, administrators, and librarians strive to highlight the deleterious aspects of these changes within our professional lives. These perspectives may be, at times, narrowly focused on the localised impacts within particular scholarly domains, institutions, or divisions. As these critical perspectives continue to evolve, it is likely that a holistic critique will appear and lend greater credence to what may otherwise be viewed as merely anticapitalist conspiratorial thinking. Furthermore, as these critiques gain greater traction, it is our hope that it will inspire academic leadership to self-reflect on how our practices – of assessment, evaluation, procurement, funding – subsidise corporate consolidation and shifts towards algorithmic and AI-based assessment models.

References:

Dougherty, Michael R., Rosalind Nguyen, and David A. Illingworth. 2019. “A Memory-theoretic Account of Citation Counts.” PsyArXiv. September 16. doi:10.31234/osf.io/zst69.
Bourdieu, P. Homo Academicus, 1984.
Collini, S. Speaking of Universities, 2017.
Hong, S-A, 2022, “Prediction as Extraction of Discretion”.
Zuboff, S., 2019, “The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power”
Brembs, B., 2023, “Algorithmic Employment Decisions in Academia,” http://bjoern.brembs.net/2021/09/algorithmicemployment-decisions-in-acad....
Callaham, M., 2002, “Journal Prestige, Publication Bias, and Other Characteristics Associated With Citation of Published Studies in Peer-Reviewed Journals.” JAMA. June 5. doi: 10.1001/jama.287.21.2847.
Klein, DB, and Chiang, E, 2004, “The Social Science Citation Index: A Black Box – with an Ideological Bias?” Econ Journal Watch. 1 (1).
Grossman and Brembs, B., 2021, “Current Market Rates for Scholarly Publishing Services.” F1000research.doi. org/10.12688/f1000research.27468.2).
Carpenter, T., 2022, “Let the metadata wars begin.” The Scholarly Kitchen. https://scholarlykitchen.sspnet. org/2022/06/22/oclc-sues-clarivate-over-the-new-metadoorplatform/.
Schonfeld, R., 2022, “Elsevier to acquire Interfolio.” The Scholarly Kitchen. https://scholarlykitchen.sspnet. org/2022/04/25/elsevier-acquire-interfolio/.
Gendron, Y., Andrew, J., Cooper, C., 2022, “The perils of artificial intelligence in academic publishing.” Critical Perspectives on Accounting, 87, https://www.sciencedirect. com/science/article/abs/pii/S1045235421001301.
Cottom, T.M., 2020, “Where platform capitalism and racial capitalism meet: The sociology of race and racism in the digital society.” Sociology of race and ethnicity, 6(4)
Hamilton, T., Daniels, H., Smith, C.M., and Eaton, C., 2022, “The private side of public universities: Third-party providers and platform capitalism”. CSH research & occasional paper series.
https://clarivate.com/news/clarivate-and-oclc-settle-lawsuit/.
Chubb, J., Cowling, P., Reed, D., 2022, “Speeding up to keep up: exploring the use of AI in the research process,” AI & Soc, 37, https://doi.org/10.1007/s00146-021-01259-0.

Jordan Sly is Head of Humanities and Social Science Librarians and Joseph Koivisto is Systems Librarian at the University of Maryland Libraries

A holistic view data operations and interoperability in research

Neal Dunkinson, Head of Professional Services at SciBite, tells us why we should consider taking a data-centric approach to computational processing of research data in 2023 and beyond

“This year, there has been a marked evolution in how companies think about their research and data assets, and the technologies they apply to “unlock” scientific information, as the volume of data they handle has continued to increase. Interest in the role of machine intelligence in retrieving and disseminating data remains high, but 2022 has also seen companies mature their thinking around the application of computational approaches. Researchers are increasingly realising that although artificial intelligence and machine learning can have huge value, they serve as tools in a process rather than complete solutions that can single-handedly solve data management challenges. At SciBite, our conversations this year with customers have evolved as a result. Whatever they want to build “downstream” – whether a knowledge graph, a new ontology or a new search console – they realise that the quality of the data going “in” upstream is critical. That quality deficit is what many are looking to address in 2023.

Consequently, what is emerging now is the need for a data-centric approach to computational processing, rather than the typical application-centric approach. This is going to pick up steam in the next 12 months as companies look holistically at their current research stack with a focus on data operations and interoperability. Today, research environments are extremely complex and fragmented, involving many different applications from multiple vendors. Although making individual datasets machine readable is valuable, these measures are often considered at an application-level only, resulting in data becoming siloed and limiting it from moving freely through the research environment. Many are already convinced of the need to “FAIR-ify” data, the challenge they face is how to do it so that individual datasets can inform one another. Therefore, 2023 is the year to look toward proven and dependable competencies that can deliver the value everybody in the industry is seeking. As part of this trend, I expect to see practical industry-wide efforts like the FAIR Implementation Project from The Pistoia Alliance gather even more interest, as well as the adoption of software and technology that has a high degree on interoperability “baked in”.

Neal Dunkinson is the Head of Professional Services at SciBite (part of Elsevier)

Popular

Latest issue