Measuring Scholarly Impact: Methods and Practice

Editors: Ying Ding, Ronald Rousseau, Dietmar Wolfram

Amazon Link

To date, there have been only a small number of monographs that have addressed informetrics-related topics. None provide a comprehensive treatment of recent developments or hands-on perspectives on how to apply these new techniques. This book fills that gap. The objective of this edited work is to provide an authoritative handbook of current topics, technologies, and methodological approaches that may be used for the study of scholarly impact. The chapters have been contributed by leading international researchers. Readers of this work should bring a basic familiarity with the field of scholarly communication and informetrics, as well as some understanding of statistical methods. However, the tools and techniques presented should also be accessible and usable by readers who are relatively new to the study of informetrics.

Chapter 1. Community Detection and Visualization of Networks with the Map Equation Framework

Author(s): Ludvig Bohlin (Sweden), Daniel Edler (Sweden), Andrea Lancichinetti (Sweden) and Martin Rosvall (Sweden)

Topic(s): networks
Aspect(s): community detection, visualization
Method(s): map equation
Software tool(s) used: Infomap, MapEquation software package
Data source:none

Abstract Large networks contain plentiful information about the organization of a system. The challenge is to extract useful information buried in the structure of myriad nodes and links. Therefore, powerful tools for simplifying and highlighting important structures in networks are essential for comprehending their organization. Such tools are called community-detection methods and they are designed to identify strongly intraconnected modules that often correspond to important functional units. Here we describe one such method, known as the map equation, and its accompanying algorithms for finding, evaluating, and visualizing the modular organization of networks. The map equation framework is very flexible and can identify two-level, multi-level, and overlapping organization in weighted, directed, and multiplex networks with its search algorithm Infomap. Because the map equation framework operates on the flow induced by the links of a network, it naturally captures flow of ideas and citation flow, and is therefore well-suited for analysis of bibliometric networks.

Chapter 2. Link Prediction

Author(s): Raf Guns (Belgium)

Topic(s): networks
Aspect(s): link prediction
Method(s): data gathering - preprocessing - prediction - evaluation; recall-precision charts; using predictors such as common neighbors, cosine, degree product, SimRank, and the Katz predictor
Software tool(s) used: linkpred; Pajek; VOSViewer; Anaconda Python
Data source: Web of Science (Thomson Reuters) - co-authorship data of informetrics researchers
Slides: ISSI2015 Tutorial Slides

Abstract Social and information networks evolve according to certain regularities. Hence, given a network structure, some potential links are more likely to occur than others. This leads to the question of link prediction: how can one predict which links will occur in a future snapshot of the network and/or which links are missing from an incomplete network? This chapter provides a practical overview of link prediction. We present a general overview of the link prediction process and discuss its importance to applications like recommendation and anomaly detection, as well as its significance to theoretical issues. We then discuss the different steps to be taken when performing a link prediction process, including preprocessing, predictor choice, and evaluation. This is illustrated on a small-scale case study of researcher collaboration, using the freely available linkpred tool.

Chapter 3. Network Analysis and Indicators

Author(s): Stasa Milojevic (USA)

Topic(s): network analysis - network indicators
Aspect(s): bibliometric applications
Method(s): study of collaboration and citation links
Software tool(s) used: Pajek; Sci2
Data source(s): Web of Science (Thomson Reuters) - articles published in the journal Scientometrics over the period 2003-2012

Abstract Networks have for a long time been used both as a metaphor and as a method for studying science. With the advent of very large data sets and the increase in the computational power, network analysis became more prevalent in the studies of science in general and the studies of science indicators in particular. For the purposes of this chapter science indicators are broadly defined as "measures of changes in aspects of science" (Elkana et al., Toward a metric of science: The advent of science indicators, John Wiley & Sons, New York, 1978). The chapter covers network science-based indicators related to both the social and the cognitive aspects of science. Particular emphasis is placed on different centrality measures. Articles published in the journal Scientometrics over a 10-year period (2003-2012) were used to show how the indicators can be computed in coauthorship and citation networks.

Chapter 4. PageRank-Related Methods for Analyzing Citation Networks

Author(s): Ludo Waltman (The Netherlands) and Erjia Yan (USA)

Topic(s): citation networks
Aspect(s): roles played by nodes in a citation network and their importance
Method(s): Page-rank related methods
Software tool(s) used: Sci2; MATLAB; Pajek;
Data source: Web of Science (Thomson Reuters) - all publications in the journal subject category Information Science & Library Science that are of document type article, proceedings paper, or review and that appeared between 2004 and 2013.
Slides: ISSI2015 Tutorial Slides

Abstract A central question in citation analysis is how the most important or most prominent nodes in a citation network can be identified. Many different approaches have been proposed to address this question. In this chapter, we focus on approaches that assess the importance of a node in a citation network based not just on the local structure of the network but instead on the network's global structure. For instance, rather than just counting the number of citations a journal has received, these approaches also take into account from which journals the citations originate and how often these citing journals have been cited themselves. The methods that we study are closely related to the well-known PageRank method for ranking web pages. We therefore start by discussing the PageRank method, and we then review the work that has been done in the field of citation analysis on similar types of methods. In the second part of the chapter, we provide a tutorial in which we demonstrate how PageRank calculations can be performed for citation networks constructed based on data from the Web of Science database. The Sci2 tool is used to construct citation networks, and MATLAB is used to perform PageRank calculations.

Chapter 5. Systems Life Cyrlce and Its Relation with Triple Helix

Author(s): Robert K. Abercrombie (USA) and Andrew S. Loebl (USA)

Topic(s): life cycle
Aspect(s): seen from a triple helix aspect
Method(s): Technology Readiness Levels (TRLs)
Software tool(s) used: none
Data source: from Lee et al. "Continuing Innovation in Information Technology". Washington, DC: The National Academies Press; plus diverse other sources

Abstract This chapter examines the life cycle of complex systems in light of the dynamic interconnections among the university, industry, and government sectors. Each sector is motivated in its resource allocation by principles discussed elsewhere in this book and yet remains complementary establishing enduring and fundamental relationships. Industry and government depend upon an educated workforce; universities depend upon industry to spark the R&D which is needed and to sponsor some basic research and much applied research. Government depends upon industry to address operational needs and provide finished products while universities offer government (along with industry) problem solving and problem solving environments. The life cycle of complex systems in this chapter will be examined in this context, providing historical examples. Current examples will then be examined within this multidimensional context with respect to the phases of program and project life cycle management from requirements definition through retirement and closeout of systems. During the explanation of these examples, the advances in research techniques to collect, analyze, and process the data will be examined.

Chapter 6. Spatial Scientometrics and Scholarly Impact: A Review of Recent Studies, Tools, and Methods

Author(s): Koen Frenken (The Netherlands) and Jarno Hoekman (The Netherlands)

Topic(s): spatial scientometrics
Aspect(s): scholarly impact, particularly, the spatial distribution of publication and citation output, and geographical effects of mobility and collaboration on citation impact
Method(s): review
Software tool(s) used: none
Data source: Web of Science (Thomson Reuters): post 2008

Abstract Previously, we proposed a research program to analyze spatial aspects of the science system which we called "spatial scientometrics" (Frenken, Hardeman, & Hoekman, 2009). The aim of this chapter is to systematically review recent (post-2008) contributions to spatial scientometrics on the basis of a standardized literature search. We focus our review on contributions addressing spatial aspects of scholarly impact, particularly, the spatial distribution of publication and citation impact, and the effect of spatial biases in collaboration and mobility on citation impact. We also discuss recent dedicated tools and methods for analysis and visualization of spatial scientometric data. We end with reflections about future research avenues.

Chapter 7. Researchers' Publication Patterns and Their Use for Author Disambiguation

Author(s): Vincent Lariviere and Benoit Macaluso (Canada)

Topic(s): Authors
Aspect(s): name disambiguation
Method(s): publication patterns
Software tool(s) used: none
Data source: list of distinct university based researchers in Quebec; classification scheme used by U.S. National Science Foundation (NSF); Web of Science (Thomson Reuters); Google

Abstract In recent years we have been witnessing an increase in the need for advanced bibliometric indicators for individual researchers and research groups, for which author disambiguation is needed. Using the complete population of university professors and researchers in the Canadian province of Quebec (N=13,479), their papers as well as the papers authored by their homonyms, this paper provides evidence of regularities in researchers' publication patterns. It shows how these patterns can be used to automatically assign papers to individuals and remove papers authored by their homonyms. Two types of patterns were found: (1) at the individual researchers' level and (2) at the level of disciplines. On the whole, these patterns allow the construction of an algorithm that provides assignment information for at least one paper for 11,105 (82.4%) out of all 13,479 researchers-with a very low percentage of false positives (3.2%).

Chapter 8. Knowledge Integration and Diffusion: Measures and Mapping of Diversity and Culture

Author(s): Ismael Rafols (Spain and UK)

Topic(s): knowledge integration and diffusion
Aspect(s): diversity and coherence
Method(s): presents a conceptual framework including cognitive distance (or proximity) between the categories that characterize the body of knowledge under study
Software tool(s) used: Leydesdorff's overlay toolkit; Excel; Pajek; additional software available at http://www.sussex.ac.uk/Users/ir28/book/excelmaps
Data source: Web of Science (Thomson Reuters) - citations of the research centre ISSTI (University of Edinburgh) across different Web of Science categories

Abstract In this chapter, I present a framework based on the concepts of diversity and coherence for the analysis of knowledge integration and diffusion. Visualisations that help to understand insights gained are also introduced. The key novelty offered by this framework compared to previous approaches is the inclusion of cognitive distance (or proximity) between the categories that characterise the body of knowledge under study. I briefly discuss different methods to map the cognitive dimension.

Chapter 9. Limited Dependent Variable Models and Probabilistic Prediction in Informatics

Author(s): Nick Deschacht (Belgium) and Tim C.E. Engels (Belgium)

Topic(s): regression models
Aspect(s): studying the probability of being cited
Method(s): logit model for binary choice; ordinal regression; models for multiple responses and for count data
Software tool(s) used: Stata
Data source: Web of Science - Social Sciences Citation Index (Thomson Reuters) - 2,271 journal articles published between 2008 and 2011 in five library and information science journals

Abstract This chapter explores the potential for informetric applications of limited dependent variable models, i.e., binary, ordinal, and count data regression models. In bibliometrics and scientometrics such models can be used in the analysis of all kinds of categorical and count data, such as assessments scores, career transitions, citation counts, editorial decisions, or funding decisions. The chapter reviews the use of these models in the informetrics literature and introduces the models, their underlying assumptions and their potential for predictive purposes. The main advantage of limited dependent variable models is that they allow us to identify the main explanatory variables in a multivariate framework and to estimate the size of their (marginal) effects. The models are illustrated using an example data set to analyze the determinants of citations. The chapter also shows how these models can be estimated using the statistical software Stata.

Chapter 10. Text Mining with the Stanford CoreNLP

Author(s): Min Song (South Korea) and Tamy Chambers (USA)

Topic(s): text mining
Aspect(s): for bibliometric analysis
Method(s): provides an overview of the architecture of text mining systems and their capabilities
Software tool(s) used: Stanford CoreNLP
Data source(s): Titles and abstracts of all articles published in the Journal of the American Society for Information Science and Technology (JASIST) in 2012

Abstract Text mining techniques have been widely employed to analyze various texts from massive social media to scientific publications and patents. As a bibliographic analysis tool the technique presents the opportunity for large-scale topical analysis of papers covering an entire domain, country, institution, or specific journal. For this project, we have chosen to use the Stanford CoreNLP parser due to its extensibility and enriched functionalities which can be applied to bibliometric research. The current version includes a suite of processing tools designed to take raw English language text input and output a complete textual analysis and linguistic annotation appropriate for higher-level textual analysis. The data for this project includes the title and abstract of all articles published in the Journal of the American Society for Information Science and Technology (JASIST) in 2012 (n=177). Our process will provide an overview of the concepts depicted in the journal that year and will highlight the most frequent concepts to establish an overall trend for the year.

Chapter 11. Topic Modeling: Measuring Scholarly Impact Using a Topical Lens

Author(s): Min Song (South Korea) and Ying Ding (USA)

Topic(s): topic modeling
Aspect(s): bibliometric applications
Method(s): Latent Dirichlet Allocation (LDA)
Software tool(s) used: Stanford Topic Modeling Toolbox (TMT)
Data source(s): Web of Science (Thomson Reuters) - papers published in the Journal of the American Society for Information Science (and Technology) (JASIS(T)) between 1990 and 2013

Abstract Topic modeling is a well-received, unsupervised method that learns thematic structures from large document collections. Numerous algorithms for topic modeling have been proposed, and the results of those algorithms have been used to summarize, visualize, and explore the target document collections. In general, a topic modeling algorithm takes a document collection as input. It then discovers a set of salient themes that are discussed in the collection and the degree to which each document exhibits those topics. Scholarly communication has been an attractive application domain for topic modeling to complement existing methods for comparing entities of interest. In this chapter, we explain how to apply an open source topic modeling tool to conduct topic analysis on a set of scholarly publications. We also demonstrate how to use the results of topic modeling for bibliometric analysis.

Chapter 12. The Substantive and Practical Significance of Citation Impact Differences Between Institutions: Guidelines for the Analysis of Percentiles Using Effect Sizes and Confidence Intervals

Author(s): Richard Williams (USA) and Lutz Bornmann (Germany)

Topic(s): analysis of percentiles
Aspect(s): difference in citation impact
Method(s): statistical analysis using effect sizes and confidence intervals
Software tool(s) used: Stata
Data source: InCites (Thomson Reuters) - citation data for publications produced by three research institutions in German-speaking countries from 2001 and 2002

Abstract In this chapter we address the statistical analysis of percentiles: How should the citation impact of institutions be compared? In educational and psychological testing, percentiles are already used widely as a standard to evaluate an individual's test scores-intelligence tests for example-by comparing them with the scores of a calibrated sample. Percentiles, or percentile rank classes, are also a very suitable method for bibliometrics to normalize citations of publications in terms of the subject category and the publication year and, unlike the mean-based indicators (the relative citation rates), percentiles are scarcely affected by skewed distributions of citations. The percentile of a certain publication provides information about the citation impact this publication has achieved in comparison to other similar publications in the same subject category and publication year. Analyzes of percentiles, however, have not always been presented in the most effective and meaningful way. New APA guidelines (Association American Psychological, Publication manual of the American Psychological Association (6 ed.). Washington, DC: American Psychological Association (APA), 2010) suggest a lesser emphasis on significance tests and a greater emphasis on the substantive and practical significance of findings. Drawing on work by Cumming (Understanding the new statistics: effect sizes, confidence intervals, and meta-analysis. London: Routledge, 2012) we show how examinations of effect sizes (e.g., Cohen's d statistic) and confidence intervals can lead to a clear understanding of citation impact differences.

Chapter 13. Visualizing Bibliometric Networks

Author(s): Nees Jan van Eck (The Netherlands) and Ludo Waltman (The Netherlands)

Topic(s): Bibliometric networks
Aspect(s): visualization
Method(s): as included in the software tools; tutorials
Software tool(s) used: VOSviewer; CitNetExplorer
Data source: Web of Science (Thomson Reuters) - journals Scientometrics and Journal of Informetrics and journals in their citation neighborhood

Abstract This chapter provides an introduction to the topic of visualizing bibliometric networks. First, the most commonly studied types of bibliometric networks (i.e., citation, co-citation, bibliographic coupling, keyword co-occurrence, and coauthorship networks) are discussed, and three popular visualization approaches (i.e., distance-based, graph-based, and timeline-based approaches) are distinguished. Next, an overview is given of a number of software tools that can be used for visualizing bibliometric networks. In the second part of the chapter, the focus is specifically on two software tools: VOSviewer and CitNetExplorer. The techniques used by these tools to construct, analyze, and visualize bibliometric networks are discussed. In addition, tutorials are offered that demonstrate in a step-by-step manner how both tools can be used. Finally, the chapter concludes with a discussion of the limitations and the proper use of bibliometric network visualizations and with a summary of some ongoing and future developments.

Chapter 14. Replicable Science of Science Studies

Author(s): Katy Borner (USA) and David E. Polley (USA)

Topic(s): Science of Science
Aspect(s): data preprocessing, burst detection, visualization, geospatial, topical and network analysis; career trajectories
Method(s): use of freely available tools for the actions described under 'aspects'
Software tool(s) used: Sci2 toolset
Data source: data downloaded from the Scholarly Database

Abstract Much research in bibliometrics and scientometrics is conducted using proprietary datasets and tools making it hard if not impossible to replicate results. This chapter reviews free tools, software libraries, and online services that support science of science studies using common data formats. We then introduce plug-andplay macroscopes (Borner, Commun ACM 54(3):60-69, 2011) that use the OSGi industry standard to support modular software design, i.e., the plug-and-play of different data readers, preprocessing and analysis algorithms, but also visualization algorithms and tools. Exemplarily, we demonstrate how the open source Science of Science (Sci2) Tool can be used to answer temporal (when), geospatial (where), topical (what), and network questions (with whom) at different levels of analysis-from micro to macro. Using the Sci2 Tool, we provide hands-on instructions on how to run burst analysis (see Chapter 10 in this book), overlay data on geospatial maps (see Chapter 6 in this book), generate science map overlays, and calculate diverse network properties, e.g., weighted PageRank (see Chapter 4 in this book) or community detection (see Chapter 3 in this book), using data from Scopus, Web of Science or personal bibliography files, e.g., EndNote or BibTex. We exemplify tool usage by studying evolving research trajectories of a group of physicists over temporal, geospatial, and topic space as well as their evolving co-author networks. Last but not least, we show how plug-and-play macroscopes can be used to create bridges between existing tools, e.g., Sci2 and the VOSviewer clustering algorithm (see Chapter 13 in this book), so that they can be combined to execute more advanced analysis and visualization workflows.