Delayed Recognition: A Co-Citation Perspective

A Sleeping Beauty is a publication that is apparently unrecognized by citation for some period of time before experiencing a burst of recognition. Various reasons, including resistance to new ideas, have been attributed to such delayed recognition. We study this phenomenon in the special case of co-citations, which represent new ideas generated through the combination of existing ones. Using relatively stringent selection criteria derived from the work of others, we analyze a very large dataset of over 940 million unique co-cited article pairs, and identify 1,196 cases of delayed co-citations. We further classify these 1,196 cases with respect to amplitude, rate of citation, and disciplinary origin.


Introduction
The term 'Sleeping Beauty' has been used to describe an article that is not well cited in the early years after its publication but experiences a sharp increase in the rate at which it is subsequently cited (van Raan, 2004).An implication is that the new concept presented in such an article is 'ahead of its time' and resistance to it delays recognition.Other causes for resistance and delayed recognition have been postulated that include (i) information overload from the large amount of information available, (ii) modest communication skills of authors, (iii) insufficient promotion of ideas, (iv) conflict with existing theory and experimental data, (v) the author's position in the social hierarchy of science, (vi) multiple discovery, (vii) the management structures of scientific institutions, (viii), and the conservative nature of establishments (Barber, 1961;Merton, 1963;Cole, 1970;Garfield, 1970Garfield, , 1980)).The Sleeping Beauty phenomenon, and variants of it, have been studied and debated with some degree of agreement that a fraction of the scientific literature exhibits citation kinetics that suggest delayed but eventual recognition of new ideas (Glnzel et al., 2003;Glnzel & Garfield, 2004;van Raan, 2004;Redner, 2005;Braun et al., 2010;Li, 2014;Ke et al., 2015;Li & Ye, 2016;Song et al., 2018;Sugimoto & Mostafa, 2018;Ye & Bornmann, 2018;van Raan & Winnink, 2019).
Various approaches have been used to identify Sleeping Beauties and variants of it.Depth of sleep, length of sleep, and awake intensity as variables (van Raan, 2004), the Gini coefficient to examine later years of citation history (Li et al., 2014), a parameter-free beauty coefficient (Ke et al., 2015), positional measures (Costas et al., 2010), and the citation angle by Ye & Bornmann (2018).While earlier studies examined small datasets, subsequent ones considered large samples of the literature, for example, 22 million publications in Ke et al. (2015).
The research cited above has focused on single publications, however, new ideas also result from combining two previously independent ones.The recognition of such novelty through combination can be examined by co-citation analysis (Marshakova-Shaikevich, 1973;Small, 1973;Uzzi et al., 2013;Boyack & Klavans, 2014;Wang et al., 2017;Bradley et al., 2020).Tracing co-citations, therefore, provides another lens with which to study delayed recognition.In precedent is a somewhat related use of co-citation analysis by Zong et al. (2018); Teixeira et al. (2017) who sought to identify the so-called 'princes' that awaken Sleeping Beauties.
The measurement of delayed recognition by co-citation has been briefly explored by Devarakonda et al. (2020) in a study of 33.6 million reference pairs.The authors used simplified criteria derived from prior Sleeping Beauty studies on single publications (Ke et al., 2015;van Raan, 2004;van Raan & Winnink, 2019), reported 24 co-cited pairs all in the 99th percentile of co-citation frequencies, and proposed the term delayed co-citations for such cases.This initial exploration, albeit at scale, only considered reference pairs where each member of a pair was in the 99th percentile of highly cited articles in Scopus.In this article, we extend the work delayed co-citation to a much larger dataset, approximately 940 million pairs of articles.We refine the criteria in Devarakonda et al. (2020) and identify co-cited article pairs that exhibit delayed recognition using modifications of the techniques of van Raan (2004); van Raan & Winnink (2019) and Ke et al. (2015).We also ask whether delayed co-citations are derived from Sleeping Beauty publications.

MATERIALS AND METHODS
We have previously described a dataset of 33.6 million cited pairs each belonging to the top 1% of cited articles in the Scopus bibliography (Devarakonda et al., 2020, Figure 2).In the present study, we include all co-cited pairs from references cited by articles published in Scopus in the 11 year period, 1985-1995, not only those drawn from the top 1% of cited articles.We developed methods to manage the expected volume of data using a combination of SQL, Cypher, and Python.Our code for parsing and updating Scopus XML data, a PostgreSQL schema for Scopus data, SQL, Cypher, and Python scripts used in this study are freely available from a Github repository (Korobskiy et al., 2019).
To assemble and analyze a working dataset, we first exported 95,524,693 publication records from Scopus (all citation types) as a citation graph consisting of an edgelist and a nodelist, imported these data into a graph database (Neo4j) treating publications as nodes and citations as edges.After creating indexes to improve performance, we selected all publications of citation type 'article' published in the years 1985-1995 (inclusive of both) that had at least five cited references each.In counting references, we only considered references with complete Scopus records.Incomplete references and those with cryptic placeholder identifiers were removed from the dataset.We also filtered rare cases in the data where a publication cites itself, or if the publication date of a cited reference was missing or greater than the publication date of its citing article.Selection of publications with at least 5 references was performed after curating references.
After initial comparison of SQL vs Cypher, we chose, on the basis of simplicity and performance, to use Cypher queries in Neo4j to generate all pairwise n 2 combinations of an article's cited references.We de-duplicated these pairs across all articles to assemble a dataset of ∼940 million pairs (940,357,633 pairs), roughly 28 times larger than the dataset in Devarakonda et al. (2020).We then calculated the frequency of co-cited pairs by dividing the data and processing batches in parallel using Neo4j and the GNU Parallel utility.After tuning experiments on a test set of 1 million pairs using a Neo4j 4.0 in a Centos 7.5 virtual machine with 128 Gb of RAM and 16 vCPUs in the Microsoft Azure environment, we set the batch size to 1,000 pairs and the degree of parallelization to 15 cores.Under these conditions, it took roughly 11 min to compute co-citation frequencies for a batch of 1,000 pairs.We divided these 940 million pairs into 9 subsets of around 100 million pairs each and processed them at the rate of approximately 19 hours per subset.
In illustration, the simple Cypher query for calculating co-citation frequencies of pairs in Neo4j is shown below.The input to the query is a csv file containing two columns of article identifiers with each row representing a co-cited pair.Frequencies thus calculated, were loaded back into PostgreSQL.For kinetic analysis, we selected all pairs with a co-citation frequency >= 100 and calculated the kinetics of citation accumulation from the first possible year of co-citation for each pair through the year 2018, again in Neo4j.Finally, for continuity, we set zero as the frequency for all years between the first possible year of co-citation and the last co-cited year (2018), with missing frequency counts.Minor differences between the data in Devarakonda et al. (2020) are due to more current data in Scopus in our study, and computing kinetic data through 2018 in this study.We compared small samples between the two datasets and confirmed that these minor differences in co-citation frequencies could be bridged by including citations from publications in 2019 and later.
After generating a dataset of 940 million pairs, we applied three relatively conservative conditions to identify co-cited pairs of interest: (i) a minimum peak (annual) co-citation frequency for a pair of at least 20 (ii) a minimum total co-citation frequency of at least 100 (iii) a requirement both members of a co-cited pair should be published no earlier than 1970.We then identified delayed co-citation cases by setting two more conditions: (i) a minimum sleeping duration of 10 years as measured from the first possible year of co-citation (the more recent publication year of the two articles), (ii) during this sleeping period of 10 years or more, the average co-citation frequency should be at most 1 with no more than 2 co-citations in any one year.
We also calculated the slope between the co-citation frequency of the awakening year and the peak frequency and modified the Beauty Coefficient (Ke et al., 2015;Devarakonda et al., 2020), which was designed to measure kinetics in single publications, to be relevant to co-citations by treating the first possible year of co-citation equivalently to the year of publication for a single article (Devarakonda et al., 2020).
To identify, single Sleeping Beauty publications, we narrowed the criteria of van Raan & Winnink (2019) to consider only a single sleeping period of 10 years or greater; depth of sleep (average citation rate during sleep) of at most 1; an awakening period of 5 years; and an average co-citation frequency during the awakening period (which is defined as awakening citation intensity by van Raan) of at least 5.We also calculated the Beauty Coefficient (Ke et al., 2015) for all single publications for comparison.(1985)(1986)(1987)(1988)(1989)(1990)(1991)(1992)(1993)(1994)(1995), were generated as described in Materials and Methods.Total co-citation frequencies for these pairs, ranged from 1 to 52,471 with a median frequency of 1.The empirical cumulative distribution function (ECDF) was calculated from 940,357,633 co-citation frequencies and plotted against co-citation frequencies on a log 2 scale.the data using a conservative threshold of 100 for total co-citation frequency along with a peak annual co-citation frequency of at least 20.These criteria are analogous to those proposed by van Raan (van Raan, 2004) and Redner (Redner, 2005).After applying these two further restrictions, the number of co-cited pairs is reduced to 51,613 (approximately 0.055% of the total number of pairs).
To find cases of delayed co-citation, we applied the following conditions to these 51,613 pairs: (i) a co-cited paper should have slept for at least 10 years and received no more than 2 co-citations in each year during this sleeping period, which is defined as as the number of years from the first possible co-cited year to the first year that the pair receives more than 2 co-citations.To be considered as a Sleeping Beauty, the awakening period that follows Interestingly, these 1,196 pairs are derived from only 1,267 of a possible 2,392 individual publications indicating that some members of frequently co-cited pairs are found in multiple pairs.Indeed, we have previously noted that a pair of articles concerning methods in biochemistry, contribute to over 40,000 different co-cited pairs of frequency >= 10 ( Devarakonda et al., 2020).
A logical question is whether any of these 1,267 individual publications would be classified as Sleeping Beauties.Applying van Raan's criteria (Materials and Methods), we identify 128 of these 1,267 publications as Sleeping Beauties.Interestingly, 27 of the 1,196 delayed co-citation pairs were cases where both members were Sleeping Beauties.Of these, the 1978 article by Rassias titled 'On the stability of the linear mapping in Banach spaces' was a member of four different pairs.Thus, delayed recognition can occur without a requirement that at least one member of a co-cited pair with delayed recognition should have Sleeping Beauty characteristics.These observations also suggest that while highreferencing fields such as biology (Small & Greenlee, 1980) might be advantaged by our selection criteria, the thresholds we set do not entirely exclude other fields.Accordingly, continuing this work with field normalization of co-citation frequencies, to the extent possible, is warranted.In contrast to co-citation frequencies for delayed co-citations (Fig. 2), which range from 20-260; citation counts for the 1,267 publications that contribute to these 1,196 delayed co-citations range from 121 to 190,832 with 72 of these publications having citation counts of greater than 10,000.
However, other co-citation frequencies do exceed the seemingly modest frequencies noted for delayed co-citations.For example, Becke (1993) and Lee et al. (1988), a pair of articles from the field of physical chemistry, have been co-cited over 51,000 times but do not exhibit delayed citation kinetics.It should also be noted that these articles have individually been cited over 70,000 times each.Similarly, 1,357 pairs from the data shown in Fig 1 have co-citation frequencies greater than 1,000.
We observe (Fig 1), that the 90th, 95th, and 99th percentiles of co-citation frequencies in our dataset are 4, 6, and 16 respectively.In comparison.the 90th, 95th, and 99th percentile of citation frequencies of ∼10.7 million publications of type 'article' in Scopus, published in the years 1970-1995, are 58, 96, and 254 respectively (roughly ten fold greater).What emerges is that delayed co-citations tend to have frequency profiles that are lower than those of other co-cited pairs, and single publications.This is not unexpected since co-cited frequencies cannot exceed the citation frequencies of the publications in these pairs but it does suggest that seemingly low co-citation frequencies should not be overlooked.
To examine rates of awakening, we also calculated the slope between the co-citation frequency in the first awakening year and the frequency of the peak year and noted a fairly broad range of slopes with a mean of 2.4 (Table 2).The kinetics of co-citation are visualized in Fig 2 , for three examples with the maximum slope, the mean slope, and the minimum slope observed.
Of 1,196 delayed co-citations, the slope could not be computed for 10 pairs because the peak year was the year of awakening.This small number of cases, suggest sudden recognition of the concepts represented by these pairs (Table 3.These 10 pairs span the areas of LED technology, cosmology, immunology, psychology, and computational science.One publication from 1985 titled, "An exotic class of Kaluza-Klein models" appears in 3 of 10 pairs and the author himself refers, in 1999, to 'renewed interest due to the explosion of activity in the non compact extra dimensions variant of the Kaluza Klein model' (Visser, 1999).
We also examined lesser co-citation frequencies, between 20 and 100, and observed 5,928,815 pairs.After removing pairs with (i) less than 10 years of kinetic data (the difference between publication year and peak year is less than 10 years) (ii) a negative Beauty Coefficient, which describes articles whose citations growing linearly with time or with a citation trajectory that is a concave function of time, (iii) without at least one peak of frequency 20, then the number reduced to 13,057 pairs.Of these 12,920 had only a single peak of 20 or greater that may be similar to 'flash in the pan' citations (Li, 2013;Ye & Bornmann, 2018).Given our focus on frequently co-cited pairs, we did not study these further.
An appealing alternative approach for delayed co-citations and Sleeping Beauties is the Beauty Coefficient.We have previously modified (Devarakonda et al., 2020) the Beauty Coefficient (Ke et al., 2015) designed to measure kinetics in single publications, to be useful to the case of co-cited pairs.We computed the Beauty Coefficient for these 1,196 pairs observing a range of 34.21-1678.62.These data are summarized in Table 2. Given co-citation frequencies being generally lower than citation frequencies, the top 15 Beauty Coefficient values of the 1,196 delayed co-citations range from 712.47-1678.62,which appear comparable to the top 15 described by Ke, all above 2,000.
Ke and colleagues comment that parameterized approaches in preceding studies have suffered from being somewhat arbitrary.The comment is fair, but arbitrariness may not have impeded discovery, for example Redner's work on the physics literature (Redner, 2005) with its selection threshold of 250 citations.Further, while the Beauty Coefficient is parameter free, the choice of selection threshold is left to the user leaving the door open for arbitrary selection thresholds.We consider this a strength of the measure since it can be used in contextual studies.The approach of van Raan is also intuitive and flexible but does not consider the maximum number of citations received as an important parameter to be tuned.The cases with a sleeping period of ten years, and a citation rate of 5 for the next 5 years, would satisfy requirements for a Sleeping Beauty but are perhaps less noteworthy.
Finally, to ask which fields these 1,196 delayed co-citations are found in, we mapped them to the All Science Journal Classification (ASJC) maintained by Scopus, which consists of 27 major subject area categories.The data are represented in Figure 3 but should be interpreted in the light of these subject area labels being derived from journals and that an article may have more than one label.Even so, the data suggest that delayed co-citations, as we define them in our dataset are largely drawn from the domain of biochemistry, genetics, and molecular biology followed by physics, computer science, chemistry, and engineering.

CONCLUSION
In a large-scale exploration of the kinetics of co-citation (more than 940 million unique article pairs), we have identified 1,196 cases of delayed co-citation using criteria largely derived from the work of van Raan and Ke.We acknowledge that our selection criteria, while guided by positional statistics and intuitive preference, suffers from some degree of arbitrariness.With all bibliometric data, coverage and data quality also influence discovery.Thus, we have tried to identify co-cited pairs of higher frequency since the trends in such cases are more likely to be reproducible across other data sources.Relaxing these conditions, will identify additional cases.Our goal was to identify a set of delayed co-cited pairs that can be studied, in the longer term, to understand the reasons for the patterns of citation.This future task will require a greater understanding of the fields in which such delayed co-citations occurred and ideally should be coupled to qualitative techniques.

Conflict of Interest Statement
Data used in this study derive from the ERNIE project, which involves a collaboration with Elsevier.The content of this publication is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or Elsevier.Elsevier staff did not have a role in design, manuscript-writing, or review and interpretation of results.
UNWIND $ i n p u t d a t a AS row MATCH ( a : P u b l i c a t i o n { n o d e i d : row .c i t e d 1 })<−−(p)−−>(b : P u b l i c a t i o n { n o d e i d : row .c i t e d 2 } ) RETURN row .c i t e d 1 AS c i t e d 1 , row .c i t e d 2 AS c i t e d 2 , count ( p ) AS s c o p u s f r e q u e n c y ;

Figure 2 :
Figure 2: Kinetics of Co-citation Frequencies for Delayed Co-citations.Three sample plots are shown from 1,196 delayed co-citations selected for maximum slope (left panel).mean slope(middle panel), and minimum slope(right panel) of a line connecting the co-citation frequency of the awakening year to the co-citation frequency of the peak year .Total cocitation frequencies for these three plots were 131, 174, and 254, with peaks of 22, 22, and 23, and slopes of NA, 2.38, and 0.21 respectively.The red triangle marks the awakening year and the dotted blue line, the slope.The slope in the left panel is NA since the peak year is the awakening year.The article pairs shown above are (i) Spacetime as a membrane in higher dimensions (Gibbons 1987) & An exotic class of Kaluza-Klein models (Visser 1985), (ii) Formulation of the reaction coordinate (Fukui 1070) & Ab initio effective core potentials for molecular calculations.Potentials for main group elements Na to Bi (Wadt & Hay 1985), (iii) A proposed grading system for arteriovenous malformations (Spetzler 1986) & Arteriovenous malformations of the brain: Natural history in unoperated patients (Crawford et al. 1986).

Table 3 :
Co-cited pairs with peak frequency in the the first year of mental analog of an external rotation 1974 Biologic and clinical significance of cryoglobulins.A report of 86 cases 1980 Mixed cryoglobulinemia: Clinical aspects and long-term follow-up of 40 patients 1977 Imitation of Facial and Manual Gestures by Human Neonates 1979 Matching behavior in the young infant.1978 Cognitive determinants of fixation location during picture viewing 1979 Framing pictures: The role of knowledge in automatized encoding and memory for gist 1983 Parst: A system of fortran routines for calculating molecular structure parameters (truncated

Figure 3 :
Figure 3: Disciplinary composition of 1,196 Delayed Co-citations.Each node represents a major subject area in the Scopus ASJC classification.Node size is scaled to the number articles in a given subject area.Edge thickness indicates the number of pairs that have one member in one each of the two nodes connected by the edge.Major subject areas are abbreviated in the graphic: MTH (Mathematics); IMM (Immunology and Microbiology); HP (Health Professions); GEN (General); ENS (Environmental Science); ENG (Engineering); EPS (Earth & Planetary Sciences); DCS (Decision Sciences); MAT (Material Sciences); CEN (Chemical Engineering); PSY (Psychology); PHY (Physics and Astronomy); NEU (Neuroscience); CS ( Computer Science); A&H (Arts and Humanities); SS (Social Sciences); MED (Medicine); EGY (Energy); CHE (Chemistry); ABS (Agricultural & Biological Sciences); BGMB (Biochemistry, Genetics & Molecular Biology); BMA (Business, Management, and Accounting); EEF (Economics, Econometrics and Finance) PTP (Pharmacology, Toxicology & Pharmaceutics)

Table 1 :
Distribution of 940 million Co-citation Frequencies.The count of co-cited pairs in each frequency class as well as the percentage relative to the total number of 940,357,633 is shown.Counts include the lower bound in each class and exclude the upper bound.Add legend details the sleeping period is characterized by (ii) a peak annual co-citation frequency of at least 20.These criteria when collectively applied, identified 1,196 cases of delayed co-citation, whose characteristics are summarized in Table2.

Table 2 :
Summary Statistics of 1,196 Delayed Co-citation Pairs.Criteria for selection were a minimum sleeping period of 10 years and a minimum peak of 20 citations in any year.