NLP4NLP+5: The Deep (R)evolution in Speech and Language Processing

This paper aims at analyzing the changes in the fields of speech and natural language processing over the recent past 5 years (2016–2020). It is in continuation of a series of two papers that we published in 2019 on the analysis of the NLP4NLP corpus, which contained articles published in 34 major conferences and journals in the field of speech and natural language processing, over a period of 50 years (1965–2015), and analyzed with the methods developed in the field of NLP, hence its name. The extended NLP4NLP+5 corpus now covers 55 years, comprising close to 90,000 documents [+30% compared with NLP4NLP: as many articles have been published in the single year 2020 than over the first 25 years (1965–1989)], 67,000 authors (+40%), 590,000 references (+80%), and approximately 380 million words (+40%). These analyses are conducted globally or comparatively among sources and also with the general scientific literature, with a focus on the past 5 years. It concludes in identifying profound changes in research topics as well as in the emergence of a new generation of authors and the appearance of new publications around artificial intelligence, neural networks, machine learning, and word embedding.


INTRODUCTION Preliminary Remarks
The global aim of this series of studies was to investigate the speech and natural language processing (SNLP), research area through the related scientific publications, using a set of NLP tools, in harmony with the growing interest for scientometrics in SNLP [refer to Banchs, 2012;Jurafsky, 2016;Atanassova et al., 2019;Goh and Lepage, 2019;Mohammad, 2020a,b,c;Wang et al., 2020;Sharma et al., 2021 and many more] or in various domains such as economics (Muñoz-Céspedes et al., 2021), finance (Daudert and Ahmadi, 2019), or disinformation (Monogarova et al., 2021).
The first results of these studies were presented in two companion papers, published in the first special issue "Mining Scientific Papers Volume I: NLP-enhanced Bibliometrics" of the Frontiers in Research Metrics and Analytics journal; one on production, collaboration, and citation: "The NLP4NLP Corpus (I): 50 Years of Publication, Collaboration and Citation in Speech and Language Processing" (Mariani et al., 2019a) and a second one on the evolution of research topics over time, innovation, use of language resources and reuse of papers and plagiarism within and across publications: "The NLP4NLP Corpus (II): 50 Years of Research in Speech and Language Processing" (Mariani et al., 2019b).
We now extend this corpus by considering the articles published in the same 34 sources over the past 5 years (2016)(2017)(2018)(2019)(2020). We watched during this period an increasing interest for machine-learning approaches for processing speech and natural language, and we wanted to examine how this was reflected in the scientific literature. Here, we therefore analyze these augmented data to identify the changes that happened during this period, both in terms of scientific topics and in terms of research community, reporting the results of this new study in a single article covering papers and authors' production and citation within these sources, which is submitted to the second special issue "Mining Scientific Papers Volume II: Knowledge Discovery and Data Exploitation" of the Frontiers in Research Metrics and Analytics journal. We invite the reader to refer to the previous extensive articles to get more insights on the used data and developed methods. In addition, we conducted here the study of the more than 1 million total number of references, to measure the possible influence of neighboring disciplines outside the NLP4NLP sources.

The NLP4NLP Speech and Natural Language Processing Corpus
The NLP4NLP corpus 1 (Mariani et al., 2019a) contained papers from 34 conferences and journals on natural language processing (NLP) and spoken language processing (SLP) ( Table 1) published over 50 years , gathering about 68,000 articles and 270MWords from about 50,000 different authors, and about 325,000 references. Although it represents a good picture of the international research investigations of the SNLP community, many papers, including important seminal papers, related to this field, may have been published in other publications than these. Given the uncertainty of the existence of a proper review process, we did not include the content neither of workshops nor of publications such as arXiv 2 , a popular non-peer-reviewed, free distribution service and open-access archive created in 1991 and now maintained at Cornell University. It should be noticed that conferences may be held annually, may appear every 2 years (on even or odd years), and may also be organized jointly on the same year.

The NLP4NLP+5 Speech and Natural Language Processing Corpus
The NLP4NLP+5 corpus covers the same 34 publications up to 2020, hence 5 more years (2016)(2017)(2018)(2019)(2020), which represents an addition in time of 10%. We preferred not to add new sources to facilitate the comparison between the situations in 2015 and 2020. However, we added in the present paper a Section Analysis of the Citation in NLP4NLP Papers of Sources From the Scientific Literature Outside NLP4NLP on the study of references to papers published in other sources than those of NLP4NLP. This new corpus also includes some articles published in 2015, which were not yet available at the time we produced the first NLP4NLP corpus. Some publications may no longer exist in this extended period ( Table 1).
To study the possible differences across different communities, we considered two different research areas, speech processing and natural language processing, and we attached the sources to each of those areas (Table 2), given that some sources (e.g., LREC, LRE, L&TC, MTS) may be attached to both research domains. We see that the number of documents related to speech is larger than the one related to NLP. We only considered the papers related to speech processing (named ICASSPS) in the IEEE ICASSP conference, which also includes a large number of papers on acoustics and signal processing in general.
The number of conference or journal events 3 may largely vary, from 3 for Tipster to 46 for the Institute of Electrical and Electronics Engineers (IEEE)/Association for Computing Machiner (ACM) TASLP and the time span is also different, from 5 years for Tipster to 55 years for COLING. The number of papers in each source largely varies, from 82 papers for the ACM TSLP to 22,778 papers for the ISCA conference series.

Production of Papers Over the Years
A total number of 88,752 documents have been published in the 34 sources over the years. If we do not separate the papers that were published at joint conferences, it reduces to 85,138 papers (Table 1), with a steady increase over time from 24 papers in 1965 to 5,430 in 2020 (Figure 1). This number fluctuates over the years, mainly due to the biennial frequency of some important conferences (especially LREC and COLING on even-numbered years). The largest number of papers ever has been published in 2020 (5,430 papers), comparable in a single year to the total number of papers (5,725 papers) published over the 25 initial years  ! The total number of papers itself still increases steadily at a rate which now stabilizes at about 6% per year (Figure 2), reaching 85,138 different documents as of 2020 (Figure 3).

Data and Tools
Most of the data are freely available online on the Association for Computational Linguistics (ACL) anthology website, and others are freely available in the International Speech Communication Association (ISCA) and Text Retrieval Conference (TREC) archives. IEEE International Conference on Acoustics, Speech and Signal Processing -Speech Track (ICASSP) and Transactions on Audio, Speech, and Language Processing (TASLP) articles have been obtained through the IEEE, and Language Resources and Evaluation (LRE) articles through Springer. For this study, we only considered the papers written in English and French. Most of the documents were available as textual data in PDF, whereas the eldest ones were only available as scanned images and had to be OCRized, which resulted in a lower quality. The study of the authors, of the papers as well as of the papers cited in the references, is problematic due to variations in the same name (family name and given name, initials, middle initials, ordering, married name, etc.) and required a very tedious semiautomatic cleaning process (Mariani et al., 2014), and the same for the sources cited in the references. After a preprocessing phase, the metadata and contents are processed by higher level NLP tools, including a series of Java programs that we developed (Francopoulo et al., 2015a(Francopoulo et al., ,b, 2016a.

NLP
The number of different authors on a year also globally increased over time (Figure 5), with an exceptional increase in the past 5 years (from 6,562 in 2015 to 13,299 in 2020). The number of new authors from one conference to the next similarly   increased over time, as well as the number of completely new authors, who had never published at any previous conference or journal issue. The largest number of completely new authors was in 2020 (5,778 authors), comparable in a single year to the total number of different authors (5,688) who published over the 25 initial years )!
The percentage of new authors (Figure 6), which decreased from 100% in 1966 to 55% in 2011, increased since then to reach 65% in 2020, while the percentage of completely new authors, which decreased from 100% in 1966 to about 32% in 2011, now increased since then to reach 43% in 2020. This may reflect the arrival of "new blood" in the field, as it will be reflected in  the next sections related to the analysis of authors' production, collaboration and citation, and the fact that researchers who started their careers in their 20s in 1965, which corresponds to the first year considered in our study, are now gradually retiring in their 70s.
If we compare sources, the percentage of completely new authors at the most recent event of conferences or journals within the past 5 years (Figure 7) ranges from 39% for TALN or 43% for the JEP to 87% for RANLP or 81% for EACL, while the largest conferences show relatively low percentages (48% for ISCA, 51% for IEEE ICASSPS, 55% for ACL, and 56% for EMNLP). Compared with 2015, we notice a global increase in the percentage of completely new authors, especially in conferences and journals related to NLP.
We defined the author variety as the ratio of the number of different authors to the number of authorships 4 at each conference. This ratio would be 100% if each author's name appears in only one paper. Author redundancy corresponds to 100% author variety. Author redundancy increased over time and has now stabilized at about 40% since 2008 (Figure 8).
Author redundancy is large in conferences such as ISCA or ICASSP, whereas it is lower in journals and slightly increased globally since 2015 (Figure 9).

Papers and Authorship
The number of authorships increases from 32 in 1965 to 22,610 in 2020 at even a higher pace than the number of papers (Figure 10).

Authors' Gender
The author gender study is performed with the help of a lexicon of given names with gender information (male, female, epicene 5 ). As already noted, variations due to different cultural habits for naming people (single vs. multiple given names, family vs. clan names, inclusion of honorific particles, ordering of the components, etc.) (Fu et al., 2010), changes in editorial practices,        and sharing of the same name by large groups of individuals contribute to make identification by name a real issue (Vogel and Jurafsky, 2012). In some cases, we only have an initial for the first name, which made gender guessing impossible unless the same person appears with his/her first name in full in another publication. Although the result of the automatic processing was hand-checked by an expert of the domain for the most frequent names, the results presented here should therefore be considered with caution, allowing for an error margin. A total of 46% of the authors are male, whereas 14% are female, 4% are epicene, and 36% are of unknown gender. Considering the paper authorships, which take into account the authors' productivity, and assuming that the authors of unknown gender have the same gender distribution as the ones that are categorized, male authors account in 2020 for 80% (compared to 83% in 2015) and female authors for 20% (compared to 17% in 2015) of the authorships (Figure 11), hence a slight improvement. IEEE TASLP and ICASSPS have, in 2020 just as in 2015, the most unbalanced situation (respectively, 10 and 13% of female authors), whereas the French conferences (JEP, TALN) and journals (TAL), together with LRE and LREC, have a better balanced one (from 30 to 41% of female authors). The largest increase over the past 5 years (+4%) appears for JEP and TACL (Figure 12).

Authors' Production and Co-production
The most productive author published 453 papers, whereas 36,791 authors (55% of the 66,995 authors) published only one paper. Table 3 gives the list of the 12 most productive authors. We see that the eight most productive authors are the same than in 2015, with a slightly different ranking. A total of two newcomers are ranked 9 and 10, specialized in machine learning (ML): James R. Glass (unsupervised ML) and Yang Liu (Federated ML). Some authors (James Glass, Yang Liu, Haizhou Li, Satoshi Nakamura, and Shri Narayanan) increased their number of papers by 30% and more within the past 5 years! But if we focus on the past 5 years (2016-2020) ( Table 4), we notice that only two authors (Shrikanth S. Narayanan and Haizhou Li) 6 still appear in that ranking. Others energically contributed to the research effort on speech and language processing with a new angle benefiting from supervised or unsupervised machine-learning approaches, some already active in that field but also many new names, showing the great vitality of a new generation of researchers, who published more than 15 papers per year in this recent 5-year period.

Authors' Collaborations
The number of authors per paper still increases, with more than 4 authors per paper on average, compared with 3.5 in 2015 ( Figure 13). Table 5 gives the number of authors who published papers as single authors, and the names of the ones who published 10 papers or more. A total of 60,193 authors (90% of the authors) never published a paper as single author 7 . The ranking is very similar to 2015, including six newcomers (Mark Huckvale, Mark Jan Nederhof, Hagai Aronowitz, Philip Rose, Shunichi Ishihara, and Oi Yee Kwong).
The number of papers with a single author decreased from 75% in 1965 to 3% in 2020, illustrating the changes in the way research that is being conducted.
Up to 2015, the paper with the largest number of co-authors was a META-Net paper published at LREC 2014 (44 co-authors). It is now surpassed by three other papers: • A paper with 45 co-authors from Microsoft published in TACL 2020 • A paper with 47 co-authors on the European Language Technology landscape published at LREC 2020 The most collaborating author collaborated with 403 different coauthors, whereas 2,430 authors only published alone. An author collaborates on average with 7.9 other authors (compared to 6.6 in 2015), whereas 157 authors published with 100 or more different co-authors. Table 6 provides the list of the 12 authors with the highest number of co-authors. Table 7 provides the list of the 12 authors who had the largest number of collaborations, possibly with the same co-authors.
As we can see, some authors increased a lot, and even doubled, the number of co-authors and of collaborations in the recent years, whereas there are seven newcomers in the ranking (Björn     If we focus on the past 5 years (2016-2020), we see that only three authors (Haizhou Li, Shri Narayanan, and Yang Liu) are still among the 12 authors with the largest number of coauthors (Table 8), whereas we notice many new names, often of Asian origin (Yue Zhang, Dong Yu, Yu Zhang, Kongaik Lee, Ming Zhou, and Shinji Watanabe) who constitute a new community around the use of supervised or unsupervised machine-learning approaches.
When comparing the various sources, we do not notice any meaningful changes between 2015 and 2020 regarding the diameter, density, average clustering coefficient, or connected components that were presented in our previous paper, whereas the mean degree (defined as the average number of co-authors for each author), which illustrates the degree of collaboration within a source, shows a large increase for TACL (4.5-6.9), EMNLP (4.2-5.9), and ACL (4.2-5.6) over this period (Figure 14).

Measures of Centrality in the Collaboration Graph
As we see in Table 9, some authors in the top 10 in terms of closeness centrality also appear in the two other types of centralities (degree centrality and betweeness centrality), eventually with a different ranking, whereas others do not. Compared with 2015, we notice "newcomers" among the 10 most "central" authors: • Yang Liu, Alan Black, Dong Yu (closeness centrality: those who are central in a community) • Björn Schuller, Yang Liu, Khalid Choukri, Ming Zhou, Dong Yu (degree centrality: those who most collaborate) • Laurent Besacier, Alan Black, Sanjeev Khudanpur (betweenness centrality: those who make bridges between communities).
If we consider the period 2016-2020, we see that only Sanjeev Khudanpur is still among the 10 most central authors, in terms of closeness centrality (Table 10).
In addition to that, only three authors among the 10 most "Betweenness Central" authors up to 2015 are still in the ranking for the 2016-2020 period (Shri Narayanan, Yang Liu, and Haizhou Li). New authors may bring bridges with new scientific communities. Some authors may be absent from this 2016-2020 ranking, while increasing their global "up to 2020" ranking in this period due to the enlargement of previous communities (Table 11).

Global Analysis
We studied citations of NLP4NLP+5 papers in the 78,927 NLP4NLP+5 papers that contain a list of references. If we consider the papers that were published in joint conferences as different papers, the number of references is equal to 585,531. If we consider them as the same paper, the number of references in NLP4NLP+5 papers goes down to 535,989 and is equal to the number of citations of NLP4NLP+5 papers.
The average number of NLP4NLP+5 references in NLP4NLP+5 papers increased over time from close to 0 in 1965 to 12.7 in 2020 (was 9.7 in 2015) (Figure 15), as a result of the citing habits and of the increase in the number of published papers.
The trend concerning the average number of citations per paper over the years (Figure 16) is less clear. Obviously, the most recent papers are less cited than the older ones, with a number of more than nine citations on average per paper for the papers of the most cited year (2003) and less than one citation on average for the papers published in 2020, given that they can only be cited by the papers published on the same year. It seems that papers need on average 3 years after publication to be properly cited, and that the average number of citations for a paper is  stabilized at about 6-8 citations per paper if we consider the period 1993-2018.
Among the 66,995 authors, 23,850 (36%) are never cited (even 25,281 (38%) if we exclude self-citations). These percentages slightly improved compared with 2015 (respectively, 42 and 44%). However, those never cited authors may come from neighboring research domains (artificial intelligence, machine learning, medical engineering, acoustics, phonetics, general linguistics, etc.), where they may be largely cited. Among the 85,138 papers, 31,603 (37%) are never cited [even 40,111 (47%) if we exclude self-citations by the authors of these papers] also showing a slight improvement compared with 2015 (respectively, 44 and 54%) ( Table 12). Table 13 gives the list of the 20 most cited authors up to 2020, with the number of citations for each author, the number of papers written by the author, and the percentage of self-citation with a comparison to 2015. We may notice that the seven most cited authors up to 2015 are still present in 2020, but that 50% of the authors of 2020 (mostly attached to the machine learning and word embedding research-based communities) are newcomers in this ranking. Table 14 provides the number of citations, either by themselves (self-citations) or by others (external citations), for the most productive authors that appear in Table 3. We see that only two of the 20 most productive authors (Herman Ney, Li Deng) also appear among the 20 most cited authors.

Analysis of Authors' Citations Most Cited Authors
We may express that the publishing profile is very different among authors. The authors who publish a lot are not necessarily the ones who are the most cited (from 1.75 to 15 citations per paper on average) and the role of authors varies, from the main contributor to team manager, depending on their place in the authorship list and the cultural habits. Some authors are used to cite their own papers, while others are not (from 0.6 to 2.9 citations of their own paper on average).
If we now only consider the 2016-2020 period (papers published over 55 years that are cited in this 5-year period) ( Table 15), we see that only one author of the 2015 ranking (Chris Manning) is still among the 20 most cited authors for this period! Some authors who published a small number of seminal papers got a huge number of citations (such as Jeffrey Pennington, for the "Glove paper, " with two papers totaling 2,586 citations, with no self-citation!). However, as it will appear in the next section, getting a high h-index necessitates both publishing a lot and having a lot of citations of these published papers.

Authors' H-Index
Despite the criticisms that are attached to this measure and as it was included in our previous paper, we computed the h-index of the authors based only on the papers published in the NLP4NLP+5 corpus. Table 16 provides the list of the 20 authors with the largest h-index up to 2020, with a comparison to 2015 (based on the papers published and cited in the respective 55-and 50-year time periods). We see that Christopher Manning has still the largest h-index: he published 49 papers, which were cited at least 49 times. About 55% of the authors with highest h-index up to 2015 are still in the top 20 authors with highest h-index up to 2020, while 45% are newcomers (also mostly coming from the machine learning and word embedding researchbased communities).
If we consider the h-index in the past 5 years (based on the papers published on 55 years and cited in the 2016-2020 period) ( Table 17), we see that only five authors (Chris Manning, Noah Smith, Dan Klein, Daniel Jurafsky, and Mirella Lapata) with highest h-index up to 2015 are still in the top 20 authors with highest h-index for the 2016-2020 period! Table 18 provides the list of the 20 most cited papers up to 2020 and a comparison with 2015. A number of 11 (55%) of the 20 most cited papers up to 2015 are still among the 20 most cited papers up to 2020, whereas it includes five newcomers and four papers published in or after 2015, with a special emphasis on word embedding and deep learning (Glove and BERT). The most cited paper up to 2015 is still the most cited up to 2020 (BLEU MT evaluation measure). The most cited papers are still mostly those related to language data (Penn Treebank, Wordnet, and Europarl), evaluation metrics (BLEU), language processing tools (Glove, BERT, Moses, SRILM), or methodology surveys (word representations, statistical alignment, statistical and neuronal machine translation). The largest number of

Analysis of Citations Among NLP4NLP Sources Comparison of NLP vs. Speech Processing Sources
When comparing the number of articles being cited in NLP vs. speech-oriented publications (Figure 17), we see that this number is increasing much more importantly in the NLP ones since 2001, providing that 2020 cannot be considered due to the previously expressed reason. This is also reflected in the ratio of NLP vs. speech articles' citations (Figure 18), given that we only had NLP sources until 1975. We then had a ratio of about 60% of NLP papers being cited from 1975 to 1989, then a balanced ratio until 2001, and since then an increasing percentage of NLP papers which attained 75% in 2019.

Comparison of Citations for Six Major Conferences and Journals
The comparative study of the number of cumulative citations of previously published papers in six important conferences (ACL, COLING, EMNLP, ICASSPS, ISCA, and LREC) shows  (Figure 19) that the number of ISCA papers being cited grows at a high rate over time, in agreement with the ISCA Board policy which decided in 2005 to enlarge the number of pages from 6 to 7, providing that the allowed extra page should only consist of references. The same appears more recently for ACL. ICASSPS comes in the third position, whereas EMNLP recently showed an important increase. We then find a group of two with COLING and LREC.
Doing the same on six major journals (Computational Linguistics, Computer Speech and Language, Language Resources and Evaluation, Speech Communication, IEEE Transactions on Audio, Speech, and Language Processing, and Transactions of the ACL) shows (Figure 20) the importance of the reference to the IEEE Transactions, followed by Computational Linguistics. The Transactions of the ACL recently made a large increase.

Citation Graph
We considered (refer to Appendix 4) the 85,138 papers and the 66,995 authors in NLP4NLP+5 in the citation graph, which includes 587,000 references. When comparing the sources, it should be remembered that the time periods are different, as well as the frequency and number of events for conferences or journals.
Authors' Citation Graph. When comparing the various sources, there are also no meaningful changes between 2015 and 2020 regarding the diameter, density, average clustering coefficient, and connected components of the Internal Authors Citations and Ingoing Global Authors Citations that were presented in our previous paper.
The mean degree of the Outgoing Global Authors Citations graph of the citing authors (i.e., average number of authors being cited by each author), measuring the average number of authors citations within a source, shows a large increase for most sources (Figure 21), following the general trend (refer to Figure 15), especially recently in the NLP sources (TACL, EMNLP, ACL, CL, NAACL, IJCNLP, and CONLL) with more than 40 authors being cited by each author on average.
The mean degree of the Ingoing Global Authors Citations graph of the authors being cited in each of the 34 sources   ( Figure 22) shows that authors who publish in Computational Linguistics are still the most cited, but are now closely followed by authors in TACL, with a tremendous increase, then ACL, NACL, HLT, and EMNLP, with more than 60 citations of each author on average.
Papers' Citation Graph. There are no meaningful changes between 2015 and 2020 regarding the diameter, density, average clustering coefficient, and connected components of the Internal Papers Citations and Ingoing Global Papers Citations, when comparing the various sources.  The mean degree of the Outgoing Global Papers Citations graph of the citing papers (i.e., average number of references in each paper), measuring the average number of papers citations within a source, shows an increase for most sources (Figure 23), following the general trend (refer to Figure 15), especially the NLP sources (TACL, CL, EMNLP, ACL, NAACL, IJCNLP, CONLL, CSAL, and LRE) with more than 10 references in each paper on average. Figure 24 provides the average number of papers being cited from each of the 34 sources. Papers published in Computational Linguistics are still the most cited (more than 25 times on average), but are now more closely followed by various NLP sources (TACL (with a tremendous increase), NAACL, HLT, ACL, EMNLP, and CONLL), with more than 10 citations of each paper on average. Figure 25 provides the internal h-index (NLP4NLP papers being cited by papers of any NLP4NLP source) for the 34 sources. The largest h-index is obtained by ACL, where 109 papers are cited 109 times or more in the NLP4NLP+5 papers, followed by EMNLP, which increased considerably its h-index over the past 5 years from 55 in 2015 to 90 in 2020, TASLP (84), ICASSPS (77), and ISCA Interspeech (76).

Analysis of the Citation in NLP4NLP Papers of Sources From the Scientific Literature Outside NLP4NLP Extraction of References
In the internal NLP4NLP citation analysis, references were extracted through a highly reliable checking of titles, as we possess the knowledge of the NLP4NLP paper titles. We cannot use the same approach if we want to explore the citation of articles that were published in other sources than the NLP4NLP ones, as we do not have a list of the titles of all those articles. We therefore used a different strategy based on the use of the ParsCit software (Councill et al., 2008) to identify the sources within the reference sections of articles for a limited set of NLP4NLP articles. This new process is resulted in a list of raw variants of source naming, which necessitated a manual cleaning, as it contained a lot of noise, followed by normalization and categorization in four categories (Conferences, Workshops, Journals, and Books).
All the cleaned variants for a given source are kept, for instance, a short name compared to an extended name. We then     implemented an algorithm to detect the source names within the reference sections for all NLP4NLP papers. The detection is technically conducted by means of an intermediate computation of a robust key made of uppercase letters and normalization of separators, as the aim is to compare names on the ground of significant characters and to ignore noise and unsignificant FIGURE 30 | Number of references related to AI, neural networks, and machine-learning sources external to NLP4NLP. details. We then use these data to compare the citations in NLP4NLP articles of the articles published within and outside NLP4NLP sources 8 .

Global Analysis
Starting from 32,918 entries, we conducted the manual cleaning and categorization process which resulted in 13,017 different variants of the sources, and, after normalization, in the identification of 3,311 different sources outside the 34 NLP4NLP ones, corresponding to conferences (1,304), workshops (669), journals (1,109), and books (229). Figure 26 provides the evolution of the total number of references, which attains 121,619 references in 2020 for a cumulated total of 1,038,468 references over the years, and of NLP4NLP references, which attains 72,289 in 2020 for a cumulated total of 654,340 references (63% of the total number of references) with this new calculation based on source detection. These numbers clearly illustrate the representativity of the 34 NLP4NLP sources totaling close to 20,000 references on average per source, compared with about 110 references on average per source for the 3,311 non-NLP4NLP sources. Figure 27 provides the percentage of NLP4NLP papers in the references. After a hectic period both due to the small quantity and low quality of data, mostly OCRized, until 1976, the ratio of NLP4NLP references stabilized at about 60% until 1994. It then rose up to 67% in 2009 and slowly decreased since then to attain 60% in 2020 with the appearance of new publications. Figure 28 provides the average number of references per paper globally and specifically to NLP4NLP papers. We see that this number increases similarly to attain an average of 25 references per paper, as a result of the citing habits, the increase of the number of publications and of published papers in the literature and the generalization of electronic publishing, as already expressed in section Global Analysis (Figure 15), where only NLP4NLP papers were considered based on title identification.

Specific Analysis of Non-NLP4NLP Sources
Some new sources attract many papers, which resulted in many citations, showing a drastic change in the publications habits. Figure 29 provides the number of references in NLP4NLP+5 papers to arXiv preprints, with a huge increase in the recent years (from two references in 2010 to 498 in 2015 and 12,751 in 2020).
Also, the number of references related to the publications in artificial intelligence, neural networks, and machine learning, such as the conference on Artificial Intelligence of the Association for the Advancement of Artificial Intelligence (aaai), the International Joint Conference on Artificial Intelligence (ijcai), the International Conference on Machine Learning (icml), the International Conference on Learning Representations (iclr), the Neural Information Processing Systems conference (NeurIPS, formerly nips), or the Machine Learning and Machine Learning Research Journals, greatly increased in the recent years (Figure 30).

Google Scholar H-5 Index
As of July 2021, Google Scholar (Table 20) places as we do ACL first in the ranking of the "Computational Linguistics" conferences and journals category 9 with an h5-index of 157 and an h5-median of 275 within the past 5 years, followed by EMNLP (132) (28). This ranking covers the past 5 years and therefore reflects the recent trends compared with our own results, which concern a smaller number of sources but a longer time period. Most conferences of the field considerably increased, and some (such as ACL, EMNLP, NAACL, ISCA Interspeech, Semeval, EACL) even more than doubled, their h-index over the 10 https://scholar.google.fr/citations?view_op=top_venues&hl=fr&vq= eng_signalprocessing past 5 years, whereas journals stayed at about the same level, apart from the Transactions of the ACL (TACL), which was launched in 2013 and did not appear in the previous ranking. arXiv is also not considered here.

Archive Analysis
Here, our objectives are 2-fold: i) to compute the most frequent terms used in the domain, ii) to study their variation over time. Like the study of citations, our initial input is the textual content of the papers available in a digital format or that had been scanned. It contains a grand total of 380,828,636 words, mostly in English, over 55 years .  FIGURE 32 | Evolution of the top 25 terms over the past 10 years (2010 to 2020) according to their presence (raw ranking without smoothing).

Terms Frequency and Presence
As depicted in Mariani et al. (2019b), we distinguished SNLPspecific technical terms from common general English ones after syntactic parsing, with the hypothesis that when a sequence of words is inside the NLP4NLP+5 corpus and not inside the general language profile, the term is specific to the field of SNLP. The 88,752 documents reduce to 81,634 documents when considering only the papers written in English. They include 4,488,521 different terms (unigrams, bigrams, and trigrams) and 34,828,279 term occurrences. The 500 most frequent terms (including their FIGURE 33 | Evolution of the terms "HMM" (in green) and "Neural Network" (in blue) over the past 30 years (1990 to 2020) according to their presence in the papers.
FIGURE 36 | Evolution of the terms "softmax" (blue), "hyperparameter" (red) and "epoch" (green) over the past 5 years (from the 100th rank in 2015 to the 20th in 2020), according to their presence.
synonyms and variations in upper/lower case, singular/plural number, US/UK difference, abbreviation/expanded form, and the absence/presence of a semantically neutral adjective) in the field of SNLP were computed over the period of 55 years. We called "existence" 11 the fact that a term exists in a document and "presence" the percentage of documents where the term exists. We computed in that way the occurrences, frequencies, existences, and presences of the terms globally and over time  and also the average number of occurrences of the terms in the documents where they appear ( Table 21).
The ranking of the terms may slightly differ according to their frequency or to their presence. The most frequent term overall is "dataset," which accounts for 7.6% of the terms and is present in 29% of the papers, whereas the most present term is "Speech Recognition," which is present in 31% of the papers while accounting for 5.7% of the terms. The average number of occurrences of the terms in the documents where they appear varies a lot (from 4.2 for "semantic" to more than 13 for "Deep Neural Network" or 12 for "Embedding").
We also compared the ranking with the 2015 one. A total of 17 of the 20 most frequent terms up to 2015 are still present 11 Also called "Boolean frequency" or "binary frequency". in this list, with few changes. We see a large progress in the terms associated with the neural network and machine-learning approaches ["dataset, " "embedding, " "neural network, " "DNN (Deep Neural Networks), " "decoder, " and "LSTM (Long Short-Term Memory)"] and a small decrease for the terms related to previous approaches ["HMM (Hidden Markov Models)", "GMM (Gaussian Mixture Models)", "SVM (Support Vector Machine)"].

Change in Topics
The GapChart visualization tool 12 that we developed (Perin et al., 2016) allows us to study the evolution of the terms over the years, based on their frequency or their presence. Figure 31 provides a glimpse of the evolution of topics over time, and we invite the reader to freely access the tool to get a better insight on the evolution of specific terms, such as those illustrated in the following figures. Figure 32 provides the evolution of the 25 most present terms in the past 10-year period (2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019)(2020). We see that some terms stay in this list over 10 years, such as "dataset," "metric," or "annotation," while terms related to neural network and machine-learning approaches (such as "embedding, " "encoder, " "BERT (Bidirectional Encoder Representations from Transformers), " "transformer, " "softmax, " "hyperparameter, " "epoch, " "CNN (Convolutional Neural Networks), " "RNN (Recurrent Neural Networks), " "LSTM, " and "DNN") made a large progression.
We may also select specific terms. Figure 33 focuses on the terms "HMM" and "Neural Network" in the 30-year period . It shows the slight preference for "Neural Network" up to 1992, then the supremacy of "HMM" up to 2015 and the recent burst of "Neural Networks" starting in 2013.
The progress on terms related to neural networks and machine learning was especially spectacular over the past 5 years (2015-2020) (Figures 34-36).

Tag Clouds for Frequent Terms
Tag Clouds provide an estimate of the main terms used on a given year. For this purpose, we use TagCrowd 13 to generate Tag Clouds. We conducted experiments on full texts and on papers' abstracts and found that papers' abstracts provide a more meaningful analysis as they are more synthetic and contain a larger ratio of technical terms compared with general language. Figures 37A,B provide the Tag Clouds for 2015 and 2020. We clearly see the burst of terms related to machine learning ("BERT, " "CNN, " "decoder, " "embedding, " "encoder, " "pretraining, " "transformer") that were absent in 2015, and the sustainability of "neural network," "annotation," "metric," and "LM" ("Language Model").

Machine Learning for Time Series Prediction
We explored the feasibility of predicting the research topics for the coming years based on the past (Francopoulo et al., 2016a).  Predictions are marked in green.
TABLE 23 | Comparison of the terms predicted in 2015 for the next 5 years (2016-2020) with the actual observations on these years (Predictions are in italics. Terms correctly predicted to appear among the 10 top terms are marked in yellow, term correctly predicted at its rank is marked in green).

Predicted in 2015
Observed in 2020 We selected in the time series plug-in of the Weka 14 machinelearning software package (Witten et al., 2011) the Gaussian Processes algorithm with an 18-year window that provided the 14 www.cs.waikato.ac.nz/ml/weka best results on predicting the term frequency. We then applied this software to the full set of the NLP4NLP+5 corpus, year by year. Table 22 provides the ranking of the most frequent terms in 2018 and 2019 with their observed frequency, the topic predicted by the selected Weka algorithm for 2020 based on the smallest gap between predictions and observations on the past rankings and the ranking actually observed in 2020. We see that the prediction to have the term within the most frequent top 10 is correct for 8 of them.

Prediction Reliability
As we published such predictions for the years 2016-2020 in our previous paper, we were eager to verify whether these predictions were correct or not. We thus compared these predictions with the actual figures for these 5 years (Table 23). It appears that the predictions were quite reliable: the number of terms correctly FIGURE 38 | Evolution of the distance between prediction and observation over the years. predicted to appear within the 10 top terms varies from 7 to 8 in the first 3 years (2016, 2017, and 2018) that follow the year when the prediction was made (2015) and then decreases to 4 on the 4th year (given that BERT did not exist at the time of the prediction and therefore could not be predicted) and 3 in the 5th year, whereas one term ("dataset") was correctly predicted with the right ranking for the 5 years. It therefore confirms the assumption we made in our previous papers that predictions seem to be reasonable within a 3-year horizon (unless a major discovery happens in the meanwhile) in this research domain.

Scientific Paradigms Ruptures
As expressed in our previous paper (Mariani et al., 2019b), "the difference between the prediction and the observation on each year provides a measure of the 'surprise' between what was expected and what actually occurred. The years where this 'surprise' is the largest may correspond to epistemological ruptures." Figure 38 provides the evolution of this distance between 2011 and 2020, computed as the average absolute value of the difference between prediction and observation for the 200 most frequent terms. It suggests that 2012 was a year of big changes, which then reduced for 2 consecutive years and then slightly evolved since 2014.
The same distance between prediction and observation for a specific topic illustrates the way this term evolved compared with what was expected. Figure 39 shows the evolution of the "Deep Neural Network" (DNN) term. It suggests that the popularity (as measured by the frequency of the term) of this approach in the next year was underestimated up to 2015, then overestimated until 2018.

Predictions for the Next 5 Years
The predictions for the next 5 years (2021-2025) are provided in Table 24: it is expected that methods based on machine learning, word embedding, and neural networks will keep on attracting the researchers' attention, with a sustained interest for "Language Models (LM)" and a growing interest for "BERT" and "transformer." Predictions are marked in green.
Frontiers in Research Metrics and Analytics | www.frontiersin.org TABLE 25 | The number of 10 most present terms in 2020, with variants, date, authors, and publications where they were first introduced, number of occurrences and existences in 2020, number of occurrences, frequency, number of existences and presence in the 55-year archive, with ranking and average number of occurrences of the terms in the documents where they appear, and comparison with the ranking in 2015 (the terms which joined the top 10 are marked in green, while the 5 which went out are marked in orange with their new and former ranking).

New Terms Introduced by the Authors
We studied who introduced new terms which became popular, where and when, to assess the innovative contributions of authors and sources to the advances of the scientific domain (Mariani et al., 2018b). We considered the 81,634 documents written in English and the 61,431 authors who used the 4,488,498 terms contained in these documents. A number of 2,968 of these terms are present in the 22 documents of the first year (1965) that we considered as the starting date for the introduction of new terms, while we found 594,807 of these terms in the 5,313 documents published in 2020. We should stress that the birth of a term is only searched in the 34 NLP4NLP+5 sources and that it may have had a different meaning when it was first introduced. Table 25 provides the ranked list of the 10 most popular terms according to their presence in 2020 and a comparison with the ranking in 2015. We should notice that 50% of the top 10 terms have changed, with a spectacular increase in the terms related to machine learning and neural networks ("embedding, " "encoder, " "hyperparameter, " and "LSTM").

Measuring the Importance of Topics
In our previous paper, we proposed to measure the "innovation score" of a term, as the sum of the yearly presence of the term since its introduction. We initially considered the 1,000 most frequent terms over the 55-year period, but given the poor quality and low number of different sources and papers in the first years, we decided to only consider the 45-year period from 1975 to 2020. Table 26 provides the overall ranking of the terms overall and specifically for the NLP and speech processing categories. This list is very similar to the one we computed in 2015 with a slightly different ranking.
We studied the evolution of the cumulative presence of the terms over the years (percentage of papers containing a given term up to a given year), to check the changes in paradigm while avoiding the noise due to the conference frequency. Figure 40 provides the evolution of the 10 most popular terms according to this measure. Their global ranking over the years corresponds to the order of the terms in the legend of the figure, as it will be the case for all figures in this section Innovation. Percentages are provided either in relation to the total number of papers (qualified as "all papers"), or with the papers related to a specific topic (qualified as "topical papers").
We see for example that Speech Recognition ("SR") has been a very popular topic over the years, reaching a presence in close to 35% of the papers published up to 2008, then slightly decreasing.

Measuring Authors' Innovation
We computed in a similar way an innovation score for each author, illustrating his or her contribution in the introduction and early use of new terms that subsequently became popular, as the sum over the years of the annual presence of the terms in papers published by the author (percentage of papers containing the term and signed by the author on a given year), overall and specifically for the NLP and for the speech processing categories ( Table 27). The names in this table are also very similar to those of 2015, with a slightly different ranking.
This measure does not place on the forefront uniquely the "inventors" of a new topic, as it is difficult to identify them, given that we only consider a subset of the scientific literature (the NLP4NLP+5 corpus), but it includes the early adopters who published a lot when or just after the topic was initially introduced. Therefore, authors of highly cited papers introducing innovative approaches (such as Glove or BERT recently) do not   fully benefit for their innovation, as many authors immediately adopted, used, and published with these approaches on the same year.

Measuring the Innovation in Sources
We also computed an innovation score for each source as the sum over the years of the annual presence of the terms in papers published in the source, conference or journal (percentage of papers containing the term which were published in the publication on a given year), overall and specifically for NLP and for Speech Processing ( Table 28). The names in this table are also very similar to those of 2015, with a slightly different ranking (progress for EMNLP in NLP and ISCA overall and in Speech Processing).

Measuring the Contribution of Authors and Sources to a Specific Topic
We may also study the contributions of authors or sources to a specific topic, using the cumulative innovation score of authors and sources attached to this topic.

Contributions to the Study of "HMM"
Figure 41 provides the cumulative percentage of papers containing the term "HMM "published up to a given year by the 10 most contributing authors, ranked according to the innovation measure up to 2020 15 . Compared with 2015, we only observe the appearance of Junishi Yamagishi on innovative HMM-based speech synthesis. We also do not observe much difference regarding the contributions of the various sources to HMMs (Figure 42), with the IEEE TASLP as a pioneer in this area since 1982 16 , while ISCA Conference series represents 45% and IEEE-ICASSP 25% of the papers published on HMM up to 2020 and publications that are placed in both speech and NLP (CSL, HLT, LREC) help to spread the approach from speech processing to NLP as well (ACL, EMNLP).

Contributions to the Study of "DNN"
We studied the authors' contributions over the years 17 to deep neural networks ("DNNs") that recently gained a large audience, in terms of percentage of authors and papers ("presence") mentioning the term (Figure 43).
We notice the important contribution of Asian authors to this topic (Figure 44), with the pioneering contributions of Dong Yu who published about 30% of the papers published on this topic until 2012. Compared with 2015, we notice larger changes than in the case of HMMs, as it is a more changing field, with the appearance of new names (Shinji Watanabe, Tomohiro Nakatani, Helen Meng, and Shri Narayanan).
We may then also study how an author contribution to a specific topic compares with his/her main contributions to other topics, and how it evolved over the years. Figure 45 illustrates the fact that the contributions of Dong Yu are essentially focused on deep neural networks and second on the Softmax function in the years 2011-2013. Looking at the source contribution to "DNN," we see that it started from the speech community (ISCA Interspeech and IEEE TASLP) and then diffused in the natural language processing community (starting with ACL) (Figure 46).

Contributions to the Study of "Embedding"
Similarly, we studied the authors' contributions over the years 18 to "Embedding" which was used for many years from the 70s but gained since 2015 a large audience, in terms of percentage of authors and papers ("presence") mentioning the term (Figure 47). Figure 48 shows the contribution of authors to the topic of Embedding, which both shows the individual contributions and 15 Some authors have published earlier on that topic, but don't belong to the 10 most contributive ones on the long run. 16 We only had access to IEEE ICASSP proceedings since 1990, while Computer Speech and Language started in 1986 and the ISCA Conference Series in 1987. 17 The term "DNN" appeared first in a paper published in 1990 (Yamaguchi et al., 1989), but with the meaning of "Dynamic Programming Neural Network", with a reference to a paper published by Sakoe at ICASSP 1989 (which is not considered in NLP4NLP+5). However, the low presence of the term with this meaning rightly prevents these authors to appear among the main contributors to "DNN" in its present meaning. 18 Here also, the term appeared early in 1967, but with a different meaning and a low presence until the recent years.   the large increase in the presence of the topic since 2015, as illustrated in Figure 47.
In this case, the topic was initiated in the natural language processing community (COLING, ACL, IJCNLP) and then diffused in the speech community as well (ISCA Interspeech, ICASSP, TASLP) (Figure 49).
We may then also study how a source contribution to a specific topic compares with its main contributions to other topics, and how it evolved over the years. Figure 50 clearly illustrates the major contribution in the recent years of the EMNLP conference to the research on "embedding," where almost 20% of the papers produced so far on this topic have been published.

Use of Language Resources
The LRE Map We have conducted an analysis of language resources (LR) as bricks that are used by the researchers to conduct their research investigations and develop their systems (Francopoulo et al., 2016b). We consider here language resources in the broad sense, embracing data (e.g., corpus, lexicons, dictionaries, terminological databases, language models, etc.), tools (e.g., morpho-syntactic taggers, prosodic analyzers, annotation tools, algorithms, software packages, etc.), system evaluation resources (e.g., metrics, training, dry run or test corpus, evaluation packages, etc.), and meta-resources (e.g., best practices, guidelines, norms, standards, etc.) that are mentioned in the LRE Map (Calzolari et al., 2012). This database is produced by the authors of papers at various conferences and workshops of the domain who are invited when submitting their paper to fill in a questionnaire which provides the main characteristics of the language resources produced or used in the research investigations that they report in their paper.
The version of the LRE Map that we used in our previous paper contained information harvested from the authors in 10 conferences from 2010 to 2012, for a total of 4,396 different resources. In the present paper, we use an updated version of the LRE Map containing data harvested in 53 conferences and workshops from 2010 to 2018 (6 more years), for a total of 9,725 resources, that we cleaned up (correct the name of the resources, eliminate the duplicates, regroup the various versions of resources from the same family, etc.). We finally ended up with 5,609 different resources that we searched in the articles of the NLP4NLP+5 corpus. Figure 51 provides the evolution of the number of different resources mentioned in the papers (that we call "existence") compared with the evolution of the number of papers over the years, whereas Figure 52 provides the average number of language resources mentioned in a paper (that we call "presence"). The corresponding curves cross in 2002, when more than one language resource was mentioned on average in a paper, reflecting the shift from knowledgebased approaches to data-driven approaches in SNLP research. Since 2015, the number of mentions of language resources largely increased, to attain 16,000 mentions in 2020, whereas the number of papers also greatly increased, and the ratio stays at about 3 language resources mentioned in a paper on qaverage. Table 29 provides the ranking of language resources according to their "existence" (number of papers where they are mentioned), their type (corpus, lexicon, tool, etc.), their number of "occurrences" (number of mentions in the papers), the first authors who mentioned them as well as the first publications, and the first and final years when they were mentioned, and a comparison with the 2015 ranking. We see that half of the language resources that were in the previous ranking are still present, "Wikipedia" now being at the first rank, while the other half were previously at a rank higher than 10th (BLEU, MATLAB, AnCora) or were not considered in the LRE Map, being posterior to 2012 (Word2Vec, Glove). We also see that a language resource such as Word2Vec was immediately adopted by many authors on the very same year when it appeared first in a paper of the NLP4NLP+5 corpus. Table 30 provides over the past 20 years (2000 to 2020) the number of mentions of the different language resources from the LRE Map together with the number of documents that were published and the list of the 10 most cited language resources on that year. We see in the recent years, the increase of language resources related to machine learning and neural networks (Word2Vec, Weka, Glove, Keras, Seq2Seq, ROBERTa), as well as to the use of metrics (BLEU and now ROUGE) and the appearance of new speech corpora (LibriSpeech, SquaD).

Language Resource Impact Factor
We proposed to define the "Impact Factor" of a language resource as its existence, in recognition of the importance of the corresponding language resources for conducting research in NLP and of the researchers who provided these useful language resources, similar to the role of a citation index. Table 31 provides the impact factors for the language resources of the "Data, " "Evaluation, " and "Tools" types. We can notice the importance of quality measures (BLEU introduced for machine translation and ROUGE for text summarization) and of the recent burst of machine-learning toolkits (Word2Vec, GloVe, Weka, Seq2seq).

Text Reuse and Plagiarism
Here we studied the reuse of the textual content of NLP4NLP+5 papers in other NLP4NLP+5 papers (Mariani et al., 2016(Mariani et al., , 2018a.

Data
We considered here 88,752 documents published by 66,995 authors from 1965 to 2020, which constitute a large part of the published articles in the field of SNLP, apart from the workshop proceedings and the published books.
The preparation of the textual data is described in the study of Francopoulo et al. (2015b). The overall number of words is roughly 380 MWords. Only the texts in English and French have been retained.

Algorithm for Computing Papers Similarity
The detection of "copy & paste" as defined in Appendix 5 is conducted through an algorithm described in the study of Mariani et al. (2019b). The comparison is conducted on a window of seven tokens, using the Jaccard distance and a threshold of 0.04. We therefore consider as potentially reused or plagiarized all couples of articles with a similarity score of 4% or more according to our measure of similarity.

Categorization of the Results
Our previous experiments showed that it is necessary to carefully check the results as it may contain false alarms due to the presence of short texts, such as acknowledgments, or of truncated   or merged documents due to OCRization for the eldest data. In many cases, the author appears with a different spelling, or references are properly quoted, but with a different wording, a different spelling (e.g., American vs. British English) or an improper reference to the source. We had to manually correct these cases and move the corresponding couples of papers to the right category (from "reuse" or "plagiarism" to "self-reuse" or "self-plagiarism" in the case of papers' authors names, from "plagiarism" to "reuse" in the case of references). This manual correction can be done for the articles placed in the "reuse" and "plagiarism" categories, as they are not very numerous, whereas the detection of the authors' name ensures a good reliability for the "self-reuse" and "self-plagiarism" categories. For each of the 4 copy and paste categories, we produced the list of couples of "similar" papers according to our criteria, with their similarity score, identification of the common parts, and indication of a similar list of authors or of the same title.

Results on NLP4NLP+5
We do not include in this paper the matrices for each of the four categories (self-reuse, self-plagiarism, reuse, and plagiarism) displaying the number of papers that are similar for each couple of the 34 sources (considered as "using sources" and "used sources") that were presented in our previous paper, as they do not show a large difference with the previous findings. On the 13,068 cases detected using the NLP4NLP+5 corpus, 5,799 (44%) are identified as self-reuse, 6,942 (53%) as self-plagiarism, 152 (1.5%) as reuse, and 175 (1.5%) as plagiarism. Figure 53 provides the percentage of papers that are detected as using parts of other papers over the years, whereas Figure 54 provides the percentage of papers that are detected as having been used by other papers over the years, given that they almost entirely correspond to self-reuse and self-plagiarism.
As it clearly shows, self-reuse and self-plagiarism keep being very common: about 25% of papers use parts of previous papers, whereas parts of 25% of papers are used in a new paper. This may also be related to the submission of similar papers at two different conferences on the same year, or to the publication in a journal of a paper previously published in a conference.  Just as noticed in our previous NLP4NLP study, the reuse of papers is done within a short time period (on the same year in 40% of the cases, and within 2 years in 85% of the cases). The reuse of conference papers in journal articles is done with a slightly longer delay (on the same year in 11% of the cases, and within 3 years in 84% of the cases).

Plagiarism
We characterized plagiarism by authors using in a paper a large part (more than 4%) of textual content from a paper of other authors without citing the source paper.
In our previous study related to publications up to 2015, 116 cases of possible plagiarisms were detected over 50 years on a total of 63,357 papers (less than 0.4%), which reduced to only one case with a 10% similarity score after a careful manual checking made possible by the small number of detected cases, as described in the study of Mariani et al. (2019b).
From 2015 to 2020, 47 cases of possible plagiarism are detected on a total of 20,649 papers, which also reduce to a single case after a careful manual checking! In addition to the various reasons for the false detection of plagiarism identified in our previous study, we also found out that a paper may be identified as plagiarizing another paper, whereas the authors of that other paper actually plagiarize themselves a former paper of the previous authors!

CONCLUSIONS
When comparing the study contained in this paper with the findings of our two previous papers, we may first consider the results that reinforce on 55 years the conclusions that we made on 50 years.
As already encountered while pursuing the study reported in our previous papers, we appreciated the benefit to have access to a large quantity of publications that are now freely accessible online, while we faced the difficulty of dealing with proprietary data which requires extensive discussions with the publishers to explain the nature of our investigations that necessitates a large collection of papers. It also raises the problem of distributing the data, replicating the results and updating the corpus.
We still struggled with the lack of a consistent and uniform identification of entities (titles, authors names, gender, affiliations, conference and journal names, funding agencies, names of language resources, etc.), which required a tedious manual correction process. This problem would need an international effort to provide unique identifiers to these entities, or else sophisticated disambiguation processes.
We also see that there is less and slow progress in the feminization of the research community.
This study confirmed the possibility to predict some future scientific developments for the next 3 years.
The study of reuse and plagiarism still concludes in the scarcity of real plagiarism cases, a conclusion which, however, needs a careful manual checking in addition to any automatic process, and in the commonness of the reuse of previously published textual content by the same authors which is very widespread and easily understandable, especially when turning a conference paper into a journal article.
While confirming these previous findings, this study also illustrates the tremendous changes in the speech and natural language processing community that happened during the past 5 years.
We first notice a very intense research activity reflected by a huge increase in the number of papers and authors, similar in the single year 2020 to what occurred in the first 25 years of our corpus.
More and more collaborations took place among authors, who formed new clusters. Important changes appear in the ranked list of the most productive, most collaborative, most cited, and best h-indexed authors, with the appearance of many new names, especially of Asian origins, who publish a lot, while many researchers of the pioneering times are now gradually retiring. Also, it seems that slightly more focus is nowadays devoted to NLP compared to speech processing where many breakthroughs have already been achieved in the recent past and which now shares many scientific challenges and many similar approaches in common with NLP. This is due to the appearance of new paradigms, such as deep learning, word embedding, or unsupervised machine learning, which immediately attracted a large community of researchers, due to the acceleration in publishing, who increasingly publish in conferences and journals either within NLP4NLP, where we notice a specific increase in activity for the Transactions of the ACL, or outside of the NLP4NLP core research area and publications, especially in arXiv which now appears as a popular free open-access not-peer-reviewed publication facility.
The use of language resources is also increasing a lot, according to the crucial need of data in machine-learning approaches for developing and improving the quality of systems related to a language, a population or a task, and of proper metrics to measure quality and progress. New language resources of various kinds (dataset, tools, metrics) specifically related to these paradigms became quickly very popular.
Some research domains were initiated or reactivated, such as semantic analysis, sentiment analysis, speech translation, or processing of low-resourced languages.

PERSPECTIVES
We would like to improve the quality of the automatic extraction of information (such as authors' names, references, sources, terms, language resources) to reduce the burden of manual corrections by taking into account the context through novel approaches of disambiguation based on word embedding.
We believe that the raw data that we gathered and the information that we extracted after substantial manual cleaning would provide interesting training and test data for evaluation campaigns (such as automatic name extraction, named entity disambiguation or gender detection).

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: http://www. nlp4nlp.org.