Edited by: Neil R. Smalheiser, University of Illinois at Chicago, United States
Reviewed by: Shenmeng Xu, University of North Carolina at Chapel Hill, United States; Philippe Mongeon, Université de Montréal, Canada
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Coverage is an important criterion when evaluating information systems. This exploratory study investigates this issue by submitting the same query to different databases relevant to the query topic. Data were retrieved from three databases: ACM Digital Library, Web of Science (with the Proceedings Citation Index) and Scopus. The search phrase was “information retrieval,” publication years were between 2013 and 2016. Altogether 8,699 items were retrieved, out of which 5,306 (61%) items were retrieved by a single database only, and only 977 (11%) items were located in all three databases. These 977 items were further analyzed: citation counts retrieved from the three databases were compared. Citations were also compared to altmetric data of these publications, collected from Mendeley.
Cleverdon ( coverage recall precision response time presentation effort
Cleverdon defined
In the evaluation of IR systems, the main measures are precision and recall and measures derived from them (e.g., Baeza-Yates and Ribeiro-Neto
Perry and Salisbury (
From 1963 until 2004, there was a single comprehensive citation database initiated by Eugene Garfield, first under the name of the ISI Citation Databases, and from 1996 onward known as the Web of Science (WoS) (Clarivate,
In 2005, Jacsó searched for items authored by Eugene Garfield, and found 1,522 items indexed by WoS versus 90 items on Scopus. In 2018, there are still more items indexed by WoS (1543) than Scopus (254), mainly because only WoS indexes Current Contents articles authored by Garfield (1063) and Scopus does not. In a later article, Jacsó (
Bar-Ilan (
Meho and Yang (
Ball and Tunger (
Gavel and Iselid (
In 2014, WoS covered 13,605 journals and Scopus 20,346 journals (Mongeon and Paul-Hus,
A number of studies compared coverage for specific subjects. Earth and Atmospheric Sciences were studied by Barnett and Lascar (
In the previous section, we described studies that compared WoS and Scopus based journal lists or on publication and citation counts. Instead of citations one can use additional measures like usage data (if available), or the number of users of a reference manager who saved a particular document in their libraries. Mendeley, a free online reference manager, reports for each document in its database the number of users, called “readers” that downloaded the document. There are two major advantages in supplementing conventional indicators with Mendeley readership counts: (1) Readership counts accumulate much faster than citations, and can be early signals of future citations, (2) Not all readers are citers, many of the Mendeley members are students who may or may not publish journal articles. There are shortcomings of Mendeley reader counts and other altmetric indicators as well. The major concerns are that these indicators can be quite easily manipulated, and are not transparent (see, for example, Wilsdon et al.,
We were not able to locate specific studies on database coverage in Computer Science. Hull et al. (
The most relevant article related to the topic of this article, was published by De Sutter and Van Den Oord (
As can be seen from the above literature review there are differences between the coverage of databases. The aim of this study is to emphasize the influence of the varied coverage of the databases on various measures, like publication and citation counts, the h-index, most cited sources and most cited publications. Citation counts are compared with Mendeley readership counts for the subset of documents retrieved by the three studied databases.
In the following, we demonstrate the differences stemming from coverage for the term “information retrieval,” by comparing three databases that provide citation counts, two of them comprehensive, WoS and Scopus, and one subject specific, the ACM Digital Library (ACM). Information retrieval is a topic relevant both for computer science and for information science. The query is not intended to cover “information retrieval” as a topic, it is only used to demonstrate the differences between the databases, and alerts users to search in multiple databases if there is need for comprehensive data.
For this study data were collected in May 2017, from three databases, ACM, Scopus and WoS. The search query was identical in all three cases: “information retrieval” as a phrase and so were the publication years, 2013–2016. Our aim was not to have a comprehensive view of the topic, but to have a fair comparison between the databases for a sample query. Fair means identical query, publication years and limitation where to search (e.g., title, abstract, keywords). However, because of the differences in the database search capabilities, there were slight differences in the search strategies as described below.
The ACM Digital Library allows to search in two sources: the ACM Full Text Collection and the more comprehensive (in terms of meta-data) ACM Guide to Computing Literature. The second option was chosen and we searched for term “information retrieval” in the abstract or in the title. After data cleansing (removal of duplicates, items with missing titles or authors), 3,937 items remained out of the initially retrieved 4,161 items. ACM Digital Library allows to download meta-data, but these do not include citation counts, which had to be added manually.
In Scopus, the searches were also in title and abstract, however in addition to limiting the publication years to 2013–2016, we had to limit the retrieved items to two subject areas computer science and social science (to include information science as well) to filter out noise. Out of the 5,635 items retrieved, 5,460 remained after data cleansing.
Web of Science does not allow to limit the search to abstract only, so we chose topic, which includes title, abstract and keywords. We had to exclude keywords from Scopus because inclusion of keywords added mainly noise (12,931 documents for a keyword search limited to publication years and subject area as above). An examination of a sample of the documents showed that the addition of keywords introduced a lot of noise, while in ACM the keyword search had a huge overlap with the title and abstract search. The search in WoS included the Science Citation Index, the Social Science Citation Index, the Arts and Humanities Citation Index, the Proceedings Citation Indexes, and the Emerging Journal Citation Index. The subject areas were limited to computer science and information science, 4,265 documents were retrieved. After retrieval were able to remove all items where the term “information retrieval” appeared in the keywords only 3,673 items retrieved from WoS remained in the dataset. Thus, we created three comparable datasets.
Next a list of unique documents was created from the items retrieved from the different data sources. This part was rather time consuming, because not all items had DOIs, and occasionally the DOIs were incorrect. Pairwise comparisons were conducted to discover overlap, and to collect the citation counts of the given item from the three databases. For items not matched by DOI, title and publication year were compared. These matches were manually checked, since in several cases items with identical titles and publication years were published in two different venues. It was impossible to automatically match items using the publication source as well, because there are no uniform naming conventions for proceeding titles, e.g., publication source for papers in SIGIR 2015, appear as:
“Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval” in ACM. “SIGIR 2015—Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval” in Scopus and WoS. “Proceedings of the 25th ACM International on Conference on Information and Knowledge Management” in ACM. “CIKM’16: PROCEEDINGS OF THE 2016 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT” in WoS.
and CIKM 2016 appears as
Web of Science retrieved items from the CIKM conference series only in 2016, while Scopus indexed only the 2014 proceedings, and ACM retrieved items from all 4 years covered in this study, however the source title for 2013 was slightly different, using “&” instead of “and.”
Interestingly for conducting the manual check of items that were paired only by title and publication year the start and end page of the items were most useful. Altogether 8,699 unique items were identified.
It should be noted that it was not feasible to use Google Scholar or Microsoft Academic Search. In Google Scholar, one can search in the title, but not in the abstract, and appearance of the term “information retrieval” in the full text cannot serve as evidence that the paper is about information retrieval. In any case, even when conducting a title search Google Scholar reports as of May 2017, about 4,240 results published between 2013 and 2017, and for a general search about 45,400 results. Since Google Scholar does not allow to retrieve more than 1000 results, it was not feasible to include Google Scholar. Microsoft Academic Search reported more than 50,000 results for the time period, and 28,700 results for items published in 2013 alone.
The subset of documents appearing in all three databases, was further analyzed. Altmetric data were collected for this set to enhance the comparison between the databases. For the altmetric comparison, Mendeley was chosen and data were collected in September 2017.
Mendeley data were collected using Webometric Analyst, a free tool developed by Mike Thelwall (
Longitudinal publication trends for the whole set of publications and also for the individual databases were charted both in terms of number of publications and in terms of number of citations. The h-index of the topic in each database was computed. Most cited publications were identified.
The subset of items covered by all three databases underwent additional analysis of the citation patterns. The citation counts were compared with Mendeley reader counts and Spearman correlations were computed.
Figure
Number of publication per year and per database.
Table
Citations publications received from the time of publication until May 2017 per database, total number of citations and average number of citations.
Citations |
ACM |
Scopus |
WoS |
|||
---|---|---|---|---|---|---|
Year | Total | Average per paper | Total | Average per paper | Total | Average per paper |
2013 | 3,560 | 3.07 | 5,574 | 4.01 | 2,079 | 2.67 |
2014 | 2,168 | 2.24 | 3,746 | 2.64 | 1,412 | 1.66 |
2015 | 1,056 | 1.09 | 2,144 | 1.65 | 940 | 0.87 |
2016 | 319 | 0.38 | 623 | 0.46 | 236 | 0.25 |
Total | 7,103 | 1.80 | 12,087 | 2.21 | 4,667 | 1.27 |
We observe even higher differences when considering the h-index of the publication set retrieved by each database. Although Hirsch (
In all three databases more than 60% of the retrieved items were proceedings papers, as can be seen in Table
Document types per databases, in absolute numbers and percentages.
Document types | ACM | % | Scopus | % | WoS | % |
---|---|---|---|---|---|---|
Proceedings papers | 2,727 | 69.3 | 3,383 | 62.0 | 2,228 | 60.7 |
Journal articles and reviews | 1,017 | 25.8 | 1,914 | 35.1 | 1,408 | 38.3 |
Books and proceedings | 168 | 4.3 | 26 | 0.5 | 0 | 0.0 |
Other | 25 | 0.6 | 137 | 2.5 | 37 | 1.0 |
Next, we examined which document types are cited more in each database and over time. The results are displayed in Table
Citations per document type, database and publication year.
Doc. types | ACM |
Scopus |
WoS |
||||||
---|---|---|---|---|---|---|---|---|---|
# pubs | Total cits | Ave. cits/paper | # pubs | Total cits | Ave. cits/paper | # pubs | Total cits | Ave. cits/paper | |
2013 | 843 | 2,572 | 3.05 | 893 | 2,674 | 2.99 | 452 | 435 | 0.96 |
2014 | 674 | 1,542 | 2.29 | 880 | 1,819 | 2.07 | 545 | 320 | 0.59 |
2015 | 675 | 792 | 1.17 | 792 | 1,046 | 1.32 | 701 | 247 | 0.35 |
2016 | 535 | 190 | 0.36 | 818 | 283 | 0.35 | 530 | 27 | 0.05 |
All years | 2,727 | 5,096 | 1.87 | 3,383 | 5,822 | 1.72 | 2,228 | 1,029 | 0.46 |
2013 | 261 | 973 | 3.73 | 443 | 2,852 | 6.44 | 316 | 1,640 | 5.19 |
2014 | 229 | 575 | 2.51 | 509 | 1,892 | 3.72 | 298 | 1,083 | 3.63 |
2015 | 248 | 238 | 0.96 | 468 | 1,042 | 2.23 | 376 | 681 | 1.81 |
2016 | 279 | 126 | 0.45 | 494 | 328 | 0.66 | 418 | 205 | 0.49 |
All years | 1,017 | 1,912 | 1.88 | 1,914 | 6,114 | 3.19 | 1,408 | 3,609 | 2.56 |
Top-cited documents by database.
Rank | Author | Title | Source | Year | Cits_acm | Cits_sc | Cits_wos |
---|---|---|---|---|---|---|---|
1 | Yuan et al. | Time-aware point-of-interest recommendation | SIGIR | 2013 | 68 | ||
2 | Xiao et al. | Expanding the input expressivity of Smartwatches … | SIGCHI | 2014 | 52 | 55 | |
3 | Panichella et al. | How to effectively use topic models for software engineering tasks? | ICSE | 2013 | 36 | 73 | 42 |
1 | Deng and Yu | Deep learning: Methods and applications | Found and Trends in Signal Proc. | 2013 | 22 | 145 | |
2 | Hussein et al. | Human action recognition using a temporal hierarchy… | IJCAI | 2013 | 26 | 89 | |
3 | Brehmer and Munzner | A multi-level typology of abstract visualization tasks | IEEE Tr. Visualization | 2013 | 29 | 77 | 53 |
1 | Leaman et al. | DNorm: disease name normalization | Bioinformatics | 2013 | 60 | 56 | |
2 | Brehmer and Munzner | A multi-level typology of abstract visualization tasks | IEEE Tr. Visualization | 2013 | 29 | 77 | 53 |
3a | Benetos et al. | Automatic music transcription … | IEEE TR. FUZZY SYSTEMS | 2013 | 34 | ||
3b | Saha et al. | Improving bug localization … | ASE 2013 | 2013 | 34 | ||
3c | Srivastava and Salakhutdinov | Multimodal learning with … | J. Machine Learning Res. | 2014 | 19 | 59 | 34 |
The most interesting finding of this explorative study is the small overlap between the results retrieved by the databases as can be seen in Figure
Overlap between the databases.
The set of the three most cited documents retrieved by each of the databases is displayed in order to highlight the differences in terms of citation between them. The top three documents ranked by citation counts are displayed in Table
To further our understanding of the differences between the databases, we took a closer look at the 977 publications that appeared in all three databases. In this subset, we can compare the citations received from each of the databases for each item.
Table
The top-ten most cited documents in each of the databases.
First author | Abbrev. Title | Source | Year | Scopus rank | ACM rank | WoS rank | Scopus cits | ACM cits | WoS cits |
---|---|---|---|---|---|---|---|---|---|
Brehme | A multi-level typology of abstract visualization tasks | IEEE Tr. Vis. and Comp. Graphics | 2013 | 1 | 3 | 1 | 77 | 29 | 53 |
Carreno | Analysis of user comments | ICSE | 2013 | 2 | 12 | 3 | 71 | 16 | 32 |
Srivastava | Multimodal learning with Deep Boltzmann Machines | J. Machine Learning Res. | 2014 | 3 | 10 | 2 | 59 | 19 | 34 |
Bordes | A semantic matching energy function | Machine Learning | 2014 | 4 | 1 | 30–32 | 56 | 34 | 14 |
Severyn | Learning to rank short text pairs | SIGIR 2015 | 2015 | 5 | 7 | 61–68 | 46 | 22 | 9 |
Suominen | Overview of the ShARe/CLEF eHealth evaluation lab 2013 | LNCS | 2013 | 6 | 128–162 | 17–18 | 46 | 3 | 18 |
Faro | The exact online string matching problem | ACM Comp. Surv. | 2013 | 7 | 16–18 | 15 | 44 | 14 | 20 |
Suarez-Tangil | Dendroid: A text mining approach | Expert Sys w. Apps | 2014 | 8 | 13–15 | 5 | 43 | 15 | 30 |
Ding | Collective matrix factorization | IEEE TVCG | 2014 | 9 | 2 | 10–11 | 42 | 30 | 23 |
Amadeo | Enhancing content-centric networking | Comp. Networks | 2013 | 10 | 21–26 | 4 | 42 | 12 | 31 |
Dit | Integrating inf. retrieval, execution and link analysis algorithms | Emp Soft. Eng. | 2013 | 11 | 6 | 12–13 | 41 | 23 | 22 |
Sleiman | A survey on region extractors from web documents | IEEE T. Knowledge and Data Eng. | 2013 | 16 | 8 | 19–23 | 31 | 20 | 17 |
Jones | Content-based retrieval of human actions | Inf. Sci. | 2013 | 22 | 16–18 | 6–9 | 28 | 14 | 26 |
Deng | A study of supervised term weighting scheme | Expert Sys w. Apps | 2014 | 32 | 19–20 | 10–11 | 35 | 13 | 23 |
Eickhoff | Increasing cheat robustness | Inf. Retr. | 2013 | 12–13 | 4 | 6–9 | 36 | 24 | 26 |
Campos | Survey of temporal information retrieval | ACM Comp. Surv. | 2014 | 12–13 | 9 | 149–183 | 36 | 20 | 4 |
Maleszka | A method for collaborative recommendation | Knowledge-Based Systems | 2013 | 19–21 | 277–432 | 6–9 | 29 | 1 | 26 |
Hofmann | Balancing exploration and exploitation | Information Retrieval | 2013 | 23–25 | 5 | 46–60 | 27 | 23 | 10 |
Li | A method for topological entity matching | Integrated Comp.-Aided Engineering | 2013 | 26–28 | 53–72 | 6–9 | 26 | 6 | 26 |
Table
General characteristics of the documents retrieved by all three databases.
Scopus | WoS | ACM | Mendeley | |
---|---|---|---|---|
Sum of citations/reads | 3,951 | 2,254 | 1,558 | 15,838 |
No. cited/read docs | 644 | 507 | 434 | 910 |
% cited/read | 66 | 52 | 44 | 93 |
Average citations/reads |
6.13 | 4.45 | 3.59 | 17.40 |
Std citations/reads |
8.74 | 5.58 | 4.42 | 22.67 |
Median no. citations/reads |
3 | 2 | 2 | 10 |
Maximum no. citations/reads | 77 | 53 | 34 | 216 |
The distributions are heavily skewed, as can be seen from the huge SDs. Note, that if an item is indexed on Mendeley it has at least one reader, and quite amazingly, 93% of the documents in the dataset were saved to at least one Mendeley library. It is well-known that altmetric signals are earlier than actual citations, but even if we limit the dataset to publications in 2013 and consider the number of citations after nearly four years (should be sufficient to gather at least one citation), we see that Mendeley counts are higher (see Table
General characteristics of the documents published in 2013 and retrieved by all three databases with Mendeley reader counts added.
Scopus | WoS | ACM | Mendeley | |
---|---|---|---|---|
Sum of citations/reads | 1,795 | 1,051 | 701 | 4,831 |
No. cited/read docs | 195 | 170 | 150 | 230 |
% cited/read | 81 | 70 | 62 | 95 |
Average citations/reads |
9.21 | 6.18 | 4.67 | 21.00 |
Std citations reads |
11.20 | 7.19 | 5.14 | 24.87 |
Median no. citations/reads |
5 | 4 | 3 | 13 |
Maximum no. citations/reads | 77 | 53 | 20 | 200 |
The coverage of older documents is better in all four data sources, but Mendeley still has considerably higher counts than the other three databases. We see that the gap closes because older articles have more time to accrue citations.
Finally, we computed the Spearman correlations between pairs of data sources both for the whole period (2013–2015, 977 docs) and for 2013 only (242 docs) (see Table
Spearman correlations between the data sources for documents retrieved by all three citation databases and included in Mendeley—all years and in 2013 only.
All | ACM | Scopus | WoS | All 2013 | ACM 2013 | Scopus 2013 | WoS 2013 |
---|---|---|---|---|---|---|---|
Mendeley | 0.493 |
0.532 |
0.505 |
M2013 | 0.550 |
0.670 |
0.574 |
418 | 616 | 489 | 147 | 187 | 164 | ||
ACM | 0.735 |
0.581 |
ACM2013 | 0.776 |
0.652 |
||
410 | 349 | 145 | 135 | ||||
Scopus | 0.857 |
Scopus2013 | 0.907 |
||||
489 | 165 |
We see that all the correlations are significant and medium high to high between Scopus, WoS, and ACM, and medium strength between Mendeley and the other databases. This finding is supported by previous studies (e.g., Haustein et al.,
As stated in the introduction we expected the ACM Digital Library to have the best coverage, however this assumption was shown to be wrong, as Scopus had the highest number of publications, citations and average number of citations per paper. This is different from the finding for business administration, where the subject specific database had the best coverage (Clermont and Dyckhoff,
The major goal of this paper was to highlight the importance of coverage for comprehensive data retrieval. Coverage is one of the parameters in information retrieval evaluation (Cleverdon,
Coverage also has a direct impact on citations as well. The fairest comparison is the average number of citations per paper. Here the picture is less clear, when considering proceedings papers (a major document type in computer science), the highest average citations per proceedings paper is by ACM (1.87), closely followed by Scopus (1.72) and WoS lags far behind (0.46). On the other hand, for journal articles, Scopus is highest (3.19), followed by WoS (2.56) and ACM is third with 1.88 average citations per journal article (see Table
We also studied Mendeley reader counts for the set of 977 items covered by all three citation databases. We see that the number of readers is considerably higher than the number of citations received, both for the whole dataset (3 times higher than the average number of citation by Scopus), and for papers published in 2013 (twice as high), allowing citations to catch up with reader counts. It should also be noted that even items not cited by ACM, Scopus and WoS have readers on Mendeley. When comparing citation counts from the three databases per paper, the highest correlation is between Scopus and WoS, around 0.9 both for all years and for 2013 only. The reader citation correlations are around 0.5, in line with previous studies (e.g., Haustein et al.,
The results emphasize the need for searching in multiple databases in order to increase recall as recommended by previous studies (e.g., Ramos-Remus et al.,
The study is exploratory in its nature and has its limitations. It should be extended to try to understand the meaning of these differences, i.e., why does each database tell us a different story? A single query is not enough for far reaching conclusions, but enough to raise interest to further explore the issue. In addition, the relevance of the retrieved documents should be assessed. In the current study, we relied on the databases, and have not checked relevance manually. The query was not intended to cover IR, but serves as a demonstration of the differences between the databases and also shows that altmetrics (in this case Mendeley reader counts) provide additional insights, like what the users of Mendeley, who are not all citers, are interested in.
This is a single authored paper. JI is responsible for all parts of the work.
The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Mendeley data were gathered with Webometric Analyst, provided by Mike Thelwall. This paper extends a preliminary version presented as a short paper at BIRNDL’17 (Bar-Ilan,