AUTHOR=Kanakia Anshul , Wang Kuansan , Dong Yuxiao , Xie Boya , Lo Kyle , Shen Zhihong , Wang Lucy Lu , Huang Chiyuan , Eide Darrin , Kohlmeier Sebastian , Wu Chieh-Han TITLE=Mitigating Biases in CORD-19 for Analyzing COVID-19 Literature JOURNAL=Frontiers in Research Metrics and Analytics VOLUME=Volume 5 - 2020 YEAR=2020 URL=https://www.frontiersin.org/journals/research-metrics-and-analytics/articles/10.3389/frma.2020.596624 DOI=10.3389/frma.2020.596624 ISSN=2504-0537 ABSTRACT=On the behest of the Office of Science and Technology Policy in the White House, six institutions, including ours, have created an open research dataset called CORD-19 to facilitate the development of question-answering systems that can assist researchers in finding relevant research on COVID-19. As of May 27th, 2020, CORD-19 includes more than 100 thousand open access publications from major publishers and PubMed as well as preprint articles deposited into medRxiv and bioRxiv. As CORD-19 is a small sample of the vast relevant literature, it inevitably contains sampling biases. To overcome these biases, statistical measures used in this study are smoothed by augmenting CORD-19 with its citation network. In total, three expanded sets are created for the analyses: (1) the enclosure set CORD-19E composed of CORD-19 articles and their references and citations, mirroring the methodology used in the renowned “A Century of Physics” analysis, (2) the full closure graph CORD-19C that recursively includes references starting with CORD-19, and (3) the inflection closure CORD-19I that is a much smaller subset of CORD-19C but already appropriate for statistical analysis based on theory of the scale-free nature of the citation network. Taken together, all these expanded datasets show much smoother trends when used to analyze global COVID-19 research. The results suggest that, while CORD-19 exhibits a strong tilt towards recent and highly focused articles, the knowledge being explored to attack the pandemic encompasses a much longer time span and is very interdisciplinary. A question-answering system with such extended knowledge may perform better in understanding the literature and answering related questions. Still, the collaboration patterns, especially in terms of team sizes and geographical distributions, are more resilient to sampling biases and captured very well already in CORD-19 as the raw statistics and trends agree with those from larger datasets.