Estimation of Local Novel Coronavirus (COVID-19) Cases in Wuhan, China from Off-Site Reported Cases and Population Flow Data from Different Sources

In December 2019, novel coronavirus disease (COVID-19) hit Wuhan, Hubei Province, China and spread to the rest of China and overseas. The emergence of this virus coincided with the Spring Festival Travel Rush in China. It is possible to estimate the total number of COVID-19 cases in Wuhan, by 23 January 2020, given the cases reported in other cities/regions and population flow data between Wuhan and these cities/regions. We built a model to estimate the total number of COVID-19 cases in Wuhan by 23 January 2020, based on the number of cases detected outside Wuhan city in China, with the assumption that cases exported from Wuhan were less likely underreported in other cities/regions. We employed population flow data from different sources between Wuhan and other cities/regions by 23 January 2020. The number of total cases in Wuhan was determined by the maximum log likelihood estimation and Akaike Information Criterion (AIC) weight. We estimated 8 679 (95% CI: 7 701, 9 732) as total COVID-19 cases in Wuhan by 23 January 2020, based on combined source of data from Tencent and Baidu. Sources of population flow data impact the estimates of the total number of COVID-19 cases in Wuhan before city lockdown. We should make a comprehensive analysis based on different sources of data to overcome the bias from different sources.


INTRODUCTION
In December 2019, a cluster of patients with pneumonia of unknown causes was reported in Wuhan, Hubei Province, China [1]. On 9 January 2020, a novel coronavirus, named SARS-CoV-2, was identified as the cause of this outbreak [2]. The emergence of this virus coincided with the Spring Festival Travel Rush in China. It was estimated that there would be around 3 billion trips made in China during the period of 10 January to 18 February 2020 [3]. Some researchers have pointed out the risk of the regional and global disease spreading during the Spring Festival Travel Rush [4]. However, due to the small number of severe cases reported by mid-January and most cases were linked to the Huanan Seafood Market of Wuhan city, neither international nor regional traveling restrictions were implemented to Wuhan at the early stage of this outbreak. On 13 January 2020, the first case exported from Wuhan was reported in Thailand and the case numbers dramatically increased after the diagnostic kits became available in mid-January. As of 13 July 2020, there were 85 560 laboratory confirmed cases and 4 648 deaths (58.8 and 83.2% in Wuhan) [5]. In recognition of a wide-spreading outbreak, the government has suspended all public transportations inside Wuhan city since 23 January 2020, and some regional traveling restrictions were also implemented by other cities/regions [6].

Objective
At the early stage of this outbreak, the cases might have been severely underreported due to the lack of diagnostic kits and insufficient screening for all suspected cases [7,8]. Several efforts have been made to estimate the COVID-19 case numbers in Wuhan using different modeling approaches, and the estimates range from 4 000 to 75 815 during the period of 18-29 January [8][9][10].
In this study, we aimed to estimate the number of COVID-19 cases in Wuhan, based on the cases exported from Wuhan to other cities/regions in mainland China and different sources of the population flow data between Wuhan and these cities/regions. We tested the impact of different sources of population flow data on estimating cases in Wuhan before city lockdown and combined different sources of data to overcome the bias from different sources. The estimates were made by 23 January 2020 (before the suspension of public transportations in Wuhan). We assumed that the cases exported from Wuhan were less likely underreported in other cities/regions in mainland China, as stringent temperature screening was implemented at airports and railway stations.

Data
We obtained daily number of inbound and outbound domestic passengers traveling by air, train or road to/from Wuhan from two data sources: (1) Tencent's LBS (location-based services) database (see: https://heat.qq.com/). According to location data of Tencent's mobile software users, population flow number during 10 December 2016 and 24 January 2017 was generated, between Wuhan and 24 cities/regions in China (Anhui, Beijing, Chongqing, Fujian, Gansu, Guangdong, Guangxi, Guizhou, Hainan, Hebei, Heilongjiang, Henan, Hunan, Jiangsu, Jiangxi, Jilin, Liaoning, Ningxia, Shandong, Shanghai, Sichuan, Tianjin, Yunnan, Zhejiang). We assumed that the amount of population flow in 2017 is same as that in 2020.
(2) Baidu map database (see: https://qianxi.baidu.com/). According to location data of Baidu's mobile software users, population flow number from 1 to 20 January 2020 was generated, between Wuhan and 26 cities/regions (Shanxi, Shaanxi, and other cities/regions are the same as Tencent data).
We equally divided the population flow data from different sources separately to get average daily population flow number. Figure 1 shows the geographical location of cities/regions which reported COVID-19 cases and the number of COVID-19 cases in each city/region.
As shown in Table S1, we collected total numbers of reported COVID-19 cases exported from Wuhan to other cities in China by 23 January 2020, and cases which were not exported from Wuhan (e.g., family or hospital clusters) were excluded from the analysis [11]. Thirteen cases were excluded due to the lack of traveling history to Hubei before illness onset. As for 161 cases that not specified the traveling history, we assumed that the probability of a single case being an exported case is θ , and each case is independent of each other. Then all of these unspecified cases follow a binomial distribution. As shown in Equation (1), θ represents the probability that a case is exported from Wuhan and n means the number of COVID-19 cases that not specified the traveling history, which is 161. P represents the probability that k out of n COVID-19 cases came from Wuhan. Since the most cases detected outside Wuhan are exported cases from Wuhan, by 23 January 2020 [12], we assumed that the probability of a case to be an exported case from Wuhan is based on a different level of θ (1, 0.9, 0.8). Then we obtained the expectation number of cases exported from Wuhan in city/region i. Table S3 presents the general process of the whole method. The total number of COVID-19 cases exported from Wuhan and diagnosed in each city/region outside Wuhan by 23 January 2020 was assumed to follow a Binomial distribution [8], as in Equation (2), where λ is the total number of cases infected in Wuhan by 23 January and p i is the probability of detecting any exported cases from Wuhan in city/region i outside Wuhan in China.

Number of cases exported from Wuhan and detected in city/region
The probability p i can be derived from dividing daily outbound passengers of Wuhan to city/region i by the population size that the Wuhan airport, railway and road serves and multiplying by the mean time for patients from being infected to being detected, see Equation (3). Then, we used cases exported from Wuhan to estimate the total number of COVID-19 cases infected in Wuhan (λ). Based on the data obtained from each city/region, we obtained the λ by maximum likelihood estimation. In Equation (4), l(·) and L (·) are the total log-likelihood and the total likelihood, respectively. f (·) is the function for computing the value of the probability density function of the binomial distribution (Equation 2). The k represents the total estimated number of cities/regions. The n i represents the number of cases exported from Wuhan and detected in city/region i, and p i means the probability of finding any exported cases from Wuhan in in city/region i. The 95% confidence intervals (95% CI) of log-likelihood, l, can be calculated after obtaining λ, since residuals of log maximum likelihood estimation follow Chisquare distribution [13]. Then we can extrapolate a 95% CI about the total number of COVID-19 cases infected in Wuhan. As in Equation (5), by deducting the number of exported cases from the total number of cases infected in Wuhan, we got the final estimate of the total number of COVID-19 cases in Wuhan as of 23 January 2020.

Cases in Wuhan = Cases infected in Wuhan
−Cases exported from Wuhan We assumed a population of 19 million (catchment population) traveling through the airport, railway stations and highways in Wuhan, and a 10-days delay on average, which accounted for the time interval reported between infection timing and case timing [8]. Since exported cases were much less than those in Wuhan as of 23 January 2020, it was assumed that all cases in other cities/regions outside Wuhan are detected. If cases in other cities/regions are missed, our estimate would underestimate the acute number of cases in Wuhan. In addition, we assumed that all of the passengers outflowed from Wuhan were equally likely to be infected, regardless of transfer passengers or local residents, as passengers may be a high risk of infection while traveling in trains and airplanes cabins. To overcome the bias from different sources of data, we first evaluated the correlation between two datasets to determine whether there is an apparent inconsistency or discrepancy between different sources of data. We found that the Spearman's rank correlation coefficient of Baidu and Tencent data for the same 24 cities/regions is 0.75, which means that two sources of data are correlated under 99.99% confidential level. We assumed a linear relationship between the Baidu data and Tencent data (see Figure 2). For all observations, we assumed error terms are independent of each other. We also assumed all error terms follow a normal distribution and have the same variance. We then built the linear model (Equation 6) and tested the null hypothesis H 0 that α = 0. In Equation (6), N Baidu and N Tencent represent the number of population flow data from Baidu and Tencent.
We got the result that estimated coefficient α equals 0.10, β equals 1 272 and P-value for F-test was <0.01. Then we rejected the null hypothesis H 0 under 99% CI, which suggests that two sources of data are likely to have a linear relation. Since both sets of data is likely to be reasonable. We then applied Akaike Information Criterion (AIC) [14] to test the fitting result of number of cases exported from Wuhan and detected in city/region i, n est<uscore>i , which follows a binomial distribution (Equation 2), based on Baidu and Tencent data, see Equation (7). To estimate the number of cases exported from Wuhan, the model used estimated the total number of COVID-19 cases infected in Wuhan (λ) from Equation (4). p i is the probability that we will find any exported cases from Wuhan in city/region i outside Wuhan in China, which we have already obtained from Equation (3). Please note that we only included 24 cities/regions of which both Baidu and Tencent have population flow data in AIC weight calculation.
Since Baidu and Tencent data show significant linear relationship, which confirmed with each other that the general pattern of data is rational, we weighted (Equation 8) and combined (Equation 9) the estimated number of cases from Baidu and Tencent based on AIC value to obtain the final estimate.
In Equations (8) and (9), W s and AIC s represents the weight of estimated number of COVID-19 cases infected in Wuhan and AIC value for source s, respectively. λ s is the estimate of the total number of cases from Equation (4), based on source s, and λ means the final estimate of total number of cases infected in Wuhan by 23 January 2020.

RESULT
Based on the data sourced from Tencent and Baidu, we estimated the total number of cases in Wuhan, λ (Figure 3). Then we estimated the 95% CI of the total number of COVID-19 cases. We estimated 4 672 (4 129, 5 257) and  θ represents the probability of an unspecified case reported in other cities/regions being an exported case from Wuhan.
12 950 (11 510, 14 502) as total cases in Wuhan by 23 January 2020, based on Tencent and Baidu population flow data. In addition, based on the AIC weighting ( Table 1), we combined results from Baidu and Tencent and estimated 8 679 (7 701, 9 732) as total cases in Wuhan. Table 2 presents the estimates under different sources and different level of probability of an unspecified case reported in other cities/regions being an exported case from Wuhan. Table S2 present the probability of finding any cases for each city/region outside Wuhan.

DISCUSSION
A recent study by Imai et al. estimated that a total of 4 000 (95% CI: 1 000-9 700) cases on 18 January 2020 [8]. Compared with the total number of confirmed cases provided by the government as of 23 January 2020, which is 495 [15], Imai et al. obtained around 8-fold of cases before 23 January [8]. This is partly because the screening effort targeting population from Wuhan in other cities is much more effective than the local screening effort in Wuhan due to the worsening situation.
Estimates based on combination of Baidu data and Tencent data provided closer result by Imai et al.'s [8] than the official report [15]. Our model is mostly like Imai et al.'s [8]. The difference between the two models is that we estimated the total number of cases in Wuhan based on separated data of each city/region in China. In addition, we applied maximum likelihood estimation by calculating the log-likelihood value for each city/region. Imai et al. [8] obtained the estimate based on overall overseas data, applying maximum likelihood estimation by calculating the simple ratio. In the sensitivity analysis, Table 2 shows that when the probability of an unspecified case reported in other cities/regions being an exported case from Wuhan is close to 1, slight fluctuations of the probability will have little impact on the estimation. Estimates of the population outflow provided by Baidu and Tencent show substantial fluctuation, leading to results with significant differences. We found that Baidu and Tencent data show significant linear relation, which means that pattern of two sources of data is largely consistent. One possible reason for the phenomenon is that different institutions have a various definition of the number of people flow from one city to another. Methods include people who travel to other cities through Wuhan in the population flow may provide a much more significant figure than those that only calculate people who originally depart from Wuhan. At the same time, multiple round trips may also affect the count. Another possible reason is that Baidu and Tencent would fail to track the whole amount of population flow since not everyone uses mobile phone software from Baidu and Tencent.
Imai et al. suggested that by further improving the definition and testing of COVID-19 cases, and further expanding the scope of epidemic monitoring, the gap between the estimated number and official reported cases would be further narrowed. According to our results, statistics of population flow also play significant roles in estimation. At present, many researches use data from Baidu and Tencent platforms [10,[16][17][18]. Among them, Tian et al. [18] referred to different data sources to gain a more comprehensive measure of movement volume. According to the data presented in the article, Tian et al. [18] integrated the population flow data first, and then conducted relevant calculation and analysis. In this paper, we calculated estimates from each data source first and then weighted the results. Both of methods provide more reasonable results ranged between conclusions that generated by either data sources, contributing to overcome the bias from different sources.

CONCLUSIONS
Different sources of population flow data impact the estimates of the total number of COVID-19 cases in Wuhan before city lockdown. We built a model that could be reproduced to employ incompatible sets of population flow data to estimate the number of COVID-19 cases more reasonably. We estimated 8 679 (95% CI: 7 701, 9 732) as total COVID-19 cases in Wuhan by 23 January 2020, based on the combined source of data from Tencent and Baidu. What data source can be used to make the most reliable estimation is not clear yet, though estimates based on a single source of data are likely to be biased. A comprehensive analysis based on different statistics is need before we reach any conclusions.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/Larryzza/COVID-19.