Estimating the Serial Interval of the Novel Coronavirus Disease (COVID-19): A Statistical Analysis Using the Public Data in Hong Kong From January 16 to February 15, 2020

Background: The emerging virus, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has caused a large outbreak of novel coronavirus disease (COVID-19) since the end of 2019. As of February 15, there were 56 COVID-19 cases confirmed in Hong Kong since the first case with symptom onset on January 23, 2020. Methods: Based on the publicly available surveillance data in Hong Kong, we identified 21 transmission events as of February 15, 2020. An interval censored likelihood framework is adopted to fit three different distributions including Gamma, Weibull, and lognormal, that govern the serial interval (SI) of COVID-19. We selected the distribution according to the Akaike information criterion corrected for small sample size (AICc). Findings: We found the lognormal distribution performed slightly better than the other two distributions in terms of the AICc. Assuming a lognormal distribution model, we estimated the mean of SI at 4.9 days (95% CI: 3.6–6.2) and SD of SI at 4.4 days (95% CI: 2.9–8.3) by using the information of all 21 transmission events. Conclusion: The SI of COVID-19 may be shorter than the preliminary estimates in previous works. Given the likelihood that SI could be shorter than the incubation period, pre-symptomatic transmission may occur, and extra efforts on timely contact tracing and quarantine are crucially needed in combating the COVID-19 outbreak.


INTRODUCTION
The coronavirus disease 2019  is caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2, formerly known as the "2019-nCoV"), which has emerged at the end of 2019 [1][2][3][4][5]. COVID-19 cases were soon exported to many Chinese cities and overseas [6], and the travel-related risk of disease spread was suggested in previous studies [4,[7][8][9]. The risks of rapid spreading were evaluated based on the early surveillance data and also compared to other previous respiratory infectious diseases [5,[10][11][12][13][14]. Since the first confirmed imported case in Hong Kong on January 23 [15], the local government has implemented a series of control and prevention measures for COVID-19, including enhanced border screening and traffic restrictions [16,17].
The COVID-19 pandemic has affected most of the regions around the world, including places with less developed healthcare systems. Hong Kong was the hardest hit region in the severe acute respiratory syndrome (SARS) outbreaks in 2003 [18,19], and thus it is expected to be more prepared in mitigation of emerging infectious disease outbreaks [20]. The lesson in Hong Kong shall be an example for other regions, in particular those less developed places with poor settings [21][22][23][24]. As of February 15, there were 56 COVID-19 cases confirmed in Hong Kong [16], and local transmission was also recognized by the contact tracing investigation. Given the risk of human-to-human transmission, the serial interval (SI), which refers to the time interval from illness onset in a primary case (i.e., infector) to that in a secondary case (i.e., infectee) [25][26][27][28], was of interest to the iterative rate of transmission generations of COVID-19. SI could be used to assist strategic decision-making of public health policies and construct analytical frameworks for studying the transmission dynamics of SARS-CoV-2.
In this study, we examined the publicly available materials released by the Center for Health Protection (CHP) of Hong Kong. Adopting the case-ascertained design [29], we identified the transmission chain from index cases to secondary cases. We estimated the SI of COVID-19 based on 21 identified transmission chains from the surveillance data and contact tracing data in Hong Kong.

DATA AND METHODS
As of February 15, there were 56 confirmed COVID-19 cases in Hong Kong [16], which followed the case definition in official diagnostic protocol released by the World Health Organization (WHO) [30]. To identify the pairs of infector (i.e., primary case) and infectee (i.e., secondary case), we scanned all news press released by the CHP of Hong Kong between January 16 and February 15, 2020 [17]. The exact symptom onset dates of all individual patients were released by CHP [16], which were publicly available, and used to match each transmission chain. For those infectees associated with multiple infectors, we recorded the range of onset dates of all associated infectors, i.e., lower and upper bounds. With all publicly available information from CHP, we constructed the transmission events by subjectively screening the exposure link between consecutive COVID-19 infections. We identified 21 transmission events, including 12 infectees matched with only one infector, that were used for SI estimation. Note that all the 21 transmission events occurred in Hong Kong, and most of the cases involved Hong Kong residents.
Following previous studies [25], we adopted a distribution function with mean µ and standard deviation (SD) σ , denoted by g(|µ, σ ), to govern the distribution of SI. We defined g(|µ, σ ) as three different distributions; Gamma, Weibull, and lognormal distribution. The interval censored likelihood [31], denoted by L 0 , of SI estimates is defined in Equation (1). It happens in the practical analyses of serial interval (as well as incubation period), observations are typically integer while the population mean can be a real value.
The h(·) was the probability density function (PDF) of exposure following a uniform distribution with a range from T low to T up . The terms T i low and T i up denoted the lower and upper bounds, respectively, for the range of onset dates of multiple infectors linked to the i-th infectee. The τ i was the observed onset date of the i-th infectee. Hence, the likelihood function in Equation (1) can be interpreted as the probability of the SI being observed with uncertain onset dates of infectors but a fixed onset date of the infectee [25,31]. We calculated the maximum likelihood estimates of µ and σ . Their 95% confidence interval (95% CI) were calculated by using the profile likelihood estimation framework with a cutoff threshold determined by a Chi-square quantile [32]. We select the distribution of g(·|µ, σ ) according to the Akaike information criterion corrected for small sample size, denoted by AICc. We employed both Pearson's correlation and coefficient of determination, i.e., R-squared, to measure the goodness-of-fit of the selected model.
In addition, as pointed out in [31], it was possible that the naive likelihood in Equation (1) underestimated the SI due to sampling biases. Hence, we adjusted for the right truncation observation bias due to isolation by using an alternative likelihood function, L, in Equation (2), which is based on the non-truncated version in Equation (1). The truncation scheme in Nishiura et al. [31], as well as adopted in Kwok et al. [21], is relying on prior knowledge of an additional parameter, i.e., the intrinsic growth rate of the epidemic, which is commonly assumed and fixed in the likelihood framework. The truncation scheme adopted in this work was previously discussed in Zhao [33], which considers both likelihood of occurrence and likelihood of being observed subjected to the implementation of isolation.
Frontiers in Physics | www.frontiersin.org Here, the G(·) was the cumulative distribution function of g(·|µ, σ ). The d i was the isolation date of the infector associated with the i-th infectee. All other notations were the same as those in Equation (1). The maximum likelihood estimates were calculated, and AICc was employed for model selection.

RESULTS AND DISCUSSION
The observed SIs of all 21 samples have a mean of 4.3 days, median of 4 days, interquartile range (IQR) between 2 and 5, and a range from 1 to 13 days. For the 12 "infector-infectee" pairs, the observed SIs have a mean of 3 days, median of 2 days, IQR between 2 and 4, and range from 1 to 8 days. Figure 1 shows the likelihood profiles of varying SI with respect to µ and σ of SI. In   [31,34,35]. Considering only the 12 "infector-infectee" pairs, we found the lognormal distribution also outperformed, and we estimated the mean of SI at 3.0 days (95% CI: 1.9-6.8) and SD of SI at 2.0 days (95% CI: 1.0-10.5). In this case, the Pearson's correlation is 0.96, and the R-squared is 0.92. The fitted lognormal distributions were shown in Figure 2.
For the right-truncated scenario [i.e., using Equation (2)], the lognormal distribution also outperformed in terms of the AICc, see Table 1. By using all 21 samples, we estimated the mean of SI at 4.9 days (95% CI: 3.6-6.2) and SD of SI at 4.4 days (95% CI: 2.9-8.3). By only using the 12 "infector-infectee" pairs, we estimated the mean of SI at 3.0 days (95% CI: 2.1-3.9) and SD of SI at 2.0 days (95% CI: 1.2-4.6). The Pearson's correlation and coefficient of determination were no longer applicable here since the likelihood function was adjusted and thus not solely depended on the SI observations.
Comparing to the SI of SARS with a mean of 8.4 days and SD of 3.4 days [36], the estimated 4.9-day SI for COVID-19 indicated rapid cycles of generation replacement in the transmission chain. Hence, highly efficient public health control measures, including contact tracing, isolation, and screening, were strongly recommended to mitigate the epidemic size. The timely supply and delivery of healthcare resources, e.g., facemasks, alcohol sterilizer, and manpower and equipment for treatment, were required in response to the rapid growing incidences of COVID-19 [4,37]. In the places with less The highlight estimates are considered as the main results.
Frontiers in Physics | www.frontiersin.org developed healthcare systems and limited medical resources, such a rapid growth of the epidemic may cause a burden to the public health system. Therefore, preparedness and cautiousness for the risk of COVID-19 are crucial to minimize impacts [38,39]. As also pointed out by recent works [31,34,35,40], the mean of SI at 4.9 days is slightly smaller than the mean incubation period, roughly 5 days, estimated by many previous studies [41][42][43][44]. The pre-symptomatic transmission may occur when the SI is shorter than the incubation period. If isolation can be conducted immediately after the symptom onset, the pre-symptomatic transmission is likely to contribute to most of SARS-CoV-2 infections. This situation has been recognized by a recent epidemiological investigation [45], and has been implemented in the mechanistic modeling studies of the COVID-19 epidemic [4,46], where the pre-symptomatic cases were contagious. As such, merely isolating the symptomatic cases will lead to a considerable proportion of secondary cases, and thus contact tracing and immediately quarantine were crucial to reduce the risk of infection. In addition, we would like to point out that minor negative SI observations were reported in recent studies [34,35,[47][48][49]. The negativity in the SI may occur when the incubation period is short with a large variance. However, negative value was not observed in our dataset, which may be due to the small sample size. We further remark that this is unlikely to bias the estimation of mean SI, but may lead to a slight underestimation of the SD of SI. The purpose of estimating SI is to approximate the generation interval (time lag of infections of successive cases) which is strictly positive. Caution should be taken when dealing with negative SI.
A recently epidemiological study used 5 "infector-infectee" pairs from contact tracing data in Wuhan, China during the early outbreak to estimate the mean SI at 7.5 days (95% CI: 5.3-19.0) [42], which appeared larger than our SI estimate of 4.9 days. Although the 95% CIs of SI estimates in this study, consistent with previous studies [21,31,[33][34][35], and those in Li et al. [42] were not significantly separated, the difference in the SI estimates might exist. If this difference was not due to sampling chance, one of the possible explanations could be enhanced public awareness and swift control measures including the contact tracing and isolation implemented in Hong Kong. Since Hong Kong was the hardest hit in the SARS outbreaks in 2003 [18,19], the local public health control was one of the most effective in the world. In the initial phase of the outbreak in Wuhan, the transmission occurred without sufficient awareness and effective intervention, thus the SI estimate in Li et al. [42] may be regarded as the intrinsic (wild) SI, as defined in Champredon et al. [50], of COVID-19. Whereas, the SI estimate in Hong Kong may be regarded as the effective SI, in more practical situations where timely action (quarantining cases and their close contacts) is in place [23], such that one case could be isolated before having the chance to further infect others. If timely action was not in place, infections of longer serial interval may occur. Thus, shorter SI observations might be an outcome of effectiveness in control in a location. The practice in Hong Kong is an example for other regions, including less developed countries.
The SI estimate can benefit from larger sample size. The estimates in our study were based on 21 identified transmission events including 12 "infector-infectee" pairs. Although the sample size was smaller than 28 transmission events in Nishiura et al. [31], 71 in You et al. [35] and 468 in Du et al. [34], the advantage of this analysis is that all the 21 transmission events were identified in Hong Kong. Hence, the surveillance data were under consistent reporting and recording standards, which further reduced the heterogenicity in the observations. Our analysis can be improved if larger records on the local transmission events can be produced. Furthermore, a comparison between different localities is important, which sheds light on the effects of different external factors on SI.
Accurate and consistent records on dates of illness onset were essential to the estimation of the SI. All samples used in this analysis were identified in Hong Kong and collected consistently from the CHP [16,17]. Hence, the reporting criteria were most likely to be the same for all COVID-19 cases, which potentially made our findings more robust.
Clusters of cases can occur by person-to-person transmission within a cluster, e.g., • scenario (I): person A infected B, C, and D; or • scenario (II): A to B to C to D; or • scenario (III): a mixture of (I) and (II), e.g., A to B, B to C and D; or they can occur through common exposure to an unrecognized source of infection, e.g., • scenario (IV): an unknown person X infected A, B, C, and D; or • scenario (V): a mixture of (IV) and (I) or (II), e.g., X to A and B, B to C and D.
The lack of information in the publicly available dataset made it difficult to disentangle such complicated situations. The scenarios (I) and (II) can be covered by a pair of "infectorinfectee" such that we could identify the link between two unique consecutive infections. Under the scenario (III), we cannot clearly identify the pairwise match between the infector and infectee, which means there were multiple candidates for the infector of one infectee. As such, we employed the PDF h(·) in Equation (1) to account for the possible time of exposure ranging from T low to T up . There is no information available on the SI for scenarios (IV) as well as (V) due to the onset date of person X being unknown, and thus our analysis was limited in the scenarios (I)-(III). We note that we should be extra cautious in interpreting the clusters of cases because of this potential limitation. Although we used interval censoring likelihood to deal with the multiple-infector matching issue, more detailed information of the exposure history and clue on "who acquires infection from whom" (WAIFW) would improve our estimates.
Longer SI might be difficult to confirm in reality due to the isolation of confirmed infections [51,52], or to identify and link together due to the less accurate information associated with memory error which occurred in the backward contact tracing exercise [34]. The issue associated with isolation could possibly bias the SI estimates and lead to an underestimated result [31]. It is possible that at the initial stage the SI is longer than later when strict isolation takes place [23]. Nevertheless, a comparison of estimated SI for SARS and COVID-19 in Hong Kong is still meaningful. We found that the estimated SI of COVID-19 appears shorter than that of SARS. It would be hard to imagine that isolation is responsible for the difference. It is unlikely that the isolation is more rapid in cases of COVID-19 than cases in SARS in Hong Kong, as well as other limitations, which would have happened for both. Thus, the difference we observed for COVID-19 and SARS is likely intrinsic. Given the rapid spreading of COVID-19, effective contact tracing and quarantine/isolation were even more crucial for successful control.

CONCLUSION
Together with the basic reproduction number, the serial interval is one of the most important epidemiological parameters, although is difficult to estimate and garners less attention than the former. Here, we found that the SI of COVID-19 may be shorter than the preliminary estimates seen in previous works. Since SI could be shorter than the incubation period among some cases, pre-symptomatic transmission may occur, and extra efforts on timely contact tracing and quarantine are crucially needed in combating the COVID-19 outbreak.

DATA AVAILABILITY STATEMENT
All datasets generated for this study are included in the article/supplementary material.

AUTHOR CONTRIBUTIONS
SZ conceived the study and carried out the analysis. SZ and DH drafted the first manuscript. All authors discussed the results, critically read, revised the manuscript, and gave final approval for publication.