Characterizing COVID-19 Transmission: Incubation Period, Reproduction Rate, and Multiple-Generation Spreading

Understanding the transmission process is crucial for the prevention and mitigation of COVID-19 spread. This paper contributes to the COVID-19 knowledge by analyzing the incubation period, the transmission rate from close contact to infection, and the properties of multiple-generation transmission. The data regarding these parameters are extracted from a detailed line-list database of 9,120 cases reported in mainland China from January 15 to February 29, 2020. The incubation period of COVID-19 has a mean, median, and mode of 7.83, 7, and 5 days, and, in 12.5% of cases, more than 14 days. The number of close contacts for these cases during the incubation period and a few days before hospitalization follows a log-normal distribution, which may lead to super-spreading events. The disease transmission rate from close contact roughly decreases in line with the number of close contacts with median 0.13. The average secondary cases are 2.10, 1.35, and 2.2 for the first, second, and third generations conditioned on at least one offspring. However, the ratio of no further spread in the 2nd, 3rd, and 4th generations are 26.2, 93.9, and 90.7%, respectively. Moreover, the conditioned reproduction number in the second generation is geometrically distributed. Our findings suggest that, in order to effectively control the pandemic, prevention measures, such as social distancing, wearing masks, and isolating from close contacts, would be the most important and least costly measures.


INTRODUCTION
As of July 2020, the cumulative confirmed cases of COVID-19 worldwide have exceeded 17.4 million with over 572 thousand dead. There are 22 countries with more than 100,000 confirmed cases of as of July 14, 2020. The high transmissibility of the SARS-CoV-2 virus has substantially changed people's hygiene habits, social relations, and forms of work and schooling during and after the pandemic [1]. In the absence of pharmaceutical intervention measures, public policies such as city lockdowns and workplace and school closures can mitigate the spread of disease, though with substantial economic and societal costs. The indecision regarding restarting the economy and stopping the pandemic has resulted in a wave of outbreaks in many countries [2].
Understanding the characteristics of the COVID-19 transmission process is crucial in finding a middle ground between restoring economic and societal order and controlling the pandemic. Previous research has shown that COVID-19 can be infectious pre-symptomatically [3], i.e., the virus is transmissive even without symptom onset. Finding out the incubation period's duration and the virus reproducibility during the incubation period and shortly after symptom onset but before hospitalization is thus an urgent necessity [4].
Considering the incubation period, as of Jan. 26, the mean and median were 5 and 4.75 days (obtained by 125 patients) [5]. Confirmed cases reported from Jan. 4 to Feb. 24 showed a median incubation period of 5.1 days (obtained from 181 patients) [6]. By Jan. 22, using 425 patients, the mean incubation period was 5.2 days, and [7]. Reference [8] gave a shorter incubation period of 4.2 days, inferring that COVID-19 is more infectious than initially estimated. As of Mar. 31, the mean incubation time is estimated as 8.0 with a standard deviation of 4.75 [9]. Through a renewal process, the estimated median of the incubation period is 8.1 days, which is longer than other studies [10]. The mean and median of the incubation periods were 5.84 and 5.0 days via bootstrap for groups with an age of ≥ 40, and they otherwise demonstrated a significant difference [11]. By meta-analysis, the incubation period was modeled with a lognormal distribution, and the mean and median were 5.8 and 5.1 days [12].
The transmission rate is defined as the probability that an infection occurs among susceptible people within a specific group. It is an important index for providing an indication of how social interactions are related to transmission risk. Nine reports were listed in [13], showing a rate of 35% (95% CI 27-44), depending on infection caused by different contact methods.
The best-known model within infectious disease epidemiology is the SEIR (susceptible-exposed-infectious-recovered) model with different generalization. These models are utilized at the population level for the proportion of each state at given time, aiming to investigating the strategic decisions or effectiveness of the mitigation measures. For illustration, effective containment can explains the subexponential growth in China [19], and effects of containment measures in Italy are also analyzed by an SEIRlike model [17]. More results can be found [20][21][22][23][24][25][26][27].
Clinical investigations may suffer from a limited sample size and biased sampling from the population, leading to geometrical or demographic-dependent results. Different samples and different methods also lead to different results for data analysis and estimates. Simulation of disease spread and mitigation policies require a precise setting of incubation period [19,28]. Metapopulation disease transmission models require a prerequisite setting of the transmission rate during social gathering events to predict disease spreading range [18, 29,30]. For a better estimate of the reproduction number, a real data sample is a crucial ingredient. However, it is difficult to collect. Considering the demand of investigating the properties and modeling of COVID-19, fine data extracted from informative line-list records can provide supporting evidence for the existing results and solid foundation for further study.
In this work, we estimate the parameters of concern from a large scale epidemiological line-list database, which contains the contact history and epidemiological timelines of 9,120 confirmed COVID-19 cases in China [31]. The duration of the incubation period and the details of close contacts and contact scenarios are extracted from the line-list. Spreading trees are reconstructed from the potential transmission pairs in the line-list data set. Hidden in the line-list records of confirmed cases, we have collected 421 chains of spreading with a total confirmed cases number of 1,140. We fit proper distributions to the incubation period as well as scale of close contact. The reproducibility is presented by the spreading tree, which can be referred to as the effective reproduction number under strict containment measures in China.
The incubation distribution is fitted by Weibull distribution with a mean and median of 7.83 and 7 days, respectively; this is in agreement with [9]. Larger data size and longer observation period tend to result in larger incubation period, which is coincidence with the long tailed nature of Weibull distribution. For the secondary attack rate, there are much fewer results due to the lack of data. We have obtained 412 close contact events to investigate the transmission rate. It is revealed that the relationship between the contact scale and transmission rate is not strongly related no matter if it is a linear or nonlinear relation. Moreover, the contact scale is fitted by Lognormal distribution, and the empirical distribution of transmission rate is also given. Finally, the reproducibility of COVID-19 under strict containment measures is investigated by the multiple-generation spreading structure, revealing the effectiveness of the containment measures in China. The key contributions of our work are those that aim for a better understanding of the properties of COVID-19 spread.
The rest of the paper is organized as follows. Section 2 describes the data and methods. Section 3 reports the empirical analysis and models fitted. Section 4 discusses the implications of results and provides an explanation based on branching process and the necessity of ultra-strict prevention measures.

DATA AND METHODS
The line-list database used in this paper contains hand-coded information extracted from 9,120 public reported cases by mainland China health commissions from January 15 to February 29, 2020. A typically reported item is as follows: "Patient ID: Huainan-25.
Frontiers in Physics | www.frontiersin.org January 2021 | Volume 8 | Article 589963 The patient Huainan-25 is a 59-year-old woman who is the wife of the Huainan-26 patient. On February 12, she developed fever, muscle soreness, and other symptoms. On February 14, she went to the hospital for treatment and stayed at the hospital for observation. On February 15, her nucleic acid test was tested positive, and doctors diagnosed her as a suspected patient. Two days later, she was confirmed. Doctors have traced back 3 close contacts, all of whom have been quarantined for medical observation. During the New Year's holiday, she had close contact with her daughter, son-in-law, and granddaughter. Her son-in-law, an asymptomatic patient with a history of suspicious exposure in Hefei, stayed at a designated hospital for observation. Doctors have traced back his 46 close contacts, all of whom have been quarantined for medical observation." The original extracted line-list database contains the epidemiology timelines, e.g., the possible date of virus exposure and date of symptom onset, for each case. We define the incubation period as the time between virus exposure and symptom onset. There are 457 cases with both dates of exposure and date of symptoms reported in the line-list database.
Close contact events are social events and scenarios such as living together, dining together, traveling together, and working together. There were 412 close contact events with the numbers of close contacts and secondary infections reported. Multiplegeneration transmissions can form tree structures that originated from an initial infection. There are 421 transmission chains identified from the line-list.

Duration of the Incubation Period
The incubation period is a vital variable considering the control of the pandemic. The quarantine period of close contact people with an infected individual depends on this variable. The quarantine was usually 14 days for COVID-19. However, for strict prevention, it was suggested at the Information Office of Beijing Municipality press conference on June 28 that after the first 14 days, another 14-day quarantine is necessary in some high-risk areas.
The reason why another 14 days quarantine is necessary can be found from the distribution of incubation time. The sample with 457 incubation time reveals that it is a skewed distribution, see Figure 1. The mean, median, and mode calculated from the sample data are 7.83, 7, and 5 days, respectively. Moreover, the empirical probability of incubation period exceeds 14 is P Incubation period ≥ 14 days 0.125.
That is to say, the chance of an asymptomatic infected individual turning into symptomatic after 14 days is about 12.5%. For strict control of COVID-19, longer quarantine is necessary. A Weibull distribution is fitted to the empirical data, with shift 1 to the right for avoiding zero. The density function is with λ 9.93, and k 1.79. The K-S test is 0.17, which means that Weibull distribution is proper for the data.

Scale of Close Contact Events
The scale of close contact events is the number of people involved in one event of where people have gathered together in a specific way. Table 1 shows the number of different types of social events and scenarios that can potentially facilitate disease spreading. Among the 412 close contact events, more than 93.7% happened by way of living together.
The period of our dataset is the early stage of COVID-19 spread in China. The distribution of the scale in close contact events is a natural feature seen when people are free from movement regardless of the COVID-19 pandemic. The contact scale is intrinsically positive, with a few enormously high data points typically arising. The lognormal distribution is an ideal descriptor of such data, with a positive range, right skewness, heavy right tail, and easily computed parameter estimates. Supported by the K-S test with a value of 0.18, the log-normal distribution shows the proper fitting among the positive, skewed, heavy-tailed distribution candidate. The mechanism of lognormal distributed data in ecology can be obtained by stochastic differential equation [32], which would be another topic for further investigation. The result is shown in Figure 2, The density function of this log-normal distribution is where the fitting parameters are μ 2.495, σ 0.745. The p-value of the K-S test for log-normal distribution is 0.18. It is not a rejected notion that the scale is log-normal distributed. Though there are various prevention measures worldwide, various contact events result in a heterogeneous scale of close contact. The heavy-tailed nature of the close contact scale reveals a non-neglectable possibility of super-spreading events. Therefore, in order to effectively control the pandemic, maintaining social distance and wearing masks should be effective measures.

Transmission Rate and the Scale of Close Contact Events
We define the transmission rate as the number of people infected in one close contact event over the number of people in that event. Figure 3 shows the scatter plot between the transmission rate and the scale of close contact events. It can be seen that the rate drops as the scale of events increases in a non-linear fashion. Let p be the transmission rate and N the total number of people in the close contact events. Based on our sample, given the value of N, the mean p is calculated. The relationship between N and p can be fitted with the following exponential function: where the fitting parameters are a 0.453, b 0.121, and c 0.092, and the goodness of fit index is R 2 0.706. The exponential relation reveals that a larger scale of close contact tends to smaller secondary incidence p. However, the fitting is not convinced enough. The correlation coefficient between N and p is −0.29, implying that neither a linear nor a nonlinear relation between N and p is significant. In other words, p can be treated as a natural feature of COVID-19, with weak monotonic decrease of N. The mean and median of the transmission rate is 0.20 and 0.13 with an interquartile range 0-0.3. The empirical distribution of transmission rate is also given in Figure 4. Protective measures to decrease the transmission rate would be the least cost ways to prevent the pandemic, such as maintaining social distance, wearing masks, and washing hands.

Spreading Tree Structures
Transmission events can create tree structures to map disease spread. There are in total 421 chains verified from the record data. Among the chains, there are 311 chains with secondary cases, out  We define the reproduction number in each generation by dividing the number of infected people in the next generation by the present one. Based on the existence of at least one child in the next generation, the mean reproduction number in the first, second, and third generations are 2.10, 1.35, and 2.2. However, without the conditional restriction, the mean are 1.55, 0.08, and 0.2, respectively, see Table 2.
Using the sample of number of secondary cases caused by the 311 infectors in the first generation, empirical distribution, together with geometric fitting is shown in Figure 5. The geometric distribution law is P(k secondary cases) p(1 − p) k−1 for k ≥ 1. The parameter is p 0.50, and the K-S test value is 0.73.

DISCUSSION
In this study, based on the details of confirmed cases reported by the mass media, the following features are explored: the Weibull distribution of the incubation period, the Log-normal distribution of the scale of close contact events, the geometric distribution of the reproduction number in different generations of virus transmissions, and the statistical feature of secondary attack rate.
As far as we know, the distribution of the close contacts' scale is released for the first time that it is log-normal distributed due to lack of data. This heavy-tailed distribution reveals a relatively larger possibility of super spreading events comparing to light-tailed distributions. To reduce the secondary infection, it is important to take adequate measures to reduce the scale of close contact and reduce the secondary infections. Moreover, efforts should be made to trace back the close contacts to cut off the possible spreading chain in advance.
It is notable that the method here is universal to all infectious diseases. The crucial step is the line-list record of each confirmed case and the detailed transmission relationship in the spreading tree structure. For infectious diseases where only non-pharmaceutical measurement can be applied to prevent its spreading, detailed record keeping of each confirmed case and the contact history is crucial. The tree structure is good evidence for the spreading trend and helpful for the precise estimation of the effective reproductive number. Moreover, contact history is useful to nip severe infectious diseases in the bud.
Theoretically, the reproduction number, say R, is a determining index quantifying the transmissibility. To control the pandemic, R should be less than one. Borrowed from the theory of branching processes, there is a phase transition with a critical value R 1. If R < 1, then, with a probability of one, the spread of a certain disease will die out with exponential speed. However, when R > 1, the rate of spread will exponentially increase. The probability of exponential increas can be obtained as the minimum nonnegative solution to the equation f (s) s for s ∈ (0, 1), where f (s) is the generating function of the reproduction number. From this point of view, the propagation of COVID-19 is an issue of "all or nothing." From this point of view, the control measures would be as strict as possible to avoid the possibility of exponential increase.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://github.com/ PDGLin/COVID19_EffSerialInterval_NPI

AUTHOR CONTRIBUTIONS
LZ and JZ contributed equally as first authors. JZ, X-FL, and X-KX designed the analysis, LZ, XW, JY, and X-KX analyzed the data. LZ and X-FL wrote the paper.   (11) 100.0% (11) 0 0 FIGURE 5 | The empirical distribution of infection numbers in the second generation with geometric fitting. The geometric distribution law is P(ksecondary cases) p(1 − p) k−1 for k ≥ 1. The parameter is p 0.50, and the K-S test value is 0.73.