Symptom Clustering Patterns and Population Characteristics of COVID-19 Based on Text Clustering Method

Background Descriptions of single clinical symptoms of coronavirus disease 2019 (COVID-19) have been widely reported. However, evidence of symptoms associations was still limited. We sought to explore the potential symptom clustering patterns and high-frequency symptom combinations of COVID-19 to enhance the understanding of people of this disease. Methods In this retrospective cohort study, a total of 1,067 COVID-19 cases were enrolled. Symptom clustering patterns were first explored by a text clustering method. Then, a multinomial logistic regression was applied to reveal the population characteristics of different symptom groups. In addition, time intervals between symptoms onset and the first visit were analyzed to consider the effect of time interval extension on the progression of symptoms. Results Based on text clustering, the symptoms were summarized into four groups. Group 1: no-obvious symptoms; Group 2: mainly fever and/or dry cough; Group 3: mainly upper respiratory tract infection symptoms; Group 4: mainly cardiopulmonary, systemic, and/or gastrointestinal symptoms. Apart from Group 1 with no obvious symptoms, the most frequent symptom combinations were fever only (64 cases, 47.8%), followed by dry cough only (42 cases, 31.3%) in Group 2; expectoration only (21 cases, 19.8%), followed by expectoration complicated with fever (10 cases, 9.4%) in Group 3; fatigue complicated with fever (12 cases, 4.2%), followed by headache complicated with fever was also high (11 cases, 3.8%) in Group 4. People aged 45–64 years were more likely to have symptoms of Group 4 than those aged 65 years or older (odds ratio [OR] = 2.66, 95% CI: 1.21–5.85) and at the same time had longer time intervals. Conclusions Symptoms of COVID-19 could be divided into four clustering groups with different symptom combinations. The Group 4 symptoms (i.e., mainly cardiopulmonary, systemic, and/or gastrointestinal symptoms) happened more frequently in COVID-19 than in influenza. This distinction could help deepen the understanding of this disease. The middle-aged people have a longer time interval for medical visit and was a group that deserve more attention, from the perspective of medical delays.


INTRODUCTION
The coronavirus disease 2019 (COVID-19) has evolved into a global pandemic, causing significant morbidity and mortality worldwide. As of December 2021, it has caused more than 270 million confirmed cases and more than 5 million deaths worldwide, with the number of confirmed cases continues to increase at a rate of about 100,000 per day (1).
Clinical symptoms, as indicators for the identification and diagnosis, play a vital role in the early detection and treatment. COVID-19 has a wide range of clinical manifestations, ranging from asymptomatic to severe viral pneumonia (2,3). It has been widely confirmed that fever, dry cough, expectoration, and fatigue were the most common symptoms in patients with COVID-19 (3)(4)(5). As the pandemic progressed, symptoms of cardiovascular system (6), digestive system (7), petechial skin rash (8), and loss of taste (ageusia) and smell (anosmia) (9) were also reported. Numerous studies have contributed to the understanding of COVID-19. Despite a growing body of evidence in this field, the heterogeneity in both individuals and studies still left much to explore about the symptomatology of COVID-19.
For the clinical symptoms, most previous works have been primarily descriptive studies and focused on descriptions of single symptoms (4,5). Noting the variability of symptoms and there are normally two or more symptoms coexisted in one infected case, the association and aggregation of different symptoms may provide more information. The purpose of this study was to explore whether there were potential clustering patterns of different symptoms in patients with COVID-19 based on the aggregation of symptoms with a text clustering method. On the basis of clustering results, we examined the population characteristics of different symptom groups. Given that there were both overlaps and variations in symptoms of COVID-19 and other infectious diseases, such as influenza (10)(11)(12)(13), we also compared the symptom groups found in this study with symptoms of influenza reported in other studies. By profiling the symptoms of COVID-19 and its population characteristics, we expect to provide some inspiration for enhancing the understanding of people of the disease's clinical manifestations and identifying the high frequent symptom combinations of COVID-19.

Study Design and Data Source
In this retrospective cohort study, a total of 1, 067 laboratory confirmed cases of COVID-19 from January 21, 2020 to November 20, 2020 in Sichuan Province were included. Demographic information, symptoms onset, comorbidities, and epidemiological data of all cases were extracted from individual epidemiological investigation report sourced from the Epidemic Registration System of the Sichuan Center for Disease Control and Prevention (CDC). The symptoms were first pre-recorded in the form of the epidemiological investigation report, and for self-reported symptoms not included in the form, they were appended as a free text by the CDC colleagues. Epidemiological data included dummy variables, such as whether a case was an indigenous case or an imported case from abroad, and the variable about whether a case had been infected individually or had been infected in a clustered family or workplace. This study was approved by the Ethics Committee of Sichuan Center for Disease Control and Prevention (SCCDCIRB-2020-007). Written informed consent was obtained from each of subjects.

Statistical Analysis
First, with the symptoms text of cases, the k-means clustering method was used to explore the potential symptom groups on the basis of Euclidean distance. The optimal number of clusters was determined by the widely accepted elbow method (14). Bar charts were used to give a visual representation of the symptom combinations under each group. Categorical variables were represented by counts and percentages, continuous variables in nonnormal distribution were represented by median (interquartile ranges, IQR), otherwise by mean ± SD.
Based on the clustering results, with symptom groups as the dependent variable, a multinomial logistic regression was applied to identify potential factors associated with the symptom groups. Group 1 was the reference category in the multinomial regression model. Population characteristics, such as age, gender, comorbidities (hypertension, diabetes, lung disease, and cardiovascular disease), and epidemiological characteristics (imported or indigenous, clustered or individual) were added into the model as covariates. According to Tian et al. (15), the ages were cut into four groups: aged 0-12, 13-44, 45-64, and ≥65 years. Due to lack of comorbidities and epidemiological information, considering the small proportion of missing, we depicted some respondents in the demographic description, yet not included them in the regression model. Besides, time intervals between symptoms onset and the first visit were depicted also the proportions of different symptom groups at different time intervals were visualized by a bar diagram. Figure 1 shows the procedure of our analysis. In this study, the text clustering was conducted with Python version 3.7.6 and the rest statistical analyses were conducted with R version 4.0.3. The value of p < 0.05 was considered statistically significant.

Population Distribution and Symptom Clustering Patterns
From January 21, 2020 to November 20, 2020, information of 1,067 cases was collected. The majority of infected cases were in 13-44 years (613 cases, 57.45%) and 45-64 years (344 cases, 32.23%) age groups. For comorbidities, the prevalence of hypertension was 6.84%, while it was 2.44, 3.00, and 2.36% of diabetes, lung disease, and cardiovascular disease, respectively. In addition, 41.24% of the infected patients were imported cases and 26.43% were infected with family clustering ( Table 1).
The elbow method indicated that the sum of squares within a group was minimal when the data were divided into four groups. Therefore, four clusters were selected for the analysis. Then, combined with pathophysiology (16,17) and consultation from clinical experts in the Sichuan Center for Disease Control and Prevention, the symptoms were summarized as follows: Group 1: no-obvious symptoms, referred to those with no obvious symptoms but positive nucleic acid test; Group 2: mainly fever and/or dry cough, referred to those with fever as the main symptoms, or complicated with dry cough; Group 3: mainly upper respiratory tract infection symptoms, referred to those mainly with expectoration and upper respiratory tract infection symptoms, such as pharyngodynia, stuffy nose and runny nose, or complicated with fever; Group 4: mainly cardiopulmonary, systemic, and/or gastrointestinal symptoms, referred to those whose main symptoms were cardiopulmonary symptoms, such as shortness of breath, dyspnea, chest tightness, chest pain, and/or systemic symptoms, such as fatigue, chills, and myalgia, and/or symptoms of the gastrointestinal system, such as nausea, vomiting, and diarrhea, sometimes accompanied by fever and upper respiratory tract symptoms.
The results showed that more than half (50.7%) of the infected cases did not show obvious symptoms (Group 1) at the first visit. For the three groups with obvious symptoms, their proportions were 12.6%, 10.0%, and 26.8%, respectively. Among them, Group 4, i.e., cardiopulmonary, systemic, and/or gastrointestinal symptoms had higher proportion. Population characteristics of the above symptom groups are summarized in Table 1.
To profile the symptoms composition under each group, bar charts were applied to visualize the particular symptom combinations under each group (Figure 2). It could be seen that there were overlaps and interactions of symptoms under a same group. In symptom Group 1, all cases were with no-obvious symptoms (541 cases, 100%). In symptom Group 2, the most frequent symptom combinations were fever only (64 cases, 47.8%), followed by dry cough only (42 cases, 31.3%). In symptom Group 3, the most frequent symptom combinations were expectoration only (21 cases, 19.8%), followed by expectoration complicated with fever (10 cases, 9.4%). In symptom Group 4, the most frequent symptom combinations were fatigue complicated with fever (12 cases, 4.2%), the incidence of headache complicated with fever was also high (11 cases, 3.8%). In general, except for the asymptomatic with the highest proportion (50.70%), the six most frequent symptom combinations in the whole population were fever only (6.00%), dry cough only (3.94%), dry cough complicated with fever (2.62%), expectoration only (1.97%), fatigue complicated with fever (1.12%), and headache complicated with fever (1.03%).
As for the dominant single symptom, in general, fever and dry cough were the two most frequent symptoms, with frequencies of 64.4% and 38.8%, respectively, followed by expectoration (12.0%) and fatigue (11.4%). Under the groups, fever (68.7%) and dry cough (52.24%) were the dominant symptoms in Group 2; Expectoration (59.4%) and pharyngodynia (29.24%) were the dominant symptoms in Group 3; and fatigue (42.7%) and headache (26.2%) were the dominant symptoms in Group 4. Under the groups, symptoms showed some clustering around the dominant symptoms.

Population Characteristics of Different Symptom Groups
The results of univariate and multivariate multinomial logistic regression assessing the population characteristics of different symptom groups are shown in Table 2. In the univariable analysis, higher age, female, and comorbidities (hypertension, diabetes, lung ailment, and cardiovascular disease) were all associated with increased odds of the presence of symptoms of Group 4, namely symptoms, such as cardiopulmonary, systemic, and/or gastrointestinal symptoms. The imported cases and cases infected with family clustering had lower odds of symptoms in all the three groups of obvious symptoms.
Additionally, the multivariate regression model showed that compared with the 0-12 years age group, the odds of symptoms of Group 4 increased in both the 13-44 years and 45-64 years

Time Intervals Between Symptoms Onset and the First Visit
In all the symptomatic cases, the median time interval between symptoms onset and the first visit was 1 day, and the IQR was (0,3) days. In addition, 47.5% of symptomatic patients visited a medical institution on the day of symptoms onset, 15.4% visited 1 day after onset, 11.4% visited 2 days after onset, and 25.7% sought medical treatment 3 days or more after onset. Figure 3 displayed the proportions of the three groups with obvious symptoms at different time intervals. It could be seen that the proportion of symptoms of Group 2 was decreasing as the time interval lengthened, while in Group 4, it was increasing over longer time intervals, and in Group 3, the proportion peaked at the intermediate time.
The analysis of time intervals in different age groups showed that the median time intervals in 0-12, 13-44, and 45-64 years old groups were all 1 day, while it was 0 day in ≥65 years age group (Figure 4). The ranges were larger in 13-44 years age group and 45-64 years age group, with ranges of (0,14) days and (0,15) days respectively, while the ranges in 0-12 years and ≥65 years age group were (0,7) days and (0,8) days, respectively. Patients aged 13-64 years seemed to have longer time intervals.

DISCUSSION
This study focused on the aggregation of different symptoms of COVID-19, and explored the potential symptoms clustering patterns. Similar to many previous studies (2-5, 18), we found that fever and dry cough were the most common symptoms, followed by expectoration and fatigue. Besides that, this study found there existed probable clustering patterns of symptoms, which could be summarized into four groups. Furthermore, the common symptom combinations under each group were illustrated. Specifically, the most frequent symptom combinations under the three groups with obvious symptoms (Group 2, Group 3, and Group 4) were fever only, expectoration only, and fatigue accompanied with fever, respectively.
It has been confirmed that both COVID-19 and influenza have fever, cough, and expectoration as their main symptoms (13,19,20). However, distinction between the two was that symptoms, such as vomiting, stuffy nose, runny nose, and ocular symptoms were more common in influenza than in COVID-19 (10,11,21). In COVID-19, symptoms such as fatigue, neurological symptoms (headache), gastrointestinal symptoms (diarrhea), and acute respiratory distress syndrome (ARDS) (chest distress) occurred more frequently (22)(23)(24). Similar conclusions were reached in a systematic review comparing COVID-19 and influenza (12). These distinct symptoms were largely consistent with those clustered into Group 4 in this study (i.e., mainly cardiopulmonary, systemic, and/or gastrointestinal symptoms), under which the four most frequent symptom combinations were fatigue complicated with fever, headache complicated with fever, fatigue only, and myalgia complicated with fever. Given there were both overlaps and variations between COVID-19 and influenza, information from single symptoms was limited. Therefore, awareness of the symptoms clustering patterns and the commonly accompanying symptoms may provide more information for enhancing the understanding of this disease.
Besides, the population characteristics in different symptom groups assessed with multinomial logistic regression showed that  compared with the younger age groups (0-12 years), those aged 13-44, 45-64, and ≥65 years had increased odds of showing symptoms of Group 4. This has been confirmed in previous studies that immunosenescence and inflamm-aging may be an explanation (25,26). For the comorbidities, patients with chronic diseases, such as diabetes were more likely to show symptoms of Group 4, which has been confirmed (27). In addition, the results showed that for the imported cases and the clustered cases, the odds of symptoms of Group 2, Group 3, and Group 4 were all lower than indigenous cases or non-clustered cases, respectively. For the imported cases, the entry quarantine for the imported (28) may provide an explanation. Additionally, for the results that cases infected with clustering were less likely to show more severe symptoms, this may be reasonable that infection occurred within a same family, work unit, nursery, or school means an infected person was more likely to be found as a close contact of whom with which the person was clustered, and thus was more likely to be found at the early stage and showed milder symptoms at the first clinical visit. For the result that the prevalence of symptom Group 4 (26.8%) was higher than that of Group 2 (12.6%) and Group 3 (10.0%), this study took consideration of the progression of symptoms over time. From the results of the time intervals analysis, the proportion of symptom Group 2 decreased as the time interval extended, while the proportion of Group 4 increased. This indicated that the presence of symptom Group 4, to some extent, may be related to a longer time interval between symptoms onset and the time infected individuals sought medical treatment. Infected individuals who sought medical treatment later were more likely to had symptoms of Group 4. These results were partly supported by several previous studies focusing on the dynamics of symptoms. According to Larsen et al. (29), a study on the symptoms in 55,924 confirmed cases based on a Markov process showed that there was a possible order in the development of COVID-19 symptoms. The symptoms may progress initially with fever or cough followed by upper respiratory symptoms, such as sore throat, after fatigue and other systemic symptoms, and gastrointestinal symptoms, such as nausea, vomiting, diarrhea, and abdominal pain were presented at a later stage of the disease. Huang et al. (30) analyzed the clinical characteristics of 305 patients in the early stage of the pandemic in Wuhan Jinyintan Hospital, China. They found that compared with symptoms in the early stages of disease, as the time interval lengthened, the incidence of cardiopulmonary symptoms increased significantly. A similar pattern was found in the work of Mizrahi et al. (31). These results reflected that longer interval may indicate a higher possibility of gastrointestinal symptoms (such as, nausea, vomiting, and diarrhea), cardiopulmonary symptoms (such as, shortness of breath and dyspnea), and/or systemic symptoms, which were largely consistent with the symptoms of Group 4 in this study.
Another concern was that the odds of symptom Group 4 was higher in patients aged 45-64 years than in aged ≥65 years. Despite the immunosenescence and inflamm-aging (32), elderly people were not as likely to show more severe initially symptoms as expected. However, the influence of symptoms progression may not be neglected. Results in this study showed that people aged 45-64 years have more cases with longer time intervals, indicating a time delay for medical treatment in this population. Similarly, a study of 14,168 hospitalized infected cases in Belgium found that working age group (aged 20-60 years) had longer intervals between symptoms onset and their visit to a doctor than the elderly people in nursing homes (33). One plausible explanation was that for the elderly people, any abnormal body signal may be more likely to be detected than the working population because they usually pay more attention to their health than the latter. In contrast, the middle-aged people were more likely to have longer time delay for medical visit than the elderly people, and as a result, had more severe symptoms when first diagnosed. Thus, considering the time-delay effect, this study suggested that middle-aged people, may be a subpopulation deserving special attention in the prevention and control of the epidemic. Measures, such as health dissemination can be taken to improve the timeliness of medical treatment for the workingage population. Besides, the employers could also relieve the work-related stresses through the provision of paid time-off.
In contrast to many studies that mainly described only single symptoms, this study focused on the associations among different symptoms, and explored the potential symptoms clustering patterns. Besides, it was found that the presences of different groups of symptoms may be related to the time intervals between symptoms onset and the time infected individuals sought medical treatment. These results provided us a further understanding of the spectrum of COVID-19 symptoms. Furthermore, this study revealed that people of working age were more likely to have a time delay for medical treatment, as a result, had higher possibility of showing symptoms of Group 4. This could provide inspiration for targeted prevention and control of COVID-19.
This study had several limitations. First, for comorbidities, information, such as severity and duration, was not collected, so the impact of comorbidities may be biased by the heterogeneity of severity grade and duration of the diseases. In addition, in the analysis of the population characteristics of different symptom groups, taking diabetes as an example, the OR value and its CI were large, which was attributed to the small number of cases answering "Yes." For these results, though statistically significant, the conclusions were still imprecise and unclear, so more research is needed in the future. Second, for the selfreported symptoms, there may be memory bias. As individuals may have deep memories of some symptoms or ignore others. With the spread of the pandemic, in the late pandemic, such as in November or summer, individuals may delay the consultation or neglect and consider more of influenza rather than COVID-19. Similarly, there may be information bias of the self-reported time of symptom onset. Therefore, more efforts in the future will be needed to validate these findings and turn them into COVID-19 combating practice. Furthermore, it should also be noted that all the patients in this study were infected before the end of November 2020. Therefore, for some variants of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) discovered afterward, such as Gamma (34), Delta (35), Omicron (36) and possible future variants, the results of this study would not be directly applicable. However, it is expected that our analysis procedure might be taken as reference in the future as further variants arise.

CONCLUSIONS
This study focused on the associations of symptoms of COVID-19 and found that the symptoms could be divided into four different clustering groups. The Group 4 symptoms clustered in this study, that were mainly cardiopulmonary, systemic, and/or gastrointestinal symptoms, happened more frequently in COVID-19 than in influenza. This distinction could help deepen the understanding of this disease. In addition, we found that the middle-aged population may be a group requiring more attention during this epidemic, and some measures, such as paid time-off are expected to improve the timeliness of medical treatment for this group.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the Ethics Committee of Sichuan Center for Disease Control and Prevention (SCCDCIRB-2020-007).
Written informed consent to participate in this study was provided by the participants' legal guardian/next of kin.

AUTHOR CONTRIBUTIONS
XC, HW, and TZ conceptualized the analysis. XC and HW implemented statistical analysis. XC, HW, HY, JZ, and TZ contributed to the study implementation, interpretation of results, and writing of the manuscript. LZ, CX, SM, ZL, FH, CY, and WZ did the data collection and cleaning. All authors reviewed and provided comments on the manuscript and approved the final version.