Genomic Feature Analysis of Betacoronavirus Provides Insights Into SARS and COVID-19 Pandemics

In December 2019, the world awoke to a new betacoronavirus strain named severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2). Betacoronavirus consists of A, B, C and D subgroups. Both SARS-CoV and SARS-CoV-2 belong to betacoronavirus subgroup B. In the present study, we divided betacoronavirus subgroup B into the SARS1 and SARS2 classes by six key insertions and deletions (InDels) in betacoronavirus genomes, and identified a recently detected betacoronavirus strains RmYN02 as a recombinant strain across the SARS1 and SARS2 classes, which has potential to generate a new strain with similar risk as SARS-CoV and SARS-CoV-2. By analyzing genomic features of betacoronavirus, we concluded: (1) the jumping transcription and recombination of CoVs share the same molecular mechanism, which inevitably causes CoV outbreaks; (2) recombination, receptor binding abilities, junction furin cleavage sites (FCSs), first hairpins and ORF8s are main factors contributing to extraordinary transmission, virulence and host adaptability of betacoronavirus; and (3) the strong recombination ability of CoVs integrated other main factors to generate multiple recombinant strains, two of which evolved into SARS-CoV and SARS-CoV-2, resulting in the SARS and COVID-19 pandemics. As the most important genomic features of SARS-CoV and SARS-CoV-2, an enhanced ORF8 and a novel junction FCS, respectively, are indispensable clues for future studies of their origin and evolution. The WIV1 strain without the enhanced ORF8 and the RaTG13 strain without the junction FCS “RRAR” may contribute to, but are not the immediate ancestors of SARS-CoV and SARS-CoV-2, respectively.


INTRODUCTION
A new betacoronavirus strain named severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) emerged in December 2019 (Hassan et al., 2020a,b;Lundstrom et al., 2020;Seyran et al., 2020). Betacoronavirus consists of A, B, C and D subgroups. Both SARS-CoV and SARS-CoV-2 belong to betacoronavirus subgroup B. Since SARS-CoV-2 is highly similar to SARS-CoV, many studies have focused on the investigation of the receptor binding domain (RBD) of the Spike (S) protein and its receptor angiotensin-converting enzyme 2 (ACE2) using the same strategies and methods as in SARS-CoV (Graham and Baric, 2010). Different from these studies, we previously reported several other findings on SARS-CoV-2 for the first time, including the following in particular: (1) the alternative translation of Nankai coding sequence (Nankai CDS) that characterize the rapid mutation rate of betacoronavirus at the nucleotide level ; (2) a furin cleavage site (FCS) "RRAR" in the junction region between S1 and S2 subunits (junction FCS) of SARS-CoV-2 that may increase the efficiency of viral entry into cells ; and (3) the use of 5 untranslated-region (UTR) barcoding for the detection, identification, classification and phylogenetic analysis of-though not limited to-CoVs . We defined 13-15 nt sequences of 5 UTRs including the start codons (ATGs) of the first open reading frames (ORFs) as barcodes to represent betacoronaviruses. Using 5 UTR barcodes, 1,265 betacoronaviruses were clustered into four classes, matching the C, B, A, and D subgroups of betacoronavirus, respectively . Preliminary experiments showed that the first hairpins (immediately upstream of the first gene ORF1a) formed by 5 UTR barcodes regulate the translation of downstream genes (Li et al., 2021). These previous studies indicated that recombination, receptor binding abilities, junction FCSs and first hairpins are main factors contributing to extraordinary transmission, virulence and host adaptability of betacoronavirus. Particularly, the jumping transcription and recombination of CoVs share the same molecular mechanism (Li et al., 2021), which inevitably causes CoV outbreaks.
In the present study, we started with the identification of key recombination regions and mutation sites in the genomes of betacoronavirus subgroup B and divided the subgroup B into the SARS1 and SARS2 classes using InDels at six sites. Next, we identified two recently detected betacoronavirus strains RmYN01 and RmYN02 from a bat (Zhou et al., 2020) and discovered that RmYN02 was a recombinant SARS2-like CoV strain. This led us to report-for the first time-a recombination event in open reading frame 8 (ORF8) at the whole-gene level in a bat, which had been co-infected by two betacoronavirus strains. ORF8 (Table 1), existing only in betacoronavirus subgroup B, was considered to have played a significant role in adaptation to human hosts following interspecies transmission (Lau et al., 2015) via the modification of viral replication (Muth et al., 2018). Thus, ORF8 is another main factor contributing to extraordinary transmission, virulence and host adaptability of betacoronavirus. Using the relative RNA abundance between RmYN02 and RmYN01, we validated that ORF8 associates with viral replication. Finally, we analyzed these genomic features of betacoronavirus in the context of its evolution (conjoint analysis of phylogeny and molecular functions; Liu et al., 2018) to explain the SARS and COVID-19 pandemics.

Identification of Key Recombination Regions and Mutation Sites
Based on analysis of betacoronavirus subgroup B (section "Materials and Methods"), key insertions and deletions (InDels) were identified at six sites (named M1 to M6) in the ORF3a, membrane (M), ORF7a, 7b, 8 and nucleocapsid (N) genes, respectively ( Table 1). Using the InDels at six sites, betacoronavirus subgroup B was divided into two classes: (1) the SARS1 class includes SARS-CoV (from patients) and SARS-like CoV (from animals), and (2) the SARS2 class includes SARS-CoV-2 (from patients) and SARS2-like CoV (from animals). This classification result is simple and reliable as all recombination and mutations between them are unlikely to undergo reversible changes together. As a mutation site, M1 has a length of 8 nt in the SARS1 class and 11 nt in the SARS2 class. M2, M3, M4, and M5 in the SARS2 class have 3-nt deletions that are complete codons, whereas M6 in the SARS2 class has 6-nt deletions that are not complete codons.
Almost all the identified recombination events (Table 1) occurred in the ORF1a, S and ORF8 genes. The recombination regions RC1-2 and RC3-7 are located in ORF1a and the S1 region of the S gene, respectively, while the recombination events in ORF8 are complex (see below). To initiate the CoV infection, the S protein encoded by the S gene needs to be cleaved into the S1 and S2 subunits for receptor binding and membrane fusion. By analysis of all recombination events in 292 betacoronaviruses of the subgroup B, we obtained the following results: (1) there are a few genotypes of each recombination region (RC1-7); (2) RC3-7 have more diversity than RC1-2 in the genotypes; (3) betacoronaviruses within the SARS1 and SARS2 classes (see above) have the same genotypes of each recombination region; and (4) there are a few non-synonymous substitutions between different sequences of each genotype. These results suggested that recombination, rather than accumulated mutations (i.e., single nucleotide polymorphisms or InDels) had triggered crossspecies transmission and outbreaks of SARS-CoV and SARS-CoV-2. Mutations may change potential recombination sites, affecting recombination.
Further analysis showed that two recombination regions (RC6 and RC7) are localized in the receptor binding domain (RBD) of S1 (Figure 1), while three other recombination regions (RC3, RC4, and RC5) are localized in the N-terminal domain (NTD) of S1. Almost all secondary structures of five protein segments encoded by RC3 to RC7 are disordered, which are responsible for protein protein interaction (PPI). This suggested that the recombination of RC3 to RC7 improve the adaptability of betacoronaviruses in new hosts (host range expansion; Graham and Baric, 2010) by enhancing interaction of RBD and NTD with their receptors. The adaptability improvement may be Recombination regions (RC1-7) and mutation sites (M1-6) were annotated in the viral genomes of SARS-CoV-2 (GenBank: MN908947) by column 2-4 and RmYN02 (GISAID: EPI_ISL_412977) by column 5-7. The amino acid sequences are encoded by RC1-7 from SARS-CoV-2. All the insertions and deletions refer to SARS-CoV (GenBank: AY278489). # Since RmYN02 has a recombinant ORF8, it has the same allele at the M5 site as SARS-CoV.
driven by nature selection, as the positive or negative selection of the S gene is particularly strong (Lau et al., 2015). Since both RBD and NTD had similar recombination events in their PPI regions, we proposed that NTD has a specific receptor just like RBD has ACE2. Thus, the S1 subunit of SARS-CoV-2 may have more than one specific receptor (Figure 1) like gp120 of HIV has the receptors of differentiation 4 receptor (CD4) and the C-C chemokine receptor 5 (CCR5). Comprehensive analysis and reuse of data from different sources are necessary to identify the other receptor/s of SARS-CoV-2. A previous study identified two genetic susceptibility loci (rs11385942 at locus 3p21.31 and rs657152 at locus 9q34.2) in COVID-19 patients with respiratory failure using genome-wide association analysis (Ellinghaus et al., 2020). The locus 3p21.31 was associated with six genes SLC6A20, LZTFL1, CCR9, FYCO1, CXCR6, and XCR1. However, the previous study only focused on the further analysis of the locus 9q34.2 to confirm a potential involvement of the ABO blood-group system. The researchers did not notice that three chemokine receptors CCR9, CXCR6, and XCR1 merit further investigation as candidates for SARS-CoV-2 receptors. The analysis of bulk RNA-seq data showed high expression of CCR9 and XCR1 in thymus and CXCR6 in T cells, compared to other tissues and cell types (Ellinghaus et al., 2020). In particular, the thymic cells were consistently negative for ACE2 but many CoVs can infect thymus (Lins and Smaniotto, 2020).
By investigating interaction of three protein segments encoded by RC3 to RC5 in NTD (Table 1) with CCR9, CXCR6, and XCR1, we found that CCR9 is the most possible candidate among three chemokine receptors. However, the final determination of the other receptor/s of SARS-CoV-2 needs more calculation and experiments on candidates at the whole-genome level.
Our study did not rule out the possibilities of non-receptor proteins binding to NTD.

Identification of Two Betacoronavirus Strains From a Bat
Recently, two betacoronavirus strains RmYN01 and RmYN02 (GISAID: EPI_ISL_412976 and EPI_ISL_412977) were detected from a bat of Rhinolophus malayanus (Zhou et al., 2020). Since betacoronaviruses of the subgroup B share many highly similar regions in their genome sequences, it is difficult to assemble them correctly from a mixed sample using short high-throughput sequencing (HTS) reads. Therefore, EPI_ISL_412976 was only assembled into a partial sequence in that previous study (Zhou et al., 2020). However, the exact identification of viruses requires the complete genomes or even the full-length genomes. Using paired-end sequencing data, we reassembled these two virus genomes and obtained two full-length sequences to update EPI_ISL_412976 and FIGURE 1 | SARS-CoV-2 may have more than one specific receptor. The S protein is cleaved into two subunit S1 and S2 (in red color) for receptor binding and membrane fusion. S1 has two domains, RBD (in green color) and NTD (in blue color). It is well accepted that S1 binds to its specific receptor angiotensin-converting enzyme 2 (ACE2) by the interaction between RBD and ACE2 (in purple color). In the present study, we propose that the S protein of SARS-CoV-2 may have more than one specific receptor for its function like gp120 of HIV has CD4 and CCR5. The structure of S was predicted using trRosetta (Yang et al., 2020).

EPI_ISL_412977 (Supplementary Material).
Using 5 UTR barcodes (section "Introduction"), the betacoronaviruses RmYN01 and RmYN02 were identified as belonging to the subgroup B. Using the InDels at M1-M6, RmYN01 was further identified as belonging to the SARS1 class, respectively. Using the InDels at M1-M4 and M6, RmYN02 was further identified as belonging to the SARS2 class but a recombinant SARS2-like CoV strain. RmYN02 was supposed to have a 3-nt deletion at the M5 site; however, it did not ( Table 1). This led us to report-for the first time-a recombination event in ORF8 at the whole-gene level in a bat, which had been co-infected by two betacoronavirus strains. Existing only in betacoronavirus subgroup B, ORF8 was considered to associate with viral replication (section "Introduction"), mainly based on the discovery of a 29-nt deletion in SARS-CoV (GenBank: AY274119) (Muth et al., 2018) and a 382-nt deletion in SARS-CoV-2 (GISAID: EPI_ISL_414378-80) (Su et al., 2020). However, it was also reported that ORF8 associated with attenuation without changes in its replication . Although many recombination events in ORF8 of betacoronaviruses have been reported by sequence analysis, it is difficult to determine whether they were recombination events or small-size mutation (InDel and SNP) accumulations as most of them only occurred over very small genomic regions, except the two events (see above). In the present study, the discovery of a recombination event in ORF8 at the whole-gene level led to the determination of three types (see below) of ORF8 genes in betacoronavirus subgroup B, providing new clues to investigate the functions of ORF8.
Next, we conducted further research on the biological functions of ORF8 to test a previous hypothesis that type 2 ORF8 genes enhance the viral replication. RmYN01 and RmYN02 were simultaneously detected in a bat, providing a special opportunity to compare their genome copy numbers. The difference between the genome copy numbers of RmYN01 and RmYN02 can be estimated by their relative RNA abundance. Aligning RNA-seq data to the genomes of RmYN01 and RmYN02, our calculation showed that the RmYN01 genome was covered 99.85% of its length with an average depth of 32.89 (Figure 2A), while the RmYN02 genome was covered 99.89% with an average depth of 298.99 ( Figure 2B). The relative RNA abundance between RmYN02 and RmYN01 was about 9. Based on the "leader-tobody fusion" model explaining the replication and transcription of CoVs (Li et al., 2021), the difference of RNA abundance FIGURE 2 | RNA abundances of RmYN01 and RmYN02 in a bat. RNA-seq data from a bat was aligned to two genomes of RmYN01 and RmYN02 (GISAID: EPI_ISL_412976 and EPI_ISL_412977). RNA abundance is represented by read counts (y-axis). The relative RNA abundance between RmYN02 and RmYN01 was about 9. (A) RmYN01 was identified as belonging to the SARS1 class and has a type 3 ORF8. (B) RmYN02 was identified as belonging to the SARS2 class but has an type 2 ORF8 (enhanced ORF8).
within the ORF1a and ORF1b regions (Figures 2A,B) resulted from CoV replication, rather than transcription. This result suggests that type 2 ORF8 (named enhanced ORF8) genes enhance the replication of RmYN02, ruling out the possibility that transcription contributes to the relative RNA abundance between RmYN02 and RmYN01. Another factor S1 may also result in the difference of genome copy numbers between of RmYN01 and RmYN02. However, the nucleotide identity of S1 gene regions between RmYN01 and RmYN02 reaches 69%, while type 3 ORF8 of RmYN01 and type 2 ORF8 of RmYN02 cannot be well aligned to calculate the nucleotide identity (see below). The difference in S1 is nothing compared with that in ORF8. These results validated that ORF8 associates with viral replication.

Conjoint Analysis of Phylogeny and Molecular Functions
Based on conjoint analysis of phylogeny and molecular functions that was proposed in our previous study (Liu et al., 2018), genes (i.e., ORF1a, S1, and ORF8) containing the recombination regions under high selection pressure must be removed in phylogenetic analysis. Using large segments (Supplementary Material) spanning S2, ORF3a, envelope (E), M, ORF6, 7a, 7b, N(9a), and ORF10 (Table 1), phylogenetic tree 1 (Figure 3A) showed that 19 of 21 betacoronaviruses (section "Materials and Methods") were classified into two major clades, corresponding to the SARS1 and SARS2 classes (see above) divided using the InDels at six sites, respectively: (1) the SARS1 class includes two clusters-the SARS-CoV cluster including SARS-CoV and a few most closely related SARS-like CoVs (from bats or civets) and the SARS-like CoV cluster including all other SARS-like CoVs; and (2) the SARS2 class includes two clusters-the SARS-CoV-2 cluster and the SARS2-like cluster including all SARS2-like CoVs (from bats and pangolins). The SARS1 class was divided into the SARS-CoV and SARS-like CoV clusters by the types of ORF8, while the SARS2 class was divided into the SARS-CoV-2 and SARS2-like CoV clusters by the presence of junction FCS "RRAR" . Currently, the SARS-CoV-2 cluster only includes one strain (GenBank: MN908947), which was clustered with SARS2-like CoV into one clade in phylogenetic trees.
1 ORF8 (Figure 3B), RmYN02 was identified as a recombinant SARS2-like CoV strain. The identification of RmYN02 indicated that recombination occurred across the SARS1 and SARS2 classes, which has potential to generate a new strain with similar risk as SARS-CoV and SARS-CoV-2.
As a recombinant SARS-like CoV strain with a type 3 ORF8 isolated from Chinese horseshoe bats (Rhinolophus sinicus), WIV1 was considered most closely related to SARS-CoV (Ge et al., 2013). Comparing phylogenetic tree 1 with 2 suggested that WIV1 is not the immediate ancestor of SARS-CoV. This confirmed a previous hypothesis: the ancestor of SARS-like CoVs from civets was a recombinant virus with ORF8 originating from greater horseshoe bats (Rhinolophus ferrumequinum) and other genomic regions originating from different horseshoe bats (Lau et al., 2015). However, whether these recombination events occurred in bats or civets remains unclear (Lau et al., 2015). Both phylogenetic tree 1 ( Figure 3A) and 2 ( Figure 3B) consistently revealed that SARS-CoV-2 is most closely related to the well-known strain RaTG13 (GenBank: MN996532) isolated from intermediate horseshoe bats (Rhinolophus affinis). However, RaTG13 is unlikely to be the immediate ancestor of SARS-CoV-2 due to lack of the junction FCS "RRAR." In addition, all pangolin (Manis javanica) betacoronaviruses investigated using their public genomes (section "Materials and Methods") were identified as belonging to the SARS2-like CoV cluster. However, further analyses of these genomes did not support that pangolins are the intermediate host(s) of SARS-CoV-2 (Lam et al., 2020) for three main reasons: (1) pangolin betacoronaviruses do not have the junction FCS "RRAR" ; (2) all reported strains (e.g., GISAID: EPI_ISL_410721 and GenBank: MT040334) are farther from SARS-CoV-2 than RaTG13 in the phylogenetic tree 1 and 3; and (3) pangolin betacoronaviruses are unlikely to lose "RRAR" so soon, as betacoronaviruses of the subgroup A lost junction FCSs after a long-term evolutionary change. This suggested that the intermediate host(s) of SARS-CoV-2 carry at least one betacoronavirus strain with junction FCS "RRAR" in the S protein.

Outbreak and Evolution of Betacoronavirus
Recombination, receptor binding abilities, junction FCSs, first hairpins and ORF8s (see above) are main factors contributing to extraordinary transmission, virulence and host adaptability of betacoronavirus. By analysis of these main factors in 1,300 betacoronavirus genomes (section "Materials and Methods"), we concluded: (1) as the most important factor, rapid recombination of viral genomes provides CoVs the strong ability of crossspecies transmission and outbreak; (2) the strong recombination ability of CoVs integrated other main factors to generate multiple recombinant strains, of which very a few evolved into super virus strains (e.g., SARS-CoV and SARS-CoV-2) causing pandemics by natural selection; (3) the immediate ancestor of betacoronavirus did most likely have two junction FCS and a strong first hairpin, and it transmitted across species during its outbreak; and (4) after a period of adaption in new hosts, betacoronavirus was attenuated to spread widely and persist in the host population by loss of abilities attributed to one or more factors (e.g., junction FCSs).
In betacoronavirus subgroup C (Figure 4A), middle east respiratory syndrome coronavirus (MERS-CoV) has two junction FCSs. The first one "RSTR," located at position 694 in the S protein (noted as MERS-S-R694), is non-functioning, as a result of attenuation, because there is a disulfide bond across MERS-S-R694. However, the second junction FCS "RSVR" (MERS-S-R751) is still functional. Originated from the same ancestor of MERS-CoV, MERS-like CoVs (e.g., hedgehog CoV) without "RSTR" were further attenuated by loss of MERS-S-R751. In betacoronavirus subgroup B (Figures 4A,B), SARS-CoV-2 (GenBank: MN908947) has the junction FCS "RRAR" (SARS2-S-R685), but lost another junction FCS by substituting "KNTQ" for "RNTR" (SARS-S-R761), as a result of attenuation. All SARS-2 like CoVs (from bats or pangolins; Li et al., 2020) without "RNTR" were further attenuated by loss of SARS2-S-R685. The immediate ancestor of SARS-CoV with inaccessible "RNTR" (SARS-S-R761) that has secondary structures in helix rather than coil was an attenuated variant of SARS-CoV-2 by loss of "RRAR" (SARS2-S-R685). All SARS-like CoVs without "RNTR" were further attenuated by loss of "RRAR." In betacoronavirus subgroup D, HKU9-CoV was attenuated by loss of two junction FCSs but still have a strong first hairpin. In betacoronavirus subgroup A (Figure 4A), although almost all strains (e.g., HCoV-OC43 and HCoV-HKU1) still have one junction FCS, they do not have strong first hairpins or enhanced ORF8s. These strains were heavily attenuated from the immediate ancestor of betacoronavirus due to complex reasons. For an example, the average arginine (R) percentage (2.63%) of S proteins in betacoronaviruses of the subgroup A except mouse hepatitis virus (MHV) is significantly lower than those in MHV and betacoronaviruses of the subgroups B, C, and D (3.34, 3.33, 3.32, and 3.33%). This indicated that accumulated mutations caused attenuation by loss of arginine residues, since arginine residues are indispensable for the protease cleavage sites. Other reasons may include the loss of strong first hairpins and genetic events in the transcription regulatory sequences (Li et al., 2021), an important factor that was not further investigated in the present study, but merit further investigation in the future.
Guided by conjoint analysis of phylogeny and molecular functions, we concluded the following ( Figure 3D): (1) in general, betacoronaviruses (and even CoVs) were and are undergoing attenuation to spread widely and persist in host population after every outbreak; (2) the immediate ancestor of the subgroup C (e.g., MERS-CoV) was most closely related to the immediate ancestor of betacoronavirus with slight attenuation; (3) the immediate ancestors of the subgroups B and D diverged subsequently and were further attenuated; and (4) betacoronaviruses of the subgroup A were most heavily attenuated and have the highest diversity in their genomes and hosts. In betacoronavirus subgroup B (Figure 3D), (1) the immediate ancestor of the SARS-CoV-2 cluster was most closely related to the immediate ancestor of the subgroup B with slight attenuation; (2) the immediate ancestor of the SARS-CoV cluster diverged subsequently and was further attenuated; and (3) the SARS-like CoV cluster was most heavily attenuated and has the highest diversity in the genomes and hosts. All the SARS-like CoVs (e.g., WIV1 and RmYN01) are attenuated variants of SARS-CoV, while all the SARS2-like CoVs (e.g., RaTG13, RmYN02 and betacoronaviruses from pangolins) are attenuated variants of SARS-CoV-2. As recombinant betacoronavirus, the immediate ancestor of SARS-CoV is characterized by the enhanced ORF8, while the immediate ancestor of SARS-CoV-2 is characterized by the junction FCS "RRAR." Therefore, WIV1 without the enhanced ORF8 and RaTG13 without the junction FCS "RRAR" may contribute to, but are not the immediate ancestors of SARS-CoV and SARS-CoV-2, respectively.

CONCLUSION
Recombination, receptor binding abilities, junction FCSs, first hairpins and ORF8s are main factors contributing to extraordinary transmission, virulence and host adaptability of betacoronavirus. Junction FCSs and enhanced ORF8s increase the efficiencies in viral entry into cells and genome copy numbers, respectively, while strong first hairpins may enhance the translation of their downstream proteins. The strong recombination ability of CoVs integrated other main factors to generate multiple recombinant strains, two of which evolved into SARS-CoV and SARS-CoV-2 by natural selection, resulting in the SARS and COVID-19 pandemics. The outbreaks of MERS-CoV, SARS-CoV and SARS-CoV-2 were triggered by recombination events, not accumulated mutations. So it is not suitable to estimate their divergence time using current theories in evolutionary biology. The origins of ORF8 and the junction FCS "RRAR" are still unknown. Future investigation needs be conducted to search for the betacoronavirus strains that provided the enhanced ORF8 and the junction FCS "RRAR" to SARS-CoV and SARS-CoV-2, respectively. Based on our theories, two predictions can be made: (1) more attenuated (by loss of junction FCSs or ORF8s) variants of SARS-CoV-2 will be reported; and (2) SARS2-like CoV with at least one junction FCS "RRAR" will be eventually detected.

MATERIALS AND METHODS
The software VirusDetect (Zheng et al., 2017) was used to detect viruses in RNA-seq data (Zhou et al., 2020). The software Fastq_clean (Zhang et al., 2014) was used for RNA-seq data cleaning and quality control. The genomes of RmYN01 and RmYN02 (GISAID: EPI_ISL_412976 and EPI_ISL_412977) were reassembled by aligning RNA-seq data on two closest reference genomes JX993988 and MN908947. SVDetect v0.8b and SVFilter (Zhang et al., 2016) were used to removed abnormal aligned reads. Several haploid contigs (Supplementary Material) highly similar to the complete RmYN01 genome were also assembled. This suggested that there exists more than one betacoronavirus strain belonging to the SARS-like CoV cluster in the same sample, from which RmYN01 and RmYN02 were detected. Protein structure data (PDB: 5 × 5B, 6ZGE, 6ZGF, 5 × 5F, 3JCL, 5I08, and 6OHW) were used to analyzed the FCSs of SARS-COV, SARS-CoV-2, SARS2-like CoV, MERS-CoV, MHV, HCoV-HKU1, and HCoV-HKU43, respectively.
1,265 genome sequences of betacoronaviruses (in subgroups A, B, C, and D) were downloaded from the NCBI Virus database 1 in our previous study . Among these genomes, 292 belongs to betacoronavirus subgroup B. Plus 35 genomes from the GISAID database, 1,300 betacoronavirus genomes were used for analysis in the present study. In our previous study, 10 complete genomes of betacoronavirus subgroup B (GenBank: JX993987, JX993988, GQ153539, GQ153540, GQ153542, DQ071615, DQ412043, AY515512, AY572034, and DQ497008) were selected and used for the analysis (Liu et al., 2018). To trace the origin of SARS-CoV, five complete genomes were added in the present study. They are DQ412042 (SARS-like CoV from Rhinolophus ferrumequinum), AY274119 (SARS-like CoV from a SARS patient in Toronto, Tor2), AY278489 (SARS-like CoV from a SARS patient in Guangdong, GD01), AY304486 (SARS-like CoV from civet) and KF367457 (SARS-like CoV from bat). DQ497008 was removed as a redundant sequence of AY274119 and AY278489. To trace the origin of SARS-CoV-2, three complete genomes were added. They are MN908947 (SARS-CoV-2), MN996532 (SARS2-like CoV hosted in Intermediate Horseshoe bats (Rhinolophus affinis) from Yunnan) and MG772934 (SARS2-like CoV hosted in Chinese horseshoe bats (Rhinolophus sinicus) from Zhejiang). A SARS2-like CoV (GISAID: EPI_ISL_410721) from pangolins (Collected in Guangdong, China) and a SARS2-like CoV (GenBank: MT040334) from pangolins (Collected in Guangxi, China) were used to represent SARS2-like CoVs from pangolins after the removal of sequence redundancy. In total, 21 complete genomes including RmYN01 and RmYN02 (GISAID: EPI_ISL_412976 and EPI_ISL_412977) were used for the phylogenetic analysis applying the neighbor joining (NJ) method. Sequence alignment was performed using the Bowtie v0.12.7 software with pairedend alignment allowing 3 mismatches; mutation detection and other data processing were carried out using Perl scripts; the phylogenetic analysis was performed using MEGA v7.0.26; Statistics and plotting were conducted using the software R v2.15.3 with the Bioconductor packages . The structure of the S protein was predicted using trRosetta (Yang et al., 2020).

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/ Supplementary Material.

AUTHOR CONTRIBUTIONS
SG conceived the project. SG and GD supervised this study. JC and SC conducted programming. XL, LW, and QZ downloaded, managed, and processed the data. TY predicted the structure of the S protein. JR analyzed the structure of S1. SG drafted the main manuscript text. SG and ZH revised the manuscript. All authors contributed to the article and approved the submitted version.  (31700787) to GD. The funding bodies played no role in the study design, data collection, analysis, interpretation or manuscript writing.