Biological Significance of the Genomic Variation and Structural Dynamics of SARS-CoV-2 B.1.617

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants have been emerging and circulating globally since the start of the COVID-19 pandemic, of which B.1.617 lineage that was first reported in India at the end of 2020, soon became predominant. Tracing genomic variations and understanding their impact on the viral properties are the foundations for the vaccine and drug development and for the mitigation measures to be taken or lifted. In this study, 1,051 near-complete genomes and 1,559 spike (S) sequences belonging to the B.1.617 were analyzed. A genome-wide spread of single nucleotide polymorphisms (SNPs) was identified. Of the high frequency mutations identified, 61% (11/18) involved structural proteins, despite two third of the viral genome encoding nonstructural proteins. There were 22 positive selection sites, mostly distributed across the S protein, of which 16 were led by non-C to U transition and should be of a special attention. Haplotype network revealed that a large number of daughter haplotypes were continually derived throughout the pandemic, of which H177, H181 H219 and H286 from the ancestor haplotype H176 of B.1.617.2 were widely prevalent. Besides the well known substitutions of L452R, P681R and deletions of E156 and F157, as well as the potential biological significance, structural analysis in this study still indicated that new amino acid changes in B.1.617, such as E484Q and N501Y, had reshaped the viral bonding network, and increasingly sequenced N501Y mutant with a potential enhanced binding ability was detected in many other countries in the follow-up monitoring. Although we can’t conclude the properties of all the mutants including N501Y thoroughly, it merits focusing on their spread epidemically and biologically.


INTRODUCTION
The COVID-19 pandemic caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is spreading globally. As of June 15, 2021, more than 175 million cases and nearly 3.9 million deaths have been confirmed 1 . Coronaviruses are enveloped, positive-sense RNA viruses that undergo rapid mutation and recombination (Zhao et al., 2004;Su et al., 2016). As expected, the spread of SARS-CoV-2 has generated a large number of variants. To this end, a variant classification scheme has been developed to assign the major variants to one of three classes: variant of interest (VOI), variant of concern (VOC), and variant of high consequences (VOHCs). To date, no VOHCs have been determined, while four variants have been considered as VOI (B.1.525, B.1.526, B.1.617.1, and C.37) and another four as VOC (B.1.1.7, B.1.351, B.1.617.2, and P.1). Of note, the variants in each of the three classes would be constantly updated upon continual surveillance of the risks to global public health.
The B.1.617 lineage of SARS-CoV-2 was first reported in India in October 2020 and has spread to numerous countries and regions. As of June 15, 2021, three B.1.617 sublineages with distinct mutations have emerged: VOI B.1.617.1 (Kappa), VOC B.1.617.2 (Delta), and VOI B.1.617.3. Though the amino acid (aa) changes of G142D, L452R, D614G, and P681R occurring in the spike (S) protein are signatures for B.1.617 and present in all the three sublineages, distinct mutation profiles were found in each of the sublineages. For example, T478K aa change in the receptorbinding domain (RBD) and 156-157 deletions in the N-terminal domain (NTD) of the S protein are unique to B.1.617.2, while E154K and Q1071H substitutions exist only in the B.1.617.1 sublineage. The mutation features warrant further assessment both individually and as a whole for the B.1.617 sublineages. Some of the aa changes or similar ones in B.1.617 were also identified in other circulating lineages: D614G was also found in B.1 lineage, L452R in B.1.526 (Iota), and P681H in B.1.1.7 (Alpha). L452R is located in the RBD, and P681H or P681R is located in the furin cleavage site. It was observed that the variants carrying spike L452R change are likely to be more transmissible and infective and less susceptible to the neutralizing antibodies from convalescent patients and vaccine recipients (Deng et al., 2021). The variant bearing P681R, such as B.1.617.2, has an increased furin-mediated cleavage at the S1/S2 cleavage site that would lead to enhanced viral fusogenicity and exhibit a higher pathogenicity (Liu et al., 2021b;Peacock et al., 2021;Saito et al., 2021). It was also observed that in the hamster model, B.1.617.1 has a higher pathogenicity than has B.1 (Yadav et al., 2021a). Moreover, it has been reported that B.1.617 variant has a reduced neutralization than the Wuhan prototype and B.1.1.7 by using monoclonal antibodies and/or convalescent sera from infected and vaccinated individuals (Hu et al., 2021;Liu et al., 2021a;Planas et al., 2021;Yadav et al., 2021b).
The understanding of sequence changes is important in clarifying the impact of different residues on the viral properties, such as transmissibility, pathogenicity, infectivity, and immune escape potential. Since only a short time has elapsed since the emergence of the B.1.617 variants, limited evidence has been available to establish the relationship between the aa differences and the phenotypical or epidemiological impact of each emerging variant. Thus, a close surveillance on viral genomic variation and analysis of the changes on the potential biological significance is needed. In this study, we examined the genomic signatures, spatial-temporal dynamics, and tertiary structures of functional proteins bearing the aa changes in B.1.617 viruses. Our results exposed a large number of mutations distributed across the B.1.617 viral genomes. Novel haplotypes have evolved from the original ones, some of which are widely prevalent. Evidence of selective pressure was detected at specific sites on viral proteincoding genes. Specifically, aa substitutions at positions 484 and 501 in the RBD appeared to alter the protein tertiary structure and may affect the biological activity of the B.1.617 variants. This study would help us thoroughly understand the genomic and phenotypic features of the B.1.617 variants and provide a foundation to explore novel strategies for the development of vaccines and anti-viral drugs.

Retrieval and Selection of Viral Sequences
SARS-CoV-2 genomic sequences were gathered from the Global Initiative on Sharing All Influenza Data (GISAID 2 ) by selecting the "VOC G/452R.V3 (B.1.617.+)" as the "Variants" search term. The sequences were downloaded, ensuring that no more than five sequences were collected from the same day and the same country or region. Low-quality full-length genome and S-encoding gene (with gap or "Ns") sequences were removed. The Wuhan-Hu-1/2019 prototype genomic sequence (NC_045512.2) was downloaded from the National Center for Biotechnology Information (NCBI) GenBank Database 3 and used as reference for mutation site and structure annotation.

Calculation of Genetic Diversity
All the sequences were annotated by the accession number, collection date, and location and were aligned using the MAFFT version 7 multiple sequence alignment program (online version 4 ). The number of single-nucleotide polymorphisms (SNPs) across the SARS-CoV-2 genomes was analyzed online 5 in the Coronaviridae taxon. The relative frequencies of aa occurrence (bits) at informative sites were visualized using the online version of the WebLogo sequence logo generator 6 .

Selective Pressure Analysis
In order to characterize the B.1.617 variations and to localize statistically supported positively and negatively selected sites, the selective pressure analysis was performed using the Datamonkey adaptive evolution server (online version 7 ) (Pond et al., 2012;Spielman et al., 2019). Positive (or diversifying) selection is defined as sites statistically significant for a positive value of non-synonymous (dN) to synonymous (dS) substitution ω > 1, negative (or purifying) selection is inferred for ω < 1, and neutrality is ω = 1 (Zhang et al., 2005). To reduce the impact of high possibility of occasional events on the analysis based on the microevolution scale, three site-level selection methods-mixed-effects model of evolution (MEME), fast unbiased Bayesian approximation (FUBAR), and singlelikelihood ancestor counting (SLAC)-were applied, and their results were merged in this study. MEME reports positively selected sites that consider both pervasive and episodic selection using a mixed-effects maximum likelihood approach; it infers an α (dS) value and two β (dN) values (β-and β+) per-site with two ω rates fitting a null and alternative model each; a site is under the positive selection if β+ > α and shown to be assessed significantly by the likelihood ratio test; FUBAR is a method for detecting site selection on datasets including more than 500 sequences; it uses a Bayesian approach to infer dN and dS substitution rates on a per-site basis and reports evidence of selections using posterior probabilities; it finds sites in which positive selection is working in case of posterior probabilities ≥0.9; SLAC is a counting-based method that uses maximum likelihood inferring ancestral characters for individual site across the phylogeny; it directly counts the number of synonymous and non-synonymous changes at each site over evolutionary time, and positive selection is defined if non-synonymous changes are more than those in the site under neutrality (Spielman et al., 2019).

Haplotype Network Performance
DnaSP v5.0 was used to define sequence sets and generate multi-sequence-aligned haplotype data in the rdf file format. Phylogenetic network analyses were performed using the Network 10200 package, based on the haplotypes generated by DnaSP. The data were run with the epsilon parameter set to zero, and an exploratory run was performed by setting the epsilon parameter to 10 with the median joining network algorithm. The network output was annotated using the Draw Network option to indicate geographical and time distribution, as well as cluster nomenclature.

Reconstruction of Geographical-and Time-Scaled Phylogeny
The aligned S gene sequences were used in the phylogenetic analysis. The phylogenetic tree was generated using MEGA 7.0.26 software, employing the maximum likelihood method with the Kimura 2-parameter substitution model.

Spatial Structure Construction and Analysis
The tertiary structures of the B.1.617 variant S protein RBD mutants were predicted using the Phyre2 server 8 (van Dorp et al., 2020). Protein-protein interactions were examined using the HDOCK server 9 , based on a hybrid algorithm of templatebased modeling with default parameters (Islam et al., 2020). The structures were calculated and visualized using PyMOL software version 2.

Sequence Information
The B.1.617 sequences analyzed in this study were collected between December 1, 2020, and June 2, 2021. A total of 1,051 high-quality, near-complete genomes and 1,559 S gene sequences were retrieved (Supplementary Material). As the spatial-temporal analysis in this study was performed based on the S gene, we calculated the collection time and location for the 1,559 sequences. The data showed that the sequences included in the present study covered each month of the 6-month B.1.617 pandemic, during which the sequence number prior to February 2021 was relatively small and then increased from March 2021 ( Figure 1A). With regard to geographic distribution, 29 and 25% of sequences originated from Western Europe and Southern Asia, respectively, while only ∼3% of the sequences were derived from Oceania and Africa ( Figure 1B).

High Genetic Diversity and Amino Acid Changes Occurring in B.1.617 Lineage
The total nucleotide (nt) identity of the analyzed 1,051 near-complete genome sequences was 99.9%. However, when comparing the sequences to the early B.1.617 sequence identified in December 2020 (EPI_ISL_1360317), 1,857 nt sites with a mutation frequency >1% were found across the genome (29,424 nt). A higher occurrence frequency in the coding genes of structural protein than non-structural proteins was observed (Figure 2A). Eighteen nt sites with mutation frequency >60% were identified, of which 11 were located in the structural proteincoding regions and eight within the S gene (C56G, T284C, C333T, G425A, A460G, C1433A, C1450G, and G2848A). Of these eight mutated sites, seven were non-synonymous mutations (T19R, I95T, G142D, K154E, T478K, Q484E, and D950N), and one was a synonymous mutation (D333).
A total of 76 aa substitutions with an occurrence frequency >1% were found in B.1.617, of which 39 were positioned in non-structural proteins and 37 in structural proteins. When the proteins were further analyzed individually, the S protein was found to contain the largest number of substitutions, followed by the nsp3 and nsp12 proteins. There were no substitutions in the nsp1, nsp7-11, E, or ORF10 proteins (Figures 2B,C). More than a quarter (27.6%, 21/76 sites) of all the substitutions targeted the S protein, of which 11 had high occurrence frequencies (>10%) and three (aa 382, 478, and 484) were located in the RBD. Additionally, aa deletions were also found in the S (E156 and F157 ), nsp1 (K142 , S143 , and F144 ), and ORF8 (D119 and F120 ) proteins with 69.5, 44.3, and 1.1% occurrence frequencies, respectively. The RNA-dependent RNA polymerase (RdRp), nsp12, contained 5 aa substitutions with an occurrence frequency of over 1%: T26I, K478N, T643I, G671S, and V675I, of which G671S had the highest frequency (67.8%) ( Figure 2B).

Specific Sites Under Selective Pressure
Diversifying and purifying selection sites on the B.1.617 coding sequences were estimated. The sites that were supported by at least two methods (p-value < 0.1 in MEME and SLAC or posterior probability >0.9 in FUBAR) were considered as candidates under selective pressure, and the sites supported by all the three methods were defined as strong selection sites. Results showed that the selection pressure varied markedly among different proteins. Neither positive nor negative selection sites were found in the nsp1, nsp4, nsp5, nsp8, nsp11, E, ORF3a, ORF6, ORF7b, ORF8, or ORF10 coding sequences. In contrast, both positively and negatively selected sites were identified in the nsp6, nsp12, nsp13, nsp16, S, and nucleocapsid (N) coding sequences. A total of 29 negatively selected sites were found in 11 different proteins, of which 22 (75.9%) were located in the nonstructural proteins. Meanwhile 22 positively selected sites were found in nine different proteins, with 14 (63.6%) sites distributed in the structural proteins, of which nine (41%) were located in the S protein. Additionally, six of these 22 sites were led by C-to-U transition (R134N in nsp10, A394V in nsp14, A222V and H1101D/Y in S, and P13T and T135I in N). As C-to-U transition is a preferred direction of nucleotide mutations in SARS-CoV-2 (Matyasek and Kovarik, 2020;van Dorp et al., 2020), we should pay more attention on the 16 positive selection sites that were mutations of other directions than C to U. Notably, seven sites, of which four (57%) were located in the S protein, were identified under strong positive selection ( Table 1). No positive selection sites were found localized in the RBD.

Spatial-Temporal Distribution of B.1.617 Sublineages
Among the 24 mutated nucleotide sites with occurrence frequency of more than 1% in the 1,559 S gene, 13 were transitional mutations and 11 were transversional ones. The C-to-U transition (six sites) and G-to-U transversion (five sites) accounts for 45.8% of the mutations, while the rest of the 10 mutation directions account for the 54.2%. The overall estimated bias of transitions and transversions in the selected lineage is 5.23.
To trace the history of potential B.1.617 sublineage spread, an unrooted maximum likelihood tree was built using the 1,559 S protein encoding sequences. B.1.617.3, the first reported sublineage, was less prevalent at the time of analysis; and therefore a relatively small number of sequences was included in this study, which did not display any obvious spatial or temporal specificity (Figure 3). Apparent temporal specificity was found in both B.  Figure 3B). Moreover, with regard to geographical distribution, B.1.617.1 was mainly localized to Southern Asia, while B.1.617.2 acquired worldwide distribution ( Figure 3A). The temporal and geographical specificity was caused by the fact that since its emergence, B.1.617.2 quickly replaced B.1.617.1 as the main prevalent sublineage since its emergence.

Haplotype Network Generation
To understand the evolution of B.1.617 sublineages in human hosts and trace infection pathways, we generated a phylogenetic network. Overall, the network showed that different newly mutated daughter nodes were derived from their ancestors (Figure 4). We used SARS-CoV-2 S encoding sequences from the early stage of the COVID-19 pandemic (prior to March 2020) as an outgroup (haplotype 1, H1), resulting in the root of the network being placed in H10. We found that the B.1.617 viruses and H1 were distinguished by four non-synonymous mutations: U1355G (L452R), G1450C (E484Q), A1841G (D614G), and C2042G (P681R). All the 10 sequences in the H10 were sampled in India. Three large individual nodes (I, II, and III) were evolved from H10, of which nodes I and II differed from H10 via the U333C (synonymous mutation), C3301G/T (H1101D) and U284C (I95T), and T333C mutations; while node III differed from H10 via the C56G (T19R), C1433A (T478K), and C1450G (Q484E) mutations. Nodes I, II, and III corresponded to lineages B.1.617.3, B.1.617.1, and B.1.617.2, respectively. For B.1.617.3, H102 was the most prevalent haplotype, and 47 of the 87 sequences (54%) in H102 were sampled in India. H8 was the most prevalent B.1.617.1 haplotype, and 85 of the 190 sequences (45%) were sampled in India. With the pandemic ongoing, novel B.1.617.2 nodes were further derived from the ancestral node. B.1.617.2 is the most prevalent among the three sublineages; the number of individuals included in the nodal type and in mutational branches radiating was striking. Four major daughter network nodes or haplotypes (H219, H177, H181, and H286) were derived from B.1.617.2 and spread widely in different continents, especially around Europe ( Figure 4A)

The Tertiary Structures of Receptor-Binding Domain and RNA-Dependent RNA Polymerase Mutants With Amino Acid Changes
The aa substitutions with occurrence frequencies >1% were V382L, T478K, and E484Q in the RBD of the S protein (S-RBD). In addition, substitutions were also detected at the 417 (K to N) and 501 (N to Y) sites in the RBD, with low occurrence frequencies. Docking structures of B.1.617 wild-type S-RBD and the receptor [human angiotensin-converting enzyme (hACE2)] showed that residues 417, 484, and 501 have a bonding network with hACE2. Thus, corresponding three pairs (N417 and K417, E484 and Q484, and N501 and Y501) of docking structures between the S-RBD and hACE2 were analyzed. A low energy indicates a stable system of the interacted proteins, and we selected the structure with the lowest energy of each complex for further study. The docking energy score and root-meansquare deviation (RMSD) were calculated to evaluate the change in the structures of S-RBD; and the RBD-hACE2 complexes for K417, E484, and Y501 respectively revealed significant conformational changes to N417, Q484, and N501 (Table 2), and it was shown that the complexes of K417, E484, and Y501 were more stable than those of N417, Q484, and N501, respectively. Docking structures showed that residue 417 interacts with D10 and H14 in the hACE2 through two salt bridges. The K-to-N change at aa position 417 leads to the widening of the space between K417 and D10, which is compensated by the close proximity of K417 and H14 (Figures 5A,B). As a result, the K-to-N substitution at residue position 417 marginally affects the viral tertiary structure. Residue Q484 forms an interaction with K11 of hACE2 via hydrogen bonding at the distance of 4.8 Å. In the E484 mutant, the interaction between the two atoms is not only enhanced (3.6 Å) but a new bonding interaction (4.6 Å) forms between S477 in the S protein and Q4 of the hACE2 (Figures 5C,D). Thus, compared with Q484, E484 may serve to strengthen the interaction with hACE2. Furthermore, residue N501 of the S protein interacts with Y21, K333, and D335 of the hACE2, with bond lengths of 3.0, 4.8, and 4.4 Å, respectively. In the RBD-hACE2 complex of Y501, the interaction between Y501 and K333 is formed, with a much shorter bond length compared with N501 (2.7 Å) ( Figure 5E). In addition, the Y501 substitution generated a new interaction between residues Y501 and D18 of hACE2 with a 3.6-Å bond length (Figures 5F,D). Due to the fact that the B.1.617 Y501 mutant, which originated from a US sample collected in April 2021, had a much stronger interaction with hACE2, it is reasonable for us to focus on monitoring its potential phenotypical changes.
Besides the five aa substitutions (T26I, K478N, T643I, G671S, and V675) described above with occurrence frequencies >1% in the RdRp, there were other two sites (G228S and I244V) with frequencies <1% but detected in more than four sequences. Of the seven aa substitutions targeting RdRp, three were located in the nidovirus RdRp-associated nucleotidyltransferase (NiRAN) domain, and the other four were located in the conserved polymerase domain (Figure 5G). No aa substitution sites were identified in the interface domain ( Figure 5G). Tertiary structure analysis showed that the aa sites 26, 228, 478, and 643 were  located on the protein surface and that aa substitutions at these four sites did not alter the spatial structure of RdRp ( Figure 5G).

DISCUSSION
Since the discovery of the D641G variant in early 2020, the continuous emergence of SARS-CoV-2 variants has attracted increasing attention worldwide. B.1.617 was first reported in India at the end of 2020 and has spread rapidly and globally, becoming the most prevalent SARS-CoV-2 variant to date. As analysis of genetic variability and evolution is necessary for the characterization of viral variants and the investigation of how they impact disease severity and transmissibility, and to what extent they affect the performance of diagnostic tests, vaccines, social measures etc., the dynamic genetic features on B.1.617 SARS-CoV-2 genomes were analyzed and evaluated in this study.
In sharp contrast to the small number of B.1.617 sequences and their limited geographical distribution in early 2021, both parameters have dramatically increased since March 2021, reflecting the rapid global spread of B.1.617. Mutation, a phenomenon that naturally occurs over time, is an important factor for viral evolution, providing the virus with opportunities to gain fitness (leading to enhanced transmission and/or improved immune evasion) in the infected host. SNPs, shown to be widely distributed across the B.1.617 genomes (and especially those with high occurring frequencies), are more commonly present in structural rather than non-structural protein genes. Besides site-specific mutations, deletions are found in the S, ORF8, and nsp1 sequences, which may potentially influence the function of proteins participating in viral infection, transmission, and immunomodulation (Islam et al., 2020;Phan, 2020;Zhou et al., 2020). As mentioned above in the section of introduction, some mutations were identified to be related to the altered phenotype of B.1.617 variants, but many problems remained to be solved. For instance, do the E156 /F157 aa deletions in the S protein arm the VOC B.1.617.2 variant with improved transmissibility or reduce its susceptibility to the preexisting neutralizing antibody? This needs to be determined by further in vivo studies.
Methodologically, it is of great importance to infer function from the conservation degree of the protein as a whole or even the individual aa. Positive or diversifying selection can accelerate the fixation of advantageous non-synonymous mutations, while negative or purifying selection can avoid deleterious mutations and favor the optimization of viral features (Daugherty and Malik, 2012). We found evidences of both diversifying and purifying selection in different B.1.617 variant proteins, with the former selection type dominant in structural proteins (especially the S protein) and the latter one dominant in non-structural proteins (such as nsp3 and nsp12). Generally, the presence of diversifying selection may suggest an impact of the host immune response on the variants, whereby viruses could escape from the protective immunity of the host, especially for the neutralizing antibodies, during the course of the pandemic. In fact, reduced neutralization of B.1.617.1 and B.1.617.2 variants by antibodies or convalescent sera elicited as a result of vaccination or nature infection has been reported (Hu et al., 2021;Planas et al., 2021;Supasa et al., 2021). A subset of positively selected sites were found in different proteins of SARS-CoV-2 (Berrio et al., 2020;Velazquez-Salinas et al., 2020;Emam et al., 2021;Rochman et al., 2021), which are informative to evaluate their impacts on the protein structure and the processes critical to the viral life cycle, including replication, transcription, translation, and RNA stability. Purifying selection indicates the importance of steadying the vital functional proteins or domains associated with major functions such as viral replication and infection. Thus, a large number of purifying selection mutations detected in the nsp3 suggest a highly conserved pattern in its function in viral genome replication and in antagonizing the host innate immunity (Dong et al., 2020). Generally, viruses tread a tightrope between low pathogenicity and high infectivity, which is beneficial to their long-term survival. However, statistical data from Scotland and England seemingly exhibited an increased risk of hospitalization among individuals infected with the B.1.617.2 variant, compared with those infected with the early variant B.1.617.1 (PublicHealthEngland, 2021). Regarding viral transmissibility, it was also reported that the estimated effective reproductive number of B.1.617.2 was 55% higher than that of the earlier variants (B.1.1.7, B.1.351, and P.1) and 97% (95% CI: 76-117) higher than that of other non-VOC/VOI members (Campbell et al., 2021). Of course, it is necessary to stress that all the evaluations on viral property need to be verified carefully by experiments.
Phylogenetic analysis showed no obvious spatial-temporal specificity for B.1.617.3 due to a relatively low epidemic intensity, B.1.617.1 was prevalent in the early stages of the pandemic and mainly localized to Southern Asia, and B.1.617.2 was predominant since April 2021 and assumed a global distribution. The haplotype network in this study also demonstrated the same result as the phylogenetic analysis. Although we cannot rule out the potential for sampling bias and mutagenesis bias, which could have led to inaccuracies in this study, our results imply that B.1.617.2 viruses rapidly spread through the population. Furthermore, haplotype network analysis has been applied to SARS-CoV-2 data to understand the viral evolution and trace the infection pathways (Forster et al., 2020a,b). Our network showed that India was the epicenter for B.1.617 variants and that three major nodes were generated from the original haplotype. Furthermore, B.1.617.1 and B.1.617.3 each contained a widely prevalent haplotype. Meanwhile, four additional daughter haplotypes with high prevalence were derived from the prevalent ancestral B.1.617.2 haplotype (H176). Numbers of different haplotypes that were detected in the B.1.617 lineage revealed that a large number of mutations were generated during the process of viral infection and transmission. Some of these novel variants (e.g., B.1.617.2 H219 and H177) still exhibited a good fitness and were widely prevalent in the population. It is worth noting that the phylogeographic patterns in the network are potentially affected by many factors, such as founder effects and sample size (Forster et al., 2020a).
The S1 subunit in the N-terminal region of the S protein contains the RBD (aa 319-541), of which aa 437-508 constitute the receptor-binding motif (Winger and Caspari, 2021), and 17 amino acids make direct contact with hACE2: (K417, G446, Y449, Y453, L455, F456, A475, F486, N487, Y489, Q493, G496, Q498, T500, N501, G502, and Y505) (Lan et al., 2020). The aa substitutions in the S-RBD may cause structural and functional changes. Typically, computational or bioinformatics tools can be applied to investigate tertiary structure and the resulting hostpathogen interactions (Bakhshandeh et al., 2021). The K-to-N substitution at aa site 417 in the S protein did not cause significant changes to the surface interaction between RBD and hACE2. The substitution of E to Q at aa site 484 lengthened the interaction distance between the original residues (Q484 of RBD and K11 of hACE2) and reduced the interaction between the two other residues (S477 of RBD and Q4 of hACE2), which may have weakened the affinity between the S protein and its receptor and may function in antibody escape (Planas et al., 2021). However, functional experiments are needed to verify these speculations. N501Y substitution was detected in B.1.1.7 and B.1.351, and it was suggested to be responsible for the enhanced replication in cell and in animal model and for the potentially increased transmission (Liu et al., 2021c). In the present study, two B.1.617 sequences from the United States were found to contain the N501Y substitution. Tertiary structure analysis showed that compared with N501, Y501 had a much stronger interaction with hACE2. Notably, during the preparation of this manuscript, we continued to analyze the sequences entered into GISAID throughout June and July 2021 and found that the Y501 mutant was also detected in Turkey, Poland, Czech Republic, Germany, and Cambodia, reminding us of the urgent need for the close surveillance of this variant. However, a causal link between the increased binding to hACE2 and the elevated replication in host cells cannot be drawn currently. For an enveloped virus, besides attachment to the receptor, those other factors, such as the cleavage efficiency by furin, that affect membrane fusion are critically important for entry into the cells lining the respiratory tract and the ensuing replication.
Despite the presence of seven aa substitutions, no obvious structural changes were observed in the RdRp. Combined with the fact that no aa changes occurs in nsp7 or nsp8, the two important proteins serving to promote RdRp-mediated replication (Posthuma et al., 2017;Biswal et al., 2021), it was concluded that the replication activity of RdRp of B.1.617 remained steady.
In conclusion, the data reported in this study showed that widely distributed SNPs and aa substitutions were detected across the B.1.617 viral genomes, especially in the S protein. Twenty-two positively selected sites were detected and 16 were non-C-to-U transitions. The aa substitution of N501Y can potentially enhance the interactions between S-RBD and ACE2, and the variants bearing N501Y were increasingly distributed in more countries in the follow-up monitoring. Therefore, we have improved our knowledge about the genetic diversity of B.1.617 lineage and the potential impact on viral property individually and as a whole. The biological significance and the underlining evolution rule of these variants merit further attention and verification.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.