ORIGINAL RESEARCH article

Front. Genet., 29 November 2021

Sec. Computational Genomics

Volume 12 - 2021 | https://doi.org/10.3389/fgene.2021.753440

Hotspot Mutations in SARS-CoV-2

  • 1. Department of Computer Science and Engineering, National Institute of Technical Teachers’ Training and Research, Kolkata, India

  • 2. Department of Computer Science and Information Technology, Institute of Technical Education and Research, Siksha ‘O’ Anusandhan (Deemed to be University), Bhubaneswar, India

  • 3. Department of Electronics and Communication Engineering, Jaypee Institute of Information Technology, Noida, India

Article metrics

View details

18

Citations

5,5k

Views

2,5k

Downloads

Abstract

Since its emergence in Wuhan, China, severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) has spread very rapidly around the world, resulting in a global pandemic. Though the vaccination process has started, the number of COVID-affected patients is still quite large. Hence, an analysis of hotspot mutations of the different evolving virus strains needs to be carried out. In this regard, multiple sequence alignment of 71,038 SARS-CoV-2 genomes of 98 countries over the period from January 2020 to June 2021 is performed using MAFFT followed by phylogenetic analysis in order to visualize the virus evolution. These steps resulted in the identification of hotspot mutations as deletions and substitutions in the coding regions based on entropy greater than or equal to 0.3, leading to a total of 45 unique hotspot mutations. Moreover, 10,286 Indian sequences are considered from 71,038 global SARS-CoV-2 sequences as a demonstrative example that gives 52 unique hotspot mutations. Furthermore, the evolution of the hotspot mutations along with the mutations in variants of concern is visualized, and their characteristics are discussed as well. Also, for all the non-synonymous substitutions (missense mutations), the functional consequences of amino acid changes in the respective protein structures are calculated using PolyPhen-2 and I-Mutant 2.0. In addition to this, SSIPe is used to report the binding affinity between the receptor-binding domain of Spike protein and human ACE2 protein by considering L452R, T478K, E484Q, and N501Y hotspot mutations in that region.

1 Introduction

COVID-19 caused by severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) was first identified in late December 2019 and has a high transmission rate (Zhu et al., 2020). The WHO declared this outbreak as a pandemic on March 11, 2020 (Cucinotta and Vanelli, 2020). Like other coronaviruses, SARS-CoV-2 is also an enveloped single-stranded RNA virus containing nearly 30 K nucleotide sequences (Alexandersen et al., 2020). SARS-CoV-2 encompasses 11 codding regions, which include ORF1ab, Spike (S), ORF3a, Envelope (E), Membrane (M), ORF6, ORF7a, ORF7b, ORF8, Nucleocapsid (N), and ORF10.

Though the vaccination process has started, the virus is evolving and spreading all across the world, causing fresh waves every few months. Since the virus is mutating frequently, it creates new variant of the original virus. Among several variants, B.1.1.7 (Alpha), B.1.351 (Beta), P.1 (Gamma), and B.1.617.2 (Delta) are declared as variants of concern (Singh et al., 2021). In this regard, the variant B.1.1.7 was first identified in the United Kingdom, which contains E484K, N501Y, D614G, and P681H mutations in Spike glycoprotein (Tang et al., 2020). In December 2020, the variant B.1.351 was first detected in South Africa, with mutations such as K417N, E484K, N501Y, D614G, and A701V (Tang et al., 2021). The Brazilian variant P.1 also has almost the same mutations as the B.1.351 variant, but instead of A701V, the P.1 variant has H555Y mutation (Faria et al., 2021). On the other hand, the variant B.1.617.2 was first identified in India with L452R, T478K, D614G, and P681R mutations in Spike glycoprotein (Bernal et al., 2021).

To understand the new variants of SARS-CoV-2, Tiwari and Mishra (2021) have performed phylogenetic analysis of 591 SARS-CoV-2 genomes where they have found 43 synonymous and 57 non-synonymous mutations in 12 protein regions. They found the most prevalent mutations in the Spike protein, followed by NSP2, NSP3, and ORF9. They have also highlighted several distinct SARS-CoV-2 features as compared with other human-infecting viruses. Yuan et al. (2020) have analyzed 11,183 global sequences where they have identified 119 single-nucleotide polymorphisms (SNPs) with 74 non-synonymous and 43 synonymous mutations. The mutational profiling shows that the highest mutation has occurred in Nucleocapsid, followed by NSP2, NSP3, and Spike. From China, India, the United States, and Europe, 570 SARS-CoV-2 genomes are analyzed by Weber et al. (2020), where they have identified 10 individual mutations where most of the mutations altered the amino acids in the replication-relevant proteins. Sarkar et al. (2021) have performed a genome-wide analysis of 837 Indian SARS-CoV-2 genomes, where 33 unique mutations were observed, among which 18 mutations were identified in India in five protein regions (six in Spike, five in NSP3, four in RdRp, two in NSP2, and one in Nucleocapsid). The isolated Indian sequences were classified into 22 groups based on their coexisting mutations. This study highlights several mutations identified in various protein regions, which also help to identify the evolution of virus genome across various geographic locations of India. Saha et al. (2020) have performed phylogenetic analysis of 566 Indian SARS-CoV-2 genomes to identify several mutations. As a result, 933 substitutions, 2,449 deletions, and two insertions have been identified from the aligned sequences. In another study, Saha et al. (2021) have performed genomic analysis of 10,664 SARS-CoV-2 genomes, resulting in 7,209 substitutions, 11,700 deletions, 119 insertions, and 53 SNPs.

Motivated by the aforementioned analysis, in this work, we have performed multiple sequence alignment (MSA) of 71,038 SARS-CoV-2 genomes using MAFFT (Katoh et al., 2002) followed by their phylogenetic analysis using Nextstrain (Hadfield et al., 2018) to visualize the virus evolution. This led to the identification of hotspot mutations as deletions and substitutions in the coding regions based on entropy greater than or equal to 0.3. Furthermore, as a demonstrative example, 10,286 Indian sequences are considered from 71,038 global SARS-CoV-2 sequences. For all the non-synonymous substitutions (missense mutations), the functional consequences of amino acid changes in the respective protein structures are calculated using PolyPhen-2 and I-Mutant 2.0. Finally, SSIPe is used to report the binding affinity between the receptor-binding domain (RBD) of Spike protein and human ACE2 protein by considering the hotspot mutations in that region.

2 Methods

In this section, the dataset collection for the SARS-CoV-2 genomes is discussed along with the proposed pipeline.

2.1 Data Preparation

For MSA and phylogenetic analysis, 71,038 global SARS-CoV-2 genomes are collected from Global Initiative on Sharing All Influenza Data (GISAID)1, and the Reference Genome (NC 045512.2)2 is collected from the National Center for Biotechnology Information (NCBI). The SARS-CoV-2 sequences are mostly distributed from January 2020 to June 2021 globally. Moreover, to map the protein sequences and changes in the amino acid, Protein Data Bank (PDB) is collected from Zhang Lab3 (Zhang et al., 2020; Wu et al., 2021), and it is then used to show the structural changes. All these analyses are performed on the High Performance Computing facility of NITTTR, Kolkata; and for checking the amino acid changes, MATLAB R2019b is used.

2.2 Pipeline of the Work

The pipeline of this work is provided in Figure 1A. Initially, MSA of 71,038 global SARS-CoV-2 genomes is performed using MAFFT, which is followed by their phylogenetic analysis using Nextstrain. The corresponding phylogenetic tree is shown in Figure 1B. MAFFT merges local and global algorithms for MSA, and it uses two different heuristic methods such as progressive (FFT-NS-2) and iterative refinement (FFT-NS-i). To create a provisional MSA, FFT-NS-2 calculates all-pairwise distances from which refined distances are calculated. Thereafter, FFT-NS-i is performed to get the final MSA. As MAFFT uses fast Fourier transform, it scores over other alignment techniques. On the other hand, Nextstrain is a collection of open-source tools, which is useful for understanding the evolution and spread of pathogen, particularly during an outbreak. By taking advantage of this tool, in this work, the evolution and geographic distribution of SARS-CoV-2 genomes are visualized by creating the metadata in our High Performance Computing environment.

FIGURE 1

Once the alignment and the phylogenetic analysis are completed, hotspot mutations as deletions and substitutions are identified in the coding regions based on entropy greater than or equal to 0.3. Furthermore, 10,286 Indian sequences are considered as an example to identify such mutations as well. The corresponding phylogenetic tree for Indian sequences is shown in Figure 1C. Moreover, using the codon table, amino acid changes in the SARS-CoV-2 proteins for the corresponding mutations are highlighted as well. The hotspot mutations are identified considering their entropy values, which are calculated as:where represents the frequency of each residue γ occurring at position δ and 5 represents the four possible residues as nucleotides plus gap. Thereafter, the amino acid changes in the SARS-CoV-2 proteins for the non-synonymous deletions and substitutions for both global and Indian sequences are graphically visualized as shown in Figure 1D. Finally, these changes are also used for the evaluation of their functional characteristics and are visualized in the respective protein structure as well.

3 Results

The experiments in this work are carried out according to the pipeline as given in Figure 1A. Initially, MSA of 71,038 global SARS-CoV-2 genomes across 98 countries is carried out using MAFFT followed by their phylogenetic analysis using Nextstrain, which revealed five clades: 19A, 19B, 20A, 20B, and 20C. The number of sequences for each country is reported in Supplementary Table S1. This resulted in the identification of hotspot mutation points as deletions and substitutions in the coding regions based on entropy. In this regard, only those hotspot mutations are considered whose entropy values are greater than or equal to 0.3. The entropy values for each of the genomic coordinates for both global and Indian sequences are provided in Supplementary Table S2. The mutation statistics by considering different threshold values of entropy for each category are reported in Table 1. Based on the results in this table, the entropy value of 0.3 is considered as the threshold for choosing the hotspot mutations. It is to be noted that choosing a threshold value as either 0.2 or 0.1 will lead to a huge amount of hotspot mutations, which is not desired. As a consequence of choosing entropy threshold of 0.3, 45 unique hotspot mutations are identified, which resulted in 39 non-synonymous deletions and substitutions with nine unique deletions and 22 unique amino acid changes. Also, out of the 98 countries that are considered for global analysis, India with 10,286 sequences is taken as an example to demonstrate the mutations for a particular country as well. In this regard, 52 unique hotspot mutations provide 45 non-synonymous deletions and substitutions with five unique amino acid changes for deletions and 36 unique amino acid changes for substitutions. The analysis on other countries with the most number of sequences is provided in the Supplementary Material. The phylogenetic trees in radial and rectangular views considering global analysis are shown in Figures 2A,B, respectively, while for Indian sequences, such views are provided in Figures 2D,E, respectively. These phylogenetic trees respectively show the evolution of the global and Indian SARS-CoV-2 genomes over the months. For the benefit of the readers, it is important to mention that the number of sequences does not have any direct relationship with the number of hotspot mutations. The number of hotspots is based on the entropy value, which in turn depends on the frequency of mutations at a given genomic coordinate. So even with smaller number of sequences, if the frequency of mutations is higher than that with larger number of sequences, it will produce more hotspot mutations. Thus, with 71,038 global sequences, 45 unique hotspot mutations are identified, while for 10,286 Indian sequences, 52 such mutations are identified.

TABLE 1

Threshold valueCoding regions of global SARS-CoV-2 genomes
NSP1NSP2NSP3NSP43CL-ProNSP6NSP7NSP8NSP9NSP10NSP11RdRpHelicaseExonendoRNAseNSP16SpikeORF3aEnvelopeMembraneORF6ORF7aORF7bORF8NucleocapsidORF10
> =0.6000000000000000001000000030
> =0.50 to < 0.6000000100000000001000000000
> =0.40 to < 0.50014008000003000013100000340
> =0.30 to < 0.4000000000000000001000000100
> =0.20 to < 0.3011100100000001016103020131
> =0.10 to < 0.20137413000006331114100001780
> =0.05 to < 0.101318322003005624125702120250
Threshold valueCoding regions of Indian SARS-CoV-2 genomes
> =0.6000000000000000002101010140
> =0.50 to < 0.6000000000000001001101000010
> =0.40 to < 0.5000100100000110009000010110
> =0.30 to < 0.4011410100000010106000000410
> =0.20 to < 0.30034305001004102116101010440
> =0.10 to < 0.200112501000100323118200201240
> =0.05 to < 0.1007114150110041813533011000571

Mutation statistics of 71,038 global and 10,286 Indian SARS-CoV-2 genomes by considering different threshold values.

FIGURE 2

The list of hotspot mutations for the global and Indian SARS-CoV-2 genomes along with their associated details is respectively provided in Tables 2 and 3. For example, in Table 2, genomic coordinate 28,881 in Nucleocapsid with nucleotide changes G > A and G > T has the highest entropy value of 0.773655. India also shows the same mutation but with an entropy value of 1.14807 as shown in Table 3. Please note that mutations like G28881A and G28883C may have an impact on antigenicity of Nucleocapsid protein (Yuan et al., 2020). The entropy values for the corresponding nucleotide changes for global analysis are shown in Figure 2C, while for India, the same is shown in Figure 2F. It is to be noted that the total number of unique amino acid changes for deletions and substitutions is less than the number of non-synonymous deletions and substitutions. One of the reasons for this can be that if there are deletions at consecutive genomic coordinates, the corresponding amino acid changes are the same. For example, as can be seen from Table 2, at the three consecutive genomic coordinates 11,288, 11,289, and 11,290, deletion has occurred with the amino acid change as S106-. Thus, though the number of non-synonymous deletions is 3, the number of unique amino acid change is 1. This is true for other such changes as well.

TABLE 2

Genomic coordinateOverall entropyNucleotide changeAmino acid changeProtein coordinateGene
28,8810.773655G > A, G > TR > K, R > M203Nucleocapsid
28,8830.663399G > CG > R204Nucleocapsid
28,8820.663308G > AR > R203Nucleocapsid
23,6040.642160C > A, C > GP > H, P > R681Spike
11,2960.502171T > -F > -108NSP6
21,9930.500865A > -Y > -144Spike
11,2910.499603G > -G > -107NSP6
28,2800.491543G > CD > H3Nucleocapsid
23,0630.484066A > TN > Y501Spike
21,7700.476393G > -V > -70Spike
3,2670.475810C > TT > I183NSP3
11,2880.474924T > -S > -106NSP6
11,2890.472836C > -S > -106NSP6
21,7650.471435T > -I > -68Spike
21,7670.469881C > -H > -69Spike
11,2900.467890T > -S > -106NSP6
21,7660.467479A > -I > -68Spike
21,7680.467116A > -H > -69Spike
21,7690.466151T > -H > -69Spike
11,2930.465319T > -G > -107NSP6
11,2920.464056G > -G > -107NSP6
11,2940.463926T > -F > -108NSP6
24,9140.461770G > CD > H1118Spike
6,9540.461746T > CI > T1412NSP3
28,9770.460661C > TS > F235Nucleocapsid
21,9920.460243T > -Y > -144Spike
9130.460233C > TS > S36NSP2
11,2950.459624T > -F > -108NSP6
5,9860.459543C > TF > F1089NSP3
28,2820.459253T > AD > E3Nucleocapsid
28,0480.458864G > TR > I52ORF8
14,6760.458373C > TP > P412RdRp
23,2710.458086C > AA > D570Spike
28,2810.458038A > TD > V3Nucleocapsid
27,9720.457841C > TQ > *27ORF8
5,3880.457761C > AA > D890NSP3
28,1110.457624A > GY > C73ORF8
23,7090.456643C > TT > I716Spike
24,5060.455921T > GS > A982Spike
15,2790.455884C > TH > H613RdRp
16,1760.455573T > CT > T912RdRp
21,9910.455314T > -V > -143Spike
25,5630.442049G > TQ > H57ORF3a
22,2270.310063C > TA > V222Spike
28,2530.300528C > T, C > -F > F, F > -120ORF8

List of hotspot mutations for 71,038 global SARS-CoV-2 genomes along with the protein change.

TABLE 3

Genomic coordinateOverall entropyNucleotide changeAmino acid changeProtein coordinateGene
28,8811.14807G > A, G > TR > K, R > M203Nucleocapsid
23,6040.8631C > A, C > GP > H, P > R681Spike
28,8820.69019G > AR > R203Nucleocapsid
28,8830.68846G > CG > R204Nucleocapsid
26,7670.68419T > C, T > GI > T, I > S82Membrane
28,2530.65534C > T, C > -F > F, F > -120ORF8
25,4690.6227C > TS > L26ORF3a
29,4020.61955G > TD > Y377Nucleocapsid
22,9170.61006T > GL > R452Spike
27,6380.60866T > CV > A82ORF7a
25,5630.55354G > TQ > H57ORF3a
22,4440.53665C > TD > D249Spike
18,8770.52834C > TL > L280Exon
26,7350.52715C > TY > Y71Membrane
28,8540.51198C > TS > L194Nucleocapsid
24,4100.49845G > AD > N950Spike
21,9870.49717G > AG > D142Spike
21,6180.48836C > GT > R19Spike
27,7520.48264C > TT > I120ORF7a
22,0340.47915A > -R > -158Spike
22,9950.47879C > AT > K478Spike
28,4610.46436A > GD > G63Nucleocapsid
15,4510.44421G > AG > S671RdRp
23,0120.44086G > CE > Q484Spike
22,0330.4385C > -F > -157Spike
16,4660.43082C > TP > L77Helicase
22,0320.42673T > -F > -157Spike
11,2010.42554A > GT > A77NSP6
28,2490.41704A > -D > -119ORF8
5,1840.40139C > TP > L822NSP3
22,0310.40074T > -F > -157Spike
3130.39475C > TL > L16NSP1
22,0290.38676A > -E > -156Spike
5,7000.38604C > AA > D994NSP3
20,3960.38407A > GK > R259endoRNAse
3,2670.37579C > TT > I183NSP3
22,0300.3738G > -E > -156Spike
28,2510.36694T > -F > -120ORF8
28,2480.36497G > -D > -119ORF8
24,7750.36197A > TQ > H1071Spike
21,8950.35931T > CD > D111Spike
28,2800.35905G > CD > H3Nucleocapsid
28,2500.35546T > -D > -119ORF8
28,2520.351T > -F > -120ORF8
11,4180.34861T > CV > A149NSP6
9,8910.34766C > TA > V446NSP4
17,5230.33196G > TM > I429Helicase
3,4570.3314C > TY > Y246NSP3
4,9650.32981C > TT > I749NSP3
22,0220.31618G > AE > K154Spike
11910.30404C > TP > L129NSP2
21,8460.30253C > TT > I95Spike

List of hotspot mutations for 10,286 Indian SARS-CoV-2 genomes along with the protein change.

The amino acid changes in protein for the non-synonymous deletions and substitutions as reported in Tables 2 and 3 are visualized in Figure 1D; Supplementary Figure S1. All the amino acid changes in the protein for the non-synonymous substitutions or missense mutations for the global sequences are shown in Figure 3, while the same for the Indian sequences are depicted in Figure 4. The month-wise virus evolution in terms of entropy for both global and Indian genomic sequences is visualized respectively in Figures 5 and 6, while the corresponding entropy values are reported in Supplementary Tables S3 and S4. For example, it can be seen from both the figures that both P681H and P681R, which are part of the variant of concerns Alpha or B.1.1.7 and Delta or B.1.617.2, have evolved over time globally and for India as well. It is to be noted that due to the lack of appropriate number of sequences, the data of January and February 2020 have been merged for the global analysis, while for India, such merging is for the months January to March 2020. Also, please note that since the calculation of entropy is performed on aligned sequences, only coding regions are considered for the identification of hotspot mutations, as the non-coding regions exhibit high entropy values and can be misleading while selecting such mutation points as hotspot mutations. Furthermore, the evolution of the mutation points for global SARS-CoV-2 genomes pertaining to the different variants of concern like Alpha, Beta, Gamma, and Delta as declared by the WHO is also reported respectively in Figures 7A,B,C,D. It can be observed from the figures that the popular mutation D614G, which is common in all the variants though predominant in the earlier months of the pandemic, has waned over time. Also, the mutation T478K, which is unique to the Delta variant, is known to facilitate antibody escape (Planas et al., 2021). Some important hotspot mutations like H69-, V70-, Y144-, A222V, N501Y, A570D, P681H, and P681R identified in this study are associated with the different SARS-CoV-2 variants of concern like Alpha, Beta, Gamma, and Delta.

FIGURE 3

FIGURE 4

FIGURE 5

FIGURE 6

FIGURE 7

The unique and common hotspot mutations between global and Indian sequences are represented in the form of Venn diagram in Figures 8A,B, which shows the unique and common non-synonymous hotspot mutations, while the unique and common amino acid changes are shown in Figure 8C. As shown in Figure 8A, there are 37 and 44 unique mutations in global and Indian sequences, while eight are common in both. For non-synonymous hotspot deletions and substitutions, there are 32 and 38 unique mutations in each category, while the common number of such mutations is seven as reported in Figure 8B. For amino acid changes, as shown in Figure 8C, these statistics are 22, 32, and nine. The Venn diagram showing the common and unique hotspot mutations for global and Indian sequences with Alpha, Beta, Gamma, and Delta variants of SARS-CoV-2 is reported in Supplementary Figure S2. For example, in Supplementary Figure S2A, there are four unique mutations in both global sequences and Alpha variant, while there are nine mutations that are common to both.

FIGURE 8

4 Discussion

There are spurts of new waves in almost every country around the globe. India has already gone through the massively catastrophic second wave, and according to the experts, a third wave is imminent. This can be attributed to the fact that the virus is evolving and new strains are getting identified, thereby making the study of this ever-evolving virus all the more important. The functional characteristics of some important mutations in the global and Indian SARS-CoV-2 genomic sequences are reported in Table 4.

TABLE 4

MutationsFunctional characteristics
H69-Leads to conformational changes in Spike protein (Meng et al., 2021; McCarthy et al., 2021)
V70-Leads to conformational changes in Spike protein (Meng et al., 2021; McCarthy et al., 2021)
Y144-Reduces affinity of antibody binding (McCarthy et al., 2021)
L452RIncreases the binding ability of the ACE2 receptor and can also reduce the attaching capability of vaccine-simulated antibodies with Spike protein (Garcia-Beltran et al., 2021)
T478KFacilitates antibody escape (Planas et al., 2021)
E484QAssociated with reduced sera neutralization (Greaney et al., 2021)
N501YHighest binding affinity with human receptor cell hACE2 and resistant to neutralization (Luan et al., 2021)
P681HNear furin cleavage site, may affect transmissibility of the virus (Boehm et al., 2021)
P681RNear furin cleavage site, may affect transmissibility of the virus (Boehm et al., 2021)

Functional characteristics of some important mutations.

Structural changes in amino acid residues may sometimes lead to functional instability in proteins due to change in protein translations. To judge their characteristics, these changes are demonstrated through sequence and structural homology-based prediction for the hotspot deletions and missense mutations for global and Indian sequences in Table 5. The tools used for these predictions are PolyPhen-2 (Polymorphism Phenotyping) (Adzhubei et al., 2010) and I-Mutant 2.0 (Capriotti et al., 2005). PolyPhen-24 works with sequence, structural, and phylogenetic information of missense mutations, while I-Mutant 2.05 uses support vector machine (SVM) for the automatic prediction of protein stability changes upon missense mutations. PolyPhen-2 is used to find the damaging hotspot mutations, and I-Mutant 2.0 determines protein stability. To determine if a mutation is damaging using PolyPhen-2, its score is considered, which lies between 0 and 1. If the score is close to 1, then a mutation is considered to be damaging. It can be concluded from Table 5 that out of the 22 unique amino acid changes for substitutions in global sequences, 14 are damaging, while for Indian sequences, 24 are damaging out of 36 changes. It is important to note that in case of protein, damaging mostly defines instability. Generally, this is used for human proteins. As a consequence, if the human protein is damaging in nature because of mutations, then the human protein–protein interactions may occur with high or low binding affinity. Now in case of virus, similar consequences may happen, which means that if the virus protein is damaged because of mutations, it may interact with human proteins with similar binding affinity. As a result, the virus may acquire characteristics like transmissibility and escaping antibodies (Alenquer et al., 2021; Harvey et al., 2021).

TABLE 5

Change inChange inMapped withPolyPhen-2I-mutant 2.0
NucleotideAmino acidCoding regionsPredictionScoreStabilityDDG (kcal/mol)
G28881AR203 KNucleocapsidProbably damaging0.969Decrease−2.26
G28881TR203MNucleocapsidProbably damaging0.998Decrease−1.52
G28883CG204RNucleocapsidProbably damaging1No change0
C23604AP681HSpikeNot generatedNot generatedDecrease−0.92
C23604GP681RSpikeNot generatedNot generatedDecrease−0.79
G28280CD3HNucleocapsidProbably damaging1Increase0.34
A23063TN501YSpikeBenign0.145Decrease−0.34
C3267TT183INSP3Not generatedNot generatedDecrease-0.1
G24914CD1118HSpikeProbably damaging0.998Decrease−0.1
T6954CI1412TNSP3Benign0.026Decrease−2.78
C28977TS235FNucleocapsidProbably damaging0.998Increase2.43
T28282AD3ENucleocapsidProbably damaging0.997Decrease−0.02
G28048TR52IORF8Probably damaging1Decrease−0.09
C23271AA570DSpikeBenign0.031Decrease−1.32
A28281TD3VNucleocapsidProbably damaging1Decrease−0.22
C5388AA890DNSP3Probably damaging1Decrease−1.09
A28111GY73CORF8Probably damaging0.994Increase1.04
C23709TT716ISpikePossibly damaging0.696Decrease−0.95
T24506GS982ASpikeProbably damaging0.996Decrease−1.36
C22227TA222VSpikeBenign0.001Increase0.48
T26767GI82SMembranePossibly damaging0.951Decrease−2
C25469TS26LORF3aBenign0.017Increase0.92
G29402TD377YNucleocapsidProbably damaging1Increase0.51
T22917GL452RSpikeBenign0.04Decrease−1.4
T27638CV82AORF7aPossibly damaging0.732Decrease-2.18
G25563TQ57HORF3aProbably damaging0.983Decrease−1.12
C28854TS194LNucleocapsidProbably damaging0.994Increase0.45
G24410AD950NSpikePossibly damaging0.731Increase0.15
G21987AG142DSpikeBenign0.051Decrease−1.17
C21618GT19RSpikeBenign0.004Decrease−0.12
C27752TT120IORF7aPossibly damaging0.915Decrease−0.26
C22995AT478KSpikeBenign0Decrease−0.09
A28461GD63GNucleocapsidBenign0Decrease−0.57
G15451AG671SRdRpProbably damaging1Decrease−0.29
G23012CE484QSpikePossibly damaging0.786Decrease−0.48
C16466TP77LHelicaseProbably damaging1Decrease−1.03
A11201GT77ANSP6Possibly damaging0.577Decrease−0.7
C5184TP822LNSP3Benign0.007Decrease−0.54
C5700AA994DNSP3Probably damaging0.972Decrease−0.78
A20396GK259RendoRNAseBenign0Decrease−0.49
A24775TQ1071HSpikePossibly damaging0.998Decrease−1.19
T11418CV149ANSP6Possibly damaging0.865Decrease−3.43
C9891TA446VNSP4Probably damaging0.999Increase0.64
G17523TM429IHelicasePossibly damaging0.649Decrease−1.26
C4965TT749INSP3Probably damaging0.996Decrease−0.92
G22022AE154 KSpikeNot generatedNot GeneratedDecrease−1.4
C1191TP129LNSP2Possibly damaging0.888Decrease−0.53
C21846TT95ISpikeProbably damaging0.999Decrease−1.8

Sequence and structural homology-based prediction of non-synonymous substitution as hotspot mutations along with their protein structural stability for 71,038 global SARS-CoV-2 genomes.

Another important parameter to judge the functional and structural activities of a protein is protein stability, which dictates the conformational structure of a protein. Any change in protein stability may cause misfolding, degradation, or aberrant conglomeration of proteins. I-Mutant 2.0 uses free energy change values (DDG (kcal/mol)) to predict the changes in the protein stability wherein a negative value of DDG indicates that the protein has a decreasing stability, while a positive value indicates an increase in stability. For example, the very low DDG value of G25563T shows that there is a decreased protein stability, thereby resulting in a reduction of virus virulence (Cheng et al., 2021). The results from I-mutant 2.0 show that out of the 14 and 24 unique damaging changes for global and Indian sequences, 10 and 18 changes respectively decrease the stability of the protein structures. Figure 9 shows the binding affinity between the RBD of Spike protein and human ACE2 protein performed using SSIPe6 (Huang et al., 2019) for the four mutations of SARS-CoV-2, viz., L452R, T478K, E484Q, and N501Y, taking place in such domain. The region marked in red shows the exact positions (471–492) where the binding takes place. To report the binding affinity using SSIPe, initially the RBD region of Spike protein (Woo et al., 2020) is docked with human ACE2 protein7 using PatchDock8. The best docked structure is then provided as an input to SSIPe. Table 6 further reports the binding affinity values for the four mutations. A strongly favorable mutation is usually defined as the one that has DDG value ≤ −1.5 kcal/mol, while a strongly unfavorable mutation is the one that has DDG value ≥1.5 kcal/mol. The DDG value of −0.769 kcal/mol for E484Q indicates that this is a favorable mutation, while DDG values of 1.083, 1.248, and 0.236 kcal/mol for L452R, T478K, and N501Y indicate that these mutations are somewhat unfavorable. These results corroborate our earlier explanation that because of mutation, virus–human protein–protein interactions may occur with high or low binding affinity.

FIGURE 9

TABLE 6

Genomic coordinateNucleotide changeAmino acid changeProtein coordinateDDG (kcal/mol)SSIPscoreEvoEFscore
22,917T > GL > R4521.0832.083−1.91
22,995C > AT > K4781.2481.779−0.77
23,012G > CE > Q484−0.7691.098−5.22
23,063A > TN > Y5010.23600.09

Binding affinity of the mutations in RBD region of Spike protein and human ACE2 protein.

Note. RBD, receptor-binding domain.

Supplementary Figure S3 shows the percentage of nucleotide change and frequency of nucleotide change for hotspot mutations for global and Indian sequences. For example, in Supplementary Figure S3A, the occurrence of nucleotide change G > A in 71,038 global sequences is almost 45%, while the number of times it occurs in 45 hotspot mutations is two, as is also evident from Table 2. It can also be seen from Supplementary Figures S3B, S3D that 10 and 16 out of 39 and 45 non-synonymous mutations are from C to T, thereby representing abundant transition. This transition increases the frequency of codons for hydrophobic amino acids and provides evidence of potential antiviral editing mechanisms driven by host (Yuan et al., 2020). Also, more C-to-T transition means less CpG abundance, indicating rapid adaptation of virus in host. This CpG deficiency, which leads to evasion of host antiviral defense mechanisms, is exhibited the most in SARS-CoV-2 virus (Xia, 2020).

5 Conclusion

With the imminent third wave, it is very crucial to understand the evolution of SARS-CoV-2. In this regard, MSA of 71,038 SARS-CoV-2 genomes of 98 countries over the period from January 2020 to June 2021 is performed using MAFFT followed by phylogenetic analysis to visualize the evolution of SARS-CoV-2. This resulted in the identification of hotspot mutations as deletions and substitutions in the coding regions based on entropy, which should be greater than or equal to 0.3. Consequently, a total of 45 unique hotspot mutations out of which 39 non-synonymous deletions and substitutions are identified with nine unique amino acid changes for deletions and 22 unique amino acid changes for substitutions. Moreover, 10,286 Indian sequences are considered from 71,038 global SARS-CoV-2 sequences as a demonstrative example, which gives 52 unique hotspot mutations, resulting in 45 non-synonymous deletions and substitutions with five unique amino acid changes for deletions and 36 unique amino acid changes for substitutions. Some important mutations in such sequences pertaining to the Delta variant of SARS-CoV-2 are T19R, G142D, E156-, F157-, L452R, T478K, and P681R. Furthermore, the evolution of the hotspot mutations along with the mutations in variants of concern is visualized, and their characteristics are also discussed. Moreover, for all the missense mutations, the functional consequences of amino acid changes in the respective protein structures are calculated using PolyPhen-2 and I-Mutant 2.0. Finally, SSIPe is used to report the binding affinity between the RBD of Spike protein and human ACE2 protein by considering L452R, T478K, E484Q, and N501Y hotspot mutations in that region.

Statements

Data availability statement

The aligned 71038 Global SARS-CoV-2 genomes with the reference sequence and the final results of this work are available at http://www.nitttrkol.ac.in/indrajit/projects/COVID-Hotspot-Mutation-Global-71K/. Further inquiries can be directed to the corresponding author.

Author contributions

IS and NG designed the research. IS, NG, NS, and SN analyzed the data and wrote the article. All the authors reviewed and approved the final version of the article.

Funding

This work has been partially supported by CRG short-term research grant on COVID-19 (CVD/2020/000,991) from Science and Engineering Research Board (SERB), Department of Science and Technology, Govt. of India. However, it does not provide any publication fees.

Acknowledgments

We thank all those who have contributed sequences to GISAID and NCBI databases.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2021.753440/full#supplementary-material

References

  • 1

    AdzhubeiI. A.SchmidtS.PeshkinL.RamenskyV. E.GerasimovaA.BorkP.et al (2010). A Method and Server for Predicting Damaging Missense Mutations. Nat. Methods7, 248249. 10.1038/nmeth0410-248

  • 2

    AlenquerM.FerreiraF.LousaD.ValérioM.Medina-LopesM.BergmanM.-L.et al (2021). Signatures in SARS-CoV-2 Spike Protein Conferring Escape to Neutralizing Antibodies. Plos Pathog.17, e1009772. 10.1371/journal.ppat.1009772

  • 3

    AlexandersenS.ChamingsA.BhattaT. R. (2020). SARS-CoV-2 Genomic and Subgenomic RNAs in Diagnostic Samples Are Not an Indicator of Active Replication. Nat. Commun.11, 113. 10.1038/s41467-020-19883-7

  • 4

    BernalJ. L.AndrewsN.GowerC.GallagherE.SimmonsR.ThelwallS.et al (2021). Effectiveness of Covid-19 Vaccines against the B.1.617.2 (Delta) Variant. N. Engl. J. Med.385 (7), 585594. 10.1056/NEJMoa2108891

  • 5

    BoehmE.KronigI.NeherR. A.EckerleI.VetterP.KaiserL. (2021). Novel SARS-CoV-2 Variants: the Pandemics within the Pandemic. Clin. Microbiol. Infect.27, 11091117. 10.1016/j.cmi.2021.05.022

  • 6

    CapriottiE.FariselliP.CasadioR. (2005). I-Mutant2.0: Predicting Stability Changes upon Mutation from the Protein Sequence or Structure. Nucleic Acids Res.33, W306W310. 10.1093/nar/gki375

  • 7

    ChengL.HanX.ZhuZ.QiC.WangP.ZhangX. (2021). Functional Alterations Caused by Mutations Reflect Evolutionary Trends of SARS-CoV-2. Brief. Bioinform.22, 14421450. 10.1093/bib/bbab042

  • 8

    CucinottaD.VanelliM. (2020). WHO Declares COVID-19 a Pandemic. Acta Biomed.91, 157160. 10.23750/abm.v91i1.9397

  • 9

    FariaN. R.MellanT. A.WhittakerC.ClaroI. M.CandidoD. D. S.MishraS.et al (2021). Genomics and Epidemiology of the P.1 SARS-CoV-2 Lineage in Manaus, Brazil. Science372, 815821. 10.1126/science.abh2644

  • 10

    Garcia-BeltranW. F.LamE. C.St. DenisK.NitidoA. D.GarciaZ. H.HauserB. M.et al (2021). Multiple SARS-CoV-2 Variants Escape Neutralization by Vaccine-Induced Humoral Immunity. Cell184, 23722383.e9. 10.1016/j.cell.2021.03.013

  • 11

    GreaneyA. J.StarrT. N.GilchukP.ZostS. J.BinshteinE.LoesA. N.et al (2021). Complete Mapping of Mutations to the SARS-CoV-2 Spike Receptor-Binding Domain that Escape Antibody Recognition. Cell Host & Microbe29, 4457.e9. 10.1016/j.chom.2020.11.007

  • 12

    HadfieldJ.MegillC.BellS. M.HuddlestonJ.PotterB.CallenderC.et al (2018). Nextstrain: Real-Time Tracking of Pathogen Evolution. Bioinformatics34, 41214123. 10.1093/bioinformatics/bty407

  • 13

    HarveyW. T.CarabelliA. M.JacksonB.GuptaR. K.ThomsonE. C.HarrisonE. M.et al (2021). SARS-CoV-2 Variants, Spike Mutations and Immune Escape. Nat. Rev. Microbiol.19, 409424. 10.1038/s41579-021-00573-0

  • 14

    HuangX.ZhengW.PearceR.ZhangY. (2019). SSIPe: Accurately Estimating Protein-Protein Binding Affinity Change upon Mutations Using Evolutionary Profiles in Combination with an Optimized Physical Energy Function. Bioinformatics36, 24292437. 10.1093/bioinformatics/btz926

  • 15

    KatohK.MisawaK.KumaK.MiyataT. (2002). MAFFT: A Novel Method for Rapid Multiple Sequence Alignment Based on Fast Fourier Transform. Nucleic Acids Res.30, 30593066. 10.1093/nar/gkf436

  • 16

    LuanB.WangH.HuynhT. (2021). Enhanced Binding of the N501Y‐mutated SARS‐CoV‐2 Spike Protein to the Human ACE2 Receptor: Insights from Molecular Dynamics Simulations. FEBS Lett.595, 14541461. 10.1002/1873-3468.14076

  • 17

    McCarthyK. R.RennickL. J.NambulliS.Robinson-McCarthyL. R.BainW. G.HaidarG.et al (2021). Recurrent Deletions in the SARS-CoV-2 Spike Glycoprotein Drive Antibody Escape. Science371, 11391142. 10.1126/science.abf6950

  • 18

    MengB.KempS. A.PapaG.DatirR.FerreiraI. A. T. M.MarelliS.et al (2021). Recurrent Emergence of SARS-CoV-2 Spike Deletion H69/V70 and Its Role in the Alpha Variant B.1.1.7. Cell Rep.35 (13), 109292. 10.1016/j.celrep.2021.109292

  • 19

    PlanasD.VeyerD.BaidaliukA.StaropoliI.Guivel-BenhassineF.RajahM. M.et al (2021). Reduced Sensitivity of SARS-CoV-2 Variant Delta to Antibody Neutralization. Nature596, 276280. 10.1038/s41586-021-03777-9

  • 20

    SahaI.GhoshN.MaityD.SharmaN.SarkarJ. P.MitraK. (2020). Genome-wide Analysis of Indian SARS-CoV-2 Genomes for the Identification of Genetic Mutation and SNP. Infect. Genet. Evol.85, 104457. 10.1016/j.meegid.2020.104457

  • 21

    SahaI.GhoshN.PradhanA.SharmaN.MaityD.MitraK. (2021). Whole Genome Analysis of More Than 10 000 SARS-CoV-2 Virus Unveils Global Genetic Diversity and Target Region of NSP6. Brief. Bioinform.22, 11061121. 10.1093/bib/bbab025

  • 22

    SarkarR.MitraS.ChandraP.SahaP.BanerjeeA.DuttaS.et al (2021). Comprehensive Analysis of Genomic Diversity of SARS-CoV-2 in Different Geographic Regions of India: an Endeavour to Classify Indian SARS-CoV-2 Strains on the Basis of Co-existing Mutations. Arch. Virol.166, 801812. 10.1007/s00705-020-04911-0

  • 23

    SinghJ.RahmanS. A.EhteshamN. Z.HiraS.HasnainS. E. (2021). SARS-CoV-2 Variants of Concern Are Emerging in India. Nat. Med.27, 1131. 10.1038/s41591-021-01397-4

  • 24

    TangJ.TooveyO.HarveyK.HuicD. (2021). Introduction of the South African SARS-CoV-2 Variant 501Y.V2 into the UK. J. Infect.82, e8. 10.1016/j.jinf.2021.01.007

  • 25

    TangJ. W.TambyahP. A.HuiD. S. (2021). Emergence of a New SARS-CoV-2 Variant in the UK. J. Infect.82, e27e28. 10.1016/j.jinf.2020.12.024

  • 26

    TiwariM.MishraD. (2021). Investigating the Genomic Landscape of Novel Coronavirus (2019-nCoV) to Identify Non-synonymous Mutations for Use in Diagnosis and Drug Design. J. Clin. Virol.128, 104441. 10.1016/j.jcv.2020.104441

  • 27

    WeberS.RamirezC.DoerflerW. (2020). Signal Hotspot Mutations in SARS-CoV-2 Genomes Evolve as the Virus Spreads and Actively Replicates in Different Parts of the World. Virus. Res.289, 198170. 10.1016/j.virusres.2020.198170

  • 28

    WooH.ParkS.-J.ChoiY. K.ParkT.TanveerM.CaoY.et al (2020). Developing a Fully Glycosylated Full-Length SARS-CoV-2 Spike Protein Model in a Viral Membrane. J. Phys. Chem. B124, 71287137. 10.1021/acs.jpcb.0c04553

  • 29

    WuS.TianC.LiuP.GuoD.ZhengW.HuangX.et al (2021). Effects of SARS‐CoV‐2 Mutations on Protein Structures and Intraviral Protein-Protein Interactions. J. Med. Virol.93, 21322140. 10.1002/jmv.26597

  • 30

    XiaX. (2020). Extreme Genomic CpG Deficiency in SARS-CoV-2 and Evasion of Host Antiviral Defense. Mol. Biol. Evol.37, 26992705. 10.1093/molbev/msaa094

  • 31

    YuanF.WangL.FangY.WangL. (2020). Global SNP Analysis of 11,183 SARS‐CoV‐2 Strains Reveals High Genetic Diversity. Transbound. Emerg. Dis.10.1111/tbed.13931

  • 32

    ZhangC.ZhengW.HuangX.BellE. W.ZhouX.ZhangY. (2020). Protein Structure and Sequence Reanalysis of 2019-nCoV Genome Refutes Snakes as its Intermediate Host and the Unique Similarity between its Spike Protein Insertions and HIV-1. J. Proteome Res.19, 13511360. 10.1021/acs.jproteome.0c00129

  • 33

    ZhuN.ZhangD.WangW.LiX.YangB.SongJ.et al (2020). A Novel Coronavirus from Patients with Pneumonia in China, 2019. N. Engl. J. Med.382, 727733. 10.1056/NEJMoa2001017

Summary

Keywords

COVID-19, deletions, entropy, hotspot mutations, SARS-CoV-2 genomes, substitution

Citation

Saha I, Ghosh N, Sharma  N and Nandi S (2021) Hotspot Mutations in SARS-CoV-2. Front. Genet. 12:753440. doi: 10.3389/fgene.2021.753440

Received

04 August 2021

Accepted

07 October 2021

Published

29 November 2021

Volume

12 - 2021

Edited by

Yang Zhang, University of Michigan, United States

Reviewed by

Xiaoqiang Huang, University of Michigan, United States

Yavuz Oktay, Dokuz Eylul University, Turkey

Updates

Copyright

*Correspondence: Indrajit Saha,

†These authors have contributed equally to this work

This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics