Translation-associated mutational U-pressure in the first ORF of SARS-CoV-2 and other coronaviruses

Within four months of the ongoing COVID-19 pandemic caused by SARS-CoV-2, more than 250 nucleotide mutations have been detected in the ORF1 of the virus isolated from different parts of the globe. These observations open up an obvious question about the rate and direction of mutational pressure for further vaccine and therapeutics designing. In this study, we did a comparative analysis of ORF1a and ORF1b by using the first isolate (Wuhan strain) as the parent sequence. We observed that most of the nucleotide mutations are C to U transitions. The rate of synonymous C to U transitions is significantly higher than the rate of nonsynonymous ones, indicating negative selection on amino acid substitutions. Further, trends in nucleotide usage bias have been investigated in 49 coronaviruses species. A strong bias in nucleotide usage in fourfold degenerated sites towards uracil residues is seen in ORF1 of all the studied coronaviruses. A more substantial mutational U pressure is observed in ORF1a than in ORF1b owing to the translation of ORF1ab via programmed ribosomal frameshifting. Unlike other nucleotide mutations, mutational U pressure caused by cytosine deamination, mostly occurring in the RNA-plus strand, cannot be corrected by the proof-reading machinery of coronaviruses. The knowledge generated on the direction of mutational pressure during translation of viral RNA-plus strands has implications for vaccine and nucleoside analogue development for treating covid-19 and other coronavirus infections.


Introduction
The current COVID-19 pandemic caused by SARS-CoV-2 has claimed more than 0.25 Million human lives with nearly 3.5 million reported infections globally and still this number is increasing day by day. 1 The virus was first detected in a wet market in Wuhan, China in December 2019. It bears a close similarity to bat coronaviruses, and similar viral particles have been detected in pangolin. This led to the speculation that the virus must have originated in bats while pangolins must have been intermediate hosts. Since then, the virus has spread all over the globe affecting humans of all races. The infections are characterized by sore throat, fever, cough, body pains, breathlessness, severe pneumonia, and death due to multi-organ failure involving kidneys and lungs. 2,3 Coronaviruses have an exceptionally large genome of around 30kb. A huge part of the genome, at the 5' end, around 20 kb, harbors a replicase gene that codes for around 16 nonstructural proteins (nsps). 4 The rest of the genome, near the 3' end, codes for the structural (spike, envelope, membrane, nucleocapsid) and several accessory proteins. 5 The replicase gene (ORF1) includes two subunits, ORF1a and ORF1b expressing the polyproteins pp1a and pp1ab, respectively. A slippery sequence and ribosomal frameshifting caused by an RNA pseudoknot ensure the translation of both polyproteins from the same genome. 6,7 Pp1a has nsps 1-11 whereas pp1ab has nsps 1-16. 8 The non-structural proteins form the viral replicase complexes. Here both genomic and subgenomic viral RNAs get synthesized through negative-strand intermediates. Sub genomic RNAs code for structural and accessory proteins.
Variations in viral sequences have an important role in viral propagation and pathogenesis. They may help change the viral phenotype allowing it to evade host immune system and also acquire drug resistance. These variations may arise from copying errors during genome replication. 9 RNA viruses have low replicative fidelity; hence are more prone to mutations than DNA viruses. Around 10 -4 to 10 -6 errors per nucleotide are seen in RNA viruses, which is equivalent to approximately one mutation per genome per replication cycle. This allows the viruses to maintain mutations that are beneficial for them in adapting to new environments. 10 Sometimes, a specific nucleotide mutation occurs more frequently, which is called directional mutational pressure. Mutational pressure changes the nucleotide composition of the genome irreversibly and may be caused due to an error-prone polymerase, RNA editing, oxidative damage. Moreover, the direction of mutational pressure may not be similar along the entire length of the genome. 11 Coronaviruses are known as slowly mutating RNA viruses since they perform proof-reading during replication 12 .
Cytosine to Uracil transitions can occur spontaneously through oxidative damage by free radicals or enzymatically through the action of apolipoprotein B mRNA editing catalytic subunit (APOBEC) family of cytidine deaminases. RNA-editing enzymes from APOBEC family bind single-stranded viral RNA and deaminate cytosine residues. 13,14 Single nucleotide transitions mostly result in synonymous mutations though synonymous codons, i.e., codons that code for same amino acid, don't occur equally in different organisms; these organisms are said to have a codon usage bias. Fourfold degenerate codons can tolerate any nucleotide substitution at the wobble or the third position, which doesn't change the amino acid sequence. But this changed nucleotide in the RNA may affect the RNA structure, RNA interference, and codon usage bias. 11 Since its first appearance in December, 2019, four hundred full-length sequences of ORF1 of SARS-CoV-2 have become available on GenBank. In this study, a comparative analysis of ORF1a and ORF1b has been carried out, taking the first isolate as the parent sequence. Further, trends in nucleotide usage bias have also been investigated in various coronaviruses. There are no approved drugs or vaccines for human coronaviruses yet. With alarmingly increasing damage caused by COVID-19 to human lives and global economy, there is a pressing need for its prevention and treatment strategies through vaccines or potential drugs. This knowledge on direction of mutational pressure may help design more effective vaccines and nucleoside analogues.

Nucleotide usage bias analysis
The nucleotide usage biases along the length of ORF1 from each virus were calculated in sliding windows 150 codons in length, with a step of a single codon. The "VVTAK SW" algorithm (chemres.bsmu.by) was used for the calculations. Among other indices, the algorithm calculates nucleotide usage in fourfold degenerated sites, in which all nucleotide mutations are synonymous. ORF1b in each sequence was "opened up" by the addition of "A" nucleotide into the "AAA" motif of the ribosome slippage sequence. The location of this motif has been determined with the help of GenBank description of each sequence. So, the length of complete ORF1 for SARS-CoV-2 is equal to 7097 codons. This includes ORF1a equal to 4406 codons (until the stop-codon), and the ORF1b of 2696 codons. Nucleotide usages in the fourfold degenerate sites of both the parts of ORF1 (the one before the "AAA" motif of the ribosome slippage sequence, and the one after that motif) were compared for all 49 viruses. Coefficients of correlations between

Mutational pressure analysis in SARS-CoV-2
Nucleotide sequences of the complete ORF1 of SARS-CoV-2 were retrieved with the help of NCBI BLAST algorithm. The reference genome of SARS-CoV-2 was used as a target. 400 full length sequences of ORF1 belonging to different isolates of SARS-CoV-2 from all over the globe were obtained from GenBank on the 15 th of April 2020. The numbers of sites with all possible types of nucleotide mutations in this alignment relative to the reference sequence were calculated. Similarly, the number of sites available for each type of nucleotide mutation was calculated in the reference sequence. The rate of nucleotide mutation of a certain type is equal to the number of sites with a given mutation divided by the total number of nucleotides available for this kind of mutation. For example, the rate of C to U mutations is equal to the number of sites with C to U mutations over the number of C residues. The rates of mutations have been compared with each other with the help of t-test.
The rates of synonymous and nonsynonymous mutations for C to U mutations were calculated. The sites in which C to U mutation is synonymous and those in which it is nonsynonymous in the reference ORF1 sequence of SARS-CoV-2 were determined. To calculate the rate of synonymous C to U transitions, the number of sites with (non)synonymous C to U mutations were divided by the number of sites in which C to U mutations are (non)synonymous. The rates of mutations for both parts of the ORF1 were calculated separately.

Statistical Analysis
The Student's t-test was used for calculating the statistical significance. P-values < 0.05 were considered as statistically significant. Coefficients of correlation have been calculated using MS Excel.

Mutational pressure in SARS-CoV-2 ORF1
As on the 15 th of April 2020, a total of 400 full-length sequences of ORF1 belonging to different isolates of SARS-CoV-2 were available in GenBank. The first full-length genome of SARS-CoV-2 appeared in GenBank towards the end of December 2019, followed by the other sequences. Therefore, the first reported genome of the virus can be considered as the initial sequence, while the others can be regarded as its offspring that mutated during the ongoing pandemic. If each offspring sequence is compared with the initial one, the number of mutations may seem to be relatively small. However, when all those 400 sequences are considered, there are already 250 sites with mutated nucleotides (ignoring ambiguous results of sequencing) in this enormously long ORF that consists of 21227 nucleotides.
To find out the preferable direction of those nucleotide mutations, the numbers of sites with each type of nucleotide substitution were calculated and divided by the usage of a corresponding nucleotide in a reference sequence 15 . Calculated frequencies of nucleotide mutations were compared with each other. Since there is a ribosome slippery sequence in the ORF1, we performed calculations separately for two parts of ORF1: ORF1a and ORF1b, before and after the slippery site, respectively. There is a clear and strong mutational U-pressure in both parts of the ORF1 (Table 1). The most frequent type of nucleotide mutation in SARS-CoV-2 ORF1 during 4 months of the pandemic was cytosine to uracil transition (C to U). The rate of this mutation was found to be more than 6 times higher than the rate of an opposite U to C mutation in ORF1a and more than 4 times higher in ORF1b. The difference between the rate of C to U transitions and the rate of U to C transitions is significant (P < 0.05) for both parts of ORF1. Interestingly, the rate of C to U transitions is also significantly (1.7 times) higher in the ORF1a than in ORF1b. The most frequent cause of this kind of mutation is cytosine deamination, the product of which is uracil. This mutation may be spontaneous or enzymatic. In the latter case, RNA-editing enzymes from APOBEC family may be responsible for it 13 , 14 . These enzymes bind single-stranded viral RNA and deaminate cytosine residues. Spontaneous deamination occurs via oxidation of cytosine by free radicals 16 . The rate of this process is higher for single-stranded RNA than for double-stranded RNA 17 .
Since ORF1a is translated more frequently than ORF1b, the rate of C to U transitions should be higher in ORF1a. Indeed, the process of translation often ends near the ribosome slippery sequence at a stop codon if the ribosome fails to slip to the -1 reading frame 6 . Therefore, ORF1b is not unwound as frequently as ORF1a. The difference between the rates of G to A and A to G transitions was not found to be significant for both parts of ORF1. However, the rate of G to A transitions themselves is significantly higher in ORF1a than in ORF1b (Table 1). The rate of G to U transversions is significantly higher than the rate of U to G transversions in both parts of ORF1 (Table 1). The rate of G to U transversions is, of course, significantly lower than the rate of C to U transitions. However, both of these mutations do contribute to the mutational Upressure in this viral gene. The rates of other transversions are rather low (Table 1). Their preferable directions can be determined only after some time when the virus acquires more mutations. Even now, after the first four months after the breaking of the interspecies barrier by this virus, it is evident that the most expected mutations in its ORF1 are C to U transitions.
The rate of synonymous C to U mutations is higher than the rate of nonsynonymous ones in SARS-CoV-2 ORF1 The number of C to U mutations observed in SARS-CoV-2 ORF1 is enough to make conclusions about the type of natural selection. To make such conclusions, the numbers of sites with synonymous and nonsynonymous C to U mutations in each part of ORF1were calculated 18 . After that, we divided those numbers by the numbers of sites for synonymous and nonsynonymous C to U mutations in corresponding parts of ORF1 of the reference SARS-CoV-2 genome. A comparison of those rates is given in Table 2. As evident from Table 2, the rate of synonymous mutations of C to U direction is significantly higher than the rate of nonsynonymous mutations of the same direction in both parts of ORF1. The rate of synonymous C to U mutations is similar in both parts of ORF1. In contrast, the rate of nonsynonymous C to U mutations is significantly higher in ORF1a than in ORF1b.
The number of sites for synonymous C to U mutations is 2.7 (ORF1a) and 2.5 (ORF1b) times lower than the number of sites for nonsynonymous C to U mutations. So, the increase of the rate of C to U mutations in the first part of ORF1 relative to its second part is caused by the increase of the rate of nonsynonymous mutations. The reason for this growth maybe both in the higher rate of their occurrence and in the weaker negative selection. Indeed, proteins that are cut from the second part of a polyprotein encoded by ORF1b (RNA-dependent-RNA polymerase and helicase, 3'-to-5' exonuclease, endoRNAse, 2'-O-ribose methyltransferase) seem to be more conservative than proteins that are cut from its first part. An interesting fact is that the number of sites with synonymous C to U mutations is lower than the number of sites with nonsynonymous C to U mutations (32 vs. 51) for ORF1a, while for ORF1b it is vice versa (17 vs. 13). As usual, these numbers must be normalized by the number of sites available for those mutations before the comparison.

Nucleotide usage biases in ORF1 of SARS-CoV-2, SARS-CoV, and MERS
Nucleotide usage biases in genes are formed during long-term process of fixation of certain nucleotide mutations in numerous generations 19 . As one can see in Figure 1a, the usage of uracil in fourfold degenerated sites of ORF1 from SARS-CoV-2 is quite high: 53.6±0.2% in the first part of ORF1 and 49.0±0.2% in the second part implying that the rates of C to U transitions and G to U transversions have been much higher than the rates of opposite mutations for a very long time in predecessors of the current virus. Nowadays, almost one-half of nucleotides in fourfold degenerated sites of SARS-CoV-2 ORF1 are uracil residues. Interestingly, in ORF1b, the usage of uracil is still significantly lower than in the first one. It means that the tendency observed during the mutagenesis of SARS-CoV-2 in human cells is the same as during its mutagenesis in cells of its former hosts.
Adenine is the second most common/frequent nucleotide to occur in the fourfold degenerate sites (Figure 1a). This shows that C to U transitions are frequent in RNA minus strands of the virus as well. However, they are not as frequent as those in its RNA plus strands. This fact may be interpreted as the evidence that viral RNA plus strand is a target for APOBEC editing soon after the entry into the host cell. If an infection is successful, coronavirus suppresses expression of host proteins with its early protein nsp1 20 , and its RNA minus strands formed later are not edited by APOBEC. The level of A4f is significantly higher in the ORF1b than in ORF1a.
Both cytosine and guanine are extremely rare in fourfold degenerated sites of SARS-CoV-2 ORF1 (Figure 1a). However, there are still several areas with C4f level around 20% both in ORF1a and ORF1b.
In the first ORF of the SARS-CoV (responsible for the 2002-2003 SARS epidemic), the difference in nucleotide usage levels between two parts of ORF is even more noticeable than in SARS-CoV-2 ( Figure 1b). In ORF1a, the level of U4f is 49.3±0.2%, while in ORF1b, it is 42.9±0.2%. The values of U4f for SARS-CoV ORF1 are significantly lower than those for SARS-CoV-2. As to the values of A4f, they are equal to 26.7±0.2% and 32.7±0.2% for SARS and 28.3±0.2% and 31.5±0.2% for SARS-CoV-2 ORF1 parts. So, the magnitude of change in A4f usage before and after the ribosome slippage site is higher for SARS-CoV than for SARS-CoV-2. It means that one may expect the highest rate of successful ribosome slippage for SARS-CoV-2 than for SARS-CoV.
Nucleotide usage biases in fourfold degenerated sites along the ORF1 of the MERS virus (namely, in Betacoronavirus England 1 strain) are shown in Figure 1c. The usage of U4f is distributed almost identically along the length of ORF1a (49.2±0.2%) and ORF1b (50.5±0.2%). A slight decrease of U4f in ORF1a is because of the area between codons #1400 and #2600, where U4f is lower than in other parts of the same ORF. The difference between A4f in ORF1a and ORF1b is insignificant (22.4±0.1% and 22.2±0.2%). There are areas with relatively elevated C4f usage in this ORF, while G4f is always somewhere near the point of 10%. Due to the absence of changes in nucleotide usage biases before and after the ribosome slippage sequence, one may speculate that this event (ribosome slippage) is almost always successful for the MERS virus. It is essential to check whether MERS virus is rather an exception or a rule among different coronaviruses.

Nucleotide usage biases along the length of Alpha-, Beta-, Gamma-, and Deltacoronaviruses
Nucleotide usage biases along the length of ORF1 have been determined in 46 more completely sequenced coronaviruses. From the data given in Table 3, it can be concluded that there is a mutational U-pressure in all those known species. However, the average value of U4f in fourfold degenerated sites varies from 39 to 71%, while the average value of A4f varies from 16 to 40%. Once again, U4f is always significantly higher than A4f in all species of coronaviruses, in both parts of ORF1. If we take all 49 species of coronaviruses together, U4f is significantly higher in ORF1a than in ORF1b, while A4f and C4f are significantly lower in ORF1a than in ORF1b. The level of G4f is the same in both parts of ORF1. A similar trend is observed when only Alphacoronaviruses or Betacoronaviruses are considered. However, in Deltacoronaviruses, the differences in U4f and C4f in both parts are still significant, while the difference in A4f is not significant.
There are several exceptions from the rule among studied viruses, in which the value of U4f is significantly lower in ORF1a than in ORF1b. They are 3 out of 19 Alphacoronaviruses; 2 out of 18 Betacoronaviruses; 3 out of 10 Deltacoronaviruses (Table 3). There are at least two causes of the existence of those exceptions. At first, coronaviruses are prone to recombination with each other and with other viruses 21 . That is how some fragments of ORF1 may be exchanged during recombination, and a newly acquired fragment of ORF1 will possess an outstanding bias in fourfold degenerated sites. After a certain number of generations, nucleotide usage levels in fourfold degenerated sites will become almost identical through the whole length of ORF1a and ORF1b again. From this point of view, there may be such half-homogenized fragment in the ORF1a of MERS (Figure 1c). The "homogenization" of biases in twofold degenerated sites from third codon positions takes longer time than their "homogenization" in fourfold degenerated sites 22 . Biases in first and second codon positions will be "improved" after an even more extended period of time 23 .
Second, the quality of a regulatory element responsible for ribosome slippage should be different in different coronaviruses 6 . As one can see in Figure 2a, the difference in average values of U4f in two parts of ORF1 may reach more than 8%, or it may be equal to just 1%. Interestingly, there is a correlation between the difference in U4f and the difference in A4f between two parts of ORF1 (R = -0.66). Indeed, the higher the difference in U4f, the higher (by module) the difference in A4f. The same relationships are there between the difference in U4f and the difference in C4f between two parts of ORF1 (R = -0.72). The difference in A4f, however, shows no correlation on the difference in C4f (R = 0.11). It means that in some viruses, U4f in the first part of ORF1 is increased mostly because of the decrease of A4f, while in others, it is increased mostly because of the decrease of C4f. Both SARS-CoV-2 and SARS-CoV belong to the group in which U4f is growing mostly because of the decrease of A4f. The chance of success for ribosome slippage is one of the key factors of viral pathogenesis 24 . According to Figure 2a, SARS-CoV-2 should be able to synthesize more RNA-dependent-RNA-polymerase at the onset of infection than SARS-CoV.

Discussion
Determination of the mutational pressure direction should be a starting point in any vaccine design study 22 . If the most frequent type of nucleotide mutation is known, one may try to choose "cold" spots for this mutation as targets for future vaccine development instead of "hot" spots. Based on our study, we suggest that fragments of RNA of SARS-CoV-2 that have a higher level of U in first and second codon positions and a higher level of C in synonymous sites for C to U mutations should be chosen for vaccine designing.
It has been shown that the genome of SARS-CoV-2 is subject to nucleotide usage bias towards A+U in its ORFs, including ORF1 25,26 . In this study, we examined the usage of U and A separately, since A=U and G=C parity rules do not work in these viruses. Nucleoside analogues have been suggested as anti-Covid-19 drugs 27 . Based on the results of this study, uracil analogues would be more effective in treating COVID-19 than analogues of cytosine and guanine nucleosides. Adenine analogues should be capable of effectively inhibiting the synthesis of RNA-minus strands. So they may also prove to be good therapeutic strategy but only during the early stages of infection. Though care should be taken that any nucleoside analogue used to treat Covid-19 wouldn't be recognized by the proof-reading machinery of the virus 27 or the drug will be rendered ineffective.
Sequences obtained from all over the world belong to active viruses that are under the control of negative selection, which can be confirmed by the fact that the rate of nonsynonymous mutations is less than the rate of synonymous ones. Indeed, the virus is not under strong immune pressure or the pressure of antiviral drugs at the moment. These stresses are known to cause diversifying (positive) selection of nonsynonymous mutations 28 . On the one hand, mutagenesis of C to U causes amino acid substitutions that can lead to quite drastic consequences for the fitness of a virus. Therefore, most of them are eliminated. On the other hand, mutations of C to U direction may help in immune evasion, since they are capable of destroying linear B-cell epitopes 29 .
Nsp1 is a non-structural protein present in Alpha-and Betacoronaviruses, but not in Gamma-and Deltacoronaviruses 20 . Nsp1 can inhibit the expression of host proteins. 30,31 Likely, the expression of nonspecific antiviral enzymes belonging to APOBEC and ADAR families are also inhibited by nsp1. If most of the C to U mutations in ORF1a are caused by APOBEC editing of viral RNA during initial steps of infection, then it may be expected that there will be no difference in nucleotide usage levels in fourfold degenerated sites between both the parts of ORF1 in Gamma-and Deltacoronaviruses. But, as evident from Table 3, the above-mentioned differences still exist in most of the Gamma and Deltacoronaviruses, implying that a significant fraction of C to U mutations is not caused by APOBEC editing.
The existence of RNA exonuclease (nsp14) in genomes of coronaviruses may be the cause of the observed mutational U-pressure. If during the post-replicational proof-reading most of the deaminated and oxidized nucleotides are removed, the only product of deamination remaining is a canonical amine base, uracil. As inosine (product of adenine deamination) is a non-canonical amine base, it is removed during proof-reading, whereas, uracil (product of cytosine deamination) being a canonical amine base is not removed, as proof-reading activity effectively removes mismatched pairs of canonical nucleotides, i.e., pairs of canonical nucleotides with non-canonical ones, but it cannot recognize U that has already appeared in place of C on a matrix strand before the replication as a mutated nucleotide. Interestingly, some noncanonical amine bases, like 8-oxo-G (the product of guanine oxidation), can somehow still pass through the proofreading machinery, as several G to U transversions have been detected in this study.
Without proof-reading machinery, coronaviruses accumulate noncanonical nucleotides much better than in case when it is functional 32 . Indeed, 5-formyl-uracil is paired with adenine in the absence of nsp14 in SARS-CoV and MERS viruses, and it leads to the increase of U to C and A to G transitions since this non-canonical nucleotide can pair with G even better than with A 32 . So, in the absence of proof-reading, the overall rate of mutations becomes higher, while the bias in their rates becomes weaker.
Taken together, coronaviruses are well known for their low mutation rates achieved due to proof-reading during RNA replication 12 . However, they still cannot repair C to U transitions with this mechanism, as U is a canonical amine base. Moreover, C to U transitions occur before the replication, and the resulting U makes a correct pair with A during the complementary RNA strand synthesis. Therefore, mutational U-pressure is seen in all coronaviruses.
The effectiveness of frameshifting for ORF1 for different coronaviruses has been reported to be in the range of 20 -45%. 6 . For the Mouse hepatitis virus A59, the rate of effective frameshifting is higher (from 48 to 70%) 33

Conclusions
In SARS-CoV-2 mutational U-pressure is associated with translation, as the rate of C to U transitions and the intensity of U-bias in fourfold degenerated sites are higher in ORF1a, situated before the ribosome slippery sequence, than in the less frequently translated ORF1b which is situated after the slippery sequence. U-pressure observed in all the examined sequences of ORF1 from different species of coronaviruses is a consequence of proof-reading.    Table 1. Rates of nucleotide mutations (in substitution per site with a corresponding nucleotide) in two parts of SARS-CoV-2 ORF1: before (ORF1a) and after (ORF1b) the ribosome slippage sequence. Significant changes between opposite types of mutations are shown by bold font, significant changes in rate of the same type of mutation between two parts of ORF1 are shown in underlined font.  Table 2. Description of C to U transitions in two parts of SARS-CoV-2 ORF1: before (ORF1a) and after (ORF1b) the ribosome slippage sequence. Significant changes between rates of synonymous and nonsynonymous mutations are shown by bold font, significant changes in rate of the same type of mutation between two parts of ORF1 are shown in underlined font.

ORF1a
ORF1b  Table 3. Average values of nucleotide content in fourfold degenerated sites for two parts of ORF1 (before (ORF1a) and after (ORF1b) the ribosome slippage sequence) for 49 species of coronaviruses.

Species of viruses
ORF1a ORF1b