Evolutionary Constraints on the Norovirus Pandemic Variant GII.4_2006b over the Five-Year Persistence in Japan

Norovirus GII.4 is a major cause of global outbreaks of viral gastroenteritis in humans, and has evolved by antigenic changes under the constantly changing human herd immunity. Major shift in the pandemic GII.4 strain periodically occurs concomitant with changes in the antigenic capsid protein VP1. However, how the newly emerged strain evolves after the onset of pandemic remains unclear. To address this issue, we examined molecular evolution of a pandemic lineage, termed the GII.4_2006b, by using the full-length viral genome and VP1 sequences (n = 317) from stools collected at 20 sites in Japan between 2006 and 2011. Phylogenetic tree showed a radial diversification of the genome sequences of GII.4_2006b, suggesting a rapid genetic diversification of the GII.4_2006b population from a few ancestral variants. Impressively, amino acid sequences of the variable VP1 in given seasons remained as homogeneous as those of viral enzymes under annual increase in the nucleotide diversity in the VP1 coding region. The Hamming distances between the earliest and subsequent variants indicate strong constraints on amino acid changes even for the highly variable P2 subdomain. These results show the presence of evolutionary constraints on the VP1 protein and viral enzymes, and suggest that these proteins gain near maximal levels of fitness benefits in humans around the onset of the outbreaks. These findings have implications for our understanding of molecular evolution, mechanisms of the periodic shifts in the pandemic NoV GII.4 strains, and control of the NoV GII.4 pandemic strain.

The VP1 protein is the major structural protein of the mature virion, which protrudes from the virion surface and plays pivotal roles in the viral interactions with hosts. The VP1 protein is composed of two domains, protruding (P) and shell (S) (Prasad et al., 1999). The P domain is further divided into two subdomains, P1 and P2 (Prasad et al., 1999). The P2 subdomain is placed on the tip of the VP1 protein and constitutes the major antigenic site around the binding site to the putative receptor(s) for infection (Donaldson et al., 2010). This structural feature causes sequence variation (Lindesmith et al., 2008(Lindesmith et al., , 2011(Lindesmith et al., , 2012a(Lindesmith et al., , 2013Bok et al., 2009;Debbink et al., 2012) and structural diversity (Chen et al., 2004(Chen et al., , 2006Donaldson et al., 2010), particularly in the P2 subdomain. Meanwhile, the functional importance of the P2 subdomain can cause suppression of deleterious changes and/or changes that reduce viral replication fitness. However, very little is known about evolution of the VP1 protein during viral maintenance in human populations.
To address this issue, we examined here molecular evolution of the VP1 protein of a pandemic lineage, termed GII.4_2006b, which is also known as the GII.4 Den Haag 2006b. In the autumn/winter of 2006, the national epidemiological surveillance of infectious diseases in Japan reported an unusual increase in the number of outbreaks of NoV infections (Infectious Disease Surveillance Center 1 ). This augmentation was associated with the nationwide spread of a newly emerging GII.4 variant (Motomura et al., 2008), termed GII.4_2006b. The GII.4_2006b initially coexisted as a minority strain among various other NoV lineages in Japan, but starting in October of 2006 it spread extremely rapidly and remained as the major epidemic variant across Japan between (Motomura et al., 2008. In this study, we characterized nucleotide and amino acid diversities of the VP1 proteins, using serially collected full-length 317 NoV genome and VP1 sequences from infections in Japan between 2006 and 2011. The obtained results show long-term persistent of GII.4_2006b in human populations in Japan as a dominant GII.4 subpopulation. Interestingly, both the VP1 protein and viral enzymes had remained as highly homogeneous populations, indicating strong evolutionary constraints on changes in these proteins following the onset of the outbreaks.

NoV Genome Sequencing
Stool specimens were collected from individuals with acute gastroenteritis at 20 regional public health institutes in Japan 1 http://idsc.nih.go.jp/iasr/prompt/graph-ke.html between May 2006 and March 2011 in compliance with the Food Sanitation Law of Japan, according to the methods for the protection of personal information (including methods for anonymization in an unlinkable fashion). The research was approved by research and ethical committee in National Institute of Infectious Diseases. Three to five stool specimens were collected at each site in each year. NoV genome sequences were obtained from the stool specimens as described previously (Motomura et al., 2008(Motomura et al., , 2010.

Genotype Determination
Norovirus genotype was determined by construction of phylogenetic trees of viral genome sequences. Multiple sequence alignments were done as described previously (Motomura et al., 2008(Motomura et al., , 2010 using the MAFFT (Katoh et al., 2009) and alignment tools implemented in the MEGA software suite (Tamura et al., 2011). Phylogenetic trees were constructed as described previously (Motomura et al., 2008(Motomura et al., , 2010 using MEGA software (Tamura et al., 2011). The reliability of interior branches in the tree was assessed by the bootstrap method with 1,000 resamplings.

Analysis of Diversity of Sequence Population
Mean diversity in the entire sequence population was computed with the "Sequence Diversity" menu in MEGA software suite (Tamura et al., 2011). The overall pairwise mean distance between the sequences was computed with the "Distances" menu in MEGA. As substitution models, a maximum composite likelihood and a Poisson model were used for nucleotide and amino acid sequences, respectively. Variance was estimated by the bootstrap method with 100 to 500 bootstrap replications.

Analysis of Individual Amino Acid Variation
Amino acid variations at each position of the VP1 (1-530) were calculated as previously described with a multiple sequence alignment as described previously for other viral proteins (Naganawa et al., 2008;Oka et al., 2009;Takahata et al., 2017) on the basis of Shannon's equation (Shannon, 1997): where H(i), p(x i ), and i indicate the amino acid entropy score of a given position, the probability of occurrence of a given amino acid at the position, and the number of positions, respectively. An H(i) score of zero indicates absolute conservation, whereas 4.4 bits for amino acids or 2.0 bits for nucleic acids indicates complete randomness.

Analysis of Amino Acid Substitutions by Hamming Distance
We used the Hamming distance to assess the changeability of the earliest NoV GII.4_2006b variant in Japan. In information theory, the Hamming distance between two sequences indicates the minimum number of "amino acid substitutions" required to change one sequence into the other. Because the length of amino acid sequences of the VP1 proteins of the NoV GII.4_2006b subpopulations were identical (540 amino acid residues), the Hamming distance measured in this study means the number of different amino acid residues between the earliest and subsequently emerged variants in two aligned sequences. Python was used as the programming language to compute Hamming distances. Hamming distances between the sequence in May 2006 (accession number AB447443; earliest GII.4_2006b sequence in our NoV genome dataset) and the later GII.4_2006b sequences (n = 249) were computed by creating a sequence that assigns mismatches and matches at corresponding positions in the two sequences, and then by counting the numbers of the mismatches.

Nucleotide Accession Numbers
The DDBJ database accession numbers of the 317 NoV GII.4 genome sequences used in this study are provided in Supplementary Table S1 (n = 250, GII.4_2006b) and Supplementary Table S2 (n = 67, GII.4 non-2006b).

Persistence and Diversification of NoV GII.4_2006b Genome in Japan between 2006 and 2011
We obtained 317 genome sequences of NoV GII.4 from the stool specimens collected at 20 sites in Japan between 2006 and 2011 ( Figure 1A). Eight distinct lineages of NoV GII.4 were   (Motomura et al., 2008(Motomura et al., , 2010; Supplementary Tables S1, S2). Among the newly emerged eight GII.4 lineages, only the pandemic variant GII.4_2006b had been detected dominantly and continually throughout Japan ( Figure 1B). The GII.4_2006b represented about 79% (n = 250) of the total GII.4 genomes detected during the 5 years. Phylogenetic analysis shows that the GII.4_2006b genome sequences diverged radially from a few roots ( Figure 1C). These data suggest genetic bottlenecks followed by a rapid genome diversification of GII.4_2006b between 2006 and 2011.

Diversity of NoV GII.4_2006b ORFs
The GII.4_2006b RNA genome encodes three open reading frames, ORF1, ORF2, and ORF3 (Figure 2A). ORF1 encodes viral enzymes and non-structural proteins. ORF2 and ORF3 encodes structural proteins, VP1 and VP2, respectively. We first examined whether the sequence diversity is different among the three ORFs by using the 250 GII.4_2006b genome sequences obtained in this study. The phylogenetic tree and mean diversity in the entire sequence population show that the nucleotide diversity was similar among the three ORFs ( Figure 2B). In contrast, a marked difference was observed in the diversity of amino acid sequences: the ORF1 and ORF2 amino acid sequences remained significantly less diversified than that of ORF3 ( Figure 2C). The data suggest the presence of constraints on the amino acid changes of proteins encoded by ORF1 and ORF2.
Temporal Changes in the Sequence Diversity of NoV GII.4_2006b The GII.4_2006b RNA genome encodes eight viral proteins ( Figure 3A). Shannon entropy of amino acid sequences of the 2006b ORF1 in the present genome dataset indicates that the potential sites for the internal cleavage of the ORF1 precursor protein were perfectly conserved in amino acid levels [H(i) = 0/0] for p48/NTPase (Q/G), NTPase/p22 (Q/G), and VPg/Pro (E/A),  Figure 1C. The analyses were done with tools included in the MEGA software suite (Tamura et al., 2011). Nucleotide sequences (B). Amino acid sequences (C).
Frontiers in Microbiology | www.frontiersin.org To assess the changeability of individual viral proteins, we examined temporal changes in the sequence diversity of the eight protein-coding regions using the 250 GII.4_2006b genomes. The genome sequences were divided into five groups based on the collected seasons, and the overall mean distance of the sequences in a season was calculated using MEGA. The nucleotide mean distance sequentially increased for the eight protein-coding regions (Figure 3B, Nuc), indicating a continuous increase in the dissimilarity of every gene segment in the GII.4_2006b variant population in Japan. In contrast, the temporal change in amino acid sequence diversity was very different among the eight proteins (Figure 3B, Ami). Interestingly, the amino acid mean distance of the generally hypervariable VP1 protein sequences remained comparable to that of three viral enzymes (NTPase, Pro, Pol) for 5 years, with the mean distance remaining at less than 0.01 with small variances (Figure 3B, Upper). After the 3rd epidemic season, the VP1 amino acid distance even decreased.
Meanwhile, the amino acid mean distances of the p22 and VP2 proteins sharply increased in parallel with an increase in the nucleotide distances (Figure 3B, p22 and VP2), suggesting the continuous diversification of these proteins in association with nucleotide diversification. The mean amino acid distance for the p48 protein increased with time yet less extensively than those for the p22 and VP2 proteins (Figure 3B, p48). The mean amino acid distance for the VPg protein stayed at relatively low levels with large variances (Figure 3B, VPg). In sum, these data suggest the presence of strong constraints on amino acid changes in the capsid protein VP1 and enzymes (NTPase, Pro, Pol) of the GII.4_2006b under the diversification of nucleotide sequences.

Long-term Circulation of the NoV GII.4_2006b Subgroup Carrying the Identical Capsid Protein VP1
We identified a GII.4_2006b subpopulation (n = 23) whose nucleotide sequences differed from each other, yet encoded FIGURE 4 | Long-term circulation of the NoV GII.4_2006b subgroup carrying the identical capsid protein VP1. (A) Identification of a GII.4_2006b genome subpopulation group 1 that encodes the identical VP1s in distinct genetic backbones. Nucleotide (Upper) and deduced amino acid (Lower) sequences of the group 1 genomes (n = 23) were aligned using MAFFT software (Katoh et al., 2009), and Shannon entropies at individual positions were calculated as described previously (Naganawa et al., 2008;Oka et al., 2009;Takahata et al., 2017). The distribution of Shannon entropy scores in the GII.4_2006b genome is shown. (B) Detection frequency of the group 1 genomes in five seasons between 2006 and 2011 in Japan. (C) Neighbor-joining tree of the GII.4_2006b VP1 nucleotide sequences (1620 nucleotides). Colored circles indicate the group 1 sequences. (D) P domain dimer model of the GII.4_2006b VP1 protein was constructed as described (Motomura et al., 2008(Motomura et al., , 2010. Blue residues indicate the GII.4_2006b-specific amino acid substitutions at potential epitopes in the P2 subdomain of the GII.4 VP1 (Lindesmith et al., 2012b). exactly the same VP1 amino acid sequences ( Figure 4A). The members of this population, tentatively termed group 1, were detected at distantly located 11 sample collection sites in Japan during the study period (Supplementary Table S1 and Figure  S1). They emerged in the second epidemic season in 2007 and continuously circulated without no changes in the VP1 amino acid residues, representing about 6-13% of the GII.4_2006b genomes in each season ( Figure 4B). The group 1 genomes continuously accumulated nucleotide substitutions in the VP1coding region (Figure 4C), but only synonymous substitutions ( Figure 4A). The GII.4_2006b variants at the onset of epidemics generally had 10 substitutions at the potential epitopes A, B, D, and E (Lindesmith et al., 2012a) (Figure 4D). The group 1 VP1 protein had an additional substitution (S393G) on the epitope D.

Temporal Change in Hamming Distance for the NoV GII.4_2006b Capsid Protein VP1
The NoV VP1 protein has an architecture similar to that of the VP1 proteins of other single-stranded RNA viruses (Prasad et al., 1994(Prasad et al., , 1999; Figure 5A). The S domain is highly conserved, whereas the P2 domain is hypervariable among GII.4 variants. To assess the changeability of the P2 domain of the GII.4_2006b, we examined the temporal accumulations of amino acid substitutions in the S, P1, and P2 regions of the GII.4_2006b VP1 using the Hamming distance between the earliest and subsequent VP1 variants. As the earliest VP1 variant of the GII.4_2006b, we used a sequence from a May 2006 sample, which was collected in spring about 5 months before the onset of the nationwide epidemics of the GII.4_2006b in October of 2006 in Japan (Motomura et al., 2008).
For the S domain, the Hamming distances of the variants in given seasons were at a constant peak of 0 for 5 years (Figure 5B, VP1 Shell). The data suggest that the GII.4_2006b variants having amino acid substitutions in the S domain were mostly cleared during epidemics. For the P1 and P2 subdomains, the peaks of Hamming distances were fixed at 1 and 3 after the second and first epidemic seasons, respectively ( Figure 5B, VP1 P1 and P2). The data indicate that most of the GII.4_2006b variants in the early epidemics had a few amino acid substitutions in the P domain but they could not accumulate more mutations after the second epidemic season. Thus the P domain was more variable than the S domain in the GII.4_2006b variants, as has generally been documented for other NoVs. However, the accumulation of amino acid substitutions was strictly constrained in the P domain of the GII.4_2006b variants during epidemics. In contrast, the Hamming distances of VP2, a minor structural protein in virion (Glass et al., 2000), continuously increased and showed no evidence of fixation of the peak distance during the study period (Figure 5B, VP2).

DISCUSSION
In this report, we studied molecular evolution of the NoV capsid protein of a pandemic lineage, GII.4_2006b. This NoV subpopulation predominated over other coexisting NoV GII.4 subpopulations between the 2006 and 2011 in Japan (Figure 1). Notably, the amino acid sequences of variable VP1 protein of the GII.4_2006b populations remained as homogeneous as that of the viral enzymes for the 5 years under an increase in nucleotide diversity (Figures 2, 3). Even the GII.4_2006b population possessing the identical amino acid sequence in the VP1 protein had persisted in the study period (Figure 4). Even the hypervariable antigenic P2 subdomain of the VP1 protein had resisted sequential accumulations of amino acid substitutions (Figure 5). These results suggest the presence of strong evolutionary constraints on the VP1 protein of the NoV pandemic strain. The finding has implications for our understanding of molecular evolution, mechanisms of the periodic shifts in the pandemic NoV GII.4 strains, and control of the NoV GII.4 pandemic strain.
First, the finding has implications for understanding fitness landscape and evolution of the VP1 protein of NoV GII.4 pandemic strain. The strong constraints on changes imply that the VP1 protein and enzymes of the GII.4_2006b variants had already gained near maximal levels of fitness benefits in humans around the onset of the outbreaks and that new mutations in the VP1 protein were mostly cleared from the GII.4_2006b population, probably due to a reduction in the viral fitness for the spread in humans. In order to predominate over other coexisting GII.4 variants, the pandemic variant should have the VP1 structure that confers the best ability to evade preexisting herd immunity against NoV at that time, while also having affinity to bind to receptor(s) on human cells. Because the antigenic sites are located near the receptor-binding site, new antigenic mutations always have the risk to attenuate VP1 protein function and thereby to cause reduction in the viral replication fitness in humans. Thus, it is possible that the VP1 protein of the pandemic strain had remained conserved in human populations primarily by the necessity to maintain advantageous physical property of the VP1 protein for immune evasion and infectivity simultaneously.
Secondary, the finding has implications in the periodic shifts of the pandemic NoV GII.4 strains. Provided that the VP1 protein sequence of a given pandemic variant remained conserved following the onset of epidemics as seen in the GII.4_2006b, the human herd immunity against the VP1 protein would become increasingly more effective in association with the spread of the virus in humans. Consequently, niche for the pandemic variant in humans would be reduced, and the pandemic variant eventually be replaced by an alternative variant that has the fittest capsid structure under human herd immunity at that time. Consistently, the numbers of reported NoV infection cases in Japan had decreased annually since the late 2007, and the GII.4_2006b was replaced by a new global pandemic strain GII.4_Sydney 2012 in the 2013/2014 season, as reported in other countries (van Beek et al., 2013;Eden et al., 2014).
Finally, the finding has implications in the control of NoV pandemic strains. Although development of vaccines and antiviral agents are of special importance to reduce damages from the NoV infections, structural variations in the viral proteins can be problematic. In this regard, the present study suggests the presence of strong constraints on changes in capsid protein and enzymes of a NoV GII.4 pandemic variant on the course of 5-year persistence across Japan. The finding provides a rationale for developing vaccines and antiviral agents against a pandemic strain. A basic premise of the control is that the sequences of the VP1 protein and viral enzymes of a given pandemic variant remain highly homogeneous after the onset of pandemic. Therefore, it is important to further accumulate information on the evolution of newly emerged pandemic strains to clarify whether present observations of the amino acid conservation in the VP1 and viral enzymes can be extended to other GII.4 pandemic variants. In parallel, it would be important to study genetic diversity of NoV in nature in order to develop systems to predict a new pandemic variant in advance.

AUTHOR CONTRIBUTIONS
HS conceived the study. MY prepared the computing environment for information science. KK, NT, MN, and TT organized collection of stool specimen. TO, HN, and KM performed sequencing. HN and HS performed variation analysis. HS prepared the manuscript. All authors read and approved the final manuscript.