A Non-phylogeny-dependent Reassortment Detection Method for Influenza A Viruses

Influenza A virus is a segmented RNA virus whose genome consists of 8 single-stranded negative-sense RNA segments. This unique genetic structure allows viruses to exchange their segments through reassortment when they infect the same host cell. Studying the determination and nature of influenza A virus reassortment is critical to understanding the generation of pandemic strains and the spread of viruses across species. Reassortment detection is the first step in influenza A virus reassortment research. Several methods for automatic detection of reassortment have been proposed, which can be roughly divided into two categories: phylogenetic methods and distance methods. In this article, we proposed a reassortment detection method that does not require multiple sequence alignment and phylogenetic analysis. We extracted the codon features from the segment sequence and expressed the sequence as a feature vector, and then used the clustering method of self-organizing map to cluster the sequence for each segment. Based on the clustering results and the epidemiological information of the virus, the reassortment detection was implemented. We used this method to perform reassortment detection on the collected 7,075 strains from Asia and identified 516 reassortment events. We also conducted a statistical analysis of the identified reassortment events and found conclusions consistent with previous studies. Our method will provide new insights for automating reassortment detection tasks and understanding the reassortment patterns of influenza A viruses.


INTRODUCTION
Annual epidemics and occasional pandemics caused by the influenza A virus (IAV) are important threats to human health. IAV can cause acute respiratory infections and cause high morbidity and mortality (1). Influenza A virus is a segmented RNA virus whose genome consists of 8 single-stranded, negative-sense RNA segments. When multiple influenza A viruses infect the same host cell, due to the genome characteristics of segmented viruses, the genomes of multiple viruses may exchange their segments, resulting in progeny viruses with new genome combinations (2). This process is called reassortment. Reassortment plays an important but undefined role in the cross-species transmission of influenza A virus (3). Many influenza pandemics in history were related to reassortment, including the Asian H2N2 influenza in 1957, the Hong Kong H3N2 influenza in 1968, and the 2009 H1N1 global influenza pandemic. The H7N9 avian influenza virus that broke out in 2013 was also reassorted from different viruses according to the study (4)(5)(6). Therefore, studying the determination and nature of influenza A virus reassortment is essential for understanding the generation of pandemic strains and the spread of viruses across species.
Reassortment detection is the first step in the study of influenza A virus reassortment. Due to the different evolutionary histories from different viral gene segments, reassortment can be detected by comparing the inconsistencies in the phylogenetic relationship between different segments of the reassortant virus genome (7)(8)(9)(10). However, this method is impractical for studying large-scale influenza A virus data sets. On the one hand, the cost of building a phylogenetic tree will increase exponentially with the growth of data. On the other hand, for viruses with complex reassortment history, this method is also difficult to infer the actual situation. Therefore, several methods for automatic detection of reassortment have been proposed. These methods can be roughly divided into two categories: phylogenetic methods and distance methods. In the phylogenetic method (11), proposed FluReF, which searched the full phylogenetic tree from bottom to top to select candidate reassortment groups and determined candidates that satisfied the preset threshold and resulted in inconsistencies between the segment trees. Nagarajan and Kingsford (12) proposed a reassortment finder GiRaF based on graph incompatibility. This algorithm was based on an earlier method (13), which used a fast search algorithm to mine phylogenetic inconsistencies by constructing an incompatibility graph. GiRaF then used the phylogenetic distance test to increase the false positive rate and combined the results of all fragments of the genome to generate a comprehensive reassortment catalog. In the distance method (14), estimated that the two strains did not have the possibility of reassortment based on the Hamming distance between different segments. De Silva et al. (15) reported an algorithm based on strain neighborhood. They found the closest set of strains from the reference set for each segment and compared the results between different segments to identify reassortment. In general, the phylogenetic method is more reliable than the distance method, but its cost is higher and depends on the accuracy of the constructed phylogenetic tree, which is not suitable for large-scale data sets.
In this article, we proposed a phylogeny-independent reassortment detection method. It does not need to perform multiple sequence alignment or construct a phylogenetic tree. We expressed the sequence as a feature vector and used the clustering method of self-organizing map to cluster the sequence of each segment. After uniformly naming the type of each segment of the virus, it becomes very simple to identify the reassortment. We used this method to detect the overall reassortment for viruses in Asia, and performed a statistical analysis of the identified reassortment events to dig out the reassortment determinations and patterns. Our results showed that the algorithm can identify reassortment events reported in previous studies and reveal the complex reassortment history in Asia. Therefore, our method can effectively perform reassortment analysis on large-scale data sets.

Data Processing
To study the reassortment pattern in Asia as comprehensively as possible, we assembled all the sequences from the Influenza Virus Resources at the National Centre for Biotechnology Information (NCBI) (https://ftp.ncbi.nih.gov/genomes/INFLUENZA/). The sequences sampled in Asia were then extracted. The sequences we downloaded are coding sequences. For each segment, we only consider the longest coding sequence. We deleted sequences with illegal characters in the codon region or incomplete length. Sequences without clear epidemiological information were also deleted. We did not consider the laboratory sequences. The genome of a virus would be preserved only when eight segment sequences met the above standards. After that, we obtained 7,075 genomes of IAVs. The distribution of the number of viruses in each subtype was shown in Figure 1A. There are a large number of IAVs with H1N1 and H3N2 subtypes, both exceeding 1,000 strains. In the distribution of the host (Figure 1B), viruses that infected humans, waterfowl, domestic birds, and swine account for the vast majority.

Feature Extraction
Based on the distance measure DMk proposed by Wei et al. (16), we introduced our feature extraction method for each segment sequence. A segment sequence is composed of 64 kinds of codons arranged in order. Since a codon may appear multiple times in the sequence, we can record the position of each occurrence. All the occurrence positions can form an array, recorded as p 0 , p 1 , p 2 , · · · , p n . In order to facilitate the calculation of the interval, the position at the 0th occurrence p 0 = 0 is added to the position sequence. Then we used the reciprocal of the interval between each occurrence and the previous occurrence as its occurrence density, calculated as Where p i represents the position of the i th occurrence of the codon, and p i−1 represents the position of the i-1 th occurrence of the codon. d i reflects the interval, or density, between the i th occurrence and the i-1 th occurrence of the codon. In order to describe the order of the density array, we accumulated the density array to obtain the partial sum array, calculated as: The partial sum array is calculated directly based on the density array, for example, s 1 = d 1 , s 2 = d 1 + d 2 , and so on. This array is only determined by the occurrence position of the codon, and the position of the codon can be reversed based on the array. Shannon entropy reflects the order relationship of the elements in the sequence. In order to calculate Shannon entropy, we constructed a discrete probability distribution Q = q 1 , q 2 , · · · , q n , where each probability is calculated as q i = s i / n j=1 s j . Then the Shannon entropy can be calculated as: For each codon, we can calculate a Shannon entropy, and the Shannon entropy of all codons finally constitutes a feature vector. In the actual process, we removed the feature of 3 stop codons, and finally got a 61-dimensional vector for each segment sequence.

SOM-Based Clustering
In this section, we proposed a sequence clustering method based on the self-organizing map (SOM). SOM is an unsupervised artificial neural network. Different from the general neural network training based on the reverse transfer of the loss function, it uses a competitive learning strategy and relies on the competition between neurons to gradually optimize the network. The neighbor relationship function is used to maintain the topological structure of the input space. In our clustering method, the algorithm takes the 61-dimensional vectors extracted from the segment sequences as input and maps them to the nodes of the two-dimensional grid. Segment sequences with high similarity in the input space will be mapped to the same node. We used the minisom library to implement our algorithm. First, we calculated the number of nodes in the output layer according to the empirical formula of clustering n = 5 √ N, where N is the number of genomes. The size of the output layer is size = √ n. The calculations above were rounded up. Then we initialize the SOM algorithm according to the following parameters: the output layer size was set to 61; the sigma parameter was set to 2; the learning rate was set to 1, and the neighborhood function was set to triangle; weight initialization method was set to PCA. The feature vector of each sequence was trained iteratively. After training, each vector was input to the SOM network and the winning node was the clustering type of the virus segment. Then we calculated the contour coefficient of the clustering result and recorded the corresponding parameters. Finally, we changed the parameters and repeated clustering. The clustering result with the highest contour coefficient was regarded as our final result for the segment clustering.

Reassortment Detection
Based on our clustering for each segment, we assigned a number x to each type and recorded it as segment_x. The virus sequences contained in each type were considered to have the same segmental ancestors. Therefore, if the eight segment types of a virus can be combined from segment types of other viruses, this virus may be produced by the reassortment of these viruses. There may be a situation where different combinations of multiple viruses can produce the eight segment types of the virus. For example, the eight segment types of virus A may be combined from segment types of virus B and virus C, or combined from segment types of virus D and virus E. Then we need to consider which combination is more likely to produce A by reassortment. To this end, we considered epidemiological information of the viruses such as host, location, and time. For viruses with multiple reassortment possibilities, we thought viruses with the similar host, location, and time are more likely to reassort. Our nonphylogeny-dependent reassortment detection method was shown in Figure 2.

Virus Similarity Network
Based on the feature vectors extracted from each sequence, we represented the IAV genome as a 488-dimensional feature vector composed of eight segment feature vectors. In this way, the similarity of the IAV genome can be characterized by the similarity of feature vectors. We then constructed a similarity matrix of all IAVs and used it to construct a similarity network. The minimum spanning tree algorithm was used to simplify our network, and the cys layout in cystoscope was used for display, which was shown in Figure 2. In the virus similarity network, the subtypes of different virus genomes have been clearly distinguished (Figure 3A), which proves that the feature vector we extracted is reasonable. It should be noted that swine FIGURE 2 | Reassortment detection framework. Each segment sequence was represented as a 61-dimensional feature vector and was input to SOM for clustering. The genotype was defined and reassortment detection was performed combined with the epidemiological information.
is a mixed vessel, which connected the genome of the avian influenza virus and the genome of the human infected influenza virus ( Figure 3B). We also found that some human infected virus at the top of the figure. The genomes of these viruses were similar to those of domestic birds. Therefore, domestic birds also play an important role in the process of the avian influenza virus infecting humans.

Segment Clustering Result
Using our feature extraction and SOM-based clustering method, we divided eight segments of influenza A virus into different types, respectively, which was shown in Table 1. According to the combination of hemagglutinin (HA) and neuraminidase (NA) proteins on the surface, the influenza A virus can be divided into different subtypes. At present, HA can be divided into 18 subtypes (H1-H18), NA can be divided into 11 subtypes (N1-N11). Therefore, segments with the same subtype should be grouped when we cluster HA and NA segments. Based on this, we checked the clustering results of HA and NA segments, which were shown in Figure 4A. Each type contained only one subtype and the same subtypes were clustered together, indicating that our clustering method has no problem in distinguishing between subtypes. We also analyzed the number of types corresponding to different HA and NA subtypes, which was shown in Figure 4B.
For HA subtypes, the number of types in H1, H3, H5, and H9 is relatively large, indicating that the lineages of these subtypes are complex. For NA subtypes, there are relatively more types in N1 and N2. The clustering results for all segments can be found in Supplementary Tables 1-8.

Statistic of Reassortment Subtype
We performed reassortment detection on all the viruses in the dataset one by one in the order of their sampling time and identified 516 reassortment events. The details of each reassortment event can be found in Supplementary Table 9. According to the virus subtypes involved in reassortment, we divided reassortment events into intra-subtype reassortment and inter-subtype reassortment. Intra-subtype reassortment means that the subtype involved in reassortment is single, while inter-subtype reassortment means that the subtypes involved in reassortment are diverse. We analyzed the number of two reassortment patterns and found that the intra-subtype reassortment pattern was more common, which was shown in Figure 5A. In particular, we analyzed the situation of two reassortment patterns of eight common subtypes including H1N1, H3N2, H5N1, H5N6, H6N2, H6N6, H7N9, and H9N2, which was shown in Figure 5B. Viruses with H1N1, H3N2, H9N2 subtypes had a higher proportion of intra-subtype reassortment, Frontiers in Virology | www.frontiersin.org Above these swine nodes are almost all avian host nodes (yellow-domesitic birds, purple-land birds, light blue-shorebirds, green-waterfowl), and below these swine hosts are almost all human host nodes (red). This implied that swine plays the role of an intermediate host between these avian host and human host. At the top of (B), a small number of human host nodes can be found clustered. These nodes are close to the domestic birds node (yellow). Therefore, domestic birds also play an important role in the avian influenza viruses' cross-species and infecting humans. Avian waterfowls were considered as the natural hosts of IAVs, domestic birds and swine may have played an important role as incubation for the IAVs' transmission to humans. while viruses with H5N1, H5N6, H6N2, H6N6, and H7N9 subtypes were more likely to be produced by reassortment of viruses with different subtypes.

Inter-subtype Reassortment Pattern
Although the intra-subtype reassortment pattern was frequent, we were more concerned about inter-subtype reassortment, since such reassortment pattern is more likely to produce pandemic strains. To this end, we made statistics on the frequency of other subtypes in the inter-subtype reassortment events of each subtype. We found that there were strong reassortment signals between H1N1 and H3N2, H6N2 and H6N6, H9N2, and other subtypes, which was shown in Figure 6. H1N1 and H3N2 are common subtypes of human influenza viruses. The viruses with these two subtypes circulate seasonally every year, leading to frequent reassortment. The 2009 H1N1 pandemic is a typical example. The virus that caused this pandemic was a triplereassorted strain whose genome is derived from H1N1 avian virus, H1N1 classical swine virus, and H3N2 human seasonal virus. We also noticed the reassortment relationship between H6N2 and H6N6. There were few subtypes related to H6N6 reassortment and more subtypes related to H6N2 reassortment. However, the reassortment association between the two subtypes was still much more than other subtypes. In 2011, the H6N6 avian influenza virus was isolated from swine in Guangdong Province, China (17), suggesting that the H6 subtype avian influenza virus has the potential to infect mammals and even humans. Therefore, the reassortment between H6N2 and H6N6 needs to be further monitored. It should be noted that we found H9N2 played an important role in the reassortment of other subtypes since its proportion was the first or second. Based on this, we analyzed the specific reassortment relationship between H9N2 and other subtypes.

Reassortment Pattern Related to H9N2
H9N2 avian influenza virus spreads worldwide and is widespread in poultry in Asia (18)(19)(20). A recent study has shown that H9N2 had replaced H5N6 and H7N9 as the dominant avian influenza virus subtype in chickens and ducks in China (21). At the same time, there is also evidence that H9N2 contributed to the emergence and evolution of human infection with avian influenza viruses such as H7N9, H10N8, and H5N6 (22)(23)(24). In our study, we found that H9N2 provided internal segments for viruses with other subtypes. Among 61 intersubtype reassortments related to H9N2, 27 events were H9N2 providing internal segments for H7N9. We also found four cases of H9N2 providing internal segments for H10N8 and four  cases of H9N2 providing internal segments for H5N1. H9N2 occasionally provided internal segments for H5N2, H6N2, H6N6, H3N8, and H1N2. In addition, we found that these subtypes also provided internal segments for H9N2. These H9N2 viruses that obtained internal segments from other subtypes returned to the H9N2 gene pool and provided segments for subsequent reassortment. In general, the reassortment pattern related to H9N2 is complicated. It can provide internal segments for other subtypes, and can also obtain internal segments from other subtypes.

DISCUSSION
With the continuous improvement of sequencing technology and reduction of sequencing costs, the number of sequences in the influenza virus database is rapidly increasing. The rapid growth of data drives the continuous advancement of analytical techniques. Although manual and semi-automatic methods are generally considered to produce standard results, they also have scalability and repeatability issues (12), and cannot cope with the challenges brought by data growth. In the automated method of reassortment detection, the phylogenetic method requires multiple sequence alignment and reconstruction of the phylogenetic tree to perform subsequent reassortment detection tasks. De Silva et al. (15) tried to use GiRaF (12) to reconstruct the phylogenetic map, they found that the calculation time was at least several months per segment. Therefore, we inevitably need to adopt methods that have nothing to do with phylogeny. The method proposed by Rabadan et al. (14) seemed to perform well in detecting reassortants within the pedigree, but cannot point out the specific source of reassortment. The method of de Silva et al. (15) found some well-supported candidate reassortants, but it depends on the size of its neighborhood, and different neighborhood sizes may lead to different results. In our work, we proposed a new phylogeny-independent method for automatic detection of reassortment. It can efficiently process large-scale data sets while studying the source of all segments of the influenza A virus genome to identify reassortment.
In our reassortment detection results, although there is no evidence that every reassortment event we detected exists, as a whole, the reassortment history and patterns reflected in our results can be supported by the relevant literature. According to reports, the reassortment between strains belonging to a single subtype may occur more frequently than intersubtype reassortment (25), which can be found in our results ( Figure 5). In our analysis of the reassortment related to the H9N2 subtype, we found a reassortment pattern that H9N2 provided internal segments for other subtypes. This reassortment pattern has been confirmed by many researches (22)(23)(24). In particular, in the avian influenza H7N9 virus that broke out in 2013, all internal gene segments were closely related to the gene segments of the avian influenza H9N2 virus (26). Related research has proposed a dynamic reassortment model of H7N9 virus evolution: H7N9 virus reassorted in the H9N2 gene pool, and was classified into different H9N2 gene pools under the promotion of poultry transportation, and undergoes further reassortment (27). In our early results (Figure 3B), we also found that domestic birds played an important role in the H7N9 virus infecting humans. Subsequent analysis results also showed that H9N2 was the subtype most associated with H7N9, and it frequently provided internal segments for H7N9 by reassortment. At the same time, we found cases where H9N2 provided internal segments for H10N8. Almost all of these H10N8 viruses originated in Jiangxi, China, which coincides with the study by Wang et al. (28). The reassortment pattern that H9N2 provided internal segments for H5N1 can also be supported in the report (29). All in all, our reassortment detection method can perform reassortment detection on large-scale data sets and obtain effective conclusions.
Based on our method, researchers can study the reassortment pattern by detecting the reassortment of the viruses in the data set one by one. Researchers can also explore the source of each segment of the new virus without phylogenetic analysis. This can be achieved by comparing the feature vector of the new virus with the feature vector of the virus in the database. Although our reassortment detection method is applied to the influenza virus data set, this method also has a reference value for the reassortment of other segmented viruses. Our method will help automate the task of detecting influenza virus reassortment.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.

AUTHOR CONTRIBUTIONS
HR, WC, and JY: formulated the study. XG and MH: performed the research and analyzed the data. BW, YJ, HY, and LL: participated in analysis and discussion. XG and HR: drafted the manuscript. All authors contributed to the article and approved the submitted version.