classLog: Logistic regression for the classification of genetic sequences

Zeller, Michael A.; Arendsee, Zebulun W.; Smith, Gavin J.D.; Anderson, Tavis K.

doi:10.3389/fviro.2023.1215012

ORIGINAL RESEARCH article

Front. Virol., 04 December 2023

Sec. Bioinformatic and Predictive Virology

Volume 3 - 2023 | https://doi.org/10.3389/fviro.2023.1215012

This article is part of the Research TopicZoonotic Negative-Sense RNA VirusesView all 11 articles

classLog: Logistic regression for the classification of genetic sequences

Michael A. Zeller^1,2

Zebulun W. Arendsee³

Gavin J.D. Smith^1,4

Tavis K. Anderson^3*

¹Programme in Emerging Infectious Diseases, Duke - National University of Singapore Medical School, Singapore, Singapore
²Iowa State University Veterinary Diagnostic Laboratory, Iowa State University, Ames, IA, United States
³Virus and Prion Research Unit, National Animal Disease Center, USDA-ARS, Ames, IA, United States
⁴Centre for Outbreak Preparedness, Duke - National University of Singapore Medical School, Singapore, Singapore

Introduction: Sequencing and phylogenetic classification have become a common task in human and animal diagnostic laboratories. It is routine to sequence pathogens to identify genetic variations of diagnostic significance and to use these data in realtime genomic contact tracing and surveillance. Under this paradigm, unprecedented volumes of data are generated that require rapid analysis to provide meaningful inference.

Methods: We present a machine learning logistic regression pipeline that can assign classifications to genetic sequence data. The pipeline implements an intuitive and customizable approach to developing a trained prediction model that runs in linear time complexity, generating accurate output rapidly, even with incomplete data. Our approach was benchmarked against porcine respiratory and reproductive syndrome virus (PRRSv) and swine H1 influenza A virus (IAV) datasets. Trained classifiers were tested against sequences and simulated datasets that artificially degraded sequence quality at 0, 10, 20, 30, and 40%.

Results: When applied to a poor-quality sequence data, the classifier achieved between >85% to 95% accuracy for the PRRSv and the swine H1 IAV HA dataset and this increased to near perfect accuracy when using the full dataset. The model also identifies amino acid positions used to determine genetic clade identity through a feature selection ranking within the model. These positions can be mapped onto a maximum-likelihood phylogenetic tree, allowing for the inference of clade defining mutations.

Discussion: Our approach is implemented as a python package with code available at https://github.com/flu-crew/classLog.

1 Introduction

Classification of pathogens has become a routine task in modern veterinary diagnostics (1). Classification of the infectious agent is a critical diagnostic step that allows for an informed decision on vaccination regimens and biosecurity measures that may be considered to clear a pathogen outbreak (2–4). Currently genetic classification is performed using phylogenetic methods such as maximum-likelihood and neighbor joining (2, 5, 6). These methods are effective at classifying sequences and inferring relationships between taxa, but the time and skill required to execute and interpret analyses may impact their application in routine high-throughput activities. While diagnosticians are interested in the transmission and history of disease, the most pressing need is to provide a classification of data. Consequently, methods that do not conduct computationally intensive phylogenetic inference for inferring ancestry and genomic epidemiology are required.

Phylogenetic placement (PP) methods are one solution to the problem of accurately assigning lineage designations to taxa. PP places a given taxa onto a reference tree without recomputing the topology and lineage designations are subsequently inferred based on the proximity to annotated taxa in the tree. PP methods are advantageous in that they can interpolate lineage within a broad context (between species) and narrow context (specific clades within a subtype). Multiple phylogenetic placement software are available such as the pplacer suite (7), RAPPAS (8), EPA-ng (9), and Nextclade (10). While PP methods are invaluable for research, there is still room for other methodologies to provide fast and accurate lineage assignments without the requirement for a robust reference tree topology.

Machine learning has been recognized as a viable method for classifying sequences (4, 11). Differing from PP methods, machine learning approaches do not need a reference tree for classification. Genetic divergence over time leads to distinguishable genetic patterns within monophyletic clades that are linearly separable across aligned amino acid positions. This linear separability lends itself well to supervised machine learning methods such as logistic regression and random forest classification. Logistic regression based on aligned sequences is used as the primary means of automated classification for influenza A viruses (IAV) in swine that are processed within the FLUture database (12). Similarly, porcine reproductive and respiratory syndrome virus (PRRSV) amino acid sequence data have been classified to genetic lineage using random forest, k-nearest neighbor, support vector machines, and multilayer perceptron methods (4). Decision tree machine learning approaches have been introduced to classify avian IAV sequences and SARS-CoV-2 sequences successfully at multiple taxonomical levels (13, 14). PangoLEARN, a random forest model, currently supplements the pangolin classification system for SARS-CoV-2 (11). However, despite machine learning appearing to be an effective approach for classification, few of these algorithms are user-friendly with intuitive generalized software that has been publicly released.

This manuscript introduces a general-purpose software application, classLog, that can train sequence classifiers based on user-labelled training data for use in classification of unknown sequences. The method used by the program leverages logistic regression, a parametric method of classification that runs in linear time complexity. Application of classLog provides a routine and robust way to integrate classification into pipelines where speed is necessary and there is no interest in inferring historical context of the sequences. Through decoupling the classification step from the inference of the history of the virus, this manuscript presents a method of classification that is rapid, accurate, and suitable for high-throughput pipelines.

2 Methods

2.1 Curation of swine H1 IAV and PRRSv North America datasets

We compiled two datasets to test the utility of our classification pipeline: porcine reproductive and respiratory syndrome virus (PRRSV) and influenza A virus (IAV) in swine. We restricted the swine IAV to H1 subtype hemagglutinin (HA) genes from the United States collected between 2015 to 2021: these data were curated and annotated by genetic clade by the Influenza Research Database (2, 15). These lineages were delineated based on a rule system applied to a maximum-likelihood phylogeny. Briefly, lineages were designated as statistically supported phylogenetic clusters when they contained more than 10 taxa, had statistical support > 70%, and the average pairwise distance between and within clades was >7% and< 7% respectively. Sequences sampled between 2015-2019 were used as a training set (n=3510), while 2020 and later sequences were extracted as a test set (n=163) (Figure 1B). For PRRSV sequences, we extracted the curated ORF5 gene sequence data provided by (3), and extracted and assigned the genetic clade for each sequence from the GenBank accession’s feature information. The genetic lineage delineations for PRRSV were also based on a maximum-likelihood phylogeny, with monophyletic lineages identified as those with strong statistical support and were designated using ClusterPicker (16). The dataset was further refined by removing all “Type 1” European sequences, sequences that were not the full coding region, i.e., not equal to 603 or 606 nucleotides in length, and the remaining sequences were translated. The final dataset of 3047 annotated sequences were randomly split into training and test sets, using 80% (n=2,483) and 20% (n=609) of the sequences respectively (Figure 1A).

FIGURE 1

Figure 1 Number of samples in training and test sets for Porcine Respiratory and Reproductive Syndrome virus (PRRSv) and Swine H1 Influenza A virus (IAV). (A) In the PRRSv dataset, number of samples were divided in a training set (80%) and a test set (20%). (B) Samples in the IAV dataset per year. Samples collected in 2015-2019 were used as the training set, samples collected in 2020 were taken as the test set.

The datasets were split differently to simulate two distinct uses of the classifier. IAV data was split temporally to simulate classifying new data, while PRRSV sequences were split randomly to simulate filling in classifications from a mixed set.

2.2 Simulated Sequencing Errors and Removing Informative Features

Gene sequences retrieved from Sanger sequencing, next generation, and third generation sequencing methods are not always complete, and there may be ambiguities and gaps in the data (17–19). These errors impact the estimation of the multiple sequence alignment that may subsequently decrease the accuracy of classification (20, 21). To mimic decreasing quality of sequences, a python script was created to randomly generate a number of indices for replacement with an ambiguous ‘X’. Subsequently, the X’s were removed from the sequence to generate incomplete, unaligned sequences. Test set sequences were degraded at 0%, 10%, 20%, 30%, and 40% prior to classification. While more robust simulations of sequence degradation exist (22), the replication and implication of these methods is beyond the scope of this manuscript.

2.3 Constructing classLog: the general sequence classifier

Sequence classification was implemented as a one-versus-rest logistic regression classifier, with a general outline provided (Supplemental Figure 1). Input for classification requires an aligned nucleotide or amino acid FASTA file, with definition lines specifying the classification classes using character delimiters, e.g., A/swine/Iowa/A02636475/2022|1B.2.2.1, where ‘|’ delimits the phylogenetic clade from the strain name. The binary features of this model are the presence or absence of an amino acid at a specific position within the alignment. An optional feature selection process, which selects the most relevant sequence positions for classification, was implemented using a tree classifier that ranks binary features by GINI importance so that the user may restrict the prediction model to the most important features (23). To facilitate the reusability of the classification scheme, the first sequence, feature labels, trained model, and class names are exported using a standard python pickle file format. The first sequence in the pickle file is used for pairwise alignment of unknown sequences to ensure there is consistency between query sequence alignment positions and the model feature positions. During prediction, a matrix of the presence or absence of nucleotides or amino acids at specific alignment positions is created, which is then fed to the model for prediction. For user submitted query sequences, the predicted classification is assigned and reported using classification names derived from the user-annotated classification fasta file used in training.

A prediction threshold option was included within the classifier to provide support for predicted classes on unknown data. Classifications with a score less than the threshold are rejected, and classified into an ‘unknown’ category (default value of 85%). The threshold criteria can have a direct effect on the performance of classification.

For validation, the general classifier was trained using 100%, 20%, 10%, 5%, 1%, and 0.5% of the available features within the H1 IAV and the PRRSv training datasets. For the H1 IAV sequence dataset, this resulted in 2439, 487, 243, 121, 24, and 12 features respectively. For the PRRSV dataset, this resulted in 686, 137, 68, 34, 6, and 3 features. Each classifier was used classify the 0%, 10%, 20%, 30%, and 40% test set sequences that had been generated to reflect sequencing errors and misalignment.

2.4 Simplifying feature identification in query sequences using a Needleman-Wunsch pairwise alignment algorithm

An intrinsic challenge to the implementation of the machine learning classification process was correctly assigning the positions to new genetic sequences. To overcome this challenge without keeping the original alignment, a heuristic was applied such that the first sequence from the training set was saved and stored, and subsequent classification attempts would be pairwise aligned to recover the positions. To increase the speed and keep calculations within a tractable time for computation, a Needleman-Wunsch dynamic programming alignment algorithm (24) with affine gap penalties and a BLOSUM90 substitution matrix (25) was implemented in C++ and exported as a python library using python bindings.

2.5 Measuring the performance of the classifier on swine H1 IAV and PRRSv North America dataset classification

The performance of the classifiers was measured under the metrics of accuracy, macro precision, macro recall, and macro F1 (26–28). From a confusion matrix M where true classification is assigned along the y-axis and the predicted class is assigned along the x-axis, the precision and recall equations can be generalized as follows:

\begin{array}{l} P r e c i s i o n_{i} = \frac{M_{i i}}{(\sum_{i} M_{j i})} & (1) \end{array}

\begin{array}{l} R e c a l l_{i} = \frac{M_{i i}}{(\sum_{j} M_{i j})} & (2) \end{array}

\begin{array}{l} F 1_{i} = 2 \frac{P r e c i s i o n_{i} \times R e c a l l_{i}}{(P r e c i s i o n_{i} + R e c a l l_{i})} & (3) \end{array}

\begin{array}{l} P r e c i s i o n_{m a c r o} = \frac{1}{n} \sum_{i} P r e c i s i o n_{i} & (4) \end{array}

\begin{array}{l} R e c a l l_{m a c r o} = \frac{1}{n} \sum_{i} R e c a l l_{i} & (5) \end{array}

\begin{array}{l} F 1_{m a c r o} = \frac{1}{n} \sum_{i} F 1_{i} & (6) \end{array}

These metrics were taken for each classifier applied to the 0%, 10%, 20%, 30%, and 40% test set sequences with the results plotted using ggplot2 (29) in R v3.959 (30).

Runtime performance was benchmarked using the Linux `usr/bin/time` program provided from Ubuntu v20.04LTS running within the Windows Subsystem Linux v2 (Supplemental Table 1). A second non-comparable benchmark approach that used existing phylogenetic placement approaches was run using pplacer and RAPPAS with the same test sets described above (Supplemental Table 2). The reference trees provided to the phylogenetic placement programs were paraphyletically pruned to 200 taxa using smot v1.0.0 (31) to more realistically simulate a phylogenetic placement scenario. Accuracy from either PP method was not tested as sufficient validation has been given in the originating and subsequent publication (7, 8).

2.6 Visualization of swine H1 IAV and PRRSv North America dataset using ordination and phylogenetic analysis

Sequences from both datasets were aligned using MAFFT v7.487 (32). The pairwise number of differences between each sequence were extracted from the alignment using Geneious Prime 2022 (33). These distances were ordinated into two-dimensional space using metric multidimensional scaling. Each ordination was colored first by the designated genetic clade, and then by a genetic motif consisting of the amino acids of the top two ranking amino acid positions. Amino acid position rank was calculated as the sum of GINI importance given by the extra tree classifier for each amino acid position, i.e., the two most important amino acids in determining the classification of the query sequence.

To identify the biological basis of the H1 swine IAV and PRRSv classifications, maximum likelihood trees were inferred for each dataset. Sequences were aligned using MAFFT v7.487 (32), and trees were inferred using IQ-TREE v1.6.12 (34). The PRRSv dataset was analyzed using a BLOSUM62 amino acid substitution model, while the IAV dataset was analyzed using the FLU amino acid substitution model (35). Statistical support was determined using the rapid bootstrap algorithm with 1,000 bootstraps, and the support was displayed on the branch of the resultant trees. Each tree was colored along the backbone by the phylogenetic clade, while the tips were annotated and colored by the top two ranking amino acid positions determined using GINI importance.

3 Results

3.1 classLog performance on H1 swine IAV and PRRSv observed and simulated data

A classLog classifier was trained on PRRSv ORF5 sequences collected and classified to lineage (3), dividing the dataset into 80% training and 20% testing. The classifier performed perfectly correct when trained with 10% of features (n=68) of the total features with no sequence degradation (Figure 2A). At 10% sequence degradation (20aa), 10% of the features were able to achieve an accuracy of 97%. At 20% sequence degradation (40aa), 10% of the features were sufficient to achieve 88% accuracy, though increasing the number of features did not improve accuracy. Accuracy rapidly decreased at 30% sequence degradation (60aa), with 10% of the features achieving 69% correct classifications. At 40% sequence degradation (80aa) the greatest accuracy achieved was 42%.

FIGURE 2

Figure 2 Measures of logistic regression classifier performance in the metrics of accuracy, precision, recall, and F1 scoring. (A) Porcine Respiratory and Reproductive Syndrome virus and (B) Swine H1 Influenza A virus datasets. Each metric was measured over simulated sequence degradation of 0%, 10%, 20%, 30% and 40%, as well as with classifiers using 0.5%, 1%, 5%, 10%, 20%, and 100% of the available features for classification.

A classLog classifier was trained on H1 swine sequences present in IRD collected between 2015 to 2019 and was tested on 136 test sequences from 2020. The classifier performed perfectly correct when trained with as few as 12 features (0.5%) when there was no sequence degradation (Figure 2B). At 10% sequence degradation (56 aa), 5% of the features (121 features) were needed to achieve perfect accuracy. At 20% sequence degradation (112 aa), 10% of the features (243 features) were sufficient to achieve perfect accuracy. At 30% sequence degradation (170aa), 10% of the features were sufficient to achieve 93% correct classifications, although 20% of features (487 features) only achieved 82% correct classification. At 40% sequence degradation (227aa), there was a steep decline in the accuracy, falling below 60% across the board.

For both datasets, precision was consistently higher than recall (Figure 2). This is a consequence of rejecting classifications below the 85% scoring threshold and classifying them as ‘unknown,’ i.e., the number of false positives decreased while increasing false negatives.

3.2 Using classLog to identify genetic features of biological relevance

The pairwise differences between the test set sequences were used to ordinate points in two-dimensional space (Figure 3). The ordination of both the PRRSv ORF5 and swine H1 IAV datasets were colored by their original designated clades, and by the motif formed by the amino acids present at the top two features ranked by GINI importance (Supplemental Figures 2, 3). This manuscript uses the top two features as the number of amino acid combinations above two exceeds the number of distinct colors available on the pallet; but lower ranked features are important to discriminate between phylogenetic clades. Qualitatively, the ordination demonstrated separation between distant genetic lineages such as the H1 1A classical swine lineage versus the H1 1B human seasonal lineage (Figure 2C; 2). However, sequences within some closely related genetic clades within the same lineage appeared to have overlap when assessed in a two-dimensional ordination. Within the PRRSv data (Figure 2B), the top two ranked amino acid positions (170, 172) corresponded well with the classified genetic clades suggesting that these positions may be clade defining mutations. For example, L1A has primarily the EE motif, L1B has EN, and L1C has DG. These divisions were not exclusive as L5, L8, L9 also have the EE motif that was exclusively within the L1A, and more features may need to be accounted for to discriminate between these clades. The top two positions of the swine H1 IAV dataset were 159 and 158 (H1 numbering, 17AA signal peptide removed) (Figure 2D), with a relatively high number of amino acid polymorphisms between those two positions. While some clades were well matched to one or two motifs, some clades such as the 1A alpha were highly varied in the motifs they carried, suggesting that other features position with a lower rank may better segregate this clade from the other clades. These data can be generated by extracting the features and their rankings using the classLog algorithm.

FIGURE 3

Figure 3 Metric multidimensional scaling in two dimensions of the number of pairwise differences between sequences of Porcine Respiratory and Reproductive Syndrome virus (PRRSv) ORF5 protein (A, B) and Swine H1 Influenza A virus datasets (IAV) (C, D). Plots were colored by genetic clade (A, C), and by the motif formed by the top two important positions inferred by decision tree (B, D). For PRRSv the top ranking features were positions 170 and 172. For IAV, the top ranking features were positions 159 and 158.

3.3 Congruency between phylogenetic classification, classLog predictions, and model features

Maximum-likelihood trees were inferred for the PRRSv ORF5 and swine H1 IAV HA test datasets. The backbones of the phylogenetic trees were colored by the assigned genetic lineage, while the tips were labeled and colored by the motif formed by the two amino acid positions that had the highest cumulative GINI importance. For the PRRSv ORF5 dataset (Figure 4A: positions 170 and 172), the majority of L1B motifs were represented by an EN and L1C by DG. L1A, L5, L8, and L9 were also represented by EE at 170 and 172, suggesting that despite good concordance between the inferred phylogeny and the classLog predicted clade, this was being driven by features outside of these two positions.

FIGURE 4

Figure 4 Phylogenetic maximum likelihood trees. (A) Porcine Respiratory and Reproductive Syndrome virus (PRRSv) ORF 5 protein and (B) Swine H1 Influenza A virus (IAV) test sequences, inferred by IQ-TREE v1.6.12 with 1,000 fast bootstraps. Tree backbones are colored by the prior assigned genetic lineage, where tip labels are colored by the motif formed by the top two ranking positions inferred by decision tree, positions 170 and 172 for PRRSv, and positions 159 and 158 for IAV. Bootstrap support is annotated on the branches.

For the swine H1 HA dataset (Figure 4B), the two most important features identified by classLog were positions 159 and 158. The majority of the 1B delta1a clade was primarily represented by GK, the 1B delta2 by SN, and 1A pandemic09 by KA. Three distinct motifs were identified within the 1A gamma clade, KT, NT, and ST, with RT interspersed. The 158T at was distinct enough to serve as a general rule to separate diversity within the 1A gamma clade. The remaining major H1 clade, 1A alpha, was associated with a significant amount of motif diversity, exhibiting GK, GR, KA, SA, SK and RR. The high amount of motif diversity is suggestive that another set of features may be used by the classifier for identifying this clade.

4 Discussion

Applications of machine learning present computationally efficient ways of classifying genetic sequences without relying on traditional phylogenetic methods. The direct utility of machine learning methods is in high-throughput diagnostic processes, where the primary objective is to assign classification and there is not an immediate interest in inferring the evolutionary history of the sequence in question. By decoupling the classification process from phylogenetic method, complexity and computational time are reduced. Machine learning methods have the additional benefit of being highly portable and reproducible with minimal effort once an initial prediction model is trained. Our command line interface, classLog, represents a user-friendly and validated tool that can ingest annotated genetic sequences, train a classification model, and generate predictions and associated confidence scores without extensive computational and machine learning training.

Logistic regression was chosen to ensure scalability with linear time complexity, fast computational runtime, and for simple model interpretability. These factors allow classLog to function as a lightweight component in classification pipelines. Moreover, many genetic lineage classification schemes frequently depend on phylogenetic relationships to delineate lineages, which is effectively a form of clustering by similarity. Consequently, linear separability emerges when significant genetic divergence exists between designated lineages. Although other machine learning methods such as neural networks can learn complex relationships and patterns, within the narrow context of lineage classification, logistic regression is generally sufficient. Taxonomic classification of virus sequence data is typically performed via either phylogenetic methods or through similarity-based approaches such as BLAST. Phylogenetic methods can be computationally complex: simple techniques such as neighbor joining have a cubic time complexity, but more statistically robust techniques have a higher range of complexity and runtime. BLAST overcomes these complexity issues, but there is a necessity for a curated database of sequences, and large databases can be difficult to update and share. In general, machine learning models can overcome both limitations as they offer both reasonable time complexity and space complexity for classification; and if an adequate dataset is used to generalize a model during training, the subsequent model may be reused without maintaining or training input reference datasets. In recognition of these strengths, machine learning approaches are being used (11, 14), but a generalized application has not yet been created.

classLog can be applied for rapid classification of genomic data either on site or in field settings. The advent of rapid and portable sequencing such as minION Nanopore technologies has resulted in the generation of thousands of sequences with a critical need to identify what they are, and whether the sample represents an “unknown.” The classLog program can be easily adopted as part of a light-weight pipeline that can be used to do classification on the fly in the field (36). The execution of classLog does not require significant computational resources, and our testing was conducted on regular Windows and MacOS laptops. Consequently, it can easily be integrated within mobile diagnostic stations that are functional within remote locations that may have minimal access to extensive computational resources or trained personnel (37, 38).

A consequence of field genomic epidemiology and the integration of Nanopore technology has been an increase in sequence error rate relative to traditional Sanger sequencing (39, 40). Our testing with classLog on simulated datasets, where we introduced sequence errors, suggested that the inaccuracies do not dramatically reduce the accuracy of classification using this machine learning method. It was noted that the classification failure within the H1 sequence dataset occurred proportionally to the number of samples present in the training dataset. As the sequence errors increased, misclassifications began to occur first in the sequences that had the least clade representation in the training set. It is likely that if there are more samples present in the training data to represent a specific clade, then the prediction model was better able to generalize the clade. This indicates one potential drawback of classLog, and that user-curated training datasets must remain large enough for optimal classifier performance. classLog performs within the narrow context of classification, assigning clades within a species, although it can quickly segregate unrelated sequences by specifying them as “unknown”. An alternate approach to generating large, curated datasets when attempting to classify multiple species could be the application of phylogenetic placement algorithms or using advanced machine learning models beyond logistic regression. Logistic regression is a parametric model that performs well on linearly separable classifications. In cases where the data are not linearly separable and that have limited training data, non-parametric models like random forest or neural networks may perform significantly better, potentially provide easy to understand biological context to feature rankings (41, 42), but require more computational time and effort.

Benchmarks of classLog runtime demonstrate that the combined training and classification time is fast, with each test case presenting a combined time under a minute (Supplementary Table 1). While the conditions of the test are not directly comparable to the testing of pplacer and RAPPAS, it can be noted that the total runtime of classLog is less than both PP methods when finding a solution to the same classification problem. It is notable that once the RAPPAS database is built, the placement of taxa onto a tree is very rapid, although the memory usage is higher. However, it is important to note that the use-cases of classLog compared to PP methods differs: classLog is designed specifically to assign lineages within a narrow scope of genetic diversity within a single species. Comparatively, both pplacer and RAPPAS can function with multiple species and additionally infer topology. The difference in the use-cases for the tools makes comparison only valid for the subset of problems where the tools overlap.

classLog is a method of creating light weight classifiers that can assign taxonomic classifications rapidly with minimal user curation and training. The implementation of this classification methodology can benefit diagnostic labs by saving computational run time associated with current phylogenetic classification approaches and can be easily customized to work for different pathogens. An additional benefit is the identification of critical genetic features associated with clade classifications: these features are likely clade defining mutations and can be used to form hypotheses to investigate the gene to phenotype link (43–45) and other functional studies. A benefit of machine learning approaches is that the results are also more directly interpretable as they are given as an assignment, rather than needing to be inferred from a tree. The culmination of these benefits offers a more streamlined approach to taxonomic assignment in a diagnostic setting.

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/flu-crew/classLog.

Author contributions

MZ conceived the study. MZ, ZA, and TKA programmed the code. MZ, GS and TKA conceptualized the framework for testing. All authors tested the program. MZ and TA wrote the manuscript. All authors contributed to the article and approved the submitted version.

Funding

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This work was supported in part by: the U.S. Department of Agriculture (USDA) Agricultural Research Service [ARS project number 5030-32000-231-000-D]; the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services [contract numbers 75N93021C00015 and 75N93021C00016]; the USDA Agricultural Research Service Research Participation Program of the Oak Ridge Institute for Science and Education (ORISE) through an interagency agreement between the U.S. Department of Energy (DOE) and USDA Agricultural Research Service [contract number DE-AC05- 06OR23100]; the SCINet project of the USDA Agricultural Research Service [ARS project number 0500-00093-001-00-D]; and the Duke-NUS Signature Research Program funded by the Ministry of Health, Singapore. The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication. Mention of trade names or commercial products in this article is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the USDA, DOE, or ORISE. USDA is an equal opportunity provider and employer.

Acknowledgments

We gratefully acknowledge pork producers, swine veterinarians, and laboratories for participating in the USDA Influenza A Virus in Swine Surveillance System and publicly sharing sequences.

Conflict of interest

The authors TKA declared that they were an editorial board member of Frontiers, at the time of submission. This had no impact on the peer review process and the final decision.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be constructed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fviro.2023.1215012/full#supplementary-material

References

1. Shi M, Lam TT-Y, Hon C-C, Hui RK-H, Faaberg KS, Wennblom T, et al. Molecular epidemiology of PRRSV: a phylogenetic perspective. Virus Res (2010) 154:7–17. doi: 10.1016/j.virusres.2010.08.014

PubMed Abstract | CrossRef Full Text | Google Scholar

2. Anderson TK, Macken CA, Lewis NS, Scheuermann RH, Van Reeth K, Brown IH, et al. A phylogeny-based global nomenclature system and automated annotation tool for H1 hemagglutinin genes from swine influenza A viruses. mSphere (2016) 1:e00275–00216. doi: 10.1128/mSphere.00275-16

PubMed Abstract | CrossRef Full Text | Google Scholar

3. Paploski I, Corzo C, Rovira A, Murtaugh MP, Sanhueza JM, Vilalta C, et al. Temporal dynamics of co-circulating lineages of porcine reproductive and respiratory syndrome virus. Front Microbiol (2019) 10:2486. doi: 10.3389/fmicb.2019.02486

PubMed Abstract | CrossRef Full Text | Google Scholar

4. Kim J, Lee K, Rupasinghe R, Rezaei S, Martínez-López B, Liu X. Applications of machine learning for the classification of porcine reproductive and respiratory syndrome virus sublineages using amino acid scores of ORF5 gene. Front Veterinary Sci (2021) 813. doi: 10.3389/fvets.2021.683134