<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Genet.</journal-id>
<journal-title>Frontiers in Genetics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Genet.</abbrev-journal-title>
<issn pub-type="epub">1664-8021</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">1029185</article-id>
<article-id pub-id-type="doi">10.3389/fgene.2023.1029185</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Genetics</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Classification of group A rotavirus VP7 and VP4 genotypes using random forest</article-title>
<alt-title alt-title-type="left-running-head">Tran et al.</alt-title>
<alt-title alt-title-type="right-running-head">
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/fgene.2023.1029185">10.3389/fgene.2023.1029185</ext-link>
</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Tran</surname>
<given-names>Hoc</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1765041/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Friendship</surname>
<given-names>Robert</given-names>
</name>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Poljak</surname>
<given-names>Zvonimir</given-names>
</name>
</contrib>
</contrib-group>
<aff>Department of Population Medicine, <institution>Ontario Veterinary College</institution>, University of Guelph, <addr-line>Guelph</addr-line>, <addr-line>ON</addr-line>, <country>Canada</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/23877/overview">Richard D. Emes</ext-link>, Nottingham Trent University, United Kingdom</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/399071/overview">Minakshi - Prasad</ext-link>, Lala Lajpat Rai University of Veterinary and Animal Sciences, India</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1305986/overview">Ariful Islam</ext-link>, EcoHealth Alliance, United States</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Hoc Tran, <email>hoctran10@outlook.com</email>
</corresp>
</author-notes>
<pub-date pub-type="epub">
<day>30</day>
<month>05</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>14</volume>
<elocation-id>1029185</elocation-id>
<history>
<date date-type="received">
<day>08</day>
<month>09</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>15</day>
<month>05</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2023 Tran, Friendship and Poljak.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Tran, Friendship and Poljak</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>
<bold>Introduction:</bold> Group A rotaviruses are major pathogens in causing severe diarrhea in young children and neonates of many different species of animals worldwide and group A rotavirus sequence data are becoming increasingly available over time. Different methods exist that allow for rotavirus genotyping, but machine learning methods have yet to be explored. Usage of machine learning algorithms such as random forest alongside alignment-based methodology may allow for both efficient and accurate classification of circulating rotavirus genotypes through the dual classification system.</p>
<p>
<bold>Methods:</bold> Random forest models were trained on positional features obtained from pairwise and multiple sequence alignment and cross-validated using methods of repeated 10-fold cross-validation thrice and leave one- out cross validation. Models were then validated on unseen data from the testing datasets to observe real-world performance.</p>
<p>
<bold>Results:</bold> All models were found to perform strongly in classification of VP7 and VP4 genotypes with high overall accuracy and <italic>kappa</italic> values during model training (0.975&#x2013;0.992, 0.970&#x2013;0.989) and during model testing (0.972&#x2013;0.996, 0.969&#x2013;0.996), respectively. Models trained on multiple sequence alignment generally had slightly higher overall accuracy and <italic>kappa</italic> values than models trained on pairwise sequence alignment method. In contrast, pairwise sequence alignment models were found to be generally faster than multiple sequence alignment models in computational speed when models do not need to be retrained. Models that used repeated 10-fold cross-validation thrice were also found to be much faster in model computational speed than models that used leave-one-out cross validation, with no noticeable difference in overall accuracy and <italic>kappa</italic> values between the cross-validation methods.</p>
<p>
<bold>Discussion:</bold> Overall, random forest models showed strong performance in the classification of both group A rotavirus VP7 and VP4 genotypes. Application of these models as classifiers will allow for rapid and accurate classification of the increasing amounts of rotavirus sequence data that are becoming available.</p>
</abstract>
<kwd-group>
<kwd>rotavirus</kwd>
<kwd>classification</kwd>
<kwd>randomForest</kwd>
<kwd>alignment</kwd>
<kwd>machine learning</kwd>
</kwd-group>
<contract-sponsor id="cn001">Ontario Ministry of Agriculture, Food and Rural Affairs<named-content content-type="fundref-id">10.13039/501100000094</named-content>
</contract-sponsor>
<contract-sponsor id="cn002">Natural Sciences and Engineering Research Council of Canada<named-content content-type="fundref-id">10.13039/501100000038</named-content>
</contract-sponsor>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Computational Genomics</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Group A rotaviruses have been found to be among the most common causes of acute gastroenteritis infections in both young children and animals across the globe. Nearly all young children are expected to be infected with rotavirus within their first 5&#xa0;years of life, contributing to over 215,000 deaths annually worldwide (<xref ref-type="bibr" rid="B17">Lanzieri et al., 2011</xref>; <xref ref-type="bibr" rid="B39">Tate et al., 2016</xref>). In children, several vaccines have been developed to prevent rotavirus infections, but efficacy of vaccines have been shown to vary greatly in regions such as South Africa and Bangladesh. This can be attributed to large genotypic variation within circulating rotavirus strains consisting of VP7 genotypes G1-G36 and VP4 genotypes P[1]-P[51] across the globe (<xref ref-type="bibr" rid="B23">Madhi et al., 2010</xref>; <xref ref-type="bibr" rid="B44">Zaman et al., 2010</xref>; <xref ref-type="bibr" rid="B10">Harris et al., 2017</xref>; <xref ref-type="bibr" rid="B4">Burke et al., 2019</xref>; <xref ref-type="bibr" rid="B37">Rotavirus Classification Working Group: RCWG, 2021</xref>). Global surveillance of rotavirus genotypes is therefore critical to monitor and evaluate emerging and circulating genotypes of rotaviruses before and after vaccine introduction. This will in turn allow for more targeted development of vaccines as well as updating them on an as-needed basis for rotavirus prevention. Although not monitored with the same intensity, rotaviruses are important pathogens of animals as well. Young cattle, horses, poultry, and pigs are also commonly infected by rotaviruses, contributing to economic burdens arising from weight loss, mortality, and cost of treatment for infected animals (<xref ref-type="bibr" rid="B22">Luchs and Timenetsky, 2016</xref>).</p>
<p>Rotaviruses are double-stranded RNA viruses classified into the <italic>Reoviridae</italic> family and can be further classified into nine antigenically unique groups (<xref ref-type="bibr" rid="B42">Walker et al., 2019</xref>). Of these groups, group A rotaviruses are of primary interest due to high frequency of infection within avian and mammalian species (<xref ref-type="bibr" rid="B24">Maes et al., 2009</xref>). Rotaviruses are composed of a total of 11 double-stranded RNA segments, which encode for six (VP1-VP4, VP6-VP7) structural proteins and six (NSP1-NSP4, NSP5/6) non-structural proteins (<xref ref-type="bibr" rid="B29">M&#xfc;ller and Johne, 2007</xref>). A dual classification system using the nomenclature of GxP[x] (where x is the respective genotype number) has been established for the 36&#xa0;G and 51&#xa0;P genotypes, based on the two outer capsule proteins VP7 and VP4, respectively (<xref ref-type="bibr" rid="B37">Rotavirus Classification Working Group: RCWG, 2021</xref>). Several alignment-based methods have been used for classification of rotavirus nucleotide sequence data into their respective genotypes, such as the RotaC web-based tool, Basic Local Alignment Search Tool (BLAST), and the Virus Pathogen Database and Analysis Resource (VIPR) Rotavirus A Genotype tool. RotaC uses neighbour-joining phylogenetic trees built from distance matrices obtained from alignment and nucleotide identity cut-off values to phylogenetically identify the genotype of a query sequence (<xref ref-type="bibr" rid="B24">Maes et al., 2009</xref>). BLAST compares query sequences to a known database of sequences and identifies similar sequences above a certain threshold within that database (<xref ref-type="bibr" rid="B1">Altschul et al., 1990</xref>). The VIPR tool is a reimplementation of RotaC using custom java code that also outputs corresponding BLAST results from a curated database (<xref ref-type="bibr" rid="B34">Pickett et al., 2012</xref>). The amount of rotavirus nucleotide sequence data available is rapidly increasing however, providing opportunities to use machine learning methods such as random forest for genotype classification.</p>
<p>Random forest is a widely used supervised machine learning algorithm in completing both binary and multi-class classification tasks (<xref ref-type="bibr" rid="B5">Chaudhary et al., 2016</xref>; <xref ref-type="bibr" rid="B16">Lakshmanaprabu et al., 2019</xref>; <xref ref-type="bibr" rid="B18">Lee et al., 2019</xref>). Random forest uses bootstrap samples from a training data set and grows decision trees by randomly sampling the number of features available (the m<sub>try</sub>) and choosing the best split at each node from this value (<xref ref-type="bibr" rid="B20">Liaw and Wiener, 2002</xref>). Predictions from each of the decision trees are then aggregated, and the final prediction on new data is decided by a majority vote. Random forest can be trained on both categorical and numerical data, allowing for flexibility in the features present in the training data (<xref ref-type="bibr" rid="B11">Ion Titapiccolo et al., 2013</xref>). Important features can also be identified in random forest models, although there are limitations to this that arise due to multicollinearity (<xref ref-type="bibr" rid="B13">Kuhn and Johnson, 2013</xref>). Overall, random forest has previously demonstrated strong performance using sequence data for classification of viral pathogens or for the prediction of their hosts on the basis of genetic data; such as influenza A virus (0.857&#x2013;1 overall accuracy), Coronavirus (0.728&#x2013;0.735 overall accuracy, 0.688&#x2013;0.696 <italic>kappa</italic>), and porcine reproductive and respiratory syndrome virus (&#x3e;0.99 AUC) (<xref ref-type="bibr" rid="B7">Cook et al., 2020</xref>; <xref ref-type="bibr" rid="B3">Brierley and Fowler, 2021</xref>; <xref ref-type="bibr" rid="B12">Kim et al., 2021</xref>). Resultantly, using random forest in combination with a large amount of rotavirus sequence data as input, may allow for a novel approach towards classification of rotavirus genotypes. Therefore, we looked to address the objective of developing a machine learning classifier using random forest alongside alignment-based methodology for efficient and accurate classification of circulating group A rotavirus VP7 and VP4 genotypes.</p>
</sec>
<sec sec-type="materials|methods" id="s2">
<title>2 Materials and methods</title>
<sec id="s2-1">
<title>2.1 Dataset retrieval</title>
<p>The two datasets used in this study were obtained on 15 November 2020 by downloading and excluding sequences from the NCBI Nucleotide database, as shown in <xref ref-type="fig" rid="F1">Figure 1</xref>. Sequences were initially obtained by searching the database using the keywords &#x201c;Rotavirus A VP7&#x201d; and &#x201c;Rotavirus A VP4&#x201d; in R statistical software version 3.6.1 (<xref ref-type="bibr" rid="B36">R Core Team, 2013</xref>). Sequences that were not labelled with either the G or P genotype were excluded from their respective datasets. Sequences with less than 400 nucleotide base pairs or greater than the expected length of 1062 base pairs for the VP7 dataset and less than 500 nucleotide base pairs or greater than 2,362 base pairs for the VP4 dataset were excluded (<xref ref-type="sec" rid="s11">Supplementary Figure S1</xref>). The total number of sequences available for each of the genotypes were also tallied, and sequences that belonged to a genotype where the total count was less than 10 were also excluded to prevent classification of genotypes with insufficient amount of data to train the random forest algorithm on. For the VP4 dataset specifically, genotypes with excess amounts of sequence data available (&#x3e;500) were reduced to a maximum of 100 randomly selected sequences to reduce computational strain when training the models. Distributions of these sequences by animal species are shown in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>Flow chart of data processing for the VP7 and VP4 datasets.</p>
</caption>
<graphic xlink:href="fgene-14-1029185-g001.tif"/>
</fig>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Distribution of VP7 and VP4 sequences from their respective datasets by animal species after retrieval from the NCBI Nucleotide database.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th colspan="3" align="left">VP7 and VP4 sequence distribution by species</th>
</tr>
<tr>
<th align="center">Species</th>
<th align="center">VP7 Sequences</th>
<th align="center">VP4 Sequences</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">Human</td>
<td align="center">405</td>
<td align="center">397</td>
</tr>
<tr>
<td align="center">Equine</td>
<td align="center">409</td>
<td align="center">53</td>
</tr>
<tr>
<td align="center">Bovine</td>
<td align="center">2</td>
<td align="center">221</td>
</tr>
<tr>
<td align="center">Avian</td>
<td align="center">45</td>
<td align="center">1</td>
</tr>
<tr>
<td align="center">Swine</td>
<td align="center">21</td>
<td align="center">101</td>
</tr>
<tr>
<td align="center">Other</td>
<td align="center">2</td>
<td align="center">78</td>
</tr>
<tr>
<td align="center">Total</td>
<td align="center">884</td>
<td align="center">851</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s2-2">
<title>2.2 Sequence alignment</title>
<p>Each of the datasets were aligned separately using two different alignment methods, pairwise sequence alignment and multiple sequence alignment. The resulting aligned sequences were then used to train the random forest algorithm and model performance was compared between the two different alignment methods. Code samples for each alignment method and model training are shown in <xref ref-type="sec" rid="s11">Supplementary Data Sheet S1</xref>.</p>
</sec>
<sec id="s2-3">
<title>2.3 Pairwise sequence alignment</title>
<p>Sequences were individually aligned against an appropriate complete gene segment reference sequence using the Needleman-Wunsch global alignment method from the Biostrings package in R (<xref ref-type="bibr" rid="B31">Needleman and Wunsch, 1970</xref>; <xref ref-type="bibr" rid="B33">Pag&#xe8;s et al., 2020</xref>). The reference sequences used were obtained from the NCBI RefSeq database (<xref ref-type="bibr" rid="B32">O&#x2019;Leary et al., 2016</xref>). After each alignment, the first nucleotide from each of the aligned sequences, either an &#x201c;A, T, C, G, or&#x2014;(gap)&#x201d; was extracted and stored as position one in a new data frame. This was repeated for the next nucleotide in the sequence as position two and so forth, up to the end of each aligned sequence (<xref ref-type="sec" rid="s11">Supplementary Figure S2</xref>). This was performed separately for both the VP7 and VP4 datasets and the resulting 1097 and 2,416 positional features from each dataset, respectively, were used to train the random forest algorithm.</p>
</sec>
<sec id="s2-4">
<title>2.4 Multiple sequence alignment</title>
<p>Sequences were aligned against each other using the multiple sequence alignment method from the MUSCLE package in R (<xref ref-type="bibr" rid="B8">Edgar, 2004</xref>). Default parameters were used for the multiple sequence alignment and the resulting alignment was stored in a similar data frame to pairwise sequence alignment. This was performed separately for both the VP7 and VP4 datasets and the resulting 1223 and 2,624 positional features from each dataset, respectively, were used to train the random forest algorithm.</p>
</sec>
<sec id="s2-5">
<title>2.5 Training and testing datasets</title>
<p>Using the data consisting of positional features obtained from pairwise and multiple sequence alignment, training and testing datasets were formed by randomly partitioning the data into 70% training data and 30% testing data. The training dataset was used to train the random forest algorithm and the testing dataset was used to validate model performance on unseen data. Genotype distribution of the data into training and testing data are summarized in <xref ref-type="table" rid="T2">Table 2</xref> for the VP7 dataset and <xref ref-type="table" rid="T3">Table 3</xref> for the VP4 dataset. Accession numbers for sequences in each training and testing datasets are shown in <xref ref-type="sec" rid="s11">Supplementary Data Sheet S2</xref>.</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Distribution of VP7 genotypes obtained from the NCBI nucleotide database sequences and after division into training (70%) and testing (30%) datasets.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th colspan="5" align="left">VP7 sequence distribution by genotype</th>
</tr>
<tr>
<th align="left">Genotype</th>
<th align="center">Labelled Sequences Obtained</th>
<th align="center">Sequences After Exclusion</th>
<th align="center">Training Dataset</th>
<th align="center">Testing Dataset</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">G1</td>
<td align="center">176</td>
<td align="center">134</td>
<td align="center">94</td>
<td align="center">40</td>
</tr>
<tr>
<td align="center">G2</td>
<td align="center">22</td>
<td align="center">18</td>
<td align="center">13</td>
<td align="center">5</td>
</tr>
<tr>
<td align="center">G3</td>
<td align="center">330</td>
<td align="center">309</td>
<td align="center">217</td>
<td align="center">92</td>
</tr>
<tr>
<td align="center">G4</td>
<td align="center">22</td>
<td align="center">22</td>
<td align="center">16</td>
<td align="center">6</td>
</tr>
<tr>
<td align="center">G5</td>
<td align="center">7</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">G6</td>
<td align="center">7</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">G7</td>
<td align="center">4</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">G8</td>
<td align="center">20</td>
<td align="center">20</td>
<td align="center">14</td>
<td align="center">6</td>
</tr>
<tr>
<td align="center">G9</td>
<td align="center">53</td>
<td align="center">49</td>
<td align="center">35</td>
<td align="center">14</td>
</tr>
<tr>
<td align="center">G10</td>
<td align="center">7</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">G11</td>
<td align="center">4</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">G12</td>
<td align="center">102</td>
<td align="center">95</td>
<td align="center">68</td>
<td align="center">27</td>
</tr>
<tr>
<td align="center">G13</td>
<td align="center">2</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">G14</td>
<td align="center">196</td>
<td align="center">194</td>
<td align="center">136</td>
<td align="center">58</td>
</tr>
<tr>
<td align="center">G15</td>
<td align="center">5</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">G16</td>
<td align="center">1</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">G17</td>
<td align="center">2</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">G18</td>
<td align="center">9</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">G19</td>
<td align="center">49</td>
<td align="center">43</td>
<td align="center">31</td>
<td align="center">12</td>
</tr>
<tr>
<td align="center">G20</td>
<td align="center">2</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">G21</td>
<td align="center">2</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">G22</td>
<td align="center">2</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">G23</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">G24</td>
<td align="center">1</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">G25</td>
<td align="center">3</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">G26</td>
<td align="center">6</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">G27-G36</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">Total</td>
<td align="center">1034</td>
<td align="center">884</td>
<td align="center">624</td>
<td align="center">260</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Distribution of VP4 genotypes obtained from the NCBI nucleotide database sequences and after division into training (70%) and testing (30%) datasets.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th colspan="6" align="left">VP4 sequence distribution by genotype</th>
</tr>
<tr>
<th align="center">Genotype</th>
<th align="center">Labelled sequences obtained</th>
<th colspan="2" align="center">Sequences after exclusion</th>
<th align="center">Training dataset</th>
<th align="center">Testing dataset</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">P[1]</td>
<td align="center">48</td>
<td colspan="2" align="center">48</td>
<td align="center">34</td>
<td align="center">14</td>
</tr>
<tr>
<td align="center">P[2]</td>
<td align="center">3</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[3]</td>
<td align="center">18</td>
<td colspan="2" align="center">18</td>
<td align="center">13</td>
<td align="center">5</td>
</tr>
<tr>
<td align="center">P[4]</td>
<td align="center">574</td>
<td colspan="2" align="center">100</td>
<td align="center">70</td>
<td align="center">30</td>
</tr>
<tr>
<td align="center">P[5]</td>
<td align="center">80</td>
<td colspan="2" align="center">79</td>
<td align="center">56</td>
<td align="center">23</td>
</tr>
<tr>
<td align="center">P[6]</td>
<td align="center">294</td>
<td colspan="2" align="center">100</td>
<td align="center">70</td>
<td align="center">30</td>
</tr>
<tr>
<td align="center">P[7]</td>
<td align="center">45</td>
<td colspan="2" align="center">43</td>
<td align="center">31</td>
<td align="center">12</td>
</tr>
<tr>
<td align="center">P[8]</td>
<td align="center">2,840</td>
<td colspan="2" align="center">100</td>
<td align="center">70</td>
<td align="center">30</td>
</tr>
<tr>
<td align="center">P[9]</td>
<td align="center">52</td>
<td colspan="2" align="center">52</td>
<td align="center">37</td>
<td align="center">15</td>
</tr>
<tr>
<td align="center">P[10]</td>
<td align="center">1</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[11]</td>
<td align="center">95</td>
<td colspan="2" align="center">95</td>
<td align="center">67</td>
<td align="center">28</td>
</tr>
<tr>
<td align="center">P[12]</td>
<td align="center">53</td>
<td colspan="2" align="center">51</td>
<td align="center">36</td>
<td align="center">15</td>
</tr>
<tr>
<td align="center">P[13]</td>
<td align="center">31</td>
<td colspan="2" align="center">24</td>
<td align="center">17</td>
<td align="center">7</td>
</tr>
<tr>
<td align="center">P[14]</td>
<td align="center">65</td>
<td colspan="2" align="center">65</td>
<td align="center">46</td>
<td align="center">19</td>
</tr>
<tr>
<td align="center">P[15]</td>
<td align="center">1</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[16]</td>
<td align="center">0</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[17]</td>
<td align="center">8</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[18]</td>
<td align="center">2</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[19]</td>
<td align="center">7</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[20]-P[22]</td>
<td align="center">0</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[23]</td>
<td align="center">24</td>
<td colspan="2" align="center">24</td>
<td align="center">17</td>
<td align="center">7</td>
</tr>
<tr>
<td align="center">P[24]</td>
<td align="center">1</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[25]</td>
<td align="center">6</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[26]</td>
<td align="center">0</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[27]</td>
<td align="center">5</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[28]</td>
<td align="center">1</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[29]</td>
<td align="center">0</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[30]</td>
<td align="center">2</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[31]<xref ref-type="table-fn" rid="Tfn1">
<sup>a</sup>
</xref>
</td>
<td align="center">42</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[32]</td>
<td align="center">17</td>
<td colspan="2" align="center">17</td>
<td align="center">12</td>
<td align="center">5</td>
</tr>
<tr>
<td align="center">P[33]</td>
<td align="center">1</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[34]</td>
<td align="center">0</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[35]</td>
<td align="center">1</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[36]&#x2013;P[37]</td>
<td align="center">0</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[38]</td>
<td align="center">1</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[39]</td>
<td align="center">1</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[40]</td>
<td align="center">1</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[41]&#x2013;P[46]</td>
<td align="center">0</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[47]</td>
<td align="center">1</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[48]</td>
<td align="center">1</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">P[49]</td>
<td align="center">35</td>
<td colspan="2" align="center">35</td>
<td align="center">25</td>
<td align="center">10</td>
</tr>
<tr>
<td align="center">P[50]&#x2013;P[51]</td>
<td align="center">0</td>
<td colspan="2" align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="center">Total</td>
<td align="center">4357</td>
<td colspan="2" align="center">851</td>
<td align="center">601</td>
<td align="center">250</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="Tfn1">
<label>
<sup>a</sup>
</label>
<p>36 of 42 sequences were excluded by criteria of sequence length less than 500 base pairs.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="s2-6">
<title>2.6 Model training</title>
<p>Models were trained in R by using the caret package with random forest as the chosen classification algorithm (<xref ref-type="bibr" rid="B14">Kuhn, 2008</xref>). Due to the unbalanced nature of the datasets, two different cross-validation methods of repeated 10-fold cross-validation thrice (R10FCVT) and leave-one-out cross-validation (LOOCV) were chosen to evaluate model performance during training. Ten-fold cross-validation is where the training data are randomly divided into 10 distinct folds and each fold performs once as the test dataset and the remaining folds perform as the training dataset for that given fold. Leave-one-out cross-validation is where the number of folds is equivalent to the number of samples in the dataset, and each fold performs once as the test dataset and the remaining folds perform as the training dataset for that given fold. Two models are trained for each alignment method, one using R10FCVT and one using LOOCV, for a total of four models each for the VP7 and VP4 datasets. Confusion matrices were also generated for each of the models and overall accuracies and <italic>kappa</italic> values are calculated from these confusion matrices to evaluate performance during training. Cohen&#x2019;s Kappa was calculated using the following equation:<disp-formula id="equ1">
<mml:math id="m1">
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>where <inline-formula id="inf1">
<mml:math id="m2">
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the observed agreement and <inline-formula id="inf2">
<mml:math id="m3">
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the expected agreement of the model (<xref ref-type="bibr" rid="B6">Cohen, 1960</xref>).</p>
</sec>
<sec id="s2-7">
<title>2.7 Model hyperparameter tuning</title>
<p>The m<sub>try</sub> hyperparameter was tuned during model training alongside cross-validation. The default value for the m<sub>try</sub> is equal to the square root of the number of features in the data and tuning of the m<sub>try</sub> allowed for obtaining the most robust models possible. Cross-validation allows for optimal tuning of the m<sub>try</sub> without concern for overfitting of the models, and therefore it is beneficial to perform them concurrently (<xref ref-type="bibr" rid="B35">Probst et al., 2019</xref>). The tuning range used to train each of the models consists of the default m<sub>try</sub> value and range of 10, 20, 30, 40, 50, 60, 70, 80, 90, and 100. All other hyperparameters such as the number of trees, minimum and maximum node size, and maximum depth were left as their default values.</p>
</sec>
<sec id="s2-8">
<title>2.8 Model testing</title>
<p>The trained models were tested by evaluating how well they perform on unseen data in the testing dataset. Confusion matrices were generated for each of the models after using them to predict the classes of unseen aligned sequence data and metrics such as overall accuracies, 95% confidence intervals, <italic>kappa</italic> values, no-information rates, and p-values were generated from the confusion matrices to evaluate model performance on the testing dataset. Misclassified sequences were explored after model testing by constructing maximum-likelihood phylogenetic trees with 100 bootstrap iterations using MEGA-X software on a partial training dataset (10 randomly sampled sequences from each class) and a full testing dataset, with possible outliers determined through visual analysis using the Interactive Tree of Life webtool (iTOL) (<xref ref-type="bibr" rid="B15">Kumar et al., 2018</xref>; <xref ref-type="bibr" rid="B19">Letunic and Bork, 2021</xref>).</p>
</sec>
<sec id="s2-9">
<title>2.9 Model computational performance</title>
<p>Model computational performance was determined to see how practical each model may be in real-world situations where there is a query sequence that needs to be identified. Three different components of each model were timed to determine computational performance, the time elapsed to perform the initial alignment for a query sequence, the time elapsed to train the model, and the time elapsed for the model to predict the class of the query sequence. The total time elapsed with and without training were also summed to compare model performance in situations where the models need to be retrained regularly and when they do not.</p>
</sec>
</sec>
<sec sec-type="results" id="s3">
<title>3 Results</title>
<sec id="s3-1">
<title>3.1 Training model performance</title>
<p>Using the VP7 training dataset of 624 sequences and VP4 training dataset of 601 sequences, random forest models were trained using cross-validation methods of both R10FCVT and LOOCV on positional features from aligned sequence data. Overall accuracies and <italic>kappa</italic> values were calculated to compare model performance during training directly and are summarized in <xref ref-type="table" rid="T4">Table 4</xref>. The best performing model for the VP7 dataset was found to be the multiple sequence alignment LOOCV model, which had m<sub>try</sub>, overall accuracy, and <italic>kappa</italic> values of 30, 0.992, and 0.986, respectively. The worst performing model for the VP7 dataset was found to be the pairwise sequence alignment LOOCV model, which had m<sub>try</sub>, overall accuracy, and <italic>kappa</italic> values of 33, 0.981, and 0.976, respectively. For the VP4 dataset, the best performing model was found to be either of the multiple sequence alignment models, R10FCVT and LOOCV, where both models had the same m<sub>try</sub>, overall accuracy, and <italic>kappa</italic> values of 40, 0.990, and 0.989, respectively. The worst performing model for the VP4 dataset was found to be the pairwise sequence alignment R10FCVT model, which had m<sub>try</sub>, overall accuracy, and <italic>kappa</italic> values of 90, 0.975, and 0.976, respectively. Confusion matrices were generated for each of the trained models to observe class accuracies for the imbalanced VP7 and VP4 datasets and are shown in <xref ref-type="fig" rid="F2">Figure 2</xref> and <xref ref-type="fig" rid="F3">Figure 3</xref>, respectively.</p>
<table-wrap id="T4" position="float">
<label>TABLE 4</label>
<caption>
<p>Comparison of overall accuracy, accuracy standard deviation across folds, <italic>kappa</italic>, and m<sub>try</sub> values after training and tuning of random forest models using positional features from pairwise and multiple sequence alignment.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th colspan="5" align="left">VP7 and VP4 training model performance</th>
</tr>
<tr>
<th align="center">Methods</th>
<th align="center">m<sub>try</sub>
</th>
<th align="center">Accuracy</th>
<th align="center">Accuracy Std</th>
<th align="center">
<italic>Kappa</italic>
</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td colspan="5" align="left">
<italic>Pairwise Sequence Alignment</italic>
</td>
</tr>
<tr>
<td align="center">&#x2003;R10FCVT VP7</td>
<td align="center">90</td>
<td align="center">0.9813</td>
<td align="center">0.0119</td>
<td align="center">0.9736</td>
</tr>
<tr>
<td align="center">&#x2003;LOOCV VP7</td>
<td align="center">33</td>
<td align="center">0.9808</td>
<td align="center">0.1374</td>
<td align="center">0.9757</td>
</tr>
<tr>
<td align="center">&#x2003;R10FCVT VP4</td>
<td align="center">90</td>
<td align="center">0.9751</td>
<td align="center">0.0217</td>
<td align="center">0.9763</td>
</tr>
<tr>
<td align="center">&#x2003;LOOCV VP4</td>
<td align="center">90</td>
<td align="center">0.9767</td>
<td align="center">0.1509</td>
<td align="center">0.9701</td>
</tr>
<tr>
<td colspan="5" align="left">
<italic>Multiple Sequence Alignment</italic>
</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;R10FCVT VP7</td>
<td align="center">90</td>
<td align="center">0.9919</td>
<td align="center">0.0091</td>
<td align="center">0.9878</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;LOOCV VP7</td>
<td align="center">30</td>
<td align="center">0.9920</td>
<td align="center">0.0892</td>
<td align="center">0.9858</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;R10FCVT VP4</td>
<td align="center">40</td>
<td align="center">0.9900</td>
<td align="center">0.0082</td>
<td align="center">0.9891</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;LOOCV VP4</td>
<td align="center">40</td>
<td align="center">0.9900</td>
<td align="center">0.0995</td>
<td align="center">0.9891</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>R10FCVT, Repeated 10-fold cross-validation thrice.</p>
</fn>
<fn>
<p>LOOCV, Leave-one-out cross-validation.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Confusion matrixes for the trained VP7 models on the cross-validated training dataset composed of 624 VP7 sequences.</p>
</caption>
<graphic xlink:href="fgene-14-1029185-g002.tif"/>
</fig>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Confusion matrixes for the trained VP4 models on the cross-validated training dataset composed of 601 VP4 sequences.</p>
</caption>
<graphic xlink:href="fgene-14-1029185-g003.tif"/>
</fig>
</sec>
<sec id="s3-2">
<title>3.2 Model validation</title>
<p>Using the VP7 testing dataset of 260 sequences and VP4 dataset of 250 sequences, random forest model performance was validated by using each of the trained models to predict the class of unseen aligned sequences from the testing datasets. This was done to observe how well they may perform in real-world situations where the class of query sequences need to be identified.</p>
<p>Overall accuracies, <italic>kappa</italic> values, 95% confidence intervals, no information rates, and p-values were calculated for each of the models to compare performance on the testing datasets directly and are summarized in <xref ref-type="table" rid="T5">Table 5</xref>. The best performing model on the VP7 testing dataset was found to be either of the multiple sequence alignment models, R10FCVT and LOOCV, where both models had overall accuracy, 95% confidence interval, and <italic>kappa</italic> values of 0.996, (0.979, 0.999), and 0.995, respectively. The worst performing model on the VP7 testing dataset was found to be the pairwise sequence alignment R10FCVT model, which had overall accuracy, 95% confidence interval, and <italic>kappa</italic> values of 0.985, (0.961, 0.996), and 0.980, respectively. Similarly, the best performing model on the VP4 testing dataset was found to be either of the multiple sequence alignment models, R10FCVT and LOOCV, where both models had overall accuracy, 95% confidence interval, and <italic>kappa</italic> values of 0.996, (0.978, 0.999), and 0.996, respectively. The worst performing model on the VP4 testing dataset was found to be the pairwise sequence alignment R10FCVT model with overall accuracy, 95% confidence interval, and <italic>kappa</italic> values of 0.972, (0.943, 0.989), and 0.969, respectively. VP7 and VP4 models were found to have no-information rates of 0.354 and 0.120, respectively, with overall accuracy for all models being significantly greater (<italic>p</italic> &#x3c; 0.01) than the no-information rate.</p>
<table-wrap id="T5" position="float">
<label>TABLE 5</label>
<caption>
<p>Comparison of overall accuracy, 95% confidence intervals, <italic>kappa</italic>, no-information rates, and p-values for trained VP7 and VP4 random forest models on testing data using positional features from pairwise and multiple sequence alignment.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th colspan="6" align="left">VP7 and VP4 testing data model performance</th>
</tr>
<tr>
<th align="center">Methods</th>
<th align="center">Overall accuracy</th>
<th align="center">95% confidence interval</th>
<th align="center">
<italic>Kappa</italic>
</th>
<th align="center">No-information rate</th>
<th align="center">
<italic>p</italic>-value [ACC &#x3e; NIR]</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td colspan="6" align="left">
<italic>Pairwise Sequence Alignment</italic>
</td>
</tr>
<tr>
<td align="center">&#x2003;R10FCVT VP7</td>
<td align="center">0.9846</td>
<td align="center">(0.9611, 0.9958)</td>
<td align="center">0.9804</td>
<td align="center">0.3538</td>
<td align="center">&#x3c;0.01</td>
</tr>
<tr>
<td align="center">&#x2003;LOOCV VP7</td>
<td align="center">0.9885</td>
<td align="center">(0.9668, 0.9976)</td>
<td align="center">0.9854</td>
<td align="center">0.3538</td>
<td align="center">&#x3c;0.01</td>
</tr>
<tr>
<td align="center">&#x2003;R10FCVT VP4</td>
<td align="center">0.9720</td>
<td align="center">(0.9432, 0.9887)</td>
<td align="center">0.9693</td>
<td align="center">0.1200</td>
<td align="center">&#x3c;0.01</td>
</tr>
<tr>
<td align="center">&#x2003;LOOCV VP4</td>
<td align="center">0.9760</td>
<td align="center">(0.9485, 0.9911)</td>
<td align="center">0.9737</td>
<td align="center">0.1200</td>
<td align="center">&#x3c;0.01</td>
</tr>
<tr>
<td colspan="6" align="left">
<italic>Multiple Sequence Alignment</italic>
</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;R10FCVT VP7</td>
<td align="center">0.9962</td>
<td align="center">(0.9788, 0.9999)</td>
<td align="center">0.9951</td>
<td align="center">0.3538</td>
<td align="center">&#x3c;0.01</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;LOOCV VP7</td>
<td align="center">0.9962</td>
<td align="center">(0.9788, 0.9999)</td>
<td align="center">0.9951</td>
<td align="center">0.3538</td>
<td align="center">&#x3c;0.01</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;R10FCVT VP4</td>
<td align="center">0.9960</td>
<td align="center">(0.9779, 0.9999)</td>
<td align="center">0.9956</td>
<td align="center">0.1200</td>
<td align="center">&#x3c;0.01</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;LOOCV VP4</td>
<td align="center">0.9960</td>
<td align="center">(0.9779, 0.9999)</td>
<td align="center">0.9956</td>
<td align="center">0.1200</td>
<td align="center">&#x3c;0.01</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>R10FCVT, Repeated 10-fold cross-validation thrice LOOCV, Leave-one-out cross-validation.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>Confusion matrices were generated to observe class accuracies on the VP7 and VP4 testing dataset and are shown in <xref ref-type="fig" rid="F4">Figure 4</xref> and <xref ref-type="fig" rid="F5">Figure 5</xref>, respectively. Misclassified sequences were identified through these confusion matrices and are summarized in <xref ref-type="table" rid="T6">Table 6</xref>. Possible outliers from these misclassified sequences were determined through phylogenetic analysis (<xref ref-type="sec" rid="s11">Supplementary Figure S3</xref>). Misclassified sequences with the accession numbers AB735641.1 and EU033979.1 were found to be possible outliers in the VP7 testing dataset. Misclassified sequences with the accession numbers EU033986.1 and MH446387.1 were found to be possible outliers in the VP4 testing dataset.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Confusion matrixes for the trained VP7 models on the testing dataset composed of 260 VP7 sequences.</p>
</caption>
<graphic xlink:href="fgene-14-1029185-g004.tif"/>
</fig>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>Confusion matrixes for the trained VP4 models on the testing dataset composed of 250 VP4 sequences.</p>
</caption>
<graphic xlink:href="fgene-14-1029185-g005.tif"/>
</fig>
<table-wrap id="T6" position="float">
<label>TABLE 6</label>
<caption>
<p>Comparison of model predictions for misclassified sequences obtained from using the trained models on the VP7 and VP4 testing datasets.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th colspan="7" align="left">VP7 and VP4 model predictions for misclassified sequences</th>
</tr>
<tr>
<th align="center">Unique Identifier</th>
<th align="center">Animal Species</th>
<th align="center">Reference Genotype</th>
<th align="center">PW R10FCVT Prediction</th>
<th align="center">PW LOOCV Prediction</th>
<th align="center">MSA R10FCVT Prediction</th>
<th align="center">MSA LOOCV Prediction</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td colspan="7" align="left">
<italic>VP7 Dataset</italic>
</td>
</tr>
<tr>
<td align="center">&#x2003;EU033979.1</td>
<td align="center">Human</td>
<td align="center">G3</td>
<td align="center">G4</td>
<td align="center">G3</td>
<td align="center">G1</td>
<td align="center">G1</td>
</tr>
<tr>
<td align="center">&#x2003;AB735641.1</td>
<td align="center">Swine</td>
<td align="center">G9</td>
<td align="center">G4</td>
<td align="center">G4</td>
<td align="center">G9</td>
<td align="center">G9</td>
</tr>
<tr>
<td align="center">&#x2003;KU372573.1</td>
<td align="center">Avian</td>
<td align="center">G19</td>
<td align="center">G4</td>
<td align="center">G4</td>
<td align="center">G19</td>
<td align="center">G19</td>
</tr>
<tr>
<td align="center">&#x2003;AY750923.1</td>
<td align="center">Equine</td>
<td align="center">G14</td>
<td align="center">G4</td>
<td align="center">G1</td>
<td align="center">G14</td>
<td align="center">G14</td>
</tr>
<tr>
<td colspan="7" align="left">
<italic>VP4 Dataset</italic>
</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;KY077643.1</td>
<td align="center">Swine</td>
<td align="center">P[13]</td>
<td align="center">P[11]</td>
<td align="center">P[11]</td>
<td align="center">P[13]</td>
<td align="center">P[13]</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;KT906385.1</td>
<td align="center">Swine</td>
<td align="center">P[13]</td>
<td align="center">P[6]</td>
<td align="center">P[6]</td>
<td align="center">P[13]</td>
<td align="center">P[13]</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;KT261372.1</td>
<td align="center">Bovine</td>
<td align="center">P[14]</td>
<td align="center">P[11]</td>
<td align="center">P[11]</td>
<td align="center">P[14]</td>
<td align="center">P[14]</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;EF672605.1</td>
<td align="center">Human</td>
<td align="center">P[9]</td>
<td align="center">P[11]</td>
<td align="center">P[11]</td>
<td align="center">P[9]</td>
<td align="center">P[9]</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;MH446387.1</td>
<td align="center">Human</td>
<td align="center">P[8]</td>
<td align="center">P[11]</td>
<td align="center">P[11]</td>
<td align="center">P[4]</td>
<td align="center">P[4]</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;KF414619.1</td>
<td align="center">Unknown</td>
<td align="center">P[8]</td>
<td align="center">P[4]</td>
<td align="center">P[8]</td>
<td align="center">P[8]</td>
<td align="center">P[8]</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;EU033986.1</td>
<td align="center">Human</td>
<td align="center">P[6]</td>
<td align="center">P[11]</td>
<td align="center">P[11]</td>
<td align="center">P[6]</td>
<td align="center">P[6]</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>PW R10FCVT, Pairwise repeated 10-fold cross-validation thrice PW LOOCV, Pairwise leave-one-out cross-validation MSA R10FCVT, Multiple sequence alignment repeated 10-fold cross-validation thrice MSA LOOCV, Multiple sequence alignment leave-one-out cross-validation.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="s3-3">
<title>3.3 Model computational performance results</title>
<p>The time elapsed for alignment of a query sequence, training of models, and model predictions were recorded to compare model computational performance and are summarized in <xref ref-type="table" rid="T7">Table 7</xref>. The time elapsed for pairwise and multiple sequence alignment of a query VP7 sequence were found to be 0.14 and 17.21 s, respectively. The time elapsed for pairwise and multiple sequence alignment of a query VP4 sequence were found to be 0.22 and 44.17 s, respectively.</p>
<table-wrap id="T7" position="float">
<label>TABLE 7</label>
<caption>
<p>Comparison of computational performance times for query sequence alignment, model training, and model prediction on testing data for each of the models using pairwise and multiple sequence alignment on the VP7 and VP4 datasets.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th colspan="4" align="left">VP7 and VP4 model computational performance</th>
</tr>
<tr>
<th align="center">Methods</th>
<th align="center">Query Sequence Alignment Time Elapsed (seconds)</th>
<th align="center">Training Time Elapsed (seconds)</th>
<th align="center">Prediction Time Elapsed (seconds)</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td colspan="4" align="left">
<italic>Pairwise Sequence Alignment</italic>
</td>
</tr>
<tr>
<td align="center">&#x2003;VP7 R10FCVT Model</td>
<td align="center">0.14</td>
<td align="center">502.01</td>
<td align="center">0.47</td>
</tr>
<tr>
<td align="center">&#x2003;VP7 LOOCV Model</td>
<td align="center">0.14</td>
<td align="center">8145.21</td>
<td align="center">0.47</td>
</tr>
<tr>
<td align="center">&#x2003;VP4 R10FCVT Model</td>
<td align="center">0.22</td>
<td align="center">1377.69</td>
<td align="center">1.45</td>
</tr>
<tr>
<td align="center">&#x2003;VP4 LOOCV Model</td>
<td align="center">0.22</td>
<td align="center">26136.72</td>
<td align="center">1.45</td>
</tr>
<tr>
<td colspan="4" align="left">
<italic>Multiple Sequence Alignment</italic>
</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;VP7 R10FCVT Model</td>
<td align="center">17.21</td>
<td align="center">1323.69</td>
<td align="center">0.53</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;VP7 LOOCV Model</td>
<td align="center">17.21</td>
<td align="center">7612.43</td>
<td align="center">0.53</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;VP4 R10FCVT Model</td>
<td align="center">44.17</td>
<td align="center">3395.97</td>
<td align="center">1.74</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;VP4 LOOCV Model</td>
<td align="center">44.17</td>
<td align="center">18273.13</td>
<td align="center">1.74</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>R10FCVT, Repeated 10-fold cross-validation thrice LOOCV, Leave-one-out cross-validation.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>The models with the shortest time elapsed for both the VP7 and VP4 datasets during training were found to be the pairwise sequence alignment R10FCVT models, with time elapsed of 502.01 and 1377.69&#xa0;s. The models with the longest time elapsed for both the VP7 and VP4 datasets during training were found to be the pairwise sequence alignment LOOCV models, with time elapsed of 8145.21 and 26136.72 s, respectively. Multiple sequence alignment R10FCVT models were found to have much longer time elapsed during training than their pairwise counterparts for both the VP7 and VP4 datasets, with time elapsed of 1323.69 and 3395.97 s, respectively. On the other hand, multiple sequence alignment LOOCV models were found to have shorter time elapsed than their pairwise counterparts for both the VP7 and VP4 datasets, with time elapsed of 7612.43 and 18273.13 s, respectively.</p>
<p>The models with the shortest time elapsed for class prediction of a query VP7 sequence were found to be either of the pairwise sequence alignment models, R10FCVT and LOOCV, which both had time elapsed of 0.47&#xa0;s. Similarly, the models with the shortest time elapsed for class prediction of a query VP4 sequence were found to be either of the pairwise sequence alignment models, R10FCVT and LOOCV, which both had time elapsed of 1.45&#xa0;s. The models with the longest time elapsed for class prediction of a query VP7 sequence were found to be either of the multiple sequence alignment models, R10FCVT and LOOCV, which both had time elapsed of 0.53&#xa0;s. Similarly, the models with the longest time elapsed for class prediction of a query VP4 sequence were found to be either of the multiple sequence alignment models, R10FCVT and LOOCV, which both had time elapsed of 1.74&#xa0;s.</p>
<p>The total time elapsed with and without training were summed to compare model performance in circumstances where models may or may not need to be retrained and are summarized in <xref ref-type="table" rid="T8">Table 8</xref>. Models with the shortest time elapsed with training were found to be the pairwise R10FCVT models for both VP7 and VP4 datasets. Models with the shortest time elapsed without training were found to be either of the pairwise sequence alignment models, R10FCVT and LOOCV, for both VP7 and VP4 datasets. Multiple sequence alignment models were generally slower than their pairwise sequence alignment counterparts with and without training for both datasets, with an exception where multiple sequence alignment LOOCV models were slightly faster than the pairwise sequence alignment LOOCV models only during model training.</p>
<table-wrap id="T8" position="float">
<label>TABLE 8</label>
<caption>
<p>Comparison of total time elapsed with and without training for each of the models using pairwise and multiple sequence alignment for classification of a query sequence from start to finish for the VP7 and VP4 datasets.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th colspan="4" align="left">VP7 and VP4 model classification comparison</th>
</tr>
<tr>
<th align="center">Methods</th>
<th colspan="2" align="center">Total time elapsed with training (seconds)</th>
<th align="center">Total time elapsed without training (seconds)</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td colspan="4" align="left">
<italic>Pairwise Sequence Alignment</italic>
</td>
</tr>
<tr>
<td align="center">&#x2003;VP7 R10FCVT Model</td>
<td colspan="2" align="center">502.62</td>
<td align="center">0.61</td>
</tr>
<tr>
<td align="center">&#x2003;VP7 LOOCV Model</td>
<td colspan="2" align="center">8145.82</td>
<td align="center">0.61</td>
</tr>
<tr>
<td align="center">&#x2003;VP4 R10FCVT Model</td>
<td colspan="2" align="center">1379.36</td>
<td align="center">1.67</td>
</tr>
<tr>
<td align="center">&#x2003;VP4 LOOCV Model</td>
<td colspan="2" align="center">26138.39</td>
<td align="center">1.67</td>
</tr>
<tr>
<td colspan="4" align="left">
<italic>Multiple Sequence Alignment</italic>
</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;VP7 R10FCVT Model</td>
<td colspan="2" align="center">1341.43</td>
<td align="center">17.74</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;VP7 LOOCV Model</td>
<td colspan="2" align="center">7630.17</td>
<td align="center">17.74</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;VP4 R10FCVT Model</td>
<td colspan="2" align="center">3441.88</td>
<td align="center">45.91</td>
</tr>
<tr>
<td align="center">&#xa0;&#xa0;VP4 LOOCV Model</td>
<td colspan="2" align="center">18319.04</td>
<td align="center">45.91</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>R10FCVT, Repeated 10-fold cross-validation thrice LOOCV, Leave-one-out cross-validation.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec sec-type="discussion" id="s4">
<title>4 Discussion</title>
<sec id="s4-1">
<title>4.1 Previous literature and significance of results</title>
<p>Previous studies have looked at prevalent strains found in humans and many different animal species globally. Strains that commonly infect humans worldwide consist of G1, G2, G3, G4, G9, and G12 VP7 genotypes as well as P[4], P[6], and P[8] VP4 genotypes (<xref ref-type="bibr" rid="B9">Gentsch et al., 2005</xref>; <xref ref-type="bibr" rid="B38">Santos and Hoshino, 2005</xref>; <xref ref-type="bibr" rid="B26">Matthijnssens et al., 2009</xref>). Strains that commonly infect equines consist of G3, G5, G10, and G14 VP7 genotypes as well as the P[12] VP4 genotype. Strains that commonly infect bovines consist of G1, G6-G8, G10, G11, G15, G18, and G21 VP7 genotypes as well as P[1], P[5], P[11], P[14], P[17], P[21], and P[29] VP4 genotypes. Strains that commonly infect swine consist of G1-G6, G8-G12, and G26 VP7 genotypes as well as P[1]-P[8], P[11], P[13], P[19], P[23], P[26], P[27], P[32], and P[34] VP4 genotypes (<xref ref-type="bibr" rid="B22">Luchs and Timenetsky, 2016</xref>; <xref ref-type="bibr" rid="B41">Vlasova et al., 2017</xref>). Given that circulating genotypes within humans and animal species are known, we compared the distribution of species and genotypes within our VP7 and VP4 datasets to check for agreement with the literature. Most of the sequences found within our VP7 dataset were from humans and equines. The most prevalent genotypes within our VP7 dataset were found to be G1, G3, G12, and G14, which is in general agreement with current literature. Within our VP4 dataset, most of the sequences were found to be from humans, bovines, and swine. The most prevalent genotypes within this dataset were found to be P[4], P[5], P[6], P[8], P[11], and P[14], which is also in general agreement with current literature.</p>
<p>A previous study has also looked at alignment-based classification of group A rotavirus genotypes, although using a full genome classification system rather than the dual classification system (<xref ref-type="bibr" rid="B24">Maes et al., 2009</xref>). The RotaC web-based tool initially identifies the gene segment that a query sequence belongs to by comparing it to a full genome reference alignment containing group A rotavirus standards. Distance matrices are then generated from pairwise alignment between the query sequence and an appropriate reference sequence using the Needleman-Wunsch algorithm. Neighbour-joining phylogenetic trees are then generated using the distance matrices from alignment alongside nucleotide identity cut-off for classification into genotypes, with tree reliability assessed using 100 bootstrap replicates. Phylogenetic methods for classification which involve bootstrapping may become computationally intensive as bootstrapped trees will need to be generated every time a query sequence is being classified. Random forest models with established accuracy, generally, only need to be trained once before being usable for classification. The full genome classification system also uses all 11 genome segments, and nomenclature is defined using the notation of Gx-P [x]-Ix-Rx-Cx-Mx-Ax-Nx-Tx-Ex-Hx (where x is the genotype number) for the encoding genes VP7, VP4, VP6, VP1, VP2, VP3, NSP1, NSP2, NSP3, NSP4, NSP5/6, respectively (<xref ref-type="bibr" rid="B28">Matthijnssens et al., 2011</xref>). Classification through the full genome classification system is considerably more descriptive and may allow for further studies analyzing strain reassortments between same and different host species as well as for discovering new genotypes (<xref ref-type="bibr" rid="B27">Matthijnssens et al., 2008</xref>; <xref ref-type="bibr" rid="B24">Maes et al., 2009</xref>). However, full genome sequences are not as readily available yet in comparison to partial genome sequences, therefore dual classification models may remain useful until sequencing efforts catch up. In consideration of this, we looked to see how well our models performed using the readily available partial genome sequence data with the dual classification system of G and P genotypes. Expansion of our models into full genome classification can be done when these data become more readily available, and if accuracy warrants it.</p>
<p>Results from our model training showed that random forest models trained on positional features from pairwise and multiple sequence alignment perform very well in learning and predicting the genotypes for labelled VP7 and VP4 sequences. Overall, multiple sequence alignment models were shown to outperform pairwise sequence alignment models in both overall accuracy and <italic>kappa</italic> during training. R10FCVT and LOOCV models were shown to perform very similarly during training, with LOOCV models having slightly higher overall accuracy and <italic>kappa</italic> in most cases. Tuning of each of the models during training also demonstrated that the optimal m<sub>try</sub> value of VP7 and VP4 models can be both identical or different when using either R10FCVT or LOOCV for each alignment method. This in turn led to some models being almost identical in terms of overall accuracy and <italic>kappa</italic> values during model training, and performance of these models were expected to also be very similar during model validation. In circumstances where model validation also demonstrated that these models were identical in overall accuracy and <italic>kappa</italic>, the models were expected to differ in terms of computational performance during training and tuning due to the use of different cross-validation methods.</p>
<p>Results from model validation showed that the trained random forest models perform very strongly in classification of unseen data. Overall, multiple sequence alignment models were again found to outperform pairwise sequence alignment models in overall accuracies and <italic>kappa</italic> values. R10FCVT and LOOCV models were also shown to perform the same in multiple sequence alignment models, with LOOCV outperforming R10FCVT for pairwise sequence alignment models. All models were also found to perform significantly better than the no-information rates, which demonstrates that the random forest algorithm was robust against model tendencies to predict classes as the majority class in situations involving imbalanced datasets (<xref ref-type="bibr" rid="B2">Breiman, 2001</xref>).</p>
<p>Phylogenetic analysis of each of the datasets also revealed that some of the misclassified sequences in the testing dataset were possibly outliers, as both the models and phylogenetic trees were not able to correctly classify some of these sequences into the correct genotype. These sequences could be further analyzed through other tools such as BLAST to confirm whether they are indeed outliers or simply mislabeled. Moving these sequences from the testing dataset to the training dataset may also allow for the models to learn from these misclassified sequences and improve the next time it encounters a similar sequence. Some of the misclassified sequences were also incorrectly classified by only the pairwise sequence alignment models and not the multiple sequence alignment models or phylogenetic trees. This further supports that multiple sequence alignment models are generally more accurate at classifying VP7 and VP4 genotypes than the pairwise sequence alignment models.</p>
<p>Results from model computational performance showed that pairwise sequence alignment models generally outperform multiple sequence alignment models in terms of speed for alignment of a query sequence, training of the models, and model prediction. R10FCVT models were also found to be much faster than LOOCV models specifically during the training of the model, with no difference during model prediction time. Total time elapsed summed from these 3 components and summed without the training component also showed that pairwise sequence alignment models generally outperform multiple sequence alignment models. In situations where models may need to be continually retrained due to factors such as constant influxes of new sequence data, pairwise sequence alignment R10FCVT models are favoured. Similarly, in situations where models do not need to be retrained and classification speed is a major consideration, such as in general query sequence classification (<xref ref-type="bibr" rid="B43">Williams et al., 2006</xref>), pairwise sequence alignment R10FCVT models are also favoured. Situations where rare genotypes are being classified or where cost of misclassification is very high such as in targeted vaccine development may favour multiple sequence alignment LOOCV models at the expense of speed to achieve the maximum classification accuracy possible.</p>
</sec>
<sec id="s4-2">
<title>4.2 Limitations</title>
<p>A major limitation for these models is that they rely on sufficient sequence data being available for each of the genotypes to train the random forest algorithm. Sequence data retrieved from NCBI for VP7 and VP4 sequences were still lacking for many of the known genotypes, and therefore, the classifier is not able to predict the classes of these genotypes yet. The number of VP7 and VP4 genotypes have also been shown to be increasing over time, which will lead to more and more sequence data being required (<xref ref-type="bibr" rid="B27">Matthijnssens et al., 2008</xref>; <xref ref-type="bibr" rid="B30">Mwanga et al., 2020</xref>). Models will also need to be continually retrained over periods of time to account for these new genotypes as the sequence data become more available.</p>
<p>Another limitation that these models face would be that they are not able to recognize a new genotype for group A rotaviruses. Sequence data for a new group A rotavirus genotype will most likely be incorrectly classified as a current genotype that it is most similar too, even though it may be distinct enough to be categorized as a new genotype. Identification of new group A rotavirus genotypes will have to be done through alternative methods, such as the RotaC tool, other hierarchical agglomerative clustering algorithms (<xref ref-type="bibr" rid="B24">Maes et al., 2009</xref>), or other methods. In addition, although distribution of genotypes in our dataset is in general agreement with reported genotypes, it is likely that important genotypes, for a specific species and jurisdiction, were not included into training and test datasets. However, these genotypes could be available internally in diagnostic laboratories and the results of this study suggest that random forest could be used to develop classification models on sufficient data in such situations.</p>
<p>Additionally, the usage of alignment could also be considered a limitation of our models as alignment is generally considered a computational slow process. Multiple sequence alignment was identified to be the primary bottleneck in computational performance for models that did not need to be retrained. Pairwise alignment was also found to slow down computational performance in models that did not need to be retrained, although to a much lesser extent. Alignment-free methods such as k-mer counts have previously been used in combination with random forest and may provide a suitable alternative to alignment if accuracy and computational performance warrants it (<xref ref-type="bibr" rid="B21">Liu et al., 2017</xref>; <xref ref-type="bibr" rid="B25">Malhotra et al., 2017</xref>).</p>
</sec>
</sec>
<sec sec-type="conclusion" id="s5">
<title>5 Conclusion</title>
<p>In conclusion, random forest models trained on positional features from pairwise and multiple sequence alignment were shown to achieve very high levels of performance for the dual classification of group A rotavirus VP7 and VP4 genotypes. Multiple sequence alignment models were shown to perform more accurately than pairwise sequence alignment models in both training and testing, with the trade-off being that pairwise sequence alignment models are generally faster in comparison with regards to computational performance. Application of these models as classifiers will allow for more efficient and accurate classification of group A rotaviruses on increasing amounts of new sequence data, which may aid in vaccine development. Additionally, methodology for these models may also be applicable for accurate and quick classification of other species of rotaviruses and possibly other viral pathogens which do not have a classification tool. Further improvements to these models and expansion towards the full genome classification system can be done as these data become more readily available.</p>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="s6">
<title>Data availability statement</title>
<p>Publicly available datasets were analyzed in this study. This data can be found here: NCBI Nucleotide Database <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.ncbi.nlm.nih.gov/nuccore">https://www.ncbi.nlm.nih.gov/nuccore</ext-link> and NCBI RefSeq Database <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.ncbi.nlm.nih.gov/refseq/">https://www.ncbi.nlm.nih.gov/refseq/</ext-link>.</p>
</sec>
<sec id="s7">
<title>Author contributions</title>
<p>ZP conceptualized the study. HT collected, cleaned, and analyzed the data. HT wrote the code for data analysis and wrote the manuscript draft with guidance from ZP. ZP and RF reviewed and edited the manuscript draft. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec id="s8">
<title>Funding</title>
<p>This research was funded by Ontario Agri-Food Innovation Alliance (UofG2018-3287) and a National Science and Engineering Research Council Discovery Grant (400558).</p>
</sec>
<ack>
<p>We would like to thank researchers who have submitted their rotavirus sequence data to NCBI for ease of access. This work was previously published as part of a Master of Science thesis (<xref ref-type="bibr" rid="B40">Tran, 2021</xref>).</p>
</ack>
<sec sec-type="COI-statement" id="s9">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<sec id="s11">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fgene.2023.1029185/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fgene.2023.1029185/full&#x23;supplementary-material</ext-link>
</p>
<supplementary-material xlink:href="Image1.TIFF" id="SM1" mimetype="application/TIFF" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Image2.TIF" id="SM2" mimetype="application/TIF" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Image3.pdf" id="SM3" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="DataSheet1.docx" id="SM4" mimetype="application/docx" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="DataSheet2.xlsx" id="SM5" mimetype="application/xlsx" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Altschul</surname>
<given-names>S. F.</given-names>
</name>
<name>
<surname>Gish</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>E. W.</given-names>
</name>
<name>
<surname>Lipman</surname>
<given-names>D. J.</given-names>
</name>
</person-group> (<year>1990</year>). <article-title>Basic local alignment search tool</article-title>. <source>J. Mol. Biol.</source> <volume>215</volume>, <fpage>403</fpage>&#x2013;<lpage>410</lpage>. <pub-id pub-id-type="doi">10.1016/S0022-2836(05)80360-2</pub-id>
</citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Breiman</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2001</year>). <article-title>Random forests</article-title>. <source>Mach. Learn</source> <volume>45</volume>, <fpage>5</fpage>&#x2013;<lpage>32</lpage>. <pub-id pub-id-type="doi">10.1023/A:1010933404324</pub-id>
</citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brierley</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Fowler</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning</article-title>. <source>PLoS Pathog.</source> <volume>17</volume>, <fpage>e1009149</fpage>. <pub-id pub-id-type="doi">10.1371/journal.ppat.1009149</pub-id>
</citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Burke</surname>
<given-names>R. M.</given-names>
</name>
<name>
<surname>Tate</surname>
<given-names>J. E.</given-names>
</name>
<name>
<surname>Kirkwood</surname>
<given-names>C. D.</given-names>
</name>
<name>
<surname>Steele</surname>
<given-names>A. D.</given-names>
</name>
<name>
<surname>Parashar</surname>
<given-names>U. D.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Current and new rotavirus vaccines</article-title>. <source>Curr. Opin. Infect. Dis.</source> <volume>32</volume>, <fpage>435</fpage>&#x2013;<lpage>444</lpage>. <pub-id pub-id-type="doi">10.1097/QCO.0000000000000572</pub-id>
</citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chaudhary</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Kolhe</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Kamal</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>An improved random forest classifier for multi-class classification</article-title>. <source>Inf. Process. Agric.</source> <volume>3</volume>, <fpage>215</fpage>&#x2013;<lpage>222</lpage>. <pub-id pub-id-type="doi">10.1016/j.inpa.2016.08.002</pub-id>
</citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cohen</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>1960</year>). <article-title>A coefficient of agreement for nominal scales</article-title>. <source>Educ. Psychol. Meas.</source> <volume>20</volume>, <fpage>37</fpage>&#x2013;<lpage>46</lpage>. <pub-id pub-id-type="doi">10.1177/001316446002000104</pub-id>
</citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cook</surname>
<given-names>P. W.</given-names>
</name>
<name>
<surname>Stark</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Kondor</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Zanders</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Benfer</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Detection and characterization of swine origin influenza A(H1N1) pandemic 2009 viruses in humans following zoonotic transmission</article-title>. <source>J. Virol.</source> <volume>95</volume>, <fpage>010666</fpage>-<lpage>20</lpage>. <pub-id pub-id-type="doi">10.1128/JVI.01066-20</pub-id>
</citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Edgar</surname>
<given-names>R. C.</given-names>
</name>
</person-group> (<year>2004</year>). <article-title>Muscle: Multiple sequence alignment with high accuracy and high throughput</article-title>. <source>Nucleic Acids Res.</source> <volume>32</volume>, <fpage>1792</fpage>&#x2013;<lpage>1797</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkh340</pub-id>
</citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gentsch</surname>
<given-names>J. R.</given-names>
</name>
<name>
<surname>Laird</surname>
<given-names>A. R.</given-names>
</name>
<name>
<surname>Bielfelt</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Griffin</surname>
<given-names>D. D.</given-names>
</name>
<name>
<surname>Banyai</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Ramachandran</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2005</year>). <article-title>Serotype diversity and reassortment between human and animal rotavirus strains: Implications for rotavirus vaccine programs</article-title>. <source>J. Infect. Dis.</source> <volume>192</volume>, <fpage>S146</fpage>&#x2013;<lpage>S159</lpage>. <pub-id pub-id-type="doi">10.1086/431499</pub-id>
</citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Harris</surname>
<given-names>V. C.</given-names>
</name>
<name>
<surname>Armah</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Fuentes</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Korpela</surname>
<given-names>K. E.</given-names>
</name>
<name>
<surname>Parashar</surname>
<given-names>U.</given-names>
</name>
<name>
<surname>Victor</surname>
<given-names>J. C.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). <article-title>Significant correlation between the infant gut microbiome and rotavirus vaccine response in rural Ghana</article-title>. <source>J. Infect. Dis.</source> <volume>215</volume>, <fpage>34</fpage>&#x2013;<lpage>41</lpage>. <pub-id pub-id-type="doi">10.1093/infdis/jiw518</pub-id>
</citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ion Titapiccolo</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Ferrario</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Cerutti</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Barbieri</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Mari</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Gatti</surname>
<given-names>E.</given-names>
</name>
<etal/>
</person-group> (<year>2013</year>). <article-title>Artificial intelligence models to stratify cardiovascular risk in incident hemodialysis patients</article-title>. <source>Expert Syst. Appl.</source> <volume>40</volume>, <fpage>4679</fpage>&#x2013;<lpage>4686</lpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2013.02.005</pub-id>
</citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Rupasinghe</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Rezaei</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Mart&#xed;nez-L&#xf3;pez</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Applications of machine learning for the classification of porcine reproductive and respiratory syndrome virus sublineages using amino acid scores of ORF5 gene</article-title>. <source>Front. Vet. Sci.</source> <volume>8</volume>, <fpage>683134</fpage>. <pub-id pub-id-type="doi">10.3389/fvets.2021.683134</pub-id>
</citation>
</ref>
<ref id="B13">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Kuhn</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Johnson</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2013</year>). <source>Applied predictive modeling</source>. <edition>1st</edition>. <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>.</citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kuhn</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2008</year>). <article-title>Building predictive models in R using the caret package</article-title>. <source>J. Stat. Softw.</source> <volume>1</volume> (<issue>5</issue>). <pub-id pub-id-type="doi">10.18637/jss.v028.i05</pub-id>
</citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kumar</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Stecher</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Knyaz</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Tamura</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Mega X: Molecular evolutionary genetics analysis across computing platforms</article-title>. <source>Mol. Biol. Evol.</source> <volume>35</volume>, <fpage>1547</fpage>&#x2013;<lpage>1549</lpage>. <pub-id pub-id-type="doi">10.1093/molbev/msy096</pub-id>
</citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lakshmanaprabu</surname>
<given-names>S. K.</given-names>
</name>
<name>
<surname>Shankar</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Ilayaraja</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Nasir</surname>
<given-names>A. W.</given-names>
</name>
<name>
<surname>Vijayakumar</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Chilamkurti</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Random forest for big data classification in the internet of things using optimal features</article-title>. <source>Int. J. Mach. Learn. Cybern.</source> <volume>10</volume>, <fpage>2609</fpage>&#x2013;<lpage>2618</lpage>. <pub-id pub-id-type="doi">10.1007/s13042-018-00916-z</pub-id>
</citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lanzieri</surname>
<given-names>T. M.</given-names>
</name>
<name>
<surname>Linhares</surname>
<given-names>A. C.</given-names>
</name>
<name>
<surname>Costa</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Kolhe</surname>
<given-names>D. A.</given-names>
</name>
<name>
<surname>Cunha</surname>
<given-names>M. H.</given-names>
</name>
<name>
<surname>Ortega-Barria</surname>
<given-names>E.</given-names>
</name>
<etal/>
</person-group> (<year>2011</year>). <article-title>Impact of rotavirus vaccination on childhood deaths from diarrhea in Brazil</article-title>. <source>Int. J. Infect. Dis.</source> <volume>15</volume>, <fpage>e206</fpage>&#x2013;<lpage>e210</lpage>. <pub-id pub-id-type="doi">10.1016/j.ijid.2010.11.007</pub-id>
</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lee</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Jeong</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Jeong</surname>
<given-names>W.-K.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Cpem: Accurate cancer type classification based on somatic alterations using an ensemble of a random forest and a deep neural network</article-title>. <source>Sci. Rep.</source> <volume>9</volume>, <fpage>16927</fpage>. <pub-id pub-id-type="doi">10.1038/s41598-019-53034-3</pub-id>
</citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Letunic</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Bork</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Interactive tree of life (iTOL) v5: An online tool for phylogenetic tree display and annotation</article-title>. <source>Nucleic Acids Res.</source> <volume>49</volume>, <fpage>W293</fpage>&#x2013;<lpage>W296</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkab301</pub-id>
</citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liaw</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Wiener</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2002</year>). <article-title>Classification and regression by randomForest</article-title>. <source>R. News</source> <volume>2</volume>, <fpage>18</fpage>&#x2013;<lpage>22</lpage>. </citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Gan</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>A sequence-based method to predict the impact of regulatory variants using random forest</article-title>. <source>BMC Syst. Biol.</source> <volume>11</volume>, <fpage>7</fpage>. <pub-id pub-id-type="doi">10.1186/s12918-017-0389-1</pub-id>
</citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Luchs</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Timenetsky</surname>
<given-names>M. D. C. S. T.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Group A rotavirus gastroenteritis: Post-vaccine era, genotypes and zoonotic transmission</article-title>. <source>Einstein (Sao Paulo)</source> <volume>14</volume>, <fpage>278</fpage>&#x2013;<lpage>287</lpage>. <pub-id pub-id-type="doi">10.1590/S1679-45082016RB3582</pub-id>
</citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Madhi</surname>
<given-names>S. A.</given-names>
</name>
<name>
<surname>Cunliffe</surname>
<given-names>N. A.</given-names>
</name>
<name>
<surname>Steele</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Witte</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Kirsten</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Louw</surname>
<given-names>C.</given-names>
</name>
<etal/>
</person-group> (<year>2010</year>). <article-title>Effect of human rotavirus vaccine on severe diarrhea in African infants</article-title>. <source>N. Engl. J. Med.</source> <volume>362</volume>, <fpage>289</fpage>&#x2013;<lpage>298</lpage>. <pub-id pub-id-type="doi">10.1056/NEJMoa0904797</pub-id>
</citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Maes</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Matthijnssens</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Rahman</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Van Ranst</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>RotaC: A web-based tool for the complete genome classification of group A rotaviruses</article-title>. <source>BMC Microbiol.</source> <volume>9</volume>, <fpage>238</fpage>. <pub-id pub-id-type="doi">10.1186/1471-2180-9-238</pub-id>
</citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Malhotra</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Jha</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Poss</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Acharya</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>A random forest classifier for detecting rare variants in NGS data from viral populations</article-title>. <source>Comput. Struct. Biotechnol. J.</source> <volume>15</volume>, <fpage>388</fpage>&#x2013;<lpage>395</lpage>. <pub-id pub-id-type="doi">10.1016/j.csbj.2017.07.001</pub-id>
</citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Matthijnssens</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Bilcke</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Ciarlet</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Martella</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>B&#xe1;nyai</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Rahman</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2009</year>). <article-title>Rotavirus disease and vaccination: Impact on genotype diversity</article-title>. <source>Future Microbiol.</source> <volume>4</volume>, <fpage>1303</fpage>&#x2013;<lpage>1316</lpage>. <pub-id pub-id-type="doi">10.2217/fmb.09.96</pub-id>
</citation>
</ref>
<ref id="B27">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Matthijnssens</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Ciarlet</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Heiman</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Arijs</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Delbeke</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>McDonald</surname>
<given-names>S. M.</given-names>
</name>
<etal/>
</person-group> (<year>2008</year>). <article-title>Full genome-based classification of rotaviruses reveals a common origin between human Wa-Like and porcine rotavirus strains and human DS-1-like and bovine rotavirus strains</article-title>. <source>J. Virol.</source> <volume>82</volume>, <fpage>3204</fpage>&#x2013;<lpage>3219</lpage>. <pub-id pub-id-type="doi">10.1128/JVI.02257-07</pub-id>
</citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Matthijnssens</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Ciarlet</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>McDonald</surname>
<given-names>S. M.</given-names>
</name>
<name>
<surname>Attoui</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>B&#xe1;nyai</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Brister</surname>
<given-names>J. R.</given-names>
</name>
<etal/>
</person-group> (<year>2011</year>). <article-title>Uniformity of rotavirus strain nomenclature proposed by the rotavirus classification working group (RCWG)</article-title>. <source>Arch. Virol.</source> <volume>156</volume>, <fpage>1397</fpage>&#x2013;<lpage>1413</lpage>. <pub-id pub-id-type="doi">10.1007/s00705-011-1006-z</pub-id>
</citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>M&#xfc;ller</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Johne</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2007</year>). <article-title>Rotaviruses: Diversity and zoonotic potential--a brief review</article-title>. <source>Berl. Munch Tierarztl Wochenschr</source> <volume>120</volume>, <fpage>108</fpage>&#x2013;<lpage>112</lpage>. <pub-id pub-id-type="doi">10.2376/0005-9366-120-108</pub-id>
</citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mwanga</surname>
<given-names>M. J.</given-names>
</name>
<name>
<surname>Owor</surname>
<given-names>B. E.</given-names>
</name>
<name>
<surname>Ochieng</surname>
<given-names>J. B.</given-names>
</name>
<name>
<surname>Ngama</surname>
<given-names>M. H.</given-names>
</name>
<name>
<surname>Ogwel</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Onyango</surname>
<given-names>C.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Rotavirus group A genotype circulation patterns across Kenya before and after nationwide vaccine introduction, 2010&#x2013;2018</article-title>. <source>BMC Infect. Dis.</source> <volume>20</volume>, <fpage>504</fpage>. <pub-id pub-id-type="doi">10.1186/s12879-020-05230-0</pub-id>
</citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Needleman</surname>
<given-names>S. B.</given-names>
</name>
<name>
<surname>Wunsch</surname>
<given-names>C. D.</given-names>
</name>
</person-group> (<year>1970</year>). <article-title>A general method applicable to the search for similarities in the amino acid sequence of two proteins</article-title>. <source>J. Mol. Biol.</source> <volume>48</volume>, <fpage>443</fpage>&#x2013;<lpage>453</lpage>. <pub-id pub-id-type="doi">10.1016/0022-2836(70)90057-4</pub-id>
</citation>
</ref>
<ref id="B32">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>O&#x2019;Leary</surname>
<given-names>N. A.</given-names>
</name>
<name>
<surname>Wright</surname>
<given-names>M. W.</given-names>
</name>
<name>
<surname>Brister</surname>
<given-names>J. R.</given-names>
</name>
<name>
<surname>Ciufo</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Haddad</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>McVeigh</surname>
<given-names>R.</given-names>
</name>
<etal/>
</person-group> (<year>2016</year>). <article-title>Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation</article-title>. <source>Nucleic Acids Res.</source> <volume>44</volume>, <fpage>D733</fpage>&#x2013;<lpage>D745</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkv1189</pub-id>
</citation>
</ref>
<ref id="B33">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Pag&#xe8;s</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Aboyoun</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Gentleman</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>DebRoy</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Biostrings: Efficient manipulation of biological strings. R package version 2.68.1</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://bioconductor.org/packages/Biostrings">https://bioconductor.org/packages/Biostrings</ext-link>
</comment>. </citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pickett</surname>
<given-names>B. E.</given-names>
</name>
<name>
<surname>Sadat</surname>
<given-names>E. L.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Noronha</surname>
<given-names>J. M.</given-names>
</name>
<name>
<surname>Squires</surname>
<given-names>R. B.</given-names>
</name>
<name>
<surname>Hunt</surname>
<given-names>V.</given-names>
</name>
<etal/>
</person-group> (<year>2012</year>). <article-title>ViPR: An open bioinformatics database and analysis resource for virology research</article-title>. <source>Nucleic Acids Res.</source> <volume>40</volume>, <fpage>D593</fpage>&#x2013;<lpage>D598</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkr859</pub-id>
</citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Probst</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Wright</surname>
<given-names>M. N.</given-names>
</name>
<name>
<surname>Boulesteix</surname>
<given-names>A.-L.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Hyperparameters and tuning strategies for random forest</article-title>. <source>WIREs Data Min. Knowl. Discov.</source> <volume>9</volume>, <fpage>e1301</fpage>. <pub-id pub-id-type="doi">10.1002/widm.1301</pub-id>
</citation>
</ref>
<ref id="B36">
<citation citation-type="web">
<collab>R Core Team</collab> (<year>2013</year>). <article-title>R: A language and environment for statistical computing</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://bioconductor.org/packages/release/bioc/html/Biostrings.html">https://bioconductor.org/packages/release/bioc/html/Biostrings.html</ext-link>
</comment>.</citation>
</ref>
<ref id="B37">
<citation citation-type="web">
<collab>Rotavirus Classification Working Group: RCWG</collab> (<year>2021</year>). <article-title>virus-classification</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://rega.kuleuven.be/cev/viralmetagenomics/virus-classification">https://rega.kuleuven.be/cev/viralmetagenomics/virus-classification</ext-link>
</comment>.</citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Santos</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Hoshino</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2005</year>). <article-title>Global distribution of rotavirus serotypes/genotypes and its implication for the development and implementation of an effective rotavirus vaccine</article-title>. <source>Rev. Med. Virol.</source> <volume>15</volume>, <fpage>29</fpage>&#x2013;<lpage>56</lpage>. <pub-id pub-id-type="doi">10.1002/rmv.448</pub-id>
</citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tate</surname>
<given-names>J. E.</given-names>
</name>
<name>
<surname>Burton</surname>
<given-names>A. H.</given-names>
</name>
<name>
<surname>Boschi-Pinto</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Parashar</surname>
<given-names>U. D.</given-names>
</name>
</person-group>
<collab>World Health Organization&#x2013;Coordinated Global Rotavirus Surveillance Network</collab> (<year>2016</year>). <article-title>Global, regional, and national estimates of rotavirus mortality in children &#x3c;5 years of age, 2000-2013</article-title>. <source>Clin. Infect. Dis.</source> <volume>62</volume> (<issue>2</issue>), <fpage>S96</fpage>&#x2013;<lpage>S105</lpage>. <pub-id pub-id-type="doi">10.1093/cid/civ1013</pub-id>
</citation>
</ref>
<ref id="B40">
<citation citation-type="thesis">
<person-group person-group-type="author">
<name>
<surname>Tran</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>An investigation of the classification, seasonality, and genotype diversity of rotavirus in swine populations in Canada</article-title>,&#x201d;. <comment>[MS dissertation]</comment> (<publisher-loc>Guelph</publisher-loc>: <publisher-name>University of Guelph</publisher-name>).</citation>
</ref>
<ref id="B41">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vlasova</surname>
<given-names>A. N.</given-names>
</name>
<name>
<surname>Amimo</surname>
<given-names>J. O.</given-names>
</name>
<name>
<surname>Saif</surname>
<given-names>L. J.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Porcine rotaviruses: Epidemiology, immune responses and control strategies</article-title>. <source>Viruses</source> <volume>9</volume>, <fpage>48</fpage>&#x2013;<lpage>27</lpage>. <pub-id pub-id-type="doi">10.3390/v9030048</pub-id>
</citation>
</ref>
<ref id="B42">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Walker</surname>
<given-names>P. J.</given-names>
</name>
<name>
<surname>Siddell</surname>
<given-names>S. G.</given-names>
</name>
<name>
<surname>Lefkowitz</surname>
<given-names>E. J.</given-names>
</name>
<name>
<surname>Mushegian</surname>
<given-names>A. R.</given-names>
</name>
<name>
<surname>Dempsey</surname>
<given-names>D. M.</given-names>
</name>
<name>
<surname>Dutilh</surname>
<given-names>B. E.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Changes to virus taxonomy and the international code of virus classification and nomenclature ratified by the international committee on taxonomy of viruses (2019)</article-title>. <source>Arch. Virol.</source> <volume>164</volume>, <fpage>2417</fpage>&#x2013;<lpage>2429</lpage>. <pub-id pub-id-type="doi">10.1007/s00705-019-04306-w</pub-id>
</citation>
</ref>
<ref id="B43">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Williams</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Zander</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Armitage</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2006</year>). <article-title>A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification</article-title>. <source>Comput. Commun. Rev.</source> <volume>36</volume>, <fpage>5</fpage>&#x2013;<lpage>16</lpage>. <pub-id pub-id-type="doi">10.1145/1163593.1163596</pub-id>
</citation>
</ref>
<ref id="B44">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zaman</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Dang</surname>
<given-names>D. A.</given-names>
</name>
<name>
<surname>Victor</surname>
<given-names>J. C.</given-names>
</name>
<name>
<surname>Shin</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Yunus</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Dallas</surname>
<given-names>M. J.</given-names>
</name>
<etal/>
</person-group> (<year>2010</year>). <article-title>Efficacy of pentavalent rotavirus vaccine against severe rotavirus gastroenteritis in infants in developing countries in asia: A randomised, double-blind, placebo-controlled trial</article-title>. <source>Lancet</source> <volume>376</volume>, <fpage>615</fpage>&#x2013;<lpage>623</lpage>. <pub-id pub-id-type="doi">10.1016/S0140-6736(10)60755-6</pub-id>
</citation>
</ref>
</ref-list>
</back>
</article>